Single-image super-resolution (SISR) seeks to reconstruct a high-resolution image with the high-frequency information (meaning the details) restored from its low-resolution counterpart.
1. Introduction
SISR offers many practical applications, such as video monitoring, remote sensing, video coding and medical imaging. On the one hand, SISR reduces the cost of obtaining high-resolution images, allowing researchers to acquire HR images, using personal computers instead of sophisticated and expensive optical imaging equipment. On the one hand, SISR reduces the cost of obtaining high-resolution images, allowing researchers to acquire HR images, using personal computers instead of sophisticated and expensive optical imaging equipment. On the other hand, SISR reduces the cost of information transmission, i.e., high-resolution images can be obtained by decoding the transmitted low-resolution image information using SISR. Many efforts have been made to deal with such a challenging yet ill-posed problem, due to the unknown high-resolution version of a low-resolution image.
Many traditional methods
[2,3,4][1][2][3] have been proposed to obtain high resolution (HR) images from their low resolution (LR) versions by establishing a mapping relationship between LR images and HR images. These methods are fast, lightweight and effective, which make them preferable as basic tools in SISR tasks
[5][4]. However, there is a shared and inherent problem in applying them: the tedious parameter adjustment. Obtaining desired results relies on continually tweaking parameters to accommodate various inputs. This inconvenience has an adverse impact on both efficiency and the user experience.
2. Deep CNN-Based SISR
Like other computer vision tasks, SISR has made significant progress through deep convolutional neural networks. Dong et al. first proposed SRCNN
[15][5] based on shallow CNNs. That method involves up-samples of images through bicubic interpolation. With three convolutional layers—as well as patch extraction and representation, plus nonlinear mapping and image reconstruction—the network was established. Later, that team proposed FSRCNN
[16][6], while Shi et al. proposed ESPCN
[25][7]. Meanwhile, Lai et al. proposed a Laplacian pyramid super-resolution network
[8], which takes low-resolution images as input and gradually reconstructs the sub-band residuals of high-resolution images. Tai et al. used a persistent memory network (MemNet)
[9] by using a very deep network. Tian et al. proposed a coarse-to-fine CNN method
[26][10] that, from the perspective of low-frequency features and high-frequency features, adds heterogeneous convolutions and refinement blocks to extract and process high-frequency and low-frequency features separately. Wei et al.
[27][11] used cascading dense connections to extract features of different fineness from different depth convolutional layers. Jin et al. adopted a framework
[28][12] to flexibly adjust the architecture of the network, adapting different kinds of images. DRCN
[29][13] used a deeply recursive convolutional network to improve performance without introducing new parameters for additional convolutions. DRRN
[7][14] improved DRCN by using residual networks. Lim et al. proposed an enhanced deep residual network (EDSR)
[6][15]. Liu et al.
[30][16] proposed an improved version of U-Net based on a multi-level wavelet. Li et al.
[31][17] proposed exploiting self-attention and facial semantics to obtain a super-resolution face image. Most studies of SISR achieved better performance by deepening the network or by adding the residual connection. However, deep depths make these methods difficult to train, while more parameters not only cause excessive memory consumption during inference, but also slow down the execution speed. Therefore, researchers introduce a lightweight and efficient SISR model.
In terms of lightweight models, Hui et al. proposed IDN
[11][18] by knowledge distillation to distill and extract features of each layer of the network and learn the complementary relationship among them to reduce parameters. CARN
[10][19] used a lightweight cascaded residual network; the local and global levels use cascading mechanisms to integrate features from different scale layers in order to receive more information. However, that method still involves 1.5 M parameters, and consumes too much memory. Ahn et al.
[32][20] proposed a lightweight residual network that uses grouped convolution to reduce the number of parameters, as well as weight classification to enhance the effect of super-resolution. Yao et al. proposed GLADSR
[33][21] with dense connections. Tian et al. proposed LESRCNN
[34][22], using dense cross-layer connections and advanced sub-pixel convolution to reconstruct images. Lan et al. proposed MADNet
[12][23], which contains many kinds of networks. He et al.
[13][24] introduced a multi-scale residual network.
Existing lightweight SISR methods can compress the number of parameters and calculations, but doing so results in loss of performance. In contrast,
our the method can achieve better super-resolution performance despite a small number of parameters and reduced memory consumption.
3. Lightweight Neural Networks
Many
recent super-resolution methods have focused on the lightweight nature of neural networks. researchers also focus on these features. Many lightweight network structures have been proposed, including dense networks
[10[19][22],
34], which use dense connections or residual connections to fully reuse functions. These methods are an efficient improvement for deep neural networks but are inadequate for lightweight networks. Therefore, researchers need to pay more attention to efficient lightweight network skeletons. In subsequent works, researchers have proposed several derivative versions, with the introduction of cross-layer connections within the network, reusing functions to achieve better performance. Iandola et al. proposed SqueezeNet
[35][25], using a squeeze layer and a convolution layer with a kernel size of 1 × 1 to convolve the feature map of the previous layer, thereby reducing the dimensionality of the feature map. Shufflenet V1
[36][26] and V2
[37][27] flexibly used pointwise grouped convolution and channel shuffle to achieve efficient classification effects on ImageNet
[38][28]. MobileNet
[39][29] constructed an effective network by applying—in a subsequent version—the deep separable convolution introduced by Sifre et al. MobileNet-V2
[40][30] also made use of methods, such as grouped convolution and point convolution, and introduced an attention mechanism. The design of the MobileNet-V3
[41][31] network utilized the NAS (network architecture search
[42][32]) algorithm to search for a very efficient network structure. In contrast, the EFblock that researchers propose uses global and local residual connections, deep separable convolution, grouped convolution, and point convolution.
OurThe method comprehensively considers the needs of lightweight and super-resolution, and extracts features efficiently with a small number of parameters.
4. Multi-Scale Feature Extraction
Multi-scale feature extraction is widely used in computer vision tasks, such as in semantic segmentation, image restoration, and image super-resolution. The most basic feature is that filters with different convolution kernel sizes can extract features of different fineness. Szegedy et al. proposed a multi-scale module
[19][33] called the Inception module. It uses convolution filters with different convolution kernel sizes to extract features in parallel, enabling the network to obtain different sizes of receptive fields, then extract different characteristics of fineness. In a subsequent version, the authors processed batch normalization in Inception-V2
[43][34], which accelerates the training of the network. In Inception-V3
[44][35], the authors added a new optimizer and asymmetric convolution. The application of multi-scale convolutional layers was widely demonstrated in tasks such as deblurring and denoising. He et al.
[13][24] introduced a multi-scale residual network with image features to significantly improve the performance of the image super-resolution. However, these methods focus only on local multi-scale features, ignoring the concept of a global scale. There is room for further improvement to realize the multi-scale network structure. As discussed above, researchers propose a hybrid multi-scale that, broadly, can be defined as local multi-scale and global multi-scale: the “local multi-scale” refers to the texture feature, and the “global multi-scale” refers to the structure feature. researchers experimented with this idea; the specific experimental details are introduced later.