Single-image super-resolution (SISR) aims to reconstruct a high-resolution (HR) image from a low-resolution (LR) one. Among the state-of-the-art realistic image super-resolution (SR) intelligent algorithms, generative adversarial networks (GANs) have achieved impressive visual performance.
1. Introduction
Single-image super-resolution (SISR) aims to reconstruct a high-resolution (HR) image from a low-resolution (LR) one. The traditional methods for solving the SR problems are mainly interpolation-based methods
[1][2][3][4] and reconstruction-based methods
[5][6][7]. Intelligent computing has also been applied in the field of image super-resolution. Super-resolution methods based on genetic algorithms, guided by imaging models, utilize optimization techniques to seek the optimal estimation of the original image. At its core, this approach transforms the problem of reconstructing multiple super-resolved images into a linear system of equations. The convolutional neural network (CNN) has greatly promoted the vigorous development of SR field and demonstrates vast superiority over traditional methods. The main reason it achieves good results is due to its strong capability of learning rich features from big data in an end-to-end manner
[8]. CNN-based SR methods often use PSNR as the evaluation metric; although some SR methods achieve good results for PSNR, it is still not completely satisfactory in terms of perception.
The generative adversarial network (GAN)
[9] has achieved impressive visual performance in the field of super-resolution (SR) since the pioneering work of SRGAN
[10]. GANs have proven their capability to generate more realistic images with high perceptual quality. In pursuit of further enhancing visual quality, Wang et al. proposed ESRGAN
[11]. Given the challenges of collecting well-paired datasets in real-world scenarios, unsupervised GANs have been introduced
[12][13]. BSRGAN
[14] and real-ESRGAN
[15] are dedicated to simulating the practical degradation process to obtain better visual results on real datasets.
However, perceptual dissatisfaction accompanied by unpleasant artifacts still exists in GAN-based SR models because of insufficient design in either generators or discriminators. In GAN-based SR methods, it is obvious that the decisive capability to recover naturally finer textures in generators is dependent largely on the guidance of discriminators through GAN training, but discriminators are usually cloned from well-known networks (U-net
[16], VGG
[17], etc.) suitable for image segmentation or classification, which might not fully lead generators to restore subtle textures in SR. Moreover, the design of generators should be perceptive enough to extract multi-scale image features from low-resolution (LR) images and mitigate artifacts.
2. Single-Image Super-Resolution Methods
Single-image super-resolution: SRCNN
[18] is the first method to apply deep learning to SR reconstruction, and a series of learning-based works are subsequently proposed
[19][20][21][22][23]. ESPCN
[24] introduces an efficient sub-pixel convolution layer to perform the feature extraction stages in the LR space instead of HR space. VDSR
[19] uses a very deep convolutional network. EDSR
[25] removes the batch normalization layers from the network. SRGAN
[10] first uses the GAN network for the SR problem and proposes perceptual loss, including adversarial loss and content loss. Based on human perceptual characteristics, the residual in the residual dense block strategy (RRDB) is exploited to implement various depths in network architectures
[11][26]. ESRGAN
[11] introduces the residual-in-residual dense block (RRDB) into the generator. RealSR
[27] estimates various blur kernels and real noise distributions to synthesize different LR images. CDC
[28] proposes a divide-and-conquer SR network. Luo et al., in
[29], propose a probabilistic degradation model (PDM). Shao et al., in
[30], propose a sub-pixel convolutional neural network (SPCNN) for image SR reconstruction.
Perceptual-driven approaches: The PSNR-oriented approaches lead to overly smooth results and a lack of high-frequency details, and the results sometimes do not agree with the subjective human perception. In order to improve the perceptual quality of SR results, the perceptual-driven approach is proposed. Based on the idea of perceptual similarity
[31], Li Feifei et al. propose perceptual loss in
[32]. Then, textures matching loss
[33] and contextual loss
[34] are introduced. ESRGAN
[11] improves the perceptual loss by using the features before activation and wins the PIRM perceptual super-resolution challenge
[35]. Christian Szegedy et al. propose inception
[36], which can extract more features with the same amount of computation, thus improving the training results. For the purpose of extracting multi-scale information and enhance the feature discriminability, RFB-ESRGAN
[8] applies the receptive field block (RFB)
[37] to super resolution and wins the NTIRE 2020 perceptual extreme super-resolution challenge. There is still plenty of room for perceptual quality improvement
[38].
The design of discriminator networks: The discriminator in SRGAN is VGG-style, which is trained to distinguish between SR images and GT images
[10]. ESRGAN borrows ideas from relativistic GAN to improve the discriminator in SRGAN
[11]. Real-ESRGAN improves the VGG-style discriminator in ESRGAN to an U-Net design
[15]. In
[39], Alejandro et al. propose a novel convolutional network architecture named “stacked hourglass”, which captures and consolidates information across all scales of the image. All the related work as
Table 1 shows.
Table 1. Related work on design of discriminator networks.
Artifact suppression: The instability of the training of GANs often leads to the introduction of many perceptually unpleasant artifacts while generating details in the GAN-based SR networks
[40]. There have been several SR models focusing on solving the problem. Zhang et al. propose a supervised pixel-wise generative adversarial network (SPGAN) to obtain higher-quality face images
[41]. Gong et al., in
[42], overcome the effect of artifacts in the super-resolution of remote sensing images using self-supervised hierarchical perceptual loss. Real-ESRGAN uses spectral normalization (SN) regularization to stabilize the training dynamics
[15].
The evaluation metrics: The DCNN-based SR approaches have two main optimization objectives: the distortion metric (e.g., PSNR, SSIM, IFC, and VIF
[43][44][45]) and perceptual quality (e.g., the human opinion score; no-reference quality measures such as Ma’s score
[46], NIQE
[47], BRISQUE
[48], and PI
[49])
[50]. Yochai et al. in
[49] have revealed that distortion and perceptual quality are contradictory and there is always a trade-off between the two. Algorithms that are superior in terms of perceptual quality tend to be poorer in terms of, e.g., PSNR and SSIM. However, sometimes there is also inconsistency between the results observed by human eyes and these perceptual quality metrics. Because the no-reference metrics do not always match perceptual visual quality
[51], some SR models such as SRGAN perform mean-opinion-score (MOS) tests to quantify the perceptual ability of different methods
[10]. The related work on evaluation metrics as
Table 2 shows.
Table 2. Related work on evaluation metrics.
The transformer: Vaswani et al. in
[36] propose a new simple network architecture, transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Transformer continues to show amazing capabilities in the NLP domain. Many researches have started to try to apply the powerful modeling ability of transformer to the field of computer vision
[52]. In
[53], Yang et al. propose TTSR, in which LR and HR images are formulated as queries and keys in transformer, respectively, to encourage joint feature learning across LR and HR images. Swin transformer
[54] combines the advantages of convolution and transformer. Liang et al. in
[55] propose SwinIR based on Swin transformer. Vision transformer is computationally expensive and consumes high GPU memory, so Lu et al. in
[56] propose ESRT, which uses efficient transformers (ET), a lightweight version of the transformer structure.