At present, VSR encompasses non-blind VSR
[1], blind VSR
[2], online VSR
[3], and other branches
[4], and is widely used in remote sensing
[5][6], video surveillance
[7][8], face recognition
[9][10], and other fields
[11][12]. At present, with the development of technology, the resolutions of videos are gradually increasing. Although this can enrich our lives and facilitate tasks such as surveillance and identification, it can put more pressure on areas such as video storage and transmission. In addressing these issues, VSR technology plays an important role. However, VSR is an ill-posed problem, and it is difficult to find the most appropriate reconstruction model. Thus, it remains a worthwhile endeavor to continue to explore VSR technology.
To obtain high-quality images, previous studies have proposed numerous effective methods. Initially, researchers utilize interpolation methods to obtain HR videos
[13][14]. These methods possess higher computing speeds, but the results are poor. With the development of deep learning, constructing models
[15][16][17] in different domains with deep learning has become a mainstream research method. Researchers have constructed different VSR models based on deep learning that can reconstruct high-quality videos. For example, researchers
[18][19][20][21] have utilized explicit or implicit alignments to explore temporal flow between frames. This type of methodology can effectively align adjacent frames to the reference frame to extract high-quality temporal information. However, the alignment feature increases the computational effort of the model, thus exacerbating the burden during model training and testing. Meanwhile, inaccurate optical flow often leads to errors in alignment, which affects the performances of models. Moreover, scholars
[22][23][24] have used 3D convolution or deformable 3D convolution to directly aggregate spatio-temporal information between different frames. Although this approach can quickly aggregate information from different times, it also incorporates a lot of temporal redundancy in features, which reduces the reconstruction ability of the model. In addition, in recent years, with the rise of Transformer, the application of Transformer to construct VSR models has also become a very popular research topic. Researchers
[25][26][27] have applied Transformer to analyze and acquire the motion trajectories of videos to sufficiently aggregate the spatio-temporal information between consecutive frames. However, due to the relatively high level of computation required by Transformer, the further development of Transformer in the field of VSR is limited.
2. Video Super-Resolution
2.1. Single-Image Super-Resolution
Single-image super-resolution (SISR) is the basis of super-resolution. In recent years, with the development of deep learning, SR has ushered in a new revolution. Dong et al.
[28] were the first to apply deep learning to SISR. They presented a three-layer convolution neural network and achieved a better effect. For example, when the review metric is peak signal-to-noise ratio (PSNR) and
4×SISR is performed, it outperforms the then state-of-the-art A+ algorithm [
33] with 0.21 dB and 0.18 dB on Set5 and Set14 datasets, respectively. It was thus proven that deep learning possesses great potential in the field of SR. After this paper, Kim et al. [
34] presented a very deep neural network and applied the residual network to the SR model, achieving a better effect than SRCNN. For example, when performing 4× SISR and using PSNR as a metric for evaluation, it outperforms SRCNN with 0.87 dB and 0.52 dB on Set5 and Set14 datasets, respectively. Song et al. [
35] came up with the idea of making use of the additive neural network for SISR, which replaced the traditional convolution kernel multiplication operation in the calculation of output layer. Experiments demonstrate that this additive neural network achieves performance and visual quality comparable to convolutional neural networks, while reducing energy loss by approximately 2.5 times when reconstructing a
1280×720 image. Liang et al.
[29] introduced Swin Transformer into SISR and obtained high-quality recovered images. Tian et al.
[30] proposed heterogeneous grouping blocks to enhance the internal and external interactions of different channels to obtain rich low-frequency structural information. In practice, Lee et al.
[31] applied the SR technique to the satellite synthetic aperture radar, and could effectively recover the information of scatterers. Moreover, many scholars have also constructed SISR models using methods such as GAN or VAE, etc.
[31][32][33][34][35]. Although the SISR model can also be used to reconstruct HR videos, the SISR model is only capable of capturing the spatial information of frames, and can not aggregate the temporal information between neighboring frames. As a result, the quality of the video recovered by the SISR is poor, while often suffering from artifacts and other problems. To reconstruct high-quality HR videos, researchers have shifted their focus to VSR models.
2.2. Video Super-Resolution
VSR is an extension of SISR. In VSR, the temporal information between adjacent frames play a vital role. To reconstruct high-quality HR frames, studies have built a variety of models. For instance, Caballero et al.
[19] applied the optical flow field, which included coarse flow and fine flow to align adjacent frames, and constructed an end-to-end spatio-temporal module. Based on
[19], Wang et al.
[36] combined an optical flow field and long short-term memory to make more efficient use of inter-frame information and obtain more realistic details. Moreover, Tian et al.
[37] presented the first model to substitute the deformable convolution into VSR, which amplified the feature extraction ability of the model. Based on
[37], Wang et al.
[20] proposed a pyramid, cascading, and deformable (PCD) module that further enhances the alignment capability of the model. Then, Xu et al.
[38] designed a temporal modulation block to modulate the PCD module. Meanwhile, they conducted short-term and long-term feature fusion to better extract motion clues. These optical flow-based methods have also been applied to practical work such as video surveillance, etc. Guo et al.
[8] utilized optical flow and other methods to construct the back-projection network, which can effectively reconstruct high-quality surveillance videos. Moreover, Isobe et al.
[22] proposed the structure of intra-group fusion and inter-group fusion, and used 3D convolution to capture and supplement the spatio-temporal information between different groups. Ying et al.
[23] proposed deformable 3D convolution with efficient spatio-temporal exploration and adaptive motion compensation capabilities. Fuoli et al.
[39] devised a hidden space propagation scheme that effectively aggregates temporal information over long distances. Based on
[39], Isobe et al.
[40] explored the temporal differences between LR and HR space, effectively complementing the missing details in LR frames. Then, Jin et al.
[5] used the temporal difference between long and short frames to achieve information compensation for satellite VSR. Liu et al.
[26] designed a trajectory transformer that analyzes and utilizes motion trajectories between consecutive frames to obtain high-quality HR videos. Then, on the basis of
[26], Qiu et al.
[27] introduced the frequency domain into the VSR domain, which provided a new basis upon which to study VSR.