The quality of videos varies due to the different capabilities of sensors. Video super-resolution (VSR) is a technology that improves the quality of captured video.
1. Introduction
Numerous videos are captured every day; however, due to the different capabilities of sensors, the quality of captured videos can vary greatly, which affects the subsequent analysis and applications
[1][2][3][4]. Recently, computer technologies have been applied to many fields
[5][6][7][8]. In particular, video super-resolution (VSR) is a technology for improving the quality of captured video. It produces high-resolution (HR) video frames from their low-resolution (LR) counterparts. The VSR problem is challenging due to its ill-posed nature, but its applications include video display, video surveillance, video conferencing, and entertainment
[9].
VSR models take consecutive frames as input. Single-image super-resolution (SISR) methods process only one image at a time. So, VSR models take both spatial information and temporal information into account, while SISR models only exploit spatial information for super-resolution (SR) reconstruction. Thus, many VSR methods adapt SISR models for spatial information extraction. For example, Haris et al.
[10] introduced RBPN, which employs blocks from DBPN
[11] in a recurrent encoder–decoder module to utilize spatial and temporal information. Tian et al.
[12] adapted EDSR
[13] as the main design for the SR reconstruction network in TDAN. Liang et al.
[14] utilized residual Swin Transformer blocks from SwinIR
[15] in their proposed RVRT. Although these works have adapted SISR models, each method utilizes only one SISR model. Applying SISR techniques to the VSR models would require considerable effort and they may not perform as effectively as specialized VSR models.
Meanwhile, several VSR methods do not rely on SISR models. For instance, Xue et al.
[16] proposed TOF, which estimates task-oriented flow to recover details in SR frames. Wang et al.
[17] proposed SOF-VSR, which estimates HR optical flow from LR frames. SWRN
[18] can be utilized in real time on a mobile device. However, the development of a VSR model without adapting SISR methods is very costly, as the model needs to capture both temporal and spatial information. Moreover, compared with SISR methods, they may be less effective in utilizing spatial information.
To alleviate the above issues, researchers propose a plug-and-play approach for adapting existing SISR models to the VSR task. Firstly, researchers summarize a common architecture of SISR models and provide a formal analysis of adaptation to achieve better effectiveness of different SISR models. Then, researchers present an adaptation method, which inserts a plug-and-play temporal feature extraction module into SISR models. Specifically, the temporal feature extraction module consists of three submodules. The spatial aggregation submodule aligns features extracted by the original SISR model. The alignment is performed based on the result of the offset estimation submodule. Then, the temporal aggregation submodule is applied to aggregate information extracted from all neighboring frames.
To evaluate the effectiveness of the proposed method, researchers adapt five representative SISR models, i.e., SRResNet
[19], EDSR
[13], RCAN
[20], RDN
[21], and SwinIR
[15], and the evaluations are conducted on two popular benchmarks, i.e., Vid4 and SPMC-11. On the Vid4 benchmark, the VSR-adapted models achieve at least 1.26 dB and 0.067 improvements over original SISR models in terms of peak signal-to-noise ratio (PSNR)
[22] and structural similarity index (SSIM)
[23], respectively. On the SPMC benchmark, the VSR-adapted models achieve at least 1.16 dB and 0.036 gain over original SISR models in terms of PSNR and SSIM, respectively. Moreover, the VSR-adapted models surpassed the performance of state-of-the-art VSR models.
2. Single-Image Super-Resolution
The SISR problem is an ill-posed problem, and learning-based methods have significantly improved the performance in terms of accuracy
[13][15][19][20][21][24][25] and speed
[26][27][28][29]. In 2014, Dong et al.
[30] introduced a learning-based model, namely SRCNN, into the SISR field. Inspired by ResNet
[31], Ledig et al.
[19] proposed SRResNet in 2017. SRResNet
[19] accepts LR images directly and achieves high performance and increased efficiency. Kim et al.
[13] improved the SRResNet by removing unnecessary batch normalization in residual blocks and expanding the number of parameters. In 2018, Zhang et al.
[21] employed a densely connected architecture. All extracted features are fused to utilize hierarchical information. Subsequently, Zhang et al.
[20] introduced the channel attention mechanism that adaptively weights features channel-wisely. In 2021, Liang et al.
[15] proposed SwinIR by making use of the Transformer
[32]. Additionally, SwinIR uses the Swin Transformer
[33] variation, which is more appropriate for computer vision tasks. By appropriately employing convolution layers and Swin Transformer modules, SwinIR can capture local and global dependencies at the same time, resulting in SOTA performance.
3. Video Super-Resolution
In recent years, deep-learning-based models have been used to solve the VSR problem, and have become increasingly popular
[9]. Researchers roughly divide VSR models into two categories:
(1) Models adapting SISR models: Sajjadi et al.
[34] proposed FRVSR, which takes EnhanceNet
[35] as the subnetwork for SR reconstruction. Haris et al.
[10] applied the iterative up- and downsampling technique
[11] in RBPN. The representative deep learning SISR model, EDSR
[13], is utilized by many VSR models. Tian et al.
[12] applied a shallow version of EDSR
[13] in TDAN. EDVR
[36] and WAEN
[37] both employed the residual block and upsampling module from EDSR
[13] in the reconstruction module. Inspired by
[12], Xu et al.
[38] adapted EDSR as the reconstruction module. EGVSR
[39] applied ESPCN
[26] as the backbone for the SR net. The recently proposed RVRT
[14] utilized the residual Swin Transformer block, which is proposed in SwinIR
[15].
(2) Models without adapting SISR models: DUF
[40] reconstructs SR frames by estimating upsampling filters and a residual image for high-frequency details. Kim et al.
[41] employed 3D convolution to capture spatial–temporal nonlinear characteristics between LR and HR frames. Xue et al.
[16] proposed a method, namely TOF. It learns a task-specific representation of motion. Wang et al.
[17] proposed SOF-VSR, which estimates HR optical flow from LR frames. To better leverage the temporal information, TGA
[42] introduced a hierarchical architecture. Recently, Chan et al.
[43] proposed BasicVSR by investigating the essential components of VSR models. Liu et al.
[44] applied spatial convolution packing to jointly exploit spatial–temporal features. For better fusing information from neighboring frames, Lee et al.
[45] utilized both attention-based alignment and dilation-based alignment. Lian et al.
[18] proposed SWRN to achieve real-time inference while producing superior performance.
Because VSR models have to capture both temporal and spatial information, proposing a VSR method requires more effort. Thus, many researchers turn to adapting SISR models. Based on SISR models, proposing a VSR method can focus on capturing temporal information. However, these models either utilize a SISR model as a subnet or adapt modules from a SISR model to extract features. Additionally, they may be less effective than those methods that do not adapt SISR methods.