Single-Image Super-Resolution Models to Video Super-Resolution

Single-Image Super-Resolution Models to Video Super-Resolution: History

Please note this is an old version of this entry, which may differ significantly from the current revision.

Contributor: Wenhao Wang , Zhenbing Liu , Haoxiang Lu , Rushi Lan ,

Yingxin Huang

The quality of videos varies due to the different capabilities of sensors. Video super-resolution (VSR) is a technology that improves the quality of captured video.

video super-resolution
single-image super-resolution
plug-and-play

1. Introduction

Numerous videos are captured every day; however, due to the different capabilities of sensors, the quality of captured videos can vary greatly, which affects the subsequent analysis and applications ^[1]^[2]^[3]^[4]. Recently, computer technologies have been applied to many fields ^[5]^[6]^[7]^[8]. In particular, video super-resolution (VSR) is a technology for improving the quality of captured video. It produces high-resolution (HR) video frames from their low-resolution (LR) counterparts. The VSR problem is challenging due to its ill-posed nature, but its applications include video display, video surveillance, video conferencing, and entertainment ^[9].

VSR models take consecutive frames as input. Single-image super-resolution (SISR) methods process only one image at a time. So, VSR models take both spatial information and temporal information into account, while SISR models only exploit spatial information for super-resolution (SR) reconstruction. Thus, many VSR methods adapt SISR models for spatial information extraction. For example, Haris et al. ^[10] introduced RBPN, which employs blocks from DBPN ^[11] in a recurrent encoder–decoder module to utilize spatial and temporal information. Tian et al. ^[12] adapted EDSR ^[13] as the main design for the SR reconstruction network in TDAN. Liang et al. ^[14] utilized residual Swin Transformer blocks from SwinIR ^[15] in their proposed RVRT. Although these works have adapted SISR models, each method utilizes only one SISR model. Applying SISR techniques to the VSR models would require considerable effort and they may not perform as effectively as specialized VSR models.

Meanwhile, several VSR methods do not rely on SISR models. For instance, Xue et al. ^[16] proposed TOF, which estimates task-oriented flow to recover details in SR frames. Wang et al. ^[17] proposed SOF-VSR, which estimates HR optical flow from LR frames. SWRN ^[18] can be utilized in real time on a mobile device. However, the development of a VSR model without adapting SISR methods is very costly, as the model needs to capture both temporal and spatial information. Moreover, compared with SISR methods, they may be less effective in utilizing spatial information.

To alleviate the above issues, researchers propose a plug-and-play approach for adapting existing SISR models to the VSR task. Firstly, researchers summarize a common architecture of SISR models and provide a formal analysis of adaptation to achieve better effectiveness of different SISR models. Then, researchers present an adaptation method, which inserts a plug-and-play temporal feature extraction module into SISR models. Specifically, the temporal feature extraction module consists of three submodules. The spatial aggregation submodule aligns features extracted by the original SISR model. The alignment is performed based on the result of the offset estimation submodule. Then, the temporal aggregation submodule is applied to aggregate information extracted from all neighboring frames.

To evaluate the effectiveness of the proposed method, researchers adapt five representative SISR models, i.e., SRResNet ^[19], EDSR ^[13], RCAN ^[20], RDN ^[21], and SwinIR ^[15], and the evaluations are conducted on two popular benchmarks, i.e., Vid4 and SPMC-11. On the Vid4 benchmark, the VSR-adapted models achieve at least 1.26 dB and 0.067 improvements over original SISR models in terms of peak signal-to-noise ratio (PSNR) ^[22] and structural similarity index (SSIM) ^[23], respectively. On the SPMC benchmark, the VSR-adapted models achieve at least 1.16 dB and 0.036 gain over original SISR models in terms of PSNR and SSIM, respectively. Moreover, the VSR-adapted models surpassed the performance of state-of-the-art VSR models.

2. Single-Image Super-Resolution

The SISR problem is an ill-posed problem, and learning-based methods have significantly improved the performance in terms of accuracy ^[13]^[15]^[19]^[20]^[21]^[24]^[25] and speed ^[26]^[27]^[28]^[29]. In 2014, Dong et al. ^[30] introduced a learning-based model, namely SRCNN, into the SISR field. Inspired by ResNet ^[31], Ledig et al. ^[19] proposed SRResNet in 2017. SRResNet ^[19] accepts LR images directly and achieves high performance and increased efficiency. Kim et al. ^[13] improved the SRResNet by removing unnecessary batch normalization in residual blocks and expanding the number of parameters. In 2018, Zhang et al. ^[21] employed a densely connected architecture. All extracted features are fused to utilize hierarchical information. Subsequently, Zhang et al. ^[20] introduced the channel attention mechanism that adaptively weights features channel-wisely. In 2021, Liang et al. ^[15] proposed SwinIR by making use of the Transformer ^[32]. Additionally, SwinIR uses the Swin Transformer ^[33] variation, which is more appropriate for computer vision tasks. By appropriately employing convolution layers and Swin Transformer modules, SwinIR can capture local and global dependencies at the same time, resulting in SOTA performance.

3. Video Super-Resolution

In recent years, deep-learning-based models have been used to solve the VSR problem, and have become increasingly popular ^[9]. Researchers roughly divide VSR models into two categories:

(1) Models adapting SISR models: Sajjadi et al. ^[34] proposed FRVSR, which takes EnhanceNet ^[35] as the subnetwork for SR reconstruction. Haris et al. ^[10] applied the iterative up- and downsampling technique ^[11] in RBPN. The representative deep learning SISR model, EDSR ^[13], is utilized by many VSR models. Tian et al. ^[12] applied a shallow version of EDSR ^[13] in TDAN. EDVR ^[36] and WAEN ^[37] both employed the residual block and upsampling module from EDSR ^[13] in the reconstruction module. Inspired by ^[12], Xu et al. ^[38] adapted EDSR as the reconstruction module. EGVSR ^[39] applied ESPCN ^[26] as the backbone for the SR net. The recently proposed RVRT ^[14] utilized the residual Swin Transformer block, which is proposed in SwinIR ^[15].

(2) Models without adapting SISR models: DUF ^[40] reconstructs SR frames by estimating upsampling filters and a residual image for high-frequency details. Kim et al. ^[41] employed 3D convolution to capture spatial–temporal nonlinear characteristics between LR and HR frames. Xue et al. ^[16] proposed a method, namely TOF. It learns a task-specific representation of motion. Wang et al. ^[17] proposed SOF-VSR, which estimates HR optical flow from LR frames. To better leverage the temporal information, TGA ^[42] introduced a hierarchical architecture. Recently, Chan et al. ^[43] proposed BasicVSR by investigating the essential components of VSR models. Liu et al. ^[44] applied spatial convolution packing to jointly exploit spatial–temporal features. For better fusing information from neighboring frames, Lee et al. ^[45] utilized both attention-based alignment and dilation-based alignment. Lian et al. ^[18] proposed SWRN to achieve real-time inference while producing superior performance.

Because VSR models have to capture both temporal and spatial information, proposing a VSR method requires more effort. Thus, many researchers turn to adapting SISR models. Based on SISR models, proposing a VSR method can focus on capturing temporal information. However, these models either utilize a SISR model as a subnet or adapt modules from a SISR model to extract features. Additionally, they may be less effective than those methods that do not adapt SISR methods.

This entry is adapted from the peer-reviewed paper 10.3390/s23115030

References

Yang, C.; Huang, Z.; Wang, N. QueryDet: Cascaded Sparse Query for Accelerating High-Resolution Small Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 13658–13667.
Shermeyer, J.; Etten, A.V. The Effects of Super-Resolution on Object Detection Performance in Satellite Imagery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2019, Computer Vision Foundation/IEEE, Long Beach, CA, USA, 16–20 June 2019; pp. 1432–1441.
Dong, H.; Xie, K.; Xie, A.; Wen, C.; He, J.; Zhang, W.; Yi, D.; Yang, S. Detection of Occluded Small Commodities Based on Feature Enhancement under Super-Resolution. Sensors 2023, 23, 2439.
Yuan, X.; Fu, D.; Han, S. LRF-SRNet: Large-Scale Super-Resolution Network for Estimating Aircraft Pose on the Airport Surface. Sensors 2023, 23, 1248.
Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021, 596, 583–589.
Cheng, H.K.; Schwing, A.G. XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model. In Proceedings of the Computer Vision-ECCV 2022—17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part XXVIII; Lecture Notes in Computer Science. Avidan, S., Brostow, G.J., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer: Berlin/Heidelberg, Germany, 2022; Volume 13688, pp. 640–658.
Chen, Y.; Xia, R.; Zou, K.; Yang, K. FFTI: Image inpainting algorithm via features fusion and two-steps inpainting. J. Vis. Commun. Image Represent. 2023, 91, 103776.
Imran, A.; Sulaman, M.; Yang, S.; Bukhtiar, A.; Qasim, M.; Elshahat, S.; Khan, M.S.A.; Dastgeer, G.; Zou, B.; Yousaf, M. Molecular beam epitaxy growth of high mobility InN film for high-performance broadband heterointerface photodetectors. Surf. Interfaces 2022, 29, 101772.
Liu, H.; Ruan, Z.; Zhao, P.; Dong, C.; Shang, F.; Liu, Y.; Yang, L.; Timofte, R. Video super-resolution based on deep learning: A comprehensive survey. Artif. Intell. Rev. 2022, 55, 5981–6035.
Haris, M.; Shakhnarovich, G.; Ukita, N. Recurrent Back-Projection Network for Video Super-Resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Computer Vision Foundation/IEEE, Long Beach, CA, USA, 16–20 June 2019; pp. 3897–3906.
Haris, M.; Shakhnarovich, G.; Ukita, N. Deep Back-Projection Networks for Super-Resolution. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Computer Vision Foundation/IEEE Computer Society, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1664–1673.
Tian, Y.; Zhang, Y.; Fu, Y.; Xu, C. TDAN: Temporally-Deformable Alignment Network for Video Super-Resolution. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Computer Vision Foundation/IEEE, Seattle, WA, USA, 13–19 June 2020; pp. 3357–3366.
Lim, B.; Son, S.; Kim, H.; Nah, S.; Lee, K.M. Enhanced Deep Residual Networks for Single Image Super-Resolution. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2017, IEEE Computer Society, Honolulu, HI, USA, 21–26 July 2017; pp. 1132–1140.
Liang, J.; Fan, Y.; Xiang, X.; Ranjan, R.; Ilg, E.; Green, S.; Cao, J.; Zhang, K.; Timofte, R.; Gool, L.V. Recurrent Video Restoration Transformer with Guided Deformable Attention. Adv. Neural Inf. Process. Syst. 2022, 35, 378–393.
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Gool, L.V.; Timofte, R. SwinIR: Image Restoration Using Swin Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, ICCVW 2021, Montreal, BC, Canada, 11–17 October 2021; pp. 1833–1844.
Xue, T.; Chen, B.; Wu, J.; Wei, D.; Freeman, W.T. Video Enhancement with Task-Oriented Flow. Int. J. Comput. Vis. 2019, 127, 1106–1125.
Wang, L.; Guo, Y.; Liu, L.; Lin, Z.; Deng, X.; An, W. Deep Video Super-Resolution Using HR Optical Flow Estimation. IEEE Trans. Image Process. 2020, 29, 4323–4336.
Lian, W.; Lian, W. Sliding Window Recurrent Network for Efficient Video Super-Resolution. In Proceedings of the Computer Vision-ECCV 2022 Workshops, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part II; Lecture Notes in Computer Science. Karlinsky, L., Michaeli, T., Nishino, K., Eds.; Springer: Berlin/Heidelberg, Germany, 2022; Volume 13802, pp. 591–601.
Ledig, C.; Theis, L.; Huszar, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.P.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, IEEE Computer Society, Honolulu, HI, USA, 21–26 July 2017; pp. 105–114.
Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image Super-Resolution Using Very Deep Residual Channel Attention Networks. In Proceedings of the Computer Vision-ECCV 2018—15th European Conference, Munich, Germany, 8–14 September 2018; Proceedings, Part VII; Lecture Notes in Computer Science. Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Berlin/Heidelberg, Germany, 2018; Volume 11211, pp. 294–310.
Zhang, Y.; Tian, Y.; Kong, Y.; Zhong, B.; Fu, Y. Residual Dense Network for Image Super-Resolution. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Computer Vision Foundation/IEEE Computer Society, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2472–2481.
Horé, A.; Ziou, D. Image Quality Metrics: PSNR vs. SSIM. In Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 2366–2369.
Wang, Z.; Bovik, A.; Sheikh, H.; Simoncelli, E. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612.
Liu, Y.; Chu, Z.; Li, B. A Local and Non-Local Features Based Feedback Network on Super-Resolution. Sensors 2022, 22, 9604.
Chen, Y.; Xia, R.; Yang, K.; Zou, K. MFFN: Image super-resolution via multi-level features fusion network. Vis. Comput. 2023, 1–16.
Shi, W.; Caballero, J.; Huszar, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, IEEE Computer Society, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883.
Lan, R.; Sun, L.; Liu, Z.; Lu, H.; Pang, C.; Luo, X. MADNet: A Fast and Lightweight Network for Single-Image Super Resolution. IEEE Trans. Cybern. 2021, 51, 1443–1453.
Lan, R.; Sun, L.; Liu, Z.; Lu, H.; Su, Z.; Pang, C.; Luo, X. Cascading and Enhanced Residual Networks for Accurate Single-Image Super-Resolution. IEEE Trans. Cybern. 2021, 51, 115–125.
Sun, L.; Liu, Z.; Sun, X.; Liu, L.; Lan, R.; Luo, X. Lightweight Image Super-Resolution via Weighted Multi-Scale Residual Network. IEEE/CAA J. Autom. Sin. 2021, 8, 1271–1280.
Dong, C.; Loy, C.C.; He, K.; Tang, X. Image Super-Resolution Using Deep Convolutional Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 295–307.
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, IEEE Computer Society, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008.
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, IEEE, Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002.
Sajjadi, M.S.M.; Vemulapalli, R.; Brown, M. Frame-Recurrent Video Super-Resolution. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Computer Vision Foundation/IEEE Computer Society, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6626–6634.
Sajjadi, M.S.M.; Schölkopf, B.; Hirsch, M. EnhanceNet: Single Image Super-Resolution Through Automated Texture Synthesis. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2017, IEEE Computer Society, Venice, Italy, 22–29 October 2017; pp. 4501–4510.
Wang, X.; Chan, K.C.K.; Yu, K.; Dong, C.; Loy, C.C. EDVR: Video Restoration With Enhanced Deformable Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2019, Computer Vision Foundation/IEEE, Long Beach, CA, USA, 16–20 June 2019; pp. 1954–1963.
Choi, Y.J.; Lee, Y.; Kim, B. Wavelet Attention Embedding Networks for Video Super-Resolution. In Proceedings of the 25th International Conference on Pattern Recognition, ICPR 2020, Milan, Italy, 10–15 January 2021; pp. 7314–7320.
Xu, W.; Song, H.; Jin, Y.; Yan, F. Video Super-Resolution with Frame-Wise Dynamic Fusion and Self-Calibrated Deformable Alignment. Neural Process. Lett. 2022, 54, 2803–2815.
Cao, Y.; Wang, C.; Song, C.; Tang, Y.; Li, H. Real-Time Super-Resolution System of 4K-Video Based on Deep Learning. In Proceedings of the 32nd IEEE International Conference on Application-specific Systems, Architectures and Processors, ASAP 2021, Virtual, 7–9 July 2021; pp. 69–76.
Jo, Y.; Oh, S.W.; Kang, J.; Kim, S.J. Deep Video Super-Resolution Network Using Dynamic Upsampling Filters Without Explicit Motion Compensation. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Computer Vision Foundation/IEEE Computer Society, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3224–3232.
Kim, S.Y.; Lim, J.; Na, T.; Kim, M. Video Super-Resolution Based on 3D-CNNS with Consideration of Scene Change. In Proceedings of the 2019 IEEE International Conference on Image Processing, ICIP 2019, Taipei, Taiwan, 22–25 September 2019; pp. 2831–2835.
Isobe, T.; Li, S.; Jia, X.; Yuan, S.; Slabaugh, G.G.; Xu, C.; Li, Y.; Wang, S.; Tian, Q. Video Super-Resolution With Temporal Group Attention. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Computer Vision Foundation/IEEE, Seattle, WA, USA, 13–19 June 2020; pp. 8005–8014.
Chan, K.C.K.; Wang, X.; Yu, K.; Dong, C.; Loy, C.C. BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, Computer Vision Foundation/IEEE, Virtual, 19–25 June 2021; pp. 4947–4956.
Liu, Z.; Siu, W.; Chan, Y. Efficient Video Super-Resolution via Hierarchical Temporal Residual Networks. IEEE Access 2021, 9, 106049–106064.
Lee, Y.; Cho, S.; Jun, D. Video Super-Resolution Method Using Deformable Convolution-Based Alignment Network. Sensors 2022, 22, 8476.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.