1000/1000
Hot
Most Recent
Video Super Resolution is the process of generating high-resolution video frames from the given low-resolution ones. The main goal is to restore more fine details, while saving coarse ones. There are many approaches for this task, but it's still popular and challenging problem.
The most research works consider degradation process of frames as
where [math]\displaystyle{ x }[/math] — original high-resolution frame,
[math]\displaystyle{ k }[/math] — blur kernel,
[math]\displaystyle{ * }[/math] — convolution operation,
[math]\displaystyle{ \downarrow{_s} }[/math] — downscaling operation,
[math]\displaystyle{ n }[/math] — additive noise,
[math]\displaystyle{ y }[/math] — low-resolution frame
Super resolution is an inverse operation(calculate x from input y). So video super resolution problem is to estimate video sequence {[math]\displaystyle{ \overline{x} }[/math]} from video sequence {[math]\displaystyle{ y }[/math]} so that {[math]\displaystyle{ \overline{x} }[/math]} was close to original {[math]\displaystyle{ x }[/math]}. To do it better we need to estimate blur kernel, downscaling operation and additive noise for given input.
We can use single image super resolution methods to generate high-resolution frames independently from their counterparts. Working with video, we can also benefit from temporal information. There are a few traditional methods, which consider the video super resolution task as an optimization problem. Last years deep learning based methods for video upscaling outperform traditional ones.
There are several traditional methods for video upscaling. These methods try to utilize some natural preferences and effectively estimate motion between frames. The high-resolution frame is reconstructed based on both natural preferences and estimated motion.
Firstly the low-resolution frame is transformed to the frequency domain. The high-resolution frame is estimated in this domain. Finally, this result frame is transformed to the spatial domain. Some methods use Fourier transform, which helps to extend the spectrum of captured signal and though increase resolution. There are different approaches for these methods: using weighted least squares theory,[1] total least squares (TLS) algorithm,[2] space-varying[3] or spatio-temporal[4] varying filtering. Other methods use Wavelet transform, which helps to find similarities in neighboring local areas.[5] Later Second-generation wavelet transform was used for video super resolution.[6]
Iterative back-projection methods assume some function between low-resolution and high-resolution frames and try to improve their guessed function in each step of an iterative process.[7] Projections onto convex sets(POCS), that defines a specific cost function, also can be used for iterative methods.[8]
Iterative adaptive filtering algorithms use Kalman filter to estimate transformation from low-resolution frame to high-resolution one.[9] To improve the final result these methods consider temporal correlation among low-resolution sequences. Some approaches also consider temporal correlation among high-resolution sequence.[10] To approximate Kalman filter a common way is to use Least Mean Squares (LMS).[11] One can also use Steepest descent,[12] Least Squares (LS),[13] Recursive Least Squares (RLS).[13]
Direct methods estimate motion between frames, upscale a reference frame, and warp neighboring frames to the high-resolution reference one. To construct result, these upscaled frames are fused together by median filter,[14] weighted median filter,[15] adaptive normalized averaging, AdaBoost classifier[16] or SVD based filters.[17]
Non-parametric algorithms join motion estimation and frames fusion to one step. It performed by consideration of patches similarities. Weights for fusion can be calculated by Nonlocal-Means filters.[18] To stength searching for similar pathes, one can use rotation invariance similarity measure[19] or adaptive patch size.[20] Calculating intra-frame similarity help to preserve small details and edges.[21] Parameters for fusion also can be calculated by kernel regression.[22]
Probabilistic methods use statistical theory to solve the task. Maximum likelihood (ML) methods estimate more probable image.[23][24] Another group of methods use Maximum a posteriori (MAP) estimation. Regularization parameter for MAP can be estimated by Tikhonov regularization.[25] Markov random fields (MRF) is often used along with MAP and helps to preserve similarity in neighboring patches.[26] Huber MRFs are used to preserve sharp edges.[27] Gaussian MRF can smooth some edges, but remove noise.[28]
In approaches with alignment, neighboring frames are firstly aligned with target one. One can align frames by performing Motion Estimation and Motion Compensation (MEMC) or by using Deformable convolution (DC). Motion Estimation gives information about motion of pixels between frames. Motion Compensation is a warping operation, which aligns one frame to another based on motion information. Examples of such methods:
Another way to align neighboring frames with target one is deformable convolution. While usual convolution has fixed kernel, deformable convolution on the first step estimate shifts for kernel and then do convolution. Examples of such methods:
Some methods align frames by calculated homography between frames.
Methods without alignment do not perform alignment as a first step and just process input frames.
While 2D convolutions work on spatial domain, 3D convolutions use both spatial and temporal information. They perform motion compensation and maintain temporal consistency
Recurrent convolutional neural networks perform video super resolution by storing temporal dependencies.
Non-local methods extract both spatial and temporal information. The key idea is to use all possible positions as a weighted sum. This strategy may be more effective than local approaches.
The common way to estimate the performance of video super resolution algorithms is to use a few metrics:
Another way to assess the performance of the video super resolution algorithm is to organize subjective assessment. People are asked to compare to corresponding frames. The final Mean opinion score (MOS) is calculated as the arithmetic mean overall ratings.
While deep learning approaches of video super resolution outperform traditional ones, it's crucial to form a high-quality dataset for evaluation. It's important to verify models' ability to restore small details, text and objects with complicated structure, to cope with big motion and noise.
Dataset | Videos | Mean video length | Ground-truth resolution | Motion in frames | Fine details |
---|---|---|---|---|---|
Vid4 | 4 | 43 frames | 720×480 | Without fast motion | Some small details, without text |
SPMCS | 30 | 31 frames | 960×540 | SLow motion | A lot of small details |
Vimeo-90K (test SR set) | 7824 | 7 frames | 448×256 | A lot of fast, difficult, diverse motion | Few details, text in a few sequences |
Xiph HD (complete sets) | 70 | 2 seconds | from 640×360 to 4096×2160 |
A lot of fast, difficult, diverse motion | Few details, text in a few sequences |
Ultra Video Dataset 4K | 16 | 10 seconds | 4096×2160 | Diverse motion | Few details, without text |
REDS (test SR) | 30 | 100 frames | 1280×720 | A lot of fast, difficult, diverse motion | Few details, without text |
Space-Time SR | 5 | 100 frames | 1280×720 | Diverse motion | Without small details and text |
Harmonic | — | — | 4096×2160 | — | — |
CDVL | — | — | 1920×1080 | — | — |
A few benchmarks in video super resolution were organized by companies and conferences. The purposes of such challenges are to compare diverse algorithms and to find the state-of-the-art for the task.
Benchmark | Organizer | Dataset | Upscale factor | Metrics |
---|---|---|---|---|
NTIRE 2019 Challenge | CVPR (Computer Vision and pattern recognition) | REDS | 4 | PSNR, SSIM |
Youku-VESR Challenge 2019 | Youku | Youku-VESR | 4 | PSNR, VMAF |
AIM 2019 Challenge | ECCV (European Conference on Computer Vision) | Vid3oC | 16 | PSNR, SSIM, MOS |
AIM 2020 Challenge | ECCV (European Conference on Computer Vision) | Vid3oC | 16 | PSNR, SSIM, LPIPS |
Mobile Video Restoration Challenge | ICIP (International Conference of Image Processing), Kwai | — | — | PSNR, SSIM, MOS |
MSU Video Super Resolution Benchmark 2021 | MSU (Moscow State University) | — | 4 | ERQAv1.0, PSNR and SSIM with shift compensation, QRCRv1.0, CRRMv1.0 |
The NTIRE 2019 Challenge was organized by CVPR and proposed two tracks for Video Super Resolution: clean (only bicubic degradation) and blur (blur added firstly). Each track had more than 100 participants and 14 final results were submitted.
Dataset REDS was collected for this challenge. It consists of 30 videos of 100 frames each. The resolution of ground-truth frames is 1280×720. The tested scale factor is 4. To evaluate models' performance PSNR and SSIM were used. The best participants' results are performed in the table:
Team | Model name | PSNR (clean track) |
SSIM (clean track) |
PSNR (blur track) |
SSIM (blur track) |
Runtime per image in sec (clean track) |
Runtime per image in sec (blur track) |
Platform | GPU | Open source |
---|---|---|---|---|---|---|---|---|---|---|
HelloVSR | EDVR | 31.79 | 0.8962 | 30.17 | 0.8647 | 2.788 | 3.562 | PyTorch | TITAN Xp | YES |
UIUC-IFP | WDVR | 30.81 | 0.8748 | 29.46 | 0.8430 | 0.980 | 0.980 | PyTorch | Tesla V100 | YES |
SuperRior | ensemble of RDN, RCAN, DUF |
31.13 | 0.8811 | — | — | 120.000 | — | PyTorch | Tesla V100 | NO |
CyberverseSanDiego | RecNet | 31.00 | 0.8822 | 27.71 | 0.8067 | 3.000 | 3.000 | TensorFlow | RTX 2080 Ti | YES |
TTI | RBPN | 30.97 | 0.8804 | 28.92 | 0.8333 | 1.390 | 1.390 | PyTorch | TITAN X | YES |
NERCMS | PFNL | 30.91 | 0.8782 | 28.98 | 0.8307 | 6.020 | 6.020 | PyTorch | GTX 1080 Ti | YES |
XJTU-IAIR | FSTDN | — | — | 28.86 | 0.8301 | — | 13.000 | PyTorch | GTX 1080 Ti | NO |
The Youku-VESR Challenge was organized to check models' ability to cope with degradation and noise, which are real for Youku online video-watching application. The proposed dataset consists of 1000 videos, each length is 4–6 seconds. The resolution of ground-truth frames is 1920×1080. The tested scale factor is 4. PSNR and VMAF metrics were used for performance evaluation. Top methods are performed in the table:
Team | PSNR | VMAF |
---|---|---|
Avengers Assemble | 37.851 | 41.617 |
NJU_L1 | 37.681 | 41.227 |
ALONG_NTES | 37.632 | 40.405 |
The challenge was held by ECCV and had two tracks on video extreme super resolution: first track checks the fidelity with reference frame (measured by PSNR and SSIM). Second track check the perceptual quality of videos (MOS). Dataset consists of 328 video sequences of 120 frames each. The resolution of ground-truth frames is 1920×1080. The tested scale factor is 16. Top methods are performed in the table:
Team | Model name | PSNR | SSIM | MOS | Runtime per image in sec | Platform | GPU/CPU | Open source |
---|---|---|---|---|---|---|---|---|
fenglinglwb | based on EDVR | 22.53 | 0.64 | first result | 0.35 | PyTorch | 4× Titan X | NO |
NERCMS | PFNL | 22.35 | 0.63 | — | 0.51 | PyTorch | 2× 1080 Ti | NO |
baseline | RLSP | 21.75 | 0.60 | — | 0.09 | TensorFlow | Titan Xp | NO |
HIT-XLab | based on EDSR | 21.45 | 0.60 | second result | 60.00 | PyTorch | V100 | NO |
Challenge's conditions are the same as AIM 2019 Challenge. Top methods are performed in the table:
Team | Model name | Params number | PSNR | SSIM | Runtime per image in sec | GPU/CPU | Open source |
---|---|---|---|---|---|---|---|
KirinUK | EVESRNet | 45.29M | 22.83 | 0.6450 | 6.1 s | 1 × 2080Ti 6 | NO |
Team-WVU | — | 29.51M | 22.48 | 0.6378 | 4.9 s | 1 × TitanXp | NO |
BOE-IOT-AIBD | 3D-MGBP | 53M | 22.48 | 0.6304 | 4.83 s | 1 × 1080 | NO |
sr xxx | based on EDVR | — | 22.43 | 0.6353 | 4 s | 1 × V100 | NO |
ZZX | MAHA | 31.14M | 22.28 | 0.6321 | 4 s | 1 × 1080Ti | NO |
lyl | FineNet | — | 22.08 | 0.6256 | 13 s | — | NO |
TTI | based on STARnet | — | 21.91 | 0.6165 | 0.249 s | — | NO |
CET CVLab | — | 21.77 | 0.6112 | 0.04 s | 1 × P100 | NO |
The MSU Video Super-Resolution Benchmark was organized by MSU and proposed three types of motion, two ways to lower resolution, and eight types of content in the dataset. The resolution of ground-truth frames is 1920×1280. The tested scale factor is 4. 14 models were tested. To evaluate models' performance PSNR and SSIM were used with shift compensation. Also proposed a few new metrics: ERQAv1.0, QRCRv1.0, and CRRMv1.0.[72] Top methods are performed in the table:
Model name | Multi-frame | Subjective | ERQAv1.0 | PSNR | SSIM | QRCRv1.0 | CRRMv1.0 | Runtime per image in sec | Open source |
---|---|---|---|---|---|---|---|---|---|
DBVSR | YES | 5.561 | 0.737 | 31.071 | 0.894 | 0.629 | 0.992 | — | YES |
LGFN | YES | 5.040 | 0.740 | 31.291 | 0.898 | 0.629 | 0.996 | 1.499 | YES |
DynaVSR-R | YES | 4.751 | 0.709 | 28.377 | 0.865 | 0.557 | 0.997 | 5.664 | YES |
TDAN | YES | 4.036 | 0.706 | 30.244 | 0.883 | 0.557 | 0.994 | — | YES |
DUF-28L | YES | 3.910 | 0.645 | 25.852 | 0.830 | 0.549 | 0.993 | 2.392 | YES |
RRN-10L | YES | 3.887 | 0.627 | 24.252 | 0.790 | 0.557 | 0.989 | 0.390 | YES |
RealSR | NO | 3.749 | 0.690 | 25.989 | 0.767 | 0.000 | 0.886 | — | YES |
In many areas, working with video, we deal with different types of video degradation, including downscaling. The resolution of video can be degraded because of imperfections of measuring devices, such as optical degradations and limited size of camera sensors. Bad light and weather conditions add noise to video. Object and camera motion also decrease video quality. Super Resolution techniques help to restore the original video. It's useful in a wide range of applications, such as
It also helps to solve task of object detection, face and character recognition (as preprocessing step). The interest to super resolution is growing with the development of high definition computer displays and TVs.