The advancements in image super-resolution technology have led to its widespread use in remote sensing applications. The existing scale-arbitrary super-resolution methods are primarily predicated on learning either a discrete representation (DR) or a continuous representation (CR) of the image, with DR retaining the sensitivity to resolution and CR guaranteeing the generalization of the model.
1. Introduction
Constrained by transmission bandwidth and hardware equipment, the spatial resolution of received remote sensing images may be inadequate, resulting in insufficient details and failing to meet the requirements of certain practical applications. Moreover, the variety of resolutions available at ground terminals makes it imperative to reconstruct satellite images at arbitrary scales. In real-world remote sensing applications, the ability to represent images at arbitrary resolutions is also crucial for object detection, semantic segmentation, mapping, and human–computer interaction.
Digital images are typically composed of discrete pixels, each of which represents different levels of detail at different scales. Single-image super-resolution (SISR) is a widely used computer vision technique that aims to reconstruct images at various scales. Thanks to the progress in deep learning, SISR models that operate on fixed integer scale factors (e.g., ×2/×3/×4) have made significant advancements. However, most existing SISR models are limited to generating images with fixed integer scale factors, reducing their efficacy in remote sensing applications. Given the impracticality of training numerous models for multiple scale factors, developing a SISR method that can accommodate arbitrary (including non-integer) scale factors remains an open challenge.
In existing natural image-oriented, scale-arbitrary super-resolution techniques, two representative methods are Meta-SR
[1] and LIIF
[2]. Both methods make assumptions that each pixel value is composed of RGB channels. They predict the specific RGB values of each pixel in the high-resolution (HR) space based on the feature vector, also known as the latent code, in the low-resolution (LR) space. However, their specific designs are different. On the one hand, the meta upscale module in Meta-SR generates convolution kernels with specific numbers and weights according to the scale factor. These kernels are then convolved with the latent code to predict the RGB value of a specific pixel. This approach of mapping the latent code to RGB values is referred to as discrete representation (DR). On the other hand, the local implicit image function (LIIF) directly predicts the RGB value of a pixel based on both the coordinates and the latent code. In contrast to the discrete point-to-point feature mapping in DR, LIIF creates the continuous representation (CR) of an image through continuous coordinates.
In comparison to discrete digital images, human perception of real-world scenes is continuous, thus both the discrete representation (DR), as represented by Meta-SR
[1], and the continuous representation (CR), as represented by LIIF
[2], can be utilized. CR employs a neural network-parameterized implicit function for continuous, global, and robust learning, while DR utilizes a multilayer perceptron (MLP) for discrete, local, and sensitive learning. In brief, CR enables reconstruction at ultra-high magnifications, while DR produces a more accurate image with sharper edges by adapting to specific resolutions.
2. Single Image Super-Resolution
Image super-resolution aims to recover high-resolution (HR) images from low- resolution (LR) images, with single image super-resolution (SISR) being a representative topic. Since the publication of the first super-resolution network constructed by convolutional neural networks
[3], more and more network structures have been explored, such as residual networks
[4][5][4,5], recursive networks
[6][7][6,7], dense connections
[8][9][8,9], multiple paths
[10][11][10,11], attention mechanisms
[12][13][14][12,13,14], encoder–decoder networks
[15], and Transformer-based networks
[16][17][18][16,17,18]. In addition, other generative models, such as generative adversarial networks (GAN)
[19][20][19,20], flow-based models
[21], and diffusion models
[22], have also been applied to SISR tasks.
SISR is a research priority in the field of remote sensing: Lei et al.
[23] propose a local-to-global combined network (LGCNet) for learning the multi-level features of salient objects. Jiang et al.
[24] introduce a deep distillation recursive network (DDRN), which includes a multi-scale purification unit to compensate for the high-frequency components during information transmission. Lu et al.
[25] present a multi-scale residual neural network (MRNN) to compensate for high-frequency information in the generated satellite images. Jiang et al.
[26] introduce an edge-enhanced GAN (EEGAN) for recovering sharp edges in images. Wang et al.
[27] propose an adaptive multi-scale feature fusion network (AMFFN) to preserve the feature and improve the efficiency of information usage. Several methods also incorporate attention mechanisms: Dong et al.
[28] develop a multi-perception attention network (MPSR), which uses multi-perception learning and multi-level information fusion to optimize the generated images. Zhang et al.
[29] present a mixed high-order attention network (MHAN) with an attention module that outperforms channel attention. Ma et al.
[30] implement a dense channel attention network (DCAN), in which they design a dense channel attention mechanism to exploit multi-level features. Jia et al.
[31] put forward a multi-attention GAN (MA-GAN), which is capable of improving the resolution of images at multiple scale factors. Furthermore, Liu et al.
[32] develop a diffusion model with a detail complement mechanism (DMDC), further enhancing the super-resolution effect of small and dense targets in remote sensing images.
3. Scale-Arbitrary Super-Resolution
The standard SISR process comprises a feature extraction module and a reconstruction module. The core component of the reconstruction module is the upsampling layer, which enhances the resolution of the feature map. Currently, the widely used deconvolutional layer
[33] and sub-pixel layer
[34] in SISR can only handle fixed integer scale factors, making the model difficult to generalize further. Lim et al.
[5] employ multiple upsampling branches for predetermined scale factors, but the method still cannot be applied to arbitrary scale factors.
Meta-SR
[1] is the first work aimed at scale-arbitrary super-resolution. In Meta-SR, the meta upscale module utilizes a latent code in LR space to perform a one-to-one mapping of the RGB value in HR space. Because the switch of the latent code is discontinuous, the generated image may contain checkerboard artifacts. Chen et al.
[2] propose a local implicit image function (LIIF) to represent the image as a continuous function and introduce a local ensemble to eliminate the checkerboard artifact. However, the use of a shared MLP for each pixel in LIIF neglects the local characteristics of the image, which may result in an overly smooth image. To overcome this problem, Li et al.
[35] present an adaptive local image function (A-LIIF), which employs multiple MLPs to model pixel differences and increase the detail in the generated image. Ma et al.
[36] introduce the implicit pixel flow (IPF) to convert the original blurry implicit neural representation into a sharp one, resolving the problem of overly smooth images generated by LIIF.
In addition to building upsampling modules as in Meta-SR and LIIF, some approaches introduce scale information into the feature extraction module to create scale-aware feature extraction modules. Such modules include the scale-aware dynamic convolutional layer for feature extraction
[37], the scale attention module that adaptively rescales the convolution filters
[38], and the scale-aware feature adaption blocks based on conditional convolution
[39], among others. These scale-aware modules align with the target resolution, improving the learning ability of the network for arbitrary scale factors. To sum up, Meta-SR and its subsequent works significantly advance the field of scale-arbitrary super-resolution by overcoming the limitations of fixed scale factors and improving the quality of the generated images.
The studies in remote sensing have not fully explored the concept of scale-arbitrary super-resolution. For instance, Fang et al.
[40] propose an arbitrary upscale module based on Meta-SR and add an edge reinforcement module in post-processing stage to enhance the high-frequency information of the generated images. In addition, He et al.
[41] present a video satellite image framework that enhances spatial resolution by subpixel convolution and bicubic-based adjustment. To conclude, there remains a requirement for a general approach to tackle the challenge of scale-arbitrary super-resolution for satellite images.
4. Image Rescaling
Compared to super-resolution models that primarily focus on image upscaling, image rescaling (IR) integrates both image downscaling and upscaling to achieve more precise preservation of details. Therefore, the upscaling component of IR can also be used for super-resolution reconstruction of images at arbitrary resolutions. Xiao et al.
[42] developed an invertible rescaling net (IRN) with a deliberately designed framework but limit it to a fixed integer scale factor. In contrast, Pan et al.
[43] proposed a bidirectional arbitrary image rescaling network (BAIRNet) that unifies image downscaling and upscaling as a single learning process. Later, Pan et al.
[44] introduced a simple and effective invertible arbitrary rescaling network (IARN) that achieves arbitrary image rescaling with better performance than BAIRNet. In the field of remote sensing, Zou et al.
[45] proposed a rescaling-assisted image super-resolution method (RASR) to better restore lost information in medium-low resolution remote sensing images.
As a newly emerged field, IR requires modeling the process of downscaling images and supporting image magnification at non-integer scale factors. However, most IR methods mainly focus on exploring continuous representations of images at lower scale factors (e.g., less than 4×) and neglect the potential underfitting problem that may arise at higher scale factors.