Deep Learning for Land Use: Comparison
Please note this is a comparison between Version 1 by Chunyang Wang and Version 2 by Jessie Wu.

Image super-resolution (SR) techniques can improve the spatial resolution of remote sensing images to provide more feature details and information, which is important for a wide range of remote sensing applications, including land use/cover classification (LUCC). Convolutional neural networks (CNNs) have achieved impressive results in the field of image SR, but the inherent localization of convolution limits the performance of CNN-based SR models. 

  • super-resolution
  • land use/cover classification
  • deep learning
  • remote sensing

1. Introduction

Long time series and high-spatial-resolution remote sensing images play a crucial role in high-precision land use/cover classification (LUCC) [1]. However, due to the limitation of hardware technology and cost, publicly available remote sensing data with high spatial resolution usually do not have long time series. For example, the Sentinel-2 satellite has a spatial resolution of up to 10 m, but its temporal coverage starts from 2015, and even expensive commercial satellite data are usually available from 2000 onwards. In contrast, remote sensing data with long time series usually do not have high spatial resolution. For example, Landsat series satellites have been providing valuable data since 1972. These data are frequently utilized for time series land use analysis. However, their spatial resolution is limited to 30 m, which restricts their application in long-term series and high-precision LUCC analysis. Therefore, it is crucial to improve the spatial resolution of long time series and low-spatial-resolution remote sensing images by means of algorithm development. Traditional SR methods for remote sensing images mainly include interpolation [2], Pansharp [3], sparse representation-based [4], and convex set projection [5]. The interpolation method has the advantage of simplicity and speed, but the interpolation results are usually blurred. The Pansharp method requires the sensor to have a high spatial resolution of the panchromatic bands, and can then improve the spatial resolution of other bands by data fusion. Methods based on sparse representation and convex set projection have high computational complexity and have difficulty recovering high-frequency details of the image. Convex set projection, in particular, demands a substantial amount of prior knowledge [6]. In recent years, deep learning techniques have been rapidly developed and have achieved impressive results in various computer vision (CV) tasks, including image super-resolution. Using deep learning techniques, LR data can be processed with super-resolution to improve the spatial resolution, which provides an opportunity to obtain higher-quality LUCC maps [7].

2. Deep Learning for Image Super-Resolution

The super-resolution convolutional neural network (SRCNN) [8] was the first convolutional neural network used for image super-resolution. SRCNN uses a stack of three convolutional layers to directly learn the mapping relationship between LR and HR end-to-end. Deep residual learning [9] was proposed to shift the structure of deep learning networks towards greater depth. The very deep super-resolution network (VDSR) [10] improves the super-resolution performance of the network by using residual concatenation and stacking very deep convolutional layers. SRCNN and VDSR up-sample the image when it is fed into the network, which results in slower training and high computational resource usage. The fast super-resolution convolutional neural network (FSRCNN) [11] and the enhanced deep super-resolution network (EDSR) [12] up-sample the feature maps in the last part of the network and achieved better super-resolution results in terms of the peak signal-to-noise ratio (PSNR) metric. Subsequently, the residual channel attention network (RCAN) [13] achieved better performance than deep convolutional networks such as VDSR and EDSR. This improvement was achieved by integrating the attention mechanism into the super-resolution network. This breakthrough showcased the mechanism’s ability to reconstruct intricate texture details within the image. Consequently, the attention mechanism has become a widespread inclusion in various super-resolution networks. The multispectral remote sensing images super-resolution convolutional neural network (msiSRCNN) [14] verified the feasibility of applying a convolutional neural network to the super-resolution of multispectral remote sensing images by fine-tuning SRCNN, and achieved better results than the traditional methods. The convolutional neural network became the mainstream method for the super-resolution of remote sensing images. Remote sensing images are distinct from ordinary optical images due to their diverse feature types and varying scales among these features. Aiming to address these problems, researchers have proposed many new structures [15][16][17][18][19][15,16,17,18,19], which are used to enhance the feature-learning capability of super-resolution networks for remote sensing images. Although CNN-based methods have achieved significant results in remote sensing image super-resolution tasks, the inherent localization of CNNs makes it difficult to model the global pixel dependencies of remote sensing images, which limits the further improvement of CNN performance in super-resolution tasks.
The Transformer [20] has quickly become the dominant approach in the field of natural language processing (NLP) with its powerful global modeling capabilities. The emergence of Vision Transformer (VIT) [21] introduced Transformer to the CV domain, achieving performance beyond CNNs on large datasets. Currently, there have been many works combining CNNs with Transformers for image super-resolution [22][23][24][25][26][27][22,23,24,25,26,27]. These methods use a CNN as a shallow feature extractor and Transformer for deep feature extraction, combining the local feature extraction capability of the CNN and the global modeling capability of Transformer to further improve the quality of SR images. Although Transformer can effectively compensate for the localization of CNN, in addition to the local–global learning capability, multi-scale feature learning is equally important for the super-resolution task of remote sensing images [17][18][27][17,18,27]. Unfortunately, Transformer does not have the ability of multi-scale feature learning. Numerous researchers have explored the integration of multi-scale information into the Transformer model [28][29][30][31][28,29,30,31]. However, these methods generally result in an increase in the number of network parameters, thereby further impeding the training process of the already large Transformer model. In addition to that, there are ways [29][32][29,32] to realize Transformer’s multiscale hierarchical representation of images by gradual down-sampling, but this is not applicable to the image super-resolution task. Although the CNN and Transformer-based methods are better than the traditional super-resolution methods and result in higher peak PSNR values, they are more ambiguous in visual perception.
Generative adversarial networks (GAN) [33] have powerful image-generation capabilities in image generation [34], style migration [35], and image super-resolution [36][37][38][36,37,38]. The GAN consists of two sub-networks, the generator and the discriminator, which are trained against each other in a ”zero-sum game”. The generator’s goal is to generate realistic images to deceive the discriminator, while the discriminator’s goal is to determine the authenticity of the input images. The generator updates the gradient based on the feedback from the discriminator. Adversarial training allows the GAN to generate images that are visually superior to the CNN. Super-resolution GAN (SRGAN) [36] uses the GAN network for the image super-resolution task and the pretrained VGG19 [39] network is used as a feature extractor to compute the perceptual loss to optimize the perceptual quality of the SR images. The enhanced super-resolution GAN (ESRGAN) [37] is an improvement of SRGAN; ESRGAN uses dense residual blocks to enhance the feature-learning capability of the network and removes the BatchNorm layer [40] from SRGAN. ESRGAN is still one of the most advanced image super-resolution methods. For the task of remote sensing image super-resolution, researchers have made many improvements to the GAN network, including the introduction of the attention mechanism [41][42][41,42], processing after super-resolution [43], and improvements to the discriminator [44], etc. GAN-based methods have more powerful image-generation capabilities than CNN-based methods, which generate SR images with more details. Therefore, rwesearchers choose to train reseachers'our model using the GAN framework.

3. Deep Learning for Land Use/Cover Classification

Land use/cover classification can extract information of natural land types as well as artificially utilized land types from remote sensing images, which is important in the fields of ecological protection, urban planning, and precision agriculture. The traditional LUCC methods [45][46][47][45,46,47] often rely on artificially designed features, such as spectral indices [48], and the spatial correlation of pixels is ignored. In contrast to traditional classification methods, deep-learning-based approaches eliminate the dependence on artificial features. They effectively capture both the spatial and spectral features inherent in remote sensing images [49], leading to superior classification accuracy and enhanced robustness. Fully convolutional neural networks (FCN) [50] represent an enlightening approach for the semantic segmentation task based on deep learning, which can realize the classification of images at pixel level. U-net [51] is a new approach for semantic segmentation, which was initially proposed for biomedical image segmentation, but has been widely used for image segmentation in various fields, including remote sensing, due to its superior performance. The Deeplab [52][53][54][55][52,53,54,55] family of models are another classic set of approaches for image segmentation tasks as well as U-net networks. In contrast to the stepwise down-sampling structure of U-net, Deeplab employs dilated convolutions [56] to facilitate multi-scale feature learning, thereby enhancing the segmentation accuracy. At present, in the remote sensing image LUCC task, the Transformer-based classification method is one of the hotspots in research. The self-attention mechanism of Transformer means that it can model the spectral features well. Many researchers have opted to integrate CNNs and Transformer by utilizing a CNN for extracting the spatial features and employing Transformer to capture the spectral features. These methods incorporating both spatial and spectral features have achieved better accuracy in the LUCC task [57][58][59][60][57,58,59,60]. The morphFormer, proposed by Roy et al. [61], integrates the learnable spectral morphological convolution operation and a self-attention mechanism. This combination enhances the interaction between the spectral features and improves the representation of the structure and shape information in tokens. When compared to traditional CNNs and other Transformer LUCC models, morphFormer achieves higher classification accuracy in experiments. Thus, it stands as one of the most advanced LUCC methods available at present. In this researchtudy, this method was directly employed for the SR data in the second stage, specifically for the LUCC task.
The objective of this research tudy is to enhance the spatial resolution of remote sensing images using deep learning techniques. This enhancement aims to provide richer and more accurate surface information for LUCC tasks, thereby further improving the precision of LUCC. This research study is mainly divided into two stages: remote sensing image SR and LUCC. In the SR stage, researcherswe propose a new model named the dilated Transformer GAN (DTGAN) for real remote sensing image super-resolution. The generator of this model combines CNN and Transformer, using a CNN as a shallow feature extractor and Transformer for deep feature extraction. At the same time, researcherswe seek to solve the problems of Transformer’s inability to learn multi-scale features and its slow computation and large resource consumption. This research is influenced by [32][62][63][64][32,62,63,64], with regard to the attention mechanism called dilated window multi-head self-attention (DW-MHSA), which can introduce multi-scale information into the Transformer and improve the computation efficiency of the self-attention without increasing the network parameters. The discriminator of DTGAN uses PatchGAN [38]. In the LUCC stage, researchwers directly adopt morphFormer [61] in the LUCC of SR to verify the availability of the SR data.
Video Production Service