1. Introduction
The advancement in Earth observation technology has led to the availability of very high-resolution (VHR) images of the Earth’s surface. With the development of VHR images, it is possible to accurately identify and classify land use and land cover (LULC)
[1], and the demands for such tasks are high. Scene classification in remote sensing images aims to categorize image scenes automatically into relevant classes like residential areas, cultivation land, forests, etc.
[2], drawing considerable attention. In recent years, the application of scene classification in VHR satellite images is evident in disaster detection
[3], land use
[4][5][6][7][8][9], and urban planning
[10]. The implementation of deep learning (DL) for scene classification is an emerging tendency in a current scenario, with an effort to achieve maximum accuracy.
In the early days of remote sensing, the spatial resolution of images was relatively low, resulting in the pixel size equivalent to the object of interest
[11]. As a consequence, the studies on remote sensing classification were based on pixel-level
[11][12][13]. Subsequently, the increment in spatial resolution reoriented the research to remote sensing classification on the object-level, which produced more enhanced classification than per-pixel analysis
[14]. This approach dominated the remote sensing classification domain for decades, including
[15][16][17][18][19][20]. However, the perpetual growth of remote sensing images facilitates capturing distinct object classes, rendering traditional pixel-level and object-level methods inadequate for accurate image classification. In such a scenario, scene-level classification became crucial to interpret the global contents of remote sensing images
[21]. Thus, numerous experiments can be observed in scene-label analysis over the past few years
[22][23][24][25][26].
Figure 1 illustrates the progression of classification approaches from pixel-level to object-level and finally to scene-level.
Figure 1. Timeline of remote sensing classification approaches. The spatial resolution of images increased over time, resulting in three classification levels: pixel-level, object-level, and scene-level classification.
The preliminary methods for remote sensing scene classification were predominantly relying on low-level features such as texture, color, gradient, and shape. Hand-crafted features like Scale-Invariant Feature Transform (SIFT), Gabor filters, local binary pattern (LBP), color histogram (CH), gray level co-occurrence matrix (GLCM), and histogram of oriented gradients (HOG) were designed to capture specific patterns or information from the low-level features
[7][27][28][29][30]. These features are crafted by domain experts and are not learned automatically from data. The methods utilized on low-level features only rely on uniform texture and can not perform on complex scenes. On the other hand, methods on mid-level features extract more complex patterns with clustering, grouping, or segmentation
[31][32]. The idea is to acquire local attributes from small regions or patches within an image and encode these local attributes to retrieve the intricate and detailed pattern
[33]. The bag-of-visual-words (BoVW) model is a widely used mid-level approach for scene classification in the remote sensing field
[34][35][36]. However, the limited representation capability of mid-level approaches has hindered breakthroughs in remote sensing image scene classification.
In recent times, DL models emerged as a solution to address the aforementioned limitations in low-level and mid-level methods. DL architectures implement a range of techniques, including Convolutional Neural Networks (CNNs)
[37][38], Vision Transformers (ViT), and Generative Adversarial Networks (GANs), to learn discriminative features for effective feature extraction. For scene classification, the available datasets are grouped into diverse scenes. The DL architectures are either trained on these datasets to obtain the predicted scene class
[39], or pretrained DL models are used to obtain derived classes from the same scene classes
[40][41], depending upon the application. In the context of remote sensing scene classification, the experiments are mainly focused on achieving optimal scene prediction by implementing DL architectures. CNN architectures like Residual Network (ResNet)
[42], AlexNet
[43], GoogleNet
[44], etc., are commonly used for remote sensing scene classification. Operations like fine-tuning
[45], adding low-level and mid-level feature descriptors like LBP
[46], BovW
[47], etc., and developing novel architectures
[33][48] are performed to obtain nearly ideal scene classification results. Furthermore, ViTs and GANs are also used to advance research and applications in this field.
2. High-Resolution Scene Classification Datasets
Multiple VHR remote sensing datasets are available for scene classification. The UC Merced Land Use Dataset (UC-Merced or UCM)
[49] is a popular dataset obtained from the United States Geological Survey (USGS) National Map Urban Area Imagery collection covering US regions. Some of the classes belonging to this dataset are airplanes, buildings, forests, rivers, agriculture, beach, etc., depicted in
Figure 2. The Aerial Image Dataset (AID)
[23] is another dataset acquired from Google Earth Imagery with a higher number of images and classes than the UCM dataset.
Table 1 lists some widely used datasets for scene recognition in the remote sensing domain. Some of the scene classes are common in multiple datasets. For instance, the “forest” class is included in UCM, WHU-RS19
[50], RSSCN7
[51], and NWPU-RESISC45
[52] datasets. However, there are variations in scene classes among different datasets. In addition, the number of images and their size and resolution of scene datasets varies in respective datasets. Thus, the selection of the dataset depends on the research objectives. Cheng et al.
[52] proposed a novel large-scale dataset NWPU-RESISC45 with rich image variations and high within-class diversity and between-class similarity, addressing the problem of small-scale datasets, lack of variations, and diversity. Miao et al.
[53] merged UCM, NWPU-RESISC45, and AID datasets to prepare a larger remote-sensing scene dataset for semi-supervised scene classification and attained a similar performance to the state-of-the-art methods. In these VHR datasets, multiple DL architectures have been conducted to obtain optimal accuracy.
Figure 2. Sample images of some classes from the UCM dataset. In total, there are 21 classes in this dataset.
Table 1. Scene databases used for scene classification.
3. CNN-Based Scene Classification Methods
CNNs
[59] effectively extract meaningful features from images by utilizing convolutional kernels and hierarchical layers. A typical CNN architecture includes a convolutional layer, pooling layer, Rectified Linear Unit (ReLU) activation layer, and fully connected (FC) layer
[60] as shown in
Figure 3. The mathematical formula for the output of each filter in a 2D convolutional layer is provided in Equation (1).
where
𝑓(·) represents the activation function,
𝑤𝑙𝑚,𝑛 is the weight associated with the connection between input feature map
m and output feature map
n, and
𝑏𝑙𝑚 denotes the bias term associated with output feature map
n. The convolution layers learn features from input images, followed by the pooling layer, which reduces computational complexity while retaining multi-scale information. The ReLU activation function introduces non-linearity to the network, allowing for the modeling of complex relationships and enhancing the network’s ability to learn discriminative features. The successive use of convolutional and pooling layers allows the network to learn increasingly complex and abstract features at different scales. The first few initial layers capture low-level features such as edges and textures, while deeper layers learn high-level features and global structures
[61]. The FC layers serve as a means of mapping the learned features to the final output space, enabling the network to make predictions based on the extracted information. In a basic CNN structure, the output of an FC layer is fed into either a
softmax or a
sigmoid activation function for classification tasks. However, the majority of parameters are occupied by an FC layer, increasing the possibility of overfitting.
Dropout is implemented to counter this problem
[62]. To minimize loss and improve the model’s performance, several optimizers like Adam
[63] and stochastic gradient descent (SGD)
[64] are prevalent in research.
Figure 3. A typical CNN architecture. Downsampling operations such as pooling and convolution layers allow the capture of multi-scale information from the input image, finally classifying them with an FC layer.
Multiple approaches have been explored to optimize the feature extraction process for accurate remote sensing scene classification using CNNs. They are sub-divided into two categories: pretrained CNNs and CNNs trained from scratch.
3.1. Pretrained CNNs for Feature Extraction
Collecting and annotating data for larger remote sensing scene datasets increases costs and is a laborious task. To address the scarcity of data in the remote sensing domain, researchers often utilize terrestrial image datasets such as ImageNet
[65] and PlacesCNN
[66], which contain a large number of diverse images from various categories. Wang et al.
[67] described the local similarity between remote sensing scene image and natural image scenes. By leveraging pretrained models trained on these datasets, the CNN algorithms can benefit from the learned features and generalize well to remote sensing tasks with limited labeled data. This process is illustrated in
Figure 4, showcasing the role of pretrained CNNs.
Figure 4. Pretrained CNN utilized for feature extraction from the aerial image dataset. The weight parameters of CNN pretrained on the ImageNet dataset are transferred to the new CNN. Top layers are replaced with a custom layer configuration fitting to the target concept.
In 2015, Penatti et al.
[30] introduced pretrained CNN on ImageNet into remote sensing scene classification, discovering better classification results for the UCM dataset than low-level descriptors. In a diverse large-scale dataset named NWPU-RESISC45, three popular pretrained CNNs: AlexNet
[62], VGG-16
[68] and GoogLeNet
[69], improved the performance by 30% minimum compared to handcrafted and unsupervised feature learning methods
[52]. NWPU-RESISC45 dataset is ambiguous due to high intra-class diversity and inter-class similarity. Sen et al.
[70] adopted a hierarchical approach to mitigate the misclassification. Their method is divided into two levels: (i) all 45 classes are rearranged into 5 main classes (Transportation, Water Areas, Buildings, Constructed Lands, Natural Lands), and (ii) the 45 sub-levels are trained in each class. DenseNet-121
[71] pretrained on ImageNet is used as a feature extractor in both levels. Al et al.
[72] combined four scene datasets, namely UCM, AID, NPWU, and PatternNet, to construct a heterogeneous scene dataset. For suitability, the 12 shared classes are filtered to utilize an MB-Net architecture, which is based on pretrained ResNet-50
[73]. MB-Net is designed to capture collective knowledge from three labeled source datasets and perform scene classification on a remaining unlabeled target dataset. Shawky et al.
[74] brought a data augmentation strategy for CNN-MLP architecture with Xception
[75] as a feature extractor. Sun et al.
[76] obtained multi-scale ground objects using multi-level convolutional pyramid semantic fusion (MCPSF) architecture and differentiated intricate scenes consisting of diverse ground objects.
Yu et al.
[77] introduced a feature fusion strategy in which CNNs are utilized to extract features from both the original image and a processed image obtained through saliency detection. The extracted features from these two sources are then fused together to produce more discriminative features. Ye et al.
[78] proposed parallel multi-stage (PMS) architecture based on the GoogleNet backbone to learn features individually from three hierarchical levels: low-, middle-, and high-level, prior to fusion. Dong et al.
[79] integrated a Deep Convolutional Neural Network (DCNN) with Broad Learning System (BLS)
[80] for the first time in the remote sensing scene classification domain to extract shallow features and named it FDPResNet. The DCNN implemented ResNet-101 pretrained on ImageNet as a backbone on both shallow and deep features and further fused and passed to the BLS system for classification. CNN architectures for remote sensing scene classification vary in design with the incorporation of additional techniques and methodologies. However, the widely employed approaches include LBP-based, fine-tuning, parameter reduction, and attention mechanism methods.
LBP-based pretrained CNNs: LBP is a widely used robust low-level descriptor for recognizing textures
[81]. In 2018, Anwer et al.
[46] proposed Tex-Net architecture, which combined an original RGB image with a texture-coded mapped LBP image. The late fusion strategy (Tex-Net-LF) performed better than early fusion (Tex-Net-EF). Later, Yu et al.
[82], who previously introduced the two-stream deep fusion framework
[77], adopted the same concept to integrate the LBP-coded image as a replacement for the processed image obtained through saliency detection. However, they conducted a combination of previously proposed and new experiments using the LBP-coded mapped image and fused the features together. Huang et al.
[83] stated that two-stream architectures solely focus on RGB image stream and overlook texture-containing images. Therefore, CTFCNN architecture based on pretrained CaffeNet
[84] extracted three kinds of features: (i) convolutional features from multiple layers, wherein each layer improved bag-of-visual words (iBoVW) method represented discriminating information, (ii) FC features, and (iii) LBP-based FC features. Compared to traditional BoVW
[35], the iBoVW coding method achieved rational representation.
Fine-tuned pretrained CNNs: Cheng et al.
[52] not only used pretrained CNNs for feature extraction from the NWPU-RESISC45 dataset, they further fine-tuned the increasing learning rate in the last layer to gain better classification results. For the same dataset, Yang et al.
[85] fine-tuned parameters utilized on three CNN models: VGG-16 and DenseNet-161 pretrained on ImageNet used as deep-learning classifier training, and feature pyramid network (FPN)
[86] pretrained on Microsoft Coco (Common Objects in Context)
[87] for deep-learning detector training. The combination of DenseNet+FPN exhibited exceptional performance. Zhang et al.
[33] used the hit and trial technique to set the hyperparameters to achieve better accuracy. Petrovska et al.
[88] implemented linear learning rate decay, which decreases the learning rate over time, and cyclical learning rates
[89]. The improved accuracy utilizing fine-tuning on pretrained CNNs validates the statement made by Castelluccio et al.
[90] in 2015.
Parameters reduction: CNN architectures exhibit a substantial amount of parameters, such as VGG-16, which comprises approximately 138 million parameters
[91]. The large number of parameters is one of the factors for over-fitting
[92][93]. Zhang et al.
[94] utilized DenseNet, which is known for its parameter efficiency, with around 7 million parameters. Yu et al.
[95] integrated light-weighted CNN MobileNet-v2
[96] with feature fusion bilinear model
[97] and termed the architecture as BiMobileNet. BiMobileNet featured a parameter count of 0.86 million, which is six, eleven, and eighty-five times lower than the parameter numbers reported in
[82],
[45] and
[98], respectively, while achieving better accuracy.
Attention mechanism: In the process of extracting features from entire images, it is essential to consider that images contain various objects and features. Therefore, selectively focusing on critical parts and disregarding irrelevant ones becomes crucial. Zhao et al.
[99] added a channel-spatial attention
[100] module (CBAM) following each residual dense block (RDB) based on DenseNet-101 backbone pretrained on ImageNet. CBAM helps to learn meaningful features in both channel and spatial dimensions
[101]. Ji et al.
[102] proposed an attention network based on the VGG-VD16 network that localizes discriminative areas in three different scales. The multiscale images are fed to sub-network CNN architectures and further fused for classification. Zhang et al.
[103] introduced a multiscale attention network (MSA-Network), where the backbone is ResNet. After each residual block, the multiscale module is integrated to extract multiscale features. The channel and position attention (CPA) module is added after the last multiscale module to extract discriminative regions. Shen et al.
[104] incorporated two models, namely ResNet-50 and DenseNet-121, to fulfill the insufficiency of single CNN models. Both models captured low, middle, and high-level features and combined them with a grouping-attention-fusion strategy. Guo et al.
[105] proposed a multi-view feature learning network (MVFL) divided into three branches: (i) channel-spatial attention to localize discriminative areas, (ii) triplet metric branch, and (iii) center metric branch to increase interclass distance and decrease intraclass distance. Zhao et al.
[106] designed an enhanced attention module (EAM) to enhance the ability to understand more discriminative features. In EAM, two depthwise dilated convolution branches are utilized, each branch having different dilated rates. Dilated convolutions enhance the receptive fields without escalating the parameter count. They effectively capture multiscale contextual information and improve the network’s capacity to learn features. The two branches are merged with depthwise convolutions to decrease the dimensionality. Hu et al.
[107] introduced a multilevel inheritance network (MINet), where FPN based on ResNet-50 is adopted to acquire multilayer features. Subsequently, an attention mechanism is employed to augment the expressive capacity of features at each level. For the fusion of features, the feature weights across different levels are computed by leveraging the SENet
[108] approach.
3.2. CNNs Trained from Scratch
Pretrained CNNs are typically trained on large-scale datasets such as ImageNet, which are not adaptable to the specific characteristics of the target dataset. Modifying pretrained CNNs is inconvenient due to the complexity and compatibility of CNNs. Although pretrained CNNs attained outstanding classification results, Zhang et al.
[109] addressed complexity in pretrained CNNs due to the extensive parameter size and implemented a light-weighted CNN MobileNet-v2 with dilated convolution and channel attention. He et al.
[110] introduced a skip-connected covariance (SCCov) network, where a skip connection is added along with covariance pooling. The SCCov architecture reduced the parameter number with better scene classification. Zhang et al.
[111] proposed a gradient-boosting random convolutional network to assemble various non-pretrained deep neural networks (DNNs). A simplified representation of CNN architectures trained from scratch is illustrated in
Figure 3. This CNN is solely based on a dataset to be trained without the involvement of pretrained CNN on a specific dataset.
4. Vision Transformer-Based Scene Classification Methods
The ViT
[112] model can perform image feature extraction without relying on convolutional layers. This model utilizes a transformer architecture
[113], initially introduced for natural language processing. In ViT, an input image undergoes partitioning into fixed-size patches, and each patch is then transformed into a continuous vector through a process known as linear embedding. Moreover, position embeddings are added to the patch embeddings to retain positional information. Subsequently, a set of sequential patches (Equation (2)) are fed into the transformer encoder, which consists of alternating layers of Multi-head Self-Attention (MSA)
[113] (Equation (3)) and multi-layer perceptron (MLP) (Equation (4)). Layernorm (LN)
[114] is implemented prior to both MSA and MLP to reduce training time and stabilize the training process. Residual connections are applied after every layer to improve the performance. The MLP has two layers with a Gaussian Error Linear Unit (GELU)
[115] activation function. In the final layer of the encoder, the first element of the sequence is passed to an external head classifier for prediction (Equation (5)).
Figure 5 illustrates the ViT architecture for remote sensing scene classification.
where
𝑥class represents the embedding for the class token.
𝑥𝑖𝑝E denotes the embeddings of different patches flattened from the original images, concatenated with
𝑥class.
ℝ(𝑃2·𝐶)×𝐷 is a matrix representing patch embeddings where
P is patch size,
C is the number of channels, and
D is the embedding dimension. Positional embedding
𝐸𝑝𝑜𝑠 is added to patches, accounting for
𝑁+1 positions (including class token), each in a
D-dimensional space.
where
𝑧′𝑙 is the output of the MSA layer, applied after LN to the
(𝑙−1)-th layer’s output
𝑧𝑙−1 (i.e.,
𝑧0), incorporating a residual connection.
L represents the total number of layers in the transformer.
where
𝑧𝑙 is the output of the MLP layer, applied after LN to the
𝑧′𝑙 from the
(𝑙−1)-th layer, incorporating a residual connection.
L represents the total number of layers in the transformer.
Figure 5. A ViT architecture. A remote sensing scene image is partitioned into fixed-size patches, each of them linearly embedded, and positional embeddings are added. The resulting sequence of vectors are fed to a transformer encoder for classification.
ViT performs exceptionally well in capturing contextual features. Bazi et al.
[116] introduced ViT for remote sensing scene classification and obtained a promising result compared to state-of-the-art CNN-based scene classification methods. Their method involves data augmentation strategies to improve classification accuracy. Furthermore, the network is compressed by pruning to reduce the model size. Bashmal et al.
[117] utilized a data-efficient image transformer (DeiT), which is trained by knowledge distillation with a smaller dataset and showed potential results. Bi et al.
[118] used the combination of ViT and supervised contrastive learning (CL)
[119], named ViT-CL, to increase the robustness of the model by learning more discriminative features. ViT performs exceptionally well in capturing contextual features. However, they face limitations in learning local information. Moreover, their computational complexity is significantly high
[120]. Peng et al.
[121] addressed the challenge and introduced a local–global interactive ViT (LG-ViT). In LG-ViT architecture, images are partitioned to learn features in two different scales: small and large. ViT blocks learn from both scales to handle the problem of scale variation. In addition, a global-view network is implemented to learn global features from a whole image. The features obtained from the global-view network are embedded with local representation branches, which enhance local–global feature interaction.
CNNs excel at preserving local information but lack the ability to comprehensively capture global contextual features. ViTs are well-suited to learn long-range contextual relations. The hybrid approach of using CNNs and transformers leverages the strengths of both architectures to improve classification performance. Xu et al.
[120] integrated ViT and CNN to harness the strength of CNN. In their study, ViT is used to extract rich contextual information, which is transferred to the ResNet-18 architecture. Tang et al.
[122] proposed an efficient multiscale transformer and cross-level attention learning (EMTCAL), which also combines CNN with a transformer to extract maximum information. They employed ResNet-34 as a feature extractor in the CNN model. Zhang et al.
[123] proposed a remote sensing transformer (TRS) to capture the global context of the image. TRS combines self-attention with ResNet through the Multi-Head Self-Attention layer (MHSA), replacing the conventional 3 × 3 spatial convolutions in the bottleneck. Additionally, the approach incorporates multiple pure transformer encoders, leveraging attention mechanisms to enhance the learning of representations. Wang et al.
[124] utilized pretrained Swin Transformer
[125] to capture features at multilayer followed by patch merging to concatenate the patches (except in the last block), with these two elements forming the Swin Transformer Block (STB). The multilevel features obtained from STB are eventually merged with the inspired technique
[86], then further compressed using convolutions within the adaptive feature compression module because of redundancy in multiple features. Guo et al.
[126] integrated Channel-Spatial Attention (CSA) into the ViT
[112] and termed the architecture Channel-Spatial Attention Transformers (CSAT). The combined CSAT model accurately acquires and preserves both local and global knowledge.
5. GAN-Based Scene Classification Methods
Supervised learning methods effectively perform remote sensing scene classification. However, due to limited labeled scene images, Miao et al. merged UCM, NWPU-RESISC45, and AID datasets to create larger remote-sensing scene datasets
[53]. Annotating samples manually for labeling scenes is laborious and expensive. GAN
[127] can extract meaningful information using unlabeled data. GAN is centered around two models: the generator and the discriminator, illustrated in
Figure 6. The generator is trained to create synthetic data that resemble real data, aiming to deceive the discriminator. On the other hand, the discriminator is trained to distinguish between the generated (fake) data and the real data
[128]. The overall objective of GAN is to achieve a competitive interplay between these two models, driving the generator to produce increasingly realistic data samples while the discriminator becomes better at detecting the generated data. In Equation (6), the value function describes the training process of GAN as a minimax game.
where the input from the latent space
z is provided to the generator
G to generate synthetic image
G(
z).
G(
z) is further fed to the discriminator
D, alongside the real image
x. The discriminator predicts both samples as synthetic (0) or real (1) based on its judgment. This process optimizes
G and
D through a dynamic interplay.
G trains to minimize
log(1−𝐷(𝐺(𝑧))), driving it to create synthetic images that resemble real ones. Simultaneously,
D trains to maximize
log(𝐷(𝑥)) +
log(1−𝐷(𝐺(𝑧))), refining its ability to distinguish real from synthetic samples.
Figure 6. A GAN architecture consisting of generator G and discriminator D. G generates synthetic image G(z) from the latent space z. G(z) and real image x are then fed into the D, which is responsible for distinguishing between G(z) and x.
Lin et al.
[129] acknowledged the unavailability of sufficient labeled data for remote sensing scene classification, which led them to introduce multiple-layer feature-matching generative adversarial networks (MARTA GANs). MARTA GANs fused mid-level features with global features for learning better representations by a descriptor. Xu et al.
[130] replaced ReLU with scaled exponential linear units (SELU)
[131] activation, enhancing GAN’s ability to produce high-quality images. Ma et al.
[132] addressed that samples generated by GAN are solely used for self-training and introduced a new approach, SiftingGAN, to generate a significant number of authentic labeled samples. Wei et al.
[133] introduced multilayer feature fusion Wasserstein GAN (MF-WGANs), where the multi-feature fusion layer is subsequent to the discriminator to learn mid-level and high-level feature information.
In unsupervised learning methods, labeled image scenes remain unexplored. However, leveraging annotations can significantly enhance classification capabilities. Therefore, Yan et al.
[134] incorporated semi-supervised learning into GAN, aiming to exploit the benefits of labeled images and enhance classification performance. Miao et al.
[53] introduced a semi-supervised representation consistency Siamese network (SS-RCSN), which incorporates Involution-GAN for unsupervised feature learning and a Siamese network to measure the similarity between labeled and unlabeled data in a high-dimensional space. Additionally, representation consistency loss in the Siamese network aids in minimizing the disparities between labeled and unlabeled data.