New Semantic Segmentation Method for Remote Sensing Images: Comparison
Please note this is a comparison between Version 1 by Zimeng Yang and Version 2 by Jason Zhu.

Semantic segmentation is an important task for the interpretation of remote sensing images. Remote sensing images are large in size, contain substantial spatial semantic information, and generally exhibit strong symmetry, resulting in images exhibiting large intraclass variance and small interclass variance, thus leading to class imbalance and poor small-object segmentation.

  • remote sensing
  • semantic segmentation
  • coordinate attention mechanism

1. Introduction

With the continuous development of computer vision technology, machine learning provides a variety of techniques and tools applied to remote sensing data to identify and extract important symmetric features in remote sensing images [1][2][3][1,2,3]. However, different from natural images, having the characteristics of a wide imaging range and complex and diverse backgrounds, remote sensing images present more spectral channels and complex image structures than natural images [4]. The unbalance of categories in remote sensing images and the segmentation of small objects are the reasons that affect the semantic segmentation effect [5].
Convolutional neural networks (CNNs) have been successfully applied to many semantic segmentation tasks [6][7][6,7]. The classical semantic segmentation models and their contributions are shown in Table 1. Great efforts have been made to successfully apply deep learning methods for the segmentation of remote sensing data [8][9][8,9]. Compared with natural image datasets, those comprising remote sensing images have higher intraclass variance and lower interclass variance, making the labeling task difficult [10]. To deal with the special data structure of remotely sensed images, Geng et al. [11] have extended the long short-term memory (LSTM) network [12] to extract contextual relationship information, where the LSTM algorithm learns potential spatial correlations. Mou et al. [13] and Tao et al. [14] have designed the spatial relationship module and spatial information inference structure, respectively, in order to build more effective contextual spatial relationship models. In order to better acquire long-range contextual and location-sensitive information among features in remote sensing images, the multiscale module is improved. ASPP [15] uses dilated convolution to increase the size of the receptive field and control the number of parameters. However, when remote sensing images contain objects with large size disparity, the pyramid pooling module cannot capture small objects well [16]. In order to solve the problem of uneven data distribution among different tags in image segmentation, some researchers try to take symmetry into account in deep learning models and architectures [17][18][19][20][17,18,19,20]. Lv et al. proposed a new way to detect and track objects at night, inspired by symmetric neural networks, which involves using computer algorithms to enhance certain features of objects and location and appearance information [21]. Park et al. proposed a symmetric graph convolutional autoencoder which produces a low-dimensional latent representation from a graph [22]. These approaches not only enable to balance the data distribution but also reduce the complexity of the model.
Table 1. The classical semantic segmentation models and their contributions.
Model Contribution Backbone Dataset Mean Intersection over Union (mIou)/%
FCN-8s [23]FCN-8s [25] Fully conv network without full connection layer (fc) can adapt to any size of input image VGG16 VOC 2012 62.2
SegNet [24]SegNet [26] The parameters that need to be trained for upsampling in FCN are reduced VGG16 CamVid 60.1
PSPNet [25]PSPNet [27] PSPNet provides effective global context priors for pixel-level scene resolution ResNet VOC 2012 82.6
DeepLab V3 [26]DeepLab V3 [28] The atrous convolution is applied to the expansion module, and the atrous spatial pyramid pooling module is improved ResNet Cityscapes 81.3
DeepLab V3+ [26]DeepLab V3+ [28] The improved Xception computes faster without reducing accuracy Xception Cityscapes 82.1
ReSegNet [27]ReSegNet [29] A new residual architecture of a coder–decoder model is proposed to alleviate the problem of inadequate learning VGG16 ISPRS Vaihingen 74.63
Moreover, attention mechanisms have been successfully applied in semantic segmentation [28][29][30,31] over the past few years, where introducing an attention mechanism into a semantic segmentation model allows the model to better focus on meaningful image features [30][32]. In CNNs, channel attention [31][33] is usually implemented after each convolution [32][34], while spatial attention is typically implemented at the end of the network [33][35]. As a symmetric semantic segmentation model, U-net can obtain the context information of an image while locating the segmentation boundary accurately. In U-Net-based networks, channel attention is usually added in each layer of the upsampling part [34][36]. However, channel attention only considers interchannel information and ignores the importance of location information, which is crucial for obtaining the object structure of remote sensing images [35][37]. To enhance the perception of information channels and important regions, Woo et al. [36][38] have proposed the convolutional block attention module (CBAM) by linking channel attention and spatial attention in tandem. However, convolution can only capture local relationships and ignores the relational information between distant objects. Therefore, Hou et al. [37][39] have proposed a new coordinate attention mechanism by embedding location information into channel attention and successfully applied it to the semantic segmentation of natural images.
The above methods in deep learning for remote sensing image classification imbalance and small objects do not fully utilize the spatial feature information and location-sensitive information in remote sensing images at different scales. A novel semantic segmentation network, CAS-Net, is proposed which integrates coordinate attention and SPD-Conv [38][40] layers for remote sensing images. CAS-Net adopts SPD-Conv to adjust the backbone network to reduce the loss of fine-grained information and improve the learning efficiency of feature information. In the feature extraction stage, a coordinate attention mechanism is used to enable the model to capture directional perception and position-sensitive information at the same time, so as to locate small objects more accurately at multiple scales. In addition, the Dice coefficient is introduced into the cross-entropy loss function to enable the model to maximizes the cross-merge ratio of a direct metric region and reduce the classification accuracy problem caused by classification imbalance.

2. Attention Mechanism

Evidence from human perceptual processes has demonstrated the importance of attention mechanisms [31][33] which employ high-level information to guide bottom-up feed-forward processes [39][41]. For the processing of remote sensing images, the joint use of channel attention and spatial attention mechanisms has been common in previous studies [40][41][42,43]. The channel attention mechanism [42][44] and spatial attention mechanism [43][45] may also be applied separately when processing hyperspectral images. Qi et al. [44][46] have combined a multiscale convolutional structure and attention mechanism with the LinkNet network to obtain ATD-LinkNet, which can effectively exploit the spatial and semantic information in remote sensing images. The attention module incorporates features from different scales to effectively exploit the rich spatial and semantic information in remote sensing images, while the decoder part uses dense upsampling convolution to refine the nonlinear boundaries of objects in the remote sensing images. Li et al. [45][47] have proposed a dual-path attention network (DPA-Net) with a self-attention mechanism to enhance the model’s ability to capture key local features in remotely sensed images, using a global attention module to extract pixel-level spatial information and a channel attention module to focus on different features in the image. The attention factor in the coordinate attention mechanism [37][39] decomposes the channel attention to aggregate features along two spatial directions. In this way, long-distance relationships are obtained, and accurate location information is retained. Then, the generated feature maps are encoded into pairs of direction-aware and location-sensitive attention information, respectively, which are fed into the feature maps to enhance the representation of the object of interest.

3. Small Objects

The segmentation of small objects in remote sensing images is generally a challenging task. Mnihetal et al. [46][48] have proposed a method for the automatic identification of small objects based on a restricted Boltzmann machine (RBM) which automatically extracts object locations such as roads, buildings, and trees from aerial images, requiring various pre- and postprocessing steps. In particular, the basic features of the remote sensing images are extracted by preprocessing, and the final road segmentation results are obtained by the postprocessing network, based on the results extracted by the basic network due to the discontinuity of road segmentation results. Kampffmeyer et al. [20] have compared pixel- and patch-based FCN, where the patch-based pixel classification uses 65 × 65 pixels blocks for dense segmentation, effectively improving the segmentation accuracy for small objects such as cars; meanwhile, as pixel-based segmentation is designed behind the convolution layer of the shrinkage path, and features are directly upsampled back to the original image resolution; this method may lose some fine information of the image. Saito et al. [47][49] have proposed a new channel suppression method by using the original pixel values in the aerial image as the input and then output the prediction using a three-channel labeled image softmax (CIS) function instead of the original softmax function, which has the advantage that it does not require preprocessing and can directly recognize small objects such as roads and buildings. Sunkara et al. [38][40] have proposed SPD-Conv, which completely gets rid of the stepwise convolution and maximum pooling used in previous models and is comprised of a space-to-depth (SPD) layer followed by a nonstrided convolution layer, which can be integrated into most CNN architectures.