New Semantic Segmentation Method for Remote Sensing Images

New Semantic Segmentation Method for Remote Sensing Images: History

Please note this is an old version of this entry, which may differ significantly from the current revision.

Contributor: Zimeng Yang , Qiulan Wu , Feng Zhang ,

Xueshen Zhang

, Xuefei Chen , Yue Gao

Semantic segmentation is an important task for the interpretation of remote sensing images. Remote sensing images are large in size, contain substantial spatial semantic information, and generally exhibit strong symmetry, resulting in images exhibiting large intraclass variance and small interclass variance, thus leading to class imbalance and poor small-object segmentation.

remote sensing
semantic segmentation
coordinate attention mechanism

1. Introduction

With the continuous development of computer vision technology, machine learning provides a variety of techniques and tools applied to remote sensing data to identify and extract important symmetric features in remote sensing images ^[1]^[2]^[3]. However, different from natural images, having the characteristics of a wide imaging range and complex and diverse backgrounds, remote sensing images present more spectral channels and complex image structures than natural images ^[4]. The unbalance of categories in remote sensing images and the segmentation of small objects are the reasons that affect the semantic segmentation effect ^[5].

Convolutional neural networks (CNNs) have been successfully applied to many semantic segmentation tasks ^[6]^[7]. The classical semantic segmentation models and their contributions are shown in Table 1. Great efforts have been made to successfully apply deep learning methods for the segmentation of remote sensing data ^[8]^[9]. Compared with natural image datasets, those comprising remote sensing images have higher intraclass variance and lower interclass variance, making the labeling task difficult ^[10]. To deal with the special data structure of remotely sensed images, Geng et al. ^[11] have extended the long short-term memory (LSTM) network ^[12] to extract contextual relationship information, where the LSTM algorithm learns potential spatial correlations. Mou et al. ^[13] and Tao et al. ^[14] have designed the spatial relationship module and spatial information inference structure, respectively, in order to build more effective contextual spatial relationship models. In order to better acquire long-range contextual and location-sensitive information among features in remote sensing images, the multiscale module is improved. ASPP ^[15] uses dilated convolution to increase the size of the receptive field and control the number of parameters. However, when remote sensing images contain objects with large size disparity, the pyramid pooling module cannot capture small objects well ^[16]. In order to solve the problem of uneven data distribution among different tags in image segmentation, some researchers try to take symmetry into account in deep learning models and architectures ^[17]^[18]^[19]^[20]. Lv et al. proposed a new way to detect and track objects at night, inspired by symmetric neural networks, which involves using computer algorithms to enhance certain features of objects and location and appearance information ^[21]. Park et al. proposed a symmetric graph convolutional autoencoder which produces a low-dimensional latent representation from a graph ^[22]. These approaches not only enable to balance the data distribution but also reduce the complexity of the model.

Table 1. The classical semantic segmentation models and their contributions.

Model	Contribution	Backbone	Dataset	Mean Intersection over Union (mIou)/%
FCN-8s ^[23]	Fully conv network without full connection layer (fc) can adapt to any size of input image	VGG16	VOC 2012	62.2
SegNet ^[24]	The parameters that need to be trained for upsampling in FCN are reduced	VGG16	CamVid	60.1
PSPNet ^[25]	PSPNet provides effective global context priors for pixel-level scene resolution	ResNet	VOC 2012	82.6
DeepLab V3 ^[26]	The atrous convolution is applied to the expansion module, and the atrous spatial pyramid pooling module is improved	ResNet	Cityscapes	81.3
DeepLab V3+ ^[26]	The improved Xception computes faster without reducing accuracy	Xception	Cityscapes	82.1
ReSegNet ^[27]	A new residual architecture of a coder–decoder model is proposed to alleviate the problem of inadequate learning	VGG16	ISPRS Vaihingen	74.63

Moreover, attention mechanisms have been successfully applied in semantic segmentation ^[28]^[29] over the past few years, where introducing an attention mechanism into a semantic segmentation model allows the model to better focus on meaningful image features ^[30]. In CNNs, channel attention ^[31] is usually implemented after each convolution ^[32], while spatial attention is typically implemented at the end of the network ^[33]. As a symmetric semantic segmentation model, U-net can obtain the context information of an image while locating the segmentation boundary accurately. In U-Net-based networks, channel attention is usually added in each layer of the upsampling part ^[34]. However, channel attention only considers interchannel information and ignores the importance of location information, which is crucial for obtaining the object structure of remote sensing images ^[35]. To enhance the perception of information channels and important regions, Woo et al. ^[36] have proposed the convolutional block attention module (CBAM) by linking channel attention and spatial attention in tandem. However, convolution can only capture local relationships and ignores the relational information between distant objects. Therefore, Hou et al. ^[37] have proposed a new coordinate attention mechanism by embedding location information into channel attention and successfully applied it to the semantic segmentation of natural images.

The above methods in deep learning for remote sensing image classification imbalance and small objects do not fully utilize the spatial feature information and location-sensitive information in remote sensing images at different scales. A novel semantic segmentation network, CAS-Net, is proposed which integrates coordinate attention and SPD-Conv ^[38] layers for remote sensing images. CAS-Net adopts SPD-Conv to adjust the backbone network to reduce the loss of fine-grained information and improve the learning efficiency of feature information. In the feature extraction stage, a coordinate attention mechanism is used to enable the model to capture directional perception and position-sensitive information at the same time, so as to locate small objects more accurately at multiple scales. In addition, the Dice coefficient is introduced into the cross-entropy loss function to enable the model to maximizes the cross-merge ratio of a direct metric region and reduce the classification accuracy problem caused by classification imbalance.

2. Attention Mechanism

Evidence from human perceptual processes has demonstrated the importance of attention mechanisms ^[31] which employ high-level information to guide bottom-up feed-forward processes ^[39]. For the processing of remote sensing images, the joint use of channel attention and spatial attention mechanisms has been common in previous studies ^[40]^[41]. The channel attention mechanism ^[42] and spatial attention mechanism ^[43] may also be applied separately when processing hyperspectral images. Qi et al. ^[44] have combined a multiscale convolutional structure and attention mechanism with the LinkNet network to obtain ATD-LinkNet, which can effectively exploit the spatial and semantic information in remote sensing images. The attention module incorporates features from different scales to effectively exploit the rich spatial and semantic information in remote sensing images, while the decoder part uses dense upsampling convolution to refine the nonlinear boundaries of objects in the remote sensing images. Li et al. ^[45] have proposed a dual-path attention network (DPA-Net) with a self-attention mechanism to enhance the model’s ability to capture key local features in remotely sensed images, using a global attention module to extract pixel-level spatial information and a channel attention module to focus on different features in the image. The attention factor in the coordinate attention mechanism ^[37] decomposes the channel attention to aggregate features along two spatial directions. In this way, long-distance relationships are obtained, and accurate location information is retained. Then, the generated feature maps are encoded into pairs of direction-aware and location-sensitive attention information, respectively, which are fed into the feature maps to enhance the representation of the object of interest.

3. Small Objects

The segmentation of small objects in remote sensing images is generally a challenging task. Mnihetal et al. ^[46] have proposed a method for the automatic identification of small objects based on a restricted Boltzmann machine (RBM) which automatically extracts object locations such as roads, buildings, and trees from aerial images, requiring various pre- and postprocessing steps. In particular, the basic features of the remote sensing images are extracted by preprocessing, and the final road segmentation results are obtained by the postprocessing network, based on the results extracted by the basic network due to the discontinuity of road segmentation results. Kampffmeyer et al. ^[20] have compared pixel- and patch-based FCN, where the patch-based pixel classification uses 65 × 65 pixels blocks for dense segmentation, effectively improving the segmentation accuracy for small objects such as cars; meanwhile, as pixel-based segmentation is designed behind the convolution layer of the shrinkage path, and features are directly upsampled back to the original image resolution; this method may lose some fine information of the image. Saito et al. ^[47] have proposed a new channel suppression method by using the original pixel values in the aerial image as the input and then output the prediction using a three-channel labeled image softmax (CIS) function instead of the original softmax function, which has the advantage that it does not require preprocessing and can directly recognize small objects such as roads and buildings. Sunkara et al. ^[38] have proposed SPD-Conv, which completely gets rid of the stepwise convolution and maximum pooling used in previous models and is comprised of a space-to-depth (SPD) layer followed by a nonstrided convolution layer, which can be integrated into most CNN architectures.

This entry is adapted from the peer-reviewed paper 10.3390/sym15051037

References

Xu, Z.Y.; Zhang, W.C.; Zhang, T.X.; Yang, Z.F.; Li, J.Y. Efficient transformer for remote sensing image segmentation. Remote Sens. 2021, 13, 3585.
Zhou, X.K.; Xu, X.S.; Liang, W.; Zeng, Z.; Yan, Z. Deep-Learning-Enhanced Multitarget Detection for End-Edge-Cloud Surveillance in Smart IoT. IEEE Internet Things J. 2021, 8, 12588–12596.
Ali, I.; Rehman, A.U.; Khan, D.M.; Khan, Z.; Shafiq, M.; Choi, J.-G. Model Selection Using K-Means Clustering Algorithm for the Symmetrical Segmentation of Remote Sensing Datasets. Symmetry 2022, 14, 1149.
Li, J.Y.; Huang, X.; Gong, J.Y. Deep neural network for remote-sensing image interpretation: Status and perspectives. Natl. Sci. Rev. 2019, 6, 1082–1086.
Chen, X.L.; Zhu, G.B.; Liu, M.Q. Remote sensing image scene classification with self-supervised learning based on partially unlabeled datasets. Remote Sens. 2022, 14, 5838.
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer vision and Pattern Recognition, Las Vegas, NV, USA, 25 June–2 July 2016; pp. 3213–3223.
Pan, J.S.; Wei, Z.Q.; Zhao, Y.H.; Zhou, Y. Enhanced FCN for farmland extraction from remote sensing image. Multimed. Tools Appl. 2022, 81, 38123–38150.
Liu, Y.; Gao, L.R.; Xiao, C.C.; Qu, Y.; Zheng, K. Hyperspectral image classification based on a shuffled group convolutional neural network with transfer learning. Remote Sens. 2020, 12, 1780.
Yuan, X.H.; Shi, J.F.; Gu, L.C. A review of deep learning methods for semantic segmentation of remote sensing imagery. Expert Syst. Appl. 2021, 169, 114417.
Tuia, D.; Volpi, M.; Copa, L.; Kanevski, M.; Munoz-Mari, J. A survey of active learning algorithms for supervised remote sensing image classification. IEEE J. Sel. Top. Signal Process. 2011, 5, 606–617.
Geng, J.; Wang, H.Y.; Fan, J.C.; Ma, X.R. SAR image classification via deep recurrent encoding neural networks. IEEE Trans. Geosci. Remote Sens. 2017, 56, 2255–2269.
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780.
Mou, L.C.; Hua, Y.S.; Zhu, X.X. Spatial relational reasoning in networks for improving semantic segmentation of aerial images. In Proceedings of the IEEE Conference on International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 5232–5235.
Tao, C.; Qi, J.; Li, Y.S.; Wang, H.; Li, H.F. Spatial information inference net: Road extraction using road-specific contextual information. ISPRS J. Photogramm. Remote Sens. 2019, 158, 155–166.
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848.
He, Q.; Dong, Z.; Chen, F. Pyramid: Enabling Hierarchical Neural Networks with Edge Computing. In Proceedings of the ACM Web Conference 2022, Lyon, France, 25–29 April 2022; pp. 1860–1870.
Wang, Z.; Zhang, J.; Xia, S.; Shi, B.; Bai, X.; Zhang, L. Symmetry-enhanced deep learning for spatiotemporal prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Los Angeles CA, USA, 15–21 June 2019; pp. 5989–5998.
Ma, J.; Lu, D.; Li, Y.; Shi, G. CLHF-Net: A Channel-Level Hierarchical Feature Fusion Network for Remote Sensing Image Change Detection. Symmetry 2022, 14, 1138.
Qi, L.Y.; Yang, Y.H.; Zhou, X.K. Fast anomaly identification based on multiaspect data streams for intelligent intrusion detection toward secure industry 4.0. IEEE Trans. Ind. Inform. 2021, 18, 6503–6511.
Liang, W.; Hu, Y.; Zhou, X. Variational few-shot learning for microservice-oriented intrusion detection in distributed industrial IoT. IEEE Trans. Ind. Inform. 2021, 18, 5087–5095.
Lv, Y.; Feng, W.; Wang, S.; Dauphin, G.; Zhang, Y.; Xing, M. Spectral-Spatial Feature Enhancement Algorithm for Nighttime Object Detection and Tracking. Symmetry 2023, 15, 546.
Park, J.; Lee, M.; Chang, H.J.; Lee, K.; Choi, J.Y. Symmetric graph convolutional autoencoder for unsupervised graph representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Los Angeles CA, USA, 15–21 June 2019; pp. 6519–6528.
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 8–10 June 2015; pp. 3431–3440.
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A deepconvolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495.
Zhao, H.; Shi, J.; Qi, X. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890.
Chen, L.C.; Zhu, Y.; Papandreou, G. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 801–818.
Sun, Y.; Tian, Y.; Xu, Y. Problems of encoder-decoder frame-works for high-resolution remote sensing image segmentation: Struc-tural stereotype and insufficient learning. Neurocomputing 2019, 330, 297–304.
Guo, M.H.; Xu, T.X.; Liu, J.J.; Liu, Z.N. Attention mechanisms in computer vision: A survey. Comput. Vis. Media 2022, 8, 331–368.
Chen, L.; Zhang, H.W.; Xiao, J.; Nie, L.Q. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5659–5667.
Zhao, Q.; Liu, J.H.; Li, Y.W.; Zhang, H. Semantic segmentation with attention mechanism for remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5403913.
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Warsaw, Poland, 17–19 September 2018; pp. 7132–7141.
Tong, W.; Chen, W.T.; Han, W. Channel-attention-based DenseNet network for remote sensing image scene classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 4121–4132.
Zhu, M.H.; Jiao, L.C.; Liu, F.; Yang, S.Y. Residual spectral–spatial attention network for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2020, 59, 449–462.
Ren, Y.; Li, X.; Yang, X. Development of a dual-attention U-Net model for sea ice and open water classification on SAR images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 4010205.
Wang, H.; Zhu, Y.; Green, B. Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 108–126.
Woo, S.; Park, J.; Lee, J.Y. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19.
Hou, Q.B.; Zhou, D.Q.; Feng, J.S. Coordinate attention for efficient mobile network design. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Beijing, China, 29 October–21 November 2021; pp. 13713–13722.
Sunkara, R.; Luo, T. No more strided convolutions or pooling: A new CNN building block for low-resolution images and small objects. arXiv 2022, arXiv:2208.03641.
Mnih, V.; Heess, N.; Graves, A. Recurrent models of visual attention. In Advances in Neural Information Processing Systems 27, Proceedings of the 28th Annual Conference on Neural Information Processing Systems, Montreal, Canada, 8–13 December 2014; Neural Information Processing Systems Foundation, Inc. (NeurIPS): La Jolla, CA, USA, 2014.
Zhang, S.Y.; Li, C.R.; Qiu, S. EMMCNN: An ETPS-based multi-scale and multi-feature method using CNN for high spatial resolution image land-cover classification. Remote Sens. 2019, 12, 66.
Gao, H.; Cao, L.; Yu, D.F. Semantic segmentation of marine remote sensing based on a cross direction attention mechanism. IEEE Access 2020, 8, 142483–142494.
Zheng, J.W.; Feng, Y.C.; Bai, C.; Zhang, J.L. Hyper spectral image classification using mixed convolutions and covariance pooling. IEEE Trans. Geosci. Remote Sens. 2021, 59, 522–534.
Zhou, M.; Zou, Z.; Shi, Z.; Zeng, W.J.; Gui, J. Local Attention networks for occluded airplane detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2020, 17, 381–385.
Qi, X.; Li, K.; Liu, P.; Zhou, X.; Sun, M. Deep attention and multi-scale networks for accurate remote sensing image segmentation. IEEE Access 2020, 8, 146627–146639.
Li, J.; Xiu, J.; Yang, Z. Dual path attention net for remote sensing semantic image segmentation. ISPRS Int. J. Geo-Inf. 2020, 9, 571.
Mnih, V. Machine Learning for Aerial Image Labeling. Ph.D. Thesis, University of Toronto, Toronto, Canada, 2013.
Saito, S.; Yamashita, T.; Aoki, Y. Multiple object extraction from aerial imagery with convolutional neural networks. Electron. Imaging 2016, 10, 1–9.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.