自适应本地跨渠道交互: History
Please note this is an old version of this entry, which may differ significantly from the current revision.

Adding an attention module to the deep convolution semantic segmentation network has significantly enhanced the network performance. However, the existing channel attention module focusing on the channel dimension neglects the spatial relationship, causing location noise to transmit to the decoder.

  • adaptive local cross-channel interaction
  • vector average pooling
  • attention mechanism

1. Introduction

By analyzing an image at the pixel level, semantic segmentation provides more nuanced identification than object detection and image classification, allowing for the output of complete scene information. Urban planning, land resource management, marine monitoring, and transportation assessment benefit significantly from semantic segmentation processing of remotely sensed imagery [1][2]. However, remote sensing images present unique processing challenges due to the abundance of feature information such as shape, location, and texture, as well as the high intra-class variance and high inter-class similarity exhibited by ground objects in the images [3].
Conventional semantic segmentation approaches emphasize manual feature extraction [4]. The feature vectors can be obtained based on hand-crafted rules of certain application scenarios. Once the scenario has been modified, it is challenging to reuse these extracted feature vectors. The extraction of repeated features is a laborious and time-consuming process. In addition, the hand-crafted rules of traditional semantic segmentation depend on complex mathematical models that are not data-driven such as the current methods. Therefore, there are constraints on the comprehensibility and generalizability of the conventional semantic segmentation approaches. More recently, deep learning-based semantic segmentation techniques have demonstrated significant potential. For instance, FCN [5] implements a fully convolutional semantic segmentation network, which is the baseline for the current popular semantic segmentation approaches. U-Net [6] introduces the skip connection between the shallow and deep layers to effectively reconstruct the low-level spatial information for advanced semantic objects to solve the issue of inaccurate object edge segmentation. The encoder–decoder architecture encourages researchers to concentrate on ways to represent pixel features in the encoder better to boost network performance. ResNet [7] is one such work that expands the network depth to extract more advanced abstract features. Atrous or dilated convolutional networks such as the ones developed by the DeepLab community [8][9][10][11] can accomplish multi-scale tasks by enlarging the receptive field. HRNet [12][13] maintains dense multi-layer interaction between the shallow and deep feature maps. Similarly, U-Net++ [14] enhances the accuracy of semantic segmentation by substituting dense connections for regular skip connections between the encoder and the decoder. These efforts have led to growth in the use of deep learning for semantic segmentation. Researchers then began to optimize the performance of baseline semantic segmentation networks by introducing an attention mechanism, enabling them to capitalize on the critical feature information and eliminate the redundancy of feature maps or pixels to reinforce the feature representation.
The effectiveness of the attention mechanism has been demonstrated for a variety of tasks [15][16] including object detection [17][18][19] and image classification [20][21][22]. Based on the principle of the attention mechanism, researchers in the field of semantic segmentation have developed several attention modules, such as the channel and spatial attention modules. These modules are often incorporated into the semantic segmentation architecture to aid in extracting significant features in certain channels and pixels, thereby raising the segmentation accuracy [23]. Generally, the attention modules mentioned above are built separately and only capture features along certain channels or spatial dimensions. The CBAM [24] attention module combines channel and spatial attention using a tandem mode for the first time, significantly improving segmentation accuracy [25]. However, the module constructed using the tandem integration might cause errors transmitting from the channel attention to the spatial attention side, confining the further improvement in semantic segmentation performance. Researchers also designed attention modules focusing on the relationship of channels and pixels based on the self-attention mechanism, which represents the core information by the weighted sum of each channel-spatial dimension [26]. In other words, the network can use a self-attention mechanism to raise the overall accuracy of the semantic segmentation network by establishing a long-range contextual relationship [27]. Nevertheless, the complicated structure of a self-attention module remains challenging regarding training cost and execution efficiency, making it constrained to support a large-scale remote sensing application [28].

2. Attention Mechanism

The attention mechanism can enhance the capacity of deep CNNs to obtain more discriminative features. By creating a global dependency between channels to determine the corresponding weights, the channel attention module SE [29] successfully improved the network’s representation of significant features using the attention mechanism in the channel dimension for the first time. Yet, the spatial context was not taken into account. The effectiveness of the channel attention network was further enhanced by adding the ECA [30] module that established the dependency of local channels to learn the critical weights. To better capture spatial information and generate spatial consequences with a wider convolutional field, the spatial attention module in CBAM performs feature map channel pooling and dimension reduction. As a result of its reliance on convolution, the spatial attention module has certain limitations, as it can only capture the local dependency in position and not establish a long-range dependency. In addition, the pixel-relational self-attention mechanism represented by Transformer [31] has become a new SOTA in the current computer vision field and is widely recognized and applied. In DANet [26], the feature map generates Query, Key, and Value matrices via three convolutions. These matrices are then employed to calculate weights for each local and global location to build contextual information. OCRNet [32] creates a description region for each category in advance and constructs the global contextual information by calculating the similarity of each pixel to the description region of the respective category. The self-attention mechanism effectively establishes a global connection but requires excessive computation, affecting the network inference efficiency. Specifically, the process of computing the weight map with the Query and Key matrices of the feature map imposes O(N2) (N = H × W × C, H, W, and C denote the height, width, and number of channels of the feature map, respectively) time and space complexity on the self-attention module, which leads to a large burden on the semantic segmentation network when processing large remote sensing images.

3. Attention in Semantic Segmentation for Remote Sensing

By integrating attention modules, a semantic segmentation network for image interpretation can better represent features, reduce noise and build contextual information to increase the network’s overall segmentation accuracy. As an example, in ENet [33], the SE module is added to the upsampling stage of the network to generate a weight for each channel to refine the segmentation accuracy of remote sensing images. In SE-UNet [34], the convolution block is altered from two convolutions with the size of 3 × 3 per layer in the standard UNet to one convolution plus one SE module for strengthening the representation of the feature maps, thereby enhancing the capacity of UNet on extracting the road from the satellite and aerial imagery. An efficient channel attention (ECA) module is proposed and integrated into the UNet encoder in [35], which optimizes the segmentation and raises the encoder performance on feature extraction. Denoising remote sensing images with the RSIDNet proposed in [36] is made more accessible by adding an ECA module to the shortcut block connecting the shallow layer with the deep layer. It augments the feature representation of the shallow feature maps, reducing the noise brought by the layers and enhancing the segmentation accuracy. It is discussed in SCAttNet [25] that the CBAM module is employed to integrate channel and spatial attention. The network first adopts the ResNet to extract features to strengthen its segmentation capability for high-resolution remote sensing imagery. It then outputs them into the CBAM module to construct local contextual information and optimize the learned feature map weights at channel and pixel levels. RAANet [37] constructs a new residual ASPP module by embedding a CBAM module and a residual structure as a way to improve the accuracy of the semantic segmentation network. In [38], an SCA attention module containing spaces and channels is designed by using a spatial attention module in the CBAM in parallel with a coordinate attention module that constructs channels and spaces to enhance the detection of remote sensing images by the lightweight model. In order to improve the ability of convolutional neural networks to represent potential relationships between different objects and surrounding features, MQANet [39] introduces position attention, channel attention, label attention, and edge attention modules into the model as a way to expand the perceptual field of the network and introduce background information in labels to obtain global features. Furthermore, the self-attention mechanism has been used for remote sensing image semantic segmentation. As an illustration, a region-attention RSA module is constructed using the self-attention mechanism in RSANet [40]. Firstly, the module creates several soft object regions for each category distributed in the image, followed by region descriptors. Then, it evaluates the similarity between pixels of the feature maps and all-region descriptors. Those values measured in the parallel will be treated as weights of the initial feature maps. In [41], a self-attention module that concerns channels and spaces is constructed to generate a weight map of all spatial locations and channel relationships by multiplying the Query and Key matrices of the feature map to obtain global information. The network may achieve more accurate segmentation results, but the complexity and the high hardware resource needs make it non-economical in actual deployment.

This entry is adapted from the peer-reviewed paper 10.3390/rs15081980

References

  1. Anilkumar, P.; Venugopal, P. Research Contribution and Comprehensive Review towards the Semantic Segmentation of Aerial Images Using Deep Learning Techniques. Secur. Commun. Netw. 2022, 2022, 6010912.
  2. Wang, J.J.; Ma, A.L.; Zhong, Y.F.; Zheng, Z.; Zhang, L.P. Cross-sensor domain adaptation for high spatial resolution urban land-cover mapping: From airborne to spaceborne imagery. Remote Sens. Environ. 2022, 277, 113058.
  3. Zheng, Z.; Zhong, Y.F.; Wang, J.J.; Ma, A.L. Foreground-Aware Relation Network for Geospatial Object Segmentation in High Spatial Resolution Remote Sensing Imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 14–19 June 2020; pp. 4095–4104.
  4. Huang, X.; Zhang, L.P.; Gong, W. Information fusion of aerial images and LIDAR data in urban areas: Vector-stacking, re-classification and post-processing approaches. Int. J. Remote Sens. 2011, 32, 69–84.
  5. Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2016; pp. 3431–3440.
  6. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; pp. 234–241.
  7. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 27–30 June 2016; pp. 770–778.
  8. Chen, L.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062.
  9. Chen, L.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848.
  10. Chen, L.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587.
  11. Chen, L.C.; Zhu, Y.K.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 833–851.
  12. Sun, K.; Xiao, B.; Liu, D.; Wang, J.; Soc, I.C. Deep High-Resolution Representation Learning for Human Pose Estimation. In Proceedings of the 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 5686–5696.
  13. Sun, K.; Zhao, Y.; Jiang, B.; Cheng, T.; Xiao, B.; Liu, D.; Mu, Y.; Wang, X.; Liu, W.; Wang, J. High-resolution representations for labeling pixels and regions. arXiv 2019, arXiv:1904.04514.
  14. Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 20 September 2018; pp. 3–11.
  15. Tsotsos, J.K. ANALYZING VISION AT THE COMPLEXITY LEVEL. Behav. Brain Sci. 1991, 14, 768.
  16. Vikram, T.N. A Computational Perspective on Visual Attention. Cognit. Syst. Res. 2012, 19–20, 88–90.
  17. Li, W.; Liu, K.; Zhang, L.Z.; Cheng, F. Object detection based on an adaptive attention mechanism. Sci. Rep. 2020, 10, 11307.
  18. Tian, Z.; Zhan, R.; Hu, J.; Wang, W.; He, Z.; Zhuang, Z. Generating Anchor Boxes Based on Attention Mechanism for Object Detection in Remote Sensing Images. Remote Sens. 2020, 12, 2416.
  19. Chen, Z.; Tian, S.; Yu, L.; Zhang, L.; Zhang, X. An object detection network based on YOLOv4 and improved spatial attention mechanism. J. Intell. Fuzzy Syst. 2022, 42, 2359–2368.
  20. Zhang, M.; Su, H.; Wen, J. Classification of flower image based on attention mechanism and multi-loss attention network. Comput. Commun. 2021, 179, 307–317.
  21. Cao, P.; Xie, F.; Zhang, S.; Zhang, Z.; Zhang, J. MSANet: Multi-scale attention networks for image classification. Multimed. Tools Appl. 2022, 81, 34325–34344.
  22. Roy, S.K.; Dubey, S.R.; Chatterjee, S.; Baran Chaudhuri, B. FuSENet: Fused squeeze-and-excitation network for spectral-spatial hyperspectral image classification. Iet Image Process. 2020, 14, 1653–1661.
  23. Guo, M.; Xu, T.; Liu, J.; Liu, Z.; Jiang, P.; Mu, T.; Zhang, S.; Martin, R.R.; Cheng, M.; Hu, S. Attention mechanisms in computer vision: A survey. Comput. Vis. Media 2022, 8, 331–368.
  24. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19.
  25. Li, H.; Qiu, K.; Chen, L.; Mei, X.; Hong, L.; Tao, C. SCAttNet: Semantic Segmentation Network With Spatial and Channel Attention Mechanism for High-Resolution Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2021, 18, 905–909.
  26. Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H.; Soc, I.C. Dual Attention Network for Scene Segmentation. In Proceedings of the 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 3141–3149.
  27. Jin, Z.; Liu, B.; Chu, Q.; Yu, N. ISNet: Integrate Image-Level and Semantic-Level Context for Semantic Segmentation. In Proceedings of the 18th IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021; pp. 7169–7178.
  28. Liu, S.; Cheng, J.; Liang, L.; Bai, H.; Dang, W. Light-Weight Semantic Segmentation Network for UAV Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 8287–8296.
  29. Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141.
  30. Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 14–19 June 2020; pp. 11534–11542.
  31. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929.
  32. Yuan, Y.; Chen, X.; Wang, J. Object-contextual representations for semantic segmentation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 173–190.
  33. Wang, Y. Remote Sensing Image Semantic Segmentation Algorithm Based on Improved ENet Network. Sci. Program. 2021, 2021, 5078731.
  34. Sofla, R.A.D.; Alipour-Fard, T.; Arefi, H. Road extraction from satellite and aerial image using SE-Unet. J. Appl. Remote Sens. 2021, 15, 014512.
  35. Han, G.; Zhang, M.; Wu, W.; He, M.; Liu, K.; Qin, L.; Liu, X. Improved U-Net based insulator image segmentation method based on attention mechanism. Energy Rep. 2021, 7, 210–217.
  36. Han, L.; Zhao, Y.; Lv, H.; Zhang, Y.; Liu, H.; Bi, G. Remote Sensing Image Denoising Based on Deep and Shallow Feature Fusion and Attention Mechanism. Remote Sens. 2022, 14, 1243.
  37. Liu, R.R.; Tao, F.; Liu, X.T.; Na, J.M.; Leng, H.J.; Wu, J.J.; Zhou, T. RAANet: A Residual ASPP with Attention Framework for Semantic Segmentation of High-Resolution Remote Sensing Images. Remote Sens. 2022, 14, 3109.
  38. Wang, M.Y.; Wang, J.T.; Liu, C.; Li, F.Y.; Wang, Z.Y. Spatial-Coordinate Attention and Multi-Path Residual Block Based Oriented Object Detection in Remote Sensing Images. Int. J. Remote Sens. 2022, 43, 5757–5774.
  39. Li, Y.; Si, Y.; Tong, Z.; He, L.; Zhang, J.; Luo, S.; Gong, Y. MQANet: Multi-Task Quadruple Attention Network of Multi-Object Semantic Segmentation from Remote Sensing Images. Remote Sens. 2022, 14, 6256.
  40. Zhao, D.; Wang, C.; Gao, Y.; Shi, Z.; Xie, F. Semantic Segmentation of Remote Sensing Image Based on Regional Self-Attention Mechanism. IEEE Geosci. Remote Sens. Lett. 2022, 19.
  41. Zhang, Y.J.; Cheng, J.; Bai, H.W.; Wang, Q.; Liang, X.Y. Multilevel Feature Fusion and Attention Network for High-Resolution Remote Sensing Image Semantic Labeling. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6512305.
More
This entry is offline, you can click here to edit this entry!
Video Production Service