By integrating attention modules, a semantic segmentation network for image interpretation can better represent features, reduce noise and build contextual information to increase the network’s overall segmentation accuracy. As an example, in ENet
[33], the SE module is added to the upsampling stage of the network to generate a weight for each channel to refine the segmentation accuracy of remote sensing images. In SE-UNet
[34], the convolution block is altered from two convolutions with the size of 3 × 3 per layer in the standard UNet to one convolution plus one SE module for strengthening the representation of the feature maps, thereby enhancing the capacity of UNet on extracting the road from the satellite and aerial imagery. An efficient channel attention (ECA) module is proposed and integrated into the UNet encoder in
[35], which optimizes the segmentation and raises the encoder performance on feature extraction. Denoising remote sensing images with the RSIDNet proposed in
[36] is made more accessible by adding an ECA module to the shortcut block connecting the shallow layer with the deep layer. It augments the feature representation of the shallow feature maps, reducing the noise brought by the layers and enhancing the segmentation accuracy. It is discussed in SCAttNet
[25] that the CBAM module is employed to integrate channel and spatial attention. The network first adopts the ResNet to extract features to strengthen its segmentation capability for high-resolution remote sensing imagery. It then outputs them into the CBAM module to construct local contextual information and optimize the learned feature map weights at channel and pixel levels. RAANet
[37] constructs a new residual ASPP module by embedding a CBAM module and a residual structure as a way to improve the accuracy of the semantic segmentation network. In
[38], an SCA attention module containing spaces and channels is designed by using a spatial attention module in the CBAM in parallel with a coordinate attention module that constructs channels and spaces to enhance the detection of remote sensing images by the lightweight model. In order to improve the ability of convolutional neural networks to represent potential relationships between different objects and surrounding features, MQANet
[39] introduces position attention, channel attention, label attention, and edge attention modules into the model as a way to expand the perceptual field of the network and introduce background information in labels to obtain global features. Furthermore, the self-attention mechanism has been used for remote sensing image semantic segmentation. As an illustration, a region-attention RSA module is constructed using the self-attention mechanism in RSANet
[40]. Firstly, the module creates several soft object regions for each category distributed in the image, followed by region descriptors. Then, it evaluates the similarity between pixels of the feature maps and all-region descriptors. Those values measured in the parallel will be treated as weights of the initial feature maps. In
[41], a self-attention module that concerns channels and spaces is constructed to generate a weight map of all spatial locations and channel relationships by multiplying the Query and Key matrices of the feature map to obtain global information. The network may achieve more accurate segmentation results, but the complexity and the high hardware resource needs make it non-economical in actual deployment.