自适应本地跨渠道交互: Comparison
Please note this is a comparison between Version 1 by Menglei Kang and Version 2 by Rita Xu.

Adding an attention module to the deep convolution semantic segmentation network has significantly enhanced the network performance. However, the existing channel attention module focusing on the channel dimension neglects the spatial relationship, causing location noise to transmit to the decoder.

在深度卷积语义分割网络中加入注意力模块,显著提升了网络性能。然而,现有的信道注意力模块以信道维度为重点,忽略了空间关系,导致位置噪声传染给译码器。

  • adaptive local cross-channel interaction
  • vector average pooling
  • attention mechanism

1. Introduction

By analyzing an image at the pixel level, semantic segmentation provides more nuanced identification than object detection and image classification, allowing for the output of complete scene information. Urban planning, land resource management, marine monitoring, and transportation assessment benefit significantly from semantic segmentation processing of remotely sensed imagery [1][2]. However, remote sensing images present unique processing challenges due to the abundance of feature information such as shape, location, and texture, as well as the high intra-class variance and high inter-class similarity exhibited by ground objects in the images [3]. Conventional semantic segmentation approaches emphasize manual feature extraction [4]. The feature vectors can be obtained based on hand-crafted rules of certain application scenarios. Once the scenario has been modified, it is challenging to reuse these extracted feature vectors. The extraction of repeated features is a laborious and time-consuming process. In addition, the hand-crafted rules of traditional semantic segmentation depend on complex mathematical models that are not data-driven such as the current methods. Therefore, there are constraints on the comprehensibility and generalizability of the conventional semantic segmentation approaches. More recently, deep learning-based semantic segmentation techniques have demonstrated significant potential.

引言

通过在像素级别分析图像,语义分割提供了比对象检测和图像分类更细致的识别,从而可以输出完整的场景信息。城市规划、土地资源管理、海洋监测和交通评估从遥感影像的语义分割处理中获益匪浅[1,2]。然而,遥感影像由于形状、位置、纹理等特征信息丰富,以及图像中地物表现出的高类内方差和高类间相似性,因此存在独特的处理挑战[3]。
传统的语义分割方法强调手动特征提取[4]。特征向量可以基于某些应用场景的手工制作规则获得。一旦场景被修改,重用这些提取的特征向量就具有挑战性。重复特征的提取是一个费力且耗时的过程。此外,传统语义分割的手工制作规则依赖于复杂的数学模型,这些模型不是数据驱动的,例如当前的方法。因此,传统语义分割方法的可理解性和泛化性存在局限性。最近,基于深度学习的语义分割技术已经显示出巨大的潜力。例如,For instance, FCN [5] implements a fully convolutional semantic segmentation network, which is the baseline for the current popular semantic segmentation approaches. 实现了一个完全卷积的语义分割网络,这是当前流行的语义分割方法的基线。U-Net [6] introduces the skip connection between the shallow and deep layers to effectively reconstruct the low[6]引入浅层和深层之间的跳跃连接,有效地重构了高级语义对象的底层空间信息,解决了对象边缘分割不准确的问题。编码器-level spatial information for advanced semantic objects to solve the issue of inaccurate object edge segmentation. The encoder–decoder architecture encourages researchers to concentrate on ways to represent pixel features in the encoder better to boost network performance. 解码器架构鼓励研究人员专注于更好地表示编码器中的像素特征的方法,以提高网络性能。ResNet [7] is one such work that expands the network depth to extract more advanced abstract features. Atrous or dilated convolutional networks such as the ones developed by the 就是这样一项工作,它扩展了网络深度以提取更高级的抽象特征。像DeepLab community [8][9][10][11] can accomplish multi-scale tasks by enlarging the receptive field. 社区[8,9,10,11]开发的卷积网络一样,可以通过扩大感受野来完成多尺度任务。HRNet [12][13] maintains dense multi-layer interaction between the shallow and deep feature maps. Similarly, [12,13]在浅层和深层特征图之间保持密集的多层交互。类似地,U-Net++ [14] enhances the accuracy of semantic segmentation by substituting dense connections for regular skip connections between the encoder and the decoder. These efforts have led to growth in the use of deep learning for semantic segmentation. Researchers then began to optimize the performance of baseline semantic segmentation networks by introducing an attention mechanism, enabling them to capitalize on the critical feature information and eliminate the redundancy of feature maps or pixels to reinforce the feature representation. The effectiveness of the attention mechanism has been demonstrated for a variety of tasks [15][16] including object detection [17][18][19] and image classification [20][21][22]. Based on the principle of the attention mechanism, researchers in the field of semantic segmentation have developed several attention modules, such as the channel and spatial attention modules. These modules are often incorporated into the semantic segmentation architecture to aid in extracting significant features in certain channels and pixels, thereby raising the segmentation accuracy [23]. Generally, the attention modules mentioned above are built separately and only capture features along certain channels or spatial dimensions. The 通过用密集连接代替编码器和解码器之间的常规跳过连接,提高了语义分割的准确性。这些努力导致了深度学习在语义分割中的应用的增长。然后,研究人员开始通过引入注意力机制来优化基线语义分割网络的性能,使他们能够利用关键特征信息并消除特征图或像素的冗余,以加强特征表示。
注意力机制的有效性已被证明可用于多种任务[15,16],包括目标检测[17,18,19]和图像分类[20,21,22]。基于注意力机制的原理,语义分割领域的研究者开发了几种注意力模块,如通道注意力模块和空间注意力模块。这些模块通常被纳入语义分割架构中,以帮助提取某些通道和像素中的重要特征,从而提高分割精度[23]。通常,上述注意力模块是单独构建的,仅捕获沿特定通道或空间维度的特征。CBAM [24] attention module combines channel and spatial attention using a tandem mode for the first time, significantly improving segmentation accuracy [25]. However, the module constructed using the tandem integration might cause errors transmitting from the channel attention to the spatial attention side, confining the further improvement in semantic segmentation performance. Researchers also designed attention modules focusing on the relationship of channels and pixels based on the self-attention mechanism, which represents the core information by the weighted sum of each channel-spatial dimension [26]. In other words, the network can use a self-attention mechanism to raise the overall accuracy of the semantic segmentation network by establishing a long-range contextual relationship [27]. Nevertheless, the complicated structure of a self-attention module remains challenging regarding training cost and execution efficiency, making it constrained to support a large-scale remote sensing application [28]. [24]注意力模块首次使用串联模式将通道注意力和空间注意力结合在一起,显著提高了分割精度[25]。然而,采用串联积分构建的模块可能会导致从通道注意力向空间注意力侧传输的误差,从而限制语义分割性能的进一步提高。研究人员还基于自注意力机制设计了关注通道和像素关系的注意力模块,该机制通过每个通道空间维度的加权和来表示核心信息[26]。换句话说,网络可以通过建立长程上下文关系来使用自注意力机制来提高语义分割网络的整体准确性[27]。然而,自注意力模块的复杂结构在训练成本和执行效率方面仍然存在挑战,使其难以支持大规模遥感应用[28]。

2. Attention Mechanism注意力机制

The attention mechanism can enhance the capacity of deep 注意力机制可以增强深度CNNs to obtain more discriminative features. By creating a global dependency between channels to determine the corresponding weights, the channel attention module SE [29] successfully improved the network’s representation of significant features using the attention mechanism in the channel dimension for the first time. Yet, the spatial context was not taken into account. The effectiveness of the channel attention network was further enhanced by adding the 获取更多判别特征的能力。通过在信道之间建立全局依赖关系来确定相应的权重,信道注意力模块SE[29]首次在信道维度中使用注意力机制成功地改进了网络对重要特征的表示。然而,空间背景没有被考虑在内。通过添加ECA [30] module that established the dependency of local channels to learn the critical weights. To better capture spatial information and generate spatial consequences with a wider convolutional field, the spatial attention module in [30]模块,进一步增强了通道注意力网络的有效性,该模块建立了本地通道的依赖性来学习关键权重。为了更好地捕获空间信息,并产生具有更宽卷积场的空间后果,CBAM performs feature map channel pooling and dimension reduction. As a result of its reliance on convolution, the spatial attention module has certain limitations, as it can only capture the local dependency in position and not establish a long-range dependency. In addition, the pixel-relational self-attention mechanism represented by 中的空间注意力模块进行了特征图通道池化和降维。由于空间注意力模块对卷积的依赖性,它只能捕获位置上的局部依赖关系,而不能建立长距离依赖关系,因此存在一定的局限性。此外,以Transformer [31] has become a new [31]为代表的像素关系自注意力机制已成为当前计算机视觉领域的新SOTA in the current,并得到广泛认可和应用。在 computer vision field and is widely recognized and applied. In DANet [26],[26] the feature map中,特征图通过三个卷积生成 generates Query, Key, andQuery、Key 和 Value matrices via three convolutions. These matrices are then employed to calculate weights for each local and global location to build contextual information. 矩阵。然后使用这些矩阵来计算每个局部和全局位置的权重,以构建上下文信息。OCRNet [32] creates a[32]预先为每个类别创建一个描述区域,并通过计算每个像素与相应类别的描述区域的相似性来构建全局上下文信息。自注意力机制有效地建立了全局连接,但需要过多的计算,影响了网络推理效率。具体来说,使用特征图的 description region for each category in advance and constructs the global contextual information by calculating the similarity of each pixel to the description region of the respective category. The self-attention mechanism effectively establishes a global connection but requires excessive computation, affecting the network inference efficiency. Specifically, the process of computing the weight map with the Query and Key matrices of the feature map imposes O(和 Key 矩阵计算权重图的过程施加了 O(N2) (N = H × W × C, C、H, W, and C分别表示特征图的高度、宽度和通道数) denote the height, width, and number of channels of the feature map, respectively) time and space complexity on the self-attention module, which leads to a large burden on the semantic segmentation network when processing large remote sensing images.自注意力模块的时间和空间复杂度,在处理大型遥感影像时给语义分割网络带来较大负担。

3. Attention in Semantic Segmentation for Remote Sensing遥感语义分割中的注意

By integrating attention modules, a semantic segmentation network for image interpretation can better represent features, reduce noise and build contextual information to increase the network’s overall segmentation accuracy. As an example, in 通过集成注意力模块,用于图像解释的语义分割网络可以更好地表示特征、降低噪声并构建上下文信息,从而提高网络的整体分割精度。例如,在ENet [33], the [33]中,在网络的上采样阶段增加SE module is added to the upsampling stage of the network to generate a weight for each channel to refine the segmentation accuracy of remote sensing images. In 模块,为每个通道生成权重,以细化遥感图像的分割精度。在SE-UNet [34], th[34]中,卷积块从标准UNe convolution block is altered from two convolutions with the size of 3 × 3 per layer in the standard UNet to one convolution plus one SE module for strengthening the representation of the feature maps, thereby enhancing the capacity of UNet on extracting the road from the satellite and aerial imagery. An efficient channel attention (中每层大小为3×3的两个卷积改为一个卷积加一个SE模块,以加强特征图的表示,从而增强UNet从卫星和航空图像中提取道路的能力。文献[35]提出了一种高效信道注意力(ECA) module is proposed and integrated into the )模块,并将其集成到UNet encoder in [35], which optimizes the segmentation and raises the encoder performance on feature extraction. Denoising remote sensing images with the 编码器中,优化了分割,提高了编码器的特征提取性能。通过在连接浅层和深层的快捷模块中添加ECA模块,使使用[36]中提出的RSIDNet proposed in [36] is made more accessible by adding an ECA module to the shortcut block connecting the shallow layer with the deep layer. It augments the feature representation of the shallow feature maps, reducing the noise brought by the layers and enhancing the segmentation accuracy. It is discussed in 对遥感图像进行去噪变得更加容易。它增强了浅层特征图的特征表示,减少了层带来的噪声,提高了分割精度。SCAttNet [25] that the [25]中讨论了CBAM module is employed to integrate channel and spatial attention. The network first adopts the 模块用于整合通道和空间注意力。该网络首先采用ResNet to extract features to strengthen its segmentation capability for high-resolution remote sensing imagery. It then outputs them into the CBAM module to construct local contextual information and optimize the learned feature map weights at channel and pixel levels. 提取特征,以增强其对高分辨率遥感影像的分割能力。然后,它将它们输出到CBAM模块中,以构建本地上下文信息,并在通道和像素级别优化学习到的特征图权重。RAANet [37] constructs a new residual [37]通过嵌入CBAM模块和残差结构来构建一种新的残差ASPP module by embedding a 模块,以提高语义分割网络的准确性。文献[38]中,利用CBAM module and a residual structure as a way to improve the accuracy of the semantic segmentation network. In [38], an 中的空间注意力模块与构建通道和空间的坐标注意力模块并行设计了包含空间和通道的SCA attention module containing spaces and channels is designed by using a spatial attention module in the CBA注意力模块,以增强轻量级模型对遥感图像的检测。为了提高卷积神经网络表示不同物体与周围特征之间潜在关系的能力,M in parallel with a coordinate attention module that constructs channels and spaces to enhance the detection of remote sensing images by the lightweight model. In order to improve the ability of convolutional neural networks to represent potential relationships between different objects and surrounding features, MQANet [39] introduces position attention, channel attention, label attention, and edge attention modules into the model as a way to expand the perceptual field of the network and introduce background information in labels to obtain global features. Furthermore, the self-attention mechanism has been used for remote sensing image semantic segmentation. As an illustration, a region-attention QANet[39]在模型中引入了位置注意力、通道注意力、标签注意力和边缘注意力模块,以此来扩展网络的感知场,并在标签中引入背景信息,从而获得全局特征。此外,自注意力机制已被用于遥感图像语义分割。举例来说,使用RSA modulNe is constructed using the self-attention mechanism in RSANet [40]. Firstly, the module creates several soft object regions for each category distributed in the image, followed by region descriptors. Then, it evaluates the similarity between pixels of the feature maps and all-region descriptors. Those values measured in the parallel will be treated as weights of the initial feature maps. In [41], a self-attention module that concerns channels and spaces is constructed to generate a weight map of all spatial locations and channel relationships by multiplying the t中的自注意力机制构建了一个区域注意力RSA模块[40]。首先,该模块为图像中分布的每个类别创建多个软对象区域,然后是区域描述符。然后,它评估特征图的像素与全区域描述符之间的相似性。在并行中测量的那些值将被视为初始特征图的权重。在文献[41]中,构建了一个涉及通道和空间的自注意力模块,通过相乘特征图的Query and Key matrices of the feature map to obtain global information. The network may achieve more accurate segmentation results, but the complexity and the high hardware resource needs make it non-economical in actual deployment. 矩阵来生成所有空间位置和通道关系的权重图,从而得到全局信息。该网络可能会获得更准确的分段结果,但复杂性和高硬件资源需求使其在实际部署中不经济。
Video Production Service