2. Object Detection for Remote Sensing Images
Before deep learning was applied to object detection, traditional remote sensing image object detection methods mainly included threshold-based clustering, template matching, and feature extraction. The authors of
[17] proposed a method for object detection based on mean shift segmentation and non-parametric clustering, which uses prior knowledge of the object shape and a hierarchical clustering method for object extraction and clustering. The authors of
[18] proposed a remote sensing image object detection method based on feature extraction, which proposes an improved SIFT (Scale-Invariant Feature Transform) algorithm to extract uniformly distributed matching features, then refines the initial matching by binary histogram and random sample consensus. Traditional remote sensing image detection methods usually require prior knowledge to design features manually, which are then poorly robust to complex scene changes and noise disturbances, resulting in low detection performance.
With the widespread application of deep learning in object detection, many researchers have improved the general detectors applied to remote sensing image object detection tasks. The authors of
[19] proposed a contextual refinement module for remote sensing images based on Faster R-CNN to extract and refine the contextual information and improve the Region Proposal Network (RPN) to obtain more positive samples. However, it did not consider the problem of background noise. L-SNR-YOLO
[20] constructs a network backbone by a swin-transformer and convolutional neural network (CNN) to obtain multi-scale global and local information. Moreover, a feature enhancement module is proposed to make image features salient. However, this approach should have considered the lightweight of the model, which leads to the introduction of a large number of parameters. LOCO
[21] proposes a variant of YOLO that uses the spatial characteristics of the object to design the layer structure of the model and uses constrained regression modeling to improve the robustness of the predictions, which allows for better detection of small and dense building footprints. TPH-YOLOv5
[22] adds a detection layer for small objects based on YOLOv5 and introduces a spatial attention mechanism and transformer encoder module, significantly improving the detection accuracy for small objects in UAV images, but it neglected the large variations of object scale in remote sensing images. DFPH-YOLO
[23] proposes a dense feature pyramid network for remote sensing images based on YOLOv3, which enables four detection layers to combine semantic information before and after sampling to improve object detection performance at different scales. However, the introduction of irrelevant background information was not avoided.
Recently, some studies have proposed new strategies and methods for remote sensing image object detection. LAG
[24] proposes a hierarchical anchor generation algorithm that generates anchors in different layers based on the diagonal and aspect ratio of the object, making the anchors in each layer match better with the detection range of that layer. The authors of
[25] proposed a new multi-scale deformable attention module and a multi-level feature aggregation module and inserts them into the feature pyramid network (FPN) to improve the detection performance of various shapes and sizes of remote sensing objects. RSADet
[26] considers the spatial distribution, scale, and orientation changes of the objects in remote sensing images by introducing deformable convolution and a new bounding box confidence prediction branch. The authors of
[27] proposed to cast the bounding box regression in the aerial images as a center probability map prediction problem, thus largely eliminating the ambiguities on the object definitions and the background pixels. Although these above studies provide optimization schemes for remote sensing image detection, they neglect the problem of background noise introduction when the model performs feature extraction on elongated objects in remote sensing images. In addition, the problem of weight allocation of different quality samples in images to the regression process and the lightweight design of the model also need to be considered.
3. Attention Mechanism
Currently, attention mechanisms have been widely applied in the field of image processing. The attention mechanisms can adaptively select essential parts the network should focus on, thereby improving its feature extraction ability. The attention mechanisms can be divided into spatial and channel attention mechanisms. The spatial attention mechanism guides the model to focus on critical spatial regions in a weighted manner, thereby improving the perception ability of the network for image details. The channel attention mechanism learns the weights of each channel, allowing the network to pay more attention to the critical channels in the image during the training process, thereby improving the ability of the network to extract image features.
The Squeeze-and-Excitation Network (SENet)
[28] is a classic channel attention method. It first compresses the features on the channel by global average pooling, then learns the weight of each channel through two fully connected layers, thus weighting the channel importance of the input feature map to learn the relationship between global channel information. Efficient Channel Attention (ECA)
[29] improves on SENet by using one-dimensional convolution with adaptive convolution kernel size instead of global connections, learning more practical information in a more efficient way but ignoring the relationship between global channel information. The Convolutional Block Attention Module (CBAM)
[30] is a mixed attention method in both channel and spatial domains, which combines channel and spatial attention by performing average pooling and max pooling operations on the input feature map. Coordinate Attention (CA)
[31] is a spatial position-based attention mechanism. It extracts information through horizontal and vertical direction average pooling operations, encodes the spatial position information of the input feature map into two-dimensional coordinates, fuses coordinate information into channel information, and then pays attention, and is a very effective attention mechanism. Triplet Attention Module (TAM)
[32] is a rotation attention mechanism that rotates the feature map so that the model can focus on different parts of the object in different directions, thereby improving the accuracy of object detection and image classification. Non-Local
[33], Self-Calibrated Convolutions (SC)
[34], and Bi-Level Routing Attention (BRA)
[35] are all self-attention mechanisms used for computer vision tasks. This method establishes a relationship between pixels in the image and weights them semantically, but often introduces a large number of additional parameters.
These attention mechanisms perform well in images in natural scenes by adaptively calibrating and directing the network to focus more on the foreground of the feature map, thus slightly mitigating the interference of background noise in the image. However, for remote sensing images, none of the above attention methods adopts an effective strategy to reduce the introduction of background noise and ignores the problem of scale differences of remote sensing objects. Hence, the performance improvement in the object detection task of remote sensing images is lower than in natural scenes.
4. GSConv
In order to improve the performance and efficiency of networks, the study of lightweight models has also received widespread attention. Models such as MobileNet
[36], ShuffleNet
[37], and EfficientNet
[38] achieve lightweight design through different techniques. Among them, Depthwise Separable Convolution is a common lightweight convolution technique consisting of two steps: Depthwise Convolution (DWConv) and Pointwise Convolution (PWConv).
Figure 2a shows the framework of DWConv. It applies a separate convolution kernel to each channel of the input tensor, for example, using C convolution kernels to perform convolution operations on an input tensor with C channels. The size of each convolution kernel is usually small, such as 3 × 3 or 5 × 5. PWConv, as shown in
Figure 2b, applies a 1 × 1 convolution kernel for dense calculation, which can fuse information between channels and reduce dimensionality.
Figure 2. Illustration of two types of convolution: (
a) Depthwise Convolution; (
b
DWConv is equivalent to a grouped convolution with the number of groups equal to the number of channels in the input tensor. Each channel is calculated using a separate convolution kernel. Although this method can significantly reduce the number of parameters and computation costs, there is no interaction between channels, and the information between channels is separated during the calculation. It is an important reason for the low accuracy of DWConv calculation. PWConv fuses channel information through dense convolution operations of 1 × 1, which has higher calculation accuracy. However, this dense convolution also brings more parameters and computational costs. Depthwise Separable Convolution uses DWConv and PWConv but simply connects them in series, and the dense calculation results between channels are separated. Compared with ordinary convolution, although the number of parameters is reduced, the calculation accuracy is lower, which will affect the detection performance of the model.
To solve the abovementioned problem, Li et al. proposed GSConv
[39], a new lightweight convolution technique. The structure of the GSConv module is shown in
Figure 3, where PWConv and DWConv represent Pointwise Convolution and Depthwise Convolution in depth-separable convolution, respectively. They are combined in a more efficient way in GSConv. Assuming that the number of channels in the input tensor is C1 and the number of channels in the output tensor is C2, first, to obtain more accurate dense calculation results, the module uses a PWConv to calculate the input tensor and compresses the channels to 1/2 of the output channels. Then, to ensure lightweight computation, a depth-wise DWConv operation is performed on the dense computation result of PWConv to obtain a result with C2/2 channels. These two calculation results obtained above are then stacked along the channel dimension to obtain a tensor with C2 channels. Finally, to mix the calculation results of PWConv and DWConv, a shuffle operation is applied along the channel dimension, allowing the information generated by PWConv to permeate into different parts of the computation result of DWConv. GSConv combines the accuracy of dense computation and the lightweight characteristics of depth-wise computation, making it an efficient, lightweight convolution method.
Figure 3. Structure of GSConv module.