Transformer Framework and YOLO Framework for Object Detection: Comparison
Please note this is a comparison between Version 1 by Ziman Fan and Version 2 by Jessie Wu.

Object detection for remote sensing is a fundamental task in image processing of remote sensing; as one of the core components, small or tiny object detection plays an important role. 

  • small object detection
  • remote sensing images
  • transformer
  • YOLO

1. Introduction

Remote sensing object detection is a prominent and consequential application within the realm of remote sensing image processing [1]. It aims to accurately identify and locate specific target instances within an image. Within this domain, remote sensing small object detection holds particular importance as it focuses on detecting objects in remote sensing images that occupy a very small area or consist of only a few pixels. Detecting small objects is considerably more challenging than detecting larger objects, resulting in lower accuracy rates [2]. In recent years, small object detection based on convolutional neural networks (CNNs) has rapidly developed with the rapid growth of deep learning [3]. Small object detection often faces challenges such as limited information on small objects, scarcity of positive samples, and imbalanced classification. To tackle this challenge, researchers and experts have put forth diverse deep neural network methodologies, encompassing CNNs, GANs, RNNs, and transformers, to tackle the detection of small objects, encompassing remote sensing small objects. To improve the detection of small objects, Liu W et al., proposed the YOLOV5-Tassel network, which introduced the SimAM module in front of each detection head to extract the features of interest [4]. Li J. et al., suggested using GAN models to generate high-resolution images of small objects, narrowing the gap between small and large objects, and improving the detection capability of tiny objects [5]. Xu W et al. integrated contextual information into the Swin Transformer and designed an advanced framework called the foreground-enhanced attention Swin Transformer (FEA-Swin) [6]. Although the accuracy of detecting small objects has improved, the speed has been somewhat compromised. Zhu X. et al., proposed the YOLOv5-THP model, which is based on YOLOv5 and adds a transformer model with an attention mechanism to the detection head [7]. While this enhances the network’s performance in detecting small objects, it also brings a significant computing burden.
In the field of remote sensing, detecting small objects remains challenging due to large image scales, complex and varied backgrounds, and unique shooting perspectives. Cheng et al. propose a model training regularization method that enhances the performance of detection of small or tiny objects in remote sensing by exploiting and incorporating global contextual cues and image-level contextual information [8]. Liu J. et al., added a dilated convolution module to the FPN and designed a relationship connection attention module to automatically select and refine features, combining global and local attention to achieve the detection task of small objects in remote sensing [9]. Cheng et al., proposed an end-to-end cross-scale feature fusion (CSFF) framework based on the feature pyramid network (FPN), which inserted squeeze-and-excitation (SE) modules at the top layer to achieve better detection of tiny objects in optical remote sensing images [10]. Dong et al., proposed a CNN method based on balanced multi-scale fusion (BMF-CNN), which fused high- and low-level semantic information to improve the detection performance of tiny objects in remote sensing [11]. Liang X. et al., proposed a single-shot detector (FS-SSD) based on feature fusion and scaling to better adapt to the detection of tiny or small objects in remote sensing. FS-SSD added a scaling branch in the deconvolution module and used two feature pyramids generated by the deconvolution module and feature fusion module together for prediction, improving the accuracy of object detection [12]. Xu et al., designed a transformer-guided multi-interaction network (TransMIN) using local–global feature interaction (LGFI) and cross-view feature interaction (CVFI) modules to enhance the performance of small object detection in remote sensing. However, this improvement unavoidably introduces a computational burden [13]. Li et al., proposed a transformer that aggregates multi-scale global spatial positions to enhance small object detection performance but it also comes with a computational burden [14]. To reduce the computational cost of the transformer, Xu et al., improved the lightweight Swin transformer and designed a Local Perception Swin transformer (LPSW) backbone network to enhance small-scale detection accuracy [15]. Gong et al., designed an SPH-YOLOv5 model based on Swin Transformer Prediction Heads (SPHs) to balance the accuracy and speed of small object detection in remote sensing [16]. Although many experts and scholars are studying the balance between detection accuracy and inference speed, achieving an elegant balance remains a challenging problem [17][18][19][20][21][17,18,19,20,21].
Considerable advancements have been achieved in the utilization of transformers [6][7][13][14][15][16][6,7,13,14,15,16] for small object detection within the remote sensing domain. The exceptional performance of the Contextual Transformer (CoT) [22] in harnessing spatial contextual information, thereby offering a fresh outlook on transformer design, merits significant attention. In the domain of remote sensing, small target pixels are characterized by a scarcity of spatial information but a profusion of channel-based data. Consequently, the amalgamation and modeling of spatial and channel information assume paramount importance. Furthermore, transformers impose notable demands on computational resources and network capacity, presenting a challenge in striking an optimal balance between detection accuracy and processing speed for small object detection in the remote sensing discipline. Meanwhile, Bar M et al. demonstrated that the background is critical for human recognition of objects [18]. Empirical research in computer vision has also shown that both traditional methods [19] and deep learning-based methods [12] can enhance algorithm performance by properly modeling spatial context. Moreover, He K. et al., have proven that residual structures are advantageous for improving network performance [17][20][17,20]. Finally, rwesearchers note that the classification and regression tasks of object detection focus on the salient features and boundary features of the target, respectively [23]. Therefore, a decoupled detection head incorporating residual structure as well as channel and spatial context knowledge should have a positive impact on the detection of small or tiny objects.

2. Transformer Framework for Object Detection

The transformer structure, based on self-attention, first appeared in NLP tasks. Compared to modern convolutional neural networks (CNN) [24], the Vision Transformer has made impressive progress in the field of computer vision. After Dosovitskiy A et al. successfully introduced transformers into computer vision [25], many scholars turned to transformers [26][27][28][26,27,28]. In object detection, DETR [29] and Pix2seq [30] are the earliest transformer detectors that define two different object detection paradigms. However, transformers have many parameters, require high computing power and hardware, and are not easily applicable. To apply transformers on mobile devices, Mehta S et al. proposed a lightweight MobileVIT series [31][32][33][31,32,33], which achieved a good balance between accuracy and real-time performance, and has been widely used in risk detection [34], medicine [35], and other fields. A major advantage of transformers is that they can use the attention mechanism to model the global dependence of input data, obtain longer-term global information, and ignore the connection between local contexts. To address this problem, Li Y. et al., proposed a lightweight CoT [22] self-attention module to capture contextual background information on 2D feature maps. It can extract information between local contexts while capturing global dependencies for more adequate information exchange. In this rpapesearch, researchersr, we use CoT to exploit the global characteristics of spatial context and channels. Based on the original structure, researcherswe added the global residual and local fusion structures to further tap and utilize the characteristics of space and channels.

3. YOLO Framework for Object Detection

In 2015, YOLO [36] introduced a one-stage object detection method that combined candidate frame extraction, CNN feature learning, and NMS optimization to simplify the network structure. The detection speed was nearly 10 times faster than R-CNN, making real-time object detection possible with the computing power available at that time. However, it was not suitable for detecting small objects. YOLOv2 [37] added optimization strategies such as batch normalization and a dimensional clustering anchor box based on v1 to improve the accuracy of object regression and positioning. YOLOv3 [38] added the residual structure and FPN structure based on v2 to further improve the detection performance of small objects. The network framework structure after YOLOv3 can be roughly divided into three parts, backbone, neck, and head. Subsequent versions have optimized internal details to varying degrees. For example, YOLOv4 [39], based on v3, further optimized the backbone network and activation function, and used Mosaic data enhancement to improve the robustness and reliability of the network. YOLOv5 [40] added the focus structure based on v4 and accelerated the training speed by slicing. YOLOv6 [41] introduced RepVGG in the backbone, proposed a more efficient EfficientRep block, and simplified the design of the decoupling detection head to improve the detection efficiency. YOLOv7 [42] adopted the E-ELAN structure in the neck part, which reduces the inference speed, and used the auxiliary head training method. At present, YOLOv7 is one of the more advanced object detection networks due to its real-time characteristics. It is widely used in fields with high time requirements such as industrial equipment inspection [43], sea rescue [44], and aquaculture [45]. Therefore, reswearchers use YOLOv7, one of the powerful benchmarks, as the benchmark model.

4. Detection Head Framework for Object Detection

In the object detection task, there are two tasks: classification and regression, which respectively output the classification and bounding box position of the object. Song G. et al., pointed out that the focus of the classification and regression tasks is different [23]. Specifically, classification pays more attention to the texture content of the object, while regression pays more attention to the edge information of the object. Wu Y et al. suggested that it may be better to divide classification and regression tasks into FC-head and Conv-head [46]. In the single-stage model, YOLOX [47] adopts the decoupling head structure that separates the classification and regression branches and adds two additional 3 × 3 convolutional layers. This improves detection accuracy at the cost of inference speed. Building upon this approach, YOLOv6 takes into account the balance between the representation ability of related operators and the hardware computing overhead and adopts the Hybrid Channels strategy to redesign a more efficient decoupling head structure that reduces the cost while maintaining accuracy. They also mitigate the additional latency overhead of 3 × 3 convolutions in the decoupled detection head. Feng C. et al., use feature extractors to learn task interaction features from multiple convolutional layers to enhance the interaction between classification and localization [48]. They also pointed out that the interaction characteristics of different tasks may vary due to the classification and localization goals. To resolve the feature conflict introduced between the two tasks, they designed a layer attention mechanism that focuses on different types of features such as different layers and receptive fields. This mechanism helps to resolve a certain degree of feature conflict between the two tasks.
Video Production Service