DetectFormer: Comparison
Please note this is a comparison between Version 1 by weiguo pan and Version 4 by Rita Xu.

Object detection plays a vital role in autonomous driving systems, and the accurate detection of surrounding objects can ensure the safe driving of vehicles. A category-assisted transformer object detector called DetectFormer for autonomous driving. The proposed object detector can achieve better accuracy compared with the baseline. Specifically, ClassDecoder is assisted by proposal categories and global information from the Global Extract Encoder (GEE) to improve the category sensitivity and detection performance. This fits the distribution of object categories in specific scene backgrounds and the connection between objects and the image context. Data augmentation is used to improve robustness and attention mechanism added in backbone network to extract channel-wise spatial features and direction information.

Object detection plays a vital role in autonomous driving systems, and the accurate detection of surrounding objects can ensure the safe driving of vehicles. Specifically, ClassDecoder is assisted by proposal categories and global information from the Global Extract Encoder (GEE) to improve the category sensitivity and detection performance. This fits the distribution of object categories in specific scene backgrounds and the connection between objects and the image context. Data augmentation is used to improve robustness and attention mechanism added in backbone network to extract channel-wise spatial features and direction information.

  • autonomous driving
  • deep learning
  • object detection

1. Introduction

Vision-based object detection in traffic scenes plays a crucial role in autonomous driving systems. With the rapid development of autonomous driving, the performance of object detection has made significant progress. The traffic object (e.g., traffic signs, vehicles, and pedestrians) can be detected automatically by extracting the features. The result of perceiving the traffic scenario can ensure the safety of the autonomous vehicle. This kind of method can be divided into anchor-based and anchor-free.
Deep-learning-based object detection can be divided into single-stage and multi-stage object detection. The multi-stage algorithms extract the region of interest first, and then the location of the object is determined in these candidate areas. The single-stage algorithm’s output the location and category with dense bounding boxes directly on the original image. These detection algorithms classify each anchor box or key point and detect different categories independently, while ignoring the relationships between categories. There exists a specific relationship between other objects, such as probability, location, and scale of different objects in a particular environment, which is essential for object detection and can improve object detection accuracy.
This relationship between categories exists in many cases in traffic scenarios. For example, pedestrians appearing in highway scenes and vehicles appearing on the pedestrian path are low-probability events, which indicates the connection between object categories and scenarios. Secondly, the signs “Passing” and “No Passing” should not appear in the same scene, which indicates the connection between different object categories. There exist specific implicit relationships between object categories and the background of traffic scenes. Existing object detection methods do not consider this relationship in scenes, and their classification subnetwork is trained to independently classify different objects as individuals without the objects knowing each other, which results in the model underperforming in terms of fitting the distribution of objects and the scene background. Additionally, the model does not thoroughly learn the features required by the detection task and will cause a gap in the classification confidence between categories, which influences the detection performance.
Based on the above-mentioned assumptions, this papentryr proposes a category-assisted transformer object detector to learn the relationships between different objects called DetectFormer, based on the single-stage method. The motivation of this entrstudy was to allow the classification subnetwork to fit better the distribution of object categories with specific scene backgrounds and ensure that the network model is more focused on this relationship.
Transformer [1] is widely used in natural language processing, machine translation, and computer vision because of its ability to perceive global information. Specifically, the vision transformer (ViT) [2] and DETR [3] have been proposed and applied to computer vision. Previous studies have used transformers to capture global feature information and reallocate network attention to features, which is called self-attention. In this study, DetectFormer was built based on the transformer concept. Still the inputs and structure of the multi-head attention mechanism are different because the purpose of DetectFormer is to improve the detection accuracy with the assistance of category information.
The contributions of this study are as follows:
(1)
 The Global Extract Encoder (GEE) is proposed to extract the global information of the image features output by the backbone network, enhancing the model’s global perception ability.
(2)
 A novel category-assisted transformer called ClassDecoder is proposed. It can learn the object category relationships and improve the model’s sensitivity by implicitly learning the relationships between objects.
(3)
The attention mechanism is added to the backbone network to capture cross-channel, direction-aware and position-sensitive information during feature extraction.
(4)
 Efficient data augmentation methods are proposed to enhance the diversity of the dataset and improve the robustness of model detection.

2. Object Detection

Traditional object detection uses HOG [4] or DPM [5] to extract the image features, and then feed them into a classifier such as SVM [6]. Chen et al. [7] use SVM for traffic light detection. In recent years, deep learning based object detection algorithms have achieved better performance in terms of accuracy compared with traditional methods and have become a research hotspot. Generally, there are two types of object detection based on deep convolutional networks: (1) multi-stage detection, such as R-CNN series [8][9][10][8,9,10], and Cascade R-CNN [11]; (2) one-stage detection, which is also known as the dense detector and can be divided into anchor-based methods (for example, the You Only Look Once series [12][13][14][12,13,14] and RetinaNet [15]) and anchor-free methods (for example, FCOS [16], CenterNet [17], and CornerNet [18]). Multi-stage detection methods extract features of the foreground area using region proposal algorithms from preset dense candidates in the first stage. The bounding boxes of objects are regressed in the subsequent steps. The limitation of this structure is that it reduces the detection speed and cannot satisfy the real-time requirements of autonomous driving tasks. Single-stage detection methods directly detect the object and regress the bounding boxes different from multi-stage methods, which can avoid the repeated calculation of the feature map and obtains the anchor boxes directly on the feature map. He et al. [19] proposed a detection method using CapsNet [20] based on visual inspection of traffic scenes. Li et al. [21] proposed improved Faster R-CNN for multi-object detection in a complex traffic environments. Lian et al. [22] proposed attention fusion for small traffic object detection. Liang et al. [23] proposed a light-weight anchor-free detector for traffic scene object detection. However, their models cannot capture global information limited by the size of the receptive field. The above-mentioned approaches obtain local information when extracting image features, and enlarge the receptive field by increasing the size of the convolution kernel or stacking the number of convolution layers. In recent years, transformers have been introduced as new attention-based building blocks applied to computer vision, they have achieved superior performance because they can obtain the global information of the image without increasing the receptive field.

3. Transformers Structure

The transformer is a new encoder–decoder architecture introduced by Vaswani et al. [1] first used in machine translation and has better performance than LSTM [24], GRU [25], RNNs [26] (MoE [27], GNMT [28]) in translation tasks. Transformer extracts features by aggregating global information, making it suited for long sequence prediction tasks and other information-heavy tasks, which has better performance than other RNN-based models in natural language processing [29][30][29,30], speech processing [31], transfer learning [32]. It is comparable to the performance of CNN in computer vision as a new framework. Alexey et al. [2] proposed a vision transformer, which applied a transformer to computer vision and image classification tasks. Nicolas et al. [3] proposed DETR, which applied a transformer to object detection task. Yan et al. use a transformer to predict long-term traffic flow [33]. Cai et al. [34] use a transformer to capture the spatial dependency for continuity and periodicity time series.
Video Production Service