DetectFormer | Encyclopedia MDPI

DetectFormer: History

View Latest Version

Please note this is an old version of this entry, which may differ significantly from the current revision.

Contributor:

Tianjiao Liang

Hong Bao

Weiguo Pan

Xinyue Fan

Han Li

Object detection plays a vital role in autonomous driving systems, and the accurate detection of surrounding objects can ensure the safe driving of vehicles. A category-assisted transformer object detector called DetectFormer for autonomous driving. The proposed object detector can achieve better accuracy compared with the baseline. Specifically, ClassDecoder is assisted by proposal categories and global information from the Global Extract Encoder (GEE) to improve the category sensitivity and detection performance. This fits the distribution of object categories in specific scene backgrounds and the connection between objects and the image context. Data augmentation is used to improve robustness and attention mechanism added in backbone network to extract channel-wise spatial features and direction information.

autonomous driving
deep learning
object detection

1. Introduction

Vision-based object detection in traffic scenes plays a crucial role in autonomous driving systems. With the rapid development of autonomous driving, the performance of object detection has made significant progress. The traffic object (e.g., traffic signs, vehicles, and pedestrians) can be detected automatically by extracting the features. The result of perceiving the traffic scenario can ensure the safety of the autonomous vehicle. This kind of method can be divided into anchor-based and anchor-free.

Deep-learning-based object detection can be divided into single-stage and multi-stage object detection. The multi-stage algorithms extract the region of interest first, and then the location of the object is determined in these candidate areas. The single-stage algorithm’s output the location and category with dense bounding boxes directly on the original image. These detection algorithms classify each anchor box or key point and detect different categories independently, while ignoring the relationships between categories. There exists a specific relationship between other objects, such as probability, location, and scale of different objects in a particular environment, which is essential for object detection and can improve object detection accuracy.

This relationship between categories exists in many cases in traffic scenarios. For example, pedestrians appearing in highway scenes and vehicles appearing on the pedestrian path are low-probability events, which indicates the connection between object categories and scenarios. Secondly, the signs “Passing” and “No Passing” should not appear in the same scene, which indicates the connection between different object categories. There exist specific implicit relationships between object categories and the background of traffic scenes. Existing object detection methods do not consider this relationship in scenes, and their classification subnetwork is trained to independently classify different objects as individuals without the objects knowing each other, which results in the model underperforming in terms of fitting the distribution of objects and the scene background. Additionally, the model does not thoroughly learn the features required by the detection task and will cause a gap in the classification confidence between categories, which influences the detection performance.

Based on the above-mentioned assumptions, this entry proposes a category-assisted transformer object detector to learn the relationships between different objects called DetectFormer, based on the single-stage method. The motivation of this entry was to allow the classification subnetwork to fit better the distribution of object categories with specific scene backgrounds and ensure that the network model is more focused on this relationship.

Transformer ^[1] is widely used in natural language processing, machine translation, and computer vision because of its ability to perceive global information. Specifically, the vision transformer (ViT) ^[2] and DETR ^[3] have been proposed and applied to computer vision. Previous studies have used transformers to capture global feature information and reallocate network attention to features, which is called self-attention. DetectFormer was built based on the transformer concept. Still the inputs and structure of the multi-head attention mechanism are different because the purpose of DetectFormer is to improve the detection accuracy with the assistance of category information.

The contributions of this entry are as follows:

(1): The Global Extract Encoder (GEE) is proposed to extract the global information of the image features output by the backbone network, enhancing the model’s global perception ability.
(2): A novel category-assisted transformer called ClassDecoder is proposed. It can learn the object category relationships and improve the model’s sensitivity by implicitly learning the relationships between objects.
(3): The attention mechanism is added to the backbone network to capture cross-channel, direction-aware and position-sensitive information during feature extraction.
(4): Efficient data augmentation methods are proposed to enhance the diversity of the dataset and improve the robustness of model detection.

2. Object Detection

Traditional object detection uses HOG ^[4] or DPM ^[5] to extract the image features, and then feed them into a classifier such as SVM ^[6]. Chen et al. ^[7] use SVM for traffic light detection. In recent years, deep learning based object detection algorithms have achieved better performance in terms of accuracy compared with traditional methods and have become a research hotspot. Generally, there are two types of object detection based on deep convolutional networks: (1) multi-stage detection, such as R-CNN series ^[8]^[9]^[10], and Cascade R-CNN ^[11]; (2) one-stage detection, which is also known as the dense detector and can be divided into anchor-based methods (for example, the You Only Look Once series ^[12]^[13]^[14] and RetinaNet ^[15]) and anchor-free methods (for example, FCOS ^[16], CenterNet ^[17], and CornerNet ^[18]). Multi-stage detection methods extract features of the foreground area using region proposal algorithms from preset dense candidates in the first stage. The bounding boxes of objects are regressed in the subsequent steps. The limitation of this structure is that it reduces the detection speed and cannot satisfy the real-time requirements of autonomous driving tasks. Single-stage detection methods directly detect the object and regress the bounding boxes different from multi-stage methods, which can avoid the repeated calculation of the feature map and obtains the anchor boxes directly on the feature map. He et al. ^[19] proposed a detection method using CapsNet ^[20] based on visual inspection of traffic scenes. Li et al. ^[21] proposed improved Faster R-CNN for multi-object detection in a complex traffic environments. Lian et al. ^[22] proposed attention fusion for small traffic object detection. Liang et al. ^[23] proposed a light-weight anchor-free detector for traffic scene object detection. However, their models cannot capture global information limited by the size of the receptive field. The above-mentioned approaches obtain local information when extracting image features, and enlarge the receptive field by increasing the size of the convolution kernel or stacking the number of convolution layers. In recent years, transformers have been introduced as new attention-based building blocks applied to computer vision, they have achieved superior performance because they can obtain the global information of the image without increasing the receptive field.

3. Transformers Structure

The transformer is a new encoder–decoder architecture introduced by Vaswani et al. ^[1] first used in machine translation and has better performance than LSTM ^[24], GRU ^[25], RNNs ^[26] (MoE ^[27], GNMT ^[28]) in translation tasks. Transformer extracts features by aggregating global information, making it suited for long sequence prediction tasks and other information-heavy tasks, which has better performance than other RNN-based models in natural language processing ^[29]^[30], speech processing ^[31], transfer learning ^[32]. It is comparable to the performance of CNN in computer vision as a new framework. Alexey et al. ^[2] proposed a vision transformer, which applied a transformer to computer vision and image classification tasks. Nicolas et al. ^[3] proposed DETR, which applied a transformer to object detection task. Yan et al. use a transformer to predict long-term traffic flow ^[33]. Cai et al. ^[34] use a transformer to capture the spatial dependency for continuity and periodicity time series.

This entry is adapted from the peer-reviewed paper 10.3390/s22134833

References

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Long Beach, CA, USA, 2017; pp. 6000–6010.
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929.
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2020; Volume 12346, pp. 213–229. ISBN 978-3-030-58451-1.
Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; IEEE: San Diego, CA, USA, 2005; Volume 1, pp. 886–893.
Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object Detection with Discriminatively Trained Part-Based Models. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 1627–1645.
Hearst, M.A.; Dumais, S.T.; Osuna, E.; Platt, J.; Scholkopf, B. Support vector machines. IEEE Intell. Syst. Their Appl. 1998, 13, 18–28.
Chen, Z.; Shi, Q.; Huang, X. Automatic detection of traffic lights using support vector machine. In Proceedings of the 2015 IEEE Intelligent Vehicles Symposium (IV), Seoul, Korea, 28 June–1 July 2015; IEEE: Seoul, Korea, 2015; pp. 37–40.
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; IEEE: Columbus, OH, USA, 2014; pp. 580–587.
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; IEEE: Santiago, Chile, 2015; pp. 1440–1448.
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149.
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving Into High Quality Object Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Salt Lake City, UT, USA, 2018; pp. 6154–6162.
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Las Vegas, NV, USA, 2016; pp. 779–788.
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Honolulu, HI, USA, 2017; pp. 6517–6525.
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767.
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327.
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: A Simple and Strong Anchor-free Object Detector. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1922–1933.
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as Points. arXiv 2019, arXiv:1904.07850.
Law, H.; Deng, J. CornerNet: Detecting Objects as Paired Keypoints. Int. J. Comput. Vis. 2020, 128, 642–656.
He, S.; Chen, L.; Zhang, S.; Guo, Z.; Sun, P.; Liu, H.; Liu, H. Automatic Recognition of Traffic Signs Based on Visual Inspection. IEEE Access 2021, 9, 43253–43261.
Sabour, S.; Frosst, N.; Hinton, G.E. Dynamic Routing Between Capsules. In Proceedings of the NIPS, Long Beach, CA, USA, 4–9 December 2017; pp. 3859–3869.
Li, C.; Qu, Z.; Wang, S.; Liu, L. A method of cross-layer fusion multi-object detection and recognition based on improved faster R-CNN model in complex traffic environment. Pattern Recognit. Lett. 2021, 145, 127–134.
Lian, J.; Yin, Y.; Li, L.; Wang, Z.; Zhou, Y. Small Object Detection in Traffic Scenes Based on Attention Feature Fusion. Sensors 2021, 21, 3031.
Liang, T.; Bao, H.; Pan, W.; Pan, F. ALODAD: An Anchor-Free Lightweight Object Detector for Autonomous Driving. IEEE Access 2022, 10, 40701–40714.
Graves, A. Long Short-Term Memory. In Supervised Sequence Labelling with Recurrent Neural Networks; Studies in Computational Intelligence; Springer: Berlin/Heidelberg, Germany, 2012; Volume 385, pp. 37–45. ISBN 978-3-642-24796-5.
Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv 2014, arXiv:1406.1078.
Zaremba, W.; Sutskever, I.; Vinyals, O. Recurrent Neural Network Regularization. arXiv 2014, arXiv:1409.2329.
Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; Dean, J. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. arXiv 2017, arXiv:1701.06538.
Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv 2016, arXiv:1609.08144.
Wang, H.; Wu, Z.; Liu, Z.; Cai, H.; Zhu, L.; Gan, C.; Han, S. HAT: Hardware-Aware Transformers for Efficient Natural Language Processing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Association for Computational Linguistics: Online, 2020; pp. 7675–7688.
Floridi, L.; Chiriatti, M. GPT-3: Its Nature, Scope, Limits, and Consequences. Minds Mach. 2020, 30, 681–694.
Dong, L.; Xu, S.; Xu, B. Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 5–20 April 2018; IEEE: Calgary, AB, Canada, 2018; pp. 5884–5888.
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2016, arXiv:1810.04805.
Yan, H.; Ma, X.; Pu, Z. Learning Dynamic and Hierarchical Traffic Spatiotemporal Features With Transformer. IEEE Trans. Intell. Transport. Syst. 2021, 1–14.
Cai, L.; Janowicz, K.; Mai, G.; Yan, B.; Zhu, R. Traffic transformer: Capturing the continuity and periodicity of time series for traffic forecasting. Trans. GIS 2020, 24, 736–755.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.