小物体检测和交通标志检测

小物体检测和交通标志检测: Comparison

Please note this is a comparison between Version 2 by Jessie Wu and Version 4 by Jessie Wu.

The detection of traffic signs is easily affected by changes in the weather, partial occlusion, and light intensity, which increases the number of potential safety hazards in practical applications of autonomous driving.

small object detection
multi-scale feature fusion
loss function
data

1. Introduction

The traffic sign detection system is an important part of an intelligent transportation system. It can effectively provide the driver with current road traffic information, and it can also ensure the operational safety of the intelligent vehicle control system. In recent years, due to the far-reaching impact of this technology on traffic safety, this field has been deeply studied by many researchers.

Traditional traffic sign detection algorithms are mainly concentrated on color segmentation, combining features such as the shape and contour for feature extraction, and then realizing the recognition of traffic sign by completing feature classification through classifiers ^{[1][2][3][4][5][6]}[1,2,3,4,5,6]. The handmade features in traditional techniques are human exhaustion and a lack of sufficient robustness to deal with complex and changeable traffic environments. In recent years, traffic sign detection algorithms based on deep convolutional neural networks have been widely developed. They are mainly divided into two categories: the two-stage object detection algorithm represented by the region-based convolutional network (R-CNN) series ^[7][8][9][7,8,9], and the one-stage object detection algorithm represented by the you only look once (YOLO) series ^[10][11][12][10,11,12] and the single shot multibox detector (SSD) series ^[13][14][13,14]. The two-stage algorithm has achieved remarkable results in accuracy, but the lack of real-time performance means that it is difficult to apply most of the methods to practical detection tasks. Researchers are more concerned with the one-stage algorithm because it can predict the object categories and generate the bounding boxes simultaneously, being competent for detection tasks with high real-time requirements. Zhang et al. [15] introduced a multi-scale spatial pyramid pooling block based on the YOLOv3 [10] algorithm, aiming to accurately realize the real-time location and classification of traffic signs. The mean average precision (mAP) of the algorithm on the Tsinghua-Tencent 100K (TT100K) dataset [16] was satisfactory, but it detected only 23.81 frames per second (FPS). Wu et al. [17] proposed a traffic sign detection model based on SSD [13] combined with a receptive field module (RFM) and path aggregation network (PAN) [18], which achieved a 95.4% and 95.9% mAP on the German Traffic Sign Detection Benchmark (GTSDB) dataset [19] and CSUST Chinese Traffic Sign Detection Benchmark (CCTSDB) dataset [20], respectively, but it has high requirements for the storage capacity and computing power of the device. Yan et al. [21] proposed an auxiliary information enhanced YOLO algorithm based on YOLOv5, which achieved a detection speed of 84.8% mAP and 100.7 FPS on the TT100K dataset, but its robustness against complex scenes such as extreme weather and lighting changes has not been verified.

The research on the detection of traffic signs in harsh environments such as fog, strong light, and insufficient light has attracted the attention of many scholars. Hnewa et al. [22] proposed a novel multi-scale domain adaptive YOLO framework, which extracts domain-invariant features from blurred long-distance image regions and has a significant effect on foggy image datasets. Fan et al. [23] proposed a multi-scale traffic sign detection algorithm based on an attention mechanism, which can effectively reduce the effect of illumination changes on traffic sign detection. Zhou et al. [24] proposed an attention network based on high-resolution traffic sign classification to overcome the complex factors of icy and snowy environments. However, the above methods used a single scene and cannot be effectively applied to multi-scene detection tasks.

2. Small Object Detection

There are usually two ways to define small objects. One definition states that the object size must be smaller than 0.12% of the original size to be regarded as a small object. This respapearchr takes this as a reference. The other is an absolute size definition, that is, the object size must be smaller than 32 × 32 pixels. Therefore, small object detection has always been a difficult topic to address in the field of object detection. At present, multi-scale fusion, the receptive field angle, high-resolution detection, and context-aware detection are the main approaches to small object detection. In high-resolution detection ^[25][26][26,27], high-resolution feature maps are established and predicted to obtain fine details, but context information is lost. In addition, to obtain the context information of the object, there are several methods ^[27][28][28,29] that use the top-down and bottom-up paths to fuse the features of different layers, which can greatly increase their receptive field. In this paper, Tthe feature pyramid network (FPN) ^[29][30] + PAN was used as the feature fusion module of the network, and a multiple attention mechanism was introduced in the model backbone to enhance the learning of context and expand the receptive field, so as to effectively improve the accuracy of small object detection.

3. Traffic Signs Detection

The key to traffic sign detection is to extract distinguishable features. Due to limitations in the computer power and available dataset size, the performance of traditional methods depends on the effectiveness of the manual extraction of features, such as color-based ^[30][31][31,32] and shape-based methods ^[32][33][33,34]. These methods are also easily affected by factors such as extreme weather, illumination changes, variable shooting angles, and obstacles, and can only be applied to limited scenes. In order to promote traffic sign detection in real scenes, many authors have published excellent traffic sign datasets, such as the Laboratory for Intelligent and Safe Automobiles (LISA) dataset ^[34][35], GTSDB, CCTSDB, and TT100K. Since the TT100K dataset covers partial occlusion, illumination changes, and viewing angle changes, it is closer to the real scene than other datasets. With the development of deep learning technology, and the publication of several excellent public datasets, the performance of traffic sign detection algorithms based on deep learning has been significantly improved compared with the traditional traffic sign detection algorithms. Zhang et al. ^[35][36] used the Cascade R-CNN [8] combined with the sample balance method to detect traffic signs, achieving ideal detection results on both CCTSDB and GTSDB. Sun et al. ^[36][37] proposed a feature expression enhanced SSD detection algorithm, which achieved an 81.26% and 90.52% mAP on TT100K and CCTSDB, respectively. However, the detection speed of this algorithm was only 22.86 FPS and 25.08 FPS, which could not achieve real-time performance. Liu et al. ^[37][38] proposed a symmetric traffic sign detection algorithm, which optimizes the delay problem by reducing the computing overhead of the network and, at the same time, improves the traffic sign detection performance in complex environments, such as scale and illumination changes, achieving a 97.8% mAP and 84 FPS on the CCTSDB dataset. However, the integration of multiple modules leads to insufficient global information acquisition.