YOLOv5-AC: Comparison
Please note this is a comparison between Version 2 by HaoHui Lv and Version 1 by HaoHui Lv.

YOLOv5-AC: Attention Mechanism-Based Lightweight YOLOv5 for Track Pedestrian Detection. 

In response to the dangerous behavior of pedestrians roaming freely on unsupervised train tracks, the real-time detection of pedestrians is urgently required to ensure the safety of trains and people. Aiming to improve the low accuracy of railway pedestrian detection, the high missed-detection rate of target pedestrians, and the poor retention of non-redundant boxes, YOLOv5 is adopted as the baseline to improve the effectiveness of pedestrian detection. First of all, L1 regularization is deployed before the BN layer, and the layers with smaller influence factors are removed through sparse training to achieve the effect of modeby model pruning. In the next moment, the context extraction module is applied to the feature extraction network, and the input features are fully extracted using receptive fields of different sizes. In addition, both the context attention module CxAM and the content attention module CnAM are added to the FPN part to correct the target position deviation in the process of feature extraction so that the accuracy of detection can be improved. What is more, DIoU_NMS is employed to replace NMS as the prediction frame screening algorithm to improve the problem of detection target loss in the case of high target coincidence. Experimental results show that compared with YOLOv5, the AP of our YOLOv5-AC model for pedestrians is 95.14%, the recall is 94, improving attention mechanism, etc.22%, and the counting frame rate is 63.1 FPS. Among them, AP and recall increased by 3.78% and 3.92%, respectively, while the detection speed increased by 57.8%. The experimental results verify that our YOLOv5-AC is an effective and accurate method for pedestrian detection in railways

 

  • pedestrian detection
  • deep learning
  • model pruning
  • context extraction module
  • attention module
  • DIoU_NMS

As

1、Introduction

As rail transportation plays an increasingly important role in China, the safety of rail transit operations has also attracted more and more attention. As a consequence, It is of great significance to carry out research on pedestrian detection and abnormal state monitoring at railway stations to ensure the safety of pedestrians. pedestrians usually move fast and irregularly on the railway track, while the target is very small and has a high degree of coincidence of body positions within the visual range of the machine’s vision. In addition, complex and uncertain environmental factors such as trees, weeds, and telephone poles around the railway track have caused huge obstacles to pedestrian detection. In order to ensure pedestrian safety, we will experimentally improve our YOLOv5 through the following five points to achieve better pedestrian detection results.

(1) L1 [1] regularization is added to constrain the scaling factor of the BN [2] layer to make the activation coefficients sparse. Next, the modified model is sparsely trained to cut out the sparse layers. We end up with a very compact model with repeated cutting.

(2) In Backbone, the CEM module is introduced to fully extract the features of different scales. The CxAM module is introduced to extract context semantic information to improve recognition accuracy. The CnAM module is introduced to correct the position of F5 layer features and improve the accuracy of target box regression.

(3) DIoU_NMS is used instead of NMS to filter prediction boxes to avoid eliminating different target prediction boxes with high consistency.

(4) We collected a certain number of datasets along with a certain number of relevant public datasets to provide data support for the verification of the actual effect of the improved model.

(5) According to the direction of improvement, a number of related ablation experiments were designed to verify the validity of each contribution.

2、Improved YOLOv5-AC

2.1、YOLOv5 Network Structure

The railYOLO transportation plays an increasingly important role in China, the safety of rail transit operations has also attracted more and more attention. However,series of algorithms, from YOLOv1 to YOLOv5 in some remote areas[3], the train track crosses the highway and pedestrian passage. In particular, pedestrians still stay on the track when the train is about to arrive, which will bring huge potential safety hazards, and accidents occur frequently. These pedestrians usually move fast and irregularly on the railway track, while the target is very small and has a high degree of coincidence of body positions within the visual range of the machine’s vision. In addition, complex and uncertain environmental factors such as trees, weeds, and telephone poles around the railway track have caused huge obstacles to pedestrian detection. It is of great significance to carry out research on pedestrian detection and abnormal state monitoring at railway stations to ensure the safety of pedestrians. Traditional machine learning target detection algorithms, such as the Viola–Jones Detector, generally use the sliding window method to extract candidate frames. They first extract and learn low and intermediate features in candidate frames, and then use classifiers to identify and selectas been the hottest algorithm in the field of target detection due to its fast and efficient performance. The latest generation of YOLOv5’s weight files is only 28 MB, which is ideal as an initial model. Therefore, YOLOv5 is selected as the experimental objects, which makes it difficult to solve the problems caused by fast movement, small targets, high randomness of appearance, and the high degree of coincidence of body positions. In order to better deal with these difficulties, we propose a detection algorithm based on deep learning, which can help us to obtain a better detection effect by learning the higher-level features of the object through Convolutional Neural Networks (CNNs) [1]. The deep learning target detection algorithm has been in development since R. Girshick et al. proposed Region-CNN (RCNN) [2] in 2014. Since then, Fast R-CNN [3], Faster R-CNN [4], Spatial Pyramid Pooling (SPP) [5], two-stage detectors, You Only Look Once (YOLO) [6–9], Single Shot MultiBox Detector (SSD) [10], and other single-stage detectors have emerged. The two-stage detector uses a convolutional neural ne for algorithm improvement in this experiment. The network structure of YOLOv5 generally follows the previous series. The feature extraction network to extract the features of the markers, and then uses Region Proposal Net (RPN) to recommend candidate boxes, which returns the candidateof the backbone adopts CSPDarknet boxes[4]. to tThe predicted position through a gradient descent at the end. Conversely, the single-stage detector directly performinput newly adds the regression of the bounding box after extracting the features by ignoring the RPN. The two-stage detector uses two different networks to classify and locate objects, so the detection accuracy is at a high level while the speed is very slow, requiring at least 100 ms to detect an focus structure, slices the input image, such as the Faster RCNN. The single-stage detector uses only one network to perform classification and positioning at the same time, so detection speed is guaranteed. The detection speed of YOLOv1 can reach 45–120 fpsreduces the size, and increases the depth, which can process video or camera images in real-time, requiring less equipment and achieving better performance in field deployment. With the development of transportation, pedestrian detection has gradually become a hot spot in the field of target table detection, where many experts and scholars have put forward their views and opinions. Jin, Xianjian et al. proposed a pedestrian detection algorithm based on YOLOv5 in an autonomous driving environment [11]; Gai Y et al. proposed a method of pedestrian detection + tracking + counting based on YOLOv5 with Deepsort [12]; Sukar et al. proposed an improved YOLOv5 algorithm for real-time pedestrian detection [13]. Zhi Xu et al. proposed a method of CAP-YOLO based on channel

attentimprove the speed of feature extraction. At the same time, the CSP2 structure is deployed to the neck part to enhance the ability of network feature fusion. for Coal Mine Real-Time Intelligent Monitoring [14]. Masoomeh Shireen Ansarnia et al. proposed a deep learning algorithm for contextual deteThe optimization function in orthophotography [15]. Kamiladopts Adam Roszyk et[5] al. andopted a SGD method[6]. fFor low-latency multispectral pedestrian detection in autonomous driving by YOLOv4 [16]. Luying Que et al. proposed a lightweight pedestrian detection engine of a two-stage low-complexity detection network and adaptive region focusing technique [17]. Yang Liu et al. used a thermal infrared vehicle and pedestrian detection method in complex scenes [18]. Jingwei Cao et al. proposed a pedestrian detection algorithm for intelligent vehicles in complex scenarios [19]. Isamu Kamoto et al. used a deep learning method to predict crowd behavior based on LSTM [20]. Gopal, D.G. et al. proposed a method of selfish node detection based on evidence by trust authority and selfish replica allocation in DANET [21]. Jerlin, M.A. et al. created a smart parking system based on IoT [22]. Nagarajan, S.M. et al. applied an intelligent anomaly detection framework to cyber physical systems [23]. Selvaraj, A. et al. put forward a swarm intelligence approach of optimal virtual machine selection for anomaly detection [24]. Nagarajan, S.M. et al. put forward an effective task scheduling algorithm with deep learning for IoHT in sustainable smart cities [25]. The above algorithms have put forward corresponding practical innovations in pedestrian detection and processing, but few achievements have been made in railway pedestrian detection, which is one of the most high-risk scenarios. This paper aims to carry on the corresponding research and experiments for this scene. Aimed at the problem ocus and conv are the structures that mainly contain the convolution kernel and residual components. As a consequence, the network depth can be changed by controlling the number of residual components in Conv, while the network width can be adjusted by gaining command of the low detection accuracy caused by the rapid movement of the target or the prediction frame completely deviating from the target, as well as the missed detection of the target caused by the high coincidence of body positions, an improved

tanumber of convolution kernels in Focus and Conv. Therget detection algorithm based onfore, YOLOv5s is proposed.

(1) L1 [26] regu has larizatioun is added to constrain the scaling factor of the BN [27] layer to make the activation coefficients sparse. Next, the modified model is sparsely trained to cut out the sparse layers. We end up with a very compact model with repeated cuttingched four models ranging from small to large by regulating the parameters: YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x.

(2) In Backbone, tThe CEM module is introduced to fully extract the features of different scales. The CxAM module is introduced to extract context semantic information to improve recognition accuracy. YOLOv5 network structure is shown in Figure 1. 

Figure 1. YOLOv5 algorithm structure diagram.​​

​2.2、Sparse Training and Model Pruning

YOLOv5 is already a cracking lightweight detection network where the trained weight model generally does not exceed 30 MB, which is still too large for some embedded devices. If we simply choose to reduce the size of the network input, such as 640 to 320, as the size of the model is reduced accordingly the detection effect will also have a greater loss at the same time. Therefore, according to a method of network slimming proposed by Zhuang Liu et al. [7], we add L1 regularization parameters to the model to constrain the scaling factor of the BN layer, which can cause the coefficients close to 0 to become smaller. These pairs of parameter layers with little influence on forward propagation are eliminated through sparse training. We can obtain a very compact and efficient network model by repeating the above operations.

The principle of YOLOv5 network channel clipping is shown in the Figure 2.

Figure 2. Principle of pruning.

​Above all, we append L1 regularization to the model to perform corresponding sparse training. Then, channel pruning is performed on the trained model. Ultimately, the training hyperparameters are fine-tuned to ensure the model inference results are optimal. The algorithm implementation process is shown in Figure 3.

Figure 3. Process of pruning.​​​​​​

​2.3、AC_FPN structure

​The output is sent to the deconvolution layer of the FPN network for feature fusion after the feature maps are processed by the above three modules. The improved AC_FPN [8] structure is shown in Figure 4.

Figure 4. AC_FPN structure diagram.

​2.4、Improved NMS

Non-Maximum Suppression (NMS) needs to be performed for the screening of many target boxes in the post-processing process of target detection. YOLOv5 adopts the traditional NMS [9] method. The occluded target selection box is usually removed when facing two different targets with a high degree of coincidence by using this method. For an environment with a large number of targets, where there will be many targets with a high degree of coincidence, the occlusion target candidate boxes that are obscured will be removed as redundant information by NMS, which is not suitable for models that want to detect accurately. In this paper, DIoU_NMS is used to replace the NMS. DIoU_NMS introduces the parameter β of the center point distance between the two boxes. When β →∞, DIoU_NMS degenerates into traditional NMS. Otherwise, as long as the center points of the two frames do not coincide perfectly when β → 0, they will be retained by DIoU_NMS. As a consequence, the value of β can be adjusted to 0 → ∞ according to the actual situation to achieve the best effect to restrain redundant boxes. Its classification score update formula is defined as Formula (11​):​where si is the classification score and e is the NMS threshold, RDIoU(M, Bi) is the penalty​ item, M is the predicted box with the highest score, and Bi is the other box.

2.5、 Improved YOLOv5-AC Network Structure

The features are further extracted by adding a context extraction model (CEM) in Backbone. CxAM is added to extract the context semantics. CnAM is applied to correct the feature positions of the F and F5 layers. Post-processing uses DIoU_NMS to replace NMS. The improved YOLOv5-AC structure is shown in Figure 5.

Figure 5. YOLOv5-AC structure diagram.

3、Experiment​​​​

​3.1、Training Process​​​​​​​

The CnAM module is introduced to correct the position of F5 layer features and improve the accuracy of target box regression.

(3) DIoU_NMS is used insexperimentead of NMS to filter prediction boxes to avoid eliminating different target prediction boxes with high consistency.

(4)will proceed as shown in We colTablected a 1.

Table 1. Procedure of the experiment.

3.2、 Training Metrics​​

​Accuracy and Recall are selected as metrics to compare the quality of the original model and the improved model of the test results. The calculation formulas of Accuracy and Recall are as follows:

​TP is the number of people on the track that were correctly detected. FP is the number of people on the track that were incorrectly detected as people. TN is the number of people on the track that were not detected. FN represents no one on the track and no one detected at the same time. The relationship can be intuitively understood through the following confusion matrix Table 2.

Table 2. Confusion matrix.​

Reference

[1] certKain number of datasets along with a certain number of relevant public datasets to provide data support for the verification of the actual effect of the improved modelng, G.; Dong, X.; Zheng, L.; Yang, Y. Patchshuffle regularization. arXiv 2017.

(5) According to the direction of improvement, a number of related ablation experiments were designed to verify the validity of each contributioneprint.

​[2] Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on       Machine Learning, Lille, France, 6 July 2015; pp. 448–456.

[3] ​YOLOv5. Available online: https://github.com/ultralytics/yolov5 (accessed on 9 June 2020).​

[4] ​Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the                  IEEE/CVF Conference on computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 13 June 2020; pp. 390–391.

​[5] Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014. preprint.​

[6] Zheng, S.; Meng, Q.; Wang, T.; Chen, W.; Yu, N.; Ma, Z.M.; Liu, T.Y. Asynchronous stochastic gradient descent with delay compensation. In Proceedings of the                     International Conference on Machine Learning, Sydney, NSW, Australia, 6 August 2017; pp. 4120–4129.

[7]​ Liu, Z.; Li, J.; Shen, Z.; Huang, G.; Yan, S.; Zhang, C. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International               Conference on Computer Vision, Venice, Italy, 22 October 2017; pp. 2736–2744.

[8] ​Cao, J.; Chen, Q.; Guo, J.; Shi, R. Attention-guided context feature pyramid network for object detection. arXiv 2020. preprint.​

[9] ​Neubeck, A.; Van Gool, L. Efficient non-maximum suppression. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06),Washington,       DC, USA, 20 August 2006; Volume 3, pp. 850–855.​​​​​​​

Video Production Service