Object detection methods based on deep learning typically require devices with ample computing capabilities, which limits their deployment in restricted environments such as those with embedded devices.
1. Introduction
AS one of the fundamental tasks in computer vision, object detection is widely used in face detection, object tracking, image segmentation, and autonomous driving
[1]. The objective is to localize and classify specific objects in an image, accurately find all the objects of interest, and locate the position with a rectangular bounding box
[2][3]. In recent years, in the field of computer vision, there has been a growing focus on designing deeper networks to extract valuable feature information, resulting in improved performance
[4][5][6][7][8]. However, due to the vast number of parameters in these models, they often consume a significant amount of computational resources. For example, Khan
[9] proposed an end-to-end scale-invariant head detection framework by modeling a set of specialized scale-specific convolutional neural networks with different receptive fields to handle scale variations. Wang
[10] introduced a pyramid structure into the transformer framework, using a progressive shrinking strategy to control the scale of feature maps. While these models demonstrate outstanding detection accuracy, they heavily rely on powerful GPUs to achieve rapid detection speed
[11]. This poses a significant challenge in achieving a balance between accuracy and inference speed on mobile devices with limited computational resources
[12][13][14]. Currently, detection models based on deep learning often use complex network architectures to extract valuable feature information. Although such models have a high detection accuracy, they usually rely on powerful graphics processing units (GPUs) to achieve a fast detection speed
[15]. With the rapid development of technologies such as smartphones, drones, and unmanned vehicles, implementing neural networks in parallel on devices with limited storage and computing power is becoming an urgent need. Under computing power and storage space constraints, lightweight real-time networks have become popular research topics related to the application of deep learning in embedded applications
[16].
Recently, some researchers have reduced the number of parameters and model size of the network by optimizing the network structure, such as SqueezeNet, MobileNetv1-v3
[17][18][19], ShuffleNetv1-v2
[20][21], Xception
[22], MixNet
[23], EfficientNet
[24], etc. The MobileNet series methods replace the traditional convolution by using depth-wise separable convolutions, thus achieving a result similar to that of standard convolution but greatly reducing the number of model calculations and parameters. The ShuffleNet series networks use group convolution to reduce the number of model parameters and apply channel shuffling to reorganize the feature maps generated by group convolution. Other researchers have proposed regression-based one-stage object detectors, such as SSD
[25], YOLOv1-v4
[26][27][28][29], RetinaNet
[30], MimicDet
[31], etc. Instead of taking two shots, as in the RCNN series, one-stage detectors predict the target location and category information directly from a network without region propositions. Based on the regression concept of YOLOv1, SSD uses predefined boxes of different scales and aspect ratios for prediction and extracts feature maps of different scales for detection. Although the SSD accuracy is better than that of YOLOv1, SSD does not perform well in small object detection. YOLOv3 uses the Darknet backbone network to mine high-level semantic information, which greatly improves the classification performance. A similar feature pyramid network is used for feature fusion to enhance the accuracy of small target detection. Since a large number of easily classified negative samples in the training phase can lead to model degradation, RetinaNet proposes focal loss based on standard cross-entropy loss to eliminate category imbalance effectively, similar to a kind of hard sample mining. To improve the accuracy of the one-stage detector, MimicDet uses the features generated by a two-stage detector to train the one-stage detector in the training phase. However, in the inference phase.
MimicDet uses a one-stage method directly for prediction to ensure that the detection speed is relatively fast. The YOLO series methods achieve an excellent balance between accuracy and speed and have become widely used for target detection in actual scenarios. Nevertheless, YOLO models have a complex network structure and a large number of network parameters, so they require vast computing resources and considerable storage space when used in embedded devices. However, the high computational cost limits the ability of YOLO models to perform multiple tasks that require real-time performance on computationally limited platforms
[32]. To reduce the occupation of computing resources, lightweight YOLO methods require fewer parameters and improve the detection speed by applying a smaller feature extraction network, such as the latest YOLOv4-Tiny
[33]. Therefore, when performing object detection on embedded devices, improving the detection accuracy while achieving real-time performance is a significant problem to be solved.
2. Attention Mechanism
In recent years, attention mechanisms have been widely used in various fields of computer vision to enhance important features and suppress irrelevant noise. These mechanisms provide excellent performance in improving model accuracy, such as SENet
[34], CBAM
[35], No-local
[36], SKNet
[37], GCNet
[38], NAM
[39], ECANet
[40], SA-Net
[41], SimAM
[42], GAM
[43], etc. SENet explicitly models the correlation between feature channels and automatically learns the channel-wise weights for feature selection. CBAM focuses on spatial and channel attention information and concatenates the feature maps after average and maximum pooling operations to reduce feature loss, thus making the model focus on the target itself rather than the background. SKNet uses convolution kernels of different sizes to extract semantic features and dynamically adjusts the receptive field by aggregating feature information from multiple branches. Based on SENet and No-local, GCNet proposes a simple global context modeling framework to mine long-distance dependencies and reduce computational pressure. NAM applies a weight sparsity penalty to the attention module, thereby improving computational efficiency while maintaining similar performance. ECANet has mainly made some improvements to the SENet module, proposing a non-dimensional reduction local cross-channel interaction strategy and an adaptive method for selecting the size of one-dimensional convolutional kernels, thereby achieving performance improvement. Although CBAM brings performance improvements, it increases the computational complexity to a certain extent. SA-Net introduces the channel shuffle method, which parallelizes the use of spatial and channel attention mechanisms in blocks, enabling efficient integration of the two types of attention mechanisms. Different from the common channel and spatial attention modules, SimAM introduces an attention mechanism without any trainable parameters, proposed based on neuroscience theory and the linear separability principle. GAM proposes a global attention mechanism that introduces channel attention and multi-layer perceptrons to reduce information diffusion and amplify global interactive representations, thereby improving the performance of deep neural networks.
3. YOLOv4, YOLOv4-CSP and YOLOv4-Tiny Networks
YOLOv4 is an evolution from YOLOv3, and the purpose is to design a real-time object detection network that can be applied in the actual working environment. YOLOv4 proposes a CSPDarknet53 backbone network to reduce repeated gradient learning effectively and improve the learning ability of the network. In terms of data augmentation, YOLOv4 uses a mosaic to combine four images into one, which is equivalent to increasing the minibatch size and adds self-adversarial training (SAT), which allows the neural network to update images in the reverse direction before normal training. In addition, YOLOv4 uses modules such as ASFF
[44], ASPP
[45], and RFB
[46] to expand the receptive field and introduce attention mechanisms to emphasize important features. Based on YOLOv4, YOLOv4-CSP is compressed in terms of network width, network depth, and image resolution to achieve the optimal trade-off between speed and accuracy. Compared with YOLOv4, YOLOv4-CSP converts the first CSP stage into the original Darknet residual layer and modifies the PAN architecture in YOLOv4 according to the CSP approach. Moreover, YOLOv4-CSP inserts an SPP module in the middle position of the modified PAN structure. To reduce the computational complexity for embedded devices, YOLOv4-Tiny is a simplified structure of YOLOv4 and YOLOv4-CSP. YOLOv4-Tiny uses a lightweight backbone network called CSPDarknet-Tiny while directly applying a feature pyramid network (FPN)
[47] instead of a path aggregation network (PANet)
[48] to reduce computational complexity. In the inference stage, multiscale feature maps are first fused via the FPN. Then, the category scores and offsets of each predefined anchor are predicted by a 1 × 1 convolution kernel, and the predicted bounding boxes are postprocessed using non-maximal suppression (NMS)
[49] to obtain the final detection results. Although YOLOv4-Tiny provides a certain accuracy rate and fast detection speed, the regression accuracy for small and medium targets is relatively low, which will be improved in the proposed Mini-YOLOv4 network.