Lightweight Multi-Target Recognition Model for Live Streaming Scenes

Lightweight Multi-Target Recognition Model for Live Streaming Scenes: Comparison

Please note this is a comparison between Version 1 by Kai Qiao and Version 2 by Peter Tang.

The commercial potential of live e-commerce is being continuously explored, and machine vision algorithms are gradually attracting the attention of marketers and researchers. During live streaming, the visuals can be effectively captured by algorithms, thereby providing additional data support.

model optimization
object detection
attention mechanism
live streaming

1. Introduction

Live e-commerce has emerged as a prominent marketing trend, with its role as a powerful sales-boosting tool being widely embraced globally [1]. Since 2019, leading global retailers like Amazon and QVC have established their own live video shopping platforms. In particular, China has witnessed a significant surge in the user base of live e-commerce, reaching a staggering 469 million in 2022, indicating its immense commercial potential. The utilization of real-time marketing strategies in live streaming scenarios effectively conveys sensory cues to viewers, thereby stimulating consumer purchases [2]. Consequently, the ability to capture these sensory cues during live streaming has become increasingly crucial.

In live streaming, the primary focus of consumers’ visual attention is centered around the anchor and the commodity, as these factors play a crucial role in influencing their purchasing decisions. To extract such visual cues effectively, object detection algorithms in machine vision have proven to be invaluable. Machine vision, a mainstream field within deep learning, encompasses various subfields including scene recognition, object recognition, object detection, and video tracking [3]. Among these subfields, object detection models based on deep learning have undergone significant advancements since the occurrence of Region-based Convolutional Neural Networks (R-CNN), resulting in notable improvements in both accuracy and speed [4]. Traditional object detection techniques can be partitioned into two groups: single-stage and two-stage object detection. The former, such as RCNN, Fast RCNN, etc., are lightweight and offer fast processing speeds. Conversely, the other techniques achieve higher accuracy but require significant computational resources. The YOLO algorithm, widely employed in practical applications, serves as an excellent example of a single-stage object detection technique that achieves comparable accuracy to two-stage methods [5].

The introduction of attention mechanisms into machine vision has been a great success. The attentional mechanism in general is a dynamic selection process, adaptively weighting the input features, resulting in significant performance and accuracy improvements in object recognition, but with a relatively larger computation. Attention mechanisms, such as Shuffle Attention (SA), Convolutional Block Attention Module (CBAM), and Coordinate Attention (CA), have been developed to achieve lightweight enhancements and can be easily integrated into mobile network modules [6]. In recent years, researchers have been actively exploring lightweight modules such as GhostNet, MobileNetV3, and BlazeFace ^[7][8][7,8]. Additionally, many scholars have been attempting to refine the backbone section of YOLOv5 with lightweight modules and incorporate attention mechanisms, aiming to strike a balance between accuracy and computational efficiency.

Qi et al. [9] integrated the Squeeze and Excitation (SE) attention mechanism into YOLOv5 for tomato virus disease identification, achieving higher accuracy. However, this modification resulted in an increased inference time compared to the original model, and the attention mechanism consumed a significant amount of computational resources. Xu et al. [10] substituted the YOLOv5 backbone network with ShuffleNetV2 and integrated the CA attention mechanism, achieving a favorable balance between the indicators for mask detection. Li et al. [11] enhanced the backbone of the YOLOv5 model using GhostNet and incorporated the CA attention mechanism to detect anchor expressions in live streaming scenarios, yielding promising results. In live streaming scenes, relying solely on facial expressions is insufficient to capture the rich visual cues.

2. Deep Learning and Emotion Recognition in Live Streaming Scenarios

Emotions, as fundamental human behaviors, play a significant role in information processing and can trigger corresponding actions [12]. The impact of emotions on human behavior has been commonly demonstrated among various domains such as online comments, advertising marketing, TV shopping, and live commerce ^[13][14][15][13,14,15]. The generation of emotions in live streaming scenarios is complex, with sensory cues being important factors in emotional arousal. As a result, sensory marketing has gained increasing attention [16]. Some researchers have focused on manipulating emotions through sensory cues, such as smell and music ^[17][18][17,18]. The influence of rich sensory stimuli on consumer emotions can lead to impulsive buying, and it has been confirmed that impulsive buying behavior is primarily driven by emotions [2]. Therefore, it remains an important topic to explore how to evoke consumer emotions through sensory cues to promote sales in live streaming environments. Emotion recognition has been provided favorable technical support by the development of deep learning. The emergence of Convolutional Neural Networks (CNNs) has made significant strides in object detection models that recognize emotions from facial expressions [19]. Real-time or near real-time speech emotion recognition algorithms have also greatly improved with the development of deep learning, moving away from old-fashioned frameworks such as Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs) [20]. In live streaming scenarios, viewers are often attracted by the anchors in the broadcasting room, and visual cues, as the most intuitive influencing factor, should be carefully considered. According to the theory of emotional contagion [21], the emotions of the anchor will undoubtedly exert a significant influence on the emotions of the audience to a certain extent.

3. Application of Yolov5 Algorithm for Object Detection

Traditionally, feature extraction in object detection heavily relied on manual feature design, which often resulted in poor generalization. However, with the emergence of deep learning, Convolutional Neural Networks (CNNs) have emerged as the mainstream framework for machine learning in object detection, thanks to their remarkable performance and excellent feature extraction capabilities, starting from the introduction of R-CNN. One-stage and two-stage algorithms are the two main types of deep learning-based object detection algorithms. Firstly, one-stage algorithms directly predict the object’s coordinates and class through regression, offering faster recognition speeds. On the other hand, two-stage algorithms employ region generation for target classification and calibration, leading to higher accuracy. However, the two-stage approach comes with increased computational overhead, reducing the model’s speed and hindering real-time monitoring ^[22][23][22,23]. Since 2015, the YOLO (You Only Look Once) family of single-stage deep learning algorithms has undergone continuous improvements. YOLO utilizes a convolutional neural network architecture to determine the location and type of objects in an image, enabling high-speed recognition. The yolov5 deep learning algorithm further enhances efficiency by adopting a more lightweight network architecture, significantly reducing the weight and improving the speed. The yolov5 family comprises four different architectures (YOLOv5x, YOLOv5l, YOLOv5m, and YOLOv5s), allowing flexibility in adapting to various object detection requirements by adjusting the extracted features’ width and depth [24]. YOLOv5s, the lightest variant in the YOLOv5 series, boasts the fastest recognition speed and finds widespread application in various scenarios. Wang et al. [25] utilized a YOLOv5s model with channel pruning to achieve remarkable results in fast apple fruit detection. Guo et al. [26] optimized the backbone network of the YOLOv5s and integrated the SE attention mechanism, significantly improving the model’s accuracy compared to YOLOv5s and YOLOv4. Li et al. [27] employed YOLOv5s in an industrial setting for forklift monitoring, enhancing the backbone section with the GhostNet and incorporating the SE attention mechanism. Li et al. [11] pioneered the application of YOLOv5 in a live streaming scenario for real-time monitoring of anchor expressions. The improved YOLOv5s model incorporates the GhostNet module and the CA attention mechanism, achieving a superior balance between precision and speed. The previous YOLOv5 model has found extensive applications in various commodity environments. However, as the live streaming scene is still a nascent industry, there is significant potential to explore more applications for YOLOv5 in this domain. While Li et al. [11] achieved effective recognition of anchor expressions through an improved model, theour focus extends beyond expressions to encompass other elements within the live streaming scene. Therefore, the re-application of the model for further improvements becomes particularly crucial.

4. The Development and Application of Attention Mechanism in Deep Learning

Inspired by human perception, the attention mechanism is implemented. When humans visually perceive objects, they tend to focus on specific parts that are relevant or important to them. This selective observation allows humans to efficiently extract important information from a substantial quantity of visual data using limited cognitive resources. The attention mechanism mimics this process, enhancing the efficiency and accuracy of perceptual information processing. It serves as an effective solution to tackle the challenge of information overload. By incorporating the attention mechanism into computer vision tasks, the substantial computational workload can be effectively reduced. As a result, the attention mechanism has gained significant traction in the realm of deep learning, becoming a standard component in neural network architectures [28]. Currently, the two most common attention mechanisms applied to machine vision are spatial attention and channel attention [6]. The emphasis on the former is on the location of the object within the deep learning information and spatially transforms this location information. The spatial transformer network (STN) [29] is an example of spatial attention. Additionally, channel attention emphasizes the content information of the object. The SE network, introduced by Hu et al. [30], is a notable channel attention mechanism. The SE attention module enhances target recognition by adaptively calibrating channel weights, filtering important features, and using global average pooled features for computations. As deep learning neural networks continue to evolve, researchers have developed hybrid attention mechanisms that combine both spatial and channel attention to improve the precision and efficiency of feature recognition within large feature maps [31]. The CBAM is capable of feature map recognition through both spatial and channel attention. It starts by applying global pooling operations to the feature map, generating channel attention features. Subsequently, spatial attention features are generated by concatenating and downsampling the channels. Finally, the input features are combined with the final features [32]. The CA mechanism integrates spatial coordinate information by embedding location details into channel attention, decomposing channel attention into two parallel one-dimensional feature encodings. This approach differs from CBAM as it does not forcibly compress the channels. The two one-dimensional feature encodings allow for more comprehensive extraction of spatial information and optimize feature extraction efficiency [33]. Another efficient replacement attention mechanism is SA. SA combines channel attention and spatial attention using shuffling units. This lightweight and efficient attention mechanism has demonstrated better performance and lower complexity compared to CBAM and SE attention mechanisms on public datasets [34].