Lightweight Multi-Target Recognition Model for Live Streaming Scenes

Lightweight Multi-Target Recognition Model for Live Streaming Scenes: History

View Latest Version

Please note this is an old version of this entry, which may differ significantly from the current revision.

Contributor:

The commercial potential of live e-commerce is being continuously explored, and machine vision algorithms are gradually attracting the attention of marketers and researchers. During live streaming, the visuals can be effectively captured by algorithms, thereby providing additional data support.

model optimization
object detection
attention mechanism
live streaming

1. Introduction

Live e-commerce has emerged as a prominent marketing trend, with its role as a powerful sales-boosting tool being widely embraced globally ^[1]. Since 2019, leading global retailers like Amazon and QVC have established their own live video shopping platforms. In particular, China has witnessed a significant surge in the user base of live e-commerce, reaching a staggering 469 million in 2022, indicating its immense commercial potential. The utilization of real-time marketing strategies in live streaming scenarios effectively conveys sensory cues to viewers, thereby stimulating consumer purchases ^[2]. Consequently, the ability to capture these sensory cues during live streaming has become increasingly crucial.

In live streaming, the primary focus of consumers’ visual attention is centered around the anchor and the commodity, as these factors play a crucial role in influencing their purchasing decisions. To extract such visual cues effectively, object detection algorithms in machine vision have proven to be invaluable. Machine vision, a mainstream field within deep learning, encompasses various subfields including scene recognition, object recognition, object detection, and video tracking ^[3]. Among these subfields, object detection models based on deep learning have undergone significant advancements since the occurrence of Region-based Convolutional Neural Networks (R-CNN), resulting in notable improvements in both accuracy and speed ^[4]. Traditional object detection techniques can be partitioned into two groups: single-stage and two-stage object detection. The former, such as RCNN, Fast RCNN, etc., are lightweight and offer fast processing speeds. Conversely, the other techniques achieve higher accuracy but require significant computational resources. The YOLO algorithm, widely employed in practical applications, serves as an excellent example of a single-stage object detection technique that achieves comparable accuracy to two-stage methods ^[5].

The introduction of attention mechanisms into machine vision has been a great success. The attentional mechanism in general is a dynamic selection process, adaptively weighting the input features, resulting in significant performance and accuracy improvements in object recognition, but with a relatively larger computation. Attention mechanisms, such as Shuffle Attention (SA), Convolutional Block Attention Module (CBAM), and Coordinate Attention (CA), have been developed to achieve lightweight enhancements and can be easily integrated into mobile network modules ^[6]. In recent years, researchers have been actively exploring lightweight modules such as GhostNet, MobileNetV3, and BlazeFace ^[7]^[8]. Additionally, many scholars have been attempting to refine the backbone section of YOLOv5 with lightweight modules and incorporate attention mechanisms, aiming to strike a balance between accuracy and computational efficiency.

Qi et al. ^[9] integrated the Squeeze and Excitation (SE) attention mechanism into YOLOv5 for tomato virus disease identification, achieving higher accuracy. However, this modification resulted in an increased inference time compared to the original model, and the attention mechanism consumed a significant amount of computational resources. Xu et al. ^[10] substituted the YOLOv5 backbone network with ShuffleNetV2 and integrated the CA attention mechanism, achieving a favorable balance between the indicators for mask detection. Li et al. ^[11] enhanced the backbone of the YOLOv5 model using GhostNet and incorporated the CA attention mechanism to detect anchor expressions in live streaming scenarios, yielding promising results. In live streaming scenes, relying solely on facial expressions is insufficient to capture the rich visual cues.

2. Deep Learning and Emotion Recognition in Live Streaming Scenarios

Emotions, as fundamental human behaviors, play a significant role in information processing and can trigger corresponding actions ^[12]. The impact of emotions on human behavior has been commonly demonstrated among various domains such as online comments, advertising marketing, TV shopping, and live commerce ^[13]^[14]^[15]. The generation of emotions in live streaming scenarios is complex, with sensory cues being important factors in emotional arousal. As a result, sensory marketing has gained increasing attention ^[16]. Some researchers have focused on manipulating emotions through sensory cues, such as smell and music ^[17]^[18]. The influence of rich sensory stimuli on consumer emotions can lead to impulsive buying, and it has been confirmed that impulsive buying behavior is primarily driven by emotions ^[2]. Therefore, it remains an important topic to explore how to evoke consumer emotions through sensory cues to promote sales in live streaming environments.

Emotion recognition has been provided favorable technical support by the development of deep learning. The emergence of Convolutional Neural Networks (CNNs) has made significant strides in object detection models that recognize emotions from facial expressions ^[19]. Real-time or near real-time speech emotion recognition algorithms have also greatly improved with the development of deep learning, moving away from old-fashioned frameworks such as Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs) ^[20]. In live streaming scenarios, viewers are often attracted by the anchors in the broadcasting room, and visual cues, as the most intuitive influencing factor, should be carefully considered. According to the theory of emotional contagion ^[21], the emotions of the anchor will undoubtedly exert a significant influence on the emotions of the audience to a certain extent.

3. Application of Yolov5 Algorithm for Object Detection

Traditionally, feature extraction in object detection heavily relied on manual feature design, which often resulted in poor generalization. However, with the emergence of deep learning, Convolutional Neural Networks (CNNs) have emerged as the mainstream framework for machine learning in object detection, thanks to their remarkable performance and excellent feature extraction capabilities, starting from the introduction of R-CNN. One-stage and two-stage algorithms are the two main types of deep learning-based object detection algorithms. Firstly, one-stage algorithms directly predict the object’s coordinates and class through regression, offering faster recognition speeds. On the other hand, two-stage algorithms employ region generation for target classification and calibration, leading to higher accuracy. However, the two-stage approach comes with increased computational overhead, reducing the model’s speed and hindering real-time monitoring ^[22]^[23]. Since 2015, the YOLO (You Only Look Once) family of single-stage deep learning algorithms has undergone continuous improvements. YOLO utilizes a convolutional neural network architecture to determine the location and type of objects in an image, enabling high-speed recognition. The yolov5 deep learning algorithm further enhances efficiency by adopting a more lightweight network architecture, significantly reducing the weight and improving the speed. The yolov5 family comprises four different architectures (YOLOv5x, YOLOv5l, YOLOv5m, and YOLOv5s), allowing flexibility in adapting to various object detection requirements by adjusting the extracted features’ width and depth ^[24].

YOLOv5s, the lightest variant in the YOLOv5 series, boasts the fastest recognition speed and finds widespread application in various scenarios. Wang et al. ^[25] utilized a YOLOv5s model with channel pruning to achieve remarkable results in fast apple fruit detection. Guo et al. ^[26] optimized the backbone network of the YOLOv5s and integrated the SE attention mechanism, significantly improving the model’s accuracy compared to YOLOv5s and YOLOv4. Li et al. ^[27] employed YOLOv5s in an industrial setting for forklift monitoring, enhancing the backbone section with the GhostNet and incorporating the SE attention mechanism. Li et al. ^[11] pioneered the application of YOLOv5 in a live streaming scenario for real-time monitoring of anchor expressions. The improved YOLOv5s model incorporates the GhostNet module and the CA attention mechanism, achieving a superior balance between precision and speed.

The previous YOLOv5 model has found extensive applications in various commodity environments. However, as the live streaming scene is still a nascent industry, there is significant potential to explore more applications for YOLOv5 in this domain. While Li et al. ^[11] achieved effective recognition of anchor expressions through an improved model, the focus extends beyond expressions to encompass other elements within the live streaming scene. Therefore, the re-application of the model for further improvements becomes particularly crucial.

4. The Development and Application of Attention Mechanism in Deep Learning

Inspired by human perception, the attention mechanism is implemented. When humans visually perceive objects, they tend to focus on specific parts that are relevant or important to them. This selective observation allows humans to efficiently extract important information from a substantial quantity of visual data using limited cognitive resources. The attention mechanism mimics this process, enhancing the efficiency and accuracy of perceptual information processing. It serves as an effective solution to tackle the challenge of information overload. By incorporating the attention mechanism into computer vision tasks, the substantial computational workload can be effectively reduced. As a result, the attention mechanism has gained significant traction in the realm of deep learning, becoming a standard component in neural network architectures ^[28].

Currently, the two most common attention mechanisms applied to machine vision are spatial attention and channel attention ^[6]. The emphasis on the former is on the location of the object within the deep learning information and spatially transforms this location information. The spatial transformer network (STN) ^[29] is an example of spatial attention. Additionally, channel attention emphasizes the content information of the object. The SE network, introduced by Hu et al. ^[30], is a notable channel attention mechanism. The SE attention module enhances target recognition by adaptively calibrating channel weights, filtering important features, and using global average pooled features for computations.

As deep learning neural networks continue to evolve, researchers have developed hybrid attention mechanisms that combine both spatial and channel attention to improve the precision and efficiency of feature recognition within large feature maps ^[31]. The CBAM is capable of feature map recognition through both spatial and channel attention. It starts by applying global pooling operations to the feature map, generating channel attention features. Subsequently, spatial attention features are generated by concatenating and downsampling the channels. Finally, the input features are combined with the final features ^[32].

The CA mechanism integrates spatial coordinate information by embedding location details into channel attention, decomposing channel attention into two parallel one-dimensional feature encodings. This approach differs from CBAM as it does not forcibly compress the channels. The two one-dimensional feature encodings allow for more comprehensive extraction of spatial information and optimize feature extraction efficiency ^[33].

Another efficient replacement attention mechanism is SA. SA combines channel attention and spatial attention using shuffling units. This lightweight and efficient attention mechanism has demonstrated better performance and lower complexity compared to CBAM and SE attention mechanisms on public datasets ^[34].

This entry is adapted from the peer-reviewed paper 10.3390/app131810170

References

Zheng, S.; Chen, J.; Liao, J.; Hu, H.L. What motivates users’ viewing and purchasing behavior motivations in live streaming: A stream-streamer-viewer perspective. J. Retail. Consum. Serv. 2023, 72, 103240.
Zhang, X.; Cheng, X.; Huang, X. “Oh, My God, Buy It!” Investigating impulse buying behavior in live streaming commerce. Int. J. Hum. Comput. Interact. 2022, 39, 2436–2449.
Morris, T. Computer Vision and Image Processing; Palgrave Macmillan Ltd.: London, UK, 2004; pp. 1–320.
Aziz, L.; Salam, M.S.B.H.; Sheikh, U.U.; Ayub, S. Exploring deep learning-based architecture, strategies, applications and current trends in generic object detection: A comprehensive review. IEEE Access 2020, 8, 170461–170495.
Diwan, T.; Anirudh, G.; Tembhurne, J.V. Object detection using YOLO: Challenges, architectural successors, datasets and applications. Multimed. Tools Appl. 2023, 82, 9243–9275.
Guo, M.H.; Xu, T.X.; Liu, J.J.; Liu, Z.N.; Jiang, P.T.; Mu, T.J.; Zhang, S.H.; Martin, R.R.; Cheng, M.M.; Hu, S.M. Attention mechanisms in computer vision: A survey. Comput. Vis. Media 2022, 8, 331–368.
Bazarevsky, V.; Kartynnik, Y.; Vakunov, A.; Raveendran, K.; Grundmann, M. Blazeface: Sub-millisecond neural face detection on mobile gpus. arXiv 2019, arXiv:1907.05047.
Jin, R.; Xu, Y.; Xue, W.; Li, B.; Yang, Y.; Chen, W. An Improved Mobilenetv3-Yolov5 Infrared Target Detection Algorithm Based on Attention Distillation. In International Conference on Advanced Hybrid Information Processing; Springer International Publishing: Cham, Switzerland, 2021; pp. 266–279.
Qi, J.; Liu, X.; Liu, K.; Xu, F.; Guo, H.; Tian, X.; Li, M.; Bao, Z.; Li, Y. An improved YOLOv5 model based on visual attention mechanism: Application to recognition of tomato virus disease. Comput. Electron. Agric. 2022, 194, 106780.
Xu, S.; Guo, Z.; Liu, Y.; Fan, J.; Liu, X. An improved lightweight yolov5 model based on attention mechanism for face mask detection. In Artificial Neural Networks and Machine Learning–ICANN 2022, Proceedings of the 31st International Conference on Artificial Neural Networks, Bristol, UK, 6–9 September 2022, Part III; Springer Nature: Cham, Switzerland, 2022; pp. 531–543.
Li, Z.; Song, J.; Qiao, K.; Li, C.; Zhang, Y.; Li, Z. Research on efficient feature extraction: Improving YOLOv5 backbone for facial expression detection in live streaming scenes. Front. Comput. Neurosci. 2022, 16, 980063.
Clore, G.L.; Schwarz, N.; Conway, M. Affective causes and consequences of social information processing. Handb. Soc. Cogn. 1994, 1, 323–417.
Deng, B.; Chau, M. The effect of the expressed anger and sadness on online news believability. J. Manag. Inf. Syst. 2021, 38, 959–988.
Bharadwaj, N.; Ballings, M.; Naik, P.A.; Moore, M.; Arat, M.M. A new livestream retail analytics framework to assess the sales impact of emotional displays. J. Mark. 2022, 86, 27–47.
Lin, Y.; Yao, D.; Chen, X. Happiness begets money: Emotion and engagement in live streaming. J. Mark. Res. 2021, 58, 417–438.
Krishna, A. An integrative review of sensory marketing: Engaging the senses to affect perception, judgment and behavior. J. Consum. Psychol. 2012, 22, 332–351.
Gardner, M.P. Mood states and consumer behavior: A critical review. J. Consum. Res. 1985, 12, 281–300.
Kahn, B.E.; Isen, A.M. The influence of positive affect on variety seeking among safe, enjoyable products. J. Consum. Res. 1993, 20, 257–270.
Ng, H.W.; Nguyen, V.D.; Vonikakis, V.; Winkler, S. Deep learning for emotion recognition on small datasets using transfer learning. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, Seattle, WA, USA, 9–13 November 2015; pp. 443–449.
Abbaschian, B.J.; Sierra-Sosa, D.; Elmaghraby, A. Deep learning techniques for speech emotion recognition, from databases to models. Sensors 2021, 21, 1249.
Barsade, S.G. The ripple effect: Emotional contagion and its influence on group behavior. Adm. Sci. Q. 2002, 47, 644–675.
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; IEEE: Columbus, OH, USA, 2014; pp. 580–587.
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE T. Pattern Anal. 2017, 39, 1137–1149.
Glenn, J. yolov5. Git Code. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 4 March 2023).
Wang, D.; He, D. Channel pruned YOLO V5s-based deep learning approach for rapid and accurate apple fruitlet detection before fruit thinning. Biosyst. Eng. 2021, 210, 271–281.
Guo, K.; He, C.; Yang, M.; Wang, S. A pavement distresses identification method optimized for YOLOv5s. Sci. Rep. 2022, 12, 3542.
Li, Z.; Lu, K.; Zhang, Y.; Li, Z.; Liu, J.B. Research on Energy Efficiency Management of Forklift Based on Improved YOLOv5 Algorithm. J. Math. 2021, 2021, 5808221.
Niu, Z.; Zhong, G.; Yu, H. A review on the attention mechanism of deep learning. Neurocomputing 2021, 452, 48–62.
Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial transformer networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 2, pp. 2017–2025.
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141.
Hu, H.; Li, Q.; Zhao, Y.; Zhang, Y. Parallel deep learning algorithms with hybrid attention mechanism for image segmentation of lung tumors. IEEE Trans. Ind. Inform. 2020, 17, 2880–2889.
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19.
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722.
Zhang, Q.L.; Yang, Y.B. Sa-net: Shuffle attention for deep convolutional neural networks. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 2235–2239.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.