Weakly Supervised Crowd-Counting Models

Version	Summary	Created by	Modification	Content Size	Created at	Operation
1		Yongqi Chen	--	1813	2024-02-28 02:57:52	\|
2	format change	Peter Tang	+ 5 word(s)	1818	2024-02-28 03:20:45	\|

This entry is adapted from the peer-reviewed paper 10.3390/electronics13040723

Crowd-counting networks have become the mainstream method to deploy crowd-counting techniques on resource-constrained devices. Significant progress has been made in this field, with many outstanding lightweight models being proposed successively. However, challenges like scare variation, global feature extraction, and fine-grained head annotation requirements still exist in relevant tasks, necessitating further improvement. In this research, the researchers propose a weakly-supervised hybrid lightweight crowd-counting network that integrates the initial layers of GhostNet as the backbone to efficiently extract local features and enrich intermediate features. The experimental results for accuracy and inference speed evaluation on some mainstream datasets validate the effective design principle of the model.

crowd counting lightweight hybrid network weakly supervised learning

1. Introduction

As a subdomain within the field of object detection and counting, crowd counting is known as a technique to count or estimate the total crowd present in an image or video stream. It can be applied in a variety of circumstances, such as video surveillance, traffic monitoring, and public safety. However, challenges persist in terms of fine-grained head annotation requirements, global context feature extraction, handling scale variations, etc. Further, in order to make this technology more practical in real-life scenarios, lightweight crowd-counting network design has become a trend in this domain. Lightweight crowd-counting network design aims at designing a more compact model architecture that can run on resource-constrained devices while maintaining counting accuracy. This poses higher demands on model lightweighting and resource consumption.

Lightweight crowd-counting models often implement techniques such as model pruning ^[1], parameter sharing ^[2]^[3], model quantization ^[4], and knowledge distillation ^[5] to reduce parameters and computation cost. Sun et al. ^[4] utilized model quantization in their model for a microcontroller unit (MCU), representing a

2.2 \times

speedup compared to the original float model, which indicates its effectiveness in resource-constrained devices. Liu et al. ^[5] proposed a Structure Knowledge Transfer (SKT) framework to allow a student network to learn a feature modeling ability from a teacher network. The model size was reduced to one-quarter of the teacher network with only a small drop in accuracy. However, challenges persist in lightweight crowd-counting network design. First, most of the ones ^[1]^[4]^[6]^[7]^[8]^[9]^[10]^[11] with fully supervised guidance require location-level annotation information in datasets to maintain accurate performance. Generating such annotations is tedious and time-consuming. Further, processing and learning accurate information from fine-grained annotations often need more complicated architecture. For instance, ref. ^[12] proposed an iterative crowd-counting network to handle high-resolution and low-resolution density maps for more accurate results, and ref. ^[13] combined deep and shallow networks to generate a density map that is more adaptable to large-scale variations. Secondly, lightweight models often lack effective modules for global context feature extraction ^[14]^[15] and intermediate feature enrichment ^[14]. Finally, it is also of vital importance to implement effective modules to handle scale variation problems inherent in the object counting task.

In order to enhance the adaptability of the model to datasets with coarse annotations, weakly supervised methods have been proposed. Such techniques can also be implemented in crowd-counting fields. Compared to fully supervised object detection methods, including detection with a bounding box ^[16]^[17], which requires box anchors to be annotated before training, and detection with image segmentation ^[18], which often requires generation of a segmentation map with a point map or a density map, weakly supervised objection counting can only rely on the ground-truth count number to estimate people in the image. In scenarios where only real-time estimation of the number of people is needed and detailed location information is not crucial, the model structure is correspondingly simplified, making it more compatible with lightweight network design. However, the weakly supervised model needs to handle incomplete label information and learn useful patterns from it, placing increased demands on the model’s training methods and optimization techniques. There have been some studies concentrating on weakly supervised network design, and some of them have achieved good results in this regard. Yang et al. ^[19] proposed a soft label ranking network in the model to facilitate counting tasks, wherein the network ranks images based on the number of people in the images. It gets rid of expensive semantic labels and location annotations, and the ranking network drives the shared backbone CNN model to explicitly acquire density-sensitive capability. Then, the regression network utilizes information about the number of individuals to enhance the accuracy of counting. Liang et al. ^[20] re-articulates the problem of weakly supervised crowd counting from a Transformer-based sequential counting perspective, achieving a weakly supervised paradigm that relies on counting-level annotations. However, it is not suitable for direct application in lightweight crowd-counting tasks due to its Transformer backbone.

Following the strategy proposed by ^[20]^[21], the training network in a weakly supervised setting can straightforwardly extend the traditional density map estimation process by imposing the integral of the estimated density map, which is close to the ground-truth object count for count-level annotated images. The ground-truth count number can be easily obtained either directly from the datasets, from the count number of location coordinates, or integrated from the ground-truth density map, depending on the format of ground-truth labels provided by the datasets.

2. Lightweight Network Design for General Tasks

Designing lightweight crowd-counting networks has become a mainstream method to make networks more applicable to resource-constrained devices. Compared to heavy structure models, lightweight models have low computation cost and fast processing time to meet real-time demands. There have been some works designed lightweight models targeting general computer vision tasks such as object detection, semantic segmentation, and object classification. Thought-provoking modules and their innovative design principles inspired the researchers to optimize modules in the model for better performance. Andrew et al. ^[22]^[23]^[24] proposed MobileNet family networks that consist of three generations of MobileNet. The main contribution of MobileNet is the idea of depth-wise separable convolution that replaces a normal convolution operation with channel-wise convolution and point-wise convolution and the design of inverted residual bottleneck to reduce computations. Based on MobileNet’s structure, Han et al. ^[25]^[26] came up with a more efficient convolution architecture called GhostNet, which takes advantage of MobileNet’s inverted bottleneck architecture and depth-wise separable convolution and designs a Ghost Bottleneck, which can generate ghost features with cheap linear operations. When it comes to the Transformer region, Zhang et al. ^[27] proposed MiniViT to compress the Vision Transformer by weight multiplexing, which consists of weight transformation and weight distillation. Daniel et al. ^[28] introduced hydra attention to linear attention so that it can add more heads while keeping computation amounts the same as before. However, relying solely on the Transformer for compact model design is not competitive. In most cases, mixing a Transformer with a convolution network would be a better choice to take advantage of both CNN’s small model size and Transformer’s high accuracy. In this direction, Chen et al. ^[29] presented a parallel design of MobileNet and Transformer called Mobile-Former, aiming to fuse the local features and global information bidirectionally. Pan et al. ^[30] designed EdgeViTs as a lightweight vision Transformer family, which achieved accuracy–latency and accuracy–energy trade-offs in object recognition tasks. Chen et al. ^[31] designed a Mixing Block that combines local-window self-attention and depth-wise convolution in a parallel design to integrate the features across windows and dimensions. Metha et al. ^[32] replaced local processing in convolutions with global processing using Transformers in their network to learn better representations with fewer parameters and simple training recipes. The hybrid networks mentioned above are designed for general purposes and are not specially tailored for crowd-counting tasks.

3. Lightweight Crowd-Counting Models

When it comes to the crowd-counting field, some lightweight models have been proposed and achieved promising performance at the time when they were proposed. Now, when we look back at their design, improvements can be put forward to boost performance. Shi et al. ^[14] proposed a lightweight C-CNN model that uses three parallel layers with filters of different sizes to solve scale variation problems. Compared to multi-branch architecture, it has a simpler structure and fewer parameters. However, such a simple structure may not be able to handle complicated scenarios involving light variation, fake object representation, etc. Zhu et al. ^[3] implemented weight sharing to its scale feature extraction module in LSANet, sharply decreased the parameters of a complicated network to a minimum level. However, it outputs a density map at three different scales for better guidance, which inevitably causes unnecessary computation cost. The researchers' weakly supervised network directly skips density map generation and can achieve a better trade-off between accuracy and computation cost. Liang et al. ^[33] designed PDDNet that is equipped with a multi-scale information extraction module with lightweight pyramid dilated convolutions (LPC) modules to extract global context information, and Dong et al. ^[34] designed a Multi-Scale Feature Extraction Network (MFENet) to model multi-scale information. Compared to the PPAM block in the researchers' model, the method takes fewer computations to process scale-aware information and the Swin-Transformer block is more effective in extracting global context information. Tian et al. ^[35] and Zhang et al. ^[36] put forward guidance branch to their lightweight model to learn localization information in their work; such a technique needs precise head location coordinates to guide location task and that is not mandatory in lightweight crowd-counting tasks. When it comes to hybrid network architecture, Sun et al. ^[37] introduce Transformer blocks after each downscale convolution block to separately model scale-varied information stage by stage; it is not computationally efficient to introduce multiple Transformer blocks for the same thing.

4. Weakly Supervised Crowd-Counting Models

There have been some previous works also concentrating on weakly supervised lightweight crowd-counting network designs that are most relevant to the researchers' network. Yang et al. ^[19] made the first attempt to train a pure convolution network without localization level, but still relied on a sorting network with handcrafted soft labels. Wang et al. ^[38] proposed a weakly supervised network with multi-granularity MLP that is solely based on count-level annotations; they introduced a ranking mechanism and designed auxiliary branches for self-supervision, causing excessive computation times, which should be avoided in lightweight models. Wang et al. ^[39] proposed a Joint CNN and Transformer network, which also implements weakly supervised learning for efficient crowd counting; it takes the modified Swin-Transformer with patching layers being discarded for global feature modeling to compensate for local features from VGGNet, but it ignores the scale-aware information at both local area and global scope, and the combination of VGGNet and the Swin-Transformer block with patch embedding operations would cause loss of background context information.

From recent lightweight crowd-counting works mentioned above, we can figure out some common problems in them: Firstly, lightweight convolutional models have often concentrated on designing various dilated convolution modules to expand the convolution receptive field from the local area to a global horizon. However, Transformer inherently has a global receptive field. If we can decrease the computation cost, replacing the dilated convolution module with Transformer would be a better choice. Secondly, some works maintain crowd location tasks in lightweight models, which require complicated and accurate annotations in datasets for training. Finally, weakly supervised lightweight crowd-counting models also encounter issues that fully supervised ones now face, and they require further improvement to fit coarse-grained annotations.

References

Lv, H.; Yan, H.; Liu, K.; Zhou, Z.; Jing, J. Yolov5-ac: Attention mechanism-based lightweight yolov5 for track pedestrian detection. Sensors 2022, 22, 5903.
Lin, J.; Hu, J.; Xie, Z.; Zhang, Y.; Huang, G.; Chen, Z. A Multitask Network for People Counting, Motion Recognition, and Localization Using Through-Wall Radar. Sensors 2023, 23, 8147.
Zhu, F.; Yan, H.; Chen, X.; Li, T. Real-time crowd counting via lightweight scale-aware network. Neurocomputing 2022, 472, 54–67.
Son, S.; Seo, A.; Eo, G.; Gill, K.; Gong, T.; Kim, H.S. MiCrowd: Vision-Based Deep Crowd Counting on MCU. Sensors 2023, 23, 3586.
Liu, L.; Chen, J.; Wu, H.; Chen, T.; Li, G.; Lin, L. Efficient Crowd Counting via Structured Knowledge Transfer. arXiv 2020, arXiv:2003.10120.
Khan, K.; Khan, R.U.; Albattah, W.; Nayab, D.; Qamar, A.M.; Habib, S.; Islam, M. Crowd counting using end-to-end semantic image segmentation. Electronics 2021, 10, 1293.
Khan, S.D.; Salih, Y.; Zafar, B.; Noorwali, A. A deep-fusion network for crowd counting in high-density crowded scenes. Int. J. Comput. Intell. Syst. 2021, 14, 168.
Chen, X.; Yu, X.; Di, H.; Wang, S. Sa-internet: Scale-aware interaction network for joint crowd counting and localization. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Beijing, China, 29 October–1 November 2021; pp. 203–215.
Duan, Z.; Wang, S.; Di, H.; Deng, J. Distillation remote sensing object counting via multi-scale context feature aggregation. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5613012.
Xie, Y.; Lu, Y.; Wang, S. Rsanet: Deep recurrent scale-aware network for crowd counting. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Virtual, 25–28 October 2020; pp. 1531–1535.
Wang, S.; Lu, Y.; Zhou, T.; Di, H.; Lu, L.; Zhang, L. SCLNet: Spatial context learning network for congested crowd counting. Neurocomputing 2020, 404, 227–239.
Ranjan, V.; Le, H.; Hoai, M. Iterative crowd counting. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 270–285.
Boominathan, L.; Kruthiventi, S.S.; Babu, R.V. Crowdnet: A deep convolutional network for dense crowd counting. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 640–644.
Shi, X.; Li, X.; Wu, C.; Kong, S.; Yang, J.; He, L. A real-time deep network for crowd counting. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 2328–2332.
Jiang, G.; Wu, R.; Huo, Z.; Zhao, C.; Luo, J. LigMSANet: Lightweight multi-scale adaptive convolutional neural network for dense crowd counting. Expert Syst. Appl. 2022, 197, 116662.
Goh, G.L.; Goh, G.D.; Pan, J.W.; Teng, P.S.P.; Kong, P.W. Automated Service Height Fault Detection Using Computer Vision and Machine Learning for Badminton Matches. Sensors 2023, 23, 9759.
Yu, R.; Wang, S.; Lu, Y.; Di, H.; Zhang, L.; Lu, L. SAF: Semantic Attention Fusion Mechanism for Pedestrian Detection. In Proceedings of the Pacific Rim International Conference on Artificial Intelligence, Cuvu, Fiji, 26–30 August 2019; pp. 523–533.
Wang, Q.; Breckon, T.P. Crowd Counting via Segmentation Guided Attention Networks and Curriculum Loss. IEEE Trans. Intell. Transp. Syst. 2022, 23, 15233–15243.
Yang, Y.; Li, G.; Wu, Z.; Su, L.; Huang, Q.; Sebe, N. Weakly supervised crowd counting learns from sorting rather than locations. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 1–17.
Liang, D.; Chen, X.; Xu, W.; Zhou, Y.; Bai, X. Transcrowd: Weakly supervised crowd counting with Transformers. Sci. China Inf. Sci. 2022, 65, 160104.
Lei, Y.; Liu, Y.; Zhang, P.; Liu, L. Towards using count-level weak supervision for crowd counting. Pattern Recognit. 2021, 109, 107616.
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861.
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520.
Koonce, B.; Koonce, B. MobileNetV3. In Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization; Springer: Berlin/Heidelberg, Germany, 2021; pp. 125–144.
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589.
Han, K.; Wang, Y.; Xu, C.; Guo, J.; Xu, C.; Wu, E.; Tian, Q. GhostNets on heterogeneous devices via cheap operations. Int. J. Comput. Vis. 2022, 130, 1050–1069.
Zhang, J.; Peng, H.; Wu, K.; Liu, M.; Xiao, B.; Fu, J.; Yuan, L. Minivit: Compressing vision Transformers with weight multiplexing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12145–12154.
Bolya, D.; Fu, C.Y.; Dai, X.; Zhang, P.; Hoffman, J. Hydra attention: Efficient attention with many heads. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2022; pp. 35–49.
Chen, Y.; Dai, X.; Chen, D.; Liu, M.; Dong, X.; Yuan, L.; Liu, Z. Mobile-former: Bridging mobilenet and Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5270–5279.
Pan, J.; Bulat, A.; Tan, F.; Zhu, X.; Dudziak, L.; Li, H.; Tzimiropoulos, G.; Martinez, B. Edgevits: Competing light-weight cnns on mobile devices with vision Transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2022; pp. 294–311.
Chen, Q.; Wu, Q.; Wang, J.; Hu, Q.; Hu, T.; Ding, E.; Cheng, J.; Wang, J. Mixformer: Mixing features across windows and dimensions. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2022; pp. 5249–5259.
Mehta, S.; Rastegari, M. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision Transformer. arXiv 2021, arXiv:2110.02178.
Liang, L.; Zhao, H.; Zhou, F.; Ma, M.; Yao, F.; Ji, X. PDDNet: Lightweight congested crowd counting via pyramid depth-wise dilated convolution. Appl. Intell. 2023, 53, 10472–10484.
Dong, J.; Zhao, Z.; Wang, T. Crowd Counting by Multi-Scale Dilated Convolution Networks. Electronics 2023, 12, 2624.
Tian, Y.; Duan, C.; Zhang, R.; Wei, Z.; Wang, H. Lightweight Dual-Task Networks For Crowd Counting In Aerial Images. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 1975–1979.
Zhang, Y.; Zhao, H.; Duan, Z.; Huang, L.; Deng, J.; Zhang, Q. Congested crowd counting via adaptive multi-scale context learning. Sensors 2021, 21, 3777.
Sun, Y.; Li, M.; Guo, H.; Zhang, L. MSGSA: Multi-Scale Guided Self-Attention Network for Crowd Counting. Electronics 2023, 12, 2631.
Wang, M.; Zhou, J.; Cai, H.; Gong, M. Crowdmlp: Weakly supervised crowd counting via multi-granularity mlp. Pattern Recognit. 2023, 144, 109830.
Wang, F.; Liu, K.; Long, F.; Sang, N.; Xia, X.; Sang, J. Joint CNN and Transformer Network via weakly supervised Learning for efficient crowd counting. arXiv 2022, arXiv:2203.06388.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.

Upload a video for this entry

Information

Subjects: Computer Science, Artificial Intelligence

Contributors MDPI registered users' name will be linked to their SciProfiles pages. To register with us, please refer to https://encyclopedia.pub/register :

Yongqi Chen

Huailin Zhao

Ming Gao

Mingfang Deng

View Times: 184

Update Date: 28 Feb 2024

Table of Contents

Video Upload Options

Confirm