Crowd-counting networks have become the mainstream method to deploy crowd-counting techniques on resource-constrained devices. Significant progress has been made in this field, with many outstanding lightweight models being proposed successively. However, challenges like scare variation, global feature extraction, and fine-grained head annotation requirements still exist in relevant tasks, necessitating further improvement. In this research, the researchers propose a weakly-supervised hybrid lightweight crowd-counting network that integrates the initial layers of GhostNet as the backbone to efficiently extract local features and enrich intermediate features. The experimental results for accuracy and inference speed evaluation on some mainstream datasets validate the effective design principle of the model.
1. Introduction
As a subdomain within the field of object detection and counting, crowd counting is known as a technique to count or estimate the total crowd present in an image or video stream. It can be applied in a variety of circumstances, such as video surveillance, traffic monitoring, and public safety. However, challenges persist in terms of fine-grained head annotation requirements, global context feature extraction, handling scale variations, etc. Further, in order to make this technology more practical in real-life scenarios, lightweight crowd-counting network design has become a trend in this domain. Lightweight crowd-counting network design aims at designing a more compact model architecture that can run on resource-constrained devices while maintaining counting accuracy. This poses higher demands on model lightweighting and resource consumption.
Lightweight crowd-counting models often implement techniques such as model pruning
[1], parameter sharing
[2][3], model quantization
[4], and knowledge distillation
[5] to reduce parameters and computation cost. Sun et al.
[4] utilized model quantization in their model for a microcontroller unit (MCU), representing a
2.2×
speedup compared to the original float model, which indicates its effectiveness in resource-constrained devices. Liu et al.
[5] proposed a Structure Knowledge Transfer (SKT) framework to allow a student network to learn a feature modeling ability from a teacher network. The model size was reduced to one-quarter of the teacher network with only a small drop in accuracy. However, challenges persist in lightweight crowd-counting network design. First, most of the ones
[1][4][6][7][8][9][10][11] with fully supervised guidance require location-level annotation information in datasets to maintain accurate performance. Generating such annotations is tedious and time-consuming. Further, processing and learning accurate information from fine-grained annotations often need more complicated architecture. For instance, ref.
[12] proposed an iterative crowd-counting network to handle high-resolution and low-resolution density maps for more accurate results, and ref.
[13] combined deep and shallow networks to generate a density map that is more adaptable to large-scale variations. Secondly, lightweight models often lack effective modules for global context feature extraction
[14][15] and intermediate feature enrichment
[14]. Finally, it is also of vital importance to implement effective modules to handle scale variation problems inherent in the object counting task.
In order to enhance the adaptability of the model to datasets with coarse annotations, weakly supervised methods have been proposed. Such techniques can also be implemented in crowd-counting fields. Compared to fully supervised object detection methods, including detection with a bounding box
[16][17], which requires box anchors to be annotated before training, and detection with image segmentation
[18], which often requires generation of a segmentation map with a point map or a density map, weakly supervised objection counting can only rely on the ground-truth count number to estimate people in the image. In scenarios where only real-time estimation of the number of people is needed and detailed location information is not crucial, the model structure is correspondingly simplified, making it more compatible with lightweight network design. However, the weakly supervised model needs to handle incomplete label information and learn useful patterns from it, placing increased demands on the model’s training methods and optimization techniques. There have been some studies concentrating on weakly supervised network design, and some of them have achieved good results in this regard. Yang et al.
[19] proposed a soft label ranking network in the model to facilitate counting tasks, wherein the network ranks images based on the number of people in the images. It gets rid of expensive semantic labels and location annotations, and the ranking network drives the shared backbone CNN model to explicitly acquire density-sensitive capability. Then, the regression network utilizes information about the number of individuals to enhance the accuracy of counting. Liang et al.
[20] re-articulates the problem of weakly supervised crowd counting from a Transformer-based sequential counting perspective, achieving a weakly supervised paradigm that relies on counting-level annotations. However, it is not suitable for direct application in lightweight crowd-counting tasks due to its Transformer backbone.
Following the strategy proposed by
[20][21], the training network in a weakly supervised setting can straightforwardly extend the traditional density map estimation process by imposing the integral of the estimated density map, which is close to the ground-truth object count for count-level annotated images. The ground-truth count number can be easily obtained either directly from the datasets, from the count number of location coordinates, or integrated from the ground-truth density map, depending on the format of ground-truth labels provided by the datasets.
2. Lightweight Network Design for General Tasks
Designing lightweight crowd-counting networks has become a mainstream method to make networks more applicable to resource-constrained devices. Compared to heavy structure models, lightweight models have low computation cost and fast processing time to meet real-time demands. There have been some works designed lightweight models targeting general computer vision tasks such as object detection, semantic segmentation, and object classification. Thought-provoking modules and their innovative design principles inspired the researchers to optimize modules in the model for better performance. Andrew et al.
[22][23][24] proposed MobileNet family networks that consist of three generations of MobileNet. The main contribution of MobileNet is the idea of depth-wise separable convolution that replaces a normal convolution operation with channel-wise convolution and point-wise convolution and the design of inverted residual bottleneck to reduce computations. Based on MobileNet’s structure, Han et al.
[25][26] came up with a more efficient convolution architecture called GhostNet, which takes advantage of MobileNet’s inverted bottleneck architecture and depth-wise separable convolution and designs a Ghost Bottleneck, which can generate ghost features with cheap linear operations. When it comes to the Transformer region, Zhang et al.
[27] proposed MiniViT to compress the Vision Transformer by weight multiplexing, which consists of weight transformation and weight distillation. Daniel et al.
[28] introduced hydra attention to linear attention so that it can add more heads while keeping computation amounts the same as before. However, relying solely on the Transformer for compact model design is not competitive. In most cases, mixing a Transformer with a convolution network would be a better choice to take advantage of both CNN’s small model size and Transformer’s high accuracy. In this direction, Chen et al.
[29] presented a parallel design of MobileNet and Transformer called Mobile-Former, aiming to fuse the local features and global information bidirectionally. Pan et al.
[30] designed EdgeViTs as a lightweight vision Transformer family, which achieved accuracy–latency and accuracy–energy trade-offs in object recognition tasks. Chen et al.
[31] designed a Mixing Block that combines local-window self-attention and depth-wise convolution in a parallel design to integrate the features across windows and dimensions. Metha et al.
[32] replaced local processing in convolutions with global processing using Transformers in their network to learn better representations with fewer parameters and simple training recipes. The hybrid networks mentioned above are designed for general purposes and are not specially tailored for crowd-counting tasks.
3. Lightweight Crowd-Counting Models
When it comes to the crowd-counting field, some lightweight models have been proposed and achieved promising performance at the time when they were proposed. Now, when we look back at their design, improvements can be put forward to boost performance. Shi et al.
[14] proposed a lightweight C-CNN model that uses three parallel layers with filters of different sizes to solve scale variation problems. Compared to multi-branch architecture, it has a simpler structure and fewer parameters. However, such a simple structure may not be able to handle complicated scenarios involving light variation, fake object representation, etc. Zhu et al.
[3] implemented weight sharing to its scale feature extraction module in LSANet, sharply decreased the parameters of a complicated network to a minimum level. However, it outputs a density map at three different scales for better guidance, which inevitably causes unnecessary computation cost. The researchers' weakly supervised network directly skips density map generation and can achieve a better trade-off between accuracy and computation cost. Liang et al.
[33] designed PDDNet that is equipped with a multi-scale information extraction module with lightweight pyramid dilated convolutions (LPC) modules to extract global context information, and Dong et al.
[34] designed a Multi-Scale Feature Extraction Network (MFENet) to model multi-scale information. Compared to the PPAM block in the researchers' model, the method takes fewer computations to process scale-aware information and the Swin-Transformer block is more effective in extracting global context information. Tian et al.
[35] and Zhang et al.
[36] put forward guidance branch to their lightweight model to learn localization information in their work; such a technique needs precise head location coordinates to guide location task and that is not mandatory in lightweight crowd-counting tasks. When it comes to hybrid network architecture, Sun et al.
[37] introduce Transformer blocks after each downscale convolution block to separately model scale-varied information stage by stage; it is not computationally efficient to introduce multiple Transformer blocks for the same thing.
4. Weakly Supervised Crowd-Counting Models
There have been some previous works also concentrating on weakly supervised lightweight crowd-counting network designs that are most relevant to the researchers' network. Yang et al.
[19] made the first attempt to train a pure convolution network without localization level, but still relied on a sorting network with handcrafted soft labels. Wang et al.
[38] proposed a weakly supervised network with multi-granularity MLP that is solely based on count-level annotations; they introduced a ranking mechanism and designed auxiliary branches for self-supervision, causing excessive computation times, which should be avoided in lightweight models. Wang et al.
[39] proposed a Joint CNN and Transformer network, which also implements weakly supervised learning for efficient crowd counting; it takes the modified Swin-Transformer with patching layers being discarded for global feature modeling to compensate for local features from VGGNet, but it ignores the scale-aware information at both local area and global scope, and the combination of VGGNet and the Swin-Transformer block with patch embedding operations would cause loss of background context information.
From recent lightweight crowd-counting works mentioned above, we can figure out some common problems in them: Firstly, lightweight convolutional models have often concentrated on designing various dilated convolution modules to expand the convolution receptive field from the local area to a global horizon. However, Transformer inherently has a global receptive field. If we can decrease the computation cost, replacing the dilated convolution module with Transformer would be a better choice. Secondly, some works maintain crowd location tasks in lightweight models, which require complicated and accurate annotations in datasets for training. Finally, weakly supervised lightweight crowd-counting models also encounter issues that fully supervised ones now face, and they require further improvement to fit coarse-grained annotations.