Human pose estimation is a complex detection task in which the network needs to capture the rich information contained in the images.
1. Introduction
Human pose estimation is a crucial component in the field of computer vision, aiming to predict anatomical keypoints of the human body in 2D images. With the advancement of deep convolutional neural networks, the performance of pose estimation models has made significant progress. These models have gradually been applied to more complex scenarios, such as motion analysis
[1][2][3] and human–computer interaction
[4][5][6].
Currently, the mainstream models for pose estimation predominantly rely on
convolution neural network (CNN
) as encoders to extract texture features. Subsequently, the feature maps are decoded into higher-resolution sizes using methods, such as heatmap-based approaches or direct keypoint regression, which has become a widely adopted paradigm in most pose estimation models. The Hourglass model
[7], for instance, stacks multiple Hourglass modules, where each module utilizes symmetric up-sampling and down-sampling processes combined with intermediate supervision to generate high-resolution feature maps. HRNet
[8] employs parallel branches for different resolution feature mappings while consistently maintaining the highest-resolution branch. However, due to the nature of convolutional kernels, CNN exhibits local convolutional properties, restricting its ability to capture global dependencies within a limited receptive field. Although CNN excels at extracting texture features from images, it often lacks the capacity to learn spatial features effectively. As a consequence, the network fails to fully comprehend the information contained in the image. These limitations greatly constrain the potential of CNN-based models.
In recent years, Transformer
[9] has achieved remarkable success in the field of natural language processing (NLP), continuously breaking records and topping various leaderboards. As a sequence-to-sequence model, Transformer exhibits strong modeling capabilities for dependencies between sequences. Furthermore, in the field of computer vision, Transformer excels at capturing the spatial features of images. The introduction of Vision Transformer (ViT)
[10] marked the first application of Transformer to computer vision. The authors divided images into smaller patches, flattened them into sequences, and trained Transformer on these sequences. This simple yet effective approach quickly attracted the attention of many researchers. However, the high resolution of images poses computational challenges for pure Transformer methods, leading to the emergence of CNN + Transformer networks. One of the most representative models in this category is TFPose
[11]. The authors employ CNN as an encoder, flatten the extracted features along the channel dimension, and feed them into Transformer. Finally, they use regression methods to predict keypoints. Researchers believe that CNN + Transformer is a more optimal solution that leverages the strengths of both networks, striking a balance between speed and accuracy. However, mainstream CNN + Transformer models are still in their early stages, leaving room for exploration in terms of network integration and regression approaches. Consequently, the performance of both networks is not fully realized. Based on these considerations, this
rpape
search r proposes a novel network architecture called MSTPose, which aims to address the limitations of existing models.
2. Convolution Neural NetworkN-Based Human Pose Estimation
In the field of human pose estimation, CNN-based methods have achieved tremendous success. Many early works aim to extract image features by using CNN as an encoder. DeepPose
[12] firstly introduces CNN to address the problem of pose estimation, they propose a cascaded structure of deep neural networks. In SimpleBaseline
[13], the authors utilize transpose convolution in the output part of the backbone network to generate higher-resolution feature maps for better pose estimation.
Due to pose estimation being different from simple detection tasks, capturing the global dependencies between features is crucial. Varun Ramakrishna et al.
[14] propose a sequential prediction algorithm that simulates the mechanism of message passing to predict the confidence of each variable (part), iteratively improving the estimation at each stage. Tompson et al.
[15] utilize the structural relationships between human keypoints and incorporate the idea of Markov random fields to optimize the prediction results. Wei et al.
[16] introduce the CPM (convolutional pose machines) network with VGG
[17] as the backbone, employing a jointly trained multi-stage, intermediate supervision architecture to learn the dependencies between keypoints. George Papandreou et al.
[18] propose a box-free system based on fully convolutional networks, learning the offsets of keypoints through a greedy decoding process and grouping keypoints into human pose instances.
However, due to the local convolutional nature of CNN, its ability to capture global dependencies is limited. Another approach is to enlarge the receptive field of feature maps, and there are various ways to achieve this, such as multi-scale fusion
[19][20][21][22] and high-resolution representation
[23]. Yilun Chen et al.
[21] present a cascaded pyramid model to obtain multi-scale features, ultimately performing pose estimation by up-sampling to high-resolution feature maps. Bowen Cheng et al.
[23] propose HigherNRNet, which utilizes transpose convolutions to obtain higher-resolution feature maps to perceive small-scale objects.
As networks become increasingly complex, there is a need for better methods to more comprehensively capture image information. Compared to previous works that solely rely on CNN, the emergence of Transformer highlights new possibilities to pose estimation.
3. Transformer-Based Human Pose Estimation
Transformer is a feed-forward network based on the self-attention mechanism, which has achieved significant success in the field of NLP
[24][25][26][27][28][29]. In recent years, with the introduction of Transformer into the visual domain, researchers have witnessed the rise of Transformer
[10][30][31].
In the field of image segmentation, W. Wang et al.
[32] propose a method called Attention-Guided Object Segmentation (AGOS) and Dynamic Visual Attention Prediction (DVAP) for unsupervised video object segmentation. T. Zhou et al.
[33] introduce Matnet, which employs a two-stream encoder to transform surface features into mobile attention features at each stage of convolution. The bridge network is used for multi-level feature map fusion and acquisition, resulting in better segmentation results.
In the domain of object detection, N. Carion et al.
[30] present the DETR model, which achieves higher detection accuracy by incorporating Transformer and employing a unique set prediction loss. To address the slow convergence speed and limited feature spatial resolution issues in
[30], X. Zhu et al.
[31] propose Deformable DETR, where the attention module focuses only on a small group of key sampling points around the reference, leading to improved performance.
In the field of human pose estimation, S. Yang et al.
[34] introduce TransPose, using CNN as the encoder and incorporating Transformer for precise localization of human keypoints, capturing both short and long-range dependencies between keypoints. W. Mao et al.
[11] propose TFPose, which builds upon
[34] and employs direct regression of keypoints for pose estimation. K. Li et al.
[35] develop the end-to-end PRTR model, which employs cascaded Transformer networks for direct regression of keypoints. B. Shan et al.
[36] propose the MSRT network, which performs segmentation and superimposition of feature maps at different scales using the FAM module and utilizes Transformer for keypoints decoding.