Human Action Recognition Methods for Single-Modality Action Recognition

Human Action Recognition Methods for Single-Modality Action Recognition: History

View Latest Version

Please note this is an old version of this entry, which may differ significantly from the current revision.

Contributor:

Xiongjiang Xiao

Ziliang Ren

Huan Li

Wenhong Wei

Zhiyong Yang

Huaide Yang

action recognition
multimodality compensation
SlowFast pathways

1. Introduction

Human action recognition is widely used in computer vision research, such as intelligent video surveillance, intelligent human–computer interaction, robot control, video retrieval, pose estimation, and many other fields ^[1]^[2]^[3]. Since the environments faced by human action recognition are diverse and complex, capturing effective features for action recognition is still a challenging problem. Recently, several works focused on exploiting the complementary information provided by RGB and depth ^[4]^[5]^[6] models have made considerable progress.

Human action recognition in RGB videos has been extensively studied in the past decades. In the early years, the method was to manually extract the behavioral features in the video that can represent the temporal and spatial changes of human action. There are mainly methods involving spatiotemporal volume ^[7]^[8], spatiotemporal interest points (STIP) ^[9] and trajectory ^[10]^[11]. Deep learning methods have a powerful ability to learn and analyze under complex network structures, which has made them become the mainstream of current human behavior research. Two-stream-based networks ^[12] can capture different types of information from different input models with little computational cost, but it is difficult to learn complete action sequences. Several works ^[13]^[14] have tried dense temporal sampling to learn more motion information, but sampling frames may generate high computational costs. Extracting larger features from the spatiotemporal dimension, the 3D convolutional neural network (CNN)-based approach achieves better performance in human behavior recognition. Its drawback is the high number of parameters and computational complexity. Transformer-based methods can achieve remarkable results in the global connection mode, but these methods rely on the pre-training of large-scale datasets and train with many parameters. Over the years, numerous studies have proposed the integration of RGB, depth, and skeleton modalities for accurate human action recognition, with works including those by ^[4]^[15]^[16]^[17]^[18]^[19]. Decision-level fusion methods have shown promising results that aim to capture the unique features of each modality independently and combine them to produce the final classification score.

2. Single-Modality Action Recognition

In videos, traditional methods have limitations which extract spatio-temporal behavioral features through manual methods, such as Histogram of Oriented Gradient (HOG) ^[8] and STIP ^[9]. In recent years, methods of deep learning have been able to learn and distinguish human action features from raw video frames, exhibiting superior representation capabilities and powerful performance. To address the challenge of capturing motion information, Ref. ^[20] proposed a temporal template that computes frame-to-frame differences to capture the entire motion sequence. Ref. ^[12] introduced a two-stream CNN model with a spatial network and a temporal network. To reduce the cost of computing optical flow, Ref. ^[21] proposed a method that accelerates deep two-stream architecture by replacing optical flow with motion vector. Furthermore, the method of training 3D convolutional to explore spatiotemporal features has attracted considerable attention. In order to simultaneously understand the spatio-temporal features in videos, Ref. ^[22] pioneered the use of 3D convolutional networks for simultaneous spatio-temporal feature learning in action recognition. In particular, C3D ^[23] developed an end-to-end framework that effectively adapts to deep 3D convolutional networks for spatio-temporal feature learning from raw videos. However, C3D ignores the long-term dependence of video on spatio-temporal and performs poorly on standard benchmarks. A Long-term Temporal Convolution (LTC) is proposed in ^[14] to build the long-term temporal structure by reducing the spatial resolution and increasing the temporal extent of the 3D convolutional layers. Ref. ^[24] proposed a nolocal operation that captures long-term dependencies, which refers to modeling the correlation between any two locations in a feature map.

3. Multimodality Action Recognition

In complex scenes, single-modal action recognition has limitations. In order to improve the accuracy and robustness of human behavior recognition, multi-modal information is combined to learn complementary features. The two-stream structure proposed by ^[12] is a solution to the problem of insufficient information caused by a single modality. This framework consists of a spatial network and a temporal network, which are combined to obtain the final result by merging the prediction scores. Another approach to addressing the limitations of single-modal action recognition was proposed by ^[25] that developed a two-stream architecture which incorporated low-resolution RGB frames and high-resolution center crops. Since then, researchers have continued to build on these classic two-stream frameworks, exploring new ways to improve their performance. Ref. ^[26] proposed a temporal segment network (TSN), which performs sparse temporal sampling of the video, and fuses the classification scores of the segments. Depth sequences and RGB are treated as a single entity in ^[27], and scene flow information is being extracted from them. First, dynamic images are generated from feature sequences, then different dynamic images are input into two different convolutional neural networks, and finally, their classification scores are fused for human action recognition. To explore complementary information, jointly training multiple networks has attracted considerable attention. Ref. ^[16] improved performance from videos by converting RGB and depth sequences into two pairs of dynamic images (one pair for RGB and one pair for depth), and jointly training a single neural network to recognize human actions using both types of dynamic images. A Modality Compensation Network (MCN) is proposed ^[18] to explore common features between different modalities. In order to facilitate mutual learning of features extracted from dynamic images of multiple modalities, Ref. ^[28] proposed a novel Segment Cooperative ConvNet (SC-ConvNet), which utilizes a rank pooling mechanism ^[29] to construct these features. In another work, Ref. ^[30] introduced a cross-modality compensation block (CMCB) that improved the interaction between different modalities by jointly learning compensation features. To improve the performance of human action recognition, 3D convolutional models with two-stream or multi-stream designs are studied ^[31]^[32]^[33]^[34]. The work of ^[31] designs a novel 3D convolutional two-stream network, which is an Inflated 3D ConvNet (I3D) that is based on 2D ConvNet inflation. Feichtenhofer ^[34] introduced a two-stream 3D convolutional framework, which comprises a slow pathway for low frame rate and a fast pathway for high frame rate to capture semantic and motion information.

4. Attention Mechanism for Video Understanding

In recent years, attention-based neural networks to explore video understanding work have attracted considerable attention, such as person reidentification ^[35] and video object segmentation ^[36]. The work of ^[37] proposes using the Transformer model, originally designed for natural language processing, for image recognition tasks. To address the limitation of handling long token sequences in videos, Ref. ^[38] proposed a transformer-based approach which decomposes the different components of the transformer encoder along the temporal and spatial dimensions. In order to improve attention efficiency, Ref. ^[39] proposes a novel directed-attention mechanism to understand human actions in exact order. A trajectory attention block (trajectory attention block) is proposed ^[40] to enhance the robustness of human action recognition in dynamic scenes which generates a set of specific trajectory markers along the spatial dimension and performs pooling operations along the temporal dimension. The work of ^[41] proposes a multi-view transformer for video recognition that laterally connects multiple encoders to efficiently fuse mutual information from different features within the video. A self-supervised Video Transformer (SVT) is designed by ^[42], which allows learning cross-view information and the dependencies between motion from video clips of different spatial extents and frame intervals. In addition, a lot of work has proposed effective methods for the memory and computational overhead issues in Transformer-based action recognition. To reduce the memory and computation constraints, a Shifted Chunk Transformer (SCT) was designed by ^[43], which involves dividing each frame into several local patches and inputting them into the image blocks of Locality-Sensitive Hashing (LSH). A Recurrent Visual Transformer (RViT) was proposed by ^[44] to reduce memory, which utilizes an attention gate mechanism and is operated in a recurrent manner.

This entry is adapted from the peer-reviewed paper 10.3390/math11092115

References

Wang, L.; Huynh, D.Q.; Koniusz, P. A comparative review of recent kinect-based action recognition algorithms. IEEE Trans. Image Process. 2019, 29, 15–28.
Liu, F.; Xu, X.; Qiu, S.; Qing, C.; Tao, D. Simple to complex transfer learning for action recognition. IEEE Trans. Image Process. 2015, 25, 949–960.
Song, X.; Lan, C.; Zeng, W.; Xing, J.; Sun, X.; Yang, J. Temporal-spatial mapping for action recognition. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 748–759.
Shahroudy, A.; Ng, T.T.; Gong, Y.; Wang, G. Deep multimodal feature analysis for action recognition in RGB+D videos. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1045–1058.
Liu, Y.; Lu, Z.; Li, J.; Yang, T. Hierarchically learned view-invariant representations for cross-view action recognition. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 2416–2430.
Liu, Z.; Cheng, J.; Liu, L.; Ren, Z.; Zhang, Q.; Song, C. Dual-stream cross-modality fusion transformer for RGB-D action recognition. Knowl.-Based Syst. 2022, 255, 109741.
Zhang, Z.; Hu, Y.; Chan, S.; Chia, L.T. Motion context: A new representation for human action recognition. In Proceedings of the European Conference on Computer Vision (ECCV), Marseille, France, 12–18 October 2008; Springer: Berlin/Heidelberg, Germany, 2008; pp. 817–829.
Klaser, A.; Marszałek, M.; Schmid, C. A spatio-temporal descriptor based on 3D-gradients. In Proceedings of the British Machine Vision Conference (BMVC), Leeds, UK, 1–4 September 2008; pp. 1–10.
Das Dawn, D.; Shaikh, S.H. A comprehensive survey of human action recognition with spatio-temporal interest point (STIP) detector. Vis. Comput. 2016, 32, 289–306.
Gaidon, A.; Harchaoui, Z.; Schmid, C. Activity representation with motion hierarchies. Int. J. Comput. Vis. 2014, 107, 219–238.
Wang, H.; Oneata, D.; Verbeek, J.; Schmid, C. A robust and efficient video representation for action recognition. Int. J. Comput. Vis. 2016, 119, 219–238.
Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 2014, 27, 568–576.
Donahue, J.; Anne Hendricks, L.; Guadarrama, S.; Rohrbach, M.; Venugopalan, S.; Saenko, K.; Darrell, T. Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 677–691.
Varol, G.; Laptev, I.; Schmid, C. Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1510–1517.
Wang, L.; Tong, Z.; Ji, B.; Wu, G. TDN: Temporal difference networks for efficient action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 1895–1904.
Wang, P.; Li, W.; Wan, J.; Ogunbona, P.; Liu, X. Cooperative training of deep aggregation networks for RGB-D action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New Orleans, LA, USA, 2–7 February 2018; pp. 7404–7411.
Khaire, P.; Kumar, P.; Imran, J. Combining CNN streams of RGB-D and skeletal data for human activity recognition. Pattern Recognit. Lett. 2018, 115, 107–116.
Song, S.; Liu, J.; Li, Y.; Guo, Z. Modality compensation network: Cross-modal adaptation for action recognition. IEEE Trans. Image Process. 2020, 29, 3957–3969.
Yue-Hei Ng, J.; Hausknecht, M.; Vijayanarasimhan, S.; Vinyals, O.; Monga, R.; Toderici, G. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 4694–4702.
Ijjina, E.P.; Chalavadi, K.M. Human action recognition in RGB-D videos using motion sequence information and deep learning. Pattern Recognit. 2017, 72, 504–516.
Zhang, B.; Wang, L.; Wang, Z.; Qiao, Y.; Wang, H. Real-time action recognition with enhanced motion vector CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2718–2726.
Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 221–231.
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 4489–4497.
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803.
Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; Fei-Fei, L. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 1725–1732.
Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; van Gool, L. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision (ECCV), Cham, The Netherlands, 11–14 October 2016; pp. 20–36.
Wang, P.; Li, W.; Gao, Z.; Zhang, Y.; Tang, C.; Ogunbona, P. Scene flow to action map: A new representation for RGB-D based action recognition with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 595–604.
Ren, Z.; Zhang, Q.; Cheng, J.; Hao, F.; Gao, X. Segment spatial-temporal representation and cooperative learning of convolution neural networks for multimodal-based action recognition. Neurocomputing 2021, 433, 142–153.
Bilen, H.; Fernando, B.; Gavves, E.; Vedaldi, A. Action recognition with dynamic image networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 2799–2813.
Cheng, J.; Ren, Z.; Zhang, Q.; Gao, X.; Hao, F. Cross-modality compensation convolutional neural networks for RGB-D action recognition. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 1498–1509.
Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308.
Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 6450–6459.
Zhou, Y.; Sun, X.; Zha, Z.J.; Zeng, W. MICT: Mixed 3D/2D convolutional tube for human action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 449–458.
Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6202–6211.
Lu, X.; Wang, W.; Shen, J.; Crandall, D.; Luo, J. Zero-shot video object segmentation with co-attention siamese networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 2228–2242.
Wu, D.; Ye, M.; Lin, G.; Gao, X.; Shen, J. Person re-identification by context-aware part attention and multi-head collaborative learning. IEEE Trans. Inf. Forensics Secur. 2021, 17, 115–126.
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16×16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929.
Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 6836–6846.
Truong, T.D.; Bui, Q.H.; Duong, C.N.; Seo, H.S.; Phung, S.L.; Li, X.; Luu, K. Direcformer: A directed attention in transformer approach to robust action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 20030–20040.
Patrick, M.; Campbell, D.; Asano, Y.; Misra, I.; Metze, F.; Feichtenhofer, C.; Vedaldi, A.; Henriques, J.F. Keeping your eye on the ball: Trajectory attention in video transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12493–12506.
Yan, S.; Xiong, X.; Arnab, A.; Lu, Z.; Zhang, M.; Sun, C.; Schmid, C. Multiview transformers for video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 3333–3343.
Ranasinghe, K.; Naseer, M.; Khan, S.; Khan, F.S.; Ryoo, M.S. Self-supervised video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 2874–2884.
Zha, X.; Zhu, W.; Xun, L.; Yang, S.; Liu, J. Shifted chunk transformer for spatio-temporal representational learning. Adv. Neural Inf. Process. Syst. 2021, 34, 11384–11396.
Yang, J.; Dong, X.; Liu, L.; Zhang, C.; Shen, J.; Yu, D. Recurring the transformer for video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 14063–14073.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.