Human Action Recognition Methods for Single-Modality Action Recognition

Human Action Recognition Methods for Single-Modality Action Recognition: History

View Latest Version

Please note this is an old version of this entry, which may differ significantly from the current revision.

Subjects: Computer Science, Artificial Intelligence

Contributor:

Xiongjiang Xiao

Ziliang Ren

Huan Li

Wenhong Wei

Zhiyong Yang

Huaide Yang

Human action recognition is widely used in computer vision research, such as intelligent video surveillance, intelligent human–computer interaction, robot control, video retrieval, pose estimation, and many other fields. Since the environments faced by human action recognition are diverse and complex, capturing effective features for action recognition is still a challenging problem.

action recognition
multimodality compensation
SlowFast pathways

1. Introduction

Human action recognition is widely used in computer vision research, such as intelligent video surveillance, intelligent human–computer interaction, robot control, video retrieval, pose estimation, and many other fields [1,2,3]. Since the environments faced by human action recognition are diverse and complex, capturing effective features for action recognition is still a challenging problem. Recently, several works focused on exploiting the complementary information provided by RGB and depth [4,5,6] models have made considerable progress.

Human action recognition in RGB videos has been extensively studied in the past decades. In the early years, the method was to manually extract the behavioral features in the video that can represent the temporal and spatial changes of human action. There are mainly methods involving spatiotemporal volume [7,8], spatiotemporal interest points (STIP) [9] and trajectory [10,11]. Deep learning methods have a powerful ability to learn and analyze under complex network structures, which has made them become the mainstream of current human behavior research. Two-stream-based networks [12] can capture different types of information from different input models with little computational cost, but it is difficult to learn complete action sequences. Several works [13,14] have tried dense temporal sampling to learn more motion information, but sampling frames may generate high computational costs. Extracting larger features from the spatiotemporal dimension, the 3D convolutional neural network (CNN)-based approach achieves better performance in human behavior recognition. Its drawback is the high number of parameters and computational complexity. Transformer-based methods can achieve remarkable results in the global connection mode, but these methods rely on the pre-training of large-scale datasets and train with many parameters. Over the years, numerous studies have proposed the integration of RGB, depth, and skeleton modalities for accurate human action recognition, with works including those by [4,15,16,17,18,19]. Decision-level fusion methods have shown promising results that aim to capture the unique features of each modality independently and combine them to produce the final classification score.

2. Single-Modality Action Recognition

In videos, traditional methods have limitations which extract spatio-temporal behavioral features through manual methods, such as Histogram of Oriented Gradient (HOG) [8] and STIP [9]. In recent years, methods of deep learning have been able to learn and distinguish human action features from raw video frames, exhibiting superior representation capabilities and powerful performance. To address the challenge of capturing motion information, Ref. [21] proposed a temporal template that computes frame-to-frame differences to capture the entire motion sequence. Ref. [12] introduced a two-stream CNN model with a spatial network and a temporal network. To reduce the cost of computing optical flow, Ref. [22] proposed a method that accelerates deep two-stream architecture by replacing optical flow with motion vector. Furthermore, the method of training 3D convolutional to explore spatiotemporal features has attracted considerable attention. In order to simultaneously understand the spatio-temporal features in videos, Ref. [23] pioneered the use of 3D convolutional networks for simultaneous spatio-temporal feature learning in action recognition. In particular, C3D [24] developed an end-to-end framework that effectively adapts to deep 3D convolutional networks for spatio-temporal feature learning from raw videos. However, C3D ignores the long-term dependence of video on spatio-temporal and performs poorly on standard benchmarks. A Long-term Temporal Convolution (LTC) is proposed in [14] to build the long-term temporal structure by reducing the spatial resolution and increasing the temporal extent of the 3D convolutional layers. Ref. [25] proposed a nolocal operation that captures long-term dependencies, which refers to modeling the correlation between any two locations in a feature map.

3. Multimodality Action Recognition

In complex scenes, single-modal action recognition has limitations. In order to improve the accuracy and robustness of human behavior recognition, multi-modal information is combined to learn complementary features. The two-stream structure proposed by [12] is a solution to the problem of insufficient information caused by a single modality. This framework consists of a spatial network and a temporal network, which are combined to obtain the final result by merging the prediction scores. Another approach to addressing the limitations of single-modal action recognition was proposed by [26] that developed a two-stream architecture which incorporated low-resolution RGB frames and high-resolution center crops. Since then, researchers have continued to build on these classic two-stream frameworks, exploring new ways to improve their performance. Ref. [27] proposed a temporal segment network (TSN), which performs sparse temporal sampling of the video, and fuses the classification scores of the segments. Depth sequences and RGB are treated as a single entity in [28], and scene flow information is being extracted from them. First, dynamic images are generated from feature sequences, then different dynamic images are input into two different convolutional neural networks, and finally, their classification scores are fused for human action recognition. To explore complementary information, jointly training multiple networks has attracted considerable attention. Ref. [16] improved performance from videos by converting RGB and depth sequences into two pairs of dynamic images (one pair for RGB and one pair for depth), and jointly training a single neural network to recognize human actions using both types of dynamic images. A Modality Compensation Network (MCN) is proposed [18] to explore common features between different modalities. In order to facilitate mutual learning of features extracted from dynamic images of multiple modalities, Ref. [29] proposed a novel Segment Cooperative ConvNet (SC-ConvNet), which utilizes a rank pooling mechanism [30] to construct these features. In another work, Ref. [31] introduced a cross-modality compensation block (CMCB) that improved the interaction between different modalities by jointly learning compensation features. To improve the performance of human action recognition, 3D convolutional models with two-stream or multi-stream designs are studied [32,33,34,35]. The work of [32] designs a novel 3D convolutional two-stream network, which is an Inflated 3D ConvNet (I3D) that is based on 2D ConvNet inflation. Feichtenhofer [35] introduced a two-stream 3D convolutional framework, which comprises a slow pathway for low frame rate and a fast pathway for high frame rate to capture semantic and motion information.

4. Attention Mechanism for Video Understanding

In recent years, attention-based neural networks to explore video understanding work have attracted considerable attention, such as person reidentification [36] and video object segmentation [37]. The work of [38] proposes using the Transformer model, originally designed for natural language processing, for image recognition tasks. To address the limitation of handling long token sequences in videos, Ref. [39] proposed a transformer-based approach which decomposes the different components of the transformer encoder along the temporal and spatial dimensions. In order to improve attention efficiency, Ref. [40] proposes a novel directed-attention mechanism to understand human actions in exact order. A trajectory attention block (trajectory attention block) is proposed [41] to enhance the robustness of human action recognition in dynamic scenes which generates a set of specific trajectory markers along the spatial dimension and performs pooling operations along the temporal dimension. The work of [42] proposes a multi-view transformer for video recognition that laterally connects multiple encoders to efficiently fuse mutual information from different features within the video. A self-supervised Video Transformer (SVT) is designed by [43], which allows learning cross-view information and the dependencies between motion from video clips of different spatial extents and frame intervals. In addition, a lot of work has proposed effective methods for the memory and computational overhead issues in Transformer-based action recognition. To reduce the memory and computation constraints, a Shifted Chunk Transformer (SCT) was designed by [44], which involves dividing each frame into several local patches and inputting them into the image blocks of Locality-Sensitive Hashing (LSH). A Recurrent Visual Transformer (RViT) was proposed by [45] to reduce memory, which utilizes an attention gate mechanism and is operated in a recurrent manner.

This entry is adapted from the peer-reviewed paper 10.3390/math11092115

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.