Taxonomy for Skeleton-GNN-Based Human Action Recognition: History
Please note this is an old version of this entry, which may differ significantly from the current revision.
Contributor:

Human action recognition has been applied in many fields, such as video surveillance and human computer interaction, where it helps to improve performance. Connecting the skeleton joints as in the physical appearance can naturally generate a graph. A new taxonomy for skeleton-GNN-based methods is proposed according to their designs, and their merits and demerits are analyzed.

  • skeleton graphs
  • human action recognition
  • Taxonomy
  • GNN

1. Introduction

Human action recognition (HAR), aiming at automatically detecting human activities, has become increasingly popular, especially after being armed with deep learning, tremendous data and more computational resources. Typically, HAR holds great value in video surveillance [1,2], human–computer interactions (HCI) [3,4,5], virtual reality [6,7,8], security [9] and so forth.
HAR is supported by multi-modalities. Specifically, one kind of modality is structured data, e.g., images or videos and auxiliary data, such as semantic information. The common use of sensors (including cameras) and cloud databases makes structured data easy to be captured and shared. Moreover, they are visually or semantically informative, e.g., the shape or motion difference of subjects, the space–time trajectory [10] and the names of joints.
With the help of carefully designed representation learners, such as deep-learning (DL) models, these informative representations are obtained in a task-related way so as to help solve the problem more accurately. However, the performances are upper-bounded by the data, which emphasizes less on the intrinsic relations between the joints of skeletons. The other is unstructured data that are non-Euclidean, such as human skeletons. Extractors, e.g., Openpose, Google PoseNet and Nuitrack, are capable of working in real-time and thus generate sufficient skeleton graphs.
These poses contain intrinsic information among spatial joints and temporal frames as well as 3D information if the depth data are offered. Additionally, compared with an image that requires a storage space proportional to the image width, height and number of channels, skeletons only require the 3D coordinates and confidence score of every joint, and normally there are no more than 30 joints, which decreases the storage cost significantly.
Moreover, while image-based methods suffer from varied brightness, changing of backgrounds, chromatic differences, different subjects etc, 3D skeletons can work on various scenes once they are detected. As HAR should label the same activity with the same label even when performed by different persons under different conditions or styles, a skeleton graph is undoubtedly a promising choice.
Models to find representations of human skeletons are classified into three categories.
The traditional method is handcrafted descriptors, such as principle components analysis (PCA) based on 3D position differences of joints [11], selecting joint pairs by top-K Relative Variance of Joint Relative Distance [12]. These descriptors are interpretable; however, they are limited as they tend to extract shallow and simple features and normally fail to find significant deep features.
The other idea is redefining the problem a deep learning problem in Euclidean space, such as serializing the graph nodes into a sequence and then adopting the well-known Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN) etc. In this way, deep features are extracted mechanically but without paying attention to the intrinsic spatial and temporal relations between graph joints, e.g., the serialization of joints ignores their natural structures in skeletons.
Recently, Graph Neural Networks (GNNs), especially graph convolution networks (GCNs), have come into spotlight, and were imported into skeleton graphs. The earliest milestone is ST-GCN [13]. Thereafter, multiple works based on ST-GCN were proposed. Among them, 2s-AGCN [14] is another typical work, which adopted an attention mechanism. As GNNs are professional in discovering the intrinsic relations between joints, GNN HAR methods have achieved a new state-of-the-art (SOTA).

2. Spatial-Based Approaches

Approaches in this category take GNN as a spatial feature extractor, and the temporal evolution is handled by other modules. Two major candidates are proposed to evolve states in temporal dimension. One category is traditional conditional random field (CRF) methods, including Hidden CRF (HCRF). The other one prefers the family of RNN, such as RNN, long-short temporal memory network (LSTM) and Gated Recurrent Units (GRU).

2.1. CRF

CRF is an undirectional graph model whose nodes are divided into exactly two disjoint sets $X$ and $Y$, the observed and output variables, respectively. The conditional distribution $p(Y|X)$ is then modeled. It is suitable for labeling action sequences since Markov chain models are able to track the evolution among temporal dimension.
K. Liu et al. [87,90] argued that GCN is powerful in extracting spatial information but weak on state evolution and then performed HCRF on extracted features. After obtaining features by GCN, HCRF will learn hidden states on each node and perform directed message passing on these hidden states. Finally, under the minimum negative conditional log-likelihood rule, the label for an action sequence sample is defined. By viewing the skeleton graph as a conditional random field (CRF), K. Liu et al. [63] adopted CRF as a loss function to improve performance.

2.2. RNN

Although CRF works as a graph model and handles the state evolution, there are situations when they are non-Markov chains. For example, the current state may rely on states from all previous timesteps. This is why RNN was proposed and started becoming popular. The family of RNN is capable of preserving the relationships between states in multiple timesteps compared with CRF in k predefined timesteps. Among the family, LSTM is capable of solving gradient explosion and gradient vanishing that exists in vanilla RNN, while GRU can be regarded as a simplification of LSTM.
The RNN methods are classified as separated strategy, bidirectional strategy and aggregated block.
Separated Strategy
Some methods perform spatial information extraction, usually by GCN (either GCN in spectral space or in spatial space) and perform state evolution separately. In [88], to further encode continuous motion variations, the deep features learned from skeleton graphs by GCN in spectral space were gathered along consecutive temporal slices and then are fed into a recurrent gated network. Finally, the recurrent temporal encoding was integrated with the spectral graph filtering and action-attending to jointly train.
R. Zhao et al. [69] performed GCN and LSTM separately, the spatial information from GCN in each frame was directly input into LSTM cell. Z. Y. Xu et al. [91] proposed using RL combined with LSTM as the feature selection network (FSN) consisting of a policy network and a value network. To be precise, both the policy network and value network are based on LSTM for sequential action or value generation. The feature selection is done along temporal dimension and the input features are the spatial features from GCN.
S. Xu et al. [92] worked on two-subjects interaction graphs. After performing GNN on skeleton graphs in one frame to extract spatial information, the attentioned LSTM is preformed on the joint-level, person-level and scene-level so as to pass information in different scales. To leverage these three types of features, a Concurrent-LSTM (Co-LSTM) is applied to further balance their temporal dynamics for action recognition.
M.S. Li et al. [77] used GRU to update the joint features while inferring the future pose conditioned on the A-links and previous actions. The prediction from GRU evolution was then handled and later adopted by GNN.
In the work proposed by J.M. Yu et al. [93], RNN was used as an autoregressive model to predict the hidden state of noisy skeleton graphs. The hidden state was later used to predict action class. Q.Q. Huang et al. [94] worked with the same idea except for changing the basic GNN to attentioned GNN. Others, such as [62,64,95] extract state evolution information similarly after various GNN modules but not based on attentioned GCN.

Bidirectional Strategy

Considering the bi-directional information of video sequence, some use bidirectional LSTM to keep forward information and backward information simultaneously.
In order to utilize the past and future temporal information, X.L. Ding et al. [96] choose the bidirectional RNN to model skeleton sequences and adopt it before extracting spatial information by GNN. To capture the temporal contextual information over frames, J.L. Gao et al. [53] provide a context-aware module consisting of bidirectional LSTM cells, aiming at modeling temporal dynamics and dependencies based on the learned spatial latent nodes.
Except for the basic bidirectional LSTM, J. Huang et al. [97] deployed GCN on LSTM to enhance its ability of extracting spatial features. Precisely, they provided a LSGM that consists of one original LSTM cell followed by two GCN layers. Then, the LSGM was used to build Bi-Direction LSGM modules, which comprises of a forward LSGM and a reverse LSGM. The forward LSGM and reverse LSGM work in parallel, and the outputs from them are added together to pass to the next layer.

Aggregated Block

Some argue that the extraction of spatial information and temporal information can be stacked together as a basic building block; however, they process the spatial information before performing temporal convolution. Papers [89,98] integrated GCN with LSTM, in other words, each gate in LSTM—namely, the input gate, forget gate and output gate—is armed with GCN so as to operate LSTM directly on the extracted spatial information from each frame.

3. Spatiotemporal Approaches

The methods mentioned above tackle spatial information and temporal information separately. However, spatial information and temporal information are correlated. For example, the similar actions of waking up and lying on the bed have similar spatial information but distributed at different timestamps.

3.1. CNN

ST-GCN is a typical spatiotemporal approach since it performs GCN on spatiotemporal graph (STG) directly and therefore extracts spatiotemporal information simultaneously. Methods, such as [29,48,54,60,68,82,86,96,101,102,103,104,105,106,107,108,109,110,111,112,113] are all developed based on ST-GCN. Methods based on AGCN also work on STG, such as [66,73,93,114]. However, one drawback for these methods is that they only perform spatiotemporal extraction on a predefined temporal size (the kernel size of CNN in temporal dimension); therefore, multi-scale temporal information cannot be handled.
To work on multiple timescale dynamically so as to take either long term dependencies or short term dependencies into consideration, P. Ghosh et al. [67] also used STG but they allowed flexible temporal connections, which can span multiple timesteps. For example, the joint left arm at timestep $(t+1, t+2,\cdots)$ can have connections with left arm joint at timestep rather than only at $t+1$ in ST-GCN. Their method is based on Hourglass (a CNN framework), combined with ST-GCN.
Z.T. Zhang et al. [99] attempted to handle temporal information with two gated temporal convolutional network (TCN), herein 1DCNN and 2DCNN with tanh and sigmoid activation functions working as gates. They argued that TCN will not overfit to some extent since it inherits the stable gradient of CNN. After performing filtering in temporal dimension, the outputs are combined together and then tackled by GCN and MLP.
In addition to making progress on temporal dimensions, some approaches attempted to modify GNN to take multi-scale in spatiotemporal dimension into consideration. Z. Hu et al. [100] established dependence relationships for different bone nodes with a bone joint module, which is based on multiscale dynamic aggregated GCNs. GCNs describe and aggregate the bone joint semantic information. In this way, either the spatial information or the multiscale temporal information are all handled together.

3.2. RNN

Based on GCN, to tackle long-term information, W.W. Ding et al. [83] used LSTM as a vertex updater during message passing. Therefore, the features of each vertex will contain the temporal information and thus handle spatiotemporal information simultaneously.

4. Generated Approaches

The generated approaches cover two categories, one includes self-supervised methods, also known as unsupervised methods, and the other is neural architecture search (NAS), which aims at generating the best model by combining candidate components.
Both categories work in a non-end-to-end way. For the self-supervised methods, they first use priors, like pretext tasks, to generate a pretrained model, and then adapt it to fit the target task. For NAS, it aims at generating a best model on the target task. They emphasize the combinations of given components first, and chose the best model from these combinations. Then, the chosen model will be fine tuned on the target task.

4.1. Self-Supervised

Self-supervised learning is a means for training computers without manually labeled data. It is a member of unsupervised learning methods where outputs or goals are derived by machines. The machines are thus capable of labeling, categorizing and analyzing information on their own and then drawing conclusions based on connections and correlations. We classify methods in this category as AE, adversarial learning and teacher–student mechanism.

AE

M. Li et al. [77] built an A-links inference module (AIM) based on AE, where the output of the encoder is the probability of each joint pair with type-c link, and the decoder requires the output of encoder and joints positions in the previous frame. Thus, the loss of AIM is the difference between part of the input from the encoder and decoder’s prediction. In this way, no more labeled data are required during pre-training the AIM except for the input poses.

Adversarial Learning

Inspired by adversarial learning, in [69], they incorporated it into the Bayesian inference framework and formulated it as a prior that targets regularized model parameters as to improve the generalization. The discriminator was implemented as a fully connected layer. The loss function while training is similar as what is adopted in generative adversarial network (GAN).

Teacher–Student Mechanism

For transferring knowledge between two graphs, such as one obtained in the lab and the other from real life, Y.S. Tang et al. [115] used a teacher–student mechanism. The teacher network guides the student network to transfer the knowledge across the weight matrices by a task-specific loss function, so that the relation information is well preserved during transfer. By doing so, no more action labels for the target domain are required during training.

4.2. NAS

In addition to self-supervised methods to generate task-specific models, some researchers showed their interest on automatic machine learning (AutoML), among which, NAS has gained more attention.
W. Peng et al. [116] discussed the best architecture of skeleton GCN methods, given components: the dynamic graph modules with various spatiotemporal cues and Chebyshev approximations in different orders. All candidates have residual connections. The proposed NAS framework works to find the most accurate and efficient network. Moreover, instead of providing a pre-defined graph, they generate dynamic graphs based on the node correlations captured by different function modules.
N. Heidari et al. [117] progressively adjusted the model topology by increasing the width of the model layers until the performance converges. If the addition of the last layer does not improve the performance, this newly added layer is removed and the algorithm stops growing the topology.

This entry is adapted from the peer-reviewed paper 10.3390/s22062091

This entry is offline, you can click here to edit this entry!
Video Production Service