Taxonomy for Skeleton-GNN-Based Human Action Recognition

This entry is adapted from the peer-reviewed paper 10.3390/s22062091

Human action recognition has been applied in many fields, such as video surveillance and human computer interaction, where it helps to improve performance. Connecting the skeleton joints as in the physical appearance can naturally generate a graph. A new taxonomy for skeleton-GNN-based methods is proposed according to current designs, and the merits and demerits are analyzed.

skeleton graphs human action recognition Taxonomy GNN

1. Introduction

Human action recognition (HAR), aiming at automatically detecting human activities, has become increasingly popular, especially after being armed with deep learning, tremendous data and more computational resources. Typically, HAR holds great value in video surveillance ^[1]^[2], human–computer interactions (HCI) ^[3]^[4]^[5], virtual reality ^[6]^[7]^[8], security ^[9] and so forth.

HAR is supported by multi-modalities. Specifically, one kind of modality is structured data, e.g., images or videos and auxiliary data, such as semantic information. The common use of sensors (including cameras) and cloud databases makes structured data easy to be captured and shared. Moreover, they are visually or semantically informative, e.g., the shape or motion difference of subjects, the space–time trajectory ^[10] and the names of joints.

With the help of carefully designed representation learners, such as deep-learning (DL) models, these informative representations are obtained in a task-related way so as to help solve the problem more accurately. However, the performances are upper-bounded by the data, which emphasizes less on the intrinsic relations between the joints of skeletons. The other is unstructured data that are non-Euclidean, such as human skeletons. Extractors, e.g., Openpose, Google PoseNet and Nuitrack, are capable of working in real-time and thus generate sufficient skeleton graphs.

These poses contain intrinsic information among spatial joints and temporal frames as well as 3D information if the depth data are offered. Additionally, compared with an image that requires a storage space proportional to the image width, height and number of channels, skeletons only require the 3D coordinates and confidence score of every joint, and normally there are no more than 30 joints, which decreases the storage cost significantly.

Moreover, while image-based methods suffer from varied brightness, changing of backgrounds, chromatic differences, different subjects etc, 3D skeletons can work on various scenes once they are detected. As HAR should label the same activity with the same label even when performed by different persons under different conditions or styles, a skeleton graph is undoubtedly a promising choice.

Models to find representations of human skeletons are classified into three categories.

The traditional method is handcrafted descriptors, such as principle components analysis (PCA) based on 3D position differences of joints ^[11], selecting joint pairs by top-K Relative Variance of Joint Relative Distance ^[12]. These descriptors are interpretable; however, they are limited as they tend to extract shallow and simple features and normally fail to find significant deep features.

The other idea is redefining the problem a deep learning problem in Euclidean space, such as serializing the graph nodes into a sequence and then adopting the well-known Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN) etc. In this way, deep features are extracted mechanically but without paying attention to the intrinsic spatial and temporal relations between graph joints, e.g., the serialization of joints ignores their natural structures in skeletons.

Recently, Graph Neural Networks (GNNs), especially graph convolution networks (GCNs), have come into spotlight, and were imported into skeleton graphs. The earliest milestone is ST-GCN ^[13]. Thereafter, multiple works based on ST-GCN were proposed. Among them, 2s-AGCN ^[14] is another typical work, which adopted an attention mechanism. As GNNs are professional in discovering the intrinsic relations between joints, GNN HAR methods have achieved a new state-of-the-art (SOTA).

2. Spatial-Based Approaches

Approaches in this category take GNN as a spatial feature extractor, and the temporal evolution is handled by other modules. Two major candidates are proposed to evolve states in temporal dimension. One category is traditional conditional random field (CRF) methods, including Hidden CRF (HCRF). The other one prefers the family of RNN, such as RNN, long-short temporal memory network (LSTM) and Gated Recurrent Units (GRU).

2.1. CRF

CRF is an undirectional graph model whose nodes are divided into exactly two disjoint sets

$X$

(1)

and $Y$, the observed and output variables, respectively. The conditional distribution $p(Y|X)$ is then modeled. It is suitable for labeling action sequences since Markov chain models are able to track the evolution among temporal dimension.

K. Liu et al. ^[15]^[16] argued that GCN is powerful in extracting spatial information but weak on state evolution and then performed HCRF on extracted features. After obtaining features by GCN, HCRF will learn hidden states on each node and perform directed message passing on these hidden states. Finally, under the minimum negative conditional log-likelihood rule, the label for an action sequence sample is defined. By viewing the skeleton graph as a conditional random field (CRF), K. Liu et al. ^[17] adopted CRF as a loss function to improve performance.

2.2. RNN

Although CRF works as a graph model and handles the state evolution, there are situations when they are non-Markov chains. For example, the current state may rely on states from all previous timesteps. This is why RNN was proposed and started becoming popular. The family of RNN is capable of preserving the relationships between states in multiple timesteps compared with CRF in k predefined timesteps. Among the family, LSTM is capable of solving gradient explosion and gradient vanishing that exists in vanilla RNN, while GRU can be regarded as a simplification of LSTM.

The RNN methods are classified as separated strategy, bidirectional strategy and aggregated block.

Separated Strategy

Some methods perform spatial information extraction, usually by GCN (either GCN in spectral space or in spatial space) and perform state evolution separately. In ^[18], to further encode continuous motion variations, the deep features learned from skeleton graphs by GCN in spectral space were gathered along consecutive temporal slices and then are fed into a recurrent gated network. Finally, the recurrent temporal encoding was integrated with the spectral graph filtering and action-attending to jointly train.

R. Zhao et al. ^[19] performed GCN and LSTM separately, the spatial information from GCN in each frame was directly input into LSTM cell. Z. Y. Xu et al. ^[20] proposed using RL combined with LSTM as the feature selection network (FSN) consisting of a policy network and a value network. To be precise, both the policy network and value network are based on LSTM for sequential action or value generation. The feature selection is done along temporal dimension and the input features are the spatial features from GCN.

S. Xu et al. ^[21] worked on two-subjects interaction graphs. After performing GNN on skeleton graphs in one frame to extract spatial information, the attentioned LSTM is preformed on the joint-level, person-level and scene-level so as to pass information in different scales. To leverage these three types of features, a Concurrent-LSTM (Co-LSTM) is applied to further balance their temporal dynamics for action recognition.

M.S. Li et al. ^[22] used GRU to update the joint features while inferring the future pose conditioned on the A-links and previous actions. The prediction from GRU evolution was then handled and later adopted by GNN.

In the work proposed by J.M. Yu et al. ^[23], RNN was used as an autoregressive model to predict the hidden state of noisy skeleton graphs. The hidden state was later used to predict action class. Q.Q. Huang et al. ^[24] worked with the same idea except for changing the basic GNN to attentioned GNN. Others, such as ^[25]^[26]^[27] extract state evolution information similarly after various GNN modules but not based on attentioned GCN.

Bidirectional Strategy

Considering the bi-directional information of video sequence, some use bidirectional LSTM to keep forward information and backward information simultaneously.

In order to utilize the past and future temporal information, X.L. Ding et al. ^[28] choose the bidirectional RNN to model skeleton sequences and adopt it before extracting spatial information by GNN. To capture the temporal contextual information over frames, J.L. Gao et al. ^[29] provide a context-aware module consisting of bidirectional LSTM cells, aiming at modeling temporal dynamics and dependencies based on the learned spatial latent nodes.

Except for the basic bidirectional LSTM, J. Huang et al. ^[30] deployed GCN on LSTM to enhance its ability of extracting spatial features. Precisely, they provided a LSGM that consists of one original LSTM cell followed by two GCN layers. Then, the LSGM was used to build Bi-Direction LSGM modules, which comprises of a forward LSGM and a reverse LSGM. The forward LSGM and reverse LSGM work in parallel, and the outputs from them are added together to pass to the next layer.

Aggregated Block

Some argue that the extraction of spatial information and temporal information can be stacked together as a basic building block; however, they process the spatial information before performing temporal convolution. Papers ^[31]^[32] integrated GCN with LSTM, in other words, each gate in LSTM—namely, the input gate, forget gate and output gate—is armed with GCN so as to operate LSTM directly on the extracted spatial information from each frame.

3. Spatiotemporal Approaches

The methods mentioned above tackle spatial information and temporal information separately. However, spatial information and temporal information are correlated. For example, the similar actions of waking up and lying on the bed have similar spatial information but distributed at different timestamps.

3.1. CNN

ST-GCN is a typical spatiotemporal approach since it performs GCN on spatiotemporal graph (STG) directly and therefore extracts spatiotemporal information simultaneously. Methods, such as ^[33]^[34]^[35]^[36]^[37]^[38]^[39]^[28]^[40]^[41]^[42]^[43]^[44]^[45]^[46]^[47]^[48]^[49]^[50]^[51]^[52] are all developed based on ST-GCN. Methods based on AGCN also work on STG, such as ^[53]^[54]^[23]^[55]. However, one drawback for these methods is that they only perform spatiotemporal extraction on a predefined temporal size (the kernel size of CNN in temporal dimension); therefore, multi-scale temporal information cannot be handled.

To work on multiple timescale dynamically so as to take either long term dependencies or short term dependencies into consideration, P. Ghosh et al. ^[56] also used STG but they allowed flexible temporal connections, which can span multiple timesteps. For example, the joint left arm at timestep $(t+1, t+2,\cdots)$ can have connections with left arm joint at timestep rather than only at $t+1$ in ST-GCN. Their method is based on Hourglass (a CNN framework), combined with ST-GCN.

Z.T. Zhang et al. ^[57] attempted to handle temporal information with two gated temporal convolutional network (TCN), herein 1DCNN and 2DCNN with tanh and sigmoid activation functions working as gates. They argued that TCN will not overfit to some extent since it inherits the stable gradient of CNN. After performing filtering in temporal dimension, the outputs are combined together and then tackled by GCN and MLP.

In addition to making progress on temporal dimensions, some approaches attempted to modify GNN to take multi-scale in spatiotemporal dimension into consideration. Z. Hu et al. ^[58] established dependence relationships for different bone nodes with a bone joint module, which is based on multiscale dynamic aggregated GCNs. GCNs describe and aggregate the bone joint semantic information. In this way, either the spatial information or the multiscale temporal information are all handled together.

3.2. RNN

Based on GCN, to tackle long-term information, W.W. Ding et al. ^[59] used LSTM as a vertex updater during message passing. Therefore, the features of each vertex will contain the temporal information and thus handle spatiotemporal information simultaneously.

4. Generated Approaches

The generated approaches cover two categories, one includes self-supervised methods, also known as unsupervised methods, and the other is neural architecture search (NAS), which aims at generating the best model by combining candidate components.

Both categories work in a non-end-to-end way. For the self-supervised methods, they first use priors, like pretext tasks, to generate a pretrained model, and then adapt it to fit the target task. For NAS, it aims at generating a best model on the target task. They emphasize the combinations of given components first, and chose the best model from these combinations. Then, the chosen model will be fine tuned on the target task.

4.1. Self-Supervised

Self-supervised learning is a means for training computers without manually labeled data. It is a member of unsupervised learning methods where outputs or goals are derived by machines. The machines are thus capable of labeling, categorizing and analyzing information on their own and then drawing conclusions based on connections and correlations.

AE

M. Li et al. ^[22] built an A-links inference module (AIM) based on AE, where the output of the encoder is the probability of each joint pair with type-c link, and the decoder requires the output of encoder and joints positions in the previous frame. Thus, the loss of AIM is the difference between part of the input from the encoder and decoder’s prediction. In this way, no more labeled data are required during pre-training the AIM except for the input poses.

Adversarial Learning

Inspired by adversarial learning, in ^[19], they incorporated it into the Bayesian inference framework and formulated it as a prior that targets regularized model parameters as to improve the generalization. The discriminator was implemented as a fully connected layer. The loss function while training is similar as what is adopted in generative adversarial network (GAN).

Teacher–Student Mechanism

For transferring knowledge between two graphs, such as one obtained in the lab and the other from real life, Y.S. Tang et al. ^[60] used a teacher–student mechanism. The teacher network guides the student network to transfer the knowledge across the weight matrices by a task-specific loss function, so that the relation information is well preserved during transfer. By doing so, no more action labels for the target domain are required during training.

4.2. NAS

In addition to self-supervised methods to generate task-specific models, some researchers showed their interest on automatic machine learning (AutoML), among which, NAS has gained more attention.

W. Peng et al. ^[61] discussed the best architecture of skeleton GCN methods, given components: the dynamic graph modules with various spatiotemporal cues and Chebyshev approximations in different orders. All candidates have residual connections. The proposed NAS framework works to find the most accurate and efficient network. Moreover, instead of providing a pre-defined graph, they generate dynamic graphs based on the node correlations captured by different function modules.

N. Heidari et al. ^[62] progressively adjusted the model topology by increasing the width of the model layers until the performance converges. If the addition of the last layer does not improve the performance, this newly added layer is removed and the algorithm stops growing the topology.

References

Aggarwal, J.K.; Ryoo, M.S. Human activity analysis: A review. ACM Comput. Surv. (CSUR) 2011, 43, 1–43.
Ziaeefard, M.; Bergevin, R. Semantic human activity recognition: A literature review. Pattern Recognit. 2015, 48, 2329–2345.
Meng, M.; Drira, H.; Boonaert, J. Distances evolution analysis for online and off-line human object interaction recognition. Image Vis. Comput. 2018, 70, 32–45.
Zhang, W.; Liu, Z.; Zhou, L.; Leung, H.; Chan, A.B. Martial arts, dancing and sports dataset: A challenging stereo and multi-view dataset for 3d human pose estimation. Image Vis. Comput. 2017, 61, 22–39.
Panwar, M.; Mehra, P.S. Hand gesture recognition for human computer interaction. In Proceedings of the 2011 International Conference on Image Information Processing, Shimla, India, 3–5 November 2011; pp. 1–7.
Sagayam, K.M.; Hemanth, D.J. Hand posture and gesture recognition techniques for virtual reality applications: A survey. Virtual Real. 2017, 21, 91–107.
Schröder, M.; Ritter, H. Deep learning for action recognition in augmented reality assistance systems. In Proceedings of the ACM SIGGRAPH 2017 Posters, Los Angeles, CA, USA, 30 July–3 August 2017; pp. 1–2.
Bates, T.; Ramirez-Amaro, K.; Inamura, T.; Cheng, G. On-line simultaneous learning and recognition of everyday activities from virtual reality performances. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 3510–3515.
Meng, H.; Pears, N.; Bailey, C. A human action recognition system for embedded computer vision application. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007; pp. 1–6.
Beddiar, D.R.; Nini, B.; Sabokrou, M.; Hadid, A. Vision-based human activity recognition: A survey. Multimed. Tools Appl. 2020, 79, 30509–30555.
Yang, X.; Tian, Y.L. Eigenjoints-based action recognition using naive-bayes-nearest-neighbor. In Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, 16–21 June 2012; pp. 14–19.
Li, M.; Leung, H. Graph-based approach for 3D human skeletal action recognition. Pattern Recognit. Lett. 2017, 87, 195–202.
Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32.
Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 12026–12035.
Liu, K.; Gao, L.; Khan, N.M.; Qi, L.; Guan, L. Graph Convolutional Networks-Hidden Conditional Random Field Model for Skeleton-Based Action Recognition. In Proceedings of the 2019 IEEE International Symposium on Multimedia (ISM), San Diego, CA, UAS, 9–11 December 2019; pp. 25–256.
Liu, K.; Gao, L.; Khan, N.M.; Qi, L.; Guan, L. A Multi-Stream Graph Convolutional Networks-Hidden Conditional Random Field Model for Skeleton-Based Action Recognition. IEEE Trans. Multimed. 2021, 23, 64–76.
Liu, K.; Gao, L.; Khan, N.M.; Qi, L.; Guan, L. A Vertex-Edge Graph Convolutional Network for Skeleton-Based Action Recognition. In Proceedings of the 2020 IEEE International Symposium on Circuits and Systems (ISCAS), Online, 12–14 October 2020; pp. 1–5.
Li, C.; Cui, Z.; Zheng, W.; Xu, C.; Ji, R.; Yang, J. Action-Attending Graphic Neural Network. IEEE Trans. Image Process. 2018, 27, 3657–3670.
Zhao, R.; Wang, K.; Su, H.; Ji, Q. Bayesian Graph Convolution LSTM for Skeleton Based Action Recognition. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 6881–6891.
Xu, Z.Y.; Wang, Y.F.; Jiang, J.Q.; Yao, J.; Li, L. Adaptive Feature Selection With Reinforcement Learning for Skeleton-Based Action Recognition. IEEE Access 2020, 8, 213038–213051.
Xu, S.; Rao, H.; Peng, H.; Jiang, X.; Guo, Y.; Hu, X.; Hu, B. Attention-Based Multilevel Co-Occurrence Graph Convolutional LSTM for 3-D Action Recognition. IEEE Internet Things J. 2020, 21, 15990–16001.
Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 3595–3603.
Yu, J.; Yoon, Y.; Jeon, M. Predictively Encoded Graph Convolutional Network for Noise-Robust Skeleton-based Action Recognition. arXiv 2020, arXiv:2003.07514.
Huang, Q.Q.; Zhou, F.Y.; Qin, R.Z.; Zhao, Y. View transform graph attention recurrent networks for skeleton-based action recognition. Signal Image Video Process. 2020, 15, 599–606.
Si, C.Y.; Jing, Y.; Wang, W.; Wang, L.; Tan, T.N. Skeleton-based action recognition with hierarchical spatial reasoning and temporal stack learning network. Pattern Recognit. 2020, 107, 107511.
Parsa, B.; Narayanan, A.; Dariush, B. Spatio-Temporal Pyramid Graph Convolutions for Human Action Recognition and Postural Assessment. In Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA, 1–5 March 2020; pp. 1069–1079.
Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Symbiotic graph neural networks for 3d skeleton-based human action recognition and motion prediction. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 1, 1.
Ding, X.; Yang, K.; Chen, W. An Attention-Enhanced Recurrent Graph Convolutional Network for Skeleton-Based Action Recognition. In Proceedings of the 2019 2nd International Conference on Signal Processing and Machine Learning, Hangzhou, China, 27–29 November 2019; pp. 79–84.
Gao, J.; He, T.; Zhou, X.; Ge, S. Focusing and Diffusion: Bidirectional Attentive Graph Convolutional Networks for Skeleton-based Action Recognition. arXiv 2019, arXiv:1912.11521.
Huang, J.; Huang, Z.; Xiang, X.; Gong, X.; Zhang, B. Long-Short Graph Memory Network for Skeleton-based Action Recognition. In Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA, 1–5 March 2020; pp. 634–641.
Si, C.; Chen, W.; Wang, W.; Wang, L.; Tan, T. An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 1227–1236.
Zhang, H.; Song, Y.; Zhang, Y. Graph Convolutional LSTM Model for Skeleton-Based Action Recognition. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 8–12 July 2019; pp. 412–417.
Wang, L.; Huynh, D.; Koniusz, P. A Comparative Review of Recent Kinect-Based Action Recognition Algorithms. IEEE Trans. Image Process. 2020, 29, 15–28.
Yang, C.L.; Setyoko, A.; Tampubolon, H.; Hua, K.L. Pairwise Adjacency Matrix on Spatial Temporal Graph Convolution Network for Skeleton-Based Two-Person Interaction Recognition. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Online, 25–28 October 2020; pp. 2166–2170.
Cai, J.; Jiang, N.; Han, X.; Jia, K.; Lu, J. JOLO-GCN: Mining Joint-Centered Light-Weight Information for Skeleton-Based Action Recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 5–9 January 2021; pp. 2735–2744.
Yang, D.; Li, M.M.; Fu, H.; Fan, J.; Leung, H. Centrality Graph Convolutional Networks for Skeleton-based Action Recognition. arXiv 2020, arXiv:2003.03007.
Bai, Z.; Ding, Q.; Tan, J. Two-Steam Fully Connected Graph Convolutional Network for Skeleton-Based Action Recognition. In Proceedings of the 2020 Chinese Control And Decision Conference (CCDC), Hefei, China, 22–24 August 2020; pp. 1056–1061.
Song, Y.F.; Zhang, Z.; Shan, C.; Wang, L. Richly Activated Graph Convolutional Network for Robust Skeleton-based Action Recognition. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 1915–1925.
Zhu, G.M.; Zhang, L.; Li, H.S.; Shen, P.Y.; Shah, S.A.A.; Bennamoun, M. Topology learnable graph convolution for skeleton-based action recognition. Pattern Recognit. Lett. 2020, 135, 286–292.
Gao, X.; Li, K.; Zhang, Y.; Miao, Q.; Sheng, L.; Xie, J.; Xu, J. 3D Skeleton-Based Video Action Recognition by Graph Convolution Network. In Proceedings of the 2019 IEEE International Conference on Smart Internet of Things (SmartIoT), Beijing, China, 19–21 August 2019; pp. 500–501.
Jiang, Y.; Song, K.; Wang, J. Action Recognition Based on Fusion Skeleton of Two Kinect Sensors. In Proceedings of the 2020 International Conference on Culture-oriented Science & Technology (ICCST), Beijing, China, 30–31 October 2020; pp. 240–244.
Li, Q.; Mo, H.; Zhao, J.; Hao, H.; Li, H. Spatio-Temporal Dual Affine Differential Invariant for Skeleton-based Action Recognition. arXiv 2020, arXiv:2004.09802.
Lin, C.H.; Chou, P.Y.; Lin, C.H.; Tsai, M.Y. SlowFast-GCN: A Novel Skeleton-Based Action Recognition Framework. In Proceedings of the 2020 International Conference on Pervasive Artificial Intelligence (ICPAI), Taipei, Taiwan, China, 3–5 December 2020; pp. 170–174.
Miki, D.; Chen, S.; Demachi, K. Weakly Supervised Graph Convolutional Neural Network for Human Action Localization. In Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA, 1–5 March 2020; pp. 642–650.
Peng, W.; Shi, J.; Xia, Z.; Zhao, G. Mix dimension in poincaré geometry for 3d skeleton-based action recognition. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, DC, USA, 12–16 October 2020; pp. 1432–1440.
Sun, D.; Zeng, F.; Luo, B.; Tang, J.; Ding, Z. Information Enhanced Graph Convolutional Networks for Skeleton-based Action Recognition. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–7.
Tian, D.; Lu, Z.M.; Chen, X.; Ma, L.H. An attentional spatial temporal graph convolutional network with co-occurrence feature learning for action recognition. Multimed. Tools Appl. 2020, 79, 12679–12697.
Zhong, Q.B.; Zheng, C.M.; Zhang, H.X. Research on Discriminative Skeleton-Based Action Recognition in Spatiotemporal Fusion and Human-Robot Interaction. Complexity 2020, 2020, 8717942.
Cheng, K.; Zhang, Y.; Cao, C.; Shi, L.; Cheng, J.; Lu, H. Decoupling GCN with DropGraph Module for Skeleton-Based Action Recognition. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 536–553.
Song, Y.F.; Zhang, Z.; Wang, L. Richly activated graph convolutional network for action recognition with incomplete skeletons. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, China, 22–25 September 2019; pp. 1–5.
Papadopoulos, K.; Ghorbel, E.; Aouada, D.; Ottersten, B. Vertex feature encoding and hierarchical temporal modeling in a spatial-temporal graph convolutional network for action recognition. arXiv 2019, arXiv:1912.09745.
Fan, Y.; Weng, S.; Zhang, Y.; Shi, B.; Zhang, Y. Context-aware cross-attention for skeleton-based human action recognition. IEEE Access 2020, 8, 15280–15290.
Obinata, Y.; Yamamoto, T. Temporal Extension Module for Skeleton-Based Action Recognition. arXiv 2020, arXiv:2003.08951.
Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Skeleton-Based Action Recognition With Multi-Stream Adaptive Graph Convolutional Networks. IEEE Trans. Image Process. 2020, 29, 9532–9545.
Dong, J.Q.; Gao, Y.B.; Lee, H.J.; Zhou, H.; Yao, Y.F.; Fang, Z.J.; Huang, B. Action Recognition Based on the Fusion of Graph Convolutional Networks with High Order Features. Applied Sci. 2020, 10, 1482.
Ghosh, P.; Yao, Y.; Davis, L.S.; Divakaran, A. Stacked Spatio-Temporal Graph Convolutional Networks for Action Segmentation. In Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA, 1–5 March 2020; pp. 565–574.
Zhang, Z.T.; Wang, Z.Y.; Zhuang, S.N.; Huang, F.Y. Structure-Feature Fusion Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition. IEEE Access 2020, 8, 228108–228117.
Hu, Z.; Lee, E.J. Dual Attention-Guided Multiscale Dynamic Aggregate Graph Convolutional Networks for Skeleton-Based Human Action Recognition. Symmetry 2020, 12, 1589.
Ding, W.W.; Li, X.; Li, G.; Wei, Y.S. Global relational reasoning with spatial temporal graph interaction networks for skeleton-based action recognition. Signal Process.-Image Commun. 2020, 83, 115776.
Tang, Y.S.; Wei, Y.; Yu, X.M.; Lu, J.W.; Zhou, J. Graph Interaction Networks for Relation Transfer in Human Activity Videos. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 2872–2886.
Peng, W.; Hong, X.; Chen, H.; Zhao, G. Learning graph convolutional network for skeleton-based human action recognition by neural searching. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 2669–2676.
Heidari, N.; Iosifidis, A. Progressive Spatio-Temporal Graph Convolutional Network for Skeleton-Based Human Action Recognition. arXiv 2020, arXiv:2011.05668.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.

Upload a video for this entry

Information

Subjects: Health Care Sciences & Services

Contributor MDPI registered users' name will be linked to their SciProfiles pages. To register with us, please refer to https://encyclopedia.pub/register :

Miao Feng

View Times: 676

Update Date: 14 Mar 2022

Version	Summary	Created by	Modification	Content Size	Created at	Operation
1		Miao Feng	+ 2383 word(s)	2383	2022-03-10 04:42:54	\|
2	formating	Jason Zhu	-14 word(s)	2369	2022-03-11 02:20:26	\| \|
3	formating	Jason Zhu	Meta information modification	2369	2022-03-14 05:19:45	\|