Skeleton-Based Sign Language Recognition

Skeleton-Based Sign Language Recognition: Comparison

Please note this is a comparison between Version 1 by Abu Saleh Musa Miah and Version 2 by Alfred Zheng.

Skeleton-based sign language recognition (SLR) systems that mainly use specific skeleton points instead of the pixels of images and/or sensors. The main advantage of skeleton-based SLR systems is that they can increase the attention paid to signs and have strong adaptability to complicated backgrounds and dynamic circumstances. The use of hand skeleton information alone is sometimes insufficient to correctly represent the exact meaning of a sign due to a lack of emotion and bodily expression.

sign language recognition
skeleton-based
Abu Saleh Musa Miah

1. Introduction

Sign language is a spatial type of visual language based on dynamic gesture movement, including hand, body, and facial gestures ^{[1][2][3][4][5][6][7]}[1,2,3,4,5,6,7]. It is the language used by those communities who do not speak or hear anything spatially, including the deaf and speech-impaired people. Due to the difficulties and complexity of sign language, such as the considerable time required to understand and utilize it, the non-deaf community is generally not eager to learn this language to establish communication with those specialized disabled people. In addition, teaching this language to the non-deaf community to communicate with the minor community is not practical or feasible. Moreover, there are no common international versions of sign language, and it differs with respect to several languages, such as Bangla [3], Turkish [8], Chinese [9], and English [10], as well as culture ^[1][4][11][1,4,11]. To establish effective communication between the non-deaf community and the deaf community, a sign language translator is needed; however, it is rare to find expert sign language interpreters. In this context, researchers believe that automatic sign language recognition (SLR) can effectively address these problems ^[3][4][5][6][3,4,5,6].

Researchers have worked to develop SLR systems with the help of computer vision ^[3][4][12][3,4,12], sensor-based methods ^{[5][6][12][13][14][15][16][17][18][19]}[5,6,12,13,14,15,16,17,18,19], and artificial intelligence, in order to facilitate communication for deaf and hearing impaired people. Many researchers have recently proposed skeleton-based SLR systems that mainly use specific skeleton points instead of the pixels of images and/or sensors ^{[20][21][22][23][24]}[20,21,22,23,24]. The main advantage of skeleton-based SLR systems is that they can increase the attention paid to signs and have strong adaptability to complicated backgrounds and dynamic circumstances. However, there are still some deficiencies in extracting the skeleton points for SLR, for example, due to the high computational complexity of ground truth skeleton annotation. Many motion capture systems (e.g., Microsoft Kinetic, Microsoft Oak-D, and Intel RealSense, among other systems) provide the main body coordinates and their skeleton annotations; however, it is difficult to obtain skeleton annotations for various gestures [25]. Shin et al. extracted 21 hand key points from an American sign language data set using the MediaPipe system. After extracting the distance and angular features, they applied an SVM for recognition [26]. The use of hand skeleton information alone is sometimes insufficient to correctly represent the exact meaning of a sign due to a lack of emotion and bodily expression. As such, researchers have recently considered that the use of a full-body skeleton may be more effective for SLR systems [27].

Xia et al. extracted the hand skeleton and body skeleton with different approaches and achieved good performance using an RNN-based model [28]. Their main problems were the unreliability of hand key points and that the RNN did not perform well with respect to the dynamics of the skeleton. Perez et al. extracted 67 key points, including those related to the face, body, and hand gestures, using an OpenCV AI Kit with Depth (OAK-D) camera. They recorded 3000 skeleton samples for Mexican sign language (MSL) by considering 30 different signs, where each sample was constructed using the spatial coordinates of the body, face, and hands in 3D. They mainly calculated the motion from the extracted skeleton key points, and, finally, they reported the performance accuracy obtained by an LSTM with gated recurrent units (GRUs) [29].

2. Skeleton-Based Sign Language Recognition

Many researchers have worked to develop automatic sign language recognition systems using various approaches, including segmentation, semantic detection, feature extraction, and classification ^{[9][30][31][32][33][34][35][36]}[9,31,32,33,34,35,36,37]. Some studies have considered the use of scale-invariant feature transform (SIFT) ^[33][34] or histogram of oriented gradients (HOG), ref. ^[34][35] for hand-crafted feature extraction, followed by machine learning approaches such as support vector machine (SVM) or k-nearest neighbors (kNN) ^[9][35][36][9,36,37] for classification. The main drawback of segmentation–semantic detection methods is that they may face difficulties in producing a good performance for video or large-scale data sets. To overcome the challenges, researchers have recently focused on the various deep-learning-based approaches to improve the potential features and SLR classification accuracy from video and large-scale data sets ^{[1][2][3][4][30][31][37][38][39][40][41][42]}[1,2,3,4,31,32,38,39,40,41,42,43]. Existing SLR systems still face many difficulties in achieving good performance, due to the high computational cost of the potential information, considerable gestures for SLR, and potential features. One of the most common challenges is capturing the global body motion skeleton at the same time as local arm, hand, and facial expressions. Neverova et al. employed a ModDrop framework to initialize individual and gradual fusion modalities for capturing spatial information ^[37][38]. They achieved good performance in terms of spatial and temporal information for multiple modalities. However, one of the drawbacks of their approach is that they applied data augmented with audio, which is not effective at all times. Pu et al. employed connectionist temporal classification (CTC) for sequence modeling and a 3D convolutional residual network (3D-ResNet) for feature learning ^[38][39]. The employed LSTM and CTC decoder were jointly trained with a soft dynamic time warping (soft-DTW) alignment constraint. Finally, they employed 3D-ResNet for training labels with loss and validated the developed model on the RWTHPHOENIX-Weather and CSL data sets, obtaining a word error rate (WER) of 36.7% and 32.7%, respectively. Koller et al. employed a hybrid CNN-HMM model to combine the two kinds of features; namely, the discriminative features of the CNN with the sequence features of the hidden Markov model (HMM) ^[30][31]. They claimed that they achieved good recognition accuracy for three benchmark sign language data sets, reducing the WER by 20%. Huang et al. proposed an attention-based 3D-convolutional neural network (3D-CNN) for SLR, in order to extract the spatio-temporal features, and selected highlighted information using an attention mechanism ^[43][44]. Finally, they evaluated their model on the CSL and ChaLearn 14 benchmark data sets, and achieved 95.30% accuracy on the ChaLearn data set. Pigou et al. proposed a simple temporal feature pooling-based method and showed that temporal information is more important for deriving discriminative features for video classification-related research ^[44][45]. They also focused on the recurrence information with temporal convolution, which can improve the significance of the video classification task. SINCAN et al. proposed a hybrid method combining an LSTM, feature pooling, and a CNN method to recognize isolated sign language ^[45][46]. They included the VGG-16 pre-trained model with the CNN part and two parallel architectures for learning RGB and depth information. Finally, they achieved 93.15% accuracy on the Montalbano Italian sign language data set. Huang et al. applied a continuous sign language recognition approach to eliminate temporal segmentation in pre-processing, which they called hierarchical attention network with latent space (LS-HAN) ^[46][47]. They mainly included a two-stream CNN, LS, and a HAN for video feature extraction, semantic gap bridging, and latent space-based recognition, respectively. The main drawback of their work is that they mainly extracted pure visual features, which are not effective for capturing hand gestures and body movements. Zhou et al. proposed a holistic visual appearance-based approach and a 2D human pose-based method to improve the performance of large-scale sign language recognition ^[47][48]. They also applied a pose-based temporal graph convolution network (Pose-TGCN) to extract the temporal dependencies of pose trajectories and achieved 66% accuracy on 2000-word glosses. Liu et al. applied a feature extraction approach based on a deep CNN with stack temporal fusion layers with a sequence learning model (i.e., Bidirectional RNN) ^[48][49]. Guo et al. employed a hierarchical LSTM approach with word embedding, including visual content for SLR ^[49][50]. First, spatio-temporal information is extracted using a 3D CNN, which is then compacted into visemes with the help of an online key based on the adaptive variable length. However, their approach is may not good for capturing motion information. The main drawback of image and video pixel-based work is the high computational complexity. To overcome these drawbacks, researchers have considered the joint points, instead of pixels of full images, for hand gesture and action recognition ^[50][51][52][51,52,53]. Various models have been proposed for skeleton-based gesture recognition, including LSTMs [24] and RNNs ^[53][54]. Yan et al. applied a graph-based method—namely, ST-GCN—to construct a dynamics pattern for skeleton-based action recognition using a graph convolutional network (GCN) [24]. Following the previous task, many researchers have employed modified versions of the ST-GCN to improve the performance accuracy for hand gestures and human activity recognition. Li et al. employed an encoder and decoder for extracting action-specific latent information ^[52][53]. They included two links for this purpose, and finally employed a GCN-based approach (action-structured GCN) to learn temporal and spatial information. Shi et al. have employed a two-stream-based GCN for action recognition ^[54][55] and a multi-stream GCN for action recognition [21]. In the multi-stream GCN, they integrated the GCN with a spatio-temporal network to extract the more important joints and features from all of the features. Zhang et al. proposed a decoupling GCN for skeleton-based action recognition [20]. Song et al. proposed ResGCN integrated with part-wise attention (PartAtt) to improve the performance and computational cost of skeleton-based action recognition [22]. However, the main drawback was that their performance was not significantly higher than that of the existing ResNet. Amorin et al. proposed a human skeleton movement-based sign language recognition using ST-GCN, where they proposed to select potential key points from the whole-body key points. Finally, they achieved 85.0% accuracy on their data set (named ASLLVD) ^[55][56]. The disadvantage of this work was that they considered only one hand with the body key points. Perez et al. extracted 67 key points, including face, body, and hand gestures, using a special camera. Finally, they achieved good performance with an LSTM [29]. In the same way, many researchers have considered 133 points from the whole body to recognize sign language [8]. Jiang et al. applied a different approach with a multimodal data set including full-body skeleton points and achieved good performance accuracy [8].