Sign2Pose: A Pose-Based Approach for Gloss Prediction

Version	Summary	Created by	Modification	Content Size	Created at	Operation
1		Andrew J	--	2466	2024-01-09 10:08:15	\|
2	format correct	Catherine Yang	Meta information modification	2466	2024-01-10 01:33:41	\|

This entry is adapted from the peer-reviewed paper 10.3390/s23052853

Word-level sign language recognition (WSLR) is the backbone for continuous sign language recognition (CSLR) that infers glosses from sign videos. Finding the relevant gloss from the sign sequence and detecting explicit boundaries of the glosses from sign videos is a persistent challenge. A Sign2Pose Gloss prediction transformer that can significantly identify the intermediate gloss for the given input video sequence.

sign language recognition gloss prediction transformer pose-based approach pose estimation deep learning

1. Introduction

Sign language, which has its own underlying structure, grammar, syntax, and complexities, is the main mode of communication among the Deaf Community. To comprehend sign language, one must consider a plethora of factors involving hand movements, head, hand posture, shoulder posture, location of the lips, and facial expressions. However, in an environment where spoken language is much more prevalent, the deaf community faces challenges of communication barriers and separation from society. To alleviate communication difficulties, understanding sign language as a spoken language is becoming incredibly valuable.

The early stages of sign language research focused primarily on sign language recognition (SLR). SLR focuses on action recognition from the performed sign language sequence without paying attention to its grammatical and linguistic structures. In other words, SLR interprets the performed signs of alphabets ^[1], numbers ^[2], or symbols ^[3] from either static images or continuous sequences of images ^[4] and is categorized into Isolated SLR ^[5] and Dynamic SLR ^[6]. Continuous SLR recognizes sign postures from a continuous sequence of sign language videos which can either be an isolated word video or a continuous spoken sentence sequence, whereas isolated SLR recognizes sign postures from a single static image. Prior systems relied on hidden Markov model-based sequence recognition ^[7] and per-image feature extraction ^[8]. The effectiveness of automatic voice recognition served as inspiration for this pipeline. The design of the features that needed to be retrieved posed the biggest challenge in SLR. It was challenging to create a reliable algorithm that could extract the key linguistic elements, such as hand form ^[9], body movements ^[10], and face expression ^[11], even though they had already been recognized. Later, with the advancement of deep learning, manually constructed feature extraction was replaced by automatically extracted features using CNN models ^[2]^[12]^[13]. The overfitting, class imbalance, and exploding gradient problems caused them to perform poorly despite carrying out automatic feature extraction. Likewise, they significantly lagged in encoding the object’s orientation and position. Soon, many hybrid models combined with CNN and HMM ^[14], CNN with DCGAN ^[15], CNN with LSTM ^[16]^[17], CNN with SVM ^[18], and CNN with hybrid segmentation ^[19] emerged. The outbreak of 3D CNN ^[17]^[20]^[21] created outstanding growth in spatio-temporal feature extraction.

Although deep learning has produced state-of-the-art results in the various challenges of SLR ^[16]^[22], to enhance the training process of the end-to-end sequence translation process, deep learning models require annotated datasets to tune CSLR models. For this to happen, the model should be trained with isolated words to increase the performance of the CSLR models. To resolve this issue, Chen et al. ^[23] proposed a transfer learning-based approach. This approach addressed data scarcity by gradually pretraining visual and linguistic modules from general domains into the target domains to some extent. This strategy also required annotated data to improve the model’s performance. The development of better-trained sign language translation models is hampered by a lack of data. Owing to this issue, the performance of current CSLR models needs to be improved. Though various methods and architectures have been proposed to address exact interpretations of sign language through SLR and CSLR, there still lacks meaningful translation of the performed sign language. Ever since the advent of deep learning and its application in computer vision, the pairing of vision and language has received a lot of attention.

Sign language translation (SLT) ^[24] is the transcription of a sign language video to spoken sentence phrases, paying attention to all the rich underlying grammatical structures that allow the user to understand the underlying language model, spatial representations, and the mapping pattern between the sign and spoken language. SLT is far more complex than SLR because it considers additional visual cues such as body posture, facial expressions, and signing position. While performing sign transcription, which is literally a written version of sign performance, glosses are the intermediary representation. Glosses are words associated with a specific sign, also known as a label ^[25]. The structure of glosses differs from that of spoken languages. They serve as the foundation of complete sign sequence translation. For example, if a signer performs a sign sequence for the phrase “The weather is too cold today”, the sign translation model suggests the relevant glosses, such as “weather”, “cold”, and “today”.

2. Significance of Glosses in Vision-Based CSLT

Recognizing the exact gloss representation for the performed sign sequence plays a significant role in CSLT. The biggest challenge of a CSLT system is the insufficient annotated dataset, identifying the explicit boundaries of signed words from the extracted frames of sign video and the transcription of target sentences from the extracted gloss sequences. In the initial phase of work, Hidden Markov models ^[26]^[27]^[28] were widely used for capturing the temporal feature information. Cui et al. ^[29] proposed a DNN (Deep Neural Network) for temporal feature extraction and RNN for sequence learning. In his framework, he suggested an iterative training process that includes gloss annotations from video segments and an alignment proposal module that generates the aligned sentence sequences from the extracted glosses. It is evident from this approach that the iterative process of sequence learning eliminates the need for massive amounts of information to train an HMM model. Although these modalities are superior in learning temporal dependencies, the integrated approach of multiple modalities necessitates more investigation because the performed sign gestures have concurrently related streams of information. Further, Sharma et al. ^[30] proposed a deep transfer learning approach employed for sign sentence recognition. In their deep learning-based network model, they used a convolutional neural network along with bi-directional LSTM and connectionist temporal classification (CTC). The added advantage of using this model is it can be trained to recognize the sequence of sentences without any requirement of any prior knowledge in an end-to-end fashion. However, connectionist temporal classification faces severe overfitting during computation. To resolve this issue, Niu et al. ^[31] used stochastic fine-grain labelling while training the model. For extracting gloss information from sign video frames, the model should know contextual information to extract the actual context of the sign with gloss. To ensure this, Anirudh et al. ^[32] proposed a pose-based SLR for gloss identification with contextual information using a graph convolutional network (GCN) and BERT transformer. Though this model concentrates on both spatial and temporal fusion extraction, combining a pose-based approach with image-based features will further enhance model performance. On the other hand, Cui et al. ^[33] proposed a model for real-time CSLR where they used RNN to address mapping issues with relevant glosses by designing a weakly supervised detection network using a connectionist temporal and alignment proposal for continuous spoken sentence translation. Further, this method requires improvement to handle multi-modal information.

To make this easy, transfer learning is employed by initially training the deep learning network using an isolated word dataset so that the problem is addressed. Rastgoo et al. ^[16] adapted this transfer learning technique using a post-processing algorithm to address the limited labelled dataset issue.

3. End-to-End and Two-Stage Translation in SLT

With the recent advancement in neural machine translation, recent works have concentrated on designing a gloss-free model to generate textual content directly from visual domains using cross-modal mappings without any intermediate glosses. Zhao et al. ^[34] proposed a novel framework for sign video to spoken sentence generation using three key modules. In their model, they replaced the gloss generation module with a word existence module that checks the word existence in the input sign video. For this, they applied a CNN encoder–decoder for video feature extraction and a logistic regression classifier for word existence verification. However, in the existing proposed model, there still exist challenges in visual-to-text direct mappings. Additionally, training an SLT model is challenging for longer sentences/video sequences, and decoding a sentence from the input sign video after extracting finite dimensional features is tedious. Further, a key point normalization method to normalize the skeleton points of the signer was proposed by ref. ^[35] to translate sign videos into spoken sentences directly without any intermediate gloss. They applied the stochastic frame selection method for sampling and frame augmentation and transcribed sign language videos into spoken sentences using attention models. However, direct sign-to-text translation outcomes were no better. Since end-to-end translation requires a huge amount of information to train and tune the model, two-stage SLT is the better option for CSLT; however, it is time-consuming to process the input sequence.

When compared with gloss, mid-level representation drastically improves SLT performance ^[24]. Additionally, sign-to-gloss translation averts long-term dependencies ^[36], and the number of sign glosses from a particular sign video are minimal when compared with the number of frames in the video ^[14]. Therefore, combining gloss representation with recognition and translation of sign language, a unified architecture is proposed by Camgoz et al. ^[37] that jointly learns continuous sign language recognition and translation achieved by CTC, thereby improvising sequence-to-sequence learning and performance independent of ground truth timing information. The detailed summary of the existing deep learning models for two-stage SLT is discussed in Table 1.

Table 1. Summary of existing methods for gloss prediction using two-stage SLT.

Ref.	Translation Type	Technique for Gloss Prediction	Dataset	Performance Metric	Remarks
^[32]	Sign2Gloss2Text	Graph convolution network (GCN) and bi-directional encoder representations from transformer (BERT)	WLASL	88.67 at top 10% accuracy on 100 gloss recognition	Image-based feature extraction enhances the performance of the model.
^[38]	Sign2Gloss2Text	Human key-point estimation	KETI sign language	BLEU4—65.83 (Key points: Hand, body)	Performance would improve on improving key-point detection
^[39]	Sign2Gloss2Text Gloss2Text	Spatial-temporal transformer and spatial-temporal RNN	Phoenix 2014T	BLEU4-24.00	Dataset is restricted to the weather forecast
^[40]	Sign2Gloss2Text	Temporal graph convolution network (TGCN)	WLASL	62.63% at top 10 accuracy on 2000 gloss recognition	Labelling a large number of samples requires advanced deep algorithms to pave the way from word-level to sentence-level annotations
^[41]	Sign2Gloss2Text	Context-aware GAN, temporal convolution layers (TCL), and BLSTM	Phoenix 2014T, CSL, and GSL signer independent	23.4%, 2.1%, and 2.26% WER, respectively	Complexity and data imbalance in GAN network
^[42]	Sign2Gloss2Text	Transformer	WLASL100, WLASL300, and LSA 64	63.18%, 43.78%, and 100% recognition accuracy	Shows better outcomes on even smaller datasets
^[43]	Sign2Gloss2Text	Intensity modifier	Phoenix 2014T	BLEU1-26.51	Lacks spatial and temporal information for black translation and lack of proper evaluation metrics.

In the same way, sign-to-gloss→gloss-to-text is one of the best translation protocols, where instead of training a network for text-to-text translation from scratch, they provide better translation results for gloss-to-text translation. A Sign2Gloss translation protocol network was proposed using a modified standard transformer.

4. Video Analysis and Summarization

Sign language translation takes time to process continuous sign sequences. As a result, incorporating video summarization or video processing techniques into SLT may improve gloss recognition accuracy in the Sign2Gloss translation protocol. Video summarization and video processing, on the other hand, are very common in video recognition and action recognition tasks ^[44]. The primary goal of video processing is to choose a set of frames to facilitate fast computation while processing lengthy videos. Yao et al. ^[45] proposed a key frame extraction technique based on multifeatured fusion for processing dance videos in order to recognize various dance motions and steps. Furthermore, a smart key-frame extraction technique was proposed by Wang et al. ^[46] for vehicle target recognition. This model integrates the scale-invariant feature transform (SIFT) and the background difference algorithm, coupled with the concept of criterion factor K, to significantly divide and categorize the frames into non-mutation and mutation frames. The redundant frames are dissimilar frames that are discarded. However, because it skips a greater number of frames compared to SLT, this method is only appropriate for vehicle recognition. To resolve these missing details in frame extraction methods, Li et al. ^[47] proposed a new concept called sparse coding for key frame extraction with log-regularizer. This method overcomes the challenges of losing pertinent data while discarding redundant frames while performing key frame extraction. However, this method is unsuitable for complex videos because it strips away high-level semantic information from the video.

5. Pose-Based Methods for SLT

Human pose-based architecture is not only used for action recognition but it has also been applied to perform specific tasks in WSLR and SLT since the advancement of deep learning. Pose estimation is either performed using a probabilistic graphical model or using pictorial structures ^[48]. So far, human pose estimation has achieved outstanding results for static or isolated images. However, it underperformed for real-time or dynamic images such as video because of issues with tracking occlusions, motion blur during the transition, and its inability to capture the temporal dependency between the extracted video frames. The poses/skeletal holds positional information of a human body pose and can provide important cues ^[49]. Using the RWTH-Phoenix 2014 T dataset, a skeleton-based graph convolution network was proposed for end-to-end translation. It used only 14 key points, omitting fine-grained key points in fingers and faces, resulting in poor end-to-end translation performance. However, skeletal-based methods have gained attention in modern research methods since they are independent of background variations. Further, in skeleton-based SLR models, RGB-based skeletal methods outperforms well. To overcome this performance degradation stated in the previous work, Songyao et al. ^[44] proposed a skeleton-aware multi-modal ensemble for RGB frames, which has 33 key points, including key points in the nose, mouth, upper body, and hands. This framework makes use of multi-modal information and utilizes a sign language graph convolution neural network (SL-GCN) to build embedded dynamics. Further, in another work, maxim et al. ^[50] investigated the enhancement of recognition performance in SLR models by fine-tuning the datasets. Additionally, the author analyzed whether it is possible to use these models in a real-time environment without GPU.

Yang et al. introduced the graph convolution neural network model to deal with the temporal dependency among extracted frames. Followed by him, many others proposed various methods for pose estimation, such as the GCN-BERT method ^[32], key point extraction methods using open pose ^[51], action structured graph convolution networks ^[52], and MS-G3D for spatial–temporal graphical convolution networks.

The pose-based approach proposed by Youngmin et al. ^[51] introduced video processing and key point extraction techniques. These techniques aided in frame selection and key point extraction for precise body movement and location. Sign-to-text translation protocol was used in this pose-based approach. However, direct translation from sign language video to spoken sentence produced no good results. In addition to these methods, automatic sign language translation is possible by merging the NLP transformers and computer vision. For such tasks, the video-transformer network was proposed by Coster et al. ^[53]. However, these transformer networks require a huge amount of labelled data corpus to fine-tune and train ASLR models. This method is evaluated using the large-scale annotated Turkish Sign Language data corpus that eliminates the need for a large, annotated data corpus.

References

Asst professor, B.M.; Dept, C. Automatic Sign Language Finger Spelling Using Convolution Neural Network: Analysis. Int. J. Pure Appl. Math. 2017, 117, 9–15.
Jennifer Eunice R, H.D.J. Deep CNN for Static Indian Sign Language Digits Recognition. In Frontiers in Artificial Intelligence and Applications; IOS Press: Amsterdam, The Netherlands, 2022; Volume 347, pp. 437–446.
Chajri, Y.; Bouikhalene, B. Handwritten mathematical symbols dataset. Data Br. 2016, 7, 432–436.
Huang, J.; Zhou, W.; Zhang, Q.; Li, H.; Li, W. Video-based sign language recognition without temporal segmentation. In Proceedings of the 32nd Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 2257–2264.
Tolentino, L.K.S.; Serfa Juan, R.O.; Thio-ac, A.C.; Pamahoy, M.A.B.; Forteza, J.R.R.; Garcia, X.J.O. Static sign language recognition using deep learning. Int. J. Mach. Learn. Comput. 2019, 9, 821–827.
Liao, Y.; Xiong, P.; Min, W.; Min, W.; Lu, J. Dynamic Sign Language Recognition Based on Video Sequence with BLSTM-3D Residual Networks. IEEE Access 2019, 7, 38044–38054.
Kumar, P.; Gauba, H.; Roy, P.P.; Dogra, D.P. Coupled HMM-based Multi-Sensor Data Fusion for Sign Language Recognition. Pattern Recognit. Lett. 2016, 86, 1–8.
Chabchoub, A.; Hamouda, A.; Al-Ahmadi, S.; Barkouti, W.; Cherif, A. Hand Sign Language Feature Extraction Using Image Processing. Adv. Intell. Syst. Comput. 2020, 1070, 122–131.
Ong, E.J.; Bowden, R. A boosted classifier tree for hand shape detection. In Proceedings of the Sixth IEEE International Conference on Automatic Face and Gesture Recognition, Seoul, Republic of Korea, 19 May 2004; pp. 889–894.
Charles, J.; Pfister, T.; Everingham, M.; Zisserman, A. Automatic and efficient human pose estimation for sign language videos. Int. J. Comput. Vis. 2014, 110, 70–90.
Liu, J.; Liu, B.; Zhang, S.; Yang, F.; Yang, P.; Metaxas, D.N.; Neidle, C. Non-manual grammatical marker recognition based on multi-scale, spatio-temporal analysis of head pose and facial expressions. Image Vis. Comput. 2014, 32, 671–681.
Cheng, K.L.; Yang, Z.; Chen, Q.; Tai, Y.W. Fully Convolutional Networks for Continuous Sign Language Recognition. In Lecture Notes in Computer Science; Including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics; Springer: Berlin/Heidelberg, Germany, 2020; Volume 12369 LNCS, pp. 697–714.
Koller, O.; Ney, H.; Bowden, R. Deep Hand: How to Train a CNN on 1 Million Hand Images When Your Data Is Continuous and Weakly Labelled. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016.
Koller, O.; Zargaran, S.; Ney, H. Resign: Re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMs. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21 July 2017–26 July 2017; pp. 3416–3424.
Zhang, F.; Sheng, J. Gesture Recognition Based on CNN and DCGAN for Calculation and Text Output. IEEE Access 2019, 7, 28230–28237.
Rastgoo, R.; Kiani, K.; Escalera, S. Word separation in continuous sign language using isolated signs and post-processing. arXiv 2022, arXiv:2204.00923.
Guo, D.; Zhou, W.; Li, H.; Wang, M. Hierarchical LSTM for sign language translation. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 6845–6852.
Agha, R.A.A.R.; Sefer, M.N.; Fattah, P. A comprehensive study on sign languages recognition systems using (SVM, KNN, CNN and ANN). In Proceedings of the Proceedings of the First International Conference on Data Science, E-learning and Information Systems-DATA’18, New York, NY, USA, 1–2 October 2018; ACM Press: New York, NY, USA, 2018; pp. 1–6.
Rahim, M.A.; Islam, M.R.; Shin, J. Non-touch sign word recognition based on dynamic hand gesture using hybrid segmentation and CNN feature fusion. Appl. Sci. 2019, 9, 3790.
Wu, Y.; Zhou, Y.; Zeng, W.; Qian, Q.; Song, M. An Attention-based 3D CNN with Multi-scale Integration Block for Alzheimer’ s Disease Classification. IEEE J. Biomed. Health Inform. 2022, 26, 5665–5673.
Neto, G.M.R.; Junior, G.B.; de Almeida, J.D.S.; de Paiva, A.C. Sign Language Recognition Based on 3D Convolutional Neural Networks. In Lecture Notes in Computer Science; Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics; Springer: Berlin/Heidelberg, Germany, 2018; Volume 10882 LNCS, pp. 399–407.
Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. Adv. Neural Inf. Process. Syst. 2014, 4, 3104–3112.
Chen, Y.; Wei, F.; Sun, X.; Wu, Z.; Lin, S. A Simple Multi-Modality Transfer Learning Baseline for Sign Language Translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 5110–5120.
Camgoz, N.C.; Hadfield, S.; Koller, O.; Ney, H.; Bowden, R. Neural Sign Language Translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7784–7793.
Jin, T.; Zhao, Z.; Zhang, M.; Zeng, X. Findings of the Association for Computational Linguistics Prior Knowledge and Memory Enriched Transformer for Sign Language Translation. Assoc. Comput. Linguist. 2022, 2022, 3766–3775.
Koller, O.; Zargaran, S.; Ney, H.; Bowden, R. Deep sign: Hybrid CNN-HMM for continuous sign language recognition. In Proceedings of the British Machine Vision Conference 2016, York, UK, 19–22 September 2016; pp. 136.1–136.12.
Wu, D.; Pigou, L.; Kindermans, P.J.; Le, N.D.H.; Shao, L.; Dambre, J.; Odobez, J.M. Deep Dynamic Neural Networks for Multimodal Gesture Segmentation and Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 1583–1597.
Koller, O.; Forster, J.; Ney, H. Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. Comput. Vis. Image Underst. 2015, 141, 108–125.
Cui, R.; Liu, H.; Zhang, C. A Deep Neural Framework for Continuous Sign Language Recognition by Iterative Training. IEEE Trans. Multimed. 2019, 21, 1880–1891.
Sharma, S.; Gupta, R.; Kumar, A. Continuous sign language recognition using isolated signs data and deep transfer learning. J. Ambient Intell. Humaniz. Comput. 2021, 1, 1531–1542.
Niu, Z.; Mak, B. Stochastic Fine-Grained Labeling of Multi-state Sign Glosses for Continuous Sign Language Recognition. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 172–186.
Tunga, A.; Nuthalapati, S.V.; Wachs, J. Pose-based Sign Language Recognition using GCN and BERT. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 31–40.
Cui, R.; Liu, H.; Zhang, C. Recurrent convolutional neural networks for continuous sign language recognition by staged optimization. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2016; pp. 1610–1618.
Zhao, J.; Qi, W.; Zhou, W.; Duan, N.; Zhou, M.; Li, H. Conditional Sentence Generation and Cross-Modal Reranking for Sign Language Translation. IEEE Trans. Multimed. 2022, 24, 2662–2672.
Kim, Y.; Kwak, M.; Lee, D.; Kim, Y.; Baek, H. Keypoint based Sign Language Translation without Glosses. arXiv 2022, arXiv:2204.10511.
Du, Y.; Xie, P.; Wang, M.; Hu, X.; Zhao, Z.; Liu, J. Full transformer network with masking future for word-level sign language recognition. Neurocomputing 2022, 500, 115–123.
Camgöz, N.C.; Koller, O.; Hadfield, S.; Bowden, R. Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation. arXiv 2020, arXiv:2003.13830v1.
Ko, S.K.; Kim, C.J.; Jung, H.; Cho, C. Neural sign language translation based on human keypoint estimation. Appl. Sci. 2019, 9, 2683.
Read, J.; Polytechnique, E. Better Sign Language Translation with STMC-Transformer. arXiv 2017, arXiv:2004.00588.
Walczynska, J. HandTalk: American Sign Language Recognition by 3D-CNNs. Ph.D. Thesis, University of Groningen, Groningen, The Netherlands, 2022.
Papastratis, I.; Dimitropoulos, K.; Daras, P. Continuous Sign Language Recognition through a Context-Aware Generative Adversarial Network. Sensors 2021, 21, 2437.
Bohacek, M.; Hruz, M. Sign Pose-based Transformer for Word-level Sign Language Recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, Waikoloa, HI, USA, 4–8 January 2022; pp. 182–191.
Inan, M.; Zhong, Y.; Hassan, S.; Quandt, L.; Alikhani, M. Modeling Intensification for Sign Language Generation: A Computational Approach. arXiv 2022, arXiv:2203.09679.
Jiang, S.; Sun, B.; Wang, L.; Bai, Y.; Li, K.; Fu, Y. Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble. arXiv 2021, arXiv:2110.06161v1.
Yao, P. Key Frame Extraction Method of Music and Dance Video Based on Multicore Learning Feature Fusion. Sci. Program. 2022, 2022, 9735392.
Wang, J.; Zeng, C.; Wang, Z.; Jiang, K. An improved smart key frame extraction algorithm for vehicle target recognition. Comput. Electr. Eng. 2022, 97, 107540.
Li, Z.; Li, Y.; Tan, B.; Ding, S.; Xie, S. Structured Sparse Coding With the Group Log-regularizer for Key Frame Extraction. IEEE/CAA J. Autom. Sin. 2022, 9, 1818–1830.
Nie, B.X.; Xiong, C.; Zhu, S.C. Joint action recognition and pose estimation from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1293–1301.
Gan, S.; Yin, Y.; Jiang, Z.; Xie, L.; Lu, S. Skeleton-Aware Neural Sign Language Translation. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 4353–4361.
Novopoltsev, M.; Verkhovtsev, L.; Murtazin, R.; Milevich, D.; Zemtsova, I. Fine-tuning of sign language recognition models: A technical report. arXiv 2023, arXiv:2302.07693.
Shalev-Arkushin, R.; Moryossef, A.; Fried, O. Ham2Pose: Animating Sign Language Notation into Pose Sequences. arXiv 2022, arXiv:2211.13613.
Liu, F.; Dai, Q.; Wang, S.; Zhao, L.; Shi, X.; Qiao, J. Multi-relational graph convolutional networks for skeleton-based action recognition. In Proceedings of the 2020 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), Exeter, UK, 17–19 December 2020; pp. 474–480.
De Coster, M.; Van Herreweghe, M.; Dambre, J. Isolated sign recognition from RGB video using pose flow and self-attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3436–3445.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.

Upload a video for this entry

Information

Subjects: Mathematics, Applied

Contributors MDPI registered users' name will be linked to their SciProfiles pages. To register with us, please refer to https://encyclopedia.pub/register :

Jennifer Eunice

Andrew J

Yuichi Sei

D. Jude Hemanth

View Times: 94

Update Date: 10 Jan 2024

Table of Contents

Video Upload Options

Confirm