Sign Language Motion Generation

Version	Summary	Created by	Modification	Content Size	Created at	Operation
1		Manuel Gil-Martín	--	1289	2023-12-13 09:48:48	\|
2	Reference format revised.	Lindsay Dong	Meta information modification	1289	2023-12-18 09:58:50	\|

This entry is adapted from the peer-reviewed paper 10.3390/s23239365

Motion generation is a research field aimed at developing realistic 2D or 3D animations based on different inputs such as image sequences, speech recordings, or text descriptions. Most state-of-the-art sign language motion generation systems are based on expert rules or prerecorded movements.

motion generation motion dataset sign language motion generation

1. Introduction

Motion generation is an innovative research field which consists of producing movements, gestures, or animations by computer systems. This process involves the use of mathematical algorithms to create dynamic, natural, and fluid motions that simulate human movements, enhancing human–computer interactions. Motion generation has many different applications, such as movement of virtual characters in video games or animated films, robotics, and sign language representation in communication systems for deaf people. In sign language communication systems, motion generation allows for an increase in the naturality of interactive avatars or virtual assistants that respond with a sign language output. Increasing naturality allows for the development of friendly communication systems between deaf and hearing people. These systems empower and enhance the communication capabilities of deaf individuals, encouraging inclusivity and facilitating their integration into various social and professional scenarios.

2. Related Work

2.1. Motion Generation

Within human motion modeling ^[1]^[2]^[3]^[4], motion generation aims at creating realistic animations to simulate human gestures and movements in virtual avatars. This generation can be achieved using different types of information as inputs to the systems: image sequences, speech recordings, or text descriptions that represent a specific movement. Some previous works ^[5]^[6] focused on generating articulated motion RGB video sequences by estimating the human pose based on a single frame image. The proposed approach incorporated constraints from 2D/3D skeleton points and used generative adversarial networks (GAN) to generate a more human-like 3D human model. Other works have focused on generating animated avatars from audio. For instance, Shlizerman et al. ^[7] developed an avatar that moved its hands similarly to how a pianist or violinist would do using music as input. The authors trained a long short-term memory (LSTM) network considering audio features such as mel-frequency cepstral coefficients (MFCC) from violin and piano recital videos. Regarding the use of text description to generate motion, a previous work ^[8] generated 3D human motion sequences given a prescribed action type. The proposed approach used a conditional temporal variational autoencoder (VAE) approach endowed with Lie algebra representation.

2.2. Metrics for Motion Generation

Regarding the assessment of the generated motion, there are several possible ways to proceed. Some metrics, like smoothness, refer to how natural the generated motion is, and others, such as diversity or multimodality ^[8], refer to how realistic the generated motion is, including variance across all action categories or variety within each action class. However, there exist metrics that objectively determine how similar the generated motion is compared to the ground truth. Some of these objective metrics are Fréchet inception distance (FID), probability of correct key points (PCK), average position error (APE), and dynamic time warping (DTW). FID refers to the distance between the feature distribution of generated motion and the ground truth ^[8]. PCK scores evaluate the probability of pose key points to be close to the ground truth key points up to a specific threshold, with higher PCK corresponding to better pose generations ^[9]. APE covers the error for all the locations of the human pose (landmarks) at all times ^[10], computing the average position error. DTW finds an optimal alignment between two landmark time series by non-linearly warping them. A lower DTW corresponds to better landmark sequence generations ^[9].

Even though all these metrics have been used in the literature, some previous works have highlighted the efficacy of DTW compared to others, especially for aligning landmarks sequences. For example, a previous work ^[11], focusing on the motion capture data of the human gait, used DTW as a measure to align time instants, concluding that DTW is an effective technique in motion data analysis. In addition, another previous work ^[12] used a DTW-based metric to assess Kinect-enabled home-based physical rehabilitation exercises. They used this metric to compute motion similarity between two time series from an individual user and a virtual coach. They concluded that a DTW-based metric could be effectively used for the automatic performance evaluation of motion exercises. Moreover, a recent work ^[13] focused on determining which is the best automated metric for text to motion generation. The authors concluded that commonly used metrics such as R-Precision (a distance-based metric that measures the rate of correct motion prompt pair matchings) show strong correlations with the quality of the generated movement.

2.3. Sign Language Generation

Some previous works in the literature have directly focused on sign language motion generation, exploring the interplay between linguistic expression through videos or textual descriptions and physical movements that represent that language.

A recent work ^[14] proposed a pose-based approach for gloss prediction using a transformer model and datasets of Word-Level American Sign Language (WLASL) ^[15]. Particularly, they used WLASL100, which contains 2038 videos from 100 glosses and 97 signers, and WLASL300, which includes 5117 videos from 300 glosses and 109 signers. The authors achieved the top 1% recognition accuracy of 80.9% in WLASL100 and 64.21% in WLASL300 via a key frame extraction technique that used histogram difference and Euclidean distance metrics to select and drop redundant frames. The work also used data augmentation based on joint angle rotation. Another work ^[16] extracted landmarks from videos of British Sign Language using MediaPipe, but in the work, the landmarks were used as input to CNN and LSTM-based classifiers.

Some works focused on generating motion from natural language. For example, in a previous work ^[10], the authors built a neural architecture able to learn a joint embedding of language and pose, mapping linguistic concepts and motion animations to the same representation space. They used the KIT Motion-Language Dataset, reaching PCK scores of 70% when using a threshold of 55 mm. Another work ^[17] used complex natural language sentences to generate motion through an encoder-decoder structure. They used a hierarchical two-stream model (pose and sentence encoders) along with a pose discriminator. In this case, the model also learned a joint embedding for both pose and language.

Other works generated motion from speech. For instance, a previous work ^[9] focused on converting speech segments into sign language through a cross-modal discriminator, allowing the network to correlate between poses and speech time-steps. They obtained a DTW distance of 14.05 over an Indian Sign Language dataset. Another work ^[18] used a BERT encoder to transform source sentences to context vectors and a transformer decoder to generate Korean Sign Language sequences.

Finally, a previous paper ^[19] provided a method to automate the process of Indian Sign Language generation using plain RGB images. The method leverages animation data, using an intermediate 2D OpenPose representation, to train a sign language generation model.

3. Proposed System

The scholars propose analyse , and evaluate a deep learning architecture based on Transformers for generating sign language motion from sign phonemes (represented using HamNoSys: a notation system developed at University of Hamburg) ^[20]. The sign phonemes provide information about sign characteristics like hand configuration, localization, or movements. The use of sign phonemes is crucial for generating sign motion with a high level of details (including finger extensions and flexions). The Transformer-based approach also includes a stop detection module for deciding the end of the generation process. Both aspects, motion generation and stop detection, are evaluated in detail considering adapted performance metrics. For motion generation, the Dynamic Time Warping distance is used to compute the similarity between two landmarks sequences (ground truth and generated). The stop detection module is evaluated considering detection accuracy and ROC (Receiver Operating Characteristic) curves. The authors propose and evaluate several strategies to obtain the system configuration with the best performance. These strategies include different padding strategies, interpolation approaches, or data augmentation techniques. The best configuration of a fully automatic system obtains an average DTW distance per frame of 0.1059 and an Area Under the ROC Curve (AUC) higher than 0.94.

References

Gil-Martín, M.; San-Segundo, R.; Fernández-Martínez, F.; Ferreiros-López, J. Time Analysis in Human Activity Recognition. Neural Process. Lett. 2021, 53, 4507–4525.
Gil-Martín, M.; San-Segundo, R.; de Córdoba, R.; Pardo, J.M. Robust Biometrics from Motion Wearable Sensors Using a D-vector Approach. Neural Process. Lett. 2020, 52, 2109–2125.
Gil-Martín, M.; San-Segundo, R.; Fernández-Martínez, F.; de Córdoba, R. Human activity recognition adapted to the type of movement. Comput. Electr. Eng. 2020, 88, 106822.
Gil-Martín, M.; Johnston, W.; San-Segundo, R.; Caulfield, B. Scoring Performance on the Y-Balance Test Using a Deep Learning Approach. Sensors 2021, 21, 7110.
Min, X.; Sun, S.; Wang, H.; Zhang, X.; Li, C.; Zhang, X. Motion Capture Research: 3D Human Pose Recovery Based on RGB Video Sequences. Appl. Sci. 2019, 9, 3613.
Yan, Y.; Xu, J.; Ni, B.; Zhang, W.; Yang, X. Skeleton-Aided Articulated Motion Generation. In Proceedings of the 2017 ACM Multimedia Conference (MM’17), Mountain View, CA, USA, 23–27 October 2017; pp. 199–207.
Shlizerman, E.; Dery, L.; Schoen, H.; Kemelmacher-Shlizerman, I. Audio to Body Dynamics. In Proceedings of the 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7574–7583.
Guo, C.; Zuo, X.; Wang, S.; Zou, S.; Sun, Q.; Deng, A.; Gong, M.; Cheng, L. Action2Motion: Conditioned Generation of 3D Human Motions. In Proceedings of the 28th Acm International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 2021–2029.
Kapoor, P.; Mukhopadhyay, R.; Hegde, S.B.; Namboodiri, V.; Jawahar, C. Towards Automatic Speech to Sign Language Generation. In Proceedings of the Interspeech 2021, Brno, Czech Republic, 30 August–3 September 2021; pp. 3700–3704.
Ahuja, C.; Morency, L.-P. Language2Pose: Natural Language Grounded Pose Forecasting. In Proceedings of the 7th International Conference on 3D Vision (3DV), Quebec, QC, Canada, 15–18 September 2019; pp. 719–728.
Switonski, A.; Josinski, H.; Wojciechowski, K. Dynamic time warping in classification and selection of motion capture data. Multidimens. Syst. Signal Process. 2018, 30, 1437–1468.
Yu, X.; Xiong, S. A Dynamic Time Warping Based Algorithm to Evaluate Kinect-Enabled Home-Based Physical Rehabilitation Exercises for Older People. Sensors 2019, 19, 2882.
Voas, J.G. What is the best automated metric for text to motion generation? arXiv 2023, arXiv:2309.10248.
Eunice, J.; J, A.; Sei, Y.; Hemanth, D.J. Sign2Pose: A Pose-Based Approach for Gloss Prediction Using a Transformer Model. Sensors 2023, 23, 2853.
Li, D.; Opazo, C.R.; Yu, X.; Li, H. Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Snowmass, CO, USA, 1–5 March 2020; pp. 1448–1458.
Dhulipala, S.; Adedoyin, F.F.; Bruno, A. Sign and Human Action Detection Using Deep Learning. J. Imaging 2022, 8, 192.
Ghosh, A.; Cheema, N.; Oguz, C.; Theobalt, C.; Slusallek, P. Synthesis of Compositional Animations from Textual Descriptions. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV 2021), Montreal, QC, Canada, 10–17 October 2021; pp. 1376–1386.
Kim, J.H.; Hwang, E.J.; Cho, S.; Lee, D.H.; Park, J.C. Sign Language Production with Avatar Layering: A Critical Use Case over Rare Words. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France, 20–25 June 2022; pp. 1519–1528. Available online: https://aclanthology.org/2022.lrec-1.163 (accessed on 12 November 2023).
Krishna, S.; Vignesh, P.V.; Babu, J.D.; Soc, I.C. SignPose: Sign Language Animation Through 3D Pose Lifting. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Montreal, QC, Canada, 11–17 October 2021; pp. 2640–2649.
Gil-Martín, M.; Villa-Monedero, M.; Pomirski, A.; Sáez-Trigueros, D.; San-Segundo, R. Sign Language Motion Generation from Sign Characteristics. Sensors 2023, 23, 9365. https://doi.org/10.3390/s23239365

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.

Upload a video for this entry

Information

Subjects: Telecommunications

Contributors MDPI registered users' name will be linked to their SciProfiles pages. To register with us, please refer to https://encyclopedia.pub/register :

Manuel Gil-Martín

María Villa-Monedero

Andrzej Pomirski

Daniel Sáez-Trigueros

Rubén San-Segundo

View Times: 197

Update Date: 18 Dec 2023

Table of Contents

Video Upload Options

Confirm