Indian Sign Language Recognition of Image and Video

Indian Sign Language Recognition of Image and Video: History

View Latest Version

Please note this is an old version of this entry, which may differ significantly from the current revision.

Subjects: Computer Science, Artificial Intelligence

Contributor: VARANASI USHA BALA

Communication is the key for any human being and for the small percentage of people who have a hearing problem (deaf and dumb community). The only way for them to communicate is to use sign language. Sign language is a combination of hand movements, Facial expressions and body gestures. There is no problem for the mute people to communicate among themselves, but with normal people have no interest in learning sign-language which is a barrier of communication between people. The main objective of this model is to use machine learning and deep learning techniques to solve this problem which is to make a fully functional, feasible, reliable and easy to use. The biggest disadvantage of the existing system is that it can only recognize alphabets and numbers related to gestures but not complete words which are used in the real life. Our proposed work takes the image/video input and uses the trained machine learning model to predict the signs and gestures in sign-language. We tried to recognize not only gestures but also signs which are relatively hard and unique to recognize. We generate English words corresponding to the signs shown by the user and the generated words can be further used to form proper English sentences. The above mentioned procedure can be used to train the proposed work in order to identify signs of any sign language.

sign language
facial expressions
signs
gestures
hand movements
body gestures

Introduction

Everyone in this world communicate with each other by voice but there is a small percentage of people who are unable to communicate with everyone because of hearing disability and millions of people are suffering because of this disability. Sign language is not being included in any of the study curriculum and this is causing great difficulty for the hearing impaired people. There is no problem for the mute people to communicate among themselves, but they might face some problem while communicating with normal people. Sign language is expressed in the form of hand gestures along with some non-verbal expressions. It is not the same as verbal gestures and hence the sign language is difficult to understand by the normal people unless trained. Even trained people face difficulty in understanding the sign languages because they widely vary depending on the geographical location. With the technology we have we are yet to overcome this communication barrier.

The communication in sign language mainly takes place in two forms 1.Gesture based signs which are used to express whole words at once using a unique series of gestures. 2.Fingerspelling based gestures that use the alphabet to form sentences which are more time consuming but easy to understand the reason being limited alphabet in the Language. The already existing systems can be mainly divided into two categories 1. Hardware based and 2. Software based. Most of the mute people cannot afford on hardware-based solutions which are costly and complex to operate on the other hand existing Software based solutions are simply identifying the Finger Spellings which is time consuming and less accurate, also user need to make a grammatically correct sentence to make a understandable sentence which takes even more time.

Our proposed work addresses the problem faced by Gesture based sign language detection which is more feasible, fast and easy to use.

Background Study
- Objective

The main objective is to recognize the signs in sign language which is executed using sensor gloves which are hardware based where the user need to wear the gloves all the time – which is impossible. When the user does the signs without wearing the gloves the system will not recognize the corresponding signs and gestures.

A variety of software algorithms were implemented to recognize signs and gestures. Of them few algorithms such as CNN , RNN , HMM , LSTM are used to recognize the signs and gestures. Another implementation of machine learning and deep learning algorithms are used to train the model to recognize the signs and gestures only for the given image input, which may be a lacunae of the existing system. This existing idea can be extended by our proposed work by taking the input of both image and video. This serves as the best solution to overcome the problems faced by hearing impaired people.

2.1.1 Sensor gloves

Sensor gloves are synthetic gloves with in-built sensors which are able to detect the position of the hands, fingers and knuckles. The position of the hands, dingers and knuckles are plotted in 3d obtaining a configuration. these configurations have already defined signs and converted accordingly.

They are highly accurate because of their static lookup nature. The system directly maps the identified configuration with already configured gesture. They are easy to use i.e., no additional training required to use these gloves, once after wearing the gloves the user can use gestures like he would regularly do without any gloves. The implementation implicitly does not involve any machine learning algorithms.

The use of physical hardware makes the system expensive and as a result a very small population will be able to afford it making it highly not feasible. Carrying gloves wherever you go, wearing it all the time makes users feel conscious and uncomfortable and also considering only the hand gestures gives no room to get the context of the expression.

2.1.2 Software Algorithms

Software Algorithms collectively refers to the machine learning and artificial algorithms using computer vision and gesture recognition to translate sign language. Different models are trained on different instances of datasets enabling the algorithms to be able to recognize the gesture in sign language.

Due to low implementation cost they are usually highly affordable and economical. Can offer high accuracy if designed and trained correctly. They can be easily deployed and scaled making them available and reliable. The sign space being very large results in considerably long training times. Achieving a high accuracy rate may be difficult due to a major part of the system being uncertain.

2.2 Algorithms Explored

2.2.1 Convolution Neural Networks

A Convolutional Neural Network (CNN) is a Deep Learning algorithm which has multiple layers of Convolution and maxpooling which will convolute and extract the features in the image and we can use these features to train the machine learning model to classify the images which are given in the future based on these features.

2.2.2 Optical Flow

Optical flow provides a concise description of both the regions of the image undergoing motion and the velocity of motion. In practice, computation of optical flow is susceptible to noise and illumination changes.

2.2.3 Key frame Extraction

Key frames are the frames that alone can define the content of the video. We follow a map-like structure and identify the gesture based on the extracted key frames.

2.2.4 Long short-term memory

Long short-term memory (LSTM) is a more advanced version of Recurrent neural networks (RNN) which are mainly used to recognize patterns from sequential data. The LSTM is perfect to identify signs from a live video stream.

Proposed Method

The proposed system consists of several modules which work together to translate the sign language- image and video inputs to English words. The detailed architecture and methodology is discussed below.

The following are the steps of architecture in our proposed work. They are :

Dataset Acquisition
Training the Dataset
Incorporating the trained dataset with machine learning models
Measuring the performance and accuracy of the proposed model
Generating the desired output – English words

The Video Input is taken as input and each frame is taken out of the video. The Video filter and processor labels the hand gestures in the frames. These labelled frames are then sent to the learning model. This model now gets trained by the sent data. After the gestures are recognized, they’re sent to the sentence generator. The Sentence generator takes the text and phrases the sentence, words accordingly for understandable output to the user. This output of the Sentence generator is shown as Output which is an understandable sentence.

Fig 3a. System Architecture

Video Input
Recorded video in real time which is captured from camera from the computer system web cam. It takes the video and generated frames and passes them to the video filter.

Video filter and processor

The video filter takes the generated frames and identifies the objects and extracts the areas of interest from each frame. We use object detection algorithms to identify hands and faces.
Trained machine learning model

A trained machine learning model that can recognise the signs from the areas of interest and also predict the context based on the facial expressions. The model generates English words corresponding to the signs in the video.

Sentence generator

The sentence generator is a trained natural language processing algorithm (NLP) that is used to make properly framed English sentences. It takes generated words from the ML model as input and uses this as key words to generate English sentences.

Text Output

The text output generated from the sentence generator is shown as the output and read out loud from the device speaker.

Dataset Acquisition

1.Initially we worked on alphabets and numbers from Indian sign language to get a better understanding on how the existing systems work. For this we used a dataset from kaggle.

The data only contains the right-hand perspective of all the signs so we also included the reverse of these signs to better recognize the signs shown by left hands as well by training the model with the images which are the mirror images of the original dataset.

Fig 3.1.1. Alphabets in Indian Sign Language

Fig 3.1.2. Numbers in Indian Sign Language

Fig 3.1.3. Sample of right, left images used for training

2.Then we moved to Sign Identification from live video, for this we captured our own custom dataset from laptop’s webcam using openCV.

We used Mediapipe to capture all the anchor points of face and hands and trained the model using this data.

Fig 4. Landmarks identified my Mediapipe

3.1.1 Importance of Facial Expressions

Sign language is a combination of hand movements, body gestures and facial expressions among which facial expressions play a major role in determining the emotion of the speaker and sometimes the meaning may completely change based on the facial expression of the speaker. The example given below perfectly explains the scenario discussed above

Fig 3.1.1.1.(ISL Signs) I am fine (ISL) How are You (ISL)

The above image is the ISL Representation of the Sign I am fine and How are you respectively. If we see both the signs are same but the difference is the facial expressions of the speaker. In the former Sign (I am fine) the facial expression of the speaker is happy conveying his wellness on the other in hand the later Sign (How are you) the facial expression is more like asking a question.

Training the Dataset , Incorporating the trained dataset with machine learning

models, Measuring the performance and accuracy of the proposed model, Generating the desired output – English words are explained in the methodology

3.4 Methodology

The system contains several modules. All modules are list and their functionality is clearly explained below

Alphabet Recognition

With the help of modern technology and computer vision we are able to take images and videos as input and use the supervised models to recognise the alphabets. With the well-known methods of convolutional neural networks (CNN), we train using the thousands of image dataset consisting of various use-case scenarios, the alphabet recognition is one of the phase were we predict the alphabets in Indian sign language by recognising the hand signs in the images, this module is modular making it a reusable in the further development of the system. The alphabet recognition takes frames or images as input, the future scope of the system is where frames are extracted from the video footage, all the frames will have essential information in-order to recognise the pattern which further help the model to predict the sign into an English language alphabet. The alphabet recognition module brilliantly detects the hands in the image apart from the background, making it easier for the system to recognize the hand, the hand outline module extracts the hand features from the image and converts the image into a black and white image. This processed image is then sent as input feed to the supervised model which can take advantage of the processed form of the image and predict the alphabet more accurately. Hand outline detection will be discussed further in the paper as it is one of the essential features for the system to accurately predict the sign given by the user.

Training Phase (Alphabet)

In the training phase, we gather thousands of images for our various classifications which include alphabets and numbers with different possible scenarios of the use cases. The training phase takes the training dataset and trains the model by recursively going through the supervised methodology and with each epoch the accuracy of the model alters, this accuracy changes significantly with each epoch, so the accuracy can sometimes be increasing or decreasing. This accuracy is the major and essential part of the training as it is what determines the accuracy of the supervised trained model. All the thousands of images in the dataset are iterated and accuracy is varied throughout, as we have chosen a supervised method, the dataset not only hold the training data but also each data solution in-order for the system to validate its recognition of the sign or gesture on each iteration of the data in the dataset. The training phase is really essential for our system as we are going to do the accuracy analysis of the machine learning models and then choose the best model which has satisfactory and higher accuracy rate when compared to other models, this considered model is the final model chosen for the system to use.

Recognition Phase(Alphabet)

Recognition Phase uses the trained supervised model to recognise the gestures in the images/frames of the video footage. This phase takes the live video as the input where the length of the video can vary according to the Indian sign language, the system makes sures to extract the frames which are needed to recognise the sign language from the footage. These frames are then sent to the a module which detects the hands in the frames and then differentiates the hands from the background, this feature is essential and gives an advantageous gain as the module processed image/ frame makes sure to get a higher recognition rate of the input, the images which are processed by the Hand Outline Detection module are then sent to the supervised model where the trained model takes the processed image as input and uses the training data from the training phase and checks for patterns and other features which are relevant to the recognition from the input and then predicts/ recognises the sign/ gesture. The recognised gestures are then shown in the Graphical User Interface (GUI) which are quality of life features making the system easy user- friendly. The Recognised gestures are not only shown live but also stored under the history panel where the user can revisit the past recognised words, the words recognised are stored in English language and are stored either in sentence or single word format. The recognised words are stored temporarily until the session ends and the user can either give input as image format or video format, where the system can take both types of inputs and can recognise the sign language in both cases.

Training Phase (ISL SIGNS)

In this Training phase we train the LSTM using the video data of the selected signs (Multiple instances of each sign) and train it on all the recorded signs. We used the mediapipe module to extract anchor points like face landmarks, hand landmarks and pose landmarks. We make a array containing all these points and train the model on the change of the coordinates of each of these landmarks for each frame.

Recognition Phase (ISL SIGNS)

The LSTM model trained on the Signs will be able to identify the signs shown to the camera. The user can show the signs to the camera and the captured video frames are send to the model in a sequential manner so, the model will output the probabilities of the recorded sign being one of the trained sign. From this data we can decide what is the sign shown by the user.

3.5 Architecture of Deep learning Models

Instead of using pre-trained models for recognizing the Indian Sign language we use our own custom models and trained them with the selected datasets.

3.5.1 CNN-Architecture

We used the CNN algorithm to recognize the gestures (Alphabets and numbers) in image format. There are already many pre-trained models like Lenet,Dense,Alexnet etc ., with which we can train the model a lot faster but we decided to use our own model to have more flexibility.

The image shows the architecture diagram of the CNN model we used to extract features from the images. All the images in the data set are 128 X 128 So the input layer is 128 X 128 and we used only Gray scale images to reduce computational time. After all the features are extracted we use a dense network to classify the images into 35 classes (A-Z(26) + 1-9(9)).

Fig 7. CNN Architecture

3.5.2 LSTM-Architecture

LSTM is mainly used to recognize patterns from sequential data like Audio, Text data and also video. The problem at hand is Recognizing the Sings form the Video so LSTM is the best option to use here.

Fig 8. LSTM Architecture

We Start by inserting multiple layers of LSTM units for pattern recognition followed by the DENSE for classification.

Results

We trained different models on the Alphabet data and one on the Sign data.

4.1 CNN – DENSE (Alphabet)

Used Dense network for classification of the extracted features

4.2 CNN –SVM

The below results show the accuracy rates for the model that used Support Vector Machine for the classification of the extracted features.

The below Table shows the different models we used by tweaking different things in our algorithms.

4.3 LSTM

We tried to include a lot more signs to one model but with the increase in the number of signs the system is getting slower and also due to less data instances for each sign(30) Accuracy is considerably decreasing for similar signs in the same model. So we settled for 3 signs per model and prepared multiple models. We can solve this problem by increasing the number of instances for each sign and increasing to the Sign length for each sign.

4.4 Face Emotion Detection

The addition of face emotion detection to our system made a huge change in better understanding and translation of the sign language. Taking the face data into consideration while recognizing the sign makes a huge difference as mentioned above.

The above images only shows (All emotions are detected along with their probabilities and only most likely are mentioned here) the dominant emotions.

Conclusion

The Sign Language Translation system can change the lives of many people across the world. It can be used to talk to the mute people and understand them better. This will help to communicate with children at the younger ages well and therefore help them to get out of their loneliness and depression. In Adults this translation system opens up more opportunities to those who are backward until now due to the communication gap.

The development of this system helps the people using Indian sign language and also using this model we can change the training dataset and implement translators for any sign language. We achieved significant accuracies with the little infrastructure we had, With further training we can achieve better accuracies. The addition of facial expression detection based on the emotion detection is a very important feature to sign Language. Also we can use transfer learning to increase the sign space for recognition. Adding sign generation form English sentences to this system will lead to an even more decrease in communication gap.

^[1]

References

©Text is available under the terms and conditions of the Creative Commons-Attribution ShareAlike (CC BY-SA) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.