BIM Interpretation Using SSD-MobileNet-V2 FPNLite and COCO mAP

BIM Interpretation Using SSD-MobileNet-V2 FPNLite and COCO mAP: History

View Latest Version

Please note this is an old version of this entry, which may differ significantly from the current revision.

Subjects: Engineering, Electrical & Electronic

Contributor:

Iffah Zulaikha Saiful Bahri

Sharifah Saon

Abd Kadir Mahamad

Khalid Isa

Umi Fadlilah

Mohd Anuaruddin Bin Ahmadon

Shingo Yamaguchi

Bahasa Isyarat Malaysia (BIM), also known as Malaysian Sign Language (MSL). BIM began its journey by forming a deaf school in Penang called the Federation School for the Deaf (FSD) in 1954. Studies have revealed that indigenous sign words arose through gestural communication amongst deaf students at the FSD outside their classroom.

Bahasa Isyarat Malaysia (BIM)
SSD-MobileNet-V2 FPNLite
COCO mAP
TensorFlow Lite
Android application

1. Introduction

Every normal human being has been granted the most precious gift that cannot be replaced: the ability to express themselves by responding to the events occurring in their surroundings, where they observe, listen, and then react to circumstances through speech [1]. Unfortunately, there are those unfortunate ones who lack this precious gift. This creates a difference between normal human beings and disadvantaged ones, creating a massive gap between them [1,2,3,4,5]. Because communication is a necessary element of regular people’s lives, deaf/mute individuals must communicate as normally as possible with others.

Communication is a tedious task for people who have hearing and speech impairments. Hand gestures, which involve the movement of hands, are used as sign language for natural communication between ordinary people and deaf people, which is just like speech for vocal people [4,6,7,8,9]. Nonetheless, sign languages differ by country and are used for a variety of purposes, including American Sign Language (ASL), British Sign Language (BSL), Japanese Sign Language [10,11], and Turkish Sign Language (TSL) [12]. This project focuses on Bahasa Isyarat Malaysia (BIM), also known as Malaysian Sign Language (MSL). BIM began its journey by forming a deaf school in Penang called the Federation School for the Deaf (FSD) in 1954. Studies have revealed that indigenous sign words arose through gestural communication amongst deaf students at the FSD outside their classroom. With the aim of educating deaf students, in 1964, American Sign Language (ASL) was made available in Johor, while Kod Tangan Bahasa Malaysia (KTBM) started to settle in Penang in 1978 when Total Communication was introduced into education for deaf students [13]. BIM has been the main form of communication amongst the deaf population in Malaysia for many years since it was first developed [14,15,16].

Communication is a vital aspect of everyday life; deaf/mute individuals must communicate as normally as possible with others [9]. The inability to speak is considered a problem amongst people [17] because they cannot clearly understand the words of normal people and, hence, cannot answer them [17]. This inability to express oneself verbally generates a significant disadvantage and, thus, a communication gap between the deaf/mute society and normal people [1,2,5,14]. The deaf/mute population or sign language speakers experience social aggregation challenges [4], and they constantly feel helpless because no one understands them and vice versa. This major humanitarian issue requires a specialised solution. Deaf/mute individuals face difficulties connecting with the community [3,18], particularly those who were denied the blessing of hearing prior to the development of spoken language and learning to read and write [3].

Usually, the ancient technique utilised for the deaf/mute to communicate with normal people is a human translator that can aid them with the discussion. However, it might be challenging due to the lack of a human translator [7], they might not always be accessible [19] for the deaf/mute, and paying for them can be expensive. It also makes such persons dependent on interpreters [2]. This procedure may also be relatively slow. It makes talking seem unnatural and boring between deaf/mute and normal people, which indirectly causes a lack of engagement in social activities [2]. Correspondingly, as previously stated, the deaf/mute use sign language to communicate with others and those who understand sign language. This causes a challenge if the deaf/mute are required to communicate with normal people, as they must be proficient in sign language, which only a minority of people learn and understand [19].

In addition, the issues of sign language are due to the substantial variation in gesture form and meaning amongst many cultures, situations, and people; gesture detection is a challenging undertaking. It is difficult to create precise and trustworthy models for gesture recognition because of this heterogeneity. Some of the most important elements that influence gesture recognition are (i) gestures can vary in terms of their speed, amplitude, duration, and spatial placement, which can make it challenging to consistently identify them [20], (ii) gestures can indicate a variety of things depending on the situation, the culture, and the individual’s perception, (iii) different modes of interfering: speech, facial expressions, and other nonverbal clues can all be used in conjunction with gestures to affect how they are perceived [21], (iv) individual variations: different gesturing techniques can influence how accurately recognition models work [20], (v) the distinctions between spoken languages and sign languages present extra difficulties for sign language recognition [22], (vi) the unique grammar, syntax, and vocabulary of sign languages can make it difficult to effectively translate them into written or spoken language, and (vii) the difficulty of recognising sign languages can also be complicated by regional and cultural variances.

Undoubtedly, the advancement in technology, such as smartphones that can be used to make calls or send messages, has significantly improved people’s quality of life. This includes the numerous assistive technologies available to the deaf/mute, such as speech-to-text and speech-to-visual technologies and sign language, which are portable and simple. Several applications are accessible to normal people; however, each has a set restriction today [16,23]. Additionally, there is a shortage of excellent smartphone translation programs that encourage sign language translation [14] between deaf/mute and normal people. Therefore, despite the tremendous benefits of cutting-edge technologies, deaf/mute and normal people cannot benefit from them. Unknowingly, Malaysians are unfamiliar with BIM, and present platforms for translating sign language are inefficient, highlighting the limited capability of the market’s existing mobile translating application [16].

2. Interpretation of Bahasa Isyarat Malaysia (BIM) Using SSD-MobileNet-V2 FPNLite and COCO mAP

Bahasa Isyarat Malaysia (BIM), also known as Malaysian Sign Language (MSL), was initially developed in 1998, shortly after the Malaysian Federation of the Deaf was founded.

In [14], a survey was conducted for possible consumers as its methodology. The target populations were University of Tenaga Nasional (UNITEN) students and Bahasa Isyarat Malaysia Facebook Group (BIMMFD) members. Multiple-choice, open-ended, and dichotomous items were included in the surveys. This research demonstrates that the software is thought to be helpful for society and suggests creating a more user-friendly and accessible way to study and communicate using this app utilising BIM.

2.1. SSD-MobileNet-V2 FPNLite

SSD-MobileNet-V2 can recognise multiple items in a single image or frame. This model detects each image’s position, producing the object’s name and bounding boxes. Ninety different objects can be classified using the pre-trained SSD-Mobile model.

Due to the elimination of bounding box proposals, Single-Shot Multibox detector (SSD) models run faster than R-CNN models. The processing speed of detection and the model size were the deciding factors in the choice of the SSD-MobileNet-V2 model. As demonstrated in Table 1, the model requires input photos of 320 × 320 and detects objects and their locations in those images in 19 milliseconds, whereas other models require more time. For example, SSD-MobileNet-V1-COCO, the second-fastest model, takes 0.3 milliseconds to categorise objects in a picture compared to SSD-MobileNet-V2-COCO, the third-fastest model, and so on. Compared to the second-fastest model SSD-MobileNet-V1-COCO, SSD-MobileNet-V2 320 × 320 is the most recent MobileNet model for Single-Shot Multibox detection. It is optimised for speed at a very low cost, with a mean average precision (mAP) of only 0.8 [28].

Table 1. Model comparison [28].

Model Name	Speed (ms)	COCO mAP	TensorFlow Version
SSD-MobileNet-V2 320 × 320	19	20.2	2
SSD-MobileNet-V1-COCO	30	21	1
SSD-MobileNet-V2-COCO	31	22	1
Faster R-CNN ResNet50 V1 640 × 640	53	29.3	2
Faster RCNN Inception V2 COCO	58	28	1

2.2. TensorFlow Lite Object Detection

An open-source deep learning framework called TensorFlow Lite was created for devices with limited resources, such as mobile devices and Raspberry Pi modules. TensorFlow Lite enables the use of TensorFlow models on mobile, embedded, and Internet of Things (IoT) devices. It allows for low latency and compact binary size on-device machine learning inference. As a result, latency is increased and power consumption is decreased [28].

For edge-based machine learning, TensorFlow Lite was explicitly created. It enables to use various resource-constrained edge devices, such as smartphones, micro-controllers, and other circuits, to perform multiple lightweight algorithms [29].

An open-source machine learning tool called TensorFlow Object Detection API is utilised in many different applications and has recently grown in popularity. When installing the TensorFlow Object Detection API, an implicit assumption is that it can be provided with noise-free or benign datasets. This open-source software is now being used in many object detection applications. However, in the real world, the datasets could contain inaccurate information due to noise, naturally occurring adversarial objects, adversarial tactics, and other flaws. Therefore, for the API to handle datasets from the real world, it needs to undergo thorough testing to increase its robustness and capabilities [30].

Another paper also defines TensorFlow Object Detection as a class of semantic things (such as people, buildings, or cars) that can be detected in digital photos and videos using object detection, a computer technology linked to computer vision, and image processing. The study areas for target detection include pedestrian and face detection.

Many computer vision applications require object detection, such as image retrieval and video surveillance. Applying this method to an edge device could let you perform a task, such as an autopilot [29].

2.3. MobileNets Architecture and Working Principle

Efficiency in deep learning is the key to designing or creating a helpful tool that is feasible to use with as little computation as possible. There are other ways or methods to solve efficiency issues in deep learning programming, and MobileNet is one of the approaches for said problem. MobileNets reduce the computation by factorising the convolutions. The architecture of MobileNets is primarily from depth-wise separable filters. MobileNets factorise a standard convolution into a depth-wise convolution and a 1 × 1 convolution (pointwise convolution) [31]. A standard convolution filters and combines inputs into a new set of outputs in one step. In contrast, depth-wise separable convolution splits the information into the filtering layer and the combining layer, decreasing the computation power and model size drastically.

2.4. Android Speech-to-Text API

Google Voice Recognition, or GVR, is a tool with an open API that converts the speech from the user to text to be read. GVR usually requires an internet connection from the user to the GVR server. GVR uses neural network algorithms to convert raw audio speech to text and works for several languages [32]. This tool has two-thread communication. The first thread is to receive the user’s audio speech and send it to Google Cloud to be converted into text and stored as strings. After that, the other communication thread reads the strings, sends them to the server, and resides in the user workstation.

Google Cloud Speech-to-Text or Cloud Speech API is another tool for the speech-to-text feature. It has far more features than standard Google Speech API. For example, it has 30+ voices available in multiple languages and variants. However, this is not just a tool; it is a product made by Google, and the user needs to subscribe and send some fees to use this tool. Table 2 lists the advantages and disadvantages of these tools.

Table 2. Advantages and disadvantages of Google Cloud API and Android Speech-to-Text API.

	Advantages	Disadvantages
Google Cloud API	It supports 80 different languages.	Not free.
	Can recognise audio uploaded in the request.	Requires higher-performance hardware.
	Returns text results in real time.
	Accurate in noisy environments.
	Works with apps across any device and platform.
Android Speech-to-Text API	Free to use.	Need to pass local language to convert speech to-text.
	Easy to use.	Not all devices support offline speech input.
	It does not require high-performance hardware.	It cannot pan an audio file to be recognised.
	Easy to develop.	It only works with Android phones.

This entry is adapted from the peer-reviewed paper 10.3390/info14060319

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.