The fourth most spoken language in the world is Arabic (Generates a Set Consulting Group 2020). In 2001, the Arab Federation of the Deaf officially declared Arabic SL as the main language for people with speech and hearing problems in Arab countries. Arabic SL is still in its infancy, even though Arabic is one of the most widely spoken languages in the world. The most general issue that Arabic SL patients realize is “diglossia”. Each country has its regional dialects that are spoken instead of written languages. As a result, the different dialects spoken have given rise to different Arabic SLs. They are as numerous as the Arab states, but all share the same alphabet and a small number of vocabulary words. Arabic is one of the more sophisticated and appealing languages and is spoken by over 380 million people around the world as the first official language. The intellectual and semantic homogeneity of Arabic is tenable
[8]. The ability of NN to facilitate the recognition of Arabic SL hand gestures was the main concern of the authors
[10]. The main aim was to illustrate the application of different types of stationary and dynamic indicators by detecting actual human movements. First, it was shown how different architectures and fully and moderately repetitive systems can be combined with a feed-forward neural network and a recurrent neural network
[10]. The experimental evaluations show a 95% precision rate for the detection of stationary action, which inspired them to further explore their proposed structure. The automated detection of Arabic SL alphabets using an image-based approach was highlighted in
[11]. In particular, to create an accurate sensor for the Arabic SL alphabet, several visual aspects were investigated. The extracted visible tags were fed into the One-Versus-All SVM. The results demonstrated that the Histogram of Oriented Gradients obtained promising performance, using One-Versus-All SVM and HOG identifiers. The Kinect sensor was used in
[12] to develop a real-time automatic Arabic SL recognition system based on the Dynamic Time Warping coordination approach. Power and data gloves are not used by the software. Different aspects of human–computer interactions were covered in a few other studies
[13]. Studies from 2011 that can identify Arabic SL with an accuracy of up to 82.22%
[14][15] show that Hidden Markov models are at the center of alternative methods for SL recognition. Some other works using Hidden Markov Models can be found in
[16]. A five-stage approach for an Arabic SL translator with an efficiency of 91.3% was published at the same time in
[16], which focuses on the background subtraction of transcription, size, or partial invariance. Almasre and Al-Nuaim recognized 28 Arabic SL gestures using specialized detectors such as the Microsoft Kinect or Leap Motion Detectors. More recent studies have focused on understanding Arabic SL
[17]. An imaging method that included the elevation, width, and intensity of the elements was used to create many CNNs and provide feedback. Instead, the frame rate of the depth footage is used by CNN to interpret the data, which also defines how vast the system is. Faster refresh rates produce more detail, while lower frame rates produce less depth. Furthermore, a new method for Arabic SL recognition was proposed in 2019 using a CNN to identify 28 letters of the Arabic language and digits from 0 to 10
[18]. In numerous training and testing permutations, the proposed seven-layer architecture was frequently taught, with the highest apparent correctness being 90.02 percent using a training dataset of 80 percent images. Finally, the researchers showed why the proposed paradigm was better than alternative strategies. Among deep neural networks, CNNs have primarily been utilized in computer-vision-based methods that generally focus on the collected images of a motion and extract its important features to identify it. Multimedia systems, emotion recognition, picture segmentation and semantic breakdown, super resolution, and other issues have all been addressed using this technology
[19][20][21]. Oyedotun et al. employed a CNN and the Stacked Denoising Autoencoder to identify 24 American SL gestures
[22]. Pigou et al.
[23], on the other hand, recommended the use of a CNN for Italian SL recognition
[24]. Another study
[25] shows a remarkable CNN model that uses hand gestures to automatically recognize numbers and communicates the precise results in Bangla. This model is used in the current investigation
[25]. In a related work
[24][25], a CRNN module is used to estimate hand posture. Moreover,
[26], recommends using a deep learning model to recognize the distinguishing features in large datasets and apply transfer learning to data collected from different individuals. In
[27], a Bernoulli heat map based on deep CNN was constructed to measure head posture. Another study used separable 3D convolutional networks using a neural network to recognize dynamic hand gestures for identifying the hand signal. Another article
[28] was submitted on wearable hand gesture recognition using flexible strain sensors; this is the most recent study on this topic. The authors of
[29] made the most recent work-related hand gesture deformable CNN in use. Another recent effort proposed for HCI uses fingerprint detection for hand gesture recognition
[30]. A small neural network is used to recognize hand gestures
[31]. Learning geometric features
[32] is another way to understand hand gestures. In
[33], the K-nearest neighbor method provides a reliable recognition system. Arabic SL is one way to capture statistical feature extraction using a classifier. The Arabic character language is another way. Tubaiz’s method has a number of weaknesses, but the biggest one is that users have to wear instrumented gloves to capture the subtleties of a particular gesture, which is often very uncomfortable for the user. In
[34], the researcher proposed using a glove with instruments to create a system for recognizing Arabic SL utilizing hidden Markov models and spatiotemporal features for the continuous recognition of Arabic SL. The authors of
[35] advocated using a multiscale network for hand pose estimation. Similarly, ref.
[36] investigated text translation from Arabic SL for use on portable devices. It is reported in
[37] that Arabic SL can be automatically identified using sensor and picture approaches. In
[38], the authors provide a programmable framework for Arabic SL hand gesture recognition using two depth cameras and two Microsoft Kinect-based machine learning algorithms. The CNN approach, which is now being used to study Arabic SL, is also unmatched
[39].
In addition to the above approaches, a region-based (RCNN) is also explored for sign language recognition. For instance, various backbone pre-trained models are evaluated with RCNN, which intelligently works in numerous background scenes
[40]. Next, in the case of low-resolution images, the authors of
[41] used CNN for more prominent features, followed by machine learning classifiers SVM with triplet loss. Similarly, to overcome the issue of computational complexity, ref.
[42] proposed a lightweight model for real-time sign language recognition, which obtained incredible performance on testing data. However, these models show better classification accuracy in the case of small datasets but limited performance over large-scale datasets. To tackle such issues, a deep CNN network was developed that was trained on massive amounts of samples and improved recognition scores
[43]. This work is further enhanced in
[44], where a novel deep CNN architecture is designed that obtained a tremendous semantic recognition score. In addition, to address the balancing problem, the authors of
[45] developed a DL model followed by a synthetic minority oversampling technique that yielded better performance with a large number of parameters and a large model size. Therefore, it is highly desirable to develop an image-based intelligent system for Arabic hand sign recognition using novel CNN architecture.