When it is necessary to store or edit a text written by hand in Arabic, this can only be performed manually, which can take significant time. However, fortunately, optical character recognition can be applied for this particular case. Optical character recognition (OCR) is a technique that is used to read and recognize the text present in an image and then convert it into a textual format. Once the text is extracted and digitized, it can utilize the applications of storing, retrieving, searching, and editing. Here, researchers shed light on the challenges accompanying OCR in general, then narrow it down to the case of Arabic OCR.
1. Introduction
Interchangeably, OCR is referred to as “text recognition”. Text recognition has many benefits and applications, especially in the automation of pipelines in libraries and healthcare institutions, the overcoming of learning difficulties, and transportation by auto-plate recognition, leading to reduced human intervention and effort, a promise AI-based solutions still deliver
[1][2][3][4][5][6].
The text to be recognized can be divided into one of two types: handwritten or printed. Handwritten-text recognition (HTR) tasks can themselves be divided into two types: online and offline. Offline handwritten-text recognition is the recognition of text that is not produced by computer devices and is not produced at the time of recognition but, rather, is usually found in still documents, records, and images. On the other hand, online text recognition focuses on text produced by computer-based devices, such as digital pens and tablets, rather than being extracted from documents
[7][8][9][10][11].
Arabic, the language of the Quran, is acknowledged as the lingua franca in 25 countries and is considered the sixth most commonly spoken native language around the world. It is used by more than one billion Muslims around the globe and is spoken by more than 274 million people
[12]. Its script’s morphological, lexical, and semantic richness make it one of the most challenging languages computationally
[13]. Specifically, the Arabic writing system is from right to left and consists of 28 letters, each of which can take multiple forms, depending on its position in the word: alone, in the beginning, in the middle, or at the end. Moreover, some of the characters that can be turned into other characters simply by adding dots. Additionally, some characters have loops, while others have similar skeletons. Moreover, the Arabic writing system is connected, meaning that connections can be formed whenever two letters are combined. In addition, diacritics can be present in Arabic text in multiple forms, and different diacritics can change the meaning of a word slightly in most cases. The fonts are rich in their diversity as they originate from different eras and geographies. All these characteristics make recognition tasks difficult
[14][15][16]. However, this challenge can also be viewed as an opportunity.
Predominantly, the performance of OCR by an OCR system or engine involves the following steps: (1) image acquisition, in which the image is collected; (2) pre-processing, which involves steps to preprocess the previously acquired image, such as binarization and thinning; (3) segmentation, which can be performed either line-wise, word-wise, or character-wise; (4) feature extraction, in which the important visual features are extracted, which can help in the subsequent step; (5) classification, in which the recognition is performed by devising models that utilize the features extracted previously; and (6) post-processing techniques, which deal with the produced text to correct it further
[17][18].
2. OCR Challenges [19][20][21]
Many challenges can occur in the OCR in images from which text is extracted, such as the following:
-
Scene complexity: The segregation of text from non-text can be difficult in cases in which the rate of noise is high or when there are strokes similar to text, but which are part of the background and are not texts themselves;
-
Conditions of uneven lighting: Due to the presence of shadows, pictures taken of any document can have the problem of uneven light distribution, in which part of the image has more light while the other is quite dark. Thus, scanned images are preferred, since they eliminate such unevenness;
-
Skewness (rotation): The text has skewed lines and deviates from the distinctive orientation. This can be a consequence of manually taking pictures;
-
Blurring and degradation: The grade of the image can be that of low quality. A remedy for this problem is to sharpen the characters;
-
Aspect ratios: Different aspect ratios can require different search procedures to detect the text;
-
Tilting (perspective distortion): The page itself may not appear in a correct, parallel perspective. The taking of photographs of texts by non-experts without the use of a scanner can account for such non-parallelism, making the task difficult. A simple solution is to use scanners or to try to take photographs for each page while perceiving the perspectives. An example is shown in Figure 1;
-
Warping: This problem is related to the text itself. The page may be perfectly parallel at the time at which the picture is taken; however, the nature of the text and how it is written are warped, which is not usually the case in writing;
-
Fonts: Printed fonts and styles vary from those that are handwritten. The same writer can write the same passage in the same space in a different way each time. On the other hand, printed text is easier because the differences tend to be less frequent from one print to another;
-
Multilingual and multi-font environments: Different languages feature different complexities. Some languages use cursive scripts with connected characters, as in the Arabic language. Moreover, recent individual pages can contain multiple languages, while old writings can have multiple fonts within the same page if they are written by different writers.
Figure 1. Example of tilting.
3. Arabic OCR Challenges [14][15][16]
-
Arabic is written from right to left, while most models are more used to the opposite direction;
-
Arabic consists of a minimum of 28 characters, and each character has a minimum of two and up to four shapes, depending on the position. Fortunately, no upper- or lower-case characters are used in Arabic;
-
The position determines the shape of a character. Therefore, a character can have different shapes depending on its position: alone, at the start, in the middle, or at the end. For example, the letter Meem has different shapes according to its position;
-
Ijam: Dots can be written above or below characters, and they differentiate between one character and another, such as Baa and Tha;
-
Tashkeel: Marks such as Tanween and Harakat can change the context of a piece of writing. Examples applied to the letter Raa are illustrated in Table 1;
-
Hamza, in various locations, such as above or below Alif, above Waw, after or above Yaa, and in the middle or end of a word;
-
Loops, such as Saad and Taa, and Meem and Ain, where the difference between characters is difficult to distinguish due to the similarity of the skeleton;
-
Separated characters separate words, such as Raa, Dal, and Waw. Figure 2 presents an example with different kinds of separation;
-
The total number of shapes is at least 81. Some letters share the same shape but differ in terms of dots;
-
Without dots, there are at least 52 letter shapes;
-
The structural characteristics of the characters, given their high similarity in shape, can be perplexing. Moreover, a slight modification can turn one character into another;
-
Arabic is cursively written. Therefore, connections are formed when two characters are connected;
-
The presence of other diacritics, such as Shadda, and punctuation, such as full stops can be confused with the dots that are used in Arabic;
-
Most models suffer from overfitting, since generalization is not a concern, partly because of the lack of annotated datasets and partly because complex models are trained on small sets or sets that are not balanced;
-
The available datasets have the issue of distorted and unclear samples;
-
The majority of datasets suffer from imbalances, in which the distribution is not even between the samples, which can affect the model’s performance by overfitting to these specific samples;
-
The writing style can vary from one author to another. Figure 3 shows different types of writing style.
Figure 2. Elucidation of how sub-words are formed in Arabic.
Figure 3. Richness and variation in Arabic fonts.
Table 1. Different types of Tashkeel.
Fatha |
Fathatain |
Dammah |
Dammatain |
Kasra |
Kasratain |
Sukoon |
Shaddah |
رَ |
رً |
رُ |
رٌ |
رِ |
رٍ |
رْ |
رّ |
This entry is adapted from the peer-reviewed paper 10.3390/app13137568