Arabic Text Clustering: Comparison
Please note this is a comparison between Version 1 by Souad Larabi-Marie-Sainte and Version 2 by Lindsay Dong.

Arabic text clustering is an essential topic in Arabic Natural Language Processing (ANLP). Its significance resides in various applications, such as document indexing, categorization, user review analysis, and others. 

  • natural language processing
  • text clustering
  • Arabic text

1. Introduction

Clustering text documents is an important field in the area of Natural Language Processing (NLP) as it simplifies the tedious process of categorizing specific documents among millions of resources, especially when metadata such as key phrases, titles, and labels are not available. Text clustering is valuable for different applications, including topic extraction, spam filtering, automatic document categorization, user reviews analysis, and fast information retrieval.
The process of clustering text written in natural languages is complicated, especially for the Arabic language. One of the complications in Arabic is the language’s morphological complexity. For instance, a word in Arabic can be written in several forms that might exceed ten forms [1]. Ambiguity is also another major complication in the Arabic language, which is caused by the richness and complexity of Arabic morphology [1][2][1,2]. There are various other factors in the Arabic language causing difficulty in text clustering. Among these factors are the different dialects for different regions. Texts from different regions may exhibit significant linguistic variations. Moreover, in the Arabic language, the ordering of words in a sentence provides quite different interpretations for that sentence [3][4][3,4].
Several Arabic text clustering techniques have been proposed by researchers to encounter these challenges. Among the various techniques, it has been concluded that the K-Means clustering algorithm is the most widely applied, and that is due to its simplicity and efficiency in comparison with other clustering algorithms [2][5][6][7][2,5,6,7]. However, the initiation process of K-Means weakens its accuracy results. The initiation starts with plotting the centers of the clusters randomly and then assigning documents to the nearest center. If the initiation process is inaccurate, then the clustering will be imprecise [8]. Researchers proposed the use of K-Means++, which is an improved algorithm for the initialization process of K-Means [9]. However, the experiments show that even with this smart initialization process, the accuracy of the clustering is low compared to other techniques. Researchers also proposed the use of other clustering techniques, such as Suffix Tree clustering [10] and Self-Organizing Maps (SOM) [11]M [11]. Suffix Tree clustering has a limitation of overlapping documents in different clusters [12], while SOM clustering techniques demonstrated high effectiveness in clustering text even with high-dimensional datasets [13][14][15][16][17][13,14,15,16,17].

2. Arabic Text Clustering

Alharwat and Hegazi demonstrated the issue of data mining and data with high dimensions [18][19]. To overcome the addressed problem, the authors applied modeling techniques to the documents before clustering them. The scholars used the Modern Standard Arabic (MSA) dataset [19][20], which has several versions with different preprocessed articles. The outcome of theis study showed that normalized data provided better quality in clustering than unnormalized ones. With normalization, the purity of their clusters was 0.933, and the F1-score was 0.8732. Similar to Alharwat and Hegazi, Al-Azzawy et al. used K-Means to cluster an Arabic dataset corpus which contains 20 documents related to news and short anecdotes [20][21]. The highest clustering scores for the precision, recall, and F1-measure were 98%, 88%, and 93%, respectively. Mahmood and Al-Rufaye also addressed the problem of the high dimensionality of documents by minimizing the dimensionality of documents using the Term Frequency (TF), Inverse Document Frequency (IDF), and Term Frequency–Inverse Document Frequency (TF-IDF) feature selection approaches [21][22]. Following that, K-Means and K-Medoids were used for the clustering. The authors implemented their experiment on a 300-document corpus they built. The authors reported that K-Medoids provided more accurate results than K-Means; the first scored 60%, 78%, and 67% for the precision, recall, and F1-measure, respectively, while the second scored 80%, 83%, and 81%, respectively. Another group of researchers used K-Means clustering along with the TF-IDF and Binary Term Occurrence (BTO) feature selection approaches [22][23]. The scholars used a dataset that contains 1121 Arabic tweets. The outcome of their work showed that the BTO feature selection approach outperformed the TF-IDF. The literature for clustering Arabic text using K-Means shows high variation in the performance scores for clustering Arabic text, which could be attributed to the instability and inconsistency of the K-Means clustering algorithm. To overcome the limitations of the K-Means random initiation of cluster centroids, researchers used PSO-optimized K-Means to cluster Arabic text [23][24][25][24,25,26]. The use of Particle Swarm Optimization (PSO) contributes to selecting the initial seeds of K-Means. A group of researchers implemented their algorithm for the purpose of Quran verses theme clustering [23][24], whereas another group [24][25][25,26] used three different datasets, named BBC, CNN, and OSAC [26][27]. The outcome of these research papers demonstrated the effectiveness of applying optimization methods for enhancing the accuracy of the clustering models used. Another work on clustering Arabic documents was based on the sentiment orientation and context of words in the data corpus [5]. The authors used the Brown clustering algorithm on user reviews of several topics, such as news, movies, and restaurants. The data in the research were collected from several sources [27][28][29][30][28,29,30,31]. The evaluation results of the approach showed that the subjectivity and polarity of the clustering documents provided rates of 96% and 85%, respectively. The evaluation results indicated that the number of clusters also affects the accuracy rates, showing that fewer clusters provide better results. In another work [2], the authors used a combination of Markov Clustering, Fuzzy-C-Means, and Deep Belief Neural Networks (DBN) in an attempt to cluster Arabic documents. Two datasets were used in the study; the first was acquired from the Al-Jazeera news website with 10,000 documents and the second from a Saudi Press Agency [31][32] with 6000 documents. The clustering precision, recall, and F1-measure resulted in 91.2%, 90.9%, and 91.02%, respectively. The model that was used was highly impacted by the feature selection of the root words leading to imprecise clustering results. Al-Anzi and Abuzeina [11] used Expectation-Maximization (EM), SOM, and K-Means algorithms to cluster Arabic documents. They built a corpus of 1000 documents extracted from a Kuwaiti newspaper website called Alanba [32][33]. The documents cover different topics, such as health, technology, sports, politics, and others. The authors then compared the evaluation of the three clustering algorithms. They reported that SOM obtained the highest accuracy between the three algorithms with a rate of 93.4%. From the study, it appears that the use of SOM in clustering Arabic text is promising. The Bond Energy Algorithm (BEA) was also used by researchers to cluster Arabic text [33][34]. The results of theis study showed that the BEA algorithm outperforms K-Means clustering in terms of precision, recall, and the F1-score. In the broader field of text clustering, researchers also proposed the use of prototype-based models for text clustering [34][35]. The results of the work showed that it outperforms K-Means clustering. To conclude, most of the current work on Arabic text clustering used K-Means clustering because it is a simple model and can be applied easily. However, the mechanism that K-Means follows has limitations. For instance, K-Means first initiates centers of clusters and then assigns documents to these clusters. If the initiation process of K-Means is not well formulated, then the risk of incorrect clustering arises. Moreover, techniques that integrate K-Means clustering with Particle Swarm Optimization [25][26] have promising results. This shows that optimization contributes positively to clustering models. In addition, previous work showed that the use of SOM provided better clustering results than K-Means for Arabic text [11]. Table 1 presents a summary of the recent work regarding Arabic text clustering.
Table 1.
Arabic text clustering related work comparison.
Ref. Model Dataset Purity F1-Score Precision Recall Accuracy
[18][19] K-Means MSA 93.3% 87.32% 87.13% 87.52% -
[20][21] K-Means Own corpus -   93% 98% 88%
[25][26] K-Means + (PSO) BBC, CNN, OSAC 50% 47% 33% - -
[5] Brown clustering algorithm Own corpus 85% - - - -
[24][25] K-Means Arabic tweets 76.4% - - - -
[22][23] TF-IDF + BTO Arabic tweets - - - - -
[21][22] K-Medious Own corpus - 67% 60% 78% -
[2] Markov + Fuzzy-C-Means + DBN Own corpus - 91.02% 91.02% 90.9% -
[10] Suffix Tree Own corpus - 81.11% 80.3% 83.75% -
[11] SOM Own corpus - - - - 93.4%
Video Production Service