Text Classification Technique | Encyclopedia MDPI

Text Classification Technique: Comparison

Please note this is a comparison between Version 2 by Rita Xu and Version 1 by Yousif A. Alhaj.

One of the effective solutions in Arabic text classification is to find the suitable feature selection method with an optimal number of features alongside the classifier. Although several text classification methods have been proposed for the Arabic language using different techniques, such as feature selection methods, an ensemble of classifiers, and discriminative features, choosing the optimal method becomes an NP-hard problem considering the huge search space.

text classification
feature selection
swarm optimization

1. Introduction

Recently, the internet has witnessed a massive accumulation of valuable information growing exponentially every day. Most of this information is an unstructured text which creates challenges for humans to manage and process this information and extract proper knowledge ^[1]. A new research field in text mining called text classification (TC) emerged to overcome this problem. TC is a machine learning challenge that tries to classify new written content into a conceptual group from a predetermined classification collection ^[1]. It is crucial in a variety of applications, including sentiment analysis [2^[2][3],3], spam email filtering [4,5]^[4][5], hate speech detection ^[6], text summarization ^[7], website classification ^[8], authorship attribution ^[9], information retrieval ^[10], medical diagnostics ^[11], emotion detection on smart phones ^[12], online recommendations ^[13], fake news detection [14^[14][15],15], crypto-ransomware early detection ^[16], semantic similarity detection ^[17], part-of-speech tagging ^[18], news classification ^[19], and tweet classification ^[20].

Several primary stages are needed to build TC systems ^[21], namely, the preprocessing stage (tokenization, stop word removal, normalization, and stemming), document modeling stage (feature extraction and feature selection), and the final document classification and evaluation stage.

Compared to other languages, such as English, TC in Arabic is resource-poor. However, Arabic is the world’s fifth most widely spoken language, with around 4.5% of its population using it as their first language ^[1]. The Arabic alphabet consists of 28 letters and directions from right to left. As Arabic words do not begin with a capital letter like they do in English, distinguishing correct names, acronyms, and shortcuts can be challenging. There are also diacritics, which are symbols put above or below letters to give different sounds and grammatical formulations that can alter the definition of a sentence ^[22]. The shape and construction of the same letter vary depending on where it appears in the sentence ^[23]. The Arabic language requires a variety of preprocessing methods before classification due to several obstacles, including the Arabic language’s strong affixation character and the scarcity of freely available Arabic datasets ^[23], as well as the scarcity of standard Arabic morphological analysis software.

The ATC system uses a robust feature selection (FS) method and classifier (CA) to enhance the performance ^[24]. The former executes a classification process, whereas the former select useful features to decrease the high dimensionality of the feature space. Additionally, incorporating FS in TC systems will help reduce classification complexity and processing demands [25,26]^[25][26]. Over the past years, researchers have been challenged with finding robust FS methods, relevant features and classifiers to enhance the performance of the TC system. This problem occurs due to many available FS methods and techniques in the literature.

Obviously, executing this trail is a complicated and time-consuming task. Therefore, this necessitates the development of a technique to find an optimal solution, among others.

Recently, optimization techniques such as Particle Swarm Optimization (PSO) ^[27] have been used to solve selected problems across several domains ^[28]. These techniques are known to emulate natural evolution in their operations. PSO is a comparative evolutionary algorithm (EA) based on swarm intelligence, which is considered one of the most efficient search methods over various proposed method in the literature. Moreover, it is not very expensive and can converge faster in comparison with other EA ^[29]. This is the main motivation behind using the PSO method. Additionally, it has been successfully applied in feature selection ^[30], ensemble learning ^[31], and clustering ^[32]. Therefore, this work provides a new technique for ATC that uses a meta-heuristic algorithm to find the best answer from a variety of feature selection methods and classifiers using a set of features. This configuration is determined using the PSO ^[33].

The proposed method, called OCATC, consists of three phases covering data preparation, experiment, and testing and evaluation. In the first phase, OCATC starts by preparing a given Arabic dataset using several preprocessing tasks, including tokenization, normalization, stop word removal, and stemming, then extracts features from the dataset using the TF–IDF approach. In the second phase, the dataset is divided into train and test sets using 10-fold cross-validation, whereby the training set is used during the learning of PSO to find the optimal configuration. This step is considered the main contribution, where the PSO begins by generating a set of solutions, and each solution represents a configuration of three elements: the feature selection method, number of features, and classifier. Then, the fitness function for each solution is computed to determine the best configuration. After that, it updates the position and velocity of each solution until the prerequisites for stopping are met. Finally, the testing set is used to evaluate the quality of the best configuration to build the classification system using the optimal FS method (first element), the optimal number of features (second element), and the optimal classifier (third element). The recommended approach, to the best of ourthe knowledge, aims to find an effective solution for the ATC system to automatically select the optimal solution from a set of elements such as feature selection methods, features, and classifiers, which has not been applied before.

2. A Novel Text Classification Technique

Several studies have looked at the issue of automatic TC, providing various methodologies and answers. This is primarily true for English, but it also applies to other languages, such as Arabic, it is still in an early stage ^[1]. The authors of ^[34] used a variety of categorization techniques to investigate the impact of removing stop words on ATC. They discovered that a support vector machine (SVM) classifier with little sequential optimization had the best error rate and accuracy. In ^[35], the Naive Bayes (NB) classifier is used to examine the impact of a light stemmer, Khoja stemmer, and root extractor on Arabic document categorization. The authors came to the conclusion that a root extractor combined with a position tagger would yield the most outstanding results. Chi-square (Chi2), Information Gain (IG), NG-Goh-Low (NGL), and Galavotti– Sebastiani–Simi Coefficient (GSS) coefficients were used to determine the essential features and the influence of feature reduction approaches on Arabic document categorization ^[22]. They also employed feature weighting approaches based on inverse document frequency (IDF), such as Term frequency (TF–IDF), the location of a word’s first occurrence (FAiDF), and the compactness of the word (WCF) (CPiDF). The classification model was established using the SVM classifier. Thus, when the TF–IDF, CPiDF, and FAiDF feature weighting methods were combined, the GSS outperformed the other feature selection strategies. The feature selection approach for ATC by ^[36] was binary practical swarm optimization with a KNN classifier (BPSO-KNN). The Alj-News dataset was utilized to develop the classification model, and the best results were obtained utilizing SVM and NB classifiers. On Arabic document classification, Sabbah et al. ^[37] tested a number of feature selection approaches, including Chi2, IG, Correlation (Corr), and SVM-based Feature Ranking Method (SVM-FRM). The classification model was constructed using an SVM classifier. In their research, they used the BBC ^[38] and Abuaiadah ^[39] databases. They arrived at the conclusion that SVM-FRM performs well in a balanced dataset, but not so well in an imbalanced dataset. A novel feature selection method, namely, improved chi-square (ImpCHI), was presented by ^[40] to enhance the ATC. Three standard features selection methods, namely, IG, Chi2, and Mutual Information (MI), were compared with ImpCHI. SVM and DT classifiers were used to evaluate the performance using CNN dataset ^[38]. Experimental results demonstrate that the most beneficial result was obtained using the ImpCHI FS method using the SVM classifier. In ^[41], the authors presented the Frequency Ratio Accumulation Method, a revolutionary text categorization technique for the Arabic language (FRAM). The characteristics were extracted using a bag of words (BoW). Chi2, MI, Odds Ratio (OR), and GSS-coefficient were among the feature selection approaches used to exclude unnecessary characteristics (GSS). According to the results, the FRAM outperformed the NB, Multi-variant Bernoulli Naive Bayes (MBNB), and Multinomial Naive Bayes models (MNB) classifiers. The macro-f-measure value was 95.1 percent for the unigram word-level representation approach. The Polynomial Networks (PNs) classifier used by ^[42] on Arabic text classification using the Alj-News dataset. They compared the performance of the PNs classifier with other classifiers, such as SVM, NB, and DT classifiers. Their results showed that the performance of the PNs classifier was not the best for all the categories in the dataset, but was very competitive. The authors in ^[43] claimed to be the first to utilize Logistic Regression (LR) in Arabic Text categorization with the Alj-News dataset. The results of the experiments showed that Logistic Regression is beneficial for ATC. The effects of eight supervising learning algorithms on Arabic document classification Studied by ^[44]. Several feature representation approaches were used to extract features from the Abuaiadah dataset ^[39]. The authors concluded that superior results were obtained when gathering an LSVM classifier with the IDFT approach. Abdelaal et al. ^[45] proposed an automatic classification model for Hadith. The proposed model was used to organize Arabic text Hadith into related categories: Sahih, Hasan, Da’if, and Maudu. Several classifiers, such as LSVC, SGD, and LR, were investigated to build a classification model. IG, Chi2, and Gain Ratio (GR) are feature selection methods to remove irrelevant features. The outcomes showed that LSVC outperforms other classifiers. Moreover, in ^[46], the authors evaluated the Hadith dataset using DT, RF, and NB classifiers, and they isolated redundant features using IG and Chi2. Binary boolean algebra and TF–IDF were used to extract features. Experimental results demonstrated that the best classifier investigated was DT. Elnagar et al. ^[1] performed an extensive analysis to evaluate the effectiveness of Deep Neural Networks (DNN) models and a word2vec embedding model on new constructed large corpora for Arabic document classification. The corpus was SANAD (Single-label-Arabic- News- Articles Dataset), collected by Einea et al. ^[47]. The evaluation experiments showed the effectiveness of the proposed models on single-label categorization and multi-label categorization. Alhaj et al. ^[48] studied the effects of stemming strategies on ATC. Several classifiers, including NB, KNN, and SVM, were used to build the classification model. Chi2 was used to extract essential features in different sizes. An available public dataset, namely, CNN, was used to evaluate the classification model. The outcomes demonstrated that the SVM classifier gathered with ARLStem stemmer outperforms other classifiers when increasing the features. Moreover, in ^[49], the authors studied the effects of stop word removal in several classifiers and feature extractions using the CNN public dataset. Chi2 was applied as a feature selection method. The TF–IDF and BoW methods were used to extract features. They concluded that the best results were achieved when removing stop words gathering with TF–IDF and the VM classifier.

References

Elnagar, A.; Al-Debsi, R.; Einea, O. Arabic text classification using deep learning models. Inf. Process. Manag. 2020, 57, 102121.
Al-Ayyoub, M.; Khamaiseh, A.A.; Jararweh, Y.; Al-Kabi, M.N. A comprehensive survey of arabic sentiment analysis. Inf. Process. Manag. 2019, 56, 320–342.
Al-Smadi, M.; Al-Ayyoub, M.; Jararweh, Y.; Qawasmeh, O. Enhancing Aspect-Based Sentiment Analysis of Arabic Hotels’ reviews using morphological, syntactic and semantic features. Inf. Process. Manag. 2019, 56, 308–319.
Dada, E.G.; Bassi, J.S.; Chiroma, H.; Abdulhamid, S.M.; Adetunmbi, A.O.; Ajibuwa, O.E. Machine learning for email spam filtering: Review, approaches and open research problems. Heliyon 2019, 5, e01802.
Shrivas, A.K.; Dewangan, A.K.; Ghosh, S.M.; Singh, D. Development of proposed ensemble model for spam e-mail classification. Inf. Technol. Control 2021, 50, 411–423.
Aldjanabi, W.; Dahou, A.; Al-Qaness, M.A.A.; Elaziz, M.A.; Helmi, A.M.; Damaševičius, R. Arabic offensive and hate speech detection using a cross-corpora multi-task learning model. Informatics 2021, 8, 69.
Sun, G.; Wang, Z.; Zhao, J. Automatic text summarization using deep reinforcement learning and beyond. Inf. Technol. Control 2021, 50, 458–469.
Li, Y.; Nie, X.; Huang, R. Web spam classification method based on deep belief networks. Expert Syst. Appl. 2018, 96, 261–270.
Kapociute-Dzikiene, J.; Venckauskas, A.; Damasevicius, R. A comparison of authorship attribution approaches applied on the Lithuanian language. In Proceedings of the 2017 Federated Conference on Computer Science and Information Systems, FedCSIS 2017, Prague, Czech Republic, 3–6 September 2017; pp. 347–351.
Xu, B.; Lin, H.; Lin, Y.; Xu, K.; Wang, L.; Gao, J. Incorporating semantic word representations into query expansion for microblog information retrieval. Inf. Technol. Control 2019, 48, 626–636.
Omoregbe, N.A.I.; Ndaman, I.O.; Misra, S.; Abayomi-Alli, O.O.; Damaševičius, R. Text messaging-based medical diagnosis using natural language processing and fuzzy logic. J. Healthc. Eng. 2020, 2020, 8839524.
Ghosh, S.; Hiware, K.; Ganguly, N.; Mitra, B.; De, P. Emotion detection from touch interactions during text entry on smartphones. Int. J. Hum.-Comput. Stud. 2019, 130, 47–57.
Ji, Z.; Pi, H.; Wei, W.; Xiong, B.; Wozniak, M.; Damasevicius, R. Recommendation Based on Review Texts and Social Communities: A Hybrid Model. IEEE Access 2019, 7, 40416–40427.
Alonso, M.A.; Vilares, D.; Gómez-Rodríguez, C.; Vilares, J. Sentiment analysis for fake news detection. Electronics 2021, 10, 1348.
Tesfagergish, S.G.; Damaševičius, R.; Kapočiūtė-Dzikienė, J. Deep Fake Recognition in Tweets Using Text Augmentation, Word Embeddings and Deep Learning; Springer: Cham, Switzerland, 2021; Volume 12954, pp. 523–538.
Al-rimy, B.A.S.; Maarof, M.A.; Shaid, S.Z.M. Crypto-ransomware early detection model using novel incremental bagging with enhanced semi-random subspace selection. Future Gener. Comput. Syst. 2019, 101, 476–491.
Mansoor, M.; Ur Rehman, Z.; Shaheen, M.; Khan, M.A.; Habib, M. Deep learning based semantic similarity detection using text data. Inf. Technol. Control 2020, 49, 495–510.
Tesfagergish, S.G.; Kapočiūtė-Dzikienė, J. Part-of-speech tagging via deep neural networks for northern-Ethiopic languages. Inf. Technol. Control 2020, 49, 482–494.
Alfonse, M.; Gawich, M. A novel methodology for Arabic news classification. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2022, 12, e1440.
Alruily, M. Classification of arabic tweets: A review. Electronics 2021, 10, 1143.
Uysal, A.K.; Gunal, S. The impact of preprocessing on text classification. Inf. Process. Manag. 2014, 50, 104–112.
Ayedh, A.; Tan, G.; Rajeh, H. The Impact of Feature Reduction Techniques on Arabic Document Classification. Int. J. Database Theory Appl. 2016, 9, 67–80.
Ayedh, A.; TAN, G.; Alwesabi, K.; Rajeh, H. The Effect of Preprocessing on Arabic Document Categorization. Algorithms 2016, 9, 27.
Kou, G.; Yang, P.; Peng, Y.; Xiao, F.; Chen, Y.; Alsaadi, F.E. Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods. Appl. Soft Comput. 2019, 86, 105836.
Larkey, L.S.; Ballesteros, L.; Connell, M.E. Improving stemming for Arabic information retrieval. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland, 11–15 August 2002; p. 275.
Al-Anzi, F.S.; AbuZeina, D. Beyond vector space model for hierarchical Arabic text classification: A Markov chain approach. Inf. Process. Manag. 2018, 54, 105–115.
Kohler, M.; Vellasco, M.M.; Tanscheit, R. PSO+: A new particle swarm optimization algorithm for constrained problems. Appl. Soft Comput. 2019, 85, 105865.
Al-qaness, M.A.; Ewees, A.A.; Fan, H.; AlRassas, A.M.; Abd Elaziz, M. Modified aquila optimizer for forecasting oil production. Geo-Spat. Inf. Sci. 2022, 1–17.
Unler, A.; Murat, A. A discrete particle swarm optimization method for feature selection in binary classification problems. Eur. J. Oper. Res. 2010, 206, 528–539.
Engelbrecht, A.P.; Grobler, J.; Langeveld, J. Set based particle swarm optimization for the feature selection problem. Eng. Appl. Artif. Intell. 2019, 85, 324–336.
Malhotra, R.; Khanna, M. Particle swarm optimization-based ensemble learning for software change prediction. Inf. Softw. Technol. 2018, 102, 65–84.
Janani, R.; Vijayarani, S. Text document clustering using Spectral Clustering algorithm with Particle Swarm Optimization. Expert Syst. Appl. 2019, 134, 192–200.
Eberhart, R.C.; Kennedy, J.A. New Optimizer Using Particle Swarm. In Proceedings of the Sixth International Symposium on Micro Machine and Human Science, Nagoya, Japan, 4–6 October 1995; pp. 39–43.
Al-Shargabi, B.; Al-Romimah, W.; Olayah, F. A comparative study for Arabic text classification algorithms based on stop words elimination. In Proceedings of the 2011 International Conference on Intelligent Semantic Web-Services and Applications, Amman, Jordan, 18–20 April 2011; p. 11.
Yousif, S.A.; Samawi, V.W.; Elkabani, I. Enhancement of Arabic Text Classification Using Semantic Relations with Part of Speech Tagger. Adv. Electr. Comput. Eng. 2015, 195–201.
Chantar, H.K.; Corne, D.W. Feature subset selection for Arabic document categorization using BPSO-KNN. In Proceedings of the 2011 Third World Congress on Nature and Biologically Inspired Computing, Salamanca, Spain, 19–21 October 2011; pp. 546–551.
Sabbah, T.; Ayyash, M.; Ashraf, M. Support Vector Machine based Feature Selection Method for Text Classification. In Proceedings of the International Arab Conference on Information Technology, Yassmine Hammamet, Tunisia, 22–24 December 2017.
Saad, M.; Ashour, W. OSAC: Open Source Arabic Corpora. In Proceedings of the 6th ArchEng International Symposiums, EEECS’10 the 6th International Symposium on Electrical and Electronics Engineering and Computer Science, Lefke, North Cyprus, 25–26 November 2010; pp. 118–123.
Abuaiadah, D.; El Sana, J.; Abusalah, W. On the impact of dataset characteristics on arabic document classification. Int. J. Comput. Appl. 2014, 101, 31–38.
Bahassine, S.; Madani, A.; Al-Sarem, M.; Kissi, M. Feature selection using an improved Chi-square for Arabic text classification. J. King Saud Univ. Comput. Inf. Sci. 2018, 32, 225–231.
Sharef, B.T.; Omar, N.; Sharef, Z.T. An automated arabic text categorization based on the frequency ratio accumulation. Int. Arab J. Inf. Technol. 2014, 11, 213–221.
Al-Tahrawi, M.M.; Al-Khatib, S.N. Arabic text classification using Polynomial Networks. J. King Saud Univ.-Comput. Inf. Sci. 2015, 27, 437–449.
Al-Tahrawi, M.M. Arabic Text Categorization Using Logistic Regression. Int. J. Intell. Syst. Appl. 2015, 7, 71–78.
Sammouda, R. A comparative study of effective supervised learning methods on arabic text classification. Int. J. Comput. Sci. Netw. Secur. 2017, 17, 130–133.
Abdelaal, H.M.; Youness, H.A.; Ahmed, A.M.; Ghribi, W. Knowledge Discovery in the Hadith according to the reliability and memory of the reporters using Machine learning techniques. IEEE Access 2019, 7, 157741–157755.
Abdelaal, H.M.; Elemary, B.R.; Youness, H.A. Classification of Hadith According to Its Content Based on Supervised Learning Algorithms. IEEE Access 2019, 7, 152379–152387.
Einea, O.; Elnagar, A.; Debsi, R.A. SANAD: Single-label Arabic News Articles Dataset for automatic text categorization. Data Brief 2019, 25, 104076.
Alhaj, Y.A.; Xiang, J.; Zhao, D.; Al-Qaness, M.A.; Elaziz, M.A.; Dahou, A. A Study of the Effects of Stemming Strategies on Arabic Document Classification. IEEE Access 2019, 7, 32664–32671.
Alhaj, Y.A.; Wickramaarachchi, W.U.; Hussain, A.; Al-Qaness, M.A.; Abdelaal, H.M. Efficient Feature Representation Based on the Effect of Words Frequency for Arabic Documents Classification. In Proceedings of the 2nd International Conference on Telecommunications and Communication Engineering, Beijing, China, 28–30 November 2018; pp. 397–401.