Text Classification Technique | Encyclopedia MDPI

Text Classification Technique: History

View Latest Version

Please note this is an old version of this entry, which may differ significantly from the current revision.

Subjects: Computer Science, Artificial Intelligence

Contributor:

Yousif A. Alhaj

Abdelghani Dahou

Mohammed A. A. Al-qaness

Laith Abualigah

Aaqif Afzaal Abbasi

Nasser Ahmed Obad Almaweri

Mohamed Abd Elaziz

Robertas Damaševičius

One of the effective solutions in Arabic text classification is to find the suitable feature selection method with an optimal number of features alongside the classifier. Although several text classification methods have been proposed for the Arabic language using different techniques, such as feature selection methods, an ensemble of classifiers, and discriminative features, choosing the optimal method becomes an NP-hard problem considering the huge search space.

text classification
feature selection
swarm optimization

1. Introduction

Recently, the internet has witnessed a massive accumulation of valuable information growing exponentially every day. Most of this information is an unstructured text which creates challenges for humans to manage and process this information and extract proper knowledge [1]. A new research field in text mining called text classification (TC) emerged to overcome this problem. TC is a machine learning challenge that tries to classify new written content into a conceptual group from a predetermined classification collection [1]. It is crucial in a variety of applications, including sentiment analysis [2,3], spam email filtering [4,5], hate speech detection [6], text summarization [7], website classification [8], authorship attribution [9], information retrieval [10], medical diagnostics [11], emotion detection on smart phones [12], online recommendations [13], fake news detection [14,15], crypto-ransomware early detection [16], semantic similarity detection [17], part-of-speech tagging [18], news classification [19], and tweet classification [20].

Several primary stages are needed to build TC systems [21], namely, the preprocessing stage (tokenization, stop word removal, normalization, and stemming), document modeling stage (feature extraction and feature selection), and the final document classification and evaluation stage.

Compared to other languages, such as English, TC in Arabic is resource-poor. However, Arabic is the world’s fifth most widely spoken language, with around 4.5% of its population using it as their first language [1]. The Arabic alphabet consists of 28 letters and directions from right to left. As Arabic words do not begin with a capital letter like they do in English, distinguishing correct names, acronyms, and shortcuts can be challenging. There are also diacritics, which are symbols put above or below letters to give different sounds and grammatical formulations that can alter the definition of a sentence [22]. The shape and construction of the same letter vary depending on where it appears in the sentence [23]. The Arabic language requires a variety of preprocessing methods before classification due to several obstacles, including the Arabic language’s strong affixation character and the scarcity of freely available Arabic datasets [23], as well as the scarcity of standard Arabic morphological analysis software.

The ATC system uses a robust feature selection (FS) method and classifier (CA) to enhance the performance [24]. The former executes a classification process, whereas the former select useful features to decrease the high dimensionality of the feature space. Additionally, incorporating FS in TC systems will help reduce classification complexity and processing demands [25,26]. Over the past years, researchers have been challenged with finding robust FS methods, relevant features and classifiers to enhance the performance of the TC system. This problem occurs due to many available FS methods and techniques in the literature.

Obviously, executing this trail is a complicated and time-consuming task. Therefore, this necessitates the development of a technique to find an optimal solution, among others.

Recently, optimization techniques such as Particle Swarm Optimization (PSO) [27] have been used to solve selected problems across several domains [28]. These techniques are known to emulate natural evolution in their operations. PSO is a comparative evolutionary algorithm (EA) based on swarm intelligence, which is considered one of the most efficient search methods over various proposed method in the literature. Moreover, it is not very expensive and can converge faster in comparison with other EA [29]. This is the main motivation behind using the PSO method. Additionally, it has been successfully applied in feature selection [30], ensemble learning [31], and clustering [32]. Therefore, this work provides a new technique for ATC that uses a meta-heuristic algorithm to find the best answer from a variety of feature selection methods and classifiers using a set of features. This configuration is determined using the PSO [33].

The proposed method, called OCATC, consists of three phases covering data preparation, experiment, and testing and evaluation. In the first phase, OCATC starts by preparing a given Arabic dataset using several preprocessing tasks, including tokenization, normalization, stop word removal, and stemming, then extracts features from the dataset using the TF–IDF approach. In the second phase, the dataset is divided into train and test sets using 10-fold cross-validation, whereby the training set is used during the learning of PSO to find the optimal configuration. This step is considered the main contribution, where the PSO begins by generating a set of solutions, and each solution represents a configuration of three elements: the feature selection method, number of features, and classifier. Then, the fitness function for each solution is computed to determine the best configuration. After that, it updates the position and velocity of each solution until the prerequisites for stopping are met. Finally, the testing set is used to evaluate the quality of the best configuration to build the classification system using the optimal FS method (first element), the optimal number of features (second element), and the optimal classifier (third element). The recommended approach, to the best of our knowledge, aims to find an effective solution for the ATC system to automatically select the optimal solution from a set of elements such as feature selection methods, features, and classifiers, which has not been applied before.

2. A Novel Text Classification Technique

Several studies have looked at the issue of automatic TC, providing various methodologies and answers. This is primarily true for English, but it also applies to other languages, such as Arabic, it is still in an early stage [1]. The authors of [34] used a variety of categorization techniques to investigate the impact of removing stop words on ATC. They discovered that a support vector machine (SVM) classifier with little sequential optimization had the best error rate and accuracy. In [35], the Naive Bayes (NB) classifier is used to examine the impact of a light stemmer, Khoja stemmer, and root extractor on Arabic document categorization. The authors came to the conclusion that a root extractor combined with a position tagger would yield the most outstanding results.

Chi-square (Chi2), Information Gain (IG), NG-Goh-Low (NGL), and Galavotti– Sebastiani–Simi Coefficient (GSS) coefficients were used to determine the essential features and the influence of feature reduction approaches on Arabic document categorization [22]. They also employed feature weighting approaches based on inverse document frequency (IDF), such as Term frequency (TF–IDF), the location of a word’s first occurrence (FAiDF), and the compactness of the word (WCF) (CPiDF). The classification model was established using the SVM classifier. Thus, when the TF–IDF, CPiDF, and FAiDF feature weighting methods were combined, the GSS outperformed the other feature selection strategies. The feature selection approach for ATC by [36] was binary practical swarm optimization with a KNN classifier (BPSO-KNN). The Alj-News dataset was utilized to develop the classification model, and the best results were obtained utilizing SVM and NB classifiers.

On Arabic document classification, Sabbah et al. [37] tested a number of feature selection approaches, including Chi2, IG, Correlation (Corr), and SVM-based Feature Ranking Method (SVM-FRM). The classification model was constructed using an SVM classifier. In their research, they used the BBC [38] and Abuaiadah [39] databases. They arrived at the conclusion that SVM-FRM performs well in a balanced dataset, but not so well in an imbalanced dataset.

A novel feature selection method, namely, improved chi-square (ImpCHI), was presented by [40] to enhance the ATC. Three standard features selection methods, namely, IG, Chi2, and Mutual Information (MI), were compared with ImpCHI. SVM and DT classifiers were used to evaluate the performance using CNN dataset [38]. Experimental results demonstrate that the most beneficial result was obtained using the ImpCHI FS method using the SVM classifier.

In [41], the authors presented the Frequency Ratio Accumulation Method, a revolutionary text categorization technique for the Arabic language (FRAM). The characteristics were extracted using a bag of words (BoW). Chi2, MI, Odds Ratio (OR), and GSS-coefficient were among the feature selection approaches used to exclude unnecessary characteristics (GSS). According to the results, the FRAM outperformed the NB, Multi-variant Bernoulli Naive Bayes (MBNB), and Multinomial Naive Bayes models (MNB) classifiers. The macro-f-measure value was 95.1 percent for the unigram word-level representation approach.

The Polynomial Networks (PNs) classifier used by [42] on Arabic text classification using the Alj-News dataset. They compared the performance of the PNs classifier with other classifiers, such as SVM, NB, and DT classifiers. Their results showed that the performance of the PNs classifier was not the best for all the categories in the dataset, but was very competitive. The authors in [43] claimed to be the first to utilize Logistic Regression (LR) in Arabic Text categorization with the Alj-News dataset. The results of the experiments showed that Logistic Regression is beneficial for ATC.

The effects of eight supervising learning algorithms on Arabic document classification Studied by [44]. Several feature representation approaches were used to extract features from the Abuaiadah dataset [39]. The authors concluded that superior results were obtained when gathering an LSVM classifier with the IDFT approach.

Abdelaal et al. [45] proposed an automatic classification model for Hadith. The proposed model was used to organize Arabic text Hadith into related categories: Sahih, Hasan, Da’if, and Maudu. Several classifiers, such as LSVC, SGD, and LR, were investigated to build a classification model. IG, Chi2, and Gain Ratio (GR) are feature selection methods to remove irrelevant features. The outcomes showed that LSVC outperforms other classifiers. Moreover, in [46], the authors evaluated the Hadith dataset using DT, RF, and NB classifiers, and they isolated redundant features using IG and Chi2. Binary boolean algebra and TF–IDF were used to extract features. Experimental results demonstrated that the best classifier investigated was DT.

Elnagar et al. [1] performed an extensive analysis to evaluate the effectiveness of Deep Neural Networks (DNN) models and a word2vec embedding model on new constructed large corpora for Arabic document classification. The corpus was SANAD (Single-label-Arabic- News- Articles Dataset), collected by Einea et al. [47]. The evaluation experiments showed the effectiveness of the proposed models on single-label categorization and multi-label categorization.

Alhaj et al. [48] studied the effects of stemming strategies on ATC. Several classifiers, including NB, KNN, and SVM, were used to build the classification model. Chi2 was used to extract essential features in different sizes. An available public dataset, namely, CNN, was used to evaluate the classification model. The outcomes demonstrated that the SVM classifier gathered with ARLStem stemmer outperforms other classifiers when increasing the features. Moreover, in [49], the authors studied the effects of stop word removal in several classifiers and feature extractions using the CNN public dataset. Chi2 was applied as a feature selection method. The TF–IDF and BoW methods were used to extract features. They concluded that the best results were achieved when removing stop words gathering with TF–IDF and the VM classifier.

This entry is adapted from the peer-reviewed paper 10.3390/fi14070194

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.