Machine Learning-Based Text Classification Comparison

Machine Learning-Based Text Classification Comparison: Comparison

Please note this is a comparison between Version 2 by Dean Liu and Version 1 by Yehia Alzoubi.

The growth in textual data associated with the increased usage of online services and the simplicity of having access to these data has resulted in a rise in the number of text classification research papers. Text classification has a significant influence on several domains such as news categorization, the detection of spam content, and sentiment analysis. The classification of Turkish text is the research focus of this work ssince only a few studies have been conducted in this context. WeResearchers utilize data obtained from customers’ inquiries that come to an institution to evaluate the proposed techniques. Classes are assigned to such inquiries specified in the institution’s internal procedures. The Support Vector Machine, Naïve Bayes, Long Term-Short Memory, Random Forest, and Logistic Regression algorithms were used to classify the data. The performance of the various techniques was then analyzed after and before data preparation, and the results were compared. The Long Term-Short Memory technique demonstrated superior effectiveness in terms of accuracy, achieving an 84% accuracy rate, surpassing the best accuracy record of traditional techniques, which was 78% accuracy for the Support Vector Machine technique. The techniques performed better once the number of categories in the dataset was reduced. Moreover, the findings show that data preparation and coherence between the classes’ number and the number of training sets are significant variables influencing the techniques’ performance.

Turkish texts
machine learning
text preprocessing
algorithm effectiveness

1. Introduction

Corporations provide a variety of online applications using which consumers can submit grievances to relevant divisions. Daily, lots of complaints or inquiries are delivered to large corporations. It is critical for people that these inquiries be received and replied to. Individuals that make these inquiries desire for them to be met as quickly as is feasible. It wastes a lot of time for large corporations to send these inquiries to the right divisions, and the rate of inquiries lowers proportionately. The misclassification of the arriving inquiries even lengthens the response time ^[1]. The advancement of Machine Learning (ML) algorithms has resulted in remedy recommendations for ourthe everyday difficulties ^[2]. ML deployment has become unavoidable for large corporations to adapt to meet customer inquiries ^[3].

ML often gives systems the capability to learn and improve based on experience without being explicitly coded ^[4]. Unsupervised, supervised, semi-supervised, and reinforcement learning are the four main themes of ML algorithms ^[5]. ML trains on data and discovers how to fulfill jobs using different algorithms. These algorithms try to extract secret knowledge from enormous amounts of available data and apply it to classification or regression models ^[6]. Therefore, ML may be of great assistance in text classification ^[7]. The Support Vector Machine (SVM), Naïve Bayes (NB), Long Term-Short Memory (LTSM), Random Forest (RF), and Logistic Regression (LR) algorithms are some of the ML techniques ^[8], and these were deployed in this study. Section 3 will go through these algorithms in further depth. The ML workflow typically includes three steps: data preparation, selecting the appropriate ML algorithms and variables, and evaluating and assessing performance ^[9].

Many studies on the use of ML for text classification have recently been published, including survey papers (e.g., [2,10,11,12]^{[2][10][11][12]}), comparative analyses using different ML algorithms (e.g., [13,14]^[13][14]), papers focusing on specific languages (e.g., [4,15,16]^[4][15][16]), and papers applying certain ML algorithms for text classification (e.g., [1,17,18]^[1][17][18]), text classification performance (e.g., ^[19]), and text classification frameworks (e.g., ^[6]). However, the literature on text classification in the Turkish context is limited [20,21]^[20][21]. Accordingly, tresearchis work triesers tryied to meet this demand by focusing on several ML algorithms for identifying Turkish texts. It is intended that the inquiry text will be sent to the relevant class quickly using ML algorithms, allowing the staff to react to letters more quickly. Another concern to consider is which ML classification algorithm is best for this task. The classification method is challenging in this study as all of the inquiries are in the Turkish composable language [22,23]^[22][23].

2. Text Classification in Turkish Context

Amasyalı and Diri ^[24] investigated the SVM, NB, and RF algorithms in terms of the document type, name of the author, and author’s race. For text classification, they employed the character N-grams approach. The writings of 18 authors, 4 females and 14 males, with 35 distinct texts authored by every author, and common interests and athletics, were analyzed. The correlation-based feature selection approach was utilized, which was found to improve classification accuracy. The best functioning method in author discovery was found to be the NB method, and the best working algorithm in race and genre identification was found to be the SVM ^[24]. Güran et al. ^[25] used DT-J48, K-Nearest Neighbor (K-NN), and Bayesian Probabilistic classifiers to classify text using the N-gram method. They looked into 600 text documents that were obtained online. These documents were represented by the researchers using TF-IDF. They also used stemmed words on the data to minimize the vector space. They changed the text to lowercase in data preparation, then applied data cleansing to create and stop the word lists. Later, they used the information gain approach to perform feature extraction and the splitting of words. All three techniques achieved the highest performance for unigram words. Among them, the K-NN was shown to be the lowest-performing ML algorithm compared to the others (i.e., the K-NN’s performance was 65.5%, that of the BN was 94%, and 75% was the performance of J48). According to the authors, the roundness of the training dataset has a detrimental impact on the findings ^[25]. Uysal and Gunal ^[26] investigated the effect of data preparation on text classification in news and e-mail data in English and Turkish. They accomplished the preparation through text lowercasing, stop-word deletion, tagging, and breaking. According to them, the classification includes different phases of classification and the extraction and selection of features. According to the study, English represents a non-composable language, whereas Turkish is an example of a composable language that is widely used across the globe. The text was classified using the SVM and measured using Micro-F1 algorithms. They concluded that the preparation phase is just as critical as the processes of extraction and selection of features ^[26]. Furthermore, they stated that stop-word cleansing is a critical phase also. ThIt is research also contendsshowed that while some preparation operations, such as lowercase modification, are required irrespective of the language or field, others must be integrated according to the language and field of the study ^[26]. Yıldırım and Yıldız ^[27] compared the standard bag-of-words approach and artificial neural system for text classification. One of the datasets included seven distinct types of data such as sports, policy, and economics. The second dataset consisted of six distinct types, each with 600 texts. Using cleaning and stop-word procedures, they prepared the data for morphological analysis. As a classification algorithm, the NB was utilized ^[27]. Their findings demonstrated that stop-word cleansing and morphological analysis had little effect on the outcome. They also emphasized the significance of selecting features in the classification process and employing the Chi-Square and Knowledge Gain techniques ^[27]. Kuyumcu ^[28] used quick text classification (fastText) without data preparation since data preparation is frequently time-consuming, especially in composable languages such as Turkish. Facebook’s fastText word embedding-based analyzer eliminates the requirement for data preparation. The fastText algorithm was used on the Turkish text classification 3600 dataset. They tested the model using the NB, K-NN, and J48 techniques. According to the results, the best performance was attained by the Multinomial NB classifier, which scored 90.12%. Accordingly, the author concluded that the fastText algorithm is substantially superior compared to other methods in terms of consistency without the need for data preparation ^[27]. Çoban et al. ^[29] used Deep Learning (DL) to perform sentiment analysis on public Facebook data acquired from Turkish user profiles. With text representations, recurrent neural systems obtained the highest accuracy, 91.6%. Dogru et al. ^[30] suggested a DL-based classification of news texts using the Doc2vec word-based approach on the Turkish text classification 3600 dataset. DL-based Convolutional Neural Networks (CNN) and classic ML methods such as the NB, SVM, RF, and Gauss NB algorithms were employed as classification techniques. In the suggested model, the maximum result was reached as 94.17% in the Turkish sample compared to 96.41% in the English sample in the classifications performed by CNN ^[30]. Zulqarnain et al. ^[31] employed DL to perform question classification in Turkish. They used three different DL algorithms (Gated Recurrent Unit, LSTM, and CNN). They also used the word2vec methodology. Word2vec strategies had a major effect on the prediction performance utilizing multiple DL algorithms, which achieved an accuracy of 93.7% ^[31]. Bektaş ^[32] used text classification tools to analyze 7731 tweets from 13 prominent Turkish economists. The classification findings were then compared to four popular ML algorithms (the SVM, NB, LR, and integration of LR with the SVM). The results revealed that the success of a text classification issue is related to the feature extraction techniques and that the SVM outperforms other ML methods using unigram feature maps. The integration approach of the SVM with LR generated the best results (82.9%) ^[32]. Bozyigit et al. ^[23] deployed ML to classify customers’ concerns regarding packaged food goods expressed in Turkish. The class of concern was determined using the TF-IDF and word2vec feature extraction algorithms. The results of the LR, NB, K-NN, SVM, RF, and Extreme Gradient Boosting classifiers were compared. The strongest technique was Extreme Gradient Boosting with an TF-IDF weighted value, which reaches an 86% F-measure score ^[23]. When compared to the TF-IDF method, word2vec-based ML performed poorly in terms of the F-measure. Furthermore, TF-IDF-based ML provides more accurate predictions on the optimized feature subsets determined by the Chi-Square approach, which, when conducted on TF-IDF features, raises the F-measure from 86% to 88% in Extreme Gradient Boosting ^[23]. Eminagaoglu ^[33] introduced a similarity metric for classification that can be utilized well for word vectors and classification techniques such as the K-NN and k-means methods. The suggested metric is validated against certain global datasets in English and Turkish. The suggested metric might be employed in any applicable method or model for data acquisition and text classification. Karasoy and Ballı ^[22] performed a content-based SMS classification utilizing ML and DL approaches to filter out undesirable texts in Turkish. The features were analyzed using DL and ML and the results were compared. As a consequence, the CNN technique was discovered to be the best successful technique, with a 99.86% accuracy rate in classification. Köksal and Yılmaz ^[21] proposed a technique and considerations for improving text categorization effectiveness with ML. ML methods and state-of-the-art preparation text models were used to assess two publicly available Turkish news datasets. They used a variety of ML techniques, including the NB, LR, K-NN, SVM, and RF techniques. They also used a BERT model that was particularly trained for classifying Turkish text. The results demonstrated that the technique outperformed earlier F1-score-based news classification experiments and achieved 96% accuracy ^[21]. Yildiz ^[20] presented a data distribution algorithm that addresses the data imbalance issue to improve text categorization success. The suggested algorithm was evaluated using LSTM on a very large Turkish dataset having 263168 articles divided into 15 groups. To compare the parameters, the model was trained with and without utilizing the suggested algorithm. The proposed algorithm produced roughly 3.5% more accuracy than the standard approach experiment. It also revealed a more-than-three-point rise in the F1-score ^[20]. TResearchis study ers prioritized data preparation by normalizing the data, morphologically analyzing the terms in the corpus into their basic forms, generating the stop words’ list by examining and removing the most repeated word groupings, simplifying the repeated categories using the k-means technique during the preparation phase, and reducing the number of categories. The dataset was trained using the SVM, NB, LTSM, RF, and LR methods. This studyIt can be assumed to be an extension and validation of [21,22,23]^[21][22][23]. The effectiveness of the various strategies was then evaluated both after and before data preparation. It is also important to note that thise articlecurrent sutdy is a part of and is based on ^[34].

3. Natural Language Processing

NLP is an AI discipline that deals with the processing of human language in a computer-readable format. NLP, which was designed to help computers comprehend the language humans use to communicate, has become very common ^[13]. The ease with which people’s speech may now be accessed via social media and the proliferation of communication outlets such as radio and television have enlarged the usage of NLP. NLP is required in text classification to analyze user data. Building a process by making meaning out of text data is a difficult step. NLP is practiced in several stages. The sentences, firstly, are divided into smaller sections called tokens in lexical analysis. The structure of the sentence or syntax of all tokens is then considered in the syntactic analysis process ^[12]. It is tested to see whether its syntax is correct. The interpretations of phrases are inferred using the preceding procedures in semantic analysis. The NLP stages are completed after turning the data into output data. InA this work, a morphology analysis tool, a key element of NLP, was employed in data preparation. Turkish morphology analysis was carried out using the Zemberek library. Zemberek performs parsing with root dictionary-based parses. The library conducts parsing by calculating the probability of the root. The library initially scans the binary root record stored before appending the relevant sentences to the root. The origin of a term in Turkish can take several forms. Various changes were added to the tree for such assertions and distorted terms. The Turkish term “kitaba”, for instance, is different from the term “kitap”, which is the origin of the term. As a result, the distorted term “kitab” was added to the tree ^[35]. TRMOR is yet another morphological technique for Turkish analysis, developed by Kayabas et al. in ^[36]. TRMOR’s performance was evaluated using 1000 words chosen at random from Wikipedia. TRMOR achieved an accuracy value of 94.12%. TRMOR binds stems to suffix morphemes first, assessing all possible connections, and then maps the output string in the right surface form using morphological criteria ^[36]. Since Turkish is an agglutinative (composable) language, composite terms are difficult to parse. For example, while “su” is the compound marker in “acemborusu”, “i” is the composite marker in “ayçiçeği”. According to the study, every composite term currently cannot be addressed, although this may be achievable in future projects ^[36].

4. Machine Learning

ML refers to teaching a set of algorithms to a machine (computer) using data, where there are no set rules ^[12]. The machine learns by finding patterns or commonalities in the dataset. ML algorithms cannot be employed when there are no patterns. The machine does its own data parsing to determine what action to take. As humans have to practice anything new, wresearchers learn frequently, and a computer’s data quantity is connected to a more efficient learning process ^[2]. Data are essential in ML, and it has gotten simpler to obtain data as the internet has improved, increasing the popularity of ML. According to the approaches utilized, ML algorithms are classified as unsupervised or supervised ^[22]. As mentioned in the introduction, several supervised algorithms were utilized in this work, including the K-NN, LR, SVM, RF, and NB algorithms. Furthermore, the LTSM method is employed as an unsupervised DL algorithm ^[2].

4.1. Supervised Learning

It is a type of ML activity wherein “supervised” refers to data that have been labeled ^[13]. A sample dataset is used by the ML model. The primary objective is to obtain the description of the data class. Algorithms for supervised learning discover the connections and correlations between output and input and anticipate the next output. That is, the data to be utilized in supervised learning and the classes associated with it are linked. They are known as labeled data ^[10]. Classification is a systematic method for generating classifications utilizing the prediction of training examples based on previous data labels. That is why this method is known as supervised learning. Examples of these classifiers include a Rule-Based classifier, DT classifier, Neural Network classifier, Neuro-Fuzzy classifier, the SVM, and so on. In tThis paper, the e SVM classifier, NB classifier, LR classifier, and RF classifier were utilized using labeled textual data.

4.2. Unsupervised Learning

In unsupervised learning, computers execute the process of learning by detecting patterns in the data. There is no particular output, or associated data, in the dataset (i.e., there are no data labels). The computer builds models by looking for similarities and patterns in the inputs ^[23]. Unsupervised ML seeks a relationship among datasets, which might be positive or negative. In other words, unsupervised ML discovers patterns of similarities or distinctions across datasets. Because there is no sample organizer or dataset, this method is referred to as unsupervised learning ^[6]. Unsupervised ML methods that have been utilized include the Hidden Markov Model, k-means Clustering, LTSM method, and Singular-Value Decomposition. Thise current study employs both the k-means algorithm as well as the LTSM method, which is also regarded as a DL algorithm ^[12].

References

Ajitha, P.; Sivasangari, A.; Immanuel Rajkumar, R.; Poonguzhali, S. Design of text sentiment analysis tool using feature extraction based on fusing machine learning algorithms. J. Intell. Fuzzy Syst. 2021, 40, 6375–6383.
Minaee, S.; Kalchbrenner, N.; Cambria, E.; Nikzad, N.; Chenaghlu, M.; Gao, J. Deep learning-based text classification: A comprehensive review. ACM Comput. Surv. (CSUR) 2021, 54, 1–40.
Srinivasan, S.; Ravi, V.; Alazab, M.; Ketha, S.; Al-Zoubi, A.M.; Kotti Padannayil, S. Spam emails detection based on distributed word embedding with deep learning. In Machine Intelligence and Big Data Analytics for Cybersecurity Applications. Studies in Computational Intelligence; Maleh, Y., Shojafar, M., Alazab, M., Baddi, Y., Eds.; Springer: Cham, Germany, 2021; Volume 919, pp. 161–189.
Akhter, M.P.; Jiangbin, Z.; Naqvi, I.R.; Abdelmajeed, M.; Fayyaz, M. Exploring deep learning approaches for Urdu text classification in product manufacturing. Enterp. Inf. Syst. 2022, 16, 223–248.
Sarker, I.H. Machine learning: Algorithms, real-world applications and research directions. SN Comput. Sci. 2021, 2, 160.
Mohammed, A.; Kora, R. An effective ensemble deep learning framework for text classification. J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 8825–8837.
Qasim, R.; Bangyal, W.H.; Alqarni, M.A.; Ali Almazroi, A. A fine-tuned BERT-based transfer learning approach for text classification. J. Healthc. Eng. 2022, 2022, 3498123.
Thirumoorthy, K.; Muneeswaran, K. Feature selection for text classification using machine learning approaches. Natl. Acad. Sci. Lett. 2022, 45, 51–56.
Luo, X. Efficient english text classification using selected machine learning techniques. Alex. Eng. J. 2021, 60, 3401–3409.
Altınel, B.; Ganiz, M.C. Semantic text classification: A survey of past and recent advances. Inf. Process. Manag. 2018, 54, 1129–1153.
Kadhim, A.I. Survey on supervised machine learning techniques for automatic text classification. Artif. Intell. Rev. 2019, 52, 273–292.
Li, Q.; Peng, H.; Li, J.; Xia, C.; Yang, R.; Sun, L.; Yu, P.S.; He, L. A survey on text classification: From traditional to deep learning. ACM Trans. Intell. Syst. Technol. (TIST) 2022, 13, 1–41.
Hartmann, J.; Huppertz, J.; Schamp, C.; Heitmann, M. Comparing automated text classification methods. Int. J. Res. Mark. 2019, 36, 20–38.
Shah, K.; Patel, H.; Sanghvi, D.; Shah, M. A comparative analysis of logistic regression, random forest and KNN models for the text classification. Augment. Hum. Res. 2020, 5, 12.
El Rifai, H.; Al Qadi, L.; Elnagar, A. Arabic text classification: The need for multi-labeling systems. Neural Comput. Appl. 2022, 34, 1135–1159.
Elnagar, A.; Al-Debsi, R.; Einea, O. Arabic text classification using deep learning models. Inf. Process. Manag. 2020, 57, 102121.
Dai, Y.; Guo, W.; Chen, X.; Zhang, Z. Relation classification via LSTMs based on sequence and tree structure. IEEE Access 2018, 6, 64927–64937.
Yuvaraj, N.; Chang, V.; Gobinathan, B.; Pinagapani, A.; Kannan, S.; Dhiman, G.; Rajan, A.R. Automatic detection of cyberbullying using multi-feature based artificial intelligence with deep decision tree classification. Comput. Electr. Eng. 2021, 92, 107186.
Yadav, B.P.; Ghate, S.; Harshavardhan, A.; Jhansi, G.; Kumar, K.S.; Sudarshan, E. Text categorization performance examination using machine learning algorithms. In Proceedings of the IOP Conference Series: Materials Science and Engineering, Warangal, India, 9–10 October 2020; IOP Publishing: Warangal, India, 2020; p. 022044.
Yildiz, B. Efficient text classification with deep learning on imbalanced data improved with better distribution. Turk. J. Sci. Technol. 2022, 17, 89–98.
Köksal, Ö.; Yılmaz, E.H. Improving automated Turkish text classification with learning-based algorithms. Concurr. Comput. Pract. Exp. 2022, 34, e6874.
Karasoy, O.; Ballı, S. Spam SMS detection for Turkish language with deep text analysis and deep learning methods. Arab. J. Sci. Eng. 2022, 47, 9361–9377.
Bozyigit, F.; Dogan, O.; Kilinc, D. Categorization of customer complaints in food industry using machine learning approaches. J. Intell. Syst. Theory Appl. 2022, 5, 85–91.
Amasyalı, M.F.; Diri, B. Automatic Turkish text categorization in terms of author, genre and gender. In Natural Language Processing and Information Systems. NLDB 2006. Lecture Notes in Computer Science; Kop, C., Fliedl, G., Mayr, H.C., Métais, E., Eds.; Springer: Berlin/Heidelberg, Germany, 2006; Volume 3999, pp. 221–226.
Güran, A.; Akyokuş, S.; Bayazıt, N.G.; Gürbüz, M.Z. Turkish text categorization using n-gram words. In Proceedings of the International Symposium on Innovations in Intelligent Systems and Applications (INISTA 2009), Trabzon, Turkey, 29 June–1 July 2009; IEEE: Trabzon, Turkey, 2009; pp. 369–373.
Uysal, A.K.; Gunal, S. The impact of preprocessing on text classification. Inf. Process. Manag. 2014, 50, 104–112.
Yıldırım, S.; Yıldız, T. A comparative analysis of text classification for Turkish language. Pamukkale Univ. J. Eng. Sci. 2018, 24, 879–886.
Kuyumcu, B.; Aksakalli, C.; Delil, S. An automated new approach in fast text classification (fastText): A case study for Turkish text classification without pre-processing. In Proceedings of the 3rd International Conference on Natural Language Processing and Information Retrieval, ACM, Tokushima, Japan, 28–30 June 2019; pp. 1–4.
Çoban, Ö.; Özel, S.A.; İnan, A. Deep learning-based sentiment analysis of Facebook data: The case of Turkish users. Comput. J. 2021, 64, 473–499.
Dogru, H.B.; Tilki, S.; Jamil, A.; Hameed, A.A. Deep learning-based classification of news texts using doc2vec model. In Proceedings of the 1st International Conference on Artificial Intelligence and Data Analytics (CAIDA), Riyadh, Saudi Arabia, 6–7 April 2021; IEEE: Riyadh, Saudi Arabia, 2021; pp. 91–96.
Zulqarnain, M.; Alsaedi, A.K.Z.; Ghazali, R.; Ghouse, M.G.; Sharif, W.; Husaini, N.A. A comparative analysis on question classification task based on deep learning approaches. PeerJ Comput. Sci. 2021, 7, e570.
Bektaş, J. Detection of economy-related Turkish tweets based on machine learning approaches. In Data Mining Approaches for Big Data and Sentiment Analysis in Social Media; El-Latif, A.A.A., Ed.; IGI Global: Hershey, PA, USA, 2022; pp. 171–195.
Eminagaoglu, M. A new similarity measure for vector space models in text classification and information retrieval. J. Inf. Sci. 2022, 48, 463–476.
Erkaya, A.E. Text Classification based on Organizational Data Using Machine Learning; Ankara Yıldırım Beyazıt Üniversitesi Fen Bilimleri Enstitüsü: Keçiören/Ankara, Türkiye, 2019.
Akın, A.A.; Akın, M.D. Zemberek, an open source NLP framework for Turkic languages. Structure 2007, 10, 1–5.
Kayabaş, A.; Schmid, H.; Topcu, A.E.; Kiliç, Ö. TRMOR: A finite-state-based morphological analyzer for Turkish. Turk. J. Electr. Eng. Comput. Sci. 2019, 27, 3837–3851.