In recent years, there has been an exponential growth in the number of complex documents and texts that require a deeper understanding of machine learning methods to be able to accurately classify texts in many applications. Many machine learning approaches have achieved surpassing results in natural language processing. The success of these learning algorithms relies on their capacity to understand complex models and non-linear relationships within data. However, finding suitable structures, architectures, and techniques for text classification is a challenge for researchers. In this paper, a brief overview of text classification algorithms is discussed. This overview covers different text feature extractions, dimensionality reduction methods, existing algorithms and techniques, and evaluation methods. Finally, the limitations of each technique and its application in real-world problems are discussed.
The classification task is one of the most indispensable problems in machine learning. As text and document data sets proliferate, the development and documentation of supervised machine learning algorithms become an imperative issue, especially for text classification. Having a better document categorization system for this information requires discerning these algorithms. However, the existing text classification algorithms work more efficiently if we have a better understanding of feature extraction methods and how to evaluate them correctly. Currently, text classification algorithms can be chiefly classified in the following manner: (I) Feature extraction methods, such as Term Frequency-Inverse document frequency (TF-IDF), term frequency (TF), word-embedding (e.g., Word2Vec, contextualized word representations, Global Vectors for Word Representation (GloVe), and FastText), are widely used in both academic and commercial applications. In this paper, we had addressed these techniques. However, text and document cleaning could help the accuracy and robustness of an application. We described the basic methods of text pre-processing step. (II) Dimensionality reduction methods, such as principal component analysis (PCA), linear discriminant analysis (LDA), non-negative matrix factorization (NMF), random projection, Autoencoder, and t-distributed Stochastic Neighbor Embedding (t-SNE), can be useful in reducing the time complexity and memory complexity of existing text classification algorithms. In a separate section, the most common methods of dimensionality reduction were presented. (III) Existing classification algorithms, such as the Rocchio algorithm, bagging and boosting, logistic regression (LR), Naïve Bayes Classifier (NBC), k-nearest neighbor (KNN), Support Vector Machine (SVM), decision tree classifier (DTC), random forest, conditional random field (CRF), and deep learning, are the primary focus of this paper. (IV) Evaluation methods, such as accuracy, Fβ, Matthew correlation coefficient (MCC), receiver operating characteristics (ROC), and area under the curve (AUC), are explained. With these metrics, the text classification algorithm can be evaluated. (V) Critical limitations of each component of the text classification pipeline (i.e., feature extraction, dimensionality reduction, existing classification algorithms, and evaluation) were addressed in order to each technique. And finally, we compare the most common text classification algorithm in this section. (V) Finally, the usage of text classification as an application and/or support other majors such as law, medicine, etc. are covered in a separate section. In this survey, Recent techniques and trending of text classification algorithms have discussed.
This entry is adapted from the peer-reviewed paper 10.3390/info10040150