Text Classification Algorithms: A Survey: Comparison
Please note this is a comparison between Version 5 by Kamran Kowsari and Version 4 by Kamran Kowsari.

In recent years, there has been an exponential growth in the number of complex documents and texts that require a deeper understanding of machine learning methods to be able to accurately classify texts in many applications. Many machine learning approaches have achieved surpassing results in natural language processing. The success of these learning algorithms relies on their capacity to understand complex models and non-linear relationships within data. However, finding suitable structures, architectures, and techniques for text classification is a challenge for researchers. In this paper, a brief overview of text classification algorithms is discussed. This overview covers different text feature extractions, dimensionality reduction methods, existing algorithms and techniques, and evaluation methods. Finally, the limitations of each technique and its application in real-world problems are discussed.

  • text classification
  • text mining
  • text representation
  • text categorization
  • text analysis
  • document classification

Text Classification Algorithms: A Survey

docs/pic/WordArt.png

Introduction

docs/pic/OverviewTextClassification.png

Text and Document Feature Extraction


Text feature extraction and pre-processing for classification algorithms are very significant. In this section, we start to talk about text cleaning since most of the documents contain a lot of noise. In this part, we discuss two primary methods of text feature extractions- word embedding and weighted word.

Text Cleaning and Pre-processing

In Natural Language Processing (NLP), most of the text and documents contain many words that are redundant for text classification, such as stopwords, miss-spellings, slangs, and etc. In this section, we briefly explain some techniques and methods for text cleaning and pre-processing text documents. In many algorithms like statistical and probabilistic learning methods, noise and unnecessary features can negatively affect the overall performance. So, the elimination of these features is extremely important.

Tokenization

Tokenization is the process of breaking down a stream of text into words, phrases, symbols, or any other meaningful elements called tokens. The main goal of this step is to extract individual words in a sentence. Along with text classification, in text mining, it is necessary to incorporate a parser in the pipeline which performs the tokenization of the documents; for example:

sentence:

After sleeping for four hours, he decided to sleep for another four

In this case, the tokens are as follows:

{'After', 'sleeping', 'for', 'four', 'hours', 'he', 'decided', 'to', 'sleep', 'for', 'another', 'four'}

Here is python code for Tokenization:

from nltk.tokenize import word_tokenize
text = "After sleeping for four hours, he decided to sleep for another four"
tokens = word_tokenize(text)
print(tokens)
Stop words

Text and document classification over social media, such as Twitter, Facebook, and so on is usually affected by the noisy nature (abbreviations, irregular forms) of the text corpora.

Here is an example from geeksforgeeks

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = "This is a sample sentence, showing off the stop words filtration."

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)

filtered_sentence = [w for w in word_tokens if not w in stop_words]

filtered_sentence = []

for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)

print(word_tokens)
print(filtered_sentence)

Output:

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing',
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']
Capitalization

Sentences can contain a mixture of uppercase and lower case letters. Multiple sentences make up a text document. To reduce the problem space, the most common approach is to reduce everything to lower case. This brings all words in a document in same space, but it often changes the meaning of some words, such as "US" to "us" where first one represents the United States of America and second one is a pronoun. To solve this, slang and abbreviation converters can be applied.

text = "The United States of America (USA) or America, is a federal republic composed of 50 states"
print(text)
print(text.lower())

Output:

"The United States of America (USA) or America, is a federal republic composed of 50 states"
"the united states of america (usa) or america, is a federal republic composed of 50 states"
Slangs and Abbreviations

Slangs and abbreviations can cause problems while executing the pre-processing steps. An abbreviation is a shortened form of a word, such as SVM stand for Support Vector Machine. Slang is a version of language that depicts informal conversation or text that has different meaning, such as "lost the plot", it essentially means that 'they've gone mad'. Common method to deal with these words is converting them to formal language.

Noise Removal

Another issue of text cleaning as a pre-processing step is noise removal. Text documents generally contains characters like punctuations or special characters and they are not necessary for text mining or classification purposes. Although punctuation is critical to understand the meaning of the sentence, but it can affect the classification algorithms negatively.

Spelling Correction

An optional part of the pre-processing step is correcting the misspelled words. Different techniques, such as hashing-based and context-sensitive spelling correction techniques, or spelling correction using trie and damerau-Levenshtein distance bigram have been introduced to tackle this issue.

Text Stemming is modifying a word to obtain its variants using different linguistic processeses like affixation (addition of affixes). For example, the stem of the word "studying" is "study", to which -ing.

Here is an example of Stemming from NLTK

Text lemmatization is the process of eliminating redundant prefix or suffix of a word and extract the base word (lemma).

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("cats"))


Comparison of Feature Extraction Techniques
Model
Advantages
Limitation
Weighted Words
  • Easy to compute
  • Easy to compute the similarity between 2 documents using it
  • Basic metric to extract the most descriptive terms in a document
  • Works with an unknown word (e.g., New words in languages)
  • It does not capture the position in the text (syntactic)
  • It does not capture meaning in the text (semantics)
  • Common words effect the results (e.g., “am”, “is”, etc.)
TF-IDF
  • Easy to compute
  • Easy to compute the similarity between 2 documents using it
  • Basic metric to extract the most descriptive terms in a document
  • Common words do not affect the results due to IDF (e.g., “am”, “is”, etc.)
  • It does not capture the position in the text (syntactic)
  • It does not capture meaning in the text (semantics)
Word2Vec
  • It captures the position of the words in the text (syntactic)
  • It captures meaning in the words (semantics)
  • It cannot capture the meaning of the word from the text (fails to capture polysemy)
  • It cannot capture out-of-vocabulary words from the corpus
GloVe (Pre-Trained)
  • It captures the position of the words in the text (syntactic)
  • It captures meaning in the words (semantics)
  • Trained on a huge corpus
  • It cannot capture the meaning of the word from the text (fails to capture polysemy)
  • Memory consumption for storage
  • It cannot capture out-of-vocabulary words from the corpus
GloVe (Trained)
  • It is very straightforward, e.g., to enforce the word vectors to capture sub-linear relationships in the vector space (performs better than Word2vec)
  • Lower weight for highly frequent word pairs, such as stop words like “am”, “is”, etc. Will not dominate training progress
  • Memory consumption for storage
  • Needs huge corpus to learn
  • It cannot capture out-of-vocabulary words from the corpus
  • It cannot capture the meaning of the word from the text (fails to capture polysemy)
FastText
  • Works for rare words (rare in their character n-grams which are still shared with other words
  • Solves out of vocabulary words with n-gram in character level
  • It cannot capture the meaning of the word from the text (fails to capture polysemy)
  • Memory consumption for storage
  • Computationally is more expensive in comparing with GloVe and Word2Vec
Contextualized Word Representations
  • It captures the meaning of the word from the text (incorporates context, handling polysemy)
  • Memory consumption for storage
  • Improves performance notably on downstream tasks. Computationally is more expensive in comparison to others
  • Needs another word embedding for all LSTM and feedforward layers
  • It cannot capture out-of-vocabulary words from a corpus
  • Works only sentence and document level (it cannot work for individual word level)

 

Dimensionality Reduction


 

Principal Component Analysis (PCA)

Principle component analysis~(PCA) is the most popular technique in multivariate analysis and dimensionality reduction. PCA is a method to identify a subspace in which the data approximately lies. This means finding new variables that are uncorrelated and maximizing the variance to preserve as much variability as possible.

Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis (LDA) is another commonly used technique for data classification and dimensionality reduction. LDA is particularly helpful where the within-class frequencies are unequal and their performances have been evaluated on randomly generated test data. Class-dependent and class-independent transformation are two approaches in LDA where the ratio of between-class-variance to within-class-variance and the ratio of the overall-variance to within-class-variance are used respectively.

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn.decomposition import NMF


def TFIDF(X_train, X_test, MAX_NB_WORDS=75000):
    vectorizer_x = TfidfVectorizer(max_features=MAX_NB_WORDS)
    X_train = vectorizer_x.fit_transform(X_train).toarray()
    X_test = vectorizer_x.transform(X_test).toarray()
    print("tf-idf with", str(np.array(X_train).shape[1]), "features")
    return (X_train, X_test)


from sklearn.datasets import fetch_20newsgroups

newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')
X_train = newsgroups_train.data
X_test = newsgroups_test.data
y_train = newsgroups_train.target
y_test = newsgroups_test.target

X_train,X_test = TFIDF(X_train,X_test)



NMF_ = NMF(n_components=2000)
X_train_new = NMF_.fit(X_train)
X_train_new =  NMF_.transform(X_train)
X_test_new = NMF_.transform(X_test)

print("train with old features: ",np.array(X_train).shape)
print("train with new features:" ,np.array(X_train_new).shape)

print("test with old features: ",np.array(X_test).shape)
print("test with new features:" ,np.array(X_test_new))

output:

tf-idf with 75000 features
train with old features:  (11314, 75000)
train with new features: (11314, 2000)
test with old features:  (7532, 75000)
test with new features: (7532, 2000)



Random Projection

Random projection or random feature is a dimensionality reduction technique mostly used for very large volume dataset or very high dimensional feature space. Text and document, especially with weighted feature extraction, can contain a huge number of underlying features. Many researchers addressed Random Projection for text data for text mining, text classification and/or dimensionality reduction. We start to review some random projection techniques.

docs/pic/Random%20Projection.png

Autoencoder is a neural network technique that is trained to attempt to map its input to its output. The autoencoder as dimensional reduction methods has achieved great success via the powerful reprehensibility of neural networks. The main idea is, one hidden layer between the input and output layers with fewer neurons can be used to reduce the dimension of feature space. Especially for texts, documents, and sequences that contain many features, autoencoders could help to process data faster and more efficiently.

docs/pic/Autoencoder.png

T-distributed Stochastic Neighbor Embedding (T-SNE) is a nonlinear dimensionality reduction technique for embedding high-dimensional data which is mostly used for visualization in a low-dimensional space. This approach is based on G. Hinton and ST. Roweis. SNE works by converting the high dimensional Euclidean distances into conditional probabilities which represent similarities.

 

docs/pic/TSNE.png

Text Classification Techniques


 

Model
Advantages
Disadvantages
Rocchio Algorithm
  • Easy to implement
  • Computationally is very cheap
  • Relevance feedback mechanism (benefits to ranking documents as not relevant)
  • The user can only retrieve a few relevant documents
  • Rocchio often misclassifies the type for multimodal class
  • This technique is not very robust
  • linear combination in this algorithm is not good for multi-class datasets
Boosting and Bagging
  • Improves the stability and accuracy (takes the advantage of ensemble learning wherein multiple weak learners outperform a single strong learner.)
  • Reducing variance which helps to avoid overfitting problems.
  • Computational complexity
  • loss of interpretability (if the number of models is hight, understanding the model is very difficult)
  • Requires careful tuning of different hyper-parameters.
Logistic Regression
  • Easy to implement
  • does not require too many computational resources
  • it does not require input features to be scaled (pre-processing)
  • It does not require any tuning
  • it cannot solve non-linear problems
  • prediction requires that each data point be independent
  • attempting to predict outcomes based on a set of independent variables
Naive Bayes Classifier
  • It works very well with text data
  • Easy to implement
  • Fast in comparing to other algorithms
  • A strong assumption about the shape of the data distribution
  • limited by data scarcity for which any possible value in feature space, a likelihood value must be estimated by a frequentist
K-Nearest Neighbor
  • Effective for text datasets
  • non-parametric
  • More local characteristics of text or document are considered
  • Naturally handles multi-class datasets
  • computational of this model is very expensive
  • difficult to find the optimal value of k
  • The constraint for large search problem to find nearest neighbors
  • Finding a meaningful distance function is difficult for text datasets
Support Vector Machine (SVM)
  • SVM can model non-linear decision boundaries
  • Performs similarly to logistic regression when linear separation
  • Robust against overfitting problems~(especially for text dataset due to high-dimensional space)
  • lack of transparency in results caused by a high number of dimensions (especially for text data).
  • Choosing an efficient kernel function is difficult (Susceptible to overfitting/training issues depending on the kernel)
  • Memory complexity
Decision Tree
  • Can easily handle qualitative (categorical) features
  • Works well with decision boundaries parallel to the feature axis
  • A decision tree is a very fast algorithm for both learning and prediction
  • Issues with diagonal decision boundaries
  • Can be easily overfit
  • extremely sensitive to small perturbations in the data
  • Problems with out-of-sample prediction
Conditional Random Field (CRF)
  • Its feature design is flexible
  • Since CRF computes the conditional probability of global optimal output nodes, it overcomes the drawbacks of label bias
  • Combining the advantages of classification and graphical modeling which combining the ability to compactly model multivariate data
  • The high computational complexity of the training step
  • this algorithm does not perform with unknown words
  • The problem with online learning (It makes it very difficult to re-train the model when newer data becomes available.)
 reuters.load_data(path="reuters.npz",
                                                         num_words=None,
                                                         skip_top=0,
                                                         maxlen=None,
                                                         test_split=0.2,
                                                         seed=113,
                                                         start_char=1,
                                                         oov_char=2,
                                                         index_from=3)

20Newsgroups

Random Forest
  • Ensembles of decision trees are very fast to train in comparison to other techniques
  • Reduced variance (relative to regular trees)
  • Not require preparation and pre-processing of the input data
  • Quite slow to create predictions once trained
  • more trees in the forest increases time complexity in the prediction step
  • Not as easy to visually interpret
  • Overfitting can easily occur
  • Need to choose the number of trees in the forest
Deep Learning
  • Flexible with features design (Reduces the need for feature engineering, one of the most time-consuming parts of machine learning practice.)
  • Architecture that can be adapted to new problems
  • Can deal with complex input-output mappings
  • Can easily handle online learning (It makes it very easy to re-train the model when newer data becomes available.)
  • Parallel processing capability (It can perform more than one job at the same time)
  • Requires a large amount of data (if you only have small sample text data, deep learning is unlikely to outperform other approaches.
  • It is extremely computationally expensive to train.
  • Model Interpretability is the most important problem of deep learning~(Deep learning in most of the time is black-box)
  • Finding an efficient architecture and structure is still the main challenge of this technique

Evaluation


F1 Score

docs/pic/F1.png

Matthew correlation coefficient (MCC)

Compute the Matthews correlation coefficient (MCC)

The Matthews correlation coefficient is used in machine learning as a measure of the quality of binary (two-class) classification problems. It takes into account of true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes. The MCC is in essence a correlation coefficient value between -1 and +1. A coefficient of +1 represents a perfect prediction, 0 an average random prediction and -1 an inverse prediction. The statistic is also known as the phi coefficient.

Receiver operating characteristics (ROC)

ROC curves are typically used in binary classification to study the output of a classifier. In order to extend ROC curve and ROC area to multi-class or multi-label classification, it is necessary to binarize the output. One ROC curve can be drawn per label, but one can also draw a ROC curve by considering each element of the label indicator matrix as a binary prediction (micro-averaging).

Another evaluation measure for multi-class classification is macro-averaging, which gives equal weight to the classification of each label. [sources]

 

/docs/pic/sphx_glr_plot_roc_001.png

Area Under Curve (AUC)

Area under ROC curve (AUC) is a summary metric that measures the entire area underneath the ROC curve. AUC holds helpful properties, such as increased sensitivity in the analysis of variance (ANOVA) tests, independence of decision threshold, invariance to a priori class probability and the indication of how well negative and positive classes are regarding decision index.

 

Text and Document Datasets


IMDB

Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations, such as "only consider the top 10,000 most common words, but eliminate the top 20 most common words".

As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

from keras.datasets import imdb

(x_train, y_train), (x_test, y_test) = imdb.load_data(path="imdb.npz",
                                                      num_words=None,
                                                      skip_top=0,
                                                      maxlen=None,
                                                      seed=113,
                                                      start_char=1,
                                                      oov_char=2,
                                                      index_from=3)

Reuters-21578

Dataset of 11,228 newswires from Reuters, labeled over 46 topics. As with the IMDB dataset, each wire is encoded as a sequence of word indexes (same conventions).

from keras.datasets import reuters

(x_train, y_train), (x_test, y_test) =

The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon messages posted before and after a specific date.

This module contains two loaders. The first one, sklearn.datasets.fetch_20newsgroups, returns a list of the raw texts that can be fed to text feature extractors, such as sklearn.feature_extraction.text.CountVectorizer with custom parameters so as to extract feature vectors. The second one, sklearn.datasets.fetch_20newsgroups_vectorized, returns ready-to-use features, i.e., it is not necessary to use a feature extractor.

from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')

from pprint import pprint
pprint(list(newsgroups_train.target_names))

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

Web of Science Dataset

Description of Dataset:

Here is three datasets which include WOS-11967 , WOS-46985, and WOS-5736 Each folder contains:

  • X.txt
  • Y.txt
  • YL1.txt
  • YL2.txt

X is input data that include text sequences Y is target value YL1 is the target value of level one (parent label) YL2 is the target value of level one (child label)

Meta-data: This folder contains on data file as the following attribute: Y1 Y2 Y Domain area keywords Abstract

The abstract is input data that include text sequences of 46,985 published paper Y is target value YL1 is the target value of level one (parent label) YL2 is a target value of level one (child label) Domain is the major domain which includes 7 labels: {Computer Science, Electrical Engineering, Psychology, Mechanical Engineering, Civil Engineering, Medical Science, biochemistry} area is subdomain or area of the paper, such as CS-> computer graphics which contain 134 labels. keywords: is authors keyword of the papers

This dataset contains 11,967 documents with 35 categories which include 7 parents categories.
This dataset contains 46,985 documents with 134 categories which include 7 parents categories.
This dataset contains 5,736 documents with 11 categories which include 3 parents categories.

Referenced paper: HDLTex: Hierarchical Deep Learning for Text Classification

Citations:
@ARTICLE{Kowsari2018Text_Classification,
    title={Text Classification Algorithms: A Survey},
    author={Kowsari, Kamran and Jafari Meimandi, Kiana and Heidarysafa, Mojtaba and Mendu, Sanjana and Barnes, Laura E. and Brown, Donald E.},
    journal={Information},
    VOLUME = {10},
    YEAR = {2019},
    NUMBER = {4},
    ARTICLE-NUMBER = {150},
    URL = {http://www.mdpi.com/2078-2489/10/4/150},
    ISSN = {2078-2489},
    publisher={Multidisciplinary Digital Publishing Institute}
}