Text Classification Algorithms: A Survey

Introduction

Text and Document Feature Extraction

Text feature extraction and pre-processing for classification algorithms are very significant. In this section, we start to talk about text cleaning since most of the documents contain a lot of noise. In this part, we discuss two primary methods of text feature extractions- word embedding and weighted word.

Text Cleaning and Pre-processing

In Natural Language Processing (NLP), most of the text and documents contain many words that are redundant for text classification, such as stopwords, miss-spellings, slangs, and etc. In this section, we briefly explain some techniques and methods for text cleaning and pre-processing text documents. In many algorithms like statistical and probabilistic learning methods, noise and unnecessary features can negatively affect the overall performance. So, the elimination of these features is extremely important.

Tokenization

Tokenization is the process of breaking down a stream of text into words, phrases, symbols, or any other meaningful elements called tokens. The main goal of this step is to extract individual words in a sentence. Along with text classification, in text mining, it is necessary to incorporate a parser in the pipeline which performs the tokenization of the documents; for example:

sentence:

After sleeping for four hours, he decided to sleep for another four

In this case, the tokens are as follows:

{'After', 'sleeping', 'for', 'four', 'hours', 'he', 'decided', 'to', 'sleep', 'for', 'another', 'four'}

Here is python code for Tokenization:

from nltk.tokenize import word_tokenize
text = "After sleeping for four hours, he decided to sleep for another four"
tokens = word_tokenize(text)
print(tokens)

Stop words

Text and document classification over social media, such as Twitter, Facebook, and so on is usually affected by the noisy nature (abbreviations, irregular forms) of the text corpora.

Here is an example from geeksforgeeks

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = "This is a sample sentence, showing off the stop words filtration."

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)

filtered_sentence = [w for w in word_tokens if not w in stop_words]

filtered_sentence = []

for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)

print(word_tokens)
print(filtered_sentence)

Output:

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing',
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']

Capitalization

Sentences can contain a mixture of uppercase and lower case letters. Multiple sentences make up a text document. To reduce the problem space, the most common approach is to reduce everything to lower case. This brings all words in a document in same space, but it often changes the meaning of some words, such as "US" to "us" where first one represents the United States of America and second one is a pronoun. To solve this, slang and abbreviation converters can be applied.

text = "The United States of America (USA) or America, is a federal republic composed of 50 states"
print(text)
print(text.lower())

Output:

"The United States of America (USA) or America, is a federal republic composed of 50 states"
"the united states of america (usa) or america, is a federal republic composed of 50 states"

Slangs and Abbreviations

Slangs and abbreviations can cause problems while executing the pre-processing steps. An abbreviation is a shortened form of a word, such as SVM stand for Support Vector Machine. Slang is a version of language that depicts informal conversation or text that has different meaning, such as "lost the plot", it essentially means that 'they've gone mad'. Common method to deal with these words is converting them to formal language.

Noise Removal

Another issue of text cleaning as a pre-processing step is noise removal. Text documents generally contains characters like punctuations or special characters and they are not necessary for text mining or classification purposes. Although punctuation is critical to understand the meaning of the sentence, but it can affect the classification algorithms negatively.

Spelling Correction

An optional part of the pre-processing step is correcting the misspelled words. Different techniques, such as hashing-based and context-sensitive spelling correction techniques, or spelling correction using trie and damerau-Levenshtein distance bigram have been introduced to tackle this issue.

Stemming

Text Stemming is modifying a word to obtain its variants using different linguistic processeses like affixation (addition of affixes). For example, the stem of the word "studying" is "study", to which -ing.

Here is an example of Stemming from NLTK

Lemmatization

Text lemmatization is the process of eliminating redundant prefix or suffix of a word and extract the base word (lemma).

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("cats"))




Comparison of Feature Extraction Techniques



Model

Advantages

Limitation



Weighted Words



Easy to compute

Easy to compute the similarity between 2 documents using it

Basic metric to extract the most descriptive terms in a document

Works with an unknown word (e.g., New words in languages)





It does not capture the position in the text (syntactic)

It does not capture meaning in the text (semantics)

Common words effect the results (e.g., “am”, “is”, etc.)





TF-IDF



Easy to compute

Easy to compute the similarity between 2 documents using it






























Basic metric to extract the most descriptive terms in a document

Common words do not affect the results due to IDF (e.g., “am”, “is”, etc.) 



It does not capture the position in the text (syntactic)

It does not capture meaning in the text (semantics)


Word2Vec 

It captures the position of the words in the text (syntactic)

It captures meaning in the words (semantics)




It cannot capture the meaning of the word from the text (fails to capture polysemy)

It cannot capture out-of-vocabulary words from the corpus


GloVe (Pre-Trained) 

It captures the position of the words in the text (syntactic)

It captures meaning in the words (semantics)

Trained on a huge corpus




It cannot capture the meaning of the word from the text (fails to capture polysemy)

Memory consumption for storage

It cannot capture out-of-vocabulary words from the corpus


GloVe (Trained) 

It is very straightforward, e.g., to enforce the word vectors to capture sub-linear relationships in the vector space (performs better than Word2vec)

Lower weight for highly frequent word pairs, such as stop words like “am”, “is”, etc. Will not dominate training progress




Memory consumption for storage

Needs huge corpus to learn

It cannot capture out-of-vocabulary words from the corpus

It cannot capture the meaning of the word from the text (fails to capture polysemy)


FastText 

Works for rare words (rare in their character n-grams which are still shared with other words

Solves out of vocabulary words with n-gram in character level




It cannot capture the meaning of the word from the text (fails to capture polysemy)

Memory consumption for storage

Computationally is more expensive in comparing with GloVe and Word2Vec


Contextualized Word Representations 

It captures the meaning of the word from the text (incorporates context, handling polysemy)




Memory consumption for storage

Improves performance notably on downstream tasks. Computationally is more expensive in comparison to others

Needs another word embedding for all LSTM and feedforward layers

It cannot capture out-of-vocabulary words from a corpus

Works only sentence and document level (it cannot work for individual word level)



 
Dimensionality Reduction

 
Principal Component Analysis (PCA)
Principle component analysis~(PCA) is the most popular technique in multivariate analysis and dimensionality reduction. PCA is a method to identify a subspace in which the data approximately lies. This means finding new variables that are uncorrelated and maximizing the variance to preserve as much variability as possible.
Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA) is another commonly used technique for data classification and dimensionality reduction. LDA is particularly helpful where the within-class frequencies are unequal and their performances have been evaluated on randomly generated test data. Class-dependent and class-independent transformation are two approaches in LDA where the ratio of between-class-variance to within-class-variance and the ratio of the overall-variance to within-class-variance are used respectively.

Non-negative Matrix Factorization (NMF)


from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn.decomposition import NMF


def TFIDF(X_train, X_test, MAX_NB_WORDS=75000):
    vectorizer_x = TfidfVectorizer(max_features=MAX_NB_WORDS)
    X_train = vectorizer_x.fit_transform(X_train).toarray()
    X_test = vectorizer_x.transform(X_test).toarray()
    print("tf-idf with", str(np.array(X_train).shape[1]), "features")
    return (X_train, X_test)


from sklearn.datasets import fetch_20newsgroups

newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')
X_train = newsgroups_train.data
X_test = newsgroups_test.data
y_train = newsgroups_train.target
y_test = newsgroups_test.target

X_train,X_test = TFIDF(X_train,X_test)



NMF_ = NMF(n_components=2000)
X_train_new = NMF_.fit(X_train)
X_train_new =  NMF_.transform(X_train)
X_test_new = NMF_.transform(X_test)

print("train with old features: ",np.array(X_train).shape)
print("train with new features:" ,np.array(X_train_new).shape)

print("test with old features: ",np.array(X_test).shape)
print("test with new features:" ,np.array(X_test_new))

output:
tf-idf with 75000 features
train with old features:  (11314, 75000)
train with new features: (11314, 2000)
test with old features:  (7532, 75000)
test with new features: (7532, 2000)



Random Projection



Random projection or random feature is a dimensionality reduction technique mostly used for very large volume dataset or very high dimensional feature space. Text and document, especially with weighted feature extraction, can contain a huge number of underlying features. Many researchers addressed Random Projection for text data for text mining, text classification and/or dimensionality reduction. We start to review some random projection techniques.


Autoencoder

Autoencoder is a neural network technique that is trained to attempt to map its input to its output. The autoencoder as dimensional reduction methods has achieved great success via the powerful reprehensibility of neural networks. The main idea is, one hidden layer between the input and output layers with fewer neurons can be used to reduce the dimension of feature space. Especially for texts, documents, and sequences that contain many features, autoencoders could help to process data faster and more efficiently.


T-distributed Stochastic Neighbor Embedding (T-SNE)

T-distributed Stochastic Neighbor Embedding (T-SNE) is a nonlinear dimensionality reduction technique for embedding high-dimensional data which is mostly used for visualization in a low-dimensional space. This approach is based on G. Hinton and ST. Roweis. SNE works by converting the high dimensional Euclidean distances into conditional probabilities which represent similarities.
 


Text Classification Techniques

 



Model

Advantages

Disadvantages



Rocchio Algorithm



Easy to implement

Computationally is very cheap

Relevance feedback mechanism (benefits to ranking documents as not relevant)





The user can only retrieve a few relevant documents

Rocchio often misclassifies the type for multimodal class

This technique is not very robust

linear combination in this algorithm is not good for multi-class datasets





Boosting and Bagging



Improves the stability and accuracy (takes the advantage of ensemble learning wherein multiple weak learners outperform a single strong learner.)

Reducing variance which helps to avoid overfitting problems.





Computational complexity

loss of interpretability (if the number of models is hight, understanding the model is very difficult)

Requires careful tuning of different hyper-parameters.
































Logistic Regression


Easy to implement

does not require too many computational resources

it does not require input features to be scaled (pre-processing)

It does not require any tuning




it cannot solve non-linear problems

prediction requires that each data point be independent

attempting to predict outcomes based on a set of independent variables


Naive Bayes Classifier 

It works very well with text data

Easy to implement

Fast in comparing to other algorithms




A strong assumption about the shape of the data distribution

limited by data scarcity for which any possible value in feature space, a likelihood value must be estimated by a frequentist


K-Nearest Neighbor 

Effective for text datasets

non-parametric

More local characteristics of text or document are considered

Naturally handles multi-class datasets




computational of this model is very expensive

difficult to find the optimal value of k

The constraint for large search problem to find nearest neighbors

Finding a meaningful distance function is difficult for text datasets


Support Vector Machine (SVM) 

SVM can model non-linear decision boundaries

Performs similarly to logistic regression when linear separation

Robust against overfitting problems~(especially for text dataset due to high-dimensional space)




lack of transparency in results caused by a high number of dimensions (especially for text data).

Choosing an efficient kernel function is difficult (Susceptible to overfitting/training issues depending on the kernel)

Memory complexity


Decision Tree 

Can easily handle qualitative (categorical) features

Works well with decision boundaries parallel to the feature axis

A decision tree is a very fast algorithm for both learning and prediction




Issues with diagonal decision boundaries

Can be easily overfit

extremely sensitive to small perturbations in the data

Problems with out-of-sample prediction


Conditional Random Field (CRF) 

Its feature design is flexible

Since CRF computes the conditional probability of global optimal output nodes, it overcomes the drawbacks of label bias

Combining the advantages of classification and graphical modeling which combining the ability to compactly model multivariate data




The high computational complexity of the training step

this algorithm does not perform with unknown words

The problem with online learning (It makes it very difficult to re-train the model when newer data becomes available.)
 reuters.load_data(path="reuters.npz",
                                                         num_words=None,
                                                         skip_top=0,
                                                         maxlen=None,
                                                         test_split=0.2,
                                                         seed=113,
                                                         start_char=1,
                                                         oov_char=2,
                                                         index_from=3)


20Newsgroups

20Newsgroups Dataset
















Random Forest


Ensembles of decision trees are very fast to train in comparison to other techniques 
Reduced variance (relative to regular trees)

Not require preparation and pre-processing of the input data




Quite slow to create predictions once trained

more trees in the forest increases time complexity in the prediction step

Not as easy to visually interpret

Overfitting can easily occur

Need to choose the number of trees in the forest


Deep Learning 

Flexible with features design (Reduces the need for feature engineering, one of the most time-consuming parts of machine learning practice.)

Architecture that can be adapted to new problems

Can deal with complex input-output mappings

Can easily handle online learning (It makes it very easy to re-train the model when newer data becomes available.)

Parallel processing capability (It can perform more than one job at the same time)




Requires a large amount of data (if you only have small sample text data, deep learning is unlikely to outperform other approaches.

It is extremely computationally expensive to train.

Model Interpretability is the most important problem of deep learning~(Deep learning in most of the time is black-box)

Finding an efficient architecture and structure is still the main challenge of this technique




Evaluation

F1 Score


Matthew correlation coefficient (MCC)
Compute the Matthews correlation coefficient (MCC)
The Matthews correlation coefficient is used in machine learning as a measure of the quality of binary (two-class) classification problems. It takes into account of true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes. The MCC is in essence a correlation coefficient value between -1 and +1. A coefficient of +1 represents a perfect prediction, 0 an average random prediction and -1 an inverse prediction. The statistic is also known as the phi coefficient.
Receiver operating characteristics (ROC)

ROC curves are typically used in binary classification to study the output of a classifier. In order to extend ROC curve and ROC area to multi-class or multi-label classification, it is necessary to binarize the output. One ROC curve can be drawn per label, but one can also draw a ROC curve by considering each element of the label indicator matrix as a binary prediction (micro-averaging).
Another evaluation measure for multi-class classification is macro-averaging, which gives equal weight to the classification of each label. [sources]

 



Area Under Curve (AUC)
Area under ROC curve (AUC) is a summary metric that measures the entire area underneath the ROC curve. AUC holds helpful properties, such as increased sensitivity in the analysis of variance (ANOVA) tests, independence of decision threshold, invariance to a priori class probability and the indication of how well negative and positive classes are regarding decision index.

 
Text and Document Datasets

IMDB

IMDB Dataset

Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations, such as "only consider the top 10,000 most common words, but eliminate the top 20 most common words".
As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

from keras.datasets import imdb

(x_train, y_train), (x_test, y_test) = imdb.load_data(path="imdb.npz",
                                                      num_words=None,
                                                      skip_top=0,
                                                      maxlen=None,
                                                      seed=113,
                                                      start_char=1,
                                                      oov_char=2,
                                                      index_from=3)


Reuters-21578

Reters-21578 Dataset

Dataset of 11,228 newswires from Reuters, labeled over 46 topics. As with the IMDB dataset, each wire is encoded as a sequence of word indexes (same conventions).

from keras.datasets import reuters

(x_train, y_train), (x_test, y_test) =
The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon messages posted before and after a specific date.
This module contains two loaders. The first one, sklearn.datasets.fetch_20newsgroups, returns a list of the raw texts that can be fed to text feature extractors, such as sklearn.feature_extraction.text.CountVectorizer with custom parameters so as to extract feature vectors. The second one, sklearn.datasets.fetch_20newsgroups_vectorized, returns ready-to-use features, i.e., it is not necessary to use a feature extractor.

from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')

from pprint import pprint
pprint(list(newsgroups_train.target_names))

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']




Web of Science Dataset
Description of Dataset:
Here is three datasets which include WOS-11967 , WOS-46985, and WOS-5736 Each folder contains:

X.txt

Y.txt

YL1.txt

YL2.txt


X is input data that include text sequences Y is target value YL1 is the target value of level one (parent label) YL2 is the target value of level one (child label)
Meta-data: This folder contains on data file as the following attribute: Y1 Y2 Y Domain area keywords Abstract
The abstract is input data that include text sequences of 46,985 published paper Y is target value YL1 is the target value of level one (parent label) YL2 is a target value of level one (child label) Domain is the major domain which includes 7 labels: {Computer Science, Electrical Engineering, Psychology, Mechanical Engineering, Civil Engineering, Medical Science, biochemistry} area is subdomain or area of the paper, such as CS-> computer graphics which contain 134 labels. keywords: is authors keyword of the papers

Web of Science Dataset WOS-11967


This dataset contains 11,967 documents with 35 categories which include 7 parents categories.

Web of Science Dataset WOS-46985


This dataset contains 46,985 documents with 134 categories which include 7 parents categories.

Web of Science Dataset WOS-5736


This dataset contains 5,736 documents with 11 categories which include 3 parents categories.
Referenced paper: HDLTex: Hierarchical Deep Learning for Text Classification
Citations:
@ARTICLE{Kowsari2018Text_Classification,
    title={Text Classification Algorithms: A Survey},
    author={Kowsari, Kamran and Jafari Meimandi, Kiana and Heidarysafa, Mojtaba and Mendu, Sanjana and Barnes, Laura E. and Brown, Donald E.},
    journal={Information},
    VOLUME = {10},
    YEAR = {2019},
    NUMBER = {4},
    ARTICLE-NUMBER = {150},
    URL = {http://www.mdpi.com/2078-2489/10/4/150},
    ISSN = {2078-2489},
    publisher={Multidisciplinary Digital Publishing Institute}
}

Basic metric to extract the most descriptive terms in a document

Common words do not affect the results due to IDF (e.g., “am”, “is”, etc.)		It does not capture the position in the text (syntactic) It does not capture meaning in the text (semantics)
Word2Vec	It captures the position of the words in the text (syntactic) It captures meaning in the words (semantics)	It cannot capture the meaning of the word from the text (fails to capture polysemy) It cannot capture out-of-vocabulary words from the corpus
GloVe (Pre-Trained)	It captures the position of the words in the text (syntactic) It captures meaning in the words (semantics) Trained on a huge corpus	It cannot capture the meaning of the word from the text (fails to capture polysemy) Memory consumption for storage It cannot capture out-of-vocabulary words from the corpus
GloVe (Trained)	It is very straightforward, e.g., to enforce the word vectors to capture sub-linear relationships in the vector space (performs better than Word2vec) Lower weight for highly frequent word pairs, such as stop words like “am”, “is”, etc. Will not dominate training progress	Memory consumption for storage Needs huge corpus to learn It cannot capture out-of-vocabulary words from the corpus It cannot capture the meaning of the word from the text (fails to capture polysemy)
FastText	Works for rare words (rare in their character n-grams which are still shared with other words Solves out of vocabulary words with n-gram in character level	It cannot capture the meaning of the word from the text (fails to capture polysemy) Memory consumption for storage Computationally is more expensive in comparing with GloVe and Word2Vec
Contextualized Word Representations	It captures the meaning of the word from the text (incorporates context, handling polysemy)	Memory consumption for storage Improves performance notably on downstream tasks. Computationally is more expensive in comparison to others Needs another word embedding for all LSTM and feedforward layers It cannot capture out-of-vocabulary words from a corpus Works only sentence and document level (it cannot work for individual word level)

Logistic Regression

	Easy to implement does not require too many computational resources it does not require input features to be scaled (pre-processing) It does not require any tuning	it cannot solve non-linear problems prediction requires that each data point be independent attempting to predict outcomes based on a set of independent variables
Naive Bayes Classifier	It works very well with text data Easy to implement Fast in comparing to other algorithms	A strong assumption about the shape of the data distribution limited by data scarcity for which any possible value in feature space, a likelihood value must be estimated by a frequentist
K-Nearest Neighbor	Effective for text datasets non-parametric More local characteristics of text or document are considered Naturally handles multi-class datasets	computational of this model is very expensive difficult to find the optimal value of k The constraint for large search problem to find nearest neighbors Finding a meaningful distance function is difficult for text datasets
Support Vector Machine (SVM)	SVM can model non-linear decision boundaries Performs similarly to logistic regression when linear separation Robust against overfitting problems~(especially for text dataset due to high-dimensional space)	lack of transparency in results caused by a high number of dimensions (especially for text data). Choosing an efficient kernel function is difficult (Susceptible to overfitting/training issues depending on the kernel) Memory complexity
Decision Tree	Can easily handle qualitative (categorical) features Works well with decision boundaries parallel to the feature axis A decision tree is a very fast algorithm for both learning and prediction	Issues with diagonal decision boundaries Can be easily overfit extremely sensitive to small perturbations in the data Problems with out-of-sample prediction
Conditional Random Field (CRF)	Its feature design is flexible Since CRF computes the conditional probability of global optimal output nodes, it overcomes the drawbacks of label bias Combining the advantages of classification and graphical modeling which combining the ability to compactly model multivariate data	The high computational complexity of the training step this algorithm does not perform with unknown words The problem with online learning (It makes it very difficult to re-train the model when newer data becomes available.)



Random Forest


Ensembles of decision trees are very fast to train in comparison to other techniques	Reduced variance (relative to regular trees) Not require preparation and pre-processing of the input data	Quite slow to create predictions once trained more trees in the forest increases time complexity in the prediction step Not as easy to visually interpret Overfitting can easily occur Need to choose the number of trees in the forest
Deep Learning	Flexible with features design (Reduces the need for feature engineering, one of the most time-consuming parts of machine learning practice.) Architecture that can be adapted to new problems Can deal with complex input-output mappings Can easily handle online learning (It makes it very easy to re-train the model when newer data becomes available.) Parallel processing capability (It can perform more than one job at the same time)	Requires a large amount of data (if you only have small sample text data, deep learning is unlikely to outperform other approaches. It is extremely computationally expensive to train. Model Interpretability is the most important problem of deep learning~(Deep learning in most of the time is black-box) Finding an efficient architecture and structure is still the main challenge of this technique