Text Classification Algorithms: A Survey
Introduction
Text and Document Feature Extraction
Text feature extraction and pre-processing for classification algorithms are very significant. In this section, we start to talk about text cleaning since most of the documents contain a lot of noise. In this part, we discuss two primary methods of text feature extractions- word embedding and weighted word.
Text Cleaning and Pre-processing
In Natural Language Processing (NLP), most of the text and documents contain many words that are redundant for text classification, such as stopwords, miss-spellings, slangs, and etc. In this section, we briefly explain some techniques and methods for text cleaning and pre-processing text documents. In many algorithms like statistical and probabilistic learning methods, noise and unnecessary features can negatively affect the overall performance. So, the elimination of these features is extremely important.
Tokenization
Tokenization is the process of breaking down a stream of text into words, phrases, symbols, or any other meaningful elements called tokens. The main goal of this step is to extract individual words in a sentence. Along with text classification, in text mining, it is necessary to incorporate a parser in the pipeline which performs the tokenization of the documents; for example:
sentence:
After sleeping for four hours, he decided to sleep for another four
In this case, the tokens are as follows:
{'After', 'sleeping', 'for', 'four', 'hours', 'he', 'decided', 'to', 'sleep', 'for', 'another', 'four'}
Here is python code for Tokenization:
from nltk.tokenize import word_tokenize
text = "After sleeping for four hours, he decided to sleep for another four"
tokens = word_tokenize(text)
print(tokens)
Stop words
Text and document classification over social media, such as Twitter, Facebook, and so on is usually affected by the noisy nature (abbreviations, irregular forms) of the text corpora.
Here is an example from geeksforgeeks
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
example_sent = "This is a sample sentence, showing off the stop words filtration."
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
filtered_sentence = []
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
print(word_tokens)
print(filtered_sentence)
Output:
['This', 'is', 'a', 'sample', 'sentence', ',', 'showing',
'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop',
'words', 'filtration', '.']
Capitalization
Sentences can contain a mixture of uppercase and lower case letters. Multiple sentences make up a text document. To reduce the problem space, the most common approach is to reduce everything to lower case. This brings all words in a document in same space, but it often changes the meaning of some words, such as "US" to "us" where first one represents the United States of America and second one is a pronoun. To solve this, slang and abbreviation converters can be applied.
text = "The United States of America (USA) or America, is a federal republic composed of 50 states"
print(text)
print(text.lower())
Output:
"The United States of America (USA) or America, is a federal republic composed of 50 states"
"the united states of america (usa) or america, is a federal republic composed of 50 states"
Slangs and Abbreviations
Slangs and abbreviations can cause problems while executing the pre-processing steps. An abbreviation is a shortened form of a word, such as SVM stand for Support Vector Machine. Slang is a version of language that depicts informal conversation or text that has different meaning, such as "lost the plot", it essentially means that 'they've gone mad'. Common method to deal with these words is converting them to formal language.
Noise Removal
Another issue of text cleaning as a pre-processing step is noise removal. Text documents generally contains characters like punctuations or special characters and they are not necessary for text mining or classification purposes. Although punctuation is critical to understand the meaning of the sentence, but it can affect the classification algorithms negatively.
Spelling Correction
An optional part of the pre-processing step is correcting the misspelled words. Different techniques, such as hashing-based and context-sensitive spelling correction techniques, or spelling correction using trie and damerau-Levenshtein distance bigram have been introduced to tackle this issue.
Text Stemming is modifying a word to obtain its variants using different linguistic processeses like affixation (addition of affixes). For example, the stem of the word "studying" is "study", to which -ing.
Here is an example of Stemming from NLTK
Text lemmatization is the process of eliminating redundant prefix or suffix of a word and extract the base word (lemma).
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("cats"))
Comparison of Feature Extraction Techniques
Model | Advantages | Limitation |
Weighted Words |
|
|
TF-IDF |
|
|
Word2Vec |
|
|
GloVe (Pre-Trained) |
|
|
GloVe (Trained) |
|
|
FastText |
|
|
Contextualized Word Representations |
|
|
Dimensionality Reduction
Principal Component Analysis (PCA)
Principle component analysis~(PCA) is the most popular technique in multivariate analysis and dimensionality reduction. PCA is a method to identify a subspace in which the data approximately lies. This means finding new variables that are uncorrelated and maximizing the variance to preserve as much variability as possible.
Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA) is another commonly used technique for data classification and dimensionality reduction. LDA is particularly helpful where the within-class frequencies are unequal and their performances have been evaluated on randomly generated test data. Class-dependent and class-independent transformation are two approaches in LDA where the ratio of between-class-variance to within-class-variance and the ratio of the overall-variance to within-class-variance are used respectively.
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn.decomposition import NMF
def TFIDF(X_train, X_test, MAX_NB_WORDS=75000):
vectorizer_x = TfidfVectorizer(max_features=MAX_NB_WORDS)
X_train = vectorizer_x.fit_transform(X_train).toarray()
X_test = vectorizer_x.transform(X_test).toarray()
print("tf-idf with", str(np.array(X_train).shape[1]), "features")
return (X_train, X_test)
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')
X_train = newsgroups_train.data
X_test = newsgroups_test.data
y_train = newsgroups_train.target
y_test = newsgroups_test.target
X_train,X_test = TFIDF(X_train,X_test)
NMF_ = NMF(n_components=2000)
X_train_new = NMF_.fit(X_train)
X_train_new = NMF_.transform(X_train)
X_test_new = NMF_.transform(X_test)
print("train with old features: ",np.array(X_train).shape)
print("train with new features:" ,np.array(X_train_new).shape)
print("test with old features: ",np.array(X_test).shape)
print("test with new features:" ,np.array(X_test_new))
output:
tf-idf with 75000 features train with old features: (11314, 75000) train with new features: (11314, 2000) test with old features: (7532, 75000) test with new features: (7532, 2000)
Random Projection
Random projection or random feature is a dimensionality reduction technique mostly used for very large volume dataset or very high dimensional feature space. Text and document, especially with weighted feature extraction, can contain a huge number of underlying features. Many researchers addressed Random Projection for text data for text mining, text classification and/or dimensionality reduction. We start to review some random projection techniques.
Autoencoder is a neural network technique that is trained to attempt to map its input to its output. The autoencoder as dimensional reduction methods has achieved great success via the powerful reprehensibility of neural networks. The main idea is, one hidden layer between the input and output layers with fewer neurons can be used to reduce the dimension of feature space. Especially for texts, documents, and sequences that contain many features, autoencoders could help to process data faster and more efficiently.
T-distributed Stochastic Neighbor Embedding (T-SNE) is a nonlinear dimensionality reduction technique for embedding high-dimensional data which is mostly used for visualization in a low-dimensional space. This approach is based on G. Hinton and ST. Roweis. SNE works by converting the high dimensional Euclidean distances into conditional probabilities which represent similarities.
Text Classification Techniques
Model | Advantages | Disadvantages |
Rocchio Algorithm |
|
|
Boosting and Bagging |
|
|
Logistic Regression |
|
|
Naive Bayes Classifier |
|
|
K-Nearest Neighbor |
|
|
Support Vector Machine (SVM) |
|
|
Decision Tree |
|
|
Conditional Random Field (CRF) |
|
|
Random Forest |
|
|
Deep Learning |
|
|
Evaluation
F1 Score
Matthew correlation coefficient (MCC)
Compute the Matthews correlation coefficient (MCC)
The Matthews correlation coefficient is used in machine learning as a measure of the quality of binary (two-class) classification problems. It takes into account of true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes. The MCC is in essence a correlation coefficient value between -1 and +1. A coefficient of +1 represents a perfect prediction, 0 an average random prediction and -1 an inverse prediction. The statistic is also known as the phi coefficient.
Receiver operating characteristics (ROC)
ROC curves are typically used in binary classification to study the output of a classifier. In order to extend ROC curve and ROC area to multi-class or multi-label classification, it is necessary to binarize the output. One ROC curve can be drawn per label, but one can also draw a ROC curve by considering each element of the label indicator matrix as a binary prediction (micro-averaging).
Another evaluation measure for multi-class classification is macro-averaging, which gives equal weight to the classification of each label. [sources]
Area Under Curve (AUC)
Area under ROC curve (AUC) is a summary metric that measures the entire area underneath the ROC curve. AUC holds helpful properties, such as increased sensitivity in the analysis of variance (ANOVA) tests, independence of decision threshold, invariance to a priori class probability and the indication of how well negative and positive classes are regarding decision index.
Text and Document Datasets
IMDB
Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations, such as "only consider the top 10,000 most common words, but eliminate the top 20 most common words".
As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.
from keras.datasets import imdb
(x_train, y_train), (x_test, y_test) = imdb.load_data(path="imdb.npz",
num_words=None,
skip_top=0,
maxlen=None,
seed=113,
start_char=1,
oov_char=2,
index_from=3)
Reuters-21578
Dataset of 11,228 newswires from Reuters, labeled over 46 topics. As with the IMDB dataset, each wire is encoded as a sequence of word indexes (same conventions).
from keras.datasets import reuters
(x_train, y_train), (x_test, y_test) = reuters.load_data(path="reuters.npz",
num_words=None,
skip_top=0,
maxlen=None,
test_split=0.2,
seed=113,
start_char=1,
oov_char=2,
index_from=3)
20Newsgroups
The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon messages posted before and after a specific date.
This module contains two loaders. The first one, sklearn.datasets.fetch_20newsgroups, returns a list of the raw texts that can be fed to text feature extractors, such as sklearn.feature_extraction.text.CountVectorizer with custom parameters so as to extract feature vectors. The second one, sklearn.datasets.fetch_20newsgroups_vectorized, returns ready-to-use features, i.e., it is not necessary to use a feature extractor.
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
from pprint import pprint
pprint(list(newsgroups_train.target_names))
['alt.atheism',
'comp.graphics',
'comp.os.ms-windows.misc',
'comp.sys.ibm.pc.hardware',
'comp.sys.mac.hardware',
'comp.windows.x',
'misc.forsale',
'rec.autos',
'rec.motorcycles',
'rec.sport.baseball',
'rec.sport.hockey',
'sci.crypt',
'sci.electronics',
'sci.med',
'sci.space',
'soc.religion.christian',
'talk.politics.guns',
'talk.politics.mideast',
'talk.politics.misc',
'talk.religion.misc']
Web of Science Dataset
Description of Dataset:
Here is three datasets which include WOS-11967 , WOS-46985, and WOS-5736 Each folder contains:
- X.txt
- Y.txt
- YL1.txt
- YL2.txt
X is input data that include text sequences Y is target value YL1 is the target value of level one (parent label) YL2 is the target value of level one (child label)
Meta-data: This folder contains on data file as the following attribute: Y1 Y2 Y Domain area keywords Abstract
The abstract is input data that include text sequences of 46,985 published paper Y is target value YL1 is the target value of level one (parent label) YL2 is a target value of level one (child label) Domain is the major domain which includes 7 labels: {Computer Science, Electrical Engineering, Psychology, Mechanical Engineering, Civil Engineering, Medical Science, biochemistry} area is subdomain or area of the paper, such as CS-> computer graphics which contain 134 labels. keywords: is authors keyword of the papers
- Web of Science Dataset WOS-11967
This dataset contains 11,967 documents with 35 categories which include 7 parents categories.
- Web of Science Dataset WOS-46985
This dataset contains 46,985 documents with 134 categories which include 7 parents categories.
- Web of Science Dataset WOS-5736
This dataset contains 5,736 documents with 11 categories which include 3 parents categories.
Referenced paper: HDLTex: Hierarchical Deep Learning for Text Classification
Citations:@ARTICLE{Kowsari2018Text_Classification, title={Text Classification Algorithms: A Survey}, author={Kowsari, Kamran and Jafari Meimandi, Kiana and Heidarysafa, Mojtaba and Mendu, Sanjana and Barnes, Laura E. and Brown, Donald E.}, journal={Information}, VOLUME = {10}, YEAR = {2019}, NUMBER = {4}, ARTICLE-NUMBER = {150}, URL = {http://www.mdpi.com/2078-2489/10/4/150}, ISSN = {2078-2489}, publisher={Multidisciplinary Digital Publishing Institute} }