Sparse Composite Document Vector with Multi-Sense Embeddings (SCDV-MS) [
13] forces discarding the outliers from the clustering output to eliminate long-tail noises in the SCDV, which applies a hard threshold that may hinder the thematic representation of documents. Moreover, representing proper expressive documents depends upon modeling the underlying semantic topics in the correct form [
14], which requires capturing deep semantics insights buried in words, expressions, and string patterns [
15]. Hence, for the noisy long texts, we proposed
Weighted Sparse Document Vector (WSDV) that embodies important words emphasizing capability using
Pomegranate General Mixture model [
16], and a soft threshold-based noise reduction technique.
It is challenging to capture semantics insights in document modeling with sparse short texts. The probability distribution of words captures better semantics than the point embedding approach (e.g., word2vec) [
17] as it generalizes deterministic point embeddings of the terms using the mean vector, and the covariance matrix holds uncertainty of the estimations. Hence, instead of depending on external knowledge sources, we proposed corpus statistics empowered
Weighted Compact Document Vector (WCDV), which emphasizes potential terms while performing probability word distribution using the weighted energy function. In WCDV, we employ the
Multimodal word Distributions [
18] that learns distributions of words using the Expected Likelihood Kernel [
19], which computes the inner product between distributions of words to get the affinity of word pairs. However, every word in a document does not hold the same importance; some are used more frequently than others, indicating their importance in the corpus. It is required to emphasize the frequently used words, especially when word co-occurrence information is limited (e.g., microblogging, product review, etc.). Therefore, to preserve the word frequency importance, we proposed
Weight attained Expected Likelihood Kernel which considers term frequency-based point weights while measuring the partial log energy between distributions in the
Multimodal word Distributions [
18].
2. Corpus Statistics
Word embeddings models ignore side information (e.g., document labels) while learning embeddings from enormous document corpora. To improve word representation and text classification accuracy, Linear, Y. et al. [
20] proposed to use document labels as the global context both in the local neural network model and the global matrix factorization framework. Obayes, H.K. et al. [
21] combined GloVe and bidirectional long short-term memory (BiLSTM) recurrent neural network for better sentiment classification, which causes expensive computation and no guidance for documents containing multiple sentences. Yang, Z. et al. [
22] proposed Hierarchical attention networks (HAN) for document classification, which maintain a hierarchical structure of word to sentence (building sentence from words) and sentence to document (aggregating sentences to a document representation). Zhang, Z. et al. [
23] proved that the TFIDF algorithm with the combination of Naive Bayes has significance in the text classification task compared to many complex models.
Recently, transformers-based models [
24,
25] became more prevalent in downstream Natural Language Processing (NLP) tasks (e.g., document classification). Wang, B. andlinebreak Kuo, C-C.J. [
26] proposed SBERTWK for sentence embedding, which trains on both word and sentence level objectives but no guidance for representing a document that contains multiple sentences. However, the transformer-based model requires enormous computational resources. Sanh, V. et al. [
27] introduced a distilled version of BERT called DistilBERT, which is smaller, faster, cheaper, and lighter than other transformers-based models.
Mapping sentences to a fixed-length embedding vector using Universal Sentence Encoder (USE) based method [
28] also got success in the downstream Natural Language Processing (NLP) task. The sentence analysis method made by combining Universal Language Model Fine-tuning (ULMFiT) with the Support Vector Machine (SVM) [
29] is capable of performing document classification using a small amount of data but has higher computational complexity.
Yet, K.S. et al. [
30] proposed document embedding along with their uncertainty called the Bayesian subspace multinomial model (Bayesian SMM) to capture better semantics. It is a generative log-linear model that learns to represent documents in the form of the Gaussian distributions and encodes uncertainty in the covariance matrix but holds only a single mode of words. Therefore, encoded uncertainty might diffuse spontaneously; the mean vector can be pulled in one direction and represents one particular meaning by leaving others not representing [
31]. Different senses of a word lie in the linear superposition of standard word embeddings [
32] and the Gaussian mixture model holds multiple modes to represent distinct meanings of words.
For the long texts classification, we proposed WCDV, which represents documents with uncertainty estimations in the distribution of words using Gaussian Mixtures distributions for short-length document classification. We proposed WSDV using the Pomegranate General Mixture model for the long texts classification. Both WSDV and WCDV accommodate polysemous terms and train on the labeled documents corpus for better classification performance.
Noisy topics are outliers prone, thus less coherent and less expressive. Newman, D. [
33] regularized the LDA-based topic model where only the higher frequency terms allow into the word dependencies sparse covariance matrix. This model executes two prime steps. Firstly, measuring the point weight of each word in the vocabulary, and secondly, putting a threshold point to eliminate lower weighted words from the covariance matrix. Mittal, M. et al. [
34] introduced automated K-means clustering, where they applied a threshold point to decide whether or not to create a new cluster for the objects. This approach prohibits outlier tendency by accommodating lower probability objects into a new cluster. Gupta, V. et al. [
13] introduced SCDV-MS, which removes noise by applying a hard threshold on the fuzzy word cluster assignments, which proved better classification performance and lower space and time complexity than SCDV [
6].
In contrast, the proposed WSDV contains more natural noise removal techniques using a soft threshold and more efficient sparse vectorial representation for the long text (e.g., removing first principle components).
To capture better corpus semantics, Sia, S. et al. [
35] introduced weighted data clustering on pre-trained word embeddings, where they also proved the effectiveness of re-ranking the top words in a cluster for better representative topics. Similarly, Gebru, I.D. et al. [
36] proposed a Gaussian mixture-based weighted data clustering method called WD-GMM that demonstrates how the point weight of datum affects the covariance matrix and leads to better clustering. Inspired by them, we proposed WSDV, which extends the clustering process on the weighted data for the multi-class document classification performance.
Short texts are sparse due to limited word co-occurrence, which requires special treatment to capture hidden semantic information [
37,
38]. Pretrained word embedding over large external corpora is a common remedy for dealing short length documents. Zuo, Y. et al. [
39] proposed a word embedding-enhanced Pseudo-document-based Topic Model (WE-PTM) to leverage pre-trained word embeddings that is essential for alleviating data sparsity. Instead of incorporating external knowledge sources, Zhang, P. and He, Z. [
40] proposed an ensemble approach by exploiting both word embeddings and latent topics in sentence-level sentiment analysis for sentence polarity detection.
Therefore, for semantically enriched short-length document representation, instead of importing information from external knowledge sources, we employ
Multimodal word Distributions [
18] to capture uncertainty in the distribution of word embeddings for the vectorial representation of documents.
The contextual analysis-based model emphasizes potential terms that capture better semantics insights and boost classification performance [
41]. Xu, J. et al. [
42] proposed a convolutional neural network-based model, which incorporates context-relevant concepts into text representation for uplifting short text classification performance, but it requires expensive computational capacity.
In WCDV, we use the weighted energy function to emphasize potential terms in the short texts corpus.
Weighted Kernel Density Estimation (WKDE) [
43,
44] based on point weights has proved effective. For the semantic similarity measuring task, constant weighting assumption-based semantic similarity [
45] measure between two concepts/words holds better performance for the semantic representation of the concept/words but holds the same weighting relevance. Later, it found that the weight propagation mechanism [
46,
47] for augmenting input with semantic information achieves desired performance and removes the same weighting curse for concepts/words. Recently, Liu J. et el. [
48] introduced a weighted kernel mechanism for the weighted k-means multi-view clustering, where they redefined the objective by assigning weights to the cluster level instead of global weighting for each view and outperforming the existing objective.