Short Text Clustering Algorithms

Short Text Clustering Algorithms: Comparison

Please note this is a comparison between Version 2 by Peter Tang and Version 1 by Majid Hameed Ahmed.

Short text clustering (STC) has become a critical task for automatically grouping various unlabelled texts into meaningful clusters. STC is a necessary step in many applications, including Twitter personalization, sentiment analysis, spam filtering, customer reviews and many other social network-related applications.

short text
text representation
dimensionality reduction
clustering techniques
short text clustering

1. Introduction

Recently, the number of text documents on the Internet has increased significantly and rapidly. The rapid development of mobile devices and Internet technologies has encouraged users to search for information, communicate with friends and share their opinions and ideas on social media such as Twitter, Instagram, and Facebook and search engines such as Google. The texts generated every day in social media are vast and unstructured data ^[1].

Most of these generated texts come in the form of short texts and need special analysis compared with formally written ones [2,3]^[2][3]. Short texts can be found on the Internet, including on social media, in product descriptions, in advertisement text, on questions and answers (Q&A) websites ^[4] and in many other applications. Short texts are distinguished by a lack of context, so finding knowledge in them is difficult. This issue motivates researchers to develop novel, effective methods. Examples of short texts can be found in various contexts, like tweets, search inquiries, chat messages, online reviews and product descriptions. Short text also presents a challenge in clustering owing to its chaotic nature, which typically contains noise, slang, emojis, misspellings, abbreviations and grammatical errors. Tweets are a good example of these challenges. In addition, short text represents various facets of people’s daily lives. As an illustration, Twitter generates 500 million tweets per day. These short texts can be used in several applications, such as trend detection ^[5], user profiling ^[6], event exploration ^[7], system recommendation ^[8], online user clustering ^[9] and cluster-based retrieval [2,10]^[2][10].

With the vast amount of short texts being added to the web every day, extracting valuable information from short text corpora by using data-mining techniques is essential [11,12]^[11][12]. Among the many different data-mining techniques, clustering stands out as a unique technique for short text that provides the exciting potential to automatically recognize valuable patterns from a massive, messy collection of short texts ^[13]. Clustering techniques focus on detecting similarity patterns in corpus data, automatically detecting groups of similar short texts, and organising documents into semantic and logical structures.

Clustering techniques help governments, organisations and companies monitor social events, interests and trends by identifying various subjects from user-generated content [14,15]^[14][15]. Many users can post short text messages, image captions, search queries and product reviews on social media platforms. Twitter sets a restriction of 280 characters on the length of each tweet ^[16], and Instagram sets a limit of 2200 characters for each post ^[17].

Clustering short texts (forum titles, result snippets, frequently asked questions, tweets, microblogs, image or video titles and tags) within groups assigned to topics is an important research subject. Short text clustering (STC) has undergone extensive research in recent years to solve the most critical challenges to the current clustering techniques for short text, which are data sparsity, limited length and high-dimensional representation [18,19,20,21,22,23]^{[18][19][20][21][22][23]}.

Applying standard clustering techniques directly to a short text corpus creates issues. The accuracy is worse when using traditional clustering techniques such as the K-means ^[24] technique to group short text data than when using the same method to group regular-length documents ^[25]. One of the reasons is that standard clustering techniques such as K-means ^[24] and DBSCAN ^[26] depend on methods that measure the similarity/distance between data objects and accurate text representations ^[25]. However, the use of standard text representation methods for STC, such as a term frequency inverse-document-frequency (TF-IDF) vectors or bag of words (BOW) ^[27], leads to sparse and high-dimensional feature vectors that are less distinctive for measuring distance [18,28,29]^[18][28][29].

2. Applications of Short Text Clustering

Many clustering methods have been used in several real-world applications. The following disciplines and fields use clustering:

Information retrieval (IR): Clustering methods have been used in various applications in information retrieval, including clustering big datasets. In search engines, text clustering plays a critical role in improving document retrieval performance by grouping and indexing related documents ^[30].
Internet of Things (IoT): With the rapid advancement of technology, several domains have focused on IoT. Data collection in the IoT involves using a global positioning system, radio frequency identification technology, sensors and various other IoT devices. Clustering techniques are used for distributed clustering, which is essential for wireless sensor networks [31,32]^[31][32].
Biology: When clustering genes and samples in gene expression, the gene expression data characteristics become meaningful. They can be classified into clusters based on their expression patterns ^[33].
Industry: Businesses collect large volumes of information about current and prospective customers. For further analysis, customers can be divided into small groups ^[34].
Climate: Recognising global climate patterns necessitates detecting patterns in the oceans and atmosphere. Data clustering seeks to identify atmospheric pressure patterns that significantly impact the climate ^[35].
Medicine: Cluster analysis is used to differentiate among disease subcategories. It can also detect disease patterns in the temporal or spatial distribution ^[36].

3. Components of Short Text Clustering

Clustering is a type of data analysis that has been widely studied; it aims to group a collection of data objects or items into subsets or clusters ^[37]. Specifically, the main goal of clustering is to generate cohesive and identical groups of similar data elements by grouping related data points into unified clusters. All the documents or objects in the same cluster must be as similar as possible ^[38]. In other words, similar documents in a cluster have similar topics so that the cluster is coherent internally. Distinctions between each cluster are notable. Documents or objects in the same cluster must be as different from those in the other clusters as possible. Text clustering is essential to many real-world applications, such as text mining, online text organisation and automatic information retrieval systems. Fast and high-quality document clustering greatly aids users in successfully navigating, summarizing and organizing large amounts of disorganized data. Furthermore, it may determine the structure and content of previously unknown text collections. Clustering attempts to automatically group documents or objects with similar clusters by using various similarity/distance measures ^[39]. Differentiating between clustering and classification of documents is crucial ^[40]. Still, the difference between the two may be unclear because a set of documents must be split into groups in both cases. In general, labelled training data are supplied during classification; however, the challenge arises when attempting to categorize test sets, which consist of unlabelled data, into a predetermined set of classes. In most cases, the classification problem may be handled using a supervised learning method [41,42]^[41][42]. As mentioned above, one of the significant challenges in clustering is grouping a set of unlabelled and non-predefined data into similar groups. Unsupervised learning methods are commonly used to solve the clustering problem. Furthermore, clustering is used in many data fields that do not rely on predefined knowledge of the data, unlike classification, which requires prior knowledge of the data ^[43]. Short texts are becoming increasingly common as online social media platforms such as Instagram, Twitter and Facebook increase in size. They have very minimal vocabulary; many words even appear only once. Therefore, STC significantly affects semantic analysis, demonstrating its importance in various applications, such as IR and summarisation ^[2]. However, the sparsity of short text representation makes the traditional clustering methods unsatisfying. This is due to the sparsity problems caused by each short text document only containing a few words. Short text data contain unstructured sentences that lead to massive variance from regular texts’ vocabulary when using clustering techniques. Therefore, self-corpus-based expansion is presented as a semantically aligned substitutional approach by defining and augmenting concepts in the corpus using clustering techniques ^[44] or topics based on the probability of frequency of the term ^[45]. However, dealing with the microblogging data is challenging for any of these methods because of their lack of structure and the small number of co-occurrences among words ^[46]. Several strategies have been proposed to alleviate the sparsity difficulties caused by lack of context, such as corpus-based metrics ^[47] and knowledge-based metrics [25,48]^[25][48]. One of these simple strategies concentrates on data-level enhancements. The main idea is to combine short text documents to create longer ones ^[49]. For aggregation, related models utilize metadata or external data ^[48]. Although these models can alleviate some of the sparsity issues, a drawback remains. That is, these models rely on external data to a large extent. STC is more challenging than traditional text clustering. Representations of the text in the original lexical space are typically sparse, and this problem is exacerbated for short texts ^[50]. Therefore, learning an efficient short text representation scheme suitable for clustering is critical to the success of STC. In essence, the major drawback of the standard STC techniques is that they cannot adequately handle the sparseness of words in the documents. Compared with long texts containing rich contexts, distinguishing the clusters of short documents with few words occurring in the training set is more challenging. Generally, most models primarily focus on learning representation from local co-occurrences of words ^[21]. Understanding how a model works is critical for using and developing text clustering methods. STC generally contains different steps that can be applied, as shown in Figure 1.

Figure 1.

Components for text-data clustering.

(I)

Pre-processing: It is the first step to take in STC. The data must be cleaned by removing unnecessary characters, words, symbols, and digits. Then, text representation methods can be applied. Pre-processing plays an essential role in building an efficient clustering system because short text data (original text) are unsuitable to be used directly for clustering.

(II)

Representation: Documents and texts are collections of unstructured data. These unstructured data need to be transformed into a structured feature space to use mathematical modelling during clustering. The standard techniques of text representation can be divided into the representation-based corpus and representation-based external knowledge methods.

(III)

Dimensionality reduction: Texts or documents, often after being represented by traditional techniques, become high-dimensional. Data-clustering procedures may be slowed down by extensive processing time and storage complexity. Dimensionality reduction is a standard method for dealing with this kind of issue. Many academics employ dimensionality reduction to lessen their application time and memory complexity rather than risk a performance drop. Dimensionality reduction may be more effective than developing inexpensive representation.

(IV)

Similarity measure: It is the fundamental entity in the clustering algorithm. It makes it easier to measure similar entities, group the entities and elements that are most similar and determine the shortest distance between related entities. In other words, distance and similarity have an inverse relationship, so they are used interchangeably. The vector representation of the data items is typically used to compute similarity/distance measures.

(V)

Clustering techniques: The crucial part of any text clustering system is selecting the best algorithm. We cannot choose the best model for a text clustering system without a deep conceptual understanding of each approach. The goal of clustering algorithms is to generate internally coherent clusters that are obviously distinct from one another.

(VI)

Evaluation: It is the final step of STC. Understanding how the model works is necessary before applying or creating text clustering techniques. Several models are available to evaluate STC.

3.1. Document Pre-Processing in Short Text Clustering

Document pre-processing plays an essential part in STC because the short text data (original text) are unsuitable to be used directly for clustering. The textual document likely contains every type of string, such as digits, symbols, words, and phrases. Noisy strings may negatively impact clustering performance, affecting information retrieval [51,52]^[51][52]. The pre-processing phase for STC enhances the overall processing ^[47]. In this context, the pre-processing step must be used on documents to cluster if one wants to use machine learning approaches ^[53]. Pre-processing consists of four steps: tokenization, normalization, stop word removal and stemming. The main pre-processing steps are shown in Figure 2.

Figure 2.

Main pre-processing steps.

According to ^[54], short texts have many unwanted words which may harm the representation rather than assist it. This fact validates the benefits of pre-processing the document in STC. Utilizing the documents with all their words, including unnecessary ones is a complicated task. Generally, words classified under particles, conjunctions and other grammar-based categories, which are commonly used, may be unsuitable for supporting studies on short text clustering. Furthermore, as suggested by Bruce et al. ^[55], even standard terms such as ‘go’, ‘gone’ and ‘going’ in the English language are created by the derivational and inflectional processes. They fare better if their inflectional and derivational morphemes are taken to remain in their original stems. This reduces the number of words in a document whilst preserving the semantic functions of these words. Therefore, a document free of unwanted words is an appropriate pre-processing goal [56,57]^[56][57].

3.2. Document Representation

Even after the noise in the text has been removed during pre-processing, the text still does not fit together well enough to produce the best results when clustering. Therefore, focusing on the text representation step is essential, which involves converting the word or the full text from its initial form into another. Directly applying learning algorithms to text information without representing it is impossible [62]^[58] because text information has complex nature [63]^[59]. Textual document content must be converted into a concise representation before applying a machine learning approach to the text. Language-independent approaches are particularly successful because they are not dependent on the meaning of the language and perform well in the event of noisy text. As these methods do not depend on language, they are efficient [64]^[60]. Short text similarity has attracted more attention in recent years, and understanding semantics correctly between documents is challenging to understanding lexical diversity and ambiguity [65]^[61]. Representing short text is critical in NLP yet challenging owing to its sparsity; high dimensionality; complexity; large volume and much irrelevant, redundant and noisy information [1,66]^[1][62]. As a result, the traditional methods of computing semantic similarity are a significant roadblock because they are ineffective in various circumstances. Many existing traditional systems fail to deal with terms not covered by synonyms and cannot handle abbreviations, acronyms, brand names and other terms [67]^[63]. Examples of these traditional systems are BOW and TF-IDF, which represent text as real value vectors to help with semantic similarity computation. However, these strategies cannot account for the fact that words have diverse meanings and that different words may be used to represent the same concept. For example, consider two sentences: ‘Majid is taking insulin’ and ‘Majid has diabetes’. Although these two sentences have the same meaning, they do not use the same words. These methods capture the lexical features of the text and are simple to implement; however, they ignore the semantic and syntactic features of the text. To address this issue, several studies have expanded and enriched the context of data from an ontology [68,69]^[64][65] or Wikipedia [70,71]^[66][67]. However, these techniques require a great deal of understanding of NLP. They still use high-dimensional representation for short text, which may lead to wasting memory and computing time. Generally, these methods use external knowledge to improve contextual information for short texts. Many short text similarity measurement approaches exist, such as representation-based measurement [72^[68][69],73], which learn new representations for short text and then measure similarity based on this model [74]^[70]. A large number of similarity metrics have previously been proposed in the literature. WThe researchers choose corpus-based and knowledge-based metrics because of their observed performance in NLP applications. This research explains several representation-based measurement methods, as shown in Figure 3.

Figure 3.

Main taxonomies for short text representation.

3.3. Dimensionality Reduction

It is commonly used in machine learning and big data analytics because it aids in analysing large, high-dimensional datasets. It can benefit tasks like data clustering and classification [126]^[71]. Recently, dimensional-reduction methods have emerged as a promising avenue for improving clustering accuracy [127]^[72]. Text sequences in term-based vector models have many features. As a result, memory and time complexity consumption are prohibitively expensive for these methods. To address this issue, many researchers use dimensionality reduction to reduce the feature-space size [101]^[73].

3.4. Similarity and Distance Measure

The similarity measure determines the similarity between diverse terms, such as words, sentences, documents or concepts. The goal of determining similarity measures between two terms is to determine the degree of relevance by matching the conceptually similar terms but not necessarily lexicographically similar terms [138]^[74]. Generally, the similarity measure is a significant and essential component of any clustering technique. This is because it makes it easier to measure two things, group the most similar elements and entities together and determine the shortest distance between them [139,140]^[75][76]. In other words, distance and similarity have an inverse relationship, so they are used interchangeably. In general, similarity/distance measures are computed using the vector representations of data items. Document similarity is vital in text processing [141]^[77]. It calculates the degree to which two text objects may be identical. Nonetheless, the similarity and distance measures are used as a retrieval module in information retrieval. Similarity measurements include cosine, Jaccard and inner products; distance measures include Euclidean distance and KL divergence [142]^[78]. An analysis of the literature studies shows that several similarity metrics have been developed. However, none of the similarity metrics appears to be the most effective for any research [143]^[79].

3.5. Clustering Algorithms

Clustering methods divide a collection of documents into groups or subsets. Cluster algorithms seek to generate internally coherent clusters yet distinct from one another. In other words, documents inside one cluster must be similar as feasible, whereas documents in different clusters should be as diverse as possible. The clustering method splits many text messages into many significant clusters. Clustering has become a standard strategy in information retrieval and text mining [145]^[80]. Concurrently, text clustering faces various challenges. On the one hand, a text vector is a high-dimensional vector, typically ranging in the thousands or even the ten thousand dimensions. On the other hand, the text vector generally is sparse, making it challenging to identify the cluster centre. Clustering has become an essential means of unsupervised machine learning, attracting many researchers [146,147]^[81][82]. In general, there are three types of clustering algorithms: hierarchical-based clustering, partition-based clustering and density-based clustering. WThe researchers quickly discuss a few traditional techniques for each category; clustering algorithms have been extensively studied in the literature [148,149]^[83][84].

3.6. Performance Evaluation Measure

This step provides an overview of the performance measures used to evaluate the proposed model. These performance measures involve comparing the clusters created by the proposed model with the proper clusters. The assessment of clustering results is often called cluster validation. Cluster validity can be employed to identify the number of clusters and determines the corresponding best partition. Many suggestions have been made for measuring the similarity between the two clusters [160,161]^[85][86]. These measures may be used to evaluate the effectiveness of various data clustering techniques applied to a given dataset. When assessing the quality of a clustering approach, these measurements are typically related to the different kinds of criteria being considered. The term ‘internal assessment’ refers to assessing the clustering outcome using only the data clustered by itself [162]^[87]. These methods often give the algorithm the perfect score, producing values with a higher degree of similarity inside a cluster and a low degree between clusters. The outcomes of external assessment clustering are evaluated based on data not utilized for clustering, such as known-class labels and external benchmarks. It is noteworthy these external benchmarks are composed of a group of things that have already been categorized, and typically, these sets are created by human specialists. These assessment techniques gauge how well the clustering complies with the established benchmark classes [163,164]^[88][89].

References

Yang, S.; Huang, G.; Ofoghi, B.; Yearwood, J. Short text similarity measurement using context-aware weighted biterms. Concurr. Comput. Pract. Exp. 2020, 34, e5765.
Zhang, W.; Dong, C.; Yin, J.; Wang, J. Attentive representation learning with adversarial training for short text clustering. IEEE Trans. Knowl. Data Eng. 2021, 34, 5196–5210.
Yu, Z.; Wang, H.; Lin, X.; Wang, M. Understanding short texts through semantic enrichment and hashing. IEEE Trans. Knowl. Data Eng. 2015, 28, 566–579.
Lopez-Gazpio, I.; Maritxalar, M.; Gonzalez-Agirre, A.; Rigau, G.; Uria, L.; Agirre, E. Interpretable semantic textual similarity: Finding and explaining differences between sentences. Knowl. Based Syst. 2017, 119, 186–199.
Ramachandran, D.; Parvathi, R. Analysis of twitter specific preprocessing technique for tweets. Procedia Comput. Sci. 2019, 165, 245–251.
Vo, D.-V.; Karnjana, J.; Huynh, V.-N. An integrated framework of learning and evidential reasoning for user profiling using short texts. Inf. Fusion 2021, 70, 27–42.
Feng, W.; Zhang, C.; Zhang, W.; Han, J.; Wang, J.; Aggarwal, C.; Huang, J. STREAMCUBE: Hierarchical spatio-temporal hashtag clustering for event exploration over the Twitter stream. In Proceedings of the 2015 IEEE 31st International Conference on Data Engineering, Seoul, Korea, 13–17 April 2015; pp. 1561–1572.
Ailem, M.; Role, F.; Nadif, M. Sparse poisson latent block model for document clustering. IEEE Trans. Knowl. Data Eng. 2017, 29, 1563–1576.
Liang, S.; Yilmaz, E.; Kanoulas, E. Collaboratively tracking interests for user clustering in streams of short texts. IEEE Trans. Knowl. Data Eng. 2018, 31, 257–272.
Carpineto, C.; Romano, G. Consensus clustering based on a new probabilistic rand index with application to subtopic retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 2315–2326.
Wang, T.; Brede, M.; Ianni, A.; Mentzakis, E. Detecting and characterizing eating-disorder communities on social media. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, Cambridge, UK, 6–10 February 2017; pp. 91–100.
Song, G.; Ye, Y.; Du, X.; Huang, X.; Bie, S. Short text classification: A survey. J. Multimed. 2014, 9, 635.
Rodriguez, A.; Laio, A. Clustering by fast search and find of density peaks. Science 2014, 344, 1492–1496.
Zhang, C.; Lei, D.; Yuan, Q.; Zhuang, H.; Kaplan, L.; Wang, S.; Han, J. GeoBurst+ Effective and Real-Time Local Event Detection in Geo-Tagged Tweet Streams. ACM Trans. Intell. Syst. Technol. (TIST) 2018, 9, 1–24.
Yang, S.; Huang, G.; Xiang, Y.; Zhou, X.; Chi, C.-H. Modeling user preferences on spatiotemporal topics for point-of-interest recommendation. In Proceedings of the 2017 IEEE International Conference on Services Computing (SCC), Honolulu, HI, USA, 25–30 June 2017; pp. 204–211.
Alsaffar, D.; Alfahhad, A.; Alqhtani, B.; Alamri, L.; Alansari, S.; Alqahtani, N.; Alboaneen, D.A. Machine and deep learning algorithms for Twitter spam detection. In Proceedings of the International Conference on Advanced Intelligent Systems and Informatics, Cairo, Egypt, 26–28 October 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 483–491.
Shanmugam, S.; Padmanaban, I. A multi-criteria decision-making approach for selection of brand ambassadors using machine learning algorithm. In Proceedings of the 2021 11th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Uttar Pradesh, India, 28–29 January 2021; pp. 848–853.
Hadifar, A.; Sterckx, L.; Demeester, T.; Develder, C. A self-training approach for short text clustering. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), Florence, Italy, 2 August 2019; pp. 194–199.
Jin, J.; Zhao, H.; Ji, P. Topic attention encoder: A self-supervised approach for short text clustering;SAGE, United Kingdom. J. Inf. Sci. 2022, 48, 701–717.
Jinarat, S.; Manaskasemsak, B.; Rungsawang, A. Short text clustering based on word semantic graph with word embedding model. In Proceedings of the 2018 Joint 10th International Conference on Soft Computing and Intelligent Systems (SCIS) and 19th International Symposium on Advanced Intelligent Systems (ISIS), Toyama, Japan, 5–8 December 2018; pp. 1427–1432.
Liu, W.; Wang, C.; Chen, X. Inductive Document Representation Learning for Short Text Clustering; Springer: Berlin/Heidelberg, Germany, 2021.
Qiang, J.; Qian, Z.; Li, Y.; Yuan, Y.; Wu, X. Short text topic modeling techniques, applications, and performance: A survey. IEEE Trans. Knowl. Data Eng. 2020, 34, 1427–1445.
Wei, C.; Zhu, L.; Shi, J. Short Text Embedding Autoencoders with Attention-Based Neighborhood Preservation. IEEE Access 2020, 8, 223156–223171.
Jain, A.K. Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 2010, 31, 651–666.
Xu, J.; Xu, B.; Wang, P.; Zheng, S.; Tian, G.; Zhao, J. Self-taught convolutional neural networks for short text clustering. Neural Netw. 2017, 88, 22–31.
Mistry, V.; Pandya, U.; Rathwa, A.; Kachroo, H.; Jivani, A. AEDBSCAN—Adaptive Epsilon Density-Based Spatial Clustering of Applications with Noise. In Progress in Advanced Computing and Intelligent Engineering; Springer: Berlin/Heidelberg, Germany, 2021; pp. 213–226.
Salton, G.; Buckley, C. Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 1988, 24, 513–523.
Xu, J.; Wang, P.; Tian, G.; Xu, B.; Zhao, J.; Wang, F.; Hao, H. Short text clustering via convolutional neural networks. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, Denver, CO, USA, 5 June 2015; pp. 62–69.
Liu, K.; Bellet, A.; Sha, F. Similarity learning for high-dimensional sparse data. In Artificial Intelligence and Statistics; PMLR: San Diego, CA, USA, 2015; pp. 653–662.
Wahid, A.; Gao, X.; Andreae, P. Multi-objective multi-view clustering ensemble based on evolutionary approach. In Proceedings of the 2015 IEEE Congress on Evolutionary Computation (CEC), Sendai, Japan, 25–28 May 2015; pp. 1696–1703.
Bindhu, V.; Ranganathan, G. Hyperspectral image processing in internet of things model using clustering algorithm. J. ISMAC 2021, 3, 163–175.
AL-Jumaili, A.H.A.; Mashhadany, Y.I.A.; Sulaiman, R.; Alyasseri, Z.A.A. A Conceptual and Systematics for Intelligent Power Management System-Based Cloud Computing: Prospects, and Challenges. Applied Sciences. 2021, 11, 9820.
Oyelade, J.; Isewon, I.; Oladipupo, F.; Aromolaran, O.; Uwoghiren, E.; Ameh, F.; Achas, M.; Adebiyi, E. Clustering algorithms: Their application to gene expression data. Bioinform. Biol. Insights 2016, 10, BBI-S38316.
Güçdemir, H.; Selim, H. Integrating multi-criteria decision making and clustering for business customer segmentation. Ind. Manag. Data Syst. 2015, 115, 1022–1040.
Biabiany, E.; Bernard, D.C.; Page, V.; Paugam-Moisy, H. Design of an expert distance metric for climate clustering: The case of rainfall in the Lesser Antilles. Comput. Geosci. 2020, 145, 104612.
Bu, F.; Hu, C.; Zhang, Q.; Bai, C.; Yang, L.T.; Baker, T. A cloud-edge-aided incremental high-order possibilistic c-means algorithm for medical data clustering. IEEE Trans. Fuzzy Syst. 2020, 29, 148–155.
Ding, Y.; Fu, X. Topical Concept Based Text Clustering Method. In Advanced Materials Research; Trans Tech Publications Ltd.: Lausanne, Swizerland, 2012; Volume 532, pp. 939–943.
Li, R.; Wang, H. Clustering of Short Texts Based on Dynamic Adjustment for Contrastive Learning. IEEE Access 2022, 10, 76069–76078.
Froud, H.; Benslimane, R.; Lachkar, A.; Ouatik, S.A. Stemming and similarity measures for Arabic Documents Clustering. In Proceedings of the 2010 5th International Symposium on I/V Communications and Mobile Network, IEEE Xplore, Rabat, Morocco, 3 December 2010; pp. 1–4.
Agrawal, U.; Soria, D.; Wagner, C.; Garibaldi, J.; Ellis, I.O.; Bartlett, J.M.; Cameron, D.; Rakha, E.A.; Green, A.R. Combining clustering and classification ensembles: A novel pipeline to identify breast cancer profiles. Artif. Intell. Med. 2019, 97, 27–37.
Allahyari, M.; Pouriyeh, S.; Assefi, M.; Safaei, S.; Trippe, E.D.; Gutierrez, J.B.; Kochut, K. A brief survey of text mining: Classification, clustering and extraction techniques. arXiv 2017, arXiv:1707.02919.
Howland, P.; Park, H. Cluster-preserving dimension reduction methods for document classification. In Survey of Text Mining II; Springer: Berlin/Heidelberg, Germany, 2008; pp. 3–23.
Al-Omari, O.M. Evaluating the effect of stemming in clustering of Arabic documents. Acad. Res. Int. 2011, 1, 284.
Jia, C.; Carson, M.B.; Wang, X.; Yu, J. Concept decompositions for short text clustering by identifying word communities. Pattern Recognit. 2018, 76, 691–703.
Mohotti, W.A.; Nayak, R. Corpus-based augmented media posts with density-based clustering for community detection. In Proceedings of the 2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI), Volos, Greece, 5–7 November 2018; pp. 379–386.
Lau, J.H.; Baldwin, T. An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv 2016, arXiv:1607.05368.
Yang, S.; Huang, G.; Cai, B. Discovering topic representative terms for short text clustering. IEEE Access 2019, 7, 92037–92047.
Jin, O.; Liu, N.N.; Zhao, K.; Yu, Y.; Yang, Q. Transferring topical knowledge from auxiliary long texts for short text clustering. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, Glasgow, Scotland, UK, 24–28 October 2011; pp. 775–784.
Mehrotra, R.; Sanner, S.; Buntine, W.; Xie, L. Improving lda topic models for microblogs via tweet pooling and automatic labeling. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, 28 July–1 August 2013; pp. 889–892.
Aggarwal, C.C.; Zhai, C. A survey of text clustering algorithms. In Mining Text Data; Springer: Berlin/Heidelberg, Germany, 2012; pp. 77–128.
Palanivinayagam, A.; Nagarajan, S. An optimized iterative clustering framework for recognizing speech. Int. J. Speech Technol. 2020, 23, 767–777.
Kanimozhi, K.; Venkatesan, M. A novel map-reduce based augmented clustering algorithm for big text datasets. In Data Engineering and Intelligent Computing; Springer: Berlin/Heidelberg, Germany, 2018; pp. 427–436.
Obaid, H.S.; Dheyab, S.A.; Sabry, S.S. The impact of data pre-processing techniques and dimensionality reduction on the accuracy of machine learning. In Proceedings of the 2019 9th Annual Information Technology, Electromechanical Engineering and Microelectronics Conference (IEMECON), Jaipur, India, 13–15 March 2019; pp. 279–283.
Croft, W.B.; Metzler, D.; Strohman, T. Search Engines: Information Retrieval in Practice; Addison-Wesley Reading: London UK, 2010; Volume 520.
Cambazoglu, B.B. Review of “Search Engines: Information Retrieval in Practice” by Croft, Metzler and Strohman. Inf. Process. Manag. 2010, 46, 377–379.
Kaur, J.; Buttar, P.K. A systematic review on stopword removal algorithms. Int. J. Future Revolut. Comput. Sci. Commun. Eng. 2018, 4, 207–210.
Al-Shalabi, R.; Kanaan, G.; Jaam, J.M.; Hasnah, A.; Hilat, E. Stop-word removal algorithm for Arabic language. In Proceedings of the 2004 International Conference on Information and Communication Technologies: From Theory to Applications, Damascus, Syria, 19–23 April 2004; p. 545.
Ahmed, M.H.; Tiun, S. K-means based algorithm for islamic document clustering. In Proceedings of the International Conference on Islamic Applications in Computer Science and Technologies (IMAN 2013), Selangor, Malaysia, 1–2 July 2013; pp. 2–9.
Abdulameer, A.S.; Tiun, S.; Sani, N.S.; Ayob, M.; Taha, A.Y. Enhanced clustering models with wiki-based k-nearest neighbors-based representation for web search result clustering. J. King Saud Univ. Comput. Inf. Sci. 2020, 34, 840–850.
Khreisat, L. Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study. DMIN 2006, 2006, 78–82.
Zakaria, T.N.T.; Ab Aziz, M.J.; Mokhtar, M.R.; Darus, S. Semantic similarity measurement for Malay words using WordNet Bahasa and Wikipedia Bahasa Melayu: Issues and proposed solutions. Int. J. Softw. Eng. Comput. Syst. 2020, 6, 25–40.
Yin, J.; Wang, J. A dirichlet multinomial mixture model-based approach for short text clustering. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24–27 August 2014; pp. 233–242.
Sabah, A.; Tiun, S.; Sani, N.S.; Ayob, M.; Taha, A.Y. Enhancing web search result clustering model based on multiview multirepresentation consensus cluster ensemble (mmcc) approach. PLoS ONE 2021, 16, e0245264.
Fodeh, S.; Punch, B.; Tan, P.-N. On ontology-driven document clustering using core semantic features. Knowl. Inf. Syst. 2011, 28, 395–421.
Osman, M.A.; Noah, S.A.M.; Saad, S. Ontology-Based Knowledge Management Tools for Knowledge Sharing in Organization—A Review. IEEE Access 2022, 10, 43267–43283.
Banerjee, S.; Ramanathan, K.; Gupta, A. Clustering short texts using wikipedia. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, 23–27 July 2007; pp. 787–788.
Zakaria, T.N.T.; Ab Aziz, M.J.; Mokhtar, M.R.; Darus, S. Text Clustering for Reducing Semantic Information in Malay Semantic Representation. Asia-Pac. J. Inf. Technol. Multimed. 2020, 9, 11–24.
Mueller, J.; Thyagarajan, A. Siamese recurrent architectures for learning sentence similarity. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30.
Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022.
Zainodin, U.Z.; Omar, N.; Saif, A. Semantic measure based on features in lexical knowledge sources. Asia-Pac. J. Inf. Technol. Multimed. 2017, 6, 39–55.
Allaoui, M.; Kherfi, M.L.; Cheriet, A. Considerably improving clustering algorithms using UMAP dimensionality reduction technique: A comparative study. In International Conference on Image and Signal Processing; Springer: Berlin/Heidelberg, Germany, 2020; pp. 317–325.
Swesi, I.M.A.O.; Bakar, A.A. Feature clustering for PSO-based feature construction on high-dimensional data. J. Inf. Commun. Technol. 2019, 18, 439–472.
Kowsari, K.; Jafari Meimandi, K.; Heidarysafa, M.; Mendu, S.; Barnes, L.; Brown, D. Text classification algorithms: A survey. Information 2019, 10, 150.
Little, C.; Mclean, D.; Crockett, K.; Edmonds, B. A semantic and syntactic similarity measure for political tweets. IEEE Access 2020, 8, 154095–154113.
Alian, M.; Awajan, A. Factors affecting sentence similarity and paraphrasing identification. Int. J. Speech Technol. 2020, 23, 851–859.
Alkoffash, M.S. Automatic Arabic Text Clustering using K-means and K-mediods. Int. J. Comput. Appl. 2012, 51, 5–8.
Lin, Y.-S.; Jiang, J.-Y.; Lee, S.-J. A similarity measure for text classification and clustering. IEEE Trans. Knowl. Data Eng. 2013, 26, 1575–1590.
Huang, A. Similarity measures for text document clustering. In Proceedings of the Sixth New Zealand Computer Science Research Student Conference (NZCSRSC2008), Christchurch, New Zealand, 14–18 April 2008; Volume 4, pp. 9–56.
Froud, H.; Lachkar, A.; Ouatik, S.A. Arabic text summarization based on latent semantic analysis to enhance arabic documents clustering. arXiv 2013, arXiv:1302.1612.
Guangming, G.; Yanhui, J.; Wei, W.; Shuangwen, Z. A Clustering Algorithm Based on the Text Feature Matrix of Domain-Ontology. In Proceedings of the 2013 Third International Conference on Intelligent System Design and Engineering Applications, Hong Kong, China, 16–18 January 2013; pp. 13–16.
Abualigah, L.M.Q. Feature Selection and Enhanced Krill Herd Algorithm for Text Document Clustering; Springer: Berlin/Heidelberg, Germany, 2019.
Liu, F.; Xiong, L. Survey on text clustering algorithm-Research present situation of text clustering algorithm. In Proceedings of the 2011 IEEE 2nd International Conference on Software Engineering and Service Science, Beijing, China, 15–17 July 2011; pp. 196–199.
Reddy, C.K.; Vinzamuri, B. A survey of partitional and hierarchical clustering algorithms. In Data Clustering; Chapman and Hall/CRC: New York, NY, USA, 2018; pp. 87–110.
Bhattacharjee, P.; Mitra, P. A survey of density based clustering algorithms. Front. Comput. Sci. 2021, 15, 1–27.
Karaa, W.B.A.; Ashour, A.S.; Sassi, D.B.; Roy, P.; Kausar, N.; Dey, N. Medline text mining: An enhancement genetic algorithm based approach for document clustering. In Applications of Intelligent Optimization in Biology and Medicine; Springer: Berlin/Heidelberg, Germany, 2016; pp. 267–287.
Durairaj, M.; Vijitha, C. Educational data mining for prediction of student performance using clustering algorithms. Int. J. Comput. Sci. Inf. Technol. 2014, 5, 5987–5991.
Powers, D.M. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv 2020, arXiv:2010.16061.
Qiang, J.; Li, Y.; Yuan, Y.; Wu, X. Short text clustering based on Pitman-Yor process mixture model. Appl. Intell. 2018, 48, 1802–1812.
Punitha, S.; Jayasree, R.; Punithavalli, M. Partition document clustering using ontology approach. In Proceedings of the 2013 International Conference on Computer Communication and Informatics, Coimbatore, Tamil Nadu, India, 4–6 January 2013; pp. 1–5.