Semantic textual similarity (STS) is a challenging task in Natural Language Processing (NLP) and text mining and consists in verifying the degree of similarity between two pieces of text based on their meaning. STS is closely related to the field of distributional semantics, which focuses on developing theories and methods for representing and acquiring the semantic properties of linguistic items based on their distributional properties in text corpora
[1].
From the linguistic perspective, distributional semantics is based on a simple assumption called
distributional hypothesis. For the distributional hypothesis, the more two words are semantically similar to each other, the more they tend to appear in the same, or similar, linguistic context due to the fact that “difference of meaning correlates with difference of distribution”
[1]. From a computational perspective, we can leverage this hypothesis by representing words as vectors encoding the properties of their contexts in a vector space. Their (semantic) similarity is then given by the distance between their respective vector representations. We usually refer to vector representations of words as
word embeddings [1].
The NLP field has made large use of the distributional hypothesis and the distributional properties of words to encode their meaning. While the earliest attempts exploited co-occurrence matrices to represent words based on their contexts, more modern approaches leverage machine learning and deep learning in the form of
neural language models (NLMs). These models determine a probability distribution over a sequence of tokens, where tokens are usually defined as an approximation of the concept of word and are always discrete entities. Among such models, the earliest ones typically employed unsupervised learning to obtain fixed-length representations of words
[2] and sentences
[3]. In the case of words, it is important to cite Word2Vec, that is implemented in two different algorithms: the CBOW (Continuous Bag of Words) and the SGNS (Skip-Gram with negative samplings)
[2]. In the case of sentences, Doc2Vec is an extension to Word2Vec that incorporates document level representations. Doc2Vec enables the representation of documents as dense vectors, allowing for various downstream tasks such as document classification, document similarity, and information retrieval
[3].
2. Semantic Textual Similarity
Over the years, several architectures have been proposed, such as ELMo (Embeddings from Language Model) and LSTM (Long short-term memory)-based language models. In recent years, Transformer-based NLMs have established themselves as the de facto standard for many NLP tasks. Transformers, such as BERT (Bidirectional Encoder Representations from Transformers), are a type of neural network for sequence transduction that relies on a self-attention mechanism. They are able to deal with complex tasks involving human language, achieving state-of-the-art results
[4]. A very attractive aspect of the BERT-like architectures is that their internal representations of words and sequences are context-aware. The attention mechanism in Transformers facilitates the consideration of relationships between words within a sentence or across more significant portions of a text, establishing deep connections. Furthermore, researchers have proposed other architectures with attention-based mechanisms, such as AlBERT
[5] and DistilBERT
[6], which have gained significant attention and continue to be exploited by the NLP community. Nevertheless, it is fair to admit that BERT and similar models face some limitations, especially when applied to tasks related to semantic textual similarity, particularly at the level of sentence-level embeddings.
One of the limitations of BERT and BERT-like models is evident in tasks regarding semantic textual similarity, particularly when coping with sequence-level embeddings
[7]. It is well known that BERT’s sequence-level embeddings are not directly trained to encode the semantics of the sequences and, thus, are not suited to compare them with standard metrics such as cosine similarity
[4]. To overcome these limitations, Sentence-BERT
[7] was proposed. Sentence-BERT is a modification of the pre-trained BERT network with Siamese and triplet network structures. It can produce sentence embeddings that are semantically meaningful and compared using a similarity measure (for example, cosine similarity or Manhattan/Euclidean distance). The process of finding the most similar pair is reduced from 65 hours with BERT/RoBERTa to about 5 seconds with Sentence-BERT while maintaining the accuracy achieved by BERT
[7]. Research has been conducted to design and evaluate various approaches for employing Siamese networks, similarity concepts, one-shot learning, and context/memory awareness in textual data
[8]. Furthermore, recent efforts have focused on developing an unsupervised contrastive learning method that transforms pre-trained language models into universal text encoders, as seen with Mirror-BERT
[9] and subsequent models
[10].
In the last few years, we have also seen the rise of large language models (LLMs), with a considerable number of parameters reaching the tens or even hundreds of billions. These models differ from their predecessors in terms of scale and, in some instances, they incorporate reinforcement learning techniques during training. Prominent examples of large language models include OpenAI’s GPT
[11], Google’s LLaMA
[12], and Hugging Face’s BLOOM
[13]. LLMs typically outperform smaller counterparts and are recognized by their zero-shot learning capabilities and emergent new abilities
[14]. However, they are affected by two significant limitations. First, most of these models are controlled by private companies and are only accessible via APIs. Second, the computational costs of such models often pose challenges for running them on standard commercial hardware without resorting to parameter selection and/or distillation techniques.
The problem of semantic textual similarity between pairs of sentences has been discussed in several papers and some studies have faced the issue of extracting semantic differences from texts. In particular, research has been carried out on the ideological discourses in newspaper texts. Following this idea, others investigated the utility of applying text-mining techniques to support the discourse analysis of news reports. They found contrast patterns to highlight ideological differences between local and international press coverage
[15]. Recently, critical discourse analysis has been applied to investigate ideological differences in reporting the same news across various topics, such as the COVID-19 in Iranian and American newspapers
[16], and the representation of Syrian refugees in Turkey, considering three Turkish newspapers
[17].
Sentiment analysis has also been a key area of focus
[18]. Finally, several studies have been conducted in completely different domains, such as scholarly documents. A hybrid model, which considers both section headers and body text to recognize automatically generic sections in scholarly documents, was proposed in
[19].