Recently published works outlined a series of insights about diverse TS characteristics. For instance, Ref. [
17] assessed various techniques used by extractive summarization approaches, as well as the associated evaluation metrics; Ref. [
18] presented an overview of TS datasets, approaches, and evaluation schemas; Ref. [
4] elaborated on a comprehensive classification schema of TS approaches, based on their underlying techniques, and performed a comparative assessment of their performance through the use of various metrics; Ref. [
19] focused on extractive TS approaches, evaluation metrics, and their limitations; Ref. [
3] offered a comprehensive survey of TS and keyword extraction, the latter being a sibling task of TS. In addition, Refs. [
2,
20] are two comprehensive surveys of TS applications, approaches, datasets, and evaluation techniques, also reporting on associated limitations and challenges. Contrary to earlier works, Ref. [
21] reported a comprehensive survey that focused only on abstractive TS approaches, taking into account recent deep learning approaches, while also presenting their comparative evaluation using various versions of the
ROUGE metric [
22].
2. Extractive Approaches
The goal of extractive approaches is to extract the most important sentences of the document under consideration. These are assembled into a concise summary that captures the most significant aspects of the original text. Various algorithms have been proposed for extractive summarization, each utilizing different techniques for the sentence ranking and extraction step, including: (i) statistical ones, which utilize statistical metrics such as word or sentence frequency; (ii) graph-based ones, which model the document into a graph of sentences, and then utilize graph theory concepts (e.g., centrality, community detection measures, etc.), and (iii) semantic-based ones, which model sentences and their terms into a co-occurrence matrix, which is then analyzed using distributional semantics [
25]. In this context, this subsection discusses some of the most prominent approaches in extractive summarization, namely
Luhn,
LSA,
TextRank,
LexRank, PositionRank, and
TopicRank.
Luhn [
26] is one of the earliest approaches in extractive summarization. It utilizes statistical analysis to rank each sentence of a given text, based on the frequency of the most important words and their relative position in that sentence. The highest scoring sentences are extracted to form the final summary. However, this approach has a limitation, as it only focuses on individual words and does not consider the relationship between words or sentences.
Latent semantic analysis (
LSA) was one of the earliest techniques used in an attempt to model the semantic relationships between words and capture key concepts in a document [
27]. For the task of TS, the work of [
28] proposed the
LSA technique, which models a document as a term-sentence matrix that represents the frequency of each word in each sentence of the document. Then, it applies
singular value decomposition (
SVD) to extract the most important semantic features of the document in order to rank and extract the most important sentences. However, some drawbacks of this approach concern the dimensionality and the selection of sentences. To address them, Ref. [
29] built a
semantic-based approach using
LSA that also used more advanced algorithms. Despite such improvements, summarization approaches built on
LSA can be computationally expensive, especially for larger texts, due to the use of
SVD [
30].
The use of graph-based algorithms is another extractive summarization approach that addresses some limitations of earlier approaches since it performs fast and scalable summarizations. One of the earliest and most prominent graph-based ranking approaches is
TextRank [
31]. The first step of this approach is the representation of the document as a weighted graph of sentences. The sentences of the document are represented as nodes and the relationships between them as edges. A connection between two sentences indicates that there is similarity between them, measured as a function of their overlapping content. After the graph is created, the
PageRank centrality algorithm [
32] is applied to rank each sentence based on its connections to the other ones. Finally, the top-ranked sentences are selected to form a summary of the input document. The number of extracted sentences can be set as a user-defined parameter for the termination of the algorithm.
LexRank [
33] is another graph-based algorithm that relies on
PageRank. Its key difference is that each sentence is represented as a vector of the
TF-IDF (
term frequency—inverse document frequency) scores of the words it contains, while the relationship between these sentence vectors is measured using cosine similarity. A similarity matrix is created with each sentence represented as a row and column, and the elements of the matrix are computed as the cosine similarity score between the sentence vectors. Only similarities above a given threshold are included. To rank the sentences,
PageRank is applied. The number of selected sentences can be set similarly to
TextRank. Other graph-based approaches that build on
TextRank are
TopicRank [
34] and
PositionRank [
35].
TopicRank uses a topic-modelling technique, which clusters sentences with similar topics and extracts the most important sentences of each cluster.
PositionRank considers both the distribution of term positions in a text and the term frequencies in a biased
PageRank, to rank sentences.
Many word embedding models have been developed since the introduction of the pioneering
Word2Vec model [
36]. Their goal is to capture semantic information for textual terms, thus increasing the accuracy of various NLP tasks. These embeddings are calculated for each term, and their mean vector representation is the document embedding. Recent advancements in deep learning allow the inference of sentence embeddings [
37] from pretrained language models, while achieving better accuracy than earlier models.
3. Abstractive Approaches
The need for abstractive approaches resulted from a major drawback of extractive approaches, which is a lack of readability and coherence of the produced text, since extractive approaches utilize simple heuristics to extract and concatenate the most relevant sentences, without accounting for grammatical or syntactical rules [
18]. To generate a fluent and coherent summary, more contextual information about the tokens of the input text is required, thus a family of models that generate new phrases in a similar manner to the paraphrasing process of a human reader is needed [
2,
4]. Many models for abstractive summarization have already been proposed in the literature. As seen in a recent survey [
2], these include graph-based [
41], rule-based [
42], and semantic modelling [
43] approaches. These earlier models, however, do not utilize recent advancements in deep learning, which improve many NLP tasks. Newer abstractive summarization approaches build on deep learning models, including: (i) the
convolutional neural networks (
CNN) and the
recurrent neural networks (
RNNs); (ii)
LSTM and
GRU, which improve the original
RNNs and are discussed in [
44]. Other neural architectures that are not based on
CNNs and
RNNs include
GAN (generative adversarial networks). Certain works use these to build their abstractive approaches, as described in [
45,
46]. However, these yield lower evaluation scores (i.e.,
ROUGE) than recent deep learning models, which rely on the model explained in the next paragraph, as validated in [
21].
Transformer [
47] is a deep learning model that consists of a series of encoder and decoder layers, which utilize the attention mechanism to model the global dependencies of sequential data [
48]. Specifically, the self-attention mechanism assigns different weights to different parts of the input, according to their contextual significance. These are encoded in hidden state layers when generating the output sequence. In addition,
Transformer models use multi-head attention, which means that attention is applied in parallel to capture different patterns and relationships of the input data.
Transformer uses the encoder-decoder model, which encodes information into hidden layers and then decodes it to generate output. These models are semisupervised, due to their unsupervised pretraining on large datasets, followed by supervised finetuning. Approaches built on this model achieve state-of-the-art performance on various text generation tasks, including abstractive summarization. Recent surveys [
20,
21] discussed and evaluated the differences between earlier abstractive approaches, including those that utilize deep learning models proposed before the introduction of the
Transformer architecture.
T5 [
49], which stands for
text-to-text transfer transformer, is an approach that closely follows the
Transformer architecture. It provides a general framework which converts multiple NLP tasks into sequential text-to-text ones. To address each task, it uses a task-specific prefix before the given sequence in the input. The pretraining process comprises both supervised and unsupervised training. The unsupervised objective of the approach includes masking random spans of tokens with unique sentinel tokens. The “corrupted” sentence is passed to the encoder, while the decoder learns to predict the dropped-out tokens on the output layer. A follow up approach, namely
mT5 [
50], builds on
T5 to provide multilingual pretrained baseline models, which can be further finetuned to address diverse downstream tasks in multiple natural languages.
BART [
51], which stands for
bidirectional auto-regressive transformers, is a multitask deep learning approach, with abstractive summarization being included in them.
BART utilizes a “denoising” autoencoder that learns the associations between a document and its “corrupted” form using various textual transformations. These include random token masking or deletion, text infilling, sentence permutation, and document rotation. This autoencoder is implemented as a sequence-to-sequence model with a bidirectional encoder and a left-to-right autoregressive decoder. For its pretraining, it optimizes a reconstruction loss (cross-entropy) function, where the decoder generates tokens found in the original document with higher probability.
PEGASUS [
9], which stands for
pretraining with extracted gap-sentences for abstractive summarization, is a deep learning approach pretrained solely for the downstream task of abstractive summarization. It introduces a novel pretraining objective for
Transformer-based models, called
gap sentences generation (
GSG). This objective is specifically designed for the task of abstractive text summarization, as it involves the masking of whole sentences, rather than smaller text spans used in previous attempts. By doing so, it creates a “gap” in the input document, where the model is then trained to complete, by considering the rest of the sentences. Another key advantage of this approach is the selection of the masked sentences by utilizing a technique that ranks sentences based on their importance in the document rather than randomly, as suggested in earlier approaches.
Considering the rapidly increasing size and computational complexity of large pretrained models, as noted in [
52], researchers were prompted to explore methods to compress them into smaller versions that maintain high accuracy and faster inference in terms of execution time. One such example is the work of [
53] that proposes various comprehension techniques, including: (i)
direct knowledge distillation (
KD), which allows the knowledge transfer between a large model, referred to as the “teacher” model, into a smaller and “distilled” model, referred to as the “student” model; (ii)
pseudo-labels, which replace the ground truth target documents of the student model with those of the teacher, and (iii)
shrink and finetune (
SFT), which shrinks the teacher model to student size by copying a subset of layers and then the finetuning student model again. They also provide various “distilled” pretrained model versions of large pretrained ones, produced by the
BART and
PEGASUS approaches.
4. Datasets
CNN/
Daily Mail [
54] is a dataset containing over 300,000 news articles from
CNN and the
Daily Mail newspaper, written between 2007 and 2015. This dataset is distributed in three major versions. The first one was made for the NLP task of question answering and contains 313 k unique news articles and close to 1 M questions. The second version was restructured for the task of TS; the data in this version are anonymized. The third version provides a nonanonymized version of the data, where individuals’ names can be found in the dataset. Each article is accompanied by a list of bullet point summaries, which abstractively summarize a key aspect of the article. The
CNN/
Daily Mail dataset has 3 splits: training (92%, 287,113 articles), validation (4.3% 13,368 articles), and test (3.7%, 11,490 articles).
XSum (standing for
eXtreme Summarization) is a dataset that provides over 220,000
BBC news articles covering various topics [
8]. Each article is accompanied by a one-sentence summary written by a human expert, who for the most part was the original author of the article.
XSum has 3 splits: training (90%, 204,045 articles), validation (5%, 11,332 articles), and test (5%, 11,334 articles).
SAMSum [
55] is a dataset that contains more than 16,000 online chat conversations written by linguists. These conversations cover diverse topics and formality styles including emoticons, slang words, and even typographical errors. They are also annotated with short third-person summaries, explaining the dialogue between different people.
Reddit TIFU [
56] is a dataset consisting of 123,000 Reddit posts from the /
r/
tifu online discussion forum. These posts are informal stories that include a short summary, which is the title of the post, and a longer one, known as the “
TL;
DR” (
too long;
didn’t read) summary.
BillSum [
57] is a dataset that deals with US Congressional (USC) and California (CA) state bill summarization. This corpus contains legislation documents from five to twenty thousand characters. In total, it contains 22,200 USC (18,949 train documents and 3269 test documents) and 1200 CA state bills (1237 test documents), accompanied by summaries written by human experts. The data are collected from the US Publishing Office
Govinfo and the CA legislature’s website.
5. Evaluation Metrics
For evaluation purposes, researchers employ two well-known families of metrics, namely
BLEU and
ROUGE. In general, both assess the number of matching
n-grams (sequence of terms) between the machine generated summary and the human assigned one.
BLEU [
23], standing for
bilingual evaluation understudy, is an automatic evaluation method that was originally created for the task of machine translation, but can also be applied for the automatic text summarization task, as suggested in [
22,
23,
24]. It is based on the precision metric, which measures the number of words from the machine-generated (candidate) sentence that match the words in the human-written (reference) sentence, divided by the total number of words in the candidate sentence. Specifically, it utilizes a brevity penalty for short sentences and a modified precision. This precision calculates the geometric average of
n-gram precisions, while penalizing word repetition. Two variations of BLEU, namely
BLEU-1 and BLEU-2, use unigram precision and both unigram and bigram precisions, respectively.
ROUGE [
22], which stands for
recall-oriented understudy for gisting evaluation, was inspired by the success of the
n-gram overlap measure utilized by
BLEU. In contrast to
BLEU,
ROUGE was introduced as a recall-oriented metric. Specifically, the most common setups are: (i)
ROUGE-1 and
ROUGE-2 for unigrams and bigrams, respectively; (ii)
ROUGE-L, which utilizes the
longest common subsequence (
LCS) between the reference and the machine-generated summary. It is noted that a variation of
ROUGE-L, called
ROUGE-LSUM, is computed on the summary-level (contrary to
ROUGE-L that is computed on the sentence-level).