Abstractive vs. Extractive Summarization: Comparison
Please note this is a comparison between Version 1 by Nikos Karacapilidis and Version 2 by Jason Zhu.

Due to the huge and continuously growing size of the textual corpora existing on the Internet, important information may go unnoticed or become lost. At the same time, the task of summarizing these resources by human experts is tedious and time consuming. This necessitates the automation of the task. Natural language processing (NLP) is a multidisciplinary research field, merging aspects and approaches from computer science, artificial intelligence and linguistics; it deals with the development of processes that semantically and efficiently analyze vast amounts of textual data. Text summarization (TS) is a fundamental NLP subtask, which has been defined as the process of the automatic creation of a concise and fluent summary that captures the main ideas and topics of one or multiple documents.

  • text summarization
  • deep learning
  • language models
  • abstractive summarization
  • extractive summarization

1. Introduction

Due to the huge and continuously growing size of the textual corpora existing on the Internet, important information may go unnoticed or become lost. At the same time, the task of summarizing these resources by human experts is tedious and time consuming [1]. This necessitates the automation of the task. Natural language processing (NLP) is a multidisciplinary research field, merging aspects and approaches from computer science, artificial intelligence and linguistics; it deals with the development of processes that semantically and efficiently analyze vast amounts of textual data. Text summarization (TS) is a fundamental NLP subtask, which has been defined as the process of the automatic creation of a concise and fluent summary that captures the main ideas and topics of one or multiple documents [2].
Diverse TS applications already exist, including: (i) literature summarization, aiming to handle a long document such as a book, a scientific article, or similar literature resources [3][4][5,6]; (ii) news outlet summarization, aiming to summarize information from one or multiple news outlet portals [5][6][7,8]; (iii) e-mail summarization [7][8][9,10]; (iv) legal document summarization, focusing on the extraction of important aspects from lengthy legal documents [9][10][11][11,12,13]; (v) social media summarization, where social media posts from multiple users are summarized in order to measure the social impact of a certain topic (this application is highly relevant to the field of opinion mining) [12][13][14,15]; (vi) argument summarization, as a means for meaningfully aggregating the public opinion in a digital democracy platform [14][16].
Recently published works outlined a series of insights about diverse TS characteristics. For instance, Ref. [15][17] assessed various techniques used by extractive summarization approaches, as well as the associated evaluation metrics; Ref. [16][18] presented an overview of TS datasets, approaches, and evaluation schemas; Ref. [17][4] elaborated on a comprehensive classification schema of TS approaches, based on their underlying techniques, and performed a comparative assessment of their performance through the use of various metrics; Ref. [18][19] focused on extractive TS approaches, evaluation metrics, and their limitations; Ref. [19][3] offered a comprehensive survey of TS and keyword extraction, the latter being a sibling task of TS. In addition, Refs. [2][20][2,20] are two comprehensive surveys of TS applications, approaches, datasets, and evaluation techniques, also reporting on associated limitations and challenges. Contrary to earlier works, Ref. [21] reported a comprehensive survey that focused only on abstractive TS approaches, taking into account recent deep learning approaches, while also presenting their comparative evaluation using various versions of the ROUGE metric [22].
The above works, however, have a series of limitations: (i) only a few of them [17][20][21][4,20,21] evaluated the approaches under consideration through a common evaluation framework (e.g., the ROUGE metric); (ii) only a few of them discussed deep learning approaches [2][20][21][2,20,21]; (iii) they did not use alternative evaluation metrics, which yield interesting results, e.g., BLEU [23], as discussed in [22][23][24][22,23,24]; (iv) they did not provide links to the code repositories of their experimental setups and datasets.
Several works reported the importance of developing a comprehensive evaluation framework [2][16][19][21][2,3,18,21]. Specifically, Ref. [2] stressed the need for the proposal of new solutions regarding the automatic evaluation of TS approaches, while [16][18] pointed out that the automatic evaluation of TS approaches remains a very promising research area with many open issues. Some of these issues include: (i) the lack of metrics that take into account the mismatch of synonymous terms between human assigned summaries and machine generated ones, (ii) the lack of datasets with quality summaries, and (iii) the lack of datasets for evaluation of multilingual approaches.

2. Extractive Approaches

The goal of extractive approaches is to extract the most important sentences of the document under consideration. These are assembled into a concise summary that captures the most significant aspects of the original text. Various algorithms have been proposed for extractive summarization, each utilizing different techniques for the sentence ranking and extraction step, including: (i) statistical ones, which utilize statistical metrics such as word or sentence frequency; (ii) graph-based ones, which model the document into a graph of sentences, and then utilize graph theory concepts (e.g., centrality, community detection measures, etc.), and (iii) semantic-based ones, which model sentences and their terms into a co-occurrence matrix, which is then analyzed using distributional semantics [25]. In this context, this subsection discusses some of the most prominent approaches in extractive summarization, namely Luhn, LSA, TextRank, LexRank, PositionRank, and TopicRank. Luhn [26] is one of the earliest approaches in extractive summarization. It utilizes statistical analysis to rank each sentence of a given text, based on the frequency of the most important words and their relative position in that sentence. The highest scoring sentences are extracted to form the final summary. However, this approach has a limitation, as it only focuses on individual words and does not consider the relationship between words or sentences. Latent semantic analysis (LSA) was one of the earliest techniques used in an attempt to model the semantic relationships between words and capture key concepts in a document [27]. For the task of TS, the work of [28] proposed the LSA technique, which models a document as a term-sentence matrix that represents the frequency of each word in each sentence of the document. Then, it applies singular value decomposition (SVD) to extract the most important semantic features of the document in order to rank and extract the most important sentences. However, some drawbacks of this approach concern the dimensionality and the selection of sentences. To address them, Ref. [29] built a semantic-based approach using LSA that also used more advanced algorithms. Despite such improvements, summarization approaches built on LSA can be computationally expensive, especially for larger texts, due to the use of SVD [30]. The use of graph-based algorithms is another extractive summarization approach that addresses some limitations of earlier approaches since it performs fast and scalable summarizations. One of the earliest and most prominent graph-based ranking approaches is TextRank [31]. The first step of this approach is the representation of the document as a weighted graph of sentences. The sentences of the document are represented as nodes and the relationships between them as edges. A connection between two sentences indicates that there is similarity between them, measured as a function of their overlapping content. After the graph is created, the PageRank centrality algorithm [32] is applied to rank each sentence based on its connections to the other ones. Finally, the top-ranked sentences are selected to form a summary of the input document. The number of extracted sentences can be set as a user-defined parameter for the termination of the algorithm. LexRank [33] is another graph-based algorithm that relies on PageRank. Its key difference is that each sentence is represented as a vector of the TF-IDF (term frequency—inverse document frequency) scores of the words it contains, while the relationship between these sentence vectors is measured using cosine similarity. A similarity matrix is created with each sentence represented as a row and column, and the elements of the matrix are computed as the cosine similarity score between the sentence vectors. Only similarities above a given threshold are included. To rank the sentences, PageRank is applied. The number of selected sentences can be set similarly to TextRank. Other graph-based approaches that build on TextRank are TopicRank [34] and PositionRank [35]. TopicRank uses a topic-modelling technique, which clusters sentences with similar topics and extracts the most important sentences of each cluster. PositionRank considers both the distribution of term positions in a text and the term frequencies in a biased PageRank, to rank sentences. Many word embedding models have been developed since the introduction of the pioneering Word2Vec model [36]. Their goal is to capture semantic information for textual terms, thus increasing the accuracy of various NLP tasks. These embeddings are calculated for each term, and their mean vector representation is the document embedding. Recent advancements in deep learning allow the inference of sentence embeddings [37] from pretrained language models, while achieving better accuracy than earlier models.

3. Abstractive Approaches

The need for abstractive approaches resulted from a major drawback of extractive approaches, which is a lack of readability and coherence of the produced text, since extractive approaches utilize simple heuristics to extract and concatenate the most relevant sentences, without accounting for grammatical or syntactical rules [16][18]. To generate a fluent and coherent summary, more contextual information about the tokens of the input text is required, thus a family of models that generate new phrases in a similar manner to the paraphrasing process of a human reader is needed [2][17][2,4]. Many models for abstractive summarization have already been proposed in the literature. As seen in a recent survey [2], these include graph-based [38][41], rule-based [39][42], and semantic modelling [40][43] approaches. These earlier models, however, do not utilize recent advancements in deep learning, which improve many NLP tasks. Newer abstractive summarization approaches build on deep learning models, including: (i) the convolutional neural networks (CNN) and the recurrent neural networks (RNNs); (ii) LSTM and GRU, which improve the original RNNs and are discussed in [41][44]. Other neural architectures that are not based on CNNs and RNNs include GAN (generative adversarial networks). Certain works use these to build their abstractive approaches, as described in [42][43][45,46]. However, these yield lower evaluation scores (i.e., ROUGE) than recent deep learning models, which rely on the model explained in the next paragraph, as validated in [21]. Transformer [44][47] is a deep learning model that consists of a series of encoder and decoder layers, which utilize the attention mechanism to model the global dependencies of sequential data [45][48]. Specifically, the self-attention mechanism assigns different weights to different parts of the input, according to their contextual significance. These are encoded in hidden state layers when generating the output sequence. In addition, Transformer models use multi-head attention, which means that attention is applied in parallel to capture different patterns and relationships of the input data. Transformer uses the encoder-decoder model, which encodes information into hidden layers and then decodes it to generate output. These models are semisupervised, due to their unsupervised pretraining on large datasets, followed by supervised finetuning. Approaches built on this model achieve state-of-the-art performance on various text generation tasks, including abstractive summarization. Recent surveys [20][21][20,21] discussed and evaluated the differences between earlier abstractive approaches, including those that utilize deep learning models proposed before the introduction of the Transformer architecture. T5 [46][49], which stands for text-to-text transfer transformer, is an approach that closely follows the Transformer architecture. It provides a general framework which converts multiple NLP tasks into sequential text-to-text ones. To address each task, it uses a task-specific prefix before the given sequence in the input. The pretraining process comprises both supervised and unsupervised training. The unsupervised objective of the approach includes masking random spans of tokens with unique sentinel tokens. The “corrupted” sentence is passed to the encoder, while the decoder learns to predict the dropped-out tokens on the output layer. A follow up approach, namely mT5 [47][50], builds on T5 to provide multilingual pretrained baseline models, which can be further finetuned to address diverse downstream tasks in multiple natural languages. BART [48][51], which stands for bidirectional auto-regressive transformers, is a multitask deep learning approach, with abstractive summarization being included in them. BART utilizes a “denoising” autoencoder that learns the associations between a document and its “corrupted” form using various textual transformations. These include random token masking or deletion, text infilling, sentence permutation, and document rotation. This autoencoder is implemented as a sequence-to-sequence model with a bidirectional encoder and a left-to-right autoregressive decoder. For its pretraining, it optimizes a reconstruction loss (cross-entropy) function, where the decoder generates tokens found in the original document with higher probability. PEGASUS [7][9], which stands for pretraining with extracted gap-sentences for abstractive summarization, is a deep learning approach pretrained solely for the downstream task of abstractive summarization. It introduces a novel pretraining objective for Transformer-based models, called gap sentences generation (GSG). This objective is specifically designed for the task of abstractive text summarization, as it involves the masking of whole sentences, rather than smaller text spans used in previous attempts. By doing so, it creates a “gap” in the input document, where the model is then trained to complete, by considering the rest of the sentences. Another key advantage of this approach is the selection of the masked sentences by utilizing a technique that ranks sentences based on their importance in the document rather than randomly, as suggested in earlier approaches. Considering the rapidly increasing size and computational complexity of large pretrained models, as noted in [49][52], researchers were prompted to explore methods to compress them into smaller versions that maintain high accuracy and faster inference in terms of execution time. One such example is the work of [50][53] that proposes various comprehension techniques, including: (i) direct knowledge distillation (KD), which allows the knowledge transfer between a large model, referred to as the “teacher” model, into a smaller and “distilled” model, referred to as the “student” model; (ii) pseudo-labels, which replace the ground truth target documents of the student model with those of the teacher, and (iii) shrink and finetune (SFT), which shrinks the teacher model to student size by copying a subset of layers and then the finetuning student model again. They also provide various “distilled” pretrained model versions of large pretrained ones, produced by the BART and PEGASUS approaches.

4. Datasets

CNN/Daily Mail [51][54] is a dataset containing over 300,000 news articles from CNN and the Daily Mail newspaper, written between 2007 and 2015. This dataset is distributed in three major versions. The first one was made for the NLP task of question answering and contains 313 k unique news articles and close to 1 M questions. The second version was restructured for the task of TS; the data in this version are anonymized. The third version provides a nonanonymized version of the data, where individuals’ names can be found in the dataset. Each article is accompanied by a list of bullet point summaries, which abstractively summarize a key aspect of the article. The CNN/Daily Mail dataset has 3 splits: training (92%, 287,113 articles), validation (4.3% 13,368 articles), and test (3.7%, 11,490 articles). XSum (standing for eXtreme Summarization) is a dataset that provides over 220,000 BBC news articles covering various topics [6][8]. Each article is accompanied by a one-sentence summary written by a human expert, who for the most part was the original author of the article. XSum has 3 splits: training (90%, 204,045 articles), validation (5%, 11,332 articles), and test (5%, 11,334 articles). SAMSum [52][55] is a dataset that contains more than 16,000 online chat conversations written by linguists. These conversations cover diverse topics and formality styles including emoticons, slang words, and even typographical errors. They are also annotated with short third-person summaries, explaining the dialogue between different people. Reddit TIFU [53][56] is a dataset consisting of 123,000 Reddit posts from the /r/tifu online discussion forum. These posts are informal stories that include a short summary, which is the title of the post, and a longer one, known as the “TL;DR” (too long; didn’t read) summary. BillSum [54][57] is a dataset that deals with US Congressional (USC) and California (CA) state bill summarization. This corpus contains legislation documents from five to twenty thousand characters. In total, it contains 22,200 USC (18,949 train documents and 3269 test documents) and 1200 CA state bills (1237 test documents), accompanied by summaries written by human experts. The data are collected from the US Publishing Office Govinfo and the CA legislature’s website.

5. Evaluation Metrics

For evaluation purposes, researchers employ two well-known families of metrics, namely BLEU and ROUGE. In general, both assess the number of matching n-grams (sequence of terms) between the machine generated summary and the human assigned one. BLEU [23], standing for bilingual evaluation understudy, is an automatic evaluation method that was originally created for the task of machine translation, but can also be applied for the automatic text summarization task, as suggested in [22][23][24][22,23,24]. It is based on the precision metric, which measures the number of words from the machine-generated (candidate) sentence that match the words in the human-written (reference) sentence, divided by the total number of words in the candidate sentence. Specifically, it utilizes a brevity penalty for short sentences and a modified precision. This precision calculates the geometric average of n-gram precisions, while penalizing word repetition. Two variations of BLEU, namely BLEU-1 and BLEU-2, use unigram precision and both unigram and bigram precisions, respectively. ROUGE [22], which stands for recall-oriented understudy for gisting evaluation, was inspired by the success of the n-gram overlap measure utilized by BLEU. In contrast to BLEU, ROUGE was introduced as a recall-oriented metric. Specifically, the most common setups are: (i) ROUGE-1 and ROUGE-2 for unigrams and bigrams, respectively; (ii) ROUGE-L, which utilizes the longest common subsequence (LCS) between the reference and the machine-generated summary. It is noted that a variation of ROUGE-L, called ROUGE-LSUM, is computed on the summary-level (contrary to ROUGE-L that is computed on the sentence-level).
 
Video Production Service