Generating Paraphrase Using Simulated Annealing for Citation Sentences

Generating Paraphrase Using Simulated Annealing for Citation Sentences: History

View Latest Version

Please note this is an old version of this entry, which may differ significantly from the current revision.

Subjects: Computer Science, Artificial Intelligence

Contributor:

Ridwan Ilyas

Masayu Leylia Khodra

Rinaldi Munir

Rila Mandala

Dwi Hendratmo Widyantoro

The paraphrase generator for citation sentences is used to produce several sentence alternatives to avoid plagiarism. Furthermore, the generation results need to pay attention to semantic similarity and lexical divergence standards. The generation process is guided by an objective function using a simulated annealing algorithm to maintain the properties of semantic similarity and lexical divergence. The objective function is created by combining the two factors that maintain these properties.

citation sentences
paraphrase generator
simulated annealing

1. Introduction

Paraphrase generation produces new text from the input with different wording but the same information [1]. The generation machine aims to create sentences with high lexical divergence and maintain semantic similarity. Furthermore, the generator is often equated with a machine translation, but the input and output are sentences in the same language [2]. The generation of a paraphrase needs to consider several criteria. Moreover, sentences in scientific papers are usually argumentative [3], where one statement is bound in context with another, either in a causal paragraph or vice versa. The resulting new sentence should not have plagiarism characteristics [4]. Scientific papers contain many equivalent or multilevel compound sentences; hence, the output form is more complex [5]. In the domain of scientific papers, paraphrasing can be found in several events [6], such as:

The abstract is a paraphrase of the sentence in the body of the paper
The introductory part has a paraphrase equivalent to the methodology section
The conclusion has a paraphrase equivalent to the experimental section
Definition sentences have paraphrase equivalents with others that define the same construct
The citation sentence that quotes the same paper is a paraphrase.

This study used the potential of the citation sentence as the paraphrase collection, which can be obtained from various papers. The citation sentence was selected because it is often considered a part that increases a paper’s plagiarism value.

It also has many potentials for paraphrasing when collected. Therefore, the dataset was collected from open-source Computing Language Papers (ACL Anthology). Citation sentences have several functions, including citing weaknesses, contrasts, methods, and data similarities, as well as problem bases or neutral ones [7]. The citation sentence used in this study is limited to only one of the citation targets. This was carried out to limit the context of the purpose of each delivery in the scientific argument on sentences.

Inspired by the Unsupervised Paraphrasing of Simulated Annealing [8], a generate and test model architecture was developed with the same algorithm but different objective functions and strategies. Furthermore, this study combined two matrix functions, namely METEOR [9] and PINC Score [10], to capture semantic similarities and lexical differences. The two matrix functions were combined in a linear weighted function [11], which can be adjusted to the tendency of its value. The language source that makes a substitute or addition successor to the input sentence is built with word embedding [12]. The sentence candidate selection strategy uses the n-gram language model [13].

Approaches for rule-based paraphrase generation are based on hand-crafted and automatically collected paraphrase rules. These rules were mostly hand-crafted in the early works [14]. Because of the enormous manual work required, some researchers have attempted to collect paraphrases automatically [15]. Unfortunately, because of the limitations of the extraction methods, long and complex additional patterns have been generated, affecting performances.

Thesaurus-based approaches start by extracting all synonyms for the words to be substituted from a thesaurus. The best choice is then selected according to the context of the source phrase [16]. Although simple and effective, the diversity of the generated paraphrases tends to limit this strategy.

2. Corpus Construction

The construction of the corpus paraphrase is known as paraphrase extraction. This is a task to generate the collection of paraphrased sentence pairs from large documents [27]. The extraction result can be a collection of words or phrases, such as PPDB [26], which uses two-language pivoting. It can also be the paraphrased sentence pair, such as MSRP [28], which is obtained from a news collection using a supervised approach. Other corpora, such as PIT [29], were compiled from tweets using the similarity object delivery approach. Each text unit and domain have unique characteristics because of its specific information purpose. State or the art of constructing a paraphrase corpus can be seen in Table 1.

Table 1. Paraphrase corpus state of the art.

No	Paper	Year	Name	Domain	Technique
1	Ganitkevitch et al. [27]	2013	PPDB	Free	Pivoting
2	Pavlick et al. [30]	2013	PPDB 2.0	Free	Pivoting
3	Dolan et al. [28]	2005	MSRP	Free	SVM
4	Xu et al. [29]	2014	PIT	Twitter	Multi-instance learning

It is necessary to observe the authors’ characteristics in conveying information when extracting paraphrases from scientific paper sources. Authors of scientific papers write information using three approaches, namely paraphrasing, summarizing, and translating [31]. Abstract sentences with body parts can be collected to build a paraphrase corpus [5]. However, citation sentences have the greatest potential to build a paraphrase corpus from these papers [32]. The construction of the citation paraphrase corpus in this study is a small contribution to paraphrase generation research.

3. Objective Function

The generation model built with the generate and test model requires an objective function to guide the generation process. In paraphrasing, the objective function is a formula to measure the paraphrase value of two pairs of sentences (usually a value between 0 to 1). Studies in this section are usually grouped in the task text similarity measurement.

Paraphrasing is a task that is very similar to machine translation; therefore, the evaluation approach of the translation can be used for paraphrasing. Furthermore, evaluation techniques, such as NIST [33], BLEU [34], and WMT [35], can be combined into a formula to assess the results of paraphrase generation evaluated based on the available data [36].

The Term Frequency Kullback–Leibler Divergence (TF-KLD) data representation is the best technique for measuring paraphrases in the MSRP dataset [37]. Prior to the classification, the matrix is converted into a latent representation with TF-KLD features, and the SVM algorithm is subsequently used for classification. The evaluation was carried out by comparing the standard TF-IDF resulting in an accuracy of 80.4% and an F1 Score of 85.9%.

Another approach for measuring the paraphrase output is the use of deep learning to build a sentence representation and simply compare it in vector form [32]. Apart from the neural network architecture, a Convolution Neural Network (CNN) model, which consists of composition layers, decomposition, and proximity estimation, can be used to measure the paraphrase generation results [38]. Text representation with word embedding is often used when a deep learning approach is applied.

A model is developed to measure paraphrase in the domain of scientific papers. The Siamese neural network architecture is used to study the similarity and dissimilarity based on corpus labeled true and false for sentences from scientific papers [39] with an accuracy rate of 64%. Furthermore, the SVM model can be developed by engineering word features, such as Euclidean distance, cosine similarity, and sentence length [40], with an accuracy rate of 61.9%. Both studies utilized a learning-based approach and were strongly influenced by the quality of the corpus used.

In this study, the objective function was built based on semantic similarity and lexical divergence. To combine the two, a formula that can configure the tendency to one aspect was built. The objective function formation is explained in the experiment section.

4. Paraphrase Generator

Paraphrase generation is a task to generate new sentences from the input. Furthermore, various language resources are needed in this process. The general approach of generating paraphrases uses a machine translation set to produce sentences in the same language [2].

Paraphrase generation can be found in various domains, such as news [41], where the generator can be used to package news content or form variations of headlines [42]. It can also be found in social media domains such as Twitter [29]. Paraphrase generation in these various domains aims to produce semantic similarity, compression, concatenation, and sentence simplification [43].

The sequence-to-sequence learning is a technique developed with a deep approach to paraphrase generation [44]. The main construction of this model is the Recurrent Neural Network (RNN) or Long-sort Term Memory (LSTM) units. The deep learning approach was developed with the Transformer and inspired the use of this technique in paraphrase generation [45].

5. Simulated Annealing

Simulated Annealing (SA) is an effective algorithm in the solution search on a very large dimension space [46]. The advantage of the algorithm is its ability to avoid the local maximum of the optimization function. Furthermore, the algorithm is inspired by heavy industrial processing that utilizes the lowering of an object’s temperature and manipulates it to the desired shape. The temperature drop factor determines the fault tolerance in the search process of the solution space. The error is acceptable when the temperature is still high and less likely to be accepted towards the end of the temperature drop.

In the sentence generation case, it can be stated that χ is a very large sentence dimension space, and

f (x)

is the objective function of generating new sentences. The main target of Simulated Annealing is to determine the sentence x with the maximum value

f (x)

. There is a generation step in every search, which can be called

t

, while the sentence generated can be referred to as

x_{t}

. Simulated Annealing will select

x_{t + 1}

, which has undergone a change from

x_{t}

as the current step when the f value is greater. Simulated Annealing is inspired by the metallurgical process of cooling materials. At the beginning of the search, the temperature

T

is usually very high and allows

x_{t + 1}

to be accepted even though the value of

f

is smaller. Theoretically, this can avoid the local maximum’s optimization function and guarantee the global maximum’s achievement [47].

This entry is adapted from the peer-reviewed paper 10.3390/informatics10020034

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.