Generating Paraphrase Using Simulated Annealing for Citation Sentences

Generating Paraphrase Using Simulated Annealing for Citation Sentences: History

View Latest Version

Please note this is an old version of this entry, which may differ significantly from the current revision.

Contributor:

Ridwan Ilyas

Masayu Leylia Khodra

Rinaldi Munir

Rila Mandala

Dwi Hendratmo Widyantoro

The paraphrase generator for citation sentences is used to produce several sentence alternatives to avoid plagiarism. Furthermore, the generation results need to pay attention to semantic similarity and lexical divergence standards. The generation process is guided by an objective function using a simulated annealing algorithm to maintain the properties of semantic similarity and lexical divergence. The objective function is created by combining the two factors that maintain these properties.

citation sentences
paraphrase generator
simulated annealing

1. Introduction

Paraphrase generation produces new text from the input with different wording but the same information ^[1]. The generation machine aims to create sentences with high lexical divergence and maintain semantic similarity. Furthermore, the generator is often equated with a machine translation, but the input and output are sentences in the same language ^[2]. The generation of a paraphrase needs to consider several criteria. Moreover, sentences in scientific papers are usually argumentative ^[3], where one statement is bound in context with another, either in a causal paragraph or vice versa. The resulting new sentence should not have plagiarism characteristics ^[4]. Scientific papers contain many equivalent or multilevel compound sentences; hence, the output form is more complex ^[5]. In the domain of scientific papers, paraphrasing can be found in several events ^[6], such as:

The abstract is a paraphrase of the sentence in the body of the paper
The introductory part has a paraphrase equivalent to the methodology section
The conclusion has a paraphrase equivalent to the experimental section
Definition sentences have paraphrase equivalents with others that define the same construct
The citation sentence that quotes the same paper is a paraphrase.

This study used the potential of the citation sentence as the paraphrase collection, which can be obtained from various papers. The citation sentence was selected because it is often considered a part that increases a paper’s plagiarism value.

It also has many potentials for paraphrasing when collected. Therefore, the dataset was collected from open-source Computing Language Papers (ACL Anthology). Citation sentences have several functions, including citing weaknesses, contrasts, methods, and data similarities, as well as problem bases or neutral ones ^[7]. The citation sentence used in this study is limited to only one of the citation targets. This was carried out to limit the context of the purpose of each delivery in the scientific argument on sentences.

Inspired by the Unsupervised Paraphrasing of Simulated Annealing ^[8], a generate and test model architecture was developed with the same algorithm but different objective functions and strategies. Furthermore, this study combined two matrix functions, namely METEOR ^[9] and PINC Score ^[10], to capture semantic similarities and lexical differences. The two matrix functions were combined in a linear weighted function ^[11], which can be adjusted to the tendency of its value. The language source that makes a substitute or addition successor to the input sentence is built with word embedding ^[12]. The sentence candidate selection strategy uses the n-gram language model ^[13].

Approaches for rule-based paraphrase generation are based on hand-crafted and automatically collected paraphrase rules. These rules were mostly hand-crafted in the early works ^[14]. Because of the enormous manual work required, some researchers have attempted to collect paraphrases automatically ^[15]. Unfortunately, because of the limitations of the extraction methods, long and complex additional patterns have been generated, affecting performances.

Thesaurus-based approaches start by extracting all synonyms for the words to be substituted from a thesaurus. The best choice is then selected according to the context of the source phrase ^[16]. Although simple and effective, the diversity of the generated paraphrases tends to limit this strategy.

2. Corpus Construction

The construction of the corpus paraphrase is known as paraphrase extraction. This is a task to generate the collection of paraphrased sentence pairs from large documents ^[17]. The extraction result can be a collection of words or phrases, such as PPDB ^[18], which uses two-language pivoting. It can also be the paraphrased sentence pair, such as MSRP ^[19], which is obtained from a news collection using a supervised approach. Other corpora, such as PIT ^[20], were compiled from tweets using the similarity object delivery approach. Each text unit and domain have unique characteristics because of its specific information purpose. State or the art of constructing a paraphrase corpus can be seen in Table 1.

Table 1. Paraphrase corpus state of the art.

No	Paper	Year	Name	Domain	Technique
1	Ganitkevitch et al. ^[17]	2013	PPDB	Free	Pivoting
2	Pavlick et al. ^[21]	2013	PPDB 2.0	Free	Pivoting
3	Dolan et al. ^[19]	2005	MSRP	Free	SVM
4	Xu et al. ^[20]	2014	PIT	Twitter	Multi-instance learning

It is necessary to observe the authors’ characteristics in conveying information when extracting paraphrases from scientific paper sources. Authors of scientific papers write information using three approaches, namely paraphrasing, summarizing, and translating ^[22]. Abstract sentences with body parts can be collected to build a paraphrase corpus ^[5]. However, citation sentences have the greatest potential to build a paraphrase corpus from these papers ^[23]. The construction of the citation paraphrase corpus in this study is a small contribution to paraphrase generation research.

3. Objective Function

The generation model built with the generate and test model requires an objective function to guide the generation process. In paraphrasing, the objective function is a formula to measure the paraphrase value of two pairs of sentences (usually a value between 0 to 1). Studies in this section are usually grouped in the task text similarity measurement.

Paraphrasing is a task that is very similar to machine translation; therefore, the evaluation approach of the translation can be used for paraphrasing. Furthermore, evaluation techniques, such as NIST ^[24], BLEU ^[25], and WMT ^[26], can be combined into a formula to assess the results of paraphrase generation evaluated based on the available data ^[27].

The Term Frequency Kullback–Leibler Divergence (TF-KLD) data representation is the best technique for measuring paraphrases in the MSRP dataset ^[28]. Prior to the classification, the matrix is converted into a latent representation with TF-KLD features, and the SVM algorithm is subsequently used for classification. The evaluation was carried out by comparing the standard TF-IDF resulting in an accuracy of 80.4% and an F1 Score of 85.9%.

Another approach for measuring the paraphrase output is the use of deep learning to build a sentence representation and simply compare it in vector form ^[23]. Apart from the neural network architecture, a Convolution Neural Network (CNN) model, which consists of composition layers, decomposition, and proximity estimation, can be used to measure the paraphrase generation results ^[29]. Text representation with word embedding is often used when a deep learning approach is applied.

A model is developed to measure paraphrase in the domain of scientific papers. The Siamese neural network architecture is used to study the similarity and dissimilarity based on corpus labeled true and false for sentences from scientific papers ^[30] with an accuracy rate of 64%. Furthermore, the SVM model can be developed by engineering word features, such as Euclidean distance, cosine similarity, and sentence length ^[31], with an accuracy rate of 61.9%. Both studies utilized a learning-based approach and were strongly influenced by the quality of the corpus used.

In this study, the objective function was built based on semantic similarity and lexical divergence. To combine the two, a formula that can configure the tendency to one aspect was built. The objective function formation is explained in the experiment section.

4. Paraphrase Generator

Paraphrase generation is a task to generate new sentences from the input. Furthermore, various language resources are needed in this process. The general approach of generating paraphrases uses a machine translation set to produce sentences in the same language ^[2].

Paraphrase generation can be found in various domains, such as news ^[32], where the generator can be used to package news content or form variations of headlines ^[33]. It can also be found in social media domains such as Twitter ^[20]. Paraphrase generation in these various domains aims to produce semantic similarity, compression, concatenation, and sentence simplification ^[34].

The sequence-to-sequence learning is a technique developed with a deep approach to paraphrase generation ^[35]. The main construction of this model is the Recurrent Neural Network (RNN) or Long-sort Term Memory (LSTM) units. The deep learning approach was developed with the Transformer and inspired the use of this technique in paraphrase generation ^[36].

5. Simulated Annealing

Simulated Annealing (SA) is an effective algorithm in the solution search on a very large dimension space ^[37]. The advantage of the algorithm is its ability to avoid the local maximum of the optimization function. Furthermore, the algorithm is inspired by heavy industrial processing that utilizes the lowering of an object’s temperature and manipulates it to the desired shape. The temperature drop factor determines the fault tolerance in the search process of the solution space. The error is acceptable when the temperature is still high and less likely to be accepted towards the end of the temperature drop.

In the sentence generation case, it can be stated that χ is a very large sentence dimension space, and

f (x)

is the objective function of generating new sentences. The main target of Simulated Annealing is to determine the sentence x with the maximum value

f (x)

. There is a generation step in every search, which can be called

t

, while the sentence generated can be referred to as

x_{t}

. Simulated Annealing will select

x_{t + 1}

, which has undergone a change from

x_{t}

as the current step when the f value is greater. Simulated Annealing is inspired by the metallurgical process of cooling materials. At the beginning of the search, the temperature

T

is usually very high and allows

x_{t + 1}

to be accepted even though the value of

f

is smaller. Theoretically, this can avoid the local maximum’s optimization function and guarantee the global maximum’s achievement ^[38].

This entry is adapted from the peer-reviewed paper 10.3390/informatics10020034

References

Androutsopoulos, I.; Malakasiotis, P. A Survey of Paraphrasing and Textual Entailment Methods. J. Artif. Intell. Res. 2010, 38, 135–187.
Quirk, C.; Brockett, C.; Dolan, W. Monolingual Machine Translation for Paraphrase Generation; Association for Computational Linguistics: Stroudsburg, PA, USA, 2003.
Lisa, C.M.L. Merging Corpus Linguistics and Collaborative Knowledge. Ph.D. Thesis, University of Birmingham, Birmingham, UK, 2009.
Barrom-Cedeno, A.; Vila, M.; Marti, A. Plagiarism Meets Paraphrasing: Insights for the Next Generation in Automatic Plagiarism Detection. Assoc. Comput. Linguist. 2013, 10, 1–309.
Kittredge, R. Paraphrasing for Condensation in Journal Abstracting. J. Biomed. Inform. 2002, 35, 265–277.
Ilyas, R.; Widiyantoro, D.H.; Khodra, M.L. Building Candidate Monolingual Parallel Corpus from Scientific Papers. In Proceedings of the 2018 International Conference on Asian Language Processing, IALP, Bandung, Indonesia, 15–17 November 2018; pp. 230–233.
Teufel, S.; Siddharthan, A.; Tidhar, D. An Annotation Scheme for Citation Function. In Proceedings of the COLING/ACL 2006–SIGdial06: 7th SIGdial Workshop on Discourse and Dialogue, Sydney, Australia, 15–16 July 2006; pp. 80–87.
Liu, X.; Mou, L.; Meng, F.; Zhou, H.; Zhou, J.; Song, S. Unsupervised Paraphrasing by Simulated Annealing. arXiv 2019, 302–312.
Banerjee, S.; Lavie, A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization; Association for Computational Linguistics: Stroudsburg, PA, USA, 2005; pp. 65–72.
Chen, D.L.; Dolan, W.B. Collecting Highly Parallel Data for Paraphrase Evaluation. In Proceedings of the ACL-HLT 2011, 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011; Volume 1, pp. 190–200.
Carbonell, J.; Goldstein, J. Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries. In SIGIR Forum (ACM Special Interest Group on Information Retrieval); ACM: New York, NY, USA, 1998; pp. 335–336.
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013, arXiv:1301.3781.
NADAS, A. Estimation of Probabilities in the Language Model of the IBM Speech Recognition System. In IEEE Transactions on Acoustics Speech and Signal Processing; IEEE: New York, NY, USA, 1984; p. 27.
McKeown, K.R. Paraphrasing Questions Using Given and New Information. Am. J. Comput. Linguist. 1983, 9, 1.
Lin, D.; Pantel, P. Discovery of Inference Rules for Question-Answering. Nat. Lang. Eng. 2001, 7, 343–360.
Kauchak, D.; Barzilay, R. Paraphrasing for Automatic Evaluation. In Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics; Association for Computational Linguistics: New York, NY, USA, 2006; pp. 455–462.
Ganitkevitch, J.; Van Durme, B.; Callison-Burch, C. PPDB: The Paraphrase Database. In Proceedings of the NAACL-HLT–Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, GA, USA, 9–14 June 2013; pp. 758–764.
Bhagat, R.; Hovy, E. What Is Paraphrase; Association for Computational Linguistics: New York, NY, USA, 2013; Volume 39.
Dolan, W.B.; Brockett, C. Automatically Constructing a Corpus of Sentential Paraphrases. In Proceedings of the Third International Workshop on Paraphrasing, Yamamoto, Japan, 14 October 2005; pp. 9–16.
Xu, W.; Ritter, A.; Callison-burch, C.; Dolan, W.B.; Ji, Y. Extracting Lexically Divergent Paraphrases from Twitter. Trans. Assoc. Comput. Linguist. 2014, 2, 435–448.
Pavlick, E.; Rastogi, P.; Ganitkevitch, J.; Durme, B.V.; Callison-Burch, C. PPDB 2.0: Better Paraphrase Ranking, Fine-Grained Entailment Relations, Word Embeddings, and Style Classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Short Papers), Beijing, China, 26–31 July 2015; pp. 425–430.
Shi, L. Rewriting and Paraphrasing Source Texts in Second Language Writing. J. Second Lang. Writ. 2012, 21, 134–148.
Teufel, S. Do “Future Work” Sections Have a Purpose? Citation Links and Entailment for Global Scientometric Questions. In Proceedings of the 2nd Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries, Tokyo, Japan, 7–11 August 2017.
Takahashi, K.; Ishibashi, Y.; Sudoh, K.; Nakamura, S. Multilingual Machine Translation Evaluation Metrics Fine-Tuned on Pseudo-Negative Examples for WMT 2021 Metrics Task. In Proceedings of the WMT 2021–6th Conference on Machine Translation, Online, 10–11 November 2021; pp. 1049–1052.
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadephia, PA, USA, 6–12 July 2002; pp. 311–318.
Doddington, G. Automatic Evaluation of Machine Translation Quality Using N-Gram Co-Occurrence Statistics; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2002; p. 138.
Madnani, N.; Tetreault, J.; Chodorow, M. Re-Examining Machine Translation Metrics for Paraphrase Identification. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2012; pp. 182–190.
Ji, Y.; Eisenstein, J. Discriminative Improvements to Distributional Sentence Similarity. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013), Seattle, WA, USA, 18–21 October 2013; pp. 891–896.
Brad, F. Neural Paraphrase Generation Using Transfer Learning; Association for Computational Linguistics: New York, NY, USA, 2017.
Aziz, A.A.; Djamal, E.C.; Ilyas, R. Siamese Similarity Between Two Sentences Using Manhattan’s Recurrent Neural Networks. In Proceedings of the 2019 International Conference of Advanced Informatics: Concepts, Theory and Applications (ICAICTA), Yogyakarta, Indonesia, 20–22 September 2019; pp. 1–6.
Saputro, W.F.; Djamal, E.C.; Ilyas, R. Paraphrase Identification Between Two Sentence Using Support Vector Machine. In Proceedings of the 2019 International Conference on Electrical Engineering and Informatics (ICEEI), Nanjing, China, 8–10 November 2019; pp. 406–411.
Wubben, S.; Bosch, A.V.D.; Krahmer, E. Paraphrase Generation as Monolingual Translation: Data and Evaluation. In Proceedings of the 6th International Natural Language Generation Conference, Meath, Ireland, 7–9 July 2010; pp. 203–207.
Mallinson, J.; Sennrich, R.; Lapata, M. Paraphrasing Revisited with Neural Machine Translation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, 3–7 April 2017; pp. 881–893.
Zhao, S.; Lan, X.; Liu, T.; Li, S. Application-Driven Statistical Paraphrase Generation. In Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, Suntec, Singapore, 2–7 August 2009; pp. 834–842.
Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. Nips 2014, 27, 3104–3112.
Parikh, A.P.; Täckström, O.; Das, D.; Uszkoreit, J. A Decomposable Attention Model for Natural Language Inference. arXiv 2016, arXiv:1606.01933.
Kirkpatrick, S.; Gelatt, C.D.; Vecchi, M.P. Optimization by Simulated Annealing. Science 1983, 220, 671–680.
Granville, V.; Rasson, J.P.; Krivánek, M. Simulated Annealing: A Proof of Convergence. IEEE Trans. Pattern Anal. Mach. Intell. 1994, 16, 652–656.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.