汉语中多个错误的语法更正

汉语中多个错误的语法更正: History

View Latest Version

Please note this is an old version of this entry, which may differ significantly from the current revision.

Contributor:

Zhiyong Hu

Grammar Error Correction (GEC) is a key task in the field of Natural Language Processing (NLP). Its purpose is to automatically detect and correct grammatical errors in sentences, and it holds immense research value. The mainstream methods for grammar correction primarily rely on sequence tagging and text generation, which are two end-to-end approaches. These methods demonstrate exemplary performance in domains with low error density, but often fail to provide satisfactory results in high error density situations where multiple errors exist in a single sentence. As a result, these methods tend to over-correct correct words, leading to a high false alarm rate.

Chinese grammar error correction
prompt templates
sequence labeling

1. Introduction

Syntax correction is a highly important application task, playing roles in education, official document processing, and many preprocessing stages of natural language processing tasks. Although grammatical errors can occur in any language, this discussion focuses solely on the syntax correction task in Chinese texts. Influenced by the inherent characteristics and usage habits of Chinese texts, Chinese Grammar Error Correction (CGEC) exhibits clear differences and diversity. Furthermore, for non-native speakers' Chinese sentences, multiple types of errors often appear in a single sentence. Under such high error density conditions, accurately detecting and correcting the diverse and complex Chinese grammatical errors is a challenging task. The types of grammatical errors can be roughly classified based on their characteristics into redundancy errors (R), omission errors (M), word order errors (W), and incorrect word errors (S) [1].R-type errors refer to the presence of unnecessary or repetitive linguistic elements in a sentence, leading to verbosity or unnecessary repetition. M-type errors indicate the absence of essential linguistic elements or structures in a sentence, resulting in an incomplete or non-fluent sentence. W-type errors point to incorrect word or phrase order in a sentence, leading to unclear grammatical rules or meanings. S-type errors indicate the presence of misspelled words in a sentence, making the sentence inaccurate or hard to understand. For example, Table 1 displays the cases of these four types of errors in Chinese texts.

Table 1. Types of Chinese Grammatical Errors.

Correct Sentence	Redundancy Error (R)	Missing Error (M)	Word Order Error (W)	Wrong Word Error (S)
小猫捉一个老鼠	小猫捉一个老鼠鼠	小-捉一个老鼠	小猫一个捉老鼠	小帽捉一个老鼠
The cat catches a mouse	The cat catches a mouse mouse	The - catches a mouse	The cat a catches mouse	The hat catches a mouse

2. Methods for Grammar Correction

This chapter mainly introduces two methods for grammar correction: the current mainstream sequence labeling paradigm and the text generation paradigm, as well as the exploration of related work using prompt learning and prompt templates.

2.1. Grammar Correction Methods based on Sequence Labeling and Text Generation

Research on Chinese grammar error correction can be divided into two categories: methods based on sequence labeling and methods based on text generation. The fundamental idea of sequence labeling-based methods is to define corresponding 'delete,' 'retain,' 'add,' and other operation tags according to error types like 'redundant,' 'correct,' 'missing,' etc. These operation tags are then added to the text sequence. The model learns the dependencies between these operation tags and predicts the operation tag for each character in the text sequence, which is then used for grammar correction. This type of method was earlier proposed and applied in the field of English error correction. Awasthi et al.[3].used sequence labeling to implement text correction by first marking characters in the sequence with self-defined tags, then predicting the corresponding operation tags through an iterative process involving multiple rounds of prediction and refinement. However, this paper only provided simple definitions for operation tags. Later, Omelianchuk et al. [4]. refined the design of operation tags, defining 5000 tags, including 'add,' 'delete,' 'modify,' 'retain,' etc., and then using a pre-trained transformer and multi-round iterative sequence labeling to obtain the operation tags for the target sequence. Deng et al. [5]. achieved text correction by combining a pre-trained Transformer encoder and an editing space in the field of Chinese text correction. This editing space comprises 8772 tags, also known as the operation tag set, where each tag represents a specific editing action, such as adding, deleting, or modifying a character. Given the characteristics of Chinese text, some scholars have tried to integrate phonetically and graphically similar knowledge into the grammar correction model. Li Jiacheng et al.[6]. proposed a correction model integrating a pointer network with confusion set knowledge. While predicting word editing operations, the model also allows the pointer network to choose words from the confusion set incorporating phonetic and graphical similarity knowledge, thus improving correction results for substitution errors. However, sequence labeling methods, despite their fast inference speed and small dataset requirements, demand high-quality annotated data and are restricted by the size of the operation tag set, making it challenging to handle complex problems encountered in real-life applications.
Text generation-based methods incorporate the concept of neural machine translation, translating original sentences directly into correct ones by learning the dependencies between each word in the input sequence. However, unlike translation tasks, both the input and target sequences of grammar correction tasks are in the same language and share many identical characters. Therefore, characters can often be directly extracted from the input sequence to the target sequence during text generation. For this, Wang et al. [7]. proposed a grammar correction model that integrates a copy mechanism. Based on Transformer architecture, this model predicts the character at the current position in the target sequence given the input sequence and uses a balancing factor to control whether to copy characters from the input sequence to the target generation sequence. Additionally, Wang et al. [8] proposed a grammar correction model that combines a dynamic residual structure with the Transformer model to capture semantic information during target sequence generation better. They also used corrupted text for data augmentation. Fu et al. [9]. proposed a three-stage method for grammar correction. They first eliminated shallow errors like spelling or punctuation based on a pre-trained language model and a set of similar characters. Then they built Transformer models at the character and word levels to handle grammatical errors. Finally, they reordered the results from the previous two stages in the ensemble stage, selecting the optimal output. Text generation methods only need to generate correct text based on the input sequence using the learned dependencies during the correction process, hence eliminating the need to define specific error types. However, this method needs to improve on issues of controllability and interpretability.

2.2. Prompt learning and prompt templates

In recent years, with the emergence of various large-scale pre-training models, the research methodology is gradually transitioning from the traditional 'pre-training + fine-tuning' paradigm to the prompt-based 'pre-training + prompting + prediction' paradigm.The traditional 'pre-training + fine-tuning' paradigm involves training the model on a large dataset (pre-training) and optimizing it for a specific task (fine-tuning). It is usually necessary to set an objective function according to the specific downstream task and retrain the corresponding domain corpus to adjust the parameters of the pre-trained model to adapt to the downstream task. However, when it comes to using ultra-large-scale pre-trained models, such as the GPT-3 model [10] with 175 billion parameters, matching downstream tasks using the 'pre-training + fine-tuning' paradigm is often time-consuming and costly. Moreover, since the pre-trained model already performs well in its original domain, using fine-tuning for domain transfer is restricted by the original domain, which might damage its performance. Therefore, modifications to the pre-trained model are avoided in the 'pre-training + prompting + prediction' paradigm of prompt learning. Instead, prompt templates are constructed better to fit the downstream tasks with the pre-trained model. As research on prompt learning flourishes, the 'pre-training + prompting + prediction' paradigm is gradually evolving into the fourth paradigm in the field of natural language processing [11].

In prompt learning, the design of prompt templates mainly involves the position and quantity of prompts, which can be divided into manually designed and automatically learned methods. Manually designed prompt templates are based on human experience and professional knowledge in the field of natural language. Petroni et al. [12]. Designed corresponding cloze templates for each relation in the knowledge source by manual definition, exploring the facts and common knowledge contained in language models. Schick et al. [13]. Transformed input examples into cloze examples containing task description information, successfully combining task description with standard supervised learning. Manually designed prompt templates are intuitive and smooth but highly depend on human language expertise and frequent trial-and-error, resulting in high costs for high-quality, prompt templates. Therefore, automatic learning of prompt templates has been explored, which can be divided into discrete and continuous types. Discrete prompts use unique discrete characters as prompts to generate prompt templates automatically. Ben-David et al. [14]proposed a domain-adaptive algorithm that trains models to generate unique domain-related features, which are then connected with the original input to form prompt templates. Continuous prompts construct soft prompt templates from a vector embedding perspective and perform prompting directly in the model's embedding space. Li et al. [15]Froze model parameters while constructing task-specific continuous vector sequences as soft prompts by adding prefixes. Furthermore, many scholars combine these two methods to obtain higher quality prompt templates, such as Zhong et al. [16]. who initially defined prompt templates using a discrete search method, then initiated virtual tokens according to the template and fine-tuned embeddings for optimization. Han et al. [17]. proposed a rule-based prompt tuning method, which combines sub-templates manually designed into a complete prompt template according to logical rules and inserts virtual tokens with adjustable embeddings.

This entry is adapted from the peer-reviewed paper 10.3390/app13158858

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.