语法纠错(Grammar Error Correction (GEC)是自然语言处理(NLP)领域的一项关键任务。其目的是自动检测和纠正句子中的语法错误,具有巨大的应用研究价值。主流的语法纠正方法主要依靠序列标注和文本生成,这是两种端到端的方法。这些方法在低误差密度的领域表现出了堪称典范的性能,但在单个句子中存在多个错误的高误差密度情况下,往往无法提供令人满意的结果。因此,这些方法往往会过度纠正正确的单词,从而导致高误报率。) is a key task in the field of Natural Language Processing (NLP). Its purpose is to automatically detect and correct grammatical errors in sentences, and it holds immense research value. The mainstream methods for grammar correction primarily rely on sequence tagging and text generation, which are two end-to-end approaches. These methods demonstrate exemplary performance in domains with low error density, but often fail to provide satisfactory results in high error density situations where multiple errors exist in a single sentence. As a result, these methods tend to over-correct correct words, leading to a high false alarm rate.
1. 简介Introduction
语法更正是一项非常重要的应用任务,在教育、官方文档处理和自然语言处理任务的许多预处理阶段中发挥着作用。虽然语法错误可能发生在任何语言中,但本讨论仅关注中文文本中的语法更正任务。受中文文本固有特点和使用习惯的影响,汉语语法纠错(Syntax correction is a highly important application task, playing roles in education, official document processing, and many preprocessing stages of natural language processing tasks. Although grammatical errors can occur in any language, this discussion focuses solely on the syntax correction task in Chinese texts. Influenced by the inherent characteristics and usage habits of Chinese texts, Chinese Grammar Error Correction (CGEC
)错误表现出明显的差异和多样性。此外,对于非母语人士的中文句子,一个句子中经常会出现多种类型的错误。在如此高误差密度的条件下,准确检测和纠正复杂多样的汉语语法错误是一项具有挑战性的任务。语法错误的类型根据其特征大致可分为冗余错误(R)、缺失错误(M)、词序错误(W)和错误单词错误(S)) exhibits clear differences and diversity. Furthermore, for non-native speakers[
1]
。R型错误是指句子中存在不必要或重复的语言元素,导致冗长或不必要的重复。M型错误表示句子中缺乏基本的语言元素或结构,导致句子不完整或不流畅。W 型错误指向句子中的单词或短语顺序不正确,导致语法规则或含义不明确。S 型错误表示句子中存在拼写错误的单词,使句子不准确或难以理解。例如,表1显示了中文文本中这四种错误的情况。.
表 1.汉语语法错误的类型。Table 1. Types of Chinese Grammatical Errors.
2. Methods for Grammar 语法更正方法Correction
本章主要介绍两种语法更正方法;即当前主流的序列标注范式和文本生成范式,以及使用提示学习和提示模板对相关工作的探索。This chapter mainly introduces two methods for grammar correction: the current mainstream sequence labeling paradigm and the text generation paradigm, as well as the exploration of related work using prompt learning and prompt templates.
2.1. 基于序列标注和文本生成的语法修正方法Grammar Correction Methods based on Sequence Labeling and Text Generation
汉语语法纠错研究可分为两类:基于序列标注的方法和基于文本生成的方法。基于序列标记的方法的基本思想是根据错误类型(如“冗余”、“正确”、“丢失”等)定义相应的“删除”、“保留”、“添加”和其他操作标签。然后将这些操作标记添加到文本序列中。该模型学习这些操作标记之间的依赖关系,并预测文本序列中每个字符的操作标记,然后将其用于语法更正。这种方法较早提出并应用于英语纠错领域。Research on Chinese grammar error correction can be divided into two categories: methods based on sequence labeling and methods based on text generation. The fundamental idea of sequence labeling-based methods is to define corresponding 'delete,' 'retain,' 'add,' and other operation tags according to error types like 'redundant,' 'correct,' 'missing,' etc. These operation tags are then added to the text sequence. The model learns the dependencies between these operation tags and predicts the operation tag for each character in the text sequence, which is then used for grammar correction. This type of method was earlier proposed and applied in the field of English error correction. Awasthi
等人 et al.[
3]
使用序列标签来实现文本校正,首先用自定义标签标记序列中的字符,然后通过涉及多轮预测和细化的迭代过程预测相应的操作标签。但是,本文仅提供了操作标签的简单定义。后来,.used sequence labeling to implement text correction by first marking characters in the sequence with self-defined tags, then predicting the corresponding operation tags through an iterative process involving multiple rounds of prediction and refinement. However, this paper only provided simple definitions for operation tags. Later, Omelianchuk
等人 et al. [
4]
。细化操作标签设计,定义. refined the design of operation tags, defining 5000
个标签,包括“添加”、“删除”、“修改”、“保留”等,然后使用预训练的转换器和多轮迭代序列标签,得到目标序列的操作标签。Deng等人 tags, including 'add,' 'delete,' 'modify,' 'retain,' etc., and then using a pre-trained transformer and multi-round iterative sequence labeling to obtain the operation tags for the target sequence. Deng et al. [
5]
在中文文本校正领域,通过结合预训练的变压器编码器和编辑空间来实现文本校正。此编辑空间包含. achieved text correction by combining a pre-trained Transformer encoder and an editing space in the field of Chinese text correction. This editing space comprises 8772
个标签,也称为操作标签集,其中每个标签代表一个特定的编辑操作,例如添加、删除或修改字符。鉴于中文文本的特点,一些学者试图将语音和图形相似的知识整合到语法纠正模型中。李佳成等tags, also known as the operation tag set, where each tag represents a specific editing action, such as adding, deleting, or modifying a character. Given the characteristics of Chinese text, some scholars have tried to integrate phonetically and graphically similar knowledge into the grammar correction model. Li Jiacheng et al.[
6]
提出了一种将指针网络与混淆集知识集成的校正模型。在预测单词编辑操作的同时,该模型还允许指针网络从包含语音和图形相似性知识的混淆集中选择单词,从而改善替换错误的校正结果。然而,序列标记方法尽管推理速度快,数据集要求小,但需要高质量的注释数据,并且受到操作标签集大小的限制,这使得处理实际应用中遇到的复杂问题具有挑战性。
基于文本生成的方法结合了神经机器翻译的概念,通过学习输入序列中每个单词之间的依赖关系,将原始句子直接翻译成正确的句子。但是,与翻译任务不同,语法更正任务的输入序列和目标序列都使用相同的语言,并且共享许多相同的字符。因此,在文本生成过程中,通常可以直接将字符从输入序列提取到目标序列。为此,. proposed a correction model integrating a pointer network with confusion set knowledge. While predicting word editing operations, the model also allows the pointer network to choose words from the confusion set incorporating phonetic and graphical similarity knowledge, thus improving correction results for substitution errors. However, sequence labeling methods, despite their fast inference speed and small dataset requirements, demand high-quality annotated data and are restricted by the size of the operation tag set, making it challenging to handle complex problems encountered in real-life applications.
Text generation-based methods incorporate the concept of neural machine translation, translating original sentences directly into correct ones by learning the dependencies between each word in the input sequence. However, unlike translation tasks, both the input and target sequences of grammar correction tasks are in the same language and share many identical characters. Therefore, characters can often be directly extracted from the input sequence to the target sequence during text generation. For this, Wang
等人 et al. [
7]
提出了一个集成了复制机制的语法纠正模型。该模型基于变压器架构,在给定输入序列的情况下预测目标序列中当前位置的字符,并使用平衡因子来控制是否将字符从输入序列复制到目标生成序列。此外,. proposed a grammar correction model that integrates a copy mechanism. Based on Transformer architecture, this model predicts the character at the current position in the target sequence given the input sequence and uses a balancing factor to control whether to copy characters from the input sequence to the target generation sequence. Additionally, Wang
等人 et al. [
8]
提出了一种语法校正模型,该模型将动态残差结构与转换器模型相结合,以便在目标序列生成期间更好地捕获语义信息。他们还使用损坏的文本进行数据增强。 proposed a grammar correction model that combines a dynamic residual structure with the Transformer model to capture semantic information during target sequence generation better. They also used corrupted text for data augmentation. Fu
等人 et al. [
9]
提出了一种三阶段语法纠正方法。他们首先根据预先训练的语言模型和一组相似的字符消除了拼写或标点符号等浅层错误。然后,他们在字符和单词级别构建了转换器模型来处理语法错误。最后,他们在集成阶段对前两个阶段的结果进行了重新排序,选择了最佳输出。文本生成方法只需要在校正过程中使用学习的依赖关系根据输入序列生成正确的文本,因此无需定义特定的错误类型。但是,这种方法需要在可控性和可解释性问题上进行改进。. proposed a three-stage method for grammar correction. They first eliminated shallow errors like spelling or punctuation based on a pre-trained language model and a set of similar characters. Then they built Transformer models at the character and word levels to handle grammatical errors. Finally, they reordered the results from the previous two stages in the ensemble stage, selecting the optimal output. Text generation methods only need to generate correct text based on the input sequence using the learned dependencies during the correction process, hence eliminating the need to define specific error types. However, this method needs to improve on issues of controllability and interpretability.
2.2. 提示学习和提示模板 Prompt learning and prompt templates
近年来,随着各种大规模预训练模型的出现,研究方法正逐渐从传统的“预训练In recent years, with the emergence of various large-scale pre-training models, the research methodology is gradually transitioning from the traditional 'pre-training +
微调”范式过渡到基于提示的“预训练+提示+预测”范式。传统的“预训练+微调”范式涉及在大型数据集上训练模型(预训练)并针对特定任务对其进行优化(微调)。通常需要根据具体的下游任务设置一个目标函数,并重新训练相应的域语料库,以调整预训练模型的参数以适应下游任务。然而,当涉及到使用超大规模的预训练模型时,例如具有 3 亿个参数的 GPT-10 模型 fine-tuning' paradigm to the prompt-based 'pre-training + prompting + prediction' paradigm.The traditional 'pre-training + fine-tuning' paradigm involves training the model on a large dataset (pre-training) and optimizing it for a specific task (fine-tuning). It is usually necessary to set an objective function according to the specific downstream task and retrain the corresponding domain corpus to adjust the parameters of the pre-trained model to adapt to the downstream task. However, when it comes to using ultra-large-scale pre-trained models, such as the GPT-3 model [
1750]
,使用“预训练 with 175 billion parameters, matching downstream tasks using the 'pre-training +
微调”范式匹配下游任务通常既耗时又昂贵。此外,由于预训练模型在其原始域中已经表现良好,因此对域传输进行微调受到原始域的限制,这可能会损害其性能。因此,在提示学习的“预训练+提示+预测”范式中避免了对预训练模型的修改。相反,可以更好地构建提示模板,以使用预先训练的模型来适应下游任务。随着对提示学习研究的蓬勃发展,“预训练+提示+预测”范式正逐渐演变为自然语言处理领域的第四范式 fine-tuning' paradigm is often time-consuming and costly. Moreover, since the pre-trained model already performs well in its original domain, using fine-tuning for domain transfer is restricted by the original domain, which might damage its performance. Therefore, modifications to the pre-trained model are avoided in the 'pre-training + prompting + prediction' paradigm of prompt learning. Instead, prompt templates are constructed better to fit the downstream tasks with the pre-trained model. As research on prompt learning flourishes, the 'pre-training + prompting + prediction' paradigm is gradually evolving into the fourth paradigm in the field of natural language processing [
11]
。.
在提示学习中,提示模板的设计主要涉及提示的位置和数量,可分为手动设计和自动学习的方法。手动设计的提示模板基于人类在自然语言领域的经验和专业知识。In prompt learning, the design of prompt templates mainly involves the position and quantity of prompts, which can be divided into manually designed and automatically learned methods. Manually designed prompt templates are based on human experience and professional knowledge in the field of natural language. Petroni
等人 et al. [
12]
通过手动定义为知识库中的每个关系设计了相应的完形填空模板,探索了语言模型中包含的事实和常识。. Designed corresponding cloze templates for each relation in the knowledge source by manual definition, exploring the facts and common knowledge contained in language models. Schick
等人 et al. [
13]
将输入示例转换为包含任务描述信息的完形填空示例,成功地将任务描述与标准监督学习相结合。手动设计的提示模板直观流畅,但高度依赖人类语言专业知识和频繁的试错,导致高质量提示模板的成本很高。因此,已经探索了提示模板的自动学习,可以分为离散型和连续型。离散提示使用唯一的离散字符作为提示,自动生成提示模板。. Transformed input examples into cloze examples containing task description information, successfully combining task description with standard supervised learning. Manually designed prompt templates are intuitive and smooth but highly depend on human language expertise and frequent trial-and-error, resulting in high costs for high-quality, prompt templates. Therefore, automatic learning of prompt templates has been explored, which can be divided into discrete and continuous types. Discrete prompts use unique discrete characters as prompts to generate prompt templates automatically. Ben-David
等人 et al. [
14]
提出了一种领域自适应算法,该算法训练模型生成独特的领域相关特征,然后将其与原始输入连接以形成提示模板。连续提示从向量嵌入的角度构造软提示模板,并直接在模型的嵌入空间中执行提示。proposed a domain-adaptive algorithm that trains models to generate unique domain-related features, which are then connected with the original input to form prompt templates. Continuous prompts construct soft prompt templates from a vector embedding perspective and perform prompting directly in the model's embedding space. Li
等人 et al. [
15]
冻结了模型参数,同时通过添加前缀将特定于任务的连续向量序列构建为软提示。此外,许多学者将这两种方法结合起来以获得更高质量的提示模板,例如Froze model parameters while constructing task-specific continuous vector sequences as soft prompts by adding prefixes. Furthermore, many scholars combine these two methods to obtain higher quality prompt templates, such as Zhong
等人 et al. [
16]
,他们最初使用离散搜索方法定义提示模板,然后根据模板和微调嵌入启动虚拟令牌以进行优化。. who initially defined prompt templates using a discrete search method, then initiated virtual tokens according to the template and fine-tuned embeddings for optimization. Han et al.
[
17]
提出了一种基于规则的提示调优方法,该方法根据逻辑规则将手动设计的子模板组合成一个完整的提示模板,并插入具有可调嵌入的虚拟令牌。. proposed a rule-based prompt tuning method, which combines sub-templates manually designed into a complete prompt template according to logical rules and inserts virtual tokens with adjustable embeddings.