Grammatical Error Correction (GEC) is the task of automatically detecting and correcting grammatical errors in text. Current state-of-the-art systems rely on large quantities of artificial data, but this is not always publicly available or easy to generate. The goal of this project is hence to build a GEC system that depends on very little annotated data; e.g. Bryant and Briscoe (2018); Alikaniotis and Raheja (2019); Stahlberg et al. (2019).
One way to do this is to leverage a pretrained Masked Language Model (MLM) to predict corrections in a manner similar to a gap-fill task. For example, an error detection system might predict an error of a certain type in a given position in a sentence, and the MLM can then predict a correction based on the context. Possible extensions include comparing performance with a real vs. oracle error detection system, different types of correction strategy for different error types (e.g. content vs. function words), and relaxed evaluation of n-best predictions for each edit.
Language Model Based Grammatical Error Correction without Annotated Training Data.
Christopher Bryant, Ted Briscoe. 2018.
The Unreasonable Effectiveness of Transformer Language Models in Grammatical Error Correction.
Dimitris Alikaniotis, Vipul Raheja. 2019.
Neural Grammatical Error Correction with Finite State Transducers.
Felix Stahlberg, Christopher Bryant, Bill Byrne. 2019.
Grammatical error correction (GEC) is the task of automatically detecting and correcting grammatical errors in written text. The idea of ‘translating’ a grammatically incorrect sentence into a correct one has been proposed to handle all error types simultaneously and neural machine translation (NMT) sequence-to-sequence models have been applied to GEC with great success.
Recently, sequence tagging approaches have been proposed for text generation tasks where target texts highly overlap with source inputs (Malmi et al., 2019; Omelianchuk et al., 2020; Stahlberg and Kumar, 2020; Parnow et al., 2021). Rather than generating target texts from scratch, sequence tagging models predict edit operations, such as keep, delete, and insert. Target texts are then reconstructed from the inputs using these edit operations.
Since the source and target sentences are both in the same language (e.g. English) and most words in the sentence do not need changing, GEC seems to benefit from these sequence tagging approaches.
The aim of this project is to develop a sequence tagging model for error correction. Possible extensions include applying the model to other monolingual sequence-to-sequence tasks, like text simplification & normalisation, summarisation, sentence fusion, sentence splitting & rephrasing.
Encode, Tag, Realize: High-Precision Text Editing.
Eric Malmi, Sebastian Krause, Sascha Rothe, Daniil Mirylenka, Aliaksei Severyn. 2019.
GECToR – Grammatical Error Correction: Tag, Not Rewrite.
Kostiantyn Omelianchuk, Vitaliy Atrasevych, Artem Chernodub, Oleksandr Skurzhanskyi. 2020.
Seq2Edits: Sequence Transduction Using Span-level Edit Operations.
Felix Stahlberg, Shankar Kumar. 2020.
Grammatical Error Correction as GAN-like Sequence Labeling.
Kevin Parnow, Zuchao Li, Hai Zhao. 2021.
Codeswitching is the natural linguistic phenomenon of mixing more than one language in a single sentence or discourse; e.g.
"I would like to buy un bocadillo por favor." == "I would like to buy a sandwich please."
Most NLP research has focused on processing monolingual text, but there is increasing interest in processing multilingual codeswitching texts, e.g. in machine translation (Xu and Yvon, 2021). This is challenging however because very few datasets exist so researchers have begun to generate artificial codeswitching corpora (Rizvi et al., 2021; Gupta et al., 2021; Tarunesh et al., 2021).
The aim of this project is hence to follow this work and investigate different methods of automatically generating codeswitching data. The quality of the data can then be evaluated in several ways, including 1) training a classifier to differentiate between real/artificial codeswitching sentences, 2) evaluating performance on downstream tasks using real/artificial data, or 3) carrying out a small-scale human evaluation.
Although we propose mixed English-Spanish as a possible language pair for this work, feel free to propose a different language pair if you can find an alternative dataset (e.g. there is also a lot of work on English-Hindi).
Can You Traducir This? Machine Translation for Code-Switched Input.
Jitao Xu, François Yvon. 2021.
GCM: A Toolkit for Generating Synthetic Code-mixed Text.
Mohd Sanad Zaki Rizvi, Anirudh Srinivasan, Tanuja Ganu, Monojit Choudhury, Sunayana Sitaram. 2021.
Training Data Augmentation for Code-Mixed Translation.
Abhirut Gupta, Aditya Vavre, Sunita Sarawagi. 2021.
From Machine Translation to Code-Switching: Generating High-Quality Code-Switched Text.
Ishan Tarunesh, Syamantak Kumar, Preethi Jyothi. 2021.