Department of Computer Science and Technology

Technical reports

Automatic annotation of error types for grammatical error correction

Christopher Bryant

June 2019, 138 pages

This technical report is based on a dissertation submitted December 2018 by the author for the degree of Doctor of Philosophy to the University of Cambridge, Churchill College.

Abstract

Grammatical Error Correction (GEC) is the task of automatically detecting and correcting grammatical errors in text. Although previous work has focused on developing systems that target specific error types, the current state of the art uses machine translation to correct all error types simultaneously. A significant disadvantage of this approach is that machine translation does not produce annotated output and so error type information is lost. This means we can only evaluate a system in terms of overall performance and cannot carry out a more detailed analysis of different aspects of system performance.

In this thesis, I develop a system to automatically annotate parallel original and corrected sentence pairs with explicit edits and error types. In particular, I first extend the Damerau-Levenshtein alignment algorithm to make use of linguistic information when aligning parallel sentences, and supplement this alignment with a set of merging rules to handle multi-token edits. The output from this algorithm surpasses other edit extraction approaches in terms of approximating human edit annotations and is the current state of the art. Having extracted the edits, I next classify them according to a new rule-based error type framework that depends only on automatically obtained linguistic properties of the data, such as part-of-speech tags. This framework was inspired by existing frameworks, and human judges rated the appropriateness of the predicted error types as Good (85%) or Acceptable (10%) in a random sample of 200 edits. The whole system is called the ERRor ANnotation Toolkit (ERRANT) and is the first toolkit capable of automatically annotating parallel sentences with error types.

I demonstrate the value of ERRANT by applying it to the system output produced by the participants of the CoNLL-2014 shared task, and carry out a detailed error type analysis of system performance for the first time. I also develop a simple language model based approach to GEC, that does not require annotated training data, and show how it can be improved using ERRANT error types.

Full text

PDF (3.8 MB)

BibTeX record

@TechReport{UCAM-CL-TR-938,
  author =	 {Bryant, Christopher},
  title = 	 {{Automatic annotation of error types for grammatical error
         	   correction}},
  year = 	 2019,
  month = 	 jun,
  url = 	 {https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-938.pdf},
  institution =  {University of Cambridge, Computer Laboratory},
  number = 	 {UCAM-CL-TR-938}
}