Computer Laboratory

Technical reports

Grammatical error correction in non-native English

Zheng Yuan

March 2017, 145 pages

This technical report is based on a dissertation submitted September 2016 by the author for the degree of Doctor of Philosophy to the University of Cambridge, St. Edmund’s College.


Grammatical error correction (GEC) is the task of automatically correcting grammatical errors in written text. Previous research has mainly focussed on individual error types and current commercial proofreading tools only target limited error types. As sentences produced by learners may contain multiple errors of different types, a practical error correction system should be able to detect and correct all errors.

In this thesis, we investigate GEC for learners of English as a Second Language (ESL). Specifically, we treat GEC as a translation task from incorrect into correct English, explore new models for developing end-to-end GEC systems for all error types, study system performance for each error type, and examine model generalisation to different corpora. First, we apply Statistical Machine Translation (SMT) to GEC and prove that it can form the basis of a competitive all-errors GEC system. We implement an SMT-based GEC system which contributes to our winning system submitted to a shared task in 2014. Next, we propose a ranking model to re-rank correction candidates generated by an SMT-based GEC system. This model introduces new linguistic information and we show that it improves correction quality. Finally, we present the first study using Neural Machine Translation (NMT) for GEC. We demonstrate that NMT can be successfully applied to GEC and help capture new errors missed by an SMT-based GEC system.

While we focus on GEC for English, the methods presented in this thesis can be easily applied to any language.

Full text

PDF (1.4 MB)

BibTeX record

  author =	 {Yuan, Zheng},
  title = 	 {{Grammatical error correction in non-native English}},
  year = 	 2017,
  month = 	 mar,
  url = 	 {},
  institution =  {University of Cambridge, Computer Laboratory},
  number = 	 {UCAM-CL-TR-904}