Computer Laboratory

Technical reports

Grammatical error prediction

Øistein E. Andersen

January 2011, 163 pages

This technical report is based on a dissertation submitted 2010 by the author for the degree of Doctor of Philosophy to the University of Cambridge, Girton College.

Abstract

In this thesis, we investigate methods for automatic detection, and to some extent correction, of grammatical errors. The evaluation is based on manual error annotation in the Cambridge Learner Corpus (CLC), and automatic or semi-automatic annotation of error corpora is one possible application, but the methods are also applicable in other settings, for instance to give learners feedback on their writing or in a proofreading tool used to prepare texts for publication.

Apart from the CLC, we use the British National Corpus (BNC) to get a better model of correct usage, WordNet for semantic relations, other machine-readable dictionaries for orthography/morphology, and the Robust Accurate Statistical Parsing (RASP) system to parse both the CLC and the BNC and thereby identify syntactic relations within the sentence. An ancillary outcome of this is a syntactically annotated version of the BNC, which we have made publicly available.

We present a tool called GenERRate, which can be used to introduce errors into a corpus of correct text, and evaluate to what extent the resulting synthetic error corpus can complement or replace a real error corpus.

Different methods for detection and correction are investigated, including: sentence-level binary classification based on machine learning over n-grams of words, n-grams of part-of-speech tags and grammatical relations; automatic identification of features which are highly indicative of individual errors; and development of classifiers aimed more specifically at given error types, for instance concord errors based on syntactic structure and collocation errors based on co-occurrence statistics from the BNC, using clustering to deal with data sparseness. We show that such techniques can detect, and sometimes even correct, at least certain error types as well as or better than human annotators.

We finally present an annotation experiment in which a human annotator corrects and supplements the automatic annotation, which confirms the high detection/correction accuracy of our system and furthermore shows that such a hybrid set-up gives higher-quality annotation with considerably less time and effort expended compared to fully manual annotation.

Full text

PDF (1.6 MB)

BibTeX record

@TechReport{UCAM-CL-TR-794,
  author =	 {Andersen, {\O}istein E.},
  title = 	 {{Grammatical error prediction}},
  year = 	 2011,
  month = 	 jan,
  url = 	 {http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-794.pdf},
  institution =  {University of Cambridge, Computer Laboratory},
  number = 	 {UCAM-CL-TR-794}
}