Department of Computer Science and Technology

Technical reports

Issues in preprocessing current datasets for grammatical error correction

Christopher Bryant, Mariano Felice

September 2016, 15 pages

DOI: 10.48456/tr-894

Abstract

In this report, we describe some of the issues encountered when preprocessing two of the largest datasets for Grammatical Error Correction (GEC); namely the public FCE corpus and NUCLE (along with associated CoNLL test sets). In particular, we show that it is not straightforward to convert character level annotations to token level annotations and that sentence segmentation is more complex when annotations change sentence boundaries. These become even more complicated when multiple annotators are involved. We subsequently describe how we handle such cases and consider the pros and cons of different methods.

Full text

PDF (0.4 MB)

BibTeX record

@TechReport{UCAM-CL-TR-894,
  author =	 {Bryant, Christopher and Felice, Mariano},
  title = 	 {{Issues in preprocessing current datasets for grammatical
         	   error correction}},
  year = 	 2016,
  month = 	 sep,
  url = 	 {https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-894.pdf},
  institution =  {University of Cambridge, Computer Laboratory},
  doi = 	 {10.48456/tr-894},
  number = 	 {UCAM-CL-TR-894}
}