Technical reports
Issues in preprocessing current datasets for grammatical error correction
Christopher Bryant, Mariano Felice
September 2016, 15 pages
DOI: 10.48456/tr-894
Abstract
In this report, we describe some of the issues encountered when preprocessing two of the largest datasets for Grammatical Error Correction (GEC); namely the public FCE corpus and NUCLE (along with associated CoNLL test sets). In particular, we show that it is not straightforward to convert character level annotations to token level annotations and that sentence segmentation is more complex when annotations change sentence boundaries. These become even more complicated when multiple annotators are involved. We subsequently describe how we handle such cases and consider the pros and cons of different methods.
Full text
PDF (0.4 MB)
BibTeX record
@TechReport{UCAM-CL-TR-894, author = {Bryant, Christopher and Felice, Mariano}, title = {{Issues in preprocessing current datasets for grammatical error correction}}, year = 2016, month = sep, url = {https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-894.pdf}, institution = {University of Cambridge, Computer Laboratory}, doi = {10.48456/tr-894}, number = {UCAM-CL-TR-894} }