Building Educational Applications 2019 Shared Task:
Grammatical Error Correction

Grammatical error correction (GEC) is the task of automatically correcting grammatical errors in text; e.g. [I follows his advices -> I followed his advice]. It can be used to not only help language learners improve their writing skills, but also alert native speakers to accidental mistakes or typos.

GEC gained significant attention in the Helping Our Own (HOO) and Conference on Natural Language Learning (CoNLL) shared tasks between 2011 and 2014 (Dale and Kilgarriff, 2011; Dale et al., 2012; Ng et al., 2013; Ng et al., 2014), but has since become more difficult to evaluate given a lack of standardised experimental settings. In particular, recent systems have been trained, tuned and tested on different combinations of corpora using different metrics (Yannakoudakis et al., 2017; Chollampatt and Ng, 2018a; Ge et al., 2018; Grundkiewicz and Junczys-Dowmunt, 2018). One of the aims of this shared task is hence to once again provide a platform where different approaches can be trained and tested under the same conditions.

Another significant problem facing the field is that system performance is still primarily benchmarked against the CoNLL-2014 test set, even though this 5-year-old dataset only contains 50 essays on 2 different topics written by 25 South-East Asian undergraduates in Singapore. This means that systems have increasingly overfit to a very specific genre of English and so do not generalise well to other domains. The shared task hence also introduces a new dataset that represents a much more diverse cross-section of English language levels and domains.

Task Instructions

Participants should first join the BEA 2019 Shared Task Discussion Group. This group is the best place to ask questions and keep up-to-date with shared task news and updates. If you do not join this group, you may miss out on important shared task news.

The aim of the shared task is to correct all types of errors in written text. This includes grammatical, lexical and orthographical errors. Participants will be provided with plain text files as input, one tokenised sentence per line, and are expected to produce equivalent corrected text files as output. For example:

InputTravel by bus is exspensive , bored and annoying .
OutputTravelling by bus is expensive , boring and annoying .

All text was tokenised using spaCy v1.9.0 and the en_core_web_sm-1.2.0 model.

In order to evaluate system performance, system output must first be automatically annotated with the ERRor ANnotation Toolkit (ERRANT) (Bryant et al., 2017):

python3 parallel_to_m2.py -orig <input_file> -cor <corrected_file> -out <corrected_m2>

This will produce an annotated M2 file that can be used in the ERRANT evaluation script:

python3 compare_m2.py -hyp <corrected_m2> -ref <reference_m2>

This will print system results to the screen. The script also has additional options to carry out a more fine-grained analysis of system performance. See the Data and Evaluation sections of this website for more information on M2 format and evaluation in general.

Data

One of the key contributions of this shared task is the introduction of new annotated datasets: the Cambridge English Write & Improve (W&I) corpus and the LOCNESS corpus.

Cambridge English Write & Improve

Write & Improve is an online web platform that assists non-native English students with their writing. Specifically, students from around the world submit letters, stories, articles and essays in response to various prompts, and the W&I system provides instant feedback. Since W&I went live in 2014, W&I annotators have manually annotated some of these submissions and assigned them a CEFR level.

LOCNESS

The LOCNESS corpus consists of essays written by native English students. It was originally compiled by researchers at the Centre for English Corpus Linguistics at the University of Louvain. Since native English students also sometimes make mistakes, we asked the W&I annotators to annotate a subsection of LOCNESS so researchers can test the effectiveness of their systems on the full range of English levels and abilities.

Corpora Statistics

We release 3,600 annotated submissions to W&I across 3 different CEFR levels: A (beginner), B (intermediate), C (advanced). We also release 100 annotated native (N) essays from LOCNESS.

We attempted to balance the corpora such that there is a roughly even distribution of sentences at different levels across each of the training, development and test sets. Due to time constraints, we are unable to release a native training set from LOCNESS. An overview of the data is shown in the following table:

ABCNTotal
TrainTexts1,3001,000700-3,000
Sentences10,48513,02110,771-34,277
Tokens183,406237,871206,748-628,025
DevTexts1301007050350
Sentences1,0361,2851,0689984,377
Tokens18,66523,69821,42323,11786,903
TestTexts1301007050350
Sentences1,1071,3301,0081,0304,475
Tokens18,87223,63619,93723,14385,588
TotalTexts1,5601,2008401003,700
Sentences12,62815,63612,8472,01843,129
Tokens220,943285,205248,10846,260800,516

Other Corpora and Download Links

To increase the amount of annotated data available to participants, we also allow the use of several other learner corpora in the main restricted track of the shared task. Since these corpora were previously only available in different formats, we make new standardised versions available with the shared task.

NOTE: While the FCE and W&I+LOCNESS corpora are immediately downloadable, participants must fill out online forms to obtain the Lang-8 Corpus of Learner English and NUCLE. After filling out the forms, a link to Lang-8 will be emailed to you immediately, while NUCLE will be sent within one day. All corpora are subject to similar licences and may only be used for non-commercial purposes.

All these corpora have been standardised with ERRANT. This means they have all been converted to M2 format and annotated with ERRANT error types automatically. This is significant because NUCLE and the FCE were previously annotated according to different incompatible error type frameworks and Lang-8 and W&I+LOCNESS were not annotated with error types at all. In fact Lang-8 is also the only corpus that does not originally contain any explicit human annotations, and so these were extracted automatically by ERRANT.

An additional complication with Lang-8 is that it is the only training/development corpus that contains multiple sets of annotations for certain sentences. This means you should not apply all the edits in an M2 sentence block to generate the corrected sentence, as edits from multiple annotators can interact. We thus provide a script to safely generate the corrected text for a specific annotator from an M2 file: corr_from_m2.py

M2 Format

All the above corpora have been made available in M2 format, the standard format for annotated GEC files since the CoNLL-2013 shared task.

S This are a sentence .
A 1 2|||R:VERB:SVA|||is|||-REQUIRED-|||NONE|||0
A 3 3|||M:ADJ|||good|||-REQUIRED-|||NONE|||0
A 1 2|||R:VERB:SVA|||is|||-REQUIRED-|||NONE|||1
A -1 -1|||noop|||-NONE-|||REQUIRED|||-NONE-|||2

In M2 format, a line preceded by S denotes an original sentence while a line preceded by A indicates an edit annotation. Each edit line consists of the start and end token offset of the edit, the error type, and the tokenized correction string. The next two fields are included for historical reasons and can be ignored (see the CoNLL-2013 shared task), while the last field is the annotator id.

A "noop" edit is a special kind of edit that explicitly indicates an annotator/system made no changes to the original sentence. If there is only one annotator, noop edits are optional, otherwise a noop edit should be included whenever at least 1 out of n annotators considered the original sentence to be correct. This is something to be aware of when combining individual M2 files, as missing noops can affect results.

The above example can hence be interpreted as follows:
Annotator 0 changed "are" to "is" and inserted "good" before "sentence" to produce the correction: This is a good sentence .
Annotator 1 changed "are" to "is" to produce the correction: This is a sentence .
Annotator 2 thought the original was correct and made no changes to the sentence: This are a sentence .

Tracks

There are 3 tracks in the BEA 2019 shared task. Each track controls the amount of annotated learner data that can be used in a system.

In all tracks, we place no restrictions on the amount of unannotated data (e.g. for language modelling) or other NLP tools (e.g. POS taggers, parsers, spellcheckers, etc.) that can be used, as long as the resource is publicly available.

  1. Restricted Track
  2. In the restricted track, participants may only use the following learner datasets: Note that we restrict participants to the preprocessed Lang-8 Corpus of Learner English rather than the raw, multilingual Lang-8 Learner Corpus because participants would otherwise need to filter the raw corpus themselves. We also do not allow the use of the CoNLL 2013/2014 shared task test sets in this track.
  3. Unrestricted Track
  4. In the unrestricted track, participants may use anything and everything to build their systems. This includes proprietary datasets and software.
  5. Low Resource Track (formerly Unsupervised Track)
  6. In the low resource track, participants may only use the following learner dataset:

    Since current state-of-the-art systems rely on as much annotated learner data as possible to reach the best performance, the goal of the low resource track is to encourage research into systems that do not rely on large amounts of learner data. This track should be of particular interest to researchers working on GEC for languages where large learner corpora do not exist.

    Since we expect this to be a challenging track however, we will allow participants the use of the W&I+LOCNESS development to develop their systems. There is no restriction on how participants can use this development set. The only difference between the low resource and the restricted track is the amount of learner data that can be used.

Evaluation

Systems will be evaluated using the ERRANT scorer, an improved version of the MaxMatch scorer (Dahlmeier and Ng, 2012) originally used in the CoNLL shared tasks. As in the previous shared tasks, this means system performance will primarily be measured in terms of span-based correction using the F0.5 metric, which weights precision twice as much as recall.

In span-based correction, a system is only rewarded if a system edit exactly matches a reference edit in terms of both its token offsets and correction string. In contrast, the ERRANT scorer can also report performance in terms of span-based detection and token-based detection. The difference between these settings is shown in the following table:

OriginalI often look at TVSpan-basedSpan-basedToken-based
Reference[2, 4, watch]CorrectionDetectionDetection
Hypothesis 1[2, 4, watch]MatchMatchMatch
Hypothesis 2[2, 4, see]No matchMatchMatch
Hypothesis 3[2, 3, watch]No matchNo matchMatch

Systems will primarily be evaluated in terms of their span-based correction F0.5 on the combined W&I+LOCNESS test set overall.

Note that although the W&I+LOCNESS training and development sets were made available as separate files for each CEFR level, the test set will not be provided in the same format. Instead, participants will be given a single plain text file that contains all the test sentences combined. This is because systems should not expect to know the CEFR level of an input text in advance and should hence be prepared to handle all levels and abilities.

We will nevertheless also report system performance in terms of different CEFR and native levels, as well as in terms of detection and error types.

Metric Justification

Evaluation is still a hot topic in GEC and no method is perfect. The main advantage of the ERRANT scorer over the MaxMatch scorer and other evaluation methods is that it can provide much more detailed feedback about system performance, including scores for error detection, error correction, and error types. We hope that participants will be able to make use of this information to build better systems.

We also evaluated ERRANT in relation to human judgements using the same setup as Chollampatt and Ng (2018b), and found similar correlation coefficients to other metrics.

CorpusSentence
MetricPearson rSpearman ρKendall τ
ERRANT0.640.6260.623
M20.6230.6870.617
GLEU0.6910.4070.567
I-measure-0.25-0.3850.564

Important Dates

DateEvent
Friday, Jan 25, 2019New training data released
Monday, March 25, 2019New test data released
Friday, March 29, 2019System output submission deadline
Friday, April 12, 2019System results announced
Friday, May 3, 2019System paper submission deadline
Friday, May 17, 2019Review deadline
Friday, May 24, 2019Notification of acceptance
Friday, August 2, 2019BEA-2019 Workshop (Florence, Italy)

Organisers

Christopher Bryant, University of Cambridge
Mariano Felice, University of Cambridge
Øistein Andersen, University of Cambridge
Ted Briscoe, University of Cambridge

Contact

The best place to ask questions and keep up-to-date with shared task news is the BEA 2019 Shared Task Discussion Group.
You can also contact the organisers directly at bea2019st@gmail.com.

References

Christopher Bryant, Mariano Felice, and Ted Briscoe. 2017. Automatic annotation and evaluation of error types for grammatical error correction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 793-805.

Shamil Chollampatt and Hwee Tou Ng. 2018a. A multilayer convolutional encoder-decoder neural network for grammatical error correction. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence.

Shamil Chollampatt and Hwee Tou Ng. 2018b. A reassessment of reference-based grammatical error correction metrics. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2730-2741.

Daniel Dahlmeier and Hwee Tou Ng. 2012. Better evaluation for grammatical error correction. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 568-572.

Daniel Dahlmeier, Hwee Tou Ng, and Siew Mei Wu. 2013. Building a large annotated corpus of learner English: The NUS corpus of learner English. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pages 22–31.

Robert Dale and Adam Kilgarriff. 2011. Helping Our Own: The HOO 2011 pilot shared task. In Proceedings of the Generation Challenges Session at the 13th European Workshop on Natural Language Generation, pages 242–249.

Robert Dale, Ilya Anisimoff, and George Narroway. 2012. Helping Our Own: HOO 2012: A report on the preposition and determiner error correction shared task. In Proceedings of the Seventh Workshop on Innovative Use of NLP for Building Educational Applications, pages 54–62.

Tao Ge, Furu Wei, and Ming Zhou. 2018. Fluency boost learning and inference for neural grammatical error correction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1055–1065.

Roman Grundkiewicz and Marcin Junczys-Dowmunt. 2018. Near human-level performance in grammatical error correction with hybrid machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pages 284-290.

Tomoya Mizumoto, Mamoru Komachi, Masaaki Nagata and Yuji Matsumoto. 2011. Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners. In Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP), pages 147-155.

Toshikazu Tajiri, Mamoru Komachi, and Yuji Matsumoto. 2012. Tense and Aspect Error Correction for ESL Learners Using Global Context. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 198-202.

Hwee Tou Ng, Siew Mei Wu, Yuanbin Wu, Christian Hadiwinoto, and Joel R. Tetreault. 2013. The CoNLL-2013 shared task on grammatical error correction. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task, pages 1–12.

Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Christian Hadiwinoto, Raymond Hendy Susanto, and Christopher Bryant. 2014. The CoNLL-2014 shared task on grammatical error correction. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, pages 1–14.

Helen Yannakoudakis, Ted Briscoe, and Ben Medlock. 2011. A new dataset and method for automatically grading ESOL texts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 180–189.

Helen Yannakoudakis, Marek Rei, Øistein E. Andersen, and Zheng Yuan. 2017. Neural sequence-labelling models for grammatical error correction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2795–2806.

Helen Yannakoudakis, Øistein E. Andersen, Ardeshir Geranpayeh, Ted Briscoe and Diane Nicholls. 2018. Developing an automated writing placement system for ESL learners. Applied Measurement in Education, 31:3, pages 251-267.