Building Educational Applications 2019 Shared Task:
Grammatical Error Correction

News

07/07/2022

Codalab are phasing out the server on which the shared task was originally run. All existing Codalab competition tracks will stop accepting new submissions from 31/08/2022 and become read-only from 01/01/2023.

To enable researchers to continue testing on BEA-2019, we have created a new competition on Codalab's new server that represents all tracks:

Click here to evaluate on the BEA-2019 test set

Unfortunately, Codalab are not migrating any previous submissions or user accounts to the new server, so you must recreate them if you wish to continue evaluating on BEA-2019. We apologise for the inconvenience.

02/08/2019

A shared task overview paper containing a detailed description of systems, data, results and analysis is now available. Please, use the following citation when referring to the shared task and/or using the released datasets:

Christopher Bryant, Mariano Felice, Øistein E. Andersen and Ted Briscoe. 2019. The BEA-2019 Shared Task on Grammatical Error Correction. In Proceedings of the 14th Workshop on Innovative Use of NLP for Building Educational Applications (BEA-2019), pp. 52–75, Florence, Italy, August. Association for Computational Linguistics.

System output produced by all participating teams is also available here.

Description

Grammatical error correction (GEC) is the task of automatically correcting grammatical errors in text; e.g. [I follows his advices -> I followed his advice]. It can be used to not only help language learners improve their writing skills, but also alert native speakers to accidental mistakes or typos.

GEC gained significant attention in the Helping Our Own (HOO) and Conference on Natural Language Learning (CoNLL) shared tasks between 2011 and 2014 (Dale and Kilgarriff, 2011; Dale et al., 2012; Ng et al., 2013; Ng et al., 2014), but has since become more difficult to evaluate given a lack of standardised experimental settings. In particular, recent systems have been trained, tuned and tested on different combinations of corpora using different metrics (Yannakoudakis et al., 2017; Chollampatt and Ng, 2018a; Ge et al., 2018; Grundkiewicz and Junczys-Dowmunt, 2018). One of the aims of this shared task is hence to once again provide a platform where different approaches can be trained and tested under the same conditions.

Another significant problem facing the field is that system performance is still primarily benchmarked against the CoNLL-2014 test set, even though this 5-year-old dataset only contains 50 essays on 2 different topics written by 25 South-East Asian undergraduates in Singapore. This means that systems have increasingly overfit to a very specific genre of English and so do not generalise well to other domains. The shared task hence also introduces a new dataset that represents a much more diverse cross-section of English language levels and domains.

Instructions

Participants should first join the BEA 2019 Shared Task Discussion Group. This group is the best place to ask questions and keep up-to-date with shared task news and updates. If you do not join this group, you may miss out on important shared task news.

The aim of the shared task is to correct all types of errors in written text. This includes grammatical, lexical and orthographical errors. Participants will be provided with plain text files as input, one tokenised sentence per line, and are expected to produce equivalent corrected text files as output. For example:

InputTravel by bus is exspensive , bored and annoying .
OutputTravelling by bus is expensive , boring and annoying .

All text was tokenised using spaCy v1.9.0 and the en_core_web_sm-1.2.0 model.

Official evaluation will be carried out on the Codalab competition platform. There is a different Codalab competition for each track (see Tracks) and participants may submit to as many tracks as they wish. More instructions are available on Codalab:

We encourage participants to join Codalab early so that they can become more familiar with the submission process before the official test phase. Participants may already use Codalab to submit output on the development set and receive detailed feedback about their systems using the official evaluation procedure.

If participants prefer to evaluate their systems themselves without Codalab, they must first automatically annotate their system output using the ERRor ANnotation Toolkit (ERRANT) (Bryant et al., 2017):

python3 parallel_to_m2.py -orig <input_file> -cor <corrected_file> -out <corrected_m2>

This will produce an annotated M2 file that can be used in the ERRANT evaluation script:

python3 compare_m2.py -hyp <corrected_m2> -ref <reference_m2>

This will print system results to the screen. The script also has additional options to carry out a more fine-grained analysis of system performance. See the Data and Evaluation sections of this website for more information on M2 format and evaluation in general.

Data

One of the key contributions of this shared task is the introduction of new annotated datasets: the Cambridge English Write & Improve (W&I) corpus and the LOCNESS corpus.

Cambridge English Write & Improve

Write & Improve (Yannakoudakis et al., 2018) is an online web platform that assists non-native English students with their writing. Specifically, students from around the world submit letters, stories, articles and essays in response to various prompts, and the W&I system provides instant feedback. Since W&I went live in 2014, W&I annotators have manually annotated some of these submissions and assigned them a CEFR level.

LOCNESS

The LOCNESS corpus (Granger, 1998) consists of essays written by native English students. It was originally compiled by researchers at the Centre for English Corpus Linguistics at the University of Louvain. Since native English students also sometimes make mistakes, we asked the W&I annotators to annotate a subsection of LOCNESS so researchers can test the effectiveness of their systems on the full range of English levels and abilities.

Corpora Statistics

We release 3,600 annotated submissions to W&I across 3 different CEFR levels: A (beginner), B (intermediate), C (advanced). We also release 100 annotated native (N) essays from LOCNESS.

We attempted to balance the corpora such that there is a roughly even distribution of sentences at different levels across each of the training, development and test sets. Due to time constraints, we are unable to release a native training set from LOCNESS. An overview of the data is shown in the following table:

ABCNTotal
TrainTexts1,3001,000700-3,000
Sentences10,49313,03210,783-34,308
Tokens183,684238,112206,924-628,720
DevTexts1301007050350
Sentences1,0371,2901,0699984,384
Tokens18,69123,72521,44023,11786,973
TestTexts1301007050350
Sentences1,1071,3301,0101,0304,477
Tokens18,90523,66719,95323,14385,668
TotalTexts1,5601,2008401003,700
Sentences12,63715,65212,8622,01843,169
Tokens221,280285,504248,31746,260801,361

Other Corpora and Download Links

To increase the amount of annotated data available to participants, we also allow the use of several other learner corpora in the main restricted track of the shared task. Since these corpora were previously only available in different formats, we make new standardised versions available with the shared task.

NOTE: While the FCE and W&I+LOCNESS corpora are immediately downloadable, participants must fill out online forms to obtain the Lang-8 Corpus of Learner English and NUCLE. After filling out the forms, a link to Lang-8 will be emailed to you immediately, while NUCLE will be sent within one day. All corpora are subject to similar licences and may only be used for non-commercial purposes.

All these corpora have been standardised with ERRANT. This means they have all been converted to M2 format and annotated with ERRANT error types automatically. This is significant because NUCLE and the FCE were previously annotated according to different incompatible error type frameworks and Lang-8 and W&I+LOCNESS were not annotated with error types at all. In fact, Lang-8 is also the only corpus that does not originally contain any explicit human annotations, and so these were extracted automatically by ERRANT.

An additional complication with Lang-8 is that it is the only training/development corpus that contains multiple sets of annotations for certain sentences. This means you should not apply all the edits in an M2 sentence block to generate the corrected sentence, as edits from multiple annotators can interact. We thus provide a script to safely generate the corrected text for a specific annotator from an M2 file: corr_from_m2.py

M2 Format

All the above corpora have been made available in M2 format, the standard format for annotated GEC files since the CoNLL-2013 shared task.

S This are a sentence .
A 1 2|||R:VERB:SVA|||is|||-REQUIRED-|||NONE|||0
A 3 3|||M:ADJ|||good|||-REQUIRED-|||NONE|||0
A 1 2|||R:VERB:SVA|||is|||-REQUIRED-|||NONE|||1
A -1 -1|||noop|||-NONE-|||REQUIRED|||-NONE-|||2

In M2 format, a line preceded by S denotes an original sentence while a line preceded by A indicates an edit annotation. Each edit line consists of the start and end token offset of the edit, the error type, and the tokenized correction string. The next two fields are included for historical reasons and can be ignored (see the CoNLL-2013 shared task), while the last field is the annotator id.

A "noop" edit is a special kind of edit that explicitly indicates an annotator/system made no changes to the original sentence. If there is only one annotator, noop edits are optional, otherwise a noop edit should be included whenever at least 1 out of n annotators considered the original sentence to be correct. This is something to be aware of when combining individual M2 files, as missing noops can affect results.

The above example can hence be interpreted as follows:
Annotator 0 changed "are" to "is" and inserted "good" before "sentence" to produce the correction: This is a good sentence .
Annotator 1 changed "are" to "is" to produce the correction: This is a sentence .
Annotator 2 thought the original was correct and made no changes to the sentence: This are a sentence .

Tracks

There are 3 tracks in the BEA 2019 shared task. Each track controls the amount of annotated learner data that can be used in a system.

In all tracks, we place no restrictions on the amount of unannotated data (e.g. for language modelling) or other NLP tools (e.g. POS taggers, parsers, spellcheckers, etc.) that can be used, as long as the resource is publicly available.

  1. Restricted Track

    In the restricted track, participants may only use the following learner datasets:

    • FCE (Yannakoudakis et al., 2011)
    • Lang-8 Corpus of Learner English (Mizumoto et al., 2011; Tajiri et al., 2012)
    • NUCLE (Dahlmeier et al., 2013)
    • W&I+LOCNESS (Bryant et al., 2019; Granger, 1998)

    Note that we restrict participants to the preprocessed Lang-8 Corpus of Learner English rather than the raw, multilingual Lang-8 Learner Corpus because participants would otherwise need to filter the raw corpus themselves. We also do not allow the use of the CoNLL 2013/2014 shared task test sets in this track.

  2. Unrestricted Track

    In the unrestricted track, participants may use anything and everything to build their systems. This includes proprietary datasets and software.

  3. Low Resource Track (formerly Unsupervised Track)

    In the low resource track, participants may only use the following learner dataset:

    • W&I+LOCNESS development set

    Since current state-of-the-art systems rely on as much annotated learner data as possible to reach the best performance, the goal of the low resource track is to encourage research into systems that do not rely on large amounts of learner data. This track should be of particular interest to researchers working on GEC for languages where large learner corpora do not exist.

    Since we expect this to be a challenging track however, we will allow participants the use of the W&I+LOCNESS development to develop their systems. There is no restriction on how participants can use this development set. The only difference between the low resource and the restricted track is the amount of learner data that can be used.

Evaluation

Systems will be evaluated using the ERRANT scorer, an improved version of the MaxMatch scorer (Dahlmeier and Ng, 2012) originally used in the CoNLL shared tasks. As in the previous shared tasks, this means system performance will primarily be measured in terms of span-based correction using the F0.5 metric, which weights precision twice as much as recall.

In span-based correction, a system is only rewarded if a system edit exactly matches a reference edit in terms of both its token offsets and correction string. In contrast, the ERRANT scorer can also report performance in terms of span-based detection and token-based detection. The difference between these settings is shown in the following table:

OriginalI often look at TVSpan-basedSpan-basedToken-based
Reference[2, 4, watch]CorrectionDetectionDetection
Hypothesis 1[2, 4, watch]MatchMatchMatch
Hypothesis 2[2, 4, see]No matchMatchMatch
Hypothesis 3[2, 3, watch]No matchNo matchMatch

Systems will primarily be evaluated in terms of their span-based correction F0.5 on the combined W&I+LOCNESS gold test set overall.

Note that although the W&I+LOCNESS training and development sets were made available as separate files for each CEFR level, the test set will not be provided in the same format. Instead, participants will be given a single plain text file that contains all the test sentences combined. This is because systems should not expect to know the CEFR level of an input text in advance and should hence be prepared to handle all levels and abilities.

We will nevertheless also report system performance in terms of different CEFR and native levels, as well as in terms of detection and error types.

Metric Justification

Evaluation is still a hot topic in GEC and no method is perfect. The main advantage of the ERRANT scorer over the MaxMatch scorer and other evaluation methods is that it can provide much more detailed feedback about system performance, including scores for error detection, error correction, and error types. We hope that participants will be able to make use of this information to build better systems.

We also evaluated ERRANT in relation to human judgements using the same setup as Chollampatt and Ng (2018b), and found similar correlation coefficients to other metrics.

CorpusSentence
MetricPearson rSpearman ρKendall τ
ERRANT0.640.6260.623
M20.6230.6870.617
GLEU0.6910.4070.567
I-measure-0.25-0.3850.564

Results

All results in terms of span-based correction. You can download the system output used to generate these results here.

  1. Restricted Track
  2. Rank User Team Name TP FP FN P R F0.5 Detailed Results
    1 romang UEDIN-MS 3127 1199 2074 72.28 60.12 69.47 View
    2 yjchoe33 Kakao&Brain 2709 894 2510 75.19 51.91 69.00 View
    3 goo2go LAIX 2618 960 2671 73.17 49.50 66.78 View
    4 HelenY CAMB-CLED 2924 1224 2386 70.49 55.07 66.75 View
    5 seanxu1015 Shuyao 2926 1244 2357 70.17 55.39 66.61 View
    6 BUAA_WJP YDGEC 2815 1205 2487 70.02 53.09 65.83 View
    7 awasthiabhijeet05 ML@IITB 3678 1920 2340 65.70 61.12 64.73 View
    8 fs439 CAMB-CUED 2929 1459 2502 66.75 53.93 63.72 View
    9 tomoyamizumoto AIP-Tohoku 1972 902 2705 68.62 42.16 60.97 View
    10 arahusky UFAL, Charles University, Prague 1941 942 2867 67.33 40.37 59.39 View
    11 liuwangwang CVTE-NLP 1739 811 2744 68.20 38.79 59.22 View
    12 hsamswcc BLCU 2554 1646 2432 60.81 51.22 58.62 View
    13 yoav_kantor IBM Research AI - HRL 1819 1044 3047 63.53 37.38 55.74 View
    14 Masahiro TMU 2720 2325 2546 53.91 51.65 53.45 View
    15 qiuwenbo 1428 854 2968 62.58 32.48 52.80 View
    16 cehinson NLG NTU 1833 1873 2939 49.46 38.41 46.77 View
    17 apurva.nagvenkar CAI 2002 2168 2759 48.01 42.05 46.69 View
    18 davidzhao PKU 1401 1265 2955 52.55 32.16 46.64 View
    19 SolomonLab SolomonLab 1760 2161 2678 44.89 39.66 43.73 View
    20 mengyang Buffalo 604 350 3311 63.31 15.43 39.06 View
    21 nihalnayak Ramaiah 829 7656 3516 9.77 19.08 10.83 View
  3. Unrestricted Track
  4. Rank User Team Name TP FP FN P R F0.5 Detailed Results
    1 goo2go LAIX 2618 960 2671 73.17 49.50 66.78 View
    2 tomoyamizumoto AIP-Tohoku 2589 1078 2484 70.60 51.03 65.57 View
    3 arahusky UFAL, Charles University, Prague 2812 1313 2469 68.17 53.25 64.55 View
    4 hsamswcc BLCU 3051 2007 2357 60.32 56.42 59.50 View
    5 gurunathp Aparecium 1585 1077 2787 59.54 36.25 52.76 View
    6 mengyang Buffalo 699 374 3265 65.14 17.63 42.33 View
    7 nihalnayak Ramaiah 1161 8062 3480 12.59 25.02 13.98 View
  5. Low Resource Track
  6. Rank User Team Name TP FP FN P R F0.5 Detailed Results
    1 romang UEDIN-MS 2312 982 2506 70.19 47.99 64.24 View
    2 JiyeonHam Kakao&Brain 2412 1413 2797 63.06 46.30 58.80 View
    3 goo2go LAIX 1443 884 3175 62.01 31.25 51.81 View
    4 fs439 CAMB-CUED 1814 1450 2956 55.58 38.03 50.88 View
    5 arahusky UFAL, Charles University, Prague 1245 1222 2993 50.47 29.38 44.13 View
    6 simonHFL Siteimprove 1299 1619 3199 44.52 28.88 40.17 View
    7 Bohdan_Didenk WebSpellChecker.com 2363 3719 3031 38.85 43.81 39.75 View
    8 Satoru TMU 1638 4314 3486 27.52 31.97 28.31 View
    9 mengyang Buffalo 446 1243 3556 26.41 11.14 20.73 View

Paper Submissions

Participants are required to submit a paper describing their submissions to the shared task. Papers must adhere to the ACL Submission Guidelines. Authors are invited to submit a full paper of up to eight (8) pages of content, plus unlimited references; final versions of long papers will be given one additional page of content (up to 9 pages) so that reviewers’ comments can be taken into account. We also invite short papers of up to of up to four (4) pages of content, plus unlimited references. Upon acceptance, short papers will be given five (5) content pages in the proceedings. Authors are encouraged to use this additional page to address reviewers’ comments in their final versions.

Previously published papers cannot be accepted. The submissions will be reviewed by the program committee. As reviewing will be blind, please ensure that papers are anonymous. Self-references that reveal the author’s identity, e.g., “We previously showed (Smith, 1991) …”, should be avoided. Instead, use citations such as “Smith previously showed (Smith, 1991) ...”. We are aware that participants may still be identifiable from their scores, but this is permitted in the context of the shared task.

Please, submit your paper to the "GEC Shared Task" track of the BEA Workshop using the following link: https://www.softconf.com/acl2019/bea/.

Important Dates

DateEvent
Friday, Jan 25, 2019New training data released
Monday, March 25, 2019New test data released
Friday, March 29, 2019System output submission deadline
Friday, April 5, 2019System results announced
Friday, May 3, 2019System paper submission deadline
Friday, May 24, 2019Review deadline
Monday, June 3, 2019Camera-ready submission deadline
Friday, August 2, 2019BEA-2019 Workshop (Florence, Italy)

Note: All deadlines are 23:59 UTC to be consistent with Codalab.

Organisers

Christopher Bryant, University of Cambridge
Mariano Felice, University of Cambridge
Øistein E. Andersen, University of Cambridge
Ted Briscoe, University of Cambridge

Contact

Discussion forum: BEA 2019 Shared Task Discussion Group.
Email: bea2019st@gmail.com.

References

Christopher Bryant, Mariano Felice, Øistein E. Andersen and Ted Briscoe. 2019. The BEA-2019 Shared Task on Grammatical Error Correction. In Proceedings of the 14th Workshop on Innovative Use of NLP for Building Educational Applications, Florence, Italy, August. Association for Computational Linguistics.

Christopher Bryant, Mariano Felice, and Ted Briscoe. 2017. Automatic annotation and evaluation of error types for grammatical error correction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 793-805.

Shamil Chollampatt and Hwee Tou Ng. 2018a. A multilayer convolutional encoder-decoder neural network for grammatical error correction. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence.

Shamil Chollampatt and Hwee Tou Ng. 2018b. A reassessment of reference-based grammatical error correction metrics. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2730-2741.

Daniel Dahlmeier and Hwee Tou Ng. 2012. Better evaluation for grammatical error correction. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 568-572.

Daniel Dahlmeier, Hwee Tou Ng, and Siew Mei Wu. 2013. Building a large annotated corpus of learner English: The NUS corpus of learner English. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pages 22–31.

Robert Dale and Adam Kilgarriff. 2011. Helping Our Own: The HOO 2011 pilot shared task. In Proceedings of the Generation Challenges Session at the 13th European Workshop on Natural Language Generation, pages 242–249.

Robert Dale, Ilya Anisimoff, and George Narroway. 2012. Helping Our Own: HOO 2012: A report on the preposition and determiner error correction shared task. In Proceedings of the Seventh Workshop on Innovative Use of NLP for Building Educational Applications, pages 54–62.

Tao Ge, Furu Wei, and Ming Zhou. 2018. Fluency boost learning and inference for neural grammatical error correction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1055–1065.

Sylviane Granger. 1998. The computer learner corpus: A versatile new source of data for SLA research. In Sylviane Granger, editor, Learner English on Computer, pages 3–18. Addison Wesley Longman, London and New York.

Roman Grundkiewicz and Marcin Junczys-Dowmunt. 2018. Near human-level performance in grammatical error correction with hybrid machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pages 284-290.

Tomoya Mizumoto, Mamoru Komachi, Masaaki Nagata and Yuji Matsumoto. 2011. Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners. In Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP), pages 147-155.

Toshikazu Tajiri, Mamoru Komachi, and Yuji Matsumoto. 2012. Tense and Aspect Error Correction for ESL Learners Using Global Context. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 198-202.

Hwee Tou Ng, Siew Mei Wu, Yuanbin Wu, Christian Hadiwinoto, and Joel R. Tetreault. 2013. The CoNLL-2013 shared task on grammatical error correction. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task, pages 1–12.

Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Christian Hadiwinoto, Raymond Hendy Susanto, and Christopher Bryant. 2014. The CoNLL-2014 shared task on grammatical error correction. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, pages 1–14.

Helen Yannakoudakis, Ted Briscoe, and Ben Medlock. 2011. A new dataset and method for automatically grading ESOL texts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 180–189.

Helen Yannakoudakis, Marek Rei, Øistein E. Andersen, and Zheng Yuan. 2017. Neural sequence-labelling models for grammatical error correction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2795–2806.

Helen Yannakoudakis, Øistein E. Andersen, Ardeshir Geranpayeh, Ted Briscoe and Diane Nicholls. 2018. Developing an automated writing placement system for ESL learners. Applied Measurement in Education, 31:3, pages 251-267.