Building Educational Applications 2019 Shared Task:
Grammatical Error Correction

News

07/07/2022

Codalab are phasing out the server on which the shared task was originally run. All existing Codalab competition tracks will stop accepting new submissions from 31/08/2022 and become read-only from 01/01/2023.

To enable researchers to continue testing on BEA-2019, we have created a new competition on Codalab's new server that represents all tracks:

Click here to evaluate on the BEA-2019 test set

Unfortunately, Codalab are not migrating any previous submissions or user accounts to the new server, so you must recreate them if you wish to continue evaluating on BEA-2019. We apologise for the inconvenience.

02/08/2019

A shared task overview paper containing a detailed description of systems, data, results and analysis is now available. Please, use the following citation when referring to the shared task and/or using the released datasets:

Christopher Bryant, Mariano Felice, Øistein E. Andersen and Ted Briscoe. 2019. The BEA-2019 Shared Task on Grammatical Error Correction. In Proceedings of the 14th Workshop on Innovative Use of NLP for Building Educational Applications (BEA-2019), pp. 52–75, Florence, Italy, August. Association for Computational Linguistics.

System output produced by all participating teams is also available here.

Description

Grammatical error correction (GEC) is the task of automatically correcting grammatical errors in text; e.g. [I follows his advices -> I followed his advice]. It can be used to not only help language learners improve their writing skills, but also alert native speakers to accidental mistakes or typos.

GEC gained significant attention in the Helping Our Own (HOO) and Conference on Natural Language Learning (CoNLL) shared tasks between 2011 and 2014 (Dale and Kilgarriff, 2011; Dale et al., 2012; Ng et al., 2013; Ng et al., 2014), but has since become more difficult to evaluate given a lack of standardised experimental settings. In particular, recent systems have been trained, tuned and tested on different combinations of corpora using different metrics (Yannakoudakis et al., 2017; Chollampatt and Ng, 2018a; Ge et al., 2018; Grundkiewicz and Junczys-Dowmunt, 2018). One of the aims of this shared task is hence to once again provide a platform where different approaches can be trained and tested under the same conditions.

Another significant problem facing the field is that system performance is still primarily benchmarked against the CoNLL-2014 test set, even though this 5-year-old dataset only contains 50 essays on 2 different topics written by 25 South-East Asian undergraduates in Singapore. This means that systems have increasingly overfit to a very specific genre of English and so do not generalise well to other domains. The shared task hence also introduces a new dataset that represents a much more diverse cross-section of English language levels and domains.

Instructions

Participants should first join the BEA 2019 Shared Task Discussion Group. This group is the best place to ask questions and keep up-to-date with shared task news and updates. If you do not join this group, you may miss out on important shared task news.

The aim of the shared task is to correct all types of errors in written text. This includes grammatical, lexical and orthographical errors. Participants will be provided with plain text files as input, one tokenised sentence per line, and are expected to produce equivalent corrected text files as output. For example:

Input	Travel by bus is exspensive , bored and annoying .
Output	Travelling by bus is expensive , boring and annoying .

All text was tokenised using spaCy v1.9.0 and the en_core_web_sm-1.2.0 model.

Official evaluation will be carried out on the Codalab competition platform. There is a different Codalab competition for each track (see Tracks) and participants may submit to as many tracks as they wish. More instructions are available on Codalab:

We encourage participants to join Codalab early so that they can become more familiar with the submission process before the official test phase. Participants may already use Codalab to submit output on the development set and receive detailed feedback about their systems using the official evaluation procedure.

If participants prefer to evaluate their systems themselves without Codalab, they must first automatically annotate their system output using the ERRor ANnotation Toolkit (ERRANT) (Bryant et al., 2017):

python3 parallel_to_m2.py -orig <input_file> -cor <corrected_file> -out <corrected_m2>

This will produce an annotated M2 file that can be used in the ERRANT evaluation script:

python3 compare_m2.py -hyp <corrected_m2> -ref <reference_m2>

This will print system results to the screen. The script also has additional options to carry out a more fine-grained analysis of system performance. See the Data and Evaluation sections of this website for more information on M2 format and evaluation in general.

Data

One of the key contributions of this shared task is the introduction of new annotated datasets: the Cambridge English Write & Improve (W&I) corpus and the LOCNESS corpus.

Cambridge English Write & Improve

Write & Improve (Yannakoudakis et al., 2018) is an online web platform that assists non-native English students with their writing. Specifically, students from around the world submit letters, stories, articles and essays in response to various prompts, and the W&I system provides instant feedback. Since W&I went live in 2014, W&I annotators have manually annotated some of these submissions and assigned them a CEFR level.

LOCNESS

The LOCNESS corpus (Granger, 1998) consists of essays written by native English students. It was originally compiled by researchers at the Centre for English Corpus Linguistics at the University of Louvain. Since native English students also sometimes make mistakes, we asked the W&I annotators to annotate a subsection of LOCNESS so researchers can test the effectiveness of their systems on the full range of English levels and abilities.

Corpora Statistics

We release 3,600 annotated submissions to W&I across 3 different CEFR levels: A (beginner), B (intermediate), C (advanced). We also release 100 annotated native (N) essays from LOCNESS.

We attempted to balance the corpora such that there is a roughly even distribution of sentences at different levels across each of the training, development and test sets. Due to time constraints, we are unable to release a native training set from LOCNESS. An overview of the data is shown in the following table:

		A	B	C	N	Total
Train	Texts	1,300	1,000	700	-	3,000
	Sentences	10,493	13,032	10,783	-	34,308
	Tokens	183,684	238,112	206,924	-	628,720
Dev	Texts	130	100	70	50	350
	Sentences	1,037	1,290	1,069	998	4,384
	Tokens	18,691	23,725	21,440	23,117	86,973
Test	Texts	130	100	70	50	350
	Sentences	1,107	1,330	1,010	1,030	4,477
	Tokens	18,905	23,667	19,953	23,143	85,668
Total	Texts	1,560	1,200	840	100	3,700
	Sentences	12,637	15,652	12,862	2,018	43,169
	Tokens	221,280	285,504	248,317	46,260	801,361

M2 Format

All the above corpora have been made available in M2 format, the standard format for annotated GEC files since the CoNLL-2013 shared task.

S This are a sentence .
A 1 2|||R:VERB:SVA|||is|||-REQUIRED-|||NONE|||0
A 3 3|||M:ADJ|||good|||-REQUIRED-|||NONE|||0
A 1 2|||R:VERB:SVA|||is|||-REQUIRED-|||NONE|||1
A -1 -1|||noop|||-NONE-|||REQUIRED|||-NONE-|||2

In M2 format, a line preceded by S denotes an original sentence while a line preceded by A indicates an edit annotation. Each edit line consists of the start and end token offset of the edit, the error type, and the tokenized correction string. The next two fields are included for historical reasons and can be ignored (see the CoNLL-2013 shared task), while the last field is the annotator id.

A "noop" edit is a special kind of edit that explicitly indicates an annotator/system made no changes to the original sentence. If there is only one annotator, noop edits are optional, otherwise a noop edit should be included whenever at least 1 out of n annotators considered the original sentence to be correct. This is something to be aware of when combining individual M2 files, as missing noops can affect results.

The above example can hence be interpreted as follows:
Annotator 0 changed "are" to "is" and inserted "good" before "sentence" to produce the correction: This is a good sentence .
Annotator 1 changed "are" to "is" to produce the correction: This is a sentence .
Annotator 2 thought the original was correct and made no changes to the sentence: This are a sentence .

Tracks

There are 3 tracks in the BEA 2019 shared task. Each track controls the amount of annotated learner data that can be used in a system.

In all tracks, we place no restrictions on the amount of unannotated data (e.g. for language modelling) or other NLP tools (e.g. POS taggers, parsers, spellcheckers, etc.) that can be used, as long as the resource is publicly available.

Restricted Track
In the restricted track, participants may only use the following learner datasets:
- FCE (Yannakoudakis et al., 2011)
- Lang-8 Corpus of Learner English (Mizumoto et al., 2011; Tajiri et al., 2012)
- NUCLE (Dahlmeier et al., 2013)
- W&I+LOCNESS (Bryant et al., 2019; Granger, 1998)
Note that we restrict participants to the preprocessed Lang-8 Corpus of Learner English rather than the raw, multilingual Lang-8 Learner Corpus because participants would otherwise need to filter the raw corpus themselves. We also do not allow the use of the CoNLL 2013/2014 shared task test sets in this track.
Unrestricted Track
In the unrestricted track, participants may use anything and everything to build their systems. This includes proprietary datasets and software.
Low Resource Track (formerly Unsupervised Track)
In the low resource track, participants may only use the following learner dataset:
- W&I+LOCNESS development set
Since current state-of-the-art systems rely on as much annotated learner data as possible to reach the best performance, the goal of the low resource track is to encourage research into systems that do not rely on large amounts of learner data. This track should be of particular interest to researchers working on GEC for languages where large learner corpora do not exist.

Since we expect this to be a challenging track however, we will allow participants the use of the W&I+LOCNESS development to develop their systems. There is no restriction on how participants can use this development set. The only difference between the low resource and the restricted track is the amount of learner data that can be used.

Evaluation

Systems will be evaluated using the ERRANT scorer, an improved version of the MaxMatch scorer (Dahlmeier and Ng, 2012) originally used in the CoNLL shared tasks. As in the previous shared tasks, this means system performance will primarily be measured in terms of span-based correction using the F_0.5 metric, which weights precision twice as much as recall.

In span-based correction, a system is only rewarded if a system edit exactly matches a reference edit in terms of both its token offsets and correction string. In contrast, the ERRANT scorer can also report performance in terms of span-based detection and token-based detection. The difference between these settings is shown in the following table:

Original	I often look at TV	Span-based	Span-based	Token-based
Reference	[2, 4, watch]	Correction	Detection	Detection
Hypothesis 1	[2, 4, watch]	Match	Match	Match
Hypothesis 2	[2, 4, see]	No match	Match	Match
Hypothesis 3	[2, 3, watch]	No match	No match	Match

Systems will primarily be evaluated in terms of their span-based correction F_0.5 on the combined W&I+LOCNESS gold test set overall.

Note that although the W&I+LOCNESS training and development sets were made available as separate files for each CEFR level, the test set will not be provided in the same format. Instead, participants will be given a single plain text file that contains all the test sentences combined. This is because systems should not expect to know the CEFR level of an input text in advance and should hence be prepared to handle all levels and abilities.

We will nevertheless also report system performance in terms of different CEFR and native levels, as well as in terms of detection and error types.

Metric Justification

Evaluation is still a hot topic in GEC and no method is perfect. The main advantage of the ERRANT scorer over the MaxMatch scorer and other evaluation methods is that it can provide much more detailed feedback about system performance, including scores for error detection, error correction, and error types. We hope that participants will be able to make use of this information to build better systems.

We also evaluated ERRANT in relation to human judgements using the same setup as Chollampatt and Ng (2018b), and found similar correlation coefficients to other metrics.

	Corpus		Sentence
Metric	Pearson r	Spearman ρ	Kendall τ
ERRANT	0.64	0.626	0.623
M²	0.623	0.687	0.617
GLEU	0.691	0.407	0.567
I-measure	-0.25	-0.385	0.564

Results

All results in terms of span-based correction. You can download the system output used to generate these results here.

Restricted Track

Rank	User	Team Name	TP	FP	FN	P	R	F_0.5	Detailed Results
1	romang	UEDIN-MS	3127	1199	2074	72.28	60.12	69.47	View
2	yjchoe33	Kakao&Brain	2709	894	2510	75.19	51.91	69.00	View
3	goo2go	LAIX	2618	960	2671	73.17	49.50	66.78	View
4	HelenY	CAMB-CLED	2924	1224	2386	70.49	55.07	66.75	View
5	seanxu1015	Shuyao	2926	1244	2357	70.17	55.39	66.61	View
6	BUAA_WJP	YDGEC	2815	1205	2487	70.02	53.09	65.83	View
7	awasthiabhijeet05	ML@IITB	3678	1920	2340	65.70	61.12	64.73	View
8	fs439	CAMB-CUED	2929	1459	2502	66.75	53.93	63.72	View
9	tomoyamizumoto	AIP-Tohoku	1972	902	2705	68.62	42.16	60.97	View
10	arahusky	UFAL, Charles University, Prague	1941	942	2867	67.33	40.37	59.39	View
11	liuwangwang	CVTE-NLP	1739	811	2744	68.20	38.79	59.22	View
12	hsamswcc	BLCU	2554	1646	2432	60.81	51.22	58.62	View
13	yoav_kantor	IBM Research AI - HRL	1819	1044	3047	63.53	37.38	55.74	View
14	Masahiro	TMU	2720	2325	2546	53.91	51.65	53.45	View
15	qiuwenbo		1428	854	2968	62.58	32.48	52.80	View
16	cehinson	NLG NTU	1833	1873	2939	49.46	38.41	46.77	View
17	apurva.nagvenkar	CAI	2002	2168	2759	48.01	42.05	46.69	View
18	davidzhao	PKU	1401	1265	2955	52.55	32.16	46.64	View
19	SolomonLab	SolomonLab	1760	2161	2678	44.89	39.66	43.73	View
20	mengyang	Buffalo	604	350	3311	63.31	15.43	39.06	View
21	nihalnayak	Ramaiah	829	7656	3516	9.77	19.08	10.83	View

Unrestricted Track

Rank	User	Team Name	TP	FP	FN	P	R	F_0.5	Detailed Results
1	goo2go	LAIX	2618	960	2671	73.17	49.50	66.78	View
2	tomoyamizumoto	AIP-Tohoku	2589	1078	2484	70.60	51.03	65.57	View
3	arahusky	UFAL, Charles University, Prague	2812	1313	2469	68.17	53.25	64.55	View
4	hsamswcc	BLCU	3051	2007	2357	60.32	56.42	59.50	View
5	gurunathp	Aparecium	1585	1077	2787	59.54	36.25	52.76	View
6	mengyang	Buffalo	699	374	3265	65.14	17.63	42.33	View
7	nihalnayak	Ramaiah	1161	8062	3480	12.59	25.02	13.98	View

Low Resource Track

Rank	User	Team Name	TP	FP	FN	P	R	F_0.5	Detailed Results
1	romang	UEDIN-MS	2312	982	2506	70.19	47.99	64.24	View
2	JiyeonHam	Kakao&Brain	2412	1413	2797	63.06	46.30	58.80	View
3	goo2go	LAIX	1443	884	3175	62.01	31.25	51.81	View
4	fs439	CAMB-CUED	1814	1450	2956	55.58	38.03	50.88	View
5	arahusky	UFAL, Charles University, Prague	1245	1222	2993	50.47	29.38	44.13	View
6	simonHFL	Siteimprove	1299	1619	3199	44.52	28.88	40.17	View
7	Bohdan_Didenk	WebSpellChecker.com	2363	3719	3031	38.85	43.81	39.75	View
8	Satoru	TMU	1638	4314	3486	27.52	31.97	28.31	View
9	mengyang	Buffalo	446	1243	3556	26.41	11.14	20.73	View

Paper Submissions

Participants are required to submit a paper describing their submissions to the shared task. Papers must adhere to the ACL Submission Guidelines. Authors are invited to submit a full paper of up to eight (8) pages of content, plus unlimited references; final versions of long papers will be given one additional page of content (up to 9 pages) so that reviewers’ comments can be taken into account. We also invite short papers of up to of up to four (4) pages of content, plus unlimited references. Upon acceptance, short papers will be given five (5) content pages in the proceedings. Authors are encouraged to use this additional page to address reviewers’ comments in their final versions.

Previously published papers cannot be accepted. The submissions will be reviewed by the program committee. As reviewing will be blind, please ensure that papers are anonymous. Self-references that reveal the author’s identity, e.g., “We previously showed (Smith, 1991) …”, should be avoided. Instead, use citations such as “Smith previously showed (Smith, 1991) ...”. We are aware that participants may still be identifiable from their scores, but this is permitted in the context of the shared task.

Please, submit your paper to the "GEC Shared Task" track of the BEA Workshop using the following link: https://www.softconf.com/acl2019/bea/.

Important Dates

Date	Event
Friday, Jan 25, 2019	New training data released
Monday, March 25, 2019	New test data released
Friday, March 29, 2019	System output submission deadline
Friday, April 5, 2019	System results announced
Friday, May 3, 2019	System paper submission deadline
Friday, May 24, 2019	Review deadline
Monday, June 3, 2019	Camera-ready submission deadline
Friday, August 2, 2019	BEA-2019 Workshop (Florence, Italy)

Note: All deadlines are 23:59 UTC to be consistent with Codalab.

Organisers

Christopher Bryant, University of Cambridge
Mariano Felice, University of Cambridge
Øistein E. Andersen, University of Cambridge
Ted Briscoe, University of Cambridge

Contact

Discussion forum: BEA 2019 Shared Task Discussion Group.
Email: bea2019st@gmail.com.

References

Christopher Bryant, Mariano Felice, Øistein E. Andersen and Ted Briscoe. 2019. The BEA-2019 Shared Task on Grammatical Error Correction. In Proceedings of the 14th Workshop on Innovative Use of NLP for Building Educational Applications, Florence, Italy, August. Association for Computational Linguistics.

Christopher Bryant, Mariano Felice, and Ted Briscoe. 2017. Automatic annotation and evaluation of error types for grammatical error correction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 793-805.

Shamil Chollampatt and Hwee Tou Ng. 2018a. A multilayer convolutional encoder-decoder neural network for grammatical error correction. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence.

Shamil Chollampatt and Hwee Tou Ng. 2018b. A reassessment of reference-based grammatical error correction metrics. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2730-2741.

Daniel Dahlmeier and Hwee Tou Ng. 2012. Better evaluation for grammatical error correction. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 568-572.

Daniel Dahlmeier, Hwee Tou Ng, and Siew Mei Wu. 2013. Building a large annotated corpus of learner English: The NUS corpus of learner English. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pages 22–31.

Robert Dale and Adam Kilgarriff. 2011. Helping Our Own: The HOO 2011 pilot shared task. In Proceedings of the Generation Challenges Session at the 13th European Workshop on Natural Language Generation, pages 242–249.

Robert Dale, Ilya Anisimoff, and George Narroway. 2012. Helping Our Own: HOO 2012: A report on the preposition and determiner error correction shared task. In Proceedings of the Seventh Workshop on Innovative Use of NLP for Building Educational Applications, pages 54–62.

Tao Ge, Furu Wei, and Ming Zhou. 2018. Fluency boost learning and inference for neural grammatical error correction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1055–1065.

Sylviane Granger. 1998. The computer learner corpus: A versatile new source of data for SLA research. In Sylviane Granger, editor, Learner English on Computer, pages 3–18. Addison Wesley Longman, London and New York.

Roman Grundkiewicz and Marcin Junczys-Dowmunt. 2018. Near human-level performance in grammatical error correction with hybrid machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pages 284-290.

Tomoya Mizumoto, Mamoru Komachi, Masaaki Nagata and Yuji Matsumoto. 2011. Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners. In Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP), pages 147-155.

Toshikazu Tajiri, Mamoru Komachi, and Yuji Matsumoto. 2012. Tense and Aspect Error Correction for ESL Learners Using Global Context. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 198-202.

Hwee Tou Ng, Siew Mei Wu, Yuanbin Wu, Christian Hadiwinoto, and Joel R. Tetreault. 2013. The CoNLL-2013 shared task on grammatical error correction. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task, pages 1–12.

Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Christian Hadiwinoto, Raymond Hendy Susanto, and Christopher Bryant. 2014. The CoNLL-2014 shared task on grammatical error correction. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, pages 1–14.

Helen Yannakoudakis, Ted Briscoe, and Ben Medlock. 2011. A new dataset and method for automatically grading ESOL texts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 180–189.

Helen Yannakoudakis, Marek Rei, Øistein E. Andersen, and Zheng Yuan. 2017. Neural sequence-labelling models for grammatical error correction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2795–2806.

Helen Yannakoudakis, Øistein E. Andersen, Ardeshir Geranpayeh, Ted Briscoe and Diane Nicholls. 2018. Developing an automated writing placement system for ESL learners. Applied Measurement in Education, 31:3, pages 251-267.

Building Educational Applications 2019 Shared Task: Grammatical Error Correction