Building Educational Applications 2019 Shared Task:
Grammatical Error Correction
News
07/07/2022
Codalab are phasing out the server on which the shared task was originally run. All existing Codalab competition tracks will stop accepting new submissions from 31/08/2022 and become read-only from 01/01/2023.
To enable researchers to continue testing on BEA-2019, we have created a new competition on Codalab's new server that represents all tracks:
Click here to evaluate on the BEA-2019 test setUnfortunately, Codalab are not migrating any previous submissions or user accounts to the new server, so you must recreate them if you wish to continue evaluating on BEA-2019. We apologise for the inconvenience.
02/08/2019
A shared task overview paper containing a detailed description of systems, data, results and analysis is now available. Please, use the following citation when referring to the shared task and/or using the released datasets:
Christopher Bryant, Mariano Felice, Øistein E. Andersen and Ted Briscoe. 2019. The BEA-2019 Shared Task on Grammatical Error Correction. In Proceedings of the 14th Workshop on Innovative Use of NLP for Building Educational Applications (BEA-2019), pp. 52–75, Florence, Italy, August. Association for Computational Linguistics.
System output produced by all participating teams is also available here.
Description
Grammatical error correction (GEC) is the task of automatically correcting grammatical errors in text; e.g. [I follows his advices -> I followed his advice]. It can be used to not only help language learners improve their writing skills, but also alert native speakers to accidental mistakes or typos.
GEC gained significant attention in the Helping Our Own (HOO) and Conference on Natural Language Learning (CoNLL) shared tasks between 2011 and 2014 (Dale and Kilgarriff, 2011; Dale et al., 2012; Ng et al., 2013; Ng et al., 2014), but has since become more difficult to evaluate given a lack of standardised experimental settings. In particular, recent systems have been trained, tuned and tested on different combinations of corpora using different metrics (Yannakoudakis et al., 2017; Chollampatt and Ng, 2018a; Ge et al., 2018; Grundkiewicz and Junczys-Dowmunt, 2018). One of the aims of this shared task is hence to once again provide a platform where different approaches can be trained and tested under the same conditions.
Another significant problem facing the field is that system performance is still primarily benchmarked against the CoNLL-2014 test set, even though this 5-year-old dataset only contains 50 essays on 2 different topics written by 25 South-East Asian undergraduates in Singapore. This means that systems have increasingly overfit to a very specific genre of English and so do not generalise well to other domains. The shared task hence also introduces a new dataset that represents a much more diverse cross-section of English language levels and domains.
Instructions
Participants should first join the BEA 2019 Shared Task Discussion Group. This group is the best place to ask questions and keep up-to-date with shared task news and updates. If you do not join this group, you may miss out on important shared task news.
The aim of the shared task is to correct all types of errors in written text. This includes grammatical, lexical and orthographical errors. Participants will be provided with plain text files as input, one tokenised sentence per line, and are expected to produce equivalent corrected text files as output. For example:
Input | Travel by bus is exspensive , bored and annoying . |
Output | Travelling by bus is expensive , boring and annoying . |
All text was tokenised using spaCy v1.9.0 and the en_core_web_sm-1.2.0 model.
Official evaluation will be carried out on the Codalab competition platform. There is a different Codalab competition for each track (see Tracks) and participants may submit to as many tracks as they wish. More instructions are available on Codalab:
We encourage participants to join Codalab early so that they can become more familiar with the submission process before the official test phase. Participants may already use Codalab to submit output on the development set and receive detailed feedback about their systems using the official evaluation procedure.
If participants prefer to evaluate their systems themselves without Codalab, they must first automatically annotate their system output using the ERRor ANnotation Toolkit (ERRANT) (Bryant et al., 2017):
python3 parallel_to_m2.py -orig <input_file> -cor <corrected_file> -out <corrected_m2>
This will produce an annotated M2 file that can be used in the ERRANT evaluation script:
python3 compare_m2.py -hyp <corrected_m2> -ref <reference_m2>
This will print system results to the screen. The script also has additional options to carry out a more fine-grained analysis of system performance. See the Data and Evaluation sections of this website for more information on M2 format and evaluation in general.
Data
One of the key contributions of this shared task is the introduction of new annotated datasets: the Cambridge English Write & Improve (W&I) corpus and the LOCNESS corpus.
Cambridge English Write & Improve
Write & Improve (Yannakoudakis et al., 2018) is an online web platform that assists non-native English students with their writing. Specifically, students from around the world submit letters, stories, articles and essays in response to various prompts, and the W&I system provides instant feedback. Since W&I went live in 2014, W&I annotators have manually annotated some of these submissions and assigned them a CEFR level.
LOCNESS
The LOCNESS corpus (Granger, 1998) consists of essays written by native English students. It was originally compiled by researchers at the Centre for English Corpus Linguistics at the University of Louvain. Since native English students also sometimes make mistakes, we asked the W&I annotators to annotate a subsection of LOCNESS so researchers can test the effectiveness of their systems on the full range of English levels and abilities.
Corpora Statistics
We release 3,600 annotated submissions to W&I across 3 different CEFR levels: A (beginner), B (intermediate), C (advanced). We also release 100 annotated native (N) essays from LOCNESS.
We attempted to balance the corpora such that there is a roughly even distribution of sentences at different levels across each of the training, development and test sets. Due to time constraints, we are unable to release a native training set from LOCNESS. An overview of the data is shown in the following table:
A | B | C | N | Total | ||
Train | Texts | 1,300 | 1,000 | 700 | - | 3,000 |
Sentences | 10,493 | 13,032 | 10,783 | - | 34,308 | |
Tokens | 183,684 | 238,112 | 206,924 | - | 628,720 | |
Dev | Texts | 130 | 100 | 70 | 50 | 350 |
Sentences | 1,037 | 1,290 | 1,069 | 998 | 4,384 | |
Tokens | 18,691 | 23,725 | 21,440 | 23,117 | 86,973 | |
Test | Texts | 130 | 100 | 70 | 50 | 350 |
Sentences | 1,107 | 1,330 | 1,010 | 1,030 | 4,477 | |
Tokens | 18,905 | 23,667 | 19,953 | 23,143 | 85,668 | |
Total | Texts | 1,560 | 1,200 | 840 | 100 | 3,700 |
Sentences | 12,637 | 15,652 | 12,862 | 2,018 | 43,169 | |
Tokens | 221,280 | 285,504 | 248,317 | 46,260 | 801,361 |
Other Corpora and Download Links
To increase the amount of annotated data available to participants, we also allow the use of several other learner corpora in the main restricted track of the shared task. Since these corpora were previously only available in different formats, we make new standardised versions available with the shared task.
NOTE: While the FCE and W&I+LOCNESS corpora are immediately downloadable, participants must fill out online forms to obtain the Lang-8 Corpus of Learner English and NUCLE. After filling out the forms, a link to Lang-8 will be emailed to you immediately, while NUCLE will be sent within one day. All corpora are subject to similar licences and may only be used for non-commercial purposes.
- FCE v2.1: DOWNLOAD
The First Certificate in English (FCE) corpus is a subset of the Cambridge Learner Corpus (CLC) that contains 1,244 written answers to FCE exam questions (Yannakoudakis et al., 2011). - Lang-8 Corpus of Learner English: REQUEST
Lang-8 is an online language learning website which encourages users to correct each other's grammar. The Lang-8 Corpus of Learner English is a somewhat-clean, English subsection of this website (Mizumoto et al., 2011; Tajiri et al., 2012). - NUCLE: REQUEST
The National University of Singapore Corpus of Learner English (NUCLE) consists of 1,400 essays written by mainly Asian undergraduate students at the National University of Singapore (Dahlmeier et al., 2013). - W&I+LOCNESS v2.1: DOWNLOAD
Described at the start of this section (Bryant et al., 2019; Granger, 1998).
All these corpora have been standardised with ERRANT. This means they have all been converted to M2 format and annotated with ERRANT error types automatically. This is significant because NUCLE and the FCE were previously annotated according to different incompatible error type frameworks and Lang-8 and W&I+LOCNESS were not annotated with error types at all. In fact, Lang-8 is also the only corpus that does not originally contain any explicit human annotations, and so these were extracted automatically by ERRANT.
An additional complication with Lang-8 is that it is the only training/development corpus that contains multiple sets of annotations for certain sentences. This means you should not apply all the edits in an M2 sentence block to generate the corrected sentence, as edits from multiple annotators can interact. We thus provide a script to safely generate the corrected text for a specific annotator from an M2 file: corr_from_m2.py
M2 Format
All the above corpora have been made available in M2 format, the standard format for annotated GEC files since the CoNLL-2013 shared task.
S This are a sentence .
A 1 2|||R:VERB:SVA|||is|||-REQUIRED-|||NONE|||0
A 3 3|||M:ADJ|||good|||-REQUIRED-|||NONE|||0
A 1 2|||R:VERB:SVA|||is|||-REQUIRED-|||NONE|||1
A -1 -1|||noop|||-NONE-|||REQUIRED|||-NONE-|||2
In M2 format, a line preceded by S denotes an original sentence while a line preceded by A indicates an edit annotation. Each edit line consists of the start and end token offset of the edit, the error type, and the tokenized correction string. The next two fields are included for historical reasons and can be ignored (see the CoNLL-2013 shared task), while the last field is the annotator id.
A "noop" edit is a special kind of edit that explicitly indicates an annotator/system made no changes to the original sentence. If there is only one annotator, noop edits are optional, otherwise a noop edit should be included whenever at least 1 out of n annotators considered the original sentence to be correct. This is something to be aware of when combining individual M2 files, as missing noops can affect results.
The above example can hence be interpreted as follows:
Annotator 0 changed "are" to "is" and inserted "good" before "sentence" to produce the correction: This is a good sentence .
Annotator 1 changed "are" to "is" to produce the correction: This is a sentence .
Annotator 2 thought the original was correct and made no changes to the sentence: This are a sentence .
Tracks
There are 3 tracks in the BEA 2019 shared task. Each track controls the amount of annotated learner data that can be used in a system.
In all tracks, we place no restrictions on the amount of unannotated data (e.g. for language modelling) or other NLP tools (e.g. POS taggers, parsers, spellcheckers, etc.) that can be used, as long as the resource is publicly available.
-
Restricted Track
In the restricted track, participants may only use the following learner datasets:
- FCE (Yannakoudakis et al., 2011)
- Lang-8 Corpus of Learner English (Mizumoto et al., 2011; Tajiri et al., 2012)
- NUCLE (Dahlmeier et al., 2013)
- W&I+LOCNESS (Bryant et al., 2019; Granger, 1998)
Note that we restrict participants to the preprocessed Lang-8 Corpus of Learner English rather than the raw, multilingual Lang-8 Learner Corpus because participants would otherwise need to filter the raw corpus themselves. We also do not allow the use of the CoNLL 2013/2014 shared task test sets in this track.
-
Unrestricted Track
In the unrestricted track, participants may use anything and everything to build their systems. This includes proprietary datasets and software.
-
Low Resource Track (formerly Unsupervised Track)
In the low resource track, participants may only use the following learner dataset:
- W&I+LOCNESS development set
Since current state-of-the-art systems rely on as much annotated learner data as possible to reach the best performance, the goal of the low resource track is to encourage research into systems that do not rely on large amounts of learner data. This track should be of particular interest to researchers working on GEC for languages where large learner corpora do not exist.
Since we expect this to be a challenging track however, we will allow participants the use of the W&I+LOCNESS development to develop their systems. There is no restriction on how participants can use this development set. The only difference between the low resource and the restricted track is the amount of learner data that can be used.
Evaluation
Systems will be evaluated using the ERRANT scorer, an improved version of the MaxMatch scorer (Dahlmeier and Ng, 2012) originally used in the CoNLL shared tasks. As in the previous shared tasks, this means system performance will primarily be measured in terms of span-based correction using the F0.5 metric, which weights precision twice as much as recall.
In span-based correction, a system is only rewarded if a system edit exactly matches a reference edit in terms of both its token offsets and correction string. In contrast, the ERRANT scorer can also report performance in terms of span-based detection and token-based detection. The difference between these settings is shown in the following table:
Original | I often look at TV | Span-based | Span-based | Token-based |
Reference | [2, 4, watch] | Correction | Detection | Detection |
Hypothesis 1 | [2, 4, watch] | Match | Match | Match |
Hypothesis 2 | [2, 4, see] | No match | Match | Match |
Hypothesis 3 | [2, 3, watch] | No match | No match | Match |
Systems will primarily be evaluated in terms of their span-based correction F0.5 on the combined W&I+LOCNESS gold test set overall.
Note that although the W&I+LOCNESS training and development sets were made available as separate files for each CEFR level, the test set will not be provided in the same format. Instead, participants will be given a single plain text file that contains all the test sentences combined. This is because systems should not expect to know the CEFR level of an input text in advance and should hence be prepared to handle all levels and abilities.
We will nevertheless also report system performance in terms of different CEFR and native levels, as well as in terms of detection and error types.
Metric Justification
Evaluation is still a hot topic in GEC and no method is perfect. The main advantage of the ERRANT scorer over the MaxMatch scorer and other evaluation methods is that it can provide much more detailed feedback about system performance, including scores for error detection, error correction, and error types. We hope that participants will be able to make use of this information to build better systems.
We also evaluated ERRANT in relation to human judgements using the same setup as Chollampatt and Ng (2018b), and found similar correlation coefficients to other metrics.
Corpus | Sentence | ||
Metric | Pearson r | Spearman ρ | Kendall τ |
ERRANT | 0.64 | 0.626 | 0.623 |
M2 | 0.623 | 0.687 | 0.617 |
GLEU | 0.691 | 0.407 | 0.567 |
I-measure | -0.25 | -0.385 | 0.564 |
Results
All results in terms of span-based correction. You can download the system output used to generate these results here.
- Restricted Track
- Unrestricted Track
- Low Resource Track
Rank | User | Team Name | TP | FP | FN | P | R | F0.5 | Detailed Results |
---|---|---|---|---|---|---|---|---|---|
1 | romang | UEDIN-MS | 3127 | 1199 | 2074 | 72.28 | 60.12 | 69.47 |
View
|
2 | yjchoe33 | Kakao&Brain | 2709 | 894 | 2510 | 75.19 | 51.91 | 69.00 |
View
|
3 | goo2go | LAIX | 2618 | 960 | 2671 | 73.17 | 49.50 | 66.78 |
View
|
4 | HelenY | CAMB-CLED | 2924 | 1224 | 2386 | 70.49 | 55.07 | 66.75 |
View
|
5 | seanxu1015 | Shuyao | 2926 | 1244 | 2357 | 70.17 | 55.39 | 66.61 |
View
|
6 | BUAA_WJP | YDGEC | 2815 | 1205 | 2487 | 70.02 | 53.09 | 65.83 |
View
|
7 | awasthiabhijeet05 | ML@IITB | 3678 | 1920 | 2340 | 65.70 | 61.12 | 64.73 |
View
|
8 | fs439 | CAMB-CUED | 2929 | 1459 | 2502 | 66.75 | 53.93 | 63.72 |
View
|
9 | tomoyamizumoto | AIP-Tohoku | 1972 | 902 | 2705 | 68.62 | 42.16 | 60.97 |
View
|
10 | arahusky | UFAL, Charles University, Prague | 1941 | 942 | 2867 | 67.33 | 40.37 | 59.39 |
View
|
11 | liuwangwang | CVTE-NLP | 1739 | 811 | 2744 | 68.20 | 38.79 | 59.22 |
View
|
12 | hsamswcc | BLCU | 2554 | 1646 | 2432 | 60.81 | 51.22 | 58.62 |
View
|
13 | yoav_kantor | IBM Research AI - HRL | 1819 | 1044 | 3047 | 63.53 | 37.38 | 55.74 |
View
|
14 | Masahiro | TMU | 2720 | 2325 | 2546 | 53.91 | 51.65 | 53.45 |
View
|
15 | qiuwenbo | 1428 | 854 | 2968 | 62.58 | 32.48 | 52.80 |
View
|
|
16 | cehinson | NLG NTU | 1833 | 1873 | 2939 | 49.46 | 38.41 | 46.77 |
View
|
17 | apurva.nagvenkar | CAI | 2002 | 2168 | 2759 | 48.01 | 42.05 | 46.69 |
View
|
18 | davidzhao | PKU | 1401 | 1265 | 2955 | 52.55 | 32.16 | 46.64 |
View
|
19 | SolomonLab | SolomonLab | 1760 | 2161 | 2678 | 44.89 | 39.66 | 43.73 |
View
|
20 | mengyang | Buffalo | 604 | 350 | 3311 | 63.31 | 15.43 | 39.06 |
View
|
21 | nihalnayak | Ramaiah | 829 | 7656 | 3516 | 9.77 | 19.08 | 10.83 |
View
|
Rank | User | Team Name | TP | FP | FN | P | R | F0.5 | Detailed Results |
---|---|---|---|---|---|---|---|---|---|
1 | goo2go | LAIX | 2618 | 960 | 2671 | 73.17 | 49.50 | 66.78 |
View
|
2 | tomoyamizumoto | AIP-Tohoku | 2589 | 1078 | 2484 | 70.60 | 51.03 | 65.57 |
View
|
3 | arahusky | UFAL, Charles University, Prague | 2812 | 1313 | 2469 | 68.17 | 53.25 | 64.55 |
View
|
4 | hsamswcc | BLCU | 3051 | 2007 | 2357 | 60.32 | 56.42 | 59.50 |
View
|
5 | gurunathp | Aparecium | 1585 | 1077 | 2787 | 59.54 | 36.25 | 52.76 |
View
|
6 | mengyang | Buffalo | 699 | 374 | 3265 | 65.14 | 17.63 | 42.33 |
View
|
7 | nihalnayak | Ramaiah | 1161 | 8062 | 3480 | 12.59 | 25.02 | 13.98 |
View
|
Rank | User | Team Name | TP | FP | FN | P | R | F0.5 | Detailed Results |
---|---|---|---|---|---|---|---|---|---|
1 | romang | UEDIN-MS | 2312 | 982 | 2506 | 70.19 | 47.99 | 64.24 |
View
|
2 | JiyeonHam | Kakao&Brain | 2412 | 1413 | 2797 | 63.06 | 46.30 | 58.80 |
View
|
3 | goo2go | LAIX | 1443 | 884 | 3175 | 62.01 | 31.25 | 51.81 |
View
|
4 | fs439 | CAMB-CUED | 1814 | 1450 | 2956 | 55.58 | 38.03 | 50.88 |
View
|
5 | arahusky | UFAL, Charles University, Prague | 1245 | 1222 | 2993 | 50.47 | 29.38 | 44.13 |
View
|
6 | simonHFL | Siteimprove | 1299 | 1619 | 3199 | 44.52 | 28.88 | 40.17 |
View
|
7 | Bohdan_Didenk | WebSpellChecker.com | 2363 | 3719 | 3031 | 38.85 | 43.81 | 39.75 |
View
|
8 | Satoru | TMU | 1638 | 4314 | 3486 | 27.52 | 31.97 | 28.31 |
View
|
9 | mengyang | Buffalo | 446 | 1243 | 3556 | 26.41 | 11.14 | 20.73 |
View
|
Paper Submissions
Participants are required to submit a paper describing their submissions to the shared task. Papers must adhere to the ACL Submission Guidelines. Authors are invited to submit a full paper of up to eight (8) pages of content, plus unlimited references; final versions of long papers will be given one additional page of content (up to 9 pages) so that reviewers’ comments can be taken into account. We also invite short papers of up to of up to four (4) pages of content, plus unlimited references. Upon acceptance, short papers will be given five (5) content pages in the proceedings. Authors are encouraged to use this additional page to address reviewers’ comments in their final versions.
Previously published papers cannot be accepted. The submissions will be reviewed by the program committee. As reviewing will be blind, please ensure that papers are anonymous. Self-references that reveal the author’s identity, e.g., “We previously showed (Smith, 1991) …”, should be avoided. Instead, use citations such as “Smith previously showed (Smith, 1991) ...”. We are aware that participants may still be identifiable from their scores, but this is permitted in the context of the shared task.
Please, submit your paper to the "GEC Shared Task" track of the BEA Workshop using the following link: https://www.softconf.com/acl2019/bea/.
Important Dates
Date | Event |
Friday, Jan 25, 2019 | New training data released |
Monday, March 25, 2019 | New test data released |
Friday, March 29, 2019 | System output submission deadline |
Friday, April 5, 2019 | System results announced |
Friday, May 3, 2019 | System paper submission deadline |
Friday, May 24, 2019 | Review deadline |
Monday, June 3, 2019 | Camera-ready submission deadline |
Friday, August 2, 2019 | BEA-2019 Workshop (Florence, Italy) |
Note: All deadlines are 23:59 UTC to be consistent with Codalab.
Organisers
Christopher Bryant, University of CambridgeMariano Felice, University of Cambridge
Øistein E. Andersen, University of Cambridge
Ted Briscoe, University of Cambridge
Contact
Discussion forum: BEA 2019 Shared Task Discussion Group.Email: bea2019st@gmail.com.
References
Christopher Bryant, Mariano Felice, Øistein E. Andersen and Ted Briscoe. 2019. The BEA-2019 Shared Task on Grammatical Error Correction. In Proceedings of the 14th Workshop on Innovative Use of NLP for Building Educational Applications, Florence, Italy, August. Association for Computational Linguistics.
Christopher Bryant, Mariano Felice, and Ted Briscoe. 2017. Automatic annotation and evaluation of error types for grammatical error correction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 793-805.
Shamil Chollampatt and Hwee Tou Ng. 2018a. A multilayer convolutional encoder-decoder neural network for grammatical error correction. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence.
Shamil Chollampatt and Hwee Tou Ng. 2018b. A reassessment of reference-based grammatical error correction metrics. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2730-2741.
Daniel Dahlmeier and Hwee Tou Ng. 2012. Better evaluation for grammatical error correction. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 568-572.
Daniel Dahlmeier, Hwee Tou Ng, and Siew Mei Wu. 2013. Building a large annotated corpus of learner English: The NUS corpus of learner English. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pages 22–31.
Robert Dale and Adam Kilgarriff. 2011. Helping Our Own: The HOO 2011 pilot shared task. In Proceedings of the Generation Challenges Session at the 13th European Workshop on Natural Language Generation, pages 242–249.
Robert Dale, Ilya Anisimoff, and George Narroway. 2012. Helping Our Own: HOO 2012: A report on the preposition and determiner error correction shared task. In Proceedings of the Seventh Workshop on Innovative Use of NLP for Building Educational Applications, pages 54–62.
Tao Ge, Furu Wei, and Ming Zhou. 2018. Fluency boost learning and inference for neural grammatical error correction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1055–1065.
Sylviane Granger. 1998. The computer learner corpus: A versatile new source of data for SLA research. In Sylviane Granger, editor, Learner English on Computer, pages 3–18. Addison Wesley Longman, London and New York.
Roman Grundkiewicz and Marcin Junczys-Dowmunt. 2018. Near human-level performance in grammatical error correction with hybrid machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pages 284-290.
Tomoya Mizumoto, Mamoru Komachi, Masaaki Nagata and Yuji Matsumoto. 2011. Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners. In Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP), pages 147-155.
Toshikazu Tajiri, Mamoru Komachi, and Yuji Matsumoto. 2012. Tense and Aspect Error Correction for ESL Learners Using Global Context. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 198-202.
Hwee Tou Ng, Siew Mei Wu, Yuanbin Wu, Christian Hadiwinoto, and Joel R. Tetreault. 2013. The CoNLL-2013 shared task on grammatical error correction. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task, pages 1–12.
Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Christian Hadiwinoto, Raymond Hendy Susanto, and Christopher Bryant. 2014. The CoNLL-2014 shared task on grammatical error correction. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, pages 1–14.
Helen Yannakoudakis, Ted Briscoe, and Ben Medlock. 2011. A new dataset and method for automatically grading ESOL texts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 180–189.
Helen Yannakoudakis, Marek Rei, Øistein E. Andersen, and Zheng Yuan. 2017. Neural sequence-labelling models for grammatical error correction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2795–2806.
Helen Yannakoudakis, Øistein E. Andersen, Ardeshir Geranpayeh, Ted Briscoe and Diane Nicholls. 2018. Developing an automated writing placement system for ESL learners. Applied Measurement in Education, 31:3, pages 251-267.