We released the FCE-CLC dataset in 2011 which contains over 50K errors drawn from 80 different error types produced by speakers of 12 different native languages taking the First Certificate in English (FCE) examination (Yannakoudakis et al). This dataset has been successfully used to train and test a number of error correction and detection systems (Dale et al) as well as one native language identification system (Kochmar), detecting the native language of a writer from their text, and one automated grading system (Yannakoudakis et al), assigning a grade to ESL text which correlates well with that assigned by human examiners.
There has been a great deal of academic and commercial interest in detecting and correcting errors in texts produced by speakers of English as a second (or other) language (ESL), e.g. Microsoft's ESL Assistant. Most research work has focussed on learning classifiers for article (*a information is good and preposition errors (*We sat at the sunshine) from well-formed English text, because there is plenty of the latter and these are two common types of error (see Leacock et al for a recent overview). However, recent work has shown that more accurate classifiers can be built from error-annotated ESL data (e.g. Roz & Roth).
For the latest shared task on error detection and correction, the NUCLE error-coded dataset was released (Dahlmeier et al) and the task extended to include noun and verb form and agreement errors (see Ng et al). The results confirmed that systems that make use of error-coded data but also train on large quantities of native text (such as the Google 5-gram corpus) perform best.
We also have access to the 40M word Cambridge Learner Corpus (CLC) about half of which has also been error coded, representing the largest such resource available worldwide to date, and to the EFCamDAT 30M word corpus which has partial error coding (http://www.ling.cam.ac.uk/ef-unit/corpus.html).
All these datasets have been automatically tokenised, part-of-speech tagged and parsed, and contain linguistic annotation, error coding and metadata, such as native language and script grades, all represented in XML format.
One or more projects can be undertaken using one or more of these datasets, possibly benchmarking performance against recent results on the shared tasks, and building on the existing approaches described in the references. These projects could either focus on error detection, error correction, automated assessment, or native language identification. There has been a lot more work on errors than on native language identification or automated grading, but both the latter are challenging and potentially important tasks. There are still many error types annotated in the datasets which haven't been evaluated in shared tasks and haven't received much attention in the literature.
If you are interested in working out your own project, it would make sense to look at some of the references below or the references to our existing work linked from the Alta Institute web page and then talking to me.
Projects can be undertaken in most programming languages and will utilise machine learning toolkits (such as Mallet, MegaM, SVMlight, Weka, etc). They will suit students taking modules L100 and L101.
A popular approach to error detection and correction (EDC) of ESL texts has been to build separate classifiers using bespoke features for specific error types; for example, subject-verb agreement errors can be detected via grammatical relations output by a (robust) parser ( The boys obviously likes football subject(like+s boy+s) likes/like), whilst some verb form errors can be detected using ngrams ( sawed the accident / saw the accident sawed/saw).
The disadvantage of treating errors independently is that they interact. The context for one error may include another error and a series of independent corrections of errors may itself be ungrammatical: All boy are clever/All boys are clever and All boy are clever/All boy is clever results in All boys is clever. An efficient and effective technique for combining the predictions of the independent classifiers is needed to improve on the state-of-the-art in EDC.
Several approaches have been proposed which have led to some improvement in performance: integer logic programming (Wu & Ng, Roz & Roth) and joint inference (Rox & Roth) but an alternative potentially simpler method is to apply a language model trained on well-formed native text to the combined output of all classifiers and choose the most likely sentence. This approach is used effectively in Statistical Machine Translation (SMT), which itself has been applied to EDC with some success (Yuan & Felice). The advantage of the proposed new approach is that it would be possible to retain bespoke features for distinct error types, which isn't possible with standard SMT.
The project could compare several of these approaches, either by implementing two or by training and testing the language modelling approach on the same data used in previous work. If you are interested in this project, read (some of) the references below and then talk to me.
The project can be undertaken in most programming languages and may utilise some existing EDC classifiers and ILP programming and language modelling toolkits (e.g. LPSOLVE, SRILM). It will suit students taking modules L100 and L101.