ACS Project Suggestions

Definition of CEFR-related readability metrics

Proposers: Mariano Felice
Supervisors: Mariano Felice, Øistein E. Andersen
Special Resources:
Requirements: NLP

Description

Readability metrics have a long history, including the Gunning fog index (1952), SMOG (1969), Flesch-Kincaid (1975) and Coleman–Liau index (1975), as well as modern alternatives like the Lexile Text Measure¹ or ATOS², and new machine-learning and NLP-based approaches. Such metrics can form the basis of a a readability model which could classify text into CEFR levels. A paper by François & Miltsakaki (2012)³ already describes this kind of experiment for French and includes a review of previous related work for English.

Possible choices for training data include: 1) existing texts from textbooks, and 2) learners’ successful writing productions.

A successful metric could be useful for assessing the suitability of texts for exams and textbooks, warning about the difficulty of text on webpages, assessing newly written scripts (e.g., for self-assessment), etc.

References

The Lexile Framework for Reading.
Michael Milone: ‘The Development of ATOS: The Renaissance Readability Formula’.
Thomas François & Eleni Miltsakaki (2012): ‘Do NLP and machine learning improve traditional readability formulas?’ In Proceedings of the First Workshop on Predicting and Improving Text Readability for target reader populations (PITR ’12).

Automatic inference of error detection/correction patterns

Proposers: Mariano Felice, Øistein E. Andersen
Supervisors: Øistein E. Andersen, Mariano Felice
Special Resources: CLC dataset
Requirements: NLP, Machine Learning

Description

The aim is to infer patterns from error-annotated corpora that would enable reliable detection and correction of various errors in written text, including ones which have not previously been seen, generalising beyond simple word n-grams.

One way of discovering latent patterns would be to train a tree substitution grammar (TSG) over syntactic trees or grammatical relations as described in Swanson (2013)¹, with more details on training TSGs found in Cohn & Blunsom (2010)².

Possible extensions include the use of native corpora (containing unannotated correct text) to reinforce or complement the knowledge extracted from the error-corrected data, as well as the use of graph kernels.

References

Ben Swanson (2013): ‘Exploring Syntactic Representations for Native Language Identification’. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications.
Trevor Cohn & Phil Blonsom (2010): ‘Blocked Inference in Bayesian Tree Substitution Grammars’. In Proceedings of the ACL 2010 Conference.