Prof Andrew Caines

The Write & Improve Corpus 2024: a set of essays written by learners of English using the Write & Improve writing practice website, data and annotation courtesy of Cambridge University Press & Assessment. More information, including how to obtain the data, and a link to our full-text paper, here on the CUP&A website.
The Teacher-Student Chatroom Corpus: a collection of one-to-one written English lessons between qualified teachers and learners of English in an online chatroom; we describe the work in two papers published at NLP4CALL 2020, then NLP4CALL 2022, and we have made the data available for research use by application here.
The CrowdED Corpus: a crowdsourced corpus of spoken and transcribed monologues by native speakers and learners of English, originally reported in LREC 2016; to accompany our paper at COLING 2020 we release corrected transcriptions and grammatical error annotations on a subset of the English recordings.
The WaCky Wordlist: a wordlist based on UKWaC, a 2-billion word web corpus, which lists the 1.2 million word types occurring in the corpus 10 times or more, along with part-of-speech tags and phonemic forms.
The Glottolog Data Explorer: an interactive world map showing language endangerment status for languages in the Glottolog database.
Open dataset of ML & NLP papers put together by Marek Rei, used for our blogpost on NLP conferences, geographic diversity and carbon emissions, and also for Marek’s new study of 2019 publications.

Contact me: firstname.lastname @ cl.cam.ac.uk