The CrowdED Corpus: a crowdsourced corpus of spoken and transcribed monologues by native speakers and learners of English, originally reported in LREC 2016; to accompany our paper at COLING 2020 we release corrected transcriptions and grammatical error annotations on a subset of the English recordings.
The Teacher-Student Chatroom Corpus: a collection of one-to-one written English lessons between qualified teachers and learners of English in an online chatroom; we describe the work in two papers published at NLP4CALL 2020, then NLP4CALL 2022, and we have made the data available for research use by application here.
The WaCky Wordlist: a wordlist based on UKWaC, a 2-billion word web corpus, which lists the 1.2 million word types occurring in the corpus 10 times or more, along with part-of-speech tags and phonemic forms.
The Glottolog Data Explorer: an interactive world map showing language endangerment status for languages in the Glottolog database.