- The CrowdED Corpus: a crowdsourced corpus of spoken and transcribed monologues by native speakers and learners of English, originally reported in LREC 2016; to accompany our paper at COLING 2020 we release corrected transcriptions and grammatical error annotations on a subset of the English recordings.
- The Teacher-Student Chatroom Corpus: a collection of one-to-one written English lessons between qualified teachers and learners of English in an online chatroom; we describe the work in a paper published at NLP4CALL 2020 and make the data available by application here.
- The WaCky Wordlist: a wordlist based on UKWaC, a 2-billion word web corpus, which lists the 1.2 million word types occurring in the corpus 10 times or more, along with part-of-speech tags and phonemic forms.
- The Glottolog Data Explorer: an interactive world map showing language endangerment status for languages in the Glottolog database.
- Open dataset of ML & NLP papers put together by Marek Rei, used for our blogpost on NLP conferences, geographic diversity and carbon emissions, and also for Marek’s new study of 2019 publications.
Contact me: firstname.lastname @ cl.cam.ac.uk