Complex Word Identification (CWI) is an NLP task concerned with identification of words that should be simplified or adapted to the current level of a reader. It is also often considered the first step in lexical text simplification which aims to replace complex words and expressions with simpler alternatives. Text simplification and CWI are NLP application useful for a wide variety of readers, both native and non-native speakers: for example, it has been shown (Nation, 2006) that a non-native speaker should know around 98% of the vocabulary used in text in order to understand it; as for the native speakers of English, the National Literacy Trust estimates that 1 in 6 adults in the UK have poor literacy skills.
CWI has recently become an active area of research with at least one shared task organised so far and potentially more to come (Zampieri et al., 2017). The starting point for this project will be implementation of the baselines as outlined in Paetzold & Specia (2016a), as well as the approaches applied by the winning team (Paetzold & Specia, 2016b) as a benchmark system.
In addition, the CWI task is closely related to that of readability assessment (Xia et al., 2016), with the complex words being beyond the reading level that is assigned to the text as a whole. This project will consider:
Training, test data and evaluation scripts for the CWI shared task are publicly available.
Nation (2006), How Large a Vocabulary Is Needed For Reading and Listening?
Paetzold & Specia (2016a), SemEval 2016 Task 11: Complex Word Identification and referenced papers on invididual systems.
Paetzold & Specia (2016b), SV000gg at SemEval-2016 Task 11: Heavy Gauge Complex Word Identification with System Voting
Zampieri et al. (2017), Complex Word Identification: Challenges in Data Annotation and System Performance
Xia et al. (2016), Text Readability Assessment for Second Language Learners