Complex Word Identification

  • Proposer: Ekaterina Kochmar
  • Supervisors: Ekaterina Kochmar
  • Special Resources: Access to the Cambridge Learner Corpus (CLC), English Vocabulary Profile (EVP) & Cambridge Advanced Learner Dictionary (CALD) + Access to an NLIP machine/server
  • Applications: lexical text simplification, readability assessment

    Description

    Complex Word Identification (CWI) is an NLP task concerned with identification of words that should be simplified or adapted to the current level of a reader. It is also often considered the first step in lexical text simplification which aims to replace complex words and expressions with simpler alternatives. Text simplification and CWI are NLP application useful for a wide variety of readers, both native and non-native speakers: for example, it has been shown (Nation, 2006) that a non-native speaker should know around 98% of the vocabulary used in text in order to understand it; as for the native speakers of English, the National Literacy Trust estimates that 1 in 6 adults in the UK have poor literacy skills.

    CWI has recently become an active area of research with at least one shared task organised so far and potentially more to come (Zampieri et al., 2017). The starting point for this project will be implementation of the baselines as outlined in Paetzold & Specia (2016a), as well as the approaches applied by the winning team (Paetzold & Specia, 2016b) as a benchmark system.

    In addition, the CWI task is closely related to that of readability assessment (Xia et al., 2016), with the complex words being beyond the reading level that is assigned to the text as a whole. This project will consider:

    1. casting the CWI task as that of readability assessment, using the vast amount of features employed by the system in Xia et al. (2016) and aiming to predict the areas of text (words and expressions) beyond the overall reading level of text;
    2. using additional resources, such as CLC, EVP and CALD to derive complexity levels for words;
    3. reconceptualising word complexity as continuous rather than binary variable (see 3.3 in Zampieri et al. (2017)), and treating the CWI task as a ranking/regression problem rather than that of classification;
    4. identifying the 2% of the most challenging words in need of simplification in accordance with the hypothesis of Nation (2006).


    Additional resources:

    Training, test data and evaluation scripts for the CWI shared task are publicly available.

    Background reading:

    Nation (2006), How Large a Vocabulary Is Needed For Reading and Listening?

    Paetzold & Specia (2016a), SemEval 2016 Task 11: Complex Word Identification and referenced papers on invididual systems.

    Paetzold & Specia (2016b), SV000gg at SemEval-2016 Task 11: Heavy Gauge Complex Word Identification with System Voting

    Zampieri et al. (2017), Complex Word Identification: Challenges in Data Annotation and System Performance

    Xia et al. (2016), Text Readability Assessment for Second Language Learners