Description

In automatic text simplification the aim is to translate between sentences of different difficulty levels. Common neural machine translation methods rely on large parallel corpora for training (Stajner et al 2017), which limits the generalizability of these methods to other languages, use-cases and domains. This project instead aims to explore the task of unsupervised translation for simplification where only monolingual corpora of different difficulties are given. Very recent attempts on unsupervised sentence simplification exist (Surya et al 2019; Zhao et al 2020) and can function as a solid basis for this project. Possible directions for this project include: [i] exploring learning jointly on multiple objectives, as text simplification is related to several tasks such as sentence paraphrasing or summarization, [ii] unsupervised multilingual simplification for languages such as Spanish or Italian (Aprosio et al 2019; Martin et al 2020), [iii] cross-domain performance when simplifying sentences for a different domain than trained on (e.g Wikipedia and Newsela).

Resources

Starter code by Rami Aly
Public datasets for text simplification
Access to GPU

References

Sanja Stajner, Marc Franco-Salvador, Simone Paolo Ponzetto, Paolo Rosso & Heiner Stuckenschmidt (2017). Sentence Alignment Methods for Improving Text Simplification Systems. ACL
Sai Surya, Abhijit Mishra, Anirban Laha, Parag Jain & Karthik Sankaranarayanan (2019). Unsupervised Neural Text Simplification. ACL
Yanbin Zhao, Lu Chen, Zhi Chen & Kai Yu (2020). Semi-Supervised Text Simplification with Back-Translation and Asymmetric Denoising Autoencoders. AAAI
Alessio Palmero Aprosio, Sara Tonelli, Marco Turchi, Matteo Negri & Mattia Di Gangi (2019). Neural Text Simplification in Low-Resource Conditions Using Weak Supervision. NeuralGen
Louis Martin, Angela Fan, Eric de la Clergerie, Antoine Bordes & Benoit Sagot (2020). Multilingual Unsupervised Sentence Simplification. arXiv

Linking language comprehension and production embeddings in vector space

Supervisors: Andrew Caines, Russell Moore, Luca Benedetto, Paula Buttery

Description

Integrated teaching and learning platforms are becoming increasingly sophisticated. Recent work has employed neural models to create vectors representing a learner's skill-set and the learning tasks available to them. When these embeddings occupy the same vector space, they can be used to recommend tasks appropriate to the learner. Previous work by Moore et al (2019) has modelled latent user proficiencies and tasks as skills embeddings in the STEM domain. Building on work by Chen & Meurers (2019), the aim of this project is to apply a similar modelling approach to the domain of language learning. In particular this project would aim to model both the learner's written language proficiency and their reading proficiency. It is anticipated that the vectors representing the proficiencies will not overlap: therefore part of this project will involve mapping such that reading competence may be predicted from writing competence and vice versa.

Resources

User data from real online language learning platforms
Access to GPU

References

Russell Moore, Andrew Caines, Mark Elliott, Ahmed Zaidi, Andrew Rice & Paula Buttery (2019). Skills Embeddings: a neural approach to multicomponent representations of students and tasks. Proceedings of the 12th International Conference on Educational Data Mining
Xiaobin Chen & Detmar Meurers (2019). Linking text readability and learner proficiency using linguistic complexity feature vector distance. Computer Assisted Language Learning, 32:4

Translating between text levels: unsupervised translation for text simplification

Description

Resources

References

Linking language comprehension and production embeddings in vector space

Description

Resources

References