Proposer: Tamara Polajnar tp366@cam.ac.uk
Supervisor: Tamara Polajnar

Glossary induction

Key phrase extraction is an established field in NLP. In general it's used to extract keyword summaries of documents. I would be interested in reformulating it as extracting key terms for Wikipedia Categories. In particular, Wikipedia has some glossary pages associated with categories https://en.wikipedia.org/wiki/Portal:Contents/Glossaries which could be used as ground truth. The data would have to be extracted.

e.g. terms in https://en.wikipedia.org/wiki/Glossary_of_graph_theory_terms from https://en.wikipedia.org/wiki/Category:Graph_theory

This is generally done by trying to statistically determine words that are more significant in a particular sub-corpus than they are overall.

Potentially you could use RNNs with attention and other datasets e.g. CNN/Daily mail dataset but instead of looking at the snippets you try to regenerate only the content words in the snippets. Or other annotated key phrase extraction datasets.

Lemmatised vs standard embeddings in paraphrase detection/generation

The task is either paraphrase detection or potentially paraphrase generation. The goal would be to use this as a method of evaluating various semantic composition methods and vectors. There has been work on vector composition methods and evaluation of distributional vs distributed vectors, but not much examination of the granularity of vector type for this task. Is it better to use general embeddings or do lemmatised, POS-separated ones add value? Can we make sequence model that learns character-level, token-level, and lemma-POS-level embeddings and see which of these features helps the most on a sequence-to-sequence task like paraphrase detection? Starting point:
PDF The Interplay of Semantics and Morphology in Word Embeddings and PPDB paraphrase database

Semantic composition with sense-disambiguated embeddings

The goal of this project is to see what are the best embeddings for word retrieval given a definition-like task, e.g. the reverse dictionary retrieval or RELPRON dataset. Can you run a WSD system and then use sense disambiguated vectors on composition tasks to see if it makes a difference. There's related work here https://arxiv.org/pdf/1702.06696.pdf but the task is different and the vectors they compare are not trained on the same data. So ultimately this project would involve training of comparable vectors using an embedding generating method such as word2vec.