If you are interested in any of the proposals below, please email the proposer or Ann Copestake. We expect students to meet with potential supervisors before making their project selection, at least for first choice projects.

How clever are the models exhibiting 'super-human' performance on the CLEVR VQA dataset?

Proposers: Alexander Kuhnle
Supervisor: Ann Copestake with Alexander Kuhnle

Description

Multimodal tasks like Visual Question Answering (VQA) [1] are easy to evaluate, but still require the evaluated system to productively combine vision and language in non-trivial ways. This is especially the case for abstract settings with datasets like CLEVR [2], where visual and lexical complexity are reduced to a minimum and the difficulty focuses on visually grounded language understanding.

Recently, various deep learning models were reported with more than 90% accuracy on CLEVR. The FiLM model [3] in particular stands out due to its simplicity: It just applies linear modulation conditioned by the linguistic input representation produced via an LSTM to the visual features of a CNN architecture with a few layers. In this MPhil project, the first aim is to reproduce these results on CLEVR (starting with [4]). As a next step, we want to evaluate the model on our dataset [5] which, similarly to CLEVR, also consists of abstract coloured shapes, but can produce controlled data via detailed generation configuration. Consequently we can test in more detail which types of linguistic structures can be handled successfully by the system, and which of them might not be sufficiently covered by CLEVR. Moreover, by ablating and modifying the model, we are interested in investigating precisely which aspect of the FiLM model architecture is responsible for the significant gains in accuracy over a CNN+LSTM baseline [3].

[1] https://arxiv.org/abs/1505.00468
[2] https://arxiv.org/abs/1612.06890
[3] https://arxiv.org/abs/1709.07871
[4] https://github.com/ethanjperez/film
[5] https://arxiv.org/abs/1704.04517

Evaluation of image captioning systems on artificial data via semantic parsing

Proposers: Alexander Kuhnle
Supervisor: Ann Copestake with Alexander Kuhnle

Multimodal tasks gained popularity in the last years, with neural networks obtaining unprecedented performance values on them. Two common tasks are Image Captioning (IC) [1] and Visual Question Answering (VQA) [2]. More recently, datasets like CLEVR [3] were introduced, which consists of artificial and automatically generated data, with the motivation of providing clearer evaluation results.

Similar to CLEVR, we are working on a dataset, or more precisely, a data generation system [4] producing abstract images of coloured shapes and statements about these image in the style of traditional formal semantics (e.g. focusing on quantifiers as in "Most squares are red."). As part of this MPhil project, we are interested in investigating IC models, e.g. [1, 5], for this kind of data. The grammar-based language generation system of our framework can also be used for parsing, hence it is possible to evaluate the statements produced by an IC system based on the framework-internal semantics, instead of comparing them to arbitrary 'gold captions', as is common practice for IC evaluation elsewise. Various interesting aspects can be investigated: Do systems generate a wider variety of interesting captions for the same image when run multiple times? Do systems generalise to compose new sentences they have not seen during training?

[1] https://arxiv.org/abs/1411.4555
[2] https://arxiv.org/abs/1505.00468
[3] https://arxiv.org/abs/1612.06890
[4] https://arxiv.org/abs/1704.04517
[5] https://arxiv.org/abs/1502.03044

A proposition-based summarizer using DMRS

Proposers: Ewa Muszyńska and Yimai Fang
Supervisor: Ann Copestake with Ewa Muszyńska

Description

A proposition-based summariser by Fang and Teufel (2014: https://www.cl.cam.ac.uk/~yf261/papers/ft2014.pdf , 2016: https://www.cl.cam.ac.uk/~yf261/papers/ft2016-new.pdf ) is inspired by an an incremental model of human text processing. It composes a summary of a text based on propositions which are more likely to be remembered by a human given memory limitations. The current summariser extracts propositions from text using a purely syntactic Stanford Parser (Klein and Manning, 2003). The goal of the project we propose is to convert the summariser to extract proposition from a semantic, linguistically-motivated Dependency Minimal Recursion Semantics (DMRS; Copestake, 2007) representation, based on a broad-coverage, linguistically precise English Resource Grammar (ERG; Copestake and Flickinger, 2000). Some potential directions for investigation are: what form and size should the extracted propositions take? what information contained in the DMRS is useful for the summariser? does the richer representation improve the performance of the summariser?