Department of Computer Science and Technology

Technical reports

Evaluating visually grounded language capabilities using microworlds

Alexander Kuhnle

January 2020, 142 pages

This technical report is based on a dissertation submitted August 2019 by the author for the degree of Doctor of Philosophy to the University of Cambridge, Queens’ College.

DOI: 10.48456/tr-942


Deep learning has had a transformative impact on computer vision and natural language processing. As a result, recent years have seen the introduction of more ambitious holistic understanding tasks, comprising a broad set of reasoning abilities. Datasets in this context typically act not just as application-focused benchmark, but also as basis to examine higher-level model capabilities. This thesis argues that emerging issues related to dataset quality, experimental practice and learned model behaviour are symptoms of the inappropriate use of benchmark datasets for capability-focused assessment. To address this deficiency, a new evaluation methodology is proposed here, which specifically targets in-depth investigation of model performance based on configurable data simulators. This focus on analysing system behaviour is complementary to the use of monolithic datasets as application-focused comparative benchmarks.

Visual question answering is an example of a modern holistic understanding task, unifying a range of abilities around visually grounded language understanding in a single problem statement. It has also been an early example for which some of the aforementioned issues were identified. To illustrate the new evaluation approach, this thesis introduces ShapeWorld, a diagnostic data generation framework. Its design is guided by the goal to provide a configurable and extensible testbed for the domain of visually grounded language understanding. Based on ShapeWorld data, the strengths and weaknesses of various state-of-the-art visual question answering models are analysed and compared in detail, with respect to their ability to correctly handle statements involving, for instance, spatial relations or numbers. Finally, three case studies illustrate the versatility of this approach and the ShapeWorld generation framework: an investigation of multi-task and curriculum learning, a replication of a psycholinguistic study for deep learning models, and an exploration of a new approach to assess generative tasks like image captioning.

Full text

PDF (4.1 MB)

BibTeX record

  author =	 {Kuhnle, Alexander},
  title = 	 {{Evaluating visually grounded language capabilities using
  year = 	 2020,
  month = 	 jan,
  url = 	 {},
  institution =  {University of Cambridge, Computer Laboratory},
  doi = 	 {10.48456/tr-942},
  number = 	 {UCAM-CL-TR-942}