Department of Computer Science and Technology

Technical reports

Transparent analysis of multi-modal embeddings

Anita L. Verő

May 2022, 202 pages

This technical report is based on a dissertation submitted November 2021 by the author for the degree of Doctor of Philosophy to the University of Cambridge, King’s College.

Some figures in this document are best viewed in colour. If you received a black-and-white copy, please consult the online version if necessary.

DOI: 10.48456/tr-970


Vector Space Models of Distributional Semantics — or Embeddings — serve as useful statistical models of word meanings, which can be applied as proxies to learn about human concepts. One of their main benefits is that not only textual, but a wide range of data types can be mapped to a space, where they are comparable or can be fused together.

Multi-modal semantics aims to enhance Embeddings with perceptual input, based on the assumption that the representation of meaning in humans is grounded in sensory experience. Most multi-modal research focuses on downstream tasks, involving direct visual input, such as Visual Question Answering. Fewer papers have exploited visual information for meaning representations when the evaluation tasks involve no direct visual input, such as semantic similarity. When such research has been undertaken, the results on the impact of visual information have been often inconsistent, due to the lack of comparison and the ambiguity of intrinsic evaluation.

Does visual data bolster performance on non-visual tasks? If it does, is this only because we add more data or does it convey complementary quality information compared to a higher quantity of text? Can we achieve comparable performance using small-data if it comes from the right data distribution? Is the modality, the size or the distributional properties of the data that matters? Evaluating on downstream or similarity-type tasks is a good start to compare models and data sources. However, if we want to resolve the ambiguity of intrinsic evaluations and the spurious correlations of downstream results, creating more transparent and human interpretable models is necessary.

This thesis proposes diverse studies to scrutinize the inner “cognitive models” of Embeddings, trained on various data sources and modalities. Our contribution is threefold. Firstly, we present comprehensive analyses of how various visual and linguistic models behave in semantic similarity and brain imaging evaluation tasks. We analyse the effect of various image sources on the performance of semantic models, as well as the impact of the quantity of images in visual and multi-modal models. Secondly, we introduce a new type of modality: a visually structured, text based semantic representation, lying in-between visual and linguistic modalities. We show that this type of embedding can serve as an efficient modality when combined with low resource text data. Thirdly, we propose and present proof-of-concept studies of a transparent, interpretable semantic space analysis framework.

Full text

PDF (61.6 MB)

BibTeX record

  author =	 {Ver{\H o}, Anita L.},
  title = 	 {{Transparent analysis of multi-modal embeddings}},
  year = 	 2022,
  month = 	 may,
  url = 	 {},
  institution =  {University of Cambridge, Computer Laboratory},
  doi = 	 {10.48456/tr-970},
  number = 	 {UCAM-CL-TR-970}