Department of Computer Science and Technology

Technical reports

Deep embodiment: grounding semantics in perceptual modalities

Douwe Kiela

February 2017, 128 pages

This technical report is based on a dissertation submitted July 2016 by the author for the degree of Doctor of Philosophy to the University of Cambridge, Darwin College.

DOI: 10.48456/tr-899


Multi-modal distributional semantic models address the fact that text-based semantic models, which represent word meanings as a distribution over other words, suffer from the grounding problem. This thesis advances the field of multi-modal semantics in two directions. First, it shows that transferred convolutional neural network representations outperform the traditional bag of visual words method for obtaining visual features. It is then shown that these representations may be applied successfully to various natural language processing tasks. Second, it performs the first ever experiments with grounding in the non-visual modalities of auditory and olfactory perception using raw data. Deep learning, a natural fit for deriving grounded representations, is used to obtain the highest-quality representations compared to more traditional approaches. Multi-modal representation learning leads to improvements over language-only models in a variety of tasks. If we want to move towards human-level artificial intelligence, we will need to build multi-modal models that represent the full complexity of human meaning, including its grounding in our various perceptual modalities.

Full text

PDF (5.8 MB)

BibTeX record

  author =	 {Kiela, Douwe},
  title = 	 {{Deep embodiment: grounding semantics in perceptual
  year = 	 2017,
  month = 	 feb,
  url = 	 {},
  institution =  {University of Cambridge, Computer Laboratory},
  doi = 	 {10.48456/tr-899},
  number = 	 {UCAM-CL-TR-899}