Computer Laboratory

Comparing Data Sources and Architectures for Deep Visual Representation Learning in Semantics

Multi-modal distributional models learn grounded representations for improved performance in semantics. Deep visual representations, learned using convolutional neural networks, have been shown to achieve particularly high performance. In this study, we systematically compare deep visual representation learning techniques, experimenting with three well-known network architectures. In addition, we explore the various data sources that can be used for retrieving relevant images, showing that images from search engines perform as well as, or better than, those from manually crafted resources such as ImageNet. Furthermore, we explore the optimal number of images and the multi-lingual applicability of multi-modal semantics. We hope that these findings can serve as a guide for future research in the field.


The following tarballs contain image embeddings for the top 50 results for MEN and SimLex words in Google, Bing and Flickr, using AlexNet (CaffeNet), GoogLeNet and VGGNet. We also provide the multi- and cross-lingual experiment data and the datasets used for evaluation.


The visual representations are stored as Python Pickle files, and are a dictionary of dictionaries of numpy arrays, e.g.:

'elephant': {
    'elephant-image-1': np.array,
    'elephant-image-2': np.array,
    'elephant-image-50': np.array

Please refer to if you are unfamiliar with the Pickle format.

You could use MMFeat to load in the Pickle:

Alternatively, you can just load it from Python and e.g. take the mean of the top 10 images as the visual concept representation:

import numpy as np
import cPickle as pickle

data = pickle.load(open('bing/bing-vgg.pkl', 'rb'))
n_images = 10

reps = {}
for key in data.keys():
    reps[key] = np.mean(data[key].values()[:n_images], axis=0)