Computer Laboratory

Learning Image Embeddings using Convolutional Neural Networks for Improved Multi-Modal Semantics

We construct multi-modal concept representations by concatenating a skip-gram linguistic representation vector with a visual concept representation vector computed using the feature extraction layers of a deep convolutional neural network (CNN) trained on a large labeled object recognition dataset. This transfer learning approach brings a clear performance gain over features based on the traditional bag-of-visual-word approach. Experimental results are reported on theWordSim353 and MEN semantic relatedness evaluation tasks. We use visual features computed using either ImageNet or ESP Game images.

We want to make it easy for people to experiment with image embeddings and multi-modal semantics. To facilitate further research, we publicly release embeddings for all labels in the ESP Game dataset:

The embeddings have been normalized to zero mean and unit length. Note that these are for the visual modality only! Please cite the paper if you use these vectors. The embeddings were extracted using code by Maxime Oquab.