IMPORTANCE-BASED MULTIMODAL AUTOENCODER

Abstract

Integrating information from multiple modalities (e.g., verbal, acoustic and visual data) into meaningful representations has seen great progress in recent years. However, two challenges are not sufficiently addressed by current approaches: (1) computationally efficient training of multimodal autoencoder networks which are robust in the absence of modalities, and (2) unsupervised learning of important subspaces in each modality which are correlated with other modalities. In this paper we propose the IMA (Importance-based Multimodal Autoencoder) model, a scalable model that learns modality importances and robust multimodal representations through a novel cross-covariance based loss function. We conduct experiments on MNIST-TIDIGITS, a multimodal dataset of spoken and image digits, and on IEMOCAP, a multimodal emotion corpus. The IMA model is able to distinguish digits from uncorrelated noise, and word-level importances are learnt that correspond to the separation between function and emotional words. The multimodal representations learnt by IMA are also competitive with state-of-the-art baseline approaches on downstream tasks.

1. INTRODUCTION

With the ever-increasing amount of heterogeneous multimedia content present on the internet, machine learning approaches have been applied to automated perception problems such as object recognition (Krizhevsky et al., 2012) , image captioning (Vinyals et al., 2015) and automatic language translation (Choi et al., 2018) . An important research direction is the problem of learning representations from multiple modalities which represent our primary channels of communication and sensation, such as vision or touch (Baltrušaitis et al., 2018) . With respect to this area of research, there are two major challenges in this research area which our paper addresses. The first is the design of encoder networks to enable learning and inference of multimodal representations in the absence of modalities. This is useful for scenarios such as sensor failure or imputation/bidirectional generation of missing modalities from any combination of the observed ones. The caveat is that to have this property, recent approaches such as the JMVAE-KL model (Suzuki et al., 2016) and MVAE (Wu & Goodman, 2018) have encoders with high complexity for a large number of modalities. When M is the number of modalities, JMVAE-KL needs 2 M sub-networks for every combination of input modalities, while MVAE requires M sub-networks but additional O(2 M ) subsampled loss terms to handle missing modalities. The second challenge is that multimodal data, such as emotional spoken utterances or web images with captions are often generated not only by an underlying shared latent factor, but also by modality-specific private latent factors. For example in spoken utterances, the verbal modality (words) are generated not only due to emotion but also due to syntax and semantics. Function words such as I and the are mostly syntactic and do not relate to emotion, similarly not all recorded audio frames are indicative of emotion. The inference of how relevant a sample in each modality is to the shared latent factor (subsequently referred to as importance) is important for downstream tasks. For the remainder of this paper, non-relevant samples are refered to as uncorrelated noise. In a supervised setting, the latent factors and modality importance weights can be learnt from task labels. When labels are absent in the unsupervised scenario, for the purpose of this paper we define the concept of modality importances based on correlations between the latent factor and each modality. In the important subspace of each modality the multimodal and unimodal representations both maximally correlate, indicating that samples in that modality subspace can be attributed to a

