IMPORTANCE-BASED MULTIMODAL AUTOENCODER

Abstract

Integrating information from multiple modalities (e.g., verbal, acoustic and visual data) into meaningful representations has seen great progress in recent years. However, two challenges are not sufficiently addressed by current approaches: (1) computationally efficient training of multimodal autoencoder networks which are robust in the absence of modalities, and (2) unsupervised learning of important subspaces in each modality which are correlated with other modalities. In this paper we propose the IMA (Importance-based Multimodal Autoencoder) model, a scalable model that learns modality importances and robust multimodal representations through a novel cross-covariance based loss function. We conduct experiments on MNIST-TIDIGITS, a multimodal dataset of spoken and image digits, and on IEMOCAP, a multimodal emotion corpus. The IMA model is able to distinguish digits from uncorrelated noise, and word-level importances are learnt that correspond to the separation between function and emotional words. The multimodal representations learnt by IMA are also competitive with state-of-the-art baseline approaches on downstream tasks.

1. INTRODUCTION

With the ever-increasing amount of heterogeneous multimedia content present on the internet, machine learning approaches have been applied to automated perception problems such as object recognition (Krizhevsky et al., 2012) , image captioning (Vinyals et al., 2015) and automatic language translation (Choi et al., 2018) . An important research direction is the problem of learning representations from multiple modalities which represent our primary channels of communication and sensation, such as vision or touch (Baltrušaitis et al., 2018) . With respect to this area of research, there are two major challenges in this research area which our paper addresses. The first is the design of encoder networks to enable learning and inference of multimodal representations in the absence of modalities. This is useful for scenarios such as sensor failure or imputation/bidirectional generation of missing modalities from any combination of the observed ones. The caveat is that to have this property, recent approaches such as the JMVAE-KL model (Suzuki et al., 2016) and MVAE (Wu & Goodman, 2018) have encoders with high complexity for a large number of modalities. When M is the number of modalities, JMVAE-KL needs 2 M sub-networks for every combination of input modalities, while MVAE requires M sub-networks but additional O(2 M ) subsampled loss terms to handle missing modalities. The second challenge is that multimodal data, such as emotional spoken utterances or web images with captions are often generated not only by an underlying shared latent factor, but also by modality-specific private latent factors. For example in spoken utterances, the verbal modality (words) are generated not only due to emotion but also due to syntax and semantics. Function words such as I and the are mostly syntactic and do not relate to emotion, similarly not all recorded audio frames are indicative of emotion. The inference of how relevant a sample in each modality is to the shared latent factor (subsequently referred to as importance) is important for downstream tasks. For the remainder of this paper, non-relevant samples are refered to as uncorrelated noise. In a supervised setting, the latent factors and modality importance weights can be learnt from task labels. When labels are absent in the unsupervised scenario, for the purpose of this paper we define the concept of modality importances based on correlations between the latent factor and each modality. In the important subspace of each modality the multimodal and unimodal representations both maximally correlate, indicating that samples in that modality subspace can be attributed to a shared latent factor, and not an independent private one. In contrast, for unimportant samples in a modality, the correlation is minimal. The main contributions of our proposed approach are two-fold. The first is a multimodal autoencoder framework where training requires additional loss terms which are O(M ), i.e. linear in the number of modalities, and thus only require M per-modality encoders to handle missing modalities. Computationally, this is advantageous compared to JMVAE-KL and MVAE, which require exponential number of sub-networks and loss terms respectively. Secondly, we define the concept of importance in an unsupervised setting, and propose novel cross-correlation based loss terms to learn important regions in each modality's representation space. The importances are modeled by separate unimodal networks referred to as importance networks. Hyper-parameter ρ j for the j-th modality controls the integration of prior domain knowledge about the degree of importance in that modality. While not trained on any supervised labels, the learnt importances from IMA are analyzed quantitatively and found to correspond to the separation between digit vs. noise labels and emotion vs. neutral categories.

2. RELATED WORK

Following the great success of deep neural networks for representation learning, the research area of multimodal machine learning is gaining interest (Baltrušaitis et al., 2018) . Our proposed IMA model is relevant to two main research areas in this domain, Inter-modality Correlation Learning and Efficient Multimodal VAEs. The idea of learning acoustic embeddings for words has also been explored in Wang et al. (2018) and Jung et al. ( 2019) however we attempt to map words to their affective rather than phonetic representations. In this section, we describe each area and conclude with the similarities and differences between the IMA model and prior approaches. Inter-modality Correlation Learning: There have been several approaches which measure correlations between modalities/sources of data to understand how observed data in each modality can be explained by shared underlying concepts. The IBFA (Inter-Battery Factor Analysis) introduced by Tucker (1958) and its successor, the MBFA (Multi-battery factor Analysis) (Browne, 1980) 2017) learns a deep projection of each modality in a bimodal understanding scenario so that the projections are maximally correlated, effectively extending the classical CCA technique (Knapp, 1978) to deep neural networks. Our proposed model extends these approaches to also detect important regions of each modality correlated with the shared latent factor. Efficient Multimodal VAEs: VAEs (Variational Autoencoders) have been applied to multimodal data for applications such as inference and bidirectional generation of modalities. This poses a major challenge of constructing encoders to model the latent posterior which are efficient in training/inference under any combination of input modalities. Recent work addresses this by focusing on factorized models for efficient inference. Vedantam et al. (2017) employs a product-of-experts decomposition with modality specific inference networks to train image generation models. Wu & Goodman (2018) propose MVAE (Multimodal Variational Autoencoders) , where the latent posterior is modeled with a parameter shared product of experts network. Shi et al. ( 2019) proposed a mixture-of-experts multimodal variational autoencoder (MMVAE) where the posterior is a mixture of experts instead. These approaches have been extended more recently, for example in multi-source neural variational inference (Kurle et al., 2019) where the multimodal posterior is constructed using learnt and integrated beliefs from multiple posteriors, each being informed by a different source. Sutter et al. (2020) introduce a novel Jensen-Shannon divergence based objective function which can be used to approximate both unimodal and joint multimodal posteriors. While existing approaches attempt to efficiently learn multimodal representations through posterior modeling, our proposed IMA model aligns modalities during autoencoder training for projection to a common space which facilitates inference even in absence of modalities. Only M encoders and O(M ) loss terms are required by the IMA model for inference with M modalities. Prior work has also not focused sufficiently on unsupervised learning of modality importances (through detection of subspaces maximally correlated with shared latent factors) which we address in this paper.



are among the earliest proposed techniques to study shared factors between score sets from batteries of tests. DeepCCA (Deep Canonical Covariance Analysis) proposed by Benton et al. (

