GENERALING MULTIMODAL VARIATIONAL METHODS TO SETS

Abstract

Making sense of multiple modalities can yield a more comprehensive description of real-world phenomena. However, learning the co-representation of diverse modalities is still a long-standing endeavor in emerging machine learning applications and research. Previous generative approaches for multimodal input approximate a joint-modality posterior by uni-modality posteriors as product-ofexperts (PoE) or mixture-of-experts (MoE). We argue that these approximations lead to a defective bound for the optimization process and loss of semantic connection among modalities. This paper presents a novel variational method on sets called the Set Multimodal VAE (SMVAE) for learning a multimodal latent space while handling the missing modality problem. By modeling the joint-modality posterior distribution directly, the proposed SMVAE learns to exchange information between multiple modalities and compensate for the drawbacks caused by factorization. In public datasets of various domains, the experimental results demonstrate that the proposed method is applicable to order-agnostic cross-modal generation while achieving outstanding performance compared to the state-ofthe-art multimodal methods. The source code for our method is available online https://anonymous.4open.science/r/SMVAE-9B3C/.

1. INTRODUCTION

Most real-life applications such as robotic systems, social media mining, and recommendation systems naturally contain multiple data sources, which raise the need for learning co-representation among diverse modalities Lee et al. (2020) . Making use of additional modalities should improve the general performance of downstream tasks as it can provide more information from another perspective. In literatures, substantial improvements can be achieved by utilizing another modality as supplementary information Asano et al. ( 2020 2021). However, our key insight is that their methods suffer from two critical drawbacks: 1) The implied conditional independence assumption and corresponding factorization deviate their VAEs from modeling inter-modality correlations. 2) The aggregation of inference results from uni-modality is by no means a co-representation of these modalities. To overcome these drawbacks of previous VAE methods, this work proposes the Set Multimodal Variational Autoencoder (SMVAE), a novel multimodel generative model eschewing factorization and instead relying solely upon set operation to achieve scalability. The SMVAE allows for better performance compared to the latest multimodal VAE methods and can handle input modalities of variable numbers and permutations. By learning the actual multimodal joint posterior directly, the SMVAE is the first multimodal VAE method that achieves scalable co-representation with missing modalities. A high-level overview of the proposed method is illustrated in Fig. 1 . The SMVAE can handle a set of maximally M modalities as well as their subsets and allows cross-modality generations. E i and D i represent the i-th embedding network and decoder network for the specific modality. µ s , σ s and µ k , σ k represent the parameters for the posterior distribution of the latent variable. By incorporating set operation when learning the joint-modality posterior, we can simply drop the corresponding embedding networks when a modality is missing. Comprehensive experiments show the proposed Set Multimodal Variational Autoencoder (SMVAE) outperforms state-of-the-art multimodal VAE methods and is immediately applicable to real-life multimodality. 2021), assume the variational approximation is factorizable. Thus, they focused on factorizing the approximation of the multimodal joint posterior q(z|x 1 , ⋯, x M ) into a set of uni-modality inference encoders q i (z|x i ), such that q(z|x 1 , ⋯, x M ) ≈ F ({x i } M i=1 ), where F (⋅) is a product or mean operation, depending on the chosen aggregation method. As discussed in Sutter et al. ( 2021), these scalable multimodal VAE methods differ only in the choice of aggregation method. Different from those mentioned above multimodal VAE methods, we attain the joint posterior in its original form without introducing additional assumptions on the form of the joint posterior. To handle the issue of scalability, we exploit the deterministic set operation function in the noise-outsourcing process. While existing multimodal VAE methods can be viewed as typical late fusion method that combines decisions about the latent variables Khaleghi et al. (2013) , the proposed SMVAE method corresponds to the early fusion method at the representation level, allowing for the learning of correlation and co-representation from multimodal data.



); Nagrani et al. (2020) or by multimodal fusion Atrey et al. (2010); Hori et al. (2017); Zhang et al. (2021). However, current multimodal research suffers severely from the lack of multimodal data with fine-grained labeling and alignment Sun et al. (2017); Beyer et al. (2020); Rahate et al. (2022); Baltrušaitis et al. (2018) and the missing of modalities Ma et al. (2021); Chen et al. (2021). In the self-supervised and weakly-supervised learning field, the variational autoencoders (VAEs) for multimodal data Kingma & Welling (2013); Wu & Goodman (2018); Shi et al. (2019); Sutter et al. (2021) have been a dominating branch of development. VAEs are generative self-supervised models by definition that capture the dependency between an unobserved latent variable and the input observation. To jointly infer the latent representation and reconstruct the observations properly, the multimodal VAEs are required to extract both modality-specific and modality-invariant features from the multimodal observations. Earlier works mainly suffer from scalability issues as they need to learn a separate model for each modal combination Pandey & Dukkipati (2017); Yan et al. (2016). More recent multimodal VAEs handle this issue and achieves scalability by approximating the true joint posterior distribution with the mixture or the product of uni-modality inference models Shi et al. (2019); Wu & Goodman (2018); Sutter et al. (

Figure 1: Overview of the proposed method for learning multimodal latent space. The SMVAE is able to handle any combination or number of input modalities while having discriminative latent space and proper reconstruction.

of learning a multimodal generative model is to maintain the model's scalability to the exponential number of modal combinations. Existing multimodal generative models such as Conditional VAE (CVAE)Pandey & Dukkipati (2017) and joint-modality VAE (JMVAE) Suzuki et al. (2016) had difficulty scaling since they need to assign a separate inference model for each possible input and output combinations. To tackle this issue, follow-up works, such as, TELBO Vedantam et al. (2017), MVAE Wu & Goodman (2018), MMVAE Shi et al. (2019), MoPoE Sutter et al. (

