GENERALING MULTIMODAL VARIATIONAL METHODS TO SETS

Abstract

Making sense of multiple modalities can yield a more comprehensive description of real-world phenomena. However, learning the co-representation of diverse modalities is still a long-standing endeavor in emerging machine learning applications and research. Previous generative approaches for multimodal input approximate a joint-modality posterior by uni-modality posteriors as product-ofexperts (PoE) or mixture-of-experts (MoE). We argue that these approximations lead to a defective bound for the optimization process and loss of semantic connection among modalities. This paper presents a novel variational method on sets called the Set Multimodal VAE (SMVAE) for learning a multimodal latent space while handling the missing modality problem. By modeling the joint-modality posterior distribution directly, the proposed SMVAE learns to exchange information between multiple modalities and compensate for the drawbacks caused by factorization. In public datasets of various domains, the experimental results demonstrate that the proposed method is applicable to order-agnostic cross-modal generation while achieving outstanding performance compared to the state-ofthe-art multimodal methods. The source code for our method is available online https://anonymous.4open.science/r/SMVAE-9B3C/.

1. INTRODUCTION

Most real-life applications such as robotic systems, social media mining, and recommendation systems naturally contain multiple data sources, which raise the need for learning co-representation among diverse modalities Lee et al. (2020) . Making use of additional modalities should improve the general performance of downstream tasks as it can provide more information from another perspective. In literatures, substantial improvements can be achieved by utilizing another modality as supplementary information Asano et al. ( 2020 2021). However, our key insight is that their methods suffer from two critical drawbacks: 1) The implied conditional independence assumption and corresponding factorization deviate their VAEs from modeling inter-modality correlations. 2) The aggregation of inference results from uni-modality is by no means a co-representation of these modalities. 1



); Nagrani et al. (2020) or by multimodal fusion Atrey et al. (2010); Hori et al. (2017); Zhang et al. (2021). However, current multimodal research suffers severely from the lack of multimodal data with fine-grained labeling and alignment Sun et al. (2017); Beyer et al. (2020); Rahate et al. (2022); Baltrušaitis et al. (2018) and the missing of modalities Ma et al. (2021); Chen et al. (2021). In the self-supervised and weakly-supervised learning field, the variational autoencoders (VAEs) for multimodal data Kingma & Welling (2013); Wu & Goodman (2018); Shi et al. (2019); Sutter et al. (2021) have been a dominating branch of development. VAEs are generative self-supervised models by definition that capture the dependency between an unobserved latent variable and the input observation. To jointly infer the latent representation and reconstruct the observations properly, the multimodal VAEs are required to extract both modality-specific and modality-invariant features from the multimodal observations. Earlier works mainly suffer from scalability issues as they need to learn a separate model for each modal combination Pandey & Dukkipati (2017); Yan et al. (2016). More recent multimodal VAEs handle this issue and achieves scalability by approximating the true joint posterior distribution with the mixture or the product of uni-modality inference models Shi et al. (2019); Wu & Goodman (2018); Sutter et al. (

