GENERALING MULTIMODAL VARIATIONAL METHODS TO SETS

Abstract

Making sense of multiple modalities can yield a more comprehensive description of real-world phenomena. However, learning the co-representation of diverse modalities is still a long-standing endeavor in emerging machine learning applications and research. Previous generative approaches for multimodal input approximate a joint-modality posterior by uni-modality posteriors as product-ofexperts (PoE) or mixture-of-experts (MoE). We argue that these approximations lead to a defective bound for the optimization process and loss of semantic connection among modalities. This paper presents a novel variational method on sets called the Set Multimodal VAE (SMVAE) for learning a multimodal latent space while handling the missing modality problem. By modeling the joint-modality posterior distribution directly, the proposed SMVAE learns to exchange information between multiple modalities and compensate for the drawbacks caused by factorization. In public datasets of various domains, the experimental results demonstrate that the proposed method is applicable to order-agnostic cross-modal generation while achieving outstanding performance compared to the state-ofthe-art multimodal methods. The source code for our method is available online https://anonymous.4open.science/r/SMVAE-9B3C/.

1. INTRODUCTION

Most real-life applications such as robotic systems, social media mining, and recommendation systems naturally contain multiple data sources, which raise the need for learning co-representation among diverse modalities Lee et al. (2020) . Making use of additional modalities should improve the general performance of downstream tasks as it can provide more information from another perspective. In literatures, substantial improvements can be achieved by utilizing another modality as supplementary information Asano et al. (2020) ; Nagrani et al. (2020) or by multimodal fusion Atrey et al. (2010) ; Hori et al. (2017) ; Zhang et al. (2021) . However, current multimodal research suffers severely from the lack of multimodal data with fine-grained labeling and alignment Sun et al. (2017) ; Beyer et al. (2020) ; Rahate et al. (2022) ; Baltrušaitis et al. (2018) and the missing of modalities Ma et al. (2021) ; Chen et al. (2021) . In the self-supervised and weakly-supervised learning field, the variational autoencoders (VAEs) for multimodal data Kingma & Welling (2013) ; Wu & Goodman (2018) ; Shi et al. (2019) ; Sutter et al. ( 2021) have been a dominating branch of development. VAEs are generative self-supervised models by definition that capture the dependency between an unobserved latent variable and the input observation. To jointly infer the latent representation and reconstruct the observations properly, the multimodal VAEs are required to extract both modality-specific and modality-invariant features from the multimodal observations. Earlier works mainly suffer from scalability issues as they need to learn a separate model for each modal combination Pandey & Dukkipati (2017) ; Yan et al. (2016) . More recent multimodal VAEs handle this issue and achieves scalability by approximating the true joint posterior distribution with the mixture or the product of uni-modality inference models Shi et al. (2019) ; Wu & Goodman (2018) ; Sutter et al. (2021) . However, our key insight is that their methods suffer from two critical drawbacks: 1) The implied conditional independence assumption and corresponding factorization deviate their VAEs from modeling inter-modality correlations. 2) The aggregation of inference results from uni-modality is by no means a co-representation of these modalities. To overcome these drawbacks of previous VAE methods, this work proposes the Set Multimodal Variational Autoencoder (SMVAE), a novel multimodel generative model eschewing factorization and instead relying solely upon set operation to achieve scalability. The SMVAE allows for better performance compared to the latest multimodal VAE methods and can handle input modalities of variable numbers and permutations. By learning the actual multimodal joint posterior directly, the SMVAE is the first multimodal VAE method that achieves scalable co-representation with missing modalities. A high-level overview of the proposed method is illustrated in Fig. 1 . The SMVAE can handle a set of maximally M modalities as well as their subsets and allows cross-modality generations. E i and D i represent the i-th embedding network and decoder network for the specific modality. µ s , σ s and µ k , σ k represent the parameters for the posterior distribution of the latent variable. By incorporating set operation when learning the joint-modality posterior, we can simply drop the corresponding embedding networks when a modality is missing. Comprehensive experiments show the proposed Set Multimodal Variational Autoencoder (SMVAE) outperforms state-of-the-art multimodal VAE methods and is immediately applicable to real-life multimodality. Figure 1 : Overview of the proposed method for learning multimodal latent space. The SMVAE is able to handle any combination or number of input modalities while having discriminative latent space and proper reconstruction.

2.1. MULTIMODALITY VAES

The core problem of learning a multimodal generative model is to maintain the model's scalability to the exponential number of modal combinations. Existing multimodal generative models such as Conditional VAE (CVAE) Pandey & Dukkipati (2017) and joint-modality VAE (JMVAE) Suzuki et al. (2016) had difficulty scaling since they need to assign a separate inference model for each possible input and output combinations. To tackle this issue, follow-up works, such as, TELBO Vedantam et al. (2017) , MVAE Wu & Goodman (2018) , MMVAE Shi et al. (2019 ), MoPoE Sutter et al. (2021) , assume the variational approximation is factorizable. Thus, they focused on factorizing the approximation of the multimodal joint posterior q(z|x 1 , ⋯, x M ) into a set of uni-modality inference encoders q i (z|x i ), such that q(z|x 1 , ⋯, x M ) ≈ F ({x i } M i=1 ), where F (⋅) is a product or mean operation, depending on the chosen aggregation method. As discussed in Sutter et al. (2021) , these scalable multimodal VAE methods differ only in the choice of aggregation method. Different from those mentioned above multimodal VAE methods, we attain the joint posterior in its original form without introducing additional assumptions on the form of the joint posterior. To handle the issue of scalability, we exploit the deterministic set operation function in the noise-outsourcing process. While existing multimodal VAE methods can be viewed as typical late fusion method that combines decisions about the latent variables Khaleghi et al. (2013) , the proposed SMVAE method corresponds to the early fusion method at the representation level, allowing for the learning of correlation and co-representation from multimodal data.

2.2. METHODS FOR SET-INPUT PROBLEMS

Multiple instance learning (MIL) Carbonneau et al. (2018) and 3D shape recognition Su et al. (2015) ; Hofer et al. (2005) ; Wu et al. (2015) , are well-known examples of weakly-supervised learning problems that deal with set-input. MIL handles training data as numerous sets of instances with only set-level labels. A typical way to solve set-level classification problems is to use pooling methods for information aggregation Shao et al. (2021) . Recently, Lee et al. (2019) observed that classical feed-forward neural networks like the multi-layer perception (MLP) Murtagh (1991) cannot guarantee invariance under the permutation of the elements in the input as well as the input of arbitrary sizes. Furthermore, recursive neural networks such as RNN and LSTM Hochreiter & Schmidhuber (1997) are sensitive to the order of the input sequences, and cannot fit the multimodal case since there is no natural order for modalities. Recently, Deep Sets Zaheer et al. (2017) provided a formal definition for a permutation invariant function in set-input problems and proposed a universal approximator for arbitrary set functions. Later on, Set Transformer Lee et al. (2019) further extends this idea by using the self-attention mechanism to provide interactions as well as information aggregation among elements from an input set. However, their method only models a set of outputs as a deterministic function. Our work fills the gap between a deterministic set function to a probabilistic distribution and applies it to multimodal unsupervised learning.

3.1. PRELIMINARIES

This work considers the multimodal learning problem as a set modeling problem and presents a scalable method for learning multimodal latent variables and cross-modality generation. Given a dataset {X (i) } N i=1 of N i.i.d. multimodal samples, we consider each of the sample as a set of M modalities observations X (i) = {x (i) j } M j=1 . The multimodal data is assumed to be generated following the successive random process p(X, z) = p θ (X|z)p(z) which involves an unobserved latent variable z. The prior distribution of the latent variable z is assumed to be p θ (z), with θ denoting its parameters. The marginal log-likelihood of this dataset of multimodal sets can be expressed as a summation of marginal log-likelihood of individual sets as log p(X i) ). (i) ) as log ∏ N i=1 p(X (i) ) = ∑ N i=1 log p(X Since the marginal likelihood of the dataset is intractable, we cannot optimize p({X (i) } N i=1 ) with regards to θ directly. We instead introduce the variational approximation q ϕ (z|X) from a parametric family, parameterized by ϕ, as an importance distribution. q ϕ (z|X) is often parameterized by a neural network with ϕ as its trainable parameters. Together, we can express the marginal log-likelihood of a single multimodal set as: log p(X (i) ) = D KL (q ϕ (z|X (i) )||p θ (z|X (i) )) + L(ϕ, θ; X (i) ) L(ϕ, θ; X (i) ) = E z∼q ϕ (z|X (i) ) [log p θ (X (i) ), z) -log q ϕ (z|X (i) )] = -D KL (q ϕ (z|X (i) )||p θ (z)) + E z∼q ϕ (z|X (i) ) [log p θ (X (i) |z)] , where D KL (⋅||⋅) is the Kullback-Leibler (KL) divergence between two distributions. The nonnegative property of the KL divergence term between the variational approximation q ϕ (z|X (i) ) and the true posterior p θ (z|X (i) ) in the first line makes L(ϕ, θ; X (i) ) the natural evidence lower bound (ELBO) for the marginal log-likelihood. The last line indicates that maximizing the ELBO is equivalent to maximizing the reconstruction performance and regulating the variational approximation using the assumed prior distribution for the latent variable. To avoid confusion, we term neural networks used for mapping the raw input observations into a fixed-sized feature vector as the embedding network while the neural network used to parameterize the variational approximation q ϕ (z|X (i) ) as the encoder network. A frequently used version of the objective function is written as: arg min ϕ -βD KL (q ϕ (z|X (i) )||p(z)) + E z∼q ϕ (z|X (i) ) [λ log p(X (i) |z)] , where additional annealing coefficients β and reweighting coefficient λ are used in the ELBO to allow gradients and warm-up training which gradually increases the regularization effect from the prior distribution and avoids reaching local minima in the early training stage Bowman et al. (2015) ; Sønderby et al. (2016) . We drop the superscript of X (i) to maintain brevity in the following paper.

3.2. SET MULTIMODAL VARIATIONAL AUTOENCODER

In multimodal scenarios with missing modalities, we consider each sample X s = {x i |i th modaltiy present} as a subset of X and the powerset P(X) denoting all the 2 M combinations, such that X s ∈ P(X). Our goal is to perform inference and generation from any number and permutation of available modalities, which requires an inference process is invariant to permutations and input of variable size. Following Definition 1, we denotes the invariant inference process as p(z|X s ) = p(z|π⋅X s ). The ELBO for a subset X s can be written as Eq.3. L s (ϕ, θ; X s ) = -D KL (q ϕ (z|X s )||p θ (z)) + E z∼q ϕ (z|X s ) [log p θ (X s |z)] (3) Definition 1 Let S n be a set of all permutations of indices 1, ⋯, N , X = (x 1 , ⋯x n ) denotes n random variables. A probabilistic distribution p(y|X) is permutation inariant if and only if for any permutation π ∈ S n , p(y|X) = p(y|π⋅X), where ⋅ is the group action. The difference between L(ϕ, θ; X) in Eq.1 and L s (ϕ, θ; X s ) in Eq.3 is that the ELBO for a subset X s is not yet a valid bound for log p(X) by itself. Additional sampling from P(X) in the optimization objective as Eq.4 is needed for theoretical completeness. arg min ϕ ∑ Xs ∼P(X) π∈Sn L s (ϕ, θ; π⋅X s ) , where π is a randomly generated permutation to the input subset X s . However, this sampling process can be trivial if we combine the sampling of the subsets with the sampling of mini-batch during training. By assuming the Gaussian form of the latent variable z and applying the reparameterization technique, the inference process of SMVAE can be written as: p(z|x s ) ∼ N (µ, σ 2 ), ϵ ∼ N (0, I) z ∶= µ + σ ⊙ ϵ (6) µ z , log σ 2 z ∶= g ϕ (E 1 (x 1 ), ⋯, E m (x m )) , where E i are embedding network for the i th modality, g ϕ (⋅) is a neural network with trainable parameters ϕ that provide the parameter for the latent's posterior distribution (i.e., µ and σ) , ⊙ denotes the element-wise multiplication. For the generation process, it is desired to models the joint likelihood of modalities conditioned on the latent variables p θ (x s , z) = p(z)p θ (x s |z) so that the model can utilize information from other available modalities more easier when generating a complex modality. However, for the sake of easy implementation, we assign n separate decoders D 1 , ⋯, D M for all possible modalities as p θ (x s |z) = [D θ 1 (z), ⋯, D θ M (z)]. We find empirically that, without loss of generality, using L 2 -normalization as additional regularization to regulate the parameter oµ and σ of the inference network to 0 and 1 respectively could facilitate the learning efficiency because the gradient from the ELBO often favors the reconstruction term over the regularization term.

3.3. SET REPRESENTATION FOR JOINT DISTRIBUTION

The scalability issue comes from the requirement for an inference process for the powerset P(X). We achieve scalability by using the noise-outsourced functional representation, i.e. z = g(ϵ, X s ), to bridge the gap between the deterministic set functions to a stochastic function. The properties of the deterministic function thus can be passed to the stochastic distribution under minor conditions Bloem-Reddy & Teh (2020) . With such a foundation, the problem of modeling the posterior for a superset immediately reduces to designing a differentiable deterministic function that has the desired invariant or elastic properties. Specifically, we identify four critical requirements for weaklysupervised multimodal learning. Being that the model should 1) be scalable in the number of observable modalities; 2) be able to process input modalities sets of arbitrary size and permutation; 3) satisfy Theorem 1; and 4) be able to learn the co-representation among all modalities. Theorem 1 A valid set function f (x) is invariant to the permutation of instances, iif it can be decomposed in the form Φ(∑ Ψ(x)), for any suitable transformations Φ and Ψ. An oversimplified example of a set function can be summation or product as done in MVAE Wu & Goodman (2018) and MMVAE Shi et al. (2019) . Pooling operations such as average pooling or max pooling also fit the definition. However, these set aggregation operations will require additional factorization assumptions to the joint posterior and ultimately forbid the VAE to learn corepresentation of the input modalities as aggregation is only applied at the decision level. To establish the inductive bias of inter-modality correlation, the self-attention mechanism without positional embeddings is a reasonable choice Edelman et al. ( 2022); Shvetsova et al. (2022) . Therefore, the proposed SMVAE leverages self-attention as the deterministic set function to aggregate embeddings of multimodal inputs. Given the query Q, key K and value V , an attention function is denoted as Att(Q, K, V ) = ω( QK T √ d k )V , where K ∈ R m×d k and V ∈ R m×d v are m vectors of dimension d k and d v , Q ∈ R n×d q are n vectors of dimension d q , ω is the softmax activation function. In our case, the key-value pairs represent the m available embeddings of input modalities, m ≤ M . Each embedding is mapped to a d-dimensional embedding space by a modality-specific embedding network. By measuring the compatibility of the corresponding key and the query Q, information that is shared among modalities is aggregated as co-representation. In practice, we utilize the multi-head extension of self-attention denoted as MultiHead(Q, K, V, h) = Concat(A 1 , ⋯, A h )W o , where A i = Att i (QW Q i , KW K i , V W v i ) is obtained from the i th attention function with projection parameters W Q i ∈ R (d/h)×d q ,W K i ∈ R (d/h)×d k , W V i ∈ R (d/h)×d k and W o ∈ R d v ×d , h denotes the total number of attention heads and d denotes the dimension of the projections for keys, values and queries. Inspired by Lee et al. (2019) , we design our deterministic set representation function g ϕ (X s ) as follows: g ϕ (X s ) ∶= H + f s (H) H = I + MultiHead(I, X s , X s , h) , where I ∈ R 1×d v is an d v -dimensional trainable vector as the query vector for multimodal embeddings. f s is a fully-connected layer. By calculating attention weights using I and each embedding. Not only does I work as an aggregation vector that regulates the number of output vectors from g ϕ (X s ) to be constant regardless of the number of input embeddings, but also it selects relevant information from each embedding base on similarity measurement. The former justifies g ϕ (X s ) as a suitable permutation invariant set-processing function while the latter yields the desired co-representation among modalities. Finally, Since the set representation function g ϕ (X s ) is invariant to the input permutations of different input sizes, we achieved an invariant inference probabilistic function that satisfies Definition 1 through the noise-outsourced process as shown in Eq. 6. Thus, by introducing the set representation function in the noise-outsourced process, the SMVAE is readily a scalable multimodal model for any subsets of modalities.

3.4. TOTAL CORRELATION OPTIMIZATION WITHOUT CONDITION INDEPENDENCE

The lower bound of the multimodal data without factorizing the joint posterior (i.e., Eq. 1) provides additional information about the correlations of modalities during the optimization process compared to factorized methods. It is noteworthy that both MVAE and MMVAE depend on the assumption of conditional independence between modalities in factorization. Without loss of generality, the relation between L(ϕ, θ; X) and the factorized case L CI can be shown in Eq. 9. L(ϕ, θ; X) = E q ϕ (z|X) [log p θ (z) ∏ M i=1 p θ (x i | z) q ϕ (z | X) + log p θ (X, z) p θ (z) ∏ M i=1 p(x i | z) ] = L CI + E q ϕ (z|X) [log p θ (X | z) ∏ M i=1 p θ (x i | z) ] (9) , where X ≡ (x 1 , ⋯, x M ) and L CI is the lower bound for factorizable generative process as MVAE or MMVAE. Specifically, let q(X) denotes the empirical distribution for the multimodal dataset, we have: E q(X) [L(ϕ, θ; X)] = E q(X) [L CI ] + E z∼ q(X)q ϕ (z|X) p θ (X|z) ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ E X∼p θ (X|z) [log p θ (X | z) ∏ M i=1 p θ (x i | z) ] conditional total correlation ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ (10) , which reveals that without assuming a factorizable generative process and enforcing conditional independence among modalities, our optimization objective naturally models the conditional total correlation which provides information of dependency among multiple input modalities Watanabe (1960) ; Studenỳ & Vejnarová (1998) . Therefore, the SMVAE has the additional advantage of learning correlations among different modalities of the same event, which is also what we desired for good co-representation.

4.1. EXPERIMENT SETTINGS

We make use of uni-modal datasets including MNIST LeCun et al. (1998 ), FashionMNIST Xiao et al. (2017) and CelebA Liu et al. (2015) to evaluate the performance of the proposed SMVAE and compare with other state-of-the-art methods. We convert these uni-modal datasets into bi-modal dataset by transforming the labels to one-hot vectors as the second modality as in Wu & Goodman (2018) ; Suzuki et al. (2016) . For quatitative evaluation, we denote x 1 and x 2 as the image and text modality and measure the marginal log-likelihood, log p(x) ≈ log E q(z|⋅) [ p(x|z)p(z) q(z|⋅) ], the joint likelihood log p(x, y) ≈ log E q(z|⋅) [ p(z)p(x|z)p(y|z) q(z|⋅) ], and the marginal conditional probability, log p(x|y) ≈ log E q(z|⋅) [ p(z)p(x|z)p(y|z) q(z|⋅) ]-log E p(z) [p(y|z)], using data samples from the test set. q(z|⋅) denotes the importance distribution. For all the multimodal VAE methods, we keep the architecture of encoders and decoders consistent for a fair comparison. Detailed training configurations and settings of the networks are listed in Appendix. B. The marginal probabilities measure the model's ability to capture data distributions while the conditional log probability measures classification performance. Higher scoring of these matrics means the better a model is able to generate proper samples and convert between modalities. These are the desirable properties for learning a generative model. 

4.2. GENERATION QUALITY AND QUANTITATIVE EVALUATION

We obtain 1000 important samples to estimate the probability metrics. Table 1 shows the quatitative results of the proposed SMVAE for each dataset. We can see that the SMVAE outperforms other methods in almost all metrics. The outstanding of SMVAE mainly contributes to the direct modeling of the joint posterior distribution and optimization on a more informative objective. Fig. 3 , Fig. 4 and Fig. 5 show cross-modality generation of image samples for each domain generated by the SMVAE model. We can see that given the text modality only, the SMVAE can generate corresponding images of good quality. We further visualize the learned latent representation using tSNE Hinton & Roweis (2002) . As shown in Fig. 2 , latent space learned by MVAE method can only produce cohesive latent representation when both modalities are presented. When one modality is missing, representations from their method are distributed irrespective of the semantic category of the data. On the other hand, although the MMVAE method achieves cohesive representation for single-modality posterior, their joint representation is less discriminative. Indicating that using only the combination of uni-modal inference networks is insufficient to capture intermodality co-representation. Nonetheless, our SMVAE method can achieve discriminative latent space for both single-and joint-modality inputs thanks to its ability to exploit shared information from different modalities. 

4.3. CASE STUDY: COMPUTER VISION APPLICATION

We demonstrate that our SMVAE is able to learn image transformations including colorization, edge detection, facial landmark segmentation, image completion, and watermark removal. With original image and each transformation as different modalities, we obtain 6 modalities in total by applying different transformations to the ground-truth images for this multimodal setting. This case study demonstrates the SMVAE's ability to generate in multiple directions and combinations. Similar to Wu & Goodman (2018) , for edge detection, we use Canny detector Canny (1986) from Scikit-Image module Van der Walt et al. (2014) to extract edges of the facial image. For facial landmark segmentation, we use Dlib tool King (2009) and OpenCV Bradski & Kaehler (2000) . For colorization, we simply convert RGB colors to grayscale. For watermark removal, we add a watermark overlay to the original image. For image completion, we replace half of the image with black pixels. Fig. 6 shows the samples generated from a trained SMVAE model. As can be seen in Fig. 6 (a), the SMVAE generates a good reconstruction of the facial landmark segmentation and extracted edges. In Fig. 6 (b), we can see that the SMVAE is able to put reasonable facial color to the input grayscale image. Fig. 6 (c) demonstrates that the SMVAE can recover the image from the watermark and complete the image quite well. The reconstructed right half of the image is basically agreed on the left half of the original image. In Fig. 6 (d), all traces of the watermark is also removed. Although our reconstructed images suffer from the same blurriness problem that is shared in VAE methods Zhao et al. ( 2017), the SMVAE is able to perform cross-modality generation thanks to its ability to capture share information among modalities. 2021). The Vision&Touch dataset is a real-world robot manipulation dataset that contains visual, tactile, control action, and robot proprioception data which pocess more diverse modalities. The robotic arm attempts to insert the peg located on its tip into the target object. We use a total of 4 modalities including the depth images, RGB images, the 6-axis force sensor feedbacks, and the control action given to the robotics arm in each time step. Fig. 7 (a) illustrates that as the robotic arm is not receiving force signals in early steps, reconstruction results of the RGB image show clearly that the arm has no contact with the taget box below. Only when the robotic arm is receiving high force readings, the generated image depicts the contact between the robotics arm and the target box. The quality of the reconstructed rgb and depth images is also differ between partial observation and full observation. While only limited information is observed (i.e., force and action inputs), our method is only able to reconstruct rgb and depth images that can properly reflex the relative posistion between the robotic arm and the target object (Fig. 7 (a)). But when more information is presented, the latent variables can have more comprehensive information about the event and better reconstruction result as we removed the conditional independence assumption (Fig.

7(b))

.

5. CONCLUSION

This paper proposes a multimodal generative model by incorporating the set representation learning in the VAE framework. Unlike the previous multimodal VAE methods, the proposed SMVAE method provides a scalable solution for multimodal data of variable size and permutations. Critically, our model learns the joint posterior distribution directly without additional assumptions for factorization, yielding a more informative objective and the ability to achieve co-representation between modalities. Statistical and visualization results demonstrate that our method excels with other state-of-the-art multimodal VAE methods. Which has high potential in emerging multimodal tasks that need to learn co-representation of diverse data sources while taking missing modality problems or set-input processing problems into consideration. Application on cross-modality reconstruction in robotic dataset further indicates the proposed SMVAE has high potential in emerging multimodal tasks. In the future, we will explore methods that extend the current SMVAE framework to more diverse modalities as well as dynamic multimodal sequences to provide solutions for real-world multimodal applications. 



Figure 2: Visualization of 2-D latent representation learned by our model compared to MVAE and MMVAE from bi-modality MNIST dataset. Our model can learn natrual clusters while maintaining good structure for their joint distribution. Left: MVAE. Middle: SMVAE. Right: MMVAE.

Figure 3: Conditional image reconstruction of digits generated by SMVAE given text input.

Figure 4: Conditional image reconstruction of fashionMNIST generated by SMVAE given text input.

Figure 5: Generated face images by SMVAE given facial attributes.

Figure 6: Learning computer vision transformations: (a) Ground truth images are randomly selecetd from CelebA dataset and we present the generated recontructed images, detected edges and facial landmarks. (b) Recontructed color images based on grayscale images. (c) Image recovery from obscured image. (d) Image recovery from watermark images.

Figure 7: Learning to reconstruct visual event in robotics scenario. (a) image sampled from SMVAE conditioned on force and action inputs. SMVAE is able to reconstruct different visual modalities that depicts the relative position of the robotic arms to the target box properly. (b) From top to bottom, sampled reconstructed images from a action sequence

Statistical resultsDataSet Method sampled from p(z|x, y) sampled from p(z|x) log p(x, y) log p(x) log p(x|y) log p(x, y) log p(x) log p(x|y)

Yifei Zhang, Désiré Sidibé, Olivier Morel, and Fabrice Mériaudeau. Deep multimodal fusion for semantic image segmentation: A survey. Image and Vision Computing, 105:104042, 2021. Shengjia Zhao, Jiaming Song, and Stefano Ermon. Towards deeper understanding of variational autoencoding models. arXiv preprint arXiv:1702.08658, 2017.

