MULTIMODAL VARIATIONAL AUTOENCODERS FOR SEMI-SUPERVISED LEARNING: IN DEFENSE OF PRODUCT-OF-EXPERTS Anonymous

Abstract

Multimodal generative models should be able to learn a meaningful latent representation that enables a coherent joint generation of all modalities (e.g., images and text). Many applications also require the ability to accurately sample modalities conditioned on observations of a subset of the modalities. Often not all modalities may be observed for all training data points, so semi-supervised learning should be possible. In this study, we evaluate a family of product-ofexperts (PoE) based variational autoencoders that have these desired properties. We include a novel PoE based architecture and training procedure. An empirical evaluation shows that the PoE based models can outperform an additive mixtureof-experts (MoE) approach. Our experiments support the intuition that PoE models are more suited for a conjunctive combination of modalities while MoEs are more suited for a disjunctive fusion.

1. INTRODUCTION

Multimodal generative modelling is important because information about real-world objects typically comes in different representations, or modalities. The information provided by each modality may be erroneous and/or incomplete, and a complete reconstruction of the full information can often only be achieved by combining several modalities. For example, in image-and video-guided translation (Caglayan et al., 2019) , additional visual context can potentially resolve ambiguities (e.g., noun genders) when translating written text. In many applications, modalities may be missing for a subset of the observed samples during training and deployment. Often the description of an object in one modality is easy to obtain, while annotating it with another modality is slow and expensive. Given two modalities, we call samples paired when both modalities are present, and unpaired if one is missing. The simplest way to deal with paired and unpaired training examples is to discard the unpaired observations for learning. The smaller the share of paired samples, the more important becomes the ability to additionally learn from the unpaired data, referred to as semi-supervised learning in this context (following the terminology from Wu & Goodman, 2019 . Typically one would associate semi-supervised learning with learning form labelled and unlabelled data to solve a classification or regression tasks). Our goal is to provide a model that can leverage the information contained in unpaired samples and to investigate the capabilities of the model in situations of low levels of supervision, that is, when only a few paired samples are available. While a modality can be as low dimensional as a label, which can be handled by a variety of discriminative models (van Engelen & Hoos, 2020), we are interested in high dimensional modalities, for example an image and a text caption. Learning a representation of multimodal data that allows to generate high-quality samples requires the following: 1) deriving meaningful representation in a joint latent space for each high dimensional modality and 2) bridging the representations of different modalities in a way that the relations between them are preserved. The latter means that we do not want the modalities to be represented orthogonally in the latent space -ideally the latent space should encode the object's properties independent of the input modality. Variational autoencoders (Kingma & Welling, 2014) using a product-of-experts (PoE, Hinton, 2002; Welling, 2007) approach for combining input modalities are a promising approach for multimodal generative modelling having the desired properties, in partic-ular the VAEVAE model developed by Wu & Goodman (2018) and a novel model termed SVAE, which we present in this study. Both models can handle multiple high dimensional modalities, which may not all be observed at training time. It has been argued that a product-of-experts (PoE) approach is not well suited for multimodal generative modelling using variational autoencoders (VAEs) in comparison to additive mixture-of-experts (MoE). It has empirically been shown that the PoE-based MVAE (Wu & Goodman, 2018 ) fails to properly model two high-dimensional modalities in contrast to an (additive) MoE approach referred to as MMVAE, leading to the conclusion that "PoE factorisation does not appear to be practically suited for multi-modal learning" (Shi et al., 2019) . This study sets out to test this conjecture for state-of-the-art multimodal VAEs. The next section summarizes related work. Section 3 introduces SVAE as an alternative PoE based VAE approach derived from axiomatic principles. Then we present our experimental evaluation of multimodal VAEs before we conclude.

2. BACKGROUND AND RELATED WORK

We consider multimodal generative modelling. We mainly restrict our considerations to two modalities x 1 ∈ X 1 , x 2 ∈ X 2 , where one modality may be missing at a time. Extensions to more modalities are discussed in Experiments section and Appendix D. To address the problem of generative cross-modal modeling, one modality x 1 can be generated from another modality x 2 by simply using independently trained generative models (x 1 → x 2 and x 2 → x 1 ) or a composed but noninterchangeable representation (Wang et al., 2016; Sohn et al., 2015) . However, the ultimate goal of multimodal representation learning is to find a meaningful joint latent code distribution bridging the two individual embeddings learned from x 1 and x 2 alone. This can be done by a two-step procedure that models the individual representations first and then applies an additional learning step to link them (Tian & Engel, 2019; Silberer & Lapata, 2014; Ngiam et al., 2011) . In contrast, we focus on approaches that learn individual and joint representations simultaneously. Furthermore, our model should be able to learn in a semi-supervised setting. Kingma et al. (2014) introduced two models suitable for the case when one modality is high dimensional (e.g., an image) and another is low dimensional (e.g., a label) while our main interest are modalities of high complexity. We consider models based on variational autoencoders (VAEs, Kingma & Welling, 2014; Rezende et al., 2014) . Standard VAEs learn a latent representation z ∈ Z for a set of observed variables x ∈ X by modelling a joint distribution p(x, z) = p(z)p(x|z). In the original VAE, the intractable posterior q(z|x) and conditional distribution p(x|z) are approximated by neural networks trained by maximising the ELBO loss taking the form L = E q(z|x) [log p(x|z)] -D KL (q(z|x) N (0, I)) with respect to the parameters of the networks modelling q(z|x) and p(x|z). Here D KL (• •) denotes the Kullback-Leibler divergence. Bi-modal VAEs that can handle a missing modality extend this approach by modelling q(z|x 1 , x 2 ) as well as q 1 (z|x 1 ) and q 2 (z|x 2 ), which replace the single q(z|x). Multimodal VAEs may differ in 1) the way they approximate q(z|x 1 , x 2 ), q 1 (z|x 1 ) and q 2 (z|x 2 ) by neural networks and/or 2) the structure of the loss function, see Figure 1 . Typically, there are no conceptual differences in the decoding, and we model the decoding distributions in the same way for all methods considered in this study. Suzuki et al. (2017) introduced a model termed JMVAE (Joint Multimodal VAE), which belongs to the class of approaches that can only learn from the paired training samples (what we refer to as the (fully) supervised setting). It approximates q(z|x 1 , x 2 ), q 1 (z|x 1 ) and q 2 (z|x 2 ) with three corresponding neural networks and optimizes an ELBO-type loss of the form L = E q(z|x1,x2) [log p 1 (x 1 |z) + log p 2 (x 2 |z)] -D KL (q(z|x 1 , x 2 ) N (0, I)) -D KL (q(z|x 1 , x 2 ) q 1 (z|x 1 )) -D KL (q(z|x 1 , x 2 ) q 2 (z|x 2 )) . (2) The last two terms imply that during learning the joint network output must be generated which requires paired samples. The MVAE (Multimodal VAE) model (Wu & Goodman, 2018) is the first multimodal VAE-based model allowing for missing modalities that does not require any additional network structures for learning the joint latent code distribution. The joint posterior is modeled using a product-of-experts (PoE) as q(z|x 1:M ) = m q m (z|x m ). For the missing modality q k (z|x k ) = 1 is assumed. The model allows for semi-supervised learning while keeping the number of model parameters low. The bridged model (Yadav et al., 2020) highlights the need for an additional network structure for approximating the joint latent code distribution. It attempts to keep the advantages of the additional encoding networks. It reduces the number of model parameters by introducing the bridge encoder that consists of one fully connected layer which takes z 1 and z 2 latent code vectors generated from x 1 and x 2 and outputs the mean and the variance of the joint latent code distribution. The arguably most advanced multimodal VAE models is VAEVAE by Wu & Goodman (2019) , which we discuss in detail in the next section (see also Appendix C). Shi et al. (2019) proposed a MoE model termed MMVAE (Mixture-of-experts Multimodal VAE). In MMVAE model the joint variational posterior for M modalities is approximated as q(z|x 1:M ) = m α m q m (z|x m ) where α m = 1 M . The model utilizes a loss function from the importance weighted autoencoder (IWAE, Burda et al., 2016) that computes a tighter lower bound compared to the VAE ELBO loss. The MoE rule formulation allows in principle to train with a missing modality i by assuming α i = 0, however, Shi et al. (2019) do not highlight or evaluate this feature. There are benchmarks in the paper that compare MVAE (Wu & Goodman, 2018) and MMVAE, concluding that MVAE often fails to learn the joint latent code distribution. Because of these results and those presented by Wu & Goodman (2019) , we did not include MVAE as a benchmark model in our experiments.

3. VAEVAE AND SVAE

We developed a new approach as an alternative to VAEVAE. Both models 1) are VAE based; 2) allow for interchangeable cross-model generation as well as a learning joint embedding; 3) allow for missing modalities at training time; and 4) can be applied to two similarly complex high dimensional modalities. Next, we will briefly present our new model SVAE. Then we highlight the differences to VAEVAE. Finally, we state a newly derived objective function for training the models. We consider two modalities and refer to Appendix D for generalizations to more modalities. SVAE Since both modalities might not be available for all the samples, it should be possible to marginalize each of them out of q(z|x 1 , x 2 ). While the individual encoding distributions q(z|x 1 ) and q(z|x 2 ) can be approximated by neural networks as in the standard VAE, we need to define a meaningful approximation of the joint encoding distribution q(z|x) = q(z|x 1 , x 2 ). In the newly proposed SVAE model, these distributions are defined as the following: q(z|x 1 , x 2 ) = 1 Z(x 1 , x 2 ) q 1 (z|x 1 )q 2 (z|x 2 ) (3) q(z|x 1 ) = q 1 (z|x 1 )q * 2 (z|x 1 ) (4) q(z|x 2 ) = q 2 (z|x 2 )q * 1 (z|x 2 ) (5) q(z) = N (0, I) The model is derived from an axiomatic proof that is given in Appendix A. The desired properties of the model were that 1) when no modalities are observed the generating distribution for the latent code is Gaussian, 2) the modalities are independent given the latent code, 3) both experts cover the whole latent space with equal probabilities, and 4) the joint encoding distribution q(z|x 1 , x 2 ) is modelled by a PoE. The distributions q 1 (z|x 1 ), q 2 (z|x 2 ), q * 2 (z|x 1 ) and q * 1 (z|x 2 ) are approximated by neural networks. In case both observations are available, q(z|x 1 , x 2 ) is approximated by applying the product-ofexperts rule with q 1 (z|x 1 ) and q 2 (z|x 2 ) being the experts for each modality. In case of a missing modality, equation 4 or 5 is used. If, for example, x 2 is missing, the q * 2 (z|x 1 ) distribution takes over as a "replacement" expert, modelling marginalization over x 2 . SVAE vs. VAEVAE The VAEVAE model (Wu & Goodman, 2019) is the most similar to ours. Wu & Goodman define two variants which can be derived from the SVAE model in the following way. Variant (a) can be derived by setting q * (z|x 1 ) = q * (z|x 2 ) = 1. Variant (b) is obtained from (a) by additionally using a separate network to model q(z|x 1 , x 2 ). Having a joint network q(z|x 1 , x 2 ) implements the most straightforward way of capturing the inter-dependencies of the two modalities. However, the joint network cannot be trained on unpaired data -which can be relevant when the share of supervised data gets smaller. Option (a) uses the product-of-experts rule to model the joint distribution of the two modalities as well, but does not ensure that both experts cover the whole latent space (in contrast to SVAE, see equation A.14 in the appendix), which can lead to individual latent code distributions diverging. Based on this consideration and the experimental results from Wu & Goodman (2019) , we focused on benchmarking VAEVAE (b) and refer to it as simply VAEVAE in Section 4. SVAE resembles VAEVAE in the need for additional networks besides one encoder per each modality and the structure of ELBO loss. It does, however, solve the problem of learning the joint embeddings in a way that allows to learn the parameters of approximated q(z|x 1 , x 2 ) using all available samples, i.e., both paired and unpaired. If q(z|x 1 , x 2 ) is approximated with the joint network that accepts concatenated inputs, as in JMVAE and VAEVAE (b), the weights of q(z|x 1 , x 2 ) can only be updated for the paired share of samples. If q(z|x 1 , x 2 ) is approximated with a PoE of decoupled networks as in SVAE, the weights are updated for each sample whether paired or unpaired -which is the key differentiating feature of SVAE compared to existing architectures. A New Objective Function When developing SVAE, we devised a novel ELBO-type loss: For an unbiased evaluation, we considered the same test problems and performance metrics as Shi et al. (2019) . In addition, we designed an experiment referred to as MNIST-Split that was supposed to be well-suited for PoE. In all experiments we kept the network architectures as similar as possible (see Appendix F). For the new benchmark problem, we constructed a multi-modal dataset where the modalities are similar in dimensionality as well as complexity and are providing missing information to each other rather than duplicating it. The latter should favor a PoE modelling, which suits an "AND" combination of the modalities, and not a MoE modeling, which is more aligned with an "OR" combination. L =E ppaired(x1,x2) E q(z|x1,x2) [log p 1 (x 1 |z) + log p 2 (x 2 |z)] -D KL (q(z|x 1 , x 2 ) p(z|x 1 )) -D KL (q(z|x 1 , x 2 ) p(z|x 2 )) + E ppaired(x1) E q(z|x1) [log p 1 (x 1 |z)] -D KL (q(z|x 1 ) p(z)) + E ppaired(x2) E q(z|x2) [log p 2 (x 2 |z)] -D KL (q(z|x 2 ) p(z)) (7) L 1 =E punpaired(x1) E q(z|x1) [log p 1 (x 1 |z)] -D KL (q(z|x 1 ) p(z)) (8) L 2 =E punpaired(x2) E q(z|x2) [log p 2 (x 2 |z)] -D KL (q(z|x 2 ) p(z)) L comb =L + L 1 + L 2 We measured performance for different supervision levels for each dataset (e.g., 10% supervision level means that 10% of the training set samples were paired and the remaining 90% were unpaired). Image and image: MNIST-Split We created an image reconstruction dataset based on MNIST digits (LeCun et al., 1998) . The images were split horizontally into equal parts, either two or three depending on the experimental setting. These regions are considered as different input modalities. In the above notion of "AND" and "OR" tasks we implicitly assume an additional modality, which is the image label in this case. The fact that the correct digit can sometimes be guessed from only one part of the image makes the new MNIST-Split benchmark a mixture of an "AND" and an "OR" task. This is in contrast to the MNIST-SVHN task described below, which can be regarded as an almost pure "OR" task. TWO MODALITIES: MNIST-SPLIT. In the bi-modal version referred to as MNIST-Split, the MNIST images were split in top and bottom halves of equal size, and the halves were then used as two modalities. We tested the quality of the image reconstruction given one or both modalities by predicting the reconstructed image label with an independent oracle network, a ResNet-18 (He et al., 2016) trained on the original MNIST dataset. The evaluation metrics were joint coherence, synergy, and cross-coherence. For measuring joint coherence, 1000 latent space vectors were generated from the prior and both halves of an image were then reconstructed with the corresponding decoding networks. The concatenated halves yield the fully reconstructed image. Since the ground truth class labels do not exist for the randomly sampled latent vectors, we could only perform a qualitative evaluation, see Figure 3 . Synergy was defined as the accuracy of the image reconstruction given both halves. Cross-coherence considered the reconstruction of the full image from one half and was defined as the fraction of class labels correctly predicted by the oracle network. The quantitative results are shown in Table 1 and Figure 4 . All PoE architectures clearly outperformed MMVAE even when trained on the low supervision levels. In this experiment, it is important that both experts agree on a class label. Thus, as expected, the multiplicative PoE fits the task much better than the additive mixture. Utilizing the novel loss function (10) gave the best results for very low supervision (SVAE and VAEVAE*). THREE MODALITIES: MNIST-SPLIT-3. We compared a simple generalization of the SVAE model to more than two modalities with the canonical extension of the VAEVAE model (both defined in Appendix D) on the MNIST-Split-3 data, the 3-modal version of MNIST-Split task. Figure 6 Figure 5 : The SVAE and VAEVAE network architectures for 3-modalities case. The number of parameters is kn 2 for SVAE and kn2 n-1 for VAEVAE, where n is the number of modalities and k is the number of parameters in one encoding network. Figure 6 : MNIST-Split-3 dataset, reproducing the logic of MNIST-Split for the images splitted in three parts. shows that SVAE performed better when looking at the individual modalities reconstructions. While the number of parameters in the bi-modal case is the same for SVAE and VAEVAE, it grows exponentially for VAEVAE and stays in order of n 2 for SVAE where n is the number of modalities, see Figure 5 and Appendix D for details. Image and image: MNIST-SVHN The first dataset considered by Shi et al. ( 2019) is constructed by pairing MNIST and SVHN (Netzer et al., 2011) images showing the same digit. This dataset shares some properties with MNIST-Split, but the relation between the two modalities is different: the digit class is derived from a concatenation of two modalities in MNIST-Split, while in MNIST-SVHN it could be derived from any modality alone, which corresponds to "AND" combination of the modalities and favors the MoE architecture. As before, oracle networks are trained to predict the digit classes of MNIST and SVHN images. Joint coherence was again computed based on 1000 latent space vectors generated from the prior. Both images were then reconstructed with the corresponding decoding networks. A reconstruction was considered correct if the predicted digit classes of MNIST and SVHN were the same. Cross-coherence was measured as above. Figure 2 shows examples of paired image reconstructions from the randomly sampled latent space of the fully supervised VAEVAE model. The digit next to the each reconstruction shows the digit class prediction for this image. The quantitative results in Figure 7 show that all three PoE based models reached a similar joint coherence as MMVAE, VAEVAE scored even higher. The cross-coherence Image and text: CUB-Captions The second benchmark considered by Shi et al. (2019) is the CUB Images-Captions dataset (Wah et al., 2011) containing photos of birds and their textual descriptions. Here the modalities are of different nature but similar in dimensionality and information content. We used the source codefoot_2 by Shi et al. to compute the same evaluation metrics as in the MMVAE study. Canonical correlation analysis (CCA) was used for estimating joint and crosscoherences of images and text (Massiceti et al., 2018) . The projection matrices W x for images and W y for captions were pre-computed using the training set of CUB Images-Captions and are available as part of the source code. Given a new image-caption pair x, ỹ, we computed the correlation between the two by corr(x, ỹ) = φ(x) T φ(ỹ) φ(x) φ(ỹ) , where φ( k) = W T k k -avg(W T k k). We employed the same image generation procedure as in the MMVAE study. Instead of creating the images directly, we generated 2048-d feature vectors using a pre-trained ResNet-101. In order to find the resulting image, a nearest neighbours lookup with Euclidean distance was performed. A CNN encoder and decoder was used for the (see Table F.5 and Table F.6 ). Prior to computing the correlations, the captions were converted to 300-d vectors using FastText (Bojanowski et al., 2017) . As in the experiment before, we used the same network architectures and hyperparameters as Shi et al. (2019) . We sampled 1000 latent space vectors from the prior distribution. Images and captions were then reconstructed with the decoding networks. The joint coherence was then computed as the CCA for the resulting image and caption averaged over the 1000 samples. Cross-coherence was computed from caption to image and vice versa using the CCA averaged over the whole test set. As can be seen in Figure 8 , VAEVAE showed the best performance among all models. With full supervision the VAEVAE model outperformed MMVAE in all three metrics. The cross-coherence of the three PoE models was higher or equal to MMVAE except for very low supervision levels. All three PoE based models were consistently better than MVAE.

5. DISCUSSION AND CONCLUSIONS

We studied bi-modal variational autoencoders (VAEs) based on a product-of-experts (PoE)architecture, in particular VAEVAE as proposed by Wu & Goodman (2019) and a new model SVAE, which we derived in an axiomatic way, and represents a generalization of the VAE-VAE architecture. The models learn representations that allow coherent sampling of the modalities and accurate sampling of one modality given the other. They work well in the semi-supervised setting, that is, not all modalities need to be always observed during training. It has been argued that the mixture-of-experts (MoE) approach MMVAE is preferable to a PoE for multimodal VAEs (Shi et al., 2019) , in particular in the fully supervised setting (i.e., when all data are paired). Given that the caption can be broad (e.g., "this bird is black and white and has a long pointy beak" in the example), it can fit many different images. In this case, the image from the caption reconstruction tends to better fit the description than the original image. The same goes for images: one of the reconstructed images has a bird with a red belly which got reflected in the generated caption even though it was not a part of the original caption. This conjecture was based on a comparison with the MVAE model (Wu & Goodman, 2018) , but is refuted by our experiments showing that VAEVAE and our newly proposed SVAE can outperform MMVAE on experiments conducted by Shi et al. (2019) . Intuitively, MoEs are more tailored towards an "OR" (additive) combination of the information provided by the different modalities, while PoEs are more tailored to towards an "AND" (multiplicative) combination. This is demonstrated by our experiments on halved digit images, where a conjunctive combination is helpful and the PoE models perform much better than MMVAE. We also expand SVAE and VAEVAE to 3-modal case and show that SVAE demonstrates better performance on individual modalities reconstructions while having less parameters than VAEVAE. Given equation A.15 and equation A.14 we obtain q(z) = q(z|x)p(x)dx = q(z|x 1 , x 2 )p(x 1 , x 2 )dx 1 dx 2 (A.15) = 1 Z(x 1 , x 2 ) q 1 (z|x 1 )q 2 (z|x 2 )p(x 1 )p(x 2 |x 1 )dx 1 dx 2 . (A.16) Let us define q * j (z|x i ) = 1 Z(x i , x j ) q j (z|x j )p(x j |x i )dx j (A.17) and write q(z) (A.16) = q 1 (z|x 1 )p(x 1 ) 1 Z(x 1 , x 2 ) q 2 (z|x 2 )p(x 2 |x 1 )dx 2 dx 1 = p(x 1 )q 1 (z|x 1 )q * 2 (z|x 1 )dx 1 . (A.18) So the proposal distributions are: q(z|x 1 , x 2 ) = 1 Z(x 1 , x 2 ) q 1 (z|x 1 )q 2 (z|x 2 ) (A.19) q(z|x 1 ) = q 1 (z|x 1 )q * 2 (z|x 1 ) (A.20) q(z|x 2 ) = q 2 (z|x 2 )q * 1 (z|x 2 ) (A.21) q(z) = N (0, I) (A.22)

B DERIVATION OF THE LOSS FUNCTION

In the following, we derive the ELBO-type loss we use for training, see equation 7. Let consider the optimization E pData(x1,x2) [log p(x 1 , x 2 )] = 1 2 E pData(x1,x2) [log p(x 1 |x 2 ) + log p(x 2 ) + log p(x 2 |x 1 ) + log p(x 1 )] = 1 2 E pData(x1,x2) [log p(x 1 |x 2 )] + 1 2 E pData(x1,x2) [log p(x 2 |x 1 )] + 1 2 E pData(x2) [log p(x 2 )] + 1 2 E pData(x1) [log p(x 1 )] . (B.23) We can now proceed by finding lower-bounds for each term. For the last two terms log p(x i ) we can use the standard ELBO as given in equation 1. This gives the terms L i = E pData(xi) E q(z|xi) [log p i (x i |z)] -D KL (q(z|x i ) p(z)) (B.24) Next, we will derive log p(x 1 |x 2 ). This we can do in terms of a conditional VAE (Sohn et al., 2015) , where we condition all terms on x 2 (or x 1 if we model log p(x 2 |x 1 )). So the model we derive the log-likelihood for is p(x 1 |x 2 ) = p(x 1 |z)p(z|x 2 )dz, where p(z|x 2 ) is now our prior. By model assumption we further have p(x 1 , x 2 , z) = p(x 1 |z)p(x 2 |z)p(z) and therefore p(x 1 |x 2 , z) = p(x 1 |z). Thus we arrive at the ELBO losses L 12 = E pData(x1,x2) E q(z|x1,x2) [log p 1 (x 1 |z)] -D KL (q(z|x 1 , x 2 ) p(z|x 2 )) (B.25) and L 21 = E pData(x1,x2) E q(z|x1,x2) [log p 2 (x 2 |z)] -D KL (q(z|x 1 , x 2 ) p(z|x 1 )) . (B.26) We now insert the terms in equation B.23 and arrive at: 2E pData(x1,x2) [log p(x 1 , x 2 )] ≤ L 12 + L 21 + L 1 + L 2 = E pData(x1,x2) E q(z|x1,x2) [log p 1 (x 1 |z)] -D KL (q(z|x 1 , x 2 ) p(z|x 2 )) + E pData(x1,x2) E q(z|x1,x2) [log p 2 (x 2 |z)] -D KL (q(z|x 1 , x 2 ) p(z|x 1 )) + E pData(x1) E q(z|x1) [log p 1 (x 1 |z)] -D KL (q(z|x 1 ) p(z)) + E pData(x2) E q(z|x2) [log p 2 (x 2 |z)] -D KL (q(z|x 2 ) p(z)) The first two terms together give E pData(x1,x2) E q(z|x1,x2) [log p 1 (x 1 |z) + log p 2 (x 2 |z)] . (B.27) We do not know the conditional prior p(z|x i ). By definition of the VAE, we are allowed to optimize the prior, therefore we can parameterize it and optimize it. However, we know that in an optimal model p(z|x i ) ≈ q(z|x i ) and it might be possible to prove that if p(z|x i ) is learnt in the same model-class as q(z|x i ) we can find that the optimum is indeed p(z|x i ) = q(z|x i ). Inserting this choice into the equation gives the end-result.

C TRAINING PROCEDURE FOR SVAE AND VAEVAE*

Algorithm 1: Training procedure for SVAE and VAEVAE*. In bold are terms that are different from Wu & Goodman (2019) Input: Supervised example (x 1 , x 2 ), Unsupervised example x 1 , Unsupervised example x 2 z = q(z|x 1 , x 2 ) z x1 = q 1 (z|x 1 ) z x2 = q 2 (z|x 2 ) d 1 = D KL (q(z |x 1 , x 2 ) q 1 (z x1 |x 1 )) + D KL (q 1 (z x1 |x 1 ) p(z)) d 2 = D KL (q(z |x 1 , x 2 ) q 2 (z x2 |x 2 )) + D KL (q 2 (z x2 |x 2 ) p(z)) L = log p 1 (x 1 |z) + log p 2 (x 2 |z) + log p 1 (x 1 |z x1 ) + log p 2 (x 2 |z x2 ) + d 1 + d 2 L x1 = log p 1 (x 1 |z x1 ) + D KL (q 1 (z x1 |x 1 ) p(z)) L x2 = log p 2 (x 2 |z x2 ) + D KL (q 2 (z x2 |x 2 ) p(z)) L comb = L + L x1 + L x2

D SVAE AND VAEVAE FOR MORE THAN TWO MODALITIES

In the following, we formalize the VAEVAE model for three modalities and present a naïve extension of the SVAE model to more than two modalities. In the canonical extension of VAEVAE to three modalities, the three-and two-modal relations are captured by the corresponding networks q(z|x 1 , x 2 , x 3 ), q(z|x i , x j ) and q(z|x i ) for i, j ∈ {1, 2, 3}, see Figure 5 . In the general n-modal case, the model has 2 n networks. For n = 3, the loss function reads: L 1,2,3 =E ppaired(x1,x2,x3) E q(z|x1,x2,x3) [log p 1 (x 1 |z) + log p 2 (x 2 |z) + log p 3 (x 3 |z)] -D KL (q(z|x 1 , x 2 , x 3 ) q(z|x 1 , x 2 )) -D KL (q(z|x 1 , x 2 , x 3 ) q(z|x 2 , x 3 )) -D KL (q(z|x 1 , x 2 , x 3 ) q(z|x 1 , x 3 )) (D.28) L ij =E ppaired(x1,x2,x3) E q(z|xi,xj ) [log p i (x i |z) + log p j (x j |z)] -D KL (q(z|x i , x j ) q(z|x 1 )) -D KL (q(z|x i , x j ) q(z|x 2 )) -D KL (q(z|x i , x j ) q(z|x 3 )) -D KL (q(z|x i , x j ) q(z)) (D.29) L i =E punpaired(xi) E q(z|xi) [log p i (x i |z)] -D KL (q(z|x i ) q(z)) (D.30) L comb =L 1,2,3 + i,j∈{1,2,3},i =j L i,j + 3 i=1 L i (D.31) In this study, we considered a simplifying extension of SVAE to n modalities using n 2 networks q j i (z|x j ) for i, j ∈ {1, . . . , n}. For the 3-modal case depicted in Figure 5 , the PoE relations between the modalities are defined in the following way: q(z|x 1 , x 2 , x 3 ) = 1 Z(x 1 , x 2 , x 3 ) q 1 1 (z|x 1 )q 2 2 (z|x 2 )q 3 3 (z|x 3 ) (D.32) i, j, k ∈ {1, 2, 3}, i = j = k : (D.33) q i (z|x i , x j ) = i Z(x i , x j ) q i i (z|x i )q j j (z|x j )q i k (z|x i ) (D.34) q j (z|x i , x j ) = 1 Z(x i , x j ) q i i (z|x i )q j j (z|x j )q j k (z|x j ) (D.35) q(z|x i ) = q i i (z|x i )q i j (z|x i )q i k (z|x i ) (D.36) q(z) = N (0, I) (D.37) The corresponding SVAE loss function has additional terms due to the fact that the relations between pairs of modalities need to be captured with two PoE rules q i (z|x i , x j ) and q i (z|x i , x j ) in SVAE, while there is only a single network q(z|x i , x j ) in VAEVAE. The loss functions equation D.28equation D.31 above are modified in a way that f (q(z|x i , x j )) = f (q i (z|x i , x j )) + f (q j (z|x i , x j )) for any function f . This extension of the bi-modal case assumes that p(x i , x j |x k ) = p(x i |x k )p(x j |x k ) for i, j, k ∈ {1, 2, 3}, i = j = k, which implies that x i , x j and x k are independent of each other. The encoder and decoder architectures for each experiment and modality are listed below. To implement joint encoding network (VAEVAE architecture), an fully connected layer followed by ReLU is added to the encoding architecture for each modality. Another fully connected layer accepts the concatenated features from the two modalities as an input and outputs the latent space parameters. Adam optimiser is used for learning in all the models (Kingma & Ba, 2015).

F.1 MNIST-SPLIT

The models are trained for 200 epochs with the learning rate 2 • 10 -4 . The best epoch is chosen by the highest accuracy of the reconstruction from the top half evaluated on the validation set. We used a latent space dimensionality of L = 64. The network architectures are described in The models are trained for 200 epochs with the learning rate 10 -4 . The best epoch is chosen by the highest joint coherence evaluated on the validation set. We used a latent space dimensionality of L = 64. The network architectures are described in Table F .5 and Table F .6 for the text and image modality, respectively.

F.4 MNIST-SPLIT-THREE

The models are trained for 50 epochs with the learning rate 2 • 10 -4 . The best epoch is chosen by the highest accuracy of the reconstruction from the top half evaluated on the validation set. We used a latent space dimensionality of L = 64. The network architectures are described in 



We also evaluated SVAE*, our model with the VAEVAE loss function, but it never outperformed other models. https://github.com/masa-su/pixyz https://github.com/iffsid/mmvae



Figure 1: Schematic overview bi-modal VAEs using a PoE and additional network structures that are capable of semi-supervised learning without requiring a two step learning procedure. VAEVAE (a) and (b) are by Wu & Goodman (2019), JMVAE is by Suzuki et al. (2017), MVAE is by Wu & Goodman (2018), and SVAE is our newly proposed model. Each triangle stands for an individual neural network, the colors indicate the two different modalities.

Figure 2: MNIST-SVHN reconstruction for fully supervised VAEVAE.We conducted experiments to compare state-of-the-art PoE based VAEs with the MoE approach MMVAE(Shi et al., 2019). We considered VAEVAE (b) as proposed byWu & Goodman (2019) and SVAE as described above. The two approaches differ both in the underlying model as well as the objective function. For a better understanding of these differences, we also considered an algorithm referred to as VAEVAE*, which has the same model architecture as VAEVAE and the same loss function as SVAE.1 The difference in the training procedure for VAEVAE and VAEVAE* is described in Appendix C. Since the VAEVAE implementation was not publicly available at the time of writing, we used our own implementation of VAEVAE based on the PiXYZ library. 2 For details about the experiments we refer to Appendix F. The source code to reproduce the experiments can be found in the supplementary material.

Figure 3: MNIST-Split image reconstructions of a top half and a bottom half given (a) the top half; (b) the bottom half of the original image.

Figure 4: MNIST-Split dataset. Accuracy of an oracle network applied to images reconstructed given (a) the full image (both halves) (b) the top half (c) the bottom half.

Figure 7: Performance on MNIST-SVHN for different supervision levels. (a) Joint coherence, a share of generated images with the same digit class; (b) Cross-coherence, accuracy of SVHN reconstructions given MNIST; (c) Cross-coherence, accuracy of MNIST reconstructions given SVHN.

Figure 8: CUB Images-Captions dataset. Performance metrics for different supervision levels. (a) Joint coherence, the correlation between images and labels reconstructed from the randomly sampled latent vectors; (b) Cross-coherence, the correlation of the reconstructed caption given the image; (c) Cross-coherence, the correlation of the reconstructed image given the caption.

Figure 9: Examples of image and caption reconstructions given one modality input for SVAE and VAEVAE.Given that the caption can be broad (e.g., "this bird is black and white and has a long pointy beak" in the example), it can fit many different images. In this case, the image from the caption reconstruction tends to better fit the description than the original image. The same goes for images: one of the reconstructed images has a bird with a red belly which got reflected in the generated caption even though it was not a part of the original caption.

Figure E.10: A. MNIST-Split image reconstructions of a top half and a bottom half given (a) the top half; (b) the bottom half of the original image (c) both halves. B. Side-by-side MNIST-SVHN reconstruction from randomly sampled latent space, with oracle predictions of a digit class. The joint coherence from the Figure 7 is a share of classes predicted the same. The examples are generated by SVAE for the supervision levels 100% and 0.1%

Figure E.11: A. MNIST-Split image reconstructions of a top half and a bottom half given (a) the top half; (b) the bottom half of the original image (c) both halves. B. Side-by-side MNIST-SVHN reconstruction from randomly sampled latent space, with oracle predictions of a digit class. The joint coherence from the Figure 7 is a share of classes predicted the same. The examples are generated by VAEVAE for the supervision levels 100% and 0.1%

Evaluation of the models trained on the fully supervised datasets.

Table F.7.

Table F.4 and Table F.3 for the MNIST and SVHN modality, respectively.

Table F.7. 4: Network architectures for MNIST-SVHN: SVHN.

5: Network architectures for CUB-Captions language processing.

A DERIVATION OF THE MODEL ARCHITECTURE

We define our model in an axiomatic way, requiring the following properties:1. When no modalities are observed, the generating distribution for the latent code is Gaussian: q(z) = p(z) = N (0, I) (A.11) This property is well known from VAEs and allows easy sampling. 2. The two modalities are independent given the latent code, so the decoder distribution is:The second property formalizes our goal that the latent representation contains all relevant information from all modalities. The joint distribution p(z|x) = p(z|x 1 , x 2 ) is given by3. Both experts cover the whole latent space with equal probabilities:4. The joint encoding distribution q(z|x) = q(z|x 1 , x 2 ) is assumed to be given by the productof-experts rule (Hinton, 2002; Welling, 2007) :The modelling by a product-of-experts in equation A.15 is a simplification of equation A.13 to make the model tractable. 

