A GENERIC PARAMETERIZATION METHOD FOR UNSUPERVISED LEARNING

Abstract

We introduce a parameterization method called Neural Bayes which allows computing statistical quantities that are in general difficult to compute and opens avenues for formulating new objectives for unsupervised representation learning. Specifically, given an observed random variable x and a latent discrete variable z, we can express p(x|z), p(z|x) and p(z) in closed form in terms of a sufficiently expressive function (Eg. neural network) using our parameterization without restricting the class of these distributions. To demonstrate its usefulness, we develop two independent use cases for this parameterization: 1. Disjoint Manifold Separation: Neural Bayes allows us to formulate an objective which can optimally label samples from disjoint manifolds present in the support of a continuous distribution. This can be seen as a specific form of clustering where each disjoint manifold in the support is a separate cluster. We design clustering tasks that obey this formulation and empirically show that the model optimally labels the disjoint manifolds. 2. Mutual Information Maximization (MIM): MIM has become a popular means for self-supervised representation learning. Neural Bayes allows us to compute mutual information between observed random variables x and latent discrete random variables z in closed form. We use this for learning image representations and show its usefulness on downstream classification tasks.

1. INTRODUCTION

We introduce a generic parameterization called Neural Bayes that facilitates unsupervised learning from unlabeled data by categorizing them. Specifically, our parameterization implicitly maps samples from an observed random variable x to a latent discrete space z where the distribution p(x) gets segmented into a finite number of arbitrary conditional distributions. Imposing different conditions on the latent space z through different objective functions will result in learning qualitatively different representations. Our parameterization may be used to compute statistical quantities involving observed variables and latent variables that are in general difficult to compute (thanks to the discrete latent space), thus providing a flexible framework for unsupervised learning. To illustrate this aspect, we develop two independent use cases for this parameterization-disjoint manifold separation (DMS) and mutual information maximization (Linsker, 1988) , as described in the abstract. For the manifold separation task, we show experiments on 2D datasets and their high-dimensional counter-parts designed as per the problem formulation, and show that the proposed objective can optimally label disjoint manifolds. For the MIM task, we experiment with benchmark image datasets and show that the unsupervised representation learned by the network achieves performance on downstream classification tasks comparable with a closely related MIM method Deep InfoMax (DIM, (Hjelm et al., 2019) ). For both objectives we design regularizations necessary to achieve the desired behavior in practice. All the proofs can be found in the appendix.

2. RELATED WORK

Neural Bayes-DMS: Numerous recent papers have proposed clustering algorithm for unsupervised representation learning such as Deep Clustering (Caron et al., 2018) , information based clustering (Ji et al., 2019) , Spectral Clustering (Shaham et al., 2018) , Assosiative Deep Clustering (Haeusser et al., 2018) etc. Our goal in regards to clustering in Neural Bayes-DMS is in general different from such methods. Our objective is aimed at finding disjoint manifolds in the support of a distribution. It is therefore a generalization of traditional subspace clustering methods (where the goal is to find disjoint affine subspaces) (Ma et al., 2008; Liu et al., 2010) , to arbitrary manifolds. Another class of clustering algorithms include mixture models (such as kNNs). Our clustering proposal (DMS) is novel compared to this class in two ways: 1. we formulate the clustering problem as that of identifying disjoint manifold in the support of a distribution. This is different from assuming K ground truth clusters, where the notion of cluster is ill-defined; 2. the DMS objective in proposition 1 is itself novel and we prove its optimality towards labeling disjoint manifolds in theorem 1. Neural Bayes-MIM: Self-supervised representation learning has attracted a lot of attention in recent years. Currently contrastive learning methods and similar variants (such as MoCo (He et al., 2020) , SimCLR (Chen et al., 2020) , BYOL Grill et al. (2020) ) produce state-of-the-art (SOTA) performance on downstream classification tasks. These methods make use of handcrafted image augmentation methods that exploit priors such as class information is typically associated with object shape and is location invariant. However, since we specifically develop an easier alternative to DIM (which also maximizes the mutual information between the input and the latent representations), for a fair comparison, we compare the performance of our Neural Bayes-MIM algorithm for representation learning only with DIM. We leave the extension of Neural Bayes MIM algorithm with data augmentation techniques and other advanced regularizations similar to Bachman et al. (2019) as future work. Our experiments show that our proposal performs comparably or slightly better compared with DIM. The main advantage of our proposal over DIM is that it offers a closed form estimation of MI due to discrete latent variables. We note that the principle of mutual information maximization for representation learning was introduced in Linsker (1988) and Bell & Sejnowski (1995) , and since then, a number of self-supervised methods involving MIM have been proposed. Vincent et al. (2010) showed that auto-encoder based methods achieve this goal implicitly by minimizing the reconstruction error of the input samples under isotropic Gaussian assumption. Deep infomax (DIM, Hjelm et al. (2019) ) uses estimators like MINE Belghazi et al. (2018) and noise-contrastive estimation (NCE, Gutmann & Hyvärinen (2010) ) to estimate MI and maximize it for both both local and global features in convolutional networks. Contrastive Predictive Coding (Oord et al., 2018) is another approach that maximizes MI by predicting the activations of a layer from the layer above using NCE. We also point out that the estimation of mutual information due to Neural Bayes parameterization in the Neural Bayes-MIM-v1 objective (Eq 8) turns out to be identical to the one proposed in IMSAT (Hu et al., 2017) . However, there are important differences: 1. we introduce regularizations which significantly improve the performance of representations on downstream tasks compared to IMSAT (cf table 3 in (Hu et al., 2017) ); 2. we provide theoretical justifications for the parameterization used (lemma 1) and show in theorem 2 why it is feasible to compute high fidelity gradients using this objective in the mini-batch setting even though it contains the term E x [L k (x)]. On the other hand, the justification used in IMSAT is that optimizing using mini-batches is equivalent to optimizing an upper bound of the original objective; 3. we perform extensive ablation studies exposing the importance of the introduced regularizations.

3. NEURAL BAYES

Consider a data distribution p(x) from which we have access to i.i.d. samples x ∈ R n . We assume that this marginal distribution is a union of K conditionals where the k th conditional's density is denoted by p(x|z = k) ∈ R + and the corresponding probability mass denoted by p(z = k) ∈ R + . Here z is a discrete random variable with K states. We now introduce the parameterization that allows us to implicitly factorize any marginal distribution into conditionals as described above. Aside from the technical details, the key idea behind this parameterization is the Bayes' rule. Lemma 1. Let p(x|z = k) and p(z) be any conditional and marginal distribution defined for continuous random variable x and discrete random variable z. If E x∼p(x) [L k (x)] = 0 ∀k ∈ [K], then there exists a non-parametric function L(x) : R n → R + K for any given input x ∈ R n with the property K k=1 L k (x) = 1 ∀x such that, p(x|z = k) = L k (x) • p(x) E x∼p(x) [L k (x)] , p(z = k) = E x [L k (x)], p(z = k|x) = L k (x) Thus the function L can be seen as a form of soft categorization of input samples. In practice, we use a neural network with sufficient capacity and softmax output to realize this function L. We name our parameterization method Neural Bayes and replace L with L θ to denote the parameters of the network. By imposing different conditions on the structure of z by formulating meaningful objectives, we will get qualitatively different kinds of factorization of the marginal p(x), and therefore the function L θ will encode the posterior for that factorization. In summary, if one formulates any objective that involves the terms p(x|z), p(z) or p(z|x), where x is an observed random variable and z is a discrete latent random variable, then they can be substituted with L k (x)•p(x) Ex[L k (x)] , E x [L k (x)] and L k (x) respectively. On an important note, Neural Bayes parameterization requires using the term E x [L k (x)], through which computing gradient is infeasible in general. A general discussion around this can be found in the appendix D. Nonetheless, we show that mini-batch gradients can have good fidelity for one of the objectives we propose using our parameterization. In the next two sections, we explore two different ways of factorizing p(x) resulting in qualitatively different goals of unsupervised representation learning.

4. DISJOINT MANIFOLD SEPARATION (DMS)

In many cases, the support of a distribution may be a set of disjoint manifolds. In this task, our goal is to label samples from each disjoint manifold with a distinct value. This formulation can be seen as a generalization of subspace clustering (Ma et al., 2008) where the goal is to identify disjoint affine subspaces. To make the problem concrete, we first formalize the definition of a disjoint manifold. Definition 1. (Connected Set) We say that a set S ⊂ R n is a connected set (disjoint manifold) if for any x, y ∈ S, there exists a continuous path between x and y such that all the points on the path also belong to S. To identify such disjoint manifolds in a distribution, we exploit the observation that only partitions that separate one disjoint manifold from others have high divergence between the respective conditional distributions while partitions that cut through a disjoint manifold result in conditional distributions with low divergence between them. Therefore, the objective we propose for this task is to partition the unlabeled data distribution p(x) into conditional distributions q i (x)'s such that a divergence between them is maximized. By doing so we recover the conditional distributions defined over the disjoint manifolds (we prove its optimality in theorem 1). We begin with two disjoint manifolds and extend this idea to multiple disjoint manifolds in the appendix B. Let J be a symmetric divergence (Eg. Jensen-Shannon divergence, Wasserstein divergence, etc), and q 0 and q 1 be the disjoint conditional distributions that we want to learn. Then the aforementioned objective can be written formally as follows: max q 0 ,q 1 π∈(0,1) J(q 0 (x)||q 1 (x)) (2) s.t. x q 0 (x) = 1, x q 1 (x) = 1, q 1 (x) • π + q 0 (x) • (1 -π) = p(x). Since our goal is to simply assign labels to data samples x corresponding to which manifold they belong instead of learning conditional distributions as achieved by Eq. ( 2), we would like to learn a function L(x) which maps samples from disjoint manifolds to distinct labels. To do so, below we derive an objective equivalent to Eq. ( 2) that learns such a function L(x). Proposition 1. (Neural Bayes-DMS) Let L(x) : R n → [0, 1] be a non-parametric function for any given input x ∈ R n , and let J be the Jensen-Shannon divergence. Define scalars f 1 (x) := L(x) and f 0 (x) := 1-L(x) 1-Ex[L(x)] . Then the objective in Eq. ( 2) is equivalent to, max L 1 2 • E x f 1 (x) • log f 1 (x) f 1 (x) + f 0 (x) + 1 2 • E x f 0 (x) • log f 0 (x) f 1 (x) + f 0 (x) + log 2 s.t. E x [L(x)] / ∈ {0, 1}. Optimality: We now prove the optimality of the proposed objective towards discovering disjoint manifolds present in the support of a probability density function p(x). Theorem 1. (optimality) Let p(x) be a probability density function over R n whose support is the union of two non-empty connected sets (definition 1) S 1 and S 2 that are disjoint, i.e. S 1 ∩ S 2 = ∅. Let L(x) ∈ [0, 1] belong to the class of continuous functions which is learned by solving the objective in Eq. ( 3). Then the objective in Eq. ( 3) is maximized if and only if one of the following is true: L(x) = 0 ∀x ∈ S 1 1 ∀x ∈ S 2 or L(x) = 1 ∀x ∈ S 1 0 ∀x ∈ S 2 . The above theorem proves that optimizing the derived objective over the space of functions L implicitly partitions the data distribution into maximally separated conditionals by assigning a distinct label to points in each manifold. Most importantly, the theorem shows that the continuity condition on the function L(x) plays an important role. Without this condition, the network cannot identify disjoint manifolds. Extension to multiple disjoint manifold case can be found in section B in appendix.

4.1. IMPLEMENTATION DETAILS

Prior Collapse: The constraint in proposition 1 is a boundary condition required for technical reasons in lemma 1. In practice we do not worry about them because optimization itself avoids situations where E x [L(x)] ∈ {0, 1}. To see the reason behind this, note that except when initialized in a way such that E x [L(x)] ∈ {0, 1}, the log terms are negative by definition. Since the denominators of f 0 and f 1 are E x [L(x)] and 1 -E x [L(x)] respectively, the objective is maximized when E x [L(x)] moves away from 0 and 1. Thus, for any reasonable initialization, optimization itself pushes E x [L(x)] away from 0 and 1. Smoothness of L θ (.): As shown in theorem 1, the proposed objectives can optimally recover disjoint manifolds only when the function L θ (.) is continuous. In practice we found enforcing the function to be smooth (thus also continuous) helps significantly. Therefore, after experimenting with a handful of heuristics for regularizing L θ , we found the following finite difference Jacobian regularization to be effective (L(.) can be scalar or vector), R c = 1 B B i=1 L θ (x i ) -L θ (x i + ζ • δi ) 2 ζ 2 where δi := δi δi 2 is a normalized noise vector computed independently for each sample x i in a batch of size B as δ i := Xv i . Here X ∈ R n×B is the matrix containing the batch of samples, and each dimension of v i ∈ R B is sampled i.i.d. from a standard Gaussian. This computation ensures that the perturbation lies in the span of data, which we found to be important. Finally ζ is the scale of normalized noise added to all samples in a batch. In our experiments, since we always normalize the datasets to have zero mean and unit variance across all dimensions, we sample ζ ∼ N (0, 0.1 2 ). Implementation: We implement the binary-partition Neural Bayes-DMS using the Monte-Carlo sampling approximation of the following objective, min θ 1 2 • E x f 1 (x) • log 1 + f 0 (x) f 1 (x) + 1 2 • E x f 0 (x) • log 1 + f 1 (x) f 0 (x) + β • R c (5) where f 1 (x) := L θ (x) Ex[L θ (x)] + and f 0 (x) := 1-L θ (x) 1-Ex[L θ (x)] + . Here = 10 -7 is a small scalar used to prevent numerical instability, and β is a hyper-parameter to control the continuity of L. The multi-partition case can be implemented in a similar way. Due to the need for computing E x [L θ (x)] in the objective, optimizing it using gradient descent methods with small batch-sizes is not possible. Therefore we experiment with this method on datasets where gradients can be computed for a very large batch-size needed to approximate the gradient through E x [L θ (x)] sufficiently well. Clustering in general is an ill posed problem. However, in our problem setup, the definition is precise, i.e., our goal is to optimally label all the disjoint manifolds present in the support of a distribution. Since this is a unique goal that is not generally considered in literature, as empirical verification, we show qualitative results on 2D synthetic datasets in figure 1 . Top 2 sub-figures have 2 clusters and the bottom 2 have 3 clusters. For all experiments we use a 4 layer MLP with 400 hidden units each, batchnorm, ReLU activation, and last layer Softmax activation. In all cases we train using Adam optimizer with a learning rate of 0.001, batch size of 400 and no weight decay, and trained until convergence. Regularization coefficient β was chosen from [0.5, 6] that resulted in optimal clustering. For generality in these experiments, these 2D datasets were projected to high dimensions ( 512) by appending 510 dimensions of 0 entries to each sample and then randomly rotated before performing clustering. The datasets were then projected back to the original 2D space for visualizing predictions. Additional experiments can be found in the appendix C.

5. MUTUAL INFORMATION MAXIMIZATION (MIM)

Suppose we want to find a discrete latent representation z (with K states) for the distribution p(x) such that the mutual information M I(x, z) is maximized (Linsker, 1988) . Such an encoding z demands that it must be very efficient since it has to capture maximum possible information about the continuous distribution p(x) in just K discrete states. Assuming we can learn such an encoding, we are interested in computing p(z|x) since it tells us the likelihood of x belonging to each discrete state of z, thereby performing soft categorization which may be useful for downstream tasks. In the proposition below, we show an objective for computing p(z|x) for a discrete latent representation z that maximizes M I(x, z). Proposition 2. (Neural Bayes-MIM-v1) Let L(x) : R n → R + K be a non-parametric function for any given input x ∈ R n with the property K i=1 L k (x) = 1 ∀x. Consider the following objective, L * = arg max L E x K k=1 L k (x) log L k (x) E x [L k (x)] Then L * k (x) = p(z * = k|x), where z * ∈ arg max z M I(x, z). The proof essentially involves expressing MI in terms of p(z|x), and p(z), which can be substituted using Neural Bayes parameterization. However, the objective proposed in the above theorem poses a challenge-the objective contains the term E x [L k (x)] for which computing high fidelity gradient in a batch setting is problematic (see appendix D). However, we can overcome this problem for the MIM objective because it turns out that gradient through certain terms are 0 as shown by the following theorem. Theorem 2. (Gradient Simplification) Denote, J(θ) = -E x K k=1 L θ k (x) log L θ k (x) E x [L θ k (x)] , Ĵ(θ) = -E x K k=1 L θ k (x) log L θ k (x) E x [L θ k (x)] where . indicates that gradients are not computed through the argument. Then ∂J(θ) ∂θ = ∂ Ĵ(θ) ∂θ . The above theorem implies that as long as we plugin a decent estimate of E x [L θ k (x)] in the objective, unbiased gradients can be computed using randomly sampled mini-batches. Note that the objective can be re-written as, min θ -E x K k=1 L θ k (x) log L θ k (x) + K k=1 E x [L k (x)] log E x [L k (x)] The second term is the negative entropy of the discrete latent representation p(z = k) := E x [L k (x)] which acts as a uniform prior. In other words, this term encourages learning a latent code z such that all states of z activate uniformly over the marginal input distribution x. This is an attribute of distributed representation which is a fundamental goal in deep learning. We can therefore further encourage this behavior by treating the coefficient of this term as a hyper-parameter. In our experiments we confirm both the distributed representation behavior of this term as well as the benefit of using a hyper-parameter as our coefficient.

5.1. IMPLEMENTATION DETAILS

Alternative Formulation of Uniform Prior: In practice we found that an alternative formulation of the second term in Eq 8 results in better performance and more interpretable filters. Specifically, we replace it with the following cross-entropy formulation, R p (θ) := - K k=1 1 K log(E x [L k (x)]) + K -1 K log(1 -E x [L k (x)]) While both, the second term in Eq 8 as well as R p (θ) are minimized when E x [L k (x)] = 1/K, the latter formulation provides much stronger gradients during optimization when E x [L k (x)] approaches 1 (see appendix F.1 for details); E x [L k (x)] = 1 is undesirable since it discourages distributed representation. Finally, unbiased gradients can be computed through Eq 9 as long as a good estimate of E x [L k (x)] is plugged in. Also note that the condition E x [L k (x)] / ∈ {0, 1} in lemma 1 is met by the Neural Bayes-MIM objective implicitly during optimization as discussed in the above paragraph in regards to distributed representation.

Implementation:

The final Neural Bayes-MIM-v2 objective is, min θ -E x K k=1 L θ k (x) log L θ k (x) + + (1 + α) • R p (θ) + β • R c ( ) where α and β are hyper-parameters, R c is a smoothness regularization introduced in Eq. 4, = 10 -7 is a small scalar used to prevent numerical instability. Qualitatively, we find that the regularization R c prevents filters from memorizing the input samples. Finally, we apply the first two terms in Eq 10 to all hidden layers of a deep network at different scales (computed by spatially average pooling and applying Softmax). These two regularizations gave a significant performance boost. Thorough implementation details are provided in the appendix E. For brevity, we refer to our final objective as Neural Bayes-MIM in the rest of the paper. On the other hand, to compute a good estimate of gradients, we use the following trick. During optimization, we compute gradients using a sufficiently large mini-batch of size MBS (Eg. 500) that fits in memory (so that the estimate of E x [L k (x)] is reasonable), and accumulate these gradients until BS samples are seen (Eg. 2000), and averaged before updating the parameters to further reduce estimation error.

5.2. EXPERIMENTS

Instead of aiming for state-of-the-art results on self-supervised learning tasks, our goal in this section is to conduct experiments using Neural Bayes-MIM to understand the behavior of the algorithm, the hyper-parameters involved, and compare performance with DIM, a closely related self-supervised learning method. The use of data augmentation, additional architectures and other regularizations similar to Bachman et al. (2019) is left as future work. Therefore, we use the following simple CNN encoder architecturefoot_1 Enc in our experiments: C(200, 3, 1, 0) -P (2, 2, 0, max) -C(500, 3, 1, 0) - C(700, 3, 1, 0) -P (2, 2, 0, max) -C(1000, 3, 1, 0). For an input image x of size 32 × 32 × 3, the output of this encoder Enc(x) has size 2 × 2 × 1000. The encoder is initialized using orthogonal initialization (Saxe et al., 2013) , batch normalization (Ioffe & Szegedy, 2015) is used after each convolution layer and ReLU non-linearities are used. All datasets are normalized to have dimensionwise 0 mean and unit variance. Early stopping in all experiments is done using the test set (following previous work). We broadly follow the experimental setup of Hjelm et al. (2019) . We do not use any data augmentation in our experiments. After training the encoder, we freeze its features and train a 1 hidden layer (200 units) classifier to get the final test accuracy.

5.2.1. ABLATION STUDIES

Behavior of Neural Bayes-MIM-v1 (Eq 8) vs Neural Bayes-MIM (v2, Eq 9): The experiments and details are discussed in the appendix F.2. The main differences are: 1. majority of the filters learned by the v1 objective are dead, as opposed to the v2 objective which encourages distributed representation; 2. the performance of v2 is better than that of the v1 objective. Performance due to Regularizations and State Scaling: We now evaluate the effects of the various components involved in the Neural Bayes-MIM objective-coefficients α and β, and applying the objective at different scales of hidden states. We use the CIFAR-10 dataset for these experiments. In the first experiment, for each value of the number of different scales considered, we vary α, β and record the final performance, thus capturing the variation in performance due to all these three components. We consider two scaling configurations: 1. no pooling is applied to the hidden layers; 2. for each hidden layer, we spatially average pool the state using a 2 × 2 pooing filter with a stride of 2. For the encoder used in our experiments (which has 4 internal hidden layers post ReLU), this gives us 4 and 8 states respectively (including the original un-scaled hidden layers) to apply the Neural Bayes-MIM objective. After getting all the states, we apply the Softmax activation to each state along the channel dimension so that the Neural Bayes parameterization holds. Thus for states with height and width, the objective is applied to each spatial (x, y) location separately and averaged. Also, for states with height (or width) less than the pooling size, we use the height (or width) as pooling size. We train Neural Bayes-MIM on the full training set for 100 epochs using Adam with learning rate 0.001 (other Adam hyper-parameters are standard), mini-batch size 500 and batch size 2000, 0 weight decay. In the first 32 experiments, α and β are sampled uniformly from [0, 10] and [0, 5] respectively. In the next 5 experiments, α is set to be 0 while β is sampled uniformly. In the next 5 experiments, β is set to be 0 while α is sampled uniformly. Thus in total we run 42 experiments for each number of scaling considered. Once we get a trained Enc(x), we train a 1 hidden layer (with 200 units) MLP classifier on the frozen features from Enc(x) using the labels in the training set. This training is done for 100 epochs using Adam with learning rate 0.001 (other Adam hyper-parameters are standard), batch size 128 and weight decay 0. As a baseline for these experiments, we use a randomly initialed encoder Enc(x). Since there are no tunable hyper-parameters in this case, we perform a grid search on the classifier hyper-parameters. Specifically, we choose weight decay from {0, 1e -5, 5e -5, 1e -4}, batch size from {128, 256}, and learning rate from {0.0001, 0.001}. This yields a total of 16 configurations. The test accuracy from these runs varied between 58.59% and 67.97%. We consider 67.97% as our baseline. The performance of encoders under the aforementioned configurations is shown in figure 2 . It is clear that both the hyper-parameters α and especially β play an important role in the quality of representations learned. Also, applying Neural Bayes-MIM at different scales of the network states significantly improves the average and best performance. Filter visualization, convergence experiments and the effect of batch-size are shown in appendix F.3.

5.2.2. FINAL CLASSIFICATION PERFORMANCE

We compare the final test accuracy of Neural Bayes-MIM with 2 baselines-a random encoder (described in ablation studies) and Deep Infomax (Hjelm et al., 2019) on benchmark image datasets-CIFAR10 and CIFAR-100 (Krizhevsky, 2009) and STL-10 (Coates et al., 2011) . Random Network refers to the use of a randomly initialized network. The experimental details for them are identical to those in the ablation above involving a hyper-parameter search over 16 configurations done for each dataset separately. DIM results are reported from Hjelm et al. (2019) . We omit STL-10 number for DIM because we resize images to a much smaller size of 32 × 32 in our runs instead of 64 × 64 as used in DIM. The following describes the experimental details for Neural Bayes-MIM. We use α = 2 and β = 4 (chosen roughly by examining figure 2 ), and MBS=500, BS=4000 in all the experiments. Note these values are not tuned for STL-10 and CIFAR-100. For CIFAR-10 and STL-10 each, we run 4 configurations of Neural Bayes-MIM over hyper-parameters learning rate ∈ {0.0001, 0.001} and weight decay ∈ {0, 0.00001}. For each run, we then train a 1 hidden layer (200 units) classifier on top of the frozen features with learning rate ∈ {0.0001, 0.001}. We report the best performance of all runs. For CIFAR-100, we take the encoder that produces the best performance on CIFAR-10, and train a classifier with the 2 learning rates and report the best of the 2 runs. Table 1 reports the classification performance of all the methods. We note that all experiments were done with CNN architecture without any data augmentation. Neural Bayes-MIM outperforms baseline methods in general.

6. CONCLUSION

We proposed a parameterization method that can be used to express an arbitrary set of distributions p(x|z), p(z|x) and p(z) in closed form using a neural network with sufficient capacity, which can in turn be used to formulate new objective functions. We formulated two different objectives that use this parameterization which were aimed towards different goals of unsupervised learning-identification of disjoint manifolds in the support of continuous distributions and learning deep network features using the infomax principle. We focused on theoretical analysis for both the objectives and presented preliminary experiments supporting the theoretical claims.

APPENDIX A PROOFS FOR NEURAL BAYES-DMS (BINARY CASE)

Proposition 1. (Neural Bayes-DMS, proposition 2 in main text) Let L(x) : R n → [0, 1] be a nonparametric function for any given input x ∈ R n , and let J be the Jensen-Shannon divergence. Define scalars f 1 (x) := L(x) Ex[L(x)] and f 0 (x) := 1-L(x) 1-Ex[L(x)] . Then the objective in Eq. ( 2) is equivalent to, max L 1 2 • E x f 1 (x) • log f 1 (x) f 1 (x) + f 0 (x) + 1 2 • E x f 0 (x) • log f 0 (x) f 1 (x) + f 0 (x) + log 2 (11) s.t. E x [L(x)] / ∈ {0, 1} Proof: Using the Neural Bayes parameterization from lemma 1 for binary case, we set, q 1 (x) := L(x) • p(x) E x∼p(x) [L(x)] q 0 (x) := (1 -L(x)) • p(x) 1 -E x∼p(x) [L(x)] These parameterizations therefore automatically satisfy the constraints in Eq. ( 2). Finally, using the definition of JS divergence, the maximization problem in Eq. ( 2) can be written as, max q0,q1 1 2 • x q 1 (x) log q 1 (x) 0.5 • (q 0 (x) + q 1 (x)) + q 0 (x) log q 0 (x) 0.5 • (q 0 (x) + q 1 (x)) Substituting q 0 and q 1 with their respective parameterizations and using the definitions of f 0 (x) and f 1 (x) completes the proof. Theorem 1. (optimality, Theorem 2 in main text) Let p(x) be a probability density function over R n whose support is the union of two non-empty connected sets (definition 1) S 1 and S 2 that are disjoint, i.e. S 1 ∩ S 2 = ∅. Let L(x) ∈ [0, 1] belong to the class of continuous functions which is learned by solving the objective in Eq. ( 3). Then the objective in Eq. ( 3) is maximized if and only if one of the following is true: L(x) = 0 ∀x ∈ S 1 1 ∀x ∈ S 2 or L(x) = 1 ∀x ∈ S 1 0 ∀x ∈ S 2 Proof: The two cases exist in the theorem due to symmetry. Recall the definition of f 0 (x) and f 1 (x) in Eq. ( 3), f 1 (x) := L(x) E x [L(x)] f 0 (x) := 1 -L(x) 1 -E x [L(x)] where L(x) ∈ [0, 1] and for a feasible L(x), and therefore π := E x [L(x)] ∈ (0, 1) due to the conditions specified in this theorem. Thus f 1 (x) ∈ [0, 1 π ] and f 0 (x) ∈ [0, 1 1-π ]. By design, the terms log f1(x) f1(x)+f0(x) and log f0(x) f1(x)+f0(x) are non-positive. Thus, for any x ∈ S 1 ∪ S 2 , F (x) = f 1 (x) • log f 1 (x) f 1 (x) + f 0 (x) + f 0 (x) • log f 0 (x) f 1 (x) + f 0 (x) (17) is maximized only when L(x) = 0 or L(x) = 1 leading to F (x) = 0. Therefore, the objective in Eq. (3) is maximized by setting L(x) = 0 or L(x) = 1 ∀x ∈ S 1 ∪ S 2 . Finally, since L(x) is a continuous function, x 1 , x 2 ∈ S 1 such that L(x 1 ) = 0 and L(x 2 ) = 1. We prove this by contradiction. Suppose there exists a pair (x 1 , x 2 ) of this kind. Then along any path connecting x 1 and x 2 within S 1 , there must exist a point where L(x) is not continuous since L(x) = 0 or L(x) = 1 ∀x ∈ S 1 ∪ S 2 to satisfy the maximization condition. This is a contradiction. By symmetry, the same argument can be proved for x 1 , x 2 ∈ S 2 . Therefore one of the two cases mentioned in the theorem must be the optimal solution for L(x) in Eq. ( 3). Thus we have proved the claim.

B NEURAL BAYES-DMS: EXTENSION TO MULTIPLE PARTITIONS

In order to extend our proposal to multiple partitions (say K), the idea is to find conditional distribution q i (i ∈ [K]) corresponding to each of the K partitions such the divergence between conditional distribution of every partition and the conditional distribution of the combined remaining partitions is maximized. Specifically, we propose the following primary objective, max q k π k =0,∀k∈[K] 1 K K k=1 J(q k (x)||q k (x)) s.t. ( ) x q k (x) = 1 ∀k ∈ [K] ( ) K k=1 q k (x) • π k = p(x) (20) K k=1 π k = 1 (21) where qk (x) is the conditional distribution corresponding to the full data distribution excluding the partition defined by q k (x). Formally, qk (x) := p(x) -q k (x) • π k 1 -π k (22) Then the theorem below shows an equivalent way of solving the above objective. Theorem 2. Let L(x) : R n → R + K be a non-parametric function for any given input x ∈ R n with the property K k=1 L k (x) = 1 ∀x, and let J be the Jensen-Shannon divergence. Define scalars f k (x) := L k (x) Ex[L k (x)] and fk (x x)] . Then the objective in Eq. ( 18) is equivalent to, max ) := 1-L k (x) 1-Ex[L k ( L k ∀i∈[K] 1 2 • E x K k=1 f k (x) • log f k (x) f k (x) + fk (x) + fk (x) • log fk (x) fk (x) + f k (x) + log 2 (23) s.t. E x [L k (x)] = π k Here L k (x) denotes the k th unit of L(x). Proof: Similar to theorem 1, the main idea is to parameterize q k and qk as follows, q k (x) := L k (x) • p(x) E x∼p(x) [L k (x)] qk (x) := (1 -L k (x)) • p(x) 1 -E x∼p(x) [L k (x)] To verify that these parameterizations are valid, note that, x q k (x) = x L k (x) • p(x) E x∼p(x) [L k (x)] = 1 (25) Similarly, x qk (x) = 1. To verify that the second constraint is satisfied, we use the above parameterization and substitute E x [L k (x)] = π k and get, K k=1 L k (x) • p(x) π k • π k = p(x) • K k=1 L k (x) (26) = p(x) where the last equality uses the definition of L(x). Also notice that each π k ∈ [0, 1] and thus E x [L k (x)] = π k is feasible for any arbitrary distribution q k (x) when L k (x) ≥ 0. Finally, using the proposed parameterization we have, qk (x) = p(x) -q i (x) • π k 1 -π k (28) = p(x) • 1 -L i (x) Ex[L k (x)] • π k 1 -π k (29) = p(x) • 1 -L i (x) 1 -E x [L k (x)] (30) = fk (x) • p(x) where we have used the fact that E x [L k (x)] = π k . Using the definition of JS divergence, the max problem in Eq. ( 18) can be written as, max L k ∀i∈[K] 1 2 • K k=1 x q k (x) • log q k (x) 0.5 • (q k (x) + qk (x)) + qk (x) • log qk (x) 0.5 • (q k (x) + q k (x)) (32) Substituting q k and qk with their respective parameterizations and using the definitions of f k (x) and fk (x) completes the proof. In terms of implementation, we propose to simply have K output units in the label generating network L θ while sharing the rest of the network. Also, we use Softmax activation at the output layer to satisfy the properties of L specified in the above theorem.

C ADDITIONAL NEURAL BAYES-DMS EXPERIMENTS

We run an experiment on MNIST. We randomly split the training set into 90% -10% trainingvalidation set In this experiment, we train a CNN with the following architecture: C(100, 3, 1, 0) -P (2, 2, 0, max)-C(100, 3, 1, 0)-C(200, 3, 1, 0)-P (2, 2, 0, max)-C(500, 3, 1, 0)-P (., ., ., avg)-F C(10). Here P (., ., ., avg) denotes the entire spatial field is average pooled to result in 1 × 1 heightwidth, and F C(10) denotes a fully connected layer with output dimension 10. Finally, Softmax is applied at the output and the network is trained using the Neural Bayes-DMS objective. We optimize the objective using Adam with learning rate 0.001, batch size 5000, 0 weight decay for 100 epochs (other Adam hyper-parameters are kept standard). We use β = 1 for the smoothness regularization coefficient. Once this network is trained, we train a linear classifier on top of this 10 dimensional output using Adam with identical configurations except a batch size of 128 is used. We early stop on the validation set of MNIST and report the test accuracy using that model. The classifier reaches 99.22% test accuracy. This experiment shows that MNIST classes lie on nearly disjoint manifolds and that Neural Bayes-DMS can correctly label them. As baseline, a linear classifier trained on features from a randomly initialized identical CNN architecture reaches 42.97%.

D GRADIENT COMPUTATION PROBLEM

FOR THE E x [L θ (x)] TERM The Neural Bayes parameterization contains the term E x [L θ (x)]. Computing unbiased gradient through this term is in general difficult without the use of very large batch-sizes even though the quantity E x [L θ (x)] itself may have a good estimate using very few samples. For instance, consider the scalar function ψ(t) = 1 + 0.01 sin ωt. Consider the scenario when ω → ∞. The quantity E[ψ(t)] can be estimated very accurately using even one example. Further, E[ψ(t)] = 1, hence ∂E[ψ(t) ] ∂t = 0. However, when using a finite number of samples, the approximation of ∂E[ψ(t)] ∂t can have a very high variance estimate due to improper cancelling of gradient terms from individual samples. In the case of Neural Bayes-MIM we found that gradients through terms involving E x [L θ (x)] were 0. This allows us to estimate gradients for this objective reliably in the mini-batch setting. But in general it may be challenging to do so and solving objectives using Neural Bayes parameterization may require a customized work-around for each objective.

E IMPLEMENTATION DETAILS OF THE NEURAL BAYES-MIM OBJECTIVE

We apply the Neural Bayes-MIM objective (Eq 10) to all the hidden layers at different scales (using average pooling). We now discuss its implementation details. Consider the CNN architecture used in our experiments-C(200, 3, 1, 0) -P (2, 2, 0, max) -C(500, 3, 1, 0) -C(700, 3, 1, 0) -P (2, 2, 0, max) -C(1000, 3, 1, 0). Denote h i (i ∈ {0, 1, 2, 3}) be the 4 hidden layer ReLU outputs after the 4 convolution layers. For input of size 32 × 32 × 3, all these hidden states have height and width dimension in addition to channel dimension. For a mini-batch B, these hidden states are therefore 4 dimensional tensors. Let these 4 dimensions for the i th state be denoted by |B| × C i × H i × W i , where the dimensions denote batch-size, number of channels, height and width. Denote S to be the Softmax function applied along the channel dimension, and P to be P (2, 2, 0, avg). Further, denote h i := P(h i-4 ) (i ∈ {4, 5, 6, 7}) as the scaled version of the original states computed by average pooling, and define numbers C i , H i , W i accordingly. Then the total Neural Bayes-MIM objective for this architecture is given by, min θ - 1 |B| x∈B   1 8 7 i=0 1 H i W i Hi,Wi h,w=1 Ci k=1 S(h i k,h,w (x)) log S(h i k,h,w (x)) +   + (1 + α) • R p (θ) + β • R c (33) where, R p (θ) := - 1 8 7 i=0 1 H i W i Hi,Wi h,w=1 Ci k=1 1 C i log 1 |B| x∈B S(h i k,h,w (x)) + C i -1 C i log 1 - 1 |B| x∈B S(h i k,h,w (x)) and, R c = 1 |B| x∈B P(h 3 k (x)) -P(h 3 k (x + ζ • δ) 2 ζ 2 where δ := δ δ 2 is a normalized noise vector computed independently for each sample x in the batch B as, δ := Xv. (36) Here X ∈ R n×B is the matrix containing the batch of samples, and each dimension of v ∈ R B is sampled i.i.d. from a standard Gaussian. This computation ensures that the perturbation lies in the span of data. Finally ζ is the scale of normalized noise added to all samples in a batch. In our experiments, since we always normalize the datasets to have zero mean and unit variance across all dimensions, we sample ζ ∼ N (0, 0.1 2 ). Note that for the architecture used, P(h 3 k (x)) results in an output with height and width equal to 1, hence the output is effectively a 2D matrix of size |B| × C 3 . Finally, the gradient form this mini-batch is accumulated and averaged over multiple batches before updating the parameters for a more accurate estimate of gradients. As discussed in the main text, the term,

F ADDITIONAL ANALYSIS OF NEURAL BAYES-MIM

R v1 p (θ) := K k=1 E x [L k (x)] log E x [L k (x)] acts as a uniform prior encouraging the representations to be distributed. However, gradients are much stronger when E x [L k (x)] approaches 1 for the alternative cross-entropy formulation, To see this, note that gradient for R v1 p (θ) is given by, R v2 p (θ) := - K k=1 1 K log E x [L k (x)] + K -1 K log(1 -E x [L k (x)]) ∂R v1 p (θ) ∂θ = K k=1 ∂E x [L θ k (x)] ∂θ log E x [L θ k (x)] - K k=1 ∂E x [L θ k (x)] ∂θ (39) = K k=1 ∂E x [L θ k (x)] ∂θ log E x [L θ k (x)] - ∂E x [ K k=1 L θ k (x)] ∂θ (40) = K k=1 ∂E x [L θ k (x)] ∂θ log E x [L θ k (x)] where the last equality holds due to the linearity of expectation and because K k=1 L θ k (x) = 1 by design. On the other hand, gradients for R v2 p (θ) is given by, ∂R v2 p (θ) ∂θ = - K k=1 1 K 1 E x [L k (x)] - K -1 1 -E x [L k (x)] ∂E x [L θ k (x)] ∂θ When the representation being learned is such that the marginal p(z) peaks along a single state k, i.e., E x [L k (x)] → 1 (making the representation degenerate), the gradient for the k th term for v1 is given by, ∂E x [L θ k (x)] ∂θ log E x [L θ k (x)] ≈ 0 (43) while that for v2 is given by, - 1 K 1 E x [L k (x)] - K -1 1 -E x [L k (x)] ∂E x [L θ k (x)] ∂θ ≈ lim c→0 1 c • ∂E x [L θ k (x)] ∂θ whose magnitude approaches infinity as It can be seen that most filters are dead. We tried other configurations as well without any change in the outcome. Since the v1 and v2 objective differ only in the formulation of the uniform prior regularization, as explained in the previous section, we believe that v1 leads to dead filters because of weak gradients from its regularization term. E x [L k (x)] → 1. Thus R v2 p (θ) is beneficial in terms of gradient strength. In the second set of experiments, we train many models using Neural Bayes-MIM-v1 and Neural Bayes-MIM-v2 objectives separately with different hyper-parameter configurations similar to the setting of figure 2. The performance scatter plot is shown in figure 6 . We find that Neural Bayes-MIM-v2 has better average and best performance compared with Neural Bayes-MIM-v1. 

F.3 ADDITIONAL EXPERIMENTS

Visualization of Filters: We visualize the filters learned by the Neural Bayes-MIM objective on MNIST digits and qualitatively study the effects of the regularizations used. For this we train a deep fully connected network with 3 hidden layers each of width 500 using Adam with learning rate 0.001, batch size 500, 0 weight decay for 50 epochs (other Adam hyper-parameters are kept standard). We train three configurations: 1. α = 0, β = 4; 2. α = 4, β = 0; 3. α = 4, β = 4. The learned filters are shown in figure 3 . We find that the uniform prior regularization (α > 0) prevents dead filters while the smoothness regularization (β > 0) prevents input memorization. Accuracy vs Epochs: Finally, we plot the evolution of accuracy over epochs for all the models learned in the experiments of figure 2. For Neural Bayes-MIM we use the models with scaling (42 in total), and all 16 models for the random encoder. The convergence plot (figure 5 ) shows that models pre-trained with Neural Bayes-MIM quickly converge to a higher test accuracy compared to the baseline. Effect of Mini-batch size (MBS) and Batch size (BS): During implementation, we proposed to compute gradients using a reasonably large mini-batch of size MBS and accumulate gradients until BS samples are seen. This is done to overcome the gradient estimation problem due to the E x [L k (x)] term in Neural Bayes-MIM. Here we evaluate the effect of these two hyper-parameters on the final test performance. We choose MBS from {50, 100, 250, 500} and BS from {50, 250, 500, 2000, 3000}. For each combination of MBS and BS, we train the CNN encoder using Neural Bayes-MIM with α = 2 and β = 4 (chosen by examining figure 2 ); the rest of the training settings are kept identical to those used for figure 2 experiment. Table 2 shows the final test accuracy on CIFAR-10 for each combination of hyper-parameters MBS and BS. We make two observations: 1. using very small MBS (Eg. 50 and 100) typically results in poor (even worse than that of a random encoder (67.97%)), while larger MBS significantly improves performance; 2. using a larger BS further improves performance in most cases (even when MBS is small).

G PROOF OF LEMMA 1

Lemma 1. Let p(x|z = k) and p(z) be any conditional and marginal distribution defined for continuous random variable x and discrete random variable z. If E x∼p(x) [L k (x)] = 0 ∀k ∈ [K], then there exists a non-parametric function L(x) : R n → R + K for any given input x ∈ R n with the property K k=1 L k (x) = 1 ∀x such that, p(x|z = k) = L k (x) • p(x) E x∼p(x) [L k (x)] , p(z = k) = E x [L k (x)], p(z = k|x) = L k (x) and this parameterization is consistent. Proof: First we show the existence proof. Notice that there exists a non-parametric function g k (x) := p(x|z=k) p(x) ∀x ∈ supp(p(x)). Denote G k (x) = p(z = k)g k (x). Then, E x [G k (x)] = E x [p(z = k)g k (x)] = p(z = k) (46) and, G k (x) E x [G k (x)] = p(z = k)g k (x) p(z = k) = p(x|z = k) p(x) Thus L k := G k works. To verify that this parameterization is consistent, note that for any k, (60) = E x∼p(x) K k=1 L k (x) log L k (x) E x [L k (x)] Therefore the two objectives are equivalent and we have a closed form estimate of mutual information. Given z * is a maximizer of M I(x, z), since L is a non-parametric function, there exists L * such that p(z * = k|x) = L * k (x) due to lemma 1. Theorem 3. (Theorem 1 in main text) Denote, J(θ) = -E x K k=1 L θ k (x) log L θ k (x) E x [L θ k (x)] Ĵ(θ) = -E x K k=1 L θ k (x) log L θ k (x) E x [L θ k (x)] where . denotes gradients are not computed through the argument. Then ∂J(θ) ∂θ = ∂ Ĵ(θ) ∂θ . Proof: We note that, J(θ) = -E x K k=1 L θ k (x) log L θ k (x) E x [L θ k (x)] (64) = -E x K k=1 L θ k (x) log L θ k (x) + E x K k=1 L θ k (x) log E x [L θ k (x)] Denote the first term by T 1 . Then due to chain rule, - ∂T 1 ∂θ = E x K k=1 ∂L θ k (x) ∂θ log L θ k (x) -E x K k=1 L θ k (x) L θ k (x) • ∂L θ k (x) ∂θ (66) = E x K k=1 ∂L θ k (x) ∂θ log L θ k (x) -E x K k=1 ∂L θ k (x) ∂θ (67) = E x K k=1 ∂L θ k (x) ∂θ log L θ k (x) -E x ∂ K k=1 L θ k (x) ∂θ (68) = E x K k=1 ∂L θ k (x) ∂θ log L θ k (x) where the last equality holds due to the linearity of expectation and because K k=1 L θ k (x) = 1 by design. Now denote the second term by T 2 . Then due to chain rule, - ∂T 2 ∂θ = -E x K k=1 ∂L θ k (x) ∂θ log E x [L θ k (x)] -E x K k=1 L θ k (x) E x [L θ k (x)] • ∂E x [L θ k (x)] ∂θ (70) = -E x K k=1 ∂L θ k (x) ∂θ log E x [L θ k (x)] - K k=1 E x [L θ k (x)] E x [L θ k (x)] • ∂E x [L θ k (x)] ∂θ (71) = -E x K k=1 ∂L θ k (x) ∂θ log E x [L θ k (x)] - K k=1 E x ∂L θ k (x) ∂θ (72) = -E x K k=1 ∂L θ k (x) ∂θ log E x [L θ k (x)] -E x ∂ K k=1 L θ k (x) ∂θ (73) = -E x K k=1 ∂L θ k (x) ∂θ log E x [L θ k (x)] ( ) where once again the last equality holds due to the linearity of expectation and because K k=1 L θ k (x) = 1 by design. Thus the gradient for J is given by, ∂J(θ) ∂θ = -E x K k=1 ∂L θ k (x) ∂θ log L θ k (x) E x [L θ k (x)] which is the same as ∂ Ĵ(θ) ∂θ . This concludes the proof.



Ex[L(x)] We use the following shorthand for a) conv layer: C(number of filters, filter size, stride size, padding); b) pooling: P(kernel size, stride, padding, pool mode)



Figure 1: Neural Bayes-DMS network prediction on synthetic high dimensional data (lying on a 2D plane) projected back to 2 dimensions for visualization. The different colors denote the label predicted by the network L θ (.) thresholded at 0.5. The darker shades of colors denote predictions made at training data points while the lighter shades denote predictions on the rest of the space.

Figure2: The effect of Neural Bayes-MIM hyper-parameters on the final test performance for CIFAR-10. A CNN encoder is trained using Neural Bayes-MIM with different configurationshyper-parameters α, β and scaling of states on which the objective is applied. A one hidden layer classifier (with 200 units) is then trained using labels on these frozen features to get the final test accuracy. The plots show that performance is significantly worse when α = β = 0 and no scaling is used, showing their important role as a regularizer. The best performing model reaches 83.59%. Black dotted line is the baseline performance (67.97%) when a randomly initialized network (with identical architecture) is used as encoder. Green and blue dotted lines are the average of all the green and blue points respectively.

GRADIENT STRENGTH OF UNIFORM PRIOR IN NEURAL BAYES-MIM-V1 (EQ 8) VS NEURAL BAYES-MIM-V2 (9)

Figure 3: MNIST filters learned using Neural Bayes-MIM-v1 objective (Eq 8) using the configuration α = 4, β = 4. Majority of filters are dead. For comparison with filters from Neural Bayes-MIM-v2, see figure 4 (bottom).

EMPIRICAL COMPARISON BETWEEN NEURAL BAYES-MIM-V1 (EQ 8) AND NEURAL BAYES-MIM-V2 (9) To empirically understand the difference in behavior of Neural Bayes-MIM objective v1 vs v2, we first plot the filters learned by the v1 objective and compare it with those learned by the v2 objective. The filters learned by the v1 objective are shown in figure 3 using the configuration α = 4, β = 4.

Figure 4: MNIST filters learned using Neural Bayes-MIM objective Eq 10. Majority of filters are dead when regularization coefficient α = 0. Filters memorize input samples when regularization coefficient β = 0. Using both regularization terms results in filters that mainly capture parts of inputs which are good for distributed representation.

Figure 5: Mean (solid line) and standard deviation (error band) of accuracy evolution during MLP classifier training on CIFAR-10 using Neural Bayes-MIM encoder (blue) and random encoder (green).Features learned by Neural Bayes-MIM allow training to start at a much higher value (∼ 65% on average) and converge faster as opposed to a random encoder (∼ 47% on average).

Figure 6: Performance of Neural Bayes-MIM-v1 vs Neural Bayes-MIM-v2. A CNN encoder is trained using Neural Bayes-MIM with different configurations-hyper-parameters α, β and scaling of states on which the objective is applied. A one hidden layer classifier (with 200 units) is then trained using labels on these frozen features to get the final test accuracy.Green and blue dotted lines are the average of all the green and blue points respectively. Neural Bayes-MIM-v2 has better average and best performance compared with Neural Bayes-MIM-v1.

the condition E x∼p(x) [L k (x)] = 0 ∀k ∈[K]. Secondly, we note that,K k=1 p(x|z = k) • p(z = k) = K k=1 L k (x) • p(x) E x [L k (x)] • E x [L k (x)] equality is due to the conditions K k=1 L k (x) = 1 ∀x. Thirdly, have from Bayes' rule:p(z = k|x) = p(x|z = k) • p(z = k) p(x) (54) = L k (x) • p(x) E x∼p(x) [L k (x)] • E x∼p(x) [L k (x)] p(x) (55) = L k (x)(56) where the second equality holds because of the existence and consistency proofs of p(x|z = k) :=L k (x)•p(x) E x∼p(x) [L k (x)] and p(z = k) := E x [L k (x)] shown above.H PROOFS FOR NEURAL BAYES-MIMProposition 2. (Neural Bayes-MIM-v1) (proposition 1 in main text) Let L(x) : R n → R + K be a non-parametric function for any given input x ∈ R n with the property K i=1 L k (x) = 1 ∀x. Consider the following objective, k (x) = p(z * = k|x), where z * ∈ arg max z M I(x, z). Proof: Using the Neural Bayes parameterization in lemma 1, we have, k (x) • p(x) • log L k (x) E x [L k (x)]

Classification performance of a one hidden layer MLP classifier trained on frozen features from the mentioned encoder models (trained using Neural Bayes-MIM) and datasets (no data augmentation used in experiments we ran). STL-10 was resized to 32 × 32 in our runs instead of 64 × 64 as in DIM due to memory restrictions. Performance reported from DIM paper are their best numbers (omitting STL-10 due to difference in image size).

Effect of mini-batch size (MBS) and batch size (BS) on final test accuracy of Neural Bayes-MIM on CIFAR-10. Gradients are computed using batches of size MBS and accumulated until BS samples are seen before parameter update. As expected, a sufficiently large MBS is needed for computing high fidelity gradients due to E

