A GENERIC PARAMETERIZATION METHOD FOR UNSUPERVISED LEARNING

Abstract

We introduce a parameterization method called Neural Bayes which allows computing statistical quantities that are in general difficult to compute and opens avenues for formulating new objectives for unsupervised representation learning. Specifically, given an observed random variable x and a latent discrete variable z, we can express p(x|z), p(z|x) and p(z) in closed form in terms of a sufficiently expressive function (Eg. neural network) using our parameterization without restricting the class of these distributions. To demonstrate its usefulness, we develop two independent use cases for this parameterization: 1. Disjoint Manifold Separation: Neural Bayes allows us to formulate an objective which can optimally label samples from disjoint manifolds present in the support of a continuous distribution. This can be seen as a specific form of clustering where each disjoint manifold in the support is a separate cluster. We design clustering tasks that obey this formulation and empirically show that the model optimally labels the disjoint manifolds. 2. Mutual Information Maximization (MIM): MIM has become a popular means for self-supervised representation learning. Neural Bayes allows us to compute mutual information between observed random variables x and latent discrete random variables z in closed form. We use this for learning image representations and show its usefulness on downstream classification tasks.

1. INTRODUCTION

We introduce a generic parameterization called Neural Bayes that facilitates unsupervised learning from unlabeled data by categorizing them. Specifically, our parameterization implicitly maps samples from an observed random variable x to a latent discrete space z where the distribution p(x) gets segmented into a finite number of arbitrary conditional distributions. Imposing different conditions on the latent space z through different objective functions will result in learning qualitatively different representations. Our parameterization may be used to compute statistical quantities involving observed variables and latent variables that are in general difficult to compute (thanks to the discrete latent space), thus providing a flexible framework for unsupervised learning. To illustrate this aspect, we develop two independent use cases for this parameterization-disjoint manifold separation (DMS) and mutual information maximization (Linsker, 1988) , as described in the abstract. For the manifold separation task, we show experiments on 2D datasets and their high-dimensional counter-parts designed as per the problem formulation, and show that the proposed objective can optimally label disjoint manifolds. For the MIM task, we experiment with benchmark image datasets and show that the unsupervised representation learned by the network achieves performance on downstream classification tasks comparable with a closely related MIM method Deep InfoMax (DIM, (Hjelm et al., 2019) ). For both objectives we design regularizations necessary to achieve the desired behavior in practice. All the proofs can be found in the appendix.

2. RELATED WORK

Neural Bayes-DMS: Numerous recent papers have proposed clustering algorithm for unsupervised representation learning such as Deep Clustering (Caron et al., 2018) , information based clustering (Ji et al., 2019 ), Spectral Clustering (Shaham et al., 2018 ), Assosiative Deep Clustering (Haeusser et al., 2018) etc. Our goal in regards to clustering in Neural Bayes-DMS is in general different from such methods. Our objective is aimed at finding disjoint manifolds in the support of a distribution. It is therefore a generalization of traditional subspace clustering methods (where the goal is to find disjoint affine subspaces) (Ma et al., 2008; Liu et al., 2010) , to arbitrary manifolds. Another class of clustering algorithms include mixture models (such as kNNs). Our clustering proposal (DMS) is novel compared to this class in two ways: 1. we formulate the clustering problem as that of identifying disjoint manifold in the support of a distribution. This is different from assuming K ground truth clusters, where the notion of cluster is ill-defined; 2. the DMS objective in proposition 1 is itself novel and we prove its optimality towards labeling disjoint manifolds in theorem 1. Neural Bayes-MIM: Self-supervised representation learning has attracted a lot of attention in recent years. Currently contrastive learning methods and similar variants (such as MoCo (He et al., 2020) , SimCLR (Chen et al., 2020), BYOL Grill et al. ( 2020)) produce state-of-the-art (SOTA) performance on downstream classification tasks. These methods make use of handcrafted image augmentation methods that exploit priors such as class information is typically associated with object shape and is location invariant. However, since we specifically develop an easier alternative to DIM (which also maximizes the mutual information between the input and the latent representations), for a fair comparison, we compare the performance of our Neural Bayes-MIM algorithm for representation learning only with DIM. We leave the extension of Neural Bayes MIM algorithm with data augmentation techniques and other advanced regularizations similar to Bachman et al. ( 2019) as future work. Our experiments show that our proposal performs comparably or slightly better compared with DIM. The main advantage of our proposal over DIM is that it offers a closed form estimation of MI due to discrete latent variables. We note that the principle of mutual information maximization for representation learning was introduced in Linsker ( 1988) and Bell & Sejnowski (1995) , and since then, a number of self-supervised methods involving MIM have been proposed. Vincent et al. (2010) showed that auto-encoder based methods achieve this goal implicitly by minimizing the reconstruction error of the input samples under isotropic Gaussian assumption. Deep infomax (DIM, Hjelm et al. ( 2019)) uses estimators like MINE Belghazi et al. ( 2018) and noise-contrastive estimation (NCE, Gutmann & Hyvärinen (2010)) to estimate MI and maximize it for both both local and global features in convolutional networks. Contrastive Predictive Coding (Oord et al., 2018) is another approach that maximizes MI by predicting the activations of a layer from the layer above using NCE. We also point out that the estimation of mutual information due to Neural Bayes parameterization in the Neural Bayes-MIM-v1 objective (Eq 8) turns out to be identical to the one proposed in IMSAT (Hu et al., 2017) . However, there are important differences: 1. we introduce regularizations which significantly improve the performance of representations on downstream tasks compared to IMSAT (cf table 3 in (Hu et al., 2017) ); 2. we provide theoretical justifications for the parameterization used (lemma 1) and show in theorem 2 why it is feasible to compute high fidelity gradients using this objective in the mini-batch setting even though it contains the term E x [L k (x)]. On the other hand, the justification used in IMSAT is that optimizing using mini-batches is equivalent to optimizing an upper bound of the original objective; 3. we perform extensive ablation studies exposing the importance of the introduced regularizations.

3. NEURAL BAYES

Consider a data distribution p(x) from which we have access to i.i.d. samples x ∈ R n . We assume that this marginal distribution is a union of K conditionals where the k th conditional's density is denoted by p(x|z = k) ∈ R + and the corresponding probability mass denoted by p(z = k) ∈ R + . Here z is a discrete random variable with K states. We now introduce the parameterization that allows us to implicitly factorize any marginal distribution into conditionals as described above. Aside from the technical details, the key idea behind this parameterization is the Bayes' rule. Lemma 1. Let p(x|z = k) and p(z) be any conditional and marginal distribution defined for continuous random variable x and discrete random variable z. If E x∼p(x) [L k (x)] = 0 ∀k ∈ [K], then there exists a non-parametric function L(x) : R n → R + K for any given input x ∈ R n with the

