DEMI: DISCRIMINATIVE ESTIMATOR OF MUTUAL INFORMATION

Abstract

Estimating mutual information between continuous random variables is often intractable and extremely challenging for high-dimensional data. Recent progress has leveraged neural networks to optimize variational lower bounds on mutual information. Although showing promise for this difficult problem, the variational methods have been theoretically and empirically proven to have serious statistical limitations: 1) many methods struggle to produce accurate estimates when the underlying mutual information is either low or high; 2) the resulting estimators may suffer from high variance. Our approach is based on training a classifier that provides the probability that a data sample pair is drawn from the joint distribution rather than from the product of its marginal distributions. Moreover, we establish a direct connection between mutual information and the average log odds estimate produced by the classifier on a test set, leading to a simple and accurate estimator of mutual information. We show theoretically that our method and other variational approaches are equivalent when they achieve their optimum, while our method sidesteps the variational bound. Empirical results demonstrate high accuracy of our approach and the advantages of our estimator in the context of representation learning.

1. INTRODUCTION

Mutual information (MI) measures the information that two random variables share. MI quantifies the statistical dependency -linear and non-linear -between two variables. This property has made MI a crucial measure in machine learning. In particular, recent work in unsupervised representation learning has built on optimizing MI between latent representations and observations (Chen et al., 2016; Zhao et al., 2018; Oord et al., 2018; Hjelm et al., 2018; Tishby & Zaslavsky, 2015; Alemi et al., 2018; Ver Steeg & Galstyan, 2014) . Maximization of MI has long been a default method for multi-modality image registration (Maes et al., 1997) , especially in medical applications (Wells III et al., 1996) , though in most work the dimensionality of the random variables is very low. Here, coordinate transformations on images are varied to maximize their MI. Estimating MI from finite data samples has been challenging and is intractable for most continuous probabilistic distributions. Traditional MI estimators (Suzuki et al., 2008; Darbellay & Vajda, 1999; Kraskov et al., 2004; Gao et al., 2015) do not scale well to modern machine learning problems with high-dimensional data. This impediment has motivated the construction of variational bounds for MI (Nguyen et al., 2010; Barber & Agakov, 2003) ; in recent years this has led to maximization procedures that use deep learning architectures to parameterize the space of functions, exploiting the expressive power of neural networks (Song & Ermon, 2019; Belghazi et al., 2018; Oord et al., 2018; Mukherjee et al., 2020) . Unfortunately, optimizing lower bounds on MI has serious statistical limitations. Specifically, McAllester & Stratos (2020) showed that any high-confidence distribution-free lower bound cannot exceed O(logN ), where N is the number of samples. This implies that if the underlying MI is high, it cannot be accurately and reliably estimated by variational methods like MINE (Belghazi et al., 2018) . Song & Ermon (2019) further categorized the state-of-the-art variational methods into "generative" and "discriminative" approaches, depending on whether they estimate the probability densities or the density ratios. They showed that the "generative" approaches perform poorly when the underlying MI is small and "discriminative" approaches perform poorly when MI is large; moreover, certain approaches like MINE (Belghazi et al., 2018) are prone to high variances. We propose a simple discriminative approach that avoids the limitations of previous discriminative methods that are based on variational bounds. Instead of estimating density or attempting to predict one data variable from another, our method estimates the likelihood that a sample is drawn from the joint distribution versus the product of marginal distributions. A similar classifier-based approach was used by Lopez-Paz & Oquab (2017) for "two sample testing" -hypothesis tests about whether two samples are from the same distribution or not. If the two distributions are the joint and product of the marginals, then the test is for independence. A generalization of this work was used by Sen et al. (2017) to test for conditional independence. We show that accurate performance on this classification task provides an estimate of the log odds. This can greatly simplify the MI estimation task in comparison with generative approaches: estimating a single likelihood ratio may be easier than estimating three distributions (the joint and the two marginals). Moreover, classification tasks are generally amicable to deep learning, while density estimation remains challenging in many cases. Our approach avoids the estimation of the partition function, which induces large variance in most discriminative methods (Song & Ermon, 2019) . Our empirical results bear out these conceptual advantages. Our approach, as well as other sampling-based methods such as MINE, uses the given joint/paired data with derived "unpaired" data that captures the product of the marginal distributions p(x)p(y). The unpaired data can be synthesized via permutations or resampling of the paired data. This construction, which synthesizes unpaired data and then defines a metric to encourage paired data points to map closer than the unpaired data in the latent space, has previously been used in other machine learning applications, such as audio-video and image-text joint representation learning (Harwath et al., 2016; Chauhan et al., 2020) . Recent contrastive learning approaches (Tian et al., 2019; Hénaff et al., 2019; Chen et al., 2020; He et al., 2020) further leverage a machine learning model to differentiate paired and unpaired data mostly in the context of unsupervised representation learning. Simonovsky et al. (2016) used paired and unpaired data in conjunction with a classifier-based loss function for patch-based image registration. This paper is organized as follows. In Section 2, we derive our approach to estimating MI. Section 2.4 discusses connections to related approaches, including MINE. This is followed by empirical evaluation in Section 3. Our experimental results on synthetic and real image data demonstrate the advantages of the proposed discriminative classification-based MI estimator, which has higher accuracy than the state-of-the-art variational approaches and a good bias/variance tradeoff.

2. METHODS

Let x ∈ X and y ∈ Y be two random variables generated by joint distribution p : X × Y → R + . Mutual Information (MI) I(x; y) ∆ = E p(x,y) log p(x, y) p(x)p(y) is a measure of dependence between x and y. Let D = {(x i , y i ) n i=1 } be a set of n independent identically distributed (i.i.d.) samples from p(x, y). The law of large numbers implies Îp (D) ∆ = 1 n n i=1 log p(x i , y i ) p(x i )p(y i ) → I(x; y) as n → ∞, which suggests a simple estimation strategy via sampling. Unfortunately, the joint distribution p(x, y) is often unknown and therefore the estimate in Eq. ( 2) cannot be explicitly computed. Here we develop an approach to accurately approximating the estimate Îp (D) based on discriminative learning. In our development, we will find it convenient to define a Bernoulli random variable z ∈ {0, 1} and to "lift" the distribution p(x, y) to the product space X × Y × {0, 1}. We thus define a family of distributions parametrized by α ∈ (0, 1) as follows: p * (x, y|z = 1; α) = p(x, y), (3) p * (x, y|z = 0; α) = p(x)p(y), (4) p * (z = 1; α) = 1 -p * (z = 0; α) = α. (5) Using Bayes' rule, we obtain p * (z = 1|x, y) p * (z = 0|x, y) = p * (x, y, z = 1) p * (x, y, z = 0) = p * (x, y|z = 1) p * (z = 1) p * (x, y|z = 0) p * (z = 0) = p(x, y) p(x)p(y) • α 1 -α , which implies that the estimate in (2) can be alternatively expressed as Îp = 1 n n i=1 log p * (z = 1|x i , y i ) p * (z = 0|x i , y i ) -log α 1 -α (7) = 1 n n i=1 logit [p * (z = 1|x i , y i )] -logit[α], where logit[u] ∆ = log u 1-u is the log-odds function. Our key idea is to approximate the latent posterior distribution p * (z = 1|x, y) by a classifier that is trained to distinguish between the joint distribution p(x, y) and the product distribution p(x)p(y) as described below.

2.1. TRAINING SET CONSTRUCTION

We assume that we have access to a large collection D of i.i.d. samples (x, y) from p(x, y) and define p(x, y; D), p(x; D), and p(y; D) to be the empirical joint and marginal distributions respectively induced by data set D. We construct the training set T = {(x j , y j , z j )} of m i.i.d. samples from our empirical approximation to the distribution p * (x, y, z). Each sample is generated independently of all others as follows. First, a value z j ∈ {0, 1} is sampled from the prior distribution p * (z) in (5). If z j = 1, then a pair (x j , y j ) is sampled randomly from the empirical joint distribution p(x, y; D); otherwise value x j is sampled randomly from the empirical marginal distribution p(x; D) and value y j is sampled randomly from the empirical marginal distribution p(y; D), independently from x j . This sampling is easy to implement as it simply samples an element from a set of unique values in the original collection D with frequencies adjusted to account for repeated appearances of the same value. It is straightforward to verify that any individual sample in the training set T is generated from distribution p * (x, y, z) up to the sampling of D. Where D is small, multiple samples may not be jointly from D but from some idiosyncratic subset; however, the empirical distribution induced by the set T converges to p * (x, y, z) as the size of available data D and the size m of the training set T becomes large.

2.2. CLASSIFIER TRAINING FOR MUTUAL INFORMATION ESTIMATION

Let q(z = 1|x, y; θ, T ) be a (binary) classifier parameterized by θ and derived from the training set T . If q(z = 1|x, y; θ, T ) accurately approximates the posterior distribution p * (z = 1|x, y; α), then we can use this classifier q instead of p * (z = 1|x, y; α) in (8) to estimate MI. We follow the widely used maximum likelihood approach to estimating the classifier's parameters θ and form the cross-entropy loss function (θ; T ) = - 1 m m j=1 log q(z j |x j , y j ; θ, T ) = -1 m m j=1 z j log q(z j = 1|x j , y j ; θ, T ) + (1 -z j ) log(1 -q(z j = 1|x j , y j ; θ, T )) (10) to be minimized to determine the optimal value of parameters θ. Once the optimization is completed, we form the estimate Îq (D, T ) = 1 n n i=1 logit q(z = 1|x i , y i ; θ, T ) -logit[α] (11) that approximates the estimate in (8). Note that the estimate is computed using the data set D, which is distinct from the training set T .

2.3. ASYMPTOTIC ANALYSIS

As the size of available data D and the size m of the training set T increase to infinity, the law of large numbers implies (θ; T ) → E p * (x,y,z) [log q(z|x, y; θ, T )] , and therefore θ ∆ = arg min θ (θ; T ) → arg max θ E p * (x,y,z) [log q(z|x, y; θ, T )] . Thus, when the model capacity of the family q(z|x, y; θ) is large enough to include the original distribution p * (z|x, y), Gibb's inequality implies q(z|x, y; θ, T ) → p * (z|x, y) and Îq (D, T ) → I(x; y) as both the training data and testing data grow. MINE and SMILE Belghazi et al. (2018) introduced the Mutual Information Neural Estimation (MINE) method, wherein they proposed learning a neural network f (x, y; θ) that maximizes the objective function J(f ) = E p(x,y) [f (x, y; θ)] -log E p(x)p(y) e f (x,y;θ) , which is the Donsker-Varadhan (DV) lower bound for the Kullback-Leibler (KL) divergence. For analysis purposes, we define q(x, y; θ) ∆ = 1 Z e f (x,y;θ) p(x)p(y), where Z = E p(x)p(y) e f (x,y;θ) . By substituting into the definition of J(•) and invoking Gibb's inequality, we obtain

2.4. CONNECTIONS TO OTHER MUTUAL INFORMATION ESTIMATORS

J(f ) = E p(x,y) [log q(x, y; θ)] -E p(x,y) [log p(x)p(y)] (15) ≤ E p(x,y) [log p(x, y)] -E p(x,y) [log p(x)p(y)] = I(x; y), with equality if and only if q(x, y; θ) ≡ p(x, y), i.e., f (x, y) = log p(x, y) p(x)p(y) + C, ( ) where C is a constant that is absorbed into the partition function Z. Thus the objective function is a lower bound on MI and is maximized when the unspecified "statistics network" f (x, y) is the log likelihood ratio of the joint distribution and the product of the marginals. Song & Ermon (2019) introduced the Smoothed Mutual Information Lower Bound Estimator (SMILE) approach which is a modification of the MINE estimator. To alleviate the high variance of f (x, y) in practice, the tilting factor e f (x,y) is constrained to the interval [e -τ , e τ ], for a tuned hyper-parameter τ . As τ → ∞, SMILE estimates converge to those produced by MINE. The log likelihood ratio of the joint versus the marginals, which the f (x, y) network from both these methods approximates, is the optimal classifier function for the task defined on our training set T above. Our parameterization of this ratio makes use of a classifier and the logit transformation. While analytically equivalent, the MINE and SMILE optimization procedures must instead search over ratio functions directly, optimizing f (x, y) ≈ p(x, y)/p(x)p(y) itself. Our experimental results demonstrate the advantage of using our estimator in (11). CPC Oord et al. (2018) proposed a contrastive predictive coding (CPC) method that also maximizes a lower bound J(f ) = E p(x,y) 1 N N i=1 log f (x i , y i ; θ) 1 N N j=1 f (x i , y j ; θ) + log N ≤ I(x; y), where f (x, y; θ) is a neural network and N is the batch size. CPC is not capable of estimating high underlying MI accurately-it is constrained by their batch size N , and this constraint scales logarithmically. In our approach, we do not estimate the likelihood ratio directly, instead we construct an auxiliary variable and "lift" the joint distribution, where we leverage the power of a discriminative neural network classifier. The logit transformation of our classifier response is used to approximate the log likelihood ratio in Eq. (1). CCMI Mukherjee et al. (2020) recently proposed a classifier based (conditional) MI estimator (CCMI). The classifier g(x, y; θ) is trained on paired and unpaired sample pairs to yield the posterior probability that the joint distribution p(x, y) (rather than the product of marginals p(x)p(y)) generated a sample pair (x, y). Unlike DEMI, the CCMI estimator Î(x, y) = E p(x,y) [logit [g(x, y)]] -log E p(x)p(y) 1 -g(x, y) g(x, y) still relies on a variational lower bound in Belghazi et al. (2018) . The first term above employs paired sample pairs and is identical to our estimator in Eq. ( 11) for g(x, y) ∆ = q(z = 1|x, y; θ, T ) and α = 0.5. The second term depends on the unpaired samples and is asymptotically zero. Thus CCMI and DEMI are asymptotically equivalent, but for finite sample sizes CCMI is prone to higher error than DEMI, as we demonstrate empirically later in the paper.

3. EXPERIMENTS

We employ two setups widely used in prior work (Song & Ermon, 2019; Belghazi et al., 2018; Poole et al., 2019; Hjelm et al., 2018) to evaluate the proposed estimator and to compare it to the state of the art approaches for estimating MI. In particular, we directly evaluate the accuracy of the resulting estimate in synthetic examples where the true value of MI can be analytically derived and also compare the methods' performance in a representation learning task where the goal is to maximize MI. Additional experiments that investigate self-consistency and long-run training behavior are reported in Appendices A and B respectively.

3.1. MI ESTIMATION

We sample jointly Gaussian variables x and y with known correlation and thus known MI values, which enables us to measure the accuracy of MI estimators when trained on this data. We vary the dimensionality of x and y (20-d, 50-d, and 100-d) , the underlying true MI, and the size of the training set (32K, 80K, and 160K) in order to characterize the relative behaviors of different MI estimators. In an additional experiment, we employ an element-wise cubic transformation (y i → y 3 i ) to generate non-linear dependencies in the data. Since deterministic transformations of x and y preserve MI, we can still access ground truth values of MI in this setup. We generate a different set of 10240 samples held out for testing/estimating MI given each training set. We generate 10 independently drawn training and test sets for each two correlated Gaussian variables. We assess the following estimators in this experiment: • DEMI, the proposed method, with three settings of the parameter α ∈ {0.25, 0.5, 0.75} in Eq. ( 5). • SMILE (Song & Ermon, 2019) , with three settings of the clipping parameter τ ∈ {1.0, 5.0, ∞}. The τ = ∞ case (i.e., no clipping) is equivalent to the MINE (Belghazi et al., 2018) objective. • InfoNCE (Oord et al., 2018) , the method used for contrastive predictive coding (CPC). • CCMI (Mukherjee et al., 2020) . • A generative model (GM), i.e., directly approximating log p(x, y) and marginals log p(x) and log p(y) using a flow network. We note that it is difficult to make comparable parameterizations between GM-flow networks and the rest of the methods, and that additionally because the "base" flow distribution is a Gaussian, these networks have a structural advantage for our synthetic tests. They are, in a sense, correctly specified for the Gaussian case, which probably would not happen in real data. Each estimator uses the same neural network architecture: a multi-layer perceptron with an initial concatenation layer for the x and y inputs, then two fully connected layers with ReLU activations, then a single output. This final layer uses a linear output for MINE, SMILE, InfoNCE, and CCMI, and a logistic output for DEMI. We use 256 hidden units for each of the fully connected layers. For the GM flow network we use the RealNVP scheme (Dinh et al., 2016) , which includes a "transformation block" of two 256-unit fully connected layers, each with ReLU activations. This network outputs two parameters, a scale and a shift, both element-wise. This transformation block is repeated three times. We train each MI estimator for 20 epochs and with the mini-batch size of 64. We employ the Adam optimizer with learning rate parameter 0.0005. The architecture choices above and the optimization settings are comparable with Song & Ermon (2019) . Results. For all experiments, InfoNCE substantially underestimated MI. This is due to its log-batchsize (log N ) maximum, which saturates quickly relative to the actual mutual information in these regimes. The limited training data setup leads to increased errors of CCMI shown in Appendix C, which is consistent with our analysis in Section 2.4. Overall, for Gaussian variables, the GM method performed very well. This is somewhat expected, as its base distribution for the flow network is itself a Gaussian. This trend begins to fall off at higher MI values for the 100-d case. For the cubic case, however, the GM method performs quite poorly, perhaps due to the increased model flexibility required for the transformed distribution. For 20-d Gaussian variables, MINE and SMILE with both parameter settings overestimated MI in comparison to DEMI, which provided estimates that were fairly close to the ground truth values. Appendix B further investigates this behavior. For the 50-d joint Gaussian case, DEMI again produced accurate estimates of MI, while MINE and SMILE underestimated MI substantially. For the 100-d joint Gaussian case, all approaches underestimated MI, with DEMI and CCMI performing best. For 20-d joint Gaussians with a cubic transformation, all approaches underestimated MI, SMILE (τ = 5) and DEMI performed best. For the 50-d and 100-d cases, all approaches understimated MI, with DEMI performing the best. In summary, DEMI performed best or very similar to the best baseline in all the experiments. It further was not sensitive to the setting of its parameter α. Its performance relative to the other MI estimators held up with the training data size decreased.

3.2. REPRESENTATION LEARNING

Our second experiment demonstrates the viability of DEMI as the differentiable loss estimate in a representation learning task. Specifically, we train an encoder on CIFAR10 and CI-FAR100 (Krizhevsky et al., 2009) data sets using the Deep InfoMax (Hjelm et al., 2018) criterion. Deep InfoMax learns representations by maximizing mutual information between local features of an input and the output of an encoder, and by matching the representations to a prior distribution. To evaluate the effectiveness of different MI estimators, we only include the MI maximization as the representation learning objective, without prior matching. We compare DEMI with MINE, SMILE, InfoNCE, and JSD (Hjelm et al., 2018) for MI estimation and maximization as required by Deep InfoMax. As discussed in Hjelm et al. (2018) , evaluation of the quality of a representation is case-driven and relies on various proxies. We use classification as a proxy to evaluate the representations, i.e., we use Deep InfoMax to train an encoder and learn representations from a data set without class labels, and then we freeze the weights of the encoder and train a small fully-connected neural network classifier using the representation as input. We use the classification accuracy as a performance proxy to the representation learning and thus the MI estimators. We build two separate classifiers on the last convolutional layer (conv(256,4,4)) and the following fully connected layer (fc( 1024)) for classification evaluation of the representations, similar to the setup in Hjelm et al. (2018) . The size of the input images is 32 × 32 and the encoder has the same architecture as the one in Hjelm et al. (2018) . 

4. CONCLUSION

We described a simple approach for estimating MI from joint data that is based on a neural network classifier that is trained to distinguish whether a sample pair is drawn from the joint distribution or the product of its marginals. The resulting estimator is the average over joint data of the logit transform of the classifier responses. Theoretically, the estimator converges to MI when the data sizes grow to infinity and the neural network capacity is large enough to contain the corresponding true conditional probability. The accuracy of our estimator is governed by the ability of the classifier to predict the true posterior probability of items in the test set, which in turn depends on (i) the number of training sample pairs and (ii) the capacity of the neural network used in training. Thus the quality of our estimates is subject to the classical issues of model capacity and overfitting in deep learning. We leave theoretical analysis (which is closely related to the classifier's convergence to the true separating boundary for a general hypothesis class) for future work We discussed close connections between our approach and the lower bound approaches of MINE and SMILE and InfoNCE(CPC). Unlike the difference-of-entropies (DoE) estimator described in (McAllester & Stratos, 2020) , our approach does not make use of assumed distributions. We also demonstrate empirical advantages of our approach over the state of the art methods for estimating MI in synthetic and real image data. Given its simplicity and promising performance, we believe that DEMI is a good candidate for use in research that optimizes MI for representation learning.

A SELF-CONSISTENCY TESTS

We assess and compare the MI estimators using the self-consistency tests proposed in (Song & Ermon, 2019) . We perform the tests on MNIST images (LeCun et al., 2010) . The self-consistency tests examine some important properties that a "useful" MI estimator should have, because optimizing MI is more important for many downstream machine learning applications than estimating the exact value of MI. The self-consistency tests examine: 1) capability of detecting independence, 2) monotonicity with data processing, 3) and additivity. We thus perform the following experiments, where the MNIST image set induces a data distribution and each MNIST image is a random variable that follows this data distribution: • MI estimation between one MNIST image and one row-masked image. Given an MNIST image X, we mask out the bottom rows and leave the top t rows of the image, which creates Y = h(X; t). The estimated MI Î(X, Y ) should be equal or very close to zero, if X and Y are independent. In this context, Î(X, Y ) should be close to 0 when t is small and be non-decreasing with t. We normalize this measurement to the final value at t = 28 (the last row), which should be the maximum information. • MI estimation between two identical MNIST images and two row-masked images. Given an MNIST image X, we create two row-masked images: Y 1 = h(X; t 1 ) and Y 2 = h(X; t 2 ), where t 1 = t 2 + 3. Since additional data processing should not increase mutual information , Î([X, X], [Y 1 , Y 2 ])/ Î(X, Y 1 ) should be close to 1. • MI estimation between two MNIST images and two row-masked images. We randomly select two MNIST images and concatenate them: [X 1 , X 2 ], and mask the same number of rows on them: [h(X 1 ; t), h(X 2 ; t)] = [Y 1 , Y 2 ]. Î([X 1 , X 2 ], [Y 1 , Y 2 ])/ Î(X 1 , Y 1 ) should be close to 2. We have 60k MNIST images or concatnated images for training and a test set of 10k images. We train each MI estimator for 100 epochs and set the mini-batch size to 64. For all methods, we concatentate inputs, then convolve with a 5 × 5 kernel with stride 2 and 64 output channels, then apply a fully connected layer with 1024 hidden units, which then maps to a single output. ReLU is applied after all but the last layer. Results. In Figure 2 we plot the results of the three self-consistency metrics for each method. In general most methods perform well for the first measurement (monotonicity) with the exception of SMILE (τ = ∞), which exhibits charateristicly high variance. Other settings of SMILE and DEMI both are relatively well behaved, though overall InfoNCE performs best. For the second metric ("data processing"), all methods perform well, again aside from SMILE (τ = ∞) variance. In the third metric, SMILE τ = 1 also exhibits a large bump in the center (where optimal should be constant 2 overall), but both InfoNCE and DEMI converge to 1 overall, and no method performs optimally. In general InfoNCE performs best across between the first two measures, but DEMI and SMILE (τ = 1) also do well in two of three. "Monotonicity" is top, "data processing" is middle, and "additivity" is bottom.

B LONG-RUN TRAINING BEHAVIOR OF SMILE

As shown in Section 3, SMILE somewhat overestimates the MI for the 20 dimensional Gaussian case in high MI regimes (∼ 30 Nats or more). This did not occur at lower MI conditions or in higher dimensions. Further investigation showed this problem to increase as training went on; to illustrate this, we set up a new experiment on the 20 dimensional Gaussian case. We ran each setting of SMILE (τ = 1, 5, ∞) for 100000 training steps with batch size 64, drawing samples directly from the generating distributions. This means that the training set has effectively a very large size. We did this for three ground-truth MI values of 10,20, and 30. For comparison we also run the proposed method through the same. This setup exactly mirrors the experiment in (Song & Ermon, 2019) Figure 1 in Section 6.1 of that paper, and uses their provided code and generation method, except that we replace their step-wise increasing MI schedule with a constant 10, 20, or 30 nat generator, and we run the experiment longer. x,y) ] to outliers. The proposed method eventually overestimates as well in both the 20 and 30 nat cases, but does not have the strongly divergent behavior exhibited by SMILE (seen particularly strongly in τ = ∞ settings).

C RESULTS OF MI ESTIMATION ON GAUSSIAN VARIABLES

We report the results in the tables below of MI estimation on Gaussian variables described in Section 3.1. The estimation error (I(x, y) -Î(x, y)) and the standard deviation of the estimates are reported. Table 2 : MI estimation between 20-d Gaussian variables trained on 160K data samples. The estimation error (I(x, y) -Î(x, y)) and the standard deviation of the estimates are reported. Actual MI SMILE τ = ∞ SMILE τ = 1 SMILE τ = 5 InfoNCE GM 0.1 0.3 ± 0.0 0.3 ± 0.0 0.3 ± 0.0 0.2 ± 0.0 0.1 ± 0.0 5 0.7 ± 0.1 -0.2 ± 0.3 0.2 ± 0.1 1.0 ± 0.1 0.1 ± 0.0 10 0.9 ± 0.2 -0.9 ± 0.6 -1.6 ± 0.6 3.9 ± 0.0 0.1 ± 0.0 15 0.2 ± 1.0 -2.6 ± 1.7 -5.2 ± 1.8 8.1 ± 0.0 0.1 ± 0.0 20 1.1 ± 1.4 -1.0 ± 2.0 -2.0 ± 2.4 13.1 ± 0.0 0.1 ± 0.1 25 -6.1 ± 4.6 -12.6 ± 8.7 -15.3 ± 13.1 18.1 ± 0.0 0.2 ± 0.0 30 -15.8 ± 4.0 -12.4 ± 6.9 -18.4 ± 5.8 23.1 ± 0.0 0.2 ± 0.1 35 -13.5 ± 7.1 -4.5 ± 4.3 -8.5 ± 6.8 28.1 ± 0.0 0.3 ± 0.0 Actual MI CCMI DEMI α = 0.5 DEMI α = 0.25 DEMI α = 0.75 0.1 0.0 ± 0.0 0.0 ± 0.0 -0.6 ± 0.0 0.5 ± 0.0 5 0.4 ± 0.6 0.0 ± 0.1 -0.6 ± 0.2 0.5 ± 0.2 10 -0.3 ± 1.1 -0.0 ± 0.6 -0.9 ± 0.4 0.9 ± 0.4 15 -1.7 ± 3.1 0.9 ± 1.1 0.0 ± 1.1 1.7 ± 0.8 20 -4.3 ± 5.6 1.9 ± 1.7 0.0 ± 2.1 1.3 ± 2.5 25 -15.6 ± 6.8 0.3 ± 3.8 -0.9 ± 3.9 -3.2 ± 5.9 30 -24.3 ± 10.2 -3.4 ± 3.7 -0.5 ± 1.8 -3.4 ± 4.8 35 -27.3 ± 8.0 0.4 ± 9.1 -1.9 ± 3.6 -0.3 ± 3.2 Table 3 : MI estimation between 20-d Gaussian variables with cubic transformation trained on 160K data samples. The estimation error (I(x, y) -Î(x, y)) and the standard deviation of the estimates are reported. Actual MI SMILE τ = ∞ SMILE τ = 1 SMILE τ = 5 InfoNCE GM 0.1 0.2 ± 0.0 0.2 ± 0.0 0.2 ± 0.0 0.1 ± 0.0 0.4 ± 0.4 5 1.3 ± 0.1 0.2 ± 0.1 1.1 ± 0.1 0.9 ± 0.0 2.7 ± 0.2 3.0 ± 0.1 1.1 ± 0.3 1.9 ± 0.2 3.8 ± 0.0 3.5 ± 0.4 15 5.6 ± 0.4 2.8 ± 0.6 2.7 ± 0.7 8.2 ± 0.0 5.7 ± 0.4 20 9.1 ± 0.3 5.9 ± 1.1 5.2 ± 1.2 13.1 ± 0.0 8.4 ± 0.7 25 14.0 ± 0.6 10.3 ± 2.1 7.5 ± 1.0 18.1 ± 0.0 12.2 ± 0.7 30 19.7 ± 0.9 13.8 ± 1.8 11.9 ± 1.6 23.1 ± 0.0 17.1 ± 0.8 35 26.0 ± 0.9 20.6 ± 1.5 18.3 ± 1.6 28.1 ± 0.0 22.8 ± 0.9 Actual MI CCMI DEMI α = 0.5 DEMI α = 0.25 DEMI α = 0.75 0.1 0.0 ± 0.0 0.1 ± 0.0 -0.6 ± 0.0 0.5 ± 0.0 5 0.9 ± 0.2 1.0 ± 0.4 0.0 ± 0.1 1.4 ± 0.2 10 2.4 ± 1.0 2.7 ± 0.6 1.5 ± 0.2 2.9 ± 0.2 15 4.8 ± 2.3 4.1 ± 1.7 3.1 ± 0.4 4.7 ± 0.5 20 6.5 ± 1.4 5.1 ± 3.6 5.4 ± 0.5 6.1 ± 0.6 25 8.8 ± 3.7 7.6 ± 3.1 8.4 ± 0.7 9.1 ± 0.5 30 9.8 ± 2.3 10.7 ± 3.8 12.5 ± 1.0 13.1 ± 1.1 35 16.7 ± 5.2 13.9 ± 3.6 16.9 ± 0.8 17.2 ± 1.2 Table 4 : MI estimation between 20-d Gaussian variables trained on 80K data samples. The estimation error (I(x, y) -Î(x, y)) and the standard deviation of the estimates are reported.  Actual MI SMILE τ = ∞ SMILE τ = 1 SMILE τ = 5 InfoNCE GM 0.1 0.6 ± 0.1 0.6 ± 0.1 0.6 ± 0.1 0.3 ± 0.0 0.1 ± 0.0 5 1.0 ± 0.5 -0.1 ± 0.5 0.5 ± 0.6 1.6 ± 0.2 0.1 ± 0.0 10 1.5 ± 1.1 -0.5 ± 1.3 -1.1 ± 1.6 4.0 ± 0.1 0.2 ± 0.1 15 1.9 ± 2.3 -0.1 ± 1.8 -2.2 ± 3.3 8.1 ± 0.0 0.2 ± 0.0 20 3.0 ± 4.3 1.7 ± 3.5 -2.0 ± DEMI α = 0.5 DEMI α = 0.25 DEMI α = 0.75 0.1 0.1 ± 0.0 0.1 ± 0.0 -0.6 ± 0.0 0.5 ± 0.0 5 1.4 ± 0.4 1.0 ± 0.1 0.3 ± 0.2 1.7 ± 0.2 10 4.2 ± 1.1 2.5 ± 0.4 1.8 ± 0.4 3.3 ± 0.3 5.6 ± 1.9 4.8 ± 0.5 3.7 ± 0.5 4.9 ± 0.5 20 5.9 ± 2.2 6.6 ± 0.5 6.3 ± 0.6 7.6 ± 0.8 25 9.4 ± 2.3 10.2 ± 0.7 10.3 ± 0.7 10.8 ± 1.0 30 15.7 ± 5.9 14.4 ± 1.1 14.0 ± 0.7 14.6 ± 0.8 35 16.3 ± 2.7 18.9 ± 0.7 18.0 ± 0.5 19.5 ± 1.0 Table 6 : MI estimation between 20-d Gaussian variables trained on 32K data samples. The estimation error (I(x, y) -Î(x, y)) and the standard deviation of the estimates are reported.  Actual MI SMILE τ = ∞ SMILE τ = 1 SMILE τ = 5 InfoNCE GM 0.1 1.1 ± 0.3 1.2 ± 0.3 1.2 ± 0.2 0.8 ± 0.1 0.1 ± 0.0 5 0.9 ± 0.1 -0.0 ± 0.5 0.5 ± 0.1 3.9 ± 0.3 1.4 ± 0.2 10 1.3 ± 0.2 -0.5 ± 1.0 -1.0 ± 0.6 4.0 ± 0.1 0.8 ± 0.1 15 1.6 ± 1.3 -1.2 ± 1.3 -2.9 ± 1.4 8.1 ± 0.0 0.8 ± 0.1 20 -2.3 ± 1.5 -3.8 ± 3.7 -6.8 ± 3.9 13.1 ± 0.0 1.2 ± 0.2 25 -0.1 ± 1.3 1.1 ± 0.7 -3.4 ± Actual MI SMILE τ = ∞ SMILE τ = 1 SMILE τ = 5 InfoNCE GM 0.1 0.7 ± 0.2 0.6 ± 0.1 0.6 ± 0.1 0.3 ± 0.0 0.1 ± 0.0 5 1.1 ± 0.1 0.2 ± 0.2 0.7 ± 0.1 1.7 ± 0.1 3.4 ± 0.3 10 1.6 ± 0.2 0.1 ± 0.4 -0.4 ± 0.6 4.3 ± 0.1 1.0 ± 0.2 1.3 ± 0.6 -0.8 ± 1.5 -1.9 ± 0.9 8.2 ± 0.0 0.5 ± 0.2 20 2.5 ± 0.9 2.0 ± 3.0 -0.9 ± 2.6 13.1 ± 0.0 0.5 ± 0.1 25 4.8 ± 1.1 4.3 ± 3.0 0.7 ± 2.9 18.1 ± 0.0 0.6 ± 0.1 30 7.5 ± 1.9 7.1 ± 3.1 2.7 ± 3.0 23.1 ± 0.0 0.7 ± 0.1 35 6.8 ± 3.6 6.9 ± 3.9 1.9 ± 3.9 28.1 ± 0.0 0.9 ± 0.2 40 7.4 ± 3.5 8.5 ± 1.8 5.1 ± 3. 22.8 ± 3.4 28.7 ± 1.6 29.0 ± 1.3 29.9 ± 1.4 Table 10 : MI estimation between 50-d Gaussian variables trained on 80K data samples. The estimation error (I(x, y) -Î(x, y)) and the standard deviation of the estimates are reported. Actual MI SMILE τ = ∞ SMILE τ = 1 SMILE τ = 5 InfoNCE GM 0.1 1.4 ± 0.3 1.3 ± 0.2 1.5 ± 0.3 0.5 ± 0.1 0.1 ± 0.0 5 1.4 ± 0.1 0.5 ± 0.4 1.0 ± 0.1 3.3 ± 0.2 4.6 ± 0.2 10 1.9 ± 0.3 -0.4 ± 0.6 -0.6 ± 0.5 4.4 ± 0.1 5.5 ± 0.6 15 2.6 ± 0.5 1.4 ± 0.9 -0.1 ± 1.3 8.3 ± 0.0 4.5 ± 0.3 20 4.3 ± 0.9 2.8 ± 0.8 0.5 ± 1.6 13.1 ± 0.0 3.5 ± 0.5 25 6.3 ± 1.2 4.2 ± 2.7 3.6 ± 2.3 18.1 ± 0.0 2.9 ± 0.6 30 8.5 ± 3.6 7.8 ± 2.5 4.6 ± 2.7 23.1 ± 0.0 3.2 ± 0.4 5.9 ± 7.9 10.4 ± 4.9 4.5 ± 6.6 28.1 ± 0.0 3.4 ± 0.5 40 -0.9 ± 6.1 4.7 ± 5. 1.1 ± 0.8 0.3 ± 0.2 -0.5 ± 0.2 1.1 ± 0.3 10 0.0 ± 0.5 1.1 ± 0.4 0.4 ± 0.4 1.5 ± 0.5 15 -0.3 ± 1.8 1.7 ± 0.4 0.9 ± 0.9 2.4 ± 1.2 20 -3.8 ± 2.0 3.9 ± 1.3 2.8 ± 1.4 4.1 ± 2.1 25 -4.9 ± 2.8 5.3 ± 2.6 4.8 ± 0.9 5.8 ± 1.7 30 -4.8 ± 6.3 5.8 ± 2.2 7.9 ± 1.3 9.0 ± 2.4 35 -8.9 ± 8.1 9.6 ± 2.9 9.4 ± 2.2 10.1 ± 2.4 40 -15.3 ± 7.8 10.4 ± 4.1 11.6 ± 3.5 11.3 ± 4.3 45 -19.4 ± 7.7 10.6 ± 4.3 11.1 ± 1.9 13.9 ± 4.3 50 -19.6 ± 6.7 14.9 ± 4.9 15.0 ± 2.8 18.1 ± 3.6 55 -18.6 ± 12.5 23.2 ± 1.8 20.2 ± 3.2 24.1 ± 3.8 Table 11 : MI estimation between 50-d Gaussian variables with cubic transformation trained on 80K data samples. The estimation error (I(x, y) -Î(x, y)) and the standard deviation of the estimates are reported. CCMI DEMI α = 0.5 DEMI α = 0.25 DEMI α = 0.75 0.1 0.2 ± 0.0 0.2 ± 0.0 -0.5 ± 0.0 0.5 ± 0.0 5 2.2 ± 0.1 2.0 ± 0.1 1.3 ± 0.1 2.5 ± 0.3 10 4.5 ± 0.8 4.1 ± 0.2 3.5 ± 0.2 4.4 ± 0.3 15 6.0 ± 0.9 6.0 ± 0.6 5.7 ± 0.4 6.3 ± 0.4 20 8.2 ± 1.6 7.7 ± 0.8 7.4 ± 0.7 8.5 ± 0.7 25 8.8 ± 2.4 10.1 ± 1.0 10.5 ± 1.1 10.6 ± 1.2 30 11.9 ± 2.6 13.0 ± 1.4 13.4 ± 0.7 14.2 ± 1.1 35 14.7 ± 2.9 17.6 ± 1.4 17.0 ± 1.0 18.3 ± 1.4 40 18.4 ± 3.3 20.9 ± 2.2 20.7 ± 1.1 21.2 ± 1.6 45 23.0 ± 3.9 24.0 ± 2.1 25.4 ± 1.5 24.7 ± 2.2 50 26.1 ± 3.1 28.5 ± 1.5 29.8 ± 0.7 30.2 ± 1.2 55 30.4 ± 2.5 34.1 ± 1.9 34.1 ± 0.9 34.2 ± 2.3 Actual MI SMILE τ = ∞ SMILE τ = 1 SMILE τ = 5 InfoNCE GM 0.1 1.1 ± 0.2 1.2 ± 0.2 1.1 ± 0.2 0.2 ± 0.0 0.9 ± 0.4 5 3.0 ± 0.1 2.8 ± 0.2 3.0 ± 0.2 1.6 ± 0 Table 12 : MI estimation between 50-d Gaussian variables trained on 32K data samples. The estimation error (I(x, y) -Î(x, y)) and the standard deviation of the estimates are reported. 2.1 ± 0.5 1.5 ± 0.4 1.9 ± 0.9 15 51.8 ± 5.3 3.8 ± 0.9 3.7 ± 0.7 4.5 ± 0.9 20 52.0 ± 3.4 6.9 ± 1.0 6.1 ± 0.9 7.0 ± 1.2 25 60.7 ± 3.6 10.1 ± 0.9 9.3 ± 0.8 11.2 ± 1.0 30 64.6 ± 3.5 12.9 ± 1.5 12.6 ± 0.7 15.  Actual MI SMILE τ = ∞ SMILE τ = 1 SMILE τ = 5 InfoNCE GM 0.1 2.2 ± 0.8 2.2 ± 0.9 1.9 ± 0.8 1.5 ± 0.2 0.1 ± 0.0 5 1.7 ± 0.1 0.6 ± 0.4 1.2 ± 0.2 5.7 ± 0.3 5.0 ± 0.0 10 2.4 ± 0.2 0.3 ± 0.8 -0.1 ± 0.5 4.5 ± 0.1 9.6 ± 0.1 15 3.3 ± 0.7 1.5 ± 0.8 0.2 ± 0.8 8.4 ± 0.1 13.2 ± 0.4 20 5.1 ± 1.2 3.3 ± 1.5 0.7 ± 1.0 13.1 ± 0.



Figure 1 reports the MI estimation error (I(x, y) -Î(x, y)) versus the true underlying MI for the experiments with joint Gaussians and joint Gaussians with a cubic transformation, when the size of the training data size is 160K. Appendix C reports additional experimental results from 80K and 32K training samples. The results of DEMI with three different settings of α are very close. In Figure 1, we only show DEMI (α = 0.5). In Appendix C, we report the other two settings as well.

Figure 1: Mutual information estimation between multivariate Gaussian variables (left column) and between multivariate Gaussian variables with a cubic transformation (right column). Closer to Zero is better. The estimation error (I(x, y) -Î(x, y)) versus the true underlying MI are reported. These estimates are based on training data size of 160K. We only show DEMI (α = 0.5) since the results of the other two parameter settings are very close. A completetable of estimate results is reported in Appendix C

Figure 2: Results of the three self-consistency tests for SMILE (τ = 1, 5, ∞), InfoNCE, and DEMI."Monotonicity" is top, "data processing" is middle, and "additivity" is bottom.

Figure 3: Long-run behavior of SMILE and DEMI for 10 (top row), 20 (middle row), and 30 (bottom row) Nats. Analytically SMILE converges to the MINE objective for τ → ∞. Smoothed trajectories are plotted in bold, exact trajectories are the semi-translucent curve, and the actual Mutual information is the black constant line.The curves for the first row of Figure3show good performance with relatively stable long-term behavior, particularly for τ = 1. The curves in the third row of Figure3on the otherhand suggest that for certain distribution/domain combinations, even though SMILE and MINE are based on a lower bound of MI, they can both grossly overestimate it. This may be as(McAllester & Stratos, 2020) suggests due in part to a sensitivity of the estimate of -ln E[e f (x,y) ] to outliers. The proposed method eventually overestimates as well in both the 20 and 30 nat cases, but does not have the strongly divergent behavior exhibited by SMILE (seen particularly strongly in τ = ∞ settings).

Table1reports the top 1 classification accuracy of CIFAR10 and CIFAR100. DEMI is comparable to InfoNCE in 3 out of 4 tasks and outperforms the other MI estimators by a significant margin. When the encoder is allowed to train with the class labels, it becomes a fully-supervised task. We also report its classification accuracy as reference. The classification accuracy based on the representations learned by Deep InfoMax with DEMI is close to or even surpasses the fullysupervised case. Note that the Deep InfoMax objective we use here does not include prior distribution matching to regularize the encoder.





MI estimation between 50-d Gaussian variables trained on 160K data samples. The estimation error (I(x, y) -Î(x, y)) and the standard deviation of the estimates are reported.

MI estimation between 50-d Gaussian variables with cubic transformation trained on 160K data samples. The estimation error (I(x, y) -Î(x, y)) and the standard deviation of the estimates are reported.

MI estimation between 100-d Gaussian variables trained on 160K data samples. The estimation error (I(x, y) -Î(x, y)) and the standard deviation of the estimates are reported.

MI estimation between 100-d Gaussian variables trained on 80K data samples. The estimation error (I(x, y) -Î(x, y)) and the standard deviation of the estimates are reported.

MI estimation between 100-d Gaussian variables trained on 32K data samples. The estimation error (I(x, y) -Î(x, y)) and the standard deviation of the estimates are reported.Actual MI SMILE τ = ∞ SMILE τ = 1 SMILE τ = 5

MI estimation between 100-d Gaussian variables with cubic transformation trained on 32K data samples. The estimation error (I(x, y) -Î(x, y)) and the standard deviation of the estimates are reported.Actual MI SMILE τ = ∞ SMILE τ = 1 SMILE τ = 5

