Unpacking Information Bottlenecks: Surrogate Objectives for Deep Learning

Abstract

The Information Bottleneck principle offers both a mechanism to explain how deep neural networks train and generalize, as well as a regularized objective with which to train models. However, multiple competing objectives are proposed in the literature, and the information-theoretic quantities used in these objectives are difficult to compute for large deep neural networks, which in turn limits their use as a training objective. In this work, we review these quantities, and compare and unify previously proposed objectives, which allows us to develop surrogate objectives more friendly to optimization without relying on cumbersome tools such as density estimation. We find that these surrogate objectives allow us to apply the information bottleneck to modern neural network architectures. We demonstrate our insights on MNIST, CIFAR-10 and Imagenette with modern DNN architectures (ResNets).

1. Introduction

The Information Bottleneck (IB) principle, introduced by Tishby et al. (2000) , proposes that training and generalization in deep neural networks (DNNs) can be explained by information-theoretic principles (Tishby and Zaslavsky, 2015; Shwartz-Ziv and Tishby, 2017; Achille and Soatto, 2018a) . This is attractive as the success of DNNs remains largely unexplained by tools from computational learning theory (Zhang et al., 2016; Bengio et al., 2009) . The IB principle suggests that learning consists of two competing objectives: maximizing the mutual information between the latent representation and the label to promote accuracy, while at the same time minimizing the mutual information between the latent representation and the input to promote generalization. Following this principle, many variations of IB objectives have been proposed (Alemi et al., 2016; Strouse and Schwab, 2017; Fischer and Alemi, 2020; Fischer, 2020; Fisher, 2019; Gondek and Hofmann, 2003; Achille and Soatto, 2018a) , which, in supervised learning, have been demonstrated to benefit robustness to adversarial attacks (Alemi et al., 2016; Fisher, 2019) and generalization and regularization against overfitting to random labels (Fisher, 2019) . Whether the benefits of training with IB objectives are due to the IB principle, or some other unrelated mechanism, remains unclear (Saxe et al., 2019; Amjad and Geiger, 2019; Tschannen et al., 2019) , suggesting that although recent work has also tied the principle to successful results in both unsupervised and self-supervised learning (Oord et al., 2018; Belghazi et al., 2018; Zhang et al., 2018; Burgess et al., 2018, among others) , our understanding of how IB objectives affect representation learning remains unclear. Critical to studying this question is the computation of the information-theoretic quantitiesfoot_0 used. While progress has been made in developing mutual information estimators for DNNs (Poole et al., 2019; Belghazi et al., 2018; Noshad et al., 2019; McAllester and Stratos, 2018; Kraskov et al., 2004) , current methods still face many limitations when concerned with high-dimensional random variables (McAllester and Stratos, 2018) and rely on complex estimators or generative models. This presents a challenge to training with IB objectives. In this paper, we analyze information quantities and relate them to surrogate objectives for the IB principle which are more friendly to optimization, showing that complex or intractable IB objectives can be replaced with simple, easy-to-compute surrogates that produce similar performance and similar behaviour of information quantities over training. Sections 2 & 3 review commonly-used information quantities for which we provide mathematically grounded intuition via information diagrams and unify different IB objectives by identifying two key information quantities, Decoder Uncertainty H[Y | Z] and Reverse Decoder Uncertainty H[Z | Y] which act as the main loss and regularization terms in our unified IB objective. In particular, Section 3.2 demonstrates that using the Decoder Uncertainty as a training objective can minimize the training error, and shows how to estimate an upper bound on it efficiently for well-known DNN architectures. We expand on the findings of Alemi et al. (2016) in their variational IB approximation and demonstrate that this upper bound is equal to the commonly-used cross-entropy lossfoot_1 under dropout regularization. Section 3.3 examines pathologies of differential entropies that hinder optimization and proposes adding Gaussian noise to force differential entropies to become non-negative, which leads to new surrogate terms to optimize the Reverse Decoder Uncertainty. Altogether this leads to simple and tractable surrogate IB objectives such as the following, which uses dropout, adds Gaussian noise over the feature vectors f (x; η), and uses an L2 penalty over the noisy feature vectors: min θ E x,y∼p(x,y), ∼N η∼dropout mask log p( Ŷ = y | z = f θ (x; η) + ) + γ f θ (x; η) + 2 2 . (1) Section 4 describes experiments that validate our insights qualitatively and quantitatively on MNIST, CIFAR-10 and Imagenette, and shows that with objectives like the one in equation ( 1) we obtain information plane plots (as in figure 1 ) similar to those predicted by Tishby and Zaslavsky (2015) . Our simple surrogate objectives thus induce the desired behavior of IB objectives while scaling to large, high-dimensional datasets. We present evaluations on CIFAR-10 and Imagenette imagesfoot_2 . Compared to existing work, we show that we can optimize IB objectives for well-known DNN architectures using standard optimizers, losses and simple regularizers, without needing complex estimators, generative models, or variational approximations. This will allow future research to make better use of IB objectives and study the IB principle more thoroughly. 2 Background and Thomas, 2012; MacKay, 2003; Shannon, 1948) . We will further require the Kullback-Leibler divergence D KL (• || •) and cross-entropy H(• || •). The definitions can be found in section A.1. We will use differential entropies interchangeably with entropies: equalities between them are preserved in the differential setting, and inequalities will be covered in section 3.3. Information diagrams (I-diagrams), like the one depicted in figure 2 , clarify the relationship between information quantities: similar to Venn diagrams, a quantity equals the sum of its parts in the diagram. Importantly, they offer a grounded intuition as Yeung (1991) show that we can define a signed measure µ * such that information quantities map to abstract sets and are consistent with set operations. We provide details on how to use I-diagrams and what to watch out for in section A.2. Probabilistic model. We will focus on a supervised classification task that makes prediction Ŷ given data X using a latent encoding Z, while the provided target is Y. We assume categorical Y and Ŷ, and continuous X. Our probabilistic model based on these assumptions is as follows: p(x, y, z, ŷ) = p(x, y) p θ (z | x) p θ (ŷ | z). (2) Thus, Z and Y are independent given X, and Ŷ is independent of X and Y given Z. The data distribution p(x, y) is only available to us as an empirical sample distribution. θ are the parameters we would like to learn. p θ (z | x) is the encoder from data X to latent Z, and p θ (ŷ | z) the decoder from latent Z to prediction Ŷ. Together, p θ (z | x) and p θ (ŷ | z) form the discriminative model p θ (ŷ | x): p θ (ŷ | x) = E p θ (z|x) p θ (ŷ | z). (3) We can derive the cross-entropy loss H(p(y | x) || p θ ( Ŷ = y | x)) (Solla et al., 1988; Hinton, 1990 ) by minimizing the Kullback-Leibler divergence between the empirical sample distribution p(x, y) and the parameterized distribution p θ (x) p θ (ŷ | x), where we set p θ (x) = p(x). See section D.1. (4) This principle can be recast as a generalization of finding minimal sufficient statistics for the labels given the data (Shamir et al., 2010; Tishby and Zaslavsky, 2015; Fisher, 2019) : it strives for minimality and sufficiency of the latent Z. Minimality is achieved by minimizing the Preserved Information I[X; Z]; while sufficiency is achieved by maximizing the Preserved Relevant Information I[Y; Z]. We defer an in-depth discussion of the IB principle to the appendix Section C.1. We discuss the several variants of IB objectives, and justify our focus on IB and DIB, in Section C.2.

Mickey

The information quantities that appear in the IB objective are not tractable to compute for the representations learned by many function classes of interest, including neural networks; for example, Strouse and Schwab (2017) only obtain an analytical solution to their Deterministic Information Bottleneck (DIB) method for the tabular setting. Alemi et al. (2016) address this challenge by constructing a variational approximation of the IB objective, but their approach has not been applied to more complex datasets than MNIST variants. Belghazi et al. (2018) use a separate statistics network to approximate the mutual information, a computationally expensive strategy that does not easily lend itself to optimization. In this section, we introduce and justify tractable surrogate losses that are easier to apply in common deep learning pipelines, and which can be scaled to large and high-dimensional datasets. We begin by proposing the following reformulation of IB and DIB objectives. Proposition 1. For IB, we obtain arg min I[X; Z] -βI[Y; Z] = arg min H[Y | Z] + β I[X; Z | Y] =H[Z|Y]-H[Z|X] , and, for DIB, H  arg min H[Z] -βI[Y; Z] = arg min H[Y | Z] + β H[Z | Y] = arg min H[Y | Z] + β H[Z] (6) with β := 1 β-1 ∈ [0, ∞) and β := 1 β ∈ [0, 1 θ [Y | X] := H(p(y | x) || p θ ( Ŷ = y | x)) = E p(x,y) h E p θ (z|x) p θ ( Ŷ = y | z) (7) H θ [Y | Z] := H(p(y | z) || p θ ( Ŷ = y | z)) = E p(x,y) E p θ (z|x) h p θ ( Ŷ = y | z) . (8) Jensen's inequality yields H θ [Y | X] ≤ H θ [Y | Z], with equality iff Z is a deterministic function of X. H[Y | Z] ≤ H[Y | Z] + D KL (p(y | z) || p θ (ŷ | z)) = H θ [Y | Z], (9) and further bounds the training error: 9) to variationally approximate p(y | z). We make this explicit by applying the reparameterization trick to rewrite the latent z as a parametric function of its input x and some independent auxiliary random variable η, i.e. p(" Ŷ is wrong") ≤ 1 -e -H θ [Y|Z] = 1 -e -(H[Y|Z]+D KL (p(y|z)||p θ (ŷ|z))) . (10) Likewise, for H θ [Y | X] f θ (x, η) D = z ∼ p θ (z | x), yielding H[Y | Z] ≤ H θ [Y | Z] = E p(x,y) E p(η) h p θ ( Ŷ = y | z = f θ (x; η)) . (11) Equation ( 11) can be applied to many forms of stochastic regularization that turn deterministic models into stochastic ones, in particular dropout. This allows us to use modern DNN architectures as stochastic encoders. Dropout regularization When we interpret η as a sampled dropout mask for a DNN, DNNs that use dropout regularization (Srivastava et al., 2014) , or variants like DropConnect (Wan et al., 2013a) , fit the equation above as stochastic encoders. Monte-Carlo dropout (Gal and Ghahramani, 2016) , for example, even specifically estimates the predictive mean p θ (ŷ | x) from equation (3). The following result extends the observation by Burda et al. (2015) that sampling yields an unbiased estimator for the Decoder Cross-Entropy H θ [Y | Z], while it only yields a biased estimator for the Prediction Cross-Entropy H θ [Y | X] (which it upper-bounds). -10 3 -10 2 -10 1 -10 0 0 10 Differential entropies In most cases, the latent Z is a continuous random variable in many dimensions. Unlike entropies on discrete probability spaces, differential entropies defined on continuous spaces are not bounded from below. This means that the DIB objective is not guaranteed to have an optimal solution and allows for pathological optimization trajectories in which the variance of the latent Z can be scaled to be arbitrarily small, achieving arbitrarily high-magnitude negative entropy. We provide a toy experiment demonstrating this in section G.4. Intuitively, one can interpret this issue as being allowed to encode information in an arbitrarily-small real number using infinite precision, similar to arithmetic coding (MacKay, 2003; Shwartz-Ziv and Tishby, 2017) 5 . In practice, due to floating point constraints, optimizing DIB naively will invariably end in garbage predictions and underflow as activations approach zero. It is therefore not desirable for training. This is why Strouse and Schwab (2017) only consider analytical solutions to DIB by evaluating a limit for the tabular case. MacKay (2003) proposes the introduction of noise to solve this issue in the application of continuous communication channels. However, here we propose adding specific noise to the latent representation to lower-bound the conditional entropy of Z, which allows us to enforce non-negativity across all IB information quantities as in the discrete case and transport inequalities to the continuous case: for a continuous Ẑ ∈ R k and independent noise , we set Z := Ẑ + ; the differential entropy then satisfies H Strictly speaking, zero-entropy noise is not necessary for optimizing the bounds: any Gaussian noise is sufficient, but zero-entropy noise is aesthetically appealing as it preserves inequalities from the discrete setting. In a sense, this propostion bounds the IB objective by the DIB objective. However, adding noise changes the optimal solutions: whereas DIB in Strouse and Schwab (2017) leads to hard clustering in the limit, adding noise leads to soft clustering when optimizing the DIB objective, as is the case with the IB objective. We show in section F.6 that minimizing the DIB objective with noise leads to soft clustering (for the case of an otherwise deterministic encoder). Altogether, in addition to Shwartz-Ziv and Tishby (2017) , we argue that noise is essential to obtain meaningful differential entropies and to avoid other pathological cases as described further in section F.7. [Z] = H[ Ẑ + ] ≥ It is not generally possible to compute H[Z | Y] exactly for continuous latent representations Z, but we can derive an upper bound. The maximum-entropy distribution for a given covariance matrix Σ is a Gaussian with the same covariance. Proposition 4. The Reverse Decoder Uncertainty can be approximately bounded using the empirical variance Var[Z i | y]: H[Z | Y] ≤ E p(y) i 1 2 ln(2πe Var[Z i | y]) ≈ E p(y) i 1 2 ln(2πe Var[Z i | y]), where Z i are the individual components of Z. H[Z] can be bounded similarly. More generally, we can create an even looser upper bound by bounding the mean squared norm of the latent: E Z 2 ≤ C ⇒ H[Z | Y] ≤ H[Z] ≤ C, with C := ke 2C/k 2πe for Z ∈ R k . See section F.2 for proof. Surrogate objectives These surrogate terms provide us with three different upper-bounds that we can use as surrogate regularizers. We refer to them as: conditional log-variance regularizer (log Var[Z | Y]), log-variance regularizer (log Var[Z]) and activation L 2 regularizer (E Z 2 ). We can now propose the main results of this paper: IB surrogate objectives that reduce to an almost trivial implementation using the cross-entropy loss and one of the regularizers above while adding zero-entropy noise to the latent Z. Theorem 1. Let Z be obtained by adding a single sample of zero-entropy noise to a single sample of the output z of the stochastic encoder. Then each of the following objectives is an estimator of an upper bound on the IB objective. In particular, for the surrogate objective E Z 2 , we obtain: min H(p(y | z) || p θ ( Ŷ = y | z)) + γ z 2 ; (14) for log Var[Z | Y]: min H(p(y | z) || p θ ( Ŷ = y | z)) + γ E p(y) i 1 2 ln(2πe Var[Z i | y]); and for log Var[Z]: min H(p(y | z) || p θ ( Ŷ = y | z)) + γ i 1 2 ln(2πe Var[Z i ]). ( ) For the latter two surrogate regularizers, we can relate their coefficient γ to β , β and β from section 3. However, as regularizing E Z 2 does not approximate an entropy directly, its coefficient does not relate to the Lagrange multiplier of any fixed IB objective. We compare the performance of these objectives in section 4. 

4. Experiments

We now provide empirical verification of the claims made in the previous sections. Our goal in this section is to highlight two main findings: first, that our surrogate objectives obtain similar behavior to what we expect of exact IB objectives with respect to their effect on robustness to adversarial examples. In particular, we show that our surrogate IB objectives improve adversarial robustness compared to models trained only on the cross-entropy loss, consistent with the findings of Alemi et al. (2016) . Second, we show the effect of our surrogate objectives on information quantities during training by plotting information plane diagrams, demonstrating that models trained with our objectives trade off between I[X; Z] and I[Y; Z] as expected. We show this by recovering information plane plots similar to the ones in Tishby and Zaslavsky (2015) and qualitatively examine the optimization behavior of the networks through their training trajectories. We demonstrate the scalability of our surrogate objectives by applying our surrogate IB objectives to the CIFAR-10 and Imagenette datasets, high-dimensional image datasets. For details about our experiment setup, DNN architectures, hyperparameters and additional insights, see section G. In particular, empirical quantification of our observations on the relationship between the Decoder Cross-Entropy loss and the Prediction Cross-Entropy are deferred to the appendix due to space limitations as well as the description of the toy experiment that shows that minimizing H[Z | Y] for continuous latent Z without adding noise does not constrain information meaningfully and that adding noise solves the issue as detailed in section 3.3. Robustness to adversarial attacks Alemi et al. ( 2016) and Fischer and Alemi (2020) observe that their IB objectives lead to improved adversarial robustness over standard training objectives. We perform a similar evaluation to see whether our surrogate objectives also see improved robustness. We train a fully-connected residual network on CIFAR-10 for a range of regularization coefficients γ using our E Z 2 surrogate objective; we then compare against a similar regularization method that does not have an information-theoretic interpretation: L2 weight-decay. We inject zero-entropy noise in both cases. After training, we evaluate the models on adversarially perturbed images using the FGSM (Szegedy et al., 2013) , PGD (Madry et al., 2018) , BasicIterative (Kurakin et al., 2017) and DeepFool (Moosavi-Dezfooli et al., 2016) attacks for varying levels of the perturbation magnitude parameter . We also compare to a simple unregularized cross-entropy baseline (black dashed line). To compute overall robustness, we use each attack in turn and only count a sample as robust if it defeats them all. As depicted in figure 5 , we find that our surrogate objectives yield significantly more robust models while obtaining similar test accuracy on the unperturbed data whereas weight-decay regularization reduces robustness against adversarial attacks. Plots for the other two regularizers can be found in the appendix in figure G.13 and figure G.14. 

Information plane plots for CIFAR-10

To compare the different surrogate regularizers, we again use a ResNet18 model on CIFAR-10 with zero-entropy noise added to the final layer activations Z, with K = 256 dimensions, as an encoder and add a single K × 10 linear unit as a decoder. We train with the surrogate objectives from section 3.3 for various γ, chosen in logspace from different ranges to compensate for their relationship to β as noted in section 3.3: for log Var[Z], γ ∈ [10 -5 , 1]; for log Var[Z | Y], γ ∈ [10 -5 , 10]; and for E Z 2 , by trial and error, γ ∈ [10 -6 , 10]. We estimate information quantities using the method of Kraskov et al. (2004) . Figure 6 shows an information plane plot for regularizing with E Z 2 for different γ over different epochs for the training set. Similar to Shwartz-Ziv and Tishby (2017), we observe that there is an initial expansion phase followed by compression. The jumps in performance (reduction of the Residual Information) are due to drops in the learning rate. In figure 4 , we can see that the saturation curves for all 3 surrogate objectives qualitatively match the predicted curve from Tishby and Zaslavsky (2015) . 

Information plane plots for Imagenette

To show that our surrogate objectives also scale up to larger datasets, we run a similar experiment on Imagenette (Howard, 2019) , which is a subset of ImageNet with 10 classes with 224 × 224 × 3 = 1.5 × 10 5 input dimensions, and on which we obtain 90% test accuracy. See the figure 1 , which shows the trajectories on the test set. We obtain similar plots to the ones obtained for CIFAR-10, showing that our surrogate objectives scale well to higher-dimensional datasets despite their simplicity.

5. Conclusion

The contributions of this paper have been threefold: First, we have proposed simple, tractable training objectives which capture many of the desirable properties of IB methods while also scaling to problems of interest in deep learning. For this we have introduced implicit stochastic encoders, e.g. using dropout, and compared multi-sample dropout approaches to identify the one that approximates the Decoder Uncertainty H θ [Y | Z], relating them to the cross-entropy loss that is commonly used for classification problems. This widens the range of DNN architectures that can be used with IB objectives considerably. We have demonstrated that our objectives perform well for practical DNNs without cumbersome density models. Second, we have motivated our objectives by providing insight into limitations of IB training, demonstrating how to avoid pathological behavior in IB objectives, and by endeavouring to provide a unifying view on IB approaches. Third, we have provided mathematically grounded intuition by using I-diagrams for the information quantities involved in IB, shown common pitfalls when using information quantities and how to avoid them, and examined how the quantities relate to each other. Future work investigating the practical constraints on the expressivity of a given neural network may provide further insight into how to measure compression in neural networks. Moreover, the connection to Bayesian Neural Networks remains to be explored.

A Information quantities & information diagrams

Here we introduce notation and terminology in greater detail than in the main paper. We review well-known information quantities and provide more details on using information diagrams (Yeung, 1991) .

A.1 Information quantities

We denote entropy H (1948) : [•], joint entropy H[•, •], conditional entropy H[• | •], h (x) = -ln x H[X] = E p(x) h p(x) H[X, Y] = E p(x,y) h p(x, y) H[X | Y] = H[X, Y] -H[Y] = E p(y) H[X | y] = E p(x,y) h p(x | y) I[X; Y] = H[X] + H[Y] -H[X, Y] = E p(x,y) h p(x) p(y) p(x,y) I[X; Y | Z] = H[X | Z] + H[Y | Z] -H[X, Y | Z], where X, Y, Z are random variables and x, y, z are outcomes these random variables can take. We use differential entropies interchangeably with entropies. We can do so because equalities between them hold as can be verified by symbolic expansions. For example, H[X, Y] = H[X | Y] + H[Y] ⇔ E p(x,y) h p(x, y) = E p(x,y) h p(x | y) + h p(y) = E p(x,y) h p(x | y) + E p(y) h p(y) , which is valid in both the discrete and continuous case (if the integrals all exist). The question of how to transfer inequalities in the discrete case to the continuous case is dealt with in section 3.3. We will further require the Kullback-Leibler divergence D KL (• || •) and cross-entropy H(• || •): H(p(x) || q(x)) = E p(x) h q(x) D KL (p(x) || q(x)) = E p(x) h q(x) p(x) H(p(y | x) || q(y | x)) = E p(x) E p(y|x) h q(y | x) = E p(x,y) h q(y | x) D KL (p(y | x) || q(y | x)) = E p(x,y) h q(y|x) p(y|x) A.2 Information diagrams Information diagrams (I-diagrams), like the one depicted in figure 2 (or figure H .1 for a bigger version), visualize the relationship between information quantities: Yeung (1991) shows that we can define a signed measure µ * such that these well-known quantities map to abstract sets and are consistent with set operations. H (McGill, 1954) follows as canonical generalization of the mutual information to multiple variables from that work, whereas total correlation does not. [A] = µ * (A) H[A 1 , . . . , A n ] = µ * (∪ i A i ) H[A 1 , . . . , A n | B 1 , . . . , B n ] = µ * (∪ i A i -∪ i B i ) I[A 1 ; . . . ; A n ] = µ * (∩ i A i ) I[A 1 ; . . . ; A n | B 1 , . . . , B n ] = µ * (∩ i A i -∪ i B i ) Note that interaction information In other words, equalities can be read off directly from I-diagrams: an information quantity is the sum of its parts in the corresponding I-diagram. This is similar to Venn diagrams. The sets used in I-diagrams are just abstract symbolic objects, however. An important distinction between I-diagrams and Venn diagrams is that while we can always read off inequalities in Venn diagrams, this is not true for I-diagrams in general because mutual information terms in more than two variables can be negative. In Venn diagrams, a set is always larger or equal any subset. However, if we show that all information quantities are non-negative, we can read off inequalities again. We do this for figure 2 at the end of section 2 for categorical Z and expand this to continuous Z in section 3.3. Thus, we can treat the Mickey Mouse I-diagram like a Venn diagram to read off equalities and inequalities. Nevertheless, caution is warranted sometimes. As the signed measure can be negative, µ * (X ∩ Y) = 0 does not imply X ∩ Y = ∅: deducing that a mutual information term is 0 does not imply that one can simply remove the corresponding area in the I-diagram. There could be Z with µ * ((X ∩ Y) ∩ Z) < 0, such that µ * (X ∩ Y) = µ * (X ∩ Y ∩ Z) + µ * (X ∩ Y -Z) = 0 but X ∩ Y ∅. This also means that we cannot drop the term from expressions when performing symbolic manipulations. This is of particular importance because a mutual information of zero means two random variables are independent, which might invite one drawing them as disjoint areas. The only time where one can safely remove an area from the diagram is for atomic quantities, which are quantities which reference all the available random variables (Yeung, 1991) . For example, when we only have three variables X, Y, Z, I[X; Y; Z] and I[X; Y | Z] are atomic quantities. We can safely remove atomic quantities from I-diagrams when they are 0 as there are no random variables left to apply that could lead to the problem explored above. We only use I-diagrams for the three variable case, but they supply us with tools to easily come up with equalities and inequalities for information quantities. In the general case with multiple variables, they can be difficult to draw, but for Markov chains they can be of great use. Continuing the example, 0 = I[X; Y; Z] = µ * (X ∩ Y ∩ Z) would imply X ∩ Y ∩ Z = ∅,

B Mickey Mouse I-diagram B.1 Intuition for the Mickey Mouse information quantities

We base the names of information quantities on existing conventions and come up with sensible extensions. For example, the name Preserved Relevant Information for I[Y; Z] was introduced by Tishby and Zaslavsky (2015) . It can be seen as the intersection of I[X; Z] and I[X; Y] in the I-diagram, and hence we denote I[X; Z] Preserved Information and I[X; Y] Relevant Information, which are sensible names as we detail below. We identify the following six atomic quantities: Label Uncertainty H[Y | X] quantifies the uncertainty in our labels. If we have multiple labels for the same data sample, it will be > 0. It is 0 otherwise. Encoding Uncertainty H[Z | X] quantifies the uncertainty in our latent encoding given a sample. When using a Bayesian model with random variable ω for the weights, one can further split this term into H [Z | X] = I[Z; ω | X] + H[Z | X, ω] , so uncertainty stemming from weight uncertainty and independent noise (Houlsby et al., 2011; Kirsch et al., 2019) . Preserved Relevant Information I[Y; Z] quantifies information in the latent that is relevant for our task of predicting the labels (Tishby and Zaslavsky, 2015) . Intuitively, we want to maximize it for good predictive performance. Residual Information I[X; Y | Z] quantifies information for the labels that is not captured by the latent (Tishby and Zaslavsky, 2015) but would be useful to be captured. Redundant Information I[X; Z | Y] quantifies information in the latent that is not needed for predicting the labelsfoot_5 . We also identify the following composite information quantities: ] quantifies the information in the data that is relevant for the labels and which our model needs to capture to be able to predict the labels. Relevant Information I[X; Y] = I[X; Y | Z] + I[Y; Z Preserved Information I[X; Z] = I[X; Z | Y] + I[Y; Z] quantifies information from the data that is preserved in the latent. Decoder Uncertainty H[Y | Z] = I[X; Y | Z] + H[Y | X] quantifies the uncertainty about the labels after learning about the latent Z. If H[Y | Z] reaches 0, it means that no additional information is needed to infer the correct label Y from the latent Z: the optimal decoder can be a deterministic mapping. Intuitively, we want to minimize this quantity for good predictive performance. Nuisance ] quantifies the information in the data that is not relevant for the task (Achille and Soatto, 2018a).

Reverse Decoder Uncertainty

H[Z | Y] = I[X; Z | Y] + H[Z | X] 7 H[X | Y] = H[X | Y, Z] + I[X; Z

B.2 Definitions & equivalences

The following equalities can be read off from figure 2. For completeness and to provide a handy reference, we list them explicitly here. They can also be verified using symbolic manipulations and the properties of information quantities. Equalities for composite quantities: I[X; Y] = I[X; Y | Z] + I[Y; Z] (17) I[X; Z] = I[X; Z | Y] + I[Y; Z] (18) H[Y | Z] = I[X; Y | Z] + H[Y | X] (19) H[Z | Y] = I[X; Z | Y] + H[Z | X] (20) H[X | Y] = H[X | Y, Z] + I[X; Z] (21) We can combine the atomic quantities into the overall Label Entropy and Encoding Entropy: H[Y] = H[Y | X] + I[Y; Z] + I[X; Y | Z] (22) H[Z] = H[Z | X] + I[Y; Z] + I[X; Z | Y]. (23) We can express the Relevant Information I[X; Y], Residual Information I[X; Y | Z], Redundant Information I[X; Z | Y] and Preserved Information I[X; Z] without X on the left-hand side: I[X; Y] = H[Y] -H[Y | X], (24) I[X; Z] = H[Z] -H[Z | X], (25) I[X; Y | Z] = H[Y | Z] -H[Y | X], (26) I[X; Z | Y] = H[Z | Y] -H[Z | X]. (27) This simplifies estimating these expressions as X is usually much higher-dimensional and irregular than the labels or latent encodings. We also can rewrite the Preserved Relevant Information I[Y; Z] as: I[Y; Z] = H[Y] -H[Y | Z] (28) I[Y; Z] = H[Z] -H[Z | Y] (29) C Information bottleneck & related works C.1 Goals & motivation The IB principle from Tishby et al. (2000) can be recast as a generalization of finding minimal sufficient statistics for the labels given the data (Shamir et al., 2010; Tishby and Zaslavsky, 2015; Fisher, 2019) : it strives for minimality and sufficiency of the latent Z. Minimality is about minimizing amount of information necessary of X for the task, so minimizing the Preserved Information I[X; Z]; while sufficiency is about preserving the information to solve the task, so maximizing the Preserved Relevant Information I[Y; Z]. From figure 2 , we can read off the definitions of Relevant Information and Preserved Information: The paper derives the following variational approximation to the IB objective, where z = f θ (x, ) denotes a stochastic latent embedding with distribution p θ (z | x), p θ (ŷ | z) denotes the decoder, and r(z) is some fixed prior distribution on the latent embedding: I[X; Y] = I[Y; Z] + I[X; Y | Z] (30) I[X; Z] = I[Y; Z] + I[X; Z | Y], min E p(x,y) E ∼p( ) -log p θ ( Ŷ = y | z = f θ (x n , )) + γ D KL (p(z|x n )||r(z)) . (32) In principle, the distributions p θ (ŷ | z) and p θ (z | x) could be given by arbitrary parameterizations and function approximators. In practice, the implementation of DVIB presented by Alemi et al. ( 2016) constructs p θ (z | x) as a multivariate Gaussian with parameterized mean and parameterized diagonal covariance using a neural network, and then uses a simple logistic regression to obtain p θ (ŷ | z), while arbitrarily setting r(z) to be a unit Gaussian around the origin. The requirement for p θ (z | x) to have a closed-form Kullback-Leibler divergence limits the applicability of the DVIB objective. The DVIB objective can be written more concisely as min H θ [Y | Z] + γ D KL (p(z | x) || r(z) ) in the notation introduced in section 3. We discuss the regularizer in more detail in section F. Fischer and Alemi (2020) take CEB and switch to a deterministic model which they turn it into a stochastic encoder by adding unit Gaussian noise. They use Gaussians of fixed variance to variationally approximate q(y | z): for each class, q(y | z) is modelled as a separate Gaussian. They are the first to report results on ImageNet and report good rebustness against adversarial attacks without adversarial training.

C.3 Canonical IB & DIB objectives

We expand the IB and DIB objectives into "disjoint" terms and drop constant ones to find a more canonical form. This leads us to focus on the optimization of the Decoder Uncertainty H[Y | Z] along with additional regularization terms. In section 3.2, we discuss the properties of H[Y | Z], and in section 3.3 we examine the regularization terms. Proposition. For IB, we obtain arg min I[X; Z] -βI[Y; Z] = arg min H[Y | Z] + β I[X; Z | Y] =H[Z|Y]-H[Z|X] , and, for DIB, arg min H[Z] -βI[Y; Z] = arg min H[Y | Z] + β H[Z | Y] = arg min H[Y | Z] + β H[Z] ( ) with β := 1 β-1 ∈ [0, ∞) and β := 1 β ∈ [0, 1). Proof. For the steps marked with *, we make use of β > 1. For IB, we obtain arg min I [X; Z] -βI[Y; Z] = arg min I[X; Z | Y] + (β -1)H[Y | Z] (*) = arg min H[Y | Z] + β I[X; Z | Y] arg min H[Y | Z] + β (H[Z | Y] -H[Z | X]), (IB) and, for DIB, arg min H[Z] -βI[Y; Z] = arg min H[Z | Y] + (β -1)H[Y | Z] (*) = arg min H[Y | Z] + β H[Z | Y], (DIB) with β := 1 β-1 ∈ [0, ∞). Similarly, we show for DIB arg min H[Z] -βI[Y; Z] = arg min H[Z] + βH[Y | Z] (*) = arg min H[Y | Z] + β H[Z], with β := 1 β ∈ [0, 1), which is relevant in section 3.3. We limit ourselves to β > 1, because, for β < 1, we would be maximizing the Decoder Uncertainty, which does not make sense: the obvious solution to this is one where Z contains no information on Y, that is p(y | z) is uniform. In the case of DIB, it is to map every input deterministically to a single latent; whereas for IB, we only minimize the Redundant Information, and the solution is free to contain noise. For β = 1, we would not care about Decoder Uncertainty and only minimize Redundant Information and Reverse Decoder Uncertainty, respectively, which allows for arbitrarily bad predictions. We note that we have β = β 1-β using the relations above.

C.4 IB objectives and the Entropy Distance Metric

Another perspective on the IB objectives is by expressing them using the Entropy Distance Metric. MacKay (2003, p. 140 ) introduces the entropy distance EDM (Y, Z) = H[Y | Z] + H[Z | Y]. (35) as a metric when we identify random variables up to permutations of the labels for categorical variables: if the entropy distance is 0, Y and Z are the same distribution up to a consistent permutation of the labels (independent of X). If the entropy distance becomes 0, both H [Y | Z] = 0 = H[Z | Y], and we can find a bijective map from Z to Y. 9We can express the Reverse Decoder Uncertainty H[Z | Y] using the Decoder Uncertainty H[Y | Z] and the entropies: DIB will encourage the model to match both distributions for γ = 0 (β = 2), as we obtain a term that matches the Entropy Distance Metric from section C.4, and otherwise trades off Decoder Uncertainty and Reverse Decoder Uncertainty. IB behaves similarly but tends to maximize Encoding Uncertainty as γ -1 ∈ [-2, 0]. Fisher (2019) argues for picking this configuration similar to the arguments in section C.1. DIB will force both distributions to become exactly the same, which would turn the decoder into a permutation matrix for categorical variables. H[Z | Y] + H[Y] = H[Y | Z] + H[Z],

D Decoder Uncertainty H[Y | Z] D.1 Cross-entropy loss

The cross-entropy loss features prominently in section 3.2. We can derive the usual cross-entropy loss for our model by minimizing the Kullback-Leibler divergence between the empirical sample distribution p(x, y) and the parameterized distribution p θ (x) p θ (ŷ | x). For discriminative models, we are only interested in p θ (ŷ | x), and can simply set p θ (x) = p(x): arg min θ D KL (p(x, y) || p θ (x) p θ ( Ŷ = y | x)) = arg min θ D KL (p(y | x) || p θ ( Ŷ = y | x)) + D KL (p(x) || p θ (x)) =0 = arg min θ H(p(y | x) || p θ ( Ŷ = y | x)) -H[Y | X] const. = arg min θ H(p(y | x) || p θ ( Ŷ = y | x)). In section 3.2, we introduce the shorthand H θ [Y | X] for H(p(y | x) || p θ ( Ŷ = y | x) ) and refer to it as Prediction Cross-Entropy.

D.2 Upper bounds & training error minimization

To motivate that H [Y | Z] (or H θ [Y | Z] ) can be used as main loss term, we show that it can bound the (training) error probability since accuracy is often the true objective when machine learning models are deployed on real-world problemsfoot_9 . Proposition. The Decoder Cross-Entropy provides an upper bound on the Decoder Uncertainty: H[Y | Z] ≤ H[Y | Z] + D KL (p(y | z) || p θ (ŷ | z)) = H θ [Y | Z], and further bounds the training error: p(" Ŷ is wrong") ≤ 1 -e -H θ [Y|Z] = 1 -e -(H[Y|Z]+D KL (p(y|z)||p θ (ŷ|z))) . Likewise, for the Prediction Cross-Entropy H θ [Y | X] and the Label Uncertainty H[Y | X].

Proof. The upper bounds for Decoder Uncertainty H[Y | Z] and Label Uncertainty H[Y | X] follow

from the non-negativity of the Kullback-Leibler divergence, for example: 0 ≤ D KL (p(y | z) || p θ (ŷ | z)) = H θ [Y | Z] -H[Y | Z], 0 ≤ D KL (p(y | x) || p θ (ŷ | x)) = H θ [Y | X] -H[Y | X]. The derivation for the training error probability is as follows: p(" Ŷ is correct") = E p(x,y) p(" Ŷ is correct" | x, y) = E p(x,y) E p θ (z|x) p θ ( Ŷ = y | z) = E p(y,z) p θ ( Ŷ = y | z). We can then apply Jensen's inequality using convex h (x) =ln x: In the next section, we examine categorical Z for which optimal decoders can be constructed and h E p(y,z) p θ ( Ŷ = y | z) ≤ E p(y,z) h p θ ( Ŷ = y | z) ⇔ p(" Ŷ is correct") ≥ e -H(p(y|z)||p θ ( Ŷ=y|z)) ⇔ p(" Ŷ is wrong") ≤ 1 -e -H θ [Y|Z] . For small H θ [Y | Z], D KL (p(y | z) || p θ ( Ŷ = y | z)) becomes zero.

E Categorical Z

For categorical Z, p(y | z) can be computed exactly for a given encoder p θ (z | x) by using the empirical data distribution, which, in turn, allows us to compute H[Y | Z]foot_10 . This is similar to computing a confusion matrix between Y and Z but using information content instead of probabilities. Moreover, if we set p θ (ŷ | z) := p(Y = ŷ | z) to have an optimal decoder, we obtain equality in equation ( 9), and obtain H θ [Y | X] ≤ H θ [Y | Z] = H[Y | Z]. If the encoder were also deterministic, we would obtain H θ [Y | X] = H θ [Y | Z] = H[Y | Z]. We can minimize H[Y | Z] directly using gradient descent. d dθ H[Y | Z] only depends on p(y | z) and d dθ p θ (z | x): d dθ H[Y | Z] = E p(x,z) d dθ ln p θ (z | x) E p(y|x) h p(y | z) . Proof. d dθ H[Y | Z] = d dθ E p(y,z) h p(y | z) = d dθ E p(x,y,z) h p(y | z) = E p(x,y) d dθ E p θ (z|x) h p(y | z) = E p θ (z|x) E p(x,y) d dθ h p(y | z) + h p(y | z) d dθ ln p θ (z | x) = E p(x,y,z) d dθ h p(y | z) + h p(y | z) d dθ ln p θ (z | x) . And now we show that E p(x,y,z) d dθ h p(y | z) = 0: Theorem 2. For random variables A, B, we have E p(x,y,z) d dθ h p(y | z) = E p(y,z) d dθ h p(y | z) = E p(y,z) -1 p(y | z) d dθ p(y | z) = - p(y, z) p(y | z) H[A + B] ≥ H[B]. Proof. See Bercher and Vignat (2002, section 2.2). Proposition 1. Let Y, Z and X be random variables satisfying the independence property Z ⊥ Y|X, and F a possibly stochastic function such that Z = F(X) + , with independent noise satisfying ⊥ F(X), ⊥ Y and H( ) = 0. Then the following holds whenever I[Y; Z] is well-defined. I[X; Z | Y] ≤ H[Z | Y] ≤ H[Z]. Proof. First, we note that H [Z | X] = H[F(X) + | X] ≥ H[ | X] = H[ ] with theorem 2, as is independent of X, and thus H[Z | X] ≥ 0. We have H[Z | X] = H[Z | X, Y ] by the conditional independence assumption, and by the non-negativity of mutual information, I[Y; Z] ≥ 0. Then: I[X; Z | Y] + H[Z | X] ≥0 = H[Z | Y] H[Z | Y] + I[Y; Z] ≥0 = H[Z] The probabilistic model from section 2 fulfills the conditions exactly, and the two statements motivate our proposition. It is important to note that while zero-entropy noise is necessary for preserving inequalities like I[X; Z | Y] ≤ H[Z | Y] ≤ H[Z] in the continuous case, any Gaussian noise will suffice for optimization purposes: we optimize via pushing down an upper bound, and constant offsets will not affect this.  Thus, if we had H[ ] 0, even though I[X; Z | Y] + H[Z | X] ≤ H[Z | Y], we could instead use I[X; Z | Y] + H[Z | X] -H[ ] ≤ H[Z | Y] -H[ ]

F.2 Upper bounds

We derive this result as follows: H[Z | Y] = E p(y) H[Z | y] ≤ E p(y) 1 2 ln det(2πe Cov[Z | y]) ≤ E p(y) i 1 2 ln(2πe Var[Z i | y]) ≈ E p(y) i 1 2 ln(2πe Var[Z i | y]), Theorem 3. Given a k-dimensional random variable X = (X i ) k i=1 with Var[X i ] > 0 for all i, H[X] ≤ 1 2 ln det(2πe Cov[X]) ≤ i 1 2 ln(2πe Var[X i ]). Proof. First, the multivariate normal distribution with same covariance is the maximum entropy distribution for that covariance, and thus H[X] ≤ ln det(2πe Cov[X]), when we substitute the differential entropy for a multivariate normal distribution with covariance Cov[X]. Let Σ 0 := Cov[X] be the covariance matrix and Σ 1 := diag(Var[X i ]) i the matrix that only contains the diagonal. Because we add independent noise, Var[X i ] > 0 and thus Σ -1 1 exists. It is clear that tr(Σ -1 1 Σ 0 ) = k. Then, we can use the KL-Divergence between two multivariate normal distributions N 0 , N 1 with same mean 0 and covariances Σ 0 and Σ 1 to show that ln det Σ 0 ≤ ln det Σ 1 : 0 ≤ D KL (N 0 || N 1 ) = 1 2 tr(Σ -1 1 Σ 0 ) -k + ln det Σ 1 det Σ 0 ⇔ 0 ≤ 1 2 ln det Σ 1 det Σ 0 ⇔ 1 2 ln det Σ 0 ≤ 1 2 ln det Σ 1 . We substitute the definitions of Σ 0 and Σ 1 , and obtain the second inequality after adding k ln(2πe) on both sides. Theorem 4. Given a k-dimensional real-valued random variable X = (X i ) k i=1 ∈ R k , we can bound the entropy by the mean squared norm of the latent: E X 2 ≤ C ⇒ H[X] ≤ C, ( ) with C := ke 2C/k 2πe . Proof. We begin with the previous bound: H[X] ≤ i 1 2 ln(2πe Var[X i ]) = k 2 ln 2πe + 1 2 ln i Var[X i ] ≤ k 2 ln 2πe + 1 2 ln        1 k i Var[X i ]        k = k 2 ln 2πe k i Var[X i ] ≤ k 2 ln 2πe k E X 2 , where we use the AM-GM inequality:        i Var[X i ]        1 k ≤ 1 k i Var[X i ] and the monotony of the logarithm with: i Var[X i ] = i E X 2 i -E [X i ] 2 ≤ i E X 2 i = E X 2 Bounding using E X 2 ≤ C , we obtain H[X] ≤ k 2 ln 2πe k C = C, and solving for C yields the statement. This theorem provides justification for the use of ln E Z 2 as a regularizer, but does not justify the use of E Z 2 directly. Here, we give two motivations. We first observe that ln x ≤ x -1 due to ln's strict convexity and ln 1 = 0, and thus: H[X] ≤ k 2 ln 2πe k E X 2 = k 2 ln 2π k E X 2 -1 ≤ π E X 2 . We can also take a step back and remind ourselves that IB objectives are actually Lagrangians, and β We can expand the regularization term to in min I[X; Z] -βI[Y; Z] D KL (p(z | x) || N(0, I k )) = E p(x) E p(z|x) h (2π) -k 2 e -1 2 Z 2 | -H[Z | X] = E p(z) k 2 ln(2π) + 1 2 Z 2 -H[Z | X]. After dropping constant terms (as they don't matter for optimization purposes), we obtain = 1 2 E Z 2 -H[Z | X]. When we inject zero-entropy noise into the latent Z, we have H[Z | X] ≥ 0 and thus E Z 2 - H[Z | X] ≤ E Z 2 . Thus, the E Z 2 regularizer also upper-bounds DVIB's regularizer in this case. In particular, we have equality when we use a deterministic encoder. When we inject zero-entropy noise and use a deterministic encoder, we are optimizing the DVIB objective function when we use the E Z 2 regularizer. In other words, in this particular case, we could reinterpret "min H θ [Y | Z] + γ E Z 2 " as optimizing the DVIB objective from Alemi et al. ( 2016) if they were using a constant covariance instead of parameterizing it in their encoder. This does not hold for stochastic encoders. We empirically compare DVIB and the surrogate objectives from section 3.3 in section G.5. In the corresponding plot in figure G.15, we can indeed note that E Z 2 and DVIB are separated by a factor of 2 in the Lagrange multiplier. Both Alemi et al. (2016) and Fischer (2020) focus on the application of variational approximations to these quantities. Using a slight abuse of notation to denote all variational approximations, we can write the VCEB objectivefoot_12 (Fischer, 2020) and the DVIB objective (Alemi et al., 2016)  more concisely as VCEB ≡ min θ H θ [Y | Z] + β (H θ [Z | Y] -H θ [Z | X]), DVIB ≡ min θ H θ [Y | Z] + β (H θ [Z] -H θ [Z | X]). DVIB does not specify how to choose stochastic encoders and picks the variational marginal q(z) to be a unit Gaussian. We relate how this choice of marginal relates to the E Z 2 surrogate objective in section F.3. Alemi et al. ( 2016) use VAE-like encoders that output mean and standard deviation for latents that are then sampled from a multivariate Gaussian distribution with diagonal covariance in their experiments. They run experiments on MNIST and on features extracted from the penultimate layer of pretrained models on ImageNet. While VCEB as introduced in Fisher ( 2019) is agnostic to the choice of stochastic encoder, Fischer (2020) mention that stochastic encoders can be similar to encoders and decoders in VAEs (Kingma and Welling, 2013) or like in DVIB mentioned above. Both VAEs and DVIB explicitly parameterize the distribution of the latent to sample from it before passing samples to the decoder. Fischer and Alemi (2020) use an existing classifier architecture to output means for a Gaussian distribution with unit diagonal covariance. They further parameterize the variational approximation for the Reverse Decoder Uncertainty q(y | z) with one Gaussian of fixed variance per class and learn this reverse decoder during training as well. Fischer and Alemi (2020) report results on CIFAR-10 and ImageNet that show good robustness against adversarial attacks without adversarial training, similar to the results in this paper. This specific (and not motivated) instantiation of the VCEB objective in Fischer and Alemi ( 2020) is similar to the log Var[Z | Y] surrogate objective introduced in section 3.3 with a deterministic encoder and zero-entropy noise injection. However, the latter uses minibatch statistics instead of learning a reverse decoder, trading variational tightness for ease of computation and optimization. Compared to this prior literature, this paper examines the usage of implicit stochastic encoders (for example when using dropout) and presents three different simple surrogate objectives together with a principled motivation for zero-entropy noise injection, which has a dual use in enforcing meaningful compression and in simplifying the estimation of information quantities. Moreover, multi-sample approaches are examined to differentiate between Decoder Cross-Entropy and Prediction Cross-Entropy. In particular, implicit stochastic encoders together with zero-entropy noise and simple surrogates make it easier to use IB objectives in practice compared to using explicitly parameterized stochastic encoders and variational approaches. F.5 An information-theoretic approach to VAEs While Alemi et al. ( 2016) draw a general connection to β-VAEs (Higgins et al., 2016) , we can use the insights from this paper to derive a simple VAE objective. Taking the view that VAEs learn latent representations that compress input samples, we can approach them as entropy estimators. Using H [X] + H[Z | X] = H[X | Z] + H[Z], we obtain the ELBO H[X] = H[X | Z] + H[Z] -H[Z | X] (1) ≤ H θ [X | Z] + H[Z] -H[Z | X] (2) ≤ H θ [X | Z] + H[Z]. ( ) We can also put eq. ( 40) into words: we want to find latent representations such that the reconstruction cross-entropy H[X | Z] and the latent entropy H[Z], which tell us about the length encoding an input sample, become minimal and approach the true entropy as average optimal encoding length of the dataset distribution. The first inequality (1) stems from introducing a cross-entropy approximation H θ [X | Z] for the conditional entropy H[X | Z]. The second inequality (2) stems from injection of zero-entropy noise with a stochastic encoder. For a deterministic encoder, we would have equality. We also note that (1) is the DVIB objective for a VAE with β = 1, and (2) is the DIB objective for a VAE. Finally, we can use one of the surrogates introduced in section 3.3 to upper bound H[Z]. For optimization purposes, we can substitute the L2 activation regularizer E Z 2 from proposition 4 and obtain as objective min θ H θ [X | Z] + E Z 2 . It turns out that this objective is examined amongst others in the recently published Ghosh et al. (2019) as a CV-VAE, which uses a deterministic encoder and noise injection with constant variance. The paper derives this objective by noticing that the explicit parameterizations that are commonly used for VAEs are cumbersome, and the actual latent distribution does often not necessarily match the induced distribution (commonly a unit Gaussian) which causes sampling to generate out-of-distribution data. It fits a separate density estimator on p(z) after training for sampling. The paper goes on to then examine other methods of regularization, but also provides experimental results on CV-VAE, which are in line with VAEs and WAEs. The derivation and motivation in the paper is different and makes no use of information-theoretic principles. Our short derivation above shows the power of using the insights from section 3.2 and 3.3 for applications outside of supervised learning.  µ i = f θ (x i ). Then the distribution of Z is given by a mixture of Gaussians with the following density, where d(x, µ i ) := x -µ i /σ 2 . p(z) ∝ 1 n n i=1 exp(-d(z, µ i )) Assuming that each x i has a deterministic label y i , we then find that the conditional distributions p(y | z) and p(z | y) are given as follows: p(z | y) ∝ 1 n y i:y i =y exp(-d(z, µ i )) p(y | z) = i:y i =y p(µ i | z) = i:y i =y p(z | µ i ) p(µ i ) p(z) = i:y i =y p(z | µ i ) n k=1 p(z | µ k ) = i:y i =y exp(-d(z, µ i )) n k=1 exp(-d(z, µ k )) , where n y is the number of x i with class y i = y. Thus, the conditional Z|Y can be interpreted as a mixture of Gaussians and Y|Z as a Softmax marginal with respect to the distances between Z and the mean embeddings. We observe that H[Z | Y] is lower-bounded by the entropy of the random noise added to the embeddings: H[Z | Y] ≥ H[ f θ (X) + | Y] ≥ H[ ] with equality when the distribution of f θ (X)|Y is deterministic -that is f θ is constant for each equivalence class. Further, the entropy H[Y | Z] is minimized when H[Z] is large compared to H[Z | Y] as we have the decomposition H[Y | Z] = H[Z | Y] -H[Z] + H[Y]. In particular, when f θ is constant over equivalence classes of the input, then H [Y | Z] is minimized when the entropy H[ f θ (X) + ] is large -i.e . the values of f θ (x i ) for each equivalence class are distant from each other and there is minimal overlap between the clusters. Therefore, the optima of the information bottleneck objective under Gaussian noise share similar properties to the optima of geometric clustering of the inputs according to their output class. To gain a better understanding of local optimization behavior, we decompose the objective terms as follows: H[Z | Y] = E p(y) H(p(z | y) || p(z | y)) = E p(x,y) H(p(z | x) || p(z | y)) = E p(x,y) D KL (p(z | x) || p(z | y)) + H[Z | x] = E p(x,y) D KL (p(z | x) || p(z | y)) + H[Z | X] =const . To examine how the mean embedding µ k of a single datapoint x k affects this entropy term, we look at the derivative of this expression with respect to µ k = f θ (x k ). We obtain: d dµ k H[Z | Y] = d dµ k H[Z | y k ] = d dµ k E p(x|y k ) D KL (p(z | x) || p(z | y)) = i i:y i =y k 1 n y k d dµ k D KL (p(z | x i ) || p(z | y k )) + 1 n y k d dµ k D KL (p(z | x k ) || p(z | y k )). While these derivatives do not have a simple analytic form, we can use known properties of the KL divergence to develop an intuition on how the gradient will behave. We observe that in the left-hand sum µ k only affects the distribution of Z|Y (that is we are differentiating a sum of terms that look like a reverse KL), whereas it has greater influence on p(z | x k ) in the right-hand term, and so its gradient will more closely resemble that of the forward KL. The left-hand-side term will therefore push µ k towards the centroid of the means of inputs mapping to y, whereas the right-hand side term is mode-seeking.

F.7 A note on differential and discrete entropies

The mutual information between two random variables can be defined in terms of the KL divergence between the product of their marginals and their joint distribution. However, the KL divergence is only well-defined when the Radon-Nikodym derivative of the density of the joint with respect to the product exists. Mixing continuous and discrete distributions-and thus differential and continuous entropies-can violate this requirement, and so lead to negative values of the "mutual information". This is particularly worrying in the setting of training stochastic neural networks, as we often assume that an stochastic embedding is generated as a deterministic transformation of an input from a finite dataset to which a continuous perturbation is added. We provide an examples where naive computation without ensuring that the product and joint distributions of the two random variables have a well-defined Radon-Nikodym derivative yields negative mutual information. Let X ∼ U([0, 0.1]), Z = X + R with R ∼ U({0, 1}). Then I[X; Z] = H[X] = log 1 10 ≤ 0. Generally, given X as above and an invertible function f such that Z = f (X), I[X; Z] = H[X] and can thus be negative. In a way, these cases can be reduced to (degenerate) expressions of the form I[X; X] = H[X]. We can avoid these cases by adding independent continuous noise. These examples show that not adding noise can lead to unexpected results. While they still yield finite quantities that bear a relation to the entropies of the random variables, they violate some of the core assumptions we have such that mutual information is always positive.

G.1 DNN architectures and hyperparameters

For our experiments, we use PyTorch (Paszke et al., 2019) and the Adam optimizer (Kingma and Ba, 2014) . In general, we use an initial learning rate of 0.5 × 10 -3 and multiply the learning rate by √ 0.1 whenever the loss plateaus for more than 10 epochs for CIFAR-10. For MNIST and Permutation MNIST, we use an initial learning rate of 10 -4 and multiply the learning rate by 0.8 whenever the loss plateaus for more than 3 epochs. Sadly, we deviate from this in the following experiments: when optimizing the decoder uncertainty for categorical Z for CIFAR-10, we used 5 epochs patience for the decoder uncertainty objective and a initial learning rate of 10 -4 . We do not expect this difference to affect the qualitative results mentioned in section E when comparing to other objectives. We also only used 5 epochs patience when comparing the two cross-entropies on CIFAR-10 in section 3.2. As this was used for both sets of experiments, it does not matter. We train the experiments for creating the information plane plots for 150 epochs. The toy experiment (figure 3 ) is trained for 20 epochs. All other experiments train for 100 epochs. We use a batchsize of 128 for most experiments. We use a batchsize of 32 for comparing the crossentropies for CIFAR-10 (where we take 8 dropout samples each), and a batchsize of 16 for MNIST (where we take 64 dropout samples each). For MNIST, we use a standard dropout CNN, following https://github.com/pytorch/ examples/blob/master/mnist/main.py. For Permutation MNIST, we use a fully-connected model (for experiments with categorical Z in section E): 784 × 1024 × 1024 × C. For CIFAR-10, we use a regular deterministic ResNet18 model (He et al., 2016a) for the experiments in section E. (As the model outputs a categorical distribution it becomes stochastic through that and we don't need stochasticity in the weights.) For the other experiments as well as the Imagenette experiments, we use a ResNet18v2 (He et al., 2016b) . When we need a stochastic model for CIFAR-10 (for continuous Z), we add DropConnect (Wan et al., 2013b) with rate 0.1 to all but the first convolutional layers and dropout with rate 0.1 before the final fully-connected layer. Because of memory issues, we reuse the dropout masks within one batch. The model trains to 94% accuracy on CIFAR-10. For CIFAR-10, we always remove the maximum pooling layer and change the first convolutional layer to have kernel size 3 with stride 1 and padding 1. We also use dataset augmentation during training, but not during evaluation on the training set and test set for purposes of computing metrics. We crop randomly after the padding the training images by 4 pixels in every direction and randomly flip images horizontally. We generally sample 30 values of γ for the information plane plots from the specified ranges, using a log scale. For the ablation studies mentioned below, we sample 10 values of γ each. We always sample γ = 0 separately and run a trial with it. Baselines were tuned by hand (without regularization) using grad-student descent and small grid searches.

G.2 Cluster setup & used resources

We make use of a local SLURM cluster (Jette et al., 2002) . We run our experiments on GPUs (Geforce RTX 2080 Ti). We estimate reproducing all results would take 94 GPU days. G.3 Comparison of the surrogate objectives 0 0.5 1 1.5 2 H θ [Y |Z] I[Y ; X | Z] log Var [Z] log Var [Z | Y ] E Z 2 Train Set 

G.3.1 Measurement of information quantities

Measuring information quantities can be challenging. As mentionend in the introduction, there are many complex ways of measuring entropies and mutual information terms. We can side-step the issue by making use of the bounds we have established and the zero-entropy noise we are injecting, and design experiments around that. First, to estimate the Preserved Information I[X; Z], we note that when we use a deterministic model as encoder and only inject zero-entropy noise, we have H [Z | X] = 0 and I[X; Z] = I[X; Z] + H[Z | X] = H[Z] . We use the entropy estimator from Kraskov et al. (2004, equation (20) ) to estimate the Encoding Entropy H[Z] and thus I[X; Z]. For the plots in figure 6 , we retrained the decoder on the test set to obtain a tighter bound on H[Y | Z] (while keeping the encoder fixed). We then sampled the latent using the test set to estimate the trajectories. We only did this for the CIFAR-10 model without dropout. For our ablations, we did not retrain the decoder and thus only present plots on the test and training set, respectively. At this point, it is important to recall that the Decoder Uncertainty is also the negative log-likelihood (when training with a single dropout sample), which provides a different perspective on the plots. It makes it clear that we can see how much a model overfits by comparing the best and final epochs of a trajectory in the plot (marked by a circle and a square, respectively).

G.3.2 Ablation study

We perform an ablation study to determine whether injecting noise is necessary. Furthermore, we investigate the more interesting case of using a stochastic model as encoder, and if we can use a stochastic model without injecting zero-entropy noise. We also investigate whether log Var[Z | Y] performs better when we increase batchsize as we hypothesized that a batchsize of 128 does not suffice as it leaves only ≈ 13 samples per class to approximate H Regularizing with E Z 2 still has a very weak effect. We hypothesize that, similar to the toy experiment depicted in figure 3 , floating-point precision issues might provide a natural noise source eventually. This would change the effectiveness of γ and might require much higher values to observe similar regularization effects as when we do inject zero-entropy noise. [Z | Y]). Figure G.4 shows trajectories for a stochastic encoder (as described above with DropConnect/dropout rate 0.1). It overfits less than a deterministic one. Figure G.7 shows the effects of using higher dropout rates (using DropConnect/dropout rates of 0.3/0.5). It overfits less than model with DropConnect/dropout rates of 0.1/0.1.  θ [Y | X] coincide. In section 3.2, we discuss the differences from a theoretical perspective. Here, we empirically evaluate the difference between optimizing the estimators for each of the two crossentropy losses, for which we will draw multiple dropout samples during training and inference. We examine models with continuous Z on MNIST and CIFAR-10 (Lecun et al., 1998; Krizhevsky et al., 2009) . Specifically, we use a standard dropout CNN as an encoder for MNIST, and a modified ResNet18 to which we add DropConnect in each layer for CIFAR-10. We use K = 100 dimensions for the continuous latent Z in the last fully-connected layer, and use a linear decoder to obtain the final 10-dimensional output of class logits. For MNIST, we compute the cross-entropies using 64 dropout samples; for CIFAR-10, we use 8. For the purpose of this examination of training behavior, it is not necessary to achieve SOTA accuracy: our models obtain 99.2% accuracy on MNIST and 93.6% on CIFAR-10. 

G.4 Differential entropies and noise

We demonstrate the importance of adding noise to continuous latents by constructing a pathological sequence of parameters which attain monotonically improving and unbounded regularized objective values (H[Z]) while all computing the same function. We use MNIST with a standard dropout CNN as encoder, with K = 128 continuous dimensions in Z, and a K × 10 linear layer as decoder. After every training epoch, we decrease the entropy of the latent by normalizing and then scaling the latent to bound the entropy. We multiply the weights of the decoder to not change the overall function. As can be seen in figure 3 , without noise, entropy can decrease freely during training without change in error rate until it is affected by floating-point issues; while when adding zero-entropy noise, the error rate starts increasing gradually and meaningfully as the entropy starts to approach zero. We conclude that entropy regularization is meaningful only when noise is added to the latent. 2016). 5 trials with 95% confidence interval shown. Even though we could not reproduce the baseline reported in that paper, the simpler surrogate objective reach at least a similar test error as reported there. We also see that DVIB behaves similar to E Z 2 , but shifted by a factor 2 in γ, as predicted by section F. See figure 2 for details.



We shorten these to information quantities from now on. This connection was assumed without proof byAchille and Soatto (2018a;b). Recently, Fischer and Alemi (2020) report results on CIFAR-10 and ImageNet, see section F.4. This notation is compatible with V-Entropy introduced byXu et al. (2020). Conversely, MacKay (2003) notes that without upper-bounding the "power" E p(z) Z 2 , all information could be encoded in a single very large integer. Fisher (2019) uses the term "Residual Information" for this, which conflicts withTishby and Zaslavsky (2015). Not depicted in figure 2. That is, it does not depend on θ. The argument for continuous variables is the same. We need to identify distributions up to "isentropic" bijections. As we only take into account the empirical distribution p(x, y) available for training, the following derivation refers only to the empirical risk, and not to the expected risk of the estimator Ŷ. p(y | z) depends on θ through p θ (z | x): p(y | z) = x p(x,y) p θ(z|x) x p(x) p θ (z|x) . For categorical Z, p θ (ŷ | z) is a stochastic matrix which sums to 1 along the Ŷ dimension. We will not examine the original objective without Lagrange multipliers fromFisher (2019) here. Which is the reason why we showcase it in figure1and in equation (1).



Figure 1: Information plane plot of the training trajectories of ResNet18 models with our surrogate objective min θ H θ [Y | Z] + γE Z 2 on Imagenette. Color shows γ; transparency the training epoch. Compression (Encoding Entropy ↓) tradesoff with test performance (Residual Information ↓). See section 4.

Mouse I-diagram. The corresponding I-diagram for X, Y, and Z is depicted in figure 2. As some of the quantities have been labelled before, we try to follow conventions and come up with consistent names otherwise. Section B.1 provides intuitions for these quantities, and section B.2 lists all definitions and equivalences explicitly. For categorical Z, all the quantities in the diagram are positive, which allows us to read off inequalities from the diagram: only I[X; Y; Z] could be negative, but as Y and Z are independent given X, we have I[Y; Z|X] = 0, and I[X; Y; Z] = I[Y; Z]-I[Y; Z|X] = I[Y; Z] ≥ 0. Section 3.3 investigates how to preserve inequalities for continuous Z. 3 Surrogate IB & DIB objectives 3.1 IB Objectives Tishby et al. (2000) introduce the IB objective as a relaxation of a constrained optimization problem: minimize the mutual information between the input X and its latent representation Z while still accurately predicting Y from Z. An analogous objective which yields deterministic Z, the Deterministic Information Bottleneck (DIB) was proposed by Strouse and Schwab (2017). Letting β be a Lagrange multiplier, we arrive at the IB and DIB objectives: min I[X; Z] -βI[Y; Z] for IB, and min H[Z] -βI[Y; Z] for DIB.

and H[Y | X]. See section D.2 for a derivation. Hence, by bounding D KL (p(y | z) || p θ (ŷ | z)), we can obtain a bound for the training error in terms of H[Y | Z]. We examine one way of doing so by using optimal decoders p θ (ŷ | z) := p(Y = ŷ | z) for the case of categorical Z in section E. Alemi et al. (2016) use the Decoder Cross-Entropy bound in equation (

H[ ]; and by using zero-entropy noise ∼ N(0, 1 2πe I k ) specifically, we obtain H[Z] ≥ H[ ] = 0. Proposition 3. After adding zero-entropy noise, the inequality I[X; Z | Y] ≤ H[Z | Y] ≤ H[Z] also holds for continuous Z, and we can minimize I[X; Z | Y] in the IB objective by minimizing H[Z | Y] or H[Z], similarly to the DIB objective.

Figure 6: Information plane plot of the training trajectories of ResNet18 models with the E Z 2 surrogate objective or L2 weight-decay on CIFAR-10. The color shows γ; the transparency the training epoch. Compression (Preserved Information ↓) trades-off with performance (Residual Information ↓). See section 4. While the trajectories are similar, robustness is very different, see figure 5.

Figure G.1 shows the difference between the regularizers more clearly, and figure G.3 shows the training trajectories for all three regularizers. More details in section G.3.1.

and we could remove it from the diagram without loss of generality. Moreover, atomic I[X; Y |Z] = µ * (X∩Y -Z) = 0 then and could be removed from the diagram as well.

quantifies the uncertainty about the latent Z given the label Y. We can imagine training a new model to predict Z given Y and minimizing H[Z | Y] to 0 would allow for a deterministic decoder from the latent to given the label.

) and see that maximizing the Preserved Relevant Information I[Y; Z] is equivalent to minimizing the Residual Information I[X; Y | Z], while minimizing the Preserved Information I[X; Z] at the same time means minimizing the Redundant Information I[X; Z | Y], too, as I[X; Y] is constant for the given dataset 8 . Moreover, we also see that the Preserved Relevant Information I[Y; Z] is upper-bounded by Relevant Information I[X; Y], so to capture all relevant information in our latent, we want I[X; Y] = I[Y; Z]. Using the diagram, we can also see that minimizing the Residual Information is the same as minimizing the Decoder Uncertainty H[Y | Z]: I[X; Y | Z] = H[Y | Z] -H[Y | X]. Ideally, we also want to minimize the Encoding Uncertainty H[Z | X] to find the most deterministic latent encoding Z. Minimizing the Encoding Uncertainty and the Redundant Information I[X; Z | Y] together is the same as minimizing the Reverse Decoder Uncertainty H[Z | Y]. All in all, we want to minimize both the Decoder Uncertainty H[Y | Z] and the Reverse Decoder Uncertainty H[Z | Y].C.2 IB objectives "The Information Bottleneck Method" (IB)Tishby et al. (2000) introduce MI(X; X) -βMI( X; Y) as optimization objective for the Information Bottleneck. We can relate this to our notation by renaming X = Z, such that the objective becomes "min I[X; Z] -βI[Y; Z]". The IB objective minimizes the Preserved Information I[X; Z] and trades it off with maximizing the Preserved Relevant Information I[Y; Z].Tishby and Zaslavsky (2015) mention that the IB objective is equivalent to minimizing I[X; Z] + βI[X; Y | Z], see our discussion above.Tishby et al. (2000) provide an optimal algorithm for the tabular case, when X, Y and Z are all categorical. This has spawned additional research to optimize the objective for other cases and specifically for DNNs."Deterministic Information Bottleneck" (DIB)Strouse and Schwab (2017) introduce as objective "min H[Z] -βI[Y; Z]". Compared to the IB objective, this also minimizes H[Z | X] and encourages determinism. Vice-versa, for deterministic encoders, H[Z | X] = 0, and their objective matches the IB objective. LikeTishby et al. (2000), they provide an algorithm for the tabular case. To do so, they examine an analytical solution for their objective as it is unbounded: H[Z | X] → -∞ for the optimal solution. As we discuss in section 3.3, it does not easily translate to a continuous latent representation. "Deep Variational Information Bottleneck" Alemi et al. (2016) rewrite the terms in the bottleneck as maximization problem "max I[Y; Z] -βI[X; Z]" and swap the β parameter. Their β would be 1 /β in IB above, which emphasizes that I[Y; Z] is important for performance and I[X; Z] acts as regularizer.

3. "Conditional Entropy Bottleneck" In a preprint, Fisher (2019) introduce their Conditional Entropy Bottleneck as "min I[X; Z | Y] -I[Y; Z]". We can rewrite the objective as I[X; Z | Y] + I[X; Y | Z] -I[X; Y], using equations (30) and (31). The last term is constant for the dataset and can thus be dropped. Likewise, the IB objective can be rewritten as minimizing I[X; Z | Y] + (β -1)I[X; Y | Z]. The two match for β = 2. Fisher (2019) provides experimental results that favorably compare to Alemi et al. (2016), possibly due to additional flexibility as Fisher (2019) do not constrain p(z) to be a unit Gaussian and employ variational approximations for all terms. We relate CEB to Entropy Distance Metric in section C.4. "Conditional Entropy Bottleneck" (2020) In a substantial revision of the preprint, Fischer (2020) change their Conditional Entropy Bottleneck to include a Lagrange multiplier: "min I[X; Z | Y] -γI[Y; Z]". Their VCEB objective can be written more concisely as min H θ [Y | Z] + γ(H θ [Z | Y] -H θ [Z | X]), where, without writing down the probabilistic model, we introduce variational approximations for the Reverse Decoder Uncertainty and the Encoding Uncertainty. They are the first to report results on CIFAR-10. It is not clear how they parameterize the model they use for CIFAR-10. They use one Gaussian per class to model H θ [Z | Y]. "CEB Improves Model Robustness"

and rewrite equation(35)  asEDM (Y, Z) = 2H[Y | Z] + H[Z] -H[Y].For optimization purposes, we can drop constant terms and rearrange:arg min EDM (Y, Z) = arg min H[Y | Z] + 1 2 H[Z].C.4.1 Rewriting IB and DIB using the Entropy Distance MetricFor β ≥ 1, we can rewrite equations (IB) and (DIB) as:arg min EDM (Y, Z) + γ (H[Y | Z] -H[Z | Y]) + (γ -1) H[Z | X](36) for IB, and arg min EDM (Y, Z) + γ (H[Y | Z] -H[Z | Y]) (37) for DIB and replace β with γ = 1 -2 β ∈ [-1, 1] which allows for a linear mix between H[Y | Z] and H[Z | Y].

we note that one can use the approximation e x ≈ 1 + x to obtain: p(" Ŷ is wrong") H θ [Y | Z]. (38) Finally, we split the Decoder Cross-Entropy into the Decoder Uncertainty and a Kullback-Leibler divergence: H θ [Y | Z] = H[Y | Z] + D KL (p(y | z) || p θ ( Ŷ = y | z)). If we upper-bound D KL (p(y | z) || p θ ( Ŷ = y | z)), minimizing the Decoder Uncertainty H[Y | Z] becomes a sensible minimization objective as it reduces the probability of misclassification. We can similarly show that the training error is bounded by the Prediction Cross-Entropy H θ [Y | X].

and reordering of E p(x,y,z) h p(y | z) d dθ ln p θ (z | x) , we obtain the result. The same holds for Reverse Decoder Uncertainty H[Z | Y] and for the other quantities as can be verified easily. If we minimize H[Y | Z] directly, we can compute p(y | z) after every training epoch and fix p θ (ŷ | z) := p(Y = ŷ | z) to create the discriminative model p θ (ŷ | x). This is a different perspective on the self-consistent equations fromTishby et al. (2000);Gondek and Hofmann (2003).

Figure E.1: Decoder Uncertainty, Decoder Cross-Entropy and Prediction Cross-Entropy for Permutation-MNIST and CIFAR-10 with a categorical Z. C = 100 categories are used for Z. We optimize with different minimization objectives in turn and plot the metrics. D KL (p(y | z) || p θ( Ŷ = y | z)) is small when training with H θ [Y | Z] or H[Y | Z]. When training with H θ [Y | X] on CIFAR-10, D KL (p(y | z) || p θ ( Ŷ = y | z)) remains quite large. We run 8 trials each and plot the median with confidence bounds (25% and 75% quartiles). See section E.1 for more details.

Figure E.1 shows the three metrics as we train with each of them in turn. Our results do not achieve SOTA accuracy on the test set-we impose a harder optimization problem as Z is categorical, and we are essentially solving a hard-clustering problem first and then map these clusters to Ŷ. Results are provided for the training set in order to compare with the optimal decoder. As predicted, the Decoder Cross-Entropy upper-bounds both the Decoder Uncertainty H[Y | Z] and the Prediction Cross-Entropy in all cases. Likewise, the gap between H θ [Y | Z] and H[Y | Z] is tiny when we minimize H θ [Y | Z]. On the other hand, minimizing Prediction Cross-Entropy can lead to large gaps between H θ [Y | Z] and H[Y | Z], as can be seen for CIFAR-10.Very interestingly, on MNIST Decoder Cross-Entropy provides a better training objective whereas on CIFAR-10 Prediction Cross-Entropy trains lower. Decoder Uncertainty does not train very well on CIFAR-10, and Prediction Cross-Entropy does not train well on Permutation MNIST at all. We suspect DNN architectures in the literature have evolved to train well with cross-entropies, but we are surprised by the heterogeneity of the results for the two datasets and models.

as upper bound to minimize. The gradients remain the same. This also points to the nature of differential entropies as lacking a proper point of origin by themselves. We choose one by fixing H[ ]. Just like other literature usually only considers mutual information as meaningful, we consider H[Z | X] -H[ ] as more meaningful than H[Z | X]. However, we can side-step this discussion conveniently by picking a canonical noise as point of origin in the form of zero-entropy noise H[ ] = 0.

Alemi et al. (2016) model p θ (z | x) explicitly as multivariate Gaussian with parameterized mean and parameterized diagonal covariance in their encoder and regularize it to become close to N(0, I k ) by minimizing the Kullback-Leibler divergence D KL (p θ (z | x) || N(0, I k )) alongside the cross-entropy: min H θ [Y | Z] + γ D KL (p(z | x) || r(z)), as detailed in section C.2.

Detailed Comparison to CEB, VCEB & DVIB In Fisher (2019), the introduced CEB objective "min I[X; Z | Y] -γI[Y; Z]" is rewritten to "min γH[Y | Z] + H[Z | Y] -H[Z | X]" similar to the IB objective in proposition 1 in section 3.1. However, these atomic quantities are not separately examined in detail.

Soft clustering by entropy Minimization with Gaussian noise Consider the problem of minimizing H[Z | Y] and H[Y | Z], in the setting where Z = f θ (X) + ∼ N(0, σ 2 )-i.e. the embedding Z is obtained by adding Gaussian noise to a deterministic function of the input. Let the training set be enumerated x 1 , . . . , x n , with

Figure G.1: Information Plane Plot of the latent Z similar toTishby and Zaslavsky (2015) but using a ResNet18 model on CIFAR-10 using the different regularizes from section 3.3 (without dropout, but with zero-entropy noise). The dots are colored by γ. See section 4 for more details.

Figure G.2: Entropy estimates while training with different and with different surrogate regularizers on CIFAR-10 with a ResNet18 model. Entropies are estimated on training data based on Kraskov et al. (2004). Qualitatively all three regularizers push H[Z] and H[Z | Y] down. H[Z | Y] is not shown here because it always stays very close to H[Z]. E Z 2 tends to regularize entropies more strongly for small γ. See section 4 for more details. To estimate the Residual Information I[X; Y | Z], we similarly note that I[X; Y | Z] = I[X; Y | Z] + H[Y | X] = H[Y | Z]. Instead of estimating the entropy using Kraskov et al. (2004), we can use the Decoder Cross-Entropy H θ [Y | Z] which provides a tighter bound as long as we also minimize H θ [Y | Z] as part of the training objective. When we use stochastic models as encoder, we cannot easily compute I[X; Z] anymore. In the ablation study in the next section, we thus change the X axis accordingly. Similarly, when we look at the trajectories on the test set instead of the training set, for example in figure G.4, we change the Y axis to signify the Decoder Uncertainty H θ [Y | Z]. It is still an upper-bound, but we do not minimize it directly anymore.

Figure G.3 shows a larger version of figure6for all three regularizers and also training trajectories on the test set. As described in the previous section, this allows us to validate that the regularizers prevent overfitting on the training set: with increasing γ, the model overfits less.

Figure G.6 and figure G.5  shows that injecting noise is necessary independently of whether we use dropout or not. Regularizing with E Z 2 still has a very weak effect. We hypothesize that, similar to

Figure G.11 shows the training error probability as well as the value of each cross-entropy loss for models trained either with the Decoder Cross-Entropy or the Prediction Cross-Entropy. The Decoder Cross-Entropy H θ [Y | Z] outperforms Prediction Cross-Entropy H θ [Y | X] as a training objective: the training error probability and both cross-entropies are lower when minimizing H θ [Y | Z] compared to minimizing H θ [Y | X]. We compare only the training, rather than the test, losses of the models to isolate the effect of each loss term on training performance; we leave the prevention of overfitting to the regularization terms considered later. Recently, Dusenberry et al. (2020) also observed empirically that the Decoder Cross-Entropy H θ [Y | Z] as an objective is both easier to optimize and provides better generalization performance.

Figure G.3: Without dropout but with zero-entropy noise: Information Plane Plot of training trajectories for ResNet18 models on CIFAR-10 and different regularizers. The trajectories are colored by their respective γ; their transparency changes by epoch. Compression (Preserved Information ↓) trades-off with performance (Residual Information ↓). See section 4. The circle marks the final epoch of a trajectory. The square marks the best epoch (Residual Information ).

Figure G.6: Without and without zero-entropy noise: Information Plane Plot of training trajectories for ResNet18 models on CIFAR-10 and different regularizers. The trajectories are colored by their respective γ; their transparency changes by epoch. Compression (Preserved Information ↓) trades-off with performance (Residual Information ↓). See section 4. The circle marks the final epoch of a trajectory. The square marks the best epoch (Residual Information ).

Figure G.8: Without dropout but zero-entropy noise: Information Plane Plot of training trajectories for ResNet18 models on CIFAR-10 and log Var[Z | Y] regularizer with batchsizes 128 and 256. The trajectories are colored by their respective γ; their transparency changes by epoch. Compression (Preserved Information ↓) trades-off with performance (Residual Information ↓). See section 4. The circle marks the final epoch of a trajectory. The square marks the best epoch (Residual Information ).

Figure G.9: Information Plane Plot of the latent Z similar toTishby and Zaslavsky (2015) but using a ResNet18 model on CIFAR-10 using the different regularizes from section 3.3 (with dropout and zero-entropy noise). The dots are colored by γ. See section 4 for more details.

Entropy Hθ[Y |Z] Preserved Information I[X; Z] log Var [Z | Y ] Without dropout and without zero-entropy noise.

Figure G.10: Information quantites for different γ at the end of training for ResNet18 models on CIFAR-10 and log Var[Z | Y] regularizer with batchsizes 128 and 256. Compression (Preserved Information ↓) trades-off with performance (Residual Information ↓). See section 4.

Figure G.11: Training error probability, Decoder Cross-Entropy H θ [Y | Z] and Prediction Cross-Entropy H θ [Y | X] with continuous Z. K = 100 dimensions are used for Z, and we use dropout to obtain stochastic models. Minimizing H θ [Y | Z] (solid) leads to smaller cross-entropies and lower training error probability than minimizing H θ [Y | X] (dashed). This suggests a better data fit, which is what we desire for a loss term. We run 8 trials each and plot the median with confidence bounds (25% and 75% quartiles). See section 3.2 and G.3.3 for more details.

Figure H.1: Mickey Mouse I-diagram.

). The derivation can be found in section C.3.

We can similarly write our canonical DIB objective H[Y | Z] + β H[Z] as constrained objective min H[Y | Z] s.t. H[Z] ≤ C, and use above statement to find the approximate form min H[Y | Z] s.t. E Z 2 ≤ C . Reintroducing a Lagrangian multiplier recovers our reguralized E Z 2 objective:

The plots in figure G.10 show the effects of different γ with different regularizers more clearly. On both training and test set, one can clearly see the effects of regularization. Overall, log Var[Z | Y] performs worse as a regularizer. In figure G.8, we compare the effect of doubling batchsize. Indeed, log Var[Z | Y] performs better with higher batchsize and looks closer to log Var[Z]. G.3.3 Comparison between Decoder Cross-Entropy and Prediction Cross-Entropy When training deterministic models or dropout models with a single sample (as one usually does), the estimators for both the Decoder Cross-Entropy H θ [Y | Z] and the Prediction Cross-Entropy H

G.5 Comparison between DVIB and surrogate objectives on Permutation-MNIST

Comparing DVIB and our surrogate objectives is not straightforward because DVIB uses a VAE-like model that explicitly parameterize mean and standard deviation of the latent whereas the stochastic models we focus on in section 3.2 and beyond are implicit by using dropout.For this comparison, we use the same architecture and optimization strategy for DVIB as described in Alemi et al. (2016) : the encoder is a ReLU-MLP of the form 796 -1024 -1024 -2K with K=256 latent dimensions that outputs mean and standard deviation explicitly and separately. For the standard deviation, we use a softplus transform with a bias of -5. We use Polyak averaging with a decay constant of 0.999 (Polyak and Juditsky, 1992) . We train the model for 200 epochs with Adam with learning rate 10 -4 , β 1 = 0.5, β 2 = -.999 (Kingma and Ba, 2014) and decay the learning rate by 0.97 every 2 epochs. The marginal is fixed to a unit Gaussian around the origin. We use a softmax layer as decoder. We use 12 latent samples during training and test time.For our surrogate objectives, we use a similar ReLU-MLP of the form 796 -1024 -1024 -K with K=256 latent dimensions and dropout layers of rate 0.3 after the first and second layer. We use also 12 dropout samples during training and test time. We train for 75 epochs with Adam and learning rate 0.5 × 10 -4 . We half the learning rate every time the loss does not decrease for 13 epochs.We run 5 trials for each experiments. We were not able to reproduce the baseline of an error of 1.13% for β = 10 -3 from Alemi et al. ( 2016). We show a comparison in figure G.15. Our methods do reach an error of 1.13% overall though, so the simpler surrogate objectives perform as well good or better than DVIB.From section F.3, we know that DVIB's β would have to be twice the γ frm our section 3.3. We can see this correspondence in the plot. This also implies that DVIB's β is not related to the IB objective's β from section 3. This makes sense as DVIB arbitrarily fixes the marginal to be a unit Gaussian. 

