Unpacking Information Bottlenecks: Surrogate Objectives for Deep Learning

Abstract

The Information Bottleneck principle offers both a mechanism to explain how deep neural networks train and generalize, as well as a regularized objective with which to train models. However, multiple competing objectives are proposed in the literature, and the information-theoretic quantities used in these objectives are difficult to compute for large deep neural networks, which in turn limits their use as a training objective. In this work, we review these quantities, and compare and unify previously proposed objectives, which allows us to develop surrogate objectives more friendly to optimization without relying on cumbersome tools such as density estimation. We find that these surrogate objectives allow us to apply the information bottleneck to modern neural network architectures. We demonstrate our insights on MNIST, CIFAR-10 and Imagenette with modern DNN architectures (ResNets).

1. Introduction

The Information Bottleneck (IB) principle, introduced by Tishby et al. (2000) , proposes that training and generalization in deep neural networks (DNNs) can be explained by information-theoretic principles (Tishby and Zaslavsky, 2015; Shwartz-Ziv and Tishby, 2017; Achille and Soatto, 2018a) . This is attractive as the success of DNNs remains largely unexplained by tools from computational learning theory (Zhang et al., 2016; Bengio et al., 2009) . The IB principle suggests that learning consists of two competing objectives: maximizing the mutual information between the latent representation and the label to promote accuracy, while at the same time minimizing the mutual information between the latent representation and the input to promote generalization. Following this principle, many variations of IB objectives have been proposed (Alemi et al., 2016; Strouse and Schwab, 2017; Fischer and Alemi, 2020; Fischer, 2020; Fisher, 2019; Gondek and Hofmann, 2003; Achille and Soatto, 2018a) , which, in supervised learning, have been demonstrated to benefit robustness to adversarial attacks (Alemi et al., 2016; Fisher, 2019) and generalization and regularization against overfitting to random labels (Fisher, 2019) . Whether the benefits of training with IB objectives are due to the IB principle, or some other unrelated mechanism, remains unclear (Saxe et al., 2019; Amjad and Geiger, 2019; Tschannen et al., 2019) , suggesting that although recent work has also tied the principle to successful results in both unsupervised and self-supervised learning (Oord et al., 2018; Belghazi et al., 2018; Zhang et al., 2018; Burgess et al., 2018, among others) , our understanding of how IB objectives affect representation learning remains unclear. Critical to studying this question is the computation of the information-theoretic quantitiesfoot_0 used. While progress has been made in developing mutual information estimators for DNNs (Poole et al., 2019; Belghazi et al., 2018; Noshad et al., 2019; McAllester and Stratos, 2018; Kraskov et al., 2004) , current methods still face many limitations when concerned with high-dimensional random variables (McAllester and Stratos, 2018) and rely on complex estimators or generative models. This presents a challenge to training with IB objectives. In this paper, we analyze information quantities and relate them to surrogate objectives for the IB principle which are more friendly to optimization, showing that complex or intractable IB objectives can be replaced with simple, easy-to-compute surrogates that produce similar performance and similar 2016) in their variational IB approximation and demonstrate that this upper bound is equal to the commonly-used cross-entropy lossfoot_1 under dropout regularization. Section 3.3 examines pathologies of differential entropies that hinder optimization and proposes adding Gaussian noise to force differential entropies to become non-negative, which leads to new surrogate terms to optimize the Reverse Decoder Uncertainty. Altogether this leads to simple and tractable surrogate IB objectives such as the following, which uses dropout, adds Gaussian noise over the feature vectors f (x; η), and uses an L2 penalty over the noisy feature vectors: min θ E x,y∼p(x,y), ∼N η∼dropout mask -log p( Ŷ = y | z = f θ (x; η) + ) + γ f θ (x; η) + 2 2 . (1) Section 4 describes experiments that validate our insights qualitatively and quantitatively on MNIST, CIFAR-10 and Imagenette, and shows that with objectives like the one in equation ( 1) we obtain information plane plots (as in figure 1 ) similar to those predicted by Tishby and Zaslavsky (2015) . Our simple surrogate objectives thus induce the desired behavior of IB objectives while scaling to large, high-dimensional datasets. We present evaluations on CIFAR-10 and Imagenette imagesfoot_2 . Compared to existing work, we show that we can optimize IB objectives for well-known DNN architectures using standard optimizers, losses and simple regularizers, without needing complex estimators, generative models, or variational approximations. This will allow future research to make better use of IB objectives and study the IB principle more thoroughly. 2 Background and Thomas, 2012; MacKay, 2003; Shannon, 1948) . We will further require the Kullback-Leibler divergence D KL (• || •) and cross-entropy H(• || •). The definitions can be found in section A.1. We will use differential entropies interchangeably with entropies: equalities between them are preserved in the differential setting, and inequalities will be covered in section 3.3.



We shorten these to information quantities from now on. This connection was assumed without proof byAchille and Soatto (2018a;b). Recently, Fischer and Alemi (2020) report results on CIFAR-10 and ImageNet, see section F.4.



Figure 1: Information plane plot of the training trajectories of ResNet18 models with our surrogate objective min θ H θ [Y | Z] + γE Z 2 on Imagenette. Color shows γ; transparency the training epoch. Compression (Encoding Entropy ↓) tradesoff with test performance (Residual Information ↓). See section 4.

Information quantities & information diagrams. We denote entropy H[•], joint entropy H[•, •], conditional entropy H[• | •], mutual information I[•; •] and Shannon's information content h (•) (Cover

