INFORMATION PLANE ANALYSIS FOR DROPOUT NEURAL NETWORKS

Abstract

The information-theoretic framework promises to explain the predictive power of neural networks. In particular, the information plane analysis, which measures mutual information (MI) between input and representation as well as representation and output, should give rich insights into the training process. This approach, however, was shown to strongly depend on the choice of estimator of the MI. The problem is amplified for deterministic networks if the MI between input and representation is infinite. Thus, the estimated values are defined by the different approaches for estimation, but do not adequately represent the training process from an information-theoretic perspective. In this work, we show that dropout with continuously distributed noise ensures that MI is finite. We demonstrate in a range of experiments 1 that this enables a meaningful information plane analysis for a class of dropout neural networks that is widely used in practice.

1. INTRODUCTION

The information bottleneck hypothesis for deep learning conjectures two phases of training feedforward neural networks (Shwartz-Ziv and Tishby, 2017): the fitting phase and the compression phase. The former corresponds to extracting information from the input into the learned representations, and is characterized by an increase of mutual information (MI) between inputs and hidden representations. The latter corresponds to forgetting information that is not needed to predict the target, which is reflected in a decrease of the MI between learned representations and inputs, while MI between representations and targets stays the same or grows. The phases can be observed via an information plane (IP) analysis, i.e., by analyzing the development of MI between inputs and representations and between representations and targets during training (see Fig. 1 for an example). For an overview of information plane analysis we refer the reader to (Geiger, 2022) . While being elegant and plausible, the information bottleneck hypothesis is challenging to investigate empirically. As shown by Amjad and Geiger (2020, Th. 1) , the MI between inputs and the representations learned by a deterministic neural network is infinite if the input distribution is continuous. The standard approach is therefore to assume the input distribution to be discrete (e.g., equivalent to the empirical distribution of the dataset S at hand) and to discretize the real-valued hidden representations by binning to allow for non-trivial measurements, i.e., to avoid that the MI always takes the maximum value of log(|S|) (Shwartz-Ziv and Tishby, 2017) . In this discrete and deterministic setting the MI theoretically gets equivalent to the Shannon entropy of the representation. Considering the effect of binning, however, the decrease of MI is essentially equivalent to geometrical compression (Basirat et al., 2021) . Moreover, the binning-based estimate highly depends on the chosen bin size (Ross, 2014). To instead work with continuous input distributions, Goldfeld In contrast, dropout, being a source of stochasticity, is heavily used in practice due to its effective regularizing properties. The core questions investigated in this work therefore are: i) Can we obtain accurate and meaningful MI estimates in neural networks with dropout noise? ii) And if so, do IPs built for dropout networks confirm the information bottleneck hypothesis? Our main contributions answer these questions and can be summarized as follows: We present a theoretical analysis showing that binary dropout does not prevent the MI from being infinite due to the discrete nature of the noise. In contrast, we prove that dropout noise with any continuous distribution not only results in finite MI, but also provides an elegant way to estimate it. This in particular holds for Gaussian dropout, which is known to benefit generalization even more than binary dropout (Srivastava et al., 2014) , and for information dropout (Achille and Soatto, 2018). We empirically analyze the quality of the MI estimation in the setup with Gaussian and information dropout in a range of experiments on benchmark neural networks and datasets. While our results do not conclusively confirm or refute the information bottleneck hypothesis, they show that the IPs obtained using our estimator exhibit qualitatively different behavior than the IPs obtained using binning estimators and strongly indicate that a compression phase is indeed happening.

2. MUTUAL INFORMATION ESTIMATION FOR NEURAL NETWORKS

We use the following notation: Lower-case letters denote realizations of random variables (RVs), e.g., b denotes a realization of the RV B; H(A) denotes the Shannon entropy of a discrete RV A whose distribution is denoted p a ; h(B) is the differential entropy of a continuous RV B whose distribution is described by the probability density function p b ; I(A; B) is the MI between RVs A and B; X ∈ X ⊆ R n and Y ∈ Y are the RVs describing inputs to a neural network and corresponding targets; f (X) is the result of the forward pass of the input through the network to the hidden layer of interest; Z is an N -dimensional RV describing the hidden representations. The caveats of different approaches to measure the MI between input X and hidden representation Z of a neural network -e.g., the MI being infinite for deterministic neural networks and continuous input distributions, the dependence of the MI estimate on the parameterization of the estimator, etc. -were discussed widely in the literature (Saxe et al., 2019; Geiger, 2022) and are briefly reviewed in this section. These caveats do not appear for the MI measured between representations Z and targets Y , since the target is in most cases a discrete RV (class), for which MI is always finite. One option for estimating I(X; Z) is to assume the input to be drawn from a discrete distribution. This view is supported by the finiteness of the accuracy of the used computational resources (Lorenzen et al., 2021) and makes it easy to use a finite dataset S to describe the distribution. In such setup, the distribution of (X, Y ) is assumed uniform on the dataset S, and the discretization of Z is performed at a fixed bin size (e.g., corresponding to the computer precision). The MI between



Code for the experiments is public on https://github.com/link-er/IP_dropout.



Figure1: IPs w.r.t. the activations of one layer with information dropout or Gaussian dropout in a LeNet network. In contrast to the IP based on estimating MI using binning, our estimates (both for Gaussian and information dropout) clearly show compression. This suggests that even if MI is finite, the binning estimator fails to converge to the true MI (see also Section 4).

