INFORMATION PLANE ANALYSIS FOR DROPOUT NEURAL NETWORKS

Abstract

The information-theoretic framework promises to explain the predictive power of neural networks. In particular, the information plane analysis, which measures mutual information (MI) between input and representation as well as representation and output, should give rich insights into the training process. This approach, however, was shown to strongly depend on the choice of estimator of the MI. The problem is amplified for deterministic networks if the MI between input and representation is infinite. Thus, the estimated values are defined by the different approaches for estimation, but do not adequately represent the training process from an information-theoretic perspective. In this work, we show that dropout with continuously distributed noise ensures that MI is finite. We demonstrate in a range of experiments 1 that this enables a meaningful information plane analysis for a class of dropout neural networks that is widely used in practice.

1. INTRODUCTION

The information bottleneck hypothesis for deep learning conjectures two phases of training feedforward neural networks (Shwartz-Ziv and Tishby, 2017): the fitting phase and the compression phase. The former corresponds to extracting information from the input into the learned representations, and is characterized by an increase of mutual information (MI) between inputs and hidden representations. The latter corresponds to forgetting information that is not needed to predict the target, which is reflected in a decrease of the MI between learned representations and inputs, while MI between representations and targets stays the same or grows. The phases can be observed via an information plane (IP) analysis, i.e., by analyzing the development of MI between inputs and representations and between representations and targets during training (see Fig. 1 for an example). For an overview of information plane analysis we refer the reader to (Geiger, 2022). While being elegant and plausible, the information bottleneck hypothesis is challenging to investigate empirically. As shown by Amjad and Geiger (2020, Th. 1), the MI between inputs and the representations learned by a deterministic neural network is infinite if the input distribution is continuous. The standard approach is therefore to assume the input distribution to be discrete (e.g., equivalent to the empirical distribution of the dataset S at hand) and to discretize the real-valued hidden representations by binning to allow for non-trivial measurements, i.e., to avoid that the MI always takes the maximum value of log(|S|) (Shwartz-Ziv and Tishby, 2017) . In this discrete and deterministic setting the MI theoretically gets equivalent to the Shannon entropy of the representation. Considering the effect of binning, however, the decrease of MI is essentially equivalent to geometrical compression (Basirat et al., 2021) . Moreover, the binning-based estimate highly depends on the chosen bin size (Ross, 2014) . To instead work with continuous input distributions, Goldfeld 1 Code for the experiments is public on https://github.com/link-er/IP_dropout. 1

