USABLE INFORMATION AND EVOLUTION OF OPTIMAL REPRESENTATIONS DURING TRAINING

Abstract

We introduce a notion of usable information contained in the representation learned by a deep network, and use it to study how optimal representations for the task emerge during training. We show that the implicit regularization coming from training with Stochastic Gradient Descent with a high learning-rate and small batch size plays an important role in learning minimal sufficient representations for the task. In the process of arriving at a minimal sufficient representation, we find that the content of the representation changes dynamically during training. In particular, we find that semantically meaningful but ultimately irrelevant information is encoded in the early transient dynamics of training, before being later discarded. In addition, we evaluate how perturbing the initial part of training impacts the learning dynamics and the resulting representations. We show these effects on both perceptual decision-making tasks inspired by neuroscience literature, as well as on standard image classification tasks.

1. INTRODUCTION

An important open question for the theory of deep learning is why highly over-parametrized neural networks learn solutions that generalize well even though the model can in principle memorize the entire training set. Some have speculated that neural networks learn minimal but sufficient representations of the input through implicit regularization of Stochastic Gradient Descent (SGD) (Shwartz-Ziv & Tishby, 2017; Achille & Soatto, 2018) , and that the minimality of the representations relates to generalizability. Follow-up work has disputed the validity of some of these claims when using deterministic deep networks (Saxe et al., 2018) , leading to an ongoing debate on the notion of optimality of representations and how they are learned during training. Part of the disagreement stems from the use of information-theoretic quantities: most previous studies in deep learning have analyzed the amount of information that the learned representation contains about the inputs using Shannon's mutual information. However, when the mapping from input to representation is deterministic, the mutual information between the representation and input is degenerate (Saxe et al., 2018; Goldfeld et al., 2018) . Rather than study the mutual information in a neural network, here we instead define and study the "usable information" in the network, which measures the amount of information that can be extracted from the representation by a learned decoder, and is scalable to high dimensional realistic tasks. We use this notion to quantify how relevant and irrelevant information is represented across layers of the network throughout the training process, and how this is affected by the optimization algorithms and the network pretraining. In particular, we propose to study a simple task inspired by decision-making tasks in neuroscience, where inputs and outputs are carefully designed to probe specific information processing phenomena. We then extend our findings to standard image classification tasks trained with state-of-the-art models. Our neuroscience-inspired task is the checkerboard (CB) task (Chandrasekaran et al., 2017; Kleinman et al., 2019) . In the CB task, one discerns the dominant color of a checkerboard filled with red and green squares. The subject then makes a reach to a left or right target whose color matches the dominant color in the checkerboard (Fig 1a) . This task therefore involves making two binary choices: a color decision (i.e., reach to the red or green target) and a direction decision (i.e., reach to left or right). Critically, the color of the targets (red left, green right; or green left, red right) is random on every trial. The direction decision output is conditionally independent of the color decision, as detailed further in Fig 1b and Section B.6, even though the color information needs to be used to solve the task. This task allows us to evaluate how both of these components of information are represented through training and across layers. We used this task and extensions to study the evolution of minimal representations during training. If a representation is sufficient and minimal, we refer to this representation as optimal (Achille & Soatto, 2018). Our contributions are the following. (1) We introduce a notion of usable information for studying representations and training dynamics in deep networks (Section 3). (2) We used this notion to characterize the transient training dynamics in deep networks by studying the amount of usable relevant and irrelevant information in deep network layers and across training epochs. We first use the CB task to gain intuition of the training dynamics in a simplified setting. We find that training with SGD is critical to bias the network toward learning minimal representations in intermediate layers (Section 4.1) . This adds to the literature suggesting that SGD results in minimal representations of input information (Achille & Soatto, 2018; Shwartz-Ziv & Tishby, 2017) while avoiding some of the pitfalls. (3) We used the intuition gained from the simple task, evaluating our findings on CIFAR-10 and CIFAR-100 task using modern architectures. Remarkably, we find that the networks increased usable information about an irrelevant component of information early in training and discarded it later on in training to arrive at a minimal sufficient solution, consistent with a proposed (Shwartz-Ziv & Tishby, 2017) though controversial theory (Saxe et al., 2018) .

2. RELATED WORK

Some efforts to understand why neural networks generalize focus on representation learning, that is, how deep networks learn optimal (i.e., minimal and sufficient) representations of inputs in order to solve a task. Typically, representation learning is focused on studying the properties of the asymptotic representations after training (Achille & Soatto, 2018) . Recent work suggests that these asymptotic representations contain minimal but sufficient input information for performing a task (Achille & Soatto, 2018; Shwartz-Ziv & Tishby, 2017) . Implicit regularization coming from SGD, and in particular from the use of large learning rates and small batch sizes, is believed to play an important role in forming these minimal sufficient representations. How does the training process lead to these minimal but sufficient asymptotic representations? Shwartz-Ziv & Tishby (2017) propose that there are two distinct phases of training: an empirical risk minimization phase where the network minimizes the loss on the training set, and a "compression" phase where the network discards information about the inputs that do not need to be represented to solve the task. Recently, Saxe et al. (2018) challenged this view, arguing that the observed compression was dependent on the activation function and the mutual information estimator used in Shwartz-Ziv & Tishby (2017) . These works highlight the challenges of estimating mutual information to study how representations emerge through training. In general, estimating mutual information from samples is challenging for high-dimensional random variables (Paninski, 2003) . The primary difficulty in estimating mutual information is estimating a high-dimensional probability distribution from the samples, since generally the number of samples required scales exponentially with the dimension. This is impractical for realistic deep learning tasks where the representations are high dimensional. To estimate the mutual information, Shwartz-Ziv & Tishby (2017) used a binning approach, discretizing the activations into a finite number of bins. While this approximation is exact in the limit of infinitesimally small bins, in practice, the size of the bin affects the estimator (Saxe et al., 2018; Goldfeld et al., 2018) . In contrast to binning, other approaches to estimate mutual information include entropic-based estimators (e.g., Goldfeld et al. (2018) ) and a nearest neighbours approach (Kraskov et al., 2004) . Although mutual information is difficult to estimate, it is an appealing quantity to summarily characterize key aspects of the transient neural network training behavior because of its invariance to smooth and invertible transformations. In this work, rather than estimate the mutual information directly, we instead define and study the "usable information" in the network, which corresponds to a variational approximation of the mutual information (Barber & Agakov, 2003; Poole et al., 2019) (see Sections 3 and A.1). Recently, such variational approximations to mutual information have been viewed as a meaningful characterization of representations in deep networks, and the theoretical underpinnings of this approach are beginning to be investigated (Xu et al., 2020; Dubois et al., 2020) .

