USABLE INFORMATION AND EVOLUTION OF OPTIMAL REPRESENTATIONS DURING TRAINING

Abstract

We introduce a notion of usable information contained in the representation learned by a deep network, and use it to study how optimal representations for the task emerge during training. We show that the implicit regularization coming from training with Stochastic Gradient Descent with a high learning-rate and small batch size plays an important role in learning minimal sufficient representations for the task. In the process of arriving at a minimal sufficient representation, we find that the content of the representation changes dynamically during training. In particular, we find that semantically meaningful but ultimately irrelevant information is encoded in the early transient dynamics of training, before being later discarded. In addition, we evaluate how perturbing the initial part of training impacts the learning dynamics and the resulting representations. We show these effects on both perceptual decision-making tasks inspired by neuroscience literature, as well as on standard image classification tasks.

1. INTRODUCTION

An important open question for the theory of deep learning is why highly over-parametrized neural networks learn solutions that generalize well even though the model can in principle memorize the entire training set. Some have speculated that neural networks learn minimal but sufficient representations of the input through implicit regularization of Stochastic Gradient Descent (SGD) (Shwartz-Ziv & Tishby, 2017; Achille & Soatto, 2018) , and that the minimality of the representations relates to generalizability. Follow-up work has disputed the validity of some of these claims when using deterministic deep networks (Saxe et al., 2018) , leading to an ongoing debate on the notion of optimality of representations and how they are learned during training. Part of the disagreement stems from the use of information-theoretic quantities: most previous studies in deep learning have analyzed the amount of information that the learned representation contains about the inputs using Shannon's mutual information. However, when the mapping from input to representation is deterministic, the mutual information between the representation and input is degenerate (Saxe et al., 2018; Goldfeld et al., 2018) . Rather than study the mutual information in a neural network, here we instead define and study the "usable information" in the network, which measures the amount of information that can be extracted from the representation by a learned decoder, and is scalable to high dimensional realistic tasks. We use this notion to quantify how relevant and irrelevant information is represented across layers of the network throughout the training process, and how this is affected by the optimization algorithms and the network pretraining. In particular, we propose to study a simple task inspired by decision-making tasks in neuroscience, where inputs and outputs are carefully designed to probe specific information processing phenomena. We then extend our findings to standard image classification tasks trained with state-of-the-art models. Our neuroscience-inspired task is the checkerboard (CB) task (Chandrasekaran et al., 2017; Kleinman et al., 2019) . In the CB task, one discerns the dominant color of a checkerboard filled with red and green squares. The subject then makes a reach to a left or right target whose color matches the dominant color in the checkerboard (Fig 1a) . This task therefore involves making two binary choices: a color decision (i.e., reach to the red or green target) and a direction decision (i.e., reach to left or right). Critically, the color of the targets (red left, green right; or green left, red right) is

