LEARNED NEURAL NETWORK REPRESENTATIONS ARE SPREAD DIFFUSELY WITH REDUNDANCY

Abstract

Representations learned by pre-training a neural network on a large dataset are increasingly used successfully to perform a variety of downstream tasks. In this work, we take a closer look at how features are encoded in such pre-trained representations. We find that learned representations in a given layer exhibit a degree of diffuse redundancy, i.e., any randomly chosen subset of neurons in the layer that is larger than a threshold size shares a large degree of similarity with the full layer and is able to perform similarly as the whole layer on a variety of downstream tasks. For example, a linear probe trained on 20% of randomly picked neurons from a ResNet50 pre-trained on ImageNet1k achieves an accuracy within 5% of a linear probe trained on the full layer of neurons for downstream CIFAR10 classification. We conduct experiments on different neural architectures (including CNNs and Transformers) pre-trained on both ImageNet1k and ImageNet21k and evaluate a variety of downstream tasks taken from the VTAB benchmark. We find that the loss & dataset used during pre-training largely govern the degree of diffuse redundancy and the "critical mass" of neurons needed often depends on the downstream task, suggesting that there is a task-inherent sparsity-performance Pareto frontier. Our findings shed light on the nature of representations learned by pre-trained deep neural networks and suggest that entire layers might not be necessary to perform many downstream tasks. We investigate the potential for exploiting this redundancy to achieve efficient generalization for downstream tasks and also draw caution to certain possible unintended consequences.

1. INTRODUCTION

Over the years, many architectures such as VGG (Simonyan & Zisserman, 2014) , ResNet (He et al., 2016) , and Vision Transformers (ViTs) (Kolesnikov et al., 2021) have been proposed that achieve competitive accuracies on many benchmarks including the ImageNet (Russakovsky et al., 2015) challenge. A key reason for the success of these models is their ability to learn useful representations of data (LeCun et al., 2015) . Prior works have attempted to understand representations learned by deep neural networks through the lens of mutual information between the representations, inputs and outputs (Shwartz-Ziv & Tishby, 2017) and hypothesize that neural networks perform well because of a "compression" phase where mutual information between inputs and representations decreases. Moreover recent works on interpretability have found that many neurons in learned representations are polysemantic, i.e., one neuron can encode multiple "concepts" (Elhage et al., 2022; Olah et al., 2020) , and that one can then train sparse linear models on such concepts to do "explainable" classification (Wong et al., 2021) . However, it is not well understood if or how extracted features are concentrated or spread across the full representation. While the length of the feature vectors extracted from state-of-the-art networksfoot_0 can vary greatly, their accuracies on downstream tasks are not correlated to the size of the representation (see Table 1 ), but rather depend mostly on the inductive biases and training recipes (Wightman et al., 2021; Steiner et al., 2021) . In all cases, the size of extracted feature vector (i.e. number of neurons) is orders of et al., 2020; Bengio et al., 2013; Pan & Yang, 2009; Tan et al., 2018) . We show that even when using a random subset of these extracted neurons one can achieve downstream transfer accuracy close to that achieved by the full layer, thus showing that learned representations exhibit a degree of redundancy (Table 1 ). Early works in perception suggest that there are many redundant neurons in the human visual cortex (Attneave, 1954) and some works argued that a similar redundancy in artificial neural networks should help in faster convergence (Izui & Pentland, 1990) . In this paper we revisit redundancy in the context of modern DNN architectures that have been trained on large-scale datasets. In particular, we propose the diffused redundancy hypothesis and systematically measure its prevalence across different pre-training datasets, losses, model architectures and downstream tasks. We also show how this kind of redundancy can be exploited to obtain desirable properties such as generalization performance and better parity in inter-class performance. We highlight the following contributions: • We present the diffused redundancy hypothesis which states that learned representations exhibit redundancy that is diffused throughout the layer. Our work aims to better understand the nature of representations learned by DNNs. • We propose a measure of diffused redundancy and systematically test our hypothesis across various architectures, pre-training datasets & losses and downstream tasks. -We find that diffused redundancy is significantly impacted by pre-training datasets & loss and downstream datasets. -We find that models that are explicitly trained such that particular parts of the full representation perform as well as the full layer, i.e., these models have structured redundancy (e.g. (Kusupati et al., 2022) ), also exhibit a significant amount of diffused redundancy, showing that this phenomenon is perhaps inevitable when DNNs have a wide enough final layer. -We quantify the degree of diffused redundancy as a function of the number of neurons in a given layer. As we reduce the dimension of the extracted feature vector and re-train the model, the degree of diffused redundancy decreases significantly, implying that diffused redundancy only appears when the layer is wide enough to accommodate redundancy. • Finally we draw caution to some potential undesirable side-effects of exploiting diffused redundancy for efficient transfer learning that have implications for fairness.

1.1. RELATED WORK

Closest to our work is that of Dalvi et al. (2020) who also investigate neuron redundancy but in the context of pre-trained language models. They analyze two language models and find that they can achieve good downstream performance with a significantly smaller subset of neurons. However, there are two key differences to our work. First, their analysis of neuron redundancy uses neurons from all layers (by concatenating each layer), whereas we show that such redundancy exists even at the level of a single (penultimate) layer. Second, and perhaps more importantly, they use feature selection to choose the subset of neurons, whereas we show that features are diffused throughout and that even a random pick of neurons suffices. Our work also differs by analyzing vision models



Extracted features for the purpose of this paper refers to the representation recorded on the penultimate layer, but the larger concept applies to any layer



Different model architectures with varying penultimate layer lengths trained on ImageNet1k. WRN50-2 stands for WideResNet50-2. Implementation of architectures is taken from timm (Wightman, 2019). Diffused redundancy here measures what fractions of neurons (randomly picked) can be discarded to achieve within δ = 90% performance of the full layer.

