CAUSAL TESTING OF REPRESENTATION SIMILARITY METRICS Anonymous

Abstract

Representation similarity metrics are widely used to compare learned representations in neural networks, as is evident in extensive literature investigating metrics that accurately captures information encoded in the network. However, aiming to capture all of the information available in the network may have little to do with what information is actually used by the network. One solution is to experiment with causal measures of function. By ablating groups of units thought to carry information and observing whether those ablations affect network performance, we can focus on an outcome that causally links representations to function. In this paper, we systematically test representation similarity metrics to evaluate their sensitivity to causal functional changes induced by ablation. We use network performance changes after ablation as way to causally measure the influence of representation on function. These measures of function allow us to test how well similarity metrics capture changes in network performance versus changes to linear decodability. Network performance measures index the information used by the network, while linear decoding methods index available information in the representation. We show that all of the tested metrics are more sensitive to decodable features than network performance. Within these metrics, Procrustes and CKA outperform regularized CCA-based methods on average. Although Procrustes and CKA outperform on average, for AlexNet, Procrustes and CKA no longer outperform CCA methods when looking at network performance. We provide causal tests of the utility of different representational similarity metrics. Our results suggest that interpretability methods will be more effective if they are based on representational similarity metrics that have been evaluated using causal tests.

1. INTRODUCTION

Neural networks already play a critical role in systems where understanding and interpretation are paramount like in self-driving cars and the criminal justice system. To understand and interpret neural networks, representation similarity metrics have been used to compare learned representations between and across networks (Kornblith et al. ( 2019 (2018) ). Apart from helping to answer these fundamental questions, similarity metrics have the potential to provide a general-purpose metric over representations Boix-Adsera et al. (2022) . What it means for two representations to be similar, however, is not straightforward. Many similarity metrics have been proposed with different underlying assumptions and strategies for comparing representation spaces. For example, some similarity metrics are invariant under linear transformations while others are not (see Kornblith et al. (2019) for a theoretical comparison). These different assumptions and strategies can lead to quantitatively different predictions. For instance, Ding et al. (2021) show that certain metrics are insensitive to changes to the decodable information present in representations. In another study, Davari et al. (2022) demonstrate that the centered kernel alignment metric predicts a high similarity between random and fully trained representations. It is therefore unclear which representation similarity metrics capture the most important information from representations and further tests are needed to evaluate them. What important pieces of information do similar representations share? Previous studies into similarity metrics have assumed that similar representations share linearly decodable information Boix-Adsera et al. (2022); Ding et al. (2021); Feng et al. (2020) . To measure the linearly decodable information in a representation, researchers usually train linear probes for downstream tasks on learned representations and compare the results. However, the features of a representation that carry the most information may not be those actually used by the network during inference. Studies that remove features from representations in trained networks have revealed a weak link between the relevance of a feature for decoding and its effect when removed from the network Meyes et al. ( 2020 2022) recently showed that linear decoders specifically cannot single out the features of representations actually used by the network. Consequently, two representations that are equally decodable using linear probes may not actually be equal from the point of view of network performance. This distinction is crucial for neural network interpretability where the aim is to develop human-understandable descriptions of how neural networks actually rely on their internal representations. For the purpose of interpreting neural network function, we suggest that representations should be judged as similar if they cause similar effects in a trained network. To observe these causal effects, previous studies have removed features from a representation, a process called ablation, and observed the effects LeCun et al. (1989) . In this paper, we use ablation to evaluate how closely representation similarity metrics are related to causal function. We first ablate groups of units from AlexNet and MobileNet, compare the original representations to the ablated representations using representation similarity metrics, and then compare metric outputs to the changes seen for linear probe decoding or network performance. This way we can test how well representational changes from ablations are captured by representational similarity metrics by comparing those metrics to changes in linear decoding and causal differences in network performance. Linear probes measure how much task-specific information is directly decodable from a given representation. On the other hand, network performance measures quantify how the network trained on the same task uses a given representation. Finally, we test how well representation similarity metrics capture these two changes. By directly comparing linear probe accuracies and network performances on the same task we can answer questions like: how much more sensitive are representation similarity metrics to the non-causal linear properties of representations compared to the causal non-linear properties used by the network? Answering these questions may help in the development of interpretability methods that are increasingly sensitive to actual network function. In this work, we show that CKA, Procrustes, and regularized CCA-based representation similarity metrics predict causal network performance changes significantly worse than non-causal decoding changes. We also show that, on average, Procrustes and CKA outperform regularized CCA-based methods. However, Procrustes and CKA do not outperform regularized CCA-based metrics on all network and functional measure combinations. Overall, our results suggest that interpretability methods will be more effective if they are based on representational similarity metrics that have been evaluated using causal tests. In general, this paper documents the following contributions: • We introduce a causal test of the utility of representation similarity metrics. We find that five popular representation similarity metrics are significantly less sensitive to network performance changes induced by ablation than linearly decodable changes. • Within the tested metrics, we show that Procrustes and CKA tend to outperform regularized CCA-based methods, but that tests using linear probes and network performance based functional measures can produce different results in different networks.

2. METHODS

In Section 2.1, we describe the statistical testing methodology used in our experiments. In Section 2.2, we introduce the representation similarity measures we evaluate and reformulate them for use on high dimensional representations. In Section 2.3, we describe how we use ablation to produce representations with different functional properties. Finally, in Section 2.4, we describe how we



); Raghu et al. (2017); Morcos et al. (2018b); Wang et al. (2018); Li et al. (2015); Feng et al. (2020); Nguyen et al. (2020)). Using these similarity metrics, researchers evaluate whether networks trained from different random initializations learn the same information, whether different layers learn redundant or complementary information, and how different training data affect learning (Kornblith et al. (2019); Li et al. (2015); Wang et al.

); Zhou et al. (2018); Donnelly & Roegiest (2019); Morcos et al. (2018b). Hayne et al. (

