IMPORTANCE AND COHERENCE: METHODS FOR EVALUATING MODULARITY IN NEURAL NETWORKS Anonymous

Abstract

As deep neural networks become more widely-used, it is important to understand their inner workings. Toward this goal, modular interpretations are appealing because they offer flexible levels of abstraction aside from standard architectural building blocks (e.g., neurons, channels, layers). In this paper, we consider the problem of assessing how functionally interpretable a given partitioning of neurons is. We propose two proxies for this: importance which reflects how crucial sets of neurons are to network performance, and coherence which reflects how consistently their neurons associate with input/output features. To measure these proxies, we develop a set of statistical methods based on techniques that have conventionally been used for the interpretation of individual neurons. We apply these methods on partitionings generated by a spectral clustering algorithm which uses a graph representation of the network's neurons and weights. We show that despite our partitioning algorithm using neither activations nor gradients, it reveals clusters with a surprising amount of importance and coherence. Together, these results support the use of modular interpretations, and graph-based partitionings in particular, for interpretability.

1. INTRODUCTION

Deep neural networks have achieved state-of-the-art performance in a variety of applications, but this success contrasts with the challenge of making them more intelligible. As these systems become more advanced and widely-used, there are a number of reasons we may need to understand them more effectively. One reason is to shed light on better ways to build and train them. A second reason is the importance of transparency, especially in settings which involve matters of safety, trust, or justice (Lipton, 2018) . More precisely, we want methods for analyzing a trained network that can be used to construct semantic and faithful descriptions of its inner mechanisms. We refer to this as mechanistic transparency. Toward this goal, we consider modularity as an organizing principle to achieve mechanistic transparency. In the natural sciences, we often try to understand things by taking them apart. Aside from subdivision into the standard architectural building blocks (e.g., neurons, channels, layers), are there other ways a trained neural network be meaningfully "taken apart"? We aim to analyze a network via a partitioning of its neurons into disjoint sets with the hope of finding that these sets are "modules" with distinct functions. Since there are many choices for how to partition a network, we would like metrics for anticipating how meaningful a given partition might be. Inspired by the field of program analysis (Fairley, 1978) , we apply the concepts of "dynamic" and "static" analysis to neural networks. Dynamic analysis includes performing forward passes and/or computing gradients, while static analysis only involves analyzing architecture and parameters. In a concurrent submission (Anonymous et al., 2021) , we use spectral clustering to study the extent to which networks form clusters of neurons that are highly connected internally but not externally and find that in many cases, networks are structurally clusterable. This approach is static because the partitioning is produced according to the network's weights only, using neither activations nor gradients. Here, we build off of this concurrent submission by working to bridge graph-based clusterability and functional modularity. To see how well neurons within each cluster share meaningful similarities, we introduce two proxies: importance and coherence. Importance refers to how crucial clusters are to the network's perfor-

