WHAT'S IN THE BOX? EXPLORING THE INNER LIFE OF NEURAL NETWORKS WITH ROBUST RULES Anonymous

Abstract

We propose a novel method for exploring how neurons within a neural network interact. In particular, we consider activation values of a network for given data, and propose to mine noise-robust rules of the form X → Y , where X and Y are sets of neurons in different layers. To ensure we obtain a small and non-redundant set of high quality rules, we formalize the problem in terms of the Minimum Description Length principle, by which we identify the best set of rules as the one that best compresses the activation data. To discover good rule sets, we propose the unsupervised EXPLAINN algorithm. Extensive evaluation shows that our rules give clear insight in how networks perceive the world: they identify shared, resp. class-specific traits, compositionality within the network, as well as locality in convolutional layers. Our rules are easily interpretable, but also super-charge prototyping as they identify which groups of neurons to consider in unison.

1. INTRODUCTION

Neural networks provide state of the art performance in many settings. How they perform their tasks, how they perceive the world, and especially, how the neurons within the network operate in concert, however, remains largely elusive. While there exists a plethora of methods for explaining neural networks, most focus on the mapping between input and output (e.g. model distillation) or only characterize individual neurons within the network (e.g. prototyping). In this paper, we introduce a new method that explains how the neurons in a neural network interact. In particular, we consider the activations of neurons in the network over a given dataset, and propose to characterize these in terms of rules X → Y , where X and Y are sets of neurons in different layers of the network. A rule hence represents that neurons Y are typically active when neurons X are. For robustness we explicitly allow for noise, and to ensure that we discover succinct and non-redundant rules we formalize the problem in terms of the Minimum Description Length principle. To discover good rule sets, we propose the unsupervised EXPLAINN method. The rules that we discover give clear insight in how the network performs its task. As we will see, they identify what the network deems similar and different between classes, show how information flows within the network, and show which convolutional filters it expects to be active where. Our rules are easily interpretable, give insight in the differences between datasets, show the effects of fine-tuning, as well as super-charge prototyping as they tell which neurons to consider in unison. Explaining neural nets is of widespread interest, and especially important with the emergence of applications in healthcare and autonomous driving. In the interest of space we here shortly introduce the work most relevant to ours, while for more information we refer to surveys (Adadi & Berrada, 2018; Ras et al., 2018; Xie et al., 2020; Gilpin et al., 2018) . There exist several proposals for investigating how networks arrive at a decision for a given sample, with saliency mapping techniques for CNNs among the most prominent (Bach et al., 2015; Zhou et al., 2016; Sundararajan et al., 2017; Shrikumar et al., 2017) . Although these provide insight on what parts of the image are used, they are inherently limited to single samples, and do not reveal structure across multiple samples or classes. For explaining the inner working of a CNN, research mostly focuses on feature visualization techniques (Olah et al., 2017) that produce visual representations of the information captured by neurons (Mordvintsev et al., 2015; Gatys et al., 2015) . Although these visualizations provide insight on how CNNs perceive the world (M. Øygard, 2016; Olah et al., 2018) it has been shown that concepts are often encoded over multiple neurons, and that inspecting individual neurons does not provide meaningful information about their role (Szegedy et al., 2013; Bau et al., 2017) . How to find such groups of neurons and how the information is routed between layers remains unsolved. An orthogonal approach is that of model distillation, where we train easy to interpret white box models that mimic the decisions of a neural network (Ribeiro et al., 2016; Frosst & Hinton, 2017; Bastani et al., 2017; Tan et al., 2018) . Rules of the form (if-then) are easily interpretable, and hence a popular technique for model distillation (Taha & Ghosh, 1999; Lakkaraju et al., 2017) . Although some consider individual neurons to determine what rules to return (Robnik-Šikonja & Kononenko, 2008; Özbakır et al., 2010; Barakat & Diederich, 2005) all these techniques only yield rules that directly map input to output, and hence do not provide insight into how information flows through the network. Tran & dAvila Garcez (2018) mine association rules from Deep Belief Networks. Their approach, however, suffers from the pattern explosion in typical in frequency based rule mining, and is not applicable to state of the art networks. Chu et al. ( 2018) aim to explain piecewise linear NNs utilizing polytope theory to derive decision features of the network. Providing strong guarantees, this approach is however limited to PLNNs of extremely small size (< 20 neurons in hidden layers). We instead propose to mine sets of rules to discover groups of neurons that act together across different layers in feed forward networks, and so reveal how information is composed and routed through the network to arrive at the output. To discover rules over neuron activations, we need an unsupervised approach. While a plethora of rule mining methods exists, either based on frequency (Agrawal & Srikant, 1994; Bayardo, 1998; Moerchen et al., 2011) or statistical testing (Hämäläinen, 2012; Webb, 2010) , these typically return millions of rules even for small datasets, thus undermining the goal of interpretability. We therefore take a pattern set mining approach similar to GRAB (Fischer & Vreeken, 2019) , where we are after that set of rules that maximizes a global criterion, rather than treating each rule independently. Providing succinct and accurate sets of rules, GRAB is however limited to conjunctive expressions. This traditional pattern language is too restrictive for our setting, as we are also after rules that explain shared patterns between clases, and are robust to the inherently noisy activation data, which both require a more expressive pattern language of conjunctions, approximate conjunctions, and disjunctions. We hence present EXPLAINN, a non-parametric and unsupervised method that discovers sets of rules that can model these types of rules.

2. THEORY

We first informally discuss how to discover association rules between neurons. We then formally introduce the concept of robust rules, and how to find them for arbitrary binary datasets, last, we show how to combine these ideas to reveal how neurons are orchestrated within feedforward networks.

2.1. PATTERNS OF NEURON CO-ACTIVATION

Similar to neurons in the brain, when they are active, artificial neurons also send information along their outgoing edges. To understand flow of information through the network, it is hence essential to understand the activation patterns of neurons between layers. Our key idea is to use recent advances in pattern mining to discover a succinct and non-redundant set of rules that together describe the activation patterns found for a given dataset. For two layers I i , I j , these rules X → Y, X ⊂ I i , Y ⊂ I j express that the set of neurons Y are usually co-activated when neurons X are co-activated. That is, such a rule provides us local information about co-activations within, as well as the dependence of neurons between layers. Starting from the output layer, we discover rules between consecutive layers I j , I j-1 . Discovering overlapping rules between layers X → Y and Y → Z, X ⊂ I j , Y ⊂ I j-1 , Z ⊂ I j-2 , allows us to trace how information flows through the entire network. Before we can mine rules between two sets of neurons -e.g. layers -I i and I j of a network, we have to obtain its binarized activations for a given data set D = {d k = (s k , o k )}. In particular, for each sample s k and neuron set I i , we take the tensor of activations φ i and binarize it to φ b i . For networks with ReLU activations, which binarize naturally at threshold 0, we might lose some information about activation strength that is eventually used by subsequent layers, but it allows us to derive crisp symbolic, and directly interpretable statements on how neurons interact. Furthermore, binarization reflects the natural on/off state of biological neurons, also captured by smooth step functions such as sigmoid or tanh used in artifical neural networks. We gather the binarized activations into a dataset

