CLIP-DISSECT: AUTOMATIC DESCRIPTION OF NEU-RON REPRESENTATIONS IN DEEP VISION NETWORKS

Abstract

In this paper, we propose CLIP-Dissect, a new technique to automatically describe the function of individual hidden neurons inside vision networks. CLIP-Dissect leverages recent advances in multimodal vision/language models to label internal neurons with open-ended concepts without the need for any labeled data or human examples. We show that CLIP-Dissect provides more accurate descriptions than existing methods for last layer neurons where the ground-truth is available as well as qualitatively good descriptions for hidden layer neurons. In addition, our method is very flexible: it is model agnostic, can easily handle new concepts and can be extended to take advantage of better multimodal models in the future. Finally CLIP-Dissect is computationally efficient and can label all neurons from five layers of ResNet-50 in just 4 minutes, which is more than 10× faster than existing methods. Our code is available at https://github.com/Trustworthy-ML-Lab/CLIPdissect.

1. INTRODUCTION

Deep neural networks (DNNs) have demonstrated unprecedented performance in various machine learning tasks spanning computer vision, natural language processing and application domains such as healthcare and autonomous driving. However, due to their complex structure, it has been challenging to understand why and how DNNs achieve such great success across numerous tasks. Understanding how the trained DNNs operate is essential to trust their deployment in safety-critical tasks and can help reveal important failure cases or biases of a given model. One way towards understanding DNNs is to inspect the functionality of individual neurons, which is the focus of our work. This includes methods based on manual inspection (Erhan et al., 2009; Zeiler & Fergus, 2014; Zhou et al., 2015; Olah et al., 2017; 2020; Goh et al., 2021) , which provide high quality explanations and understanding of the network but require large amounts of manual effort. To address this issue, researchers have developed automated methods to describe the functionality of individual neurons, such as Network Dissection (Bau et al., 2017) and Compositional Explanations (Mu & Andreas, 2020) . In (Bau et al., 2017) , the authors first created a new dataset named Broden with pixel labels associated with a pre-determined set of concepts, and then use Broden to find neurons whose activation pattern matches with that of a pre-defined concept. In (Mu & Andreas, 2020), the authors further extend Network Dissection to detect more complex concepts that are logical compositions of the concepts in Broden. Although these methods based on Network Dissection can provide accurate labels in some cases, they have a few major limitations: (1) They require a densely annotated dataset, which is expensive and requires significant amount of human labor to collect; (2) They can only detect concepts from the fixed concept set which may not cover the important concepts for some networks, and it is difficult to expand this concept set as each concept requires corresponding pixel-wise labeled data. To address the above limitations, we propose CLIP-Dissect, a novel method to automatically dissect DNNs with unrestricted concepts without the need of any concept labeled data. Our method is training-free and leverages pre-trained multi-modal models such as CLIP (Radford et al., 2021) to efficiently identify the functionality of individual neuron units. We show that CLIP-Dissect (i) provides high quality descriptions for internal neurons, (ii) is more accurate at labeling final layer neurons where we know the ground truth, and (iii) is 10×-200× more computationally efficient than existing methods. Finally, we show how one can use CLIP-Dissect to better understand neural networks and discover that neurons connected by a high weight usually represent similar concepts.

2. BACKGROUND AND RELATED WORK

Network dissection. Network dissection (Bau et al., 2017) is the first work on understanding DNNs by automatically inspecting the functionality (described as concepts) of each individual neuronfoot_0 . They identify concepts of intermediate neurons by matching the pattern of neuron activations to the patterns of pre-defined concept label masks. In order to define the ground-truth concept label mask, the authors build an auxiliary densely-labeled dataset named Broden, which is denoted as D Broden . The dataset contains a variety of pre-determined concepts c and images x i with their associated pixel-level labels. Each pixel of images x i is labeled with a set of relevant concept c, which provides a ground-truth binary mask L c (x i ) for a specific concept c. Based on the groundtruth concept mask L c (x i ), the authors propose to compute the intersection over union score (IoU) between L c (x i ) and the binarized mask M k (x i ) from the activations of the concerned k-th neuron unit over all the images x i in D Broden : IoU k,c = x i ∈D Broden M k (xi)∩Lc (xi) x i ∈D Broden M k (xi)∪Lc (xi) . If IoU k,c > η, then the neuron k is identified to be detecting concept c. In (Bau et al., 2017) , the authors set the threshold η to be 0.04. Note that the binary masks M k (x i ) are computed via thresholding the spatially scaled activation S k (x i ) > ξ, where ξ is the top 0.5% largest activations for the neuron k over all images x i ∈ D Broden and S k (x i ) has the same resolution as the pre-defined concept masks by interpolating the original neuron activations A k (x i ). (Bau et al., 2020) propose another version of Network Dissection, which replaces the human annotated labels with the outputs of a segmentation model. This gets rid of the need for dense annotations



We follow previous work and use "neuron" to describe a channel in CNNs.



Figure 1: Labels generated by our method CLIP-Dissect, NetDissect (Bau et al., 2017) and MI-LAN (Hernandez et al., 2022) for random neurons of ResNet-50 trained on ImageNet. Displayed together with 5 most highly activating images for that neuron. We have subjectively colored the descriptions green if they match these 5 images, yellow if they match but are too generic and red if they do not match. In this paper we follow the torchvision (Marcel & Rodriguez, 2010) naming scheme of ResNet: Layer 4 is the second to last layer and Layer 1 is the end of first residual block. MILAN(b) is trained on both ImageNet and Places365 networks, while MILAN(p) is only trained on Places365.

