TAKE 5: INTERPRETABLE IMAGE CLASSIFICATION WITH A HANDFUL OF FEATURES

Abstract

Deep Neural Networks use thousands of mostly incomprehensible features to identify a single class, a decision no human can follow. We propose an interpretable sparse and low dimensional final decision layer in a deep neural network with measurable aspects of interpretability and demonstrate it on fine-grained image classification. We argue that a human can only understand the decision of a machine learning model, if the features are interpretable and only very few of them are used for a single decision. For that matter, the final layer has to be sparse and -to make interpreting the features feasible -low dimensional. We call a model with a Sparse Low-Dimensional Decision "SLDD-Model". We show that a SLDD-Model is easier to interpret locally and globally than a dense high-dimensional decision layer while being able to maintain competitive accuracy. Additionally, we propose a loss function that improves a model's feature diversity and accuracy. Our more interpretable SLDD-Model only uses 5 out of just 50 features per class, while maintaining 97 % to 100 % of the accuracy on four common benchmark datasets compared to the baseline model with 2048 features.

1. INTRODUCTION

Understanding the decision of a deep learning model is becoming more and more important. Especially for safety-critical applications such as the medical domain or autonomous driving, it is often either legally (Bibal et al., 2021) or by the practitioners required to be able to trust the decision and evaluate its reasoning (Molnar, 2020) . Due to the high dimensionality of images, most previous work on interpretable models for computer vision combines the deep features computed by a deep neural network with a method that is considered interpretable, such as a prototype based decision tree (Nauta et al., 2021) . While approaches for measuring the interpretability without humans exist for conventional machine learning algorithms (Islam et al., 2020) , they are missing for methods including deep neural networks. In this work, we propose a novel sparse and low-dimensional SLDD-Model which offers measurable aspects of interpretability. The key aspect is a heavily reduced number of features, out of which only very few are considered per class. Humans can only consider 7 ± 2 aspects at once (Miller, 1956) and could therefore follow a decision that uses that many features. To be intelligible for all humans, we aim for an average of 5 features per class. Having a reduced number of features makes it feasible to investigate every single feature and understand its meaning: We are able to align several of the learned features with human concepts post-hoc. The combination of reduced features and sparsity therefore increases both global How does the model behave? and local interpretability Why did the model make this decision?, demonstrated in Figure 1 . Our proposed method generates the SLDD-Model by utilizing glm-saga (Wong et al., 2021) to compute a sparse linear classifier for selected features, which we then finetune to the sparse structure. We apply feature selection instead of a transformation to reduce the computational load and preserve the original semantics of the features, which can improve interpretability (Tao et al., 2015) , especially if a more interpretable model like B-cos Networks (Böhle et al., 2022) is used. Additionally, we propose a novel loss function for more diverse features, which is especially relevant when one class depends on very few features, since using more redundant features limits the total information available for the decision. Our main contributions are as follows: • We present a pipeline that ensures a model with increased global and local interpretability which identifies a single class with just few, e.g. 5, features of its low-dimensional representation. We call the resulting model SLDD-Model. • Our novel feature diversity loss ensures diverse features. This increases the accuracy for the extremely sparse case. • We demonstrate the competitive performance of our proposed method on four common benchmark datasets in the domain of fine-grained image classification as well as ImageNet-1K (Russakovsky et al., 2015) , and show that several learned features for algorithmic decision-making can be directly connected to attributes humans use. • Code will be published upon acceptance.

2.1. FINE-GRAINED IMAGE CLASSIFICATION

Fine-grained image classification describes the problem of differentiating similar classes from one another. It is more challenging compared to conventional image recognition tasks (Lin et al., 2015) since the differences between classes are much smaller. To tackle this difficulty, several adaptions to the common image classification approach have been applied. They usually involve learning more discriminative features by adding a term to the loss function (Chang et al., 2020; Liang et al., 2020; Zheng et al., 2020) , introducing hierarchy to the architecture (Chou et al., 2022) or using expensive expert knowledge (Chen et al., 2018; Chang et al., 2021) . Chang et al. (2020) divide the features into groups, s. t. every group is assigned to exactly one class. While training, an additional loss increases the activations of features for samples of their assigned class and reduces the overlap of feature maps in each group. Liang et al. (2020) tried to create class-specific filters by inducing sparsity in the features. Both (Chang et al., 2020) and (Liang et al., 2020) optimize for class-specific filters, which are neither suitable for the low-dimensional case when the number of classes exceeds the number of features nor interpretable, since it is unclear if the feature is already detecting the class rather than a lower level feature. The Feature Redundancy Loss (Zheng et al., 2020) (FRL) enforces the K most used features to be localized differently by reducing the normalized inner product between their feature maps. This adds a hyperparameter and does not optimize all features at once.

2.2. INTERPRETABLE MACHINE LEARNING

Interpretable machine learning is a broad term and can refer to both models that are interpretable by design, and post-hoc methods that try to understand what the model has learned. Furthermore, interpretability can be classified as the interpretability of a single instance (local) or the entire model (global) (Molnar, 2020) . In this work, we present methods making models more interpretable by design but also utilize post-hoc methods to offer local and global interpretability. Common local post-hoc methods are saliency maps like Grad-CAM (Selvaraju et al., 2017) that aim to show what part of the input image is relevant for the prediction. While they can be helpful, they have to be cautiously interpreted, as they do not show many desired properties one would expect from an explanation like shift invariance (Kindermans et al., 2019) or only producing reasonable explanations, when the model is working as intended (Adebayo et al., 2018) . Another way of obtaining saliency maps is based on masking the input image and measuring the impact on the output (Zeiler & Fergus, 2014; Fong & Vedaldi, 2017; Jain et al., 2022) . As a global post-hoc method, Elude (Ramaswamy et al., 2022) generates an explanation for a model by mimicking its behavior with a sparse model. This model uses additional attributes and main directions of the remaining feature space of the model as input. Instead of explaining a model, we directly train the more interpretable model in this work. Another line of research tries to align learned representations with human understandable concepts from an additional labeled dataset (Kim et al., 2018; Bau et al., 2017; McGrath et al., 2021) , increasing the global interpretability of the model. To our best knowledge, measuring the interpretability of a deep neural network is an open task, as previous work focuses on measuring the quality of explanations of black boxes (Rokade & Alluri, 2021) or on conventional machine learning algorithms (Islam et al., 2020) , where increased interpretability is measured when model complexity is reduced, e.g. via the number of operations (Yang et al., 2017; Friedler et al., 2019; Rüping et al., 2006) or number of features (Rüping et al., 2006) . The sparsity and low-dimensionality of our proposed SLDD-Model is motivated by these findings. Due to the limitations of post-hoc methods in explaining a deep neural network, models that are more interpretable by design are becoming more relevant. Nauta et al. (2021) and Yang et al. (2022) used a deep feature extractor in combination with a prototype based decision tree(s) to achieve state-of-the-art performance on fine-grained image classification while increasing interpretability. However, decision trees struggle to model linear dependencies, and it is hard to globally interpret an ensemble of deep decision trees. Kim et al. (2021) and Hoffmann et al. (2021) additionally indicate a gap between the perceived similarities of humans and prototype based models. Concept bottleneck models (CBM) (Koh et al., 2020) first predict concepts annotated in the dataset and then use a simple model to predict the target class from the concepts. CBM-AUC (Sawada & Nakamura, 2022) extended CBM by allowing unsupervised concepts to influence the decision. PCBM (Yuksekgonul et al., 2022 ) created a post-hoc CBM using TCAV (Kim et al., 2018) to compress high-dimensional learned features into a concept bottleneck. Margeloiu et al. (2021) and Elude (Ramaswamy et al., 2022) both suggest that training the CBM end-to-end leads to the encoding of additional information next to the concepts, which reduces the interpretability. In contrast to CBM, our proposed method does not require additional labels for training and leads to a very sparse decision process. While their features are generally more aligned with the given concepts, they also need to be analyzed thoroughly.

2.2.1. Glm-saga

Wong et al. ( 2021) developed glm-saga, a method to efficiently fit a heavily regularized sparse layer to the computed features of a backbone feature extractor by combining the path algorithm of Friedman et al. (2010) with advancements in variance reduced gradient methods by Gazagnadou et al. (2019) . They showed that human understanding is more aligned with the decision process of the sparse model and sparsely shared features can be more easily aligned with human concepts. Additionally, they reached levels of sparsity that network wide sparsity methods do not obtain in the final layer with competitive accuracy (Gale et al., 2019) . For precomputed and normalized features, glm-saga computes a series of n sparse linear classifiers P = [(W sparse 1 , b 1 ), (W sparse 2 , b 2 ), . . . , (W sparse n , b n )], where the sparsity of W sparse i is decreasing with i. This series is called regularization path. Each of the models minimizes the elastic net loss L = L target + λR(W ) R(W ) = (1 -α) 1 2 W F + α W 1,1 with the initial optimization goal L target , in our case the cross-entropy loss, and regularization strength λ, which decreases along the path. The regularization function R(W ) with weighting factor α ∈ [0, 1] is known as Elastic Net (Zou & Hastie, 2005) . Glm-saga optimizes the problem iteratively, clipping entries in W with an absolute value below a threshold after each step, to ensure real sparsity. For reference, the pseudocode for glm-saga is included in Appendix C. Since their aim is understanding the neural network, they fit the sparse layer to fixed features and do not finetune the features to the sparse layer, which requires different optimization strategies than dense networks (Tessera et al., 2021) . In this work, we utilize glm-saga to create a more interpretable model with competitive accuracy by applying it on selected features with higher diversity and finetuning the features afterwards. This leads to improved accuracy and enables a higher sparsity which, combined with the reduced number of features,increases interpretability.

3.1. PROBLEM FORMULATION

We apply the proposed SLDD-Model to the domain of fine-grained image classification. We consider the problem of classifying an image I ∈ R 3×w×h of width w and height h into one class c ∈ {c 1 , c 2 , . . . , c nc } using a trainable deep neural network Φ. This neural network extracts the feature maps M ∈ R n f ×w M ×h M and aggregates them into the feature vector f ∈ R n f . Then it applies the trainable neural network C to obtain the final output y ∈ R nc as y = C(f ).

3.2. SLDD-Model

Train Dense Model with L div

Sparsify Layer

Finetune Features We propose a flexible, generally applicable method for generating a more locally and globally interpretable model with no need for additional labels and an adjustable tradeoff between interpretability and accuracy. We make the decision process more interpretable by only using n * f features with n * f n f and using an interpretable classifier C. At the core of an interpretable classifier C lies a linear layer y = W f + b with the weight matrix W ∈ R nc×n * f and bias b ∈ R nc . In order for it to be interpretable, W has to be very sparse, meaning the number of non-zero entries n w has to be very low. Miller (1956) showed that humans can handle 7 ± 2 cognitive aspects at once, which constitutes an appropriate upper bound on the average number of relevant features per class n wc = nw nc . In our work we focus on n wc ≤ 5.

SLDD-Model

The pipeline of our approach is presented in Figure 2 and utilizes glm-saga for sparsification and feature selection. We first train a deep neural network with our proposed feature diversity loss L div until convergence. Then the features F train ∈ R n T ×n f for all n T images in the training set are computed, which are the average pooled feature maps M . Afterwards, the features are selected as described in Section 3.2.2 and glm-saga, presented in Section 2.2.1, is used to calculate the regularization path. Finally, the solution with the desired sparsity is selected from the regularization path and the remaining layers get finetuned with the final layer set to the sparse model, s. t. the features adapt to it.

3.2.1. FEATURE DIVERSITY LOSS

The goal of the proposed feature diversity loss L div is that every feature captures a different independent concept. This is achieved by enforcing differently localized features in their feature maps. The proposed loss is motivated by the Mutual-Channel Loss (MCL) (Chang et al., 2020) and the Feature Redundancy loss (FRL) (Zheng et al., 2020) . In contrast to FRL, we use Cross-Channel-Max-Pooling (CCMP) (Goodfellow et al., 2013) over all weighted feature maps to optimize all features jointly and reduce the need for the hyperparameter K. MCL also uses CCMP but instead of grouping the channels into class-specific filters, we apply the diversity component to all feature maps M , to aim for shared interpretable features. For notation, w ĉl describes the entry in W that is assigned to the specific feature l ∈ {0, 1, . . . , n f -1} for the predicted class ĉ = arg max(y) and m l ij = M l,i,j . ŝl ij = exp(m l ij ) h M i =1 w M j =1 exp(m l i j ) f l max f |w ĉl | w ĉ 2 (3) Equation 3 uses the softmax to transform the feature maps M by normalizing their entries m l ij over the spatial dimensions and then scales the maps so that they focus on visible and important features by maintaining the relative mean of M while weighting them according to the predicted class, s. t. different to MCL absent features do not have to be localized in small background patches. Equation 4decreases the loss if the different weighted feature maps Ŝl attend to different locations. The final training loss is then L final = L CE + βL div with the weighting factor β ∈ R + . L div = - h M i=1 w M j=1 max(ŝ 1 ij , ŝ2 ij , . . . , ŝn f ij )

3.2.2. FEATURE SELECTION

For selecting the set of features N f * from the initial features N f s. t. |N f * | = n * f , we run an adapted version of glm-saga, introduced in Section 2.2.1, until one solution (W sparse j , b j ) of the regularization path uses a feature not already in N f * , which we then add to the set of selected features N f * and restart the adapted glm-saga. As adaptation, we extended the proximal operator of the group version of glm-saga, which operates on w l = W :,l , which are the entries in W that correspond to an entire feature l. Since w l 2 indicates the importance of l, we additionally only keep entries for features that have the maximum norm or are in N f * , s. t. exactly one feature is added per iteration. The resulting proximal operator with λ 1 = γλα and λ 2 = γλ(1 -α) is: Prox λ1,λ2 (w i ) = wi( wi 2-λ1 ) (1+λ2) wi 2 if w i 2 > λ 1 ∧ w i 2 = max j ∈N f \N f * w j 2 ∨ i ∈ N f * 0 otherwise (5) The extensions are underlined and γ is the learning rate of glm-saga.

4. EXPERIMENTS

This section contains our experimental results. We validate our method using Resnet50 (He et al., 2016) , DenseNet121 (Huang et al., 2017) and Inception-v3 (Szegedy et al., 2016) on four common benchmark datasets in the domain of fine-grained image classification. Additionally, we show the applicability of a SLDD-Model for large scale datasets like ImageNet-1K (Russakovsky et al., 2015) . An overview of CUB-2011 (Wah et al., 2011 ), Stanford Cars (Krause et al., 2013) , FGVC-Aircraft (Maji et al., 2013 ), NABirds (Van Horn et al., 2015) and ImageNet-1K is given in Table 1 . Additionally, CUB-2011 contains labels for the images such as attributes (e.g. "red wing") as well as for the classes, which makes it easier to measure the alignment with understandable concepts. After the competitive accuracy and the impact of L div is shown, the interpretability of the SLDD-Model is discussed and these attributes are used to show the alignment of the learned features. The implementation details can be found in Appendix B.1. Finally, the tradeoff between interpretability and accuracy is visualized.

4.1. DIVERSITY METRIC

To assess the impact of L div , we developed a measurement for the local diversity of the feature maps M that led to the decision, inspired by the diversity component of MCL (Chang et al., 2020) , which entails a different way of computing the features. For that, we consider the k feature maps M k that are weighted the highest for the predicted class ĉ in W . To only compare the localization, softmax is applied to the M k , yielding S k . With these distributions S k , we compute the diversity as with diversity@k ∈ [ 1 k , 1] to measure how different and pronounced the M k are localized. Since we focus on n wc ≤ 5, we set k = 5. We report the mean diversity@5 for all classes that use at least five features. Note that the proposed L div is a weighted version of diversity@n f . diversity@k = h M i=1 w M j=1 max(s 1 ij , s 2 ij , . . . , s k ij ) k (6) * f = 2048 n * f = 50 n * f = 2048 n * f = 50 n * f = 2048 n * f = 50 n * f = 2048 n * f =

4.2. RESULTS

We report the accuracy on the test set for the dense model after training the pretrained model on the training data with n wc = n f , for the sparse model with n wc ≤ 5, and for the result of our whole pipeline, the model with finetuned features, obtained by training the sparse model on the training data, and still n wc ≤ 5. Every shown metric is the average over five (four for ImageNet-1K) randomly seeded runs. The standard deviations are included in the appendix. Table 3 shows the competitive performance of our SLDD-Model to the dense Resnet50. It is evident that an extreme sparsity of s = 5 2048 can be obtained in the final layer with just 0.1 to 0.4 percent points less accuracy. Additionally decreasing the number of features by 97.6 %, resulting in just 50 instead of the previous 2048 features, only reduces the accuracy compared to the dense model by 1.3 to 2.7 percent points. Finally, our proposed L div improves the accuracy for all sparse models. Table 4 shows the general applicability of our method with different backbones. However, we observed some instability and no increased accuracy when finetuning the DenseNet121 with n * f = 2048, showing a positive effect of sharing features. Table 2 compares our approach to competitors: Without requiring additional supervision, we achieve a competitive performance compared to CBM (Koh et al., 2020)-based methods, while achieving a lower dimensionality and higher sparsity. Additionally, we improve the accuracy of glm-saga with heavily reduced n * f . For ImageNet-1K, we skipped the dense training and directly used the pretrained model. The good scalability of our proposed method to this large dataset with a higher number of classes is displayed in Table 5 . Table 6 shows the diversity@5: With L div it is very close to the maximum value of 100 % in the dense case and still heavily increased in the sparse cases. This showcases that L div is suitable to ensure a diverse localization and improved interpretability of the used feature maps , which is visualized in Figures 9 to 17. Finally, the total number of features that is used by the unrestricted (n * f = n f ) models in Table 3 is reduced from 912 to 719 for the models with L div , but still shows a high number of class-specific features. That L div leads to more shared features supports our motivation of enforcing different features to capture different concepts.  CUB-2011 FGVC-Aircraft NABirds Stanford Cars n * f = nf n * f = 50 n * f = nf n * f = 50 n * f = nf n * f = 50 n * f = nf n * f =

4.2.1. COMPARISON WITH OTHER LOSS FUNCTIONS

We compare our diversity loss L div with the MCL (Chang et al., 2020) and the FRL (Zheng et al., 2020) . The used hyperparameters for the loss functions are reported in Appendix B.1.1 and we focussed on three datasets to save computational resources. Table 3 shows that our L div reaches the highest accuracy across all datasets. Notably, the accuracy reported in (Chang et al., 2020) for the MC-Loss is achieved by a two-layer MLP plus additional techniques, whereas we only use one layer to ensure linearly separable representations for our SLDD-Model. Although it is expected that applying L div has a positive effect on diversity@5 due to the similar formulation, we could observe a remarkable uplift in diversity@5 (Table 6 ) compared to MCL and FRL, which also optimize for differently localized features.

4.3. INTERPRETABILITY

In this section, we discuss the interpretability of the proposed SLDD-Model using example models. The interpretability of the proposed SLDD-Model is based upon using very few (n wc ) features from a small pool of n * f to make a decision. A low n * f allows the analysis of the remaining features to try to align them with a human understandable concept, which is discussed in Section 4.3.1. Since the sparse linear layer is easily interpretable, the complete model with sufficiently well understood features is both locally and globally interpretable. For global interpretability, the final layer of the SLDD-Model can be fully visualized and analyzed. Figure 3a shows W sparse . This allows the practitioner to verify the global behavior of the model. For example, the attribute aligned with "has-bill-shape:needle" in the presented model in Section 4. the feature is aligned well, this leads to knowledge discovery. The local interpretability describes the explanation of a single decision made by the model. Decisions with sparsely connected features are inherently locally interpretable, if the features can be interpreted and localized, as shown in Figure 1 . The practitioner can understand where and what was found in the image, and due to the full global interpretability also understand the behavior around the current example.

4.3.1. FEATURE ALIGNMENT

In this section, we demonstrate how the features of the proposed SLDD-Model can be aligned with interpretable concepts.We describe how one can use additional labels or expert knowledge to interpret the features and demonstrate that several learned sparse features are directly connected to attributes relevant to humans. Thus, our model learns such abstract concepts directly from the data. Overall, due to the very limited number of used features n * f , the features can and should be thoroughly analyzed and interpreted to facilitate interpretability. For feature localization, we follow a masking approach similar to Fong & Vedaldi (2017), which is described in Appendix D.

Alignment with Additional Data

We use the attributes A contained in CUB-2011 to align the learned features with these labels after the finetuning. For each attribute a ∈ A and feature j we compute a score C aj that corresponds to a relative increase of the feature when the attribute is present: δ aj = 1 |ρ a+ | i∈ρa+ F train i,j - 1 |ρ a-| i∈ρa- F train i,j C aj = δ aj max(F train :,j ) -min(F train :,j ) The set of indices whose images contain the attribute is denoted by ρ a+ . We considered an attribute to be present if the human annotated it with "probably" or "definitely". Annotations with "guessing" were neither included in the positive (ρ a+ ) nor negative (ρ a-) examples. For one exemplary model a part of the matrix of C values is displayed in Figure 3b . It is clear, that some features correspond to colors, some to specific shapes like "bill-shape:needle" and other features do not correlate with specific attributes. Figure 4a visually validates the connection that was implied in Figure 3b . Manual Alignment To ensure that the entirety of a feature is understood, or in absence of additional data, the features have to be manually aligned for increased interpretability. This is enabled by the low number of features and sparsity. Some useful aspects for understanding a feature are the localization, extreme examples, feature visualization (Olah et al., 2017) or which classes use that feature. One such alignment for a model trained on FGVC-Aircraft is displayed in Appendix E and Figure 4b . Aligning learned features with human understandable concepts is still challenging as a single feature can refer to multiple aspects and human understandable concepts do not need to be axis aligned (Szegedy et al., 2013) . However, the low dimensionality of the remaining features allows for a sophisticated analysis of every feature in practice, which could even discover spurious correlations as features as done by the glm-saga publication.

4.4. INTERPRETABILITY TRADEOFF

In this section, we analyze the impact of changing n * f and n wc on the finetuning accuracy of the model trained with L div , shown in Figure 5 . Figure 5a visualizes the impact of n * f : With decreasing n * f the accuracy drops slowly until a dataset-specific threshold is reached, at which a steep decline starts. Additionally, the proposed L div works regardless of n * f . Figure 5b shows the finetuning accuracy in relation to n wc : The accuracy is rather insensitive to n wc , only decreasing when n wc < 5, which is the case for both dimensions and showcases that five features suffice for a competitive model even if the features are shared among classes. Figure 5 demonstrates the tradeoff that our SLDD-Model offers: Both n wc and n * f can be drastically reduced with either a negligible or small impact on accuracy to adapt to the amount of interpretability needed.

5. LIMITATIONS AND FUTURE WORK

As shown in Figure 5 , the SLDD-Model cannot get arbitrarily low-dimensional or sparse with competitive accuracy via our proposed method. The optimal sparsity and dimensionality for a given problem are hard to predict and might require some experiments to determine the minimum values for competitive accuracy. Aligning all used features with human concepts is still difficult, albeit more feasible than without a SLDD-Model. Future work could use a more interpretable feature extractor like B-cos Networks (Böhle et al., 2022) to alleviate that problem. It seems promising to apply a SLDD-Model to other safety-critical domains, such as medical, where an expert can be utilized to align the features and follow the decision, as it can help bring the required interpretability and trustworthiness to the domain. Embodied autonomous agents can also benefit from it, as the entire decision process can be thoroughly analyzed. While more interpretable models could be used to more deliberately bring harm, they can disclose existing problems with machine learning models and open up the opportunity to build fair and trustworthy models. Finally, sparsity and dimensionality could be part of metrics used to quantify the trustworthiness of a model.

6. CONCLUSION

In this work, we proposed the more interpretable sparse low-dimensional decision model (SLDD-Model) to allow a human to follow and understand the decision of a Deep Neural Network for image classification. Our proposed pipeline constructs a SLDD-Model with drastically increased global and local interpretability while still showing competitive accuracy. As demonstrated, a practitioner can manually configure the pipeline to set the tradeoff between accuracy and interpretability. Our novel loss increases the feature diversity and we showed that identifying a class with varied features can improve the accuracy. Finally, our SLDD-Model offers measurable aspects of interpretability, which allows future work to not just compare itself on accuracy but also on interpretability.  * f = nf n * f = 50 n * f = nf n * f = 50 n * f = nf n * f = 50 n * f = nf n * f =

A APPENDIX

In this appendix, we provide implementation details and standard deviations for the experiments. Additionally, the pseudocode for glm-saga (Wong et al., 2021) is shown. Finally, we present the feature visualization technique and more ablations on L div .

B DETAILED RESULTS

The full results of Section 4.2 with the standard deviations are presented in Tables 7 to 13 . The reported standard deviations are, except for DenseNet121 on NABirds, as mentioned in Section 4.2, generally rather small compared to the differences in means, which supports our conclusions. We also show exemplary images with the 5 most important features for the dense and sparse conventional model in Figures 9 to 17. The comparison with the finetuned SLDD-Model shows an improved localization and interpretability of the features. The finetuned SLDD-Model in Figure 10 seems to use features that each individually focus more on chest, lower belly, head, bill or crown, whereas for the dense and sparse models the different features focus on the same regions. This increased diversity@5 and with it interpretability was also measured in Section 4.2.

B.1 IMPLEMENTATION DETAILS

We use Pytorch (Paszke et al., 2019) to implement our methods and on ImageNet pretrained models as backbone feature extractor. We utilized glm-saga and robustness (Engstrom et al., 2019) . The images are resized to 448 × 448 (299 × 299 for Inception-v3, 224 × 224 for ImageNet-1K), normalized, randomly horizontally flipped and jitter is applied. The model is finetuned using stochastic gradient descent on the specific dataset for 150 (100 for NABirds) epochs with a batch size of 16 (64 for ImageNet-1K), starting with 5 • 10 -3 as learning rate for the pretrained layer and 0.01 for the final linear layer. Both get multiplied by 0.4 every 30 epochs. Additionally, we used momentum of 0.9, 2 -regularization of 5 • 10 -4 and apply a dropout rate of 0.2 on the features to reduce dependencies. Algorithm 1 Pseudocode from glm-saga (Wong et al., 2021) 1: Initialize table of scalars a i = 0 for i ∈ [n] 2: Initialize average gradient of table g avg = 0 and g 0avg = 0 3: for minibatch B ⊂ [n] do 4: ImageNet-1K n * f = 2048 n * f = for i ∈ B do 5: a i = x T i β + β 0 -y i 6: g i = a i • x i // calculate new gradient information 7: g i = a i • x i // calculate stored gradient information 8: end for 9: g = 1 |B| i∈B g i 10: g = 1 |B| i∈B g i 11: g 0 = 1 |B| i∈B a i 12: g 0 = 1 |B| i∈B a i 13: β = β -γ(g -g + g avg ) 14: β 0 = β 0 -γ(g 0 -g 0 + g 0avg ) 15: β = Prox γλα,γλ(1-α) (β) 16: for i ∈ B do 17:  a i = a i // g 0avg = g 0avg + |B| n (g 0 -g 0 ) 20: end for 21: end for β was set to 0.196 for Resnet50, 0.098 for DenseNet121 and 0.049 for Inception-v3. For the feature selection, we set α = 0.8 and reduce the regularization strength λ by 90 % as we found it sped up the process without decreasing performance. We use glm-saga to compute the regularization path with α = 0.99 and all other parameters set to default with a lookbehind of T = 5. From this path, the solution with maximum n wc ≤ 10 is selected. Then the non-zero entries with the lowest absolute value get zeroed out until we are left with n wc = 5, as we empirically found that they do not improve test accuracy after finetuning. This selected solution replaces the final layer of our model. Then we train for 40 epochs, starting with the final learning rate of the initial training multiplied by 100 ( 1 100 of that for ImageNet-1K), and decrease it by 60 % every 10 epochs. Dropout on the features was set to 0.1 and momentum was increased to 0.95. Note that, while the increased momentum has been important for the stability of the final training, the hyperparameters were not thoroughly optimized for the sparse case.

B.1.1 COMPETITORS

For creating the accuracy for Resnet50 and CBM (Koh et al., 2020) -joint in Table 2 we resized the images to 448 × 448 and used a batch size of 16. The remaining used hyperparameters were almost identical to the CBM experiments with Inception-v3, but we only trained for up to 400 epochs, as 650 led to decreased accuracy (-0.8 percent points). Additionally, the learning rate was not decayed, mirroring the published code. The reported accuracy stems from three runs with a standard deviation of 0.7. For MCL, we used the reported hyperparameters of µ = 0.005 and λ = 10. For finetuning, we assigned every feature to every class that was using it. We optimized the hyperparameters for FRL based on accuracy, leading to K = 10 and λ = 0.01. Table 10: diversity@5 in percent dependent on L div for Resnet50. Best results per column are in bold and ± indicates the standard deviation across five runs. This section includes the Pseudocode for glm-saga (Wong et al., 2021) in algorithm 1. The proximal operator Prox λ1,λ2 (β) is defined as: C Glm-saga CUB-2011 FGVC-Aircraft NABirds Stanford Cars n * f = 2048 n * f = 50 n * f = 2048 n * f = 50 n * f = 2048 n * f = 50 n * f = 2048 n * f = Prox λ1,λ2 (β) =      β-λ1 1+λ2 if β > λ 1 β+λ1 1+λ2 if β < λ 1 0 otherwise

D VISUALIZATION OF FEATURES

For feature visualization, we follow a masking approach. We systematically blur, following (Fong & Vedaldi, 2017) , one patch of size p × p of the image and measure the difference in feature activation between the augmented image and not augmented image. The actual localization map L p ∈ R n * f × w p × h p for that square size is computed by L pxy = ReLU(f (I) -f (I pxy )) where I pxy indicates the image where a p-sized patch starting at position (x * p, y * p) is blurred and the ReLU suppresses parts that increased the feature activation, since blur should not be injecting a feature. The final localization map is the combination of different square sizes p ∈ {28, 56, 64, 112, 224} to accommodate for differently sized features: L = p L p max(L p ) Notably, L p has to be resized according to the smallest p and we only show L i for one feature i.

E FEATURE ALIGNMENT

In this section, we use the alignment of the shown feature in Figure 4b with four-engine aircraft to exemplary show how one can align a feature manually. We first visualize the distribution of the feature over the training data in Figure 6 . This indicates a more binary attribute and by investigating the images and saliency maps, we observed an alignment with four-engine aircraft. We test the hypothesis by filtering for the classes "A340-500" and "BAE 146-300", for which the feature corresponds to the four engines. Figure 7 shows that the lowest activating examples of this group do not clearly show the four-engines which supports our hypothesis. Note that one feature can correspond to multiple concepts, as another hypothesis is an alignment with the propeller. Whether all correlating concepts have to be understood or how exact the analysis has to be depends on the application.

F ABLATIONS ON FEATURE DIVERSITY LOSS

In this section, we present an additional analysis of the factors in L div and the impact of β.

F.1 FACTOR IMPORTANCE

We analyzed the impact of the two factors in Equation 3 with for accuracy optimized β, shown in Tables 14 and 15 . The label w/o Class-Specific indicates not using the weights of the predicted class and w/o Rescaling refers to not maintaining their relative mean. We used β = 0.001 

F.2 LOSS WEIGHTING

This section is concerned with the impact of the weighting factor β for the feature diversity loss L div . For our proposed method, we use β = 0.196. Tables 16 and 17 show that L div improves the diversity@5 and accuracy across all datasets in the sparse case with increasing β up to a maximum roughly around β = 1. Setting the value higher leads to L div dominating the training. To ensure that the network is still mainly optimized for classification, we choose β = 0.196, even though in some cases we could still observe a slight gain in accuracy with a slight increase of β. The positive relation between diversity@5 and accuracy, visualized in Figure 8 , supports our approach of enforcing varied features for the extremely sparse case. 



Figure 1: Local explanation by our SLDD-Model: The two features used for the predicted class, emerged without additional supervision, are aligned with human interpretable attributes and localized (described in App. D) adequately.

Figure 2: Overview of our proposed pipeline to construct a SLDD-Model

3.1 has a non-zero weight for all four classes that have the attribute in more than 30 % of examples. The visualization of the classes positively related to a specific attribute and features related to a class like Figures 9 to 17 helps to trust the model. Additionally, W sparse allows for further feature understanding, since it is possible to analyze the similarities between classes that share a feature. If (a) Exemplary W sparse . The alignment of the features with attributes in CUB-2011 is displayed in Figure 3b. (b) Relationship between chosen features and attributes (C > 20%) of CUB-2011 for the exemplary model. Higher values indicate that the feature describes the attribute.

Figure 3: Visualization of the sparse matrix and the feature alignment.

(a) Feature 45 with C = 0.39 for the attribute "hasbill-shape:needle": Higher activations are localized around the bill, and a needle-like bill is visible. (b) Manually aligned feature of a model trained on FGVC-Aircraft without additional labels: The feature is manually aligned with four engine aircraft as shown in Appendix E.

Figure 4: Example images and localization L, scaled to indicate feature activation, in ascending order for two models. The text between the rows describes the activation value for the image, which drops below 0 due to the normalization of glm-saga, and the class name.

Impact of changing n * f with nwc = 5 (b) Impact of changing nwc with n * f = 50 or 2048

Figure 5: Relationship between Finetune Accuracy and aspects related to interpretability for Resnet50.

Table 9: Accuracy in percent for Resnet50 on ImageNet-1Kusing the pretrained dense model.. Best results per column are in bold and ± indicates the standard deviation across four runs.

Figure 6: Top: Distribution of feature activation over the training data. Bottom: Different examples of this distribution with their scaled feature localization. The activation and class name is given above the image. of features of the baseline model n f = 2048 as scaling factors in order to be less dependent of model architecture and image size, and β = 0.1 for w/o Rescaling. Only the combination of both factors leads to an improved accuracy, validating our idea that it is important to only enforce diversity of features that are found in the input and used in conjunction.

Figure 7: The three least activating training examples of classes "A340-500" and "BAE-16-300" for the given feature.

Figure 8: Relationship between finetuned diversity@5 and accuracy for varying β for Resnet50, portrayed via color, on FGVC-Aircraft. Each dot represents an increase by a factor of √ 10 and the standard deviation is indicated by the shaded area. β = 1.96 L div is not shown, as it dominates the training.

Figure 9: Feature maps of the top 5 features by magnitude for class Scarlet Tanager on example images. The used weights for the respective features are also displayed.

Figure 10: Feature maps of the top 5 features by magnitude for class White Crowned Sparrow on example images. The used weights for the respective features are also displayed.

Figure 12: Feature maps of the top 5 features by magnitude for class Cape Glossy Starling on example images. The used weights for the respective features are also displayed.

Figure 14: Feature maps of the top 5 features by magnitude for class Audi S5 Coupe 2012 on example images. The used weights for the respective features are also displayed.

Overview of the number of classes, training and testing examples for the used datasets

Comparison with competitors on accuracy in percent. For ease of comparison, we evaluated CUB-2011 on Inception-v3 (n f = 1024) and ImageNet-1K with Resnet50 (n f = 2048) (*denotes Resnet50 accuracy). The dense Resnet50 achieves 80.9 % on ImageNet-1K. For glm-saga, we selected the solution with maximum n wc ≤ 5. Arrows indicate generally preferable directions. For CBM -joint, Resnet50 results were created as described in Appendix B.1.1

50 Ldiv Dense Sparse Finet. Sparse Finet. Dense Sparse Finet. Sparse Finet. Dense Sparse Finet. Sparse Finet. Dense Sparse Finet. Sparse Finet. Impact of the loss function on accuracy in percent for Resnet50. Best results per column are in bold.

Accuracy in percent dependent on backbone. Best results per column are in bold.

Accuracy in percent for Resnet50 on ImageNet-1K using the pretrained dense model.

Impact of the loss function on diversity@5 in percent for Resnet50. Best results per column are in bold.

Ldiv Dense Sparse Finet. Sparse Finet. Dense Sparse Finet. Sparse Finet. Dense Sparse Finet. Sparse Finet. Dense Sparse Finet. Sparse Finet. Accuracy in percent dependent on L div for Resnet50. Best results per column are in bold and ± indicates the standard deviation across five runs.

50 Backbone Dense Sparse Finet. Sparse Finet. Dense Sparse Finet. Sparse Finet. Dense Sparse Finet. Sparse Finet. Dense Sparse Finet. Sparse Finet. Accuracy in percent dependent on backbone. Best results per column are in bold and ± indicates the standard deviation across five runs.

50 Ldiv Dense Sparse Finet. Sparse Finet. Dense Sparse Finet. Sparse Finet. Dense Sparse Finet. Sparse Finet. Dense Sparse Finet. Sparse Finet.

• 196 2048 ≈ 1e-5 for w/o Class-Specific, since we use the size of the feature maps with 196 = w M • h M and the number Backbone Dense Sparse Finet. Sparse Finet. Dense Sparse Finet. Sparse Finet. Dense Sparse Finet. Sparse Finet. Dense Sparse Finet. Sparse Finet.Table 11: diversity@5 in percent dependent on backbone. Best results per column are in bold and ± indicates the standard deviation across five runs. Accuracy in percent for Resnet50 compared to other loss functions. Best results per column are in bold and ± indicates the standard deviation across five runs. Table 13: diversity@5 in percent for Resnet50 compared to other loss functions. Best results per column are in bold and ± indicates the standard deviation across five runs. Sparse Finet. Dense Sparse Finet. Sparse Finet. Dense Sparse Finet. Sparse Finet.

Impact of factors in L div on accuracy in percent for Resnet50. Best results per column are in bold and ± indicates the standard deviation across five runs.

Impact of factors in L div on diversity@5 in percent for Resnet50. Best results per column are in bold and ± indicates the standard deviation across five runs.

Dense Sparse Finet. Sparse Finet. Dense Sparse Finet. Sparse Finet. Dense Sparse Finet. Sparse Finet. Accuracy in percent dependent on β for Resnet50. Best results per column are in bold and ± indicates the standard deviation across five runs. Our used β is underlined.

Dense Sparse Finet. Sparse Finet. Dense Sparse Finet. Sparse Finet. Dense Sparse Finet. Sparse Finet.Table 17: diversity@5 in percent dependent on β for Resnet50. Best results per column are in bold and ± indicates the standard deviation across five runs. Our used β is underlined.

7. REPRODUCIBILITY

For reproducibility, we uploaded the code for both the feature selection and the feature diversity loss L div as supplementary material. Additionally, we clearly reported the used hyperparameters and data augmentation in the implementation details, Appendix B.1. The used weights for the respective features are also displayed.

