INTERPRETABILITY WITH FULL COMPLEXITY BY CONSTRAINING FEATURE INFORMATION

Abstract

Interpretability is a pressing issue for machine learning. Common approaches to interpretable machine learning constrain interactions between features of the input, sacrificing model complexity in order to render more comprehensible the effects of those features on the model's output. We approach interpretability from a new angle: constrain the information about the features without restricting the complexity of the model. We use the Distributed Information Bottleneck to optimally compress each feature so as to maximally preserve information about the output. The learned information allocation, by feature and by feature value, provides rich opportunities for interpretation, particularly in problems with many features and complex feature interactions. The central object of analysis is not a single trained model, but rather a spectrum of models serving as approximations that leverage variable amounts of information about the inputs. Information is allocated to features by their relevance to the output, thereby solving the problem of feature selection by constructing a learned continuum of feature inclusion-to-exclusion. The optimal compression of each feature-at every stage of approximation-allows fine-grained inspection of the distinctions among feature values that are most impactful for prediction. We develop a framework for extracting insight from the spectrum of approximate models and demonstrate its utility on a range of tabular datasets.

1. INTRODUCTION

Interpretability is a pressing issue for machine learning (ML) (Doshi-Velez & Kim, 2017; Fan et al., 2021; Rudin et al., 2022) . As models continue to grow in complexity, machine learning is increasingly integrated into fields where flawed decisions can have serious ramifications (Caruana et al., 2015; Rudin et al., 2018; Rudin, 2019) . Interpretability is not a binary property that machine learning methods have or do not: rather, it is the degree to which a learning system can be probed and comprehended (Doshi-Velez & Kim, 2017) . Importantly, interpretability can be attained along many distinct routes (Lipton, 2018) . Various constraints on the learning system can be incorporated, such as to force feature effects to combine in a simple (e.g., linear) manner, restricting the space of possible models in exchange for a degree of comprehensibility (Molnar et al., 2020; Molnar, 2022) . In contrast to explainable AI that happens post-hoc after a black-box model is trained, interpretable methods engineer the constraints into the model from the outset (Rudin et al., 2022) . In this work, we introduce a novel route to interpretability that places no restrictions on model complexity, and instead tracks how much and what information is most important for prediction. By identifying optimal information from features of the input, the method grants a measure of salience to each feature, produces a spectrum of models utilizing different amounts of optimal information about the input, and provides a learned compression scheme for each feature that highlights from where the information is coming in fine-grained detail. The central object of interpretation is not a single 1

