SHAPLEY EXPLAINABILITY ON THE DATA MANIFOLD

Abstract

Explainability in AI is crucial for model development, compliance with regulation, and providing operational nuance to predictions. The Shapley framework for explainability attributes a model's predictions to its input features in a mathematically principled and model-agnostic way. However, general implementations of Shapley explainability make an untenable assumption: that the model's features are uncorrelated. In this work, we demonstrate unambiguous drawbacks of this assumption and develop two solutions to Shapley explainability that respect the data manifold. One solution, based on generative modelling, provides flexible access to data imputations; the other directly learns the Shapley value-function, providing performance and stability at the cost of flexibility. While "off-manifold" Shapley values can (i) give rise to incorrect explanations, (ii) hide implicit model dependence on sensitive attributes, and (iii) lead to unintelligible explanations in higher-dimensional data, on-manifold explainability overcomes these problems.

1. INTRODUCTION

Explainability in AI is central to the practical impact of AI on society, thus making it critical to get right. While many dichotomies exist within the field -between local and global explanations (Ribeiro et al., 2016) , between post hoc and intrinsic interpretability (Rudin, 2019) , and between model-agnostic and model-specific methods (Shrikumar et al., 2017) -in this work we focus on local, post-hoc, model-agnostic explainability as it provides insight into individual model predictions, does not limit model expressiveness, and is comparable across model types. In this context, explainability can be treated as a problem of attribution. Shapley values (Shapley, 1953) provide the unique attribution method satisfying a set of intuitive axioms, e.g. they capture all interactions between features and sum to the model prediction. The Shapley approach to explainability has matured over the last two decades (Lipovetsky & Conklin, 2001; Kononenko et al., 2010; Štrumbelj & Kononenko, 2014; Datta et al., 2016; Lundberg & Lee, 2017) . Implementations of Shapley explainability suffer from a problem common across model-agnostic methods: they involve marginalisation over features, achieved by splicing data points together and evaluating the model on highly unrealistic inputs (e.g. Fig. 1 ). Such splicing would only be justified if all features were independent; otherwise, spliced data lies off the data manifold. Outside the Shapley paradigm, emerging explainability methods have begun to address this problem. See e.g. Anders et al. (2020) for a general treatment of the off-manifold problem in gradient-based explainability. See also Chang et al. (2019) and Agarwal et al. (2019) for image-specific explanations that respect the data distribution. Within Shapley explainability, initial work towards remedying the off-manifold problem has emerged; e.g. Aas et al. (2019) and Sundararajan & Najmi (2019) explore empirical and kernelbased estimation techniques, but these methods do not scale to complex data. A satisfactorily general and performant solution to computing Shapley values on the data manifold has yet to appear and is a focus of this work. Our main contributions are twofold: • Sec. 3 compares on-and off-manifold explainability, focusing on novel and unambiguous shortcomings of off-manifold Shapley values. In particular, we show that off-manifold explanations are often incorrect, and that they can hide implicit model dependence on sensitive features. Splice 5 < l a t e x i t s h a 1 _ b a s e 6 4 = " y C B 0 E 2 t w 1 x 0 D k K x h w L q 9 j L E N 8 6 A = " > A A A B + n i c b V D L T g J B E J z 1 i f h a 9 O h l I j H x R H a J R A 8 e S L x 4 x C i P B A i Z H X p h w u w j M 7 0 q W f k U L x 4 0 x q t f 4 s 2 / c Y A 9 K F h J J 5 W q 7 n R 3 e b E U G h 3 n 2 1 p Z X V v f 2 M x t < l a t e x i t s h a 1 _ b a s e 6 4 = " g C V x N N w t q I 4 Z / J 3 3 P D 2 j 0 B R n g 4 Y = " > A A A B + n i c b V D L T g J B E J z 1 i f h a 9 O h l I j H x R H Y J R g 8 e S L x 4 x C i P B A i Z H X p h w u w j M 7 0 q W f k U L x 4 0 x q t f 4 s 2 / c Y A 9 K F h J J 5 W q 7 n R 3 e b E U G h 3 n 2 1 p Z X V v f 2 M x t  m G v r q N E c a j x S E a q 6 T E N U o R Q Q 4 E S m r E C F n g S G t 7 w a u I 3 7 k F p E Y V 3 O I q h E 7 B + K H z B G R q p a x f a C I / o + e m t 2 c a B n o y 7 d t E p O V P Q R e J m p E g y V L v 2 V 7 s X 8 S S A E L l k W r d c J 8 Z O y h Q K L m G c b y c a Y s a H r A 8 t Q 0 M W g O 6 k 0 9 P H 9 M g o P e p H y l S I d K r + n k h Z o P U o 8 E x n w H C g 5 7 2 J + J / X S t C / 6 K Q i j B O E k M 8 W + Y m k G N F J D r Q n F H C U I 0 M Y V 8 L c S v m A K c b R p J U 3 I b j z L y + S e r n k n p b O b s r F y m U W R 4 4 c k E N y T F x y T i r k m l R J j X D y Q J 7 J K 3 m z n q w X 6 9 3 6 m L U u W d n M P v k D 6 / M H K E O T 6 Q = = < / l a t e x i t > Splice 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " Q o o K 9 l 0 e 7 g U v j Z H G / o t F 9 Y r E N d s = " > A A A B + n i c b V D L T g J B E J z F F + J r 0 a O X i c T E E 9 k l G j 1 4 I P H i E a M 8 E i B k d u i F C b O P z P S q Z O V T v H j Q G K 9 + i T f / x g H 2 o G A l n V S q u t P d 5 c V S a H S c b y u 3 s r q 2 v p H f L G x t 7 + z u 2 c X 9 h o 4 S x a H O I x m p l s c 0 S B F C H Q V K a M U K W O B J a H q j q 6 n f v A e l R R T e 4 T i G b s A G o f A F Z 2 i k n l 3 s I D y i 5 6 e 3 Z h s H W p n 0 7 J J T d m a g y 8 T N S I l k q P X s r 0 4 / 4 k k A I X L J t G 6 7 T o z d l C k U X M K k 0 E k 0 x I y P 2 A D a h o Y s A N 1 N Z 6 d P 6 L F R + t S P l K k Q 6 U z 9 P Z G y Q O t x 4 J n O g O F Q L 3 p T 8 T + v n a B / 0 U 1 F G C c I I Z 8 v 8 h N J M a L T H G h f K O A o x 4 Y w r o S 5 l f I h U 4 y j S a t g Q n A X X 1 4 m j U r Z P S 2 f 3 V R K 1 c s s j j w 5 J E f k h L j k n F T J N a m R O u H k g T y T V / J m P V k v 1 r v 1 M W / N W d n M A f k D 6 / M H J r 6 T 6 A = = < / l a t e x i t > Splice 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " j M y g l L c • Sec. 4 develops two methods to compute on-manifold Shapley values on general data sets: (i) a flexible generative-modelling technique to learn the data's conditional distributions, and (ii) a simple supervised-learning technique that targets the Shapley value-function directly. We demonstrate the effectiveness of these methods on higher-dimensional data with experiments. B I i K n M T v a B h D A 6 H A 7 v K 4 = " > A A A B + n i c b V D L T g J B E J z F F + J r 0 a O X i c T E E 9 k l G j 1 4 I P H i E a M 8 E i B k d u i F C b O P z P S q Z O V T v H j Q G K 9 + i T f / x g H 2 o G A l n V S q u t P d 5 c V S a H S c b y u 3 s r q 2 v p H f L G x t 7 + z u 2 c X 9 h o 4 S x a H O I x m p l s c 0 S B F C H Q V K a M U K W O B J a H q j q 6 n f v A e l R R T e 4 T i G b s A G o f A F Z 2 i k n l 3 s I D y i 5 6 e 3 Z h s H 6 k 5 6 d s k p O z P Q Z e J m p E Q y 1 H r 2 V 6 c f 8 S S A E L l k W r d d J 8 Z u y h Q K L m F S 6 C Q a Y s Z H b A B t Q 0 M W g O 6 m s 9 M n 9 N g o f e p H y l S I d K b + n k h Z o P U 4 8 E x n w H C o F 7 2 p + J / X T t C / 6 K Y i j B O E k M 8 X + Y m k G N F p D r Q v F H C U Y 0 M Y V 8 L c S v m Q K c b R p F U w I b i L L y + T R q X s n p b P b i q l 6 m U W R 5 4 c k i N y Q l x y T q r k m t R I n X D y Q J 7 J K 3 m z n q w X A A B + 3 i c b V B N S 8 N A E N 3 4 W e t X r E c v i 0 X w V J K i 6 M F D o R e P F e w H t K F s t p t 2 6 W Y T d i f S E v J X v H h Q x K t / x J v / x k 2 b g 7 Y + G H i 8 N 8 P M P D 8 W X I P j f F s b m 1 v b O 7 u l v f L + w e H R s X 1 S 6 e g o U Z S 1 a S Q i 1 f O J Z o J L 1 g Y O g v V i x U j o C 9 b 1 p 8 3 c 7 z 4 x p X k k H 2 E e M y 8 k Y 8 k D T g k Y a W h X B s B m 4 A d p M y K C 5 1 o 2 t K t O z V k A r x O 3 I F V U o D W 0 v w a j i C Y h k 0 A F 0 b r v O j F 4 K V H A q W B Z e Z B o F h M 6 J W P W N 1 S S k G k v X d y e 4 Q u j j H A Q K V M S 8 E L 9 P Z G S U O t 5 6 J v O k M B E r 3 q 5 + J / X T y C 4 9 V I u 4 w S Y p M t F Q S I w R D g P A o + 4 Y h T E 3 B B C l X m d Y j o h i l A w c Z V N C O 7 q y + u k U 6 + 5 V 7 X r h 3 q 1 c V f E U U J n 6 B x d I h f d o A a 6 R y 3 U R h T N 0 D N 6 R W 9 W Z r 1 Y 7 9 b H s n X D K m Z O 0 R 9 Y n z + / E J T i < / l a t e x i t > Digit < l a t e x i t s h a _ b a s e = " X n j + o k g B C L d h r G n Z k O = " > A A A B X i c b V A S w N B E N z M a v q K X N Y h C s w l Q t L A I a G E Z w X x A E s P e Z i Z s r d M p c j / s L F Q x N b / Y u e / c Z N c o Y k P B h v z T A z z + l M O i S s r q n p u I + t b z W j b r s o R x q P J K R b v r M g B Q K a i h Q Q j P W w E J f Q s M f X k x g N o I y J h M Y O i H r K x E I z t B K E J / S D F r B Y h a J b c q e g i T L S J F k q H Y L X + e x J M Q F H L J j G l b o y d l G k U X M I M x I w P W R a l i o W g u m k v H N g q P R p E p Z C O l V / T Q s N G Y U + r Y z Z D g w E / M r J R h c d F K h g R B d m i I J E U I z q J g P a E B o y Z A n j W t h b K R w z T j a o P I B G / + U V S L e L Z b b l Y u c z i y J F D c k R O i E f O S Y X c k C q p E U

2. BACKGROUND ON SHAPLEY EXPLAINABILITY

The Shapley value (Shapley, 1953 ) is a method from cooperative game theory that distributes credit for the total value v(N ) earned by a team N = {1, 2, . . . , n} among its players: φ v (i) = S⊆N \{i} |S|! (n -|S| -1)! n! v(S ∪ {i}) -v(S) where the value function v(S) indicates the value that a coalition of players S would earn without their other teammates. The Shapley value φ v (i) represents player i's marginal value-added upon joining the team, averaged over all orderings in which the team can be constructed. In supervised learning, let f y (x) be a model's predicted probability that data point x belongs to class y.foot_0 To apply Shapley attribution to model explainability, one interprets the features {x 1 , . . . , x n } as players in a game and the output f y (x) as their earned value. To compute Shapley values, one must define a value function representing the model's output on a coalition x S ⊆ {x 1 , . . . , x n }. As the model is undefined on partial input x S , the standard implementation (Lundberg & Lee, 2017) samples out-of-coalition features, x S where S = N \ S, unconditionally from the data distribution: v (off) fy(x) (S) = E p(x ) f y (x S x S ) We refer to this value function, and the corresponding Shapley values, as lying off the data manifold since splices x S x S generically lie far from the data distribution. Alternatively, conditioning outof-coalition features x S on in-coalition features x S would result in an on-manifold value function: v (on) fy(x) (S) = E p(x |x S ) f y (x ) The conditional distribution p(x |x S ) is not empirically accessible in practical scenarios with highdimensional data or many-valued (e.g. continuous) features. A performant method to compute onmanifold Shapley values on general data is until-now lacking and a focus of this work.  Φ f (i) = E p(x,y) φ fy(x) (i) where p(x, y) is the labelled-data distribution. Global Shapley values can be seen as a special case of the global explanation framework introduced by Covert et al. (2020) . As a consequence of the axioms (Shapley, 1953) satisfied by the φ fy(x) (i)'s, global Shapley values satisfy a sum rule: i∈N Φ f (i) = E p(x,y) f y (x) -E p(x ) E p(y) f y (x ) One interprets the global Shapley value Φ f (i) as the portion of model accuracy attributable to the i th feature. Indeed, the first term in Eq. ( 5) is the accuracy one achieves by sampling labels from f 's predicted probability distribution over classes. The offset term, which relates to class balance, is not attributable to any individual feature. The key differences between on-and off-manifold Shapley values is a subject of ongoing discussion; see Sundararajan & Najmi (2019) or Chen et al. (2020) for recent overviews. Here we focus on theoretical arguments and experimental evidence yet to appear in the literature, in favour of the onmanifold approach. We begin with mathematically precise differences between on-and off-manifold methods in Sec. 3.1 and present unambiguous drawbacks of off-manifold Shapley values in Sec. 3.2.

3.1. ON-VERSUS OFF-MANIFOLD DIFFERENCES MADE PRECISE

Suppose the model's input features x 1 , . . . , x n are the result of a data-generating process seeded by unobserved latent variables z 1 , . . . , z d . Then there exist functional relationships x i = g i (z 1 , . . . , z d ; i ) for i = 1, . . . , n where i represents noise in x i . In the limit of small i 's, there are d directions in which a data point (x 1 , . . . , x n ) can be perturbed while remaining consistent with the data distribution: these correspond to perturbations in z 1 , . . . , z d in Eq. ( 6). The data thus lives on a d-dimensional manifold in ambient n-dimensional features space, and therefore satisfies nd constraints on the x i 's: Ψ k (x 1 , . . . , x n ) = 0 for k = 1, . . . , n -d On-manifold Shapley values evaluate the model on inputs that satisfy these constraints, while the off-manifold approach uses spliced data that generically break them. For a more detailed and mathematically precise discussion of the data manifold in this context, see Anders et al. (2020) .

ALGEBRAIC MODEL DEPENDENCE CAN BE MISLEADING

Any model f y (x) can be written in many algebraic forms that all evaluate identically on the data manifold. To show this, one can add any of the nd constraints from Eq. ( 7) to any of the model's n input slots. This changes the model's algebraic form but does not affect the model's output on the data, since each constraint equals zero on-manifold. The model f y (x) thus belongs to an n(nd) dimensional equivalence class of functions that behave indistinguishably on the data. On-manifold Shapley values provide the same explanation for any two models that evaluate identically on the data distribution, because on-manifold explanations do not involve evaluation anywhere else. Off-manifold Shapley values provide different explanations for two models in the same equivalence class, as spliced data in the off-manifold value function break the constraints of Eq. ( 7).

HIDDEN MODEL DEPENDENCE ON SENSITIVE ATTRIBUTES

This is not an academic concern: it follows that off-manifold explanations are vulnerable to adversarial model perturbations that hide dependence on select input features (Dombrowski et al., 2019; Slack et al., 2020) . Dimanov et al. (2020) demonstrated that the off-manifold Shapley value for a sensitive feature like gender could be reduced near zero via this vulnerability. To see how this can happen, suppose that input feature x 1 represents gender and formally solve one of the constraints in Eq. ( 7) for x 1 . The result, say x 1 = Ψ(x 2 , . . . , x n ), can then be used to transform any model f y (x) into another fy (x 2 , . . . , x n ) = f y Ψ(x 2 , . . . , x n ), x 2 , . . . , x n that has no algebraic dependence on gender x 1 but behaves identically to f y (x) on the data manifold. The off-manifold Shapley value for gender in f would vanish, since the off-manifold value function of Eq. ( 2) depends on x 1 only through f (i.e. not at all). This result is problematic, since the two models behave equivalently on the data and thus possess the same gender bias. In contrast, the on-manifold Shapley values for f and f would be identical, as x 1 dependence enters the on-manifold value function of Eq. ( 3) through the conditional expectation value. In a sense, on-manifold Shapley values represent the model's dependence on the information content of each feature, rather than the model's algebraic dependence. (a) < l a t e x i t s h a 1 _ b a s e 6 4 = " j U l t B / N W f J h B E b k b s 1 p x R D n 8 3 j Y = " > A A A B 8 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 B I t Q L y U p i t 4 s e P F Y w X 5 A E 8 p m u 2 m X b j Z h d y K W 0 L / h x Y M i X v 0 z 3 v w 3 b t s c t P X B w O O 9 G W b m B Y n g G h 3 n 2 y q s r W 9 s b h W 3 S z u 7 e / s H 5 c O j t o 5 T R V m L x i J W 3 Y B o J r h k L e Q o W D d R j E S B Y J 1 g f D v z O 4 9 M a R 7 L B 5 w k z I / I U P K Q U 4 J G 8 j x k T x i E W Z W c T / v l i l N z 5 r B X i Z u T C u R o 9 s t f 3 i C m a c Q k U k G 0 7 r l O g n 5 G F H I q 2 L T k p Z o l h I 7 J k P U M l S R i 2 s / m N 0 / t M 6 M M 7 D B W p i T a c / X 3 R E Y i r S d R Y D o j g i O 9 7 M 3 E / 7 x e i u G 1 n 3 G Z p M g k X S w K U 2 F j b M 8 C s A d c M Y p i Y g i h i p t b b T o i i l A 0 M Z V M C O 7 y y 6 u k X a + 5 F 7 X L + 3 q l c Z P H U Y Q T O I U q u H A F D b i D J r S A Q g L P 8 A p v V m q 9 W O / W x 6 K 1 Y O U z x / A H 1 u c P 0 X a R i A = = < / l a t e x i t > (b) < l a t e x i t s h a 1 _ b a s e 6 4 = " F B y V g Q X w q B e R T J F 4 L y H h L r s o H V k = " > A A A B 8 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 B I t Q L y U p i h 4 8 F L x 4 r G A / o A l l s 9 2 0 S z e b s D s R S + j f 8 O J B E a / + G W / + G 7 d t D t r 6 Y O D x 3 g w z 8 4 J E c I 2 O 8 2 0 V 1 t Y 3 N r e K 2 6 W d 3 b 3 9 g / L h U V v H q a K s R W M R q 2 5 A N B N c s h Z y F K y b K E a i Q L B O M L 6 d + Z 1 H p j S P 5 Q N O E u Z H Z C h 5 y C l B I 3 k e s i c M w q w a n E / 7 5 Y p T c + a w V 4 m b k w r k a P b L X 9 4 g p m n E J F J B t O 6 5 T o J + R h R y K t i 0 5 K W a J Y S O y Z D 1 D J U k Y t r P 5 j d P 7 T O j D O w w V q Y k 2 n P 1 9 0 R G I q 0 n U W A 6 I 4 I j v e z N x P + 8 X o r h t Z 9 x m a T I J F 0 s C l N h Y 2 z P A r A H X D G K Y m I I o Y q b W 2 0 6 I o p Q N D G V T A j u 8 s u r p F 2 v u R e 1 y / t 6 p X G T x 1 G E E z i F K r h w B Q 2 4 g y a 0 g E I C z / A K b 1 Z q v V j v 1 s e i t W D l M 8 f w B 9 b n D 9 I V k Y Y = < / l a t e x i t > Census income data < l a t e x i t s h a _ b a s e = " x + w u W O C J b E D Q G Y l H p b + R = " > A A A C B H i c b V A S w N B E N L X z F + R S T L A b B K t w F R Q u L Q B r L C O Y D k h D N n P J k r Y d O D E c K G / + K j Y U i t v I O / + N m C E x M P N b Y W a e H t h H W / n c z a + s b m V n Y t O t + Q P z x q m C j R H O o k p F u + c y A F A r q K F B C K b A Q l C x V p z H r Q R k b r D c Q z d k A U C A R n a K V e v t B B e E A / S K u g T G K o U D w K g f Y Z s k k v X R L g x l X g L U i Q L H r r / k k I C r l k x r Q N Z u y j Q K L m G S y Q G Y s Z H b A B t S x U L w X T T R M T e m q V P g i b U s h n a m / J I W G j M O f d s Z M h y a Z W q / u e E w y u u q l Q c Y K g + H x R k E i K E Z m Q v t C A c t o R x L e y t l A + Z Z h x t b j k b g r f i p p l E v e e e n i t l y s X C / i y J I C O S F n x C O X p E J u S I U C S e P J m k j f n y X l x p P e W v G W c w c k z w P n A H j i Y Y A = = < / l a t e x i t >

Drug consumption data

< l a t e x i t s h a 1 _ b a s e 6 4 = " P V We can demonstrate this on UCI Census Income data (Dua & Graff, 2017) . We trained a neural network to predict whether an individual's income exceeds $50k based on demographic features in the data. Coral bars in Fig. 2 (a) display global Shapley values for this "Original model". (Onmanifold values were computed with the unsupervised method developed in Sec. 4.1.) + i u x 9 6 Q s C k O o 9 R g e 1 F P / Y H n R s = " > A A A C B 3 i c b V B N S 8 N A E N 3 U r 1 q / o h 4 F W S y C p 5 I U R Q 8 e C n r w W M F + Q B v K Z r t p l 2 4 2 Y X c i l t C b F / + K F w + K e P U v e P P f u G l z 0 N Y H A 4 / 3 Z p i Z 5 8 e C a 3 C c b 6 u w t L y y u l Z c L 2 1 s b m 3 v 2 L t 7 T R 0 l i r I G j U S k 2 j 7 R T H D J G s B B s H a s G A l 9 w V r + 6 C r z W / d M a R 7 J O x j H z A v J Q P K A U w J G 6 t m H X W A P 4 A f p t U o G m E Z S J 2 G c W b h P g E x 6 d t m p O F P g R e L m p I x y 1 H v 2 V 7 c f 0 S R k E q g g W n d c J w Y v J Q o 4 F W x S 6 i a a x Y S O y I B 1 D J U k Z N p L p 3 9 M 8 L F R + j i I l C k J e K r + n k h J q P U 4 9 E 1 n S G C o 5 7 1 M / M / r J B B c e C m X c Q J M 0 t m i I B E Y I p y F g v t c M Q p i b A i h i p t b M R 0 S R S i Y 6 E o m B H f + 5 U X S r F b c 0 8 r Z b b V c u 8 z j K K I D d I R O k I v O U Q 3 d o D p q I I o e 0 T N 6 R W / W k / V i v V s f s 9 a C l c / s o z + w P n 8 A z f q Z 4 w = = < / l a t e x i t > We then trained an alternative model by fine-tuning the neural network above on a loss function that penalises model dependence on sex; see App. B for full details of this experiment. This resulted in a "Suppressed model" that makes identical predictions as the original model on 98.5% of the data. Teal bars in Fig. 2 (a) display global Shapley values for this model. Note that the off-manifold Shapley value for sex is zero despite the similar behaviour exhibited by the original and suppressed models on the data. In contrast, on-manifold Shapley values explain both models similarly.

ON-MANIFOLD SHAPLEY VALUES IN THE OPTIMAL-MODEL LIMIT

Here we present a result that strengthens the connection between on-manifold Shapley values and the data distribution: in the limit of an optimal model of the data, on-manifold Shapley values converge to an explanation of how the information in the data associates with the labelled outcomes. To show why this holds, suppose the predicted probability f y (x) converges to the true underlying distribution p(y|x). In this optimal-model limit (which is approached in the limit of abundant data and high model capacity) the on-manifold value function of Eq. (3) becomes v (on) fy(x) (S) → dx S p(x S |x S ) p(y | x S x S ) = p(y | x S ) which shows that value is attributed to x i based on x i 's predictivity of the label y. We can demonstrate this on UCI Drug Consumption data (Dua & Graff, 2017) . Using the 10 binary features listed in Fig. 2 Next we fit a separate random forest g S to each coalition S of features, 2 10 models in total, in the spirit e.g. of Štrumbelj et al. (2009) . We used the accuracy A(g S ) of each model, in the sense of Eq. ( 5), as the value function for an additional Shapley computation: Φ g (i) = S⊆N \i |S|! (n -|S| -1)! n! [A(g S∪i ) -A(g S )] where Φ g (i) is directly the average gain in accuracy that results from adding feature i to the set of inputs. These values are labelled "Model retraining" in Fig. 2(b ). Note their agreement with the on-manifold explanation of the fixed random forest f . On-manifold Shapley values thus indicate which features in the data are most predictive of the label. This consistency check allows us to show in passing that Tree SHAP (Lundberg et al., 2018; 2020) does not provide a method for on-manifold explainability. Observe in Fig. 2 (b) that Tree SHAP (c) < l a t e x i t s h a 1 _ b a s e 6 4 = " 8 W U G E b P m k H x P z x T I B E 3 d 4 Z 1 h 2 I M = " > A A A B 8 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 B I t Q L y W R g n o r e v F Y w X 5 A E 8 p m u 2 m X b j Z h d y K W 0 L / h x Y M i X v 0 z 3 v w 3 b t s c t P X B w O O 9 G W b m B Y n g G h 3 n 2 y q s r W 9 s b h W 3 S z u 7 e / s H 5 c O j t o 5 T R V m L x i J W 3 Y B o J r h k L e Q o W D d R j E S B Y J 1 g f D v z O 4 9 M a R 7 L B 5 w k z I / I U P K Q U 4 J G 8 j x k T x i E W Z W e T / v l i l N z 5 r B X i Z u T C u R o 9 s t f 3 i C m a c Q k U k G 0 7 r l O g n 5 G F H I q 2 L T k p Z o l h I 7 J k P U M l S R i 2 s / m N 0 / t M 6 M M 7 D B W p i T a c / X 3 R E Y i r S d R Y D o j g i O 9 7 M 3 E / 7 x e i u G V n 3 G Z p M g k X S w K U 2 F j b M 8 C s A d c M Y p i Y g i h i p t b b T o i i l A 0 M Z V M C O 7 y y 6 u k f V F z 6 7 X r + 3 q l c Z P H U Y Q T O I U q u H A J D b i D J r S A Q g L P 8 A p v V m q 9 W O / W x 6 K 1 Y O U z x / A H 1 u c P 1 1 G R k w = = < / l a t e x i t > (b) < l a t e x i t s h a 1 _ b a s e 6 4 = " Y O r w p u + / W i Q k + q M 0 r g i r s d f m V Q w = " > A A A B 8 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 B I t Q L y W R g n o r e v F Y w X 5 A E 8 p m u 2 m X b j Z h d y K W 0 L / h x Y M i X v 0 z 3 v w 3 b t s c t P X B w O O 9 G W b m B Y n g G h 3 n 2 y q s r W 9 s b h W 3 S z u 7 e / s H 5 c O j t o 5 T R V m L x i J W 3 Y B o J r h k L e Q o W D d R j E S B Y J 1 g f D v z O 4 9 M a R 7 L B 5 w k z I / I U P K Q U 4 J G 8 j x k T x i E W T U 4 n / b L F a f m z G G v E j c n F c j R 7 J e / v E F M 0 4 h J p I J o 3 X O d B P 2 M K O R U s G n J S z V L C B 2 T I e s Z K k n E t J / N b 5 7 a Z 0 Y Z 2 G G s T E m 0 5 + r v i Y x E W k + i w H R G B E d 6 2 Z u J / 3 m 9 F M M r P + M y S Z F J u l g U p s L G 2 J 4 F Y A + 4 Y h T F x B B C F T e 3 2 n R E F K F o Y i q Z E N z l l 1 d J + 6 L m 1 m v X 9 / V K 4 y a P o w g n c A p V c O E S G n A H T W g B h Q S e 4 R X e r N R 6 s d 6 t j 0 V r w c p n j u E P r M 8 f 1 c u R k g = = < / l a t e x i t > (a) < l a t e x i t s h a 1 _ b a s e 6 4 = " D q i F 2 s 3 Y 7 y i V / D d 7 Z j D J v q H g v K Q = " > A A A B 8 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 B I t Q L y W R g n o r e v F Y w X 5 A E 8 p m u 2 m X b j Z h d y K W 0 L / h x Y M i X v 0 z 3 v w 3 b t s c t P X B w O O 9 G W b m B Y n g G h 3 n 2 y q s r W 9 s b h W 3 S z u 7 e / s H 5 c O j t o 5 T R V m L x i J W 3 Y B o J r h k L e Q o W D d R j E S B Y J 1 g f D v z O 4 9 M a R 7 L B 5 w k z I / I U P K Q U 4 J G 8 j x k T x i E W Z W c T / v l i l N z 5 r B X i Z u T C u R o 9 s t f 3 i C m a c Q k U k G 0 7 r l O g n 5 G F H I q 2 L T k p Z o l h I 7 J k P U M l S R i 2 s / m N 0 / t M 6 M M 7 D B W p i T a c / X 3 R E Y i r S d R Y D o j g i O 9 7 M 3 E / 7 x e i u G V n 3 G Z p M g k X S w K U 2 F j b M 8 C s A d c M Y p i Y g i h i p t b b T o i i l A 0 M Z V M C O 7 y y 6 u k f V F z 6 7 X r + 3 q l c Z P H U Y Q T O I U q u H A J D b i D J r S A Q g L P 8 A p v V m q 9 W O / W x 6 K 1 Y O U z x / A H 1 u c P 1 E W R k Q = = < / l a t e x i t > (d) < l a t e x i t s h a 1 _ b a s e 6 4 = " c T n A O h F + Q I 8 9 n M a 6 K / C p d Q m 5 J L U = " > A A A B 8 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 B I t Q L y U p i t 4 s e P F Y w X 5 A E 8 p m u 2 m X b j Z h d y K W 0 L / h x Y M i X v 0 z 3 v w 3 b t s c t P X B w O O 9 G W b m B Y n g G h 3 n 2 y q s r W 9 s b h W 3 S z u 7 e / s H 5 c O j t o 5 T R V m L x i J W 3 Y B o J r h k L e Q o W D d R j E S B Y J 1 g f D v z O 4 9 M a R 7 L B 5 w k z I / I U P K Q U 4 J G 8 j x k T x i E W X V w P u 2 X K 0 7 N m c N e J W 5 O K p C j 2 S 9 / e Y O Y p h G T S A X R u u c 6 C f o Z U c i p Y N O S l 2 q W E D o m Q 9 Y z V J K I a T + b 3 z y 1 z 4 w y s M N Y m Z J o z 9 X f E x m J t J 5 E g e m M C I 7 0 s j c T / / N 6 K Y b X f s Z l k i K T d L E o T I W N s T 0 L w B 5 w x S i K i S G E K m 5 u t e m I K E L R x F Q y I b j L L 6 + S d r 3 m X t Q u 7 + u V x k 0 e R x F O 4 B S q 4 M I V N O A O m t A C C g k 8 w y u 8 W a n 1 Y r 1 b H 4 v W g p X P H M M f W J 8 / 1 g i R i w = = < / l a t e x i t > (f ) < l a t e x i t s h a 1 _ b a s e 6 4 = " q P F K 1 a roughly tracks the off-manifold explanation, albeit larger on the most predictive feature and somewhat smaller on the others. This occurs because trees tend to split on high-predictivity features first, and Tree SHAP privileges early-splitting features in an otherwise off-manifold calculation. + Z K N z o z s O d b T 3 c i M y a 9 h Y = " > A A A B 8 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 B I t Q L y U p i t 4 s e P F Y w X 5 A E 8 p m u 2 m X b j Z h d y K W 0 L / h x Y M i X v 0 z 3 v w 3 b t s c t P X B w O O 9 G W b m B Y n g G h 3 n 2 y q s r W 9 s b h W 3 S z u 7 e / s H 5 c O j t o 5 T R V m L x i J W 3 Y B o J r h k L e Q o W D d R j E S B Y J 1 g f D v z O 4 9 M a R 7 L B 5 w k z I / I U P K Q U 4 J G 8 j x k T x i E W T U 8 n / b L F a f m z G G v E j c n F c j R 7 J e / v E F M 0 4 h J p I J o 3 X O d B P 2 M K O R U s G n J S z V L C B 2 T I e s Z K k n E t J / N b 5 7 a Z 0 Y Z 2 G G s T E m 0 5 + r v i Y x E W k + i w H R G B E d 6 2 Z u J / 3 m 9 F M N r P + M y S Z F J u l g U p s L G 2 J 4 F Y A + 4 Y h T F x B B C F T e 3 2 n R E F K F o Y i q Z E N z l l 1 d J u 1 5 z L 2 q X 9 / V K 4 y a P o w g n c A p V c O E K G n A H T W g B h Q S e 4 R X e r N R 6 s d 6 t j 0 V r w c p n j u E P r M 8 f 2 R S R j Q = = < / l a t e x i t > (e) < l a t e x i t s h a 1 _ b a s e 6 4 = " p K Q c Y v 9 O H s 5 v B E Z d a y 9 e o r 4 f w H M = " > A A A B 8 3 i c b V B N S 8 N A E N 3 U r 1 q / q h 6 9 B I t Q L y U p i t 4 s e P F Y w X 5 A E 8 p m O 2 m X b j Z h d y K W 0 L / h x Y M i X v 0 z 3 v w 3 b t s c t P X B w O O 9 G W b m B Y n g G h 3 n 2 y q s r W 9 s b h W 3 S z u 7 e / s H 5 c O j t o 5 T x a D F Y h G r b k A 1 C C 6 h h R w F d B M F N A o E d I L x 7 c z v P I L S P J Y P O E n A j + h Q 8 p A z i k b y P I Q n D M K s C u f T f r n i 1 J w 5 7 F X i 5 q R C c j T 7 5 S 9 v E L M 0 A o l M U K 1 7 r p O g n 1 G F n A m Y l r x U Q 0 L Z m A 6 h Z 6 i k E W g / m 9 8 8 t c + M M r D D W J m S a M / V 3 x M Z j b S e R I H p j C i O 9 L I 3 E / / z e i m G 1 3 7 G Z Z I i S L Z Y F K b C x t i e B W A P u A K G Y m I I Z Y q b W 2 0 2 o o o y N D G V T A j u 8 s u r p F 2 v u R e 1 y / t 6 p X G T x 1 E k J + S U V I l L r k i D 3 J E m a R F G E v J M X s m b l V o v 1 r v 1 s W g t W P n M M f k D 6 / M H 1 4 6 R j A = = < / l a t e x i t >

3.2. UNAMBIGUOUS SHORTCOMINGS OF OFF-MANIFOLD EXPLAINABILITY

Whereas above we clarified precise differences between on-and off-manifold Shapley values, in this section we focus on unambiguous drawbacks of the off-manifold approach. UNCONTROLLED MODEL BEHAVIOUR OFF-MANIFOLD Sec. 3.1 might lead one to believe that off-manifold Shapley values provide insight into the algebraic dependence of a model. However, the off-manifold approach of evaluating the model on spliced indistribution data does not constitute a controlled study of such dependence. Off-manifold Shapley values serve as a perilously uncontrolled technique, especially in complex nonlinear models such as neural networks. Indeed, it is widely known that deep-learning models are not robust to distributional shift (Nguyen et al., 2015; Goodfellow et al., 2015) . Still, off-manifold Shapley values evaluate the model outside its domain of validity, where it is untrained and potentially wildly misbehaved. This garbage-in-garbage-out problem is the clearest reason to avoid the off-manifold approach. Since this point has been documented in the literature (Hooker & Mentch, 2019 ), here we simply provide an example: Fig. 1 shows a binary MNIST digit (LeCun & Cortes, 2010), a coalition of pixels, and 5 random splices that would be used to compute an off-manifold explanation.

OUTLIER DETECTION EXPLAINED INCORRECTLY OFF-MANIFOLD

To demonstrate that off-manifold Shapley values frequently lead to incorrect explanations, here we offer an example on synthetic data where the ground-truth explanation is known. We generated 10 4 synthetic data points, each consisting of 20 real-valued features, for the purpose of outlier detection. We split the dataset between 99% inliers and 1% outliers, with the classes generated according to: p in (x 1 , . . . , x 20 ) = 1 2 z=0,1 20 i=1 N [z, σ 2 ](x i ) (11) p out (x 1 , . . . , x 20 ) = 1 2 z=0,1 5 i=1 N [z, σ 2 ](x i ) 20 i=6 N [z, σ 2 ](x i ) (12) That is, there is a single binary latent variable z. For inliers, each feature is an independent noisy reading of the latent z. For outliers, the first 5 features are centred instead around its opposite z. An example outlier (with σ = 0.05) is shown in Fig. 3(a) . We generated one such data set for each σ ∈ {0.01, 0.03, . . . , 0.15} in order to study the effect of noise on explanation errors. (a)  < l a t e x i t s h a _ b a s e = " j U l t B / N W f J h B E b k b s p x R D n j Y = " > A A A B i c b V B N S N A E J U r q / q h B I t Q L y U p i t s e P F Y w X A E p m u m X b j Z h d y K W L / h x Y M i X v z v w b t s c t P X B w O O G W b m B Y n g G h n y q s r W s b h W S z u e / s H c O j t o T R V m L x i J W Y B o J r h k L e Q o W D d R j E S B Y J g f D v z O M a R L B w k z I / I U P K Q U J G j x k T x i E W Z W c T / v l i l N z r B X i Z u T C u R o s t f i C m a c Q k U k G r l O g n G F H I q L T k p Z o l h I J k P U M l S R i s / m N / t M M M D B W p i T a c / X R E Y i r S d R Y D o j g i O M E / x e i u G n G Z p M g k X S w K U F j b M C s A d c M Y p i Y g i h i p t b b T o i i l A M Z V M C O y y u k X a + F X L + q l c Z P H U Y Q T O I U q u H A F D b i D J r S A Q g L P A p v V m q W O / W x K Y O U z x / A H u c P X a R i A = = < / l a t e x i t > (b) < l a t e x i t s h a 1 _ b a s e 6 4 = " y 3 6 Q m H L D 6 G N Y T L q Z T M W M D + F Q / h I = " > A A A B 8 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 B I t Q L y U p i t 4 s e P F Y w X 5 A E 8 p m u 2 m X b j Z h d y K W 0 L / h x Y M i X v 0 z 3 v w 3 b t s c t P X B w O O 9 G W b m B Y n g G h 3 n 2 y q s r W 9 s b h W 3 S z u 7 e / s H 5 c O j t o 5 T R V m L x i J W 3 Y B o J r h k L e Q o W D d R j E S B Y J 1 g f D v z O 4 9 M a R 7 L B 5 w k z I / I U P K Q U 4 J G 8 j x k T x i E W T U 4 n / b L F a f m z G G v E j c n F c j R 7 J e / v E F M 0 4 h J p I J o 3 X O d B P 2 M K O R U s G n J S z V L C B 2 T I e s Z K k n E t J / N b 5 7 a Z 0 Y Z 2 G G s T E m 0 5 + r v i Y x E W k + i w H R G B E d 6 2 Z u J / 3 m 9 F M N r P + M y S Z F J u l g U p s L G 2 J 4 F Y A + 4 Y h T F x B B C F T e 3 2 n R E F K F o Y i q Z E N z l l 1 d J u 1 5 z L 2 q X 9 / V K 4 y a P o w g n c A p V c O E K G n A H T W g B h Q S e 4 R X e r N R 6 s d 6 t j 0 V r w c p n j u E P r M 8 f 0 v y R i Q = = < / l a t e x i t > Abalone data < l a t e x i t s h a 1 _ b a s e 6 4 = " G D J Q M D p 6 + b p y D W z r G z i W J R D m Z Z g = " > A A A B / n i c b V A 9 S w N B E N 3 z M 8 a v q F j Z L A b B K t y J o o V F x M Y y g k m E J I S 5 z Z w u 7 u 0 d u 3 N i O A L + F R s L R W z 9 H X b + G z c f h R o f D D z e m 2 F m X p g q a c n 3 v 7 y Z 2 b n 5 h c X C U n F 5 Z X V t v b S x 2 b B J Z g T W R a I S c x 2 C R S U 1 1 k m S w u v U I M S h w m Z 4 d z 7 0 m / d o r E z 0 F f V T 7 M R w o 2 U k B Z C T u q X t N u E D h V F + F o J K N P I e E A y 6 p b J f 8 U f g 0 y S Y k D K b o N Y t f b Z 7 i c h i 1 C Q U W N s K / J Q 6 O R i S Q u G g 2 M 4 s p i D u 4 A Z b j m q I 0 X b y 0 f k D v u e U H o 8 S 4 0 o T H 6 k / J 3 K I r e 3 H o e u M g W 7 t X 2 8 o / u e 1 M o p O O r n U a U a o x X h R l C l O C R 9 m w X v S o C D V d w S E k e 5 W L m 7 B g C C X W N G F E P x 9 e Z o 0 D i r B Y e X o 8 q B c P Z 3 E U W A 7 b J f t s 4 A d s y q 7 Y D V W Z 4 L l 7 I m 9 s F f v 0 X v 2 3 r z 3 c e u M N 5 n Z Y r / g f X w D d d 2 V y g = = < / l a t e x i t > (c) < l a t e x i t s h a 1 _ b a s e 6 4 = " Y U 7 C 8 t h T C T B P 6 w J W 5 X H U M Z f b f B 0 = " > A A A B 8 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 B I t Q L y U p i h 4 8 F L x 4 r G A / o A l l s 9 2 0 S z e b s D s R S + j f 8 O J B E a / + G W / + G 7 d t D t r 6 Y O D x 3 g w z 8 4 J E c I 2 O 8 2 0 V 1 t Y 3 N r e K 2 6 W d 3 b 3 9 g / L h U V v H q a K s R W M R q 2 5 A N B N c s h Z y F K y b K E a i Q L B O M L 6 d + Z 1 H p j S P 5 Q N O E u Z H Z C h 5 y C l B I 3 k e s i c M w q x K z 6 f 9 c s W p O X P Y q 8 T N S Q V y N P v l L 2 8 Q 0 z R i E q k g W v d c J 0 E / I w o 5 F W x a 8 l L N E k L H Z M h 6 h k o S M e 1 n 8 5 u n 9 p l R B n Y Y K 1 M S 7 b n 6 e y I j k d a T K D C d E c G R X v Z m 4 n 9 e L 8 X w 2 s + 4 T F J k k i 4 W h a m w M b Z n A d g D r h h F M T G E U M X N r T Y d E U U o m p h K J g R 3 + e V V 0 q 7 X 3 I v a 5 X 2 9 0 r j J 4 y j C C Z x C F V y 4 g g b c Q R N a Q C G B Z 3 i F N y u 1 X q x 3 6 2 P R W r D y m W P 4 A + v z B 9 O b k Y c = < / l a t e x i t > Global Shapley value < l a t e x i t s h a 1 _ b a s e 6 4 = " X o m D q p i E 4 F G 9 Z c e f Z T H e q n x C w n U = " > A A A C B n i c b V D J S g N B E O 1 x j X G L e h S h M Q i e w o w o e v A g e N B j R L N A E k J N p y Z p 0 r P Q X S O G I S c v / o o X D 4 p 4 9 R u 8 + T d 2 l o P b g 4 L H e 1 V U 1 f M T J Q 2 5 7 q c z M z s 3 v 7 C Y W 8 o v r 6 y u r R c 2 N q s m T r X A i o h V r O s + G F Q y w g p J U l h P N E L o K 6 z 5 / f O R X 7 t F b W Q c 3 d A g w V Y I 3 U g G U g B Z q V 3 Y a R L e k R 9 k F y r 2 Q f H r H i Q K B / w W V I r D d q H o l t w x + F / i T U m R T V F u F z 6 a n V i k I U Y k F B j T 8 N y E W h l o k k L h M N 9 M D S Y g + t D F h q U R h G h a 2 f i N I d + z S o c H s b Y V E R + r 3 y c y C I 0 Z h L 7 t D I F 6 5 r c 3 E v / z G i k F J 6 1 M R k l K G I n J o i B V n G I + y o R 3 p E Z B a m A J C C 3 t r V z 0 Q I M g m 1 z e h u D 9 f v k v q R 6 U v M P S 0 d V B 8 e x 0 G

Synthetic data

< l a t e x i t s h a 1 _ b a s e 6 4 = " 0 0 c 8 u M K 7 6 d j v U r n X V e X 9 N / N e D G Q = " > A A A C A H i c b V C 7 S g N B F J 3 1 G e M r a m F h s x g E q 7 A b F C 0 s A j a W E c 0 D k i X M z t 5 N h s z O L j N 3 x b B s 4 6 / Y W C h i 6 2 f Y + T d O H o U m H h g 4 n H P v H c 7 x E 8 E 1 O s 6 3 t b S 8 s r q 2 X t g o b m 5 t 7 + y W 9 v a b O k 4 V g w a L R a z a P t U g u I Q G c h T Q T h T Q y B f Q 8 o f X Y 7 / 1 A E r z W N 7 j K A E v o n 3 J Q 8 4 o G q l X O u w i P K I f Z n c j i Q N A z u y A I s 1 7 p b J T c S a w F 4 k 7 I 2 U y Q 7 1 X + u o G M U s j k M g E 1 b r j O g l 6 G V X m p I C 8 2 E 0 1 J J Q N a R 8 6 h k o a g f a y S Y D c P j F K Y I e x M k + i P V F / b 2 Q 0 0 n o U + W Y y o j j Q 8 9 5 Y / M / r p B h e e h m X S Y o g 2 f S j M B U 2 x v a 4 D T v g C h i K k S G U K T 6 O z w Z U U Y a m s 6 I p w Z 2 P v E i a 1 Y p 7 V j m / r Z Z r V 7 M 6 C u S I H J N T 4 p I L U i M 3 p E 4 a h J G c P J N X 8 m Y 9 W S / W u / U x H V 2 y Z j s H 5 A + s z x 9 m U J b n < / l a t e x i t > Synthetic data generation < l a t e x i t s h a 1 _ b a s e 6 4 = " t j M C p M 3 U H X X q P x M i m r L f G F f R t a Q = " > A A A C C 3 i c b V C 7 S g N B F J 2 N r x h f U U u b I U G w C r t B 0 c I i Y G M Z 0 T w g C W F 2 c j c Z M j u 7 z N w V w 5 L e x l + x s V D E 1 h + w 8 2 + c P A p N P D B w O O f c O 9 z j x 1 I Y d N 1 v J 7 O y u r a + k d 3 M b W 3 v 7 O 7 l 9 w / q J k o 0 h x q P Z K S b P j M g h Y I a C p T Q j D W w 0 J f Q 8 I d X E 7 9 x D 9 q I S N 3 h K I Z O y P p K B I I z t F I 3 X 2 g j P K A f p L c j h Q N A w W m P I a N 9 U K C n m X E 3 X 3 R L 7 h R 0 m X h z U i R z V L v 5 r 3 Y v 4 k k I C r l k x r Q 8 N 8 Z O y r T d L m G c a y c G Y s a H r A 8 t S x U L w X T S 6 S 1 j e m y V H g 0 i b Z 9 C O l V / T 6 Q s N G Y U + j Y Z M h y Y R W 8 i / u e 1 E g w u O q l Q c Y K g + O y j I J E U I z o p h v a E B o 5 y Z A n j W k y a 4 A O m G U d b X 8 6 W 4 C 2 e v E z q 5 Z J 3 W j q 7 K R c r l / M 6 s u S I F M g J 8 c g 5 q Z B r U i U 1 w s k j e S a v 5 M 1 5 c l 6 c d + d j F s 0 4 8 5 l D 8 g f O 5 w / 8 z J u h < / l a t e x i t > We fit an isolation forest (Liu et al., 2008) to perform outlier detection on each synthetic dataset, achieving 100% accuracy in every case. We computed the off-and on-manifold value functions of Eqs. (2) and ( 3) for each isolation forest by sampling the probability distributions directly, as these can be inferred from Eqs. ( 11) and ( 12). Figs. The ground-truth explanation of why Fig. 3 (a) represents an outlier is that its first 5 features break correlations that exist across 99% of the data. The on-manifold explanation of Fig. 3 (c) correctly attributes the 5 largest Shapley values to features x 1 , . . . , x 5 . The off-manifold explanation of Fig. 3(b ) is unambiguously incorrect: feature x 7 receives a larger value than x 2 , x 4 , and x 5 . We consider an explanation to be erroneous if x 1 , . . . , x 5 do not receive the 5 largest Shapley values. To show the frequency of incorrect explanations, Fig. 3 (d) displays the off-and on-manifold error rates as a function of noise σ in the synthetic data set. Incorrect explanations are commonplace off-manifold: one-quarter are in error in the presence of minimal noise, and two-thirds are incorrect at σ = 0.15. The on-manifold error rate is dramatically lower across this range. Figs. 3(e) and 3(f) show the root cause of off-manifold errors. These histograms display the distribution of model outputs when evaluated on Shapley coalitions in the off-and on-manifold calculations for σ = 0.05. In particular, Fig. 3 (e) shows the model evaluated on "inlier coalitions" which do not include x 1 , . . . , x 5 . Note that model outputs for on-manifold coalitions agree with the model evaluated on the actual data, while off-manifold coalitions follow a very different distribution. In particular, since a positive model output indicates a predicted outlier, Fig. 3 (e) shows that the offmanifold calculation itself fabricates outliers through its splicing procedure. Similarly, Fig. 3 (f) shows the model evaluated on "outlier coalitions" which do include x 1 , . . . , x 5 . Note that model outputs are similar for on-manifold coalitions and actual outliers, whereas offmanifold coalitions again differ dramatically. This is a manifestation of uncontrolled model behaviour off the data manifold, and it ultimately leads to erroneous off-manifold explanations.

BREAKDOWN IN GLOBAL SHAPLEY VALUES OFF-MANIFOLD

To demonstrate that global Shapley values can be misleading off-manifold as well, we generated an additional synthetic data set according to the process in Fig. 4 (a). The data has two binary features and a binary label. We fit a decision tree to this data, resulting in a precise match to Fig. 4 (a). Note that the features x 0 and x 1 are positively correlated, both with each other and with label y. However, with x 0 fixed, the likelihood of y = 1 decreases slightly from x 1 = 0 to x 1 = 1. One might think of x 0 as disease severity, x 1 as treatment intensity, and y as mortality rate. Off-manifold, however, a negative value results from placing too much weight on splices, e.g. with (x 0 , x 1 , y) = (0, 1, 1), that occur less frequently in the actual data. The negative value would erroneously indicate that x 1 is detrimental to the model's overall performance. We can demonstrate this on real data using UCI Abalone data (Dua & Graff, 2017) . We trained a neural network to classify abalone as younger than or older than the median age based on physical Observe the drastic difference between the on-and off-manifold explanations in Fig. 4(c ). This is due to the tight correlations between features in the data (4 weights and 3 lengths) making the data manifold low-dimensional and important. Notice further the large negative off-manifold global Shapley value, negating its interpretation as the portion of model accuracy attributable to that feature. Here we develop two methods to learn the on-manifold value function: (i) unsupervised learning the conditional distribution p(x |x S ), and (ii) a supervised technique to learn the value function directly.

UNSUPERVISED APPROACH

One can use unsupervised learning to learn the conditional distributions p(x |x S ) that appear in the on-manifold value function. Here we take an approach similar to Ivanov et al. (2019) to learn these distributions with variational inference. See Douglas et al. (2017) and Belghazi et al. (2019) for alternative techniques to learning conditional distributions that could be used here instead. Our specific approach includes two model components. The first is a variational autoencoder (Kingma & Welling, 2014; Rezende et al., 2014) , with encoder q φ (z|x) and decoder p θ (x|z). The second is a masked encoder, r ψ (z|x S ), for which the goal is to map the coalition x S to a distribution in latent space that agrees with the encoder q φ (z|x) as well as possible. A model of p(x |x S ) is then provided by the composition: p(x |x S ) = dz p θ (x |z) r ψ (z|x S ) and a good fit to the data should maximise p(x |x S ). A lower bound to its log-likelihood is given by L 0 = E q φ (z|x ) log p θ (x |z) -D KL q φ (z|x ) || r ψ (z|x S ) While L 0 could be used on its own as the objective function to learn p(x |x S ), this would leave the variational distribution q φ (z|x) unconstrained, at odds with our goal of learning a smooth-manifold structure in latent space. This concern can be mitigated by L reg = -D KL q φ (z|x) || p(z) which regularises q φ (z|x) by penalising differences from a smooth (e.g. unit normal) prior distribution p(z). We thus include L reg as a regularisation term in our unsupervised objective: L = L 0 + β L reg .

METRIC FOR THE LEARNT VALUE FUNCTION

The unsupervised method presented above leads to a learnt estimate of the conditional distribution, and thus to an estimate of the on-manifold value function: vfy (x) (S) = E p(x |x S ) [f y (x )]. With the goal of judging the performance of this estimate, consider the following formal quantity: mse(x S , y) = E p(x |x S ) f y (x ) -vfy(x) (S) 2 This quantity is minimal with respect to vfy(x) (S) when vfy(x) (S) = E p(x |x S ) [f y (x )], in agreement with the definition, Eq. ( 3), of the on-manifold value function. We can then quantitatively judge the performance of the unsupervised model p(x |x S ) by computing MSE = E p(x) E S∼Shapley E y∼Unif f y (x) -vfy(x) (S) 2 Published as a conference paper at ICLR 2021

Drug consumption data

< l a t e x i t s h a 1 _ b a s e 6 4 = " P V + i u x 9 6 Q s C k O o 9 R g e 1 F P / Y H n R s = " > A A A C B 3 i c b V B N S 8 N A E N 3 U r 1 q / o h 4 F W S y C p 5 I U R Q 8 e C n r w W M F + Q B v K Z r t p l 2 4 2 Y X c i l t C b F / + K F w + K e P U v e P P f u G l z 0 N Y H A 4 / 3 Z p i Z 5 8 e C a 3 C c b 6 u w t L y y u l Z c L 2 1 s b m 3 v 2 L t 7 T R 0 l i r I G j U S k 2 j 7 R T H D J G s B B s H a s G A l 9 w V r + 6 C r z W / d M a R 7 J O x j H z A v J Q P K A U w J G 6 t m H X W A P 4 A f p t U o G m E Z S J 2 G c W b h P g E x 6 d t m p O F P g R e L m p I x y 1 H v 2 V 7 c f 0 S R k E q g g W n d c J w Y v J Q o 4 F W x S 6 i a a x Y S O y I B 1 D J U k Z N p L p 3 9 M 8 L F R + j i I l C k J e K r + n k h J q P U 4 9 E 1 n S G C o 5 7 1 M / M / r J B B c e C m X c Q J M 0 t m i I B E Y I p y F g v t c M Q p i b A i h i p t b M R 0 S R S i Y 6 E o m B H f + 5 U X S r F b c 0 8 r Z b b V c u 8 z j K K I D d I R O k I v O U Q 3 d o D p q I I o e 0 T N 6 R W / W k / V i v V s f s 9 a C l c / s o z + w P n 8 A z f q Z 4 w = = < / l a t e x i t >

Global Shapley value

< l a t e x i t s h a 1 _ b a s e 6 4 = " X o m D q p i E 4 F G 9 Z c e f Z T H e q n x C w n U Note that this is precisely Eq. ( 15) averaged over coalitions S drawn from the Shapley sum,foot_1 features x S ∼ p(x S ) drawn from the data, and labels y drawn uniformly over classes. Moreover, the meansquare-error in Eq. ( 16) is easy to estimate using the empirical distribution p(x) and the learnt model p(x |x S ), thus providing an unambiguous metric to judge the outcome of the unsupervised approach. = " > A A A C B n i c b V D J S g N B E O 1 x j X G L e h S h M Q i e w o w o e v A g e N B j R L N A E k J N p y Z p 0 r P Q X S O G I S c v / o o X D 4 p 4 9 R u 8 + T d 2 l o P b g 4 L H e 1 V U 1 f M T J Q 2 5 7 q c z M z s 3 v 7 C Y W 8 o v r 6 y u r R c 2 N q s m T r X A i o h V r O s + G F Q y w g p J U l h P N E L o K 6 z 5 / f O R X 7 t F b W Q c 3 d A g w V Y I 3 U g G U g B Z q V 3 Y a R L e k R 9 k F y r 2 Q f H r H i Q K B / w W V I r D d q H o l t w x + F / i T U m R T V F u F z 6 a n V i k I U Y k F B j T 8 N y E W h l o k k L h M N 9 M D S Y g + t D F h q U R h G h a 2 f i N I d + z S o c H s b Y V E R + r 3 y c y C I 0 Z h L 7 t D I F 6 5 r c 3 E v / z G i k F J 6 1 M R k l K G I n J o i B V n G I + y o R 3 p E Z B a m A J C C 3 t r V z 0 Q I M g m 1 z e h u D 9 f v k v q R 6 U v M P S 0 d V B

SUPERVISED APPROACH

The MSE metric of Eq. ( 16) supports a supervised approach to learning the on-manifold value function directly: one can define a surrogate model g y (x S ) that operates on coalitions of features x S (e.g. by masking out-of-coalition features) and that is trained to minimise the loss: L = E p(x) E S∼Shapley E y∼Unif f y (x) -g y (x S ) 2 As discussed above Eq. ( 16), this loss is minimised as the surrogate model g y (x S ) approaches the on-manifold value function E p(x |x S ) [f y (x )] of the model-to-be-explained. The unsupervised approach is flexible but untargeted: p(x |x S ) is data-specific but model-agnostic, accommodating explanations for many models trained on the same data. The supervised approach accuracy achieved if one predicts labels by drawing stochastically from g S 's predicted probability distribution (as opposed to deterministically drawing the maximum-probability class). The global on-manifold Shapley values in Fig. 2 (b) appear in Fig. 5 as well, labelled "Empirical". Fig. 5 also displays on-manifold Shapley values computed using the supervised and unsupervised methods introduced in this paper. As above, these are Monte Carlo estimates of Eq. ( 4). The supervised method involved training a fully connected network on the MSE loss of Eq. ( 17). All neural networks in this paper used 2 flat hidden layers, Adam (Kingma & Ba, 2015) for optimisation, and a batch size of 256. We scanned over a grid with hidden layer size = {128, 256, 512} learning rate = {10 -3 , 10 -4 } choosing the point with minimal MSE on a held-out validation set after 10k epochs of training; see Table 2 . Each supervised value in Fig. 5 corresponds to 10 4 Monte Carlo samples. The unsupervised method involved training a variational autoencoder as described in Sec. 4.1 and App. A. The encoder, decoder, and masked encoder were each modelled using fully connected networks, trained using early stopping with patience 100. We scanned over a grid of hidden layer sizes and learning rates as in Eq. ( 21) as well as latent dimension = {2, 4, 8, 16} (22) latent modes = {1, 2} regularisation β = {0.05, 0.1, 0.5, 1} choosing the point with minimal validation-set MSE; see Table 2 . Unsupervised values in Fig. 5 correspond to 10 6 Monte Carlo samples.

B.2 CENSUS INCOME EXPERIMENT

To produce the explanations of Fig. 2 (a) we used the Census Income data set from the UCI repository (Dua & Graff, 2017) . The data contains 49k individuals from the 1994 US Census, as well as 13 features which we used to predict whether annual income exceeded $50k. We trained a fully connected network (hidden layer size 50, default sklearn parameters, and early stopping), achieving a test-set accuracy of 85% amidst a 76 : 24 class balance. The Shapley values for this model are labelled "Original model" in Fig. 2 (a). These were computed exactly as described in App. B.1, except that the supervised method used 5k epochs, and the unsupervised method used patience 50. Optimised hyperparameters are given in Table 2 . The onmanifold values in Fig. 2 (a) were computed using the unsupervised method. While the supervised method does not appear in the figure, it was performed to complete Table 1 . We also fine-tuned the "Original model" to suppress the importance of sex. Motivated by Dimanov et al. (2020) we added a term to the loss that penalises the finite difference in the model output with



The results of this paper can be applied to regression problems by reinterpreting fy(x) as the model's predicted value rather than its predicted probability. In more detail, here we sample coalitions from Eq. (1), where the probability assigned to each coalition is the combinatorial factor |S|!(n -|S| -1)!/n!.



5 b d 3 d v f 2 7 c J B Q 0 e J 4 l D n k Y x U y 2 M a p A i h j g I l t G I F L P A k N L 3 R 1 d R v 3 o P S I g r v c B x D N 2 C D U P i C M z R S z y 5 0 E B 7 R 8 9 N b s 4 0D r U x 6 d t E p O T P Q Z e J m p E g y 1 H r 2 V 6 c f 8 S S A E L l k W r d d J 8 Z u y h Q K L m G S 7 y Q a Y s Z H b A B t Q 0 M W g O 6 m s 9 M n 9 M Q o f e p H y l S I d K b + n k h Z o P U 4 8 E x n w H C o F 7 2 p + J / X T t C / 6 K Y i j B O E k M 8 X + Y m k G N F p D r Q v F H C U Y 0 M Y V 8 L c S v m Q K c b R p J U 3 I b i L L y + T Rr n k n p U q N + V i 9 T K L I 0 e O y D E 5 J S 4 5 J 1 V y T W q k T j h 5 I M / k l b x Z T 9 a L 9 W 5 9 z F t X r G z m k P y B 9 f k D K 0 2 T 6 w = = < / l a t e x i t > Splice 4

5 b d 3 d v f 2 7 c J B Q 0 e J 4 l D n k Y x U y 2 M a p A i h j g I l t G I F L P A k N L 3 R 1 d R v 3 o P S I g r v c B x D N 2 C D U P i C M z R S z y 5 0 E B 7 R 8 9 N b s 4 0 D r U x 6 d t E p O T P Q Z e J m p E g y 1 H r 2V 6 c f 8 S S A E L l k W r d d J 8 Z u y h Q K L m G S 7 y Q a Y s Z H b A B t Q 0 M W g O 6 m s 9 M n 9 M Q o f e p H y l S I d K b + n k h Z o P U 4 8 E x n w H C o F 7 2 p + J / X T t C / 6 K Y i j B O E k M 8 X + Y m k G N F p D r Q v F H C U Y 0 M Y V 8 L c S v m Q K c b R p J U 3 I b i L L y + T R r n k V k p n N + V i9 T K L I 0 e O y D E 5 J S 4 5 J 1 V y T W q k T j h 5 I M / k l b x Z T 9 a L 9 W 5 9 z F t X r G z m k P y B 9 f k D K c i T 6 g = = < / l a t e x i t > Splice 3 < l a t e x i t s h a 1 _ b a s e 6 4 = " O U F a 7 1 1 O F 0 i 1 H 2 + O m Y S W / t s I D X Q = " > A A A B + n i c b V D L T g J B E J z 1 i f h a 9 O h l I j H x R H Z R o w c P J F 4 8 Y p R H A o T M D r 0 w Y f a R m V 6 V r H y K F w 8 a 4 9 U v 8 e b f O M A e F K y k k 0 p V d 7 q 7 v F g K j Y 7 z b S 0 t r 6 y u r e c 2 8 p t b 2 z u 7 d

6 9 3 6 m L f m r G z m g P y B 9 f k D J T m T 5 w = = < / l a t e x i t > Coalition < l a t e x i t s h a 1 _ b a s e 6 4 = " c 6 7 d D k Q f 5 k N n t g 3 p S Z Y l u P U I w / I = " > A

Figure 1: An MNIST digit, a coalition of pixels in a Shapley calculation, and 5 off-manifold splices.

Figure 2: (a) Vulnerability of off-manifold explanations to hidden model dependence. (b) Explanations of a fixed model compared to a model that is retrained on each Shapley coalition of features.

(b) -Mushrooms, Ecstasy, etc. -we trained a random forest f to predict whether individuals had consumed an 11th drug: LSD. As the data contains just 10 binary features, we were able to empirically sample the conditional distributions in the on-manifold value function, Eq. (3). See Fig. 2(b) for the resulting off-and on-manifold global Shapley values.

Figure 3: An individual outlier (a), its off-and on-manifold explanations (b & c), the error rate in explanations (d), and the distribution of model outputs on Shapley coalitions (e & f).

k e O b b N d t s 8 8 d s z O 2 C U r s w o T 7 J 4 9 s m f 2 4 j w 4 T 8 6 r 8 z Z p n X G m M 1 v s B 5 z 3 L 5 w J m T I = < / l a t e x i t >

Figure 4: Negative global Shapley values arise off-manifold, in both (a) synthetic and (b) real data.

3(b) and 3(c) show the resulting local Shapley values for the example outlier from Fig. 3(a).

Fig. 4(b) displays global Shapley values for this model. The global Shapley values are positive onmanifold, consistent with their interpretation as the portion of model accuracy attributable to each feature.Off-manifold, however, a negative value results from placing too much weight on splices, e.g. with (x 0 , x 1 , y) = (0, 1, 1), that occur less frequently in the actual data. The negative value would erroneously indicate that x 1 is detrimental to the model's overall performance.

characteristics. Fig. 4(c) displays global Shapley values for this model. (On-manifold values were computed using techniques developed in Sec. 4.1; see App. B for details.)

Figure 5: Validation of unsupervised and supervised techniques for computing on-manifold Shapley values. Comparison against empirical ground truth, which appeared as "On manifold" in Fig. 2(b).

presented two experiments using the scalable on-manifold methods developed here. In particular, Fig.2(a) applied the unsupervised method to Census Income data, showing that onmanifold Shapley values detect hidden model dependence on sensitive features, and Fig.4(c) applied both methods to Abalone data, showing that global Shapley values remain positive and interpretable on-manifold. In this section, we perform additional experiments to study the performance and stability of Sec. 4.1's methods, as well as their effectiveness on higher-dimensional data. PERFORMANCE AND STABILITY Our implementations of the unsupervised and supervised approaches to on-manifold Shapley values are summarised in Apps. A and B. Both approaches lead to broadly similar results. Fig. 5 compares the two techniques on the Drug Consumption data, where explanations are compared against the ground-truth empirical computation from Fig. 2(b).

APPROACHES TO ON-MANIFOLD SHAPLEY VALUES In Sec. 3 we computed on-manifold Shapley values for simple data by estimating p(x |x S ) from the empirical data distribution or, for synthetic data, by knowing this distribution analytically. Here we introduce two performant methods to compute on-manifold Shapley values on general data. Sec. 4.1 develops the theory underlying our methods, and Sec. 4.2 presents additional experimental results.

Performance and stability, in terms of MSE, of supervised and unsupervised approaches. Performance is compared with off-manifold splicing and, where accessible, the empirical optimum.

Optimal hyperparameters found for computing on-manifold Shapley values. DATA SET METHOD HIDDEN DIM. LEARN. RATE LATENT DIM. MODES β

(b)

< l a t e x i t s h a 1 _ b a s e 6 4 = " Y O r w p u + / W i Q k + q M 0 r g i r s d f m V Q w = " > A A A B 8 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 B I t Q L y W R g n o r e v F Y w X 5 A E 8 p m u 2 m X b j Z h d y K W 0 L / h x Y M i X v 0 z < l a t e x i t s h a 1 _ b a s e 6 4 = " D q i F 2 s 3 Y 7 y i V / D d 7 Z j D J v q H g v K Q = " > A A A B 8 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 B I t Q L y W R g n o r e v F Y w X 5 A E 8 p m u 2 m X b j Z h d y K W 0 L / h x Y M i X v 0 z 3 v w 3 b t s c t P X B w O O 9 G W b m B Y n g G h 3 n 2 y q s r W 9 s b h W 3 S z u 7 e / s H 5 c O j t o 5 T R< l a t e x i t s h a 1 _ b a s e 6 4 = " 9 6 t v ay G V 7 8 F S 8 e F P G o J / / G y e P g q 2 C g p q q b 7 q 4 g k c K g 6 3 4 6 u b n 5 h4 4 2 z I I N w f t 9 8 l / S O q h 6 t e r x W a 1 U P 5 r F k S c 7 Z J e U i U c O S Z 2 c k g Z p E k 7 u y S N 5 J i / O g / P k v D p v 0 9 K c M + v Z J j / g f H w B K 6 G g T A = = < / l a t e x i t > Digit < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 D g U 9 7 m W S w T J b m 9 H 6 D / y S d n r W m o = " > A A A B 9 X i c b V D L S g N B E J y N r x h f U Y 9 e B o P g K e y K Y L w F 9 O A x g n l A E s P s p D c Z M j u 7 z P S q Y c l / e P G g iA h e I s v L 5 P G W d k 7 L 1 / e n p e q l S y O P D k i x + S U e O S C V M k N q Z E 6 4 U S T Z / J K 3 p x H 5 8 V 5 d z 7 m r T k n m z k k f + B 8 / g A J u p L a < / l a t e x i t > O↵-< l a t e x i t s h a 1 _ b a s e 6 4 = " e G D r o I U b 6 U N u 9 C 0 T + q H + l I l 6 Z B 0 = " > A A A B 9 H i c b V B N S 8 N A E N 3 U r 1 q / q h 6 9 B I v g x Z J I w X o r e P F m B f s B b S i b 7 a R d u t n E 3 U m x h P 4 O L x 4 U 8 e q P 8 e a / c d v m o K 0 P B h 7 v z T A z z 4 8 F 1 + g 4 3 1 must be retrained on each model, but it entails direct minimisation of the MSE. The supervised method is thus expected to achieve higher accuracy. We confirmed this on all data sets studied in this paper; see Table 1 for a numerical comparison of the MSEs.In Table 1 , central values indicate the test-set MSE achieved by each method. The table compares the unsupervised and supervised methods against off-manifold splicing, showing significant improvement over this baseline. Note that an MSE of zero is not achievable, because f y (x) in Eq. ( 16) or ( 17) is not fully determined by partial input x S . For the Drug Consumption data where we can compute p(x |x S ) empirically, the optimal MSE happens to be 0.0436. 1 represent the standard deviation in test-set MSE upon repeating each method with fixed hyperparameters 10 times. (Uncertainties are absent for the off-manifold and empirical columns, as these do not involve training a separate model.) The table thus indicates that the supervised method offers increased stability as compared to the unsupervised approach.

Uncertainties in Table

The supervised method is more efficient as well: while the unsupervised technique estimates the value function by sampling from p(x |x S ), the supervised approach learns the value function directly. The supervised method thus requires far fewer model evaluations to match the standard-error of the unsupervised method: roughly 10 times fewer in our experiments.

EXAMPLE ON MNIST

To demonstrate on-manifold explainability on higher-dimensional data, we trained a fully connected network on binary MNIST (LeCun & Cortes, 2010) and explained random digits in Fig. 6(a) .Despite having the same sum over pixels -as controlled by the local version of Eq. ( 5) -and explaining the same model prediction, each on-manifold explanation is more concentrated, with more interpretable structure, than its off-manifold counterpart. The handwritten strokes are clearly visible on-manifold, with key off-stroke regions highlighted as well. Off-manifold explanations generally display lower intensities spread less informatively across the digit-region.These off-manifold explanations are a result of splices as in Fig. 1 . With such unrealistic input, the model's output is uncontrolled and less informative. In fact, it is only on very large coalitions of pixels, subject to minimal splicing, that the model can make intelligent predictions off-manifold. This is confirmed in Fig. 6(b) , which shows the average Shapley summand as a function of coalition size on MNIST. Note that primarily large coalitions underpin off-manifold explanations, whereas far fewer pixels are required on-manifold, consistent with the low-dimensional manifold underlying the data.

5. CONCLUSION

In this work, we took a careful study of the off-manifold problem in AI explainability. We presented important distinctions between on-and off-manifold explainability and provided experimental evidence for several novel shortcomings of the off-manifold approach. We then introduced two techniques to compute on-manifold Shapley values on general data: one technique learns to impute features on the data manifold, while the other learns the Shapley value-function directly. In-so-doing, we provided compelling evidence against the use of off-manifold explainability, and demonstrated that on-manifold Shapley values offer a viable approach to AI explainability in real-world contexts.

A IMPLEMENTATION DETAILS

For the unsupervised approach, we modelled the encoder q φ (z|x) as a diagonal normal distribution with mean and variance determined by a neural network:We modelled the decoder p θ (x|z) as a product distribution:where the distribution type (e.g. normal, categorical) of each x i is chosen per-data-set and each distribution's parameters are determined by a shared neural network. We modelled the masked encoder r ψ (z|x S ) as a Gaussian mixture:To allow r ψ (z|x S ) to accept variable-size coalitions x S as input, we simply masked out-of-coalition features with a special value (-1) that never appears in the data.The unsupervised method has several hyperparameters: β which multiplies the regularisation term, the number of components in Eq. ( 20), as well the architecture and optimisation of the networks involved. For each experiment in this paper, we tuned hyperparameters to minimise the MSE of Eq. ( 16) on a held-out validation set; see App. B for numerical details.For the supervised approach, we modelled g y (x S ) using a neural network, again masking out-ofcoalition features (with -1) to accommodate variable-size coalitions x S . This method's hyperparameters, relating to architecture and optimisation, were similarly tuned to minimise the validationset MSE; see App. B for details.

B DETAILS OF EXPERIMENTS

Here we provide numerical details for the experiments presented in the paper.

B.1 DRUG CONSUMPTION EXPERIMENT

On the Drug Consumption data from the UCI repository (Dua & Graff, 2017) , we used 10 binary features from the data set -Mushrooms, Ecstasy, etc., as displayed in Fig. 5 -to predict whether individuals had ever consumed an 11th drug: LSD. The explanations of Fig. 2 (b) and Fig. 5 describe a random forest fit with default sklearn parameters and max features = None, which achieves 82.2% test-set accuracy amidst a 57 : 43 class balance.In Fig. 2 (b), global off-manifold Shapley values were computed using 10 6 Monte Carlo samples of Eq. ( 4). For each labelled data point (x, y) sampled from the test set, a single permutation was drawn to estimate Eq. ( 1) and a single data point x was drawn to estimate the off-manifold value function Eq. ( 2). In all the figures of this paper, bar height represents the mean that resulted from Monte Carlo sampling, and error bars display the standard error of the mean.Global on-manifold Shapley values in Fig. 2 (b) were computed similarly, but in this case using the on-manifold value function of Eq. (3). For each sampled coalition x S , a random data point x was drawn from the test set, with the crucial requirement that x S = x S . In the text, we refer to this as empirically estimating the conditional distribution p(x |x S ). Such empirical estimation is only possible because this data set has a small number of all-binary features.Tree SHAP values in Fig. 2 (b) were computed with the SHAP package (Lundberg & Lee, 2017) with model output = margin and feature perturbation = tree path dependent.The values labelled "Model retraining" in Fig. 2 (b) were computed by fitting a separate random forest g S for each coalition S of features in the data set: 2 10 models in all. We used these models to compute the sum of Eq. ( 10), where A(g S ) represents a variant of model g S 's accuracy: it is the Published as a conference paper at ICLR 2021 respect to sex (as this is a discrete feature). The modified loss function thus becomeswhere L is the cross-entropy loss, f x i | do(sex = j) denotes f evaluated on the data point x i with the value for sex replaced with j, and α is a hyperparameter controlling the trade-off between optimising the accuracy and minimising the effect of sex. We fine-tuned the model for an additional 200 epochs with α = 3. The resulting model agrees with the baseline on over 98.5% of the data, and has the same test-set accuracy. Shapley values for this model are labelled "Suppressed model" in Fig. 2 (a).

B.3 ABALONE EXPERIMENT

The Abalone data set from the UCI repository (Dua & Graff, 2017) contains 8 features corresponding to physical measurements (see Fig. 4c ) which we used to classify abalone as younger than or older than the median age. We trained a neural network to perform this task -with hidden layer size 100, default sklearn parameters, and early stopping -obtaining a test-set accuracy of 78%.Shapley values in Fig. 4 (c) were computed exactly as described in App. B.1, except that the supervised method involved training for 5k epochs. Optimised hyperparameters are given in Table 2 .

B.4 MNIST EXPERIMENT

For binary MNIST (LeCun & Cortes, 2010) , we trained a fully connected network (hidden layer size 512, default sklearn parameters, and early stopping) achieving 98% test-set accuracy.The digits in Fig. 6 (a) were randomly drawn from the test set. Shapley values in Fig. 6 (a) were computed exactly as described in App. B.1, except that the supervised method involved training for 2k epochs, and the on-manifold explanations are based on 16k Monte Carlo samples per pixel. Optimised hyperparameters are given in Table 2 . The on-manifold explanations in Fig. 6 (a) were computed using the supervised method. While the unsupervised method does not appear in the figure, it was performed to complete Table 1 .The average uncertainty, which is not shown in Fig. 6 (a), is roughly 0.002 -stated as a fraction of the maximum Shapley value in each image.

