Explainable Artificial Intelligence

Reading Material

Additional recommended readings accompanying lectures. (updated regularly)

Slides

Lecture 1: Overview and taxonomy of XAI
Lecture 2: Feature importance methods (e.g., perturbation and saliency methods, etc.)
Lecture 3: Data attribution methods (e.g., influence functions, Data SHAP, etc.)
Lecture 4: Inherently intepretable models (e.g., SENNs, ProtoPNets, etc.)
Lecture 5: Concept-based XAI
Lecture 6: Neural Symbolic Interpretability
Lecture 7: Mechanistic Interpretability of (Early) Vision Models
Lecture 8: Mechanistic Interpretability: Progress and Limits

Practical Lab Sessions

Practical 1: Feature Attribution (due 17 Feb)
Practical 2: Concept Bottleneck Models (due 3 Mar)
Practical 3: Mechanistic Interpretability (due 17 Mar)

Mini Project Papers

Here is the form to submit your preferred and ranked top 5 papers for your mini project.

[Empirical, Concepts] Fong, Ruth, and Andrea Vedaldi. "Net2vec: Quantifying and explaining how concepts are encoded by filters in deep neural networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
[Perturbation, Post-hoc] Singla, Sumedha, et al. "Explanation by Progressive Exaggeration." International Conference on Learning Representations (2020).
[Saliency, Post-hoc, Adversarial] Dombrowski, Ann-Kathrin, et al. "Explanations can be manipulated and geometry is to blame." Advances in neural information processing systems 32 (2019).
[Saliency Application, Generalisation] Kim, Jang-Hyun, Wonho Choo, and Hyun Oh Song. "Puzzle mix: Exploiting saliency and local statistics for optimal mixup." International conference on machine learning. PMLR, 2020.
[Saliency Application, Fairness] Asgari, Saeid, et al. "MaskTune: Mitigating spurious correlations by forcing to explore." Advances in Neural Information Processing Systems 35 (2022): 23284-23296.
[Sample Importance, Prototypes, Bayesian Modelling] Kim, Been, Rajiv Khanna, and Oluwasanmi O. Koyejo. "Examples are not enough, learn to criticize! criticism for interpretability." Advances in neural information processing systems 29 (2016).
[Concepts, Post-hoc] Crabbé, Jonathan, and Mihaela van der Schaar. "Concept activation regions: A generalized framework for concept-based explanations." Advances in Neural Information Processing Systems 35 (2022): 2590-2607.
[IntArch, Trees] Mike Wu, Michael C. Hughes, Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Finale Doshi-Velez. "Beyond Sparsity: Tree Regularization of Deep Models for Interpretability.” AAAI 2018: 1670-1678.
[IntArch, Concepts, Unsup, LLMs] Yang, Yue, et al. "Language in a bottle: Language model guided concept bottlenecks for interpretable image classification." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
[IntArc, Concepts, Interventions, OOD] Espinosa Zarlenga, Mateo et al. "Avoiding Leakage Poisoning: Concept Interventions Under Distribution Shifts." ICML (2025).
[IntArch, Concepts, Interventions] Vandenhirtz, Moritz, et al. "Stochastic Concept Bottleneck Models." The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024).
[IntArch, Causality, Concepts] Dominici, Gabriele, et al. "Causal concept graph models: Beyond causal opacity in deep learning." ICLR (2025).
[IntArch, Prototypes] Ma, Chiyu, et al. "This looks like those: Illuminating prototypical concepts using multiple visualizations." Advances in Neural Information Processing Systems 36 (2023): 39212-39235.
[Influence, Post-hoc] Bae, Juhan, et al. "If Influence Functions are the Answer, Then What is the Question?." Advances in Neural Information Processing Systems 35 (2022): 17953-17967.
[Influence, Post-hoc, Issues] Basu, Samyadeep, Phil Pope, and Soheil Feizi. "Influence Functions in Deep Learning Are Fragile." International Conference on Learning Representations.
[NeSy, Logic] Barbiero, Pietro, et al. "Entropy-based logic explanations of neural networks." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 36. No. 6. 2022.
[NeSy, Generative Modelling] Misino, Eleonora, Giuseppe Marra, and Emanuele Sansone. "VAEL: Bridging variational autoencoders and probabilistic logic programming." Advances in Neural Information Processing Systems 35 (2022): 4667-4679.
[Counterfactuals, Posthoc] Mothilal, Ramaravind K., Amit Sharma, and Chenhao Tan. "Explaining machine learning classifiers through diverse counterfactual explanations." Proceedings of the 2020 conference on fairness, accountability, and transparency. 2020.
[Counterfactuals, Post-hoc] Altmeyer, et al. "Faithful Model Explanations through Energy-Constrained Conformal Counterfactuals." AAAI (2024).
[MechInt, LLMs] Wang, Kevin Ro, et al. "Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small." The Eleventh International Conference on Learning Representations.
[MechInt, LLMs] Meng, Kevin, et al. "Locating and editing factual associations in GPT." Advances in Neural Information Processing Systems 35 (2022): 17359-17372.
[MechInt, SAE, Concepts] Rao, Sukrut, et al. "Discover-then-name: Task-agnostic concept bottlenecks via automated concept discovery." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024.
[MechInt, SAE] Rajamanoharan, Senthooran, et al. "Improving dictionary learning with gated sparse autoencoders." arXiv (2024).
[MechInt, Circuits] Marks, Samuel, et al. "Sparse feature circuits: Discovering and editing interpretable causal graphs in language models." ICML (2025).

Explainable Artificial Intelligence

Reading Material

Slides

Practical Lab Sessions

Mini Project Papers

Study at Cambridge

About the University

Research at Cambridge