Explainable Artificial Intelligence
Reading Material
Additional recommended readings accompanying lectures. (updated regularly)
Slides
- Lecture 1: Overview and taxonomy of XAI
- Lecture 2: Feature importance methods (e.g., perturbation and saliency methods, etc.)
- Lecture 3: Data attribution methods (e.g., influence functions, Data SHAP, etc.)
- Lecture 4: Inherently intepretable models (e.g., SENNs, ProtoPNets, etc.)
- Lecture 5: Concept-based XAI
- Lecture 6: Neural Symbolic Interpretability
- Lecture 7: Mechanistic Interpretability of (Early) Vision Models
- Lecture 8: Mechanistic Interpretability: Progress and Limits
Practical Lab Sessions
- Practical 1: Feature Attribution (due 17 Feb)
- Practical 2: Concept Bottleneck Models (due 3 Mar)
- Practical 3: Mechanistic Interpretability (due 17 Mar)
Mini Project Papers
Here is the form to submit your preferred and ranked top 5 papers for your mini project.
- [Empirical, Concepts] Fong, Ruth, and Andrea Vedaldi. "Net2vec: Quantifying and explaining how concepts are encoded by filters in deep neural networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
- [Perturbation, Post-hoc] Singla, Sumedha, et al. "Explanation by Progressive Exaggeration." International Conference on Learning Representations (2020).
- [Saliency, Post-hoc, Adversarial] Dombrowski, Ann-Kathrin, et al. "Explanations can be manipulated and geometry is to blame." Advances in neural information processing systems 32 (2019).
- [Saliency Application, Generalisation] Kim, Jang-Hyun, Wonho Choo, and Hyun Oh Song. "Puzzle mix: Exploiting saliency and local statistics for optimal mixup." International conference on machine learning. PMLR, 2020.
- [Saliency Application, Fairness] Asgari, Saeid, et al. "MaskTune: Mitigating spurious correlations by forcing to explore." Advances in Neural Information Processing Systems 35 (2022): 23284-23296.
- [Sample Importance, Prototypes, Bayesian Modelling] Kim, Been, Rajiv Khanna, and Oluwasanmi O. Koyejo. "Examples are not enough, learn to criticize! criticism for interpretability." Advances in neural information processing systems 29 (2016).
- [Concepts, Post-hoc] Crabbé, Jonathan, and Mihaela van der Schaar. "Concept activation regions: A generalized framework for concept-based explanations." Advances in Neural Information Processing Systems 35 (2022): 2590-2607.
- [IntArch, Trees] Mike Wu, Michael C. Hughes, Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Finale Doshi-Velez. "Beyond Sparsity: Tree Regularization of Deep Models for Interpretability.” AAAI 2018: 1670-1678.
- [IntArch, Concepts, Unsup, LLMs] Yang, Yue, et al. "Language in a bottle: Language model guided concept bottlenecks for interpretable image classification." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
- [IntArc, Concepts, Interventions, OOD] Espinosa Zarlenga, Mateo et al. "Avoiding Leakage Poisoning: Concept Interventions Under Distribution Shifts." ICML (2025).
- [IntArch, Concepts, Interventions] Vandenhirtz, Moritz, et al. "Stochastic Concept Bottleneck Models." The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024).
- [IntArch, Causality, Concepts] Dominici, Gabriele, et al. "Causal concept graph models: Beyond causal opacity in deep learning." ICLR (2025).
- [IntArch, Prototypes] Ma, Chiyu, et al. "This looks like those: Illuminating prototypical concepts using multiple visualizations." Advances in Neural Information Processing Systems 36 (2023): 39212-39235.
- [Influence, Post-hoc] Bae, Juhan, et al. "If Influence Functions are the Answer, Then What is the Question?." Advances in Neural Information Processing Systems 35 (2022): 17953-17967.
- [Influence, Post-hoc, Issues] Basu, Samyadeep, Phil Pope, and Soheil Feizi. "Influence Functions in Deep Learning Are Fragile." International Conference on Learning Representations.
- [NeSy, Logic] Barbiero, Pietro, et al. "Entropy-based logic explanations of neural networks." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 36. No. 6. 2022.
- [NeSy, Generative Modelling] Misino, Eleonora, Giuseppe Marra, and Emanuele Sansone. "VAEL: Bridging variational autoencoders and probabilistic logic programming." Advances in Neural Information Processing Systems 35 (2022): 4667-4679.
- [Counterfactuals, Posthoc] Mothilal, Ramaravind K., Amit Sharma, and Chenhao Tan. "Explaining machine learning classifiers through diverse counterfactual explanations." Proceedings of the 2020 conference on fairness, accountability, and transparency. 2020.
- [Counterfactuals, Post-hoc] Altmeyer, et al. "Faithful Model Explanations through Energy-Constrained Conformal Counterfactuals." AAAI (2024).
- [MechInt, LLMs] Wang, Kevin Ro, et al. "Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small." The Eleventh International Conference on Learning Representations.
- [MechInt, LLMs] Meng, Kevin, et al. "Locating and editing factual associations in GPT." Advances in Neural Information Processing Systems 35 (2022): 17359-17372.
- [MechInt, SAE, Concepts] Rao, Sukrut, et al. "Discover-then-name: Task-agnostic concept bottlenecks via automated concept discovery." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024.
- [MechInt, SAE] Rajamanoharan, Senthooran, et al. "Improving dictionary learning with gated sparse autoencoders." arXiv (2024).
- [MechInt, Circuits] Marks, Samuel, et al. "Sparse feature circuits: Discovering and editing interpretable causal graphs in language models." ICML (2025).