References#

[1]

Reduan Achtibat, Maximilian Dreyer, Ilona Eisenbraun, Sebastian Bosse, Thomas Wiegand, Wojciech Samek, and Sebastian Lapuschkin. From attribution maps to human-understandable explanations through concept relevance propagation. Nature Machine Intelligence, 5:1006 – 1019, 2022.

[2]

Yoann Poupart. LCZeroLens. 2024. URL: Xmaster6y/lczerolens.

[3]

Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE, 2015.

[4]

Sebastian Lapuschkin, Stephan Wäldchen, Alexander Binder, Grégoire Montavon, Wojciech Samek, and Klaus-Robert Müller. Unmasking clever hans predictors and assessing what machines really learn. Nature Communications, 2019.

[5]

Grégoire Montavon, Wojciech Samek, and Klaus-Robert Müller. Layer-wise relevance propagation: an overview. Explainable AI: interpreting, explaining and visualizing deep learning, pages 193–209, 2019.

[6]

Luca Andolina, Francesco Cagnetta, Alessandro Rudi, and Lorenzo Rosasco. Learning deep input-output stable systems. 2021. arXiv:2101.06252.

[7]

Grégoire Montavon, Sebastian Lapuschkin, Alexander Binder, Wojciech Samek, and Klaus-Robert Müller. Explaining nonlinear classification decisions with deep taylor decomposition. ArXiv, 2015.

[8]

Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: visualising image classification models and saliency maps. 2013. arXiv:1312.6034.

[9]

Avanti Shrikumar, Peyton Greenside, Anna Shcherbina, and Anshul Kundaje. Not just a black box: learning important features through propagating activation differences. 2016. arXiv:1605.01713.

[10]

Ramprasaath R. Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi Parikh, and Dhruv Batra. Grad-cam: visual explanations from deep networks via gradient-based localization. International Journal of Computer Vision, 128:336 – 359, 2016.

[11]

Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: the all convolutional net. 2014. arXiv:1412.6806.

[12]

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In International Conference on Machine Learning. 2017.

[13]

Kedar Dhamdhere, Mukund Sundararajan, and Qiqi Yan. How important is a neuron? 2018. arXiv:1805.12233.

[14]

Aravindh Mahendran and Andrea Vedaldi. Visualizing deep convolutional neural networks using natural pre-images. International Journal of Computer Vision, 120:233–255, 2015.

[15]

Zhi Chen, Yijie Bei, and Cynthia Rudin. Concept whitening for interpretable image recognition. Nature Machine Intelligence, 2:772 – 782, 2020.

[16]

Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. 2020. arXiv:2005.00928.

[17]

Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Investigating gender bias in language models using causal mediation analysis. 2020. arXiv:2004.06643.

[18]

Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, and Stella Biderman. Leace: perfect linear concept erasure in closed form. 2023. arXiv:2306.03819.

[19]

Maximilian Dreyer, Frederik Pahde, Christopher J. Anders, Wojciech Samek, and Sebastian Lapuschkin. From hope to safety: unlearning biases of deep models via gradient penalization in latent space. In AAAI Conference on Artificial Intelligence. 2023.

[20]

Amir massoud Farahmand, Csaba Szepesvári, and Jean-Yves Audibert. Manifold-adaptive dimension estimation. In Proceedings of the 24th International Conference on Machine Learning, ICML '07, 265–272. 2007. URL: https://doi.org/10.1145/1273496.1273530, doi:10.1145/1273496.1273530.

[21]

Elena Facco, Maria d'Errico, Alex Rodriguez, and Alessandro Laio. Estimating the intrinsic dimension of datasets by a minimal neighborhood information. Scientific Reports, 7(1):12140, September 2017. URL: https://doi.org/10.1038/s41598-017-11873-y, doi:10.1038/s41598-017-11873-y.

[22]

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes. 2018. arXiv:1610.01644.

[23]

Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, and Rory Sayres. Interpretability beyond feature attribution: quantitative testing with concept activation vectors (tcav). 2018. arXiv:1711.11279.

[24]

Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition. 2023. arXiv:2312.06681.

[25]

Anna C. Gilbert and Kevin O'Neill. Ca-pca: manifold dimension estimation, adapted for curvature. arXiv preprint arXiv:2309.13478, 2023.

[26]

Keinosuke Fukunaga and D. R. Olsen. An algorithm for finding intrinsic dimensionality of data. IEEE Transactions on Computers, C-20(2):176–183, 1971. doi:10.1109/T-C.1971.223208.

[27]

Jochen Bruske and Gerald Sommer. Intrinsic dimensionality estimation with optimally topology preserving maps. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(5):572–575, 1998. doi:10.1109/34.682189.

[28]

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. In Neural Information Processing Systems. 2022.

[29]

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. ArXiv, 2023.

[30]

Jacob Dunefsky, Philippe Chlenski, and Neel Nanda. Transcoders find interpretable llm feature circuits. ArXiv, 2024.

[31]

Seul-Ki Yeom, Philipp Seegerer, Sebastian Lapuschkin, Simon Wiedemann, Klaus-Robert Müller, and Wojciech Samek. Pruning by explaining: a novel criterion for deep neural network pruning. ArXiv, 2019.

[32]

Nicholas Pochinkov and Nandi Schoots. Dissecting language models: machine unlearning via selective pruning. ArXiv, 2024.

[33]

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. 2022. arXiv:2212.04089.