References#
Reduan Achtibat, Maximilian Dreyer, Ilona Eisenbraun, Sebastian Bosse, Thomas Wiegand, Wojciech Samek, and Sebastian Lapuschkin. From attribution maps to human-understandable explanations through concept relevance propagation. Nature Machine Intelligence, 5:1006 – 1019, 2022.
Yoann Poupart. LCZeroLens. 2024. URL: Xmaster6y/lczerolens.
Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE, 2015.
Sebastian Lapuschkin, Stephan Wäldchen, Alexander Binder, Grégoire Montavon, Wojciech Samek, and Klaus-Robert Müller. Unmasking clever hans predictors and assessing what machines really learn. Nature Communications, 2019.
Grégoire Montavon, Wojciech Samek, and Klaus-Robert Müller. Layer-wise relevance propagation: an overview. Explainable AI: interpreting, explaining and visualizing deep learning, pages 193–209, 2019.
Luca Andolina, Francesco Cagnetta, Alessandro Rudi, and Lorenzo Rosasco. Learning deep input-output stable systems. 2021. arXiv:2101.06252.
Grégoire Montavon, Sebastian Lapuschkin, Alexander Binder, Wojciech Samek, and Klaus-Robert Müller. Explaining nonlinear classification decisions with deep taylor decomposition. ArXiv, 2015.
Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: visualising image classification models and saliency maps. 2013. arXiv:1312.6034.
Avanti Shrikumar, Peyton Greenside, Anna Shcherbina, and Anshul Kundaje. Not just a black box: learning important features through propagating activation differences. 2016. arXiv:1605.01713.
Ramprasaath R. Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi Parikh, and Dhruv Batra. Grad-cam: visual explanations from deep networks via gradient-based localization. International Journal of Computer Vision, 128:336 – 359, 2016.
Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: the all convolutional net. 2014. arXiv:1412.6806.
Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In International Conference on Machine Learning. 2017.
Kedar Dhamdhere, Mukund Sundararajan, and Qiqi Yan. How important is a neuron? 2018. arXiv:1805.12233.
Aravindh Mahendran and Andrea Vedaldi. Visualizing deep convolutional neural networks using natural pre-images. International Journal of Computer Vision, 120:233–255, 2015.
Zhi Chen, Yijie Bei, and Cynthia Rudin. Concept whitening for interpretable image recognition. Nature Machine Intelligence, 2:772 – 782, 2020.
Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. 2020. arXiv:2005.00928.
Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Investigating gender bias in language models using causal mediation analysis. 2020. arXiv:2004.06643.
Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, and Stella Biderman. Leace: perfect linear concept erasure in closed form. 2023. arXiv:2306.03819.
Maximilian Dreyer, Frederik Pahde, Christopher J. Anders, Wojciech Samek, and Sebastian Lapuschkin. From hope to safety: unlearning biases of deep models via gradient penalization in latent space. In AAAI Conference on Artificial Intelligence. 2023.
Amir massoud Farahmand, Csaba Szepesvári, and Jean-Yves Audibert. Manifold-adaptive dimension estimation. In Proceedings of the 24th International Conference on Machine Learning, ICML '07, 265–272. 2007. URL: https://doi.org/10.1145/1273496.1273530, doi:10.1145/1273496.1273530.
Elena Facco, Maria d'Errico, Alex Rodriguez, and Alessandro Laio. Estimating the intrinsic dimension of datasets by a minimal neighborhood information. Scientific Reports, 7(1):12140, September 2017. URL: https://doi.org/10.1038/s41598-017-11873-y, doi:10.1038/s41598-017-11873-y.
Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes. 2018. arXiv:1610.01644.
Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, and Rory Sayres. Interpretability beyond feature attribution: quantitative testing with concept activation vectors (tcav). 2018. arXiv:1711.11279.
Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition. 2023. arXiv:2312.06681.
Anna C. Gilbert and Kevin O'Neill. Ca-pca: manifold dimension estimation, adapted for curvature. arXiv preprint arXiv:2309.13478, 2023.
Keinosuke Fukunaga and D. R. Olsen. An algorithm for finding intrinsic dimensionality of data. IEEE Transactions on Computers, C-20(2):176–183, 1971. doi:10.1109/T-C.1971.223208.
Jochen Bruske and Gerald Sommer. Intrinsic dimensionality estimation with optimally topology preserving maps. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(5):572–575, 1998. doi:10.1109/34.682189.
Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. In Neural Information Processing Systems. 2022.
Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. ArXiv, 2023.
Jacob Dunefsky, Philippe Chlenski, and Neel Nanda. Transcoders find interpretable llm feature circuits. ArXiv, 2024.
Seul-Ki Yeom, Philipp Seegerer, Sebastian Lapuschkin, Simon Wiedemann, Klaus-Robert Müller, and Wojciech Samek. Pruning by explaining: a novel criterion for deep neural network pruning. ArXiv, 2019.
Nicholas Pochinkov and Nandi Schoots. Dissecting language models: machine unlearning via selective pruning. ArXiv, 2024.
Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. 2022. arXiv:2212.04089.