Interpretabilidad Mecanística

Circuitos, neuronas y análisis causal para entender el comportamiento interno de los transformers.

Relevante Leído Pendiente Irrelevante

Fundamentos

Año	Título	Resumen	Citas*
2020	Zoom In: An Introduction to Circuits	Ver	2
2021	Causal Abstractions of Neural Networks	Ver	8
2021	A Mathematical Framework for Transformer Circuits	Ver	2

*Solo citas entre papers del repositorio.

Año	Título	Resumen	Citas*
2020	Investigating Gender Bias in Language Models Using Causal Mediation Analysis	Ver	5
2022	Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small	Ver	9
2023	A circuit for Python docstrings in a 4-layer attention-only transformer	Ver	1
2023	Localizing Model Behavior with Path Patching	Ver	5
2023	Towards Automated Circuit Discovery for Mechanistic Interpretability	Ver	8
2023	How does GPT-2 compute greater-than?	Ver	4
2024	Attribution Patching Outperforms Automated Circuit Discovery	Ver	2

*Solo citas entre papers del repositorio.

Año	Título	Resumen	Citas*
2022	Task-specific Compression for Multi-task Language Models using Attribution-based Pruning	Ver	1
2022	Finding Skill Neurons in Pre-trained Transformer-based Language Models	Ver	3
2023	Task-Specific Skill Localization in Fine-tuned Language Models	Ver	1
2023	Language Models Can Explain Neurons in Language Models	Ver	0
2023	Large language models show human-like content biases in transmission chain experiments	Ver	1
2025	Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective	Ver	1

*Solo citas entre papers del repositorio.