Interpretabilidad Mecanística
Circuitos, neuronas y análisis causal para entender el comportamiento interno de los transformers.
Relevante Leído Pendiente Irrelevante
Fundamentos
| Estado | Año | Título | Resumen | Citas* |
|---|---|---|---|---|
| 2020 | Zoom In: An Introduction to Circuits | Ver | 2 | |
| 2021 | Causal Abstractions of Neural Networks | Ver | 8 | |
| 2021 | A Mathematical Framework for Transformer Circuits | Ver | 2 |
*Solo citas entre papers del repositorio.
Circuitos y Patching
| Estado | Año | Título | Resumen | Citas* |
|---|---|---|---|---|
| 2020 | Investigating Gender Bias in Language Models Using Causal Mediation Analysis | Ver | 5 | |
| 2022 | Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small | Ver | 9 | |
| 2023 | A circuit for Python docstrings in a 4-layer attention-only transformer | Ver | 1 | |
| 2023 | Localizing Model Behavior with Path Patching | Ver | 5 | |
| 2023 | Towards Automated Circuit Discovery for Mechanistic Interpretability | Ver | 8 | |
| 2023 | How does GPT-2 compute greater-than? | Ver | 4 | |
| 2024 | Attribution Patching Outperforms Automated Circuit Discovery | Ver | 2 |
*Solo citas entre papers del repositorio.
Neuronas y Localización de Conocimiento
| Estado | Año | Título | Resumen | Citas* |
|---|---|---|---|---|
| 2022 | Task-specific Compression for Multi-task Language Models using Attribution-based Pruning | Ver | 1 | |
| 2022 | Finding Skill Neurons in Pre-trained Transformer-based Language Models | Ver | 3 | |
| 2023 | Task-Specific Skill Localization in Fine-tuned Language Models | Ver | 1 | |
| 2023 | Language Models Can Explain Neurons in Language Models | Ver | 0 | |
| 2023 | Large language models show human-like content biases in transmission chain experiments | Ver | 1 | |
| 2025 | Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective | Ver | 1 |
*Solo citas entre papers del repositorio.