Interpretabilidad Mecanística

Circuitos, neuronas y análisis causal para entender el comportamiento interno de los transformers.

← Literature Review


Relevante   Leído   Pendiente   Irrelevante

Fundamentos

Estado Año Título Resumen Citas*
2020 Zoom In: An Introduction to Circuits Ver 2
2021 Causal Abstractions of Neural Networks Ver 8
2021 A Mathematical Framework for Transformer Circuits Ver 2

*Solo citas entre papers del repositorio.


Circuitos y Patching

Estado Año Título Resumen Citas*
2020 Investigating Gender Bias in Language Models Using Causal Mediation Analysis Ver 5
2022 Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Ver 9
2023 A circuit for Python docstrings in a 4-layer attention-only transformer Ver 1
2023 Localizing Model Behavior with Path Patching Ver 5
2023 Towards Automated Circuit Discovery for Mechanistic Interpretability Ver 8
2023 How does GPT-2 compute greater-than? Ver 4
2024 Attribution Patching Outperforms Automated Circuit Discovery Ver 2

*Solo citas entre papers del repositorio.


Neuronas y Localización de Conocimiento

Estado Año Título Resumen Citas*
2022 Task-specific Compression for Multi-task Language Models using Attribution-based Pruning Ver 1
2022 Finding Skill Neurons in Pre-trained Transformer-based Language Models Ver 3
2023 Task-Specific Skill Localization in Fine-tuned Language Models Ver 1
2023 Language Models Can Explain Neurons in Language Models Ver 0
2023 Large language models show human-like content biases in transmission chain experiments Ver 1
2025 Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective Ver 1

*Solo citas entre papers del repositorio.