Sesgo en LLMs
Benchmarks, datasets y métodos de mitigación de sesgo en modelos de lenguaje.
Relevante Leído Pendiente Irrelevante
Benchmarks y Datasets
| Estado | Año | Título | Tipo de método | Resumen | Citas* |
|---|---|---|---|---|---|
| 2020 | Social Bias Frames: Reasoning about Social and Power Implications of Language | Benchmark / Dataset | Ver | 4 | |
| 2021 | StereoSet: Measuring stereotypical bias in pretrained language models | Benchmark / Dataset | Ver | 19 | |
| 2020 | RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models | Benchmark / Dataset | Ver | 7 | |
| 2021 | Bot-Adversarial Dialogue for Safe Conversational Agents | Benchmark / Dataset | Ver | 2 | |
| 2021 | TruthfulQA: Measuring How Models Mimic Human Falsehoods | Benchmark / Dataset | Ver | 6 | |
| 2021 | BBQ: A hand-built bias benchmark for question answering | Benchmark / Dataset | Ver | 12 | |
| 2022 | ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection | Benchmark / Dataset | Ver | 0 | |
| 2022 | “I’m sorry to hear that”: Finding New Biases in Language Models with a Holistic Descriptor Dataset | Benchmark / Dataset | Ver | 3 | |
| 2023 | Nationality Bias in Text Generation | Benchmark / Dataset | Ver | 1 | |
| 2023 | HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models | Benchmark / Dataset | Ver | 0 | |
| 2025 | Do Bias Benchmarks Generalise? Evidence from Voice-based Evaluation of Gender Bias in SpeechLLMs | Benchmark / Dataset | Ver | 0 | |
| 2025 | BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses | Benchmark / Dataset | Ver | 0 |
*Solo citas entre papers del repositorio.
Métodos de Mitigación
| Estado | Año | Título | Tipo de método | Resumen | Citas* |
|---|---|---|---|---|---|
| 2021 | FairFil: Contrastive Neural Debiasing Method for Pretrained Text Encoders | Fine-tuning | Ver | 0 | |
| 2021 | Sustainable Modular Debiasing of Language Models | Adapters / PEFT | Ver | 1 | |
| 2021 | An Empirical Survey of the Effectiveness of Debiasing Techniques for Pre-trained Language Models | Evaluación / análisis | Ver | 18 | |
| 2022 | Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback | Alineamiento / RLHF | Ver | 8 | |
| 2022 | Debiasing Pre-Trained Language Models via Efficient Fine-Tuning | Fine-tuning | Ver | 8 | |
| 2022 | Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned | Evaluación / análisis | Ver | 3 | |
| 2022 | MABEL: Attenuating Gender Bias using Textual Entailment Data | Data augmentation | Ver | 10 | |
| 2023 | BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset | Alineamiento / RLHF | Ver | 0 | |
| 2023 | D-CALM: A Dynamic Clustering-based Active Learning Approach for Mitigating Bias | Otro | Ver | 1 | |
| 2023 | An Empirical Analysis of Parameter-Efficient Methods for Debiasing Pre-Trained Language Models | Adapters / PEFT | Ver | 5 | |
| 2023 | Language Models Get a Gender Makeover: Mitigating Gender Bias with Few-Shot Data Interventions | Data augmentation | Ver | 1 | |
| 2023 | Causal-Debias: Unifying Debiasing in Pretrained Language Models via Causal Invariant Learning | Otro | Ver | 2 | |
| 2023 | Mitigating Biases for Instruction-following Language Models via Bias Neurons Elimination | Edición de pesos / neuronas | Ver | 8 | |
| 2024 | Self-Debiasing Large Language Models: Zero-Shot Recognition and Reduction of Stereotypes | Tiempo de inferencia | Ver | 10 | |
| 2024 | ChatGPT Based Data Augmentation for Improved Parameter-Efficient Debiasing of LLMs | Data augmentation | Ver | 0 | |
| 2025 | Aligned but Stereotypical? The Hidden Influence of System Prompts on Social Bias | Evaluación / análisis | Ver | 0 | |
| 2025 | BiasEdit: Debiasing Stereotyped Language Models via Model Editing | Edición de pesos / neuronas | Ver | 6 | |
| 2025 | FairSteer: Inference Time Debiasing for LLMs with Dynamic Activation Steering | Tiempo de inferencia | Ver | 2 | |
| 2025 | BiasFilter: An Inference-Time Debiasing Framework for Large Language Models | Tiempo de inferencia | Ver | 1 | |
| 2025 | Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective | Evaluación / análisis | Ver | 1 | |
| 2025 | Debiasing the Fine-Grained Classification Task in LLMs with Bias-Aware PEFT | Adapters / PEFT | Ver | 2 | |
| 2025 | BiasGym: Fantastic LLM Biases and How to Find (and Remove) Them | Evaluación / análisis | Ver | 0 | |
| 2025 | LLM Bias Detection and Mitigation through the Lens of Desired Distributions | Otro | Ver | 0 | |
| 2026 | No Free Lunch in Language Model Bias Mitigation? | Evaluación / análisis | Ver | 0 | |
| 2026 | KnowBias: Mitigating Social Bias in LLMs via Know-Bias Neuron Enhancement | Edición de pesos / neuronas | Ver | 0 |
*Solo citas entre papers del repositorio.
Estadísticas
Por tipo de método
| Tipo de método | N° de papers |
|---|---|
| Benchmark / Dataset | 12 |
| Evaluación / análisis | 6 |
| Data augmentation | 3 |
| Adapters / PEFT | 3 |
| Edición de pesos / neuronas | 3 |
| Tiempo de inferencia | 3 |
| Alineamiento / RLHF | 2 |
| Fine-tuning | 2 |
| Otro | 3 |
| Total | 37 |
Frecuencia de datasets en papers de métodos
Número de papers de mitigación (sobre 24) que utilizan cada dataset.
| Dataset | Papers que lo usan |
|---|---|
| StereoSet | 17 |
| WinoBias | 12 |
| CrowS-Pairs | 11 |
| BBQ | 10 |
| GLUE | 8 |
| SEAT | 5 |
| BOLD | 3 |
| MMLU | 2 |
| BiasFreeBench, STS-B, WNC, SentiBias, SNLI/MultiNLI, WEAT, HH-RLHF, FairFace, CUB-200, Stanford Cars, Food-101, MRPC, RTE, QNLI, WikiText-2 | 1 cada uno |
Métodos que miden calidad general del modelo
De los 18 papers que proponen un método de mitigación activo (excluidos los 6 de evaluación/análisis pura).
| Mide calidad general | N° de papers | Papers |
|---|---|---|
| Sí | 11 | FairFil, Gira, MABEL, RLHF-Asst., Lauscher, Xie, Yang, Causal-Debias, BiasEdit, Zhao, KnowBias |
| No | 7 | Thakur, D-CALM, Gallegos, Han, BiasFilter, FairSteer, Shrestha |