Sesgo en LLMs

Benchmarks, datasets y métodos de mitigación de sesgo en modelos de lenguaje.

← Literature Review


Relevante   Leído   Pendiente   Irrelevante

Benchmarks y Datasets

Estado Año Título Tipo de método Resumen Citas*
2020 Social Bias Frames: Reasoning about Social and Power Implications of Language Benchmark / Dataset Ver 4
2021 StereoSet: Measuring stereotypical bias in pretrained language models Benchmark / Dataset Ver 19
2020 RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models Benchmark / Dataset Ver 7
2021 Bot-Adversarial Dialogue for Safe Conversational Agents Benchmark / Dataset Ver 2
2021 TruthfulQA: Measuring How Models Mimic Human Falsehoods Benchmark / Dataset Ver 6
2021 BBQ: A hand-built bias benchmark for question answering Benchmark / Dataset Ver 12
2022 ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection Benchmark / Dataset Ver 0
2022 “I’m sorry to hear that”: Finding New Biases in Language Models with a Holistic Descriptor Dataset Benchmark / Dataset Ver 3
2023 Nationality Bias in Text Generation Benchmark / Dataset Ver 1
2023 HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models Benchmark / Dataset Ver 0
2025 Do Bias Benchmarks Generalise? Evidence from Voice-based Evaluation of Gender Bias in SpeechLLMs Benchmark / Dataset Ver 0
2025 BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses Benchmark / Dataset Ver 0

*Solo citas entre papers del repositorio.


Métodos de Mitigación

Estado Año Título Tipo de método Resumen Citas*
2021 FairFil: Contrastive Neural Debiasing Method for Pretrained Text Encoders Fine-tuning Ver 0
2021 Sustainable Modular Debiasing of Language Models Adapters / PEFT Ver 1
2021 An Empirical Survey of the Effectiveness of Debiasing Techniques for Pre-trained Language Models Evaluación / análisis Ver 18
2022 Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback Alineamiento / RLHF Ver 8
2022 Debiasing Pre-Trained Language Models via Efficient Fine-Tuning Fine-tuning Ver 8
2022 Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned Evaluación / análisis Ver 3
2022 MABEL: Attenuating Gender Bias using Textual Entailment Data Data augmentation Ver 10
2023 BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset Alineamiento / RLHF Ver 0
2023 D-CALM: A Dynamic Clustering-based Active Learning Approach for Mitigating Bias Otro Ver 1
2023 An Empirical Analysis of Parameter-Efficient Methods for Debiasing Pre-Trained Language Models Adapters / PEFT Ver 5
2023 Language Models Get a Gender Makeover: Mitigating Gender Bias with Few-Shot Data Interventions Data augmentation Ver 1
2023 Causal-Debias: Unifying Debiasing in Pretrained Language Models via Causal Invariant Learning Otro Ver 2
2023 Mitigating Biases for Instruction-following Language Models via Bias Neurons Elimination Edición de pesos / neuronas Ver 8
2024 Self-Debiasing Large Language Models: Zero-Shot Recognition and Reduction of Stereotypes Tiempo de inferencia Ver 10
2024 ChatGPT Based Data Augmentation for Improved Parameter-Efficient Debiasing of LLMs Data augmentation Ver 0
2025 Aligned but Stereotypical? The Hidden Influence of System Prompts on Social Bias Evaluación / análisis Ver 0
2025 BiasEdit: Debiasing Stereotyped Language Models via Model Editing Edición de pesos / neuronas Ver 6
2025 FairSteer: Inference Time Debiasing for LLMs with Dynamic Activation Steering Tiempo de inferencia Ver 2
2025 BiasFilter: An Inference-Time Debiasing Framework for Large Language Models Tiempo de inferencia Ver 1
2025 Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective Evaluación / análisis Ver 1
2025 Debiasing the Fine-Grained Classification Task in LLMs with Bias-Aware PEFT Adapters / PEFT Ver 2
2025 BiasGym: Fantastic LLM Biases and How to Find (and Remove) Them Evaluación / análisis Ver 0
2025 LLM Bias Detection and Mitigation through the Lens of Desired Distributions Otro Ver 0
2026 No Free Lunch in Language Model Bias Mitigation? Evaluación / análisis Ver 0
2026 KnowBias: Mitigating Social Bias in LLMs via Know-Bias Neuron Enhancement Edición de pesos / neuronas Ver 0

*Solo citas entre papers del repositorio.


Estadísticas

Por tipo de método

Tipo de método N° de papers
Benchmark / Dataset 12
Evaluación / análisis 6
Data augmentation 3
Adapters / PEFT 3
Edición de pesos / neuronas 3
Tiempo de inferencia 3
Alineamiento / RLHF 2
Fine-tuning 2
Otro 3
Total 37

Frecuencia de datasets en papers de métodos

Número de papers de mitigación (sobre 24) que utilizan cada dataset.

Dataset Papers que lo usan
StereoSet 17
WinoBias 12
CrowS-Pairs 11
BBQ 10
GLUE 8
SEAT 5
BOLD 3
MMLU 2
BiasFreeBench, STS-B, WNC, SentiBias, SNLI/MultiNLI, WEAT, HH-RLHF, FairFace, CUB-200, Stanford Cars, Food-101, MRPC, RTE, QNLI, WikiText-2 1 cada uno

Métodos que miden calidad general del modelo

De los 18 papers que proponen un método de mitigación activo (excluidos los 6 de evaluación/análisis pura).

Mide calidad general N° de papers Papers
11 FairFil, Gira, MABEL, RLHF-Asst., Lauscher, Xie, Yang, Causal-Debias, BiasEdit, Zhao, KnowBias
No 7 Thakur, D-CALM, Gallegos, Han, BiasFilter, FairSteer, Shrestha