Skip to content

ReDial: Assessing Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks

Conference: ACL 2025
arXiv: 2410.11005
Code: Available
Area: AI Safety / Fairness
Keywords: dialect fairness, AAVE, reasoning robustness, LLM bias, benchmark

TL;DR

This paper constructs ReDial (1,216 pairs), the first high-quality human-annotated parallel reasoning benchmark for Standard English and African American Vernacular English (AAVE), to systematically evaluate the fairness and robustness of LLMs under dialect inputs. It reveals that almost all mainstream models suffer a significant performance drop of over 10% on AAVE queries.

Background & Motivation

Background: LLM evaluation benchmarks are typically constructed using Standard English (e.g., HumanEval, GSM8K), neglecting dialectal variations within a language. AAVE is spoken by approximately 33 million people worldwide, and about 80% of African Americans use AAVE.

Limitations of Prior Work: Existing research on dialect bias focuses primarily on linguistic and social analysis tasks, while dialect fairness in reasoning tasks (algorithmic, mathematical, and logical) remains largely unstudied. Existing dialect conversion methods either rely on pre-defined rules (missing contextual nuances) or utilize LLM translation (potentially introducing unverified biases).

Key Challenge: Dialect speakers are forced to code-switch to Standard English to receive optimal service from LLMs, which is inherently a form of linguistic discrimination. Semantically equivalent dialect inputs in reasoning tasks should not lead to performance degradation.

Goal: (1) Construct a high-quality human-annotated SE-AAVE parallel reasoning benchmark; (2) Systematically quantify the performance differences of mainstream LLMs under dialect inputs; (3) Analyze the linguistic origins of the performance degradation.

Key Insight: Hiring native AAVE speakers (including experts with computer science backgrounds) to manually rewrite seven popular SE reasoning benchmarks, ensuring semantic equivalence while maintaining dialectal features, thus avoiding biases introduced by rule-based conversion or LLM translation.

Core Idea: Objectively quantify the unfair service of LLMs toward non-standard dialects through a human-annotated parallel reasoning benchmark.

Method

Overall Architecture

ReDial consists of 1,216 fully annotated SE-AAVE parallel prompt pairs, covering four reasoning categories: algorithmic (25.7%, from HumanEval and MBPP), logical (29.8%, from LogicBench and Folio), mathematical (24.7%, from GSM8K and SVAMP), and comprehensive reasoning (19.7%, from AsyncHow). Each instance is manually rewritten by native AAVE speakers, keeping the original intent, meaning, and ground truth intact.

Key Designs

  1. Human Dialect Rewriting Process:

    • Function: Generate high-quality, semantically equivalent AAVE versions of reasoning problems.
    • Mechanism: Employing native AAVE speakers (rather than rule-based conversion or LLM translation) to ensure that rewrites naturally incorporate AAVE morphosyntactic features and contextual norms. This includes a two-round quality check: linguistic authenticity review by AAVE experts + logical equivalence verification by CS experts.
    • Design Motivation: Rule-based conversions (e.g., Ziems et al. 2022) miss contextual nuances, while LLM translations may propagate their own dialectal biases.
  2. Multi-dimensional Fairness Evaluation Framework:

    • Function: Systematically quantify dialect performance differences from both robustness and fairness perspectives.
    • Mechanism: Compare the performance of the same model on equivalent SE and AAVE inputs, using paired statistical tests (McNemar's test) to determine the statistical significance of performance differences. Robustness measures the model's sensitivity to input variation, while fairness measures whether systemic disadvantages exist for specific linguistic groups.
    • Design Motivation: Simply reporting average scores is insufficient; statistical significance tests are required to rule out random fluctuations.
  3. Error Analysis Experiments:

    • Function: Pinpoint the linguistic root causes of dialect performance degradation.
    • Mechanism: Designing synthetic perturbation experiments and AAVE feature injection experiments by incrementally introducing AAVE features such as lexical substitution, morphosyntactic transformation, and conversational norms to observe which types of features cause the largest performance drops.
    • Design Motivation: Distinguish the impact of "surface-level lexical differences" from "deep-level syntactic/pragmatic differences" on the model.

Loss & Training

As this work introduces an evaluation benchmark, it does not involve model training. Accuracy is used as the evaluation metric, and pass@1 is used for algorithmic tasks.

Key Experimental Results

Main Results

Model SE Accuracy AAVE Accuracy Relative Drop
GPT-o1 83.2% 74.1% -10.9%
GPT-4o 76.5% 67.8% -11.4%
Claude-3.5-Sonnet 74.3% 65.2% -12.2%
Llama-3.1-70B 68.7% 59.4% -13.5%
Mistral-Large 62.1% 53.6% -13.7%
Phi-3-medium 55.8% 47.2% -15.4%

Ablation Study

Perturbation Type Performance Drop Description
Lexical Substitution -3.2% Only replaces AAVE-specific lexical terms
Morphosyntactic Transformation -5.7% Introduces AAVE grammatical rules (e.g., negative concord)
Full Human Rewrite -12.2% Natural rewrites by native speakers (including pragmatics and context)
CoT Prompting Mitigation +2.1% CoT only partially mitigates the issue, leaving a significant gap

Key Findings

  • Almost all models show a significant performance drop on AAVE (\(p < 0.05\)), with an average relative decrease of over 10%.
  • Algorithmic tasks are affected the most (code-related queries are highly sensitive to dialectal phrasings), followed by logical tasks.
  • Synthetic perturbations fail to replicate the scale of performance degradation seen in human rewrites, indicating that the impact of dialects goes far beyond the lexical level.
  • CoT prompting only partially mitigates the issue, showing that bias is deeply rooted in the models rather than just their reasoning strategies.
  • Scaling up model size does not eliminate dialect bias; larger models still exhibit significant unfairness.

Highlights & Insights

  • ReDial is the first human-annotated dialect reasoning benchmark, whose core value lies in "end-to-end human effort." It avoids the systematic bias introduced by rule-based or LLM translation, providing a reliable ground truth for dialect fairness research.
  • The paper cleverly frames dialect rewriting as a "semantic robustness test," linking dialect fairness with adversarial robustness research, which provides a solid theoretical foundation for the evaluation framework.

Limitations & Future Work

  • Only covers AAVE, without extending to other English dialects (e.g., Indian English, Singaporean English) or non-English dialects.
  • Reasoning tasks are primarily closed-form/decidable, omitting dialect fairness in open-ended generation.
  • Lacks experimental validation of debiasing methods—it identifies the problem but does not propose a solution.
  • vs VALUE (Ziems et al. 2023): VALUE uses rule-based conversion to generate AAVE, whereas ReDial uses human rewriting, which is more natural and captures pragmatic differences.
  • vs Hofmann et al. 2024: They study explicit/implicit LLM bias against AAVE (such as racial association), while ReDial focuses on explicit performance disparities in tasks.

Rating

  • Novelty: ⭐⭐⭐⭐ First human-annotated dialect reasoning fairness benchmark
  • Experimental Thoroughness: ⭐⭐⭐⭐ Covers multiple model families and reasoning task categories
  • Writing Quality: ⭐⭐⭐⭐⭐ Clear motivation and rigorous experimental design
  • Value: ⭐⭐⭐⭐ Unveils the overlooked issue of dialect discrimination in LLMs