Frame In, Frame Out: Measuring Framing Bias in LLM-Generated News Summaries¶
Conference: ACL2026
arXiv: 2505.05406
Code: https://github.com/vpastorino/FIFO
Area: AIGC Detection / Summarization Evaluation / Media Bias
Keywords: Framing Bias, News Summarization, LLM Evaluation, XSum, Expert Calibration
TL;DR¶
This paper proposes FIFO, a method using an LLM jury with expert calibration to measure framing bias in LLM-generated news summaries at scale on XSum. The study finds that several high-capacity models exhibit higher framing rates compared to human summary baselines.
Background & Motivation¶
Background: News summarization models are typically evaluated based on factual consistency, coverage, fluency, and preference scores. Especially in single-sentence news summarization tasks like XSum, mainstream evaluations focus on "correctness" and "fluency." However, news texts are not merely collections of facts; headlines and summaries influence reader comprehension through selection, emphasis, omission, and attribution of responsibility.
Limitations of Prior Work: Existing framing research mostly originates from communication studies or supervised framing detection, where the goal is usually to classify which frame category a news text belongs to. Summarization evaluation rarely examines whether models introduce interpretive angles that were not prominent in the original text. Consequently, a summary can be factually compatible and linguistically fluent yet still push readers toward more emotional, political, or moralized interpretations.
Key Challenge: The compression process in summarization models naturally requires selection and omission, and framing is precisely generated by the interpretive shifts resulting from these choices. Traditional metrics view compression as an information fidelity problem; this paper further defines it as a question of "whether the interpretive perspective is altered by the model."
Goal: The authors aim to construct a scalable benchmark that covers numerous models and topics while avoiding total reliance on potentially biased LLM annotations. It also provides analyses at the model, topic, and training-setting levels regarding framing rates.
Key Insight: Instead of requiring models to identify fine-grained frame types, the paper first addresses a more fundamental question: does identifiable framing exist in the summary. This binary classification makes annotation and calibration more scalable and suitable as an evaluation dimension for summarization systems.
Core Idea: Use a three-model LLM jury for batch framing annotation, followed by small-scale expert annotation to estimate jury reliability weights, thereby transforming original silver labels into expert-calibrated framing rates.
Method¶
The core of FIFO is not training a new summarization model but proposing a framing-aware summarization evaluation pipeline. It collects outputs from 27 summarization systems on XSum, utilizes an LLM jury to assign Framed / Not Framed labels to each summary, and finally uses an expert annotation set to calibrate these labels, resulting in expert-calibrated framing rates comparable across models and topics.
Overall Architecture¶
The input consists of news articles and their system-generated single-sentence summaries. For each summary, FIFO judges whether it introduces an interpretive frame through selective emphasis, evaluative phrasing, responsibility attribution, causal organization, or omission. The output is an expert-calibrated framing rate at the model, topic, or subset level.
The process is divided into four steps. First, 15,499 summaries from 27 systems (covering BART, T5, FLAN-T5, GPT, Claude, LLaMA, etc.) are aggregated from XSum. Second, a jury composed of GPT-4.1-nano, GPT-4o, and GPT-3.5-Turbo independently judges each summary, with majority voting forming the silver labels. Third, 320 summaries are randomly sampled for manual annotation by framing analysis experts to obtain gold labels and calculate Cohen's \(\kappa=0.616\). Fourth, based on the correspondence between jury and expert labels, each silver label is converted into a probability weight and aggregated into an expert-calibrated framing rate.
Key Designs¶
-
Binary framing operationalization:
- Function: Converts complex framing theory into evaluable summary attributes, specifically whether an interpretive frame exists.
- Mechanism: A summary is labeled Framed when it makes a certain interpretation prominent through selective emphasis, omission, evaluative language, or attribution; it is labeled Not Framed if it merely states core events without a clear interpretive perspective.
- Design Motivation: While fine-grained taxonomies are better for communication analysis, system evaluation primarily needs to know the probability of a model introducing framing, making binary classification more robust and scalable.
-
LLM jury + expert gold label calibration:
- Function: Compromises between large-scale coverage and expert reliability.
- Mechanism: Three LLMs provide independent annotations for majority voting to generate 15,499 silver labels; 320 expert annotations estimate jury bias. When the jury labels a summary Framed, the probability of the expert also labeling it Framed is 77.8%; when the jury labels it Not Framed, there is still a 16.3% probability that the expert considers it Framed.
- Design Motivation: Expert annotation alone is too costly, while LLM annotation alone may inherit model bias. Calibration weights bridge these two, making model-level statistics closer to expert judgment.
-
Expert-calibrated framing rate:
- Function: Converts framing frequency for each model or topic from binary silver labels into expert-calibrated estimates.
- Mechanism: For a summary set \(S\), the calculation is \(FR(S)=\frac{1}{|S|}\sum_{s\in S}w_s\), where \(w_s=0.778\) if the jury labels it Framed, and \(w_s=0.163\) if the jury labels it Not Framed.
- Design Motivation: This estimate acknowledges jury error and does not treat LLM labels as absolute ground truth, while maintaining large-scale statistical power to compare the effects of model capacity, fine-tuning, and news topics.
Loss & Training¶
This paper does not train a new generative model or propose a neural network loss function. Its "training strategy" is an evaluation calibration strategy: first generating silver labels with a prompt-based LLM jury, then estimating conditional reliability with expert gold labels, and finally aggregating framing rates using reliability weights. This design allows FIFO to serve as an external evaluation tool for various summarization systems without depending on a specific architecture.
Key Experimental Results¶
Main Results¶
| Item | Value / Setting | Role | Notes |
|---|---|---|---|
| Summary Source | XSum | Single-doc single-sentence summary | Strong compression scenarios more easily expose selective emphasis |
| System Count | 27 systems | Model-level comparison | Covers encoder-decoder and decoder-only models |
| Silver Label Scale | 15,499 summaries | Large-scale framing analysis | Generated by 3-model LLM jury majority vote |
| Expert Gold Labels | 320 summaries | Calibration and validation | Expert-jury Cohen's \(\kappa=0.616\) |
| Expert agreement when Jury is Framed | 77.8% | Calibration weight | Corresponds to \(w=0.778\) |
| Expert Framed when Jury is Not Framed | 16.3% | Calibration weight | Corresponds to \(w=0.163\) |
Ablation Study¶
| Analysis Dimension | Key Findings | Description |
|---|---|---|
| Model Capacity / Pre-training Range | Large models have significantly higher framing rates, \(p=0.0012\) | Authors suggest low framing rates in small models may stem from lower output quality |
| XSum Fine-tuning | Fine-tuned models have significantly lower framing rates than base models, \(p=0.0006\) | Task-specific fine-tuning may constrain summary styles (95% CI: -19.27% to -7.78%) |
| Intra-family Size Effect | Pearson \(r=-0.44\) | Larger models within the same family have slightly lower framing rates, suggesting training data/settings are more critical than parameter count |
| Topic Effect | Human baseline ~53% for Politics, ~31% for Health/Science | Multiple high-capacity models exceed human baselines in these categories |
| Length Relationship | Point-biserial correlation \(r_{pb}\approx0.1904\) | Framed averages 147 words, Not Framed 83 words; framing is weakly correlated with but not explained away by length |
Key Findings¶
- FIFO demonstrates that framing is not an isolated phenomenon in specific models but an evaluation dimension that varies systematically with model capability, training methods, and news topics.
- Large models are more likely to generate linguistically rich and interpretive summaries, which improves readability but increases the room for introducing framing.
- XSum fine-tuning can reduce framing rates, indicating that task data and stylistic constraints may be more important than simply scaling up the model.
- High-capacity models exceed human baselines in specific sensitive topics (e.g., Politics, Health), suggesting caution in deployment.
Highlights & Insights¶
- The most valuable contribution is Transforming framing from a communication theory concept into a summarization evaluation metric. It reminds researchers that factually correct summaries can still be biased in "how they tell the facts."
- The expert calibration weights are pragmatic. Instead of pretending the LLM jury provides ground truth, the authors estimate systematic error with a small gold set, which is more credible than reporting raw LLM label proportions.
- Binary framing, though coarse, is suitable as a first-level risk screening. Future summarization systems could use FIFO-like metrics to identify high-risk topics or models before conducting fine-grained frame type analysis.
- The results imply that "stronger models" are not naturally more neutral. High-capacity models may be more prone to shaping interpretive frames because they are better at organizing narratives, providing context, and generating evaluative language.
Limitations & Future Work¶
- FIFO relies on an LLM jury for silver labels; despite expert calibration, annotations may still inherit the blind spots or socio-cultural biases of the jury models.
- The dataset only covers English single-document summaries and XSum style, which does not directly reflect framing behavior in multi-document, multilingual, or long-form news generation.
- The binary setting cannot specify which particular frame is used (e.g., attribution of responsibility, moral judgment, conflict frame).
- Future work could extend to multilingual news, different media ecosystems, and fine-grained frame taxonomies, combining these with factuality/stance/sentiment metrics for more complete evaluation.
Related Work & Insights¶
- vs. Traditional framing detection: Traditional work identifies what frames are expressed in news text; this paper focuses on whether a generated summary introduces framing. The former is a content analysis task; the latter is a system evaluation task.
- vs. ROUGE / factuality / coherence evaluation: These metrics measure information coverage, factual correctness, and linguistic quality. FIFO measures interpretive perspective shift, filling the gap for "factually compatible but narratively biased" summaries.
- vs. LLM-as-a-judge: Standard LLM-based evaluation treats model output as the final verdict. FIFO further uses expert gold labels to calibrate judge reliability, inspiring other subjective evaluation tasks to adopt small-scale expert calibration.
Rating¶
- Novelty: ⭐⭐⭐⭐☆ Systematically introduces framing bias to summarization evaluation with a clear problem definition.
- Experimental Thoroughness: ⭐⭐⭐⭐☆ Covers 27 systems and topic analysis; expert set is small but includes calibration; lacks multilingual/multi-doc scenarios.
- Writing Quality: ⭐⭐⭐⭐☆ Motivation, examples, and calibration formulas are clear.
- Value: ⭐⭐⭐⭐⭐ Highly practical for news summarization, media generation, and LLM content governance; a dimension likely to be reused in future evaluation work.