A Systematic Comparison between Extractive Self-Explanations and Human Rationales in Text Classification¶
Conference: ACL2026
arXiv: 2410.03296
Code: https://github.com/oeberle/self_explanations_human_rationales
Area: interpretability
Keywords: LLM self-explanation, input rationale, human annotation, interpretability evaluation, attribution methods
TL;DR¶
This paper systematically compares extractive self-explanations generated by four open-source instruction-tuned LLMs with human rationales and post-hoc attribution methods across three text classification tasks. It finds that the consistency between self-explanations and human annotations is strongly influenced by text length and task complexity; however, in perturbation-based faithfulness evaluations, self-explanations often identify token subsets more critical to model predictions.
Background & Motivation¶
Background: LLMs are widely deployed in classification, summarization, QA, and decision-support scenarios. Users increasingly expect models to provide explanations for their outputs. These explanations are typically natural language descriptions or evidence snippets extracted from the input, requiring no additional training or complex gradient-based tools.
Limitations of Prior Work: A self-explanation appearing plausible does not guarantee it is a good explanation. A rationale must pass two tests: plausibility (whether it matches human reasoning) and faithfulness (whether it reflects the information the model actually relied on). Existing research often focuses on short-text sentiment analysis or compares self-explanations only with simple saliency methods, lacking systematic cross-task and cross-method comparisons.
Key Challenge: Users desire explanations that are both "aligned with human intuition" and "faithful to internal model decisions," but these goals are not always congruent. Human rationales tend toward narrative and semantic evidence, while model self-explanations may favor explicit task-relevant snippets. Conversely, gradient attribution methods may emphasize system prompts or formatting tokens—elements critical for computation but unnatural to humans. Treating any single type as the ground truth obscures information from other perspectives.
Goal: The authors aim to answer three specific questions: the degree of consistency between LLM-generated extractive self-explanations and human rationales across tasks; whether these rationales truly change model predictions when masked; and how token selection strategies differ among self-explanations, human explanations, and post-hoc attribution methods like LRP and Gradient×Input.
Key Insight: The paper focuses on extractive explanations rather than free-text. This ensures explanations are strictly anchored in the input text, allowing them to be converted into token-level rationales for direct comparison with human annotations, post-hoc scores, and perturbation curves.
Core Idea: By placing LLM self-explanations, human rationales, and post-hoc attributions into a unified token-level evaluation framework, the study disentangles the differences between "appearing plausible" and "being effective for the model" through the lenses of plausibility, faithfulness, and linguistic statistics.
Method¶
The paper designs a controlled evaluation pipeline rather than proposing a new model. The authors select text classification data with human rationales, including a newly annotated Climate-Fever rationale subset. Four LLMs perform zero-shot classification; supporting rationales are extracted only for correct predictions. These are then systematically compared based on plausibility, faithfulness, and alignment with post-hoc attributions.
Overall Architecture¶
The pipeline consists of five steps:
- Task and Human Rationale Preparation: Covers sentiment analysis (SST/mSST), forced labor risk detection (RaFoLa), and climate claim verification (Climate-Fever). Token-level rationales were newly collected for the Climate-Fever subset.
- Model Classification: Gemma3-12B, Llama3.1-8B, Qwen3-8B, and Mistral-7B-Instruct-v0.3 perform zero-shot classification.
- Model Rationale Extraction: Rationales are requested only when the classification is correct to avoid confounding classification errors with explanation errors.
- Plausibility Evaluation: Human rationales serve as the reference for acceptable explanations. Sample-wise Cohen's Kappa measures agreement between model and human token selections.
- Faithfulness and Strategy Analysis: Post-hoc attributions (Gradient×Input, LRP) are generated. Faithfulness is measured by observing the change in probability difference between the correct and alternative answers after masking tokens. A k-greedy importance ordering is used to rank binary rationales (\(k=1\) for SST, \(k=3\) for others).
Key Designs¶
-
Controlled Comparison of Extractive Self-Explanations:
- Function: Restricts explanations to token rationales within the input for granular comparison.
- Mechanism: LLMs output classification results first, then extract supporting snippets for correct predictions.
- Design Motivation: Extractive rationales transform the qualitative problem of explanation quality into measurable token-level agreement and perturbation issues.
-
Cross-Difficulty, Cross-Lingual, and Cross-Task Data Design:
- Function: Validates self-explanations in long-text, professional, and multi-lingual scenarios beyond simple sentiment analysis.
- Mechanism: Includes SST/mSST (short sentences), RaFoLa (long news articles with implicit evidence), and Climate-Fever (ambiguous claim-evidence relationships).
- Design Motivation: Explanation quality is often dominated by dataset attributes; diverse tasks expose the limitations of self-explanations better than sentiment analysis alone.
-
Separated Plausibility and Faithfulness Evaluation:
- Function: Distinguishes between "looking like human reasoning" and "actually influencing model predictions."
- Mechanism: Plausibility uses Cohen's Kappa; faithfulness uses token masking to observe probability drops.
- Design Motivation: An explanation might be highly plausible but unfaithful to the model's internal logic, or vice versa. Separating these prevents "readability" from being mistaken for "faithfulness."
Loss & Training¶
No new models were trained. All experiments used zero-shot prompting with instruction-tuned open-weight LLMs. The generation used the transformers library with a repetition penalty of 1.0. Results were averaged over three random seeds; for SST/mSST/RaFoLa, standard deviations were within 0.01.
Evaluation involves two alignment processes: binarizing human and model rationales for Cohen's Kappa, and ranking rationale tokens for gradual masking to measure the faithfulness curve. For post-hoc attributions, ranking is derived from LRP or Gradient×Input scores.
Key Experimental Results¶
Main Results¶
Classification performance was reported as a prerequisite. Short-text sentiment analysis (SST/mSST) reached macro-F1 scores of 0.84 to 1.00. RaFoLa (long news) was harder (0.25–0.79), and Climate-Fever claim verification was the most challenging (best macro-F1 of 0.45).
| Task / Dataset | Gemma3 | Llama3 | Qwen3 | Mistral | Key Observation |
|---|---|---|---|---|---|
| SST | 0.98 | 0.98 | 0.98 | 0.99 | Almost all models solved short-text sentiment. |
| mSST-EN | 1.00 | 0.98 | 0.99 | 0.98 | Performance near saturation in English subset. |
| mSST-DA | 0.94 | 0.84 | 0.96 | 0.96 | Danish remains high, though Llama3 is lower. |
| mSST-IT | 1.00 | 0.95 | 1.00 | 0.97 | Italian performance similar to English. |
| RaFoLa #1 | 0.25 | 0.47 | 0.38 | 0.57 | "Vulnerability" indicator is difficult. |
| RaFoLa #2 | 0.37 | 0.60 | 0.47 | 0.58 | "Abusive conditions" is challenging. |
| RaFoLa #5 | 0.79 | 0.73 | 0.74 | 0.60 | "Overtime" yields higher performance. |
| RaFoLa #8 | 0.65 | 0.76 | 0.67 | 0.73 | "Violence" is easily triggered by keywords. |
| Climate-Fever Claim | 0.45±0.04 | 0.33±0.04 | 0.38±0.02 | 0.24±0.01 | Gemma3 is best; overall difficult task. |
| Climate-Fever Evidence | 0.54±0.02 | 0.40±0.03 | 0.46±0.00 | 0.45±0.02 | Evidence classification slightly better. |
Ablation Study¶
Rather than standard module ablation, the study compares explanation sources: human rationale, model self-explanation, LRP, Gradient×Input, and random baseline.
| Comparison Dimension | Key Metric / Result | Description |
|---|---|---|
| Human-Model Plausibility: SST/mSST | 0.4-0.6 Cohen's Kappa for EN; lower for DA/IT (0.31-0.33) | Moderate agreement in short-text sentiment analysis. |
| Human-Model Plausibility: RaFoLa | #1 (0.12-0.17), #2 (0.19-0.27), #5 (0.21-0.48), #8 (0.27-0.41) | Agreement depends heavily on indicators with explicit keywords. |
| Human-Model Plausibility: Climate-Fever | Gemma3 (0.24), others (0.12-0.18) | Lowest agreement due to vague claim-evidence relationships. |
| Faithfulness: self-explanation | Steepest drop in top 5-10% tokens via k-greedy | Self-explanations find token subsets highly sensitive for the model. |
| Faithfulness: post-hoc attribution | LRP > Gradient×Input; still secondary to model rationale | Post-hoc methods often capture structural rather than semantic tokens. |
| Strategy: RaFoLa top 5% faithful tokens | Human/Model stopword % (34-39%); Post-hoc stopword % (12-25%) | Humans/models select language evidence; post-hoc targets structural cues. |
Key Findings¶
- Text length and task complexity are crucial variables. Agreement is significantly higher in short sentiment sentences (21 tokens) than in long news (945 tokens) or claim verification.
- Self-explanation plausibility is unstable, but faithfulness is strong. Masking the top 5-10% of model-ranked rationale tokens reduces prediction confidence faster than human rationales or post-hoc methods.
- In RaFoLa, indicators with explicit keywords (e.g., "hours," "violence") show higher agreement.
- Climate-Fever's low agreement stems from ambiguous semantic relationships and non-standard claims rather than surface text complexity.
- Post-hoc attributions emphasize structural or formatting tokens (e.g.,
<bos>, URLs, dates). While computationally important, these do not align with human semantic evidence.
Highlights & Insights¶
- A major highlight is moving self-explanation evaluation from subjective quality to measurable token-level analysis across plausibility and faithfulness.
- The choice of datasets (SST, RaFoLa, Climate-Fever) demonstrates that explanation quality is a product of models, tasks, and text structures, not an isolated model property.
- The study clearly shows that "human-like" and "faithful to the model" are distinct concepts. Post-hoc methods may capture low-level processing cues, while self-explanations reflect natural language evidence.
- For tool design, systems should ideally reconcile natural language reasons with underlying attribution patterns rather than outputting one or the other.
- The use of Cohen's Kappa provides a more robust metric for rationale agreement than F1, accounting for class imbalance in "unselected tokens."
Limitations & Future Work¶
- Human rationales are not unbiased truths; they vary by annotator background and instructions.
- The study is restricted to extractive rationales and does not evaluate free-text explanations, which might contain external knowledge or inductive reasoning.
- Lower agreement (Kappa) does not necessarily mean a model explanation is useless, nor does high agreement guarantee faithfulness.
- Potential data contamination for SST may inflate performance.
- Faithfulness via masking can disrupt semantic coherence or input distribution.
- Future work should extend to generative explanations and multi-step reasoning tasks.
Related Work & Insights¶
- vs. Huang et al. (2023): This work extends beyond short-text sentiment to longer, cross-lingual, and more complex reasoning tasks (RaFoLa, Climate-Fever).
- vs. Randl et al. (2025): While Randl et al. compare self-explanations and saliency, this paper incorporates advanced post-hoc methods like LRP to highlight the divergence between semantic and structural evidence.
- ERASER Benchmark: Following the ERASER philosophy, this work treats human rationales as a plausibility benchmark but emphasizes that faithfulness is a separate, critical requirement.
- Insight: Applications should treat self-explanations as "candidate semantic evidence" and attributions as "calculative cues," providing users with an interface that highlights both agreement and divergence.
Rating¶
- Novelty: ⭐⭐⭐⭐☆ Systematic comparison of three distinct explanation types across varied tasks.
- Experimental Thoroughness: ⭐⭐⭐⭐☆ Large-scale evaluation across models and languages, though restricted to extractive setups.
- Writing Quality: ⭐⭐⭐⭐☆ Clear distinction between plausibility and faithfulness.
- Value: ⭐⭐⭐⭐⭐ Essential warning against treating "human-sounding" explanations as "faithful" model indicators.