Aligning What LLMs Do and Say: Towards Self-Consistent Explanations¶
Conference: ACL 2026 arXiv: 2506.07523 Code: GitHub Area: Interpretability Keywords: Self-consistency, Feature Attribution, Explanation Faithfulness, DPO Optimization, Attribution Alignment
TL;DR¶
This paper constructs a large-scale Post-hoc Self-Consistency Bank (PSCB, 85K decisions × 428K explanations), quantifies the feature attribution gap between LLM answers and their explanations, and improves attribution consistency of explanations via DPO optimization without sacrificing accuracy.
Background & Motivation¶
Background: LLMs are frequently asked to generate natural language explanations for their answers, yet these post-hoc explanations often fail to align with the input features that actually drive the answers — i.e., what the model says differs from what it does.
Limitations of Prior Work: (1) Existing faithfulness metrics (e.g., counterfactual interventions) are computationally expensive and difficult to apply at scale; (2) Methods such as CC-SHAP have only been evaluated on approximately 100 samples, limiting the reliability of their conclusions; (3) No prior work has demonstrated how to improve attribution inconsistency.
Key Challenge: LLM explanations may be fluent and plausible yet "miss the point" — the input features they highlight differ from those that actually drive the answer, posing a fundamental threat to trustworthy AI.
Goal: (1) Quantify attribution consistency between answers and explanations at scale; (2) Propose methods to improve it.
Key Insight: Feature attribution vectors are computed separately for each QA decision and its multiple explanations, and their alignment is compared. DPO fine-tuning on attribution preference data is then applied to improve consistency.
Core Idea: Spearman rank correlation better discriminates between high- and low-quality explanations than cosine similarity; DPO optimization on attribution preferences effectively improves self-consistency and generalizes across domains.
Method¶
Overall Architecture¶
PSCB construction pipeline: (1) Compute feature attribution vectors for QA decisions; (2) Generate \(K\) diverse explanations per decision and compute attribution vectors for each; (3) Measure decision–explanation attribution alignment using an alignment function; (4) Select the best and worst explanations to construct preference pairs for DPO optimization.
Key Designs¶
-
Post-hoc Self-Consistency Bank (PSCB):
- Function: Provides a large-scale attribution-augmented QA benchmark.
- Mechanism: 85K decisions × 5 explanations each = 428K explanation–attribution pairs. Two attribution methods are used — LIME and Layer Integrated Gradients (LIG) — covering 4 QA datasets and 2 LLMs.
- Design Motivation: Prior evaluation was limited to approximately 100 samples, precluding reliable conclusions. Large-scale data is a prerequisite for systematic study.
-
Spearman Rank Correlation as Alignment Metric:
- Function: More reliably measures attribution alignment than cosine similarity.
- Mechanism: Spearman rank correlation \(CC_{sp} = 1 - \frac{6\sum(r(\phi_i^{dec}) - r(\phi_i^{exp}))^2}{m(m^2-1)}\) captures consistency in feature priority ordering and is invariant to attribution scale.
- Design Motivation: Cosine similarity yields heavily overlapping distributions when distinguishing good from bad explanations (poor discriminative power), whereas Spearman rank correlation clearly separates explanations of different quality.
-
Attribution-Preference-Based DPO Optimization:
- Function: Improves explanation self-consistency without degrading task accuracy.
- Mechanism: Preference pairs are constructed using the highest-consistency explanation as chosen and the lowest-consistency explanation as rejected from PSCB, followed by DPO fine-tuning of the LLM.
- Design Motivation: SFT on the same data performs poorly; DPO is better suited to learning the subtle distinctions in attribution preferences.
Loss & Training¶
The standard DPO objective is used, with training conducted on the preference pairs from PSCB. Explanations are generated via temperature sampling (\(p=0.9\), \(T=0.7\)), with 5 explanations per decision; the best and worst are selected to form preference pairs.
Key Experimental Results¶
Main Results¶
| Model | Dataset | CC-Sp (Before) | CC-Sp (After DPO) | Accuracy Change |
|---|---|---|---|---|
| LLaMA3.1-8B | ECQA | 18.47 (mean) | Significant gain | No degradation |
| LLaMA3.2-3B | ECQA | 9.75 (mean) | Significant gain | No degradation |
Ablation Study¶
| Configuration | Key Metric | Note |
|---|---|---|
| DPO vs. SFT | DPO significantly outperforms SFT | SFT fails to learn attribution preferences |
| LIME vs. LIG | Gains do not transfer across methods | Different attribution methods capture different dimensions |
| Cross-domain generalization | Effective | Improvements from ECQA training generalize to ARC, etc. |
| Correct vs. incorrect answers | Orthogonal | Self-consistency is largely independent of accuracy |
Key Findings¶
- Self-consistency and accuracy are largely orthogonal — inconsistently explained answers may still be correct, and consistent ones may be wrong.
- Spearman rank correlation demonstrates significantly stronger discriminative power than cosine similarity.
- Self-consistency gains from DPO optimization generalize across domains but not across attribution methods.
- Different attribution methods (LIME vs. LIG) capture fundamentally distinct notions of input relevance.
Highlights & Insights¶
- The finding that "self-consistency and accuracy are orthogonal" is significant — accurate models do not necessarily produce faithful explanations.
- The paper reveals a practical tension: DPO can improve LIME-based consistency without improving LIG-based consistency, indicating that "faithful explanation" is itself a multi-dimensional concept.
- PSCB as a large-scale resource holds long-term value for the interpretability community.
Limitations & Future Work¶
- Evaluation is limited to multiple-choice QA; applicability to open-ended generation tasks remains unknown.
- Both LIME and LIG have inherent limitations; more advanced attribution methods may yield different conclusions.
- Self-consistency remains a proxy for faithfulness and does not equate to true interpretability of the decision process.
- Future work may extend to larger models and broader task types.
Related Work & Insights¶
- vs. CC-SHAP: Scales evaluation from 100 samples to 85K and, for the first time, demonstrates an improvement methodology.
- vs. Counterfactual Intervention Methods: Replaces expensive counterfactual testing with attribution vector comparison, substantially reducing cost.
- vs. RLHF: Extends preference learning from "human preferences" to "attribution consistency preferences," representing a new dimension of alignment.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Attribution-preference DPO optimization is an entirely new direction.
- Experimental Thoroughness: ⭐⭐⭐⭐ Large-scale benchmark, cross-domain generalization, DPO vs. SFT comparison.
- Writing Quality: ⭐⭐⭐⭐ Formally rigorous with clear experimental design.
- Value: ⭐⭐⭐⭐⭐ Far-reaching implications for LLM interpretability and trustworthy AI.