Aligning What LLMs Do and Say: Towards Self-Consistent Explanations¶
Conference: ACL 2026 Findings
arXiv: 2506.07523
Code: GitHub
Area: Interpretability
Keywords: Self-consistency, Feature attribution, Explanation faithfulness, DPO optimization, Attribution alignment
TL;DR¶
Constructed the Post-hoc Self-Consistency Bank (PSCB, 85K decisions × 428K explanations) to quantify the feature attribution gap between LLM answers and their natural language explanations. Improved attribution consistency through DPO optimization without compromising model accuracy.
Background & Motivation¶
Background: LLMs are often required to generate natural language explanations to justify their answers. However, these post-hoc explanations are frequently inconsistent with the input features that actually drive the answer—there is a discrepancy between what the model says and what it does.
Limitations of Prior Work: (1) Existing faithfulness measurement methods (such as counterfactual interventions) are computationally expensive and difficult to apply at scale; (2) Methods like CC-SHAP have only been evaluated on approximately 100 samples, limiting the reliability of their conclusions; (3) There has been no prior demonstration of how to mitigate this attribution inconsistency.
Key Challenge: LLM explanations may be fluent and plausible but fail to describe the actual reasoning process (rationalization). The disconnect between the input features prioritized in the explanation versus the decision poses a fundamental threat to Trustworthy AI.
Goal: (1) Quantify the attribution consistency between answers and explanations at scale; (2) Propose methods to improve this alignment.
Key Insight: Feature attribution vectors are calculated separately for each QA decision and its corresponding multiple explanations. The alignment between them is then measured. DPO is utilized on attribution-based preference data to improve consistency.
Core Idea: Spearman rank correlation is more effective than cosine similarity in distinguishing between high- and low-quality explanations. DPO optimization based on attribution preferences successfully enhances self-consistency and generalizes across domains.
Method¶
Overall Architecture¶
PSCB Construction Pipeline: (1) Calculate feature attribution vectors for QA decisions; (2) Generate \(K\) diverse explanations for each decision and calculate their respective attribution vectors; (3) Use an alignment function to measure the attribution consistency between the decision and the explanations; (4) Select the best and worst explanations to construct preference pairs for DPO optimization. The first two steps populate the PSCB benchmark, while the latter two steps perform alignment measurement and preference optimization based on the library.
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400, 'subGraphTitleMargin': {'top': 8, 'bottom': 16}}}}%%
flowchart TD
A["Multiple Choice QA Input"] --> SUB
subgraph SUB["Post-hoc Self-Consistency Bank (PSCB)"]
direction TB
B["Decision<br/>LLM Answer + Input Attribution Vector (LIME / LIG)"]
C["Explanation<br/>K Explanations via Temperature Sampling + Attribution Vectors"]
B --> C
end
SUB --> D["Spearman Rank Correlation Alignment<br/>Compare Decision vs. Explanation Attribution Vectors"]
D --> E["Construct Preference Pairs<br/>Highest score (chosen) / Lowest score (rejected)"]
E --> F["DPO Fine-tuning"]
F --> G["Self-Consistent LLM"]
Key Designs¶
1. Post-hoc Self-Consistency Bank (PSCB): Scaling Evaluation from Hundreds to Hundreds of Thousands
Previous works like CC-SHAP could only evaluate attribution consistency on roughly 100 samples, which restricted the reliability of conclusions and hindered systematic research. PSCB expands the scale to 85K decisions × 5 explanations each = 428K explanation-attribution pairs. It utilizes two attribution methods, LIME and Layer Integrated Gradients (LIG), covering 4 QA datasets and 2 LLMs, providing an attribution-augmented benchmark for large-scale quantification and DPO optimization.
2. Spearman Rank Correlation as an Alignment Metric: Better Discriminative Power than Cosine Similarity
Cosine similarity exhibits highly overlapping distributions and weak discriminative power when distinguishing between good and bad explanations because it is sensitive to attribution magnitudes. This paper adopts Spearman rank correlation \(CC_{sp} = 1 - \frac{6\sum(r(\phi_i^{dec}) - r(\phi_i^{exp}))^2}{m(m^2-1)}\), which focuses on the consistency of feature priority rankings between the decision and explanation vectors. This avoids magnitude interference and clearly separates explanations of varying quality.
3. Attribution-based DPO Optimization: Improving Self-Consistency Without Damaging Accuracy
SFT struggles to learn the subtle differences in attribution preferences on the same data. This paper adopts preference learning: constructing pairs by selecting the explanation with the highest self-consistency as "chosen" and the lowest as "rejected" from PSCB. DPO fine-tuning then guides the LLM to favor producing explanations more consistent with the decision attribution while maintaining task accuracy.
Loss & Training¶
A standard DPO objective function is used for training on PSCB preference pairs. Explanations are generated via temperature sampling (\(p=0.9, T=0.7\)), with 5 explanations per decision. The best and worst are selected to build the preference pairs.
Key Experimental Results¶
Main Results¶
| Model | Dataset | CC-Sp (Pre-optimization) | CC-Sp (Post-DPO) | Accuracy Change |
|---|---|---|---|---|
| LLaMA3.1-8B | ECQA | 18.47 (mean) | Significant Improvement | No Decrease |
| LLaMA3.2-3B | ECQA | 9.75 (mean) | Significant Improvement | No Decrease |
Ablation Study¶
| Configuration | Key Metric | Description |
|---|---|---|
| DPO vs SFT | DPO significantly outperforms SFT | SFT fails to learn attribution preferences |
| LIME vs LIG | Improvements do not generalize across methods | Different attribution methods capture different dimensions |
| Cross-domain Generalization | Effective | Improvements from ECQA training generalize to ARC, etc. |
| Correct vs Incorrect Answers | Orthogonal | Self-consistency is largely independent of accuracy |
Key Findings¶
- Self-consistency and accuracy are largely orthogonal—inconsistent explanations can accompany correct answers, and consistent explanations can accompany incorrect ones.
- Spearman rank correlation provides significantly better discriminative power than cosine similarity.
- Improvements in self-consistency via DPO optimization generalize across domains but do not transfer across different attribution methods.
- Different attribution methods (LIME vs. LIG) capture fundamentally different concepts of input relevance.
Highlights & Insights¶
- The finding that "self-consistency and accuracy are orthogonal" is crucial—accurate models do not necessarily provide faithful explanations.
- Reveals a practical contradiction: DPO can improve LIME-based consistency but not LIG-based, indicating that "faithful explanation" is a multi-dimensional concept.
- PSCB serves as a large-scale resource with long-term value for the interpretability community.
Limitations & Future Work¶
- Validated only on multiple-choice QA; applicability to open-ended generation tasks remains unknown.
- LIME and LIG have their own limitations; more advanced attribution methods might yield different conclusions.
- Self-consistency remains a proxy for faithfulness and is not identical to the actual interpretability of the decision process.
- Future work could extend to larger models and more diverse task types.
Related Work & Insights¶
- vs. CC-SHAP: Expanded the evaluation scale from 100 samples to 85K and demonstrated the first improvement method.
- vs. Counterfactual Intervention Methods: Replaced expensive counterfactual testing with attribution vector comparison, significantly reducing costs.
- vs. RLHF: Extended preference learning from "human preference" to "attribution consistency preference," representing a new dimension of alignment.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Attribution-based DPO optimization is a brand-new direction.
- Experimental Thoroughness: ⭐⭐⭐⭐ Large-scale benchmark, cross-domain generalization, and DPO vs. SFT comparison.
- Writing Quality: ⭐⭐⭐⭐ Rigorous formalization and clear experimental design.
- Value: ⭐⭐⭐⭐⭐ Significant implications for LLM interpretability and Trustworthy AI.