Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency¶
Conference: ACL 2026
arXiv: 2601.05905
Code: https://github.com/zjunlp/belief (Available)
Area: LLM Reasoning / Calibration / Trustworthiness
Keywords: Belief Robustness, Neighborhood Consistency, Self-Consistency, Bayesian Belief, Structure-Aware Training
TL;DR¶
This paper points out that for LLMs, "high self-consistency does not equate to true belief"—accuracy on 995 questions originally answered correctly with 100% consistency plummeted to 33.8% when minor contextual interference was introduced. The authors propose Neighbor-Consistency Belief (NCB), which uses the joint consistency of a target fact and its "conceptual neighbors" (premises/entailments/topics) as a structured proxy for belief robustness. Based on Asch's conformity experiments and Source Credibility theory, they designed a cognitive stress-test protocol, proving on four LLMs that high-NCB data is significantly more resistant to interference. Furthermore, they propose Structure-Aware Training (SAT), which employs teacher-student KL distillation to enforce consistent output across different neighborhood contexts, enhancing the robustness of newly learned knowledge by approximately 30% over Ans/Know augmentation baselines.
Background & Motivation¶
Background: To evaluate whether an LLM "knows" a fact, the mainstream approach relies on self-consistency (vote-based consistency across multiple samples) or token-level confidence. However, LLMs are increasingly deployed in scenarios like RAG, multi-agent collaboration, and complex prompt engineering where they are "led by external context." In these settings, "knowing" is insufficient; the model must remain stable under interference.
Limitations of Prior Work: The authors conducted a pilot study using Qwen3-30B-A3B on 995 questions with a self-consistency of 1.0 (30/30 correct samples). After inserting a single peer disagreement, accuracy dropped from 100% to 33.8%. This indicates that existing confidence metrics cannot distinguish between "answering correctly via isolated memory fragments" and "answering based on structured beliefs."
Key Challenge: Belief should be a structured latent state (in cognitive science, the human brain organizes knowledge via semantic networks where related facts constrain each other to resist interference), whereas point-wise metrics like self-consistency only observe consistency across multiple outputs of the same question, failing to capture the network structure between facts.
Goal: (a) Provide a computational metric to distinguish between "answering via structural belief" and "answering via isolated memory"; (b) Design a rigorous cognitive stress-test to verify that this metric predicts robustness; (c) Use "structural invariance" as a training objective to ensure learned knowledge is more resistant to interference.
Key Insight: Model belief as a binary latent variable \(\theta \in \{\mathcal{S}_\text{struct}, \mathcal{S}_\text{unstruct}\}\) and use Bayesian posterior estimation—if the model answers a set of neighboring facts correctly, its posterior for being in a structured belief state is significantly higher than an unstructured one; this posterior is approximated as the NCB score.
Core Idea: Use "neighbor consistency" instead of "self-consistency" as a proxy for belief strength and explicitly incorporate this structural invariance into the training loss.
Method¶
Overall Architecture¶
The paper is divided into three phases:
- Constructing the Neighbor-Enriched Dataset: 2,000 time-invariant facts were balanced across STEM, Arts & Culture, Social Sciences, and Sports, sampled from SimpleQA, HotpotQA, and SciQ. For each target \((q^*, \mathcal{E}^*)\), DeepSeek-V3.2 generated Neighbor Facts (NFs) covering three relationship types: entity premises, logical entailments, and thematic associations (average 7.84 NFs/fact), followed by manual filtering. Misleading Entities \(\mathcal{E}^\dagger\) and their Misleading Neighbor Facts (MNFs) (average 4.88/fact) were constructed for interference.
- NCB Measurement and Stress-Test Evaluation: For each fact, 30 target responses and 10 responses per neighbor were sampled (\(T=0.7\)), and NCB was estimated via Empirical Correctness Frequency. Models were then re-tested under two types of interference: Peer Quantity (Asch conformity) and Source Credibility (authoritative sources), across Standard, CoT, and Reflection reasoning strategies.
- Structure-Aware Training: Based on a checkpoint after initial answer augmentation (Ans. Aug), a teacher model views the raw question while the student model views "Question + Neighborhood Context \(C_{nq}\) or General Noise Context \(C_\text{general}\)." KL distillation forces the student's output distribution to align with the teacher's distribution across different contexts.
Key Designs¶
-
Neighbor-Consistency Belief (NCB) Metric:
- Function: Approximates the posterior probability that a fact belongs to a structured belief state using a scalar, distinguishing "true understanding" from "lucky memory."
- Mechanism: Defines a latent variable \(\theta \in \{\mathcal{S}_\text{struct}, \mathcal{S}_\text{unstruct}\}\) and formulates the posterior as \(P(\theta = \mathcal{S}_\text{struct} \mid \hat{\mathcal{E}}^* = \mathcal{E}^*, \forall i, \hat{a}_i = a_i)\). Using Bayes' theorem, the odds are decomposed into Bayes Factor × Prior Odds. By assuming \(P((\forall i, \hat a_i = a_i) \mid \hat{\mathcal{E}}^* = \mathcal{E}^*, \mathcal{S}_\text{struct}) \gg P(\cdot \mid \mathcal{S}_\text{unstruct})\), it is shown that odds \(\gg 1\). The unobservable posterior is approximated by the aggregation of Empirical Correctness Frequency \(\hat p(\hat a = a \mid q)\) over \(\mathcal{O} = \{(q^*, \mathcal{E}^*)\} \cup NFs\).
- Design Motivation: Grounded in neurocognitive science (interlocking semantic networks, Anderson’s inhibitory control theory) and the concept of "anchoring in context" from knowledge editing literature—redefining "belief = isolated facts" as "belief = a structured neighbor network" to explain why interference easily disrupts answers with high self-consistency.
-
Cognitive Stress-Test Protocol (Asch + Source Credibility):
- Function: Operationalizes belief stability under external interference into quantifiable experiments.
- Mechanism: (i) Peer Quantity simulates the Asch conformity experiment—the model observes dialogues from peer agents before answering \(q^*\). Scenarios include Conflict (peers directly give \(\mathcal{E}^\dagger\)) and Misleading (peers discuss MNFs, indirectly priming the wrong answer), with the number of peer agents \(N \in [1, 10]\). (ii) Source Credibility simulates the Hovland effect—interference text is categorized by Low (social media), Medium (blogs), and High (academic/news) authority. Accuracy drops are analyzed by grouping high vs. low NCB into 5%/20%/35% buckets.
- Design Motivation: Borrowing classical paradigms from 1970s cognitive psychology ensures ecological validity and provides clear controllable axes (number of peers, authority level).
-
Structure-Aware Training (SAT):
- Function: Uses belief structural invariance as a training objective to ensure newly learned facts remain stable within their neighborhood context.
- Mechanism: The teacher \(\theta_T\) is frozen, and the student \(\theta_S\) is trainable, both initialized from an Ans. Aug checkpoint. For each fact, two types of context are synthesized: \(C_{nq}\) (neighbor-related) and \(C_\text{general}\) (general noise). The student's output distribution \(P_{\theta_S}(y \mid C, x)\) is aligned via KL divergence with the teacher's context-free distribution \(P_{\theta_T}(y \mid x)\): \(\mathcal{L}_\text{KD} = \frac{1}{|C_b|}\sum_{(c, x) \in C_b} D_\text{KL}(P_T \parallel P_S)\).
- Design Motivation: Traditional SFT only encourages the model to memorize \((q, a)\) pairs without enforcing consistency under noise. SAT explicitly injects this robustness constraint, shifting the belief representation from point-wise to context-invariant at the loss level.
Loss & Training¶
In SAT, the student only optimizes the KL loss (unsupervised with respect to hard labels), effectively training the student to mimic the teacher’s interference-free distribution under any context \(c\). Both teacher/student are based on Qwen-2.5-32B-Instruct. Stress-Test details: 30 target samples + 10 neighbor samples per fact, \(T=0.7\), bf16 + vLLM, 8×A100.
Key Experimental Results¶
Main Results¶
Stress-Test across 4 LLMs using the Standard setting (selecting top/bottom 35% NCB subsets). Values represent "Accuracy drop after Stress" (Baselines are near 100%):
| Model | NCB Group | Quantity-Stress Standard | Source-Stress Standard | Reflection (Source) |
|---|---|---|---|---|
| Qwen-2.5-32B | Low NCB-35% | 74.0 (↓25.7) | 79.2 (↓20.5) | 78.7 (↓20.9) |
| Qwen-2.5-32B | High NCB-35% | 84.0 (↓16.0) | 87.2 (↓12.8) | 84.5 (↓15.5) |
| Qwen3-30B-A3B | Low NCB-35% | 70.8 (↓28.8) | 75.2 (↓24.3) | 84.1 (↓15.4) |
| Qwen3-30B-A3B | High NCB-35% | 82.4 (↓17.6) | 85.4 (↓14.6) | 90.2 (↓9.8) |
| Qwen3-30B-Thinking | Low NCB-35% | 77.3 (↓22.6) | 77.8 (↓22.1) | 84.7 (↓15.3) |
| Qwen3-30B-Thinking | High NCB-35% | 88.1 (↓11.3) | 87.1 (↓12.3) | 93.7 (↓5.8) |
| OLMo-2-32B | Low NCB-35% | 71.4 (↓28.3) | 80.3 (↓19.3) | 85.1 (↓14.5) |
| OLMo-2-32B | High NCB-35% | 81.3 (↓18.7) | 88.2 (↓11.8) | 89.8 (↓10.2) |
Across all four models, the accuracy drop for the high NCB group is consistently ~50%–70% of that of the low NCB group.
Ablation Study¶
SAT vs. two SFT augmentation baselines (Qwen-2.5-32B-Instruct, 100 facts originally answered incorrectly):
| Metric | Vanilla (Untrained) | Ans. Aug | Know. Aug | SAT (Ours) |
|---|---|---|---|---|
| Base ACC | 4.8 | 92.4 | 85.4 | 93.0 |
| Quantity Stress | 8.2 | 20.1 | 31.0 | 58.1 |
| Source Stress | 4.6 | 41.6 | 35.7 | 63.0 |
| Average Stress | 6.4 | 30.9 | 33.4 | 60.6 |
| MMLU | 72.84 | 82.9 | 81.1 | 80.1 |
| GSM8k | 91.66 | 91.5 | 88.8 | 91.0 |
SAT pushes Average Stress from 33.4 to 60.6 (an ~80% relative gain) without reducing Base ACC, while general capabilities (MMLU/GSM8k) remain largely unchanged.
Key Findings¶
- Finding 1 — NCB is a reliable metric for belief robustness: All four models show significantly smaller drops in the high NCB group, with the most pronounced contrast in Qwen3-Thinking (↓11.3% vs ↓22.6%). Coverage analysis revealed that Qwen3-Thinking tends to "self-refuse" on low NCB samples, suggesting reasoning models are aware of their "unstructured" knowledge.
- Finding 2 — Structural beliefs remain stable as interference intensity increases: Incremental peer conflict (from 0 to 6 opposing votes) caused low NCB accuracy to crash (97%→62%), while high NCB only declined gradually (98%→81%). Asch's classic conclusion was replicated: a single truth-teller (cfg5) significantly reduces conformity pressure.
- Finding 3 — CoT is unstable, Reflection wins: CoT often amplifies the drop (e.g., Qwen-2.5 Low NCB-35% worsened from ↓25.7% to ↓31.6%), whereas Reflection significantly mitigates it. CoT also showed a non-linear "Latitude of Rejection" effect, where accuracy dropped most at moderate interference levels.
- Finding 4 — Model scaling does not resolve belief fragility: Increasing Qwen-2.5 from 1.5B to 72B did not close the robustness gap between high and low NCB, suggesting this is not a problem solvable by scale alone.
- SAT's 30% reduction in fragility is a free lunch: MMLU/GSM8k scores remained stable while stress test performance improved significantly, indicating that "structural invariance" can be injected independently of general capabilities.
Highlights & Insights¶
- The conceptual shift from point-wise to graph-wise belief evaluation is the most significant contribution: this serves as a wake-up call to engineers relying solely on "high confidence"—high confidence is not synonymous with knowing.
- Using Asch and Source Credibility as stress-test protocols provides an elegant template for "Cognitive Psychology × LLM" research, showing how 70-year-old experimental designs provide ready-made schemas for controllable evaluation.
- The SAT training paradigm (Teacher-Student KL + context augmentation) is a highly reusable trick for RAG fine-tuning, adversarial robustness, and persona consistency.
- The formal derivation of NCB using Bayesian Odds provides a grounded mathematical definition for the psychological concept of "structural belief," making it more persuasive and easier to analyze theoretically than heuristic scoring.
Limitations & Future Work¶
- The Neighbor Facts only cover three relation types (premises, entailment, association), excluding causal chains or hierarchical taxonomies, and are limited to time-invariant facts.
- NCB lacks direct validation against human judgments of "true understanding"; it currently serves as a proxy for robustness rather than a direct measure of human-like comprehension.
- Constructing belief neighborhoods introduces significant computational overhead in both training and inference; future work should optimize this via neighborhood caching or selective sampling.
Related Work & Insights¶
- vs. Self-Consistency (Wang et al., 2023a): SC is point-wise; this paper proves it "systematically overestimates robustness."
- vs. Semantic Entropy (Farquhar et al., 2024): While surpassing token-level probability, it remains point-wise. NCB is the first to extend belief to conceptual neighborhoods.
- vs. Knowledge Editing Brittleness (Anthropic SDF 2025): Previous work noted that new knowledge is more fragile; this paper provides a structural explanation (lack of neighbor consistency) and the SAT solution.
- vs. Conformity in Multi-agent Systems (Zhang et al. 2024): This work quantifies conformity intensity using the Asch paradigm and finds that the "dissenter effect" still holds for LLMs.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Shifts evaluation from point-wise to graph-wise; elegant use of psychological paradigms.
- Experimental Thoroughness: ⭐⭐⭐⭐ Broad coverage across 4 LLMs and multiple strategies; lacks SAT validation on ultra-large models (70B+).
- Writing Quality: ⭐⭐⭐⭐ Clear logic across concept, formula, and experiment.
- Value: ⭐⭐⭐⭐⭐ Addresses the critical "illusion of confidence" issue with quantifiable metrics and effective training solutions.