Skip to content

Logical Consistency as a Bridge: Improving LLM Hallucination Detection via Label Constraint Modeling between Responses and Self-Judgments

Conference: ACL 2026
arXiv: 2605.03971
Code: https://summerrice.github.io/LaaB (Project Homepage)
Area: Hallucination Detection
Keywords: Hallucination Detection, Self-Judgment, Meta-Judgment, Logical Constraints, Mutual Learning, Internal Features

TL;DR

The model treats an LLM's self-judgment ("whether it believes its own answer was correct") as a potentially hallucinated generation. It first trains a "meta-judgment detector" using intrinsic features to estimate the credibility of the self-judgment. Then, leveraging the inherent logical rule—"if self-judgment is true, the labels are identical; if false, the labels are opposite"—it utilizes Huber loss to constrain the response detector and meta-judgment detector via confidence-weighted mutual learning. During inference, only the response detector is used, achieving gains from both perspectives with zero additional inference cost.

Background & Motivation

Background: Current LLM hallucination detection follows two main paradigms: (a) the intrinsic-pattern route—extracting hidden states (SAPLMA), prediction logits (Logits Lens), or attention patterns (Lookback Lens) during generation, essentially quantifying uncertainty at a microscopic level; (b) the self-judgment route—directly asking the LLM "Is what I just said correct?", using macroscopic symbolic judgments as signals.

Limitations of Prior Work: Both routes have critical weaknesses: (a) although fine-grained neural signals are obtained, "high-confidence hallucinations" (where the LLM is wrong but certain) are hard to detect, and metrics lack semantic calibration; (b) while explicit semantic reasoning is present, the verbal judgment itself may suffer from "secondary hallucinations"—self-preference bias, overthinking, and evaluative hallucinations make "it says it's true" unreliable.

Key Challenge: Intrinsic features (implicit/neural/micro) and self-judgments (explicit/symbolic/macro) are coupled behaviors yet are processed independently. Directly concatenating the two (e.g., simple ensemble) risks the weaker signal dragging down the stronger one; treating self-judgment as ground truth introduces noise from evaluative hallucinations.

Goal: (a) Use both signals in a unified learnable framework; (b) avoid treating self-judgment as absolute truth by providing a "reliability estimate"; (c) ensure no significant inference overhead (running the LLM twice is too expensive).

Key Insight: The authors observe that "an LLM's self-judgment of its response is also a response"—the judgment itself is generated, can be hallucinated, and its credibility can be estimated using intrinsic features. If the reliability \(L_j\) of the self-judgment can be estimated, combined with the inherent logic rule ("self-judgment is true → response is true" / "self-judgment is false → response is false"), one can back-propagate the judgment reliability to the response assessment, creating a prediction path independent of the response detector.

Core Idea: Treat self-judgment \(O_j\) as "another response," run a meta-judgment detector \(D_j\) to estimate its correctness \(L_j\), and use a logic bridge (\(L_r = L_j\) if \(O_j = \text{"Yes"}\) else \(L_r = 1 - L_j\)) to translate \(D_j\)'s prediction into a prediction for the original response. Finally, use mutual learning to align \(D_r\) and \(D_j\) under logical constraints.

Method

Overall Architecture

LaaB consists of three modules and a two-stage training strategy:

Module (a) Response Hallucination Modeling: Given a query \(Q_r\) and the generated response \(O_r\), internal features \(F_r \in \{H_r, P_r, A_r\}\) (hidden states / logits / attention) are extracted and fed into an MLP detector \(D_r\), outputting \(S_r = (S_{r,\text{hallu}}, S_{r,\text{real}})\).

Module (b) Self-Judgment Hallucination Modeling: An evaluation prompt \(Q_j\) is used to obtain a verbal judgment \(O_j \in \{\text{"Yes"}, \text{"No"}\}\). Similarly, internal features \(F_j\) from generating \(O_j\) are fed into an MLP detector \(D_j\) to output \(S_j\), estimating whether this judgment itself is correct (\(L_j \in \{0,1\}\)).

Module (c) Logic-Constrained Mutual Learning: The logic rule (Table 2) translates \(D_j\)'s prediction into a prediction for \(L_r\). Huber loss aligns the probability distributions of \(D_r\) and \(D_j\), using confidence weighting to prevent mutual degradation.

Two-Stage Training: Stage 1 uses round-robin asynchronous training with individual CE losses for \(D_r\) and \(D_j\). Stage 2 is joint fine-tuning including the logic loss. Inference only uses \(D_r\)—it has absorbed knowledge from \(D_j\) through mutual learning, requiring no additional self-judgment generation and zero added inference cost.

Key Designs

  1. Meta-Judgment—Treating Self-Judgment as a "Detectable Response":

    • Function: Solves the fundamental issue of treating verbal judgments as ground truth by estimating a reliability score \(L_j\) for each judgment \(O_j\), allowing the framework to identify if the LLM's self-evaluation is trustworthy.
    • Mechanism: Symmetrical to the response detector—extracts \(H_j\) (hidden states of the last token at the optimal layer), \(P_j\) (logits of the first token, using a contrastive design: \(P_j = P_{\text{yes}} \oplus (P_{\text{yes}} - P_{\text{no}})\) if \(O_j = \text{"Yes"}\), and vice-versa), and \(A_j\) (attention ratios across segments like Query, Response, etc.). These are fed to MLP \(D_j\) to predict \(L_j\).
    • Design Motivation: The core insight is that self-evaluation is a generation process, and intrinsic signals naturally reflect the uncertainty of that judgment. By treating self-judgment as a query-response pair, the framework retains semantic advantages while calibrating reliability via neural signals, bypassing self-preference bias.
  2. Logic Rule Bridge—Logical Label Constraints:

    • Function: Translates \(D_j\)'s prediction of \(L_j\) into a prediction for \(L_r\), providing two independent estimates for \(L_r\).
    • Mechanism: Based on logical facts: If \(O_j = \text{"Yes"}\) (LLM says response is true), then \(O_j\) being correct means the response is true (\(L_r = L_j\)). If \(O_j = \text{"No"}\) (LLM says response is false), \(O_j\) being correct means the response is false (\(L_r = 1 - L_j\)). This is implemented via \(\mathcal{L}_{\text{Logic}} = \mathcal{L}_{\text{Huber}}(S_{r,\text{hallu}}, S_{j,\text{hallu}})\) if \(O_j = \text{"Yes"}\), else \(\mathcal{L}_{\text{Huber}}(S_{r,\text{hallu}}, S_{j,\text{real}})\).
    • Design Motivation: This logic is not a heuristic but a logical identity. The innovation lies in linking two independently trained detectors through this identity, forcing logical consistency and introducing a cost-free weak supervision signal.
  3. Confidence-Weighted Mutual Learning + Gradient Normalization:

    • Function: Prevents "weak detectors from dragging down strong ones."
    • Mechanism: Calculates confidence-aware weights: \(\mathcal{L}_{\text{Logic}, r} = \log(1 + \frac{S_j(L_j)}{S_r(L_r)}) \cdot \mathcal{L}_{\text{Logic}}\) (if the peer is confident, listen to the peer). Gradient normalization dynamically balances CE and Logic losses: \(\alpha_* = \frac{\|\nabla_{\theta_*^{-1}} \mathcal{L}_{\text{CE}, *}\|_2}{\|\nabla_{\theta_*^{-1}} \mathcal{L}_{\text{Logic}, *}\|_2 + \epsilon}\). Total loss: \(\mathcal{L}_* = \mathcal{L}_{\text{CE}, *} + \alpha_* \mathcal{L}_{\text{Logic}, *}\).
    • Design Motivation: Standard Deep Mutual Learning assumes peers are equal. In hallucination detection, feature quality (\(D_r\) vs \(D_j\)) is inherently unequal. Confidence weighting makes the mutual learning robust to feature selection.

Loss & Training

Stage 1: Round-robin asynchronous training of \(D_r\) and \(D_j\), minimizing \(\mathcal{L}_{\text{CE}} + \alpha \mathcal{L}_{\text{Logic}}\) separately until convergence. Stage 2: Joint fine-tuning with \(\mathcal{L}_{\text{Joint}} = \mathcal{L}_{\text{CE}, r} + \mathcal{L}_{\text{CE}, j} + \alpha \mathcal{L}_{\text{Logic}}\). Inference only uses \(D_r\), providing the same latency as a single detector while benefiting from \(D_j\)'s knowledge distilled via logic loss.

Key Experimental Results

Main Results

Setup: 4 datasets (TriviaQA / MMLU / NQ_Open / HaluEval) × 4 LLMs (Llama-3.1-8B/70B, Qwen-2.5-32B, Mistral-7B) × 8 baselines (Self-Judge, SAPLMA, Logits Lens, Lookback Lens, etc.).

Dimension Configuration Description
Datasets TriviaQA, MMLU, NQ_Open, HaluEval Open-domain QA, multi-task, natural QA, hallucination specific
LLMs Llama-3.1, Qwen-2.5, Mistral-7B Comparison across scales and families
Ours Base Detector + LaaB Applying the LaaB wrapper to trainable baselines

In the results, "+LaaB" versions almost universally outperform their corresponding base versions across various (LLM, base detector) combinations, with the highest scores consistently belonging to LaaB-enhanced configurations.

Ablation Study

Config Role Expected Impact
Full LaaB Complete method Baseline performance
w/o meta-judgment Trusts verbal \(O_j\) directly Dropped performance due to evaluative hallucinations
w/o logic rule KL alignment without polarity Propagation of errors, especially in "No" cases
w/o confidence weighting Equal weight mutual learning Weak features degrade strong ones
w/o stage 1 No pre-training Training instability

Key Findings

  • LaaB as a Universal Wrapper: 3 trainable baselines (SAPLMA, Logits Lens, Lookback Lens) all improved when wrapped with LaaB, proving it is a general framework rather than a specific tweak.
  • Self-Judge Weakness: Using verbal judgments alone is insufficient due to evaluative hallucinations; LaaB filters these through meta-detection and logical rules.
  • Zero Inference Overhead: The distillation through logic loss allows for single-detector inference, making it highly practical for deployment.
  • Robustness: Consistency across different LLM families and scales suggests the logic bridge is a fundamental phenomenon rather than LLM-specific.

Highlights & Insights

  • Perspective Shift: Viewing "self-judgment as a response" is a deep insight. It shifts self-evaluation from an "oracle" to "another hallucination-prone generation," enabling calibration via intrinsic features.
  • Logic as Weak Supervision: Encoding a logical identity into the loss via Huber loss provides a "free multi-view" signal that doesn't require extra labels and forces consistency across independent detectors.
  • Engineering Efficiency: Using mutual learning to ensure "multi-view training, single-view inference" is a clever way to improve accuracy without increasing serving latency.
  • Contrastive Feature Engineering: Using \((P_{\text{yes}} - P_{\text{no}})\) explicitly encodes relative confidence, a robust trick for binary judgment tasks.

Limitations & Future Work

  • Training Data Cost: Collecting self-judgment features for every training pair doubles the data collection effort during the training phase.
  • Binary Constraint: The logic rule is currently tied to Yes/No judgments; extension to graded or multi-class evaluations is non-trivial.
  • White-box Dependency: Requires access to hidden states/logits, making it inapplicable to closed-source APIs like GPT-4o.
  • Short-form Focus: Evaluation was primarily on short-answer QA; the efficacy in long-form generation (summarization, dialogue) needs further verification.
  • vs Intrinsic Detectors (SAPLMA, etc.): LaaB is orthogonal and acts as an enhancement wrapper.
  • vs Self-Judge (Kadavath 2022): LaaB introduces a "reliability gate" to correct the evaluative hallucinations found in direct verbal judgments.
  • vs Deep Mutual Learning: Adapts the paradigm for unequal peers using Huber loss and confidence weighting.

Rating

  • Novelty: ⭐⭐⭐⭐ (Self-judgment as a hallucination process is highly original).
  • Experimental Thoroughness: ⭐⭐⭐⭐ (Broad coverage across models and datasets).
  • Writing Quality: ⭐⭐⭐⭐ (Clear logical progression and notation).
  • Value: ⭐⭐⭐⭐ (Zero inference cost makes it highly deployable).