Reasoning on the Manifold: Bidirectional Consistency for Self-Verification in Diffusion Language Models¶
Conference: ICML2026
arXiv: 2604.16565
Code: TBD
Area: LLM Reasoning / Diffusion Language Models / Self-Verification
Keywords: Diffusion Language Models, Manifold Geometry, Bidirectional Consistency, Reasoning Self-Verification, Reinforcement Learning Alignment
TL;DR¶
Starting from the geometric perspective that "effective reasoning trajectories are stable attractors on the learned distribution," this paper proposes BMC (Bidirectional Manifold Consistency), an unsupervised, training-free metric. By performing a "forward re-masking + backward few-step reconstruction" cycle on diffusion language model (dLLM) generation results, it uses reconstruction stability for scoring. BMC simultaneously supports error diagnosis, inference-time rejection sampling, and RL dense rewards, systematically outperforming baselines like Confidence, Self-Consistency, and Self-Evaluation across four reasoning benchmarks.
Background & Motivation¶
Background: Diffusion Large Language Models (dLLMs), represented by LLaDA and Dream, replace the strict left-to-right generation of Auto-Regressive (AR) models with all-attention and bidirectional denoising processes. They are considered more suitable for "global planning + iterative refinement" System-2 reasoning.
Limitations of Prior Work: Verifying whether a generated trajectory is actually "correct" remains an open problem. Most mainstream approaches use external signals inherited from the AR era: PRMs require expensive process-level annotation; Self-Consistency requires large sample batches and often fails collectively on difficult tasks; Self-Evaluation prompts often degenerate into random guessing across domains. These methods treat the model as a black box and completely ignore the probabilistic geometric structure of the diffusion process itself.
Key Challenge: The denoising process of dLLMs is reversible and bidirectional—theoretically, the model itself possesses information about "how stable this trajectory is," but existing verification paradigms fail to extract it. In other words, external verifiers are costly because the internal "terrain" already laid out is overlooked.
Goal: To restore "correctness" as a measurable geometric property while satisfying: (1) No ground-truth required; (2) No additional training; (3) Computational overhead significantly lower than resampling; (4) A single signal applicable across the "diagnosis → inference → alignment" pipeline.
Key Insight: The authors assume that valid solutions lie on the high-density manifold of the learned distribution and act as stable attractors of the denoising operator \(\mathcal{T}_\theta\), whereas incorrect solutions deviate from the manifold. Thus, "whether it can be faithfully reconstructed after perturbation" acts as a proxy for manifold distance.
Core Idea: Quantify stability through a "forward re-masking + backward few-step reconstruction" loop—if \(\hat{x}_0 \approx x_0\), it indicates \(x_0\) is on the manifold and the reasoning is reliable; otherwise, off-manifold drift occurred, indicating a likely error.
Method¶
Overall Architecture¶
The input is a complete sequence \(x_0\) generated by the dLLM, and the output is its geometric stability score \(S_{\text{BMC}}(x_0)\). The process involves three steps: (1) Forward perturbation—partially re-mask \(x_0\) into \(\tilde{x}_t\) according to ratio \(\gamma\); (2) Backward reconstruction—run \(K\) steps (\(K{=}16 \ll T{=}1024\)) of truncated denoising from \(\tilde{x}_t\) using the same dLLM denoiser \(p_\theta\) to obtain \(\hat{x}_0\); (3) Consistency scoring—provide \(S_{\text{BMC}}\) via a weighted sum of six similarity metrics. This score then drives three downstream tasks: error diagnosis (used directly as a score), Manifold-Guided Rejection Sampling (MGRS), and serving as a dense reward for RL.
Theoretically, the authors prove that when the difference measure \(\mathcal{D}\) is the KL divergence, BMC is equivalent to an estimate of the reweighted ELBO (Prop. 3.2); it remains consistent with the marginal ELBO when using Csiszár \(f\)-divergence (Prop. 3.3). Under continuous embedding spaces, BMC can be relaxed to semantic neighbors through Lipschitz continuity (Prop. 3.4). Geometrically, it is shown that "reconstruction residual is an upper bound on manifold distance" (Prop. 3.5): \(\|z_0 - z^*\| \le \frac{1}{1-\kappa}\|z_0 - \mathcal{T}_\theta(z_0)\|\).
Key Designs¶
-
BMC Estimator (Forward Re-masking + Truncated Backward Reconstruction + Composite Similarity):
- Function: Approximates \(\mathcal{R}_\mathcal{D}(x_0) := -\mathbb{E}_{t, \tilde{x}_t}[\mathcal{D}(x_0, \hat{x}_0(\tilde{x}_t))]\) via a "perturbation-recovery" loop, converting abstract manifold stability into a computable scalar score.
- Mechanism: Forward Bernoulli masking is applied as \(\tilde{x}_t^{(i)} = m_i x_0^{(i)} + (1-m_i)\texttt{[MASK]}\) (\(\gamma{=}0.9\)); backward denoising runs only \(K{=}16\) steps instead of the full \(T\)—this is key to efficiency. The score \(S_{\text{BMC}} = \sum_k \lambda_k s_k\) is weighted by six complementary metrics: Token Accuracy \(s_{\text{tok}}\) (local convergence), Semantic Similarity \(s_{\text{sem}}\) (allows paraphrasing), Number Retention \(s_{\text{num}}\) (key nodes in math chains), Final Answer Match \(s_{\text{ans}}\) (endpoint convergence), Character Similarity, and Intrinsic Confidence.
- Design Motivation: Single metrics have blind spots—pure likelihood is too strict and misjudges paraphrasing as errors, while pure answer matching ignores reasoning chain stability. The synthesis of six metrics makes BMC closely approximate ELBO while remaining robust to semantics. "Truncated \(K\) steps" is the engineering key to keeping verification overhead at a fraction of generation cost, avoiding another form of resampling.
-
Manifold-Guided Rejection Sampling (MGRS) Adaptive Inference:
- Function: Uses BMC to upgrade "fixed-budget Best-of-N" to rejection sampling that "dynamically allocates compute based on problem difficulty."
- Mechanism: Immediately after each \(x_0 \sim p_\theta(\cdot|q)\), \(S = S_{\text{BMC}}(x_0)\) is calculated. If \(S > \tau\) (\(\tau{=}0.75\)), the result is returned immediately ("stable terrain, no further sampling needed"); otherwise, sampling continues up to \(N_{\max}{=}10\) times. If no candidate passes the threshold, the one with the highest historical score is returned. Simple problems stop after \(\sim\)2–3 samples on average, while hard problems (e.g., MATH) naturally take \(\sim\)5–6 samples.
- Design Motivation: Self-Consistency uses majority voting regardless of difficulty, wasting compute on easy tasks and failing collectively on hard ones. Best-of-N(Confidence) ranks by token probability, but confidence is largely unrelated to reasoning correctness. BMC provides geometric signals rather than statistical ones, naturally coupling budget with problem difficulty.
-
Gated Geometric Alignment Reward:
- Function: Injects BMC into RL training to let the dLLM internalize manifold stability into its policy rather than relying solely on inference-time sampling.
- Mechanism: The reward is defined as \(r(x_0) = \mathbb{I}(y_{\text{pred}} = y^*) \cdot [r_{\text{base}} + \alpha_t \cdot S_{\text{BMC}}(x_0)]\). Multiplicative gating ensures that "incorrect chains always receive 0 reward," preventing reward hacking. \(\alpha_t\) linearly anneals via \(\alpha_t = \alpha_{\min} + (\alpha_{\max} - \alpha_{\min}) \cdot t/T\), emphasizing answers initially and geometry later. Policy optimization utilizes Sandwiched Policy Gradient (SPG), replacing biased ELBO approximations with sandwiched evidence bounds to estimate gradients.
- Design Motivation: Standard outcome RL treats all "correct answers" as equal, which might reinforce fragile "lucky" chains. BMC turns sparse outcome rewards into dense "geometric quality" rewards. Without correctness gating, the model might drift towards being "consistent but wrong"—thus, the multiplicative gate establishes a strict hierarchy: correct first, then stable.
Loss & Training¶
BMC itself is training-free and only uses the pre-trained dLLM for inference. During the RL phase, the SPG framework is used for alignment on LLaDA-8B with hyperparameters \(r_{\text{base}} = 1.5\), \(\alpha \in [0.5, 1.0]\), \(K{=}16\), \(\gamma{=}0.9\), and \(N_{\text{BMC}}{=}4\) ensemble samples. All-MiniLM-L6-v2 is used for semantic similarity.
Key Experimental Results¶
Main Results: Unsupervised Error Diagnosis (AUROC)¶
| Model | Method | GSM8K | MATH | ARC-C | GPQA |
|---|---|---|---|---|---|
| LLaDA-8B | Model Confidence | 0.753 | 0.713 | 0.550 | 0.482 |
| LLaDA-8B | Self-Evaluation | 0.549 | 0.558 | 0.546 | 0.529 |
| LLaDA-8B | Self-Consistency | 0.872 | 0.803 | 0.735 | 0.539 |
| LLaDA-8B | BMC (Ours) | 0.893 | 0.820 | 0.777 | 0.678 |
| Dream-7B | Self-Consistency | 0.684 | 0.675 | 0.708 | 0.527 |
| Dream-7B | BMC (Ours) | 0.898 | 0.825 | 0.804 | 0.605 |
On LLaDA, BMC's advantage over SC grows from +2.1% on GSM8K to +13.9% on GPQA—the harder the task, the more the "consensus assumption" of SC fails, making geometric signals more valuable. On Dream-7B, where sampling diversity is poor, SC performs similarly to confidence, while BMC maintains a stable AUROC of 0.80+, demonstrating it is less sensitive to the base model.
MGRS Inference and Alignment Results¶
| Model | Task | Standard | Self-Cons. | Best-of-N(Conf) | MGRS |
|---|---|---|---|---|---|
| LLaDA | GSM8K | 70.5 | 74.3 | 70.7 | 79.5 |
| LLaDA | MATH | 24.4 | 24.2 | 23.8 | 27.6 |
| LLaDA | ARC-C | 83.2 | 86.1 | 83.3 | 87.2 |
| Alignment Method (LLaDA, len=512) | GSM8K | MATH | ARC-C | GPQA |
|---|---|---|---|---|
| SFT | 80.4 | 34.8 | 78.1 | 26.8 |
| Outcome RL | 83.5 | 37.2 | 82.2 | 30.8 |
| Geometric Align (Ours) | 85.8 | 41.6 | 85.2 | 34.4 |
Key Findings¶
- Geometric Meaning of \(K\) and \(\gamma\): AUROC saturates at \(K{=}16\) (\(0.840 \to 0.873\)), indicating that truncated reconstruction is sufficient to detect the local contractivity of \(\mathcal{T}_\theta\). The masking rate \(\gamma\) shows an inverted U-shape, peaking at \(\gamma{=}0.9\) (0.889); \(\gamma{=}1.0\) drops sharply to 0.712—a few "geometric anchors" must be retained to measure stability, otherwise it becomes another unconditional resampling.
- Multiplicative Gating is Indispensable: If \(r\) is changed to an additive form, the model is misled by "consistent but wrong" chains—this geometric reward hacking serves as a negative example for outcome-only RL.
- Compute Adaptivity: MGRS samples only \(\sim\)2.2–3.3 times on average for GSM8K, but naturally increases to \(\sim\)5.4–5.8 times for MATH. Geometric signals allow budget to align with difficulty, whereas Best-of-N(Conf) even shows negative Sample Efficiency on MATH.
Highlights & Insights¶
- "Reconstruction Stability = Manifold Distance" is a Clean Bridge: Translating the verification problem into a geometric contractivity problem provides both the hard upper bound of Prop. 3.5 and a lightweight engineering cycle of \(K{=}16\) steps, achieving a rare balance between theory and usability.
- Single Signal Across Three Tasks: Diagnosis, inference-time sampling, and RL dense rewards all share the same BMC, avoiding a toolbox of fragmented items like "PRM for diagnosis, SC for inference, outcome for alignment." This "one-size-fits-all" is the dividend of mining the intrinsic probabilistic structure.
- Transferable to any Masked Denoising Model: As long as there is a "masking-denoising" bidirectional process, BMC can be directly applied—tasks like vision or protein discrete diffusion could immediately reuse this same geometric stability criterion.
Limitations & Future Work¶
- BMC assumes the dLLM has already learned "Correct Answer = High-Density Manifold." For undertrained or severely misaligned base models, the attractor structure itself may not hold, leading to distorted geometric signals.
- The six weights \(\lambda_k\) of the composite score \(S_{\text{BMC}}\) are mostly set manually in the paper without systematic discussion of adaptive weighting; these weights likely need adjustment when transferring across domains (e.g., code, medical).
- Experiments were primarily on LLaDA-8B and Dream-7B; the scale has not reached the 70B level. When the base model is already nearly perfect (as on ARC-C for Dream), BMC's improvement margin narrows, and scaling behavior requires larger-scale verification.
- Geometric stability \(\neq\) factual correctness: BMC can reject "drifting trajectories" but remains powerless against "on-manifold but factually incorrect" hallucinations (e.g., common sense errors with unified narratives).
Related Work & Insights¶
- vs PRM / Generative Verifier: These rely on external labeling or additional judging models, while BMC is a complete white-box utilization of the dLLM's own denoising operator. The advantage is zero annotation and no extra models; the disadvantage is it only applies to architectures with bidirectional processes like masked diffusion.
- vs Self-Consistency: SC is statistical consensus, while BMC is geometric stability. SC "fails collectively" on hard problems; BMC is more robust in sparse correct mass scenarios since it examines the stability of single trajectories.
- vs RemeDi / CDLM: RemeDi uses dual-stream re-masking of low-confidence tokens and CDLM trains specialized error-correction heads, both being "regeneration-oriented." BMC explicitly formalizes the same bidirectional dynamics into a verification criterion that can be used for diagnosis and alignment without changing training objectives.
- vs TraceRL / diffu-GRPO: While they implicitly optimize trajectory likelihood approximations, BMC provides explicit dense geometric signals and cooperates with SPG using sandwiched bounds to solve the problem of intractable likelihood gradient estimation in dLLMs. Geometrically dense rewards and unbiased gradient estimation form a complementary pair.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First to formalize dLLM bidirectional dynamics as a "manifold stability" verification criterion, bridging diagnosis, inference, and alignment.
- Experimental Thoroughness: ⭐⭐⭐⭐ Covers two dLLM suites × four reasoning benchmarks × three downstream tasks, accompanied by \(K\)/\(\gamma\) sensitivity and component ablation; lacks larger-scale and cross-domain (code/medical) verification.
- Writing Quality: ⭐⭐⭐⭐⭐ Progresses clearly from geometric intuition to four propositions and algorithm pseudocode, with strong theoretical-methodological flow.
- Value: ⭐⭐⭐⭐⭐ Provides a "training-free + multi-purpose" intrinsic verification baseline for the dLLM paradigm, serving as a directly usable tool for inference-time scaling and RL post-training of diffusion language models.