Perceptual Flow Network for Visually Grounded Reasoning¶

Conference: ICML 2026
arXiv: 2605.02730
Code: None
Area: Multimodal VLM / Reinforcement Learning / Visual Reasoning
Keywords: Visually Grounded Reasoning, GFlowNet, Sub-TB, Variational Reinforcement Learning, LVLM Hallucination

TL;DR¶

Abandoning the traditional RLVR approach of "hard supervision with precise expert bounding boxes," PFlowNet models the perceptual behavior itself as a structured Perceptual Flow latent variable. It approximates the ideal reasoning-oriented posterior with a variational distribution \(p_\theta(Z|X)\), and is trained using Sub-TB variational RL, multi-dimensional rewards, and Vicinal Geometric Shaping. As a result, the 8B Qwen3-VL achieves a new SOTA of 90.6% on V* Bench and 67.0% on MME-RealWorld-lite.

Background & Motivation¶

Background: To mitigate language bias and hallucination in LVLMs, recent methods (look-twice, VGR, grounded thinking, etc.) use RLVR to distill geometric priors from visual experts (e.g., GroundingDINO) into LVLMs, enforcing a "first localize key regions, then answer" reasoning process.

Limitations of Prior Work: The authors conducted a critical probing experiment on V* using Qwen2.5-VL—expanding expert-annotated boxes isotropically to obtain geometric priors with different IoUs, then feeding the corresponding crops to the LVLM and measuring answer accuracy. The counterintuitive result: the most precise expert box is not the best reasoning evidence; there exists a sweet spot. The reason is that visual experts are designed for object detection, prioritizing geometric precision while ignoring the "context needed for reasoning." Overly tight boxes cause a "tunnel vision" effect, removing essential peripheral cues.

Key Challenge: Existing VGR equates "expert geometric precision" with "quality of reasoning evidence," forcing LVLMs to strictly align with expert boxes. However, the truly useful "golden evidence" for reasoning is instance-specific, and heuristic expansion cannot precisely capture it. This is a fundamental misalignment—optimizing for the wrong target.

Goal: (1) Formalize VGR as a distribution modeling problem over latent visual trajectories \(Z\); (2) Construct a family of learnable perceptual trajectory representations, decoupling "geometric precision" from "reasoning utility" in the training objective; (3) Use variational RL to encourage exploration of "reasoning-friendly" perceptual behaviors while retaining geometric reliability.

Key Insight: Rather than using expert boxes as hard constraints, treat the reasoning trajectory \(Z\) as a latent variable, with a model-parameterized variational distribution \(p_\theta(Z|X)\) approximating the "ideal VGR posterior" \(P_V(Z|X,Y)\); the latter only requires \(Z\) to fall within a \(\sigma\)-vicinity \(\mathcal{S}_V\) centered on the golden evidence \(G\). The expert box \(E\) serves only as a "vicinal reference," not a hard target.

Core Idea: Model perception as a Perceptual Flow (planning state + a sequence of grounded perceptual states), use a Sub-Trajectory Balance (GFlowNet-style) variational objective for dense supervision, and design "multi-dimensional rewards + geometric shaping \(\omega_\lambda\) active only outside the expert vicinity" to achieve "sufficient exploration without overstepping boundaries."

Method¶

Overall Architecture¶

PFlowNet divides the LVLM workflow into two decoupled stages: (i) Flow Generation: The model samples a Perceptual Flow \(Z = (z_0 \to z_1 \to \dots \to z_K)\) from \(p_\theta(Z|X)\), where \(z_0\) is a planning state wrapped in <analyze>...</analyze>, and \(z_{\ge 1} = \langle r_k, c_k\rangle\) are wrapped in <localize>...</localize>, each containing an RoI box (relative coordinates 0–1000) and a descriptive caption; (ii) Flow-Guided Reasoning: The model generates the final answer \(Y\) autoregressively based on \(Z\) and the cropped visual evidence \(I_{RoI}\), with the joint distribution factorized as \(p_\theta(Y, Z|X) = p_\theta(Z|X) p_\theta(Y|Z, \langle X, I_{RoI}\rangle)\). Training is two-stage: first, SFT on synthetic perceptual flow data \((X, Z_s)\) for cold start, then variational RFT on \((X, Y, E)\) to optimize \(p_\theta(Z|X)\) towards \(P_V\).

Key Designs¶

Perceptual Flow + Sub-TB Variational Objective:
- Function: Discretizes "where the model looks and what it sees" into structured trajectories, providing dense supervision for each sub-segment, addressing the issue of PPO-style objectives only rewarding at episode end and high gradient variance.
- Mechanism: Defines Perceptual Flow \(Z = (z_0, z_1, \dots, z_K)\), where planning state \(z_0\) is natural language, perceptual state \(z_k = \langle r_k, c_k\rangle\) is RoI box + caption; special tokens like <analyze>, <localize> explicitly segment the flow. Introduces Sub-Trajectory Balance from GFlowNet: for any sub-segment \(z_{i:j}\), \(\mathcal{F}(z_i)\,\mathcal{T}_F(z_{i:j}) = \mathcal{F}(z_j)\,\mathcal{T}_B(z_{j:i})\), leading to the vRFT objective \(\mathcal{L}_{vRFT}(\theta)\) (Eq. 2 in the paper), using the squared sum of log-ratios as loss.
- Design Motivation: (a) Explicit structure allows rewards to be defined on each sub-trajectory; (b) Sub-TB provides dense constraints ("every sub-segment must balance"), more stable than PPO; (c) Decoupling perception and reasoning allows independent optimization of \(p_\theta(Z|X)\) without contaminating LLM reasoning parameters.
Multi-dimensional Reward (Contrastive Visual + Information Gain):
- Function: Ensures the reward function simultaneously measures "perceptual accuracy" and "reasoning utility," avoiding sole focus on geometric IoU.
- Mechanism: For sub-flow \(z_{0:k}\), defines \(R(z_{0:k}\top) = \left(\prod_{i=1}^k \frac{p_\phi^+(z_i)}{p_\phi^-(z_i)}\right) p_\phi(Y \mid z_{0:k}\top, X)\), where \(p_\phi^+(z_i) = p_\phi(c_i \mid I_{r_i})\) is the visual likelihood of the caption within the cropped region, \(p_\phi^-(z_i) = p_\phi(c_i \mid I \setminus I_{r_i})\) is the likelihood in the complement region, and \(p_\phi\) is a frozen reward model initialized with the policy. The contrastive term \(p_\phi^+/p_\phi^-\), in expectation over the trajectory, is interpretable as reverse-KL distillation: \(\mathbb{E}[\sum \log(p_\phi^+/p_\phi^-)] = \sum [D_{KL}(q_\theta^i \| p_\phi(\cdot|I\setminus I_{r_i})) - D_{KL}(q_\theta^i \| p_\phi(\cdot|I_{r_i}))]\), encouraging captions to capture privileged information in the crop and avoid noise in the complement; the information gain term \(p_\phi(Y|z_{0:k}\top, X)\) ensures the chosen trajectory truly contributes to the final answer \(Y\).
- Design Motivation: Embeds "vision-grounded" and "reasoning-oriented" as two independent reward constraints, naturally suppressing reward hacking—simply precise boxes with generic captions, or vice versa, cannot achieve high rewards.
Vicinal Geometric Shaping:
- Function: Retains expert priors as a "safety barrier" to prevent out-of-bounds exploration, but does not force the model to 100% mimic the expert, allowing free discovery of higher-utility perceptual behaviors within the expert vicinity.
- Mechanism: Defines symmetric Chamfer-IoU distance \(d_{IoU}(A, B) = 1 - 0.5(IoU_{A\to B} + IoU_{B\to A})\), then an \(\varepsilon\)-vicinity \(\mathcal{B}_\varepsilon(E) = \{z_{0:k} \mid d_{IoU}(r_{1:k}, E) \le \varepsilon\}\) centered on the expert RoI set \(E\). The shaping weight \(\omega_\lambda(z_{0:k}, E) = \exp(-\lambda \mathbb{I}(z_{0:k} \notin \mathcal{B}_\varepsilon(E)))\) penalizes only trajectories outside the vicinity (with strength \(\lambda\)), while rewards \(R\) are unconstrained within the vicinity. The final shaped reward \(R_\lambda(z_{0:k}\top) = R(z_{0:k}\top) \omega_\lambda(z_{0:k}, E)\) is used in the Sub-TB flow objective's \(\mathcal{F}\).
- Design Motivation: Theorem 3.1 provides a TV distance upper bound, showing that as \(\lambda \to 0\) it degenerates to standard MLE (losing geometric constraints), and as \(\lambda \to \infty\) it degenerates to expert-guided RLVR (limited by expert bias); Theorem 3.4 further proves the existence of \(\lambda^\star\) such that the bound is strictly tighter than the lower bounds of both, i.e., under ideal assumptions, PFlowNet strictly outperforms both baselines.

Loss & Training¶

Data Pipeline: Uses Gemini-3-flash / GPT-4o as teachers to randomly expand expert RoIs and generate synthetic flows \(Z_s\), then uses a verifier to sample answers with and without \(Z_s\), splitting by pass@k: \(k=1\) (too easy, discarded), \(k>1\) but \(2\le k_{w/Z_s}\le 16\) go to RFT set, \(k_{w/o Z_s} > 16\) and \(k_{w/Z_s} = 1\) go to cold-start set. Cold Start: Standard SFT, minimizing cross-entropy between \(p_\theta(Z|X)\) and \(Z_s\). vRFT: Trains with the above three designs combined, parallelizing Sub-TB computation over \(L\) sampled trajectories (sub-prefixes of each trajectory share reward cache).

Key Experimental Results¶

Main Results¶

Base model is Qwen3-VL 8B, evaluated on V* Bench (complex visual search), TreeBench (perception + reasoning tree evaluation), and MME-RealWorld-Lite (OCR/remote sensing/charts/surveillance/autonomous driving).

Dataset	Metric	PFlowNet	Prev. SOTA / Baseline	Gain
V* Bench	Overall Acc	90.6%	Qwen3-VL 8B 77.5%	+13.1% vs base, new SOTA
TreeBench	Overall Acc	Qwen3-VL+13.1% / +10.4%	Qwen3-VL 8B	+10.4 vs base
MME-RealWorld-Lite	Overall Acc	67.0%	Baselines 43–52%	+21% vs Qwen3-VL 8B
TreeBench (Attributes)	Acc	64.69 (example)	Most 50–60	Significant lead

Note: Table 2 shows that large models like InternVL3-78B / Qwen2.5-VL-72B achieve only 46.4% / 42.2% on TreeBench/MME-RealWorld-Lite, while PFlowNet with 8B significantly outperforms 70B-scale baselines, indicating the effectiveness of the training paradigm over parameter scaling.

Ablation Study¶

Configuration	Key Metric	Description
Full PFlowNet (\(\lambda^\star, \varepsilon^\star\))	Tightest TV bound / SOTA performance	All three components
\(\lambda \to 0\)	\(D_{TV} \to 1 - s_V\)	Degenerates to MLE, geometric prior lost
\(\lambda \to \infty\)	\(D_{TV} \to 1 - q\)	Degenerates to expert-guided RLVR, locked by expert bias
\(\varepsilon \to 0\)	Vicinity shrinks to a point, \(q \to 0\)	Reward signal fails, bound loosens
Increase \(\varepsilon\) (keep \(\mathcal{B}_\varepsilon \subseteq \mathcal{S}_V\))	\(q \uparrow\), bound tightens monotonically	Wider is better within valid domain
\(\varepsilon > \sigma\)	Vicinity exceeds \(\mathcal{S}_V\)	Geometric guidance diluted, performance drops

Key Findings¶

The probing experiment directly refutes the "expert is most precise" intuition: answers using precise expert boxes have lower accuracy than those with moderately expanded IoU boxes, confirming the "tunnel vision" effect.
Theorems 3.1–3.4 provide provable improvements "strictly superior to MLE and expert-guided RLVR," under the condition \(\mathcal{B}_\varepsilon \subseteq \mathcal{S}_V\) (vicinity must not exceed the valid domain).
Excellent performance-efficiency tradeoff: the 8B model outperforms 78B baselines, showing that the structured decomposition of perceptual flow + variational RL greatly improves sample efficiency.
Good scaling at test time: performance continues to improve with larger sampling budgets, indicating that \(p_\theta(Z|X)\) learns a truly explorable distribution rather than a single point.

Highlights & Insights¶

Reformalizing VGR as "latent variable posterior approximation" is the paper's major conceptual advance—replacing RLVR's "geometric alignment" framework with a "distribution approximation" framework turns the previously unsolvable "expert bias" problem into a tunable \(\lambda\)/\(\varepsilon\) hyperparameter issue.
Sub-TB provides dense supervision while maintaining GFlowNet's exploratory nature, making it especially suitable for long-chain perceptual behaviors; bridging GFlowNet concepts from molecule generation to LVLM reasoning is both novel and natural.
The insight that the contrastive term \(p_\phi^+/p_\phi^-\) in the multi-dimensional reward is equivalent to reverse-KL distillation is elegant, directly turning the caption likelihood difference "inside/outside the box" into a KL term difference, with clear physical and optimization semantics.
The design philosophy of Vicinal Geometric Shaping ("expert as reference, not target") is transferable to any RLHF scenario needing a balance between expert prior and exploration, such as tool calling for code agents or robot policy distillation.

Limitations & Future Work¶

Theoretical assumptions are strong (Assumption 1/2, \(d_{eff}\)-regularity, etc.), and it is unclear whether real LVLM distributions satisfy them; the bounds are only idealized upper limits.
Perceptual Flow currently supports only "box + caption" binary states; extension is needed for finer-grained perceptual behaviors (e.g., masks, point clouds, video frames).
The data pipeline relies on strong teacher models (Gemini-3-flash / GPT-4o) to synthesize flows, posing an explicit barrier for open-source reproduction; cold start data quality directly impacts final performance.
\(\lambda\) and \(\varepsilon\) are still fixed hyperparameters, lacking adaptive scheduling; the theorem only proves the existence of optimal \(\lambda^\star\), without specifying how to choose it.
Multi-dimensional rewards require maintaining a reward model \(p_\phi\) (initialized with the policy but frozen), doubling memory cost during training.

vs Look-Twice / VGR / TraceVL: These works use expert geometry from GroundingDINO as hard rewards, limited by expert bias; PFlowNet reduces the expert to a vicinal reference and uses variational objectives to self-learn optimal perception.
vs DeepSeek-R1 (RLVR paradigm): R1 applies RLVR to math/code with verifiable rewards; PFlowNet extends RLVR to scenarios where perceptual behavior cannot be directly verified, using contrastive likelihood + information gain instead of ground-truth rewards.
vs GFlowNets (Sub-TB): Original GFN is for discrete combinatorial object generation; this work is among the first to introduce Sub-TB to LVLM multimodal reasoning.
vs Vicinal Risk Minimization: Borrows the "vicinal shaping" idea from classic VRM, but applies it to trajectory space instead of input space, providing a new RL regularization primitive.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Reformalizes VGR, introduces GFlowNet Sub-TB, designs vicinal geometric shaping; the overall framework is coherent and opens a new paradigm.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation on V* / TreeBench / MME-RealWorld-Lite, compared to large model baselines; however, many ablation details are in the appendix.
Writing Quality: ⭐⭐⭐⭐ Clear structure in theory and method, with a consistent flow from "probing experiment → formalization → Sub-TB → reward → geometric shaping."
Value: ⭐⭐⭐⭐⭐ Paradigm-shifting for future grounded-reasoning LVLMs; vicinal shaping and multi-dimensional rewards are highly transferable.