FM-Steer: Enhance Generalist Policies with Value-Guided Cascaded Denoising¶

Conference: CVPR 2026
Paper: CVF Open Access
Code: https://hume-vla.github.io (Project Page)
Area: Robotics / Embodied AI / VLA
Keywords: Test-time computation, flow-matching VLA, value-guided sampling, cascaded denoising, robotic manipulation

TL;DR¶

FM-Steer introduces a test-time computation framework for flow-matching (FM) VLA generalist policies. It employs an intermediate flow verifier to estimate Q-values for "semi-denoised" candidate actions and selects the optimal one via Best-of-N. Subsequently, a lightweight Lite-Flow denoiser asynchronously completes the remaining denoising steps. This approach enhances π0 performance by +4.4%, +25.9%, and +12.9% on LIBERO, Simpler, and real-world robots respectively, while increasing the control frequency from 4 Hz to 90 Hz without retraining the foundation model.

Background & Motivation¶

Background: Generalist Robot Policies (VLA) currently prioritize mapping vision-language models to action experts for predicting future action chunks. Autoregressive VLAs (e.g., RT-2, OpenVLA) discretize actions into tokens, whereas flow-matching VLAs like π0 and GR00T N1 generate continuous action chunks, showing superior performance in dexterous manipulation. Simultaneously, "test-time scaling"—sampling multiple candidates and selecting the best—has demonstrated significant gains in complex LLM tasks.

Limitations of Prior Work: Adapting "sampling + verification" to robotics (e.g., V-GPS, RoboMonkey) faces two major hurdles. First, these methods are designed for autoregressive VLVs and cannot be directly applied to flow-matching VLAs, as flow-matching involves deterministic ODEs where naive re-sampling fails to produce diversity. Second, repeated sampling increases inference latency, severely reducing control frequency, which leads to jitter or failure in dynamic tasks.

Key Challenge: Robot control possesses a real-time constraint where increased inference computation translates into latency. Consequently, there is a natural conflict between "investing more computation for better performance" and "maintaining high control frequency." Prior works like V-GPS often sacrifice frequency for performance.

Goal: To decompose the problem into two sub-tasks: (1) devising an effective test-time computation method for flow-matching VLAs (solving the diversity issue); and (2) increasing the control frequency even as computational expenditure grows.

Key Insight: The authors observe that verification does not need to wait for "complete denoising." On the Euler forward trajectory of flow-matching, intermediate noise actions (flow points) carry sufficient information for scoring. Thus, expensive "sampling + verification" can be limited to the early stages, while cheap "completion denoising" can be offloaded asynchronously.

Core Idea: By using "value-guided sampling on intermediate noise actions + cascading remaining denoising to a lightweight denoiser," the framework decouples "heavy test-time scaling" from "high-frequency control." The slow verifier selects better actions at a low frequency (4 Hz), while the fast Lite-Flow denoiser provides continuous, high-frequency completion (90 Hz).

Method¶

Overall Architecture¶

FM-Steer builds upon a frozen flow-matching VLA and adds two external modules: an intermediate flow verifier $\varphi$ (to estimate Q-values and select the optimal candidate) and a Lite-Flow denoiser $\phi$ (a lightweight transformer for completing the denoising process).

Reviewing the base: Flow-matching VLAs learn a conditional vector field $v_\theta(A_t^\tau, o_t)$. The objective is to minimize $\mathcal{L}_{FM}=\mathbb{E}\lVert v_\theta(A_t^\tau,o_t)-u(A_t^\tau\mid A_t)\rVert^2$, where the true field is $u(A_t^\tau\mid A_t)=\epsilon-A_t$ and noise actions are $A_t^\tau=\tau A_t+(1-\tau)\epsilon$, with $\tau\in[0,1]$ representing the noise level. Inference starts from pure noise $A_t^0\sim\mathcal{N}(0,I)$ and uses forward Euler $A_t^{\tau+\delta}=A_t^\tau+\delta v_\theta(A_t^\tau,o_t)$ to reach $A_t^1$.

FM-Steer splits the denoising trajectory ($0$ to $1$) at $\tau^*$. The base VLA only computes from $0 \to \tau^*$ to generate a batch of intermediate noise actions as candidates. The verifier selects the optimal candidate $A_t^{\tau^*}$ with the highest Q-value, which is then sliced and handed to the Lite-Flow to complete $\tau^* \to 1$. In deployment, the verifier side runs at low frequency (4 Hz), while the Lite-Flow side fetches the latest selection from a shared queue to complete denoising at high frequency (90 Hz).

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["Observation o_t<br/>Image + Instruction + State"] --> B["Frozen Flow-matching VLA<br/>Denoise only 0→τ*"]
    B --> C["Value-Guided Test-time Sampling<br/>N-candidates with varied noise + randomness"]
    C --> D["Intermediate Flow Verifier φ<br/>Query token estimates Q-value, Best-of-N selection"]
    D -->|Low Frequency 4Hz| E["Shared Asynchronous Queue"]
    E -->|High Frequency 90Hz| F["Cascaded Action Denoising<br/>Lite-Flow completes τ*→1"]
    F --> G["Real-time Dexterous Control"]

Key Designs¶

1. Value-guided test-time sampling: Best-of-N on intermediate noise actions

To solve the lack of diversity in deterministic ODEs and the latency of full generation, FM-Steer samples intermediate flow points $A_t^{\tau_n}$ instead of final actions. Diversity is ensured through two mechanisms: (1) Varying noise levels: The $n$-th candidate's noise level is set to $\tau_n=T-(n-1)\xi$, such that: $$A_t^{\tau_n}=\int_0^{T-(n-1)\xi} v_\theta(A_t^\tau,o_t)\,d\tau+\epsilon_n,$$ where $T \in (0,1)$ is the upper bound of the noise level. (2) Noise forcing: Randomness is injected into the Euler steps by modeling each step as an isotropic Gaussian $p(A_t^{\tau+\delta}\mid A_t^\tau)\sim\mathcal{N}(\mu_\tau,\Sigma_\tau)$, where the variance $\Sigma_\tau$ controls diversity.

2. Intermediate flow verifier: Estimating state-action value with query tokens

The verifier $\varphi$ is tightly integrated with the base model via a learnable special query token $q_t$. Appended to the VLM input sequence, $q_t$ attends to all preceding vision-language tokens to aggregate the current observation. $q_t$ and the candidate $A_t^{\tau_n}$ are fed into the verifier to estimate $Q_\varphi(q_t,A_t^{\tau_n})$. The optimal candidate is chosen via Best-of-N: $$A_t^{\tau^*}=\arg\max_{A_t^{\tau_n}} Q_\varphi(q_t,A_t^{\tau_n}).$$ The verifier is trained using calibrated Q-learning on sparse binary rewards. This enables it to identify high-value directions midway through the denoising process.

3. Cascaded action denoising: Decoupling sampling and high-frequency completion

The optimal candidate $A_t^{\tau^*}$ is a noisy action chunk with horizon $H$. FM-Steer uniformly slices it into $K$ sub-chunks of horizon $h$. For each sub-chunk $A_{t,k}^{\tau^*}$, the Lite-Flow denoiser completes the trajectory: $$A_{t,k}=\int_{\tau^*}^{1} v_\phi(A_{t,k}^\tau,o_{t,k})\,d\tau+A_{t,k}^{\tau^*}.$$ By splitting the task between two models, the verifier handles slow, high-level selection, while the Lite-Flow provides high-speed, reactive control. Their asynchronous updates ensure that control frequency is improved rather than hampered by the test-time scaling compute.

Key Experimental Results¶

Evaluations were conducted across three simulation benchmarks (LIBERO, SimplerEnv WidowX, SimplerEnv Google Robot) and three real-world platforms (WidowX 250s, Franka, AgiBot G-1), covering 15 scenarios and 21 real-world tasks.

Main Results: LIBERO (Success Rate SR)¶

Model	Spatial	Object	Goal	Long	Overall
OpenVLA-OFT	97.6%	98.4%	97.9%	94.5%	97.1%
GR00T N1	94.4%	97.6%	93.0%	90.6%	93.9%
π0 (Base)	96.8%	98.8%	95.8%	85.2%	94.2%
π0.5	98.8%	98.2%	98.0%	92.4%	96.8%
FM-Steer (GR00T N1)	97.5%	99.1%	97.4%	95.4%	97.3%
FM-Steer (π0)	98.6%	99.8%	99.4%	96.7%	98.6%

FM-Steer(π0) achieves the highest average SR (98.6%), outperforming the base π0 (94.2%) by +4.4%. Its success with GR00T N1 demonstrates its plug-and-play compatibility.

Main Results: SimplerEnv (vs. Prior Test-time Methods)¶

Method	WidowX Overall	Google (Matching)	Google (Aggregation)
OpenVLA	37.7%	35.4%	30.0%
V-GPS	37.0%	36.1%	30.2%
RoboMonkey	47.8%	37.3%	30.5%
π0 (Base)	69.2%	71.4%	54.7%
FM-Steer (π0)	79.5%	76.6%	60.3%

Ablation Study¶

Configuration	LIBERO	SimplerEnv	Overall
FM-Steer (π0) Full	98.6	75.1	81.0
w/o Cascaded Denoising ($T=1$)	95.9	72.1	78.1
w/o Re-sampling ($N=1$)	93.8	69.1	75.3
w/o Lite-Flow Denoiser	89.8	65.8	71.8
w/o Flow Verifier (Random)	84.9	61.4	67.3

Key Findings¶

Flow Verifier as the Core Component: Removing the verifier (random selection) lead to the steepest performance drop (−78% on real-world robots).
Lite-Flow is critical for Real-world Deployment: Executing noisy intermediate actions without numerical completion drops real-world SR by 63%.
Failure Recovery Capabilities: FM-Steer can recover from failed states by selecting alternative high-value trajectories via value-guiding.
High-frequency Robustness: Maintains high performance at 20 Hz (Franka) and 30 Hz (AgiBot G-1), significantly outperforming π0.

Highlights & Insights¶

Early Verification: Shifting Best-of-N from "final actions" to "intermediate noise actions" solves both the diversity and latency issues of flow-matching strategies.
Structural Decoupling: The asynchronous coordination between the slow verifier (4 Hz) and fast Lite-Flow (90 Hz) is the fundamental differentiator from prior works like V-GPS.
Efficiency: Enhancing performance without retraining heavy base models using a "plug-and-play" architecture is highly efficient for engineering.

Limitations & Future Work¶

Reliance on Offline RL: The verifier depends on quality rewards during training; failures in reward definition may limit performance.
Hyperparameter Sensitivity: The choice of noise upper bound $T$, number of candidates $N$, and sampling variance requires manual tuning for different platforms.
Asynchronous Stale Data: Future work should explore whether the "freshness" of actions in the shared queue becomes a bottleneck in extremely dynamic scenarios.

Comparison with V-GPS/RoboMonkey: Those methods target autoregressive models and final action sampling, leading to low frequencies. FM-Steer applies test-time scaling to flow-matching via intermediate sampling.
Comparison with Cascaded Diffusion: While cascaded denoising was previously used for progressive image resolution, this work applies it to action generation to split computational budget and enable real-time control.

Rating¶

Novelty: ⭐⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐
Value: ⭐⭐⭐⭐⭐