Scalable Offline Model-Based RL with Action Chunks¶

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=WXGb9unEHo
Paper: Project Page
Code: https://github.com/kwanyoungpark/MAC
Area: Reinforcement Learning / Offline Model-Based Reinforcement Learning
Keywords: Offline RL, Model-Based RL, Action Chunks, Value Expansion, Flow Matching

TL;DR¶

MAC utilizes action chunk models to compress multiple single-step model calls in long-horizon offline MBRL into fewer multi-step predictions. By employing rejection sampling from a flow-based behavioral policy to select conservative and high-value action chunks, it significantly outperforms existing offline MBRL methods on 100M-scale OGBench long-horizon manipulation tasks.

Background & Motivation¶

Background: Offline reinforcement learning aims to train executable policies solely from existing data without environment interaction. Historically, two main paradigms exist: model-free offline RL, which performs value learning or behavioral regularization directly on the dataset; and model-based offline RL, which first learns dynamics models and then uses imagined trajectories for planning, data augmentation, or value estimation. This work focuses on model-based value expansion, as it combines generative modeling with on-policy value learning, theoretically aligning with training paradigms that have demonstrated scalability in long-horizon tasks.

Limitations of Prior Work: Standard offline model-free methods are often hindered by off-policy TD learning in long-horizon tasks: each Bellman update only considers short local intervals, leading to bias accumulation along long decision chains. While model-based RL could theoretically mitigate this via multi-step rollouts, learned single-step dynamics models must be called recursively, causing small single-step errors to explode into entirely incorrect future states after dozens or hundreds of steps.

Key Challenge: In model-based value expansion, the rollout length \(n\) plays two conflicting roles. With large \(n\), the target \(\sum_{i=0}^{n-1}\gamma^i r_{t+i}+\gamma^n \bar V(s_{t+n})\) relies less on bootstrapping, resulting in lower value target bias. However, using a single-step model \(p(s_{t+1}\mid s_t,a_t)\) requires \(n\) recursive predictions, making model error more likely to explode. In essence, value learning demands long rollouts, while dynamics prediction demands short ones.

Goal: The authors aim to answer a specific question: Can offline model-based RL become a scalable solution for complex, long-horizon tasks with million- to billion-scale datasets? To achieve this, a method must satisfy three requirements: allow long rollouts for value learning to reduce short-sighted bootstrapping; minimize compounding errors from recursive model calls; and avoid exploitation of model errors by keeping the policy within the data distribution.

Key Insight: Observations suggest that many long-horizon control tasks do not necessitate step-by-step future modeling. By treating \(n\) consecutive actions as an action chunk \(a_{t:t+n-1}\), a model can directly predict the state \(s_{t+n}\) after \(n\) steps. Thus, a 100-step environment rollout requires only 10 model calls. Since action chunk distributions are high-dimensional and multi-modal, making them difficult for single Gaussian policies to fit, the authors introduce flow matching to train a behavioral action chunk policy, followed by rejection sampling to select action chunks within the behavioral distribution.

Core Idea: Replacing "single-step models + direct policy optimization" with "action chunk models + flow behavior policy rejection sampling" suppresses both compounding error and out-of-distribution (OOD) action exploitation in long-horizon offline MBRL.

Method¶

Overall Architecture¶

MAC (Model-Based RL with Action Chunks) reorganizes an offline RL dataset into action chunk samples: each training sample contains the current state \(s_t\), an action chunk \(a_{t:t+n-1}\) of length \(n\), the discounted cumulative reward \(r_t=\sum_{i=0}^{n-1}\gamma^i r_{t+i}\), and the state \(s_{t+n}\) after \(n\) steps. On these samples, the algorithm simultaneously trains an action chunk dynamics model, an action chunk reward model, a flow-based behavioral action chunk policy, and a value function. It then uses the models to generate on-policy imagined trajectories to update the value function.

The execution policy is not a directly learned unconstrained actor. Instead, it samples multiple candidate action chunks from the behavioral flow policy and selects the one with the highest value according to the action chunk \(Q\)-function. This selection process keeps the policy close to the offline data distribution while maximizing returns within that distribution.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["Offline Trajectory Data"] --> B["Action Chunk Model:<br/>Multi-step future prediction"]
    A --> C["Flow Behavior Rejection Sampling:<br/>Selecting chunks within data distribution"]
    C --> D["One-step Distillation Policy:<br/>Accelerating flow sampling"]
    B --> E["Action Chunk Value Expansion:<br/>Training V/Q with imagined trajectories"]
    D --> E
    E --> C
    E --> F["Execution: first action<br/>or chunk control"]

Key Designs¶

1. Action Chunk Model: Reducing compounding error via multi-step future prediction

Standard model-based RL typically learns \(p(s_{t+1}\mid s_t,a_t)\). To obtain the state after 100 steps, the model output must be recursively fed back 100 times. MAC instead learns action chunk dynamics \(p_\psi(s_{t+n}\mid s_t,a_{t:t+n-1})\). The input is a whole sequence of actions, and the output is the future state after \(n\) steps. With a default \(n=10\), a 100-step environment rollout requires only \(H=10\) model calls. This design decouples "long-horizon value targets" from "minimal recursive depth." Experiments show that for \(n\) from 1 to 25, the action chunk model significantly suppresses MSE divergence compared to single-step models.

2. Flow Behavior Rejection Sampling: Maximizing value within data-supported action chunks

While action chunks stabilize dynamics, they complicate policy modeling. Chunks of length \(n\) reside in \(A^n\), where sequences are often highly multi-modal (e.g., at the same state, one could move an object left or right). Gaussian actors often average these modes, producing OOD chunks. MAC trains a behavioral cloning policy \(\pi_\theta(a_{t:t+n-1}\mid s_t)\) using flow matching. During policy extraction, instead of gradient ascent, it samples \(M\) candidates and selects the one with the highest \(Q\)-value: \(\pi(s_t)\overset d=\arg\max_{a^{(i)}\sim\pi_\beta(\cdot\mid s_t)}Q(s_t,a^{(i)})\). This restricts optimization to the behavioral distribution, eliminating the need for manually tuned uncertainty penalties like those in MOPO or MOBILE.

3. One-step Distillation Policy: Enabling flow rejection sampling for training rollouts

Pure flow sampling is computationally expensive. If each chunk requires \(N\) candidates and each candidate uses \(F\) flow steps, a single rejection sampling needs \(NF\) network queries. For \(N=8, F=10\), this is 80 queries per chunk. To mitigate this during training, a one-step MLP policy \(\pi_\omega(s_t,z)\) is trained to distill the flow policy output: \(L_{distill}=\mathbb E\|\pi_\omega(s_t,z)-[\pi_\theta(s_t,z)]_\times\|_2^2\). This reduces training and inference costs to \(N\) queries while preserving the expressive power of multi-step flow.

4. Action Chunk Value Expansion: Training V/Q with on-policy imagined trajectories

MAC does not use the action chunk model solely for MPC or sequence modeling. It maintains an actor-critic structure: starting from data states, it generates action chunk imagined trajectories of length \(H\) using the current rejection sampling policy, dynamics model, and reward model. These are used to train on-policy values. The value target covers \(nH\) environment steps, regressing \(V_\phi(\hat s_{t+kn})\) to cumulative rewards plus terminal bootstrapping. The \(Q_\phi(s_t,a_t)\) is then trained to predict rewards plus \(\gamma^n V_\phi(\hat s_{t+n})\), specifically serving as the ranker for rejection sampling.

Mechanism Example¶

In a cube-octuple manipulation task requiring multiple object movements, a single-step model would recurse from \(s_t\) to \(s_{t+100}\), amplifying contact errors at each step. MAC segments data into chunks of \(n=10\). At \(s_t\), the flow policy generates 32 candidate 10-step sequences; the \(Q\)-function selects the one with the highest predicted long-term value. The dynamics model predicts \(s_{t+10}\) and the cumulative reward directly. Repeating this \(H=10\) times yields a trajectory of 100 environment steps with a recursion depth of only 10. Value functions see returns at a 100-step scale, while policy candidates remain grounded in behavioral data.

Loss & Training¶

Dynamics and reward models are trained via supervised learning: \(L_{dyn}=\mathbb E\|p_\psi(s_t,a_t)-s_{t+n}\|_2^2\) and \(L_{rew}=\mathbb E\|r_\psi(s_t,a_t)-r_t\|_2^2\). In goal-conditioned tasks, a success prediction network parameterizes reward and terminal signals.

Value functions are trained on imagined trajectories regenerated each epoch. Default hyperparameters remain largely fixed across tasks: chunk size \(n=10\), rollout length \(H=10\), discount \(\gamma=0.999\), 10 flow steps, and \(N_{test}=32\) for rejection sampling.

Key Experimental Results¶

Main Results¶

Evaluations on OGBench long-horizon goal-conditioned tasks (100M transitions). The hardest tasks require up to 3000 environment steps.

Environment	Strongest Model-Free Ref.	Strongest Prev. Model-Based	Ours (MAC)	Conclusion
humanoidmaze-medium-navigate	n-SAC+BC 98 ±2	MOPO 27 ±5	36 ±2	Best among MB methods, but behind MF
humanoidmaze-giant-navigate	n-SAC+BC 82 ±5	All Prev. MB 0 ±0	0 ±0	Contact-rich locomotion remains a challenge
cube-double-play	SHARSA 95 ±3	FMPC 37 ±13	100 ±1	Significantly outperforms MF and prev. MB
cube-octuple-play	SHARSA 19 ±3	All Prev. MB 0 ±0	30 ±6	Chunking advantage is clearest in long manipulation
puzzle-3x3-play	SHARSA 100 ±0	MOPO 19 ±2	100 ±0	Reaches perfect score, matching top MF
puzzle-4x5-play	SHARSA 91 ±4	LEQ 1 ±3	99 ±3	Outperforms all existing methods on hardest puzzle

In reward-based tasks, MAC demonstrates strong general applicability, achieving the highest success rates in 4 out of 5 environments.

Ablation Study¶

Configuration	Task/Metric	Result	说明
Chunk length \(n=1\)	cube-octuple	~0 success	Long-horizon tasks are unsolvable without chunking
Chunk length \(n=10\)	cube-octuple	Significant Gain	Optimal balance between error reduction and learnability
Chunk length \(n=25\)	cube-octuple	Performance drop	Excessive chunk size hinders policy and Q estimation
MAC (Gaussian)	Multiple tasks	Near 0 success	Gaussian policy cannot capture multi-modal chunks
MAC (FQL)	Multiple tasks	Decent but lower	Gradient-based extraction is inferior to rejection sampling
MAC (Full)	Multiple tasks	Best performance	Both flow expressivity and rejection sampling are key

Key Findings¶

Action chunks effectively reduce error accumulation. MSE for length-100 rollouts is significantly lower than single-step models.
Chunk size follows a "Goldilocks" principle. \(n=10\) is stable; \(n=25\) suffers from the curse of dimensionality in the action space.
Flow rejection sampling is fundamental. Gaussian policies fail on multi-modal manipulation tasks, proving that expressivity and distribution constraints are both required.
Contact-rich locomotion is still difficult. MAC fails on humanoidmaze-giant, likely due to the inherent difficulty of modeling discontinuous contact dynamics.
Scalability comes with manageable costs. MAC is 1.2x to 2.2x slower than methods like MOPO but remains milliseconds in inference due to distillation.

Highlights & Insights¶

Decoupling "Environment Horizon" from "Model Recursion Depth". MAC allows value learning to benefit from long-horizon returns while keeping dynamic predictions stable through fewer recursive steps.
Implicit Conservatism. Unlike methods that rely on sensitive uncertainty penalty coefficients, MAC achieves safety by sampling from the behavioral flow distribution.
Purposeful Flow Matching. Flow matching is utilized here to handle the high dimensionality and multi-modality inherent in action sequences (chunks), rather than just for increased complexity.
Natural Integration of AC and Rejection Sampling. Using the \(Q\)-function as a ranker for behavioral candidates effectively turns the value function into a "distribution-constrained selector."

Limitations & Future Work¶

Discontinuous Dynamics. Failure in complex locomotion environments suggests that while chunking reduces recursion, it doesn't solve the problem of predicting fundamentally difficult dynamics.
Dimensionality vs. Horizon. As \(n\) grows, the action space explosion eventually hinders \(Q\)-estimation. Adaptive chunk sizing for different control frequencies remains an open question.
Behavioral Dependency. The method relies on the behavior policy's ability to propose viable segments; performance may be capped by the quality of the offline dataset.
Stochasticity. The current deterministic action chunk model might fail in highly stochastic or partially observable environments.

vs. MOPO / MOBILE: These rely on single-step models and strict uncertainty penalties. MAC replaces these with action chunks and behavioral flow constraints.
vs. LEQ: While LEQ also focuses on conservative value estimation, MAC's on-policy value expansion with chunked rollouts proves more effective in manipulation tasks where LEQ often struggles.
vs. F-MPC: F-MPC focuses on planning; MAC integrates chunks into a full actor-critic loop for iterative improvement.
vs. Sequence Modeling (Diffuser): MAC does not generate entire trajectories but bridges RL and generative modeling by using chunked transitions within a value-based framework.

Rating¶

Novelty: ⭐⭐⭐⭐☆ Combines action chunks and flow rejection sampling into a distinct and scalable MBRL recipe.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive evaluation across long-horizon OGBench tasks, D4RL, and extensive ablations.
Writing Quality: ⭐⭐⭐⭐⭐ Clearly identifies the tension in rollout length and addresses it logically.
Value: ⭐⭐⭐⭐☆ Highly relevant for long-horizon robot manipulation, though requires further investigation for high-frequency locomotion.