Skip to content

Scalable Offline Model-Based RL with Action Chunks

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=WXGb9unEHo
Paper: Project Page
Code: https://github.com/kwanyoungpark/MAC
Area: Reinforcement Learning / Offline Model-Based Reinforcement Learning
Keywords: Offline RL, Model-Based RL, Action Chunks, Value Expansion, Flow Matching

TL;DR

MAC utilizes action chunk models to compress multiple single-step model calls in long-horizon offline MBRL into fewer multi-step predictions. By employing rejection sampling from a flow-based behavioral policy to select conservative and high-value action chunks, it significantly outperforms existing offline MBRL methods on 100M-scale OGBench long-horizon manipulation tasks.

Background & Motivation

Background: Offline reinforcement learning aims to train executable policies solely from existing data without environment interaction. Historically, two main paradigms exist: model-free offline RL, which performs value learning or behavioral regularization directly on the dataset; and model-based offline RL, which first learns dynamics models and then uses imagined trajectories for planning, data augmentation, or value estimation. This work focuses on model-based value expansion, as it combines generative modeling with on-policy value learning, theoretically aligning with training paradigms that have demonstrated scalability in long-horizon tasks.

Limitations of Prior Work: Standard offline model-free methods are often hindered by off-policy TD learning in long-horizon tasks: each Bellman update only considers short local intervals, leading to bias accumulation along long decision chains. While model-based RL could theoretically mitigate this via multi-step rollouts, learned single-step dynamics models must be called recursively, causing small single-step errors to explode into entirely incorrect future states after dozens or hundreds of steps.

Key Challenge: In model-based value expansion, the rollout length \(n\) plays two conflicting roles. With large \(n\), the target \(\sum_{i=0}^{n-1}\gamma^i r_{t+i}+\gamma^n \bar V(s_{t+n})\) relies less on bootstrapping, resulting in lower value target bias. However, using a single-step model \(p(s_{t+1}\mid s_t,a_t)\) requires \(n\) recursive predictions, making model error more likely to explode. In essence, value learning demands long rollouts, while dynamics prediction demands short ones.

Goal: The authors aim to answer a specific question: Can offline model-based RL become a scalable solution for complex, long-horizon tasks with million- to billion-scale datasets? To achieve this, a method must satisfy three requirements: allow long rollouts for value learning to reduce short-sighted bootstrapping; minimize compounding errors from recursive model calls; and avoid exploitation of model errors by keeping the policy within the data distribution.

Key Insight: Observations suggest that many long-horizon control tasks do not necessitate step-by-step future modeling. By treating \(n\) consecutive actions as an action chunk \(a_{t:t+n-1}\), a model can directly predict the state \(s_{t+n}\) after \(n\) steps. Thus, a 100-step environment rollout requires only 10 model calls. Since action chunk distributions are high-dimensional and multi-modal, making them difficult for single Gaussian policies to fit, the authors introduce flow matching to train a behavioral action chunk policy, followed by rejection sampling to select action chunks within the behavioral distribution.

Core Idea: Replacing "single-step models + direct policy optimization" with "action chunk models + flow behavior policy rejection sampling" suppresses both compounding error and out-of-distribution (OOD) action exploitation in long-horizon offline MBRL.

Method

Overall Architecture

MAC (Model-Based RL with Action Chunks) reorganizes an offline RL dataset into action chunk samples: each training sample contains the current state \(s_t\), an action chunk \(a_{t:t+n-1}\) of length \(n\), the discounted cumulative reward \(r_t=\sum_{i=0}^{n-1}\gamma^i r_{t+i}\), and the state \(s_{t+n}\) after \(n\) steps. On these samples, the algorithm simultaneously trains an action chunk dynamics model, an action chunk reward model, a flow-based behavioral action chunk policy, and a value function. It then uses the models to generate on-policy imagined trajectories to update the value function.

The execution policy is not a directly learned unconstrained actor. Instead, it samples multiple candidate action chunks from the behavioral flow policy and selects the one with the highest value according to the action chunk \(Q\)-function. This selection process keeps the policy close to the offline data distribution while maximizing returns within that distribution.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["Offline Trajectory Data"] --> B["Action Chunk Model:<br/>Multi-step future prediction"]
    A --> C["Flow Behavior Rejection Sampling:<br/>Selecting chunks within data distribution"]
    C --> D["One-step Distillation Policy:<br/>Accelerating flow sampling"]
    B --> E["Action Chunk Value Expansion:<br/>Training V/Q with imagined trajectories"]
    D --> E
    E --> C
    E --> F["Execution: first action<br/>or chunk control"]

Key Designs

1. Action Chunk Model: Reducing compounding error via multi-step future prediction

Standard model-based RL typically learns \(p(s_{t+1}\mid s_t,a_t)\). To obtain the state after 100 steps, the model output must be recursively fed back 100 times. MAC instead learns action chunk dynamics \(p_\psi(s_{t+n}\mid s_t,a_{t:t+n-1})\). The input is a whole sequence of actions, and the output is the future state after \(n\) steps. With a default \(n=10\), a 100-step environment rollout requires only \(H=10\) model calls. This design decouples "long-horizon value targets" from "minimal recursive depth." Experiments show that for \(n\) from 1 to 25, the action chunk model significantly suppresses MSE divergence compared to single-step models.

2. Flow Behavior Rejection Sampling: Maximizing value within data-supported action chunks

While action chunks stabilize dynamics, they complicate policy modeling. Chunks of length \(n\) reside in \(A^n\), where sequences are often highly multi-modal (e.g., at the same state, one could move an object left or right). Gaussian actors often average these modes, producing OOD chunks. MAC trains a behavioral cloning policy \(\pi_\theta(a_{t:t+n-1}\mid s_t)\) using flow matching. During policy extraction, instead of gradient ascent, it samples \(M\) candidates and selects the one with the highest \(Q\)-value: \(\pi(s_t)\overset d=\arg\max_{a^{(i)}\sim\pi_\beta(\cdot\mid s_t)}Q(s_t,a^{(i)})\). This restricts optimization to the behavioral distribution, eliminating the need for manually tuned uncertainty penalties like those in MOPO or MOBILE.

3. One-step Distillation Policy: Enabling flow rejection sampling for training rollouts

Pure flow sampling is computationally expensive. If each chunk requires \(N\) candidates and each candidate uses \(F\) flow steps, a single rejection sampling needs \(NF\) network queries. For \(N=8, F=10\), this is 80 queries per chunk. To mitigate this during training, a one-step MLP policy \(\pi_\omega(s_t,z)\) is trained to distill the flow policy output: \(L_{distill}=\mathbb E\|\pi_\omega(s_t,z)-[\pi_\theta(s_t,z)]_\times\|_2^2\). This reduces training and inference costs to \(N\) queries while preserving the expressive power of multi-step flow.

4. Action Chunk Value Expansion: Training V/Q with on-policy imagined trajectories

MAC does not use the action chunk model solely for MPC or sequence modeling. It maintains an actor-critic structure: starting from data states, it generates action chunk imagined trajectories of length \(H\) using the current rejection sampling policy, dynamics model, and reward model. These are used to train on-policy values. The value target covers \(nH\) environment steps, regressing \(V_\phi(\hat s_{t+kn})\) to cumulative rewards plus terminal bootstrapping. The \(Q_\phi(s_t,a_t)\) is then trained to predict rewards plus \(\gamma^n V_\phi(\hat s_{t+n})\), specifically serving as the ranker for rejection sampling.

Mechanism Example

In a cube-octuple manipulation task requiring multiple object movements, a single-step model would recurse from \(s_t\) to \(s_{t+100}\), amplifying contact errors at each step. MAC segments data into chunks of \(n=10\). At \(s_t\), the flow policy generates 32 candidate 10-step sequences; the \(Q\)-function selects the one with the highest predicted long-term value. The dynamics model predicts \(s_{t+10}\) and the cumulative reward directly. Repeating this \(H=10\) times yields a trajectory of 100 environment steps with a recursion depth of only 10. Value functions see returns at a 100-step scale, while policy candidates remain grounded in behavioral data.

Loss & Training

Dynamics and reward models are trained via supervised learning: \(L_{dyn}=\mathbb E\|p_\psi(s_t,a_t)-s_{t+n}\|_2^2\) and \(L_{rew}=\mathbb E\|r_\psi(s_t,a_t)-r_t\|_2^2\). In goal-conditioned tasks, a success prediction network parameterizes reward and terminal signals.

Value functions are trained on imagined trajectories regenerated each epoch. Default hyperparameters remain largely fixed across tasks: chunk size \(n=10\), rollout length \(H=10\), discount \(\gamma=0.999\), 10 flow steps, and \(N_{test}=32\) for rejection sampling.

Key Experimental Results

Main Results

Evaluations on OGBench long-horizon goal-conditioned tasks (100M transitions). The hardest tasks require up to 3000 environment steps.

Environment Strongest Model-Free Ref. Strongest Prev. Model-Based Ours (MAC) Conclusion
humanoidmaze-medium-navigate n-SAC+BC 98 ±2 MOPO 27 ±5 36 ±2 Best among MB methods, but behind MF
humanoidmaze-giant-navigate n-SAC+BC 82 ±5 All Prev. MB 0 ±0 0 ±0 Contact-rich locomotion remains a challenge
cube-double-play SHARSA 95 ±3 FMPC 37 ±13 100 ±1 Significantly outperforms MF and prev. MB
cube-octuple-play SHARSA 19 ±3 All Prev. MB 0 ±0 30 ±6 Chunking advantage is clearest in long manipulation
puzzle-3x3-play SHARSA 100 ±0 MOPO 19 ±2 100 ±0 Reaches perfect score, matching top MF
puzzle-4x5-play SHARSA 91 ±4 LEQ 1 ±3 99 ±3 Outperforms all existing methods on hardest puzzle

In reward-based tasks, MAC demonstrates strong general applicability, achieving the highest success rates in 4 out of 5 environments.

Ablation Study

Configuration Task/Metric Result 说明
Chunk length \(n=1\) cube-octuple ~0 success Long-horizon tasks are unsolvable without chunking
Chunk length \(n=10\) cube-octuple Significant Gain Optimal balance between error reduction and learnability
Chunk length \(n=25\) cube-octuple Performance drop Excessive chunk size hinders policy and Q estimation
MAC (Gaussian) Multiple tasks Near 0 success Gaussian policy cannot capture multi-modal chunks
MAC (FQL) Multiple tasks Decent but lower Gradient-based extraction is inferior to rejection sampling
MAC (Full) Multiple tasks Best performance Both flow expressivity and rejection sampling are key

Key Findings

  • Action chunks effectively reduce error accumulation. MSE for length-100 rollouts is significantly lower than single-step models.
  • Chunk size follows a "Goldilocks" principle. \(n=10\) is stable; \(n=25\) suffers from the curse of dimensionality in the action space.
  • Flow rejection sampling is fundamental. Gaussian policies fail on multi-modal manipulation tasks, proving that expressivity and distribution constraints are both required.
  • Contact-rich locomotion is still difficult. MAC fails on humanoidmaze-giant, likely due to the inherent difficulty of modeling discontinuous contact dynamics.
  • Scalability comes with manageable costs. MAC is 1.2x to 2.2x slower than methods like MOPO but remains milliseconds in inference due to distillation.

Highlights & Insights

  • Decoupling "Environment Horizon" from "Model Recursion Depth". MAC allows value learning to benefit from long-horizon returns while keeping dynamic predictions stable through fewer recursive steps.
  • Implicit Conservatism. Unlike methods that rely on sensitive uncertainty penalty coefficients, MAC achieves safety by sampling from the behavioral flow distribution.
  • Purposeful Flow Matching. Flow matching is utilized here to handle the high dimensionality and multi-modality inherent in action sequences (chunks), rather than just for increased complexity.
  • Natural Integration of AC and Rejection Sampling. Using the \(Q\)-function as a ranker for behavioral candidates effectively turns the value function into a "distribution-constrained selector."

Limitations & Future Work

  • Discontinuous Dynamics. Failure in complex locomotion environments suggests that while chunking reduces recursion, it doesn't solve the problem of predicting fundamentally difficult dynamics.
  • Dimensionality vs. Horizon. As \(n\) grows, the action space explosion eventually hinders \(Q\)-estimation. Adaptive chunk sizing for different control frequencies remains an open question.
  • Behavioral Dependency. The method relies on the behavior policy's ability to propose viable segments; performance may be capped by the quality of the offline dataset.
  • Stochasticity. The current deterministic action chunk model might fail in highly stochastic or partially observable environments.
  • vs. MOPO / MOBILE: These rely on single-step models and strict uncertainty penalties. MAC replaces these with action chunks and behavioral flow constraints.
  • vs. LEQ: While LEQ also focuses on conservative value estimation, MAC's on-policy value expansion with chunked rollouts proves more effective in manipulation tasks where LEQ often struggles.
  • vs. F-MPC: F-MPC focuses on planning; MAC integrates chunks into a full actor-critic loop for iterative improvement.
  • vs. Sequence Modeling (Diffuser): MAC does not generate entire trajectories but bridges RL and generative modeling by using chunked transitions within a value-based framework.

Rating

  • Novelty: ⭐⭐⭐⭐☆ Combines action chunks and flow rejection sampling into a distinct and scalable MBRL recipe.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive evaluation across long-horizon OGBench tasks, D4RL, and extensive ablations.
  • Writing Quality: ⭐⭐⭐⭐⭐ Clearly identifies the tension in rollout length and addresses it logically.
  • Value: ⭐⭐⭐⭐☆ Highly relevant for long-horizon robot manipulation, though requires further investigation for high-frequency locomotion.