Reinforcement Learning with Action Chunking¶
Conference: NeurIPS 2025
arXiv: 2507.07969
Code: None
Area: Reinforcement Learning
Keywords: Action chunking, Q-learning, offline-to-online RL, sparse rewards, manipulation tasks
TL;DR¶
This paper proposes Q-chunking, which extends action chunking from imitation learning to TD-based reinforcement learning by running RL directly over a "chunked" action space, thereby improving exploration and sample efficiency in long-horizon sparse-reward tasks.
Background & Motivation¶
In the offline-to-online RL setting, a central challenge is how to leverage offline prior data to maximize sample efficiency during online learning. Action chunking — predicting a sequence of future actions rather than a single-step action — is a technique widely used in imitation learning.
Key motivations:
Exploration difficulty: In long-horizon sparse-reward tasks, random exploration is unlikely to reach the goal.
Insufficient use of offline data: Existing methods are not effective at obtaining good exploration policies from offline data.
Untapped potential of action chunking in RL: Methods such as ACT have demonstrated the value of action chunking in imitation learning, yet its application in TD learning has not been systematically studied.
Temporal consistency: Predicting actions step by step leads to temporally inconsistent behavior, which is detrimental to tasks such as robotic manipulation.
Method¶
Overall Architecture¶
Q-chunking extends the action space from single-step actions \(a_t\) to action sequences \(\mathbf{a}_t = (a_t, a_{t+1}, \ldots, a_{t+H-1})\), and runs Q-learning directly over this chunked space.
Key Designs¶
1. Chunked Action Space
- \(H\) consecutive actions are packed into a single "macro-action": \(\mathbf{a} = (a_0, a_1, \ldots, a_{H-1})\)
- The Q-function is defined over chunked actions: \(Q(s, \mathbf{a})\)
- The policy outputs chunked actions: \(\pi(\mathbf{a} | s)\)
2. Unbiased \(n\)-step Returns
- Chunked actions naturally correspond to \(H\)-step TD targets: $\(Q(s_t, \mathbf{a}_t) \leftarrow \sum_{k=0}^{H-1} \gamma^k r_{t+k} + \gamma^H \max_{\mathbf{a}'} Q(s_{t+H}, \mathbf{a}')\)$
- Unlike standard multi-step returns, the \(H\)-step return here is unbiased because the action sequence is executed in full, avoiding the bias introduced by importance sampling or truncation.
3. Offline-to-Online Transition
- Offline phase: the Q-function and policy are trained on action chunks from offline data.
- Online phase: temporally consistent behavioral patterns learned from offline data are leveraged for exploration.
- Key insight: chunked actions from offline data provide a more temporally consistent exploration policy.
Loss & Training¶
- Offline phase: conservative Q-learning in the CQL style applied to the chunked action space.
- Online phase: online fine-tuning in the SAC/TD3 style.
- Chunk size \(H\): treated as a hyperparameter, typically set to 5–20.
Key Experimental Results¶
Main Results¶
Long-horizon manipulation tasks (normalized success rate, 100K online steps):
| Method | Nut Assembly | Pick-Place | Stack | Can | Average |
|---|---|---|---|---|---|
| CQL → SAC | 12% | 25% | 8% | 35% | 20.0% |
| IQL → SAC | 18% | 32% | 12% | 42% | 26.0% |
| Cal-QL | 22% | 38% | 15% | 48% | 30.8% |
| RLPD | 28% | 42% | 18% | 52% | 35.0% |
| Q-chunking (Ours) | 45% | 62% | 35% | 72% | 53.5% |
Offline-only performance comparison:
| Method | Nut Assembly | Pick-Place | Stack | Can |
|---|---|---|---|---|
| CQL | 10% | 22% | 6% | 30% |
| IQL | 15% | 28% | 10% | 38% |
| Q-chunking (offline) | 25% | 40% | 18% | 52% |
Ablation Study¶
Effect of chunk size \(H\) (Nut Assembly success rate):
| H | Offline | Online 100K | Online 500K |
|---|---|---|---|
| 1 (no chunking) | 10% | 18% | 35% |
| 5 | 18% | 35% | 55% |
| 10 | 25% | 45% | 68% |
| 20 | 22% | 42% | 65% |
| 50 | 15% | 30% | 50% |
Key Findings¶
- Q-chunking improves online sample efficiency by approximately 50% over the best baseline (53.5% vs. 35.0%).
- Chunked actions already yield a better initial policy during the offline phase.
- The optimal chunk size is approximately 10; excessively long chunks reduce adaptability.
- Temporally consistent exploration is the key driver of improvement — chunking eliminates the "jittery" exploratory behavior characteristic of step-wise policies.
Highlights & Insights¶
- Simple yet effective: The core modification is merely a redefinition of the action space, introducing no new losses or architectural components.
- Dual benefits: The approach obtains better initialization from offline data while simultaneously achieving better exploration during online fine-tuning.
- Unbiased multi-step returns: Chunked actions naturally provide unbiased \(n\)-step returns, circumventing the bias issues of conventional multi-step methods.
Limitations & Future Work¶
- Chunk size \(H\) is a critical hyperparameter that requires task-specific tuning.
- In highly dynamic tasks demanding rapid reactions, chunking may reduce responsiveness.
- High-dimensional chunked action spaces increase the difficulty of learning the Q-function.
- Validation on physical robots has not yet been conducted.
Related Work & Insights¶
- ACT (Zhao et al.): Pioneering work on action chunking in imitation learning.
- Cal-QL, RLPD: Foundational methods for offline-to-online RL.
- Temporal Abstraction: Options and macro-actions in hierarchical RL.
Rating¶
- ⭐ Novelty: 8/10 — Extending action chunking to TD learning is a natural and effective idea.
- ⭐ Value: 8/10 — Directly relevant to practical tasks such as robotic manipulation.
- ⭐ Writing Quality: 8/10 — 36 pages with thorough experiments and in-depth analysis.