Reinforcement Learning with Action Chunking¶

Conference: NeurIPS 2025

Code: None

Area: Reinforcement Learning

Keywords: Action chunking, Q-learning, offline-to-online RL, sparse rewards, manipulation tasks

TL;DR¶

This paper proposes Q-chunking, which extends action chunking from imitation learning to TD-based reinforcement learning by running RL directly over a "chunked" action space, thereby improving exploration and sample efficiency in long-horizon sparse-reward tasks.

Background & Motivation¶

In the offline-to-online RL setting, a central challenge is how to leverage offline prior data to maximize sample efficiency during online learning. Action chunking — predicting a sequence of future actions rather than a single-step action — is a technique widely used in imitation learning.

Key motivations:

Exploration difficulty: In long-horizon sparse-reward tasks, random exploration is unlikely to reach the goal.

Insufficient use of offline data: Existing methods are not effective at obtaining good exploration policies from offline data.

Untapped potential of action chunking in RL: Methods such as ACT have demonstrated the value of action chunking in imitation learning, yet its application in TD learning has not been systematically studied.

Temporal consistency: Predicting actions step by step leads to temporally inconsistent behavior, which is detrimental to tasks such as robotic manipulation.

Method¶

Overall Architecture¶

Q-chunking extends the action space from single-step actions $a_t$ to action sequences $\mathbf{a}_t = (a_t, a_{t+1}, \ldots, a_{t+H-1})$, and runs Q-learning directly over this chunked space.

Key Designs¶

1. Chunked Action Space

$H$ consecutive actions are packed into a single "macro-action": $\mathbf{a} = (a_0, a_1, \ldots, a_{H-1})$
The Q-function is defined over chunked actions: $Q(s, \mathbf{a})$
The policy outputs chunked actions: $\pi(\mathbf{a} | s)$

2. Unbiased $n$-step Returns

Chunked actions naturally correspond to $H$-step TD targets: $$Q(s_t, \mathbf{a}_t) \leftarrow \sum_{k=0}^{H-1} \gamma^k r_{t+k} + \gamma^H \max_{\mathbf{a}'} Q(s_{t+H}, \mathbf{a}')$$
Unlike standard multi-step returns, the $H$-step return here is unbiased because the action sequence is executed in full, avoiding the bias introduced by importance sampling or truncation.

3. Offline-to-Online Transition

Offline phase: the Q-function and policy are trained on action chunks from offline data.
Online phase: temporally consistent behavioral patterns learned from offline data are leveraged for exploration.
Key insight: chunked actions from offline data provide a more temporally consistent exploration policy.

Loss & Training¶

Offline phase: conservative Q-learning in the CQL style applied to the chunked action space.
Online phase: online fine-tuning in the SAC/TD3 style.
Chunk size $H$: treated as a hyperparameter, typically set to 5–20.

Key Experimental Results¶

Main Results¶

Long-horizon manipulation tasks (normalized success rate, 100K online steps):

Method	Nut Assembly	Pick-Place	Stack	Can	Average
CQL → SAC	12%	25%	8%	35%	20.0%
IQL → SAC	18%	32%	12%	42%	26.0%
Cal-QL	22%	38%	15%	48%	30.8%
RLPD	28%	42%	18%	52%	35.0%
Q-chunking (Ours)	45%	62%	35%	72%	53.5%

Offline-only performance comparison:

Method	Nut Assembly	Pick-Place	Stack	Can
CQL	10%	22%	6%	30%
IQL	15%	28%	10%	38%
Q-chunking (offline)	25%	40%	18%	52%

Ablation Study¶

Effect of chunk size $H$ (Nut Assembly success rate):

H	Offline	Online 100K	Online 500K
1 (no chunking)	10%	18%	35%
5	18%	35%	55%
10	25%	45%	68%
20	22%	42%	65%
50	15%	30%	50%

Key Findings¶

Q-chunking improves online sample efficiency by approximately 50% over the best baseline (53.5% vs. 35.0%).
Chunked actions already yield a better initial policy during the offline phase.
The optimal chunk size is approximately 10; excessively long chunks reduce adaptability.
Temporally consistent exploration is the key driver of improvement — chunking eliminates the "jittery" exploratory behavior characteristic of step-wise policies.

Highlights & Insights¶

Simple yet effective: The core modification is merely a redefinition of the action space, introducing no new losses or architectural components.
Dual benefits: The approach obtains better initialization from offline data while simultaneously achieving better exploration during online fine-tuning.
Unbiased multi-step returns: Chunked actions naturally provide unbiased $n$-step returns, circumventing the bias issues of conventional multi-step methods.

Limitations & Future Work¶

Chunk size $H$ is a critical hyperparameter that requires task-specific tuning.
In highly dynamic tasks demanding rapid reactions, chunking may reduce responsiveness.
High-dimensional chunked action spaces increase the difficulty of learning the Q-function.
Validation on physical robots has not yet been conducted.

ACT (Zhao et al.): Pioneering work on action chunking in imitation learning.
Cal-QL, RLPD: Foundational methods for offline-to-online RL.
Temporal Abstraction: Options and macro-actions in hierarchical RL.

Rating¶

⭐ Novelty: 8/10 — Extending action chunking to TD learning is a natural and effective idea.
⭐ Value: 8/10 — Directly relevant to practical tasks such as robotic manipulation.
⭐ Writing Quality: 8/10 — 36 pages with thorough experiments and in-depth analysis.