Actor-Critic for Continuous Action Chunks: A Reinforcement Learning Framework for Long-Horizon Robotic Manipulation with Sparse Reward¶

Conference: AAAI 2026 Oral
arXiv: 2508.11143
Code: https://github.com/flyfaerss/ac3
Area: Robotic Manipulation / Reinforcement Learning Keywords: action chunking, actor-critic, sparse reward, long-horizon manipulation, self-supervised reward shaping

TL;DR¶

AC3 proposes an actor-critic framework that directly learns continuous action sequences (action chunks), stabilizing long-horizon robotic manipulation under sparse rewards via an asymmetric actor update rule—updating the actor only from successful trajectories—and self-supervised anchor-based intrinsic rewards. The method achieves superior success rates over existing approaches across 25 tasks on BiGym and RLBench.

Background & Motivation¶

Long-horizon robotic manipulation tasks (e.g., moving plates, flipping sandwiches) require executing continuous action sequences spanning multiple sub-tasks. Existing RL methods face two fundamental challenges in such sparse-reward settings: (1) the exploration space grows exponentially with action sequence length, making it difficult for agents to discover effective policies; and (2) positive rewards are only provided upon task completion, leaving intermediate steps without meaningful guidance signals.

The action chunking paradigm has proven highly successful in imitation learning (e.g., ACT, Diffusion Policy), yet these methods are fundamentally bounded by expert data quality and tend to fail under distribution shift. Incorporating action chunking into RL could in principle overcome this ceiling through online exploration, but directly learning continuous action chunks introduces severe Q-value explosion and training instability. Existing approaches either discretize the action space at the cost of precision (CQN-AS), or rely on large-scale offline data and complex distillation pipelines (Q-Chunking), neither of which is sufficiently direct or efficient.

Core Problem¶

How can RL directly learn high-dimensional continuous action sequences under sparse rewards and limited expert demonstrations? The core challenges are: (1) the action space dimensionality scales linearly with chunk length, destabilizing Q-value estimation to the point of explosion; and (2) policy gradients computed over predominantly failed states are harmful and lead to policy collapse.

Method¶

Overall Architecture¶

AC3 is built upon a DDPG-style off-policy actor-critic framework. The input consists of multi-view RGB images (\(84\times84\)) concatenated with proprioceptive states; the output is a continuous action chunk of dimension \(C \times d_a\) (default \(C=16\)). The design is organized into three components: 1. Actor network that directly predicts continuous action chunks 2. Critic network that estimates Q-values using intra-chunk \(n\)-step returns 3. Self-supervised reward shaping module that provides intrinsic reward signals at anchor points

Key Designs¶

Asymmetric Actor Update: The most critical design. The actor is updated exclusively from successful trajectories (expert demonstrations and online successful rollouts), not from all experience. The rationale is that under high-dimensional sparse rewards, the critic's Q-value estimates are highly unreliable over the majority of the state space; performing policy gradient updates in these regions causes the policy to be misled by erroneous gradients, resulting in collapse. Restricting training to a "trusted region" (where the value function is most reliable) is equivalent to optimizing the policy only on the manifold of successful data. The total actor loss is a weighted combination of a BC loss and a Q loss: \(\mathcal{L}_\theta = \lambda_{BC} \mathcal{L}_\theta^{BC} + \lambda_Q \mathcal{L}_\theta^Q\), where \(\lambda_{BC}=1.0\) and \(\lambda_Q=0.1\).
Intra-Chunk \(n\)-step Return: Rather than estimating the value of an entire chunk (which introduces excessive variance), the critic uses \(n\)-step returns with \(n < C\) (default \(n=4\), \(C=16\)). This addresses a key theoretical issue—the Unconstrained Action Subspace Problem: when \(n < C\), the trailing actions \(\{a_{t+n}, \ldots, a_{t+C-1}\}\) do not affect the TD target, allowing the actor to freely manipulate these dimensions to "fool" the critic into maximizing Q-values, creating a destructive actor-critic feedback loop and Q-value explosion. Moderate values of \(n\) (4 or 8) implicitly constrain the trailing actions via neural network smoothness, effectively suppressing explosion. TD3-style clipped double-Q is additionally employed to mitigate overestimation.
Self-Supervised Reward Shaping: A Goal Network \(G_\omega\) is pre-trained using contrastive learning (triplet loss) on expert demonstration data, learning state representations in which temporally proximate states are attracted and temporally distant or cross-trajectory states are repelled. During online training, an anchor point is set every \(K=C=16\) steps; if the nearest embedding distance from the current state to the demonstration state set falls below threshold \(m=0.5\), a fixed intrinsic reward \(a=0.1\) is granted (far smaller than the task success reward of 1). This semi-dense reward design aligns naturally with the chunk decision frequency and avoids the Q-value explosion associated with over-rewarding.

Loss & Training¶

Critic loss: MSE loss over dual Q-networks, with targets computed via \(n\)-step returns and clipped double-Q
Actor loss: BC loss (imitating successful trajectories) + Q loss (policy gradient), computed exclusively on successful trajectories
Goal Network: triplet loss pre-training, 5 seeds × 5 epochs = 25 epochs total, requiring under one hour
Network architecture: lightweight MLP+GRU; actor and critic share a state encoder structure with hidden dimension 512
ACT-style temporal ensemble is applied to smooth action execution

Key Experimental Results¶

Benchmark	AC3 vs BC baseline	AC3 vs CQN-AS	AC3 vs DrQ-v2
BiGym (15 tasks, 10 demos)	Significantly outperforms on most tasks	Better on most tasks	Substantially better
RLBench (10 tasks, 100 demos)	Outperforms on most tasks	Comparable (with simpler architecture)	Substantially better

Method	Trainable Parameters	Inference Speed
Chunk-wise BC	6.56M (Actor-only)	2.9ms
CQN-AS	28.58M (Q-Network)	9.5ms
AC3	14.44M (Actor-Critic)	2.9ms (Actor-only at inference)

Ablation Study¶

Chunk length \(C\): Complex bimanual tasks (e.g., move plate) require long chunks (\(C=16\)); \(C=4\) nearly fails entirely. Simpler tasks are less sensitive to chunk length.
\(n\)-step configuration: \(n=1\) leads to Q-value explosion on high-dimensional tasks; \(n=C=16\) introduces excessive variance; \(n=4\) or \(8\) offers the best trade-off.
Asymmetric update: Allowing the actor to learn from all experience (including failures) leads to policy collapse or severe degradation.
Reward shaping: \(r_{int}=0.1\) yields the best performance; linearly increasing intrinsic rewards or setting them equal to the task reward both cause Q-value explosion.
vs QC-FQL: With limited offline data, flow-matching-based BC pre-training is ineffective, and training on all experience causes complete policy failure on bimanual tasks.

Highlights & Insights¶

The asymmetric update rule is a clean and effective insight: under sparse rewards, the critic's Q-values are unreliable over most of the state space, so the actor should optimize only in the trusted region (successful trajectories). This idea is transferable to other sparse-reward RL settings.
The theoretical analysis of the Unconstrained Action Subspace is particularly rigorous, clearly explaining why Q-value explosion occurs when \(n < C\) and why moderate \(n\) mitigates this via neural network smoothness.
The architecture is deliberately minimal (MLP+GRU), requiring no Transformer or diffusion model. Inference runs at 2.9ms—over 3× faster than CQN-AS—making it well-suited for real-time deployment.
With very few demonstrations (only 10 in BiGym), the method surpasses the performance ceiling of pure imitation learning through online RL.

Limitations & Future Work¶

No real-robot validation: All experiments are conducted on simulation benchmarks; the sim-to-real gap remains uncharacterized.
Fixed chunk length: \(C\) is a manually specified hyperparameter; sub-tasks of varying complexity may benefit from different chunk lengths, making adaptive chunk length a natural extension.
Rigid thresholds in the Goal Network: The anchor reward threshold \(m\) and reward magnitude \(a\) are hand-tuned, and their robustness warrants further investigation.
Assumes successful trajectories are available: The asymmetric update relies on the presence of successful experiences in the replay buffer; on extremely hard tasks, bootstrapping from a small number of expert demonstrations may be the only recourse early in training.
Single-task training: A separate policy is trained for each task, with no cross-task generalization.
No comparison with Diffusion Policy + RL methods (e.g., DPPO).

vs CQN-AS: CQN-AS discretizes action chunks for Q-learning, limiting precision and flexibility; AC3 generates continuous chunks directly, with over 3× faster inference.
vs Q-Chunking (QC-FQL): QC relies on flow matching, large-scale offline data, and a full-chunk design (\(n=C\)); AC3 uses simple MLP+GRU, minimal demonstrations, and a moderate \(n<C\) design, making it more practical in low-data regimes.
vs DrQ-v2 (single-step RL): Single-step RL nearly completely fails on long-horizon sparse-reward tasks, underscoring the necessity of action chunking.
vs ACT/Diffusion Policy (pure IL): IL methods are bounded by the quality of demonstrations; AC3 breaks through this ceiling via online RL.

The asymmetric update concept may have broad applicability to other sparse-reward or high-dimensional continuous control problems, such as long-horizon decision-making in autonomous driving. The bias–variance–explosion trade-off analysis for intra-chunk \(n\)-step returns provides a theoretical framework of reference value for future work in RL with action chunking. The contrastive learning approach to reward shaping is likewise transferable to other sparse-reward tasks requiring intermediate supervisory signals.

Rating¶

Novelty: ⭐⭐⭐⭐ The asymmetric update and the theoretical analysis of intra-chunk \(n\)-step returns are genuinely novel, though the overall framework is a natural extension of DDPG.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 25 tasks, extensive ablations, and comparisons against diverse baselines with thorough analysis.
Writing Quality: ⭐⭐⭐⭐ Clear structure and rigorous theoretical exposition; the appendix analysis of the unconstrained action subspace is particularly well-executed.
Value: ⭐⭐⭐⭐ Addresses key stability challenges in RL with continuous action chunking; the method is simple and practical.