Skip to content

HAMLET: Switch Your Vision-Language-Action Model into a History-Aware Policy

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=KcJ9U0x6kO
Code: https://myungkyukoo.github.io/hamlet/
Area: Robotics / Embodied AI (VLA, History-Aware Policy)
Keywords: Vision-Language-Action, history-aware, moment token, time-contrastive learning, memory module, long-horizon manipulation

TL;DR

HAMLET enables "single-frame" pretrained VLAs to gain history-awareness in a plug-and-play, near-zero overhead manner by appending a few learnable moment tokens (initialized via time-contrastive learning) and a lightweight memory module. It improves success rates from 29.2% to 76.4% on real-world long-horizon tasks.

Background & Motivation

  • Background: Current mainstream VLAs (OpenVLA, π0, GR00T, CogACT, etc.) mostly adopt the "single-frame assumption," where action prediction depends only on the current observation, leveraging large-scale VLM priors for control.
  • Limitations of Prior Work: Robotic manipulation is inherently history-dependent (non-Markovian). For instance, whether to lift or release a block depends on whether it has been grasped; single frames cannot resolve ambiguity during occlusions. Consequently, single-frame VLAs frequently fail in long-horizon tasks.
  • Key Challenge: The direct remedy—stacking multiple past frames (multi-frame)—is extremely costly. Empirical tests show that adding just 4 frames slows inference by ~35% and increases peak memory by ~3.6×. Furthermore, it often suffers from "causal confusion," leading to performance drops (-3.3% on RoboCasa, -8.8% on LIBERO).
  • Goal: Inject history-awareness into pretrained VLAs without large-scale retraining, ensuring low overhead and cross-backbone compatibility.
  • Key Insight: Replace "stacking" with "compression." Instead of storing raw observations for each step, HAMLET stores a compact set of "moment tokens." A lightweight memory module then selectively aggregates these tokens across time to produce history-enhanced features for the action expert.

Method

Overall Architecture

HAMLET augments the standard VLA pipeline (VLM backbone \(F_\theta\) encoding observations and instructions into latent representation \(h_t\), followed by action expert \(A_\psi\) predicting action chunks) with two components: (i) moment tokens, which append to VLM inputs at each step to compress temporal information; and (ii) a memory module, which aggregates historical moment tokens to produce history-enhanced features \(\tilde{m}'\) for the action expert. The entire system is trained end-to-end with standard action prediction loss while maintaining single-frame visual input.

flowchart LR
    A[Current Obs o_t + Inst c] --> B[VLM F_θ]
    M[Moment Tokens m_t] --> B
    B --> H[Latent h_t]
    B --> Mt["Moment Rep m'_t"]
    Mt --> C[(Cache History Tokens)]
    C --> D[Memory Module M_φ<br/>Causal Self-Attention]
    D --> E["History-Enhanced Feature m̃'"]
    H --> F[Action Expert A_ψ DiT]
    E --> F
    S[Proprioception s_t] --> F
    F --> G["Action chunk a_t..a_t+k-1"]

Key Designs

1. Moment Token: Compressing steps into compact summaries instead of stacking raw frames. Storing raw observations \(o_t\) is expensive and contains redundant static backgrounds. HAMLET appends a set of learnable tokens \(m_t \in \mathbb{R}^{n_m \times d}\) to the VLM input at each step \(t\): \([h_t; m'_t] = F_\theta([o_t, c; m_t])\). Through causal attention, these tokens "summarize" the current step into \(m'_t\). By default, only 4 tokens are used per step, reducing the history cost from "multiple images" to "a few vectors."

2. Time-Contrastive Learning (TCL) Initialization: Capturing discriminative temporal cues. Randomly initialized tokens might learn mediocre representations. HAMLET uses time-contrastive networks to pre-train tokens by freezing the VLM: using augmented views of the same observation as positive samples \(z_t^+\), and different time steps \(t' \neq t\) within the same trajectory as hard negative samples \(z_t^-\), optimizing: $\(\mathcal{L}_{\mathrm{TCL}} = -\sum_{t} \log \frac{\exp(\mathrm{sim}(z_t, z_t^+)/\tau)}{\exp(\mathrm{sim}(z_t, z_t^+)/\tau) + \exp(\mathrm{sim}(z_t, z_t^-)/\tau)}.\)$ This forces tokens to emphasize task-relevant regions (e.g., grippers, targets) that change over time while suppressing static backgrounds.

3. Memory Module: Selective aggregation via shallow Transformers. Simple concatenation of all moment tokens yields little gain because not all moments are equally important. HAMLET stacks moment tokens from the last \(T\) steps into a matrix \(M' \in \mathbb{R}^{L \times d}\) (where \(L = T \cdot n_m\)) and aggregates them using standard self-attention with a causal mask \(C\): $\(H = \mathrm{softmax}\!\left(\frac{QK^\top}{\sqrt{d}} + C\right)V.\)$ This allows the module to pick relevant historical moments (e.g., attending to the step where a block was last visible before being covered) based on the current context.

4. Integration with Action Prediction: Plug-and-play with single-frame input. History-enhanced features are concatenated with original VLM representations: \([a_t, \dots, a_{t+k-1}] = A_\psi([h_t; \tilde{m}'], s_t)\). Since the VLM still only processes a single frame while history flows through the external memory, HAMLET preserves the generalization of single-frame VLAs and remains backbone-agnostic (compatible with GR00T, CogACT, etc.).

Key Experimental Results

Main Results

Real-world long-horizon tasks (24 trials per task, GR00T N1.5 backbone):

Method History? Pick-and-Place Twice (Success) Cover-and-Stack (Success) Swap Cubes (Success) Avg.
π0 25.0 58.3 12.5 31.9
GR00T N1 25.0 33.3 33.3 30.6
GR00T N1.5 12.5 37.5 37.5 29.2
+ Multi-frame 45.8 33.3 58.3 45.8
+ HAMLET 66.7 79.2 83.3 76.4

General simulation benchmarks (GR00T N1.5):

Method RoboCasa 100-demo LIBERO Avg.
GR00T N1.5 64.1 95.6
+ Multi-frame 60.8 (Drop) 86.8 (Drop)
+ HAMLET 66.4 97.6

Ablation Study

Component Ablation (RoboCasa 100-demo):

Moment Token TCL Memory Module Avg.
62.6
63.1
63.4
64.8
65.4

Efficiency (A100, per-timestep):

Method History Latency Peak Mem
GR00T N1.5 1 80.5ms (1.00×) 289MB (1.00×)
+ Multi-frame 8 193.0ms (2.40×) 2023MB (7.00×)
+ HAMLET 8 85.8ms (1.07×) 578MB (2.00×)

Key Findings

  • Memory module is the core contributor: Removing it causes the largest performance drop; simple concatenation of tokens provides minimal gain, highlighting "selective aggregation" as key.
  • Multi-frame stacking hurts generalization: Stacking raw frames induces causal confusion and generalizes poorly to dynamic observations, whereas HAMLET avoids this by maintaining single-frame input.
  • Memory is transferable: A memory module pretrained on LIBERO provides gains when transferred to RoboCasa.

Highlights & Insights

  • Shift from "stacking" to "compression": Reformulates history-awareness as storing semantic tokens rather than pixels, bypassing the compute/memory wall of multi-frame methods.
  • Backbone-agnostic/Plug-and-play: Consistently improves performance across GR00T and CogACT without massive retraining, making it highly practical for deployment.
  • Initialization via TCL: Self-supervised representation learning naturally aligns tokens with task-relevant moving parts (e.g., grippers), as evidenced by attention visualizations.

Limitations & Future Work

  • Task Scale: Real-world experiments are limited to three designed tabletop tasks; generalization to open-ended long-horizon scenarios remains to be validated.
  • Hyperparameter Sensitivity: There is a clear "sweet spot" for token count (performance peaks at 4-8 and drops at 32/64); whether memory capacity suffices for extreme horizons is unexplored.
  • Baselines: Lacks direct comparison with end-to-end humanoid memory systems (e.g., concurrent work Shi et al. 2025).
  • Saturation: Gains on LIBERO are marginal (95.6% to 97.6%) as the benchmark is nearly saturated; benefits are primarily seen in specifically designed history-dependent tasks.
  • VLA Lineage: Progresses from discrete tokens (RT-2, OpenVLA) to diffusion/flow-matching heads (π0, GR00T); HAMLET fills the "history" gap for the latter.
  • Memory Architectures: Echoes recurrent policies in RL and memory networks in NLP, but marks the first systematic design of lightweight memory for large-scale pretrained VLAs.
  • Insight: For any single-step foundation model, adding learnable tokens with a lightweight aggregator may be a universal, low-cost paradigm for injecting context/temporal awareness.

Rating

  • Novelty: ⭐⭐⭐⭐
  • Experimental Thoroughness: ⭐⭐⭐⭐
  • Writing Quality: ⭐⭐⭐⭐
  • Value: ⭐⭐⭐⭐