Rethinking Progression of Memory State in Robotic Manipulation: An Object-Centric Perspective¶
Conference: AAAI 2026 arXiv: 2511.11478 Code: libero-mem.github.io Area: Video Understanding / Robotic Manipulation Keywords: Non-Markovian Decision Making, Object-Centric Memory, State Space Models, VLA, Robotics Benchmark
TL;DR¶
This paper proposes LIBERO-Mem, a benchmark comprising 10 non-Markovian robotic manipulation tasks, and Embodied-SlotSSM, an object-centric memory VLA framework combining Slot Attention with state space models, to address the failure of visuomotor policies in long-horizon tasks that require object-level historical reasoning under partial observability.
Background & Motivation¶
Background: Humans can effortlessly recall past interactions with specific objects (e.g., where a salt shaker was placed, or whether salt has already been added to a dish), enabling precise execution of multi-step, long-horizon tasks. Current robotic visuomotor policies (e.g., OpenVLA, Octo, RT-1/2), however, typically rely solely on recent sensory inputs for decision-making and lack mechanisms for encoding and recalling object-level history.
Limitations of Prior Work:
Markovian Assumption Bottleneck: Most VLA models assume the current observation is sufficient to predict the optimal action. This assumption breaks down in tasks involving repeated steps, visually similar objects, or long-horizon temporal dependencies — where identical visual inputs may correspond to different semantic states (e.g., a bowl resting on a plate vs. a bowl just placed back onto a plate).
Benchmark Insufficiency: Existing benchmarks (e.g., RLBench, LIBERO, RoboCasa) are largely constructed under the Markovian assumption. While MemoryBench and MIKASA-Robo address memory, they lack systematic stress testing for object-level ambiguity and temporal extension.
Token Scalability Problem: OpenVLA encodes video sequences with 256 dense tokens, and object-centric VLAs use 16 slot tokens, but token counts grow linearly with both slot and sequence dimensions, becoming infeasible for long-horizon tasks spanning hundreds of frames.
Key Challenge: When visual observations at two time steps \(t_1\) and \(t_2\) satisfy \(\mathbf{v}_{t_1} \approx \mathbf{v}_{t_2}\) but require different actions (i.e., \(P(\mathbf{a}_{t_1}|\mathbf{v}_{1:t_1},l) \neq P(\mathbf{a}_{t_2}|\mathbf{v}_{1:t_2},l)\)), purely reactive policies will inevitably fail. This constitutes an object-level Partially Observable Markov Decision Process (POMDP).
Key Insight: Drawing from object-centric learning and state space models, this work designs structured, persistent memory representations that support long-horizon non-Markovian reasoning while remaining computationally tractable.
Method¶
Overall Architecture¶
Embodied-SlotSSM comprises three core components: (1) Slot Attention, which decomposes dense visual features into discrete object-centric tokens; (2) SlotSSM, which tracks object temporal dynamics via a slot-based state space model; and (3) a Relation Encoder + LLM Action Decoder, which aligns object memory with the current scene for action prediction.
Key Designs¶
1. LIBERO-Mem Benchmark¶
Ten tasks are designed spanning four object-centric memory dimensions:
- Object Motion (OM): The robot must remember its last action (pick up or place down) to act correctly (T1, T2).
- Object Sequence (OS): Success depends on recalling how many times an object has been manipulated; visual cues alone are insufficient (T3–T6, with 3/5/7 repeated pick-and-place cycles).
- Object Relation (OR): The robot must track the temporal order and relationships of object interactions (T7–T8, swapping bowl positions).
- Object Occlusion (OO): Occluded objects require the robot to rely on memory of past placements to identify the target (T9–T10).
Distinguishing Features (compared to MemoryBench/MIKASA-Robo): - Non-Markovian observations ✓ - Long-horizon trajectories (200–700 frames) ✓ - Sub-goal-aware evaluation ✓ (unique; supports fine-grained progress assessment) - Object identity ambiguity ✓ (unique; visually identical bowls/plates distinguished only by asset ID) - Temporal extension stress testing ✓ (unique)
Each task includes 120 collected trajectories (100 training + 20 validation), gathered via keyboard control with multi-key tracking to produce smooth demonstrations.
2. Slot Attention for Object Localization¶
Dense visual embeddings \(\mathbf{v}_t \in \mathbb{R}^{K \times D_{\text{enc}}}\) are decomposed into \(N\) object-centric tokens \(\mathbf{s}_t = \{\mathbf{s}_t^1, ..., \mathbf{s}_t^N\}\), with \(N=16\). Spatial features are iteratively bound to a fixed number of learnable object queries via attention and GRU-based recurrent updates.
Temporally Consistent Initialization: Slots are randomly initialized at \(t=0\); for \(t>0\), the final slot outputs from the previous frame initialize the current frame's slots, enabling slot identity propagation and persistent object tracking across time.
Temporal Contrastive Loss: Within a fixed temporal window, representations of the same slot in adjacent frames serve as positive pairs, while slots from different videos or positions serve as negatives; contrastive learning reinforces temporal consistency.
3. SlotSSM Transient Memory¶
A Mamba-based state space model with block-diagonal \(\overline{A}_t\), \(\overline{B}_t\), and \(C_t\) matrices, where each block is conditioned solely on its corresponding slot input:
Window Prediction: Rather than predicting only the next step, SlotSSM predicts static latent representations over a \(P = p+q\) step window centered on the current time step (spanning \(p\) past and \(q\) future steps), jointly learning forward dynamics and backward temporal consistency.
Core Proposition: When \(k\) objects are visually indistinguishable at time \(t\) (i.e., \(z_t^{(i)} \approx z_t^{(j)}\)), a policy \(\pi(a_t|h_t)\) must condition on object-specific history \(\mu_t^{(j)}\) to individuate objects — precisely the capability provided by SlotSSM.
4. Slot-Conditioned Action Decoding¶
Slot Fusion Module: Integrates the current slot \(\mathbf{s}_t^{(j)}\), the predicted next slot \(\hat{\mathbf{s}}_{t+1}^{(j)}\), and an oracle sub-goal embedding \(\mathbf{g}_t^{(j)}\) to produce a dynamic latent variable \(\mathbf{d}_t^{(j)}\).
Relation Encoder: Performs cross-attention between slot latents and raw visual features to produce 16 relational tokens \(\{\mathbf{r}_t^{(j)}\}_{j=1}^L\), enabling context-aware reasoning over object states and interactions.
Action Prediction: $\(\hat{\mathbf{a}}_t \sim P_\theta(\mathbf{a}_t | \{\mathbf{r}_t^{(j)}\}, \{\mathbf{d}_t^{(j)}\}, l)\)$
Loss & Training¶
The training objective combines the Slot Attention reconstruction loss, temporal contrastive loss, SlotSSM window prediction loss, and a cross-entropy action prediction loss from the VLA head. The current Naive E-SlotSSM variant uses oracle text sub-goal embeddings (e.g., "bowl 1 on plate 3") as progress supervision.
Key Experimental Results¶
Main Results¶
Success Rate on LIBERO-Goal (Standard Markovian Tasks):
| Method | # Tokens | bowl in stove | bowl on plate | mid drawer | top drawer→bowl | Avg. |
|---|---|---|---|---|---|---|
| SlotVLA (h=1) | 16 | 45% | 0% | 5% | 0% | 32% |
| SlotVLA (h=8) | 128 | 95% | 90% | 25% | 65% | 75.5% |
| Naive E-SlotSSM | 32 | 100% | 90% | 45% | 70% | 83.0% |
Sub-goal Completion Rate on LIBERO-Mem (Non-Markovian POMDP Tasks):
| Task | π₀ (h=1) | SlotVLA (h=1) | SlotVLA (h=8) | Naive E-SlotSSM |
|---|---|---|---|---|
| T1 (1× pick-place) | 50.0% | 0% | 50.0% | 50.0% |
| T3 (3× pick-place) | 0% | 0% | 0% | 33.3% |
| T5 (5× pick-place) | 0% | 0% | 0% | 14.3% |
| T9 (bowl in basket + move basket) | 0% | 0% | 0% | 30% |
| T10 (bowl in basket + move empty basket) | 0% | 0% | 0% | 20% |
| Avg. | 5.0% | 0% | 5.0% | 14.8% |
Ablation Study¶
Ablations are presented implicitly through cross-method comparisons across task dimensions:
| Comparison Axis | SlotVLA (h=8) | Naive E-SlotSSM | Notes |
|---|---|---|---|
| General task avg. | 75.5% | 83.0% | SSM memory +7.5% |
| POMDP task avg. | 5.0% | 14.8% | Structured memory critical |
| Token efficiency | 128 tokens | 32 tokens | 4× compression |
| Long-horizon repeated tasks (T3–T6) | All 0% | Partial success | Persistent memory effective |
Key Findings¶
- Dense tokens and naive context extension fail under POMDP: π₀ (256 tokens) and SlotVLA (h=8, 128 tokens) both average only 5.0% on LIBERO-Mem, demonstrating that simply increasing frame history does not resolve non-Markovian challenges.
- Object-centric memory provides a strong inductive bias: Embodied-SlotSSM achieves approximately 3× improvement on POMDP tasks (5% → 14.8%) through structured slot-based tracking of object identity and state.
- Slot visualizations confirm object permanence: Visualizations show the model maintains consistent attention to target objects (bowls, gripper) throughout grasping and placement sequences.
- Token concatenation causes confusion: Concatenation-based approaches struggle to disambiguate identical visual states with opposing action directions (lift vs. lower), whereas SlotSSM executes correctly.
- Absolute performance remains low (14.8%), primarily constrained by oracle sub-goal dependency.
Highlights & Insights¶
- Precise and compelling problem formulation: The paper formally characterizes non-Markovian robotic manipulation as an object-level POMDP, using \(P(\mathbf{a}_{t_1}|\mathbf{v}_{1:t_1},l) \neq P(\mathbf{a}_{t_2}|\mathbf{v}_{1:t_2},l)\) with \(\mathbf{v}_{t_1} \approx \mathbf{v}_{t_2}\) to precisely specify when the Markovian assumption fails.
- Clever benchmark design: Object identity ambiguity is induced through visually identical bowls and plates; sequential memory demands arise from repeated pick-and-place operations — simple yet incisive.
- Window prediction in SlotSSM: Predicting slot representations over a past-and-future window, rather than a single next step, simultaneously supports forward dynamics prediction and backward temporal consistency via reconstruction.
- Sub-goal-aware evaluation: Extends beyond binary success/failure metrics to enable fine-grained progress assessment (number of sub-goals completed).
- Theoretical analysis of computational efficiency: SlotSSM requires only 32 tokens versus 256 for OpenVLA and 128 for SlotVLA, achieving 4–8× compression.
Limitations & Future Work¶
- Oracle sub-goal dependency: Naive E-SlotSSM relies on oracle text sub-goal embeddings (e.g., "bowl 1 on plate 3") and cannot autonomously discover sub-goals — the most significant limitation.
- Simulation-only evaluation: LIBERO-Mem has not been extended to real physical environments; sim-to-real transfer performance remains unknown.
- Low absolute performance: A POMDP task average of 14.8% positions this as a "weak baseline," still far from practical utility.
- Fixed slot count: The fixed design of \(K=16\) may not generalize well to scenarios with highly variable numbers of objects.
- Limited task diversity: Tasks are predominantly pick-and-place and object-swapping; more complex long-horizon manipulation (e.g., cooking, assembly) is not addressed.
Related Work & Insights¶
- Integrating the cognitive science concept of "object permanence" into robotic manipulation is a meaningful research direction, and SlotSSM provides a computational realization of this principle.
- Compared to MemoryBench and MIKASA-Robo, LIBERO-Mem's distinctive contributions lie in object identity ambiguity and temporal extension stress testing — critical factors for evaluating "genuine memory" versus "short-term perception."
- The block-diagonal design of SlotSSM (based on Mamba) decomposes a global SSM into object-independent local SSMs, constituting an elegant form of structured inductive bias.
Rating¶
- Novelty: ⭐⭐⭐⭐ (Both the LIBERO-Mem benchmark and Embodied-SlotSSM framework are novel; the problem formulation and proposed solution are original.)
- Experimental Thoroughness: ⭐⭐⭐ (Multi-baseline comparisons are adequate, but component-level ablations and real-world experiments are absent.)
- Writing Quality: ⭐⭐⭐⭐ (Problem formalization is clear; framework description is thorough; visualizations are persuasive.)
- Value: ⭐⭐⭐⭐ (Opens a new direction for non-Markovian robotic manipulation; the benchmark has long-term research value.)