Rethinking Progression of Memory State in Robotic Manipulation: An Object-Centric Perspective¶

Conference: AAAI 2026 arXiv: 2511.11478 Code: libero-mem.github.io Area: Video Understanding / Robotic Manipulation Keywords: Non-Markovian Decision Making, Object-Centric Memory, State Space Models, VLA, Robotics Benchmark

TL;DR¶

This paper proposes LIBERO-Mem, a benchmark comprising 10 non-Markovian robotic manipulation tasks, and Embodied-SlotSSM, an object-centric memory VLA framework combining Slot Attention with state space models, to address the failure of visuomotor policies in long-horizon tasks that require object-level historical reasoning under partial observability.

Background & Motivation¶

Background: Humans can effortlessly recall past interactions with specific objects (e.g., where a salt shaker was placed, or whether salt has already been added to a dish), enabling precise execution of multi-step, long-horizon tasks. Current robotic visuomotor policies (e.g., OpenVLA, Octo, RT-1/2), however, typically rely solely on recent sensory inputs for decision-making and lack mechanisms for encoding and recalling object-level history.

Limitations of Prior Work:

Markovian Assumption Bottleneck: Most VLA models assume the current observation is sufficient to predict the optimal action. This assumption breaks down in tasks involving repeated steps, visually similar objects, or long-horizon temporal dependencies — where identical visual inputs may correspond to different semantic states (e.g., a bowl resting on a plate vs. a bowl just placed back onto a plate).

Benchmark Insufficiency: Existing benchmarks (e.g., RLBench, LIBERO, RoboCasa) are largely constructed under the Markovian assumption. While MemoryBench and MIKASA-Robo address memory, they lack systematic stress testing for object-level ambiguity and temporal extension.

Token Scalability Problem: OpenVLA encodes video sequences with 256 dense tokens, and object-centric VLAs use 16 slot tokens, but token counts grow linearly with both slot and sequence dimensions, becoming infeasible for long-horizon tasks spanning hundreds of frames.

Key Challenge: When visual observations at two time steps $t_1$ and $t_2$ satisfy $\mathbf{v}_{t_1} \approx \mathbf{v}_{t_2}$ but require different actions (i.e., $P(\mathbf{a}_{t_1}|\mathbf{v}_{1:t_1},l) \neq P(\mathbf{a}_{t_2}|\mathbf{v}_{1:t_2},l)$), purely reactive policies will inevitably fail. This constitutes an object-level Partially Observable Markov Decision Process (POMDP).

Key Insight: Drawing from object-centric learning and state space models, this work designs structured, persistent memory representations that support long-horizon non-Markovian reasoning while remaining computationally tractable.

Method¶

Overall Architecture¶

Embodied-SlotSSM comprises three core components: (1) Slot Attention, which decomposes dense visual features into discrete object-centric tokens; (2) SlotSSM, which tracks object temporal dynamics via a slot-based state space model; and (3) a Relation Encoder + LLM Action Decoder, which aligns object memory with the current scene for action prediction.

Key Designs¶

1. LIBERO-Mem Benchmark¶

Ten tasks are designed spanning four object-centric memory dimensions:

Object Motion (OM): The robot must remember its last action (pick up or place down) to act correctly (T1, T2).
Object Sequence (OS): Success depends on recalling how many times an object has been manipulated; visual cues alone are insufficient (T3–T6, with 3/5/7 repeated pick-and-place cycles).
Object Relation (OR): The robot must track the temporal order and relationships of object interactions (T7–T8, swapping bowl positions).
Object Occlusion (OO): Occluded objects require the robot to rely on memory of past placements to identify the target (T9–T10).

Distinguishing Features (compared to MemoryBench/MIKASA-Robo): - Non-Markovian observations ✓ - Long-horizon trajectories (200–700 frames) ✓ - Sub-goal-aware evaluation ✓ (unique; supports fine-grained progress assessment) - Object identity ambiguity ✓ (unique; visually identical bowls/plates distinguished only by asset ID) - Temporal extension stress testing ✓ (unique)

Each task includes 120 collected trajectories (100 training + 20 validation), gathered via keyboard control with multi-key tracking to produce smooth demonstrations.

2. Slot Attention for Object Localization¶

Dense visual embeddings $\mathbf{v}_t \in \mathbb{R}^{K \times D_{\text{enc}}}$ are decomposed into $N$ object-centric tokens $\mathbf{s}_t = \{\mathbf{s}_t^1, ..., \mathbf{s}_t^N\}$, with $N=16$. Spatial features are iteratively bound to a fixed number of learnable object queries via attention and GRU-based recurrent updates.

Temporally Consistent Initialization: Slots are randomly initialized at $t=0$; for $t>0$, the final slot outputs from the previous frame initialize the current frame's slots, enabling slot identity propagation and persistent object tracking across time.

\[\mathbf{s}_t^{(0)} = \begin{cases} \text{RandomInit}() & t=0 \\ \mathbf{s}_{t-1}^{(T)} & t>0 \end{cases}\]

Temporal Contrastive Loss: Within a fixed temporal window, representations of the same slot in adjacent frames serve as positive pairs, while slots from different videos or positions serve as negatives; contrastive learning reinforces temporal consistency.

3. SlotSSM Transient Memory¶

A Mamba-based state space model with block-diagonal $\overline{A}_t$, $\overline{B}_t$, and $C_t$ matrices, where each block is conditioned solely on its corresponding slot input:

\[\mathbf{h}_t^k = \overline{A}(\mathbf{s}_t^k)\mathbf{h}_{t-1}^k + \overline{B}(\mathbf{s}_t^k)\mathbf{s}_t^k\]

Window Prediction: Rather than predicting only the next step, SlotSSM predicts static latent representations over a $P = p+q$ step window centered on the current time step (spanning $p$ past and $q$ future steps), jointly learning forward dynamics and backward temporal consistency.

Core Proposition: When $k$ objects are visually indistinguishable at time $t$ (i.e., $z_t^{(i)} \approx z_t^{(j)}$), a policy $\pi(a_t|h_t)$ must condition on object-specific history $\mu_t^{(j)}$ to individuate objects — precisely the capability provided by SlotSSM.

4. Slot-Conditioned Action Decoding¶

Slot Fusion Module: Integrates the current slot $\mathbf{s}_t^{(j)}$, the predicted next slot $\hat{\mathbf{s}}_{t+1}^{(j)}$, and an oracle sub-goal embedding $\mathbf{g}_t^{(j)}$ to produce a dynamic latent variable $\mathbf{d}_t^{(j)}$.

Relation Encoder: Performs cross-attention between slot latents and raw visual features to produce 16 relational tokens $\{\mathbf{r}_t^{(j)}\}_{j=1}^L$, enabling context-aware reasoning over object states and interactions.

Action Prediction: $$\hat{\mathbf{a}}_t \sim P_\theta(\mathbf{a}_t | \{\mathbf{r}_t^{(j)}\}, \{\mathbf{d}_t^{(j)}\}, l)$$

Loss & Training¶

The training objective combines the Slot Attention reconstruction loss, temporal contrastive loss, SlotSSM window prediction loss, and a cross-entropy action prediction loss from the VLA head. The current Naive E-SlotSSM variant uses oracle text sub-goal embeddings (e.g., "bowl 1 on plate 3") as progress supervision.

Key Experimental Results¶

Main Results¶

Success Rate on LIBERO-Goal (Standard Markovian Tasks):

Method	# Tokens	bowl in stove	bowl on plate	mid drawer	top drawer→bowl	Avg.
SlotVLA (h=1)	16	45%	0%	5%	0%	32%
SlotVLA (h=8)	128	95%	90%	25%	65%	75.5%
Naive E-SlotSSM	32	100%	90%	45%	70%	83.0%

Sub-goal Completion Rate on LIBERO-Mem (Non-Markovian POMDP Tasks):

Task	π₀ (h=1)	SlotVLA (h=1)	SlotVLA (h=8)	Naive E-SlotSSM
T1 (1× pick-place)	50.0%	0%	50.0%	50.0%
T3 (3× pick-place)	0%	0%	0%	33.3%
T5 (5× pick-place)	0%	0%	0%	14.3%
T9 (bowl in basket + move basket)	0%	0%	0%	30%
T10 (bowl in basket + move empty basket)	0%	0%	0%	20%
Avg.	5.0%	0%	5.0%	14.8%

Ablation Study¶

Ablations are presented implicitly through cross-method comparisons across task dimensions:

Comparison Axis	SlotVLA (h=8)	Naive E-SlotSSM	Notes
General task avg.	75.5%	83.0%	SSM memory +7.5%
POMDP task avg.	5.0%	14.8%	Structured memory critical
Token efficiency	128 tokens	32 tokens	4× compression
Long-horizon repeated tasks (T3–T6)	All 0%	Partial success	Persistent memory effective

Key Findings¶

Dense tokens and naive context extension fail under POMDP: π₀ (256 tokens) and SlotVLA (h=8, 128 tokens) both average only 5.0% on LIBERO-Mem, demonstrating that simply increasing frame history does not resolve non-Markovian challenges.
Object-centric memory provides a strong inductive bias: Embodied-SlotSSM achieves approximately 3× improvement on POMDP tasks (5% → 14.8%) through structured slot-based tracking of object identity and state.
Slot visualizations confirm object permanence: Visualizations show the model maintains consistent attention to target objects (bowls, gripper) throughout grasping and placement sequences.
Token concatenation causes confusion: Concatenation-based approaches struggle to disambiguate identical visual states with opposing action directions (lift vs. lower), whereas SlotSSM executes correctly.
Absolute performance remains low (14.8%), primarily constrained by oracle sub-goal dependency.

Highlights & Insights¶

Precise and compelling problem formulation: The paper formally characterizes non-Markovian robotic manipulation as an object-level POMDP, using $P(\mathbf{a}_{t_1}|\mathbf{v}_{1:t_1},l) \neq P(\mathbf{a}_{t_2}|\mathbf{v}_{1:t_2},l)$ with $\mathbf{v}_{t_1} \approx \mathbf{v}_{t_2}$ to precisely specify when the Markovian assumption fails.
Clever benchmark design: Object identity ambiguity is induced through visually identical bowls and plates; sequential memory demands arise from repeated pick-and-place operations — simple yet incisive.
Window prediction in SlotSSM: Predicting slot representations over a past-and-future window, rather than a single next step, simultaneously supports forward dynamics prediction and backward temporal consistency via reconstruction.
Sub-goal-aware evaluation: Extends beyond binary success/failure metrics to enable fine-grained progress assessment (number of sub-goals completed).
Theoretical analysis of computational efficiency: SlotSSM requires only 32 tokens versus 256 for OpenVLA and 128 for SlotVLA, achieving 4–8× compression.

Limitations & Future Work¶

Oracle sub-goal dependency: Naive E-SlotSSM relies on oracle text sub-goal embeddings (e.g., "bowl 1 on plate 3") and cannot autonomously discover sub-goals — the most significant limitation.
Simulation-only evaluation: LIBERO-Mem has not been extended to real physical environments; sim-to-real transfer performance remains unknown.
Low absolute performance: A POMDP task average of 14.8% positions this as a "weak baseline," still far from practical utility.
Fixed slot count: The fixed design of $K=16$ may not generalize well to scenarios with highly variable numbers of objects.
Limited task diversity: Tasks are predominantly pick-and-place and object-swapping; more complex long-horizon manipulation (e.g., cooking, assembly) is not addressed.

Integrating the cognitive science concept of "object permanence" into robotic manipulation is a meaningful research direction, and SlotSSM provides a computational realization of this principle.
Compared to MemoryBench and MIKASA-Robo, LIBERO-Mem's distinctive contributions lie in object identity ambiguity and temporal extension stress testing — critical factors for evaluating "genuine memory" versus "short-term perception."
The block-diagonal design of SlotSSM (based on Mamba) decomposes a global SSM into object-independent local SSMs, constituting an elegant form of structured inductive bias.

Rating¶

Novelty: ⭐⭐⭐⭐ (Both the LIBERO-Mem benchmark and Embodied-SlotSSM framework are novel; the problem formulation and proposed solution are original.)
Experimental Thoroughness: ⭐⭐⭐ (Multi-baseline comparisons are adequate, but component-level ablations and real-world experiments are absent.)
Writing Quality: ⭐⭐⭐⭐ (Problem formalization is clear; framework description is thorough; visualizations are persuasive.)
Value: ⭐⭐⭐⭐ (Opens a new direction for non-Markovian robotic manipulation; the benchmark has long-term research value.)