TTOM: Test-Time Optimization and Memorization for Compositional Video Generation¶

Conference: ICLR 2026 arXiv: 2510.07940 Code: https://ttom-t2v.github.io/ Area: Video Generation / Compositional Reasoning Keywords: Test-time optimization, compositional video generation, parameter memorization, spatiotemporal layout, attention alignment

TL;DR¶

This paper proposes TTOM, a framework that aligns attention maps of video generation models with LLM-generated spatiotemporal layouts by optimizing newly introduced parameters at inference time, while a parameter memorization mechanism stores historical optimization contexts for reuse. TTOM achieves relative improvements of 34% (CogVideoX) and 14% (Wan2.1) on T2V-CompBench.

Background & Motivation¶

Background: Text-to-video (T2V) models perform well on single-object scenarios but suffer from severe misalignment in compositional scenes involving multiple objects, attributes, motions, and spatial relationships. Existing methods leverage LLMs to generate spatiotemporal layouts and guide generation by modifying latent variables or attention maps.

Limitations of Prior Work: (a) Direct manipulation of latent variables or attention maps disrupts feature distributions, causing flickering and collapse; (b) each sample is processed independently without leveraging historical context; (c) interventions optimized for one sample do not generalize to others.

Key Challenge: Fine-grained control over compositional layouts is required, yet such control must not corrupt the feature distribution of the pre-trained model.

Goal: Achieve model-agnostic compositional layout alignment at test time while reusing historical optimization results.

Key Insight: Rather than modifying latent variables, the approach inserts and optimizes new parameters to align attention with the target layout, then stores the optimized parameters in memory for future reuse.

Core Idea: Optimize parameters instead of latent variables to achieve layout alignment, and accumulate cross-sample knowledge via parameter memorization.

Method¶

Overall Architecture¶

Streaming setting: users submit prompts sequentially. (1) An LLM generates a spatiotemporal layout (bounding box sequences per object). (2) The memory is queried for a matching entry — if found, parameters are loaded (with optional continued optimization); otherwise, new parameters are initialized. (3) Test-time optimization (TTO) aligns attention maps with the layout. (4) Optimized parameters are stored in memory.

Key Designs¶

Attention–Layout Correlation Probing:
- Function: Identifies which DiT layers have attention maps most correlated with the final video layout.
- Mechanism: Generate a video → segment with GroundingDINO + SAM2 → compute mIoU between per-layer attention maps and segmentation masks. Results reveal large variance in correlation across layers.
- Design Motivation: Only layers with high correlation are targeted for optimization, avoiding unnecessary interference.
Test-Time Optimization (TTO):
- Function: Inserts new parameters and optimizes them to align attention with the layout.
- Mechanism: Lightweight parameters \(\phi\) are inserted into the video foundation model (VFM) and optimized via a JSD alignment loss \(L_{align} = \frac{1}{N}\sum_i JSD(\bar{A}_i \| \bar{B}_i)\) that aligns attention maps with Gaussian-smoothed layout masks.
- Design Motivation: Optimizing \(\phi\) rather than latent variables \(z_t\) avoids distribution collapse.
Parameter Memorization:
- Function: Stores historical optimization contexts for future reuse.
- Mechanism: Memory \(\mathcal{M} = \{g(C): \phi^*_C\}\), where keys are text embeddings of abstracted scene descriptions. Supports insert/read/update/delete operations; LFU eviction is applied when capacity is exceeded.
- Design Motivation: Parameters from similar scenes can be loaded directly to skip optimization (efficiency), or used as strong initializations (quality).

Loss & Training¶

The method is unsupervised — only \(L_{align}\) (JSD) is used to optimize the newly inserted parameters at test time. LLM-generated layouts include a validation step to ensure consistency. When a memory match is found, optimization can be skipped entirely for direct inference.

Key Experimental Results¶

Main Results¶

T2V-CompBench (7 compositional video generation categories):

Model	Avg. Score	Motion	Numeracy	Spatial
CogVideoX-5B	baseline	low	low	low
CogVideoX + TTOM	+34%	significant gain	significant gain	significant gain
Wan2.1-14B	baseline	medium	medium	medium
Wan2.1 + TTOM	+14%	gain	gain	gain

Consistent improvements are also observed on VBench.

Ablation Study¶

Configuration	Observation
Optimize latents vs. optimize parameters	Parameter optimization yields higher quality without collapse
With memory vs. without memory	Memory significantly improves both efficiency and quality
Layer selection	Optimizing only high-correlation layers achieves the best results
Skip optimization on memory hit	Large efficiency gains with only marginal quality degradation
Transferability	Parameters optimized for one scene transfer effectively to similar scenes

Key Findings¶

TTOM decouples compositional world knowledge — optimized parameters exhibit strong transferability and generalization.
Parameter memorization enables progressive improvement in streaming inference, as accumulated compositional patterns can be reused by new scenes.
The approach is model-agnostic, demonstrating effectiveness on both CogVideoX and Wan2.1 with distinct architectures.
JSD loss is more stable than direct \(L_2\) loss.

Highlights & Insights¶

Optimizing parameters rather than latents: Avoids feature distribution corruption caused by direct intervention, offering a more principled alternative to latent guidance.
"Better with use" property of parameter memory: Transforms test-time optimization from a one-shot cost into cumulative knowledge accumulation, conceptually analogous to human experiential learning.
Forward-looking streaming setting: Frames video generation as a continuous service rather than isolated requests, better reflecting real-world deployment scenarios.
Attention–layout correlation probing: Provides the first systematic quantification of the correspondence between per-layer DiT attention maps and final video layouts, offering independent analytical value.

TTOM: Test-Time Optimization and Memorization for Compositional Video Generation¶

TL;DR¶

Background & Motivation¶

Method¶

Overall Architecture¶

Key Designs¶

Loss & Training¶

Key Experimental Results¶

Main Results¶

Ablation Study¶

Key Findings¶

Highlights & Insights¶

Limitations & Future Work¶

Rating¶

TTOM: Test-Time Optimization and Memorization for Compositional Video Generation¶

TL;DR¶

Background & Motivation¶

Method¶

Overall Architecture¶

Key Designs¶

Loss & Training¶

Key Experimental Results¶

Main Results¶

Ablation Study¶

Key Findings¶

Highlights & Insights¶

Limitations & Future Work¶

Related Work & Insights¶

Rating¶

Related Papers¶