LottieGPT: Tokenizing Vector Animation for Autoregressive Generation¶

Conference: CVPR 2026
arXiv: 2604.11792
Code: https://lottiegpt.github.io/
Area: Image/Animation Generation
Keywords: Vector Animation, Lottie, Autoregressive Generation, Tokenizer, Multimodal

TL;DR¶

LottieGPT is the first autoregressive vector animation generation framework, designing a Lottie tokenizer to encode hierarchical geometry, transforms, and keyframe motion into compact token sequences. It builds a 660K animation dataset and fine-tunes Qwen-VL to generate editable vector animations directly from text/image inputs.

Background & Motivation¶

Background: Video generation (Sora, Kling, etc.) can now produce high-quality raster video, but all existing generative models operate in pixel space and cannot generate vector animations—a resolution-independent, editable, and compact mainstream multimedia format.

Limitations of Prior Work: Vector animations (e.g., UI motion effects, brand animations, After Effects motion graphics) offer key properties unavailable in pixel video: infinite resolution, semantic manipulability, parameterized motion, and small file sizes. Existing SVG generation methods are limited to static output and lack temporal modeling capability.

Key Challenge: Vector animations contain both hierarchical structure and time-dependent transformation logic; encoding them into token sequences suitable for autoregressive modeling is the core challenge. Additionally, the absence of large-scale vector animation datasets is a major bottleneck.

Goal: (1) Design a tokenizer that unifies hierarchical geometry and temporal motion encoding; (2) build a large-scale vector animation dataset; (3) train the first multimodal model for vector animation generation.

Key Insight: Adopt the Lottie format (a widely deployed JSON animation standard), leveraging its keyframe + easing function parameterized representation for compact encoding.

Core Idea: Tokenize vector animations using keyframes and interpolation functions instead of per-frame data, drastically reducing sequence length while preserving structural fidelity.

Method¶

Overall Architecture¶

LottieGPT is built on the Qwen2.5-VL architecture with three components: (1) a Lottie tokenizer that encodes JSON animations into compact token sequences; (2) a vision-language backbone for multimodal input processing; (3) a two-stage training strategy (static first, then dynamic). Input is text/image/keyframe video; output is a Lottie token sequence that can be decoded into a complete editable vector animation.

Key Designs¶

Lottie Tokenizer — Hierarchical Structure Encoding:
- Function: Encode the hierarchical structure of Lottie JSON (animation metadata → assets → layers → shapes → attributes) into discrete token sequences
- Mechanism: Special tokens (e.g., <|LAYER|>, <|ty|>) mark hierarchical boundaries and relationships, directly corresponding to the Lottie schema. Supports complete shape primitive encoding (ellipses, fills, gradients, strokes, etc.), unlike OmniSVG which requires decomposition into atomic commands
- Design Motivation: Preserve semantic information and hierarchical organization so the model can learn structural patterns rather than arbitrary text sequences
Keyframe Motion Compression:
- Function: Achieve compact temporal encoding, distinct from per-frame encoding
- Mechanism: Store only keyframe time points <|t|>, attribute values, and Bézier easing functions <|ease|>, rather than dense per-frame data. For a 100-frame animation, only 6 keyframes' worth of tokens are needed; at 300 frames, the compression ratio reaches 98%. Easing functions are encoded as first-class primitives, enabling the same keyframes to produce dramatically different motion feels
- Design Motivation: The essence of vector animation is keyframes + interpolation; this encoding drastically reduces token count while preserving complete animation information
Static-to-Dynamic Progressive Training:
- Function: Stabilize training through curriculum learning
- Mechanism: Stage 1 trains on static vector graphics (50% text-to-Lottie + 50% image-to-Lottie); Stage 2 introduces temporal dynamics (34% text-only + 33% text + first frame + 33% text + video keyframes)
- Design Motivation: Directly mixing static and dynamic training data causes convergence instability, as animation samples have far more tokens than static graphics

Loss & Training¶

Standard causal language model cross-entropy loss: \(\mathcal{L} = -\sum_{i=1}^{N} \log P(t_i | t_{<i}, \mathbf{c})\), where \(\mathbf{c}\) is the multimodal condition. The tokenizer supports lossless round-tripping: decoded animations render identically to the originals.

Key Experimental Results¶

Main Results¶

Method	Input	CLIP↑	SSIM↑	LPIPS↓	DINOv2↑	JSON↑	Valid Rate
OmniSVG-7B	Text	0.832	0.563	0.512	0.727	N/A	N/A
LottieGPT-7B	Text	0.933	0.810	0.176	0.857	0.824	98.3%
StarVector-8B	Image	0.766	0.385	0.465	0.529	N/A	N/A
OmniSVG-7B	Image	0.900	0.705	0.251	0.848	N/A	N/A
LottieGPT-7B	Image	0.945	0.835	0.154	0.876	0.843	98.8%

Ablation Study¶

Config	CLIP↑	SSIM↑	Valid Rate
full model (Stage1+2)	0.933	0.810	98.3%
Stage 1 only (no animation)	0.928	0.805	97.5%
No hierarchical encoding	0.891	0.752	92.1%
Per-frame encoding instead of keyframes	0.875	0.701	85.6%

Key Findings¶

Keyframe encoding is critical for valid rate: per-frame encoding causes excessively long sequences, dropping valid rate from 98.3% to 85.6%
Temporal modeling enhances static vector understanding: LottieGPT also achieves new SOTA on SVG generation
JSON structural scores confirm high structural fidelity in generated Lottie files

Highlights & Insights¶

The keyframe + easing function encoding is an elegant design: it preserves complete animation semantics (motion curves as first-class primitives) while achieving extremely high compression ratios. This approach can generalize to other parameterized representation generation tasks
The dataset contribution is significant: 660K vector animations + 15M static vector graphics, the first large-scale resource in this domain
Framing 2D animation generation analogously to 3D animation production workflows (generate structure first, then add animation) is an inspiring perspective

Limitations & Future Work¶

Only supports the Lottie format; does not cover SVG SMIL or CSS animations
Complex animation token sequences remain long, constrained by VLM context windows
No human evaluation of temporal consistency and motion naturalness in generated animations
Extensible to interactive animation editing and conditional generation

vs OmniSVG/StarVector: These methods can only generate static SVGs; LottieGPT is the first to support temporal modeling and animation generation
vs Pixel video generation: Pixel methods produce fixed-resolution, non-editable output; LottieGPT output is infinitely scalable and fully editable

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First autoregressive vector animation generation framework, opening a new direction
Experimental Thoroughness: ⭐⭐⭐⭐ — Introduces LottieBench with multi-dimensional evaluation
Writing Quality: ⭐⭐⭐⭐⭐ — Clear motivation and well-defined contributions
Value: ⭐⭐⭐⭐⭐ — Complete contribution of dataset + benchmark + method