OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens¶

Conference: CVPR 2026 arXiv: 2603.02138 Code: To be confirmed (paper references a Project Page) Area: Multimodal VLM / Vector Animation Generation Keywords: Lottie, vector animation, tokenization, multimodal instruction, VLM generation

TL;DR¶

OmniLottie proposes a Lottie Tokenizer that converts Lottie JSON files into structured command-parameter sequences, enabling pretrained VLMs to generate high-quality vector animations from multimodal cross-modal instructions. The work also introduces the MMLottie-2M large-scale dataset to support training.

Background & Motivation¶

State of the Field¶

Vector animations (e.g., SVG animations, Lottie format) are widely used in UI design, mobile applications, and web development. They are compact, resolution-independent, and programmatically editable. However, automated vector animation generation remains largely unexplored—existing work primarily focuses on static vector graphics or pixel-level video generation.

Limitations of Prior Work¶

Redundancy in Lottie JSON: Raw Lottie files contain large amounts of invariant structural metadata and format tokens (e.g., brackets, key names), which constitute significant noise for learning animation generation.

Lack of training data: No large-scale paired dataset of vector animations and text descriptions exists.

VLMs cannot understand animation formats: Existing VLMs generate only text or images and cannot directly output structured animation representations.

Root Cause¶

Lottie is the most popular vector animation format, yet its JSON representation is unfriendly to machine learning—redundant format tokens cause sequence lengths to explode, making it difficult to learn effective generative models.

Core Idea¶

Design a Lottie Tokenizer that converts Lottie JSON into compact command-plus-parameter sequences (eliminating all structural redundancy), enabling pretrained VLMs to autoregressively generate vector animations in the same manner as natural language generation.

Method¶

Overall Architecture¶

OmniLottie comprises three core components: 1. Lottie Tokenizer: JSON → command-parameter sequences 2. OmniLottie Model: Built upon a pretrained VLM, accepts multimodal instruction inputs, and autoregressively generates Lottie token sequences 3. MMLottie-2M Dataset: Large-scale vector animation corpus with text and visual annotations

Key Designs¶

1. Lottie Tokenizer¶

Function: Converts Lottie JSON files into structured command-parameter sequences
Mechanism: Traverses the JSON tree structure and extracts three categories of meaningful information:
- Shape Commands: Geometric instructions such as MOVE_TO(x, y) and BEZIER(cx1, cy1, cx2, cy2, x, y)
- Animation Functions: Keyframe interpolation descriptors such as EASE_IN(start_frame, end_frame, start_val, end_val)
- Control Parameters: Color, opacity, transformation matrices, etc.
Design Motivation: All redundant JSON formatting (indentation, brackets, key names) is eliminated, compressing sequence length to approximately 15–20% of the original
Novelty: Methods that train directly on raw JSON must handle sequences of ~10k+ tokens; the Lottie Tokenizer reduces this to ~1–2k tokens

2. OmniLottie Model Architecture¶

Function: Extends a pretrained VLM (e.g., LLaVA) by augmenting its vocabulary with Lottie command tokens, enabling autoregressive generation of Lottie sequences from multimodal instructions
Mechanism:
- Approximately 200 Lottie-specific tokens (shape commands, animation function names, etc.) are added to the VLM vocabulary
- Parameter values (coordinates, colors, etc.) are represented as quantized numeric tokens
- Standard next-token prediction loss is used during training
Design Motivation: Leverages the language and visual understanding capabilities of pretrained VLMs, casting vector animation generation as a sequence generation problem
Multimodal Support: Accepts diverse inputs including text descriptions (e.g., "draw a bouncing ball"), reference images, and sketches

3. MMLottie-2M Dataset¶

Function: A large-scale vector animation dataset containing 2 million professionally designed vector animations
Mechanism: Professionally designed Lottie animations are collected from platforms such as LottieFiles; VLMs are used to automatically generate text descriptions and visual annotations
Scale: 2 million animations with text descriptions and visual annotations (keyframe screenshots)

Key Experimental Results¶

Main Results: Vector Animation Generation Quality¶

Method	FID ↓	CLIP Score ↑	Human Preference (%)
DeepSVG + Motion	142.3	0.21	12.3
SVGDreamer	98.7	0.28	22.8
AnimateDiff (pixel)	45.2	0.35	28.4
OmniLottie	38.6	0.41	36.5

Ablation Study¶

Configuration	CLIP Score ↑	Notes
Full OmniLottie	0.41	Complete method
w/o Lottie Tokenizer (raw JSON)	0.24	Raw JSON text; overly long sequences degrade quality
w/o Animation Functions	0.33	Only static shapes generated; no animation
w/o MMLottie Pretrain	0.31	No large-scale dataset pretraining

Key Findings¶

The Lottie Tokenizer is the critical component—removing it drops CLIP Score from 0.41 to 0.24, as raw JSON is too verbose for the model to learn effectively
Generated vector animations play smoothly on mobile devices at roughly 1/100th the file size of pixel-based video
The flexibility of multimodal instruction is validated—text, images, and sketches all yield semantically aligned animations
The model can generate complex scenes involving multiple objects and multi-layer animations

Highlights & Insights¶

Casting vector animation generation as sequence generation—the Lottie Tokenizer design elegantly aligns this seemingly unconventional task with the LLM paradigm
MMLottie-2M fills a critical data gap—a professionally designed vector animation dataset at the 2-million scale is a valuable community resource
High practical utility—generated Lottie files can be directly integrated into app and web development without post-processing
Broader implications for structured format design—the Lottie Tokenizer approach is generalizable to other structured format generation tasks (e.g., CAD, SVG, code ASTs)

Limitations & Future Work¶

Currently limited to the Lottie format; extension to SVG animation or CSS animation is not addressed
Generation quality for complex animations (e.g., those involving masks, blend modes, or expressions) requires further improvement
Parameter value quantization introduces precision loss—subtle animation curves may be coarsened
Automatic evaluation metrics for animation temporal quality are lacking—FID and CLIP Score primarily assess static frames
The model does not support interactive editing of generated animations

vs. DeepSVG: DeepSVG targets static vector graphic generation via VAEs and does not support animation. OmniLottie is specifically designed for animation dynamics.
vs. AnimateDiff: AnimateDiff generates pixel-level video. OmniLottie produces vector format output that is compact and editable.
vs. SVGDreamer: SVGDreamer uses diffusion models to generate SVGs but does not support animation or multimodal input.
Insight: Tokenization of structured formats serves as a key bridging mechanism for integrating traditional design tools with AI generation.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First work to model vector animation generation as sequence generation; the Lottie Tokenizer design is elegant
Experimental Thoroughness: ⭐⭐⭐⭐ Human evaluation + automatic metrics + ablation study; animation temporal quality evaluation is absent
Writing Quality: ⭐⭐⭐⭐ Problem motivation is clearly articulated; tokenizer design visualizations are well executed
Value: ⭐⭐⭐⭐⭐ Triple contributions of dataset, method, and application value; pioneering significance for the vector animation generation field