AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction¶

Conference: ICCV 2025 arXiv: 2504.01014 Code: https://github.com/TencentARC/AnimeGamer Area: Image/Video Generation Keywords: infinite game generation, anime life simulation, multimodal large language model, video diffusion model, game state prediction

TL;DR¶

This paper proposes AnimeGamer, an infinite anime life simulation system built upon a multimodal large language model (MLLM). By predicting the next game state via action-aware multimodal representations — comprising dynamic animation shots and character state updates — the system achieves a continuously consistent interactive anime gaming experience.

Background & Motivation¶

Recent advances in generative AI have achieved notable progress in anime production, yet existing approaches exhibit clear limitations:

Finite vs. Infinite Games: Existing game generation methods (e.g., GameNGen simulating DOOM) are confined to predefined environments and limited instructions, constituting "finite games." An ideal anime life simulator should instead be an "infinite game" — with no preset boundaries, open-ended language interaction, and a continuously evolving storyline.

Limitations of the Predecessor Work Unbounded: - Relies solely on an LLM for pure-text dialogue, ignoring historical visual context, resulting in poor game coherence. - Generates only static images, incapable of producing the dynamic effects required to portray character actions.

Core Challenge: How to maintain character consistency and contextual coherence across multi-turn interactions while generating high-quality dynamic video?

Method¶

Overall Architecture¶

AnimeGamer is trained in three stages: (a) learning tokenization and decoding of animation shots; (b) training the MLLM to predict the next game state; (c) decoder adaptation training. During inference, a sliding window strategy enables theoretically infinite game generation.

Key Designs¶

Action-aware Multimodal Representation
- Each animation shot is decomposed into three components:
  - Visual reference \(f_v\): CLIP embedding of the first frame of the animation clip, capturing overall appearance.
  - Motion description \(f_{md}\): a short textual action prompt (e.g., "Softly talk") encoded by T5.
  - Motion magnitude \(f_{ms}\): motion intensity of the character estimated via optical flow.
- The encoder \(\mathcal{E}_a\) fuses visual and textual features via MLP + LayerNorm + Concat: \(s_a = \mathcal{E}_a(f_{md}, f_v) = \text{Concat}(\text{LN}(\text{MLP}(x)), \text{LN}(\text{MLP}(y)))\)
- Design Motivation: Existing methods predict only text or image representations, which are insufficient to preserve the visual and motion information inherent in video.
Animation Shot Decoder \(\mathcal{D}_a\)
- Built upon the video diffusion model CogVideoX, replacing the original text features with action-aware multimodal representations.
- Motion magnitude \(f_{ms}\) is injected into the timestep embedding \(f_t\) via sinusoidal embedding + FC + SiLU activation.
- Training proceeds in two steps: the decoder is first frozen while only the encoder is trained (warm-up), followed by joint training.
- The training objective is the standard diffusion loss: \(\mathcal{L} = \mathbb{E}_{z,c,s_a,\epsilon \sim \mathcal{N}(0,1),t}\left[\|\epsilon - \epsilon_\theta(z_t, t, c, s_a)\|_2^2\right]\)
MLLM-based Game State Prediction
- The MLLM is initialized from Mistral-7B and serves as the "game engine."
- Input: historical multimodal context + current player instruction.
- Output: \(N=226\) action-aware multimodal representations (\(s_a\)) + character state \(s_c\) (stamina/social/entertainment values) + motion magnitude.
- \(s_a\) is supervised with MSE loss; \(s_c\) and \(f_{ms}\) are supervised with cross-entropy loss: \(\mathcal{L} = \mathcal{L}_{CE} + \alpha \mathcal{L}_{MSE}\)
Decoder Adaptation
- Separate training of the MLLM and the decoder may lead to latent space misalignment.
- The MLLM is frozen while only the decoder is fine-tuned to adapt to the MLLM's output embeddings.
- During inference, a sliding window combined with a train-short-test-long strategy supports infinite generation.

Data Construction¶

Approximately 20,000 video clips (16 frames, 480×720) are extracted from 10 popular anime films. InternVL is used to automatically annotate character motion, background, and character states, with support for player-defined characters.

Key Experimental Results¶

Main Results¶

Model	CLIP-I↑	DreamSim↑	CLIP-T↑	ACC-F↑	MAE-F↓	Inference Time (s/turn)↓
GSC	0.786	0.502	0.333	0.316	0.826	50
GFC	0.766	0.580	0.333	0.292	1.021	63
GC	0.796	0.642	0.334	0.425	0.722	25
AnimeGamer	0.813	0.740	0.416	0.674	0.424	24

GPT-4V and human evaluation (10-point scale):

Model	Overall (GPT/Human)	Instruction Following (GPT/Human)	Context Consistency (GPT/Human)	Character Consistency (GPT/Human)
GC	6.42/7.38	7.29/7.37	6.58/6.89	7.49/7.55
AnimeGamer	8.36/10.0	9.14/9.95	8.41/9.95	9.11/9.86

Ablation Study¶

Configuration	CLIP-I↑	DreamSim↑	ACC-F↑	MAE-F↓
w/ random frame	0.845	0.450	0.474	0.562
w/o warm-up	0.831	0.511	0.703	0.458
w/o \(f_{ms}\)	0.853	0.689	0.182	1.219
w/o adapt	0.683	0.494	0.365	0.847
Ours (full)	0.867	0.793	0.729	0.403

Key Findings¶

AnimeGamer outperforms all baselines across every automatic metric, with particularly strong gains in DreamSim (+15.4%) and CLIP-T (+24.6%), demonstrating the critical role of multimodal context for consistency.
Human evaluation yields near-perfect scores (10/10), substantially surpassing methods that rely solely on textual context.
Removing motion magnitude control causes ACC-F to collapse to 0.182, confirming that text alone is insufficient to reliably control motion intensity.
Decoder adaptation is essential — omitting it reduces CLIP-I from 0.786 to 0.683.
Inference efficiency is optimal at 24 s/turn, since no additional LLM API calls are required.

Highlights & Insights¶

"MLLM as Game Engine": Employing the MLLM as a game engine to directly predict game states — rather than merely acting as a text router — constitutes an innovative paradigm.
Elegant action-aware multimodal representation design: Decoupling visual reference, motion description, and motion magnitude enables high-quality video generation while preserving controllability.
End-to-end design: The paper provides a complete technical stack, spanning data collection, model training, and evaluation benchmarks.

Limitations & Future Work¶

Training and evaluation are conducted only in the closed-domain setting (custom characters); open-domain generalization remains unexplored.
The training data is sourced from only 10 anime films, limiting scale and diversity.
Only short 16-frame video clips are generated per turn.
The character state design (stamina/social/entertainment) is relatively simple and does not cover more complex game mechanics.
Comparisons with additional game generation baselines (e.g., GameNGen, DIAMOND) are absent.

The work extends the infinite game concept from Unbounded with significant improvements, upgrading from static image generation to dynamic video generation.
The action-aware representation design is generalizable to other multimodal video-controlled generation tasks.
The automated data collection pipeline enables rapid adaptation to any anime IP.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — The first MLLM-based infinite anime life simulator; both the problem formulation and methodology are highly original.
Experimental Thoroughness: ⭐⭐⭐⭐ — Automatic and human evaluations are comprehensive, ablations are thorough, but additional baselines are lacking.
Writing Quality: ⭐⭐⭐⭐ — The framework is described clearly and supported by rich illustrations.
Value: ⭐⭐⭐⭐ — Strong commercial potential, though substantial improvements to actual gameplay experience are still needed.