Show-o2: Improved Native Unified Multimodal Models¶

Conference: NeurIPS 2025
arXiv: 2506.15564
Code: GitHub
Area: Unified Multimodal Models / Image Generation
Keywords: Unified multimodal models, autoregressive modeling, Flow Matching, 3D causal VAE, visual understanding and generation

TL;DR¶

This paper presents Show-o2, a natively unified multimodal model built upon autoregressive modeling and Flow Matching. By constructing unified visual representations in a 3D causal VAE space via dual-path spatial(-temporal) fusion, Show-o2 supports multimodal understanding and generation across text, images, and video, with a two-stage training strategy that effectively preserves language knowledge.

Background & Motivation¶

Large multimodal models (LMMs) and visual generation models have achieved impressive performance in visual understanding and image/video generation, respectively. Unified multimodal models (UMMs) attempt to integrate both capabilities within a single model. Existing approaches face the following challenges:

Unified visual representation: Multimodal understanding requires high-level semantic features (e.g., CLIP), while generation requires low-level structural details (e.g., VAE latents). These demands are fundamentally different. Existing methods either adopt a unified representation (Chameleon, Show-o) at the cost of one capability, or use decoupled representations (Janus series) at the cost of native unification.

Scalability to images and video: Most UMMs support only text and images; native support for video modality remains largely unexplored.

Knowledge forgetting during training: When training UMMs from a pretrained LLM, learning visual generation often leads to degradation of language knowledge unless large-scale text corpora are incorporated.

Show-o2's core innovation lies in constructing unified visual representations through a dual-path fusion mechanism in the 3D causal VAE space, supporting both images and video, while avoiding knowledge forgetting through a two-stage training strategy.

Method¶

Overall Architecture¶

Given interleaved text/image/video inputs, text is converted to embeddings via a tokenizer, and visual inputs are encoded into visual latents via a 3D causal VAE encoder. The visual latents are processed through dual-path extraction and spatial(-temporal) fusion to produce unified visual representations. Text embeddings and unified visual representations are concatenated into a sequence and fed into the base language model. A language head predicts text tokens via autoregressive modeling, while a Flow head generates images/video via Flow Matching. An omni-attention mechanism is employed: causal attention along the sequence dimension and full attention within visual representations.

Key Designs¶

Unified Visual Representation (Dual-Path Spatial-Temporal Fusion)

A 3D causal VAE encoder extracts visual latents, which are then processed through a dual-path architecture:

Semantic path \(\mathcal{S}(\cdot)\): Shared ViT blocks from SigLIP (with an added 2×2 patch embedding) extract high-level semantic information. A pre-distillation step enables this path to extract semantic features from both clean and noisy visual latents:

\(\mathcal{L}_{\text{distill}} = -\frac{1}{n}\sum\log\text{sim}(\mathcal{S}(\mathbf{x}_t), \text{SigLIP}(\mathbf{X}))\)

where \(\mathbf{x}_t = t \cdot \mathbf{x}_1 + (1-t) \cdot \mathbf{x}_0\), \(t \sim [0,1]\). After training, the cosine similarity between semantic features extracted from clean latents and original SigLIP features reaches approximately 0.9.

Projector \(\mathcal{P}(\cdot)\): A simple 2D patch embedding layer that retains complete low-level structural details.

The two feature streams are merged via a spatial(-temporal) fusion mechanism:

\(\mathbf{u} = \text{STF}(\mathcal{S}(\mathbf{x}_t), \mathcal{P}(\mathbf{x}_t))\)

Concretely, the features are concatenated along the channel dimension, followed by RMSNorm and a two-layer MLP. In the video setting, semantic and low-level features are naturally aligned along the temporal dimension.

Design Motivation: Understanding requires CLIP-level semantics, while generation requires VAE-level details. The dual-path design satisfies both requirements simultaneously within a unified latent space, and is naturally scalable to both images and video.

Flow Head

A dedicated Flow head, consisting of several transformer layers with adaLN-Zero timestep modulation (analogous to DiT), is added alongside the language head to predict the velocity \(\mathbf{v}_t = d\mathbf{x}_t / dt\). The training objective is:

\(\mathcal{L} = \alpha\mathcal{L}_{\text{NTP}} + \mathcal{L}_{\text{FM}}\)

where \(\mathcal{L}_{\text{NTP}}\) denotes the next-token prediction loss and \(\mathcal{L}_{\text{FM}}\) denotes the flow matching loss.

Two-Stage Training Strategy
Stage 1: Only the projector, spatial(-temporal) fusion module, and Flow head are trained, using approximately 66M image-text pairs, with interleaved and video data progressively introduced. The language model parameters are frozen to preserve language knowledge. \(\alpha=0.2\).
Stage 2: Full model fine-tuning (excluding the VAE), using 9M high-quality understanding instruction data and 16M high-quality generation data. \(\alpha=1.0\).

Design Motivation: Training the visual generation components first before global fine-tuning eliminates dependence on large-scale text corpora.

Model Scaling: The pretrained Flow head from the 1.5B model is transferred to the 7B model via a lightweight MLP transformation that aligns hidden dimensions, enabling rapid adaptation.

Loss & Training¶

Semantic path pre-distillation: 200K iterations, batch size 512, cosine schedule with lr 2e-5
Stage 1 (1.5B): 150K iterations, 64 H100 GPUs, approximately 1.5 days
Stage 2: approximately 35K iterations, approximately 15 hours
7B model: 128 H100 GPUs, approximately 2.5 days
Generation data captions are dropped with probability 0.1 to enable classifier-free guidance

Key Experimental Results¶

Multimodal Understanding (Image)¶

Model	Params	MME↑	GQA↑	SEED↑	MMB↑	MMMU↑	MMStar↑	AI2D↑
Janus-Pro	1.5B	1444.0	59.3	68.3	75.5	36.3	-	-
Show-o	1.3B	1097.2	58.0	51.5	-	27.4	-	-
Show-o2	1.5B	1450.9	60.0	65.6	67.4	37.1	43.4	69.0
Janus-Pro	7B	1567.1	62.0	72.1	79.2	41.0	-	-
TokenFlow-XL*	14B	1551.1	62.5	72.6	76.8	43.2	-	75.9
Show-o2	7B	1620.5	63.1	69.8	79.3	48.9	56.6	78.6

Image Generation (GenEval)¶

Model	Params	Training Data	Single Obj	Two Obj	Counting	Colors	Position	Color Attr	Overall↑
Janus-Pro	7B	144M	0.99	0.89	0.59	0.90	0.79	0.66	0.80
BAGEL	14B	1600M	0.98	0.95	0.84	0.95	0.78	0.77	0.88
Show-o2	1.5B	66M	0.99	0.86	0.55	0.86	0.46	0.63	0.73
Show-o2	7B	66M	1.00	0.87	0.58	0.92	0.52	0.62	0.76

Key Findings¶

The 7B Show-o2 surpasses Janus-Pro at the same scale and TokenFlow-XL at 14B on multimodal understanding, achieving MME of 1620.5 and MMMU of 48.9.
On generation tasks, Show-o2 trained on 66M data remains competitive with Janus-Pro trained on 144M data (0.76 vs. 0.80), though a gap remains compared to BAGEL trained on 1600M data (0.88).
The cosine similarity of semantic features after pre-distillation reaches 0.9, validating the feasibility of extracting CLIP-level semantics from VAE latent space.
The two-stage training strategy effectively preserves language knowledge without requiring large-scale text corpora.
Video understanding also performs well after subsequent fine-tuning (7B model: ActNet-QA 56.4, VideoMME 57.4/60.9).

Highlights & Insights¶

The dual-path unified visual representation is the central contribution: encoding both semantics and structural details within the VAE latent space eliminates the redundancy of separate CLIP and VAE encoders.
Pre-distilling the semantic path into the VAE space is an elegant design choice — enabling the same set of latents to support both understanding and generation.
Reusing the Flow head from the 1.5B model when scaling to 7B reduces large-model training cost.
Native support for text, image, and video within a single unified model remains rare among current UMMs.

Limitations & Future Work¶

Image generation quality (GenEval 0.76) still lags behind state-of-the-art dedicated generation models (BAGEL 0.88, Mogao 0.89).
Due to computational constraints, the 7B model did not incorporate interleaved or video training data; video generation capability is demonstrated only in the 1.5B model.
The semantic path has a strong dependency on SigLIP, limiting flexibility in substituting the base visual encoder.
The effect of scaling image resolution (432→1024) is not reported in detail.

The dual-path fusion mechanism offers insights into how to simultaneously preserve semantic and structural information within a single latent space.
The two-stage training approach (first freezing the LLM to learn visual generation, then fine-tuning globally) is a practical strategy for mitigating catastrophic forgetting.
Comparisons with Transfusion, Chameleon, and related methods suggest that AR + Flow Matching constitutes a competitive hybrid paradigm.

Rating¶

Novelty: ⭐⭐⭐⭐ The dual-path unified representation and semantic path distillation constitute a genuinely novel design.
Experimental Thoroughness: ⭐⭐⭐⭐ Multiple benchmarks covering both understanding and generation are evaluated, though generation-side comparisons are somewhat limited.
Writing Quality: ⭐⭐⭐⭐ Architecture descriptions are clear and training details are comprehensive.
Value: ⭐⭐⭐⭐ Provides a scalable reference design for natively unified multimodal models.