Skip to content

Show-o2: Improved Native Unified Multimodal Models

Conference: NeurIPS 2025
arXiv: 2506.15564
Code: GitHub
Area: Unified Multimodal Models / Image Generation
Keywords: Unified multimodal models, autoregressive modeling, Flow Matching, 3D causal VAE, visual understanding and generation

TL;DR

This paper presents Show-o2, a natively unified multimodal model built upon autoregressive modeling and Flow Matching. By constructing unified visual representations in a 3D causal VAE space via dual-path spatial(-temporal) fusion, Show-o2 supports multimodal understanding and generation across text, images, and video, with a two-stage training strategy that effectively preserves language knowledge.

Background & Motivation

Large multimodal models (LMMs) and visual generation models have achieved impressive performance in visual understanding and image/video generation, respectively. Unified multimodal models (UMMs) attempt to integrate both capabilities within a single model. Existing approaches face the following challenges:

Unified visual representation: Multimodal understanding requires high-level semantic features (e.g., CLIP), while generation requires low-level structural details (e.g., VAE latents). These demands are fundamentally different. Existing methods either adopt a unified representation (Chameleon, Show-o) at the cost of one capability, or use decoupled representations (Janus series) at the cost of native unification.

Scalability to images and video: Most UMMs support only text and images; native support for video modality remains largely unexplored.

Knowledge forgetting during training: When training UMMs from a pretrained LLM, learning visual generation often leads to degradation of language knowledge unless large-scale text corpora are incorporated.

Show-o2's core innovation lies in constructing unified visual representations through a dual-path fusion mechanism in the 3D causal VAE space, supporting both images and video, while avoiding knowledge forgetting through a two-stage training strategy.

Method

Overall Architecture

Given interleaved text/image/video inputs, text is converted to embeddings via a tokenizer, and visual inputs are encoded into visual latents via a 3D causal VAE encoder. The visual latents are processed through dual-path extraction and spatial(-temporal) fusion to produce unified visual representations. Text embeddings and unified visual representations are concatenated into a sequence and fed into the base language model. A language head predicts text tokens via autoregressive modeling, while a Flow head generates images/video via Flow Matching. An omni-attention mechanism is employed: causal attention along the sequence dimension and full attention within visual representations.

Key Designs

  1. Unified Visual Representation (Dual-Path Spatial-Temporal Fusion)

A 3D causal VAE encoder extracts visual latents, which are then processed through a dual-path architecture:

  • Semantic path \(\mathcal{S}(\cdot)\): Shared ViT blocks from SigLIP (with an added 2×2 patch embedding) extract high-level semantic information. A pre-distillation step enables this path to extract semantic features from both clean and noisy visual latents:

\(\mathcal{L}_{\text{distill}} = -\frac{1}{n}\sum\log\text{sim}(\mathcal{S}(\mathbf{x}_t), \text{SigLIP}(\mathbf{X}))\)

where \(\mathbf{x}_t = t \cdot \mathbf{x}_1 + (1-t) \cdot \mathbf{x}_0\), \(t \sim [0,1]\). After training, the cosine similarity between semantic features extracted from clean latents and original SigLIP features reaches approximately 0.9.

  • Projector \(\mathcal{P}(\cdot)\): A simple 2D patch embedding layer that retains complete low-level structural details.

The two feature streams are merged via a spatial(-temporal) fusion mechanism:

\(\mathbf{u} = \text{STF}(\mathcal{S}(\mathbf{x}_t), \mathcal{P}(\mathbf{x}_t))\)

Concretely, the features are concatenated along the channel dimension, followed by RMSNorm and a two-layer MLP. In the video setting, semantic and low-level features are naturally aligned along the temporal dimension.

Design Motivation: Understanding requires CLIP-level semantics, while generation requires VAE-level details. The dual-path design satisfies both requirements simultaneously within a unified latent space, and is naturally scalable to both images and video.

  1. Flow Head

A dedicated Flow head, consisting of several transformer layers with adaLN-Zero timestep modulation (analogous to DiT), is added alongside the language head to predict the velocity \(\mathbf{v}_t = d\mathbf{x}_t / dt\). The training objective is:

\(\mathcal{L} = \alpha\mathcal{L}_{\text{NTP}} + \mathcal{L}_{\text{FM}}\)

where \(\mathcal{L}_{\text{NTP}}\) denotes the next-token prediction loss and \(\mathcal{L}_{\text{FM}}\) denotes the flow matching loss.

  1. Two-Stage Training Strategy

  2. Stage 1: Only the projector, spatial(-temporal) fusion module, and Flow head are trained, using approximately 66M image-text pairs, with interleaved and video data progressively introduced. The language model parameters are frozen to preserve language knowledge. \(\alpha=0.2\).

  3. Stage 2: Full model fine-tuning (excluding the VAE), using 9M high-quality understanding instruction data and 16M high-quality generation data. \(\alpha=1.0\).

Design Motivation: Training the visual generation components first before global fine-tuning eliminates dependence on large-scale text corpora.

Model Scaling: The pretrained Flow head from the 1.5B model is transferred to the 7B model via a lightweight MLP transformation that aligns hidden dimensions, enabling rapid adaptation.

Loss & Training

  • Semantic path pre-distillation: 200K iterations, batch size 512, cosine schedule with lr 2e-5
  • Stage 1 (1.5B): 150K iterations, 64 H100 GPUs, approximately 1.5 days
  • Stage 2: approximately 35K iterations, approximately 15 hours
  • 7B model: 128 H100 GPUs, approximately 2.5 days
  • Generation data captions are dropped with probability 0.1 to enable classifier-free guidance

Key Experimental Results

Multimodal Understanding (Image)

Model Params MME↑ GQA↑ SEED↑ MMB↑ MMMU↑ MMStar↑ AI2D↑
Janus-Pro 1.5B 1444.0 59.3 68.3 75.5 36.3 - -
Show-o 1.3B 1097.2 58.0 51.5 - 27.4 - -
Show-o2 1.5B 1450.9 60.0 65.6 67.4 37.1 43.4 69.0
Janus-Pro 7B 1567.1 62.0 72.1 79.2 41.0 - -
TokenFlow-XL* 14B 1551.1 62.5 72.6 76.8 43.2 - 75.9
Show-o2 7B 1620.5 63.1 69.8 79.3 48.9 56.6 78.6

Image Generation (GenEval)

Model Params Training Data Single Obj Two Obj Counting Colors Position Color Attr Overall↑
Janus-Pro 7B 144M 0.99 0.89 0.59 0.90 0.79 0.66 0.80
BAGEL 14B 1600M 0.98 0.95 0.84 0.95 0.78 0.77 0.88
Show-o2 1.5B 66M 0.99 0.86 0.55 0.86 0.46 0.63 0.73
Show-o2 7B 66M 1.00 0.87 0.58 0.92 0.52 0.62 0.76

Key Findings

  1. The 7B Show-o2 surpasses Janus-Pro at the same scale and TokenFlow-XL at 14B on multimodal understanding, achieving MME of 1620.5 and MMMU of 48.9.
  2. On generation tasks, Show-o2 trained on 66M data remains competitive with Janus-Pro trained on 144M data (0.76 vs. 0.80), though a gap remains compared to BAGEL trained on 1600M data (0.88).
  3. The cosine similarity of semantic features after pre-distillation reaches 0.9, validating the feasibility of extracting CLIP-level semantics from VAE latent space.
  4. The two-stage training strategy effectively preserves language knowledge without requiring large-scale text corpora.
  5. Video understanding also performs well after subsequent fine-tuning (7B model: ActNet-QA 56.4, VideoMME 57.4/60.9).

Highlights & Insights

  • The dual-path unified visual representation is the central contribution: encoding both semantics and structural details within the VAE latent space eliminates the redundancy of separate CLIP and VAE encoders.
  • Pre-distilling the semantic path into the VAE space is an elegant design choice — enabling the same set of latents to support both understanding and generation.
  • Reusing the Flow head from the 1.5B model when scaling to 7B reduces large-model training cost.
  • Native support for text, image, and video within a single unified model remains rare among current UMMs.

Limitations & Future Work

  • Image generation quality (GenEval 0.76) still lags behind state-of-the-art dedicated generation models (BAGEL 0.88, Mogao 0.89).
  • Due to computational constraints, the 7B model did not incorporate interleaved or video training data; video generation capability is demonstrated only in the 1.5B model.
  • The semantic path has a strong dependency on SigLIP, limiting flexibility in substituting the base visual encoder.
  • The effect of scaling image resolution (432→1024) is not reported in detail.
  • The dual-path fusion mechanism offers insights into how to simultaneously preserve semantic and structural information within a single latent space.
  • The two-stage training approach (first freezing the LLM to learn visual generation, then fine-tuning globally) is a practical strategy for mitigating catastrophic forgetting.
  • Comparisons with Transfusion, Chameleon, and related methods suggest that AR + Flow Matching constitutes a competitive hybrid paradigm.

Rating

  • Novelty: ⭐⭐⭐⭐ The dual-path unified representation and semantic path distillation constitute a genuinely novel design.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Multiple benchmarks covering both understanding and generation are evaluated, though generation-side comparisons are somewhat limited.
  • Writing Quality: ⭐⭐⭐⭐ Architecture descriptions are clear and training details are comprehensive.
  • Value: ⭐⭐⭐⭐ Provides a scalable reference design for natively unified multimodal models.