Harmonizing Visual Representations for Unified Multimodal Understanding and Generation¶

Conference: ICCV 2025 arXiv: 2503.21979 Code: GitHub Area: Multimodal / Unified Generation & Understanding Keywords: MAR encoder, unified visual representation, masked autoregression, image generation and understanding, three-stage training

TL;DR¶

This work identifies that the encoder of masked autoregressive (MAR) models inherently possesses both the fine-grained image features required for generation and the high-level semantic representations required for understanding. Based on this observation, Harmon is proposed — an autoregressive framework that unifies image generation and understanding via a shared MAR encoder. Through three-stage progressive training, Harmon achieves an Overall score of 0.76 on GenEval, surpassing all unified models, while matching the understanding performance of the Janus series that employs a dedicated SigLIP encoder.

Background & Motivation¶

Background: Unifying image generation and understanding has become a key direction for next-generation multimodal intelligence. Existing approaches either loosely couple diffusion models with MLLMs (weak interaction), or unify visual representations through VQ discretization or VAE encoding (tight coupling).

Limitations of Prior Work: (1) VQGAN and VAE encoders are pretrained primarily for pixel-level reconstruction and lack high-level semantics, performing significantly worse than CLIP/SigLIP encoders on understanding tasks; (2) Methods such as Janus employ separate encoders for generation and understanding — effective, but at the cost of abandoning the cross-task synergy potential of a unified representation; (3) ViLA-U attempts joint training of contrastive alignment and reconstruction on VQ tokens, but struggles to balance semantic alignment with pixel fidelity.

Key Challenge: Understanding requires coarse-grained, high-level semantics, while generation requires fine-grained pixel features — how can a single encoder satisfy both heterogeneous demands simultaneously?

Goal: To identify a visual representation that naturally accommodates both generation and understanding, and to construct a truly unified framework with a shared encoder.

Key Insight: Masked image modeling (MIM) learns rich semantics through mask-and-reconstruct pretraining; MAR extends MIM to autoregressive generation — its encoder may inherently possess dual capabilities.

Core Idea: The "learning to understand through generation" property of the MAR encoder makes it an ideal candidate for a unified encoder — understanding capability emerges as a byproduct of generative pretraining.

Method¶

Overall Architecture¶

Harmon consists of three components: a MAR encoder \(f_{\text{enc}}\), an LLM \(f_{\text{LLM}}\) (Qwen2.5), and a MAR decoder \(f_{\text{dec}}\). Generation path: text prompt → LLM → interaction with MAR encoder outputs → MAR decoder predicts masked patches (masked autoregression, \(K=64\) steps). Understanding path: all image patches are fed into the MAR encoder → encoder outputs + text embeddings → LLM performs next-token prediction to answer questions. Both paths share the same MAR encoder. A three-stage progressive training scheme is employed to unlock capabilities incrementally.

Key Designs¶

Shared MAR Encoder for Dual Tasks:
- Function: A single MIM-pretrained MAR encoder serves both image generation and understanding.
- Mechanism: During generation, the encoder receives visible patches \(\mathbf{X}_{\text{seen}}\) and buffer embeddings \(\mathbf{X}_{\text{buffer}}\), producing \(\mathbf{Z}_{\text{enc}} = f_{\text{enc}}(\mathbf{X}_{\text{seen}}, \mathbf{X}_{\text{buffer}})\), which is passed through the LLM and then to the decoder to predict masked patches. During understanding, the encoder receives all patches and outputs visual representations for LLM-based text generation.
- Design Motivation: Linear probing experiments show that MAR encoder features achieve substantially higher accuracy on ImageNet than VQGAN/VAE features, and GradCAM++ visualizations demonstrate precise activation responses to visual concepts — generative pretraining has implicitly learned semantic representations.
Three-Stage Progressive Training:
- Function: Capabilities are unlocked stage by stage to avoid task conflicts.
- Mechanism: Stage I (visual-language alignment): 22M image-text pairs are used to train the MAR encoder and decoder with the LLM frozen, at 256 resolution. Stage II (comprehensive multimodal training): the LLM is unfrozen and jointly trained on 25M QA data and 50M image-text data at 256 resolution. Stage III (high-quality fine-tuning): high-quality QA data and 10M curated images at 512 resolution.
- Design Motivation: Direct end-to-end training leads to interference between generation and understanding tasks. The staged approach — alignment → capability building → quality refinement — allows the encoder to be fully optimized for both tasks at each stage.
Masked Autoregressive Generation with Diffusion Decoding:
- Function: Images are generated progressively with a cosine-scheduled decrease in masking ratio.
- Mechanism: Starting from full masking \(m_0 = hw\), the number of masked tokens decreases over \(K\) steps following a cosine schedule \(m_k = hw \cdot \cos(\frac{k}{2K}\pi)\). At each step, the decoder uses a lightweight MLP as a denoiser to predict masked patches, with the loss \(\mathcal{L} = \mathbb{E}_{\varepsilon,t}[\|\varepsilon - \varepsilon_\theta(x_t|t, x_{\text{mask}})\|^2]\). Classifier-free guidance (CFG, weight 3.0) is applied at inference to enhance text control.
- Design Motivation: The masked autoregressive paradigm of MAR naturally aligns with the causal attention of LLMs, and \(K\)-step inference is substantially more efficient than pure token-by-token autoregression.

Loss & Training¶

Generation: diffusion loss (MSE noise prediction) + CFG training with 10% empty captions. Understanding: cross-entropy loss (computed only on answer tokens). The two losses are mixed according to data ratios (Stage III ratio 1:3:16). Total training cost: Harmon-1.5B trained on 32×A100 for 8 days.

Key Experimental Results¶

Main Results¶

Image understanding (multimodal QA benchmarks):

Model	Encoder	LLM Size	POPE↑	MME-P↑	MMB↑	SEED↑	MMMU↑
Janus-Pro†	SigLIP	1.5B	86.2	1444	75.5	68.3	36.3
Show-o	MAGVIT-v2	1.3B	80.0	1097	51.6	54.4	26.7
Harmon-1.5B	MAR-H	1.5B	87.6	1155	65.5	67.1	38.9

Image generation (GenEval benchmark):

Model	Single	Two	Count	Colors	Position	ColorAttr	Overall↑
Janus-Pro-1.5B	0.98	0.82	0.51	0.89	0.65	0.56	0.73
SDXL	0.98	0.74	0.39	0.85	0.15	0.23	0.55
Harmon-1.5B	0.99	0.86	0.66	0.85	0.74	0.48	0.76

Visual quality (MJHQ-30K FID↓):

Model	MJHQ FID↓
Janus-Pro-1.5B	9.53
Show-o	15.18
Harmon-1.5B	5.15

Ablation Study¶

Synergistic effects of the shared encoder:

Analysis	Finding
Understanding vs. dedicated encoder	POPE: Harmon 87.6 ≈ Janus-Pro 86.2; MMMU: Harmon 38.9 > Janus-Pro 36.3
Generation vs. VQ-based methods	GenEval: Harmon 0.76 >> Show-o 0.53, LWM 0.47
Shared vs. separate pathways	Joint training with understanding data improves generation performance (cross-task synergy observed in paper figures)

Key Findings¶

The MAR encoder genuinely possesses dual capabilities: The shared encoder matches dedicated SigLIP encoders on understanding benchmarks.
Substantial improvements over VQ/VAE unified methods: Harmon significantly outperforms Show-o, LWM, and Chameleon on all understanding benchmarks.
State-of-the-art generation quality: MJHQ FID of 5.15 substantially outperforms all unified models; GenEval Overall of 0.76 surpasses Janus-Pro.
Cross-task synergy is real: Joint training with understanding data improves generation performance, validating the value of unification over separation.

Highlights & Insights¶

"Learning to understand through generation" insight: MAR's MIM pretraining leads the encoder to jointly learn pixel fidelity and semantic representations — this finding has independent value beyond the proposed system.
Rigorous empirical validation: Linear probing and GradCAM++ analyses provide compelling preliminary evidence that the MAR encoder is well-suited for unified modeling.
Elegant design: Shared encoder + LLM + decoder, with no additional branches or adapters.
MJHQ FID of 5.15: A commanding lead among unified models, demonstrating that generation quality need not be sacrificed.

Limitations & Future Work¶

High training cost (32×A100, 8 days), making reproduction difficult for small research groups.
Image resolution is limited to 512×512; high-resolution generation has not been validated.
Understanding performance still lags behind dedicated MLLMs (e.g., InternVL2, Qwen2-VL).
Unified video understanding and generation remains unexplored.
The Stage III data ratio (1:3:16) is skewed toward generation, potentially leaving the understanding side under-optimized.

vs. Janus/Janus-Pro: These methods use a dedicated SigLIP encoder for understanding — Harmon demonstrates that a shared encoder can achieve comparable performance.
vs. Show-o/D-DiT: These methods rely on VQGAN/VAE encoders — substantially weaker on understanding tasks; the MAR encoder is the key differentiator.
vs. ViLA-U: Joint training of semantic alignment and reconstruction on VQ tokens proves difficult to balance; the MAR encoder naturally accommodates both objectives.
Broader insight: "Generation is a sufficient condition for understanding" — Feynman's dictum, "What I cannot create, I do not understand," finds empirical support in this work.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The insight of using the MAR encoder for a unified framework is original and thoroughly validated.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive comparisons across both understanding and generation dimensions, with complete ablations and preliminary analyses.
Writing Quality: ⭐⭐⭐⭐⭐ Motivation is compellingly argued; experimental results are presented clearly and systematically.
Value: ⭐⭐⭐⭐⭐ Makes a paradigmatic contribution to unified multimodal architecture design; the discovery of MAR as a unified encoder is likely to influence subsequent work.