ICCV 2025 Image Generation Unified multimodal model visual understanding and generation semantic visual tokenizer self-enhancement alignment next-token prediction

ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance¶

Conference: ICCV 2025 arXiv: 2412.06673 Code: Not publicly available Area: Image Generation Keywords: Unified multimodal model, visual understanding and generation, semantic visual tokenizer, self-enhancement alignment, next-token prediction

TL;DR¶

This paper proposes ILLUME, a unified MLLM that integrates multimodal understanding and generation capabilities into a single LLM via a unified next-token prediction paradigm. Through a semantic visual tokenizer (reducing pretraining data by 4× to 15M) and a self-enhancement multimodal alignment scheme (enabling the model to self-evaluate the consistency between its generated images and text), ILLUME achieves competitive or superior performance compared to state-of-the-art unified models across diverse understanding, generation, and editing tasks.

Background & Motivation¶

Core Problem¶

How to construct an efficient unified MLLM that supports visual understanding, image generation, and image editing within a single framework?

Limitations of Prior Work¶

Tool-calling approaches (e.g., LLaVA + DALL-E): Decoupled architectures limit model potential.

Regression-based unified models (e.g., Emu, Emu2): Require joint training of LLMs and diffusion models, leading to high engineering cost and instability.

VQ tokenizer approaches (e.g., Chameleon, AnyGPT): Unify next-token prediction but require massive data for image-text alignment — Chameleon requires 1.4B image-text pairs, Janus requires 65M.

Key Insight¶

Existing VQ tokenizers (e.g., VQGAN) are trained with image reconstruction losses, so their quantized representations focus on low-level textures and lack semantic information, making image-text alignment within LLMs extremely slow. Performing quantization in semantic feature space can substantially accelerate the alignment process.

Secondary Problem¶

Can the understanding and generation capabilities of a unified model mutually reinforce each other? The authors find experimentally (Table 2) that naive joint training yields no clear mutual benefit, motivating a more refined approach.

Method¶

Overall Architecture¶

ILLUME is built on Vicuna-7B, extending the visual vocabulary to support discrete visual token generation:

Understanding branch: UNIT visual encoder → visual adapter → LLM text space (retaining continuous features to avoid VQ information loss)
Generation branch: LLM predicts discrete visual tokens → semantic visual tokenizer decodes → Stable Diffusion reconstructs high-resolution images
Unified optimization objective: \(\mathcal{L} = -\sum_{i=1} \log P_\theta(y_i | y_{\leq i})\), where \(y_i\) denotes text or visual tokens

Key Design 1: Semantic Visual Tokenizer¶

Unlike conventional VQGAN (trained with image reconstruction loss):

Leverages a pretrained UNIT visual encoder to extract semantic features
Supervises the quantization process and codebook learning via feature reconstruction loss
Codebook size: 16,384; each image is represented by 256 discrete tokens
Uses Stable Diffusion to reconstruct images from semantic features (high compression ratio 32×), compensating for low-level details lost during quantization
Enables high-resolution (512×512) image generation from a fixed number of tokens

Key Design 2: Three-Stage Progressive Training¶

Stage	Objective	Trainable Components	Data Volume	Training Steps
Stage-1: Visual Embedding Initialization	Initialize visual representations	Visual adapter + visual embedding/classification head	558K (LLaVA-Pretrain) + image reconstruction task	5,000
Stage-2: Unified Image-Text Alignment	Learn understanding + generation	LLM + visual adapter	15M multimodal data	15,000
Stage-3: Supervised Fine-Tuning	Task-specific capabilities	Full model	Instruction tuning + high-quality image-text pairs + mixed-modal data	8,000

Stage-1 innovation: An image reconstruction task is introduced — having the LLM generate the original image — to rapidly initialize the newly added visual embedding weights.

Stage-3 supports high-resolution input via an image patchify strategy (up to 9 slices, base resolution 448), with each slice downsampled to 256 tokens.

Key Design 3: Self-Enhancement Multimodal Alignment Scheme¶

Core idea: Train the MLLM to self-evaluate the quality of its own generated images, forming a positive feedback loop between understanding and generation.

Step 1: Self-generated corpus — Generate images from a text subset of the training set using the model itself.

Step 2: Assessment data generation — Use GPT-4o to evaluate the consistency between self-generated images and their corresponding texts (evaluation dimensions: object accuracy, quantity, color, spatial relationships), producing scores and rationales.

Step 3: SFT alignment training — Format assessment data as dialogues: - High-quality generations → single-turn assessment dialogue - Low-quality generations → two-turn dialogue (assessment + correction)

A total of 50K assessment samples are generated and incorporated into Stage-3 training.

Mutual benefit mechanism: - Generation aids understanding: Analyzing self-generated negative samples improves understanding of failure modes, enhancing image interpretation accuracy. - Understanding aids generation: Discriminative capabilities are leveraged to evaluate whether generated images align with text, preventing erroneous generation.

Key Experimental Results¶

Main Results: Multimodal Understanding (Table 3)¶

Model	Type	POPE	MMBench	SEED	MME-P	MM-Vet	MMMU	AI2D
LLaVA-1.5 (7B)	Understanding-only	85.9	64.3	58.6	1510.7	31.1	35.4	54.8
LLaVA-NeXT (7B)	Understanding-only	86.5	67.4	64.7	-	43.9	35.1	66.6
Emu3-Chat (8B)	Understanding-only	85.2	58.5	68.2	-	37.2	31.6	70.0
Janus (1.3B)	Unified	87.0	69.4	63.7	1338.0	34.3	30.5	-
ILLUME (7B)	Unified	88.5	75.1	72.9	1445.3	37.0	38.2	71.4

ILLUME achieves first or second place on 10 out of 12 benchmarks. Compared to Janus, MMMU improves by 25% and SEED by 14%.

Main Results: Image Generation (Table 4)¶

Model	Type	MJHQ FID↓	GenAI Overall	GenEval Overall
SDXL (2.6B)	Diffusion	9.55	0.55	0.55
Janus (1.3B)	Unified	10.10	-	0.61
Show-o (1.5B)	Unified	15.18	0.53	0.53
ILLUME (7B)	Unified	7.76	0.61	0.61

Ablation Study: Effect of Self-Enhancement Alignment (Table 7)¶

Setting	POPE	MME-P	MMBench	SEED	MM-Vet	MMMU	GenEval Overall
Baseline	86.4	1358.6	61.7	65.0	27.4	31.2	0.56
+ assessment	86.1	1446.7	63.1	66.0	29.0	32.0	0.59

Adding only 50K assessment samples improves both understanding and generation. MME-P increases by nearly 90 points; GenEval improves by 0.03.

Key Findings¶

Semantic tokenizer vs. reconstruction tokenizer: Training loss convergence speed differs substantially at 20M data; the reconstruction tokenizer yields inferior generation quality under the same data volume.
Joint training alone provides no significant mutual benefit (Table 2), whereas the self-enhancement scheme effectively promotes synergy between understanding and generation.
ILLUME requires only 15M data to achieve competitive performance — 1/4 of Janus and 1/93 of Chameleon.

Highlights & Insights¶

Root cause of data efficiency: Semantic information is the key to accelerating image-text alignment in LLMs — VQ tokenizer design should target LLM compatibility rather than image reconstruction.
Elegance of the self-enhancement scheme: The model's own imperfect outputs are used as learning signals, requiring no additional human annotation.
Architectural symmetry: The understanding branch uses continuous features (preserving precision); the generation branch uses discrete tokens (unified paradigm). Both branches share the encoder but exploit it differently.
Inference flexibility: Classifier-free guidance (CFG) is employed for image generation inference, supporting any-to-any tasks over interleaved image-text data.

Limitations & Future Work¶

Base model scale: Validated only on Vicuna-7B; performance on larger or more recent LLMs remains unexplored.
Generation resolution: Fixed at 512×512, still lagging behind SDXL (1024×1024).
Diffusion model dependency: The generation branch still relies on a Stable Diffusion model for image reconstruction, rather than end-to-end pure autoregressive generation.
Training cost: 32 nodes × 8 NPUs × 3 days remains expensive for academic labs.
Self-enhancement at 50K only: The effects of larger-scale assessment data or additional evaluation dimensions remain unexplored.

vs. Janus: Janus uses separate encoders to decouple understanding and generation representations; ILLUME shares the encoder but routes continuous and discrete features separately.
vs. Emu3: Emu3 adopts a pure AR architecture; ILLUME still relies on a diffusion decoder on the generation side but achieves higher data efficiency.
vs. Show-o: Show-o uses the smaller Phi-1.5B backbone; ILLUME employs a larger LLM paired with a more efficient tokenizer.
Inspiration: The self-enhancement scheme can be extended to additional modalities (video, audio, 3D); the semantic tokenizer can serve as a general design principle.

Rating ⭐⭐⭐⭐¶

Novelty: ⭐⭐⭐⭐ — Dual contributions of semantic tokenizer and self-enhancement alignment scheme Practicality: ⭐⭐⭐⭐ — Substantially reduces data requirements within a unified multi-task framework Experimental Thoroughness: ⭐⭐⭐⭐ — Covers understanding, generation, and editing; ablations are comprehensive Writing Quality: ⭐⭐⭐⭐ — Architecture diagrams are clear; motivation is well-articulated; experiments are well-organized