OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation¶

Conference: CVPR 2026
Paper: CVF Open Access
Code: https://onecat-ai.github.io/ (Project Page)
Area: Multimodal VLM / Unified Understanding and Generation
Keywords: Unified Multimodality, Decoder-only, Auto-regressive Generation, Modality-MoE, Next-Scale Prediction

TL;DR¶

OneCAT integrates "understanding + generation + editing" into the same decoder-only Transformer. By utilizing a Modality-MoE with hard routing (text, understanding, and generation specialists), it achieves encoder-free inference. It also introduces multi-scale auto-regressive generation into LLMs via a Scale-Aware Adapter, attaining SOTA performance in a unified model while delivering approximately 10× faster generation speeds than diffusion models.

Background & Motivation¶

Background: Current multimodal systems mostly follow a "modular" path—using one set for understanding (ViT + LLM), diffusion models for generation, and separate pipelines for editing. Even models claiming to be "unified" (e.g., BAGEL, Mogao) often use a Mixture-of-Transformers (MoT) approach, where an AR branch handles understanding and a diffusion branch handles generation, with external ViT encoders and VAE tokenizers. These are essentially several models coupled together.

Limitations of Prior Work: This multi-module design has two major drawbacks. First, deep early fusion across modalities is constrained by the architecture—visual data is compressed into semantic features by an independent encoder before being fed to the LLM, preventing low-level interaction between text and pixels. Second, external components (especially ViT encoders and diffusion denoising) introduce significant inference latency, particularly for high-resolution inputs/outputs.

Key Challenge: Achieving true "unification" requires a single model to master two opposing tasks: understanding requires compressing images into abstract semantics (continuous tokens), while generation requires expanding semantics into pixel details (discrete tokens). Sharing the same parameters causes gradient conflicts during early training; splitting them into separate sub-networks reverts to the modular approach.

Goal: To create a pure decoder-only, pure auto-regressive unified architecture that requires no ViT/VAE encoders during inference. It aims for early fusion, avoids interference between understanding and generation, and maintains high speed.

Key Insight: The authors propose that "pure auto-regression is sufficient to support unified multimodal intelligence." Visual input is no longer processed by a heavy encoder but directly converted into continuous tokens via a lightweight Patch Embedding layer. For generation, visual AR is shifted from "token-by-token" to "scale-by-scale" (coarse-to-fine), bypassing diffusion latency while naturally fitting the next-token paradigm.

Core Idea: Utilize Modality-MoE with hard routing to allow the same decoder to share attention while providing specialized FFNs for different modalities. A unified AR objective then unifies understanding, generation, and editing as "predicting the next (token or scale)."

Method¶

Overall Architecture¶

Initialized from a pre-trained Qwen2.5 LLM, the entire system is a single decoder-only Transformer. Attention (QKV + Attention) is shared across all modalities, while the FFNs diverge into three specialists based on the modality. During inference, there are three types of input tokens: text tokens, continuous visual tokens (original images via Patch Embedding for understanding and reference), and discrete visual tokens (produced by the model during generation). Outputs include text via a Text Detokenizer and generated images via an Image Detokenizer (multi-scale VAE decoder). Notably, no ViT or VAE encoder is required during inference—the VAE encoder is only used during training to generate ground-truth multi-scale discrete tokens.

Understanding follows Next-Token Prediction, and generation follows Next-Scale Prediction (coarse-to-fine). Editing treats the reference image as continuous tokens routed to the understanding specialist as visual conditions, while the LLM auto-regressively outputs discrete tokens for the new image, without changing the architecture.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["Original Image / Text / Reference Image"] --> B["Lightweight Patch Embedding<br/>+ Text Tokenizer<br/>(Replacing ViT and VAE Encoders)"]
    B --> C["Shared QKV + Attention<br/>(Multimodal Versatile Attention)"]
    C --> D["Modality-MoE<br/>Hard Routing for Text / Und. / Gen."]
    D -->|Generation Branch| E["Multi-scale AR + Scale-Aware Adapter<br/>Coarse-to-fine by scale"]
    D -->|Und./Text Branch| F["Next-Token Prediction"]
    E --> G["Image Detokenizer Output"]
    F --> H["Text Output"]

Key Designs¶

1. Modality-MoE: Enabling Dual Competence in Understanding and Generation

Understanding and generation have opposing token requirements (semantic compression vs. detail expansion). Sharing an FFN leads to parameter contention. OneCAT replicates the original Qwen2.5 FFN into three specialists per block: Text FFN (text tokens), Und. FFN (continuous visual tokens for understanding), and Gen. FFN (discrete visual tokens for generation). Hard routing is used based on token modality and task type without learned gating. The shared attention layer ensures parameter efficiency and early cross-modal alignment, which is crucial for instruction following. This enables an "encoder-free" architecture: raw patches enter shared attention and modality-specific FFNs directly, merging text and pixels from the first layer.

2. Multi-scale Visual AR + Scale-Aware Adapter (SAA): Scaling VAR within LLMs

Token-by-token image generation is slow, and diffusion has high latency. OneCAT adopts multi-scale VAE to encode images into hierarchical multi-scale discrete tokens, allowing the LLM to perform Next-Scale Prediction. Since low-scale tokens (color, structure) and high-scale tokens (texture, detail) differ significantly, the authors propose SAA: a set of scale-specific bypasses (skip connections) parallel to Gen. FFN linear layers. Discrete tokens are routed to the corresponding scale adapter via a scale index. Each adapter uses LoRA-style low-rank decomposition (\(r=64\)). Unlike LoRA, SAA is jointly trained end-to-end as a permanent component. Removing SAA drops GenEval from 81.2 to 78.1.

3. Multimodal Versatile Attention: One Attention for Four Tasks

Different tasks require different attention masks. Based on FlexAttention, OneCAT assigns specific masks: Causal Attention for text tokens \(T\); Full Attention for continuous visual tokens \(U\) (inputs/references) for global interaction; and Block Causal Attention for multi-scale discrete tokens \(G\)—where tokens within the same scale attend to each other freely, but cross-scale attention follows causal order.

4. Three-Stage Training + Understanding Distillation: Efficient Visual Perception

The Und. FFN is initialized from the text-only FFN. To bridge the visual gap efficiently, a three-stage pipeline is used. Stage 1 (Multimodal Pre-training) involves: (1-1) Customizing an MLLM teacher by connecting a pre-trained ViT and Qwen2.5 with an MLP, sharing the same LLM backbone to ensure consistency; (1-2) Freezing shared layers and pre-training Und. FFN (via distillation) and Gen. FFN separately. The distillation objective is:

\[L_{Und} = L_{NTP} + \lambda L_{Distill}, \quad L_{Distill} = \sum_{n=1}^{N}\mathrm{MSE}\big(h_S^{(n)}, h_T^{(n)}\big)\]

Where MSE alignment is performed on every layer's hidden states (\(\lambda=0.02\)). Stage 2 (Unified Mid-training) unfreezes the model for joint training across four tasks with SAA and native resolution strategies. Stage 3 (Unified SFT) performs supervised fine-tuning on high-quality data.

Loss & Training¶

The understanding side uses \(L_{Und}\); the generation side uses cross-entropy for next-scale prediction. CFG is used during inference. Data ratios: Text:Und:Gen = 1:6:7. Variants include OneCAT-1.5B (Active 1.5B) and OneCAT-3B (Active 3B).

Key Experimental Results¶

Main Results¶

Understanding benchmarks (SOTA among encoder-free unified models; "/" denotes no external visual components):

Model	Visual Components	DocVQA	ChartQA	AI2D	MME-P	MMB	MMVet
Janus-Pro-7B (Unified)	SigLIP	-	-	-	1567	79.2	50.0
HoVLE-2.6B (Und.-only)	/	86.1	78.6	73.0	-	71.9	44.3
Qwen2.5-VL-3B (Enc.-based)	0.6B ViT	93.9	84.0	81.6	-	79.1	61.8
OneCAT-3B	/	91.2	81.2	77.8	1630	78.8	52.2

Generation benchmarks († = using LLM prompt rewriting; OneCAT does not rewrite):

Model	GenEval Overall↑	Counting	Color Attri.	DPG Overall↑
Janus-Pro-7B	0.80	0.59	0.66	84.19
BAGEL-7B†	0.88	0.84	0.77	-
Mogao-7B†	0.89	0.83	0.80	84.33
OneCAT-1.5B	0.85	0.83	0.75	81.72
OneCAT-3B	0.90	0.84	0.80	84.53

Inference Efficiency (H800):

Task	Baseline	Res	Baseline Speed	OneCAT-3B	Gain
Und. TTFT	Qwen2.5-VL-3B	1792²	0.583s	0.225s	61%↓
T2I Gen	BAGEL-7B	1024²	26.29s	2.85s	89%↓
Editing	BAGEL-7B	1024²	46.44s	4.61s	90%↓

Ablation Study¶

Config	Key Metric	Insight
Full Hidden Distill	Avg. 35.3	Complete strategy
Last Layer Logits	Avg. 33.9	-1.4 drop
Custom Teacher	Avg. 35.3	Shared backbone is better
Qwen2.5-VL Teacher	Avg. 33.7	Off-the-shelf teacher is worse
w/ SAA	GenEval 81.2	Baseline
w/o SAA	GenEval 78.1	-3.1 drop

Key Findings¶

Full layer distillation > Logit-only distillation: Aligning internal computation patterns across every layer is more effective than output alignment (35.3 vs 33.9).
Custom Teacher > Off-the-shelf Teacher: Sharing the LLM backbone between teacher and student provides better consistency (+1.6 points).
Encoder-free speed gains are most significant at high resolutions: 61% reduction in TTFT for 1792² and ~10× speedup for 1024² generation.

Highlights & Insights¶

Discarding encoders at inference is a major win: Removing ViT and diffusion denoising eliminates the two most expensive components of unified models.
Minimalist Modality-MoE: Hard routing is enough to coexist understanding and generation within a unified AR objective, proving that unified models don't need complex separate Transformers.
SAA for "Scale-Division": Using permanent low-rank bypasses for different scales mimics frequency-specific processing in an AR framework.
Next-Scale Prediction in LLMs: Replacing token-by-token generation with coarse-to-fine scales maintains the AR paradigm while reaching diffusion-level quality at much higher speeds.

Limitations & Future Work¶

There remains a gap in understanding compared to top-tier encoder-based models (e.g., Qwen2.5-VL-3B), likely due to training data scale (0.5T vs 4T tokens).
Generation quality is partially bounded by the external multi-scale VAE (Infinity) detokenizer.
Hard routing by modality might lack flexibility for interleaved multimodal tasks; more dynamic routing could be explored.

Comparison with MoT (BAGEL, Mogao): These use separate Transformers and external encoders; OneCAT uses a single decoder with MoE, achieving 10× speedup and proving a single AR objective is sufficient.
Comparison with Encoder-free Understanding (EVE): OneCAT adds a custom teacher and full-layer distillation to enhance efficiency while expanding capabilities to generation and editing.
Comparison with VAR/Infinity: OneCAT ports multi-scale AR into LLMs and introduces SAA to handle scale-variance in shared FFNs.

Rating¶

Novelty: ⭐⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐
Value: ⭐⭐⭐⭐⭐