UniCom: Unified Multimodal Modeling via Compressed Continuous Semantic Representations¶

Conference: CVPR2025
arXiv: 2603.10702
Code: Project Page
Area: Image Generation
Keywords: unified multimodal model, continuous representation, semantic compression, transfusion, image generation

TL;DR¶

UniCom is proposed to construct a compact continuous representation space by performing channel-wise compression (rather than spatial downsampling) on continuous semantic features from VLMs, unifying multimodal understanding and generation within a Transfusion architecture to achieve SOTA generation quality in a unified model.

Background & Motivation¶

Key Challenge for Unified Models: Current unified multimodal models require a "unified token" representation that supports both understanding and generation, but visual representation design remains the core bottleneck.

Information Loss from Discretization: Vector quantization-based methods (e.g., VQ-VAE) inevitably discard fine-grained semantic information, leading to sub-optimal performance in understanding tasks.

Representation Split in Hybrid Encoders: Hybrid schemes using VAE latents + ViT features result in understanding and generation operating on mismatched feature spaces, which limits deep integration.

Difficulties in Modeling Continuous Features: Directly modeling high-dimensional continuous ViT representations for generation faces complex distributions, slow convergence, and training instability.

Manifold Hypothesis: Valid information in high-dimensional VLM embedding spaces actually lies on a low-dimensional submanifold, suggesting that learning compression can expose this submanifold.

Practical Need: There is a need to simplify the data distribution to make generative modeling feasible while preserving semantic accuracy.

Method¶

Overall Architecture¶

UniCom decomposes the conditional image distribution into two stages: \(P(\mathbf{x}|\mathbf{c}) = \int P(\tilde{\mathbf{z}}|\mathbf{c}) \cdot P(\mathbf{x}|\tilde{\mathbf{z}}) d\tilde{\mathbf{z}}\), where \(\tilde{\mathbf{z}} \in \mathbb{R}^{N \times d}\) (\(d \ll D\)) represents the compressed semantic representation. The framework consists of three core components: - Semantic Compressor: Maps high-dimensional VLM features to a low-dimensional continuous space. - Generative Prior Module: Samples from the text condition within the compressed space. - Diffusion Decoder: Reconstructs pixel images from the compressed representation.

It employs Qwen-2.5-7B-Instruct for understanding, FLUX.1-dev as the generative decoder, and SigLIP2 for semantic encoding.

Key Designs¶

1. Attention-based Semantic Compressor - Implemented as a shallow, lightweight Transformer module instead of a simple MLP. - Context-aware mapping preserves long-range semantic relationships among patches. - Key Finding: Channel-wise compression significantly outperforms spatial downsampling—compressing channels by 18× (1152→64) is nearly lossless, whereas reducing the token count leads to blurry details.

2. Joint Optimization of Compressor and Decoder - The compressor parameters \(\phi\) and the diffusion decoder parameters \(\psi\) are trained jointly. - Employs a reconstruction objective: \(\mathcal{L}_{\text{recon}} = \mathcal{L}_{\text{flow}}(\mathbf{x}, \hat{\mathbf{x}}) + \lambda \cdot \mathcal{L}_{\text{perc}}(\mathbf{x}, \hat{\mathbf{x}})\). - Joint training forces the compressor to establish an information bottleneck, retaining signals useful for generation.

3. Comparison of Two Prediction Pathways - Pathway I (Transfusion): A unified Transformer processes mixed text and image sequences, utilizing causal masking for text and bidirectional attention for images, trained end-to-end with flow matching. - Pathway II (Query-Guided): Employs a MetaQuery to extract conditioning signals from the frozen MLLM, and then projects them to the flow matching decoder via a connector. - Comparison Conclusion: Transfusion achieves faster convergence and superior consistency in editing tasks.

Loss & Training¶

Reconstruction Phase: Flow matching loss + perceptual loss (LPIPS)
Generation Phase: \(\mathcal{L}_{\text{FM}} = \mathbb{E}[\|\mathbf{v}_t - \mathbf{v}_\theta(\tilde{\mathbf{z}}_t, t; \mathbf{c})\|_2^2]\), the standard flow matching objective.
Understanding Phase: Cross-entropy loss \(\mathcal{L}_{ce}\).
Four-Stage Progressive Training: Alignment → Pre-training → Continued Training → Supervised Fine-Tuning.

Key Experimental Results¶

Main Results: Image Reconstruction (ImageNet val)¶

Method	rFID↓	PSNR↑	SSIM↑
FLUX.1-dev VAE	0.06	33.65	0.93
UniCom (d1152, Uncompressed)	0.38	22.60	0.61
UniCom (d64, 18× Compression)	0.42	22.28	0.61
UniTok	0.38	-	-
SD-VAE	1.06	28.62	0.86

Key finding: Compressing channels from 1152 to 64 (18×) only increases rFID from 0.38 to 0.42, demonstrating that channel-wise compression is almost lossless.

Main Results: Text-to-Image Generation (GenEval)¶

Model	Overall
UniCom	0.91 (SOTA among unified)
FLUX.1-Dev	0.82
Janus-Pro	0.80
Show-o2	0.76

Ablation Study¶

Channel vs. Spatial Compression: Under the same compression ratio, channel-wise compression significantly outperforms spatial downsampling in both reconstruction and generation.
Attention vs. MLP Compressor: The attention-based compressor outperforms MLP in terms of semantic preservation and reconstruction quality.
Transfusion vs. Query-Guided: Transfusion achieves faster convergence and higher editing consistency.

Key Findings¶

Continuous semantic representations can be significantly compressed in the channel dimension with almost no information loss.
The unified model performs exceptionally well in text rendering and complex editing without relying on a VAE for identity preservation.
GenEval Overall of 0.91 outperforms all unified models, including OmniGen2 (0.86) and Mogao (0.89).

Highlights & Insights¶

The finding of channel-wise compression >> spatial compression offers general guidance. The intuition is that spatial tokens contain positional information and local details, whereas the channel dimension exhibits significant redundancy.
The paradigm of continuous semantic representation + compression elegantly resolves the dilemma between discretization loss and the difficulties of high-dimensional modeling.
A systematic comparison between the Transfusion and Query-Guided paradigms provides valuable design guidelines.
High-quality image editing with identity preservation is achieved without relying on a VAE, indicating that the compressed semantic representations contain sufficient information.

Limitations & Future Work¶

Reconstruction quality (PSNR/SSIM) remains significantly lower than dedicated VAEs (such as FLUX VAE), showing that semantic compression struggles to preserve pixel-level precision.
The four-stage training pipeline is complex, leading to high training costs.
The asymmetric design—using uncompressed features for understanding tasks and compressed features for editing tasks—increases system complexity.
Validation is only performed at the 7B scale; the scaling behavior at larger sizes remains unknown.

vs Janus-Pro/Show-o2: UniCom avoids discretization loss, yielding a significant lead in generation quality.
vs VUGEN: Also employs continuous representations, but UniCom replaces the MLP with an attention-based compressor, which preserves structural semantics more effectively.
vs Transfusion: Introduces semantic compression on top of its architecture, greatly improving training stability and convergence speed.
Insight: Channel-wise compression strategies can be generalized to other scenarios requiring a feature bottleneck (e.g., video understanding, 3D generation).

Rating¶

Novelty: ⭐⭐⭐⭐ — The systematic analysis of channel-wise vs. spatial compression and the design of the attention-based semantic compressor are novel.
Experimental Thoroughness: ⭐⭐⭐⭐ — Multi-task evaluation across reconstruction, generation, and editing with thorough ablation studies, though it lacks quantitative comparisons on understanding tasks.
Writing Quality: ⭐⭐⭐⭐ — Clearly motivated with a rigorous comparative design of the two pathways.
Value: ⭐⭐⭐⭐⭐ — Achieves SOTA generation quality for unified models, and the findings on channel-wise compression offer important guiding insights for future work.