UniCom: Unified Multimodal Modeling via Compressed Continuous Semantic Representations¶
Conference: CVPR2025
arXiv: 2603.10702
Code: Project Page
Area: Image Generation
Keywords: unified multimodal model, continuous representation, semantic compression, transfusion, image generation
TL;DR¶
UniCom is proposed to construct a compact continuous representation space by performing channel-wise compression (rather than spatial downsampling) on continuous semantic features from VLMs, unifying multimodal understanding and generation within a Transfusion architecture to achieve SOTA generation quality in a unified model.
Background & Motivation¶
Key Challenge for Unified Models: Current unified multimodal models require a "unified token" representation that supports both understanding and generation, but visual representation design remains the core bottleneck.
Information Loss from Discretization: Vector quantization-based methods (e.g., VQ-VAE) inevitably discard fine-grained semantic information, leading to sub-optimal performance in understanding tasks.
Representation Split in Hybrid Encoders: Hybrid schemes using VAE latents + ViT features result in understanding and generation operating on mismatched feature spaces, which limits deep integration.
Difficulties in Modeling Continuous Features: Directly modeling high-dimensional continuous ViT representations for generation faces complex distributions, slow convergence, and training instability.
Manifold Hypothesis: Valid information in high-dimensional VLM embedding spaces actually lies on a low-dimensional submanifold, suggesting that learning compression can expose this submanifold.
Practical Need: There is a need to simplify the data distribution to make generative modeling feasible while preserving semantic accuracy.
Method¶
Overall Architecture¶
UniCom decomposes the conditional image distribution into two stages: \(P(\mathbf{x}|\mathbf{c}) = \int P(\tilde{\mathbf{z}}|\mathbf{c}) \cdot P(\mathbf{x}|\tilde{\mathbf{z}}) d\tilde{\mathbf{z}}\), where \(\tilde{\mathbf{z}} \in \mathbb{R}^{N \times d}\) (\(d \ll D\)) represents the compressed semantic representation. The framework consists of three core components: - Semantic Compressor: Maps high-dimensional VLM features to a low-dimensional continuous space. - Generative Prior Module: Samples from the text condition within the compressed space. - Diffusion Decoder: Reconstructs pixel images from the compressed representation.
It employs Qwen-2.5-7B-Instruct for understanding, FLUX.1-dev as the generative decoder, and SigLIP2 for semantic encoding.
Key Designs¶
1. Attention-based Semantic Compressor - Implemented as a shallow, lightweight Transformer module instead of a simple MLP. - Context-aware mapping preserves long-range semantic relationships among patches. - Key Finding: Channel-wise compression significantly outperforms spatial downsampling—compressing channels by 18× (1152→64) is nearly lossless, whereas reducing the token count leads to blurry details.
2. Joint Optimization of Compressor and Decoder - The compressor parameters \(\phi\) and the diffusion decoder parameters \(\psi\) are trained jointly. - Employs a reconstruction objective: \(\mathcal{L}_{\text{recon}} = \mathcal{L}_{\text{flow}}(\mathbf{x}, \hat{\mathbf{x}}) + \lambda \cdot \mathcal{L}_{\text{perc}}(\mathbf{x}, \hat{\mathbf{x}})\). - Joint training forces the compressor to establish an information bottleneck, retaining signals useful for generation.
3. Comparison of Two Prediction Pathways - Pathway I (Transfusion): A unified Transformer processes mixed text and image sequences, utilizing causal masking for text and bidirectional attention for images, trained end-to-end with flow matching. - Pathway II (Query-Guided): Employs a MetaQuery to extract conditioning signals from the frozen MLLM, and then projects them to the flow matching decoder via a connector. - Comparison Conclusion: Transfusion achieves faster convergence and superior consistency in editing tasks.
Loss & Training¶
- Reconstruction Phase: Flow matching loss + perceptual loss (LPIPS)
- Generation Phase: \(\mathcal{L}_{\text{FM}} = \mathbb{E}[\|\mathbf{v}_t - \mathbf{v}_\theta(\tilde{\mathbf{z}}_t, t; \mathbf{c})\|_2^2]\), the standard flow matching objective.
- Understanding Phase: Cross-entropy loss \(\mathcal{L}_{ce}\).
- Four-Stage Progressive Training: Alignment → Pre-training → Continued Training → Supervised Fine-Tuning.
Key Experimental Results¶
Main Results: Image Reconstruction (ImageNet val)¶
| Method | rFID↓ | PSNR↑ | SSIM↑ |
|---|---|---|---|
| FLUX.1-dev VAE | 0.06 | 33.65 | 0.93 |
| UniCom (d1152, Uncompressed) | 0.38 | 22.60 | 0.61 |
| UniCom (d64, 18× Compression) | 0.42 | 22.28 | 0.61 |
| UniTok | 0.38 | - | - |
| SD-VAE | 1.06 | 28.62 | 0.86 |
Key finding: Compressing channels from 1152 to 64 (18×) only increases rFID from 0.38 to 0.42, demonstrating that channel-wise compression is almost lossless.
Main Results: Text-to-Image Generation (GenEval)¶
| Model | Overall |
|---|---|
| UniCom | 0.91 (SOTA among unified) |
| FLUX.1-Dev | 0.82 |
| Janus-Pro | 0.80 |
| Show-o2 | 0.76 |
Ablation Study¶
- Channel vs. Spatial Compression: Under the same compression ratio, channel-wise compression significantly outperforms spatial downsampling in both reconstruction and generation.
- Attention vs. MLP Compressor: The attention-based compressor outperforms MLP in terms of semantic preservation and reconstruction quality.
- Transfusion vs. Query-Guided: Transfusion achieves faster convergence and higher editing consistency.
Key Findings¶
- Continuous semantic representations can be significantly compressed in the channel dimension with almost no information loss.
- The unified model performs exceptionally well in text rendering and complex editing without relying on a VAE for identity preservation.
- GenEval Overall of 0.91 outperforms all unified models, including OmniGen2 (0.86) and Mogao (0.89).
Highlights & Insights¶
- The finding of channel-wise compression >> spatial compression offers general guidance. The intuition is that spatial tokens contain positional information and local details, whereas the channel dimension exhibits significant redundancy.
- The paradigm of continuous semantic representation + compression elegantly resolves the dilemma between discretization loss and the difficulties of high-dimensional modeling.
- A systematic comparison between the Transfusion and Query-Guided paradigms provides valuable design guidelines.
- High-quality image editing with identity preservation is achieved without relying on a VAE, indicating that the compressed semantic representations contain sufficient information.
Limitations & Future Work¶
- Reconstruction quality (PSNR/SSIM) remains significantly lower than dedicated VAEs (such as FLUX VAE), showing that semantic compression struggles to preserve pixel-level precision.
- The four-stage training pipeline is complex, leading to high training costs.
- The asymmetric design—using uncompressed features for understanding tasks and compressed features for editing tasks—increases system complexity.
- Validation is only performed at the 7B scale; the scaling behavior at larger sizes remains unknown.
Related Work & Insights¶
- vs Janus-Pro/Show-o2: UniCom avoids discretization loss, yielding a significant lead in generation quality.
- vs VUGEN: Also employs continuous representations, but UniCom replaces the MLP with an attention-based compressor, which preserves structural semantics more effectively.
- vs Transfusion: Introduces semantic compression on top of its architecture, greatly improving training stability and convergence speed.
- Insight: Channel-wise compression strategies can be generalized to other scenarios requiring a feature bottleneck (e.g., video understanding, 3D generation).
Rating¶
- Novelty: ⭐⭐⭐⭐ — The systematic analysis of channel-wise vs. spatial compression and the design of the attention-based semantic compressor are novel.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Multi-task evaluation across reconstruction, generation, and editing with thorough ablation studies, though it lacks quantitative comparisons on understanding tasks.
- Writing Quality: ⭐⭐⭐⭐ — Clearly motivated with a rigorous comparative design of the two pathways.
- Value: ⭐⭐⭐⭐⭐ — Achieves SOTA generation quality for unified models, and the findings on channel-wise compression offer important guiding insights for future work.