FlowTok: Flowing Seamlessly Across Text and Image Tokens¶

Metadata¶

Conference: ICCV 2025
arXiv: 2503.10772
Code: GitHub
Area: Image Generation / Cross-modal Generation
Keywords: Flow Matching, 1D Token, Text-to-Image Generation, Compact Representation, Cross-modal

TL;DR¶

FlowTok proposes encoding both text and images as compact 1D token representations (\(77 \times 16\)) and directly evolving between text and image tokens via flow matching, eliminating the need for complex conditioning mechanisms or noise schedules, thereby enabling efficient cross-modal generation.

Background & Motivation¶

Conventional text-to-image generation methods treat text as a conditioning signal, progressively guiding a denoising process from Gaussian noise toward the target image. This requires complex conditioning mechanisms (e.g., cross-attention, concatenation) and noise scheduling strategies.

FlowTok explores a simpler paradigm: directly evolving between text and image modalities via flow matching. This requires projecting both modalities into a shared latent space, where the representational gap between text (1D sequences, high-dimensional semantics) and images (2D spatial structure, redundant information) constitutes the core challenge.

Prior work CrossFlow mapped text into a 2D latent space to match image embeddings, but the additional computational overhead of the text variational autoencoder made it slower than SD1.5/2.1, undermining its efficiency motivation.

Method¶

Overall Architecture¶

The core idea of FlowTok is to encode both text and images as compact 1D tokens of shape \(77 \times 16\): - Image side: A modified TA-TiTok encodes images into \(\mathbf{Z}_I \in \mathbb{R}^{K \times D}\) (\(K=77\), \(D=16\)). - Text side: A CLIP text encoder extracts initial embeddings, which are then mapped to a low-dimensional variational latent space \(\mathbf{Z}_T \in \mathbb{R}^{N \times D}\) via a text projector. - Generative model: Vanilla flow matching with DiT blocks, where text tokens serve directly as the source distribution.

Compared to the conventional 2D flow matching latent space of \(32 \times 32 \times 4\), FlowTok achieves a 3.3× compression ratio.

Key Designs¶

1. Image Tokenizer Improvements - Based on the TA-TiTok architecture, the number of latent tokens \(K\) is set to 77 to match the CLIP text length. - RoPE replaces learnable 1D positional encodings to improve positional modeling. - SwiGLU FFN replaces the standard MLP to enhance latent space quality. - Encoder uses ViT-B; decoder uses ViT-L; patch size = 16.

2. Text Projector - 6 Transformer blocks with skip connections. - Projects CLIP text embeddings (\(77 \times 768\)) into a low-dimensional space (\(77 \times 16\)). - KL divergence regularization is applied to the projected text tokens to introduce generative diversity.

3. Text Alignment Loss

To prevent semantic information loss due to channel dimensionality reduction, a CLIP-style contrastive loss is introduced:

\[\mathcal{L}_{\text{align}} = \frac{1}{2}(\text{CE}(\text{logits}_{TZ}, \text{labels}) + \text{CE}(\text{logits}_{ZT}, \text{labels}))\]

where logits are computed as scaled cosine similarities using a learnable temperature parameter \(\tau\).

Loss & Training¶

\[\mathcal{L} = \mathcal{L}_{\text{fm}} + \gamma_1 \cdot \mathcal{L}_{\text{kld}} + \gamma_2 \cdot \mathcal{L}_{\text{align}}\]

\(\mathcal{L}_{\text{fm}}\): flow matching velocity prediction loss
\(\mathcal{L}_{\text{kld}}\): KL divergence regularization (\(\gamma_1 = 10^{-4}\))
\(\mathcal{L}_{\text{align}}\): text alignment contrastive loss (\(\gamma_2 = 1\))

Model Scales¶

Model	Depth	Width	MLP	Heads	Parameters
FlowTok-B	12	768	3072	12	153M
FlowTok-XL	28	1152	4608	16	698M
FlowTok-H	36	1280	5120	20	1.1B

Key Experimental Results¶

Main Results: Zero-shot Text-to-Image Generation¶

Method	Params	Open Data	Training Cost (8-A100 days)	Inference Speed (samples/s)	COCO FID-30K↓	MJHQ-30K FID↓
PixArt-α	630M	✗	94.1	7.9	7.32	9.85
SD-2.1	860M	✓	1041.6	-	13.45	26.96
Show-o	1.3B	✓	-	1.0	9.24	14.99
CrossFlow	950M	✗	78.8	1.1	9.63	-
FlowTok-XL	698M	✓	20.4	22.7	10.06	7.68
FlowTok-H	1.1B	✓	26.1	18.2	9.67	7.15

Ablation Study: Text Alignment Loss¶

Alignment Target	COCO FID-30K↓
Average Pooling	36.02
MLP	29.14

Loss Type	COCO FID-30K↓
Cosine	31.80
Contrastive	29.14

\(\gamma_2\)	COCO FID-30K↓
1.0	29.14
2.0	30.59

Key Findings¶

Extreme efficiency: FlowTok-H requires only 26.1 eight-A100 GPU days for training — 1/40 of SD-2.1 and 1/3.6 of PixArt-α.
Fast inference: FlowTok-XL generates 22.7 images per second, 20× faster than CrossFlow and 22× faster than Show-o.
Memory efficiency: The largest model supports a batch size of 8K on 8 A100s without gradient checkpointing or gradient accumulation.
Bidirectional generation: The same framework naturally supports image-to-text generation; FlowTok-XL achieves a CIDEr score of 117.0 on COCO Karpathy.

Highlights & Insights¶

Paradigm innovation: Text is recast from a "conditioning signal" to a "source distribution," with flow matching evolving directly between modalities, eliminating complex conditioning mechanisms.
Unified 1D token representation: By encoding images as 1D tokens, the method elegantly unifies the representations of text and images.
Only 20 sampling steps: The compact 1D latent space requires far fewer sampling steps than conventional 2D approaches.
Fully open-source data: All training uses publicly available datasets, ensuring reproducibility.

Limitations & Future Work¶

Image resolution is limited to 256; higher resolutions have not been validated.
Reliance on the CLIP text encoder (77-token limit) constrains the handling of long textual descriptions.
The 1D representation may be less effective than 2D methods for spatially fine-grained control.
Image-to-text generation performance is competitive but not state-of-the-art.

CrossFlow: Also explores cross-modal flow matching, but uses a 2D latent space, incurring substantial computational overhead.
TA-TiTok: Provides the foundational architecture for 1D image tokenization.
DiT: FlowTok's generative model is based on DiT blocks, with cross-attention and other conditioning mechanisms removed.

Rating¶

⭐⭐⭐⭐ — Strong paradigm innovation with significant efficiency gains; however, limitations in resolution and text length reduce practical applicability.