Pretrained Reversible Generation as Unsupervised Visual Representation Learning¶

Conference: ICCV 2025 arXiv: 2412.01787 Code: Project Page Area: Diffusion Models · Representation Learning Keywords: Reversible generation, flow matching, unsupervised representation learning, pretrain-finetune, mutual information

TL;DR¶

PRG extracts unsupervised visual representations by inverting the generation process of pretrained continuous generative models (diffusion/flow models), enabling model-agnostic adaptation to discriminative tasks. It achieves 78% top-1 accuracy on ImageNet 64×64, establishing state of the art among generative model-based methods.

Background & Motivation¶

Diffusion/flow models have achieved remarkable success in generative tasks, yet their potential for discriminative tasks remains underexplored. Existing methods that leverage diffusion models for discrimination suffer from the following issues:

Generative classifiers (\(p(y|x) = p(x|y)p(y)/p(x)\)): computationally expensive, requiring inference over all classes.

Intermediate feature extraction (e.g., DDAE): relies on specific network modules (e.g., a particular UNet layer), leading to complex and non-generalizable designs.

Large performance gap: still significantly behind discriminative approaches.

Core insight: "What I cannot create, I do not understand" (Feynman) — a model capable of generating data must have internalized the structure of that data. Inverting the generation process thus serves naturally as feature extraction.

Method¶

1. Pretraining Stage¶

Three variants of continuous-time flow models are trained:

PRG-GVP: Generalized VP-SDE, \(\alpha_t = \cos(\frac{\pi t}{2})\), \(\sigma_t = \sin(\frac{\pi t}{2})\)
PRG-ICFM: Conditional flow matching, \(v(x_t|x_0,x_1) = x_1 - x_0\)
PRG-OTCFM: Optimal transport conditional flow matching with joint sampling of \((x_0, x_1)\)

Training objective (flow matching loss):

\[\mathcal{L}_{\text{FM}} = \frac{1}{2}\mathbb{E}_{p(x_t)}[\lambda_{\text{FM}}(t)\|v_\theta(x_t) - v(x_t)\|^2] dt\]

2. Reversed Generation as Feature Extraction¶

The generation process runs from \(t \in [1, 0]\) (noise → data); its inversion runs from \(t \in [0, 1]\) (data → features). The feature \(x_t = F_\theta(x_0)\) can be extracted at any point along the trajectory.

3. Finetuning Stage¶

A classifier \(p_\phi(y|z)\) is appended to the inverted trajectory, and both the flow model and classifier are jointly finetuned:

\[\mathcal{L}_{\text{total}} = -\sum_{i=1}^N \log p_\phi(y_i | F_\theta(x_i)) + \beta \mathcal{L}_{\text{FM}}(x)\]

Classifier design: a simple two-layer MLP with tanh suffices — the inverted features are already highly structured.

4. Theoretical Guarantee¶

Pretraining is equivalent to maximizing the mutual information \(\mathcal{I}(X,Z)\) between data \(X\) and representation \(Z\):

\[\theta^* = \arg\max_\theta \mathcal{I}(X,Z) = \arg\max_\theta \mathbb{E}_{p(z,x)}[\log p(x|z)]\]

Flow matching training maximizes a lower bound on the likelihood (Eq. 8), thereby indirectly maximizing mutual information.

Key Experimental Results¶

Main Results: CIFAR-10 Classification¶

Method	Params (M)	Accuracy (%)
WideResNet-28-10	36	96.3
ResNeXt-29-16×64d	68	96.4
SBGC	N/A	95.0
DDAE	36	97.2
PRG-GVP	42	97.25
PRG-ICFM	42	97.32
PRG-OTCFM	42	97.42

PRG-OTCFM surpasses the strongest baseline DDAE, reaching 97.42%.

Validation of Continuous Feature Extractor¶

Inference Steps	20	100	500	1000
PRG-OTCFM	97.42	97.43	97.43	97.44

Although trained with \(t_{\text{span}}=20\), inference with any number of steps (20–1000) yields nearly identical performance, validating the robustness of continuous feature extraction.

Effect of Finetuning Trajectory Length¶

Starting Point	CIFAR-10	Tiny-ImageNet
Classifier head only (frozen flow model)	~50%	~20%
\(x_{1/4} \to x_1\)	95.8	52.0
\(x_{1/2} \to x_1\)	97.0	58.4
\(x_0 \to x_1\)	97.4	56.1

CIFAR-10 (simpler): full-trajectory finetuning is optimal.
Tiny-ImageNet (more complex): starting from an intermediate point is preferable; overly long trajectories may lead to overfitting.

Pretraining Quality vs. Finetuning Performance¶

Longer pretraining (higher mutual information) leads to better finetuning performance. Without pretraining, accuracy is only 73.5%; after sufficient pretraining, it reaches 97.4%.

Highlights & Insights¶

Model-agnostic: does not depend on specific network architectures (applicable to both UNet and Transformer); the latent variable \(Z\) is determined by the ODE solver, independent of network structure.
Infinite-depth expressiveness: continuous-time flow models provide an infinite-layer structure, achieving high expressiveness with a small parameter count.
Elegant realization of the pretrain-finetune paradigm: generative pretraining combined with discriminative finetuning demonstrates that the two are complementary rather than opposing.
Flexible feature selection: different tasks can leverage features extracted at different points along the trajectory.

Limitations & Future Work¶

Experiments are conducted only at 64×64 resolution; effectiveness at higher resolutions remains unknown.
Finetuning requires end-to-end training of the flow model (substantial computational cost); freezing the model and training only the classifier yields poor performance.
Backpropagation through the ODE solver increases memory and computational requirements.
Direct comparison with self-supervised methods such as MAE and DINO is absent.

Generative classifiers: Diffusion Classifier, HybViT, SBGC
Representation learning: Denoising Autoencoders (DAE), MAE, iGPT
Diffusion features: DDAE, DiffusionDet, Baranchuk et al.

Rating¶

Dimension	Score (1–5)
Novelty	4
Technical Depth	5
Experimental Thoroughness	4
Writing Quality	4
Overall	4.2