CaO2: Rectifying Inconsistencies in Diffusion-Based Dataset Distillation¶
Conference: ICCV 2025 arXiv: 2506.22637 Code: GitHub Area: Image Generation / Dataset Distillation Keywords: Dataset Distillation, Diffusion Models, Condition Consistency, Objective Consistency, Latent Space Optimization
TL;DR¶
This paper identifies two critical issues in diffusion-based dataset distillation — objective inconsistency and condition inconsistency — and proposes a two-stage framework, CaO2: the first stage mitigates objective inconsistency via classifier-guided sample selection, and the second stage mitigates condition inconsistency via latent space optimization to maximize conditional likelihood, achieving an average improvement of 2.3% on ImageNet.
Background & Motivation¶
Dataset distillation aims to construct a compact surrogate dataset such that models trained on it can achieve performance close to that of full-data training. Traditional methods rely on matching-based optimization (gradient matching, feature matching, trajectory matching), but struggle to scale to large-scale, high-resolution datasets.
Recent diffusion-based distillation methods (e.g., Minimax Diffusion, D4M) leverage pretrained diffusion models as powerful distribution learners to generate representative samples as substitutes for the original data. However, the authors find that these methods overlook the evaluation process, leading to two severe inconsistencies:
Objective Inconsistency (OI): The distillation stage optimizes a generative objective (image fidelity), while the evaluation stage uses a classification objective. Images generated by diffusion models may be visually realistic yet lack discriminability.
Condition Inconsistency (CI): Diffusion models are imperfectly trained in practice; images \(\mathbf{x}_0^i\) generated under class condition \(c_i\) retain non-zero likelihood under other class conditions, i.e., \(p_\theta(\mathbf{x}_0^i | c_j) > 0, j \neq i\), resulting in degraded image–label pairing quality.
Method¶
Overall Architecture¶
CaO2 is a two-stage framework: the first stage generates an image pool and selects highly discriminative samples via a lightweight classifier (mitigating OI); the second stage optimizes the latent representations of selected samples in latent space to maximize conditional likelihood (mitigating CI). The entire pipeline requires no training of the diffusion backbone and can be completed on a single A6000 GPU.
Key Designs¶
-
Objective-guided Sample Selection (OSS): For each class, \(m \times \text{IPC}\) images are first generated using a pretrained diffusion model (\(m\) is an expansion factor, typically 2–4), and a lightweight pretrained classifier (e.g., ResNet-18) is used to obtain the predicted probability for each image. The selection procedure:
- Retain only samples correctly classified as the conditioning class
- Select samples based on task difficulty: high-confidence samples (easy samples) for low-IPC settings, low-confidence samples (hard samples) for high-IPC settings
- Fill any shortfall by randomly sampling from the remaining images
-
Condition-aware Latent Optimization (CLO): With the diffusion model parameters fixed, gradient-based optimization is applied to the latent variables of selected images to shift them toward regions of higher conditional likelihood. The optimization objective is: $\(\min_{\mathbf{x}} \mathbb{E}_{t,\varepsilon}\left[\|\epsilon_\theta(\mathbf{x}_t, \hat{\mathbf{c}}, t) - \varepsilon\|_2^2 + \lambda\|\epsilon_\theta(\mathbf{x}_t, \hat{\mathbf{c}}, t) - \varepsilon\|_\infty\right]\)$ where \(\lambda=10\) controls the regularization strength and the \(\|\cdot\|_\infty\) term prevents local regions from deviating excessively. The timestep \(t\) is sampled from \([1, \hat{T}]\) (\(\hat{T} \ll T\)) to ensure only moderate perturbation of the latent variable. Each image is optimized for 100 iterations.
-
Task-oriented Variation:
- The optimization condition \(\hat{\mathbf{c}}\) is selected based on validation accuracy as a proxy for task difficulty
- Easy tasks (easily separable classes): use the ground-truth class label \(c\) as the condition
- Hard tasks (difficult-to-separate classes): use the unconditional label \(\phi\) (classifier-free guidance), allowing the conditional information to be embedded into the image latent itself $\(\hat{\mathbf{c}} = c \cdot \mathbb{1}(c \in \mathbb{C}_e) + \phi \cdot \mathbb{1}(c \in \mathbb{C}_h)\)$
-
Extension to MAR (Masked Autoregressive Model): The method is not limited to diffusion models and can be applied to autoregressive generative models such as MAR. Key adaptations include replacing the noise-adding operation with random masking and designing a zero-label embedding to replace the classifier-free guidance embedding.
Loss & Training¶
- Pretrained DiT (256×256 resolution) is used as the backbone with 50-step denoising sampling
- Adam optimizer with a learning rate of 0.0006
- Each image is optimized for 100 iterations with fixed input noise
- All experiments can be completed on a single RTX A6000 GPU
- An expansion factor of \(m=2\) or \(m=4\) is sufficient for the sample selection stage
Key Experimental Results¶
Main Results¶
| Dataset | IPC | SRe2L | Minimax | RDED | CaO2 (Ours) | Gain |
|---|---|---|---|---|---|---|
| ImageWoof (ResNet-18) | 10 | 20.2 | 40.1 | 38.5 | 45.6 | +5.5 |
| ImageWoof (ResNet-18) | 50 | 23.3 | 67.0 | 68.5 | 68.9 | +0.4 |
| ImageNette (ResNet-18) | 10 | 29.4 | 61.4 | 61.4 | 65.0 | +3.6 |
| ImageNette (ResNet-50) | 50 | 71.2 | 77.1 | 78.0 | 82.7 | +4.7 |
| ImageNet-100 (ResNet-18) | 50 | 27.0 | 63.9 | 61.6 | 68.0 | +4.1 |
| ImageNet-1K (ResNet-18) | 10 | 21.3 | 44.3 | 42.0 | 46.1 | +1.8 |
| ImageNet-1K (ResNet-50) | 10 | 28.4 | 49.7 | 43.6 | 53.0 | +3.3 |
Ablation Study¶
| OSS | CLO | ImageWoof IPC=10 | ImageWoof IPC=50 | ImageNette IPC=10 | ImageNette IPC=50 |
|---|---|---|---|---|---|
| ✗ | ✗ | 38.7 | 66.1 | 61.4 | — |
| ✓ | ✗ | 42.3 | 67.5 | 63.2 | — |
| ✗ | ✓ | 41.8 | 67.2 | 62.8 | — |
| ✓ | ✓ | 45.6 | 68.9 | 65.0 | — |
Key Findings¶
- Both components (OSS and CLO) contribute comparably in isolation, with their combination yielding superior performance
- Improvements are more pronounced under low-IPC settings, where each image has a greater impact on dataset quality
- The average gain on ImageNette (4.3%) exceeds that on ImageWoof (1.6%), suggesting that diversity preservation benefits more varied datasets
- The method is model-agnostic at evaluation time — the same distilled dataset can be used across ResNet-18/50/101
- The framework generalizes effectively to MAR, validating its broader applicability
Highlights & Insights¶
- Precise problem formulation: The challenges of diffusion-based distillation are characterized as two quantifiable inconsistencies, with clear and rigorous formal definitions
- High efficiency: No training or fine-tuning of any generative model is required; only lightweight optimization of latent variables is performed, enabling full ImageNet-1K distillation on a single GPU
- Plug-and-play: The framework can serve as a post-processing module for existing diffusion-based distillation methods or be directly applied to images sampled from a DiT
Limitations & Future Work¶
- Sample selection depends on the quality of the pretrained classifier, whose biases may propagate into the distilled dataset
- Latent space optimization may lead to a slight degradation in visual quality, despite improved discriminability
- Performance gains diminish in very large IPC settings
- The task-oriented optimization strategy (distinguishing easy/hard classes) requires additional validation accuracy information
Related Work & Insights¶
- vs. Minimax Diffusion: The latter improves representativeness by fine-tuning the diffusion model, whereas CaO2 does not modify any model parameters
- vs. D4M: D4M clusters latent variables via prototype learning but still lacks guidance from a classification objective
- Core insight: A "good image" for a generative model is not necessarily a "good sample" for classification — this observation has broad implications for all generative model-based data augmentation strategies
Rating¶
- Novelty: ⭐⭐⭐⭐ The problem analysis is insightful, and the formulation of the two inconsistencies is convincing
- Experimental Thoroughness: ⭐⭐⭐⭐ Systematic experiments across multiple datasets, architectures, and IPC settings
- Writing Quality: ⭐⭐⭐⭐⭐ Formal definitions are clear, with a well-aligned problem–solution correspondence
- Value: ⭐⭐⭐⭐ Offers methodological guidance for the diffusion-based dataset distillation community