Scaling Backwards: Minimal Synthetic Pre-training?¶
Conference: ECCV 2024
arXiv: 2408.00677
Code: GitHub
Area: LLM Pre-training
Keywords: synthetic pre-training, fractal images, minimal datasets, visual representation learning, ViT
TL;DR¶
Proposes 1p-frac—achieving pre-training performance comparable to the ImageNet-1k level using minute perturbations of a single fractal image. This challenges the conventional wisdom that "pre-training requires large-scale datasets" and reveals that the essence of pre-training might be closer to weight initialization than visual concept learning.
Background & Motivation¶
Background: Pre-training is a fundamental technology for current vision systems. Mainstream methods use large-scale real image datasets (1.28 million images in ImageNet-1k, 14 million images in ImageNet-21k) for supervised or self-supervised pre-training. Synthetic pre-training directions such as FractalDB (1 million fractal images) and OFDB (1,000 fractal images) have demonstrated that effective representations can be obtained without real images.
Limitations of Prior Work: The scale of foundation models continues to expand (from millions to billions of images), yet the essence of pre-training remains unclear—does it discover general visual concepts, or merely provide better weight initialization? Furthermore, large-scale real datasets suffer from privacy, copyright, and fairness issues.
Key Challenge: OFDB has reduced fractal images to 1,000, but further reducing the number of classes leads to performance degradation. The key question is: "How small can the minimal effective pre-training dataset actually be?"
Goal: To find the minimal purely synthetic pre-training dataset and investigate the minimum requirements for successful pre-training.
Key Insight: Instead of adding more images, "scaling backwards" is proposed—constructing "classes" using subtle parameter perturbations of a single fractal image, and training the model to distinguish these perturbations which are indistinguishable to the human eye.
Core Idea: The key to pre-training lies not in the volume of data, but in the structured diversity during the data generation process—recursive self-similar structures (fractals) paired with tiny affine transformation perturbations can provide sufficient pre-training signals.
Method¶
Overall Architecture¶
1p-frac consists of three core components: - A single fractal image (defined by an Iterated Function System IFS) - A local integration empirical distribution (LIEP distribution) to generate perturbed images - A local perturbation cross-entropy loss (LPCE loss) for pre-training
Pre-training workflow: Starting from a single IFS \(\Omega\), a minute perturbation \(\epsilon\) is applied to the affine transformation parameters, generating \(L\) perturbed images as different "classes" to train the ViT to classify these perturbations.
Key Designs¶
-
Local Integration Empirical Distribution (LIEP Distribution): Key mathematical tool.
- For a single fractal image \(I\), the empirical distribution degenerates to \(p_{\text{data}}(x,y) = \delta(x-I)\delta(y)\). Direct training with cross-entropy would result in a trivial solution.
- The LIEP distribution introduces a perturbation parameter \(\Delta\), integrating within the range \(\boldsymbol{\epsilon} \in \mathcal{R}_\Delta = [-\Delta/2, \Delta/2]^{6j}\): $\(p_\Delta(x,y) = \frac{1}{|\mathcal{R}_\Delta|}\int_{\mathcal{R}_\Delta}\delta(x - I_{\boldsymbol{\epsilon}})\delta(y - \boldsymbol{\epsilon})d\boldsymbol{\epsilon}\)$
- As \(\Delta \to 0\), the LIEP distribution converges to the original single-image empirical distribution.
- Design Motivation: To provide a continuously controllable way to shrink or expand the support of the data distribution, thereby precisely studying the minimal distribution range required for pre-training.
-
Local Perturbation Cross-Entropy Loss (LPCE Loss): Pre-training objective function.
- \(\mathcal{L}_\Delta = -\mathbb{E}_{x,y \sim p_\Delta}[\log p_\theta(y|x)]\)
- In practice, the objective is approximated via numerical integration by uniformly sampling \(L=1000\) perturbation points.
- Perturbation is applied to the affine transformation parameters of the IFS: $\(w_j(\boldsymbol{v}; \boldsymbol{\epsilon}_j) = \left(\begin{bmatrix}a_j & b_j & e_j \\ c_j & d_j & f_j\end{bmatrix} + \boldsymbol{\epsilon}_j\right)\begin{bmatrix}\boldsymbol{v} \\ 1\end{bmatrix}\)$
- Design Motivation: To enable the model to learn to distinguish minute shape differences that are indistinguishable to the human eye, forcing the network to focus on structural patterns rather than surface features.
-
σ-factor Controlling Fractal Complexity: Using Anderson's \(\sigma\)-factor to evaluate the complexity of IFS.
- The smaller the \(\sigma\), the more complex the fractal (resembling the recursive structures of natural objects).
- An overly large \(\sigma\) (e.g., 6.0) causes the fractal to degenerate into something resembling Gaussian noise, but it still has a positive pre-training effect.
- The optimal value is \(\sigma = 3.5\).
- Design Motivation: To explore what kind of image structures are most critical for pre-training—concluding that recursive self-similar structures are more important than mere complexity.
Loss & Training¶
- Pre-training uses the LPCE loss, with hyperparameters following DeiT standard settings.
- Data augmentation adopts DeiT's settings (RandomCrop, RandAug, Mixup, CutMix, etc.).
- Ablation studies find that RandomCrop and Mixup/CutMix have the greatest impact on pre-training performance.
- Exploration studies use ViT-Tiny, and scaling studies use ViT-Base.
- Fine-tuning datasets include CIFAR-10/100, ImageNet-100/1k, Cars, Flowers, etc.
Key Experimental Results¶
Main Results¶
Comparison with pre-training datasets of different scales (ViT-Tiny, CIFAR-100 fine-tuning accuracy):
| Dataset | No. of Images | Type | CIFAR-100 | ImageNet-100 |
|---|---|---|---|---|
| Scratch | - | - | 64.2 | 74.9 |
| FractalDB | 1M | FDSL | 81.6 | 88.5 |
| OFDB | 1k | FDSL | 84.0 | 88.6 |
| 1p-frac | 1 | FDSL | 84.2 | 89.0 |
| ImageNet-1k | 1.28M | SL | 85.5 | - |
ViT-Base fine-tuned on ImageNet-1k: 1p-frac (1 image) achieves 82.1%, surpassing the 81.8% of ImageNet-21k pre-training.
Ablation Study¶
| Configuration | CIFAR-100 | Explanation |
|---|---|---|
| Δ=0.001 | 1.2 | Perturbation too small, pre-training collapses |
| Δ=0.01 | 19.9 | Positive effects begin to emerge |
| Δ=0.05 | 83.0 | Near-optimal performance |
| Δ=0.1 | 84.2 | Optimal perturbation magnitude |
| σ=3.5 (Most complex) | 84.2 | Optimal fractal complexity |
| σ=6.0 (Noise-like) | 81.3 | IFS structure still provides positive effect |
| Gaussian Noise | 1.1 | Complete failure, requires structured images |
| Uniform Noise | 2.0 | Similarly fails |
| L=16 sample points | 78.7 | Small sample size still yields positive effects |
| L=1000 sample points | 84.2 | More samples are better |
Key Findings¶
- "Scaling backwards" holds true: As the synthetic pre-training images scale from 1M → 1k → 1, performance unexpectedly improves from 81.6 → 84.0 → 84.2.
- Perturbation threshold exists: Pre-training collapses when \(\Delta < 0.01\), indicating that the distribution support requires a minimum scale.
- Structure > Randomness: Gaussian/uniform noise completely fails, proving that the recursive self-similar structure of fractals is crucial.
- Real images can also scale backwards: Applying LPCE loss to gray-scaled + Canny-edged + affine-transformed real images also yields positive pre-training effects (C100: 82.2%), which is virtually equivalent to the 1p-frac configuration.
- Early layers benefit more: Linear probing experiments show that the representation quality of the first three layers of ViT trained with 1p-frac even exceeds that of ImageNet-1k pre-training.
- Extremely fast dataset construction: 1p-frac only takes 0.04 hours (~2 minutes), compared to the 19 hours of FractalDB.
Highlights & Insights¶
- Subversive Discovery: Pre-training with a single synthetic image can match or even surpass pre-training on millions of real images—strongly implying that the essence of pre-training is closer to a "better weight initialization" than "visual concept learning."
- Elegant Mathematical Framework: The LIEP distribution and LPCE loss provide a continuously controllable experimental tool to precisely investigate the minimal requirements of pre-training.
- Profound Counter-intuitive Conclusion: Minute shape differences indistinguishable to the human eye are crucial for model pre-training—suggesting a fundamental discrepancy between the "concepts" learned by networks and human-perceived concepts.
- Practical Value: Compresses dataset construction time from hours to 2 minutes, completely bypassing privacy and copyright issues associated with real-world data.
Limitations & Future Work¶
- The study only verifies the ViT architecture, leaving it unexplored whether CNNs (like ResNet) exhibit a similar "scaling backwards" effect.
- The current optimal \(\sigma\) and \(\Delta\) are determined empirically via grid search, lacking theoretical guidance.
- Although full fine-tuning performance is comparable to ImageNet, a gap still exists in deeper layers under linear probing—indicating that learning deep semantic representations still requires real-world data.
- Self-supervised pre-training (e.g., MAE) has not been explored on extremely sparse synthetic images.
- A few tasks on VTAB (such as CLEVR-Count) still lag behind ImageNet supervised pre-training.
Related Work & Insights¶
- Asano et al. (2020): Pioneers of single-image self-supervised learning, but only effective for shallow layers and without using modern architectures.
- OFDB (Nakamura et al.): Compressed FractalDB to 1,000 images; this work further compresses it to a single image.
- Visual Atoms: Another FDSL dataset that generates images using parametric wave functions; while superior to FractalDB at 1 million images, it underperforms compared to 1p-frac with only 1 image.
- Insight: Pre-training may not require "learning the visual structure of the world" at all; instead, it optimizes the geometric configuration of network weights via classification signals. This insight has profound implications for understanding the inner workings of foundation models.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Disruptive conclusion, pushing minimal pre-training to the limit.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensively verified across five dimensions: exploration, hyperparameters, scaling, analysis, and applications.
- Writing Quality: ⭐⭐⭐⭐⭐ Clear logic, elegant mathematical formulations, and progressively structured experimental design.
- Value: ⭐⭐⭐⭐⭐ Deep insight into the essence of pre-training with direct practical value.