Scaling Backwards: Minimal Synthetic Pre-training?¶

Conference: ECCV 2024
arXiv: 2408.00677
Code: GitHub
Area: LLM Pre-training
Keywords: synthetic pre-training, fractal images, minimal datasets, visual representation learning, ViT

TL;DR¶

Proposes 1p-frac—achieving pre-training performance comparable to the ImageNet-1k level using minute perturbations of a single fractal image. This challenges the conventional wisdom that "pre-training requires large-scale datasets" and reveals that the essence of pre-training might be closer to weight initialization than visual concept learning.

Background & Motivation¶

Background: Pre-training is a fundamental technology for current vision systems. Mainstream methods use large-scale real image datasets (1.28 million images in ImageNet-1k, 14 million images in ImageNet-21k) for supervised or self-supervised pre-training. Synthetic pre-training directions such as FractalDB (1 million fractal images) and OFDB (1,000 fractal images) have demonstrated that effective representations can be obtained without real images.

Limitations of Prior Work: The scale of foundation models continues to expand (from millions to billions of images), yet the essence of pre-training remains unclear—does it discover general visual concepts, or merely provide better weight initialization? Furthermore, large-scale real datasets suffer from privacy, copyright, and fairness issues.

Key Challenge: OFDB has reduced fractal images to 1,000, but further reducing the number of classes leads to performance degradation. The key question is: "How small can the minimal effective pre-training dataset actually be?"

Goal: To find the minimal purely synthetic pre-training dataset and investigate the minimum requirements for successful pre-training.

Key Insight: Instead of adding more images, "scaling backwards" is proposed—constructing "classes" using subtle parameter perturbations of a single fractal image, and training the model to distinguish these perturbations which are indistinguishable to the human eye.

Core Idea: The key to pre-training lies not in the volume of data, but in the structured diversity during the data generation process—recursive self-similar structures (fractals) paired with tiny affine transformation perturbations can provide sufficient pre-training signals.

Method¶

Overall Architecture¶

1p-frac consists of three core components: - A single fractal image (defined by an Iterated Function System IFS) - A local integration empirical distribution (LIEP distribution) to generate perturbed images - A local perturbation cross-entropy loss (LPCE loss) for pre-training

Pre-training workflow: Starting from a single IFS $\Omega$, a minute perturbation $\epsilon$ is applied to the affine transformation parameters, generating $L$ perturbed images as different "classes" to train the ViT to classify these perturbations.

Key Designs¶

Local Integration Empirical Distribution (LIEP Distribution): Key mathematical tool.
- For a single fractal image $I$, the empirical distribution degenerates to $p_{\text{data}}(x,y) = \delta(x-I)\delta(y)$. Direct training with cross-entropy would result in a trivial solution.
- The LIEP distribution introduces a perturbation parameter $\Delta$, integrating within the range $\boldsymbol{\epsilon} \in \mathcal{R}_\Delta = [-\Delta/2, \Delta/2]^{6j}$: $$p_\Delta(x,y) = \frac{1}{|\mathcal{R}_\Delta|}\int_{\mathcal{R}_\Delta}\delta(x - I_{\boldsymbol{\epsilon}})\delta(y - \boldsymbol{\epsilon})d\boldsymbol{\epsilon}$$
- As $\Delta \to 0$, the LIEP distribution converges to the original single-image empirical distribution.
- Design Motivation: To provide a continuously controllable way to shrink or expand the support of the data distribution, thereby precisely studying the minimal distribution range required for pre-training.
Local Perturbation Cross-Entropy Loss (LPCE Loss): Pre-training objective function.
- $\mathcal{L}_\Delta = -\mathbb{E}_{x,y \sim p_\Delta}[\log p_\theta(y|x)]$
- In practice, the objective is approximated via numerical integration by uniformly sampling $L=1000$ perturbation points.
- Perturbation is applied to the affine transformation parameters of the IFS: $$w_j(\boldsymbol{v}; \boldsymbol{\epsilon}_j) = \left(\begin{bmatrix}a_j & b_j & e_j \\ c_j & d_j & f_j\end{bmatrix} + \boldsymbol{\epsilon}_j\right)\begin{bmatrix}\boldsymbol{v} \\ 1\end{bmatrix}$$
- Design Motivation: To enable the model to learn to distinguish minute shape differences that are indistinguishable to the human eye, forcing the network to focus on structural patterns rather than surface features.
σ-factor Controlling Fractal Complexity: Using Anderson's $\sigma$-factor to evaluate the complexity of IFS.
- The smaller the $\sigma$, the more complex the fractal (resembling the recursive structures of natural objects).
- An overly large $\sigma$ (e.g., 6.0) causes the fractal to degenerate into something resembling Gaussian noise, but it still has a positive pre-training effect.
- The optimal value is $\sigma = 3.5$.
- Design Motivation: To explore what kind of image structures are most critical for pre-training—concluding that recursive self-similar structures are more important than mere complexity.

Loss & Training¶

Pre-training uses the LPCE loss, with hyperparameters following DeiT standard settings.
Data augmentation adopts DeiT's settings (RandomCrop, RandAug, Mixup, CutMix, etc.).
Ablation studies find that RandomCrop and Mixup/CutMix have the greatest impact on pre-training performance.
Exploration studies use ViT-Tiny, and scaling studies use ViT-Base.
Fine-tuning datasets include CIFAR-10/100, ImageNet-100/1k, Cars, Flowers, etc.

Key Experimental Results¶

Main Results¶

Comparison with pre-training datasets of different scales (ViT-Tiny, CIFAR-100 fine-tuning accuracy):

Dataset	No. of Images	Type	CIFAR-100	ImageNet-100
Scratch	-	-	64.2	74.9
FractalDB	1M	FDSL	81.6	88.5
OFDB	1k	FDSL	84.0	88.6
1p-frac	1	FDSL	84.2	89.0
ImageNet-1k	1.28M	SL	85.5	-

ViT-Base fine-tuned on ImageNet-1k: 1p-frac (1 image) achieves 82.1%, surpassing the 81.8% of ImageNet-21k pre-training.

Ablation Study¶

Configuration	CIFAR-100	Explanation
Δ=0.001	1.2	Perturbation too small, pre-training collapses
Δ=0.01	19.9	Positive effects begin to emerge
Δ=0.05	83.0	Near-optimal performance
Δ=0.1	84.2	Optimal perturbation magnitude
σ=3.5 (Most complex)	84.2	Optimal fractal complexity
σ=6.0 (Noise-like)	81.3	IFS structure still provides positive effect
Gaussian Noise	1.1	Complete failure, requires structured images
Uniform Noise	2.0	Similarly fails
L=16 sample points	78.7	Small sample size still yields positive effects
L=1000 sample points	84.2	More samples are better

Key Findings¶

"Scaling backwards" holds true: As the synthetic pre-training images scale from 1M → 1k → 1, performance unexpectedly improves from 81.6 → 84.0 → 84.2.
Perturbation threshold exists: Pre-training collapses when $\Delta < 0.01$, indicating that the distribution support requires a minimum scale.
Structure > Randomness: Gaussian/uniform noise completely fails, proving that the recursive self-similar structure of fractals is crucial.
Real images can also scale backwards: Applying LPCE loss to gray-scaled + Canny-edged + affine-transformed real images also yields positive pre-training effects (C100: 82.2%), which is virtually equivalent to the 1p-frac configuration.
Early layers benefit more: Linear probing experiments show that the representation quality of the first three layers of ViT trained with 1p-frac even exceeds that of ImageNet-1k pre-training.
Extremely fast dataset construction: 1p-frac only takes 0.04 hours (~2 minutes), compared to the 19 hours of FractalDB.

Highlights & Insights¶

Subversive Discovery: Pre-training with a single synthetic image can match or even surpass pre-training on millions of real images—strongly implying that the essence of pre-training is closer to a "better weight initialization" than "visual concept learning."
Elegant Mathematical Framework: The LIEP distribution and LPCE loss provide a continuously controllable experimental tool to precisely investigate the minimal requirements of pre-training.
Profound Counter-intuitive Conclusion: Minute shape differences indistinguishable to the human eye are crucial for model pre-training—suggesting a fundamental discrepancy between the "concepts" learned by networks and human-perceived concepts.
Practical Value: Compresses dataset construction time from hours to 2 minutes, completely bypassing privacy and copyright issues associated with real-world data.

Limitations & Future Work¶

The study only verifies the ViT architecture, leaving it unexplored whether CNNs (like ResNet) exhibit a similar "scaling backwards" effect.
The current optimal $\sigma$ and $\Delta$ are determined empirically via grid search, lacking theoretical guidance.
Although full fine-tuning performance is comparable to ImageNet, a gap still exists in deeper layers under linear probing—indicating that learning deep semantic representations still requires real-world data.
Self-supervised pre-training (e.g., MAE) has not been explored on extremely sparse synthetic images.
A few tasks on VTAB (such as CLEVR-Count) still lag behind ImageNet supervised pre-training.

Asano et al. (2020): Pioneers of single-image self-supervised learning, but only effective for shallow layers and without using modern architectures.
OFDB (Nakamura et al.): Compressed FractalDB to 1,000 images; this work further compresses it to a single image.
Visual Atoms: Another FDSL dataset that generates images using parametric wave functions; while superior to FractalDB at 1 million images, it underperforms compared to 1p-frac with only 1 image.
Insight: Pre-training may not require "learning the visual structure of the world" at all; instead, it optimizes the geometric configuration of network weights via classification signals. This insight has profound implications for understanding the inner workings of foundation models.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Disruptive conclusion, pushing minimal pre-training to the limit.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensively verified across five dimensions: exploration, hyperparameters, scaling, analysis, and applications.
Writing Quality: ⭐⭐⭐⭐⭐ Clear logic, elegant mathematical formulations, and progressively structured experimental design.
Value: ⭐⭐⭐⭐⭐ Deep insight into the essence of pre-training with direct practical value.