Formula-Supervised Visual-Geometric Pre-training (FSVGP)¶
Conference: ECCV 2024
arXiv: 2409.13535
Code: https://ryosuke-yamada.github.io/fdsl-fsvgp/ (Project Page)
Area: LLM Pre-training
Keywords: Visual-geometric representation learning, Synthetic data pre-training, Fractal geometry, Unified Transformer, Formula-driven supervised learning
TL;DR¶
Proposes FSVGP, which automatically generates aligned synthetic images and point clouds using mathematical formulas of fractal geometry. Through formula-supervised consistency labels, it achieves cross-modal visual-geometric pre-training on a unified Transformer, outperforming single-modal FDSL methods across six tasks in image and 3D object classification, detection, and segmentation.
Background & Motivation¶
Background: The fusion of image (visual) and point cloud (geometric) is crucial for enhancing 3D scene understanding in visual models. However, current research on visual-geometric representation learning mostly focuses separately on improving image or 3D object recognition, lacking a unified model that can simultaneously enhance both modalities.
Limitations of Prior Work: Large-scale paired image-point cloud datasets are extremely scarce, and collecting high-quality 3D data is highly expensive. Aligning images and point clouds requires substantial preprocessing, and annotation typically requires expert manual labeling of complex 3D spatial information. Furthermore, real-world datasets face copyright issues and ethical biases.
Key Challenge: Existing FDSL (Formula-Driven Supervised Learning) methods, such as VisualAtom and PC-FractalDB, function only on a single modality (either images or point clouds) and cannot achieve cross-modal representation learning.
Goal: To achieve unified pre-training for both visual and geometric modalities using synthetic data without relying on real-world data, enabling a single model to simultaneously improve performance on various downstream tasks for both images and 3D objects.
Key Insight: Leverage mathematical formulas of fractal geometry to simultaneously generate images and point clouds, and achieve supervised pre-training using cross-modal aligned labels naturally provided by these formulas.
Core Idea: Mathematical formulas can generate multi-modal data and naturally provide cross-modal consistency labels, making it more cost-effective than manual labeling and alignment.
Method¶
Overall Architecture¶
FSVGP consists of two core components: (1) VG-FractalDB dataset construction—using 3D Iterative Function Systems (3D-IFS) to generate fractal point clouds, projecting them onto 2D planes to obtain fractal images, and using formula parameters to obtain cross-modal consistent labels; (2) Unified Transformer pre-training—with minimal modifications to ViT and PointT, fractal images and fractal point clouds are fed simultaneously, and the model is pre-trained via cross-entropy loss for classification.
Key Designs¶
-
VG-FractalDB (Visual-Geometric Fractal Database): The dataset is defined as \(\mathcal{D} = \{(X_j, I_j, y_j)\}_{j=1}^{N}\), where \(X_j\) is a fractal point cloud, \(I_j\) is a fractal image, and \(y_j\) is the formula-supervised consistency label. The fractal point cloud is generated by a 3D-IFS, iteratively producing \(T=8192\) 3D coordinate points via the affine transformation \(t_i(\mathbf{x}) = \mathbf{r}_i \mathbf{x} + \mathbf{b}_i\). The fractal image is obtained through virtual camera projection \(I_j = \mathcal{F}_{\text{RGB}}(X_j; \mathbf{c})\). Since the image and point cloud share the same 3D-IFS parameter \(\Theta^c\), the cross-modal labels are naturally aligned without extra annotation costs. Design Motivation: To utilize the self-similarity of fractal geometry to generate sufficiently complex visual-geometric structures while avoiding copyright and privacy issues of real data.
-
Formula-Supervised Consistency Label: Each fractal category \(c\) is defined by 3D-IFS \(\Theta^c\), where the 3D point cloud and the 2D projected image share the same label \(y_j \in \{1, 2, \cdots, C\}\). Invalid categories are filtered out along each coordinate axis via a variance threshold criterion (threshold 0.05), and instances are diversified using the FractalNoiseMix technique (mixing 20% random noise points). Design Motivation: Mathematical formulas naturally define cross-modal correspondences, bypassing the expensive pixel-to-point correspondence preprocessing steps required in traditional methods.
-
Unified Transformer Model: Only the input processing parts of ViT and PointT are modified. The fractal image is embedded as image tokens \(\mathbf{z}_i = [x_{\text{class}}, \mathbf{z}_i^1, \dots, \mathbf{z}_i^{M_i}]\), and the fractal point cloud is embedded as point cloud tokens \(\mathbf{z}_p = [x_{\text{class}}, \mathbf{z}_p^1, \dots, \mathbf{z}_p^{M_p}]\). The two modalities share the class token \(x_{\text{class}}\) and the MLP layer for classification. Design Motivation: To keep the model structure as simple as possible, avoiding the design of complex cross-modal modules, and ensuring the pre-trained model's applicability to various downstream tasks.
Loss & Training¶
- Loss Function: Cross-entropy loss \(\mathcal{L}_{\text{ce}}(f(\mathcal{D})) = -\frac{1}{N} \sum_{j=1}^{N} \sum_{c=1}^{C} y_{j,c} \log \hat{y}_{j,c}\), where \(\hat{y}_j = f(X_j, I_j)\) is the joint output of the unified model for both modalities.
- Training Setup: VG-FractalDB-1k (1000 categories, 1000 instances per category), using AdamW optimizer, batch size 64/GPU, initial learning rate of 5e-4, weight decay of 5e-2, training for 200 epochs. It takes approximately 60 hours on 16 NVIDIA V100 GPUs.
Key Experimental Results¶
Main Results: Comparison of FDSL Methods (6 Tasks)¶
| Pre-training Dataset | Image Classification (Acc.) | Image Detection (AP50) | Image Segmentation (AP50) | 3D Classification (Acc.) | 3D Detection (mAP25) | 3D Segmentation (mIoU) |
|---|---|---|---|---|---|---|
| VisualAtom-21k | 91.3 | 66.3 | 63.3 | ✗ | ✗ | ✗ |
| PC-FractalDB-1k | ✗ | ✗ | ✗ | 83.3 | 63.0 | 83.7 |
| VG-FractalDB-1k | 92.0 | 68.3 | 65.6 | 83.7 | 63.7 | 84.1 |
Detailed Comparison of Image Classification¶
| Method | Type | C10 | C100 | Cars | Flowers | VOC12 | P30 | IN100 | Average |
|---|---|---|---|---|---|---|---|---|---|
| ImageNet-1k | SL | 99.0 | 89.6 | 81.9 | 99.1 | 86.5 | 82.1 | 93.1 | 90.2 |
| ImageNet-1k (MAE) | SSL | 99.1 | 90.1 | 91.3 | 99.8 | 90.2 | 82.8 | 94.1 | 92.5 |
| VisualAtom-21k | FDSL | 97.7 | 86.7 | 89.2 | 99.0 | 82.4 | 81.6 | 91.3 | 89.7 |
| VG-FractalDB-1k | FDSL | 98.1 | 85.9 | 89.2 | 99.5 | 83.5 | 81.7 | 92.0 | 90.0 |
Ablation Study¶
| Configuration | IN100 (Acc.) | M40 (Acc.) | Description |
|---|---|---|---|
| ShapeNet (50k) | 87.3 | 92.7 | Using CAD models |
| VG-FractalDB (50k) | 87.9 | 92.8 | Fractal data is superior |
| VG-PN-1k (Perlin noise) | 90.7 | 92.6 | Replacing generation rules |
| VG-FractalDB-1k (Fractal) | 92.0 | 92.9 | Fractal geometry is optimal |
| VG-FDB-1k (MAE Self-Supervised) | 80.3 | 92.8 | SSL is inferior to FDSL |
| VG-FDB-1k (FSVGP) | 92.0 | 92.9 | Formula supervision is more effective |
Key Findings¶
- FSVGP exceeds VisualAtom-21k by an average of +0.3% using only 1/21 of the data volume (1M vs. 21M).
- VG-FractalDB-21k achieves 83.8% ImageNet-1k accuracy at 384 resolution, closing in on JFT-300M (84.2%) while using only 1/14 of the data.
- Dual-modal visual-geometric (V+G) pre-training outperforms single-modal pre-training on downstream tasks for both modalities.
- It outperforms SSL methods like MaskPoint on 3D detection mAP25 (63.7 vs. 63.4).
Highlights & Insights¶
- Ultra-simple yet Effective Cross-modal Design: No complex cross-modal attention or contrastive learning modules are required; cross-modal learning is achieved solely through a shared class token and classification MLP.
- Formula as Alignment: Mathematical formulas naturally provide cross-modal alignment signals, entirely bypassing expensive manual annotation and paired data collection of traditional methods.
- Privacy Advantages of Synthetic Data: Free from copyright issues, privacy leaks, and social biases, which is of significant importance under increasingly strict data compliance requirements.
- Unified Pre-training Paradigm: A single pre-training phase simultaneously serves 6 downstream tasks (2 modalities \(\times\) 3 tasks), demonstrating the general potential of visual-geometric representation learning.
Limitations & Future Work¶
- There remains a performance gap compared to SSL methods on ImageNet (such as MAE, DINO), and the domain gap between synthetic and real-world data still needs to be addressed.
- Linear probing performance is relatively weak, necessitating the design of more efficient fine-tuning strategies.
- Only a single virtual camera view (\(v=1\)) projection is utilized; multi-view projections could further enrich the diversity of the visual modality.
- The approach can be extended to more complex application scenarios, such as Bird's-Eye View (BEV) in autonomous driving and 3D shape retrieval.
Related Work & Insights¶
- VisualAtom / PC-FractalDB: Direct predecessors of this work, which conduct FDSL pre-training on image and point cloud single modalities, respectively.
- CrossPoint / Pri3D: Traditional visual-geometric representation learning methods that rely on real paired data and contrastive learning.
- Insight: Synthetic data generated by mathematical formulas could serve as a universal paradigm to address the bottleneck of multimodal alignment annotation—applicable not only to image-point cloud but potentially generalizable to other cross-modal scenarios.
Rating¶
- Novelty: ⭐⭐⭐⭐ The idea of simultaneously generating multi-modal data and labels using mathematical formulas is highly clever and unique.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation on 13 datasets across 6 tasks. The ablation studies explore core problems such as generation rules and supervision types.
- Writing Quality: ⭐⭐⭐⭐ Clear logic, rich diagrams and tables, though some symbol definitions are slightly redundant.
- Value: ⭐⭐⭐⭐ In the era of data compliance, the synthetic pre-training path is highly promising, but the gap with real-data SSL limits its immediate practicality.