ECCV 2024 3D Vision Visual-geometric representation learning Synthetic data pre-training Fractal geometry Unified Transformer Formula-driven supervised learning

Formula-Supervised Visual-Geometric Pre-training (FSVGP)¶

Conference: ECCV 2024
arXiv: 2409.13535
Code: https://ryosuke-yamada.github.io/fdsl-fsvgp/ (Project Page)
Area: LLM Pre-training
Keywords: Visual-geometric representation learning, Synthetic data pre-training, Fractal geometry, Unified Transformer, Formula-driven supervised learning

TL;DR¶

Proposes FSVGP, which automatically generates aligned synthetic images and point clouds using mathematical formulas of fractal geometry. Through formula-supervised consistency labels, it achieves cross-modal visual-geometric pre-training on a unified Transformer, outperforming single-modal FDSL methods across six tasks in image and 3D object classification, detection, and segmentation.

Background & Motivation¶

Background: The fusion of image (visual) and point cloud (geometric) is crucial for enhancing 3D scene understanding in visual models. However, current research on visual-geometric representation learning mostly focuses separately on improving image or 3D object recognition, lacking a unified model that can simultaneously enhance both modalities.

Limitations of Prior Work: Large-scale paired image-point cloud datasets are extremely scarce, and collecting high-quality 3D data is highly expensive. Aligning images and point clouds requires substantial preprocessing, and annotation typically requires expert manual labeling of complex 3D spatial information. Furthermore, real-world datasets face copyright issues and ethical biases.

Key Challenge: Existing FDSL (Formula-Driven Supervised Learning) methods, such as VisualAtom and PC-FractalDB, function only on a single modality (either images or point clouds) and cannot achieve cross-modal representation learning.

Goal: To achieve unified pre-training for both visual and geometric modalities using synthetic data without relying on real-world data, enabling a single model to simultaneously improve performance on various downstream tasks for both images and 3D objects.

Key Insight: Leverage mathematical formulas of fractal geometry to simultaneously generate images and point clouds, and achieve supervised pre-training using cross-modal aligned labels naturally provided by these formulas.

Core Idea: Mathematical formulas can generate multi-modal data and naturally provide cross-modal consistency labels, making it more cost-effective than manual labeling and alignment.

Method¶

Overall Architecture¶

FSVGP consists of two core components: (1) VG-FractalDB dataset construction—using 3D Iterative Function Systems (3D-IFS) to generate fractal point clouds, projecting them onto 2D planes to obtain fractal images, and using formula parameters to obtain cross-modal consistent labels; (2) Unified Transformer pre-training—with minimal modifications to ViT and PointT, fractal images and fractal point clouds are fed simultaneously, and the model is pre-trained via cross-entropy loss for classification.

Key Designs¶

VG-FractalDB (Visual-Geometric Fractal Database): The dataset is defined as \(\mathcal{D} = \{(X_j, I_j, y_j)\}_{j=1}^{N}\), where \(X_j\) is a fractal point cloud, \(I_j\) is a fractal image, and \(y_j\) is the formula-supervised consistency label. The fractal point cloud is generated by a 3D-IFS, iteratively producing \(T=8192\) 3D coordinate points via the affine transformation \(t_i(\mathbf{x}) = \mathbf{r}_i \mathbf{x} + \mathbf{b}_i\). The fractal image is obtained through virtual camera projection \(I_j = \mathcal{F}_{\text{RGB}}(X_j; \mathbf{c})\). Since the image and point cloud share the same 3D-IFS parameter \(\Theta^c\), the cross-modal labels are naturally aligned without extra annotation costs. Design Motivation: To utilize the self-similarity of fractal geometry to generate sufficiently complex visual-geometric structures while avoiding copyright and privacy issues of real data.
Formula-Supervised Consistency Label: Each fractal category \(c\) is defined by 3D-IFS \(\Theta^c\), where the 3D point cloud and the 2D projected image share the same label \(y_j \in \{1, 2, \cdots, C\}\). Invalid categories are filtered out along each coordinate axis via a variance threshold criterion (threshold 0.05), and instances are diversified using the FractalNoiseMix technique (mixing 20% random noise points). Design Motivation: Mathematical formulas naturally define cross-modal correspondences, bypassing the expensive pixel-to-point correspondence preprocessing steps required in traditional methods.
Unified Transformer Model: Only the input processing parts of ViT and PointT are modified. The fractal image is embedded as image tokens \(\mathbf{z}_i = [x_{\text{class}}, \mathbf{z}_i^1, \dots, \mathbf{z}_i^{M_i}]\), and the fractal point cloud is embedded as point cloud tokens \(\mathbf{z}_p = [x_{\text{class}}, \mathbf{z}_p^1, \dots, \mathbf{z}_p^{M_p}]\). The two modalities share the class token \(x_{\text{class}}\) and the MLP layer for classification. Design Motivation: To keep the model structure as simple as possible, avoiding the design of complex cross-modal modules, and ensuring the pre-trained model's applicability to various downstream tasks.

Loss & Training¶

Loss Function: Cross-entropy loss \(\mathcal{L}_{\text{ce}}(f(\mathcal{D})) = -\frac{1}{N} \sum_{j=1}^{N} \sum_{c=1}^{C} y_{j,c} \log \hat{y}_{j,c}\), where \(\hat{y}_j = f(X_j, I_j)\) is the joint output of the unified model for both modalities.
Training Setup: VG-FractalDB-1k (1000 categories, 1000 instances per category), using AdamW optimizer, batch size 64/GPU, initial learning rate of 5e-4, weight decay of 5e-2, training for 200 epochs. It takes approximately 60 hours on 16 NVIDIA V100 GPUs.

Key Experimental Results¶

Main Results: Comparison of FDSL Methods (6 Tasks)¶

Pre-training Dataset	Image Classification (Acc.)	Image Detection (AP50)	Image Segmentation (AP50)	3D Classification (Acc.)	3D Detection (mAP25)	3D Segmentation (mIoU)
VisualAtom-21k	91.3	66.3	63.3	✗	✗	✗
PC-FractalDB-1k	✗	✗	✗	83.3	63.0	83.7
VG-FractalDB-1k	92.0	68.3	65.6	83.7	63.7	84.1

Detailed Comparison of Image Classification¶

Method	Type	C10	C100	Cars	Flowers	VOC12	P30	IN100	Average
ImageNet-1k	SL	99.0	89.6	81.9	99.1	86.5	82.1	93.1	90.2
ImageNet-1k (MAE)	SSL	99.1	90.1	91.3	99.8	90.2	82.8	94.1	92.5
VisualAtom-21k	FDSL	97.7	86.7	89.2	99.0	82.4	81.6	91.3	89.7
VG-FractalDB-1k	FDSL	98.1	85.9	89.2	99.5	83.5	81.7	92.0	90.0

Ablation Study¶

Configuration	IN100 (Acc.)	M40 (Acc.)	Description
ShapeNet (50k)	87.3	92.7	Using CAD models
VG-FractalDB (50k)	87.9	92.8	Fractal data is superior
VG-PN-1k (Perlin noise)	90.7	92.6	Replacing generation rules
VG-FractalDB-1k (Fractal)	92.0	92.9	Fractal geometry is optimal
VG-FDB-1k (MAE Self-Supervised)	80.3	92.8	SSL is inferior to FDSL
VG-FDB-1k (FSVGP)	92.0	92.9	Formula supervision is more effective

Key Findings¶

FSVGP exceeds VisualAtom-21k by an average of +0.3% using only 1/21 of the data volume (1M vs. 21M).
VG-FractalDB-21k achieves 83.8% ImageNet-1k accuracy at 384 resolution, closing in on JFT-300M (84.2%) while using only 1/14 of the data.
Dual-modal visual-geometric (V+G) pre-training outperforms single-modal pre-training on downstream tasks for both modalities.
It outperforms SSL methods like MaskPoint on 3D detection mAP25 (63.7 vs. 63.4).

Highlights & Insights¶

Ultra-simple yet Effective Cross-modal Design: No complex cross-modal attention or contrastive learning modules are required; cross-modal learning is achieved solely through a shared class token and classification MLP.
Formula as Alignment: Mathematical formulas naturally provide cross-modal alignment signals, entirely bypassing expensive manual annotation and paired data collection of traditional methods.
Privacy Advantages of Synthetic Data: Free from copyright issues, privacy leaks, and social biases, which is of significant importance under increasingly strict data compliance requirements.
Unified Pre-training Paradigm: A single pre-training phase simultaneously serves 6 downstream tasks (2 modalities \(\times\) 3 tasks), demonstrating the general potential of visual-geometric representation learning.

Limitations & Future Work¶

There remains a performance gap compared to SSL methods on ImageNet (such as MAE, DINO), and the domain gap between synthetic and real-world data still needs to be addressed.
Linear probing performance is relatively weak, necessitating the design of more efficient fine-tuning strategies.
Only a single virtual camera view (\(v=1\)) projection is utilized; multi-view projections could further enrich the diversity of the visual modality.
The approach can be extended to more complex application scenarios, such as Bird's-Eye View (BEV) in autonomous driving and 3D shape retrieval.

VisualAtom / PC-FractalDB: Direct predecessors of this work, which conduct FDSL pre-training on image and point cloud single modalities, respectively.
CrossPoint / Pri3D: Traditional visual-geometric representation learning methods that rely on real paired data and contrastive learning.
Insight: Synthetic data generated by mathematical formulas could serve as a universal paradigm to address the bottleneck of multimodal alignment annotation—applicable not only to image-point cloud but potentially generalizable to other cross-modal scenarios.

Rating¶

Novelty: ⭐⭐⭐⭐ The idea of simultaneously generating multi-modal data and labels using mathematical formulas is highly clever and unique.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation on 13 datasets across 6 tasks. The ablation studies explore core problems such as generation rules and supervision types.
Writing Quality: ⭐⭐⭐⭐ Clear logic, rich diagrams and tables, though some symbol definitions are slightly redundant.
Value: ⭐⭐⭐⭐ In the era of data compliance, the synthetic pre-training path is highly promising, but the gap with real-data SSL limits its immediate practicality.