Skip to content

Formula-Supervised Visual-Geometric Pre-training (FSVGP)

Conference: ECCV 2024
arXiv: 2409.13535
Code: https://ryosuke-yamada.github.io/fdsl-fsvgp/ (Project Page)
Area: LLM Pre-training
Keywords: Visual-geometric representation learning, Synthetic data pre-training, Fractal geometry, Unified Transformer, Formula-driven supervised learning

TL;DR

Proposes FSVGP, which automatically generates aligned synthetic images and point clouds using mathematical formulas of fractal geometry. Through formula-supervised consistency labels, it achieves cross-modal visual-geometric pre-training on a unified Transformer, outperforming single-modal FDSL methods across six tasks in image and 3D object classification, detection, and segmentation.

Background & Motivation

Background: The fusion of image (visual) and point cloud (geometric) is crucial for enhancing 3D scene understanding in visual models. However, current research on visual-geometric representation learning mostly focuses separately on improving image or 3D object recognition, lacking a unified model that can simultaneously enhance both modalities.

Limitations of Prior Work: Large-scale paired image-point cloud datasets are extremely scarce, and collecting high-quality 3D data is highly expensive. Aligning images and point clouds requires substantial preprocessing, and annotation typically requires expert manual labeling of complex 3D spatial information. Furthermore, real-world datasets face copyright issues and ethical biases.

Key Challenge: Existing FDSL (Formula-Driven Supervised Learning) methods, such as VisualAtom and PC-FractalDB, function only on a single modality (either images or point clouds) and cannot achieve cross-modal representation learning.

Goal: To achieve unified pre-training for both visual and geometric modalities using synthetic data without relying on real-world data, enabling a single model to simultaneously improve performance on various downstream tasks for both images and 3D objects.

Key Insight: Leverage mathematical formulas of fractal geometry to simultaneously generate images and point clouds, and achieve supervised pre-training using cross-modal aligned labels naturally provided by these formulas.

Core Idea: Mathematical formulas can generate multi-modal data and naturally provide cross-modal consistency labels, making it more cost-effective than manual labeling and alignment.

Method

Overall Architecture

FSVGP consists of two core components: (1) VG-FractalDB dataset construction—using 3D Iterative Function Systems (3D-IFS) to generate fractal point clouds, projecting them onto 2D planes to obtain fractal images, and using formula parameters to obtain cross-modal consistent labels; (2) Unified Transformer pre-training—with minimal modifications to ViT and PointT, fractal images and fractal point clouds are fed simultaneously, and the model is pre-trained via cross-entropy loss for classification.

Key Designs

  1. VG-FractalDB (Visual-Geometric Fractal Database): The dataset is defined as \(\mathcal{D} = \{(X_j, I_j, y_j)\}_{j=1}^{N}\), where \(X_j\) is a fractal point cloud, \(I_j\) is a fractal image, and \(y_j\) is the formula-supervised consistency label. The fractal point cloud is generated by a 3D-IFS, iteratively producing \(T=8192\) 3D coordinate points via the affine transformation \(t_i(\mathbf{x}) = \mathbf{r}_i \mathbf{x} + \mathbf{b}_i\). The fractal image is obtained through virtual camera projection \(I_j = \mathcal{F}_{\text{RGB}}(X_j; \mathbf{c})\). Since the image and point cloud share the same 3D-IFS parameter \(\Theta^c\), the cross-modal labels are naturally aligned without extra annotation costs. Design Motivation: To utilize the self-similarity of fractal geometry to generate sufficiently complex visual-geometric structures while avoiding copyright and privacy issues of real data.

  2. Formula-Supervised Consistency Label: Each fractal category \(c\) is defined by 3D-IFS \(\Theta^c\), where the 3D point cloud and the 2D projected image share the same label \(y_j \in \{1, 2, \cdots, C\}\). Invalid categories are filtered out along each coordinate axis via a variance threshold criterion (threshold 0.05), and instances are diversified using the FractalNoiseMix technique (mixing 20% random noise points). Design Motivation: Mathematical formulas naturally define cross-modal correspondences, bypassing the expensive pixel-to-point correspondence preprocessing steps required in traditional methods.

  3. Unified Transformer Model: Only the input processing parts of ViT and PointT are modified. The fractal image is embedded as image tokens \(\mathbf{z}_i = [x_{\text{class}}, \mathbf{z}_i^1, \dots, \mathbf{z}_i^{M_i}]\), and the fractal point cloud is embedded as point cloud tokens \(\mathbf{z}_p = [x_{\text{class}}, \mathbf{z}_p^1, \dots, \mathbf{z}_p^{M_p}]\). The two modalities share the class token \(x_{\text{class}}\) and the MLP layer for classification. Design Motivation: To keep the model structure as simple as possible, avoiding the design of complex cross-modal modules, and ensuring the pre-trained model's applicability to various downstream tasks.

Loss & Training

  • Loss Function: Cross-entropy loss \(\mathcal{L}_{\text{ce}}(f(\mathcal{D})) = -\frac{1}{N} \sum_{j=1}^{N} \sum_{c=1}^{C} y_{j,c} \log \hat{y}_{j,c}\), where \(\hat{y}_j = f(X_j, I_j)\) is the joint output of the unified model for both modalities.
  • Training Setup: VG-FractalDB-1k (1000 categories, 1000 instances per category), using AdamW optimizer, batch size 64/GPU, initial learning rate of 5e-4, weight decay of 5e-2, training for 200 epochs. It takes approximately 60 hours on 16 NVIDIA V100 GPUs.

Key Experimental Results

Main Results: Comparison of FDSL Methods (6 Tasks)

Pre-training Dataset Image Classification (Acc.) Image Detection (AP50) Image Segmentation (AP50) 3D Classification (Acc.) 3D Detection (mAP25) 3D Segmentation (mIoU)
VisualAtom-21k 91.3 66.3 63.3
PC-FractalDB-1k 83.3 63.0 83.7
VG-FractalDB-1k 92.0 68.3 65.6 83.7 63.7 84.1

Detailed Comparison of Image Classification

Method Type C10 C100 Cars Flowers VOC12 P30 IN100 Average
ImageNet-1k SL 99.0 89.6 81.9 99.1 86.5 82.1 93.1 90.2
ImageNet-1k (MAE) SSL 99.1 90.1 91.3 99.8 90.2 82.8 94.1 92.5
VisualAtom-21k FDSL 97.7 86.7 89.2 99.0 82.4 81.6 91.3 89.7
VG-FractalDB-1k FDSL 98.1 85.9 89.2 99.5 83.5 81.7 92.0 90.0

Ablation Study

Configuration IN100 (Acc.) M40 (Acc.) Description
ShapeNet (50k) 87.3 92.7 Using CAD models
VG-FractalDB (50k) 87.9 92.8 Fractal data is superior
VG-PN-1k (Perlin noise) 90.7 92.6 Replacing generation rules
VG-FractalDB-1k (Fractal) 92.0 92.9 Fractal geometry is optimal
VG-FDB-1k (MAE Self-Supervised) 80.3 92.8 SSL is inferior to FDSL
VG-FDB-1k (FSVGP) 92.0 92.9 Formula supervision is more effective

Key Findings

  • FSVGP exceeds VisualAtom-21k by an average of +0.3% using only 1/21 of the data volume (1M vs. 21M).
  • VG-FractalDB-21k achieves 83.8% ImageNet-1k accuracy at 384 resolution, closing in on JFT-300M (84.2%) while using only 1/14 of the data.
  • Dual-modal visual-geometric (V+G) pre-training outperforms single-modal pre-training on downstream tasks for both modalities.
  • It outperforms SSL methods like MaskPoint on 3D detection mAP25 (63.7 vs. 63.4).

Highlights & Insights

  • Ultra-simple yet Effective Cross-modal Design: No complex cross-modal attention or contrastive learning modules are required; cross-modal learning is achieved solely through a shared class token and classification MLP.
  • Formula as Alignment: Mathematical formulas naturally provide cross-modal alignment signals, entirely bypassing expensive manual annotation and paired data collection of traditional methods.
  • Privacy Advantages of Synthetic Data: Free from copyright issues, privacy leaks, and social biases, which is of significant importance under increasingly strict data compliance requirements.
  • Unified Pre-training Paradigm: A single pre-training phase simultaneously serves 6 downstream tasks (2 modalities \(\times\) 3 tasks), demonstrating the general potential of visual-geometric representation learning.

Limitations & Future Work

  • There remains a performance gap compared to SSL methods on ImageNet (such as MAE, DINO), and the domain gap between synthetic and real-world data still needs to be addressed.
  • Linear probing performance is relatively weak, necessitating the design of more efficient fine-tuning strategies.
  • Only a single virtual camera view (\(v=1\)) projection is utilized; multi-view projections could further enrich the diversity of the visual modality.
  • The approach can be extended to more complex application scenarios, such as Bird's-Eye View (BEV) in autonomous driving and 3D shape retrieval.
  • VisualAtom / PC-FractalDB: Direct predecessors of this work, which conduct FDSL pre-training on image and point cloud single modalities, respectively.
  • CrossPoint / Pri3D: Traditional visual-geometric representation learning methods that rely on real paired data and contrastive learning.
  • Insight: Synthetic data generated by mathematical formulas could serve as a universal paradigm to address the bottleneck of multimodal alignment annotation—applicable not only to image-point cloud but potentially generalizable to other cross-modal scenarios.

Rating

  • Novelty: ⭐⭐⭐⭐ The idea of simultaneously generating multi-modal data and labels using mathematical formulas is highly clever and unique.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation on 13 datasets across 6 tasks. The ablation studies explore core problems such as generation rules and supervision types.
  • Writing Quality: ⭐⭐⭐⭐ Clear logic, rich diagrams and tables, though some symbol definitions are slightly redundant.
  • Value: ⭐⭐⭐⭐ In the era of data compliance, the synthetic pre-training path is highly promising, but the gap with real-data SSL limits its immediate practicality.