RayletDF: Raylet Distance Fields for Generalizable 3D Surface Reconstruction from Point Clouds or Gaussians¶

Conference: ICCV 2025 arXiv: 2508.09830 Code: https://github.com/vLAR-group/RayletDF Area: 3D Vision / Surface Reconstruction / Generalizable Representation Keywords: Raylet Distance Field, Surface Reconstruction, Point Cloud, 3DGS, Generalization

TL;DR¶

This paper proposes RayletDF, a generalizable 3D surface reconstruction method based on a "raylet" (ray segment) distance field. Through three modules — a raylet feature extractor, a distance field predictor, and a multi-raylet mixer — RayletDF directly predicts surface points from point clouds or 3D Gaussians, achieving high-accuracy cross-dataset generalization via a single forward pass on unseen datasets.

Background & Motivation¶

Recovering 3D surfaces from RGB/D images or point clouds is a fundamental requirement for applications such as mixed reality and embodied AI. Existing methods each suffer from distinct limitations:

Coordinate-based methods (OF, SDF, NeRF): require dense sampling and network evaluation to extract explicit surfaces, incurring high computational cost.
3DGS: enables real-time RGB rendering but produces poor depth quality and fails to capture fine surface geometry.
Ray-based methods (DRDF, PRIF, RayDF): are efficient but constrained by Plücker/spherical ray parameterizations, limited to object-level surfaces, and require per-scene optimization.

Core Insight: Existing ray-based methods use complete rays as input, preventing them from capturing local geometric patterns. By instead using local ray segments (raylets) — unit ray segments whose starting points are sampled near the surface — the method can focus on fine-grained local surface patterns that are generalizable across different shapes.

Method¶

Core Concepts: Raylet and Raylet Distance¶

Raylet $\mathbf{l}$: a unit segment of a ray, with its starting point sampled near the shape surface, parameterized as a 6D vector (starting point xyz + unit direction vector).
Raylet Distance $d_l$: the signed distance between the surface hit point and the raylet starting point. Positive values indicate the hit point lies in front of the starting point; negative values indicate behind.
Key advantage: multiple raylets can be sampled on both sides of the surface along a single ray, with each raylet focusing on local surface patterns.

Three-Module Pipeline¶

Module 1: Raylet Feature Extractor - Extracts per-point features $\mathbf{F} \in \mathbb{R}^{N \times 32}$ from the input scene (point cloud or 3D Gaussians) using SparseConv. - For a query raylet $\mathbf{l}$, the $K$ nearest points are retrieved via KNN, and neighborhood information is aggregated as: $$\hat{\mathbf{f}}_l^k = \left(\mathbf{p}_l^k \oplus \frac{(\mathbf{p}_l^k - \mathbf{p}_l)}{\|\mathbf{p}_l^k - \mathbf{p}_l\|} \oplus \|\mathbf{p}_l^k - \mathbf{p}_l\|\right) \oplus \mathbf{f}_l^k$$ - Key insight: the extracted features preserve local geometric patterns near the surface, enabling the learned representation to generalize across scenes.

Module 2: Raylet Distance Field Predictor - An 8-layer MLP (256 hidden units per layer) takes the raylet position, direction, and features as input, and outputs a distance value and a confidence score: $$(d_l, s_l) = MLPs(\mathbf{p}_l \oplus \mathbf{u}_l \oplus \mathbf{f}_l)$$ - No dense coordinate sampling along the ray is required; the surface distance is predicted in a single pass.

Module 3: Multi-Raylet Mixer - $T$ raylets are sampled along the same ray (same direction, different starting points), with distances predicted in parallel. - Predictions are fused via softmax-weighted aggregation: $$D = \sum_{t=1}^T \hat{s}_{l_t}\left(\|\mathbf{p}_{cam} - \mathbf{p}_{l_t}\| + d_{l_T}\right), \quad \hat{s}_{l_t} = \frac{e^{s_{l_t}}}{\sum_{t=1}^T e^{s_{l_t}}}$$ - Multi-raylet fusion improves generalizability and robustness.

Raylet Sampling Strategy¶

For point clouds: a virtual sphere (radius = distance to the nearest point) is constructed for each point; the scene surface is bounded by the union of all virtual spheres. Ray–sphere intersections are projected onto the ray, and the top-$T$ intersection points with the smallest perpendicular distances are selected as raylet starting points.
For 3DGS: ray–Gaussian intersections are computed, and the top-$T$ points are selected based on alpha blending contribution.

Training Loss¶

An $\ell_1$ loss supervises the predicted distance $D$; ground-truth values are converted from depth maps.

Key Experimental Results¶

Main Results: Cross-Dataset Generalization (Trained on ARKitScenes)¶

Method	Type	ARKitScene ADE↓	ScanNet/++ ADE↓	MultiScan ADE↓
3DGS	Per-scene	0.268	0.321	0.431
PGSR	Per-scene	0.219	0.202	0.315
DepthAnythingV2	Aligned	0.206	0.168	0.228
Pointersect	Generalizable	0.286	0.366	0.266
RayDF	Generalizable	0.183	0.227	0.326
RayletDF	Generalizable	0.115	0.175	0.216

Ablation Study: Impact of Key Components ($\delta$ Metric)¶

Ablation	ARKit $\delta$↑	ScanNet++ $\delta$↑
RayletDF (full)	0.928	0.894
w/o multi-raylet mixing	0.908	0.847
w/o confidence score	0.921	0.882
K=4 (vs. K=16)	0.916	0.870

Key Findings: - RayletDF reduces ADE on ARKitScene by 37% over RayDF (0.183→0.115), with particularly strong cross-dataset generalization. - Even when trained solely on ARKitScenes, the method significantly outperforms all generalizable baselines on fully unseen ScanNet++ and MultiScan datasets. - Multi-raylet mixing is critical (removing it drops $\delta$ by ~2%); confidence weighting further improves accuracy. - The method supports reconstruction from both point cloud and 3DGS inputs within the same pipeline. - Gaussian data for 7,770 3D scenes (ScanNet/++, ARKitScenes, MultiScan) will be publicly released.

Highlights & Insights¶

Elegant raylet concept: Decomposing rays into segments that focus on local patterns is the key to achieving generalization — local geometric patterns are shared across diverse scenes.
No dense sampling required: Unlike SDF/OF methods that require dense coordinate sampling along rays, RayletDF predicts surface distances in a single forward pass.
Unified input pipeline: The same pipeline seamlessly handles both point cloud and 3DGS inputs.
Closed-form surface normal derivation: The ray-based formulation allows surface normals to be derived analytically without an additional network.

Limitations & Future Work¶

The SparseConv backbone requires voxelization, and memory consumption grows with scene scale.
No prediction can be produced for query rays far from the point cloud surface (such rays are discarded).
Cross-dataset generalization still exhibits a non-trivial accuracy gap on MultiScan.
The use of surface normals as additional regularization or for outlier filtering has not been explored.

Coordinate-based methods (OF, SDF, UDF) require dense sampling.
Ray-based methods (RayDF, PRIF) are limited to object-level shapes.
Depth estimation (DepthAnythingV2) produces high-quality results but lacks cross-frame consistency.

Rating¶

Novelty: ★★★★★ — The raylet distance field is an elegant and effective new representation.
Practicality: ★★★★☆ — Generalizable reconstruction offers substantial value for downstream applications (AR/robotics).
Experimental Thoroughness: ★★★★★ — Comprehensive evaluation across multiple datasets with thorough ablation studies.
Writing Quality: ★★★★☆ — Concepts are clearly articulated with intuitive illustrations.