Skip to content

🧊 3D Vision

🧪 ICML2026 · 8 paper notes

📌 Same area in other venues: 📷 CVPR2026 (230) · 🔬 ICLR2026 (63) · 🤖 AAAI2026 (76) · 🧠 NeurIPS2025 (112) · 📹 ICCV2025 (263)

🔥 Top topics: Layout & Composition ×2

FSI2P: A Hierarchical Focus–Sweep Registration Network with Dynamically Allocated Depth

This paper abstracts the human observation process of "first scanning broadly, then examining in detail" into a two-stage Focus-Sweep paradigm. It replaces Transformer with Mamba for image-point cloud interaction and uses reinforcement learning to dynamically determine the number of interaction layers at each scale, achieving SOTA in I2P registration on RGB-D Scenes V2 and 7-Scenes.

LabBuilder: Protocol-Grounded 3D Layout Generation for Interactable and Safe Laboratory

LabBuilder compiles free-text experimental descriptions into "asset-chemical protocol" pairs, then employs hierarchical generation, geometric/chemical multi-objective optimization, and navigation repair to produce 3D chemical laboratory layouts that are both visually plausible and executable by robots.

Pair2Scene: Learning Local Object Relations for Procedural Scene Generation

Pair2Scene shifts 3D indoor scene generation from "directly fitting the global joint distribution" to "learning pairwise local object relations (support + function) and recursively assembling them via a scene hierarchy tree." With point cloud geometric encoding, Mixture-of-Logistics probabilistic heads, and collision-aware rejection sampling, it can generate complex scenes with object counts rising from about 4 to about 14 using only 3D-Front training data. Both FID and user studies outperform baselines such as ATISS, DiffuScene, and LayoutVLM.

PhysForge: Generating Physics-Grounded 3D Assets for Interactive Virtual World

Reframes "creating interactive 3D objects" as a two-stage problem: "physical planning first, physical generation second." The VLM acts as a physical architect, generating a "Hierarchical Physical Blueprint" with hierarchical relationships, materials, and kinematic constraints. A diffusion model then uses KineVoxel Injection to co-denoise articulation parameters and geometric voxels. Combined with the PhysDB dataset (150k assets with four-layer annotations), this approach achieves the first single-view-to-"simulation-ready" 3D asset generation capable of grasping, pushing, and articulating in physics engines.

PhysHanDI: Physics-Based Reconstruction of Hand-Deformable Object Interactions

This paper introduces PhysHanDI, which couples the MANO hand model with a Spring-Mass soft object model. Dense hand meshes drive the physical simulation of deformable objects, while object simulation refines hand reconstruction. The method achieves SOTA dense 3D reconstruction of both hands and soft objects from sparse-view RGB-D videos.

R\(^3\)L: Reasoning 3D Layouts from Relative Spatial Relations

R³L attributes the two systematic errors in MLLM multi-hop "relative spatial relation" reasoning (semantic drift and metric drift) to "repeated reference frame transformations," and introduces three modules—Invariant Spatial Decomposition (shortening relation chains), Consistent Spatial Imagination (imagine-and-revise loop to eliminate conflicts), and Supportive Spatial Optimization (global-local pose reparameterization)—to enable GPT-5-generated open-vocabulary 3D scenes to achieve near-zero collision and out-of-bounds rates across 9 scene types, with semantic metrics significantly surpassing LayoutVLM/Holodeck/LayoutGPT.

Revisiting Photometric Ambiguity for Accurate Gaussian-Splatting Surface Reconstruction

AmbiSuR explicitly models two types of intrinsic photometric ambiguities in Gaussian Splatting (primitive boundary spillover and under-constrained pixel blending) and resolves them using truncation and ray-color consistency. It further employs higher-order spherical harmonic coefficients as "self-indicators" to identify high-risk ambiguous primitives and applies amorphous local prior regularization. This reduces the average Chamfer distance on DTU to 0.46, surpassing the previous best GeoSVR (0.47).

SplAttN: Bridging 2D and 3D with Gaussian Soft Splatting and Attention for Point Cloud Completion

This work identifies that in multimodal point cloud completion, "hard projection of 3D points directly onto 2D grids" leads to a support set with Lebesgue measure zero and gradients truncated by Dirac delta (termed Cross-Modal Entropy Collapse). The authors replace hard projection with differentiable Gaussian Soft Splatting for continuous density estimation, and employ a hybrid encoder combining local EdgeConv and global Transformer, along with a global-local decoder. The method achieves SOTA on PCN/ShapeNet-55/34, and counter-factual evaluation on KITTI demonstrates that baselines actually degenerate into "unimodal template retrievers."