ICML2026 3D Vision AI paper notes paper summaries Point Cloud 3D Reconstruction Segmentation Image Restoration Few-/Zero-Shot Learning Layout & Composition

🧊 3D Vision¶

🧪 ICML2026 · 30 paper notes

📌 Same area in other venues: 📷 CVPR2026 (751) · 🔬 ICLR2026 (197) · 🤖 AAAI2026 (79) · 🧠 NeurIPS2025 (116) · 📹 ICCV2025 (267) · 🧪 ICML2025 (17)

🔥 Top topics: Point Cloud ×4 · 3D Reconstruction ×3 · Segmentation ×2 · Image Restoration ×2 · Few-/Zero-Shot Learning ×2

4DPC\(^2\)hat: Towards Dynamic Point Cloud Understanding with Failure-Aware Bootstrapping: 4DPC\(^2\)hat is the first Multimodal Large Language Model (MLLM) designed for "dynamic point cloud sequence" (4D point cloud) understanding. The authors first use a topologically consistent construction pipeline to transform 44,000 animation assets into a dataset of 200,000 cross-modal QA pairs. Then, they employ a spatio-temporal architecture using "preserved group tokens + global tokens + bidirectional Mamba" to avoid compressing a frame into a single vector. Finally, "failure-aware bootstrapping" is used to iteratively identify incorrect model responses and synthesize targeted QA for supplementary training, enabling action understanding and temporal reasoning that significantly outperform approaches that feed video frames to static 3D models.
Adaptive Volumetric Mechanical Property Fields Invariant to Resolution: AdaVoMP utilizes a "Sparse Adaptive Voxel Tree (SAV)" to simultaneously represent the input shape and output material field. A sparse Transformer encoder-decoder then autoregressively generates Young's modulus, Poisson's ratio, and density for each 3D object layer-by-layer. This approach scales the effective resolution of simulatable material fields from \(64^3\) to \(1024^3\) (a \(16^3\) increase) while outperforming previous SOTA models with lower test-time compute.
AvAtar: Learning to Align via Active Optimal Transport: This paper proposes AvAtar, an active alignment framework based on Optimal Transport (OT). It quantifies the influence of candidate queries on global alignment results through gradient propagation. By utilizing the adjoint state method and conjugate gradient method, it achieves efficient solutions with linear complexity. AvAtar consistently outperforms existing active learning strategies in network alignment and cross-domain alignment tasks.
Convex Distance Operator Transport: A Convex and Geometry-Preserving Formulation: This paper proposes CDOT (Convex Distance Operator Transport). By "operatorizing" the distance matrices and coupling of each metric space and replacing the non-convex squared pairwise distance difference in FGW with \(\|D_X T_\pi - T_\pi D_Y\|_{\mathrm{HS}}^2\), it achieves a framework for heterogeneous space alignment that is strictly convex with respect to the coupling \(\pi\), while remaining a valid pseudo-metric and possessing finite-sample risk bounds.
APEIRIA: Distilling Neuro-Symbolic Programs into 3D Multi-modal LLMs: This paper proposes APEIRIA, which distills the execution traces of neuro-symbolic 3D concept learners into natural language chain-of-thought (CoC) for 3D MLLMs. By employing GRPO reinforcement learning, it generalizes these reasoning patterns to open-vocabulary and deeply nested instructions. APEIRIA simultaneously outperforms traditional NS3D methods and current state-of-the-art 3D MLLMs on ScanRefer, Multi3DRefer, SQA3D, and Scan2Cap, while retaining the interpretability and modularity of symbolic systems.
DynaTok: Token-Based 4D Reconstruction from Partial Point Clouds: DynaTok encodes incomplete, unordered, and non-correspondence partial point clouds of each frame into a set of compact latent tokens. It aggregates complementary observations across frames using a spatio-temporal Transformer, decouples deformation using a unified latent space of "reference geometry + residual motion," and reconstructs time-consistent complete 4D point cloud sequences via a flow-matching decoder.
EPS3D: End-to-End Feed-Forward 3D Panoptic Segmentation: EPS3D is the first end-to-end feed-forward open-vocabulary 3D panoptic segmentation framework. It directly predicts unified 3D panoptic Gaussians with semantic and instance attributes from unposed multi-view images in a single forward pass. By distilling 2D foundation models for supervision, it bypasses the need for 3D annotations. It introduces a semantic-instance mutual enhancement module for reciprocal calibration, achieving approximately 13% higher semantic mIoU than SOTA on Replica with an inference time of only 1 second per scene.
Fast-SAM3D: 3Dfy Anything in Images but Faster: To address the slow inference speed of the SAM3D single-view 3D reconstruction model, this paper provides the first module-level latency profiling. Identifying performance bottlenecks caused by three types of heterogeneity (shape/layout dynamics, texture sparsity, and geometric spectral differences), the authors propose Fast-SAM3D. This training-free framework utilizes modality-aware step caching, spatiotemporal token carving, and spectral-aware token aggregation to achieve a 2.67× speedup at the object level with negligible quality loss, even slightly improving the reconstruction F-Score from 92.34 to 92.59.
FoundObj: Self-supervised Foundation Models as Rewards for Label-free 3D Object Segmentation: This paper proposes FoundObj, which utilizes 2D/3D self-supervised foundation models (DINOv2 + TRELLIS) as rewarders. By employing a "superpoint merging + PPO" RL agent, it achieves multi-class 3D object segmentation in complex indoor scenes without any scene-level human annotations, improving the unsupervised SOTA AP from 19.6 to 24.2 on ScanNet/S3DIS/ScanNet200.
FSI2P: A Hierarchical Focus–Sweep Registration Network with Dynamically Allocated Depth: This paper abstracts the human observation process of "glancing first, then examining block-by-block" into a two-stage Focus-Sweep paradigm. It replaces Transformer with Mamba for image-to-point cloud interaction and utilizes reinforcement learning to dynamically determine the number of interaction layers at each scale, achieving SOTA performance in I2P registration on RGB-D Scenes V2 and 7-Scenes.
Future Dynamic 3D Reconstruction: A 3D World Model with Disentangled Ego-Motion: This paper proposes FR3D—the first world model designed for "future dynamic 3D reconstruction." It disentangles camera ego-motion from scene motion within the latent space of a pre-trained 3D reconstruction model (CUT3R). By using two masked Transformers to extrapolate pose and geometry respectively and leveraging teacher-student distillation for nearly cost-free training, it achieves zero-shot generalization, enabling the prediction of 3D scenes 2 seconds into the future from monocular input.
Geodesic Flow Matching for Denoising High-Dimensional Structured Representations: Focusing on high-dimensional structured representations like Spatial Semantic Pointers (SSPs) in Vector Symbolic Architectures—which are "embedded in a Clifford torus within a unit hypersphere"—the authors observe that Euclidean linear interpolation in standard Flow Matching passes through the sphere's interior, causing amplitude collapse and phase destruction. By using Log/Exp maps to constrain the flow to the sphere via Geodesic Flow Matching (GFM), they reduce path error in spiking neural SLAM by 72% and enable a 1500-neuron path integrator to match the accuracy of a 2500-neuron baseline.
Geometry-Guided Modeling of Foundation Features Enables Generalizable Object Shape Deformation Learning: This paper proposes GODeform, which attaches 2D foundation model (e.g., DINOv3) features onto category template surfaces for geometry-guided propagation and cross-view fusion. It employs Flow Matching to learn a point-wise deformation field from template to target, enabling 3D shape recovery from a single image under large deformations, arbitrary viewpoints, and unseen categories, directly supporting dexterous grasping transfer.
HOI-PAGE: Zero-Shot Human-Object Interaction Generation with Part Affordance Guidance: HOI-PAGE enables an LLM to first "reason" precisely which body part should contact which object component, encoding this reasoning into a "Part Affordance Graph" (PAG). This PAG then drives 3D part segmentation, video diffusion, and optimization, generating 4D human-object interaction sequences for complex scenarios like "multiple people/single object" or "single person/multiple objects" without any 4D training data.
LabBuilder: Protocol-Grounded 3D Layout Generation for Interactable and Safe Laboratory: LabBuilder compiles free-text experimental descriptions into "asset-chemical protocols," then utilizes hierarchical generation combined with geometric/chemical multi-objective optimization and navigation repair to produce 3D chemistry laboratory layouts that are both visually plausible and executable for robotic experimental workflows.
PhyScene3D: Physically Consistent Interactive 3D Tabletop Scene Generation: PhyScene3D reshapes 3D tabletop scene generation into a "human-constructive" hierarchical sequential planning: it linearizes scene graphs into an AABB-based anchor sequence using the Cognitive Topological Reasoning Chain (CTRC), and then embeds a differentiable SDF physics engine into the VLM training loop via Physics-Aware Denoising Alignment (PADA). This allows the model-generated scenes to surpass the physical plausibility of human-annotated training data (reducing scene-level collision rates from 81.5% to 41.6% and asset-level rates to 3.86%).
PhysHanDI: Physics-Based Reconstruction of Hand-Deformable Object Interactions: This paper proposes PhysHanDI, which couples the MANO hand model with a Spring-Mass soft body model. It uses dense hand meshes to drive the physical simulation of deformable objects and inversely utilizes object simulation to refine hand reconstruction, achieving SOTA dense 3D reconstruction for both hands and soft objects on sparse-view RGB-D videos.
PLAID: A Unified Data Model for Machine Learning on Heterogeneous Physics Simulations: PLAID proposes a unified data model and open-source library for heterogeneous physical simulation data, releasing six industrial-grade datasets covering structural mechanics and CFD alongside reproducible benchmarks. It transforms real-world "variable mesh, variable topology, and variable dimension" simulation data into standardized benchmarks accessible to the machine learning community.
RelaxFlow: Text-Driven Amodal 3D Generation: RelaxFlow formulates "text-driven completion of occluded 3D objects" as a problem of decoupling control granularity for dual objectives. It proposes a training-free dual-branch inference framework: an observation branch maintains pixel-level hard constraints, while a semantic prior branch achieves low-pass relaxation through "multi-prior consensus + Gaussian blur on attention logits." The work theoretically proves that this relaxation is equivalent to low-pass filtering the generative vector field, reducing Point-FID from 100.38 to 81.11 on SOTA models like SAM3D/TRELLIS.
Revisiting Photometric Ambiguity for Accurate Gaussian-Splatting Surface Reconstruction: AmbiSuR explicitly models two types of endogenous photometric ambiguities in Gaussian Splatting (primitive edge overflow and pixel-mixing under-constraint) and disambiguates them using truncation and ray-color consistency. It further leverages high-order Spherical Harmonic (SH) coefficients as "self-indicators" to identify high-risk primitives, applying amorphous local prior regularization. AmbiSuR reduces the average Chamfer distance on DTU to 0.46, surpassing the previous state-of-the-art GeoSVR (0.47).
SIMPC: Learning Self-Induced Mirror-Point Consistency for Unsupervised Point Cloud Denoising: SIMPC proposes performing a "symmetric extension" along the denoising vector of the same noisy point to obtain a mirror point on the opposite side of the surface. A Mirror-Point Consistency Loss is then used to force the denoising targets of both points to coincide. This shifts unsupervised point cloud denoising from "finding statistical correspondences across multiple noise variants" to "finding deterministic geometric correspondences within a single point." It achieves performance significantly surpassing unsupervised SOTA and even beats several supervised methods on PUNet/PCNet synthetic data and Paris-Rue-Madame / Kinect real scans.
Smoothness Errors in Dynamics Models and How to Avoid Them: The authors theoretically demonstrate that the "unitary GNN" by Kiani et al. over-constrains physical systems that are "naturally smoothing" (such as heat diffusion) by strictly maintaining the Rayleigh quotient. They propose "relaxed unitary convolutions" (R-UniGraph / R-UniMesh) and extend the entire Rayleigh quotient-unitary convolution framework from graphs to triangular meshes, outperforming several strong baselines on MeshPDE and WeatherBench22 simultaneously.
SplAttN: Bridging 2D and 3D with Gaussian Soft Splatting and Attention for Point Cloud Completion: This paper points out that "hard projection" in multi-modal point cloud completion, which directly maps 3D points onto a 2D grid, causes the support set to have zero Lebesgue measure and truncates gradients via Dirac delta functions (termed Cross-Modal Entropy Collapse). By replacing hard projection with continuous density estimation via differentiable Gaussian Soft Splatting, combined with a hybrid encoder (EdgeConv for local + Transformer for global) and a global-local decoder, the method achieves SOTA on PCN/ShapeNet-55/34. Furthermore, a counter-factual evaluation on KITTI demonstrates that baselines actually degenerate into "unimodal template retrievers."
STABLE: Simulation-Ready Tabletop Layout Generation via a Semantics–Physics Dual System: STABLE decomposes the process of "task instructions → simulation-ready tabletop scenes" into an LLM-based Semantic Reasoner (generating coarse layouts) and a flow-matching-based Physics Corrector with SDF losses (refining poses). By iterating through three stages—task-critical, important background, and secondary background—the system reduces object collisions to zero while achieving 99.0% scene alignment (AwS) on the MesaTask-10K dataset.
Streaming Sliced Optimal Transport: Stream-SW is the first algorithm capable of estimating Sliced Wasserstein (SW) distance on a "sample stream": it utilizes KLL/quantile sketches on each 1D projection to maintain an approximate quantile function, transforming the closed-form 1D Wasserstein integral into a streamable estimator. The space complexity is only logarithmic relative to the number of samples, bringing SOT to "one-look-and-discard" scenarios such as IoT and edge devices.
SVL: Spike-based Vision-Language Pretraining for Efficient 3D Open-World Understanding: SVL injects open-world understanding into Spiking Neural Networks (SNNs) via "3D-Image-Text" tri-modal contrastive pre-training. By "reparameterizing" the text encoder into a set of classification weights, the inference stage becomes entirely free of the text tower, remaining purely spike-driven. It achieves 85.4% zero-shot classification on ModelNet40 while consuming only 0.5%–11% of the energy of equivalent ANN methods.
The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity: This paper reveals the structural root of "attention sinking to the first token" in LLMs: the lack of value aggregation for the first token under causal masking leads to variance discrepancy, which is selectively amplified by super neurons in the FFN to form extreme dimensional disparity, eventually locking QK projections to force attention sinks. Based on this, head-wise RMSNorm is proposed to suppress sinks from the root during the pre-training stage.
TideGS: Scalable Training of Over One Billion 3D Gaussian Splatting Primitives via Out-of-Core Optimization: TideGS migrates the 3DGS parameter table to an SSD, virtualizing it into "blocks" while utilizing GPU VRAM as a cache for the view-frustum visibility working set. Coupled with a three-stage asynchronous pipeline and trajectory-adaptive differential streaming, it pushes the scale of trainable Gaussians from approximately 11M (native 3DGS) or 105M (CLM) to over 1 billion on a single 24 GB GPU, achieving large-scene reconstruction quality superior to all evaluated single-GPU baselines.
Trust3R: Evidential Uncertainty for Feed-Forward 3D Reconstruction: Trust3R introduces a probabilistic evidential learning framework for feed-forward 3D reconstruction models like MASt3R. By utilizing a Normal-Inverse-Wishart prior to predict a closed-form multivariate Student-t distribution for each 3D point, it replaces heuristic confidence scores. This allows for the output of probabilistically interpretable point-wise uncertainty in a single forward pass, reducing AURC by 25% and AUSE by 41% on ScanNet++.
Zero-Shot 3D Question Answering via Hierarchical View-to-Token Transportation: KeyVT decomposes the process of feeding multi-view images sampled from 3D point clouds into 2D VLMs for 3D QA into a two-level hierarchical workflow: "key view selection" followed by "key token selection." At the view level, it uses camera geometry to partition the scene into spatially continuous sub-scenes and allocates token budgets based on relevance. At the token level, it employs Optimal Transport (OT) to eliminate cross-view redundancy. This training-free method approaches or exceeds the performance of supervised models on ScanQA, SQA3D, and VSI-Bench.