CVPR2025 3D Vision AI paper notes paper summaries 3D Gaussian Splatting Diffusion Models 3D Reconstruction Adversarial Robustness Segmentation Point Cloud

🧊 3D Vision¶

📷 CVPR2025 · 364 paper notes

📌 Same area in other venues: 📷 CVPR2026 (751) · 🔬 ICLR2026 (197) · 🧪 ICML2026 (30) · 🤖 AAAI2026 (79) · 🧠 NeurIPS2025 (116) · 📹 ICCV2025 (267)

🔥 Top topics: 3D Gaussian Splatting ×61 · Diffusion Models ×26 · 3D Reconstruction ×18 · Adversarial Robustness ×16 · Segmentation ×15

3D-GSW: 3D Gaussian Splatting for Robust Watermarking: This paper proposes 3D-GSW, the first robust digital watermarking method designed specifically for 3D Gaussian Splatting. It enhances watermark robustness by removing redundant Gaussians and splitting Gaussians in high-frequency regions via Frequency-Guided Densification (FGD). Combined with a gradient mask and wavelet sub-band loss to maintain rendering quality, 3D-GSW achieves superior watermark robustness and rendering quality across the Blender, LLFF, and Mip-NeRF 360 datasets.
3D-HGS: 3D Half-Gaussian Splatting: This work proposes the 3D Half-Gaussian (3D-HGS) reconstruction kernel, which splits a 3D Gaussian into two halves using a cutting plane, each having independent opacity. Acting as a plug-and-play reconstruction kernel to replace standard Gaussian kernels, it significantly enhances rendering quality at shape and color discontinuities without sacrificing rendering speed, outperforming all SOTA methods on Mip-NeRF360, Tanks & Temples, and Deep Blending.
3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer: This work proposes 3D-LLaVA, a general-purpose 3D Large Multimodal Model (LMM) with a minimalist architecture. The core is the Omni Superpoint Transformer (OST) acting as a versatile visual connector. It simultaneously serves as a visual feature selector, a visual prompt encoder, and a segmentation mask decoder. Using only point cloud inputs, it fully achieves state-of-the-art (SOTA) performance across five benchmarks, including ScanQA (92.6 CiDEr) and ScanRefer (43.3 mIoU).
3D-Mem: 3D Scene Memory for Embodied Exploration and Reasoning: This paper proposes 3D-Mem, a 3D scene memory framework based on "Memory Snapshots." It compactly represents explored areas using a small set of curated multi-view images and models unexplored regions via Frontier Snapshots, enabling efficient embodied exploration and reasoning in combination with VLMs.
3D-SLNR: A Super Lightweight Neural Representation for Large-scale 3D Mapping: This paper proposes 3D-SLNR, a super lightweight neural 3D representation. It defines the global Signed Distance Function (SDF) based on a collection of band-limited local SDFs anchored on support points of a point cloud. Each local SDF is parameterized by a single shared tiny MLP (without latent feature vectors). The output of the MLP is modulated by learnable geometric attributes (position, rotation, and scale) to adapt to complex geometries in different regions. Combined with a parallel query algorithm and a prune-and-expand strategy, it achieves SOTA reconstruction quality with less than 1/5 of the memory footprint of previous methods.
3D Convex Splatting: Radiance Field Rendering with 3D Smooth Convexes: Replaces Gaussian primitives with 3D smooth convex primitives for radiance field rendering. By defining convex hulls using point sets + LogSumExp smoothing + custom CUDA rasterizer, this method outperforms 3DGS on T&T and Deep Blending using fewer primitives.
3D Dental Model Segmentation with Geometrical Boundary Preserving: This paper proposes CrossTooth, which utilizes selective downsampling based on curvature priors (increasing vertex density in boundary areas by 10-15%) and cross-modal boundary feature fusion with multi-view rendered images. It achieves 95.86% mIoU and 82.05% boundary IoU on the public 3DTeethSeg'22 dataset, outperforming the previous SOTA (ToothGroupNet) by 2.3% and 5.7% respectively.
3D Gaussian Head Avatars with Expressive Dynamic Appearances by Compact Tensorial Representations: A 3D Gaussian head avatar method with compact tensorial representation is proposed, which stores the static neutral-expression appearance using canonical tri-planes and the dynamic texture (opacity offset) of each blendshape using lightweight 1D feature lines. It achieves 300 FPS real-time rendering and accurate capture of dynamic facial details with only 10MB of storage, comprehensively outperforming GA, GBS, and GHA in PSNR and storage efficiency on the Nersemble dataset.
3D Gaussian Inpainting with Depth-Guided Cross-View Consistency: This paper proposes 3DGIC, a framework that achieves object removal and inpainting in 3D Gaussian Splatting scenes through depth-guided cross-view consistent inpainting. By leveraging rendered depth maps, it projects background pixels visible from other views onto the masked region to refine the inpainting mask. Then, 2D inpainting results from a reference view are projected onto 3D space to constrain cross-view consistency for other views. The proposed method outperforms existing approaches in FID and LPIPS on the SPIn-NeRF dataset.
3D Student Splatting and Scooping (SSS): This work proposes SSS (Student Splatting and Scooping), advancing the 3DGS paradigm with three unprecedented innovations: (1) replacing Gaussian distributions with Student-t distributions as mixture components (with learnable tail thickness that varies continuously from Cauchy to Gaussian); (2) introducing negative density components (scooping by subtracting color) to extend the formulation to non-monotonic mixture models; (3) employing SGHMC sampling instead of SGD to decouple parameter optimization. SSS achieves state-of-the-art results in 6 out of 9 metrics across Mip-NeRF360, T&T, and Deep Blending, demonstrating extreme parameter efficiency by matching or exceeding 3DGS using only 18% of the component count.
3DEnhancer: Consistent Multi-View Diffusion for 3D Enhancement: Proposed a 3D enhancement framework based on a multi-view latent diffusion model. By incorporating a pose-aware encoder, multi-view row attention, and adjacent-view epipolar aggregation modules, it significantly enhances the texture quality of low-quality 3D generation results while maintaining cross-view consistency.

3DGUT: Enabling Distorted Cameras and Secondary Rays in Gaussian Splatting

4Deform: Neural Surface Deformation for Robust Shape Interpolation: This paper proposes the 4Deform framework, which achieves robust shape interpolation based on neural implicit representation and continuous velocity field learning. By linking the implicit field and the velocity field via a modified level-set equation, it achieves SOTA performance for the first time across scenarios involving noise, partial observations, topological changes, and non-isometric deformations, while supporting temporal super-resolution of real-world Kinect point cloud sequences.
4DEquine: Disentangling Motion and Appearance for 4D Equine Reconstruction from Monocular Video: Disentangles 4D equine reconstruction from monocular video into motion estimation (AniMoFormer spatio-temporal Transformer) and appearance reconstruction (EquineGS single-image feed-forward 3DGS). Leveraging the VAREN parametric model and two large-scale synthetic datasets, it achieves SOTA geometry + appearance reconstruction results on real-world data and generalizes zero-shot to donkeys and zebras.
4DGC: Rate-Aware 4D Gaussian Compression for Efficient Streamable Free-Viewpoint Video: This paper proposes 4DGC, a rate-distortion-aware 4D Gaussian compression framework. By adopting motion-aware dynamic Gaussian modeling (multi-resolution motion grids + sparse compensatory Gaussians) and end-to-end compression (differentiable quantization + implicit entropy model), 4DGC achieves 16× compression over 3DGStream without sacrificing rendering quality.
4DTAM: Non-Rigid Tracking and Mapping via Dynamic Surface Gaussians: This paper presents the first 4D tracking and mapping method (4DTAM) based on differentiable rendering and 2D Gaussian surface primitives. By jointly optimizing camera poses, scene geometry, appearance, and dynamic deformation fields, 4DTAM achieves real-time reconstruction of non-rigid dynamic scenes from monocular RGB-D video streams. It also releases a novel synthetic 4D dataset, Sim4D, for evaluation.
A2Z-10M+: Geometric Deep Learning with A-to-Z BRep Annotations for AI-Assisted CAD Modeling and Reverse Engineering: The A2Z dataset is constructed containing over 1 million complex CAD models and more than 10 million multimodal annotations (high-resolution 3D scans, freehand 3D sketches, text descriptions, and BRep topology labels), representing the largest dataset for CAD reverse engineering to date. Based on this, foundation models for BRep boundary and corner detection are trained.
A Lightweight UDF Learning Framework for 3D Reconstruction Based on Local Shape Functions: This paper proposes LoSF-UDF, a lightweight framework for learning Unsigned Distance Fields (UDFs) based on local shape functions. Trained only once on synthetic local point cloud patches (653KB parameters, 0.5GB data), it generalizes to reconstruct various types of 3D surfaces and exhibits robustness against noise and outliers.
A Unified Image-Dense Annotation Generation Model for Underwater Scenes: This paper proposes TIDE, a unified text-to-image and dense annotation generation method. Relying solely on text as input, it simultaneously generates highly consistent underwater images, depth maps, and semantic masks. By ensuring consistency across multimodal outputs through Implicit Layout Sharing (ILS) and Time Adaptive Normalization (TAN) mechanisms, the synthesized SynTIDE dataset significantly enhances the performance of underwater depth estimation and semantic segmentation.
ActiveGAMER: Active GAussian Mapping through Efficient Rendering: ActiveGAMER is proposed, representing the first attempt to utilize 3D Gaussian Splatting for active mapping. By efficiently selecting the optimal next-best-view via a rendering-based information gain module, integrated with a coarse-to-fine exploration strategy, post-refinement, and a global-local keyframe policy, ActiveGAMER significantly outperforms NeRF-based methods in both geometric accuracy and rendering fidelity on the Replica and MP3D datasets.
AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis: This paper proposes the AerialMegaDepth dataset generation framework. By co-registering pseudo-synthetic aerial renders from Google Earth and real ground-level images from MegaDepth into a unified coordinate system, it constructs a large-scale training dataset of 132k mixed-altitude images. Fine-tuning DUSt3R on this dataset improves the camera rotation estimation accuracy for aerial-ground pairs from 5% to 56%, while also significantly enhancing the quality of novel view synthesis.
AniGS: Animatable Gaussian Avatar from a Single Image with Inconsistent Gaussian Reconstruction: Generates an animatable 3D human from a single image. It first employs adapted CogVideo to generate multi-view canonical pose images (including normals), then models multi-view inconsistency as temporal changes in 4DGS to extract a consistent canonical-space Gaussian model, and finally drives animation via SMPL-X skinning.
Any3DIS: Class-Agnostic 3D Instance Segmentation by 2D Mask Tracking: Any3DIS is proposed, which replaces traditional unsupervised merging strategies with 3D-aware 2D mask tracking (utilizing SAM-2 to track the 2D segmentations of each superpoint across multiple frames) and optimizes 3D proposals using dynamic programming. It achieves state-of-the-art (SOTA) results in class-agnostic, open-vocabulary, and open-ended 3D instance segmentation tasks on ScanNet200 and ScanNet++.
ARM: Appearance Reconstruction Model for Relightable 3D Generation: This paper proposes the ARM framework, which decouples geometry and appearance generation. It reconstructs high-quality textures in the UV texture space using back-projection and global-receptive-field networks, while introducing material priors to resolve the ambiguity between material and illumination under sparse views. Trained on only 8 H100 GPUs, it outperforms existing methods on GSO and OmniObject3D.
ASHiTA: Automatic Scene-grounded Hierarchical Task Analysis: The first framework, ASHiTA, is proposed to automatically decompose high-level tasks into hierarchies of scene-grounded subtasks. By alternating LLM-assisted hierarchical task analysis with task-driven 3D scene graph construction based on the Information Bottleneck principle, joint reasoning of the task hierarchy and scene representation is achieved.
BFANet: Revisiting 3D Semantic Segmentation with Boundary Feature Analysis: Revisiting 3D semantic segmentation from the perspective of error analysis, this study classifies segmentation errors into four categories (region classification, displacement, merge, and false response) and designs corresponding evaluation metrics. It proposes BFANet, which enhances boundary awareness through a boundary-semantic decoupling module and real-time boundary pseudo-label computation, achieving 36.0 mIoU on the ScanNet200 test set (the highest score without utilizing auxiliary data during training).
BLADE: Single-view Body Mesh Learning through Accurate Depth Estimation: This paper proposes BLADE, which decouples perspective projection parameters by accurately estimating the pelvic depth \(T_z\) of the human body, recovers the human mesh using a \(T_z\)-aware pose estimator, and finally solves for the focal length and XY translation via differentiable rasterization. It realizes, for the first time, accurate perspective projection parameter and 3D human mesh recovery from a single image without relying on orthogonal camera heuristic assumptions.
Blurry-Edges: Photon-Limited Depth Estimation from Defocused Boundaries: This paper proposes a depth estimation method based on a novel image patch representation termed Blurry-Edges. By modeling the smoothness of defocused boundaries, it achieves robust depth estimation under extremely low-light (photon-limited) conditions from a pair of images with different defocus levels, improving noise robustness by over 4 times compared to existing DfD methods.

CADCrafter: Generating Computer-Aided Design Models from Unconstrained Images

CADDreamer: CAD Object Generation from Single-view Images: This paper proposes CADDreamer, which directly generates CAD models with compact B-rep representations, clear structures, and sharp edges from a single RGB image. Utilizing a semantic-enhanced multi-view diffusion model and a geometric-topological extraction module, it supports five primitive types: planes, cylinders, cones, spheres, and tori.
Category-Agnostic Neural Object Rigging: Proposes CANOR (Category-Agnostic Neural Object Rigging), which automatically discovers low-dimensional pose spaces of deformable objects in a completely category-agnostic, data-driven manner by encoding deformable 4D objects into a sparse set of spatially localized blobs and instance-aware feature volumes, enabling intuitive pose manipulation.
CMMLoc: Advancing Text-to-PointCloud Localization with Cauchy-Mixture-Model Based Framework: Proposes CMMLoc, an uncertainty-aware text-to-point cloud localization framework based on the Cauchy Mixture Model (CMM). By modeling the coarse retrieval stage as a partially relevant retrieval problem and introducing a CMM Transformer and a cardinal direction integration module, it achieves SOTA performance on the KITTI360Pose dataset.
COB-GS: Clear Object Boundaries in 3DGS Segmentation Based on Boundary-Adaptive Gaussian Splitting: Proposes COB-GS, a boundary-adaptive Gaussian splitting technique driven by semantic gradient statistics. It jointly optimizes semantic information and visual texture to resolve the blurred object boundary issue in 3DGS segmentation, achieving clear object boundary segmentation while preserving visual quality.
CoCoGaussian: Leveraging Circle of Confusion for Gaussian Splatting from Defocused Images: This paper proposes CoCoGaussian, which utilizes physical photographic defocus principles (Circle of Confusion) to model defocus blur within the 3D Gaussian Splatting framework, enabling accurate 3D scene reconstruction and sharp novel view rendering using only defocused images.
Coherent 3D Portrait Video Reconstruction via Triplane Fusion: A triplane fusion-based method is proposed to fuse personalized 3D priors with frame-by-frame observations, achieving both temporal coherence and faithful reconstruction of dynamic appearances from monocular RGB videos for 3D telepresence.
ColabSfM: Collaborative Structure-from-Motion by Point Cloud Registration: The ColabSfM paradigm is proposed, which merges distributed SfM reconstruction results via 3D point cloud registration (rather than visual descriptor matching). In addition, a dedicated SfM registration dataset generation pipeline and an improved registration model, RefineRoITr, are designed.
CoMapGS: Covisibility Map-based Gaussian Splatting for Sparse Novel View Synthesis: This paper proposes CoMapGS, which utilizes pixel-level covisibility maps to guide initial point cloud enhancement and adaptively weighted supervision in sparse-view 3DGS, representing the first attempt to explicitly focus on and recover high-uncertainty single-view regions.
CoMatcher: Multi-View Collaborative Feature Matching: Proposes CoMatcher, a multi-view collaborative feature matcher that shifts from the independent two-view matching paradigm to a 1-to-N collaborative matching paradigm, leveraging contextual cues from complementary views and cross-view projective consistency constraints to improve matching reliability in complex scenes.
Consistency-aware Self-Training for Iterative-based Stereo Matching: This paper proposes the first consistency-aware self-training framework (CST-Stereo) for iterative-based stereo matching. It evaluates pseudo-label reliability through multi-resolution prediction consistency filtering and iterative prediction consistency filtering, and combines them with a soft-weighted loss to leverage unlabeled real-world data effectively, thereby improving model performance and generalization.
Continuous 3D Perception Model with Persistent State: Proposes CUT3R (Continuous Updating Transformer for 3D Reconstruction), a recurrent model that maintains a persistent internal state, allowing online, incremental metric-scale 3D reconstruction and camera pose estimation from image streams, while enabling inference of 3D structures in unobserved regions.
Cross-View Completion Models are Zero-shot Correspondence Estimators: Reveals that the cross-attention maps in cross-view completion (CVC) models inherently learn precise dense correspondence, and proposes ZeroCo to leverage this finding in zero-shot matching and learning-based geometric matching, significantly outperforming conventional methods based on encoder/decoder features.
CrossOver: 3D Scene Cross-Modal Alignment: The CrossOver framework is proposed to learn a unified scene-level cross-modal embedding space for RGB images, point clouds, CAD models, floor plans, and textual descriptions without requiring complete modal pairing. It utilizes dimensionality-specific encoders and a three-stage training pipeline to support flexible cross-modal retrieval and localization.
Ctrl-D: Controllable Dynamic 3D Scene Editing with Personalized 2D Diffusion: Fine-tunes the InstructPix2Pix model using a single edited reference image to "learn" the editing capability, which, combined with a two-stage deformable 3D Gaussian optimization, achieves controllable and consistent dynamic 3D scene editing.
DAGSM: Disentangled Avatar Generation with GS-enhanced Mesh: DAGSM is proposed, a text-driven disentangled digital human generation method that represents the body and individual clothing items separately using GS-enhanced Mesh (GSM), supporting outfit customization, realistic animation, and texture editing.
DashGaussian: Optimizing 3D Gaussian Splatting in 200 Seconds: DashGaussian is proposed, offering a joint framework for scheduling rendering resolution and Gaussian primitive count based on frequency analysis. It reformulates 3DGS optimization as a progressive fitting of high-frequency components, achieving an average acceleration of 45.7% without compromising rendering quality.
Decompositional Neural Scene Reconstruction with Generative Diffusion Prior: Proposes DP-Recon, which introduces generative diffusion priors (SDS) into decompositional neural scene reconstruction. By dynamically adjusting pixel-wise SDS weights using visibility guidance, it resolves conflicts between reconstruction objectives and generation guidance, achieving complete object geometry and appearance recovery under sparse views.
DEFOM-Stereo: Depth Foundation Model Based Stereo Matching: Integrates the monocular depth foundation model (Depth Anything V2) into the recurrent stereo matching framework RAFT-Stereo. By incorporating combined feature encoders and a scale update module, this approach achieves state-of-the-art stereo matching performance across multiple benchmarks while preserving strong generalization capabilities.
Deformable Radial Kernel Splatting: This paper proposes Deformable Radial Kernels (DRK) to generalize traditional Gaussian splatting. By leveraging learnable radial basis functions, \(L_1\)/\(L_2\) norm blending, and edge-sharpening mechanisms, it achieves higher quality 3D scene rendering with fewer primitives.
Denoising Functional Maps: Diffusion Models for Shape Correspondence: This paper proposes DenoisFM, the first method to apply denoising diffusion models to directly predict functional maps between shapes. It reduces learning complexity through template matching and introduces an unsupervised approach to resolve the sign ambiguity of Laplace eigenvectors, achieving competitive performance on human and animal shape matching.
Dense-SfM: Structure from Motion with Dense Consistent Matching: Proposed the Dense-SfM framework, which resolves the fragmented track issue of dense matches through Gaussian Splatting-based track expansion, and combines a Transformer and Gaussian Process-based multi-view kernelized match refinement module to achieve high-precision dense SfM reconstruction.
Depth-Guided Bundle Sampling for Efficient Generalizable Neural Radiance Field Reconstruction: This paper proposes a depth-guided bundle (GDB) sampling strategy that groups adjacent rays into bundles for joint processing via sphere-cone sampling. Concurrently, it adaptively allocates the number of sampling points based on depth confidence. When applied to ENeRF and MVSGaussian, it achieves a 1.27dB PSNR improvement and a 47% speedup in FPS on the DTU dataset.
Depth Any Camera: Zero-Shot Metric Depth Estimation from Any Camera: The Depth Any Camera (DAC) framework is proposed, which enables zero-shot metric depth estimation generalizing to fisheye and 360° cameras with training restricted solely to perspective images. By utilizing ERP unified representation, pitch-aware transformation, and FoV alignment, DAC improves \(\delta_1\) accuracy by up to 50% on wide-FoV datasets.
DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos: Leverages a pretrained video diffusion model (SVD) for video depth estimation. Through a three-stage training strategy, it realizes temporally consistent depth sequence generation of variable lengths (up to 110 frames). By designing a segment-based inference strategy, it supports extremely long videos, comprehensively outperforming existing methods in zero-shot settings.
DepthCues: Evaluating Monocular Depth Perception in Large Vision Models: This paper proposes the DepthCues benchmark to systematically evaluate the depth perception capabilities of 20 large-scale pre-trained vision models across six human monocular depth cue tasks (elevation, light-shadow, occlusion, perspective, size, and texture gradient), revealing the emergence of human-like depth cues in modern vision models.
DepthSplat: Connecting Gaussian Splatting and Depth: Unifies Gaussian Splatting (3DGS) and depth estimation, two tasks typically studied independently: uses pre-trained monocular depth features to enhance the multi-view depth model, improving 3DGS reconstruction quality, while concurrently leveraging the photometric rendering loss of 3DGS as an unsupervised pre-training target to learn a robust depth model. Both tasks achieve state-of-the-art (SOTA) performance across multiple datasets.
DeSplat: Decomposed Gaussian Splatting for Distractor-Free Rendering: DeSplat proposes decomposing 3D Gaussian Splatting into static scene Gaussians and view-specific distractor Gaussians. It accomplishes scene-distractor separation purely based on volume rendering, requiring no external semantic models. It achieves comparable distractor-free novel view synthesis performance to prior methods across three benchmark datasets without sacrificing rendering speed.
DiET-GS: Diffusion Prior and Event Stream-Assisted Motion Deblurring 3D Gaussian Splatting: A two-stage framework, DiET-GS, is proposed. It jointly constrains 3DGS optimization using event double-integral (EDI) priors and a pre-trained diffusion model to reconstruct clean 3D representations from blurry multi-view images and event streams, achieving high-quality novel view synthesis with accurate colors and fine details.
DiffPortrait360: Consistent Portrait Diffusion for 360° View Synthesis: Presents the first method capable of generating consistent 360° full head views from a single portrait. Using a dual-appearance control module, a back-view generation ControlNet, and a continuous view sequence training strategy, it supports real human, stylized, and anthropomorphic characters, and can be converted into high-quality NeRF for real-time free-viewpoint rendering.
Difix3D+: Improving 3D Reconstructions with Single-Step Diffusion Models: Proposes Difix3D+, which utilizes a fine-tuned single-step diffusion model (SD-Turbo) to progressively generate pseudo-training views during the training phase to feed back into the 3D representation, and serves as a real-time post-processing enhancer during inference. It is compatible with both NeRF and 3DGS, achieving an average improvement of over 2x on FID.
Digital Twin Catalog: A Large-Scale Photorealistic 3D Object Digital Twin Dataset: Meta Reality Labs proposes the DTC dataset, containing 2,000 3D object digital twin models with millimeter-level geometric precision and photorealistic PBR materials, paired with evaluation data captured via DSLR and egocentric AR glasses, establishing the first comprehensive real-world benchmark for 3D reconstruction and inverse rendering.
Disco4D: Disentangled 4D Human Generation and Animation from a Single Image: Disco4D proposes a 4D human generation framework that disentangles clothing (represented by a Gaussian model) from the human body (represented by the SMPL-X model), generating animatable, editable, and layered 3D clothed human models from a single image, and supporting realistic 4D clothing dynamics.
DoF-Gaussian: Controllable Depth-of-Field for 3D Gaussian Splatting: Defocus-to-Focus Adaptation (DoF-Gaussian) introduces a learnable lens imaging model based on geometric optics for 3D Gaussian Splatting representations. By integrating scene-wise depth prior adjustment and a defocus-to-focus adaptation strategy, it reconstructs sharp 3D scenes from shallow depth-of-field (defocused blur) inputs, and supports controllable depth-of-field rendering (including refocusing, aperture adjustment, and bokeh shape transformation).
Doppelgangers++: Improved Visual Disambiguation with Geometric 3D Features: This paper proposes Doppelgangers++, which significantly improves the precision and generalization of doppelganger (visually ambiguous image pair) detection by introducing diverse daily scenes training data from VisymScenes and training a Transformer classifier using 3D-aware features from the multi-layer decoder of MASt3R. It seamlessly integrates into COLMAP and MASt3R-SfM pipelines to improve 3D reconstruction quality in scenes with repetitive structures.
Dr. Splat: Directly Referring 3D Gaussian Splatting via Direct Language Embedding Registration: This paper proposes Dr. Splat, which bypasses the rendering process and directly registers language-aligned CLIP embeddings onto 3D Gaussians. Combined with Product Quantization (PQ) pre-trained on large-scale image data, it achieves a 6.25% embedding compression rate. Without requiring any scene-specific optimization (~10 minutes vs. 1–24 hours in prior work), it significantly outperforms existing methods in open-vocabulary 3D semantic segmentation, 3D object grounding, and 3D object selection.
DroneSplat: 3D Gaussian Splatting for Robust 3D Reconstruction from In-the-Wild Drone Imagery: DroneSplat proposes a robust 3DGS framework for in-the-wild drone imagery, mitigating dynamic distractors via an adaptive local-global masking strategy, addressing reconstruction quality under limited views using Multi-View Stereo (MVS)-based geometry-aware point sampling and voxel-guided optimization, and presenting a dataset of 24 drone reconstruction scenes.
DropGaussian: Structural Regularization for Sparse-view Gaussian Splatting: DropGaussian proposes a simple regularization method without additional priors. By randomly dropping Gaussians during 3DGS training and introducing an opacity compensation factor, it ensures that occluded, distant Gaussians receive larger gradients and visibility. Coupled with a progressive dropping rate strategy, it effectively mitigates overfitting under sparse-view conditions and achieves performance comparable to prior-based methods without increasing computational complexity.
DropoutGS: Dropping Out Gaussians for Better Sparse-view Rendering: DropoutGS mitigates overfitting in sparse-view 3DGS through Random Dropout Regularization (RDR), and compensates for high-frequency details lost in low-complexity models using an Edge-guided Splitting Strategy (ESS). Serving as a plug-and-play module, it can be integrated with various 3DGS methods, achieving SOTA performance on LLFF, DTU, and Blender.
DSPNet: Dual-vision Scene Perception for Robust 3D Question Answering: DSPNet introduces a dual-vision scene perception network. To overcome limitations regarding fine-grained perception and robust reasoning in 3D QA, it comprehensively integrates point clouds and multi-view images through three joint modules: Text-guided Multi-view Fusion (TGMF), Adaptive Dual-vision Perception (ADVP), and Multimodal Context-guided Reasoning (MCGR), achieving SOTA results on the SQA3D and ScanQA datasets.
Dual Exposure Stereo for Extended Dynamic Range 3D Imaging: This paper proposes a dual-exposure stereo method (Dual-Exposure Stereo) that extends the effective dynamic range by automatically controlling the dual-exposure parameters of stereo cameras, and designs a motion-aware dual-exposure depth estimation network to achieve robust 3D imaging in wide dynamic range scenes.
Dual Exposure Stereo for Extended Dynamic Range 3D Imaging: Proposes a Dual-Exposure Stereo method that utilizes Automatic Dual-Exposure Control (ADEC) to apply different exposures in alternating frames, combined with a motion-aware dual-exposure feature fusion network for disparity estimation. This extends the effective dynamic range of stereo cameras to 160% and achieves robust 3D imaging under extreme lighting conditions.
DualPM: Dual Posed-Canonical Point Maps for 3D Shape and Pose Reconstruction: Proposes Dual Point Maps (DualPM), which simplifies 3D shape and pose reconstruction of deformable objects into a point map prediction problem by simultaneously predicting a pair of point maps in camera space and canonical space, generalizing to real images using only synthetic training data.
DualPM: Dual Posed-Canonical Point Maps for 3D Shape and Pose Reconstruction: This work proposes the Dual Point Maps (DualPM) representation, which predicts a pair of point maps (camera-space \(P\) and canonical-space \(Q\)) from a single image. This simplifies the 3D shape and pose reconstruction of deformable objects into a point map prediction problem. Additionally, a layered amodal point map is introduced to achieve complete shape recovery (including self-occluded parts), generalizing to real-world images while training with only 1–2 synthetic 3D models.
DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers: DUNE proposes a co-distillation framework for heterogeneous teachers, unifying 2D (DINOv2) and 3D (MASt3R, Multi-HMR) teacher models from different tasks and data domains into a single ViT-Base universal encoder. It matches or exceeds the performance of their respective ViT-Large teachers across multiple tasks such as semantic segmentation, depth estimation, 3D reconstruction, and human pose recovery.
DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers: Proposes DUNE, which pioneers the study of heterogeneous teacher distillation (co-distillation). It distills a ViT-Base universal encoder from teacher models with highly distinct task objectives and training data (DINOv2 + MASt3R + Multi-HMR). The student achieves teacher-level performance across 2D vision, 3D scene understanding, and 3D human perception tasks.
Dyn-HaMR: Recovering 4D Interacting Hand Motion from a Dynamic Camera: Dyn-HaMR proposes the first optimization framework to recover 4D global motion trajectories of interacting hands from monocular videos captured by a dynamic camera. Utilizing a three-stage pipeline (hierarchical initialization \(\rightarrow\) SLAM-guided global motion optimization \(\rightarrow\) interacting motion prior optimization), it decouples camera motion from hand motion and significantly outperforms existing methods across multiple datasets.
Dynamic Neural Surfaces for Elastic 4D Shape Representation and Analysis: This paper proposes Dynamic Spherical Neural Surfaces (D-SNS), which model genus-0 4D surfaces as spatio-temporal continuous functions using MLPs. Spatio-temporal registration, geodesic calculation, and mean estimation are directly performed in SRNF/SRVF spaces without discretization, outperforming 4D Atlas on 4D human and facial datasets.
Efficient Depth Estimation for Unstable Stereo Camera Systems on AR Glasses: Proposed two models, MultiHeadDepth and HomoDepth. They optimize the latency bottlenecks of the cost volume and preprocessing in stereo depth estimation using a hardware-friendly multi-head cost volume (approximating cosine similarity via LayerNorm + dot product, combined with group-wise pointwise convolutions) and a homography estimation network with 2D Rectified Positional Encoding (RPE), respectively. In AR glasses scenarios, this improves accuracy by 11.8-30.3% while reducing end-to-end latency by 44.5%.
Instruct-4DGS: Efficient Dynamic Scene Editing via 4D Gaussian-based Static-Dynamic Separation: Instruct-4DGS is proposed, leveraging the inherent separability of static 3D Gaussians and Hexplane deformation fields in 4D Gaussian Splatting (4DGS) to achieve efficient dynamic scene editing by focusing solely on editing static canonical Gaussians. Temporal alignment is refined via Coherent-IP2P-driven score distillation to eliminate motion artifacts, reducing the editing time by more than half while requiring only a single GPU.
EgoPressure: A Dataset for Hand Pressure and Pose Estimation in Egocentric Vision: EgoPressure introduces the first egocentric dataset for hand tactile pressure and pose estimation, containing 5 hours of RGB-D interaction data from 21 participants, high-fidelity MANO hand mesh annotations based on multi-view optimization, and ground-truth pressure mapping from pressure sensors. It also establishes benchmark models for estimating hand pressure and pose from RGB images.
EigenGS: Representation from Eigenspace to Gaussian Image Space: This paper proposes EigenGS, which bridges the eigenspace representation of classical PCA with the 2D Gaussian Splatting image representation. By learning unified Gaussian parameters on the eigenbasis, instant initialization of new images is achieved (without optimization from scratch). Furthermore, a frequency-aware learning mechanism is introduced to prevent high-resolution reconstruction artifacts, comprehensively outperforming GaussianImage in both convergence speed and final quality.
Empowering Large Language Models with 3D Situation Awareness: This paper proposes to automatically generate a situation-aware dataset, View2Cap (over 200k descriptions, 550k+ QAs), utilizing the camera trajectories of RGB-D videos. It designs a Situation Grounding (SG) module that converts pose estimation into an anchor classification task, enabling 3D LLMs to understand ego-centric spatial relationship descriptions (e.g., how "left" and "right" change depending on the viewpoint), achieving 54.0% EM@1 on SQA3D.
End-to-End HOI Reconstruction Transformer with Graph-based Encoding: Proposes the HOI-TG framework, which implicitly learns human-object interaction relationships using the self-attention mechanism of Transformers and embeds graph residual blocks in the encoder to enhance topological structure modeling for the human body and objects, respectively, achieving SOTA 3D HOI reconstruction on the BEHAVE and InterCap datasets.
End-to-End Implicit Neural Representations for Classification: Proposes the Meta Weight Transformer (MWT), which utilizes end-to-end meta-learning of SIREN initialization parameters and learning rate schedules. This allows the weight structure of INR to simultaneously optimize reconstruction quality and classification performance. Using a simple standard Transformer for classification on SIREN weights outperforms all equivariant architecture methods, achieving INR classification on high-resolution ImageNet-1K for the first time.
EnvGS: Modeling View-Dependent Appearance with Environment Gaussian: This paper proposes EnvGS, which utilizes a set of environment Gaussian primitives as an explicit 3D representation to capture scene reflections. By jointly optimizing environment Gaussians and base Gaussians through a GPU RT Core-based differentiable ray-tracing renderer, it achieves real-time (26+ FPS) and high-quality specular reflection novel view synthesis in real-world scenes for the first time, significantly outperforming all real-time methods.
ERUPT: Efficient Rendering with Unposed Patch Transformer: ERUPT proposes an efficient latent view synthesis model. By replacing pixel-level decoding with a patch-based decoder, incorporating learnable latent camera poses, and utilizing a frozen DINOv2 feature extractor, it achieves novel view synthesis at 600 fps using only 5 unposed images without requiring precise camera poses, reaching SOTA performance on the MSN dataset.
ESCAPE: Equivariant Shape Completion via Anchor Point Encoding: ESCAPE proposes a rotation-equivariant point cloud completion method based on anchor distance encoding. By representing point clouds as distance matrices to high-curvature anchor points, the Transformer is enabled to predict the complete shape within a rotation-invariant distance space, and coordinates are subsequently recovered via optimization. This approach significantly outperforms existing methods under arbitrary input rotations (slashing CD-L1 on the PCN dataset from 26.65 to 10.58).
Estimating Body and Hand Motion in an Ego-sensed World: EgoAllo proposes a system to estimate the wearer's full-body pose, height, and hand parameters from head-mounted egocentric SLAM poses and images. By designing head motion conditioning parameters that satisfy spatial and temporal invariance, it reduces human motion estimation errors by up to 18% and decreases hand world-coordinate errors by 40% using kinematic constraints.
Eval3D: Interpretable and Fine-grained Evaluation for 3D Generation: This paper proposes Eval3D, a fine-grained and interpretable evaluation tool for 3D generation quality. The core idea is to utilize various foundation models and tools as probes to detect inconsistencies in the semantic, geometric, structural, and text-alignment aspects of generated 3D assets. This achieves pixel-precise measurements and 3D spatial feedback, aligning more closely with human judgment than existing metrics.
Event Fields: Capturing Light Fields at High Speed, Resolution, and Dynamic Range: This paper proposes Event Fields—a new paradigm for capturing high-speed, high-resolution, high-dynamic-range light fields using event cameras. It designs two complementary optical schemes: a kaleidoscope (spatial multiplexing, capturing temporal derivatives) and a galvanometer (temporal multiplexing, capturing angular derivatives), achieving unprecedented capabilities such as 250fps megapixel dynamic scene refocusing and 100Hz real-time depth estimation.
EventFly: Event Camera Perception from Ground to the Sky: EventFly proposes the first cross-platform domain adaptation framework for event cameras. By identifying high-activation regions using the Event Activation Prior (EAP), blending source/target domain event data with EventBlend, and aligning feature distributions with EventMatch dual discriminators, it achieves an average improvement of 23.8% in accuracy and 77.1% in mIoU compared to source-only training on semantic segmentation tasks across three platforms: vehicle \(\to\) UAV \(\to\) quadruped robot.
Evolving High-Quality Rendering and Reconstruction in a Unified Framework with Contribution-Adaptive Regularization: This paper proposes CarGS. By identifying that the source of contribution conflicts between rendering and reconstruction tasks in Gaussian primitives lies in covariance, the authors design Lite-Geo, a lightweight residual structure to adaptively decouple the geometric contribution of the two tasks. Additionally, they introduce a normal + SDF double-guided densification strategy, achieving both SOTA rendering quality and reconstruction accuracy in a unified model, with a storage cost of only 9% of dual-model approaches.
Exploiting Deblurring Networks for Radiance Fields: This paper proposes DeepDeblurRF, which introduces DNN deblurring networks into the radiance field reconstruction pipeline for the first time. By designing an RF-guided deblurring mechanism and an iterative alternating framework, it achieves high-quality novel view synthesis under blurry image inputs. The training speed is 10-100 times faster than existing methods, while supporting multiple 3D representations such as voxel grids and 3D Gaussian Splatting.
Extreme Rotation Estimation in the Wild: This paper proposes an extreme 3D rotation estimation method for real-world Internet images, constructing the ExtremeLandmarkPairs (ELP) benchmark dataset. Through a progressive learning scheme (panoramic cropping \(\rightarrow\) FoV + appearance augmentation \(\rightarrow\) real data fine-tuning) and an auxiliary-channel-enhanced Transformer model, the proposed method significantly outperforms existing methods on non-overlapping Internet image pairs.
Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass: Proposes Fast3R, which generalizes the pairwise pointmap regression of DUSt3R to multi-view scenarios. By employing all-to-all self-attention in a Transformer, it processes \(N\) unposed and unordered images in a single forward pass, completely eliminating the \(O(N^2)\) pairwise inference and global alignment optimization.
FASTer: Focal Token Acquiring-and-Scaling Transformer for Long-term 3D Object Detection: This paper proposes FASTer, which adaptively selects focal tokens and compresses sequences via an Adaptive Scaling mechanism, and progressively aggregates long-term temporal point cloud information using a grouped hierarchical fusion strategy. It achieves new state-of-the-art (SOTA) performance on the Waymo Open Dataset with the lowest latency (75ms) and memory footprint (2856M).
Feat2GS: Probing Visual Foundation Models with Gaussian Splatting: This paper proposes Feat2GS, a unified framework that decodes 2D features of Visual Foundation Models (VFMs) into 3D Gaussian attributes via a lightweight MLP. It probes the geometric and texture awareness of VFMs individually on the novel view synthesis task, comprehensively evaluating the 3D awareness of over 10 VFMs on large-scale diverse datasets without requiring 3D ground-truth data.
Feature-Preserving Mesh Decimation for Normal Integration: The classic quadric error metric (QEM) is derived in screen space with normal maps as input. Combined with optimal Delaunay triangulation, this achieves anisotropic mesh decimation that preserves sub-millimeter accuracy even at 90%+ compression rates, accelerating high-resolution normal integration from hours to minutes.
FFaceNeRF: Few-Shot Face Editing in Neural Radiance Fields: FFaceNeRF is proposed, a NeRF-based face editing method that adapts to any custom segmentation mask layout using only 10 annotated samples through a geometry adapter, tri-plane feature injection, and latent mixing for triplane augmentation (LMTA), achieving flexible 3D-aware face editing.
FLARE: Feed-forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views: FLARE proposes a cascade learning paradigm that uses camera poses as a bridge to decompose 3D reconstruction into four progressive stages: pose estimation → local geometry → global geometry → Gaussian appearance. It achieves high-quality camera pose estimation, geometric reconstruction, and novel view synthesis from 2-8 uncalibrated sparse images within 0.5 seconds.
FLARE: Feed-forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views: Proposed is FLARE, a feed-forward differentiable system that simultaneously infers high-quality camera poses, 3D geometry, and appearance from uncalibrated sparse-view images (2-8 views) in 0.5 seconds, progressively simplifying the complex 3D learning task by employing a cascaded learning paradigm with camera poses acting as a bridge.
Floating No More: Object-Ground Reconstruction from a Single Image: The ORG framework is proposed to jointly model object 3D geometry, camera parameters, and the object-ground relationship from a single image for the first time. By predicting two compact dense representations—pixel height maps and perspective fields—it solves the problem of "floating/tilting" in reconstructed objects, significantly improving the realism of shadow generation and pose manipulation.
Flow-NeRF: Joint Learning of Geometry, Poses, and Dense Flow within Unified Neural Representations: This paper proposes Flow-NeRF, which is the first to integrate scene geometry, camera poses, and dense optical flow as a unified joint optimization target within a pose-free NeRF framework. Through shared point sampling, a pose-conditioned bijective mapping, and a feature message passing mechanism, it significantly outperforms prior methods in novel view synthesis and depth estimation, while defining and achieving novel view optical flow estimation for the first time.
Flowing from Words to Pixels: A Noise-Free Framework for Cross-Modality Evolution: CrossFlow is proposed, a general cross-modality Flow Matching framework that directly evolves from the data distribution of one modality to that of another (instead of starting from noise) without cross-attention conditioning mechanisms. It slightly outperforms standard Flow Matching baselines in text-to-image generation and demonstrates superior scaling properties regarding model size and training steps.
Floxels: Fast Unsupervised Voxel Based Scene Flow Estimation: Floxels is proposed to replace MLPs with a simple voxel grid as an implicit representation of scene flow. Combined with multi-scan distance transform loss and cluster consistency constraints, it achieves 2nd place among unsupervised methods on the Argoverse 2 benchmark (behind EulerFlow) while reducing runtime from 24 hours to 10 minutes (60-140x speedup).
FluidNexus: 3D Fluid Reconstruction and Prediction from a Single Video: FluidNexus is proposed to reconstruct 3D fluid appearance and velocity fields and predict future states from a single video for the first time. By synthesizing multi-view reference videos using a video generation model and bridging differentiable simulation and rendering via a coupled physical-visual particle representation, this method significantly outperforms existing multi-view approaches in novel view synthesis and future prediction.
FoundationStereo: Zero-Shot Stereo Matching: Presents FoundationStereo, a large-scale foundation model for stereo depth estimation. By leveraging a million-scale high-fidelity synthetic dataset, fusing monocular depth priors via a Side-Tuning Adapter, and adopting a hybrid cost volume filtering mechanism (incorporating Axial-Planar Convolution and Disparity Transformer), this method achieves strong zero-shot generalization performance without requiring target-domain fine-tuning.
FrameVGGT: Frame Evidence Rolling Memory for streaming VGGT: FrameVGGT is proposed to reorganize the KV cache of streaming VGGT from token-level retention to frame-level evidence block retention. Through a dual-layer bounded memory structure consisting of a middle bank and sparse anchors, it maintains more coherent geometric support under a fixed memory budget, achieving an optimal trade-off between accuracy and memory for long-sequence 3D reconstruction, depth, and pose estimation.
FreeGave: 3D Physics Learning from Dynamic Videos by Gaussian Velocity: Ours proposes FreeGave, a general framework for learning 3D scene geometry, appearance, and physical velocity from multi-view dynamic videos. By introducing a learnable physics code for each 3D Gaussian kernel and designing a divergence-free velocity field parameterization, FreeGave achieves accurate future frame extrapolation without relying on PINN losses or target priors.
FreeScene: Mixed Graph Diffusion for 3D Scene Synthesis from Free Prompts: FreeScene proposes a user-friendly indoor scene synthesis framework. It utilizes a VLM-driven Graph Designer to convert free-form text/image inputs into scene graphs, and then uses a Mixed Graph Diffusion Transformer (MG-DiT) to perform graph-aware denoising in a hybrid continuous-discrete space. It unifiedly supports multiple tasks such as text-to-scene and graph-to-scene, outperforming existing methods in both generation quality and controllability.
FruitNinja: 3D Object Interior Texture Generation with Gaussian Splatting: FruitNinja proposes the first method for generating internal textures for 3DGS objects. Through progressive cross-section inpainting, voxel smoothing, and the OpaqueAtom GS strategy, it achieves real-time rendering after cutting without additional optimization, significantly outperforming baselines in semantic alignment and texture consistency.
FSHNet: Fully Sparse Hybrid Network for 3D Object Detection: FSHNet proposes a fully sparse hybrid network that establishes global-range sparse voxel interactions using SlotFormer (slot partitioning + linear attention). Together with dynamic sparse label assignment and a sparse upsampling module, it outperforms existing sparse and dense detectors on three major benchmarks: Waymo, nuScenes, and Argoverse2.
Functionality Understanding and Segmentation in 3D Scenes: Fun3DU introduces the first approach for functional understanding in 3D scenes. By leveraging LLM chain-of-thought to parse task descriptions, utilizing VLMs to localize and segment functional objects across multi-view images, and applying 2D-3D voting aggregation, it substantially outperforms open-vocabulary 3D segmentation baselines on SceneFun3D (mIoU +13.2).
GASP: Gaussian Avatars with Synthetic Priors: This paper proposes GASP, which utilizes synthetic data to train a generative prior model (auto-decoder) for Gaussian Avatars. It bridges the synthetic-to-real domain gap through a three-stage fitting process and learned per-Gaussian semantic feature correlations, enabling the creation of high-quality, real-time animatable avatars (at 70 fps) supporting 360° rendering from only a single image or a short video.
GaussHDR: High Dynamic Range Gaussian Splatting via Learning Unified 3D and 2D Local Tone Mapping: This paper proposes GaussHDR, which improves HDR Gaussian Splatting by unifying 3D and 2D local tone mapping. By designing a residual local tone mapper and an uncertainty-adaptive modulation mechanism, it simultaneously enhances HDR reconstruction stability and LDR fitting quality, significantly outperforming existing methods on both synthetic and real-world scenes.
Gaussian Eigen Models for Human Heads: Proposes Gaussian Eigen Models (GEM), which distill a high-quality CNN-based Gaussian Avatar into a lightweight linear eigenbasis representation via PCA. By using linear combinations of low-dimensional coefficients to generate facial animations, it achieves high-quality, ultra-lightweight (starting at 7MB), and ultra-fast (200+ fps) animatable avatars, supporting real-time cross-identity expression reenactment from monocular video.
Gaussian Splatting Feature Fields for Privacy-Preserving Visual Localization: This paper proposes Gaussian Splatting Feature Fields (GSFFs), which combine the explicit geometry of 3DGS with implicit feature fields. Through self-supervised contrastive learning, they train scale-aware 3D features and 2D encoders, and leverage Delaunay-graph-based spatial clustering to convert features into segmentation labels, achieving high-accuracy non-privacy and privacy-preserving visual localization.
Gaussian Splatting for Efficient Satellite Image Photogrammetry (EOGS): This paper proposes EOGS, the first Earth observation framework based on 3D Gaussian Splatting. Through affine camera approximation, shadow mapping, and three regularization strategies, it achieves reconstruction accuracy comparable to EO-NeRF on satellite image 3D reconstruction tasks, while speeding up training by 300x (3 minutes vs. 15 hours).
GaussianUDF: Inferring Unsigned Distance Functions through 3D Gaussian Splatting: This paper proposes GaussianUDF, which fits 2D Gaussian planes to surfaces and leverages self-supervision and gradient inference to provide unsigned distance supervision for near-field and far-field regions respectively. This achieves efficient continuous UDF inference within the 3DGS framework for the first time, enabling high-quality open surface reconstruction.
GauSTAR: Gaussian Surface Tracking and Reconstruction: GauSTAR proposes a "Gaussian Surface" representation that binds Gaussian primitives to a mesh surface. By handling topological changes through adaptive unbinding and re-meshing mechanisms, and incorporating surface-based scene flow initialization, it introduces the first unified framework that simultaneously achieves photorealistic rendering, accurate surface reconstruction, and reliable 3D tracking in dynamic scenes.
GEAL: Generalizable 3D Affordance Learning with Cross-Modal Consistency: GEAL proposes a dual-branch architecture that leverages 3D Gaussian Splatting to render point clouds into realistic 2D images, utilizing the generalization capabilities of pre-trained 2D foundation models. Through granularity-adaptive fusion and 2D-3D consistency alignment, it achieves cross-modal knowledge transfer, outperforming existing 3D affordance methods across both standard and corrupted data benchmarks.
Gen3DEval: Using vLLMs for Automatic Evaluation of Generated 3D Objects: This paper proposes Gen3DEval, a text-to-3D generation quality evaluation framework based on fine-tuned vLLMs. By fine-tuning the Llama3 model on synthetic and human-annotated data, it achieves automatic evaluation of 3D object appearance, surface quality, and text consistency, significantly outperforming general models like GPT-4o in human preference alignment.
Generating 3D-Consistent Videos from Unposed Internet Photos: This paper proposes KFC-W, a self-supervised method for generating 3D-consistent videos from unposed Internet photos. By jointly training multi-view inpainting and view interpolation objectives on a video diffusion model without any 3D annotations (such as camera parameters), the generated videos outperform the commercial model Luma Dream Machine in both geometric and appearance consistency.
Generative Multiview Relighting for 3D Reconstruction under Extreme Illumination Variation: This paper proposes using a multiview relighting diffusion model to first unify images captured under different illumination conditions into a reference lighting condition, and then reconstruct the 3D representation using a robust NeRF model with "shading embedding". It achieves high-fidelity appearance reconstruction under extreme illumination variations, significantly outperforming existing methods, especially in recovering specular/highlight effects.
Generative Omnimatte: Learning to Decompose Video into Layers: Generative Omnimatte fine-tunes a video inpainting diffusion model (Casper) to learn the joint removal of objects and their associated effects (shadows, reflections). Combined with a trimask conditioning mechanism and omnimatte optimization, it achieves high-quality video layer decomposition and disoccluded region completion without assuming static backgrounds or requiring camera poses.
GenFusion: Closing the Loop between Reconstruction and Generation via Videos: Proposes GenFusion, which uses a reconstruction-driven video diffusion model to fix 3D reconstruction artifacts and generate content in unobserved regions. It designs a cyclic fusion pipeline to iteratively incorporate generation results into the training set, achieving high-quality 3D scene reconstruction and content expansion under sparse-view settings.
GenPC: Zero-shot Point Cloud Completion via 3D Generative Priors: This paper proposes GenPC, a zero-shot point cloud completion framework. It uses a Depth Prompting module to convert partial point clouds into depth maps to generate RGB images as input for Image-to-3D models. Then, a Geometric Preserving Fusion module aligns and fuses the generated 3D shape with the original point cloud, achieving faster and better real-world scan completion compared to SDS-based optimization methods.
GenVDM: Generating Vector Displacement Maps From a Single Image: The first method to generate Vector Displacement Maps (VDMs) from a single image is proposed. By fine-tuning Zero123++ to generate multi-view normal maps, using neural SDF to reconstruct meshes, and then parameterizing them into VDM images using neural deformation fields, the authors construct the first academic VDM dataset. This provides 3D artists with the ability to generate customized geometric detail stamps on-demand.
Geometry Field Splatting with Gaussian Surfels: This paper introduces the Geometry Field theory into the Gaussian Surfel framework, deriving an efficient and near-exact differentiable rendering algorithm for opaque surface reconstruction. It simultaneously resolves the loss discontinuity issue when surfels aggregate and employs a latent representation based on reflection vectors to better handle specular surfaces.
Geometry in Style: 3D Stylization via Surface Normal Deformation: Processes text-driven mesh stylization by optimizing the surface normal directions of a triangular mesh, combined with a differentiable ARAP (dARAP) layer to reconstruct vertex positions, enabling expressive geometric deformations while preserving shape identity.
GIFStream: 4D Gaussian-based Immersive Video with Feature Stream: Proposes GIFStream, a 4D Gaussian representation based on a canonical space + deformation field. By attaching a time-dependent feature stream to each anchor point, it enhances the capability of modeling complex motions. Meanwhile, it leverages a time-aligned structure and end-to-end compression to achieve high-quality 1080p immersive video at 30 Mbps.
Glossy Object Reconstruction with Cost-effective Polarized Acquisition: A cost-effective polarization-assisted 3D reconstruction method is proposed. By simply adding a linear polarizer in front of a standard RGB camera and capturing one polarization image per viewpoint (without the need for polarizing angle calibration), the method recovers high-fidelity geometry and reflectance decomposition of glossy objects via end-to-end optimization of polarization rendering loss in a neural implicit field.
GO-N3RDet: Geometry Optimized NeRF-enhanced 3D Object Detector: GO-N3RDet is proposed to address the lack of 3D spatial positioning and insufficient scene geometry awareness in NeRF-based multi-view 3D detection. By introducing three collaborative modules—the Position-Embedded Voxel Optimization Module (PEOM), Dual Importance Sampling (DIS), and Opacity Optimization Module (OOM)—it establishes a new SOTA on ScanNet and ARKitScenes.
GREAT: Geometry-Intention Collaborative Inference for Open-Vocabulary 3D Object Affordance Grounding: This paper proposes the GREAT framework, which fine-tunes InternVL using a Multi-Head Affordance Chain-of-Thought (MHACoT) to reason about the object's geometric attributes and latent interaction intentions in interaction images, forming an affordance knowledge dictionary. It injects this knowledge into point cloud and image features through a Cross-Modal Adaptive Fusion Module (CMAFM) to achieve open-vocabulary 3D object affordance grounding. Additionally, it constructs PIADv2, the largest 3D affordance dataset to date (15K images + 38K point clouds).
Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions: This work proposes the first multimodal multi-view 3D affordance grounding task and the AGPIL dataset (containing 30,972 point cloud-image-language triplets). It designs LMAffordance3D, a VLM-based framework that fuses 2D/3D spatial features with linguistic semantics to generalize from full-view to partial/rotation-view scenarios.
GS-2DGS: Geometrically Supervised 2DGS for Reflective Object Reconstruction: By introducing depth/normal pseudo-label supervision from foundation models (Marigold + Depth Pro) and a physically-based rendering (PBR) pipeline with deferred shading on top of 2DGS, this approach significantly outperforms existing GS methods and matches the performance of SDF methods on reflective object reconstruction while being an order of magnitude faster.
GuardSplat: Efficient and Robust Watermarking for 3D Gaussian Splatting: This paper proposes GuardSplat, which achieves high-capacity, high-fidelity, and robust copyright protection for 3DGS assets with a total optimization time of only 15 minutes. This is accomplished via CLIP-guided message decoupling optimization (training the decoder in only 5 minutes) and SH-aware watermark embedding (modifying only spherical harmonics offsets).
Guiding Human-Object Interactions with Rich Geometry and Relations: This paper proposes the ROG framework, which constructs an Interactive Distance Field (IDF) by sampling geometry-rich keypoints on object meshes. It utilizes a diffusion-based relation model to guide the motion generation model during inference to produce relation-aware and semantically-aligned human-object interactions, significantly outperforming the state-of-the-art on the FullBodyManipulation dataset.
HandOS: 3D Hand Reconstruction in One Stage: HandOS proposes an end-to-end, single-stage 3D hand reconstruction framework that unifies hand detection, 2D pose estimation, and 3D mesh reconstruction into a single pipeline. By freezing a pre-trained detector and introducing an interactive 2D-3D decoder, it eliminates the redundant calculation and cumulative errors of classical multi-stage methods, achieving state-of-the-art performance with a PA-MPJPE of 5.0 on FreiHand.
Hardware-Rasterized Ray-Based Gaussian Splatting: This paper presents VKRayGS, the first hardware-rasterized Ray-based 3D Gaussian Splatting (RayGS) rendering scheme. Through rigorous mathematical derivations, it constructs a minimum bounding quad in 3D space, achieving approximately a 40x rendering speedup while maintaining the high-quality rendering of RayGS, and additionally proposes a MIP anti-aliasing scheme for RayGS.
Hash3D: Training-free Acceleration for 3D Generation: Hash3D discovers that the features of diffusion models are highly redundant across adjacent camera poses and timesteps during SDS optimization. By caching and reusing intermediate features using an adaptive grid hash table, it accelerates various text-to-3D and image-to-3D methods by 1.3 to 4 times without training, while simultaneously enhancing multi-view consistency.
HaWoR: World-Space Hand Motion Reconstruction from Egocentric Videos: HaWoR achieves the first reconstruction of 3D hand motion in the world coordinate system from egocentric videos. By decoupling the task into camera-space hand reconstruction and adaptive SLAM camera trajectory estimation, and introducing a motion infilling network to handle out-of-view hand scenarios, it achieves state-of-the-art global trajectory accuracy (ATE 3.36mm) and hand reconstruction quality (PA-MPJPE 4.79mm) on the HOT3D dataset.
HD-EPIC: A Highly-Detailed Egocentric Video Dataset: HD-EPIC provides 41 hours of unscripted egocentric kitchen videos with unprecedented annotation density (263 annotations per minute), covering recipe steps, fine-grained actions, nutritional information, 3D digital twins, object motion trajectories, and gaze directions. It also builds a VQA benchmark of 26K questions, on which the strongest Gemini Pro achieves only 37.6%.
Hearing Hands: Generating Sounds from Physical Interactions in 3D Scenes: This paper proposes recording action-sound pairs of human hand interactions in 3D reconstructed scenes to train a rectified flow-based generative model. This achieves the prediction of corresponding interaction sounds from 3D hand trajectories, generating results that human evaluators cannot distinguish from real sounds in approximately 47% of cases.
HeatFormer: A Neural Optimizer for Multiview Human Mesh Recovery: Proposes HeatFormer, a Transformer-based neural optimizer that formulates SMPL parameter estimation as a heatmap generation and alignment problem to iteratively optimize and recover human shape and pose from multi-view images. It achieves state-of-the-art accuracy of 29.5mm MPJPE on Human3.6M, demonstrating strong robustness to the number of views, camera configurations, and occlusions.
Helvipad: A Real-World Dataset for Omnidirectional Stereo Depth Estimation: This paper proposes Helvipad—the first real-world dataset for omnidirectional stereo depth estimation (40K frames, top-bottom dual 360° cameras + LiDAR). It also introduces two lightweight adaptation strategies, polar angle input and circular padding, to improve stereo matching models for handling equirectangular projection images, with the proposed 360-IGEV-Stereo achieving state-of-the-art performance across all metrics.
High-fidelity 3D Object Generation from Single Image with RGBN-Volume Gaussian Reconstruction Model: GS-RGBN proposes a hybrid Voxel-Gaussian representation to provide 3D spatial constraints for unstructured Gaussians, and designs a Cross-Volume Fusion (CVF) module to fuse RGB semantic information and normal geometric information at the feature level. It generates high-fidelity 3D objects from a single image within seconds, achieving a PSNR improvement of 5.59dB over the second-best method on the GSO dataset.
HOI3DGen: Generating High-Quality Human-Object-Interactions in 3D: The HOI3DGen framework is proposed, which automatically annotates high-quality interaction data via MLLMs, fine-tunes a diffusion model conditioned on view, and performs 3D lifting along with SMPL registration. It is the first to achieve high-quality 3D human-object interaction generation with precise contact-semantic control from text, outperforming baselines by 4-15x in text consistency.
Horizon-GS: Unified 3D Gaussian Splatting for Large-Scale Aerial-to-Ground Scenes: This paper proposes Horizon-GS, which achieves the first unified 3D Gaussian Splatting reconstruction and real-time rendering of both aerial and street-view perspectives through a coarse-to-fine two-stage training strategy, a camera distribution balance mechanism, and a multi-resolution LOD structure, achieving SOTA rendering quality on multiple urban scene datasets.
HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos: Meta releases HOT3D, the first large-scale egocentric multi-view hand-object interaction dataset based on real wearable devices (Project Aria + Quest 3). It contains 833 minutes of recordings with over 3.7 million images, capturing 19 subjects interacting with 33 objects. The paper demonstrates through experiments that multi-view methods significantly outperform single-view methods in tasks like 3D hand tracking and 6DoF object pose estimation.
HRAvatar: High-Quality and Relightable Gaussian Head Avatar: HRAvatar proposes a monocular video head reconstruction method based on 3DGS, which achieves flexible deformation through learnable blendshapes and LBS, reduces tracking errors using an end-to-end expression encoder, and introduces a physically-based rendering model to achieve high-quality real-time relighting.
Hybrid eTFCE-GRF: Exact Cluster-Size Retrieval with Analytical p-Values for Voxel-Based Morphometry: By combining the exact cluster-size retrieval of eTFCE's Union-Find structure with the analytical GRF p-value inference of pTFCE, this work realizes exact cluster retrieval and permutation-free statistical inference in a single framework for the first time. It achieves a 1300x speedup compared to permutation-based TFCE while maintaining strict FWER control in whole-brain voxel-based morphometry.
HybridGS: Decoupling Transients and Statics with 2D and 3D Gaussian Splatting: HybridGS proposes the first hybrid 2D+3D Gaussian representation, modeling static scenes with multi-view consistent 3D Gaussians and transient objects with single-view independent 2D Gaussians. Combined with multi-view regulation and multi-stage training, it achieves state-of-the-art (SOTA) novel view synthesis quality in scenes containing distractor elements.
HyperGS: Hyperspectral 3D Gaussian Splatting: This paper successfully extends 3DGS to hyperspectral novel view synthesis (HNVS) for the first time. By performing hyperspectral rendering in a learned latent space combined with adaptive density control and pixel-level spectral pruning, it achieves efficient and accurate reconstruction of high-dimensional spectral data.
IAAO: Interactive Affordance Learning for Articulated Objects in 3D Environments: Constructs a hierarchical semantic feature field based on 3DGS, integrating semantic information from CLIP, SAM, and DINOv2 to achieve interactive affordance prediction and cross-state motion parameter recovery for articulated objects, supporting complex indoor scenes with arbitrary categories and multiple movable parts.
Identity-preserving Distillation Sampling by Fixed-Point Iterator: Proposed Identity-preserving Distillation Sampling (IDS), which corrects the gradient errors leading to identity loss in text-conditioned score functions through Fixed-Point Iterative Regularization (FPR). This method generates guided noise instead of random noise, achieving high structural and pose preservation in both 2D image editing and 3D NeRF editing.
IMFine: 3D Inpainting via Geometry-guided Multi-view Refinement: This paper proposes IMFine, a 3D inpainting pipeline designed for unbounded scenes (including 360° captures). It generates multi-view consistent inpainted images through geometry-prior-guided warping and a test-time adaptation-based multi-view refinement network. Additionally, a novel inpainting mask detection technique is proposed to accurately distinguish the occluded regions that truly require inpainting, significantly outperforming existing methods on diverse benchmarks.
Improving Gaussian Splatting with Localized Points Management: This paper proposes a Localized Points Management (LPM) strategy that localizes error-prone 3D regions using multi-view geometric constraints. Within these regions, targeted point densification and opacity resetting are executed. As a plug-and-play module, it improves the reconstruction quality of various 3DGS models while maintaining real-time rendering speed.
IncEventGS: Pose-Free Gaussian Splatting from a Single Event Camera: This paper proposes IncEventGS, the first method for incremental 3D Gaussian Splatting scene reconstruction using only a monocular event camera without known poses. By adopting a tracking-mapping SLAM paradigm to jointly optimize camera motion and scene representation, it outperforms existing methods in both novel view synthesis and pose estimation.
Instant3dit: Multiview Inpainting for Fast Editing of 3D Objects: This paper reformulates the 3D editing problem into a multi-view consistent 2D inpainting task. By fine-tuning the SDXL-inpainting model to simultaneously generate consistent filled content on a \(2 \times 2\) view grid and then reconstructing the 3D asset using a Large Reconstruction Model (LRM), high-quality 3D editing is achieved in approximately 3 seconds—hundreds of times faster than Score Distillation Sampling (SDS) methods.
InstantHDR: Single-forward Gaussian Splatting for High Dynamic Range 3D Reconstruction: This paper proposes InstantHDR, the first feed-forward HDR novel view synthesis method. It performs multi-exposure fusion via geometry-guided appearance modeling and employs a MetaNet to predict a scene-adaptive tone mapper. It reconstructs HDR 3D Gaussians from uncalibrated multi-exposure LDR images in a single forward pass, achieving a speedup of ~700x compared to optimization-based methods.
InteractVLM: 3D Interaction Reasoning from 2D Foundational Models: InteractVLM leverages the extensive visual knowledge of large-scale vision-language models (VLMs) to transfer the reasoning capabilities of 2D foundational models to 3D space via a "Render-Localize-Lift" framework. It realizes 3D contact point estimation for humans and objects from a single in-the-wild image, applying it to joint human-object interaction reconstruction and achieving a 20.6% F1 score improvement in contact estimation tasks.
IRGS: Inter-Reflective Gaussian Splatting with 2D Gaussian Ray Tracing: This paper proposes the IRGS framework, which integrates the complete rendering equation (without simplification) into Gaussian Splatting for the first time. By leveraging the proposed differentiable 2D Gaussian ray tracing, it calculates incoming light visibility and indirect radiance in real-time, achieving significantly superior relighting and material estimation results compared to prior methods on multiple inverse rendering benchmarks.
IRIS: Inverse Rendering of Indoor Scenes from Low Dynamic Range Images: IRIS proposes an inverse rendering framework to jointly recover HDR lighting, physical materials, and camera response functions from multi-view LDR images. By explicitly modeling tone mapping, automatically detecting emitters, and employing an iterative optimization strategy, it achieves high-quality material estimation, relighting, and virtual object insertion on both real and synthetic indoor scenes.
iSegMan: Interactive Segment-and-Manipulate 3D Gaussians: iSegMan proposes an interactive 3DGS segmentation and manipulation framework that requires no scene-specific training. It achieves precise 3D region control via Epipolar-guided Interaction Propagation (EIP) and Visibility-based Gaussian Voting (VGV), paired with a comprehensive manipulation toolbox supporting various functions such as semantic editing, colormapping, scaling, copying-and-pasting, combining, and deleting.
Joint Optimization of Neural Radiance Fields and Continuous Camera Motion from a Monocular Video: Modeling camera motion as time-continuous angular and linear velocities, this work avoids direct optimization of large-range camera-to-world transformations through velocity integration. Combined with a time-dependent NeRF and SDF flow constraints, camera poses and scene geometry are jointly optimized from monocular video without depth priors.
JOPP-3D: Joint Open Vocabulary Semantic Segmentation on Point Clouds and Panoramas: This work proposes the JOPP-3D framework, which achieves the first joint open-vocabulary semantic segmentation of 3D point clouds and panoramic images by tangentially decomposing panoramas into perspective images and leveraging SAM+CLIP for 3D instance-semantic alignment. It outperforms existing methods on the Stanford-2D-3D-s and ToF-360 datasets.
Kiss3DGen: Repurposing Image Diffusion Models for 3D Asset Generation: Formulates 3D asset generation as a 2D image generation task—fine-tuning the Flux DiT model to generate a "3D Bundle Image" (a collage of four-view RGB and normal maps), then reconstructing the 3D mesh via ISOMER, and extending support for 3D enhancement and editing through ControlNet.
Layered Motion Fusion: Lifting Motion Segmentation to 3D in Egocentric Videos: This paper proposes Layered Motion Fusion (LMF), which integrates predictions from 2D motion segmentation models into the dynamic and semi-static layers of layered Neural Radiance Fields. Together with a test-time refinement strategy, this work demonstrates for the first time that 3D methods can outperform 2D baselines in dynamic object segmentation for egocentric videos, improving dynamic object segmentation mAP by 30.5%.
Learnable Infinite Taylor Gaussian for Dynamic View Rendering: This paper proposes the Learnable Infinite Taylor Formula to model the temporal evolution of Gaussian primitives' position, rotation, and scale in dynamic scenes. It employs a third-order Taylor expansion to capture large-scale motion, and utilizes an MLP alongside Linear Blend Skinning (LBS) to construct the Peano remainder for compensating high-order terms. This achieves motion modeling with zero approximation error, outperforming state-of-the-art methods on the N3DV and Technicolor datasets.
Learning Class Prototypes for Unified Sparse-Supervised 3D Object Detection: Proposes CPDet3D, the first unified indoor-outdoor sparsely-supervised 3D object detection method. It mines the categories of unlabeled objects through class-aware prototype clustering (cross-scene Sinkhorn-Knopp optimal transport matching) and recovers missed detections using multi-label cooperative refinement (pseudo labels + prototype labels). It achieves 78% of fully-supervised performance on ScanNet V2, 90% on SUN RGB-D, and 96% on KITTI using only 1 annotation per scene.
Leveraging 3D Geometric Priors in 2D Rotation Symmetry Detection: This paper proposes a rotation symmetry detection model leveraging 3D geometric priors. By directly predicting the rotation center and vertices in 3D space and projecting them back to 2D, combined with a seed-point and rotation-axis based vertex reconstruction module, the method achieves an F1-score of 33.2 on the DENDI dataset, outperforming the previous segmentation-based SOTA method EquiSym (22.5).
Light3R-SfM: Towards Feed-forward Structure-from-Motion: Light3R-SfM proposes the first feed-forward end-to-end SfM framework. It replaces the traditional optimization-based global alignment with a learnable latent global alignment module, and constructs a scene graph via a shortest path tree based on retrieval scores. It completes reconstruction in only 33 seconds on the Tanks&Temples 200-image setup (49x faster than MASt3R-SfM) while maintaining comparable accuracy.
LIM: Large Interpolator Model for Dynamic Reconstruction: This paper proposes LIM, the first feed-forward, category-agnostic dynamic 4D asset reconstruction model, which achieves high-quality continuous-time interpolation and mesh tracking with consistent topology in seconds by interpolating between implicit triplane representations via a Transformer and introducing a causal consistency loss.
LookCloser: Frequency-aware Radiance Field for Tiny-Detail Scene (FA-NeRF): FA-NeRF proposes a frequency-aware neural radiance field framework. It analyzes the scene's frequency distribution using a 3D frequency quantification method, combining frequency grids, frequency-aware feature re-weighting, and adaptive ray marching. It captures both the overall scene structure and high-definition tiny details within a single model, significantly outperforming all baseline methods on multi-frequency datasets.
LT3SD: Latent Trees for 3D Scene Diffusion: LT3SD is proposed to progressively decompose 3D scenes into latent trees (each layer containing a geometry volume + a high-frequency latent feature volume). A patch-based diffusion model is trained on this representation to achieve coarse-to-fine, patch-wise high-quality infinite 3D scene generation, improving FID by 70% compared to the SOTA.
LUCAS: Layered Universal Codec Avatars: Proposes LUCAS, the first universal prior Avatar model that disentangles the face and hair into layered meshes. By employing a shared expression code and independent decoders, it enables natural face-hair interactions while supporting both real-time mesh rendering (45 FPS on mobile devices) and high-fidelity Gaussian rendering, achieving state-of-the-art performance in cross-identity zero-shot driving.
MAC-Ego3D: Multi-Agent Gaussian Consensus for Real-Time Collaborative Ego-Motion and Photorealistic 3D Reconstruction: The MAC-Ego3D framework is proposed, where a unified 3D Gaussian Splatting (3DGS) representation enables multiple agents to independently construct, align, and iteratively optimize local maps. By utilizing intra- and inter-agent Gaussian consensus mechanisms, it achieves real-time collaborative pose estimation and photorealistic 3D reconstruction, leading to a \(15\times\) inference acceleration, an order-of-magnitude reduction in pose error, and a 4-10 dB improvement in RGB PSNR.
MAGiC-SLAM: Multi-Agent Gaussian Globally Consistent SLAM: This paper proposes MAGiC-SLAM, a multi-agent SLAM system based on a rigidly deformable 3D Gaussian scene representation. Through a novel tracking and map fusion mechanism, and DinoV2-based loop closure detection, it achieves a processing speed 24 times faster than CP-SLAM with 7 times lower GPU usage, along with more accurate trajectory estimation and high-fidelity novel view rendering.
Mani-GS: Gaussian Splatting Manipulation with Triangular Mesh: Mani-GS proposes a method for manipulating 3D Gaussian Splatting based on triangular meshes. By defining a local coordinate system on each triangle to bind Gaussians, the position, rotation, and scale of Gaussians adaptively adjust when the mesh deforms. This enables various manipulation types such as large deformation, local editing, and soft-body simulation, while maintaining high-quality rendering and showing high tolerance for mesh inaccuracy.
MAR-3D: Progressive Masked Auto-regressor for High-Resolution 3D Generation: This paper proposes a progressive 3D generation framework using a Pyramid VAE paired with Cascaded MAR (MAR-LR → MAR-HR). By utilizing random masking to accommodate the unordered nature of 3D tokens and employing a condition augmentation strategy to mitigate cumulative errors during resolution scaling, the method achieves state-of-the-art (SOTA) performance among open-source approaches.
MARVEL-40M+: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation: This work constructs MARVEL-40M+, a large-scale 3D description dataset featuring 8.9 million 3D assets and over 40 million multi-level text annotations. Through a multi-stage automated annotation pipeline (InternVL2 + Qwen2.5), it generates five levels of annotation ranging from detailed narratives to concise tags. Leveraging this dataset, Stable Diffusion 3.5 is fine-tuned to achieve high-fidelity text-to-3D generation within 15 seconds.
Masked Point-Entity Contrast for Open-Vocabulary 3D Scene Understanding: This paper proposes MPEC (Masked Point-Entity Contrastive learning), which trains a 3D encoder using a two-level contrastive loss framework: cross-view point-to-entity contrastive learning and entity-to-language alignment. This approach achieves open-vocabulary semantic understanding while preserving entity-level geometric-spatial information, obtaining a state-of-the-art f-mIoU of 66.0% on ScanNet and demonstrating strong generalization capabilities on downstream tasks across 8 datasets.
MaskGaussian: Adaptive 3D Gaussian Representation from Probabilistic Masks: Reframes Gaussian pruning in 3DGS from deterministic removal to probabilistic existence modeling. Implements a masked-rasterization technique that allows unsampled Gaussians to still receive gradients for dynamic contribution assessment, achieving 62-75% Gaussian pruning rates on Mip-NeRF360/T&T/DeepBlending with only a 0.02 PSNR loss.
MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors: The first real-time monocular dense SLAM system built upon the pairwise 3D reconstruction prior MASt3R. Through efficient pointmap matching, ray-error tracking, local fusion, loop closure detection, and second-order global optimization, it achieves globally consistent pose estimation and dense geometric reconstruction at 15 FPS without requiring camera calibration, yielding state-of-the-art results.
MAtCha Gaussians: Atlas of Charts for High-Quality Geometry and Photorealism From Sparse Views: MAtCha Gaussians is proposed to model scene surfaces as an atlas of charts on 2D manifolds rendered via 2D Gaussian Surfels. By leveraging monocular depth initialization, a lightweight neural deformation model, and structure-preserving loss, it achieves high-quality surface mesh reconstruction and photorealistic novel view synthesis simultaneously in minutes under only 3-10 sparse views.
Material Anything: Generating Materials for Any 3D Object via Diffusion: This paper proposes Material Anything, a fully automated, unified diffusion framework that adapts pre-trained image diffusion models to generate PBR materials (albedo/roughness/metallic/bump) via a triple-head U-Net architecture, a confidence mask, and a rendering loss. Together with a confidence-guided progressive multi-view generation strategy and a UV space refinement model, it generates high-quality material maps for 3D objects under various lighting conditions (untextured / pure albedo / scanned / generated) in a unified manner.
Matrix3D: Large Photogrammetry Model All-in-One: Matrix3D proposes a unified photogrammetry model based on a multi-modal diffusion Transformer. Through a masked learning strategy, it simultaneously performs pose estimation, depth prediction, and novel view synthesis within a single model. It achieves a pose estimation rotation accuracy of 96.5% on CO3D, significantly outperforming all specialized methods.
MEGA: Masked Generative Autoencoder for Human Mesh Recovery: MEGA proposes a human mesh recovery method based on masked generative modeling. By discretizing the human mesh into a token sequence, the model performs image-conditioned generation after self-supervised pre-training. It supports both deterministic single-shot prediction and stochastic multi-output generation modes, achieving SOTA performance in both.
MegaSaM: Accurate, Fast and Robust Structure and Motion from Casual Dynamic Videos: MegaSaM achieves accurate, fast, and robust estimation of camera parameters and depth maps from casually captured dynamic videos by integrating monocular depth priors, motion probability maps, and uncertainty-aware global BA into a deep visual SLAM framework, significantly outperforming existing methods on both synthetic and real-world datasets.
MegaSynth: Scaling Up 3D Scene Reconstruction with Synthesized Data: MegaSynth proposes to achieve scalable 3D scene data synthesis by removing the reliance on semantic information, generating a dataset of 700k scenes (50 times larger than the real-world dataset DL3DV). It is used to train Large Reconstruction Models (LRMs), bringing a significant improvement of 1.2-1.8dB in PSNR across multiple benchmarks.
Mesh Mamba: A Unified State Space Model for Saliency Prediction in Non-Textured and Textured Meshes: This paper proposes Mesh Mamba, the first unified mesh saliency prediction model based on State Space Models (SSMs). By incorporating texture alignment, subgraph embedding, and bidirectional SSMs, it achieves high-quality visual attention prediction for both textured and non-textured 3D meshes. Additionally, it constructs the first dataset that systematically compares saliency differences under textured and non-textured conditions.
MeshArt: Generating Articulated Meshes with Structure-Guided Transformers: MeshArt proposes a hierarchical Transformer framework that decomposes articulated object generation into two stages: high-level joint structures and low-level part meshes. It autoregressively generates compact and sharp triangle mesh articulated objects, improving structural coverage by 57.1% and mesh FID by over 209 points.
MEt3R: Measuring Multi-View Consistency in Generated Images: This paper proposes MEt3R, a multi-view consistency evaluation metric based on DUSt3R reconstruction and DINO feature comparison. It measures the 3D consistency of generated images without requiring camera poses, and open-sources MV-LDM, a multi-view latent diffusion model.
MetaScenes: Towards Automated Replica Creation for Real-world 3D Scans: MetaScenes constructs a large-scale simulatable 3D scene dataset (15,366 object instances across 831 categories) by automatically replacing object assets from real-world scans to achieve Real-to-Sim transition. It proposes a multimodal alignment model, Scan2Sim, for automated asset selection, validating the dataset's efficacy on scene synthesis and cross-domain VLN transfer tasks.
MICAS: Multi-grained In-Context Adaptive Sampling for 3D Point Cloud Processing: To address inter-task and intra-task sampling sensitivity in 3D point cloud in-context learning, MICAS proposes a multi-grained adaptive sampling mechanism consisting of task-adaptive point sampling (Gumbel-softmax differentiable sampling) and query-specific prompt sampling (probability ranking-based optimal prompt selection), boosting part segmentation by 4.1% on the ShapeNet benchmark.
MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation: MIDI extends the pre-trained image-to-3D single-object generation model into a multi-instance diffusion model. Through a novel multi-instance attention mechanism, it directly captures spatial interaction relationships between objects during the 3D generation process. It simultaneously generates multiple 3D instances with correct spatial layouts from a single image, significantly outperforming existing methods on both synthetic and real datasets.
Mitigating Ambiguities in 3D Classification with Gaussian Splatting: This paper is the first to explore using 3D Gaussian Splatting (GS) point clouds instead of traditional point clouds as the input representation for 3D classification. By leveraging the scale/rotation coefficients in GS to distinguish between linear and flat surfaces, and using opacity to differentiate transparent/reflective objects, the authors construct the first real-world GS point cloud dataset and validate the effectiveness of GS point clouds in eliminating ambiguities across various classification methods.
MNE-SLAM: Multi-Agent Neural SLAM for Mobile Robots: This work proposes MNE-SLAM, the first fully distributed multi-agent collaborative neural SLAM framework where each agent independently executes neural mapping and tracking. Decentralized collaboration is achieved via peer-to-peer communication for hierarchical loop closure detection (intra-to-inter) and multi-submap fusion. It is validated on Replica, ScanNet, TUM RGB-D, and a self-collected INS dataset. Additionally, the first real-world indoor neural SLAM (INS) dataset covering both single- and multi-agent scenarios is released.
Mobile-GS: Real-time Gaussian Splatting for Mobile Devices: This paper proposes Mobile-GS, which achieves 116 FPS real-time Gaussian Splatting rendering on a Snapdragon 8 Gen 3 mobile GPU for the first time, with only 4.6MB of storage and visual quality comparable to the original 3DGS. This is accomplished through depth-aware order-independent rendering (eliminating sorting bottlenecks), neural view-dependent enhancement, first-order SH distillation, neural vector quantization, and contribution-based pruning.
Mono2Stereo: A Benchmark and Empirical Study for Stereo Conversion: Constructs the first large-scale stereo conversion benchmark Mono2Stereo (2.4 million pairs), proposes the stereo quality metric SIoU (0.84 Spearman correlation with human judgment), and introduces a dual-condition diffusion model with Edge Consistency loss, resolving the conflict between the weak stereo effect of single-stage methods and the poor image quality of two-stage methods.
MGGTalk: Monocular and Generalizable Gaussian Talking Head Animation: The MGGTalk framework is proposed, which can generalize to unseen identities using only monocular datasets for training. The core mechanism is to leverage depth estimation and facial symmetry priors to compensate for the incomplete geometric and appearance information in monocular data, enabling high-quality 3DGS-based talking head animation.
MonoPlace3D: Learning 3D-Aware Object Placement for 3D Monocular Detection: This paper proposes MonoPlace3D, a scene-aware 3D data augmentation system. Its core is learning a placement network (SA-PlaceNet) that maps scene images to a distribution of plausible 3D bounding boxes. Combined with a realistic rendering pipeline based on ControlNet, it significantly improves the performance and data efficiency of monocular 3D detectors.
Morpheus: Text-Driven 3D Gaussian Splat Shape and Color Stylization: Morpheus is proposed, an autoregressive 3DGS stylization method. Its core contributions include: (1) a novel RGBD diffusion model that achieves independent intensity control for appearance and shape stylization; (2) Warp ControlNet to propagate styles using warped composite frames; and (3) depth-guided feature sharing to ensure multi-view consistency.
Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation: This paper proposes an automated data generation pipeline to construct a large-scale 3D mask-text dataset named Mosaic3D-5.6M (containing 5.6M pairs and 30K scenes). By training a language-aligned 3D encoder and a mask decoder, it achieves the first single-stage open-vocabulary 3D instance segmentation.
MoSca: Dynamic Gaussian Fusion from Casual Videos via 4D Motion Scaffolds: Proposes the 4D Motion Scaffold (MoSca) representation, which compactly encodes scene motion using a sparse 6-DoF trajectory graph. Combined with 2D foundation model priors and physical regularization, it achieves fully automatic 4D scene reconstruction from pose-free, casual monocular videos.
MoST: Efficient Monarch Sparse Tuning for 3D Representation Learning: Proposes MoST, the first reparameterization-based 3D PEFT method, which designs a Point Monarch structured matrix (incorporating KNN-based local feature smoothing into the Monarch matrix) to outperform full fine-tuning on multiple 3D benchmarks while tuning only 3.6% of the parameters.
MotionAnyMesh: Physics-Grounded Articulation for Simulation-Ready Digital Twins: Proposes MotionAnyMesh, a zero-shot framework that eliminates hallucinations by guiding VLM inference with SP4D kinematic priors and guarantees collision-free execution through physics-constrained trajectory optimization. It automatically transforms static 3D meshes into simulation-ready articulated digital twins, achieving a physics execution success rate of 87%, nearly double that of the best existing methods.
MotionPRO: Exploring the Role of Pressure in Human MoCap and Beyond: This work constructs MotionPRO, a large-scale pressure-RGB-optical motion capture dataset (70 subjects / 400 action classes / 12.4M frames), and proposes the FRAPPE baseline to fuse pressure signals with monocular RGB. This significantly improves the physical plausibility and global trajectory accuracy of full-body pose estimation, further extending the pressure prior to humanoid robot control.
MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes: Addressing novel view synthesis (NVS) for multi-object indoor scenes, this paper significantly improves cross-view object placement and geometric consistency through three key designs: injecting structure-aware features (depth + object masks), introducing an auxiliary mask prediction task, and designing a structure-guided timestep sampling scheduler.
MP-SfM: Monocular Surface Priors for Robust Structure-from-Motion: Integrates monocular depth and normal priors tightly into classical incremental SfM. Through uncertainty propagation and alternating optimization, it breaks the fundamental limitation of three-view tracks, achieving reliable 3D reconstruction from only two-view tracks for the first time, and significantly outperforms all existing methods in extremely low-overlap and low-parallax scenarios.
Multi-View Pose-Agnostic Change Localization with Zero Labels: This paper proposes the first label-free, pose-agnostic multi-view change detection method. By embedding a change channel into 3D Gaussian Splatting (change-aware 3DGS), it fuses multi-view feature-aware and structure-aware change masks to achieve SOTA performance gains of 1.7× mIoU and 1.5× F1 in complex multi-object scenes, and enables change mask generation for unseen views.
Multi-view Reconstruction via SfM-guided Monocular Depth Estimation: This work introduces Murre, which injects SfM sparse point clouds as conditions into diffusion-based monocular depth estimation. By generating multi-view consistent metric depth maps followed by TSDF fusion, Murre outperforms state-of-the-art MVS and neural implicit reconstruction methods across diverse real-world scenarios—including indoor, street-view, and aerial scenes—while requiring only minimal fine-tuning on synthetic data.
Murre: Multi-view Reconstruction via SfM-guided Monocular Depth Estimation: Proposes Murre, a novel multi-view 3D reconstruction framework. By injecting sparse SfM point clouds into a diffusion model to guide monocular depth estimation, it bypasses the multi-view matching step of traditional MVS and outperforms the SOTA on various real-world scenes (indoor, street, aerial).
MUSt3R: Multi-view Network for Stereo 3D Reconstruction: This paper proposes MUSt3R, extending DUSt3R from a pairwise to a multi-view architecture. By symmetrizing the decoder (halving the parameters) and introducing a multi-layer memory mechanism, it achieves high-frame-rate 3D reconstruction of an arbitrary number of images in a unified coordinate system. The same network can handle offline SfM and online Visual Odometry (VO) scenarios simultaneously, achieving an ATE of only 5.5 cm in uncalibrated VO on TUM-RGBD.
MV-DUSt3R(+): Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds: MV-DUSt3R proposes a single-stage feed-forward network to jointly process an arbitrary number of pose-free input views via multi-view decoder blocks. It completely eliminates the global optimization required by DUSt3R, achieving scene reconstruction 48–78 times faster than DUSt3R while reducing the Chamfer Distance by 1.6–3.2 times. Furthermore, MV-DUSt3R+ introduces cross-reference-view attention blocks, further improving reconstruction quality in large-scale scenes.
Multi-View Pose-Agnostic Change Localization with Zero Labels: This paper proposes the first zero-label, pose-agnostic multi-view change detection method. By constructing a change-aware 3DGS representation to fuse multi-view change information, it improves mIoU by 1.7 times compared to the baseline and is capable of generating change masks for unseen views.
MVBoost: Boost 3D Reconstruction with Multi-View Refinement: MVBoost proposes a framework to boost 3D reconstruction by generating pseudo-ground-truth data through a multi-view refinement strategy. It elegantly combines the high precision of multi-view generative models with the consistency advantages of 3D reconstruction models, achieving SOTA single-image-to-3D reconstruction performance on the GSO dataset (PSNR 18.561, CD 0.101).
MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors Enhanced Diffusion Model: MVGenMaster proposes a multi-view diffusion model that integrates metric depth geometric priors. Combined with the MvD-1M dataset containing 1.6 million scenes and a training-free key-rescaling technique, it can generate up to 100 novel views from an arbitrary reference view in a single forward pass, comprehensively outperforming CAT3D and ViewCrafter on both in-distribution and out-of-distribution NVS benchmarks.
MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D: MVPaint proposes a three-stage 3D texture generation framework consisting of Synchronized Multi-View Generation (SMG) + Spatial-Aware 3D Inpainting (S3I) + UV Refinement (UVR). By synchronizing multiple views in the image domain rather than the latent domain, performing inpainting in the 3D point cloud space rather than the UV space, and utilizing a spatial-aware seam smoothing algorithm, it comprehensively outperforms existing state-of-the-art (SOTA) methods on both the Objaverse and GSO T2T benchmarks.
MVSAnywhere: Zero-Shot Multi-View Stereo: This paper proposes MVSAnywhere (MVSA), a general-purpose multi-view stereo matching architecture. By using a Cost Volume Patchifier, cost volume information is efficiently tokenized and fused with monocular ViT features (Mono/Multi Cue Combiner). Combined with a view-count and scale-agnostic metadata encoding and a cascaded adaptive depth range estimation, MVSA achieves zero-shot SOTA on the Robust MVS Benchmark, while supporting an arbitrary number of source views and arbitrary depth ranges.
NeRFPrior: Learning Neural Radiance Field as a Prior for Indoor Scene Reconstruction: NeRFPrior utilizes a rapidly trained Grid-NeRF (TensoRF, 30 minutes) as a scene-specific prior to guide SDF learning through multi-view consistency constraints and a confidence-weighted depth consistency loss. The F1 score on ScanNet improves from 0.310 (MonoSDF) to 0.930 (+200%), with a total training time of only 4.7 hours (2.2x faster than MonoSDF).
Neuro-3D: Towards 3D Visual Decoding from EEG Signals: Neuro-3D is the first work to reconstruct colored 3D point clouds from electroencephalography (EEG) signals. It introduces the EEG-3D dataset (12 subjects, 72 Objaverse object categories, dynamic video + static image stimuli) and achieves cross-modal 3D visual decoding through a dynamic-static EEG fusion encoder, CLIP-aligned contrastive learning, and diffusion-based point cloud generation with color prediction.
Node-RF: Learning Generalized Continuous Space-Time Scene Dynamics with Neural ODE-based NeRFs: This paper proposes Node-RF, which tightly couples Neural ODEs with dynamic NeRFs by modeling continuous-time scene dynamics through the ODE evolution of latent vectors. This enables long-term temporal extrapolation beyond the training sequence and generalization across trajectories without requiring optical flow or depth supervision.
NoPain: No-box Point Cloud Attack via Optimal Transport Singular Boundary: NoPain proposes the first no-box adversarial attack method for point clouds. By leveraging semi-discrete optimal transport (OT) to calculate the mapping from noise to feature space, it samples adversarial perturbations at the singular boundaries (non-differentiable points) of the mapping. This approach requires no target classifier or surrogate model, achieving 100% ASR on PointNet and maintaining a generation speed of only 28ms/sample.
Novel View Synthesis with Pixel-Space Diffusion Models: VIVID achieves end-to-end novel view synthesis using the EDM2 pixel-space diffusion model. By employing a dual U-Net encoder-decoder with cross-attention to transfer geometric information, a simple camera pose embedding (instead of complex geometric encodings), and single-view data augmentation based on homography, it achieves an FID of 2.89 (51% lower than GenWarp) and a PSNR of 17.36 (+29%) on RealEstate10K.
ODHSR: Online Dense 3D Reconstruction of Humans and Scenes from Monocular Videos: ODHSR proposes the first unified framework to simultaneously perform online camera tracking, human pose estimation, and joint human-scene dense reconstruction from monocular RGB videos. Based on 3D Gaussian Splatting, it achieves a speedup of up to 75x compared to offline methods while matching or exceeding SOTA reconstruction quality.
OffsetOPT: Explicit Surface Reconstruction without Normals: Proposes OffsetOPT, a normal-free explicit surface reconstruction method. By training a triangle prediction network on uniformly distributed point clouds and generalizing it to arbitrary point clouds via point-wise offset optimization, it achieves state-of-the-art performance in both overall quality and sharp detail preservation.
Olympus: A Universal Task Router for Computer Vision Tasks: Olympus uses a multimodal large language model (MLLM) as a unified task router. By designing task-specific routing tokens and constructing a large-scale instruction dataset, it dispatches over 20 computer vision tasks (covering image, video, and 3D) to dedicated expert models, achieving a 94.75% single-task routing accuracy and a 91.82% chain-of-action precision.
On Denoising Walking Videos for Gait Recognition: This paper proposes DenoisingGait, which combines "knowledge-driven denoising" (utilizing generative diffusion models at specific timesteps to filter out gait-irrelevant information) and "geometry-driven denoising" (compressing multi-channel diffusion features into 2D direction vectors via the Feature Matching module) to generate a novel Gait Feature Field representation, achieving state-of-the-art (SOTA) performance on multiple RGB gait datasets.
One Diffusion to Generate Them All: This work proposes OneDiffusion, a 2.8B parameter unified diffusion model that models all conditional and target images as a frame sequence with varying noise scales. A single model supports multiple tasks including text-to-image, conditional generation, depth estimation, segmentation, multi-view generation, and ID customization.
Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces: Proposes a novel task of functional 3D scene graphs, utilizing VLMs and LLMs to construct 3D scene graphs featuring objects, interactive elements, and their functional relationships from RGB-D images through a progressive detection-description-reasoning pipeline, and establishes the FunGraph3D real-world dataset.
Open-World Amodal Appearance Completion: This paper proposes a training-free open-world amodal appearance completion framework that accepts flexible natural language queries (including both direct names and abstract descriptions). By unifying segmentation, occlusion analysis, and iterative inpainting, the framework reconstructs the complete appearance of occluded objects and outputs RGBA formats to support downstream applications such as 3D reconstruction and image editing.
Ouroboros3D: Image-to-3D Generation via 3D-aware Recursive Diffusion: The paper proposes Ouroboros3D, which integrates multi-view generation and 3D reconstruction into a recursive diffusion process. By utilizing a 3D-aware feedback mechanism (rendering CCM and color maps as denoising conditions) and a joint training strategy, it resolves the issues of insufficient 3D consistency and domain gaps in two-stage methods, achieving state-of-the-art performance on the GSO dataset.
P-SLCR: Unsupervised Point Cloud Semantic Segmentation via Prototypes Structure Learning and Consistent Reasoning: This paper proposes P-SLCR, an unsupervised point cloud semantic segmentation method driven by a prototype library. By separating points into "consistent" and "ambiguous" categories, aligning consistent points with prototypes through consistent structure learning, and constraining two prototype libraries via semantic relation consistent reasoning, it achieves an unsupervised mIoU of 47.1% on S3DIS, outperforming the fully supervised PointNet.
Pano360: Perspective to Panoramic Vision with Geometric Consistency: Proposes Pano360, the first Transformer framework that performs panoramic stitching in 3D photogrammetric space. It leverages a pretrained VGGT backbone to obtain 3D-aware multi-view feature alignment and multi-feature joint optimization for seam detection. It supports 2 to hundreds of input images, achieving a success rate of 97.8% in challenging scenarios with weak texture, large parallax, or repetitive patterns.
Parametric Point Cloud Completion for Polygonal Surface Reconstruction: This work proposes PaCo, a new paradigm for parametric point cloud completion. Instead of predicting individual points, PaCo infers parametric planar primitives from incomplete point clouds. Through hierarchical encoding, proxy generation, and bipartite matching optimization, it directly bridges the gap between incomplete data and high-quality polygonal surface reconstruction.
PartRM: Modeling Part-Level Dynamics with Large Cross-State Reconstruction Model: PartRM proposes a 4D reconstruction framework based on a large-scale 3D Gaussian reconstruction model. It simultaneously models object appearance, geometry, and part-level motion from multi-view images. By constructing the PartDrag-4D dataset, a multi-scale drag embedding module, and a two-stage training strategy, it achieves state-of-the-art performance in part-level motion learning and can be applied to robot manipulation tasks.
PBR-NeRF: Inverse Rendering with Physics-Based Neural Fields: Based on NeILF++, PBR-NeRF introduces two physics-based prior losses (conservation of energy loss and NDF-weighted specular loss), effectively constraining the material-lighting decomposition ambiguity in inverse rendering. It achieves SOTA material estimation without sacrificing the quality of novel view synthesis.
PCDreamer: Point Cloud Completion Through Multi-view Diffusion Priors: PCDreamer is proposed to leverage large-scale multi-view diffusion models to "dream" multi-view images of the missing regions from partial point clouds, achieving high-fidelity point cloud completion through a multi-modality shape fusion module and a confidence-guided shape consolidation module, showing outstanding capability in reconstructing fine-grained local details.
Perception Tokens Enhance Visual Reasoning in Multimodal Language Models: This paper proposes Perception Tokens, a method to encode intermediate visual representations (e.g., depth maps, bounding boxes) into auxiliary reasoning tokens. This enables multimodal language models to enhance visual reasoning capabilities by generating perception tokens as intermediate steps, analogous to textual chain-of-thought.
Perceptual Inductive Bias is What You Need Before Contrastive Learning: Inspired by David Marr's multi-stage visual processing theory, this paper proposes adding a "pre-pretraining" stage prior to standard contrastive learning. By using foreground-background segmented shape silhouettes and intrinsic image decomposition (albedo + shading) as perceptual inductive biases, this approach achieves 2x faster convergence on ResNet18 and comprehensive improvements across downstream tasks such as segmentation, depth estimation, and recognition.
PerLA: Perceptive 3D Language Assistant: Proposes PerLA, a perceptive 3D language assistant that achieves parallel capture of high-resolution local details via Hilbert curve partitioning, and aggregates local information with low-resolution global context using cross-attention and graph convolutional networks. This significantly improves fine-grained perception in 3D scene understanding without increasing the number of LLM input tokens.
PERSE: Personalized 3D Generative Avatars from A Single Portrait: Starting from a single portrait, PERSE synthesizes a large-scale facial attribute editing video dataset and trains a 3DGS-based generative avatar model. This enables smooth interpolated editing of facial attributes in a continuously decoupled latent space while maintaining individual identity consistency.
Perturb-and-Revise: Flexible 3D Editing with Generative Trajectories: Perturb-and-Revise accomplishes flexible 3D editing by applying adaptive perturbations in the NeRF parameter space to allow parameters to escape local minima, optimizing along the generative trajectory using score distillation from multi-view diffusion models, and integrating identity-preserving gradients. This represents the first method to support large-scale geometric and appearance modifications, including pose changes and the addition of new objects.
PGC: Physics-Based Gaussian Cloth from a Single Pose: This paper proposes PGC, a method to reconstruct simulatable, realistic garment assets from only a single-frame multi-view capture. By incorporating a hybrid strategy of mesh-embedded 3D Gaussians and physically-based rendering (PBR), the method achieves garment rendering in novel poses with both high-frequency details and correct lighting effects.
PhysAnimator: Physics-Guided Generative Cartoon Animation: PhysAnimator combines physics simulation (2D deformable body simulation) with data-driven video diffusion models to generate physically plausible and anime-style dynamic animations from static anime illustrations, supporting interactive control via energy strokes and binding points.
PhysGen3D: Crafting a Miniature Interactive World from a Single Image: This paper proposes the PhysGen3D framework, which transforms a single image into a camera-centric interactive 3D scene. By combining the geometric/semantic understanding of visual foundation models with physics-based simulation and rendering, it generates videos that are more physically realistic and controllable than those from commercial I2V models.
PICO: Reconstructing 3D People In Contact with Objects: PICO proposes a comprehensive framework comprising a dataset (PICO-db) and a fitting method (PICO-fit). By establishing dense bijective contact correspondences between humans and objects, it reconstructs realistic 3D human-object interaction scenes from a single in-the-wild image, supporting arbitrary object categories.
Pixel-Aligned RGB-NIR Stereo Imaging and Dataset for Robot Vision: This paper develops a prism-based pixel-aligned RGB-NIR stereo camera system mounted on a mobile robot to collect a large-scale dataset under diverse illumination conditions. It proposes two methods—image fusion and feature fusion—to enable existing RGB pre-trained vision models to leverage NIR information with little to no fine-tuning, achieving significant improvements in tasks such as depth estimation, object detection, and SfM.
PMA: Towards Parameter-Efficient Point Cloud Understanding via Point Mamba Adapter: This paper proposes the Point Mamba Adapter (PMA), which structures and fuses complementary features from all intermediate layers of a pre-trained point cloud model into an ordered sequence using the Mamba architecture. Combined with a Geometrically-Constrained Gated Prompt Generator (G2PG) to dynamically optimize sequence ordering in 3D space, PMA achieves or surpasses full fine-tuning performance while updating only 1% of the parameters.
PointLoRA: Low-Rank Adaptation with Token Selection for Point Cloud Learning: PointLoRA integrates Low-Rank Adaptation (LoRA) with multi-scale token selection to introduce a simple and highly efficient parameter fine-tuning paradigm for pre-trained point cloud architectures. Realizing competitive performance over full fine-tuning with only 3.43% trainable parameters, it yields SOTA or competitive results on ScanObjectNN, ModelNet40, and ShapeNetPart.
POp-GS: Next Best View in 3D-Gaussian Splatting with P-Optimality: Introduces the P-Optimality theory from classical optimal experimental design into 3D-GS, deriving a general covariance matrix based on the Hessian matrix. Two approximation schemes, diagonal and block-diagonal, are proposed, significantly outperforming the information gain quantification of FisherRF under D-Optimality and T-Optimality criteria.
Pow3R: Empowering Unconstrained 3D Reconstruction with Camera and Scene Priors: This paper proposes Pow3R, a general-purpose 3D vision regression model built upon DUSt3R, which can flexibly incorporate any combination of auxiliary information such as camera intrinsics, relative pose, and sparse/dense depth. It achieves SOTA on multiple 3D vision tasks and unlocks new capabilities like native-resolution inference.
PRaDA: Projective Radial Distortion Averaging: PRaDA proposes a radial distortion calibration method operating entirely in projective space. By performing weighted averaging of distortion estimates from multiple image pairs in the function space, it achieves high-precision distortion correction without requiring 3D point reconstruction or camera pose estimation, significantly outperforming traditional methods such as COLMAP and GLOMAP on multiple datasets with severe distortion.
Preconditioners for the Stochastic Training of Neural Fields: This paper proposes a theoretical preconditioning framework for the stochastic training of neural fields, proving that curvature-aware diagonal preconditioners (such as ESGD) significantly accelerate the training of neural fields with sine/Gaussian/wavelet activations, while showing no significant benefit for ReLU(PE) activations, thereby providing theoretical guidance for optimizer selection in neural fields.
PrEditor3D: Fast and Precise 3D Shape Editing: This paper proposes PrEditor3D, a training-free 3D editing method. By using a pipeline that combines synchronized multi-view diffusion editing with feed-forward 3D reconstruction, and integrating color-coded 3D segmentation and voxel feature fusion, it achieves fast (within minutes) and precise (only modifying the target region) high-quality 3D shape editing.
ProbeSDF: Light Field Probes for Neural Surface Reconstruction: ProbeSDF redesigns the appearance model of SDF-based neural surface reconstruction by decoupling and storing spatial and angular features in voxel grids of different resolutions. Utilizing minimal parameters (4 per voxel) and a tiny MLP, it achieves superior geometry and image quality. Training takes only 1-2 minutes and supports real-time rendering.
PromptHMR: Promptable Human Mesh Recovery: PromptHMR proposes a Transformer-based promptable human pose and shape estimation method. By combining spatial prompts (bounding boxes, segmentation masks) and semantic prompts (language descriptions, interaction labels), it flexibly guides full-image 3D human reconstruction, achieving SOTA performance on multiple benchmarks and supporting video-based motion estimation in world coordinates.
ProtoDepth: Unsupervised Continual Depth Completion with Prototypes: ProtoDepth proposes a prototype-based continual learning method. By freezing the pre-trained model and learning a lightweight prototype set for each new domain to modulate latent features, it reduces the forgetting rate by over 50% in both indoor and outdoor scenarios.
ProxyTransformation: Preshaping Point Cloud Manifold with Proxy Attention for 3D Visual Grounding: This paper proposes ProxyTransformation, which enhances the point cloud manifold structure efficiently before training via deformable point clustering and proxy attention mechanisms. It utilizes textual information to guide sub-manifold translation and image information to guide intra-sub-manifold transformations, achieving a significant improvement of 7.49% on the ego-centric 3D visual grounding task.
PS-EIP: Robust Photometric Stereo Based on Event Interval Profile: This paper proposes a robust photometric stereo method based on the Event Interval Profile (EIP). By utilizing the temporal continuity and profile shape of event interval time series, it detects outliers caused by shadows and specular reflections, significantly outperforming EventPS-FCN without requiring deep learning.
PUP 3D-GS: Principled Uncertainty Pruning for 3D Gaussian Splatting: Proposes PUP 3D-GS, a principled 3D Gaussian Splatting pruning method based on the Fisher Information Matrix, which achieves a 90% Gaussian pruning rate through second-order sensitivity scoring of spatial parameters (position + scale) while maintaining better visual quality and foreground details than existing heuristic methods.
RainyGS: Efficient Rain Synthesis with Physically-Based Gaussian Splatting: RainyGS integrates physically-based raindrop simulation and shallow water dynamics with the 3D Gaussian Splatting rendering framework. It achieves high-fidelity, physically accurate, and real-time (>30fps) dynamic rain effect synthesis in open-world scenes for the first time, supporting flexible control from light rain to downpours.
RASP: Revisiting 3D Anamorphic Art for Shadow-Guided Packing of Irregular Objects: Inspired by 3D anamorphic art, RASP utilizes a differentiable rendering framework guided by multi-view shadow/silhouette images to optimize the arrangement of irregular 3D objects inside a container. Concurrently, it introduces SDF-based collision and extrusion handling strategies to achieve high-occupancy packing, part assembly, and multi-view art creation.
RDD: Robust Feature Detector and Descriptor Using Deformable Transformer: RDD proposes a dual-branch architecture that utilizes a convolutional network for keypoint detection and a deformable Transformer for descriptor extraction. By modeling geometric invariance and global context through deformable attention, it comprehensively outperforms existing methods on sparse and semi-dense feature matching tasks with large viewpoint and scale variations.
ReCap: Better Gaussian Relighting with Cross-Environment Captures: ReCap leverages multiple sets of images of the same object under different lighting environments as multi-task supervision signals, sharing material properties while independently optimizing lighting representations. This fundamentally resolves the albedo-lighting ambiguity. Combined with simplified shading functions and HDR post-processing, it significantly outperforms all existing methods on an expanded relighting benchmark.
Reconstructing Animals and the Wild: This paper proposes the RAW method, which uses an LLM to autoregressively decode CLIP image embeddings into structured compositional 3D scene representations (animals + natural environments). It innovatively introduces a CLIP projection head to replace discrete asset name predictions, enabling the model to generalize across larger-scale asset collections. This achieves the first simultaneous reconstruction of both animals and their environment from a single natural image.
Reconstructing Close Human Interaction with Appearance and Proxemics Reasoning: This paper proposes a dual-branch optimization framework that reconstructs accurate 3D poses, natural interaction relationships, and plausible body contacts of close human interactions from monocular in-the-wild videos by combining human appearance constraints (3D Gaussian Splatting), a proxemics diffusion prior, and physical constraints, achieving state-of-the-art (SOTA) performance on Hi4D and 3DPW.
Reconstructing Humans with a Biomechanically Accurate Skeleton: HSMR represents the first method to estimate biomechanically accurate skeleton (SKEL) parameters from a single image. It overcomes the lack of ground-truth training data via an iterative pseudo-label refinement strategy. HSMR matches HMR2.0's performance on standard human pose estimation benchmarks while outperforming it significantly on extreme pose scenarios (MOYO yoga dataset) by over 18mm MPJPE, all while effectively avoiding unnatural joint rotations.
Reconstructing In-the-Wild Open-Vocabulary Human-Object Interactions: The authors propose Open3DHOI, the first open-vocabulary in-the-wild 3D Human-Object Interaction (HOI) dataset (comprising over 2.5k images, 133 object categories, and 120 action categories). They also design Gaussian-HOI, a 3D Gaussian Splatting-based HOI optimizer that reconstructs spatial human-object interactions and learns contact regions via Gaussian rendering.
Reconstructing People, Places, and Cameras: HSfM unifies human mesh estimation with the traditional SfM framework. By jointly optimizing humans, scene point clouds, and camera parameters, it achieves metric-scale world coordinate reconstruction from uncalibrated sparse multi-view images, reducing human localization error from 3.59m to 0.50m.
Recovering Dynamic 3D Sketches from Videos: Liv3Stroke proposes the first method to extract dynamic 3D sketches from videos. It abstracts object motion using a set of deformable 3D Bézier curves, achieving viewpoint-consistent motion sketch reconstruction by learning point cloud motion guidance and stroke-by-stroke deformation.
Ref-GS: Directional Factorization for 2D Gaussian Splatting: This paper proposes Ref-GS, which introduces deferred rendering and directional factorization into 2D Gaussian Splatting (2DGS). It models far-field illumination and surface roughness variations using a Sph-Mip spherical feature grid, and then achieves spatially varying view-dependent effects via compact tensor decomposition. This approach achieves state-of-the-art (SOTA) performance in reflective scene rendering and geometry recovery while maintaining real-time rendering at over 45 FPS.
Reference-Based 3D-Aware Image Editing with Triplanes: Based on the EG3D triplane representation space, a reference-image-guided 3D-aware editing framework is proposed, integrating four modules: encoder, automatic localization, spatial disentanglement, and fusion learning. It achieves editing results superior to existing 2D/3D GAN and diffusion methods across diverse domains including human faces, 360-degree heads, animals, cartoons, and full-body clothing.
Regularizing INR with Diffusion Prior for Self-Supervised 3D Reconstruction of Neutron CT Data: This paper proposes DINR (Diffusive INR), which combines implicit neural representation (INR/SIREN) with a pretrained diffusion model prior. By regularizing the INR reconstruction with the diffusion denoising output using a proximal loss at each DDIM timestep, DINR outperforms FBP, pure INR, DD3IP, and classical MBIR (qGGMRF) methods on sparse-view neutron CT (down to 4-5 views).
Relation3D: Enhancing Relation Modeling for Point Cloud Instance Segmentation: Relation3D enhances the modeling of internal relations within scene features and relations between queries in Transformer-based 3D instance segmentation via three components: Adaptive Superpoint Aggregation Module (ASAM), Contrastive Learning-guided Superpoint Refinement (CLSR), and Relation-aware Self-Attention (RSA), achieving SOTA results on ScanNetV2/ScanNet++/ScanNet200/S3DIS.
RelationField: Relate Anything in Radiance Fields: RelationField introduces object-to-object relationship modeling to neural radiance fields for the first time. By distilling relationship knowledge from multimodal large language models (such as GPT-4o) into an implicit relationship feature head in NeRF, it enables open-vocabulary 3D scene relationship querying and scene graph generation, significantly outperforming existing methods on the 3DSSG benchmark.
Relative Pose Estimation through Affine Corrections of Monocular Depth Priors: This paper proposes three new relative pose solvers that leverage monocular depth priors by explicitly modeling the affine (scale + shift) ambiguity of depth predictions. It also designs a hybrid estimation framework that combines depth-aware solvers with classical point solvers, significantly improving pose estimation accuracy under both calibrated and uncalibrated settings.
Rethinking End-to-End 2D to 3D Scene Segmentation in Gaussian Splatting: The authors propose Unified-Lift, an end-to-end, object-aware 2D-to-3D segmentation method based on 3DGS. By learning the association between a global object-level codebook and Gaussian-level features, it eliminates the dependency of existing methods on pre- and post-processing, significantly outperforming the state-of-the-art in multi-view consistent instance segmentation.
Rewis3d: Reconstruction Improves Weakly-Supervised Semantic Segmentation: Rewis3d leverages feed-forward 3D reconstruction (MapAnything) to obtain 3D point clouds from 2D videos as auxiliary supervision signals. Utilizing a dual Student-Teacher architecture and weighted cross-modal consistency (CMC) loss, it improves weakly-supervised 2D semantic segmentation performance by 2-7% mIoU under sparse annotations (points/scribbles/coarse labels), while remaining purely 2D during inference.
RigGS: Rigging of 3D Gaussians for Modeling Articulated Objects in Videos: This paper proposes RigGS, an automated skeleton-driven modeling method without template priors, which extracts 3D skeletons from monocular videos and rigs 3D Gaussian representations to support novel view synthesis, pose editing, motion interpolation, and motion transfer.
RNG: Relightable Neural Gaussians: This work proposes the Relightable Neural Gaussians (RNG) framework, which learns a latent vector for each Gaussian element conditioned on the view and light directions. By integrating shadow cues and a hybrid forward-deferred optimization strategy, RNG achieves high-quality relighting of soft-boundary objects.
RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation: RoomTour3D leverages online house tour videos to construct a geometry-aware video-instruction dataset. By obtaining geometry information of walking trajectories via 3D reconstruction and combining it with GPT-4 to generate open-vocabulary instructions, it significantly boots performance on multiple VLN benchmarks and supports zero-shot navigation.
S2Gaussian: Sparse-View Super-Resolution 3D Gaussian Splatting: A two-stage framework, S2Gaussian, is proposed to address the joint sparse and low-resolution view scene reconstruction for the first time. The first stage optimizes low-resolution Gaussians with depth regularization and initializes high-resolution Gaussians via Gaussian Shuffle Split. The second stage refines the high-resolution scene using blur-free inconsistency modeling and a 3D robust optimization strategy.
SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE: SAR3D proposes an autoregressive framework based on multi-scale 3D VQVAE, achieving high-quality 3D object generation in 0.82 seconds via "next-scale prediction" (instead of next-token prediction). Furthermore, the same set of VQVAE tokens can fine-tune LLMs to enable detailed 3D object understanding and description.
SAT-HMR: Real-Time Multi-Person 3D Mesh Estimation via Scale-Adaptive Tokens: This paper proposes SAT-HMR, a DETR-based real-time multi-person 3D human mesh estimation framework. By introducing scale-adaptive tokens—utilizing high-resolution tokens for small-scale humans, low-resolution tokens for large-scale humans, and pooled/compressed tokens for the background—it improves inference speed to 24 FPS while maintaining the accuracy of high-resolution inputs, achieving an optimal balance between precision and speed.
Scalable Autoregressive Monocular Depth Estimation: A depth autoregressive model, DAR, is proposed to reformulate the monocular depth estimation task into an autoregressive prediction paradigm through two ordered objectives: resolution autoregression (gradually generating depth maps from low to high resolution) and granularity autoregression (recursively refining depth intervals from coarse to fine). The model scales up to 2.0B parameters and achieves state-of-the-art results on KITTI and NYU Depth v2.
Scaling Mesh Generation via Compressive Tokenization: This paper proposes Blocked and Patchified Tokenization (BPT), an efficient representation method that compresses triangular mesh sequences by approximately 75%. This enables autoregressive Transformers to process high-fidelity meshes with over 8k faces for the first time, achieving production-grade quality in point cloud/image-conditioned generation and validating a positive correlation scaling law between the number of mesh faces and generation performance.
Scaling Properties of Diffusion Models for Perceptual Tasks: This paper systematically studies the scaling properties of diffusion models on perceptual tasks such as depth estimation, optical flow prediction, and amodal segmentation. It establishes power-law scaling relations for both training and inference, and demonstrates that increasing test-time compute (via more denoising steps and multi-prediction ensembling) significantly boosts performance, achieving competitive results while using far less training data and computation than previous SOTA.
SceneFactor: Factored Latent 3D Diffusion for Controllable 3D Scene Generation: This paper proposes SceneFactor, which achieves text-guided large-scale 3D indoor scene generation through factored latent space diffusion (generating coarse semantic box layouts first, followed by fine geometry details). It also supports intuitive local editing via semantic box manipulation.
SCFlow2: Plug-and-Play Object Pose Refiner with Shape-Constraint Scene Flow: SCFlow2 proposes a plug-and-play 6D object pose refinement framework that embeds rigid motion fields from 3D scene flow into a shape-constrained recurrent matching network. By integrating depth maps as an iterative regularization for end-to-end training, it consistently improves the accuracy of six state-of-the-art (SOTA) methods as a post-processing step across seven datasets in the BOP benchmark, without requiring any retraining.
SCOPE: Scene-Contextualized Incremental Few-Shot 3D Segmentation: SCOPE proposes a plug-and-play background-guided prototype enrichment framework. After training on base classes, a class-agnostic segmentation model is utilized to mine pseudo-instances from background regions to establish an Instance Prototype Bank (IPB). When novel classes emerge in a few-shot manner, background prototypes are fused with few-shot prototypes using Contextual Prototype Retrieval (CPR) and Attention-Based Prototype Enrichment (APE), achieving up to 6.98% improvement in novel class IoU on ScanNet/S3DIS.
SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding: This paper proposes SeeGround, a training-free zero-shot 3D visual grounding framework. By representing the 3D scene as a hybrid of query-aligned rendered images and spatially-enhanced textual descriptions, it leverages 2D vision-language models to outperform previous zero-shot methods on ScanRefer by 7.7% in accuracy.
Seeing A 3D World in A Grain of Sand: A catadioptric imaging system based on eight pairs of planar mirrors is designed to capture 360° surrounding multi-view images of miniature scenes in a single snapshot, combining visual hull depth constraints to improve sparse-view 3DGS reconstruction quality.
SelfSplat: Pose-Free and 3D Prior-Free Generalizable 3D Gaussian Splatting: SelfSplat proposes an generalizable 3D Gaussian Splatting framework that is pose-free and 3D prior-free. By unifying self-supervised depth/pose estimation with the 3D-GS representation, coupled with a match-aware pose network and a pose-aware depth refinement module, it significantly outperforms existing pose-free methods on the RealEstate10K, ACID, and DL3DV datasets.
SemAlign3D: Semantic Correspondence Between RGB-Images Through Aligning 3D Object-Class Representations: Leverages monocular depth estimation to construct object-class 3D representations, aligning them with input images at inference time by minimizing an alignment energy function (combining semantic and spatial likelihood). This method improves the overall [email protected] on SPair-71k from 85.6% to 88.9%, with gains exceeding 10 percentage points across three categories.
Seurat: From Moving Points to Depth: This paper proposes Seurat, a monocular video depth estimation method based on 2D point trajectories. By analyzing the motion patterns of tracked points using spatial and temporal Transformers to infer depth changes over time, Seurat achieves zero-shot generalization to real-world scenes while being trained solely on synthetic data.
SfM-Free 3D Gaussian Splatting via Hierarchical Training: Proposes an SfM-free 3DGS method (SFGS) that merges multiple local 3DGS models into a unified scene representation through a hierarchical training strategy, and utilizes video frame interpolation to improve camera pose estimation, achieving a 2.25dB PSNR improvement on Tanks and Temples.
SGCR: Spherical Gaussians for Efficient 3D Curve Reconstruction: SGCR proposes Spherical Gaussians, a concise 3D representation that simplifies the anisotropic ellipsoids of standard 3D Gaussians into uniform-sized spheres. Using only 2D edge maps for supervision, it faithfully aligns them with 3D object edges. It then reconstructs precise 3D parametric curves efficiently via a novel rational Bézier curve extraction algorithm, achieving 50 times faster speed and better accuracy than NEF and EMAP.
ShapeShifter: 3D Variations Using Multiscale and Sparse Point-Voxel Diffusion: ShapeShifter proposes a method to generate high-quality shape variations from a single 3D reference model. By combining a sparse voxel grid (fVDB) with point-normal-color sampling in a multi-scale diffusion model, it achieves minute-level training and interactive inference on consumer-grade GPUs.
Sharp-It: A Multi-view to Multi-view Diffusion Model for 3D Synthesis and Manipulation: Proposes Sharp-It, a multi-view to multi-view diffusion model that enhances low-quality object outputs from 3D generative models like Shap-E into high-quality multi-view images via 2D diffusion, reducing the FID to 6.60 and supporting appearance editing in just 10 seconds.
SharpDepth: Sharpening Metric Depth Predictions Using Diffusion Distillation: SharpDepth is proposed to inject fine-grained edge detail knowledge from generative depth models (e.g., Lotus) into the predictions of discriminative metric depth models (e.g., UniDepth) via diffusion distillation. By leveraging noise-aware gating and label-free training, it achieves an optimal balance between metric accuracy and edge sharpness.
SimAvatar: Simulation-Ready Avatars with Layered Hair and Clothing: SimAvatar proposes the first fully simulation-ready text-driven 3D avatar generation framework. By representing the body, clothing, and hair as layered representations consisting of SMPL meshes, clothing meshes, and hair strands, and attaching 3D Gaussians to learn the appearance, the method is able to leverage diffusion model priors for realistic textures while directly interfacing with physical/neural simulators to produce realistic dynamics.
SimVS: Simulating World Inconsistencies for Robust View Synthesis: SimVS leverages video diffusion models to simulate inconsistencies (e.g., changes in lighting, object motion) in real-world casual captures. It uses this simulated data to train a multi-view harmonization network that converts inconsistent, sparse observations into consistent multi-view images, thereby enabling high-quality static 3D reconstruction from in-the-wild casually captured scenes.
SiNR: Sparsity Driven Compressed Implicit Neural Representations: Discovering the key property that the weight space of INRs exhibits an approximate Gaussian distribution, this work employs a random sensing matrix based on compressed sensing theory to transform weight vectors into high-dimensional sparse codes. This achieves a fundamental INR compression that does not rely on quantization schemes and can be seamlessly combined with any existing INR compression method.
Sketchy Bounding-Box Supervision for 3D Instance Segmentation: The Sketchy-3DIS framework is proposed, introducing inaccurate ("sketchy") 3D bounding box annotations to weakly-supervised 3D instance segmentation for the first time. Through joint training of an adaptive box-to-point pseudo-label generator and a coarse-to-fine instance segmenter, it achieves SOTA performance on ScanNetV2 and S3DIS, even outperforming some fully-supervised methods.
SLAM3R: Real-Time Dense Scene Reconstruction from Monocular RGB Videos: SLAM3R proposes a two-level feed-forward neural network system. It directly regresses local 3D point maps from video segments using an Image-to-Points (I2P) network, and then progressively aligns them to a global coordinate system using a Local-to-World (L2W) network. This achieves SOTA dense reconstruction accuracy and completeness at 20+ FPS, entirely without explicitly solving for camera parameters.
SOGS: Second-Order Anchor for Advanced 3D Gaussian Splatting: This work proposes SOGS, which introduces second-order anchors (utilizing the covariance matrix to capture correlations across feature dimensions for feature enhancement) and selective gradient loss into anchor-based 3D-GS. It achieves superior rendering quality compared to Scaffold-GS while reducing anchor feature dimensions from 32 down to 12-16.
Sonata: Self-Supervised Learning of Reliable Point Representations: Proposes Sonata, a reliable point cloud self-supervised learning method. By identifying and resolving the "geometric shortcut" issue (where representation collapses into low-level spatial features such as surface normals or point height), it increases the linear probe mIoU on ScanNet from 21.8% to 72.5% (a 3.3x improvement) and achieves SOTA on multiple 3D perception tasks.
SoundVista: Novel-View Ambient Sound Synthesis via Visual-Acoustic Binding: SoundVista proposes a method for synthesizing ambient sound from sparse distributed microphone recordings at arbitrary novel views. By leveraging a Visual-Acoustic Binding (VAB) module to infer acoustic properties from panoramic RGB-D data, the method optimizes reference microphone layouts and adaptively weights the contributions of reference recordings using a Transformer, significantly outperforming existing methods in both simulated and real-world scenes.
SP3D: Boosting Sparsely-Supervised 3D Object Detection via Accurate Cross-Modal Semantic Prompts: Proposes SP3D, a two-stage training strategy that leverages large multimodal models (LMMs) to generate accurate cross-modal semantic prompts. Through dynamic clustering pseudo-label generation and distribution-shape scoring, it significantly boosts sparsely-supervised 3D object detection performance under an extremely low annotation rate (2%).
SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images: SPAR3D proposes a two-stage method for reconstructing 3D objects from a single image. The first stage utilizes a lightweight point cloud diffusion model to generate a sparse point cloud to handle occlusion uncertainty, while the second stage employs a triplane transformer to convert the point cloud into a high-quality mesh with PBR materials, achieving 0.7-second inference and supporting interactive editing.
SPARS3R: Semantic Prior Alignment and Regularization for Sparse 3D Reconstruction: SPARS3R is proposed to combine the precise pose estimation from SfM with the dense depth priors from DUSt3R/MASt3R. It maps the dense point cloud to the SfM sparse point cloud via global fusion alignment, and then leverages Segment Anything Model (SAM) semantic segmentation to perform local alignment on outlier regions identified by RANSAC. This generates an initial point cloud that is both dense and pose-accurate, significantly improving the rendering quality of 3DGS in sparse views.
Sparse Point Cloud Patches Rendering via Splitting 2D Gaussians: A novel method is proposed to directly predict 2D Gaussians from point clouds for photorealistic rendering. By employing an entire-patch architecture, it achieves cross-category generalization. A splitting decoder upsamples sparse point clouds into denser Gaussian primitives, achieving state-of-the-art rendering quality and a real-time rendering speed of 142 FPS with only 2K-100K points.
Sparse Voxels Rasterization: Real-time High-fidelity Radiance Field Rendering: This paper proposes SVRaster, an efficient radiance field rendering method that requires no neural networks or 3D Gaussians. By utilizing an adaptive multi-level sparse voxel representation and a customized rasterizer based on direction-dependent Morton sorting, it achieves artifact-free, real-time, high-fidelity rendering.
SpatialDreamer: Self-supervised Stereo Video Synthesis from Monocular Input: SpatialDreamer is proposed as a self-supervised stereo video synthesis framework based on video diffusion models. It addresses the shortage of training data through a Depth-guided Video Generation (DVG) module and ensures geometric and temporal consistency via a RefinerNet framework along with a consistency control module (incorporating stereo deviation strength and Temporal Interaction Learning, or TIL). Performance exceeds the Apple Vision Pro 3D converter.
Spectral Defense Against Resource-Targeting Attack in 3D Gaussian Splatting: Against resource-targeting attacks on 3DGS (which trigger Gaussian overgrowth via poisoned training images to deplete resources), this paper proposes a spectral defense: a 3D frequency filter achieves frequency-aware pruning by relating Gaussian covariance to spectral response, and 2D spectral regularization suppresses attack noise by penalizing the angular energy anisotropy of rendered images with entropy. This achieves a 5.92× compression in Gaussian count, a 3.66× reduction in memory, and a 4.34× speedup.
Spectral Informed Mamba for Robust Point Cloud Processing: A point cloud Mamba traversing strategy, SST, is proposed based on the Graph Laplacian Spectrum. It achieves isometry-invariant classification via Surface-Aware Spectral Traversing (SAST), accurate segmentation via Hierarchical Local Traversing (HLT), and addresses the token placement issue of MAE in Mamba via Traversing-Aware Relocation (TAR).
SpectroMotion: Dynamic 3D Reconstruction of Specular Scenes: SpectroMotion based on the 3DGS framework models dynamic objects via a deformable Gaussian MLP and time-varying illumination effects via a deformable reflection MLP. Combined with a canonical environment map and a coarse-to-fine three-stage training strategy, it achieves high-quality 3D reconstruction and real-time rendering of dynamic specular scenes for the first time.
Speedy-Splat: Fast 3D Gaussian Splatting with Sparse Pixels and Sparse Primitives: Speedy-Splat is proposed to accelerate 3DGS rendering through two complementary pipelines: (1) SnugBox and AccuTile, which precisely locate the screen-space extent of Gaussians to reduce redundant pixel processing, and (2) efficient pruning (Soft and Hard Pruning) to reduce the number of Gaussians by over 90%. Combined, they achieve an average rendering speedup of 6.71×, alongside a 10.6× reduction in model size and a 1.4× training speedup.
SphereUFormer: A U-Shaped Transformer for Spherical 360 Perception: SphereUFormer proposes a U-shaped Transformer architecture that operates directly in the spherical domain (icosphere mesh). By incorporating a spherical local self-attention mechanism and sphere-specific up/downsampling operations, it avoids the distortions introduced by Equirectangular Projection (ERP), comprehensively outperforming existing methods on both 360° depth estimation and semantic segmentation tasks.
SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting Synthesis: A SplatFlow framework is proposed, consisting of a multi-view rectified flow (RF) model and a Gaussian Splatting decoder (GSDecoder), which jointly generates multi-view images, depth, and camera poses in latent space, achieving unified 3DGS generation and editing via training-free inversion and inpainting techniques.
SplineGS: Robust Motion-Adaptive Spline for Real-Time Dynamic 3D Gaussians from Monocular Video: SplineGS proposes a dynamic 3DGS framework based on cubic Hermite splines. By modeling the continuous trajectories of dynamic Gaussians through Motion-Adaptive Spline (MAS) and Motion-Adaptive Control Point pruning (MACP) while jointly optimizing camera parameters, it achieves SOTA dynamic novel view synthesis and real-time rendering without requiring COLMAP.
Stable-SCore: A Stable Registration-Based Framework for 3D Shape Correspondence: Stable-SCore revisits the "registration-correspondence" paradigm by leveraging 2D foundation models (Stable Diffusion + DINO) to establish robust 2D character correspondences. It proposes a semantic-flow-guided registration method (based on Neural Jacobian Fields) to bridge 2D correspondence and 3D deformation via differentiable rendering, significantly outperforming functional map-based methods on non-isometric character shape correspondence tasks.
StageDesigner: Artistic Stage Generation for Scenography via Theater Scripts: This paper proposes StageDesigner, the first AI-driven framework for artistic stage generation. It leverages LLMs to analyze scripts to extract scene and imagery descriptions, implements foreground entity layouts via a multi-level collision map, and generates background images aligned with the narrative atmosphere using a foreground projection module and a layout-controlled diffusion model.
StdGEN: Semantic-Decomposed 3D Character Generation from Single Images: StdGEN proposes an efficient pipeline to generate high-quality, semantically decomposed (separate body, clothing, and hair) 3D characters from a single image. The core is a Semantic-aware Large Reconstruction Model (S-LRM) that achieves feedforward joint geometry-color-semantic reconstruction by incorporating a semantic field into NeRF/SDF, generating layered 3D characters ready for games/animations within 3 minutes.
Steepest Descent Density Control for Compact 3D Gaussian Splatting: SteepGS approaches density control in 3DGS from the perspective of non-convex optimization theory, revealing that its essence is to help Gaussian primitives escape saddle points. It derives an optimal splitting strategy—splitting into two descendants, halving the opacity, and shifting along the direction of the minimum eigenvector of the splitting matrix—which reduces the number of Gaussian points by approximately 50% while maintaining rendering quality.
Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos: Stereo4D proposes an automated pipeline to mine high-quality 4D reconstruction data from internet stereo fisheye videos (VR180). It generates over 100K video clips containing pseudo-metric 3D point clouds and long-range trajectories in the world coordinate system, and trains a DynaDUSt3R model to achieve the capability of predicting 3D structure and motion directly from image pairs.
Structure from Collision: This paper introduces a brand-new task, "Structure from Collision" (SfC), which aims to infer the invisible internal structure (such as cavities) of an object by observing its appearance changes during collision. The authors design the SfC-NeRF model to optimize the internal density field under physical constraints, appearance preservation constraints, keyframe constraints, and a volume annealing strategy. The effectiveness of the method is verified on a dataset containing 115 objects with different structures and materials.
Structured 3D Latents for Scalable and Versatile 3D Generation: Proposes Structured LATents (SLat/TRELLIS), a unified 3D latent representation that fuses sparse 3D grids with DINOv2 multi-view features. It supports decoding into various formats such as radiance fields, 3D Gaussians, and meshes. A rectified flow Transformer with up to 2B parameters is trained on 500K 3D assets, generating high-quality 3D assets in approximately 10 seconds and supporting flexible local editing.
SUM Parts: Benchmarking Part-Level Semantic Segmentation of Urban Meshes: Proposes SUM Parts, the first large-scale benchmark dataset for part-level semantic segmentation of urban textured meshes (covering \(2.5\,\text{km}^2\) with 21 categories), featuring two types of labels (face annotation and texture pixel annotation), and develops an efficient interactive annotation tool combining 3D and 2D template matching.
SVG-IR: Spatially-Varying Gaussian Splatting for Inverse Rendering: This paper proposes the SVG-IR framework, which introduces a Spatially-Varying Gaussian (SVG) representation to allow a single Gaussian primitive to possess spatially-varying material and normal parameters. Combined with a physically-based indirect illumination model, SVG-IR maintains real-time rendering speeds while outperforming NeRF-based methods by 2.5 dB and existing Gaussian-based methods by 3.5 dB in relighting quality.
Symmetry Strikes Back: From Single-Image Symmetry Detection to 3D Generation: Reflect3D proposes a scalable zero-shot 3D reflection symmetry detector that resolves single-view ambiguity through a Transformer architecture and multi-angle aggregation from multi-view diffusion models. Integrating the detected symmetry into a single-image 3D generation pipeline significantly improves structural accuracy and texture quality.
Synthetic Prior for Few-Shot Drivable Head Avatar Inversion: SynShot proposes training a generative 3D Gaussian prior model using large-scale synthetic head data, allowing high-fidelity, drivable head avatars to be inverted via pivotal fine-tuning using only 3 real images, significantly outperforming monocular and GAN-based methods.
Targeted Forgetting of Image Subgroups in CLIP Models: A three-stage CLIP subgroup image forgetting framework (forgetting → reminding → restoring) is proposed. It selects key layers for LoRA fine-tuning using relative Fisher Information, aligns the retain data distribution utilizing BatchNorm statistics, and restores zero-shot capability via model souping, achieving precise subgroup forgetting (target ↓ to 0%) on ImageNet-1K and CIFAR-10 while maintaining 85-93% of the overall score.
Text-Guided Sparse Voxel Pruning for Efficient 3D Visual Grounding: This paper proposes TSP3D, the first single-stage 3D visual grounding framework based on a multi-layer sparse convolutional architecture. By achieving efficient 3D-text interaction through Text-Guided Pruning (TGP) and Completion-Based Addition (CBA), it achieves a SOTA accuracy of 46.71% [email protected] at a speed of 12.43 FPS on ScanRefer.
Textured Gaussians for Enhanced 3D Scene Appearance Modeling: Textured Gaussians introduces traditional graphics texture mapping and alpha mapping into 3DGS. By assigning an independent 2D RGBA texture map to each Gaussian, a single Gaussian can represent spatially varying color and opacity. This significantly enhances the representation capability of 3DGS, improving rendering quality given the same budget of Gaussians and yielding an almost 2dB PSNR gain at 1% Gaussian count.
Touch2Shape: Touch-Conditioned 3D Diffusion for Shape Exploration and Reconstruction: This paper proposes Touch2Shape, which utilizes a touch-conditioned diffusion model to generate compact shape representations in a low-dimensional latent space. Combined with reinforcement learning to train a touch exploration policy, it achieves active 3D shape exploration and reconstruction from tactile images, guiding the next touch location without requiring complete shape generation at each step.
Toward Robust Neural Reconstruction from Sparse Point Sets: Proposes a neural SDF learning method based on the distributionally robust optimization (DRO) framework. By defining uncertainty sets through Wasserstein and Sinkhorn distances, it samples from model uncertainty regions to regularize training, achieving robust 3D reconstruction on sparse and noisy point clouds.
Towards High-fidelity 3D Talking Avatar with Personalized Dynamic Texture: This work proposes the TexTalk4D dataset (100-minute scan-level 8K dynamic textures) and the TexTalker framework, achieving simultaneous generation of facial motion and corresponding dynamic textures (wrinkle changes) from audio for the first time, and enabling disentangled control of motion/texture styles via a style pivot-based injection strategy.
Towards Realistic Example-Based Modeling via 3D Gaussian Stitching: Proposes the first realistic example-based modeling method based on 3D Gaussian representation. It achieves seamless stitching and harmonious appearance fusion of multiple 3D Gaussian fields through sample-based cloning (S-phase) and cluster-based tuning (T-phase), supporting interactive real-time editing.
TreeMeshGPT: Artistic Mesh Generation with Autoregressive Tree Sequencing: Proposes TreeMeshGPT, which serializes meshes via dynamic tree structure traversal based on triangle adjacency relations. This achieves an efficient representation of only 2 tokens per face (~22% compression rate), scales the artistic mesh generation capability to 5,500 faces, and significantly reduces normal flipping issues.
TriTex: Learning Texture from a Single Mesh via Triplane Semantic Features: This paper proposes TriTex, a method for learning a volumetric texture field from a single textured mesh. By projecting Diff3F semantic features into a triplane representation, TriTex utilizes a convolutional network and an MLP to achieve semantic-aware, feed-forward texture transfer, outperforming existing methods in both inference speed and texture fidelity.
Turbo3D: Ultra-Fast Text-to-3D Generation: Turbo3D compresses a multi-step multi-view diffusion model into a 4-step generator via dual-teacher distillation and introduces a latent space GS-LRM reconstructor. It generates high-quality 3D Gaussian Splatting assets from text in just 0.35 seconds on a single A100, while outperforming existing methods on CLIP Score and VQA Score.
Twinner: Shining Light on Digital Twins in a Few Snaps: Twinner is proposed as the first large feed-forward reconstruction model capable of simultaneously recovering scene illumination, object geometry, and PBR material properties from a sparse set of images. Using a tricolumn representation, procedural synthetic data, and fine-tuning on real data via a differentiable PBR renderer, Twinner outperforms existing feed-forward methods and matches per-scene optimization approaches on StanfordORB.
UnCommon Objects in 3D: Meta introduces uCO3D—currently the largest public object-centric 3D dataset, containing high-resolution videos of over 1,000 object categories with complete 360° 3D annotations (camera poses, depth maps, point clouds, 3D Gaussian Splatting reconstructions, and text descriptions). Training on this dataset yields significantly better performance on multiple 3D learning tasks compared to MVImgNet and CO3Dv2.
UniK3D: Universal Camera Monocular 3D Estimation: Presents UniK3D, the first universal monocular 3D estimation method supporting arbitrary camera models (from pinhole to panorama). By employing a spherical 3D output space (radial distance instead of perpendicular depth) and a model-free camera ray representation based on spherical harmonics, it achieves zero-shot SOTA performance across 13 datasets, outperforming existing methods by a large margin, especially in large field-of-view (FoV) and panoramic settings.
UniPre3D: Unified Pre-training of 3D Point Cloud Models with Cross-Modal Gaussian Splatting: UniPre3D proposes the first unified 3D pre-training framework that predicts Gaussian primitives and renders images using differentiable Gaussian Splatting to provide pixel-level supervision. Meanwhile, it introduces a scale-adaptive cross-modal fusion strategy, making the pre-training method applicable to point clouds of arbitrary scales (both object-level and scene-level) and 3D models of arbitrary architectures.
UVGS: Reimagining Unstructured 3D Gaussian Splatting using UV Mapping: UVGS transforms unstructured 3D Gaussian Splatting (3DGS) into a structured 2D UV map representation via spherical mapping, which is further compressed into a 3-channel Super UVGS image. This allows pre-trained 2D image foundation models (VAEs, diffusion models) to be directly applied to 3DGS generation and compression in a zero-shot manner.
VarSplat: Uncertainty-aware 3D Gaussian Splatting for Robust RGB-D SLAM: VarSplat learns the appearance variance \(\sigma^2\) for each Gaussian splat within the 3DGS-SLAM framework. It derives a differentiable per-pixel uncertainty map \(V\) via the law of total variance and applies it to tracking, loop detection, and registration, achieving more robust pose estimation and competitive reconstruction quality on Replica, TUM, ScanNet, and ScanNet++ datasets.
VGGT: Visual Geometry Grounded Transformer: VGGT is a large feed-forward Transformer that directly predicts camera parameters, depth maps, point clouds, and 3D point trajectories from one to hundreds of images in less than a second, outperforming existing methods without post-processing optimization.
Vid2Avatar-Pro: Authentic Avatar from Videos in the Wild via Universal Prior: Proposes Vid2Avatar-Pro, which utilizes a Universal Prior Model (UPM) learned from multi-view dressed human motion capture data of thousands of individuals to create photorealistic and animatable 3D human avatars from monocular in-the-wild videos, significantly surpassing existing methods in novel view/pose synthesis.
Vid2Sim: Realistic and Interactive Simulation from Video for Urban Navigation: Vid2Sim proposes a real2sim framework that converts monocular videos into realistic and interactive simulation environments. By employing geometrically consistent Gaussian Splatting reconstruction and a hybrid scene representation (GS+Mesh), it supports the reinforcement learning training of urban navigation agents, improving success rates by 31.2% in digital twins and 68.3% in the real world.
Video Depth Anything: Consistent Depth Estimation for Super-Long Videos: Video Depth Anything builds upon Depth Anything V2 by introducing a lightweight spatio-temporal head and a temporal gradient matching loss. Without requiring geometric priors or video generation priors, it generates temporally consistent, high-quality depth maps for videos of arbitrary length at a real-time speed of 30 FPS.
Video Depth Without Video Models: This paper proposes RollingDepth, which avoids using video diffusion models and instead extends a single-frame latent diffusion model (Marigold) into a multi-frame snippet processor. Combined with multi-scale dilated sampling and a robust global alignment algorithm, it merges short-snippet depths into temporally consistent long-video depth, outperforming specialized video depth models and single-frame models across multiple benchmarks.
Vision-Language Embodiment for Monocular Depth Estimation: An embodied depth estimation framework is proposed, which embodies the physical characteristics of the camera model into a deep learning system to calculate Embodied Scene Depth as a geometric prior. Simultaneously, it leverages vision-language complementarity (depth text descriptions + textual VAE + conditional sampler) to fuse RGB image features and physical depth prior for monocular depth estimation.
Volumetric Surfaces: Representing Fuzzy Geometries with Layered Meshes: This paper proposes Volumetric Surfaces, a representation method that learns multi-layered translucent SDF mesh shells (k-SDF) with adaptive spacing. By rendering them via rasterization in a fixed order, it achieves real-time, high-quality view synthesis of fuzzy geometries (such as fur and hair) on low-power laptops and smartphones.
Volumetrically Consistent 3D Gaussian Rasterization: This paper points out unnecessary physical approximations in 3DGS splatting rendering and proposes to directly analytically integrate the transmittance of 3D Gaussians within the rasterization framework to compute more accurate alpha values. This maintains the speed advantage of rasterization while achieving physical accuracy close to ray tracing.
WildGS-SLAM: Monocular Gaussian Splatting SLAM in Dynamic Environments: This paper presents WildGS-SLAM, a monocular RGB SLAM system based on 3D Gaussian Splatting, which guides dynamic object removal in both tracking and mapping via DINOv2 feature-driven uncertainty prediction. It significantly outperforms existing methods in tracking accuracy (ATE RMSE of 0.46 cm) in dynamic environments and high-quality novel view synthesis without artifacts.
Wonderland: Navigating 3D Scenes from a Single Image: Wonderland proposes a pipeline for generating high-quality, wide-range 3D scenes from a single image: it first generates 3D-aware video latent variables using a video diffusion Transformer with dual-branch camera control, and then directly regresses the 3D Gaussian Splatting representation in the latent space using a Latent Large Reconstruction Model (LaLRM). This represents the first demonstration that 3D reconstruction models can be efficiently built on the latent space of video diffusion models.
WonderWorld: Interactive 3D Scene Generation from a Single Image: WonderWorld is proposed as the first framework to support interactive 3D scene generation, allowing users to control scene content and layout in real time via camera movement and text prompts. Each scene is generated in less than 10 seconds on a single A6000 GPU, which is ~80x faster than existing methods.
You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale: This paper presents See3D, a pose-free visually conditioned multi-view diffusion model trained on large-scale internet videos (320M frames / 16M video clips). Through an automated data filtering pipeline and a time-dependent visual conditioning design, it achieves zero-shot open-world 3D generation capabilities.
Zero-Shot Monocular Scene Flow Estimation in the Wild: Proposes the first monocular scene flow estimation method capable of zero-shot generalization in the wild. By jointly predicting geometry and motion, constructing a diverse training dataset of over one million samples, and employing a pointmap + 3D offset parameterization, it comprehensively outperforms existing methods in 3D endpoint error.
MVGD: Zero-Shot Novel View and Depth Synthesis with Multi-View Geometric Diffusion: MVGD proposes a multi-view geometric framework based on pixel-level diffusion, which directly generates novel-view images and scale-consistent depth maps from an arbitrary number of known-view images without intermediate 3D representations, achieving state-of-the-art results through training on over 60 million multi-view samples.