🧊 3D Vision¶

📹 ICCV2025 · 268 paper notes

TRAN-D: 2D Gaussian Splatting-based Sparse-view Transparent Object Depth Reconstruction via Physics Simulation for Scene Update: This paper proposes TRAN-D, a 2D Gaussian Splatting-based method for sparse-view transparent object depth reconstruction. It employs segmentation-guided object-aware losses to optimize Gaussian distributions in occluded regions, and leverages physics simulation (MPM) to enable dynamic scene updates after object removal, requiring only a single image for scene refresh.
3D Gaussian Map with Open-Set Semantic Grouping for Vision-Language Navigation: This paper proposes a 3D Gaussian Map based on 3D Gaussian Splatting for scene representation, combined with an open-set semantic grouping mechanism, to construct a 3D environmental representation that captures both geometric structure and rich semantic information for Vision-Language Navigation (VLN). A Multi-Level Action Prediction strategy is further designed to integrate multi-granularity spatial-semantic cues for navigation decision-making.
3D Mesh Editing using Masked LRMs: This paper proposes MaskedLRM, which reformulates 3D shape editing as a conditional reconstruction problem. During training, randomly generated 3D occluders mask multi-view inputs, and a single clean conditioning view guides completion of the occluded regions. At inference, the user defines an edit region and provides a single edited image; the model produces an edited 3D mesh in a single forward pass in under 3 seconds — 2–10× faster than optimization-based methods — while supporting topological changes (e.g., adding holes or handles) and achieving reconstruction quality on par with state-of-the-art methods.
3D Test-time Adaptation via Graph Spectral Driven Point Shift: This paper proposes GSDTTA, which shifts 3D point cloud test-time adaptation from the spatial domain to the graph spectral domain. By optimizing only the lowest 10% frequency components to adapt global structure, combined with an eigenvector-map-guided self-training strategy, GSDTTA achieves state-of-the-art performance on ModelNet40-C and ScanObjectNN-C.
3D Test-time Adaptation via Graph Spectral Driven Point Shift: This paper proposes GSDTTA, which is the first work to shift 3D point cloud test-time adaptation (TTA) from the spatial domain to the graph spectral domain. By optimizing only the lowest 10% frequency components (reducing parameters by ~90%), GSDTTA achieves global structural adjustment. Combined with a feature map guided self-training strategy for pseudo-label generation, it significantly outperforms existing 3D TTA methods on ModelNet40-C and ScanObjectNN-C.
3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding: This paper proposes 3DGraphLLM, which encodes semantic inter-object relationships in 3D scenes as learnable graph representations and feeds them into an LLM. The method significantly outperforms baselines that ignore semantic relations across multiple 3D vision-language tasks — including object grounding, scene captioning, and visual question answering — while achieving 5× faster inference than LVLM-based approaches.
3DGS-LM: Faster Gaussian-Splatting Optimization with Levenberg-Marquardt: This paper proposes 3DGS-LM, which replaces the Adam optimizer in 3DGS with a customized Levenberg-Marquardt (LM) optimizer. By introducing a GPU cache-driven parallelization scheme for efficient Jacobian-vector product computation, the method achieves a 20% speedup in 3DGS optimization while maintaining equivalent reconstruction quality.
3DGS-LM: Faster Gaussian-Splatting Optimization with Levenberg-Marquardt: This paper proposes 3DGS-LM, which replaces the ADAM optimizer in 3D Gaussian Splatting with a customized second-order Levenberg-Marquardt (LM) optimizer. Combined with an efficient GPU parallelization scheme and a gradient caching structure, the method achieves a 20% training speedup while preserving reconstruction quality.
3DGS-LM: Faster Gaussian-Splatting Optimization with Levenberg-Marquardt: This paper replaces the ADAM optimizer in 3D Gaussian Splatting with a custom Levenberg-Marquardt (LM) second-order optimizer. By leveraging an efficient CUDA-parallelized PCG algorithm and a gradient cache structure to accelerate Jacobian-vector products, the method reduces optimization time by approximately 20% while maintaining equivalent reconstruction quality.
4D Gaussian Splatting SLAM: This paper presents the first complete 4D Gaussian Splatting SLAM system capable of simultaneously performing camera pose tracking and 4D Gaussian radiance field reconstruction in dynamic scenes. Gaussian primitives are partitioned into static and dynamic sets; dynamic object motion is modeled via sparse control points and an MLP; and a novel 2D optical flow map rendering algorithm is introduced to supervise dynamic Gaussian motion learning.
4D Visual Pre-training for Robot Learning: FVP proposes a visual pre-training framework based on 4D (3D spatial + temporal) point cloud prediction. By formulating the pre-training objective as "next-frame point cloud prediction" and implementing it via a diffusion model, FVP significantly improves the success rate of multiple 3D imitation learning methods on real-robot manipulation tasks (average +28% on DP3).
4D Visual Pre-training for Robot Learning: FVP formulates 3D visual pre-training as a next-point-cloud-prediction problem, training a conditional diffusion model to predict the current-frame point cloud from historical-frame point clouds. This approach achieves a 28% average success rate improvement over DP3 across 12 real-world manipulation tasks, establishing a new state of the art.
7DGS: Unified Spatial-Temporal-Angular Gaussian Splatting: This work extends 3DGS to seven dimensions (spatial 3D + temporal 1D + directional 3D). A conditional slicing mechanism projects 7D Gaussians into 3D Gaussians compatible with the standard 3DGS pipeline, achieving up to 7.36 dB PSNR improvement on dynamic scenes with view-dependent effects while maintaining 401 FPS real-time rendering.
7DGS: Unified Spatial-Temporal-Angular Gaussian Splatting: This paper proposes 7DGS, which models scene elements as 7-dimensional Gaussian distributions (3D spatial + 1D temporal + 3D view direction). A conditional slicing mechanism converts 7D Gaussians into time- and view-conditioned 3D Gaussians, unifying dynamic scene rendering with view-dependent appearance. On the proposed 7DGS-PBR dataset, 7DGS achieves up to 7.36 dB PSNR gain over 4DGS while using only 15.3% of the Gaussian primitives, with real-time rendering at 401 FPS.
A3GS: Arbitrary Artistic Style into Arbitrary 3D Gaussian Splatting: A3GS is proposed as the first feed-forward zero-shot 3DGS style transfer framework. It encodes 3DGS scenes into a latent space via a GCN-based autoencoder and injects arbitrary style features using AdaIN, completing style transfer from any style to any 3D scene in approximately 10 seconds — two orders of magnitude faster than optimization-based methods.
A Lesson in Splats: Teacher-Guided Diffusion for 3D Gaussian Splats Generation with 2D Supervision: This paper proposes a novel framework for training 3D diffusion models using only 2D image supervision. By employing a deterministic 3D reconstruction model as a "noisy teacher" to generate 3D noisy samples, and combining a multi-step denoising strategy with cycle-consistency regularization, the proposed method achieves 3D Gaussian Splatting generation quality that surpasses the teacher model (PSNR gain of 0.5–0.85).
A Lesson in Splats: Teacher-Guided Diffusion for 3D Gaussian Splats Generation with 2D Supervision: This paper proposes a framework for training 3D diffusion models with 2D image supervision: a pretrained deterministic 3D reconstruction model serves as a "noisy teacher" to generate noisy 3D samples, and a multi-step denoising strategy combined with rendering losses enables cross-modal training (3D denoising + 2D supervision). The approach surpasses the teacher model by 0.5–0.85 PSNR while using a smaller model.
A Recipe for Generating 3D Worlds from a Single Image: The problem of single-image-to-3D-world generation is decomposed into two simpler sub-problems—panorama synthesis (training-free in-context learning) and point-cloud-conditioned inpainting (ControlNet fine-tuned for only 5k steps)—combined with 3DGS reconstruction to produce immersive 3D environments navigable within a 2 m³ volume in VR, surpassing SOTA methods such as WonderJourney and DimensionX across all image quality metrics.
A Simple yet Mighty Hartley Diffusion Versatilist for Generalizable Dense Vision Tasks: This paper proposes HarDiff — a frequency-domain learning strategy based on the Discrete Hartley Transform (DHT) — that enhances the cross-domain generalization capability of diffusion models on dense vision tasks through low-frequency training (extracting structural priors from the source domain) and high-frequency sampling (leveraging target-domain detail guidance). HarDiff achieves state-of-the-art results across 12 benchmarks spanning semantic segmentation, depth estimation, and image dehazing.
A Unified Interpretation of Training-Time Out-of-Distribution Detection: This paper proposes a novel perspective based on inter-variable "interactions" to provide a unified explanation for why different training-time OOD detection methods are effective — they all encourage the model to encode more high-order interactions. The paper further validates the dominant role of high-order interactions in OOD detection and explains, through interaction distribution analysis, why near-OOD samples are harder to detect.
AAA-Gaussians: Anti-Aliased and Artifact-Free 3D Gaussian Rendering: AAA-Gaussians proposes a unified 3D Gaussian rasterization framework that simultaneously addresses the three persistent problems of 3DGS—aliasing, projection distortion, and popping artifacts—through an adaptive 3D smoothing filter, view-space perspective-correct bounding computation, and frustum-based 3D culling, all within a single framework. The method substantially outperforms competing approaches under out-of-distribution viewpoint evaluation while maintaining real-time rendering performance.
AAA-Gaussians: Anti-Aliased and Artifact-Free 3D Gaussian Rendering: This paper proposes AAA-Gaussians, which systematically addresses aliasing, projection distortion, and pop-in artifacts in 3DGS within a unified framework via three techniques: an adaptive 3D smoothing filter, a view-space perspective-correct bounding scheme, and frustum-based 3D culling. The method achieves state-of-the-art artifact-free real-time rendering under both in-distribution and out-of-distribution viewpoints.
AAA-Gaussians: Anti-Aliased and Artifact-Free 3D Gaussian Rendering: By incorporating full 3D evaluation (rather than 2D splat approximations) into every stage of the 3DGS rendering pipeline, this work proposes an adaptive 3D smoothing filter, view-space bounding computation, and frustum-based tile culling to jointly address aliasing, projection artifacts, and popping artifacts in 3DGS. The method substantially outperforms existing approaches under out-of-distribution (OOD) viewpoints while maintaining real-time rendering (>100 FPS).
Advancing Text-to-3D Generation with Linearized Lookahead Variational Score Distillation: By identifying the optimization order mismatch between the LoRA model and the 3D model in VSD, this paper proposes a linearized lookahead correction term, \(L^2\)-VSD, which significantly improves text-to-3D generation quality at the cost of only one additional forward pass.
Adversarial Exploitation of Data Diversity Improves Visual Localization: This paper proposes RAP, a framework that synthesizes diverse training data via appearance-controllable 3DGS and introduces an adversarial discriminator to bridge the synthetic-to-real domain gap, enabling absolute pose regression methods to substantially surpass the state of the art across multiple datasets — reducing indoor translation/rotation errors by 50%/41% and outdoor errors by 38%/44%.
AllTracker: Efficient Dense Point Tracking at High Resolution: This paper proposes AllTracker, which reformulates point tracking as multi-frame long-range optical flow estimation. Through low-resolution iterative inference (2D convolutions + temporal attention) followed by high-resolution upsampling, AllTracker achieves state-of-the-art accuracy for high-resolution (768–1024 px) dense point tracking across all pixels with only 16M parameters.
Amodal3R: Amodal 3D Reconstruction from Occluded 2D Images: This paper proposes Amodal3R, an end-to-end occlusion-aware 3D reconstruction model that introduces mask-weighted cross-attention and an occlusion-aware attention layer on top of TRELLIS, enabling direct reconstruction of complete 3D object geometry and appearance from partially occluded 2D images in the 3D latent space, substantially outperforming prior two-stage "2D completion → 3D reconstruction" pipelines.
Amodal Depth Anything: Amodal Depth Estimation in the Wild: This paper proposes a new paradigm for amodal relative depth estimation, constructs a large-scale real-world dataset ADIW (564K samples), and designs two complementary frameworks (Amodal-DAV2 and Amodal-DepthFM) built upon Depth Anything V2 and DepthFM. By minimally modifying pretrained models, the method achieves depth prediction in occluded regions, improving RMSE by 27.4% over the previous SOTA on ADIW.
AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation: This paper proposes AnimateAnyMesh, the first feed-forward text-driven universal mesh animation framework. It introduces DyMeshVAE to decompose dynamic meshes into initial positions and relative trajectories, compressing them into a latent space. A Rectified Flow-based MMDiT model then learns the trajectory distribution conditioned on text. Trained on the 4M+ DyMesh dataset, the framework generates high-quality animations for meshes of arbitrary topology within 6 seconds, comprehensively outperforming DG4D, L4GM, and Animate3D.
AnyI2V: Animating Any Conditional Image with Motion Control: This paper proposes AnyI2V, a training-free framework that accepts arbitrary-modality images (mesh, point cloud, depth map, skeleton, etc.) as first-frame conditions and combines user-defined trajectories for motion-controlled video generation, outperforming existing training-free methods and competing with trained methods on FID/FVD/ObjMC metrics.
AR-1-to-3: Single Image to Consistent 3D Object Generation via Next-View Prediction: This paper proposes AR-1-to-3, an autoregressive next-view prediction framework built upon diffusion models. By adopting a progressive "near-to-far" generation strategy, combined with two conditioning mechanisms—Stacked-LE (Stacked Local-feature Encoding) and LSTM-GE (LSTM-based Global-feature Encoding)—the method significantly improves multi-view consistency in single-image-to-multi-view generation. On the GSO dataset, it achieves a PSNR of 13.18 (a 23.5% improvement over InstantMesh's 10.67) and reduces Chamfer Distance to 0.063 (compared to 0.117 for InstantMesh).
ArgMatch: Adaptive Refinement Gathering for Efficient Dense Matching: This paper proposes an Adaptive Refinement Gathering pipeline that substantially reduces dependence on heavy feature extractors and global matchers through a content-aware offset estimator, a local-consistency matching corrector, and a local-consistency upsampler, achieving competitive dense matching performance with a lightweight network.
Articulate3D: Holistic Understanding of 3D Scenes as Universal Scene Description: This paper presents Articulate3D — the first large-scale real-world indoor scene dataset with articulation annotations (280 high-quality scans) — along with USDNet, a unified framework that simultaneously predicts movable/interactive part segmentation and motion parameters from 3D point clouds, providing simulation-ready scene data for embodied AI and physical simulation.
ATLAS: Decoupling Skeletal and Shape Parameters for Expressive Parametric Human Modeling: This paper presents ATLAS, a parametric human body model that explicitly decouples external surface shape from internal skeletal parameters, incorporates sparse nonlinear pose correctives, and is trained on 600K high-resolution scans, achieving more accurate and controllable human body modeling than SMPL-X.
Auto-Regressively Generating Multi-View Consistent Images: This paper proposes MV-AR, the first autoregressive model for multi-view image generation. It progressively generates each subsequent view conditioned on all previously generated views, incorporating a unified multimodal condition injection module and a Shuffle View data augmentation strategy. MV-AR achieves consistency comparable to diffusion-based methods under text, image, and shape conditioning.
AutoOcc: Automatic Open-Ended Semantic Occupancy Annotation via Vision-Language Guided Gaussian Splatting: This paper proposes AutoOcc, a fully automatic vision-centric pipeline for open-ended semantic occupancy annotation. By leveraging vision-language model (VLM)-guided differentiable Gaussian splatting (VL-GS), AutoOcc generates 3D semantic occupancy without any human labels, achieving IoU 83.01 / mIoU 20.92 on Occ3D-nuScenes with camera-only input, substantially outperforming existing automatic annotation methods.
Back on Track: Bundle Adjustment for Dynamic Scene Reconstruction: This paper proposes BA-Track, a framework that leverages a 3D point tracker to decompose observed motion into camera motion and object motion, enabling classical Bundle Adjustment to jointly handle static and dynamic scene elements for accurate camera pose estimation and temporally consistent dense reconstruction.
Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation and Reconstruction: This paper proposes DiffusionGS, which bakes 3D Gaussian point clouds into the denoiser of a diffusion model, enabling single-stage, view-consistent single-view 3D object generation and scene reconstruction. Combined with a scene-object mixed training strategy and RPPC camera conditioning encoding, the method substantially outperforms existing approaches on PSNR/FID metrics while requiring only ~6 seconds for inference.
BANet: Bilateral Aggregation Network for Mobile Stereo Matching: This paper proposes BANet, a Bilateral Aggregation Network that decomposes the cost volume into a high-frequency detail volume and a low-frequency smooth volume via spatial attention and aggregates them separately. Using only 2D convolutions, BANet runs in real time on mobile devices while substantially outperforming MobileStereoNet-2D (35.3% accuracy improvement on KITTI 2015). Its 3D variant achieves the highest accuracy among real-time methods on GPU.
Benchmarking and Learning Multi-Dimensional Quality Evaluator for Text-to-3D Generation: This paper introduces MATE-3D, a multi-dimensional benchmark comprising 1,280 text-to-3D models (8 prompt categories × 8 generation methods × 4 evaluation dimensions × 21 annotators), and proposes HyperScore, a hypernetwork-based multi-dimensional quality evaluator that employs conditional feature fusion and adaptive quality mapping to surpass existing metrics across all evaluation dimensions.
Benchmarking and Learning Multi-Dimensional Quality Evaluator for Text-to-3D Generation: This paper constructs the MATE-3D benchmark (8 prompt categories × 8 methods = 1,280 textured meshes, annotated with 4 dimensions × 21 human raters = 107,520 labels) and proposes HyperScore, a multi-dimensional quality evaluator. HyperScore employs learnable condition features, conditional feature fusion (simulating attention shift), and a hypernetwork that generates dimension-adaptive mapping functions (simulating changes in the decision process), achieving comprehensive superiority over existing metrics across four dimensions: semantic alignment, geometry, texture, and overall quality.
Benchmarking Egocentric Visual-Inertial SLAM at City Scale: This paper introduces LaMAria — the first city-scale egocentric multi-sensor VIO/SLAM benchmark dataset — providing centimeter-accurate ground truth via surveying-grade control points. It systematically evaluates mainstream academic SLAM methods on real egocentric data, revealing a substantial performance gap between academic systems and commercial solutions.
BezierGS: Dynamic Urban Scene Reconstruction with Bézier Curve Gaussian Splatting: This paper proposes BezierGS, a 3D Gaussian Splatting method that models dynamic object motion trajectories using learnable Bézier curves, eliminating reliance on precise bounding box annotations. The method achieves state-of-the-art performance on both dynamic and static scene reconstruction on the Waymo and nuPlan datasets.
BillBoard Splatting (BBSplat): Learnable Textured Primitives for Novel View Synthesis: This paper proposes BBSplat, which replaces the Gaussian opacity in 2D Gaussian Splatting with learnable RGB texture maps and alpha maps, enabling each planar primitive to possess arbitrary shape and per-pixel color control. With fewer primitives, BBSplat closes the rendering quality gap between 2DGS and 3DGS while preserving accurate mesh extraction capability and achieving up to ×17 storage compression.
Blended Point Cloud Diffusion for Localized Text-guided Shape Editing: This paper proposes BlendedPC, which reformulates localized text-guided 3D shape editing as a semantic inpainting problem. By fine-tuning an Inpaint-E model on top of Point·E and introducing an inversion-free coordinate blending mechanism at inference time, BlendedPC achieves precise local editing while preserving the identity of the original shape, outperforming existing methods comprehensively on the ShapeTalk dataset.
BokehDiff: Neural Lens Blur with One-Step Diffusion: BokehDiff proposes a one-step inference bokeh rendering method built upon a pretrained diffusion model. It incorporates energy conservation, circle-of-confusion constraints, and self-occlusion effects via a Physics-Inspired Self-Attention (PISA) module, combined with synthetic foreground data for training, achieving significant improvements over conventional methods at depth-discontinuous regions.
Bolt3D: Generating 3D Scenes in Seconds: A feed-forward 3D scene generation method based on latent diffusion models that represents 3D scenes as multiple sets of Splatter Images and employs a dedicated geometry VAE, generating a complete 3D scene on a single GPU in 7 seconds—reducing inference cost by 300× compared to optimization-based methods (CAT3D).
Boost 3D Reconstruction using Diffusion-based Monocular Camera Calibration: This paper proposes DM-Calib, which leverages Stable Diffusion priors for monocular camera intrinsic estimation. It introduces a Camera Image representation that losslessly encodes intrinsics as an image, and recovers focal length and principal point via RANSAC. DM-Calib significantly outperforms existing calibration methods on 5 zero-shot datasets and advances downstream tasks including metric depth estimation, pose estimation, and sparse-view reconstruction.
Boost 3D Reconstruction using Diffusion-based Monocular Camera Calibration: This paper proposes DM-Calib, a diffusion-based monocular camera intrinsic estimation method. It introduces a Camera Image representation that losslessly encodes intrinsics as a 3-channel image (azimuth + elevation + grayscale), fine-tunes Stable Diffusion to generate Camera Images, and extracts intrinsics via RANSAC. The method outperforms all baselines on 5 zero-shot datasets and extends camera calibration to metric depth estimation, pose estimation, and sparse-view 3D reconstruction.
Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data: This paper proposes Bootstrap3D, a framework that leverages 2D/video diffusion models to automatically generate 1 million high-quality multi-view images paired with fine-grained text descriptions. Combined with a Training Timestep Rescheduling (TTR) strategy that balances image quality and view consistency during fine-tuning, Bootstrap3D significantly improves text-to-3D generation quality.
Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data: This paper proposes Bootstrap3D, a framework that leverages video diffusion models to generate synthetic multi-view data, employs a fine-tuned MV-LLaVA for quality filtering and dense caption rewriting, and introduces a Training Timestep Reschedule (TTR) strategy for training multi-view diffusion models — substantially improving image quality and text alignment without sacrificing view consistency.
BoxDreamer: Dreaming Box Corners for Generalizable Object Pose Estimation: This paper proposes BoxDreamer, which adopts 3D bounding box corners as an intermediate representation. A reference-view-based corner synthesizer predicts 2D corner projections in query images, and 6DoF poses are recovered via PnP using the resulting 2D–3D correspondences. The method significantly outperforms existing approaches under sparse-view and heavy-occlusion conditions.
Bridging 3D Anomaly Localization and Repair via High-Quality Continuous Geometric Representation: This paper proposes PASDF, a framework that employs a pose-aware signed distance function (SDF) for continuous geometric representation, unifying 3D anomaly detection and repair tasks, achieving state-of-the-art performance on Real3D-AD and Anomaly-ShapeNet.
Bridging 3D Anomaly Localization and Repair via High-Quality Continuous Geometric Representation: This paper proposes the PASDF framework, which achieves continuous geometric representation via a pose-aware signed distance function (SDF), unifying 3D anomaly detection and repair tasks, and attains state-of-the-art performance on Real3D-AD and Anomaly-ShapeNet.
PASDF: Bridging 3D Anomaly Localization and Repair via High-Quality Continuous Geometric Representation: This paper proposes PASDF, a unified framework that aligns point clouds to a canonical pose via a Pose Alignment Module (PAM), learns a continuous geometric representation through a neural SDF network, and scores anomalies based on SDF deviation. Anomaly repair is achieved by extracting the zero-level set via Marching Cubes as a repair template. PASDF achieves state-of-the-art performance on Real3D-AD (O-AUROC 80.2%) and Anomaly-ShapeNet (O-AUROC 90.0%).
Bridging Diffusion Models and 3D Representations: A 3D Consistent Super-Resolution Framework: This paper proposes 3DSR, a framework that integrates 2D diffusion-based super-resolution with 3D Gaussian Splatting (3DGS) representations. At each diffusion denoising step, multi-view 3D consistency is enforced via 3DGS rendering, enabling high-fidelity and spatially consistent 3D scene super-resolution.
BUFFER-X: Towards Zero-Shot Point Cloud Registration in Diverse Scenes: BUFFER-X is proposed as a zero-shot point cloud registration method that requires no manual parameter tuning. Through adaptive voxel size/search radius estimation, FPS as a replacement for learned keypoint detectors, and patch-level coordinate normalization, it achieves out-of-the-box cross-domain generalization across 11 datasets.
BUFFER-X: Towards Zero-Shot Point Cloud Registration in Diverse Scenes: BUFFER-X is a registration pipeline that determines voxel size and search radii via geometry-adaptive bootstrapping, replaces learned keypoint detectors with FPS, and applies patch-level coordinate normalization. Without any manual parameter tuning, it achieves zero-shot point cloud registration across 11 cross-domain datasets, attaining the best average-rank success rate across indoor/outdoor, multi-sensor, and multi-scene settings.
CA-I2P: Channel-Adaptive Registration Network with Global Optimal Selection: This paper proposes CA-I2P, which introduces a Channel Adaptive Adjustment Module (CAA) to enhance and filter channel-level discrepancies between image and point cloud features, and a Global Optimal Selection (GOS) module that replaces top-k selection with optimal transport to reduce many-to-one matching errors, achieving state-of-the-art image-to-point cloud registration performance on RGB-D Scenes V2 and 7-Scenes.
Can3Tok: Canonical 3D Tokenization and Latent Modeling of Scene-Level 3D Gaussians: This paper proposes Can3Tok, the first variational autoencoder capable of encoding scene-level 3DGS into a low-dimensional latent space. It achieves efficient tokenization via cross-attention with canonical queries, and addresses scale inconsistency through 3DGS normalization and semantic-aware filtering, successfully generalizing to novel scenes on DL3DV-10K.
CasP: Improving Semi-Dense Feature Matching Pipeline Leveraging Cascaded Correspondence Priors for Guidance: This paper proposes CasP, a cascaded matching pipeline that decomposes the matching stage into a one-to-many prior matching at 1/16 scale and a one-to-one fine matching at 1/8 scale, achieving up to 2.2× speedup while maintaining accuracy and significantly improving cross-domain generalization.
CATSplat: Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from A Single-View Image: This paper proposes CATSplat, a generalizable Transformer framework for feed-forward single-view 3DGS reconstruction. It enhances image features via dual cross-attention using VLM text embeddings (contextual prior) and 3D point cloud features (spatial prior), achieving comprehensive improvements over Flash3D in PSNR/SSIM/LPIPS on RE10K and other datasets, with strong cross-dataset generalization.
CF³: Compact and Fast 3D Feature Fields: This paper proposes the CF³ pipeline, which constructs a compact and high-speed 3D feature field using only 5% of the original Gaussian count via top-down feature lifting, per-Gaussian autoencoder compression, and adaptive sparsification, achieving 121–245× storage compression with real-time rendering.
CHARM3R: Towards Unseen Camera Height Robust Monocular 3D Detector: Through mathematical proof that regression-based depth and ground-plane depth exhibit opposite extrapolation trends under camera height variation, CHARM3R proposes a simple in-model average of the two depth estimates to cancel out these trends, achieving robust generalization of Mono3D detectors to unseen camera heights with AP3D improvements exceeding 45%.
CL-Splats: Continual Learning of Gaussian Splatting with Local Optimization: This paper proposes CL-Splats, a continual learning framework built on 3D Gaussian Splatting that incrementally updates scene reconstructions from sparse novel views via DINOv2-based change detection, 2D-to-3D mask lifting, and sphere-constrained local optimization. CL-Splats substantially outperforms CL-NeRF and related methods on both synthetic and real scenes (PSNR: 40.1 vs. 30.1 dB) while supporting applications such as historical state recovery and concurrent updates.
CLIP-GS: Unifying Vision-Language Representation with 3D Gaussian Splatting: This paper proposes CLIP-GS, the first multimodal representation learning framework based on 3D Gaussian Splatting (3DGS). It serializes 3DGS into tokens via a GS Tokenizer and aligns multimodal representations using an Image Voting Loss, achieving comprehensive improvements over point-cloud-based methods on cross-modal retrieval, zero-shot, and few-shot 3D classification tasks.
CMT: A Cascade MAR with Topology Predictor for Multimodal Conditional CAD Generation: This paper proposes CMT, the first multimodal CAD generation framework based on B-Rep representation. By employing a cascade MAR (edges first, then faces) and a topology predictor, CMT achieves accurate topology and geometry generation. The authors also construct mmABC, a multimodal CAD dataset of over 1.3 million models.
CoMoGaussian: Continuous Motion-Aware Gaussian Splatting from Motion-Blurred Images: A Neural ODE is used to model continuous camera motion trajectories during exposure, combining rigid-body transformations with a learnable Continuous Motion Refinement (CMR) transform to reconstruct sharp 3D Gaussian scenes from motion-blurred images, achieving substantial improvements over the state of the art across all benchmarks.
Compression of 3D Gaussian Splatting with Optimized Feature Planes and Standard Video Codecs: This paper proposes CodecGS, which represents all Gaussian attributes via compact Tri-plane feature planes, combined with frequency-domain DCT entropy modeling and a channel-level bit allocation strategy, enabling efficient compression of feature planes using standard video codecs (HEVC). The method achieves storage sizes within ~10 MB while maintaining high rendering quality, yielding up to 146× compression over vanilla 3DGS.
CstNet: Constraint-Aware Feature Learning for Parametric Point Cloud: This paper proposes CstNet, the first constraint-aware feature learning method for parametric point clouds. CAD constraints are encoded as point-level MAD-Adj-PT triplet representations, and a two-stage network (constraint acquisition + constraint feature learning) achieves state-of-the-art results on the newly constructed Param20K dataset, with classification accuracy improved by +3.49% and rotation robustness improved by +26.17%.
Contact-Aware Amodal Completion for Human-Object Interaction via Multi-Regional Inpainting: This paper proposes the first amodal completion framework tailored for human-object interaction (HOI) scenes. It leverages human body topology and contact information to identify occluded regions via convex hull operations, and employs a multi-regional inpainting strategy on a pretrained diffusion model to achieve high-quality occluded object completion without any additional training.
Curve-Aware Gaussian Splatting for 3D Parametric Curve Reconstruction: This paper proposes CurveGaussian, a single-stage method that establishes a bidirectional coupling mechanism between parametric curves and edge-oriented Gaussian primitives, enabling direct end-to-end optimization of 3D parametric curves from multi-view edge maps. By eliminating the error accumulation inherent in two-stage pipelines, the method achieves comprehensive improvements over prior approaches in accuracy, efficiency, and compactness.
CutS3D: Cutting Semantics in 3D for 2D Unsupervised Instance Segmentation: CutS3D is the first method to introduce 3D information (monocular depth estimation) into unsupervised instance segmentation. By cutting semantic regions in 3D point clouds, it separates overlapping instances in 2D, and introduces a spatial confidence mechanism to improve pseudo-label quality, surpassing CutLER and other SoTA methods on multiple benchmarks.
DAP-MAE: Domain-Adaptive Point Cloud Masked Autoencoder for Effective Cross-Domain Learning: DAP-MAE is proposed to jointly learn multi-domain point cloud data via a Heterogeneous Domain Adapter (HDA) and a Domain Feature Generator (DFG), enabling a single pretraining run to adapt to diverse downstream tasks including object classification, expression recognition, part segmentation, and 3D detection.
DAP-MAE: Domain-Adaptive Point Cloud Masked Autoencoder for Effective Cross-Domain Learning: This paper proposes a domain-adaptive point cloud MAE framework (DAP-MAE) that enables a single cross-domain pre-training to achieve state-of-the-art performance across multiple downstream tasks spanning diverse domains, including object classification, facial expression recognition, part segmentation, and object detection, via two key modules: the Heterogeneous Domain Adapter (HDA) and the Domain Feature Generator (DFG).
DAViD: Data-efficient and Accurate Vision Models from Synthetic Data: This work demonstrates that high-fidelity procedural synthetic data suffices to train human-centric dense prediction models that match the accuracy of foundation models such as Sapiens-2B, requiring only 300K synthetic images, 0.3B parameters, and less than 1/16 the training cost of comparable approaches, while achieving state-of-the-art or near-SOTA performance on depth estimation, surface normal estimation, and soft foreground segmentation.
DeepMesh: Auto-Regressive Artist-Mesh Creation with Reinforcement Learning: This paper proposes DeepMesh, a framework that achieves human preference alignment in 3D mesh generation through an improved efficient mesh tokenization algorithm (72% compression rate) and the first application of DPO-based reinforcement learning to 3D mesh generation, capable of producing high-quality artist-like triangle meshes with up to 30K faces.
DeGauss: Dynamic-Static Decomposition with Gaussian Splatting for Distractor-free 3D Reconstruction: This paper proposes DeGauss, a self-supervised framework based on decoupled dynamic-static Gaussian Splatting. By combining foreground dynamic Gaussians and background static Gaussians via a probabilistic composition mask, it achieves distractor-free 3D reconstruction across a broad range of scenarios, from casually captured image collections to highly dynamic egocentric videos.
Demeter: A Parametric Model of Crop Plant Morphology from the Real World: Demeter is a data-driven parametric plant morphology model that decomposes plant shape into four factors — topology, articulation, shape, and deformation — supporting shape generation, 3D reconstruction, and biophysical simulation.
Depth AnyEvent: A Cross-Modal Distillation Paradigm for Event-Based Monocular Depth Estimation: This paper proposes a cross-modal distillation paradigm that leverages a vision foundation model (VFM) from the image domain (Depth Anything v2) to generate pseudo-labels for training event-based depth estimation networks. It further introduces a VFM-based recurrent architecture, DepthAnyEvent-R, achieving state-of-the-art performance in event-based monocular depth estimation without requiring costly depth annotations.
Describe, Adapt and Combine: Empowering CLIP Encoders for Open-set 3D Object Retrieval: This paper proposes the DAC framework, which employs a "Describe–Adapt–Combine" three-step strategy to synergize CLIP with a multimodal large language model (MLLM). Using only multi-view images, DAC substantially outperforms the previous SOTA method that relies on all modalities (point clouds + voxels + images) on open-set 3D object retrieval, achieving an average mAP improvement of over +10%.
Diorama: Unleashing Zero-shot Single-view 3D Indoor Scene Modeling: This paper presents Diorama, the first zero-shot open-world system for complete 3D indoor scene modeling from a single RGB image. It employs a modular pipeline consisting of open-world perception and CAD-based scene assembly to produce full scenes including architectural structures and object placement, requiring neither end-to-end training nor manual annotations.
Diorama: Unleashing Zero-shot Single-view 3D Indoor Scene Modeling: This paper presents Diorama, the first zero-shot open-world system for 3D indoor scene modeling. By modularly composing foundation models (GPT-4o, SAM, DINOv2, Metric3D, etc.), Diorama converts a single RGB image into a complete, compositional 3D indoor scene containing architectural structures and CAD objects, requiring neither end-to-end training nor manual annotation.
Discretized Gaussian Representation for Tomographic Reconstruction: This paper proposes Discretized Gaussian Representation (DGR) for CT reconstruction, which directly reconstructs 3D voxels end-to-end via discretized Gaussian functions and introduces a highly parallelized fast volume reconstruction technique. DGR surpasses both deep learning-based and instance-based reconstruction methods in sparse-view and limited-angle CT settings without any training data.
Disentangling Instance and Scene Contexts for 3D Semantic Scene Completion: This paper proposes DISC, a category-aware dual-stream architecture for 3D semantic scene completion that disentangles instance and scene categories into separate query streams with dedicated decoding modules. Using only single-frame input, DISC surpasses multi-frame state-of-the-art methods on SemanticKITTI, achieving a 17.9% improvement in instance category mIoU.
Diving into the Fusion of Monocular Priors for Generalized Stereo Matching: This paper systematically analyzes three key problems in monocular prior fusion—affine-invariant vs. absolute depth alignment, local optima in iterative updates, and noisy disparity interference—and proposes a Binary Local Ranking Map and a Global Registration Module. On SceneFlow→Middlebury/Booster generalization benchmarks, bad2 error is reduced by half or more with negligible additional computational cost.
DMesh++: An Efficient Differentiable Mesh for Complex Shapes: This paper proposes DMesh++, which replaces weighted Delaunay triangulation (WDT) with a Minimum-Ball algorithm as the tessellation function for differentiable meshes, reducing computational complexity from \(O(N)\) to \(O(\log N)\). The method achieves up to 32× speedup on complex shapes while preserving desirable properties such as no self-intersections and few degenerate triangles.
Do It Yourself: Learning Semantic Correspondence from Pseudo-Labels: This paper proposes DIY-SC, a 3D-aware pseudo-label generation strategy (chained propagation + relaxed cycle consistency + spherical prototype filtering) to train a lightweight adapter that improves semantic correspondence using foundation model features, achieving a 4.5% gain over the previous SOTA on SPair-71k (PCK@0.1 per-keypoint) without any manual keypoint annotations.
DriveX: Driving View Synthesis on Free-form Trajectories with Generative Prior: DriveX is a driving view synthesis framework that progressively distills generative priors from a video diffusion model into a 3DGS representation. It designs an inpainting-based video restoration task to generate pseudo-labels for novel trajectories and iteratively refines the 3D reconstruction, enabling high-quality real-time rendering on free-form trajectories.
DSO: Aligning 3D Generators with Simulation Feedback for Physical Soundness: This paper proposes the Direct Simulation Optimization (DSO) framework, which uses stability feedback from a (non-differentiable) physics simulator as a reward signal to fine-tune 3D generators via DPO or the newly proposed DRO objective, enabling feed-forward generation of physically self-supporting 3D objects without test-time optimization.
Dynamic Point Maps: A Versatile Representation for Dynamic 3D Reconstruction: This paper proposes Dynamic Point Maps (DPM), extending DUSt3R's viewpoint-invariant point maps into a spatiotemporal-invariant representation that jointly controls viewpoint and time. By predicting only four sets of point maps in a feed-forward manner, DPM simultaneously addresses multiple 4D tasks including depth estimation, scene flow, motion segmentation, and 3D object tracking.
Easi3R: Estimating Disentangled Motion from DUSt3R Without Training: This paper proposes Easi3R, a training-free plug-and-play method that analyzes and manipulates the implicit motion information encoded in DUSt3R's cross-attention layers to achieve dynamic object segmentation, camera pose estimation, and 4D dense point cloud reconstruction.
Easy3D: A Simple Yet Effective Method for 3D Interactive Segmentation: This paper proposes Easy3D, a simple yet effective method for 3D interactive instance segmentation. By combining a sparse voxel encoder, a lightweight Transformer decoder, and an implicit click fusion strategy, Easy3D consistently outperforms state-of-the-art methods on both in-domain and out-of-domain datasets. It is also the first work to successfully apply learned negative embeddings to implicit click fusion.
Efficient Spiking Point Mamba for Point Cloud Analysis: SPM (Spiking Point Mamba) proposes the first Mamba-based 3D spiking neural network framework. Through Hierarchical Dynamic Encoding (HDE) and a Spiking Mamba Block (SMB), it achieves over 3.5× energy reduction while improving accuracy by 6–7% over the previous state-of-the-art SNN method on ScanObjectNN.
Egocentric Action-aware Inertial Localization in Point Clouds with Vision-Language Guidance: The EAIL framework leverages egocentric action cues embedded in head-mounted IMU signals and employs hierarchical multimodal alignment (vision-language guidance) to learn associations between actions and environmental structures, enabling accurate inertial localization in 3D point clouds while simultaneously supporting action recognition.
EgoM2P: Egocentric Multimodal Multitask Pretraining: EgoM2P is the first large-scale multimodal multitask model for egocentric 4D understanding. It unifies four modalities — RGB video, depth, gaze, and camera trajectory — within a temporally-aware masked modeling framework, matching or surpassing task-specific models on multiple downstream tasks while being an order of magnitude faster.
EmbodiedSplat: Personalized Real-to-Sim-to-Real Navigation with Gaussian Splats from a Mobile Device: This paper proposes EmbodiedSplat, a complete pipeline that captures real environments via iPhone video → reconstructs 3D Gaussian Splat meshes → fine-tunes navigation policies in Habitat-Sim → deploys to the real world. The approach achieves 20%–40% absolute success rate improvement over zero-shot baselines on real-scene ImageNav tasks, with a sim-vs-real Spearman rank correlation coefficient of 0.87–0.97.
Estimating 2D Camera Motion with Hybrid Motion Basis: This paper proposes CamFlow, which represents complex 2D camera motion via a hybrid motion basis (12 physical bases + random noise bases), reveals the nonlinear nature of superimposed homography flow fields, and incorporates a Laplace distribution-based probabilistic loss function. CamFlow substantially outperforms existing homography and meshflow methods under both standard and cross-dataset zero-shot evaluation settings.
ETCH: Generalizing Body Fitting to Clothed Humans via Equivariant Tightness: This paper proposes ETCH, a framework that models SE(3)-equivariant tightness vectors from clothing surfaces to body surfaces, reducing clothed human body fitting to a tightness-aware sparse marker fitting task. On the CAPE and 4D-Dress datasets, ETCH achieves 16.7%–69.5% reduction in joint error on loose garments and an average 49.9% improvement in shape accuracy compared to state-of-the-art methods (both tightness-agnostic and tightness-aware).
EvaGaussians: Event Stream Assisted Gaussian Splatting from Blurry Images: This paper proposes EvaGaussians, a framework that leverages the high temporal resolution event streams from event cameras to assist 3D Gaussian Splatting in learning from motion-blurred images. Through event-assisted initialization, joint blur/event reconstruction losses, and event-assisted geometric regularization, the method achieves high-fidelity novel view synthesis while maintaining real-time rendering efficiency.
Event-based Tiny Object Detection: A Benchmark Dataset and Baseline: This paper introduces EV-UAV, the first large-scale event camera benchmark for anti-UAV tiny object detection (147 sequences / 23M+ event-level annotations / average target size only 6.8×5.4 pixels), and proposes EV-SpSegNet — a detection framework based on sparse 3D point cloud segmentation. The method exploits the observation that tiny moving targets form continuous elongated curves in spatiotemporal event point clouds, and incorporates a Spatiotemporal Correlation loss (STC loss) to guide the network in retaining target events. It outperforms 13 state-of-the-art methods across IoU/ACC/detection probability metrics while achieving 10–100× faster inference.
Event-boosted Deformable 3D Gaussians for Dynamic Scene Reconstruction: This paper presents the first integration of event cameras with deformable 3D Gaussian Splatting (3D-GS) for dynamic scene reconstruction. It introduces a GS-Threshold Joint Modeling (GTJM) strategy and a Dynamic-Static Decomposition (DSD) strategy, achieving state-of-the-art rendering quality and speed on a newly constructed event-4D benchmark (average PSNR improvement of 2.73 dB on synthetic data, rendering speed 1.71× faster than 4D-GS).
Event-Driven Storytelling with Multiple Lifelike Humans in a 3D Scene: This paper proposes an event-driven LLM framework that decomposes multi-character behavior planning in 3D scenes into two modules — a Narrator for event-by-event generation and an Event Parser for fine-grained spatial reasoning — achieving, for the first time, long-horizon natural interaction motion generation for 4–5+ characters in large-scale multi-room 3D scenes.
ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail: This paper proposes ExCap3D, a method for generating multi-granularity captions for objects in 3D indoor scenes at both object-level and part-level description layers. Through part-to-object information sharing and semantic/textual consistency losses, the approach ensures caption accuracy and coherence. On a newly constructed dataset of 190K captions, CIDEr scores improve by 17% and 124% over the prior SOTA at the object and part levels, respectively.
FaceLift: Learning Generalizable Single Image 3D Face Reconstruction from Synthetic Heads: This paper presents FaceLift, a single-image 360° high-quality 3D human head reconstruction method trained exclusively on synthetic data yet generalizing well to real-world images. It generates identity-consistent multi-view images via a multi-view latent diffusion model, then feeds them into a Transformer-based reconstructor to produce pixel-aligned 3D Gaussian representations.
Faster and Better 3D Splatting via Group Training: This paper proposes a Group Training strategy that accelerates 3DGS training by periodically partitioning Gaussian primitives into an "under-training group" and a "cached group," combined with Opacity-based Priority Sampling (OPS). Across four standard benchmarks, the method achieves approximately 30% training speedup while simultaneously improving rendering quality and reducing model size, and can be applied as a plug-and-play module to 3DGS and Mip-Splatting frameworks.
FiffDepth: Feed-forward Transformation of Diffusion-Based Generators for Detailed Depth Estimation: This paper proposes FiffDepth, which transforms a pretrained diffusion model into a deterministic feed-forward architecture for monocular depth estimation. By preserving the diffusion trajectory to maintain detail generation capability and introducing a learnable filter to distill DINOv2's robust generalization ability into the diffusion backbone, FiffDepth simultaneously surpasses existing methods in efficiency, accuracy, and detail richness.
Find Any Part in 3D: This paper proposes Find3D, an automated 3D data annotation engine driven by 2D foundation models (SAM + Gemini) that generates 2.1 million part annotations. The resulting model is the first to simultaneously achieve open-world, cross-category, part-level, and feed-forward inference capabilities in 3D segmentation, yielding a 260% zero-shot mIoU improvement and inference speeds 6–300× faster than existing methods.
FlashDepth: Real-time Streaming Video Depth Estimation at 2K Resolution: FlashDepth augments Depth Anything v2 with a Mamba recurrent module for cross-frame scale consistency, and introduces a Small-Large hybrid architecture that achieves real-time streaming video depth estimation at 2K resolution and 24 FPS with significantly sharper boundaries than existing methods.
FlexGen: Flexible Multi-View Generation from Text and Image Inputs: This paper proposes FlexGen, a flexible multi-view image generation framework that leverages GPT-4V to produce 3D-aware text annotations from tiled orthographic views and introduces an adaptive dual-control module to support single-image, text-only, or joint image-text conditioning for generating consistent multi-view images, enabling capabilities such as unseen-region completion, material editing, and texture control.
From Gallery to Wrist: Realistic 3D Bracelet Insertion in Videos: This paper proposes a hybrid pipeline for realistically inserting 3D bracelets into videos. It leverages 3D Gaussian Splatting (3DGS) to ensure temporal consistency, employs a 2D diffusion model to enhance photorealistic illumination, and introduces a Shading-Driven pipeline that separately optimizes albedo, shading, and specular residuals. The method achieves an 81.7% realism preference rate in user studies, significantly outperforming existing approaches.
From Image to Video: An Empirical Study of Diffusion Representations: This paper systematically compares diffusion models trained under image vs. video generation objectives using the same architecture (WALT) on a suite of downstream visual understanding tasks. Video diffusion models consistently outperform their image counterparts across all tasks, with particularly large gains on tasks requiring motion and 3D spatial understanding (point tracking +68%, camera pose +60%).
From One to More: Contextual Part Latents for 3D Generation: This paper proposes CoPart, a framework that represents 3D objects via contextual part latents and fine-tunes pretrained diffusion models with a mutual guidance strategy, enabling high-quality part-level 3D generation along with support for part editing, articulated object generation, and small-scale scene generation.
FROSS: Faster-than-Real-Time Online 3D Semantic Scene Graph Generation from RGB-D Images: This paper proposes FROSS, a method that lifts 2D scene graphs directly into 3D space and represents objects as Gaussian distributions, achieving faster-than-real-time (144 FPS) online 3D semantic scene graph generation without requiring precise point cloud reconstruction.
G2SF: Geometry-Guided Score Fusion for Multimodal Industrial Anomaly Detection: G2SF reinterprets memory-bank-based anomaly scores as isotropic Euclidean distances in a local feature space, and progressively evolves them into an anisotropic unified fusion metric by learning direction-aware scaling factors via a Local Scale Prediction Network (LSPN), achieving state-of-the-art performance on multimodal industrial anomaly detection.
GAS: Generative Avatar Synthesis from a Single Image: GAS is a framework that unifies novel view synthesis and novel pose synthesis into a video generation task by combining dense appearance cues from a generalizable NeRF with a video diffusion model. A modality switcher decouples the two tasks, enabling view-consistent and temporally coherent human avatar generation from a single image.
Gaussian Splatting with Discretized SDF for Relightable Assets: This paper proposes encoding a continuous SDF as an additional per-Gaussian attribute via an SDF-to-opacity transformation that unifies Gaussian splatting and SDF representations. Combined with a projection-based consistency loss and spherical initialization, the method achieves relighting quality surpassing existing Gaussian-based inverse rendering approaches using only 4 GB of GPU memory.
Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis: This paper proposes a video-to-4D generation framework that encodes animation data directly into a compact Gaussian variation field latent space via a Direct 4DMesh-to-GS Variation Field VAE, and trains a temporally-aware diffusion model to generate dynamic 3D content. The framework achieves high-fidelity 4D synthesis in 4.5 seconds and demonstrates strong generalization to real-world video inputs.
GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs: GaussianProperty presents a training-free framework that assigns physical properties (density, elastic modulus, friction coefficient, etc.) to 3D Gaussians by leveraging SAM for segmentation and GPT-4V for recognition, via a global-local reasoning module and a multi-view voting strategy. The framework supports two downstream tasks: physics-based simulation and robotic grasping.
GaussianUpdate: Continual 3D Gaussian Splatting Update for Changing Environments: This paper presents GaussianUpdate, the first method to integrate 3D Gaussian representations with continual learning. It achieves real-time rendering and change visualization in temporally varying scenes through a three-stage update strategy (appearance update → geometric layout update → joint refinement) and visibility-aware generative replay.
GazeGaussian: High-Fidelity Gaze Redirection with 3D Gaussian Splatting: This paper proposes GazeGaussian, the first high-fidelity gaze redirection method based on 3D Gaussian Splatting (3DGS). By employing a dual-stream 3DGS model to separately represent the facial and eye regions, the method introduces an explicit Gaussian eyeball rotation representation and an expression-guided neural renderer (EGNR), achieving state-of-the-art performance in gaze accuracy, synthesis quality, and rendering speed.
Generating Physically Stable and Buildable Brick Structures from Text: BrickGPT is the first method to generate physically stable and assemblable interlocking brick structures directly from text prompts. The core idea is to formulate brick assembly as an autoregressive text generation task, augmented at inference time with physics-aware validity checking and a rollback mechanism to ensure structural stability and buildability.
Geometry Distributions: This paper proposes Geometry Distributions (GeomDist), which models 3D geometry as a probability distribution over surface points and learns it via a diffusion model. Without assuming genus, connectivity, or boundary conditions, the method samples arbitrarily many surface points from Gaussian noise to represent geometry of arbitrary topology.
GeoProg3D: Compositional Visual Reasoning for City-Scale 3D Language Fields: This paper proposes GeoProg3D, the first visual programming framework supporting natural language interaction with city-scale high-fidelity 3D scenes. By combining a Geo-aware City-scale 3D Language Field (GCLF) with Geo-Visual APIs (GV-APIs) and an LLM reasoning engine, the framework enables compositional geospatial reasoning. GeoProg3D comprehensively outperforms existing 3D language field and VLM methods on the newly introduced GeoEval3D benchmark, which contains 952 annotated queries.
GeoSplatting: Towards Geometry Guided Gaussian Splatting for Physically-based Inverse Rendering: This paper proposes GeoSplatting, which differentiably generates surface-aligned Gaussians from an optimizable explicit mesh to provide accurate geometric guidance for 3DGS, achieving state-of-the-art inverse rendering performance (material–lighting decomposition) with training times of only 10–15 minutes.
Global-Aware Monocular Semantic Scene Completion with State Space Models: This paper proposes GA-MonoSSC, a hybrid architecture combining Transformer (2D global context) and Mamba (3D long-range dependencies) for indoor monocular semantic scene completion. A novel Frustum Mamba Layer is introduced to address feature discontinuities in voxel serialization, achieving state-of-the-art performance on Occ-ScanNet and NYUv2.
Global Motion Corresponder for 3D Point-Based Scene Interpolation under Large Motion: This paper proposes the Global Motion Corresponder (GMC), which learns unary potential fields that map 3D Gaussians from two time steps into a shared canonical space, enabling robust scene interpolation and extrapolation under large motion.
GSOT3D: Towards Generic 3D Single Object Tracking in the Wild: This paper presents GSOT3D, the largest generic 3D single object tracking benchmark to date, comprising 620 multimodal sequences (point cloud + RGB + depth) spanning 54 object categories. It supports three 3D tracking tasks (PC / RGB-PC / RGB-D) and introduces PROT3D, a progressive spatiotemporal tracker that achieves state-of-the-art performance via 9DoF bounding box estimation.
GUAVA: Generalizable Upper Body 3D Gaussian Avatar: This paper presents GUAVA, the first framework for feed-forward reconstruction of animatable upper-body 3D Gaussian avatars from a single image. By combining template Gaussians and UV Gaussians in a canonical space representation, GUAVA supports rich facial expression and gesture driving, completing reconstruction in approximately 0.1 s with real-time rendering capability.
Guiding Diffusion-Based Articulated Object Generation by Partial Point Cloud Alignment and Physical Plausibility Constraints: This paper proposes PhysNAP, which guides the reverse diffusion process of the pretrained diffusion model NAP via point cloud alignment loss and SDF-based physical plausibility constraints (part penetration and joint mobility), enabling category-aware articulated object generation with significant improvements in alignment accuracy and physical plausibility over the unguided baseline.
HairCUP: Hair Compositional Universal Prior for 3D Gaussian Avatars: This paper proposes HairCUP, a compositional universal prior model that decomposes head modeling into two independent latent spaces for face and hair. By leveraging a synthetic hairless data creation pipeline for effective disentanglement, the model supports flexible face/hairstyle swapping and few-shot monocular adaptation.
Hierarchical Material Recognition from Local Appearance: This paper proposes a hierarchical material taxonomy designed for visual applications alongside a new in-the-wild dataset, Matador (~7,200 material images with depth maps, 57 categories). A graph attention network (GAT) leverages the taxonomic hierarchy for material recognition, achieving state-of-the-art results on multiple benchmarks while supporting few-shot learning of novel materials and material probing at arbitrary scene points.
HORT: Monocular Hand-held Objects Reconstruction with Transformers: This paper proposes HORT, a coarse-to-fine Transformer-based framework that efficiently reconstructs dense 3D point clouds of hand-held objects from monocular images. By integrating image features with 3D hand geometry, HORT jointly predicts the object point cloud and its pose relative to the hand, achieving state-of-the-art performance in both reconstruction accuracy and inference speed.
HouseTour: A Virtual Real Estate A(I)gent: HouseTour is proposed to jointly generate human-like 3D camera trajectories and real estate textual descriptions given a set of indoor images with known poses. The system employs a Residual Diffuser for diffusion-based trajectory planning and integrates spatial features into Qwen2-VL-3D to produce 3D-grounded text summaries.
How Far are AI-generated Videos from Simulating the 3D Visual World: A Learned 3D Evaluation Approach: This paper proposes Learned 3D Evaluation (L3DE), an objective and quantifiable evaluation framework based on monocular 3D cues (motion, depth, appearance) and contrastive learning, designed to measure the gap between AI-generated videos and real videos in terms of 3D visual consistency, without requiring manual annotation of artifacts or quality labels.
HumanOLAT: A Large-Scale Dataset for Full-Body Human Relighting and Novel-View Synthesis: This paper introduces HumanOLAT — the first publicly available large-scale full-body multi-view OLAT (One-Light-at-a-Time) dataset, comprising 21 subjects × 3 poses × 40 viewpoints × 344 lighting conditions ≈ 850K frames, providing a high-quality benchmark for human relighting and novel-view synthesis.
Identity Preserving 3D Head Stylization with Multiview Score Distillation: This paper proposes a 3D head stylization framework based on Likelihood Distillation (LD), achieving high-quality stylization with identity preservation under 360-degree consistent rendering through multiview grid scoring, mirror gradients, and rank-weighted score tensors.
IM360: Large-scale Indoor Mapping with 360 Cameras: This paper presents IM360, a 3D mapping pipeline for large-scale indoor environments captured under sparse scanning conditions. By deeply integrating a spherical camera model into every stage of SfM—combined with dense feature matching and differentiable rendering-based texture optimization—IM360 achieves substantially superior camera localization accuracy and rendering quality on Matterport3D and Stanford2D3D compared to existing methods (PSNR gain of 3.5).
Image-Guided Shape-from-Template Using Mesh Inextensibility Constraints: This paper proposes a purely image-guided unsupervised Shape-from-Template (SfT) method that reconstructs the 3D shape of deforming objects using only visual cues—color, gradients, and silhouettes—combined with mesh inextensibility constraints. The method is approximately 400× faster than the best-performing unsupervised baseline while achieving substantially higher accuracy.
Image as an IMU: Estimating Camera Motion from a Single Motion-Blurred Image: This paper reframes motion blur from an "unwanted artifact" into a "valuable motion cue." By predicting a dense optical flow field and a monocular depth map from a single blurred image, and subsequently recovering the camera's 6DoF instantaneous velocity via a differentiable least-squares solver, the method achieves motion estimation accuracy comparable to or surpassing that of an IMU, with real-time performance at 30 FPS.
InstaScene: Towards Complete 3D Instance Decomposition and Reconstruction from Cluttered Scenes: InstaScene proposes a unified framework for instance decomposition and complete reconstruction from cluttered scenes. It constructs a spatial contrastive learning scheme via tracked Gaussian rasterization for accurate instance segmentation, and designs an in-situ generation pipeline that leverages available observations and geometric cues to guide a 3D generative model toward complete object reconstruction.
JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers: JointDiT builds an RGB-Depth joint distribution model upon the Flux diffusion Transformer. Through adaptive scheduling weights and an unbalanced timestep sampling strategy, a single model can flexibly perform three tasks—joint generation, depth estimation, and depth-conditioned image generation—by controlling the timestep of each modality.
χ: Symmetry Understanding of 3D Shapes via Chirality Disentanglement: This paper proposes an unsupervised chirality feature extraction pipeline that distills left-right chirality information from 2D foundation model features to augment 3D shape vertex descriptors, effectively resolving left-right ambiguity in shape analysis.
LACONIC: A 3D Layout Adapter for Controllable Image Creation: This paper proposes LACONIC, a lightweight adapter based on parameterized 3D semantic bounding boxes that injects explicit 3D geometric information into a pretrained text-to-image diffusion model via a decoupled cross-attention mechanism. It is the first method to simultaneously support camera control, 3D object-level semantic guidance, and full scene context modeling of off-screen objects, achieving a 75.8% reduction in FID compared to SceneCraft.
LayerLock: Non-collapsing Representation Learning with Progressive Freezing: This paper proposes LayerLock, a self-supervised video representation learning method that progressively freezes network layers while dynamically shifting prediction targets from pixels to increasingly deep intermediate layer features. It combines the training stability of pixel prediction with the semantic efficiency of latent variable prediction, and is applied to video models with up to 4B parameters.
Learning 3D Object Spatial Relationships from Pre-trained 2D Diffusion Models: This paper proposes learning 3D object-object spatial relationships (OOR) from synthetic images generated by pre-trained 2D diffusion models. A 3D lifting pipeline is introduced to construct a paired dataset, upon which a text-conditioned score-based diffusion model is trained to model the distribution of relative poses and scales between object pairs. The framework is further extended to multi-object scene layout generation and scene editing.
Learning 3D Scene Analogies with Neural Contextual Scene Maps: This paper introduces the 3D scene analogy task and proposes neural contextual scene maps to establish dense 3D mappings between scene regions sharing similar semantic context, enabling downstream applications such as trajectory transfer and object placement transfer.
Learning Robust Stereo Matching in the Wild with Selective Mixture-of-Experts: This paper proposes SMoEStereo, which integrates variable-rank MoE-LoRA and variable-kernel MoE-Adapter modules into a frozen Visual Foundation Model (VFM), combined with a lightweight decision network for selective activation of MoE modules, achieving scene-adaptive robust stereo matching with state-of-the-art performance on cross-domain and joint generalization benchmarks.
Lightweight Gradient-Aware Upscaling of 3D Gaussian Splatting Images: A lightweight image upscaling technique tailored for 3DGS is proposed, leveraging analytic image gradients from Gaussian primitives for gradient-aware bicubic spline interpolation. Without any deep learning inference, the method achieves 3–4× rendering acceleration while surpassing standard bicubic interpolation and DL-based upscaling in reconstruction quality.
LINR-PCGC: Lossless Implicit Neural Representations for Point Cloud Geometry Compression: LINR-PCGC proposes the first implicit neural representation (INR)-based method for lossless point cloud geometry compression. By designing a lightweight multi-scale SparseConv network with Scale Context Extraction (SCE) and Child Node Prediction (CNP) modules, combined with a GoP-level shared decoder and initialization strategy, the method achieves a 21.21% bitrate reduction over G-PCC TMC13v23 and a 21.95% reduction over SparsePCGC on the MVUB dataset, without relying on any specific training data distribution.
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D Capabilities: This paper proposes LLaVA-3D, which constructs "3D Patches" by injecting 3D positional embeddings into 2D CLIP patch features, extending a 2D LMM (LLaVA-Video) into a unified 2D/3D understanding model with minimal architectural modifications. The approach achieves 3.5× faster training convergence than existing 3D LMMs, reaches state-of-the-art performance on multiple 3D benchmarks, and preserves 2D capabilities without degradation.
LocalDyGS: Multi-view Global Dynamic Scene Modeling via Adaptive Local Implicit Feature Decoupling: This paper proposes LocalDyGS, a framework that decomposes complex global dynamic scenes into local spaces defined by seed points and generates temporal Gaussians via static-dynamic feature decoupling to model local motions independently, achieving high-quality reconstruction of large-scale complex dynamic scenes for the first time.
LONG3R: Long Sequence Streaming 3D Reconstruction: This paper proposes LONG3R, a streaming multi-view 3D reconstruction model based on a recurrent memory mechanism. Through three key innovations — memory gating, a dual-source refined decoder, and 3D spatio-temporal memory — LONG3R significantly improves long-sequence reconstruction quality while maintaining real-time inference speed.
LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos: LongSplat targets casually captured long videos without known camera poses. It proposes an incremental joint optimization framework that simultaneously optimizes camera poses and 3DGS, introduces a robust pose estimation module based on MASt3R priors, and designs an adaptive octree anchor formation mechanism, collectively addressing pose drift, inaccurate geometry initialization, and memory constraints.
MaskHand: Generative Masked Modeling for Robust Hand Mesh Reconstruction in the Wild: This paper proposes MaskHand, the first method to introduce generative masked modeling into 3D hand mesh reconstruction. It discretizes continuous hand poses into tokens via VQ-MANO, then employs a context-guided masked Transformer to learn the probability distribution of 2D-to-3D mappings. During inference, confidence-guided iterative sampling is used to generate high-precision hand meshes, achieving a 19.5% reduction in PA-MPJPE on the HO3Dv3 zero-shot evaluation.
MaterialMVP: Illumination-Invariant Material Generation via Multi-view PBR Diffusion: MaterialMVP is an end-to-end multi-view PBR texture generation model that decouples illumination via consistency-regularized training and employs a dual-channel material generation framework (MCAA + Learnable Material Embeddings) to align albedo and metallic-roughness maps, enabling single-pass generation of high-quality, illumination-invariant, multi-view-consistent PBR materials from a 3D mesh and an image prompt.
MEGA: Memory-Efficient 4D Gaussian Splatting for Dynamic Scenes: This paper proposes MEGA, a memory-efficient framework for 4D Gaussian Splatting that eliminates redundant spherical harmonic coefficients via DC-AC color decomposition (8× compression) and reduces the total number of Gaussians through entropy-constrained Gaussian deformation. MEGA achieves approximately 190× and 125× storage compression on the Technicolor and Neural 3D Video datasets, respectively, while maintaining comparable rendering quality and real-time speed.
MemoryTalker: Personalized Speech-Driven 3D Facial Animation via Audio-Guided Stylization: This paper proposes MemoryTalker, a two-stage training framework (Memorizing + Animating) that employs a key-value memory network to store generic facial motions and generates personalized 3D facial animations driven solely by audio via audio-guided stylized memory retrieval, requiring no additional prior information at inference time.
MeshAnything V2: Artist-Created Mesh Generation with Adjacent Mesh Tokenization: MeshAnything V2 proposes Adjacent Mesh Tokenization (AMT), which represents adjacent faces using a single vertex rather than the conventional three, reducing the average token sequence length by approximately half. This allows the maximum number of generated faces to scale from 800 to 1600 without additional computational cost, significantly improving the efficiency and quality of autoregressive mesh generation.
MeshMamba: State Space Models for Articulated 3D Mesh Generation and Reconstruction: MeshMamba introduces a Mamba state space model-based approach for articulated 3D mesh generation and reconstruction. By designing vertex serialization strategies based on body-part UV maps and template mesh coordinates, the method achieves efficient generation and reconstruction of meshes with tens of thousands of vertices, running 6–9× faster than Transformer-based counterparts.
MeshPad: Interactive Sketch-Conditioned Artist-Reminiscent Mesh Generation and Editing: MeshPad decomposes sketch-driven 3D mesh creation and editing into two sub-tasks—addition and deletion—based on a triangle sequence representation and Transformer autoregressive generation. It further proposes a vertex-aligned speculative decoder achieving a 2.2× speedup, enabling interactive mesh editing within seconds.
MinCD-PnP: Learning 2D-3D Correspondences with Approximate Blind PnP: This paper proposes MinCD-PnP, which reduces the computationally expensive Blind PnP to a problem of minimizing the Chamfer distance between 2D-3D keypoints via a triple approximation strategy. A lightweight multi-task learning module, MinCD-Net, is designed and integrated into existing I2P registration frameworks, achieving significant improvements in inlier ratio and registration recall under cross-scene and cross-dataset settings.
MoGA: 3D Generative Avatar Prior for Monocular Gaussian Avatar Reconstruction: MoGA is proposed to reconstruct high-fidelity 3D Gaussian avatars from a single image by learning a generative 3D avatar prior and leveraging it as a strong constraint for initialization, regularization, and pose optimization, substantially outperforming existing methods.
Momentum-GS: Momentum Gaussian Self-Distillation for High-Quality Large Scene Reconstruction: Momentum-GS proposes a momentum-based self-distillation mechanism to address cross-block consistency issues in block-parallel training of large-scale 3D Gaussian Splatting. By introducing a momentum teacher Gaussian decoder for global guidance and decoupling the number of blocks from the number of GPUs, the method achieves state-of-the-art performance on multiple large-scale scene datasets, improving LPIPS by 18.7% over CityGaussian.
Monocular Semantic Scene Completion via Masked Recurrent Networks: This paper proposes MonoMRN, a two-stage monocular semantic scene completion framework that first generates coarse-grained predictions, then iteratively refines occluded regions via a Masked Sparse GRU (MS-GRU), while introducing distance attention projection to reduce depth projection errors. The method achieves state-of-the-art performance on both NYUv2 and SemanticKITTI.
MonoMobility: Zero-Shot 3D Mobility Analysis from Monocular Videos: MonoMobility presents the first framework for zero-shot analysis of moving parts and motion attributes (motion axis and motion type) of articulated objects from monocular video. It combines off-the-shelf tools—depth estimation, optical flow, and segmentation—for coarse initialization, and then refines the results via self-supervised optimization of a dynamic scene represented with 2D Gaussian splatting together with a specially designed articulated-object dynamic scene optimization algorithm. The method requires no annotated data and handles rotational, translational, and compound motion.
MuGS: Multi-Baseline Generalizable Gaussian Splatting Reconstruction: This paper proposes MuGS, the first generalizable 3D Gaussian Splatting method designed for multi-baseline settings. By fusing multi-view stereo (MVS) and monocular depth estimation (MDE) features and introducing a projected-sampled depth consistency network, MuGS achieves state-of-the-art novel view synthesis under both small-baseline and large-baseline scenarios.
Multi-View 3D Point Tracking: This paper presents MVTracker—the first data-driven multi-view 3D point tracker. By back-projecting multi-view depth maps into a unified 3D feature point cloud and leveraging kNN association with Transformer-based iterative refinement, MVTracker achieves robust long-range 3D point trajectory estimation under a practical 4-camera configuration, attaining median trajectory errors of 3.1 cm and 2.0 cm on Panoptic Studio and DexYCB, respectively.
MV-Adapter: Multi-view Consistent Image Generation Made Easy: This paper proposes MV-Adapter, the first adapter-based framework for multi-view image generation. By duplicating self-attention layers and adopting a parallel attention architecture, it enables plug-and-play multi-view generation on SDXL at 768 resolution, with compatibility across diverse T2I-derived models.
MVGBench: a Comprehensive Benchmark for Multi-view Generation Models: This paper presents MVGBench, a comprehensive evaluation framework for multi-view generation (MVG) models. It introduces a novel 3D consistency metric based on 3DGS self-consistency (requiring no 3D ground truth), systematically evaluates 12 state-of-the-art methods across three dimensions—peak performance, generalization, and robustness—and derives a new method, ViFiGen, from the identified best practices.
Nautilus: Locality-aware Autoencoder for Scalable Mesh Generation: Nautilus proposes a locality-aware autoencoder for scalable artist-like mesh generation. By introducing a nautilus-shell-structured mesh tokenization algorithm that reduces sequence length to 1/4 of the naive baseline, and combining it with a dual-stream point cloud conditioner to improve local structural fidelity, Nautilus achieves for the first time direct high-quality mesh generation with up to 5,000 faces.
Neural Compression for 3D Geometry Sets: This paper proposes NeCGS, the first neural compression paradigm capable of compressing geometry sets containing thousands of diverse 3D mesh models at ratios up to 900×, achieving high-fidelity reconstruction via a TSDF-Def implicit representation and a quantization-aware auto-decoder.
NeuraLeaf: Neural Parametric Leaf Models with Shape and Deformation Disentanglement: NeuraLeaf disentangles the 3D geometry of leaves into two latent spaces — a 2D base shape space and a 3D deformation space — leveraging large-scale 2D leaf image datasets to learn the shape space, proposes a skeleton-free skinning model to handle highly flexible leaf deformations, and introduces DeformLeaf, the first 3D dataset dedicated to leaf deformation modeling.
No Pose at All: Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views: This paper proposes SPFSplat, the first self-supervised 3DGS framework that requires no ground-truth poses at either training or inference time. By sharing a ViT backbone to jointly predict Gaussian primitives and camera poses, SPFSplat surpasses pose-dependent state-of-the-art methods under extreme viewpoint changes.
Noise2Score3D: Tweedie's Approach for Unsupervised Point Cloud Denoising: This paper proposes Noise2Score3D, a fully unsupervised point cloud denoising framework based on Tweedie's formula. It learns the score function directly from noisy data and achieves single-step denoising, while introducing point cloud total variation to estimate unknown noise parameters.
Not All Frame Features Are Equal: Video-to-4D Generation via Decoupling Dynamic-Static Features: DS4D is the first method to decouple dynamic and static features along both the temporal and spatial axes in video-to-4D generation. It introduces a Dynamic-Static Feature Decoupling module (DSFD) to extract dynamic representations, and a Spatiotemporal Similarity Fusion module (TSSF) to adaptively aggregate dynamic information across viewpoints, achieving state-of-the-art performance on the Consistent4D and Objaverse datasets.
OccluGaussian: Occlusion-Aware Gaussian Splatting for Large Scene Reconstruction and Rendering: This paper proposes an occlusion-aware scene partitioning strategy and region-based rendering technique. By clustering a camera co-visibility graph, it achieves partitions aligned with the scene layout, significantly improving reconstruction quality and rendering speed for large-scale 3DGS.
one look is enough seamless patchwise refinement for zero-shot monocular depth e: This paper proposes PRO (Patch Refine Once), which achieves seamless patchwise depth refinement on high-resolution images through Grouped Patch Consistency Training (GPCT) and a Bias-Free Mask (BFM) strategy. PRO eliminates boundary artifacts with only a single refinement pass per patch and achieves a 12× inference speedup over PatchRefiner.
Online Language Splatting: The first framework to achieve online, near-real-time, open-vocabulary language mapping within a 3DGS-SLAM system. Through three innovations—high-resolution CLIP embedding, two-stage online autoencoder compression, and decoupled color-language optimization—the method surpasses offline state-of-the-art in accuracy while achieving 40×–200× efficiency gains.
Open-Vocabulary Octree-Graph for 3D Scene Understanding: This paper proposes Octree-Graph, a novel scene representation combining adaptive octrees with a graph structure. Through Chronological Group-based Segment Merging (CGSM) and Instance Feature Aggregation (IFA), it obtains accurate semantic objects and enables efficient open-vocabulary 3D scene understanding.
Outdoor Monocular SLAM with Global Scale-Consistent 3D Gaussian Pointmaps: This paper proposes S3PO-GS, a monocular RGB-only outdoor SLAM system that anchors pose estimation to 3DGS-rendered pointmaps for scale self-consistency, and employs a patch-based dynamic mapping mechanism, achieving high-accuracy localization without cumulative scale drift and high-fidelity novel view synthesis.
PanSt3R: Multi-view Consistent Panoptic Segmentation: PanSt3R builds upon MUSt3R to simultaneously perform 3D reconstruction and multi-view panoptic segmentation in a single forward pass, requiring neither camera parameters nor test-time optimization, and achieves inference speeds orders of magnitude faster than existing methods.
PCR-GS: COLMAP-Free 3D Gaussian Splatting via Pose Co-Regularizations: PCR-GS is proposed to achieve high-quality 3D-GS reconstruction and pose estimation under complex camera trajectories without COLMAP priors, via co-regularizing camera poses through DINO feature reprojection regularization and wavelet-based frequency regularization.
PlaceIt3D: Language-Guided Object Placement in Real 3D Scenes: This paper introduces PlaceIt3D, a language-guided object placement task in real 3D scenes, comprising a benchmark, a large-scale dataset, and a 3D LLM-based baseline method called PlaceWizard that performs joint reasoning over scenes, objects, and natural language instructions.
PLMP -- Point-Line Minimal Problems for Projective SfM: This paper provides a complete classification of all point-line minimal problems in projective SfM, identifying 291 minimal problems (73 of which admit unique solutions solvable by linear methods), and develops a systematic framework for problem decomposition and non-minimality proofs via stabilizer subgroup analysis.
PolarAnything: Diffusion-based Polarimetric Image Synthesis: This paper proposes PolarAnything, the first diffusion-based framework for generating polarimetric images from a single RGB image. By performing denoising diffusion over encoded AoLP and DoLP representations, the method achieves physically accurate and photorealistic polarimetric attribute synthesis without requiring 3D assets or polarization cameras.
Predict-Optimize-Distill: A Self-Improving Cycle for 4D Object Understanding: This paper proposes Predict-Optimize-Distill (POD), a self-improving framework that recovers 4D part poses of articulated objects from long monocular videos through iterative predict–optimize–distill cycles, with performance that improves consistently with video length and iteration count.
Proactive Scene Decomposition and Reconstruction: This paper proposes an online scene decomposition and reconstruction task grounded in proactive human-object interaction, where interaction behavior observed from an egocentric viewpoint defines the decomposition granularity, enabling progressive object decoupling and high-quality global reconstruction.
PseudoMapTrainer: Learning Online Mapping without HD Maps: This paper proposes PseudoMapTrainer, the first framework to train online mapping models entirely without GT HD Maps: it reconstructs road surfaces from multi-camera images via 2D Gaussian Splatting (RoGS) and combines a pretrained semantic segmentation model (Mask2Former) to generate vectorized pseudo labels. A mask-aware matching algorithm and loss function are further designed to handle partially occluded pseudo labels, supporting both single-trip and multi-trip (crowdsourced) modes.
Radiant Foam: Real-Time Differentiable Ray Tracing: This paper proposes Radiant Foam, a novel differentiable scene representation based on volumetric tetrahedral mesh ray tracing. Without relying on rasterization, it achieves rendering speed and quality comparable to Gaussian Splatting while natively supporting light transport phenomena such as reflection and refraction.
RapVerse: Coherent Vocals and Whole-Body Motion Generation from Text: This work constructs the large-scale rap dataset RapVerse and proposes a unified autoregressive transformer framework that, for the first time, simultaneously generates coherent singing vocals and whole-body 3D motion from lyric text.
RayletDF: Raylet Distance Fields for Generalizable 3D Surface Reconstruction from Point Clouds or Gaussians: This paper proposes RayletDF, a generalizable 3D surface reconstruction method based on a "raylet" (ray segment) distance field. Through three modules — a raylet feature extractor, a distance field predictor, and a multi-raylet mixer — RayletDF directly predicts surface points from point clouds or 3D Gaussians, achieving high-accuracy cross-dataset generalization via a single forward pass on unseen datasets.
RayZer: A Self-supervised Large View Synthesis Model: This paper proposes RayZer, a self-supervised multi-view 3D vision model that requires no 3D supervision (no camera poses, no scene geometry annotations). By decoupling images into camera parameters and scene representations, RayZer performs 3D-aware image autoencoding and achieves performance on novel view synthesis that matches or surpasses oracle methods relying on pose annotations.
RegGS: Unposed Sparse Views Gaussian Splatting with 3DGS Registration: RegGS is proposed as a framework that incrementally aligns locally generated 3D Gaussians from a feed-forward network into a globally consistent 3D representation via a differentiable 3DGS registration module based on the optimal-transport MW2 distance, enabling high-quality 3D reconstruction from unposed sparse views.
Relative Illumination Fields: Learning Medium and Light Independent Underwater Scenes: This paper proposes Relative Illumination Fields (RIF), which models non-uniform illumination distributions in camera-local coordinates via an MLP and jointly optimizes a volumetric medium representation, enabling clean reconstruction of underwater scenes free from light source and medium effects.
REPARO: Compositional 3D Assets Generation with Differentiable 3D Layout Alignment: This paper proposes REPARO, which generates compositional 3D assets from a single image by first reconstructing individual object meshes separately and then performing layout alignment via optimal transport-based differentiable rendering.
RePoseD: Efficient Relative Pose Estimation with Known Depth Information: This paper proposes a set of efficient minimal solvers for relative pose estimation that jointly estimate the scale and affine parameters of monocular depth estimation (MDE) alongside the relative pose. The proposed solvers outperform state-of-the-art depth-aware solvers across three camera configurations (calibrated / shared focal length / unknown individual focal lengths), and large-scale experiments provide a definitive answer to the question of whether MDE depth actually benefits relative pose estimation.
Representing 3D Shapes with 64 Latent Vectors for 3D Diffusion Models: This paper proposes COD-VAE, a two-stage autoencoder framework—comprising a progressive encoder, a triplane decoder, and uncertainty-guided token pruning—that encodes 3D shapes into only 64 one-dimensional latent vectors, achieving a 16× compression ratio and 20.8× generation speedup while maintaining reconstruction quality.
Repurposing 2D Diffusion Models with Gaussian Atlas for 3D Generation: This paper proposes the Gaussian Atlas representation, which maps unordered 3D Gaussians onto a sphere via optimal transport and then flattens them into a structured 2D grid, enabling direct fine-tuning of pretrained 2D Latent Diffusion models for high-quality text-to-3D generation.
ResGS: Residual Densification of 3D Gaussian for Efficient Detail Recovery: This paper proposes a residual split operation to replace the binary split/clone mechanism in 3D-GS, combined with image pyramid progressive supervision and a variable gradient threshold selection strategy, to adaptively address both over-reconstruction and under-reconstruction simultaneously, achieving state-of-the-art rendering quality while reducing the number of Gaussians.
Revisiting Point Cloud Completion: Are We Ready For The Real-World?: Using algebraic topology and persistent homology (\(\mathcal{PH}\)) tools, this paper reveals that existing synthetic point cloud datasets lack the rich topological features present in real-world data. It contributes the first real-world industrial point cloud completion dataset RealPC (~40,000 pairs, 21 categories), and proposes BOSHNet, which samples proxy homology skeletons as topological priors to achieve significant improvements on real-world point cloud completion.
ri3d few-shot gaussian splatting with repair and inpainting diffusion priors: RI3D decomposes sparse-view synthesis into two sub-tasks — repairing visible regions and completing missing regions — and introduces two personalized diffusion models (repair + inpainting) combined with a two-stage optimization strategy to achieve high-quality 3DGS reconstruction under extremely sparse inputs.
RoboPearls: Editable Video Simulation for Robot Manipulation: This paper presents RoboPearls, an editable video simulation framework built upon 3D Gaussian Splatting (3DGS) that constructs photorealistic simulation environments from demonstration videos. It supports rich scene editing operations via Incremental Semantic Distillation (ISD) and a 3D-regularized NNFM loss, and employs a multi-LLM agent system to automate the simulation generation pipeline, forming a VLM-in-the-loop robot learning augmentation system.
RoboTron-Mani: All-in-One Multimodal Large Model for Robotic Manipulation: This paper proposes RoboTron-Mani, a multimodal large model for robotic manipulation, together with the comprehensive dataset RoboData. By enhancing 3D perception via camera parameters and occupancy supervision, and enabling flexible multimodal fusion through a Modality-Isolation-Mask (MIM), RoboTron-Mani is the first generalist policy to simultaneously surpass specialist models across multiple datasets.
Robust and Efficient 3D Gaussian Splatting for Urban Scene Reconstruction: This paper proposes a robust and efficient 3DGS reconstruction framework for city-scale scenes. Through a visibility-based partitioning strategy, controllable LOD generation, a fine-grained appearance transformation module, and multiple regularization techniques, the framework achieves high-quality reconstruction and real-time rendering on urban data with large appearance variations and transient objects.
RobuSTereo: Robust Zero-Shot Stereo Matching under Adverse Weather: This paper proposes RobuSTereo, a framework that significantly improves the zero-shot generalization of stereo matching models under adverse weather conditions (rain, fog, snow) via a diffusion-based stereo data generation pipeline and a robust feature encoder combining a denoising Vision Transformer (DVT) with VGG19.
RobustSplat: Decoupling Densification and Dynamics for Transient-Free 3DGS: This paper identifies Gaussian densification in 3DGS as the key factor responsible for transient-object artifacts, and proposes a delayed Gaussian growth strategy along with a scale-cascaded mask bootstrapping method to decouple densification from dynamic region modeling, achieving state-of-the-art transient-free novel view synthesis across multiple benchmark datasets.
RoCo-Sim: Enhancing Roadside Collaborative Perception through Foreground Simulation: RoCo-Sim is proposed as the first simulation framework for roadside collaborative perception. By integrating extrinsic parameter optimization, occlusion-aware 3D asset placement, DepthSAM-based depth modeling, and style-transfer post-processing, it generates multi-view consistent simulation data from single images, achieving over 83% improvement in roadside 3D detection performance.
Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness: Ross3D introduces 3D-aware visual reconstruction pretraining tasks—cross-view reconstruction and global BEV reconstruction—into the training pipeline of 2D large multimodal models (LMMs). Without modifying the input representation, it significantly improves 3D scene understanding through output-level supervision signals, achieving state-of-the-art performance on five benchmarks: SQA3D, ScanQA, Scan2Cap, ScanRefer, and Multi3DRefer.
S3E: Self-Supervised State Estimation for Radar-Inertial System: S3E is proposed as the first method to achieve complementary self-supervised state estimation from radar signal spectra and inertial data, leveraging a rotation-based cross-fusion technique to enhance spatial structural information under limited angular resolution.
S3R-GS: Streamlining the Pipeline for Large-Scale Street Scene Reconstruction: S3R-GS identifies three major computational redundancies in conventional street scene reconstruction pipelines—unnecessary local-to-global coordinate transformations, excessive 3D-to-2D projections, and inefficient rendering of distant content—and proposes instance-specific projection, temporal visibility filtering, and adaptive level-of-detail (LOD) strategies to reduce reconstruction time to 20%–50% of competing methods while maintaining state-of-the-art rendering quality.
SAS: Segment Any 3D Scene with Integrated 2D Priors: This paper proposes SAS, a framework that for the first time integrates the complementary capabilities of multiple 2D open-vocabulary models to learn better 3D representations. It aligns feature spaces across models via Model Alignment via Text, and quantifies per-category model recognition capability using diffusion-synthesized images through Annotation-Free Model Capability Construction. These components jointly guide multi-model feature fusion and 3D distillation, achieving substantial improvements over prior work on ScanNet v2, Matterport3D, and nuScenes.
Sat2City: 3D City Generation from A Single Satellite Image with Cascaded Latent Diffusion: This paper presents Sat2City, the first 3D generation framework capable of simultaneously producing city-scale geometry and appearance from a single satellite image. By integrating sparse voxel grids with a cascaded latent diffusion model, it introduces a Re-Hash multi-scale feature grid and an inverse sampling strategy, achieving high-fidelity generation superior to existing methods on a self-constructed 3D city dataset.
Scene Coordinate Reconstruction Priors: This paper proposes a probabilistic training framework for scene coordinate regression (SCR) that introduces hand-crafted depth distribution priors and a learned prior based on a 3D point cloud diffusion model, significantly improving scene reconstruction quality, camera pose estimation, and downstream task performance under insufficient multi-view constraints.
SceneMI: Motion In-betweening for Modeling Human-Scene Interactions: This work formally introduces the scene-aware motion in-betweening problem and proposes the SceneMI framework, which comprehensively encodes scene context via a dual-layer scene descriptor (global voxels + local BPS). By leveraging the denoising capability of diffusion models to handle noisy keyframes, SceneMI reduces the collision frame rate by 56.9% on TRUMANS, and reduces foot skating by 37.5% and jitter by 56.5% on the real-world GIMO dataset.
Seeing and Seeing Through the Glass: Real and Synthetic Data for Multi-Layer Depth Estimation: This paper introduces the novel task of multi-layer depth estimation, constructs the LayeredDepth benchmark comprising 1,500 real-world images, and develops a procedural synthetic data generator. The work reveals severe deficiencies of existing depth estimation methods when applied to transparent objects.
SegmentDreamer: Towards High-Fidelity Text-to-3D Synthesis with Segmented Consistency Trajectory Distillation: This paper proposes SegmentDreamer, which reformulates the SDS loss via Segmented Consistency Trajectory Distillation (SCTD) to address the imbalance between self-consistency and cross-consistency in existing consistency distillation (CD) methods, enabling high-fidelity 3D asset generation via 3DGS in ~32 minutes on a single A100 GPU.
SeHDR: Single-Exposure HDR Novel View Synthesis via 3D Gaussian Bracketing: SeHDR is proposed as the first framework for synthesizing HDR novel views from single-exposure multi-view LDR images. It generates bracketed exposures in 3D Gaussian space (Bracketed 3D Gaussians) and merges them into an HDR scene representation via differentiable Neural Exposure Fusion (NeEF).
Self-Ensembling Gaussian Splatting for Few-Shot Novel View Synthesis: SE-GS dynamically generates diverse 3DGS models during training via an uncertainty-aware perturbation strategy, and leverages a self-ensembling mechanism to allow the Σ-model to aggregate information from perturbed models, effectively mitigating overfitting under sparse-view settings and achieving state-of-the-art few-shot novel view synthesis performance across multiple datasets.
Sequential Gaussian Avatars with Hierarchical Motion Context: This paper proposes SeqAvatar, which leverages explicit 3DGS representations combined with hierarchical motion context (coarse-grained skeletal motion + fine-grained per-point velocity) to model motion-correlated appearance changes in human avatars. Spatio-temporal multi-scale sampling further enhances the robustness of motion conditioning. SeqAvatar achieves state-of-the-art rendering quality across multiple datasets while maintaining real-time rendering speed.
Shape of Motion: 4D Reconstruction from a Single Video: This paper proposes a dynamic 3D Gaussian representation based on \(\mathbb{SE}(3)\) motion bases, recovering globally consistent 3D motion trajectories from monocular video while simultaneously enabling real-time novel view synthesis and long-range 3D tracking, outperforming prior methods comprehensively on the iPhone and Kubric datasets.
SHeaP: Self-Supervised Head Geometry Predictor Learned via 2D Gaussians: SHeaP replaces traditional differentiable mesh rendering with 2D Gaussian Splatting for self-supervised 3DMM prediction training. By binding Gaussians to the 3DMM mesh for re-animation, and introducing a graph-convolution-based Gaussian regressor together with geometry consistency regularization, SHeaP surpasses all self-supervised methods on the NoW and Nersemble benchmarks.
SiM3D: Single-Instance Multiview Multimodal and Multisetup 3D Anomaly Detection Benchmark: This paper introduces SiM3D, the first benchmark for multiview multimodal 3D anomaly detection and segmentation targeting single-instance industrial scenarios. It employs industrial-grade sensors to acquire high-resolution data, replaces 2D anomaly maps with voxelized Anomaly Volumes, and is the first benchmark to support cross-domain synthetic-to-real evaluation.
Simulating Dual-Pixel Images From Ray Tracing For Depth Estimation: Sdirt proposes a ray-tracing-based dual-pixel (DP) image simulation framework that computes spatially varying DP PSFs incorporating lens aberrations and phase-splitting characteristics, thereby bridging the domain gap between simulated and real DP data and improving the generalization of depth estimation models on real DP images.
Single-Scanline Relative Pose Estimation for Rolling Shutter Cameras: This paper proposes a rolling shutter relative pose estimation method that requires no explicit camera motion modeling. It recovers camera pose solely from the intersections of line projections with a single selected scanline per image, and develops multiple minimal solvers for special configurations such as parallel lines and known gravity direction.
SL2A-INR: Single-Layer Learnable Activation for Implicit Neural Representation: This paper proposes SL2A-INR, a hybrid architecture combining a single-layer learnable activation block parameterized by Chebyshev polynomials with a ReLU-MLP fusion block, effectively alleviating spectral bias in implicit neural representations and achieving state-of-the-art performance on image fitting, 3D shape reconstruction, and novel view synthesis.
Sparfels: Fast Reconstruction from Sparse Unposed Imagery: Sparfels integrates a 3D foundation model (MASt3R) with efficient test-time optimization (2DGS). MASt3R provides an initial point cloud, camera poses, and dense correspondences to guide optimization. A novel splat color variance loss is introduced, enabling state-of-the-art geometric reconstruction from sparse unposed images in under three minutes.
Spatial-Temporal Aware Visuomotor Diffusion Policy Learning: This paper proposes the 4D Diffusion Policy (DP4), which injects 3D spatial and 4D spatial-temporal awareness into a diffusion policy via a dynamic Gaussian world model, achieving substantial improvements over baselines across 17 simulation tasks and 3 real-robot tasks (Adroit +16.4%, DexArt +14%, RLBench +6.45%, real tasks +8.6%).
SpatialSplat: Efficient Semantic 3D from Sparse Unposed Images: SpatialSplat is proposed to generate compact semantic 3D Gaussians from sparse unposed images via feed-forward inference, leveraging a dual-field semantic representation and a selective Gaussian mechanism that reduces representation parameters by 60% while surpassing state-of-the-art methods.
SpinMeRound: Consistent Multi-View Identity Generation Using Diffusion Models: This paper presents SpinMeRound, an identity-conditioned multi-view diffusion model that generates 360° full-head portraits with consistent identity and corresponding normal maps from a single or few face images, surpassing existing multi-view diffusion methods on face novel view synthesis benchmarks.
SplatTalk: 3D VQA with Gaussian Splatting: This paper proposes SplatTalk, a framework that leverages generalizable 3D Gaussian Splatting to generate LLM-compatible 3D tokens from multi-view RGB images alone, enabling zero-shot 3D visual question answering that surpasses 2D LMM baselines and approaches 3D LMM performance.
Stable Score Distillation: This paper proposes Stable Score Distillation (SSD), which achieves more stable and precise text-guided 2D/3D editing through single-classifier cross-prompt guidance and cross-trajectory regularization via a null-text branch, improving editing alignment while preserving the structural content of the source.
StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions: This work presents the first density-guided poisoning attack against 3D Gaussian Splatting (3DGS). By injecting illusion Gaussians into low-density regions and introducing adaptive noise to disrupt multi-view consistency, the method achieves attacks that are clearly visible from target viewpoints while remaining imperceptible from all others.
Stereo Any Video: Temporally Consistent Stereo Matching: This paper proposes Stereo Any Video, a framework that achieves spatially accurate and temporally consistent video stereo matching without relying on camera poses or optical flow. It integrates three core modules — monocular video depth foundation model priors (Video Depth Anything), all-to-all-pair correlation, and temporal convex upsampling — attaining state-of-the-art performance under zero-shot settings across multiple benchmarks.
StochasticSplats: Stochastic Rasterization for Sorting-Free 3D Gaussian Splatting: StochasticSplats introduces Stochastic Transparency into 3DGS, replacing depth-sorted alpha blending with an unbiased Monte Carlo estimator to achieve sorting-free, popping-free rendering. At 1 SPP, it is 4× faster than standard CUDA 3DGS, and the number of samples provides a flexible quality–speed trade-off.
StrandHead: Text to Hair-Disentangled 3D Head Avatars Using Human-Centric Priors: This paper presents StrandHead, the first framework for generating strand-level 3D head avatars by distilling human-centric 2D diffusion priors. It introduces a differentiable prismatization algorithm to convert hair strands into watertight meshes with gradient backpropagation, and designs regularization losses based on statistical geometric priors of hair strands to ensure hairstyle realism.
StruMamba3D: Exploring Structural Mamba for Self-supervised Point Cloud Representation Learning: StruMamba3D is proposed to maintain 3D point adjacency relationships by endowing SSM hidden states with spatial positional attributes (spatial states), and introduces a sequence-length-adaptive strategy to address the sequence length discrepancy between pre-training and downstream tasks. The method achieves 92.75% accuracy on the hardest ScanObjectNN split and 95.1% on ModelNet40, both representing single-modality SOTA.
SuperDec: 3D Scene Decomposition with Superquadric Primitives: SuperDec is a Transformer-based learning approach that decomposes point clouds into compact sets of superquadric primitives. Trained on ShapeNet, it generalizes to real-world scenes and supports downstream applications including robot manipulation and controllable generation.
SuperMat: Physically Consistent PBR Material Estimation at Interactive Rates: This paper proposes SuperMat, a single-step inference framework for PBR material decomposition. Through structured expert branches and scheduler correction, it enables end-to-end training and introduces a re-render loss to enforce physical consistency, accelerating inference from seconds to milliseconds.
SurfaceSplat: Connecting Surface Reconstruction and Gaussian Splatting: SurfaceSplat proposes a hybrid framework that establishes bidirectional connections between SDF (Signed Distance Function) and 3D Gaussian Splatting (3DGS): the SDF provides coarse geometry to enhance 3DGS rendering quality, while novel-view images rendered by 3DGS are in turn used to refine SDF surface reconstruction accuracy. The method achieves state-of-the-art performance on both surface reconstruction and novel view synthesis on the DTU and MobileBrick datasets.
SVG-Head: Hybrid Surface-Volumetric Gaussians for High-Fidelity Head Reconstruction and Real-Time Editing: SVG-Head is proposed as a hybrid representation combining surface Gaussians (with explicit texture maps) and volumetric Gaussians (for supplementary modeling of non-Lambertian regions), achieving, for the first time, real-time appearance editing of high-fidelity Gaussian head avatars.
TAPNext: Tracking Any Point (TAP) as Next Token Prediction: TAPNext reformulates Tracking Any Point (TAP) in video as a sequential masked token decoding task, eliminating the tracking-specific inductive biases and heuristics prevalent in conventional approaches. It achieves causal online tracking and establishes new state-of-the-art results among both online and offline trackers, with remarkably low inference latency.
TAR3D: Creating High-Quality 3D Assets via Next-Part Prediction: TAR3D is proposed as the first framework to quantize triplane representations into discrete geometric parts and generate them autoregressively via GPT. A 3D VQ-VAE encodes meshes of arbitrary face counts into fixed-length sequences, while TriPE positional encoding preserves 3D spatial information. The method comprehensively outperforms existing approaches on text/image-to-3D tasks.
Text2VDM: Text to Vector Displacement Maps for Expressive and Interactive 3D Sculpting: Text2VDM is proposed as the first framework for generating VDM sculpting brushes from text. It addresses the semantic entanglement problem in sub-object structure generation via Sobolev-preconditioned mesh deformation and a semantically enhanced SDS loss.
Textured 3D Regenerative Morphing with 3D Diffusion Prior: This paper proposes a regenerative 3D morphing method based on a 3D diffusion prior. By performing interpolation at three levels — initial noise, model parameters, and conditioning features — and combining three strategies (Attention Fusion, Token Reordering, and Low-Frequency Enhancement), it is the first to achieve smooth and semantically plausible morphing sequences for textured 3D objects across categories.
TimeFormer: Capturing Temporal Relationships of Deformable 3D Gaussians for Robust Reconstruction: This paper proposes the TimeFormer module, which implicitly learns temporal relationships among deformable 3D Gaussians via a cross-time Transformer encoder, and introduces a dual-stream optimization strategy that transfers motion knowledge during training with no additional overhead at inference.
TokenUnify: Scaling Up Autoregressive Pretraining for Neuron Segmentation: TokenUnify is proposed to unify three complementary learning objectives—random token prediction, next-token prediction, and next-all-token prediction—enabling hierarchical predictive coding on large-scale electron microscopy data. The method reduces autoregressive error accumulation from \(O(K)\) to \(O(\sqrt{K})\), achieving a 44% improvement on downstream neuron segmentation.
Towards More Diverse and Challenging Pre-training for Point Cloud Learning: Self-Supervised Cross Reconstruction with Decoupled Views: This paper proposes Point-PQAE, the first framework to introduce cross-view reconstruction into 3D generative self-supervised learning. By designing a point cloud cropping mechanism to generate decoupled views, a View-Relative Positional Embedding (VRPE), and a Positional Query module, the pre-training task becomes more challenging and informative. Point-PQAE surpasses Point-MAE by an average of 6.7% on ScanObjectNN under the Mlp-Linear protocol.
Towards Scalable Spatial Intelligence via 2D-to-3D Data Lifting: This paper proposes a scalable data generation pipeline that automatically converts single-view 2D images into metric-scale 3D representations—including point clouds, camera poses, and depth maps—by integrating depth estimation, camera calibration, and scale calibration. The pipeline produces COCO-3D and Objects365-v2-3D datasets comprising approximately 2 million scenes, yielding significant performance gains across multiple 3D tasks.
Trace3D: Consistent Segmentation Lifting via Gaussian Instance Tracing: This paper proposes the Gaussian Instance Tracing (GIT) mechanism, which maintains a per-Gaussian instance weight matrix across views via inverse rasterization. GIT jointly addresses two longstanding challenges—multi-view inconsistency in 2D segmentation and boundary Gaussian ambiguity—and yields significant improvements in 3D segmentation quality under both offline contrastive learning and online self-prompting settings.
TRACE: Learning 3D Gaussian Physical Dynamics from Multi-view Videos: TRACE is a framework that treats each 3D Gaussian kernel as a rigid particle and learns an independent translational-rotational dynamical system for it—comprising a complete set of physical parameters including velocity, acceleration, angular velocity, and angular acceleration. Without any manual annotation, TRACE learns the physical motion laws of 3D scenes from multi-view dynamic videos and accurately extrapolates future frames.
Tune-Your-Style: Intensity-Tunable 3D Style Transfer with Gaussian Splatting: This paper proposes Tune-Your-Style, the first intensity-tunable 3D style transfer paradigm, which explicitly models style intensity via Gaussian neurons and parameterizes a learnable style tuner. Combined with a two-stage optimization strategy, the method enables users to freely adjust the degree of style injection without retraining.
TurboReg: TurboClique for Robust and Efficient Point Cloud Registration: TurboReg is proposed, a framework that replaces traditional maximum clique enumeration with lightweight 3-cliques (TurboCliques) and introduces a highly parallelizable Pivot-Guided Search (PGS) algorithm, achieving state-of-the-art registration accuracy while delivering over 208× speedup.
UniEgoMotion: A Unified Model for Egocentric Motion Reconstruction, Forecasting, and Generation: This paper proposes UniEgoMotion, the first unified egocentric motion model that achieves 3D human motion reconstruction, forecasting, and generation from an egocentric perspective within a single model, via a conditional motion diffusion framework and a head-centric motion representation. The large-scale EE4D-Motion dataset is also released.
Unified Category-Level Object Detection and Pose Estimation from RGB Images using 3D Prototypes: This work presents the first RGB-only, single-model framework that unifies object detection and category-level pose estimation. By leveraging Neural Mesh Models as 3D prototypes, the method performs feature matching and multi-model RANSAC PnP to simultaneously detect objects and estimate their 9D poses. It surpasses the state of the art on all scale-agnostic metrics on REAL275.
UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing: This paper proposes UniVG, a unified image generation model built on MM-DiT that supports T2I generation, editing, identity-preserving generation, layout-guided synthesis, depth estimation, and more within a single set of weights, achieved via channel-wise input concatenation, progressive multi-task training, and external condition injection.
Unleashing Vecset Diffusion Model for Fast Shape Generation (FlashVDM): FlashVDM proposes a systematic framework to accelerate both DiT sampling and VAE decoding in Vecset Diffusion Models (VDM): progressive flow distillation reduces diffusion steps to 5, while adaptive KV selection, hierarchical volume decoding, and an efficient decoder yield a 45× VAE decoding speedup, achieving an overall 32× acceleration that enables high-quality 3D shape generation in under one second.
UPP: Unified Point-Level Prompting for Robust Point Cloud Analysis: This paper proposes UPP, a unified point-level prompting framework that reformulates point cloud denoising and completion as prompting mechanisms for downstream tasks. It introduces a Rectification Prompter to filter noise, a Completion Prompter to recover missing regions, and a Shape-Aware Unit to capture geometry-sensitive features. With only 6.3% of the parameters, UPP surpasses full fine-tuning on noisy and incomplete point clouds.
UST-SSM: Unified Spatio-Temporal State Space Models for Point Cloud Video Modeling: This paper proposes UST-SSM, which extends selective state space models to point cloud video analysis via three core modules — Spatio-Temporal Selective Scanning (STSS), Spatio-Temporal Structure Aggregation (STSA), and Temporal Interaction Sampling (TIS) — achieving linear complexity while surpassing Transformer-based methods.
VertexRegen: Mesh Generation with Continuous Level of Detail: VertexRegen is proposed to reframe mesh generation—inspired by progressive meshes—as learning the inverse of edge collapse, i.e., vertex split, enabling "anytime" mesh generation with continuous level of detail.
ViT-Split: Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads: Grounded in the key observation that VFM layers can be partitioned into low-level feature extractors and high-level task adapters, this paper proposes ViT-Split, which freezes the VFM backbone and introduces a task head (replicating the last \(K_t\) layers) and a prior head (a lightweight CNN aggregating multi-scale prior features). On ADE20K, ViT-Split achieves 58.2 mIoU (DINOv2-L) with only a linear head, offers 4× faster training, and requires only 1/4–1/5 of the trainable parameters compared to conventional adapters.
Vivid4D: Improving 4D Reconstruction from Monocular Video by Video Inpainting: This paper proposes Vivid4D, which reformulates multi-view augmentation from monocular video as a video inpainting problem — warping the video to novel viewpoints using monocular depth priors, then employing a video diffusion model to inpaint occluded regions. Through an iterative view expansion strategy and a robust reconstruction loss, Vivid4D significantly improves 4D dynamic scene reconstruction quality from monocular video.
VoluMe: Authentic 3D Video Calls from Live Gaussian Splat Prediction: Microsoft proposes the first method for real-time prediction of 3D Gaussian Splatting reconstructions from a monocular 2D camera, simultaneously satisfying four requirements: authenticity, realism, liveness, and temporal stability. This enables anyone to conduct volumetric 3D video calls using only a standard laptop camera.
VolumetricSMPL: A Neural Volumetric Body Model for Efficient Interactions, Contacts, and Collisions: This paper proposes VolumetricSMPL, an efficient neural volumetric body model based on Neural Blend Weights (NBW), achieving 10× inference speedup and 6× memory reduction over its predecessor COAP, while providing more accurate differentiable collision modeling through SDF (rather than occupancy function) representation.
WonderPlay: Dynamic 3D Scene Generation from a Single Image and Actions: WonderPlay introduces a Hybrid Generative Simulator that combines coarse 3D dynamic simulation from a physics solver with high-quality generation from a video diffusion model, enabling realistic multi-material dynamic 3D scene generation from a single image and user-specified actions. The framework supports diverse material types including rigid bodies, cloth, liquids, smoke, and granular materials.
WonderTurbo: Generating Interactive 3D World in 0.72 Seconds: WonderTurbo proposes the first real-time interactive 3D scene generation framework. Through the coordinated acceleration of three modules — StepSplat (feed-forward 3DGS), QuickDepth (lightweight depth completion), and FastPaint (2-step diffusion inpainting) — it compresses single-step scene extension time from 10+ seconds to 0.72 seconds, achieving a 15× speedup while maintaining generation quality comparable to WonderWorld.
Zero-Shot Inexact CAD Model Alignment from a Single Image: A weakly supervised 9-DoF CAD model alignment method that enhances DINOv2 features with geometry awareness and performs dense alignment optimization in Normalized Object Coordinate (NOC) space, enabling zero-shot 3D alignment without pose annotations that generalizes to unseen categories.
ZeroStereo: Zero-shot Stereo Matching from Single Images: This paper proposes ZeroStereo, a pipeline that starts from an arbitrary single image, uses monocular depth estimation to generate pseudo disparity, and synthesizes high-quality right-view images via a fine-tuned diffusion inpainting model. The approach achieves state-of-the-art zero-shot stereo matching generalization using only 35K synthetic training samples.