🧊 3D Vision¶
📷 CVPR2026 · 646 paper notes
📌 Same area in other venues: 🔬 ICLR2026 (201) · 🧪 ICML2026 (30) · 🤖 AAAI2026 (79) · 🧠 NeurIPS2025 (116) · 📹 ICCV2025 (267)
🔥 Top topics: 3D Gaussian Splatting ×102 · Dynamic Scenes ×58 · 3D Reconstruction ×37 · Point Cloud ×32 · Diffusion Models ×27
- 240FPS Stereo Vision from Monocular Mixed Spikes
-
A single monocular spike camera is used to optically mix left and right views onto the same sensor, with one view subjected to periodic 60 Hz modulation. Through a two-stage process—"Least Squares Baseline Decoupling + SMS-Net Depth Refinement"—a 240 FPS binocular video is reconstructed from the mixed spike stream. This approach maintains the compact hardware and data efficiency of a monocular setup while achieving depth estimation accuracy close to the "theoretical upper bound."
- 2D-LFM: Lifting Foundation Model without 3D Supervision
-
By injecting "correspondence positional encodings" into every layer of a Transformer, this work trains the first cross-category 2D→3D lifting foundation model using only 2D keypoints (without any 3D ground truth). It outperforms large models like VGGT that rely on RGB depth in object-level geometry (Pascal3D+ 8.1mm vs. VGGT 89.4mm).
- 3D-Aware Multi-Task Learning with Cross-View Correlations for Dense Scene Understanding
-
Append a lightweight, task-agnostic "geometric bypass"—the Cross-View Module (CvM, consisting of a spatial-aware encoder + multi-view Transformer + cost volume)—to standard Multi-Task Learning (MTL) networks. By injecting geometric correspondences between adjacent views into shared features as geometric consistency, the single network develops a better "understanding of 3D" when simultaneously predicting depth, segmentation, surface normals, and boundaries. This yields plug-and-play performance gains on NYUv2 and PASCAL-Context (max \(\Delta\)MTL +3.09).
- 3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image
-
A new "in-place completion" paradigm is proposed, extending pre-trained object-level generative priors to the scene level. It directly completes fragmented geometry at its original location without explicit pose alignment. Simultaneously, a large-scale scene dataset ARSG-110K is constructed, significantly outperforming baselines like MIDI and Gen3DSR.
- 3D-IDE: 3D Implicit Depth Emergent
-
The "Implicit Geometric Emergence Principle" (IGEP) is proposed. By utilizing a lightweight geometric verifier and a global 3D teacher for privileged supervision during training, the visual encoder develops 3D perception capabilities using only RGB video input. This achieves zero latency overhead during inference and outperforms comparable methods on several 3D scene understanding benchmarks.
- 3D-Object Perception Transformer (3PT)
-
3PT replaces the existing zero-shot 3D object perception pipelines—often characterized by "assembled frozen foundation models + depth dependency"—with a unified, end-to-end trained Transformer framework (detection + object grouping + iterative refinement) directly conditioned on CAD models. Relying solely on multi-view RGB, it significantly outperforms SOTA in detection and 6DoF pose on BOP benchmarks (with a relative improvement of 56.5% in AP-mm for industrial datasets), securing 7 first-place rankings across 11 tracks in the BOP Challenge 2025.
- 3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding
-
3D-VCD is the first inference-time hallucination mitigation framework for 3D embodied agents. It applies semantic/geometric perturbations to object-centric 3D scene graphs to generate a "corrupted" negative sample context. By running the MLLM on both the original and perturbed graphs and using a contrastive decoding formula, it suppresses tokens that maintain "high probability even when the scene changes." This method requires no retraining, incurs nearly zero additional overhead, and significantly reduces over-affirmation and object hallucinations in 3D-POPE and HEAL.
- 3D Gaussian Splatting at Arbitrary Resolutions with Compact Proxy Anchors
-
Building upon the anchor-based framework of Scaffold-GS, this paper employs FiLM to inject "target resolution" into anchor features and introduces a "Pixel Coverage Gate" to dynamically activate Gaussians based on sampling rates, achieving aliasing-free rendering at continuous arbitrary resolutions. Simultaneously, the method stores only approximately 30% of proxy anchors and utilizes a residual predictor to reconstruct the remaining leaf anchors online, reducing storage to nearly half of Scaffold-GS without compromising quality.
- Nope-SGS: 3D Gaussian Reconstruction from Unposed Spike Streams
-
This paper introduces Nope-SGS, the first framework for reconstructing high-speed 3D scenes directly from raw spike camera streams without camera pose priors. By remodeling spike imaging as a binomial distribution, it recovers a stable Normalized Binomial Distribution Spike (NBDS) supervision signal from unstable single-frame spikes. Combined with key-frame selection and progressive optimization, it simultaneously solves for camera trajectories and 3D Gaussians. Compared to SOTA, it achieves up to a 7.4dB improvement in PSNR, a 40% reduction in ATE, and is the fastest among spike-based methods.
- 3D Gaussian Splatting with Self-Constrained Priors for High Fidelity Surface Reconstruction
-
This paper proposes the Self-Constrained Prior (SCP), which constructs a TSDF distance field by fusing depth maps rendered from the current 3D Gaussians. This field serves as a prior to impose geometry-aware constraints (outlier removal, opacity constraints, and movement toward the surface) on Gaussians, achieving SOTA high-fidelity surface reconstruction on NeRF-Synthetic and DTU datasets.
- 3D sans 3D Scans: Scalable Pre-training from Video-Generated Point Clouds
-
The LAM3C framework demonstrates for the first time that Video-Generated Point Clouds (VGPC) reconstructed from unlabeled web videos (e.g., real estate tours) can substitute for real 3D scans in 3D self-supervised pre-training. By employing Laplacian smoothing and noise consistency losses to stabilize representation learning on noisy point clouds, combined with the self-constructed RoomTours dataset (49K scenes), the method matches or surpasses approaches using real scans in indoor semantic and instance segmentation.
- 3DrawAgent: Teaching LLM to Draw in 3D with Early Contrastive Experience
-
Proposes the training-free 3DrawAgent framework, which enables frozen LLMs to self-learn 3D spatial reasoning through "contrastive experience optimization." It generates language-driven 3D Bezier sketches in an autoregressive manner, achieving performance close to trained methods without parameter updates.
- 3DReflecNet: A Large-Scale Dataset for 3D Reconstruction of Reflective, Transparent, and Low-Texture Objects
-
3DReflecNet constructs a hybrid dataset exceeding 22 TB, containing over 120k synthetic instances and 1,000+ real scans with a total of 7M+ multi-view frames. It specifically targets three challenging material categories that "break photometric consistency assumptions"—reflective, transparent, and low-texture—and provides benchmarks for five major tasks. Experiments systematically expose the catastrophic failure of current SOTA reconstruction methods on these materials.
- 4C4D: 4 Camera 4D Gaussian Splatting
-
The 4C4D framework is proposed to adaptively control Gaussian opacity decay through a Neural Decaying Function. This addresses the imbalance between geometry and appearance learning in sparse (only 4 cameras) 4D Gaussian Splatting, achieving SOTA performance on multiple datasets.
- 4D Local Modeling Toward Dynamic Global Perception for Ambiguity-free Rotation-Invariant Point Cloud Analysis
-
To address the two major ambiguities in rotation-invariant (RI) point cloud representations—"local symmetric structures being indistinguishable" and "global pose information being discarded"—this paper proposes Ga4DPF. It uses learnable steerable transformations to equivariantly lift point clouds into a 4D space to construct robust local point-pair features, combined with a Bingham distribution to dynamically estimate a consistent global rotation that assigns a global anchor to each point. It achieves SOTA performance on ModelNet40 / ScanObjectNN / ShapeNetPart with lower parameter counts and FLOPs.
- 4D Primitive-Mâché: Glueing Primitives for Persistent 4D Scene Reconstruction
-
4DPM decomposes casual monocular RGB videos into a set of rigidly moving 3D primitives. By "glueing" each primitive over time using dense 2D correspondences, it only requires estimating an \(SE(3)\) pose per primitive to remap all historical observations to any moment. This enables a complete and persistent scene geometry at every frame, even maintaining the positions of occluded objects (object permanence).
- 4DEquine: Disentangling Motion and Appearance for 4D Equine Reconstruction from Monocular Video
-
The 4DEquine framework is proposed to disentangle 4D reconstruction of equines from monocular video into two sub-problems: dynamic motion estimation (AniMoFormer) and static appearance reconstruction (EquineGS). It achieves SOTA performance on real-world data while being trained only on synthetic data.
- GAI-GS: A Wireless Channel Prediction Framework Injecting Ray-Object Interaction into 3DGS via Geometric Algebra Attention
-
GAI-GS treats 3D Gaussian Splatting (3DGS) as a wireless radiation field. It utilizes a Geometric Algebra (GA)-based attention tokenizer to implicitly model physical ray-object interactions—such as reflection, diffraction, and transmission—within the scene. These interaction features are then injected as residuals into Gaussian attributes via a dual-branch scene mapping network, achieving SOTA performance in MAE and SSIM across several real-world indoor RSSI and spatial spectrum datasets.
- GAP: Action-Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation
-
GAP utilizes a pre-trained 3D geometric foundation model (π³) to extract 3D features, fuses 2D semantics and proprioception, and jointly predicts future action sequences and future 3D pointmaps via conditional diffusion, achieving SOTA in RoboTwin 2.0 and real-world bimanual experiments.
- ActionMesh: Animated 3D Mesh Generation with Temporal 3D Diffusion
-
ActionMesh is proposed to add a temporal axis to pre-trained 3D diffusion models through minimal extension (temporal 3D diffusion), and then utilizes a temporal 3D autoencoder to convert independent shape sequences into topology-consistent animated meshes. Generating production-grade animated 3D meshes from various inputs such as video, text, or 3D meshes in only 2 minutes, it achieves SOTA in both geometric accuracy and temporal consistency.
- Adapting Point Cloud Analysis via Multimodal Bayesian Distribution Learning
-
BayesMM proposes a training-free dynamic Bayesian distribution learning framework that models text and geometric modalities as Gaussian distributions and automatically adjusts modality weights through Bayesian Model Averaging. It achieves robust test-time adaptation across multiple point cloud benchmarks with an average gain of over 4%.
- Adaptive 3D Perception for Small Aerial Targets Under Sparse Sampling via Reinforcement Learning
-
Addressing the issue where small aerial targets (birds, UAVs) under long-range LiDAR yield extremely sparse and jittery point clouds, A3PRL utilizes a lightweight 5D reinforcement learning policy. Based on unlabeled statistics such as sparsity, acceptance rates, and trajectory continuity, it jointly adjusts voxel resolution, detection thresholds, and association gates online. This transforms a "fixed-parameter perception pipeline" into a "closed-loop adaptive perception-control system," reducing 3D localization error by approximately 19% in MMAUD cross-scenario testing.
- Adaptive Spatial-Temporal Window: Unlocking the Potential of Event Cameras in Heterogeneous Velocity Scenarios
-
Addressing "Heterogeneous Velocity Scenarios" (HVS) containing both fast and slow objects, this paper proposes the ASTW event partitioning strategy: the pixel plane is divided into small patches, and an analytical formula for the optimal time window \(\Delta t = \gamma / D\) (where \(D\) is event density) is derived based on the Maximum Entropy principle. Implemented with \(O(N)\) vectorization, ASTW allows each spatial region to adaptively select windows, achieving up to +2.6 mAP in object detection and +2.2 SR in tracking.
- AERGS-SLAM: Auto-Exposure-Robust Stereo 3D Gaussian Splatting SLAM
-
Addressing the issue where image appearance drift caused by camera Auto-Exposure (AE) in real-world scenes destroys the photometric consistency of 3DGS, AERGS-SLAM introduces a Camera Exposure Network (CEN) that decouples the "rendered radiance map" from the "exposure process." Combined with learned illumination-robust feature localization and temporal-aware coarse-to-fine optimization, it produces the first decoupled 3DGS SLAM robust to exposure variations. It outperforms existing baselines in both localization accuracy and high-fidelity reconstruction, while rendering nearly 10x faster than HDR-GS.
- AeroDGS: Physically Consistent Dynamic Gaussian Splatting for Single-Sequence Aerial 4D Reconstruction
-
AeroDGS is proposed as a physics-guided 4D Gaussian Splatting framework for monocular UAV videos. It reconstructs reliable static and dynamic geometry through a Monocular Geometry Lifting module and introduces differentiable physical priors—ground support, upright stability, and trajectory smoothness—to transform ambiguous image cues into physically consistent motion estimations, outperforming existing methods on both synthetic and real-world UAV scenes.
- AeroGS: Scale-Aware Gaussian Splatting for Pose-Free Dynamic UAV Scene Reconstruction
-
AeroGS utilizes "Scale-Aware Spatio-Temporal Anchors" (S2A-Anchors) to simultaneously estimate camera trajectories and reconstruct dynamic 4D scenes containing moving objects from pose-free monocular UAV videos. By relying on three decoupling mechanisms (ego-motion vs. object motion, appearance vs. deformation, and scale vs. complexity) to stabilize joint optimization, it achieves SOTA performance in both rendering PSNR and trajectory accuracy on VisDrone, UAVDT, and KITTI datasets.
- Affine Perspective-Three-Point Problem
-
This paper frames the classic P3P (Perspective-Three-Point) problem within weak-perspective and para-perspective affine camera models. It derives a closed-form minimal solver requiring only a bi-quadratic equation, followed by a lightweight iterative upgrade to "refine" the affine solution into an exact perspective solution. This two-step approach matches the accuracy of SOTA exact P3P solvers while being faster.
- AffordGrasp: Cross-Modal Diffusion for Affordance-Aware Grasp Synthesis
-
AffordGrasp presents a diffusion-based cross-modal framework that synthesizes physically feasible and semantically consistent human hand grasp poses from text instructions and object point clouds. By leveraging affordance-guided latent space diffusion and a Distribution Adjustment Module (DAM), it significantly outperforms existing methods across four benchmarks.
- AffordMatcher: Affordance Learning in 3D Scenes from Visual Signifiers
-
AffordMatcher proposes a method for locating affordance regions in 3D scenes from visual signifiers (human interactions in RGB images). By utilizing the large-scale AffordBridge dataset and a Match-to-Match attention mechanism based on a dissimilarity matrix, it achieves a 53.4 mAP in zero-shot affordance segmentation, surpassing the second-best method by 7.8 points.
- Affostruction: 3D Affordance Grounding with Generative Reconstruction
-
Ours proposes Affostruction, which completes object geometry (including unobserved regions) via generative reconstruction with sparse voxel fusion, and models the multimodal distribution of affordance using Flow Matching. It achieves functional region localization on the complete 3D shape, with reconstruction IoU improved by 54.8% and affordance aIoU by 40.4%.
- AIMDepth: Asymmetric Image-Event Mamba for Monocular Depth Estimation
-
AIMDepth introduces Mamba (State Space Models) to image-event monocular depth estimation for the first time. It employs a two-level modal alignment before fusion: bidirectional prior injection in the frequency domain (SCPG) for input-level alignment, and an asymmetric feature selection encoder (AME) for feature-level alignment. These are combined with a modal interaction local refinement module (ModiLocal), achieving SOTA performance on MVSEC/DENSE with only 8.69 GFLOPs.
- Aligning Text, Images and 3D Structure Token-by-Token
-
This paper proposes Kyvo—a decoder-only autoregressive LLM (based on Llama-3.2-1B) that treats "structured 3D scenes" as a third modality within the same token space as text and images. Through a systematic "cookbook," it provides key recipes for 3D shape tokenization, coordinate encoding, and sequence design, enabling a single model to perform four types of 3D tasks: rendering, single-image 3D reconstruction/recognition, instruction editing, and QA.
- AlignPose: Generalizable 6D Pose Estimation via Multi-view Feature-metric Alignment
-
AlignPose aggregates single-view object pose candidates from multiple calibrated RGB views into a single candidate using 3D NMS. It then employs a multi-view feature-metric refinement that simultaneously minimizes the discrepancy between online rendered features and observed image features across all views to solve for a globally consistent world-coordinate pose. The entire process requires no object-specific training or symmetry annotations and outperforms existing methods by over 14% on industrial datasets containing textureless, reflective, and transparent objects.
- AMB3R: Accurate Feed-forward Metric-scale 3D Reconstruction with Backend
-
AMB3R attaches a "sparse but compact" voxel backend on top of a frozen VGGT frontend for explicit 3D geometric reasoning, alongside a lightweight scale head for metric scale recovery. Trained with only ~80 H100 GPU hours, it achieves SOTA across 7 tasks and 13 datasets. Its two training-free pipelines, AMB3R-VO and AMB3R-SfM, enable feed-forward models to outperform traditional optimization-based systems in VO/SLAM and SfM for the first time.
- Anatomical Domain Shifts: Test-time Heterogeneous Adaptation for 3D Human Pose Prediction
-
Addressing Continuous Test-time Adaptation (CTTA) for 3D Human Pose Prediction (HPP), this paper identifies the overlooked fact that "domain shifts are concentrated in specific body parts rather than occurring uniformly across the whole body." It proposes TT-HA: decomposing model parameters into five anatomical subsets (left/right arms, left/right legs, torso), using Instance Normalization (IN) statistics combined with Earth Mover's Distance (EMD) to measure online domain changes for each part. Based on these measurements, self-supervised fine-tuning is applied to parts with minor shifts, while parameters for parts experiencing abrupt changes are rolled back to the source model. This achieved a 4.7% reduction in overall MPJPE and a 9.2% reduction in limb errors.
- AnchorSplat: Feed-Forward 3D Gaussian Splatting with 3D Geometric Priors
-
AnchorSplat proposes an anchor-aligned feed-forward 3DGS framework that predicts Gaussians directly in 3D space using 3D geometric priors (sparse point clouds) as anchors. It achieves SOTA performance on ScanNet++ v2 (PSNR 21.48) with approximately 20x fewer Gaussians and half the reconstruction time, while providing superior depth estimation accuracy.
- AniMimic: Imitating 3D Animation from Video Priors
-
AniMimic utilizes monocular animations generated by video diffusion models as motion supervision. It automatically rigs a static 3D mesh and optimizes joint parameters via differentiable rendering to "lift" 2D motion into 3D. Subsequently, a differentiable FEM soft-body simulation is employed to incorporate inertia and elasticity, producing editable, physically plausible 4D sequences ready for animation pipelines.
- AnthroTAP: Learning Point Tracking with Real-World Motion
-
AnthroTAP proposes an automated pipeline to generate large-scale pseudo-labeled point tracking data from real-world human motion videos via SMPL fitting and optical flow filtering. Using only 1.4K videos and 4 GPUs for one day of training, it achieves SOTA performance on the TAP-Vid benchmark, surpassing BootsTAPIR which utilizes 15M videos.
- Any Resolution Any Geometry: From Multi-View To Multi-Patch
-
A single ultra-high-definition image is decomposed into patches and treated as "virtual multi-views" within a VGGT-style framework for joint processing. Combined with cross-patch attention for global consistency reasoning, the model outputs sharp and globally coherent high-resolution depth maps and surface normals in a single forward pass, reducing AbsRel on UnrealStereo4K from 0.0582 to 0.0291.
- AnyLift: Scaling Motion Reconstruction from Internet Videos via 2D Diffusion
-
AnyLift introduces a two-stage framework—synthesizing multi-view 2D motion data followed by training a camera-conditioned multi-view 2D diffusion model—to lift 2D keypoints from monocular dynamic-camera internet videos into 3D human motion and Human-Object Interaction (HOI) in world coordinates. Without any 3D supervision, it reconstructs rare actions (e.g., gymnastics, martial arts) seldom found in MoCap datasets.
- AnyPcc: Compressing Any Point Cloud with a Single Universal Model
-
AnyPcc is proposed to achieve SOTA point cloud geometry compression across 15 diverse datasets using a single model. By employing a Universal Context Model (integrating spatial and channel-wise dual-granularity priors) and an Instance-Adaptive Fine-Tuning (IAFT) strategy, it achieves ~12% bitrate gain over G-PCC v23.
- ArchSym: Detecting 3D-Grounded Architectural Symmetries in the Wild
-
Addressing the gap in detecting 3D reflective symmetry planes in real-world "in-the-wild" scenes, this paper first automatically labels a large-scale landmark symmetry dataset, ArchSym, from SfM reconstructions via cross-view reflection matching. It then trains a single-view detector parameterizing symmetry planes as "signed distance maps relative to predicted geometry," accurately localizing metric-scale symmetry planes from a single RGB image and significantly outperforming existing SOTA.
- Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation
-
This paper systematically introduces reinforcement learning (RL) into text-to-3D autoregressive generation for the first time. By decomposing the problem into four dimensions—reward design, RL algorithms, evaluation benchmarks, and RL paradigms—it proposes a hierarchical coarse-to-fine framework, Hi-GRPO. The resulting RL-enhanced model, AR3D-R1, outperforms Trellis on Toys4K and the new MME-3DR benchmark.
- ARES: Unifying Asymmetric RGB-Event Stereo for Probabilistic Scene Flow Estimation
-
Utilizing an asymmetric stereo setup consisting of an "Event camera + RGB camera," the proposed method first integrates temporal clues from asynchronous events with spatial structures from RGB into a unified representation via Multimodal Contextual Attention for simultaneous optical flow and disparity estimation. Then, Temporal Disparity Posterior Fusion is employed to probabilistically model the evolution of disparity over time, recovering geometrically consistent and temporally stable dense scene flow. This approach achieves SOTA scene flow accuracy under the RGB-event stereo configuration.
- ARMFlow: AutoRegressive MeanFlow for Online 3D Human Reaction Generation
-
This work introduces the "single-step generation" MeanFlow paradigm to the human motion domain for the first time. By utilizing an autoregressive structure consisting of a "causal context encoder + lightweight MLP velocity predictor" combined with Bootstrapped Causal Encoding (BSCE) to suppress error accumulation, online 3D human reaction generation is achieved within a single inference step. The method reduces FID by approximately 30% compared to existing online methods while maintaining the fastest speed.
- ART: Articulated Reconstruction Transformer
-
ART reformulates "articulated object reconstruction" as a part-level feed-forward prediction problem. Using a set of learnable part slots, it decodes geometry, texture, and explicit motion parameters (axis/pivot/motion type) for each rigid part from sparse multi-view, multi-state RGB images in a single pass. This category-agnostic approach eliminates per-object optimization and significantly outperforms both feed-forward and optimization-based baselines in part-level and global geometric metrics.
- ArtHOI: Taming Foundation Models for Monocular 4D Reconstruction of Hand-Articulated-Object Interactions
-
ArtHOI introduces the first complete pipeline for reconstructing 4D interactions between hands and articulated objects (e.g., scissors, eyeglasses, laptops) from monocular RGB videos. By utilizing Adaptive Sampling Refinement (ASR) to optimize object metric scale and pose, alongside an MLLM-guided hand-object alignment method, it outperforms the RSRD baseline—which requires pre-scanned object geometry—across multiple datasets.
- Artiverse: A Diverse and Physically Grounded Dataset for Articulated Objects
-
Artiverse employs a semi-automatic annotation pipeline—integrating few-shot segmentation, geometric reasoning, and multi-stage human verification—to filter 5,402 high-quality articulated objects (88 categories, 24,607 parts) from 10 static 3D repositories. It provides part-level annotations for functional semantics, articulated joints (including multi-DoF), and physical properties (material, mass, metric scale), reducing manual annotation time by over 30% while demonstrating significant value in part motion analysis, articulated object generation, and physical simulation tasks.
- ArtLLM: Generating Articulated Assets via 3D LLM
-
ArtLLM models articulated object generation as a language generation problem. It uses a 3D multi-modal LLM to autoregressively predict part layouts and kinematic joint parameters (quantized as tokens) from point clouds. Combined with XPart for high-fidelity part geometry synthesis, it significantly outperforms existing methods on the PartNet-Mobility dataset (mIoU 0.69, inference in only 19 seconds).
- AssemblyBench: Physics-Aware Assembly of Complex Industrial Objects
-
To address the limitations of existing assembly datasets focusing only on "final poses and IKEA furniture," this paper presents AssemblyBench, a synthetic dataset featuring 2,789 complex industrial objects with step-by-step multimodal instructions and 6-DoF assembly trajectories. It introduces AssemblyDyno, a Transformer-based model that jointly predicts the assembly sequence and part trajectories in a single forward pass. It is the first to evaluate "physical feasibility" by executing predicted trajectories in a physics simulator—under identical settings, AssemblyDyno achieves a ~33% assembly success rate in the simulator, whereas the previous SOTA achieves only ~3%.
- AsymLoc: Towards Asymmetric Feature Matching for Efficient Visual Localization
-
AsymLoc proposes "Asymmetric Visual Localization"—using a large Teacher to process the map database offline and an extremely small Student for online query images. By employing a geometric matching loss and joint detector-descriptor distillation, Student features are aligned with Teacher features, enabling direct parameter-free mutual nearest neighbor matching while retaining ~95% of the Teacher’s localization accuracy despite a ten-fold reduction in model size.
- AutoRegressive Generation with B-rep Holistic Token Sequence Representation
-
BrepARG encodes the geometry and topology of CAD Boundary Representation (B-rep) into a unified token sequence for the first time. This enables next-token autoregressive generation using a decoder-only Transformer. It achieves SOTA results on DeepCAD/ABC, with training completed in 1.2 days and inference for a single model taking approximately 1.5 seconds on a single 4090.
- AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models
-
Proposes AVA-Bench, the first systematic evaluation benchmark that decouples Vision Foundation Model (VFM) capabilities into 14 Atomic Visual Abilities (AVA). Through train-test distribution alignment and isolated single-capability testing, it precisely identifies VFM strengths and weaknesses, finding that 0.5B small models maintain ranking consistency comparable to 7B models.
- AvatarPointillist: AutoRegressive 4D Gaussian Avatarization
-
AvatarPointillist proposes an autoregressive (AR) generation framework to construct 4D Gaussian avatars. It utilizes a decoder-only Transformer to generate 3DGS point clouds point-by-point (including binding information) and employs a Gaussian Decoder to predict rendering attributes. This approach breaks the limitations of fixed template topologies and enables adaptive point density adjustments, outperforming baselines like LAM and GAGAvatar on the NeRSemble dataset.
- AVGGT: Rethinking Global Attention for Accelerating VGGT
-
Through a layer-wise dissection of the actual role of global attention in VGGT/π³ (early layers being ineffective, middle layers performing cross-view alignment, and final layers doing fine-tuning), a training-free two-step acceleration scheme is proposed. It replaces early global layers with intra-frame attention and applies grid subsampling to K/V for the remaining global layers, achieving an 8–10× inference speedup for 800-frame inputs with almost no loss in accuracy.
- B³-Seg: Camera-Free, Training-Free 3DGS Segmentation via Analytic EIG and Beta-Bernoulli Bayesian Updates
-
B³-Seg reformulates the task of "segmenting target objects on an off-the-shelf 3DGS asset" into a sequence of Beta-Bernoulli Bayesian updates. By utilizing an analytic form of Expected Information Gain (EIG) to actively select the most informative next camera view, the method achieves camera-free, training-free, open-vocabulary results in seconds. Its accuracy approaches that of supervised methods that require tens of minutes.
- BEA-GS: BEyond RAdiance Supervision in 3DGS for Precise Object Extraction
-
Addressing the issue of "hidden Gaussian spikes" appearing after object extraction in 3DGS scenes, BEA-GS introduces two complementary losses during 2DGS optimization: a 2D boundary loss for visible regions (propagating gradients via rasterization to push boundary-crossing Gaussians back) and a 3D occupancy loss for non-visible regions (bypassing rasterization to penalize "unsupported" Gaussian samples based on voxel priors). It achieves the cleanest object boundaries to date across 6 metrics on 4 datasets.
- GeoCodeBench: Benchmarking PhD-Level Coding in 3D Geometric Computer Vision
-
GeoCodeBench is the first PhD-level code generation benchmark for 3D geometric computer vision. It contains 100 function completion tasks curated from 2025 top-tier papers and codebases, accompanied by automated and diverse unit tests. The strongest model, GPT-5, achieves only a 36.6% pass rate, revealing a significant gap in LLMs' ability to implement scientific-grade 3D code.
- Best Segmentation Buddies for Image-Shape Correspondence
-
This paper proposes Best Segmentation Buddies (BSB), which relaxes the hard "pixel-vertex mutual nearest neighbor" constraint—almost impossible to satisfy between images and 3D meshes—into a "segment-level mutual nearest neighbor." This allows matching a clicked semantic part from an in-the-wild image to its corresponding part on an untextured 3D mesh in an unannotated, zero-training manner.
- Beyond Geometry: Artistic Disparity Synthesis for Immersive 2D-to-3D
-
A new paradigm called "Artistic Disparity Synthesis" (Art3D) is proposed, shifting the goal of 2D-to-3D conversion from geometric accuracy to artistic expression. Through a dual-path architecture that decouples global depth style and local artistic effects, the model learns directorial intent from professional 3D movie data.
- Bidirectional Cross-Modal Prompting for Event-Frame Asymmetric Stereo
-
Addressing asymmetric stereo matching where "one eye is an event camera and the other is a standard RGB camera," this paper proposes Bi-CMPStereo. It utilizes a cross-domain adapter and self-reconstruction constraints to align both modalities into a "canonical space of a single target domain." By alternating between event and image as the target domain in a bidirectional fashion and fusing the results, the method significantly outperforms previous SOTA models (e.g., ZEST) in accuracy and generalization on DSEC, MVSEC, and M3ED datasets.
- Block-Sparse Global Attention for Efficient Multi-View Geometry Transformers
-
For feed-forward multi-view geometry Transformers such as VGGT / π³ / MapAnything, the authors observed that global attention matrices are highly sparse (probability mass is concentrated on a few patch pairs corresponding to cross-view geometric matches). Consequently, a training-free block-sparse attention was used to directly replace dense global attention, achieving 3× inference speedup (even more on long sequences) with negligible loss in reconstruction or pose accuracy.
- Breaking the 3D Dataset Bottleneck: Fast Scalable Generation of Aligned 3D Assets from Scratch for Category 6D Pose Estimation and Robotic Grasping
-
Given only a category name, a fully automated "text → image → 3D" pipeline generates canonically aligned textured meshes within 3 minutes, along with 6D pose and grasping datasets. The core mechanism utilizes depth-conditioned generation to boost pose consistency from 57% to 96%, allowing the 153K generated meshes to be used directly for zero-shot sim2real pose estimation and real-world robotic grasping.
- BRepGaussian: CAD Reconstruction from Multi-View Images with Gaussian Splatting
-
BRepGaussian achieves for the first time the direct reconstruction of complete B-rep CAD models from multi-view images. It learns edge and patch features through two-stage 2D Gaussian Splatting, followed by parametric fitting to generate watertight boundary representations without requiring point cloud supervision.
- BrickNet: Graph-Backed Generative Brick Assembly
-
This paper treats LEGO brick assembly sequences as "programs" for autoregressive generation via LLMs. The key innovation is moving away from direct regression of 6-DoF coordinates for each brick, instead using a graph-backed parametrization (spanning trees) where "connectivity" is treated as a first-class citizen. Combined with the newly constructed BrickNet dataset—comprising 320,000 large-scale human-designed LDraw samples—the model improves the number of valid connected steps from < 50 to 94+.
- Bringing Your Portrait to 3D Presence
-
Utilizing a Dual-UV representation that projects image features into a canonical UV space, paired with a factorized "3D rendering + 2D generation" synthetic data manifold and a robust proxy mesh tracker, this work enables the reconstruction of animatable 3D Gaussian avatars from a single portrait (head, half-body, or full-body). The model generalizes to real-world photos despite being trained exclusively on synthetic data.
- BuildingGPT: Auto-Regressive Building Wireframe Reconstruction Model with Reinforcement Learning
-
BuildingGPT reformulates "building wireframe reconstruction from point clouds" as a sequence generation problem: it first encodes wireframes into discrete tokens using a hierarchical tokenization scheme in the order of "foundation → wall → roof," then generates tokens sequentially using a point-cloud-conditioned auto-regressive Transformer. Finally, it employs DPO post-training based on a custom Preference Score Function (PSF) to align with human preferences for geometric accuracy and topological correctness, comprehensively surpassing detection-based and diffusion-based SOTA on the large-scale MunichWF dataset.
- C-GenReg: Training-Free 3D Point Cloud Registration by Multi-View-Consistent Geometry-to-Image Generation with Probabilistic Modalities Fusion
-
C-GenReg utilizes a pretrained World Foundation Model (Cosmos-Transfer) to render the geometry of input point clouds into "multi-view consistent RGB views." It then extracts correspondences using a Vision Foundation Model (VFM) pretrained for dense matching (MASt3R) and merges the correspondence posteriors from the image and original geometric branches via a Noisy-AND probabilistic fusion. This zero-training, plug-and-play framework is the first generative registration method to successfully operate on real-world outdoor LiDAR.
- CaliTex: Geometry-Calibrated Attention for View-Coherent 3D Texture Generation
-
CaliTex diagnoses the root cause of "cross-view texture inconsistency" as attention ambiguity caused by undiscriminated full attention in multi-view diffusion. It proposes two types of geometry-calibrated attention: Part-Aligned Attention (calculating cross-view attention grouped by 3D semantic parts) and Condition-Routed Attention (routing reference appearance through geometric conditions before injecting into noise). Implemented on a two-stage DiT, it transforms geometric consistency into an inherent behavior of the network, with texture fidelity and cross-view consistency significantly outperforming open-source and commercial baselines.
- Can Natural Image Autoencoders Compactly Tokenize fMRI Volumes for Long-Range Dynamics Modeling?
-
TABLeT is proposed, utilizing a pre-trained 2D natural image autoencoder (DCAE) to compress 3D fMRI volumes into just 27 continuous tokens. Paired with a simple Transformer encoder, it achieves unprecedented long-range temporal modeling (256 frames), surpassing SOTA voxel-based methods on tasks across UKB, HCP, and ADHD-200 with significantly improved computational efficiency.
- CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction
-
Proposes CARI4D, the first category-agnostic method to reconstruct metric-scale 4D human-object interactions from monocular RGB video—comprising object shape reconstruction, pose tracking, hand contact reasoning, and physically constrained optimization, generalizing zero-shot to unseen categories.
- CaT-GS: Efficient 3DGS Rendering for Large-Scale Scenes with Inter-frame Caching and Tile Scheduling
-
CaT-GS transforms the 3DGS rendering pipeline from "per-frame computation" to "frame-group reuse." By employing speculative multi-frame pre-processing and inter-frame caching, it eliminates redundant view-frustum culling, sorting, and tile intersection across consecutive frames. Combined with a load-aware CUDA kernel split for heavy tiles to balance GPU utilization, it achieves up to a 10× speedup over vanilla 3DGS and is up to 70% faster than previous SOTA methods in large-scale scenes.
- Catalyst4D: High-Fidelity 3D-to-4D Scene Editing via Dynamic Propagation
-
Ours proposes the Catalyst4D framework, which propagates mature 3D static editing results to 4D dynamic Gaussian scenes via Anchor Motion Guidance (AMG, establishing region-level correspondences based on optimal transport) and Color Uncertainty-guided Appearance Refinement (CUAR, automatically identifying and repairing occlusion artifacts), consistently outperforming existing methods in CLIP semantic similarity.
- CGHair: Compact Gaussian Hair Reconstruction with Card Clustering
-
CGHair is proposed to achieve over 200x appearance parameter compression and 4x hair reconstruction acceleration while maintaining comparable visual quality through hair-card-guided hierarchical clustering and a shared Gaussian appearance codebook.
- Changes in Real Time: Online Scene Change Detection with Multi-View Fusion
-
Ours proposes the first scene change detection (SCD) method that is simultaneously online, pose-agnostic, label-free, and multi-view consistent. By integrating pixel-level and feature-level change cues into a 3DGS change representation via a self-supervised fusion loss, it surpasses the detection accuracy of all existing offline methods while operating at a real-time rate exceeding 10 FPS.
- Choreographing a World of Dynamic Objects
-
CHORD treats static 3D objects as "actors" and a video generation model as a "choreographer." Through a distillation objective customized for rectified-flow video models and a spatio-temporal hierarchical 4D motion representation, it generates physically plausible 4D animations of multi-object interactions using only 3D shapes and a text prompt. It further enables zero-shot robot manipulation.
- Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding
-
Chorus utilizes three types of 2D foundation models—language-aligned (SigLIP2), general vision (DINOv3), and object-aware (PE-Spatial)—as teachers. By employing a "shared 3DGS encoder + independent projectors for each teacher," it distills a versatile feed-forward 3D Gaussian scene encoder in a single pass. It achieves SOTA performance across a wide range of tasks including semantic/instance segmentation, open-vocabulary tasks, and VQA, while requiring \(8.32\times\) to \(39.9\times\) fewer training scenes compared to point cloud pre-training baselines.
- Circular-DPO: Aligning Multi-Stage 3D Generative Models via Preference Feedback Loop
-
Addressing multi-stage 3D generative models like Trellis—which "generate sparse structures first and then fill in local details"—this work utilizes a "data loop" to route preference signals generated after end-stage DPO alignment back to the initial stage, bypassing non-differentiable discretization. Combined with dual noise-filtering weights, it jointly aligns geometry and texture, achieving a 35.15% improvement in ImageReward and a 21.44% improvement in Reward3D over the Trellis baseline.
- Clay-to-Stone: Phase-wise 3D Gaussian Splatting for Monocular Articulated Hand-Object Manipulation Modeling
-
To address the unstable optimization caused by the strong coupling of "geometry" and "articulated motion" in monocular video, this paper proposes the Clay-to-Stone dual-phase 3DGS framework. It first utilizes a "soft clay" phase (CLAY) for fine-grained, semantic-aware free deformation to explore structure and motion, followed by a "stone" phase (STONE) that imposes rigid constraints and explicitly estimates axes, pivots, and joint angles. This approach achieves SOTA geometric reconstruction and realistic rendering on the ARCTIC articulated object dataset.
- ClipGStream: Clip-Stream Gaussian Splatting for Any Length and Any Motion Multi-View Dynamic Scene Reconstruction
-
ClipGStream partitions dynamic videos into several clips and employs a "Clip-Stream" hybrid paradigm, where a "Reference Clip" establishes the base and "Source Clips" perform incremental training on that base. This approach preserves the intra-clip temporal stability of Clip-based methods while inheriting the scalability of Frame-Stream methods, achieving flicker-free, low-memory, SOTA dynamic Gaussian reconstruction on 1400-frame sequences with significant motion.
- CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation
-
The first CLIP-based few-shot unsupervised 3D point cloud domain adaptation framework. By utilizing knowledge-driven prompt tuning, parameter-efficient fine-tuning, entropy-guided view selection, and uncertainty-aware alignment loss, it achieves consistent accuracy improvements of 3-16% on PointDA-10 and GraspNetPC-10 with only ~11M trainable parameters.
- Color-Encoded Illumination for High-Speed Volumetric Scene Reconstruction
-
A set of high-frequency switching color LED strobes is used to illuminate the scene, encoding "timestamps" of high-speed motion into the color and intensity of images captured by multiple standard 60 FPS cameras. A modified dynamic Gaussian Splatting (Gaussian-Flow) is then employed to decode 600 FPS volumetric dynamic scenes from these color-mixed frames, achieving high-speed 3D reconstruction without specialized camera hardware for the first time.
- CoLoR: The Devil is in Scene Coordinate Regression for Large-Scale Visual Localization
-
CoLoR diagnoses the primary "culprits" behind the failure of large-scale Scene Coordinate Regression (SCR) as unsupervised single-view points and inconsistency between global and local features. By adopting an "explicit multi-view/single-view partitioning + two-stage strong supervision (multi-view reprojection + pseudo-depth bootstrapping)" strategy, it provides supervision for every point in the scene. Additionally, it utilizes MoCo-style contrastive learning to retrain local features into pixel-level retrieval features. CoLoR pushes SCR to SOTA performance on large-scale datasets like Aachen and Department Store, significantly narrowing the accuracy gap with Feature Matching (FM) methods while maintaining a map size of only a few dozen MBs.
- CompetitorFormer: Mitigating Query Conflicts for 3D Instance Segmentation via Competitive Strategy
-
To address the persistent issue of "multiple queries competing for the same object leading to mask fragmentation" in Transformer-based 3D instance segmentation, this paper introduces a Query Competition Layer. This layer explicitly calculates the "competitive landscape" (identifying the strongest spatial overlap and dominant/subordinate roles) for each query before each decoding stage. Combined with modified self-attention and cross-attention to enable "winner-take-all" dynamics, the method achieves faster convergence and SOTA performance across four benchmarks: ScanNetV2/200, S3DIS, and ScanNet++V2.
- Complet4R: Geometric Complete 4D Reconstruction
-
Complet4R redefines "dynamic scene 4D reconstruction" as "aggregating observed geometry from all frames in a video into a complete geometry for each target timestamp (including parts occluded in that frame but visible in others)." This is implemented end-to-end using a decoder-only transformer with switchable target-timestamp aggregation tokens, achieving SOTA on a self-built 4D complete reconstruction benchmark and in 3D point tracking.
- ComPose: A Unified Completion-Pose Framework for Robust Category-Level Object Pose Estimation
-
ComPose incorporates "point cloud completion" as a task-driven internal module into a category-level 6D pose estimation network. By utilizing keypoint-based progressive completion, it restores the complete object geometry directly in the observation space. Combined with geometric relation encoding and a geometric relation consistency loss, it improves the \(10°2\text{cm}\) accuracy for REAL275 depth-only from 68.5% to 77.8% without relying on category shape priors, while achieving a faster inference speed (38.4 FPS).
- ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors
-
ConceptPose completely transforms 6D object pose estimation into a semantic matching task: an LLM automatically generates a series of textual "concepts" for an object category, then the explainability heatmaps (GradCAM) of a VLM are used to locate each concept across two images and back-project them into 3D. This yields a "concept vector" for each point. Finally, cross-view matching of these concept vectors paired with RANSAC directly calculates the relative pose. This process requires no training and no CAD models, yet improves the average ADD(-S) of the strongest baseline by 62.8% across four real-world RGB-D benchmarks.
- Confidence-Guided Multi-Scale Aggregation for Sparse-View High-Resolution 3D Gaussian Splatting
-
This paper reveals the resolution trade-off in sparse-view 3DGS where low-resolution provides stable structures while high-resolution provides details but introduces noise. It proposes CAGS: using a low-resolution Gaussian field as an anchor, re-weighting the opacity of high-resolution Gaussians via a cross-scale confidence chain, and incorporating multi-scale pseudo-view regularization. This enables high-resolution reconstruction under extremely sparse conditions (e.g., 3 views), achieving a 2.7dB PSNR improvement over NexusGS on original resolution LLFF.
- Consistent Instance Field for Dynamic Scene Understanding
-
This work models dynamic scenes as a continuous probabilistic "instance field," where each spatio-temporal point carries both an "occupancy probability" and a "conditional identity distribution." By approximating this field using deformable 3D Gaussians with instance semantics, the method decouples object identity from multi-view visibility. It significantly outperforms previous SOTA in novel-view panoptic segmentation and open-vocabulary 4D querying (HyperNeRF mIoU +11.4, Neu3D +5.8).
- ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
-
ConsisVLA-4D utilizes three modules (CV-Aligner, CO-Fuser, CS-Thinker) to compress multi-view 2D observations into approximately 1/8 of the original tokens. It ensures "cross-view semantic consistency" and "cross-object geometric consistency" during the perception phase, and extends this to "cross-scene spatiotemporal consistency" during the reasoning phase. It improves success rates by 21.6% / 41.5% and accelerates inference by 2.3× / 2.4× on LIBERO and real-world robots compared to OpenVLA.
- Content-Aware Frequency Encoding for Implicit Neural Representations with Fourier-Chebyshev Features
-
To address the inefficiency where Implicit Neural Representations (INRs) use fixed Fourier bases and force the MLP to "synthesize" target frequencies, this paper proposes CAFE. By passing Fourier features through multiple parallel linear layers followed by Hadamard products for frequency multiplication, the representable frequency set is expanded exponentially from \(M\) fixed bases to \(O(MN3^{N-1})\), using learnable weights to select task-relevant frequencies. Supplemented by Chebyshev features for low-frequency stability (CAFE+), the method consistently outperforms baselines like SIREN, FINER, and SL2A on image fitting, 3D shapes, and NeRF (with image fitting PSNR improving by up to ~5 dB).
- Context-Nav: Context-Driven Exploration and Viewpoint-Aware 3D Spatial Reasoning for Instance Navigation
-
Context-Nav promotes contextual information from long-text descriptions from posterior verification signals to prior exploration priors—guiding frontier selection via a context-driven value map and performing viewpoint-aware 3D spatial relationship verification at candidate targets, achieving SOTA on InstanceNav and CoIN-Bench without any training.
- Cov2Pose: Leveraging Spatial Covariance for Direct Manifold-aware 6-DoF Object Pose Estimation
-
For 6-DoF object pose estimation from a single RGB image, this paper proposes Cov2Pose: using spatial covariance pooling to encode backbone features into a Symmetric Positive Definite (SPD) matrix to preserve second-order statistics, which is then compressed into a compact SPD code via manifold-aware BiMap+ReEig layers. Finally, a differentiable Cholesky decomposition maps the SPD matrix one-to-one into continuous 6D rotation + translation for direct end-to-end pose regression, achieving SOTA among direct regression methods on LM/LM-O/YCB-V.
- Coverage Optimization for Camera View Selection
-
Starting from Fisher Information Gain (FIG), this paper performs a series of analytical approximations to prove that "selecting the most informative next view" is mathematically equivalent to "selecting a view observing geometry worst covered by existing cameras." This yields CONVERGE, a lightweight, visualizable metric requiring no custom CUDA kernels. On 15 real scenes, it consistently outperforms FisherRF and random baselines in reconstruction quality, with a single scan being ~7x faster than FisherRF.
- CraftMesh: High-Fidelity Generative Mesh Manipulation via Poisson Seamless Fusion
-
CraftMesh decomposes high-fidelity mesh editing into a three-stage pipeline: "2D image editing → image-to-mesh → seamless fusion." It utilizes Poisson normal fusion (geometry) and Poisson texture coordination (color) within the SDF domain to seamlessly integrate generated editing regions into the original mesh, significantly outperforming SDS-based and multi-view diffusion-based baselines in complex insertion, deletion, and local editing tasks.
- Cross-Instance Gaussian Splatting Registration via Geometry-Aware Feature-Guided Alignment
-
The authors propose GSA (Gaussian Splatting Alignment), the first method to achieve cross-instance category-level 3DGS registration. By combining geometry-aware feature-guided coarse alignment (extended ICP for Sim(3) similarity transformation) and multi-view feature consistency fine alignment, the method significantly outperforms existing approaches in both same-object and cross-object scenarios.
- Cross-View Splatter: Feed-Forward View Synthesis with Georeferenced Images
-
To address the low coverage and difficulty of large-scale ground image collection in outdoor scenes, this paper proposes Cross-View Splatter: a feed-forward network that fuses GPS-tagged ground photos with orthorectified satellite images from public map services into a unified 3D coordinate system. By predicting pixel-aligned Gaussians for both ground (perspective) and satellite (orthographic) views, it significantly enhances scene coverage and extrapolation capabilities under sparse inputs.
- CrowdGaussian: Reconstructing High-Fidelity 3D Gaussians for Human Crowd from a Single Image
-
CrowdGaussian proposes a unified framework for reconstructing multi-person 3D Gaussian Splatting (3DGS) representations from a single image. It recovers complete geometry in occluded regions through a self-supervised adapted Large Occlusion-aware Reconstruction Model (LORM) and enhances texture detail quality using a single-step diffusion refiner (CrowdRefiner) trained with Self-Calibrated Learning (SCL).
- CUBE: Representing 3D Faces with Learnable B-Spline Volumes
-
Ours proposes CUBE (Control-based Unified B-spline Encoding), a hybrid geometric representation combining B-spline volumes with learnable high-dimensional control features. It achieves editable, high-precision 3D face reconstruction and scan registration through two-stage decoding (B-spline basis interpolation + lightweight MLP residuals).
- Curvature-Aware Captioning: Leveraging Geodesic Attention for 3D Scene Understanding
-
To address the conflicting geometric space requirements of "precise localization" and "hierarchical semantics" in 3D dense captioning, this paper introduces a multi-stage non-Euclidean geodesic attention mechanism. The encoder performs localization on the Oblique manifold, while the decoder constructs semantic hierarchies in Lorentz hyperbolic space, upgrading the Vote2Cap-DETR++ framework to the CAC framework. It achieves new SOTA [email protected] results on ScanRefer and Nr3D.
- CustomTex: High-fidelity Indoor Scene Texturing via Multi-Reference Customization
-
The CustomTex framework is proposed, which implements high-fidelity, instance-controllable texture generation for 3D indoor scenes through instance-level multi-reference driving and a dual-distillation training strategy (semantic-level VSD distillation + pixel-level super-resolution distillation). It significantly outperforms existing methods in semantic consistency, texture clarity, and the reduction of "baked-in shading."
- D-Prism: Differentiable Primitives for Structured Dynamic Modeling
-
D-Prism extends differentiable geometric primitives (superquadrics) from static scenes to the dynamic domain. It utilizes a deformation network to drive rigid primitive motions, binds 3D Gaussians to each primitive to supplement appearance, and incorporates a "clone/merge/prune" dynamic adaptive control system. This enables the simultaneous reconstruction of structured geometry with part decomposition and precise part motion from monocular videos.
- Dark3R: Learning Structure from Motion in the Dark
-
The Dark3R framework is proposed to transfer 3D priors from MASt3R to extreme low-light (SNR \(< -4\) dB) raw images through teacher-student distillation, enabling Structure from Motion (SfM) and novel view synthesis in dark environments where traditional methods fail completely.
- Deformation-based In-Context Learning for Point Cloud Understanding
-
Ours proposes DeformPIC, which redefines point cloud In-Context Learning from the "mask reconstruction" paradigm to a "deformation transfer" paradigm. By utilizing a Deformation Extraction Network to extract task semantics and a Deformation Transfer Network to migrate deformations to query point clouds, it reduces CD by 1.6, 1.8, and 4.7 in reconstruction, denoising, and registration tasks, respectively.
- Dehallu3D: Hallucination-Mitigated 3D Generation from a Single Image via Cyclic View Consistency Refinement
-
Addressing the hallucination issue where Large Reconstruction Models "imagine" sparse multi-views as outlier structures (holes, spikes), Dehallu3D introduces a plug-and-play Cyclic View Consistency Refinement (CVCR) module following the single-image-to-mesh reconstruction pipeline. It smooths out outliers via 360° circular dense adjacent-view depth consistency constraints while preserving sharp features through adaptive smoothing. It also proposes the ORM metric to specifically quantify the degree of outliers, achieving a comprehensive lead in geometry and appearance metrics on the GSO dataset.
- Dense Metric Depth Completion from Sparse Direct Time-of-Flight Sensors
-
Addressing the "extremely sparse + low resolution + noisy" depth maps from direct Time-of-Flight (dToF) sensors, this paper proposes a depth-guided dual-branch ViT encoder with masked joint attention. This allows sparse depth to unidirectionally guide RGB features without being contaminated by RGB cues, coupled with a lightweight DPT decoder to directly output dense metric depth. Trained entirely on a simulation pipeline covering flash/rotating dToF synthetic data, the model achieves zero-shot generalization across 6 datasets and 3 real dToF devices, matching or exceeding Prev. SOTA while being 20× faster and using 10× less VRAM.
- Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation
-
This paper proposes DAP (Depth Any Panoramas), a foundation model for panoramic metric depth estimation. Utilizing a "data-in-the-loop" approach, it feeds 2M indoor/outdoor and synthetic/real panoramic images into a three-stage pseudo-label distillation pipeline. Combined with a DINOv3 backbone, a plug-and-play distance mask head, and a set of distortion-aware geometric/sharpness losses, DAP achieves zero-shot SOTA on multiple benchmarks including Stanford2D3D, Matterport3D, and Deep360, providing particularly stable absolute scale predictions for outdoor long-range and sky regions.
- Depth Hypothesis Guided Iterative Refinement for Event-Image Monocular Depth Estimation
-
HypoDepth reformulates event-image monocular depth estimation from "direct regression of continuous depth" to "constrained search within discrete depth hypotheses." By utilizing a lightweight 3D cost volume and a GRU iterative unit, it progressively refines residual depth from low to high resolution. It achieves SOTA on DSEC and MVSEC, with a Tiny version capable of real-time operation on resource-constrained devices.
- Depth Peeling for High-Fidelity Gaussian-Enhanced Surfel Rendering
-
To address the issues of boundary aliasing caused by hard depth testing and the inability to jointly optimize surfels and Gaussians in Gaussian-Enhanced Surfels (GES), this paper proposes DP-GES. By introducing translucent boundaries for surfels and utilizing 3-layer depth peeling to determine accurate per-pixel occlusion orders, 3D Gaussians can still perform order-independent splatting while receiving correct transmittance modulation. This approach eliminates aliasing and popping, enables differentiable joint optimization of surfels and Gaussians, and achieves state-of-the-art image quality at 472 FPS across multiple datasets.
- DepthFocus: Controllable Depth Estimation for See-Through Scenes
-
DepthFocus redefines stereo depth estimation from a "passive output of the nearest surface" to a "controllable process driven by a physical reference distance \(c\)." Using a steerable ViT that dynamically modulates features through two modules—Conditional MoE and Direct Condition Injection—the network "peels away" transparent or reflective occlusions layer-by-layer like human eye focusing, achieving SOTA on both standard single-layer benchmarks and complex multi-layer scenes.
- DetAny4D: Detect Anything 4D Temporally in a Streaming RGB Video
-
DetAny4D defines "continuous 3D bounding box prediction in streaming RGB videos" as a 4D detection task. It utilizes an end-to-end open-vocabulary framework (SAM + DINO + UniDepth features + Causal Spatiotemporal Decoder + multi-task heads) to directly output globally consistent 3D boxes across frames. Accompanying this is the DA4D dataset comprising 280,000 sequences, which reduces cross-frame jitter variance by 10–30% compared to single-frame detectors.
- DICArt: Advancing Category-level Articulated Object Pose Estimation in Discrete State-Spaces
-
DICArt reformulates category-level 6D articulated object pose estimation as a conditional discrete diffusion process. Specifically, it discretizes rotation and translation into tokens, utilizes a "flow decider" for step-by-step denoising, and couples the estimation of each part according to the parent-child kinematic hierarchy. It significantly outperforms existing methods on synthetic, semi-synthetic, and real-world robotic arm data.
- Differentiable Adaptive 4D Structured Illumination for Joint Capture of Shape and Reflectance
-
Using a unified "spatial-angular 4D structured light" hardware (LED array + LCD mask + single camera), this work differentiably optimizes the next set of light/mask patterns in real-time during the capture process to minimize pixel-wise depth uncertainty. This enables efficient joint reconstruction of object shape (depth map) and reflectance (GGX SVBRDF) from a single viewpoint, reducing exposure time by up to 100\(\times\) and total acquisition time by 2\(\times\).
- DiffSoup: Direct Differentiable Rasterization of Triangle Soup for Extreme Radiance Field Simplification
-
DiffSoup represents radiance fields as an unstructured triangle soup (fewer than 20,000 primitives) with neural textures and binary opacity. It introduces a "stochastic opacity mask" to make opaque triangle rasterization directly differentiable, enabling real-time rendering on laptops and mobile devices using standard depth-testing pipelines with quality exceeding 3DGS or triangle splatting under equivalent budgets.
- DiffusionHarmonizer: Bridging Neural Reconstruction and Photorealistic Simulation with Online Diffusion Enhancer
-
The authors transform a pre-trained multi-step image diffusion model into a "single-step, deterministic, temporally-conditioned" enhancer. Combined with a five-way data pipeline specializing in "artifact-ridden rendering ↔ realistic photo" pairs, it enhances simulation frames reconstructed via NeRF/3DGS (characterized by artifacts and lighting mismatches) into temporally coherent, high-realism visuals in real-time. In user studies, 84.28% of participants preferred this method.
- DINO Eats CLIP: Adapting Beyond Knowns for Open-set 3D Object Retrieval
-
The view encoder for open-set 3D object retrieval (open-set 3DOR) is switched from CLIP to self-supervised DINO. A lightweight "chunked aggregation" adapter (CAM) is employed to integrate local multi-view relationships, while a Virtual Feature Synthesis (VFS) module utilizes CLIP text-visual alignment to generate unseen class virtual features for regularization. Using only uni-modal visual features, this approach outperforms bi-modal CLIP-based methods across four standard benchmarks.
- DirectFisheye-GS: Enabling Native Fisheye Input in Gaussian Splatting with Cross-View Joint Optimization
-
This paper natively integrates the Kannala-Brandt fisheye projection model into the 3DGS pipeline and proposes a cross-view joint optimization strategy based on feature overlap. This approach avoids the information loss inherent in pre-rectification and achieves or exceeds SOTA performance on multiple public datasets.
- Distilling Unsigned Distance Function for Surface Reconstruction from 3D Gaussian Splatting
-
This work distills a "local patch UDF teacher" (pre-trained on synthetic algebraic surfaces) into a lightweight student UDF optimized alongside 3DGS. By employing band-limited distillation near the surface and weighting based on visibility/geometric confidence, it stably reconstructs open surfaces with boundaries and thin structures from multi-view images, achieving SOTA Chamfer Distance on DF3D and DTU.
- DMAligner: Enhancing Image Alignment via Diffusion Model Based View Synthesis
-
DMAligner is proposed to transform the image alignment problem from the traditional optical flow warp paradigm into an "alignment-oriented view synthesis" task. By leveraging conditional diffusion models to directly generate aligned full images in conjunction with a specially constructed DSIA synthetic dataset and a Dynamics-aware Mask Producing (DMP) module, the method effectively avoids ghosting and occlusion artifacts inherent in warping methods, outperforming existing methods across multiple benchmarks.
- DROID-W: DROID-SLAM in the Wild
-
Ours proposes DROID-W, which introduces Uncertainty-aware Bundle Adjustment (UBA) combined with a DINOv2 feature-driven dynamic uncertainty update mechanism and monocular depth regularization. This enables DROID-SLAM to achieve robust camera pose estimation and scene reconstruction in highly dynamic in-the-wild scenarios, running in real-time at approximately 10 FPS.
- DropAnSH-GS: Dropping Anchor and Spherical Harmonics for Sparse-view Gaussian Splatting
-
To address the overfitting issue of 3DGS in sparse-view scenarios, this paper proposes DropAnSH-GS: it uses Anchor-based Dropout (dropping anchor points and their neighboring Gaussian clusters) instead of independent random Dropout to disrupt the local redundancy compensation effect, while introducing Spherical Harmonics (SH) Dropout to suppress high-order SH overfitting and support lossless compression after training.
- DuoMo: Dual Motion Diffusion for World-Space Human Reconstruction
-
DuoMo is proposed to decompose world-space human motion reconstruction into two independent diffusion models: a camera-space model extracts generalized camera-coordinate motion from video, and a world-space model refines lifted noisy proposals into globally consistent world-coordinate motion. By directly generating mesh vertex motion instead of SMPL parameters, it reduces W-MPJPE by 16% on EMDB and 30% on RICH.
- DVGT: Driving Visual Geometry Transformer
-
DVGT is a visual geometry Transformer designed for autonomous driving. It takes a sequence of multi-frame multi-view images without pose information as input and end-to-end directly predicts metric-scale global dense 3D point cloud maps relative to the first frame's ego-coordinate system along with per-frame ego-poses. It requires no camera intrinsics/extrinsics and no post-hoc LiDAR-based scale alignment, outperforming both general geometry models (VGGT, CUT3R, MapAnything) and driving-specific models (Driv3R) across five driving datasets.
- Dynamic-Static Decomposition for Novel View Synthesis of Dynamic Scenes with Spiking Neurons
-
Addressing the pain points of "inaccurate mask priors" and "improper label representation" in dynamic-static decomposition for 3DGS, this paper employs a 4D spatio-temporal fine-grained mask field for supervision and utilizes spiking neurons to optimize dynamic-static labels directly into discrete 0/1 values. This approach precisely classifies Gaussians into dynamic or static categories, achieving SOTA rendering quality in fine-grained motion, motion boundaries, and side-view evaluations while maintaining real-time frame rates.
- Dynamic Black-hole Emission Tomography with Physics-informed Neural Fields
-
Ours proposes PI-DEF, utilizing physics-informed coordinate neural networks to simultaneously reconstruct the 4D (time + 3D) emissivity field and 3D velocity field of gas near a black hole. Under sparse EHT measurements, it significantly outperforms BH-NeRF, which relies on hard-constrained Keplerian dynamics.
- Dynamic Visual SLAM using a General 3D Prior
-
This work tightly couples classic patch-based optical flow SLAM (DPV-SLAM) with a feed-forward 3D reconstruction foundational model (\(\pi^3\)): it uses motion masks predicted by the feed-forward model to filter dynamic pixels, stabilizes bundle adjustment with its depth priors, and resolves the inter-batch scale drift of the feed-forward model via scale alignment with the SLAM sparse point cloud. This achieves accurate poses, clean motion segmentation, and scale-consistent dense depth in dynamic scenes.
- DynamicTree: Interactive Real Tree Animation via Sparse Voxel Spectrum
-
This work compresses the motion of real scanned 3DGS trees into a set of "sparse voxels + frequency spectrum." A feed-forward diffusion model is used to generate long-term mesh motion in a single pass to drive the Gaussians. This approach avoids the spatio-temporal inconsistency common in 4D generation methods, is a hundred times faster than MPM physical simulation, and allows for real-time drag-and-drop interaction at approximately 18ms/frame by utilizing the spectrum as modal bases.
- DynamicVGGT: Learning Dynamic Point Maps for 4D Scene Reconstruction in Autonomous Driving
-
DynamicVGGT extends the static feed-forward 3D model VGGT to dynamic 4D reconstruction. By utilizing "Dynamic Point Maps" to predict point clouds for current and future frames within a unified learned coordinate system, combined with a parallel motion-aware temporal attention branch and a velocity-supervised dynamic 3D Gaussian head, it reconstructs temporally consistent dynamic driving scenes on Waymo and KITTI using only image inputs without camera parameters or dense annotations.
- E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training
-
E-RayZer is the first truly self-supervised feed-forward 3D Gaussian reconstruction model. By replacing RayZer's implicit latent scene representation with explicit 3D Gaussians and employing a curriculum learning strategy based on visual overlap, it learns geometrically grounded 3D-aware representations under zero 3D annotation. It significantly outperforms RayZer in pose estimation (RPA@5° improved from ≈0 to 90.8) and leads mainstream pre-trained models like DINOv3/CroCo v2 in frozen-backbone probing for downstream 3D tasks, even rivaling supervised VGGT.
- E2EGS: Event-to-Edge Gaussian Splatting for Pose-Free 3D Reconstruction
-
Ours proposes E2EGS, a pose-free 3D reconstruction framework entirely based on event streams: it extracts noise-resistant edge maps from event streams through patch-based temporal consistency analysis, utilizes edge information to guide Gaussian initialization and weighted loss optimization, and achieves high-quality trajectory estimation and 3D reconstruction without depth models or RGB input.
- Easy3E: Feed-Forward 3D Asset Editing via Rectified Voxel Flow
-
Ours proposes a feed-forward 3D asset editing framework based on the TRELLIS 3D generative backbone. It achieves globally consistent geometric deformation in a sparse voxel latent space through Voxel FlowEdit and recovers high-frequency details using normal-guided multi-view texture refinement.
- EcoSplat: Efficiency-controllable Feed-forward 3D Gaussian Splatting from Multi-view Images
-
EcoSplat is the first "count-controllable" feed-forward 3D Gaussian Splatting framework. Given an arbitrary target primitive count \(K\) at inference, it selects the \(K\) most significant Gaussians via a single feed-forward pass. Under extreme constraints (RE10K 24-view, compressed to 5% primitives), it achieves a PSNR of 24.7, significantly outperforming existing feed-forward methods that rely on threshold-based pruning.
- Edges Compete for Trust: Group Relative Edge Optimization for Building Reconstruction from Point Clouds
-
To address the issue where edge-based methods rely on one-to-one Hungarian matching—leaving most edge proposals "unsupervised"—this paper introduces the "intra-group relative advantage" concept from GRPO into wireframe reconstruction. The proposed GREO calculates a continuous reward for every edge based on geometric alignment quality, normalizes it within the group to form a target confidence distribution, and applies dense discriminative supervision via cross-entropy and entropy regularization. As a plug-and-play training strategy, it pushes PBWR / EdgeDiff to SOTA on Building3D with zero inference overhead.
- Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers
-
The authors discover that "Image-scale Editing (I2I) diffusion models" are inherently deterministic image-to-image mappings, making them better suited for dense perception than the commonly used "Text-to-Image (T2I)" models. They perform full-parameter fine-tuning on the FLUX.1 Kontext editor to create a unified depth/normal/matting perceiver. By incorporating a pixel-space consistency loss and a theoretically optimal square-root depth mapping, the model achieves SOTA results across three tasks with single-step inference using only ~74,000 training images.
- Efficient Hybrid SE(3)-Equivariant Visuomotor Flow Policy via Spherical Harmonics
-
Ours proposes E3Flow, the first equivariant flow matching policy framework based on spherical harmonics representation. By dynamically fusing visual information from point cloud and image modalities through a Feature Enhancement Module (FEM) and combining it with rectified flow for efficient equivariant action generation, E3Flow achieves an average success rate 3.12% higher than the strongest baseline SDP across 8 MimicGen tasks, while providing a 7x speedup in inference.
- Efficiently Reconstructing Dynamic Scenes One D4RT at a Time
-
D4RT uses a unified encoder-decoder Transformer to first encode a video into a fixed global scene representation, and then utilizes a single "query 3D position of any spatio-temporal point" decoding interface to simultaneously obtain depth, point clouds, 3D point trajectories, and camera extrinsics/intrinsics. It achieves new SOTA in dynamic 4D reconstruction and tracking, running approximately 9× faster than VGGT and two orders of magnitude faster than MegaSaM.
- EfficientVPR: Toward Efficient Visual Place Recognition via Scene-Aware Prompt Tuning and Adaptive Feature Enhancement
-
By employing a "Scene-aware Visual Prompt Tuning (SceneVPT) + instance-specific key local feature enhancement module," this work performs single-stage Visual Place Recognition (VPR) on the lightweight DINOv2-small backbone. With a descriptor dimension of only 3456, it outperforms all methods of similar scale. Compared to the two-stage SOTA based on DINOv2-large, it achieves a ~73× speedup while maintaining an average R@1 within a 2.5% margin.
- EG-3DVG: Expression and Geometry Aware Grounding Decoder for 3D Visual Grounding
-
EG-3DVG embeds two complementary attention modules—PECA, which injects 3D positions into text tokens, and GMA, which filters visual tokens based on geometric relations—within a 3D visual grounding decoder. Complemented by Expression Contrastive Learning (ECL) to distinguish intra-category distractors, it specifically addresses "cross-modal misalignment, intra-category confusion, and geometric reasoning errors," achieving SOTA in bounding box localization and mask prediction on ScanRefer and SR3D/NR3D.
- Ego-1K: A Large-Scale Multiview Video Dataset for Egocentric Vision
-
Ours introduces Ego-1K, a large-scale time-synchronized egocentric multiview video dataset containing 956 short videos (12+4 cameras, 60Hz). It fills the data gap in the field of egocentric dynamic 3D reconstruction and demonstrates that stereo depth guidance significantly enhances the quality of 4D novel view synthesis.
- Egocentric Visibility-Aware Human Pose Estimation
-
Addressing the "frequently invisible keypoints" issue in egocentric human pose estimation for head-mounted devices (HMDs), this paper constructs Eva-3M, the first large-scale real-world dataset with visibility annotations (3 million frames, 435,000 visibility labels). It proposes EvaPose, which explicitly predicts the visibility of each keypoint and weights the loss accordingly, reducing the MPJPE of visible keypoints from 49.8mm in FRAME to 34.2mm.
- EI-Part: Explode for Completion and Implode for Refinement
-
EI-Part proposes an "Explode-Implode" part-level 3D generation framework: incomplete segmented parts are exploded into a dispersed state to make room for structural completion, then imploded back to a compact state to dedicate full resolution to detail refinement. Self-attention is used in both states to maintain structural consistency among parts, ultimately outperforming SOTA models like HoloPart, X-Part, and OmniPart across Voxel IoU, CD, and F-Score.
- Elastic3D: Controllable Stereo Video Conversion with Guided Latent Decoding
-
Elastic3D utilizes a 1-step conditional latent diffusion model to directly synthesize the right-eye video from a monocular input (without depth estimation or warping). It allows users to continuously adjust 3D intensity via a scalar "parallax factor" and employs a "guided VAE decoder" with epipolar attention to inject high-frequency details from the left view into the right view, eliminating binocular rivalry artifacts. It outperforms warp-based and warp-free baselines across three real-world stereo video datasets.
- Electromagnetic Inverse Scattering from a Single Transmitter
-
This paper reformulates the Electromagnetic Inverse Scattering Problem (EISP) from "per-sample physical optimization" to "end-to-end data-driven regression." By using an MLP to directly map received scattered fields and spatial coordinates to local relative permittivity, the method leverages data distribution priors learned from the training set to compensate for the information deficiency in sparse measurements. It achieves high-quality reconstruction using only a single transmitter for the first time, with inference speeds over 70,000 times faster than previous SOTA methods.
- EmbodiedSplat: Online Feed-Forward Semantic 3DGS for Open-Vocabulary 3D Scene Understanding
-
EmbodiedSplat is proposed as the first online feed-forward semantic 3DGS framework. It achieves memory-efficient per-Gaussian semantic representation through a sparse coefficient field and a CLIP global codebook. Combined with 3D geometry-aware features, it enables full-scene open-vocabulary 3D understanding at 5-6 FPS under streaming inputs of 300+ frames.
- EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents
-
EmbodMocap uses RGB-D videos from two handheld iPhones to jointly calibrate the scene, camera trajectories, and human motion into a single metric world coordinate system. This enables low-cost "in-the-wild" 4D human-scene capture, producing data that can simultaneously drive three types of embodied tasks: monocular human-scene reconstruction, physical character animation, and real-world humanoid robot control.
- EMGauss: Continuous Slice-to-3D Reconstruction via Dynamic Gaussian Modeling in Volume Electron Microscopy
-
The problem of anisotropic slice reconstruction in volume electron microscopy (vEM) is re-modeled as a dynamic 3D scene rendering task based on deformable 2D Gaussian Splatting. High-fidelity continuous slice synthesis is achieved under sparse data conditions through a Teacher-Student pseudo-label mechanism.
- EmoDiffTalk: Emotion-aware Diffusion for Editable 3D Gaussian Talking Head
-
EmoDiffTalk maps the "emotion-to-expression" transformation onto the explainable Action Unit (AU) encoding space. It utilizes AU-prompted Gaussian diffusion to drive speech into fine-grained dynamic 3D Gaussian talking heads and implements "one-sentence emotion editing" via a text-to-AU controller. It surpasses Prev. SOTA in rendering fidelity, lip synchronization, and emotional controllability on EmoTalk3D and RenderMe-360.
- EmoTaG: Emotion-Aware Talking Head Synthesis on Gaussian Splatting with Few-Shot Personalization
-
EmoTaG is proposed, an emotion-aware 3D talking head synthesis framework based on a FLAME-Gaussian structural prior and a Gated Residual Motion Network (GRMN). It achieves few-shot personalization with only 5 seconds of video while balancing emotional expression, lip-sync, and geometric stability.
- Energy-GS: Image Energy-guided Pose Alignment Gaussian Splatting with redesigned pose gradient flow
-
Energy-GS utilizes only RGB images to simultaneously optimize 3D Gaussian Splatting scenes and inaccurate camera poses. By "freezing Gaussian positions," pose gradients are stabilized, and an image singular value energy decomposition is employed to simulate NeRF-like coarse-to-fine alignment. This approach achieves SOTA pose accuracy on synthetic and real datasets, with rendering quality on par with BARF/3R-GS.
- Enhancing Hands in 3D Whole-Body Pose Estimation with Conditional Hands Modulator
-
Ours proposes the Hand4Whole++ modular framework, which injects features from a pre-trained hand estimator into a frozen whole-body estimator via a lightweight CHAM module. This achieves precise wrist orientation prediction and transfers fine finger joints and hand shapes from a hand model using differentiable rigid alignment.
- eRetinexGS: Retinex Modeling for Low-Light Scene Enhancement via Event Streams and 3D Gaussian Splatting
-
eRetinexGS integrates "event streams + low-light frames + multi-view consistency" into a unified 3DGS framework. Each Gaussian explicitly stores two attributes—reflectance and illumination. Event signals are utilized to guide Retinex decomposition, and the two modalities are adaptively fused based on confidence. The method reconstructs a normal-light radiance field with sharp details and accurate colors in extremely dark scenes, achieving over 5 dB higher PSNR than previous state-of-the-art event+frame methods while supporting real-time rendering at 83 FPS.
- ESAM++: Efficient Online 3D Perception on the Edge
-
ESAM++ replaces the slowest component of the state-of-the-art (SOTA) online 3D perception method ESAM—the 3D Sparse UNet backbone—with a lightweight "3D Sparse Feature Pyramid Network (SFPN)." By leveraging multi-scale feature aggregation and channel rebalancing, it achieves up to a 3× speedup in CPU inference and a 2× reduction in model size across four indoor segmentation benchmarks. It maintains or even exceeds the accuracy of ESAM, enabling real-time online 3D instance segmentation on GPU-less edge devices such as mobile CPUs.
- Eulerian Gaussian Splatting using Hashed Probability Pyramids
-
The manual heuristic of "Adaptive Density Control (ADC)" in 3DGS is replaced by optimizing a learnable voxel probability density field from which Gaussians are sampled for rendering. High-resolution density is made affordable via Hashed Probability Pyramids, and sampling variance is mitigated through control variate gradient estimation. The method achieves SOTA reconstruction quality with random initialization on mip-NeRF 360 while maintaining 3DGS-level rendering speeds.
- EV-CGNet: Co-visible Focused 3D-guided 2D Event Keypoint Detection Network
-
EV-CGNet utilizes fine-grained spatio-temporal cues from event points to guide event frame feature prototype learning (G2PL). It further employs cross-frame self-attention to constrain keypoint detection to co-visible regions (CDDL), outperforming SOTA methods like SuperEvent in re-projection error, pose estimation, and SLAM trajectory error across six benchmarks.
- Event-based Visual Deformation Measurement
-
This paper proposes an event-frame fusion visual deformation measurement (VDM) system. It utilizes event cameras to provide temporally dense motion cues and standard frames to provide spatially dense accurate constraints. Through an Affine Invariant Simplex (AIS) framework, the high-dimensional deformation field is partitioned into low-parameter triangular sub-regions. Combined with a neighborhood greedy optimization to suppress long-range error accumulation, the system achieves a SOTA survival rate 1.6 times higher than existing methods under large deformations of 100+ pixels, while consuming only 18.9% of the storage/computing power required by high-speed camera solutions.
- Event Stream Filtering via Probability Flux Estimation
-
This paper reinterprets the event camera imaging process as a "stochastic process of log-irradiance trajectories crossing contrast thresholds," where events are samples of "probability flux" leaking at the thresholds. Accordingly, a generative filter, EDFilter, is proposed. It utilizes temporal kernel density estimation + motion-aware spatial smoothing + asynchronous resampling to reconstruct a clean, continuous, and physically interpretable event stream with \(O(1)\) real-time complexity.
- Event Structural Valley: A Unified Theoretical and Practical Framework for Event Camera Autofocus
-
Starting from the physical mechanism of event generation, the paper refutes the traditional assumption that "event rate is highest at the sharpest focus." It proves that the true focus corresponds to a valley (local minimum) between two peaks on the event rate curve. Based on this, the ESVA framework is proposed, which requires no image reconstruction or supervision, reducing autofocus error to SOTA on multiple synthetic and real datasets.
- EventHub: Data Factory for Generalizable Event-Based Stereo Networks without Active Sensors
-
This paper proposes EventHub, a data factory for event-based stereo matching training without LiDAR or other active sensor annotations. By generating proxy events and depth labels via novel view synthesis and transferring knowledge from RGB stereo models through cross-modal distillation, the trained event stereo models surpass LiDAR-supervised models in cross-domain generalization (reducing error by up to 50% on M3ED and MVSEC).
- Evidential Neural Radiance Fields
-
This paper adapts Evidential Deep Learning (EDL) to the NeRF volume rendering pipeline, allowing the model to directly predict and disentangle aleatoric uncertainty (data noise) and epistemic uncertainty (model ignorance) in a single forward pass. Without sacrificing rendering quality or increasing inference cost, it achieves both leading reconstruction fidelity and competitive uncertainty quality across three standardized benchmarks.
- EvObj: Learning Evolving Object-centric Representations for 3D Instance Segmentation without Scene Supervision
-
Addressing the issue where "synthetic object priors cannot generalize to real-world scans" in unsupervised 3D instance segmentation, EvObj integrates two modules into the RL discovery framework of GrabS: an identification network that evolves throughout the discovery process and a point cloud completion network to recover partial candidates. By adapting synthetic priors to real-world point clouds, EvObj outperforms all unsupervised baselines on ScanNet, S3DIS, and multi-category synthetic datasets, closely approaching the supervised method 3D-BoNet on the ScanNet hidden test set.
- ExMesh: EXplicit Mesh Reconstruction with Topology Adaptation
-
ExMesh embeds "discrete topology operations (vertex splitting/merging)" directly into a "continuous differentiable optimization" pipeline to optimize an explicit triangle mesh end-to-end from multi-view images. Without intermediate representations like Marching Cubes/TSDF or post-processing, and featuring real-time UV maintenance, it achieves a superior balance between precision, efficiency, and mesh conciseness (reaching SOTA-equivalent Chamfer distance on DTU in 13 minutes with approximately 196K faces).
- Exploring 6D Object Pose Estimation with Deformation
-
Addressing the common but often invalid assumption in 6D pose estimation that objects are "rigid and perfectly consistent with canonical CAD models," this paper constructs the first dataset explicitly characterizing deformation, DeSOPE. It covers 26 categories of daily necessities, scanning 1 canonical part + 3 incremental deformed parts (Light/Medium/Heavy) for each. A flow-driven registration aligns deformed meshes to canonical ones, and a semi-automatic pipeline generates 665K pose annotations across 133K RGB-D frames. Experiments demonstrate that more severe deformation leads to sharper performance drops in mainstream methods, revealing that the "rigidity assumption" is a significantly underestimated weakness in current pose pipelines.
- Extend3D: Town-Scale 3D Generation
-
This paper proposes Extend3D, a training-free 3D scene generation pipeline. By extending the voxel latent space of a pre-trained object-level 3D generative model (Trellis) and introducing joint denoising of overlapping patches, under-noising SDEdit initialization, and 3D-aware optimization, it generates town-scale large 3D scenes from a single image. It outperforms existing methods in both human preference and quantitative evaluation.
- ExtrinSplat: Decoupling Geometry and Semantics for Open-Vocabulary Understanding in 3D Gaussian Splatting
-
Ours proposes the extrinsic paradigm, which completely decouples semantics from 3DGS geometry. By constructing a lightweight semantic index layer through multi-granularity object grouping and VLM text hypotheses, ours achieves training-free, low-storage, and ambiguity-aware open-vocabulary 3D scene understanding.
- Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching
-
FoundationStereo, which features strong zero-shot performance but slow execution, is compressed using a "divide and conquer" strategy comprising three pillars: feature distillation, block-wise search for cost filtering, and pruning of the refinement module. Supplemented by an automated pseudo-labeling pipeline processing 1.4M real stereo pairs, this approach maintains near-original zero-shot accuracy at real-time frame rates, achieving a speedup of over 10x compared to FoundationStereo.
- Fast3Dcache: Training-free 3D Geometry Synthesis Acceleration
-
Fast3Dcache is proposed as a training-free geometry-aware caching framework for 3D diffusion models. It dynamically allocates caching budgets using Predictive Cache Scheduling Constraints (PCSC) based on voxel stabilization patterns and selects stable tokens for reuse via the Spatio-temporal Stability Criterion (SSC) based on velocity and acceleration. It achieves up to a 27.12% increase in throughput and a 54.83% reduction in FLOPs, with only about a 2% loss in geometric quality.
- Fast Markov Random Field Optimisation for Topologically Noisy 3D Shape Matching
-
This paper reformulates non-rigid 3D shape matching as a triangle-based multi-label MRF problem. It ensures neighborhood smoothness using a pairwise pseudometric that measures geodesic distances exclusively on the target shape. By employing a variant of \(\alpha\)-expansion tailored for the specific label space, the problem is solved in linear time, achieving high accuracy, stability, and speed in scenarios with topological noise (genus changes).
- Fast SceneScript: Fast and Accurate Language-Based 3D Scene Understanding via Multi-Token Prediction
-
This paper proposes Fast SceneScript, which achieves inference acceleration for 3D scene understanding by introducing Multi-Token Prediction (MTP) into structured language models. Combined with Self-Speculative Decoding (SSD) and Confidence-Guided Decoding (CGD) to filter unreliable tokens, along with a parameter-efficient head-sharing mechanism, it achieves 5.09× and 5.14× speedups for layout estimation and object detection, respectively, without compromising accuracy.
- Fast Spatial Tracking with Visual Geometry Transformer
-
This paper employs a feed-forward visual geometry Transformer to directly predict 2D/3D trajectories of arbitrary query points from monocular video. By replacing the traditional dependence on dense depth estimation and scene reconstruction with a dual-branch design of "global branch + frame-level branch + bidirectional interaction," it achieves real-time speeds of 28 ms/frame and attains SOTA performance on TAPVid-3D with 19.0 AJ / 28.9 ADP.
- Faster-GS: Analyzing and Improving Gaussian Splatting Optimization
-
This paper systematically organizes, aligns, and integrates training acceleration techniques scattered across multiple 3DGS follow-up works into a clean baseline. By introducing "memory-coalescence-friendly z-order densification" and "backward-propagation-optimizer fusion + custom Adam," it accelerates 3DGS training by up to 5× and reduces VRAM by 30% without changing reconstruction quality or Gaussian count, compressing single-scene reconstruction to under 2 minutes.
- FastEventDGS: Deformable Gaussian Splatting for Fast Dynamic Scenes from a Single Event Camera
-
FastEventDGS represents the first work to train Deformable 3D Gaussian Splatting (Deformable 3DGS) for dynamic scenes using only a single monocular event camera. By utilizing continuous trajectory parameterization, a dual event generation model, local patch motion loss, and expert depth refinement, it improves PSNR from ~16 dB to 22–24 dB on both synthetic and real-world fast-motion datasets.
- FastGS: Training 3D Gaussian Splatting in 100 Seconds
-
FastGS is proposed as a 3DGS acceleration framework based on multi-view consistency. By employing Multi-view Consistency Densification (VCD) and Multi-view Consistency Pruning (VCP) to precisely control the number of Gaussians, it achieves scene training in approximately 100 seconds on datasets like Mip-NeRF 360—a 15× speedup over vanilla 3DGS with comparable rendering quality.
- Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners
-
LILA utilizes a frozen DINOv2 encoder and a DPT decoder to learn pixel-wise features from unlabeled videos. The core training signal is "linear in-context learning": an optimal linear projection is fitted on context frames to map features to depth, optical flow, and self-distillation cues. By forcing the same projection to reconstruct corresponding cues on adjacent query frames, the model compresses geometric, semantic, and temporal consistency into pixel-level features. Downstream performance on VOS, surface normal estimation, and semantic segmentation significantly outperforms FlowFeat, LoftUp, and others.
- Feed-forward Gaussian Registration for Head Avatar Creation and Editing
-
MATCH utilizes a transformer to directly predict "Gaussian Splatting textures under dense semantic correspondence" from calibrated multi-view images in 0.5 seconds. It bypasses the time-consuming mesh tracking and per-subject optimization—which typically take hours or even a day in traditional workflows—and directly enables cross-identity/cross-expression applications such as avatar creation, interpolation, semantic editing, and expression transfer.
- Feed-Forward One-Shot Animatable Textured Mesh Avatar Reconstruction
-
MeshLAM utilizes a feed-forward Transformer to reconstruct a 3D head mesh avatar with high-fidelity textures and direct animatability from a single image in one pass. By employing a "dual shape/texture branch + GRU iterative decoding + input-to-UV back-projection guidance," it avoids test-time optimization and mesh collapse while surpassing Gaussian-based LAM in both quality and speed.
- Few-Shot Incremental 3D Object Detection in Dynamic Indoor Environments
-
This paper proposes FI3Det, the first few-shot incremental 3D object detection framework. It utilizes a VLM-guided unknown object learning module during the base training phase to perceive potential novel classes in advance. In the incremental phase, it employs a gated multi-modal prototype casting module to fuse 2D semantic and 3D geometric features for novel class detection. FI3Det achieves an average improvement of 17.37% in novel class mAP on ScanNet V2 and SUN RGB-D.
- FilterGS: Traversal-Free Parallel Filtering and Adaptive Shrinking for Large-Scale LoD 3D Gaussian Splatting
-
FilterGS eliminates the two main bottlenecks in large-scale LoD 3DGS rendering—serial layer-by-layer traversal for Gaussian selection and massive invalid Gaussian-tile key-value pairs—by utilizing "Traversal-Free Parallel Dual Filters" and "Adaptive Gaussian Shrinking based on scene crowding." It achieves nearly 300 FPS (significantly surpassing the second-best method) across six large-scale scenes while maintaining reconstruction quality comparable to SOTA.
- FILTR: Extracting Topological Features from Pretrained 3D Models
-
This paper first probes "how much topology pretrained 3D point cloud encoders actually understand" using DONUT, a synthetic dataset with topological labels. The findings reveal that while encoders have weak understanding of global topology (connected components, genus), they possess implicit perception of multi-scale structures. Subsequently, the authors propose FILTR—the first set-prediction model that adapts DETR to predict persistence diagrams directly from frozen encoder features, transforming persistence diagram extraction from a classical algorithm into a learnable, one-step feed-forward process integrable with other networks.
- FISHuman: Fine-grained Single-image 3D Human Reconstruction via Multi-view 4D Remeshing
-
FISHuman utilizes a "3D-aware dual-stream video diffusion model" to expand a single photo into multi-view aligned RGB+normal sequences. It then employs a "4D Remeshing" module to transform pixel drifts from inconsistent multi-view frames into controllable per-vertex deformations. This allows for the reconstruction of 3D humans with fine geometry, realistic textures, and animation-ready meshes from a single image, outperforming SOTAs like PSHuman and Human3Diffusion in geometry and appearance metrics on 2K2K / Sizer.
- FlashMesh: Faster and Better Autoregressive Mesh Synthesis via Structured Speculation
-
FlashMesh adapts "speculative decoding" from large language models to autoregressive mesh generation. By designing a predict–correct–verify framework tailored for the hierarchical structure of the Hourglass Transformer, the model predicts multiple tokens in parallel per step with geometric error correction. It achieves approximately 2× inference speedup on Meshtron-2B while reducing the Chamfer Distance from 0.092 to 0.089.
- FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention
-
By replacing the global self-attention in VGGT with descriptor-based cross-attention, the inference time for 1,000 images is reduced to 9.3% of VGGT while maintaining competitive reconstruction accuracy and scalability to sequences of 3,000+ images.
- FlexAvatar: Flexible Large Reconstruction Model for Animatable Gaussian Head Avatars with Detailed Deformation
-
FlexAvatar utilizes a transformer-based Large Reconstruction Model (LRM) combined with Structured Head Query tokens to aggregate an arbitrary number of single or sparse input images—without camera poses or expression labels—into a unified UV-space Gaussian avatar. A lightweight UNet driven by UV position maps decodes expression-related deformations in real-time. Together with data distribution adjustment and a 10-second test-time refinement, it achieves SOTA 3D consistency and dynamic detail realism.
- Flow3r: Factored Flow Prediction for Scalable Visual Geometry Learning
-
Ours introduces a "Factored Flow" prediction module that predicts optical flow using the geometric latents of the source view and the pose latents of the target view. This enables unlabeled videos to serve as supervision for 3D geometry learning, achieving SOTA performance across 8 benchmarks in both static and dynamic scenes.
- Flow4DGS-SLAM: Optical Flow-Guided 4D Gaussian Splatting SLAM
-
Addressing 3DGS-SLAM in dynamic scenes, this paper employs "camera ego-motion + optical flow" for category-agnostic dynamic/static decomposition. It proposes a hybrid 4D Gaussian representation featuring "explicit keyframe Gaussian centers + GMM time-varying opacity/rotation," combined with scene flow propagation and adaptive insertion to accelerate dynamic Gaussian training. It significantly outperforms 4DGS-SLAM in tracking accuracy, rendering quality, and speed (mapping time reduced from 110s/step to 6s/step).
- FluidGaussian: Propagating Simulation-Based Uncertainty Toward Functionally-Intelligent 3D Reconstruction
-
FluidGaussian is proposed to guide active view selection in 3D reconstruction through uncertainty metrics propagated via fluid simulation, ensuring that reconstruction results are not only visually realistic but also physically plausible for interactions.
- ForeHOI: Feed-forward 3D Object Reconstruction from Daily Hand-Object Interaction Videos
-
ForeHOI utilizes an end-to-end feed-forward network to directly reconstruct the geometry of objects heavily occluded by hands from monocular hand-object interaction videos. By leveraging a dual-branch diffusion model that simultaneously predicts "completed 2D object masks" and "complete 3D voxels" with bi-directional interaction, it compresses tasks that previously required hours of optimization to under one minute, while surpassing optimization-based methods in accuracy.
- Foundry: Distilling 3D Foundation Models for the Edge
-
This paper proposes the Foundation Model Distillation (FMD) paradigm and the Foundry framework. By utilizing a "compress-and-reconstruct" objective, the student model learns a set of learnable SuperTokens to compress the teacher's latent space basis vectors. The resulting single distilled model maintains universality across multiple tasks such as classification, segmentation, and few-shot learning, while reducing FLOPs from 478G to as low as 137G.
- FreeArtGS: Articulated Gaussian Splatting Under Free-Moving Scenario
-
FreeArtGS proposes a method for reconstructing articulated objects from monocular RGB-D videos in "free-moving scenarios" (where object pose and joint states vary simultaneously). By utilizing a three-stage pipeline comprising motion-driven part segmentation, robust joint estimation, and end-to-end 3DGS optimization, it significantly outperforms all baselines on the self-produced FreeArt-21 benchmark and existing datasets.
- FreeForm: Reduced-Order Deformable Simulation from Particle-Based Skinning Eigenmodes
-
The Reproducing Kernel Particle Method (RKPM) is utilized to parameterize the skinning weights of elastic bodies. Optimal skinning eigenmodes are then directly solved via a generalized eigenvalue problem on the elastic energy Hessian, enabling meshless, reduced-order elasticity simulation. This approach is approximately 40× faster to train than Simplicits (which uses per-object neural field optimization) while achieving accuracy closer to the FEM gold standard.
- FreeScale: Scaling 3D Scenes via Certainty-Aware Free-View Generation
-
FreeScale scales limited real-world data into large-scale training data by sampling high-quality free-view images in a certainty-guided manner from existing scene reconstructions, achieving a 2.7 dB PSNR improvement for feed-forward novel view synthesis models.
- Fresco: Frequency-Spatial Consistent Optimization for Fine-Grained Head Avatar Modeling
-
Fresco does not modify the underlying representation of head avatars but focuses on training dynamics: it employs a Laplacian pyramid for a "low-to-high" frequency curriculum and incorporates differentiable UV-baking to align multi-view renderings to a shared texture atlas. This suppresses early pseudo high-frequency artifacts and eliminates cross-view drifting, achieving SOTA results in both novel-view and self-reenactment metrics (PSNR/LPIPS) on the NeRSemble dataset.
- From Corners to Fiducial Tags: Revisiting Checkerboard Calibration for Event Cameras
-
This paper proposes the first event camera calibration framework that detects checkerboard corners directly in the event domain without relying on intensity reconstruction. By mathematically proving that "almost no events are generated at corners," the method uses edge cues to initialize corners and refines them to sub-pixel accuracy at local minima of event density. The same detection scheme is extended to AprilTags, achieving stable calibration on both self-collected and public datasets.
- FE2E: From Editor to Dense Geometry Estimator
-
This paper systematically analyzes the differences in fine-tuning behavior between image editing models and generative models for dense geometry estimation tasks. It discovers that editing models possess a natural structural prior advantage. Based on this, the FE2E framework is proposed, which for the first time adapts a DiT-based image editing model into a joint depth and normal estimator, significantly outperforming existing SOTA in zero-shot scenarios (reducing AbsRel by 35% on ETH3D).
- From Feature Learning to Spectral Basis Learning: A Unifying and Flexible Framework for Efficient and Robust Shape Matching
-
Addressing the long-standing blind spot in deep functional map matching—where only "features" are optimized while the "spectral basis" remains fixed—this paper proposes Advanced Functional Maps. By utilizing a set of learnable "suppression functions" \(G\), the fixed Laplacian basis \(\Phi\) is transformed into a learnable basis \(\Psi=\Phi G\). Features and the spectral basis are jointly optimized end-to-end via a lightweight multi-scale heat diffusion network. This approach significantly outperforms feature-only SOTA methods in difficult scenarios such as non-isometry and topological noise, while being faster and more stable by eliminating the functional map solver.
- From None to All: Self-Supervised 3D Reconstruction via Novel View Synthesis
-
NAS3R is a completely self-supervised feed-forward 3D reconstruction framework. Without using any ground-truth (GT) labels or pre-trained priors during training, it jointly learns 3D Gaussians, camera intrinsics/extrinsics, and depth from uncalibrated, poseless multi-view images using only photometric loss signal from rendered target views. Its novel view synthesis (NVS) quality approaches supervised methods, while its pose and depth estimation outperform several supervised baselines.
- From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection
-
Shift keypoint detection from the "image pair matching" paradigm to "sequence-level trackability optimization" using the reinforcement learning framework TraqPoint to directly optimize the long-term tracking quality of keypoints over image sequences. It surpasses SOTA in pose estimation, visual localization, visual odometry, and 3D reconstruction.
- From Rays to Projections: Better Inputs for Feed-Forward View Synthesis
-
To address the fragility of encoding cameras as Plücker rays in feed-forward view synthesis, this paper adopts "target-view point cloud projections" as conditional inputs. This reformulates fragile geometric regression into a stable image-to-image translation task. Combined with MAE self-supervised pre-training, the method outperforms ray-conditioned baselines on standard NVS benchmarks and a custom view-consistency benchmark.
- FSFSplatter: Geometrically Accurate Reconstruction with Free Sparse-view Images within 2 minutes
-
FSFSplatter utilizes a large multi-view Transformer to convert 3 uncalibrated sparse images into a dense, geometrically consistent 2D Gaussian scene via a single feed-forward pass while simultaneously estimating camera parameters. This is followed by contribution-based pruning to remove floaters and geometry-enhanced optimization supervised by depth and multi-view features, resulting in accurate, renderable surfaces within 2 minutes. Surface error is reduced by at least 28% and NVS error by at least 46% on DTU/Replica/BlendedMVS.
- FunFact: Building Probabilistic Functional 3D Scene Graphs via Factor-Graph Reasoning
-
FunFact constructs probabilistic open-vocabulary functional 3D scene graphs from posed RGB-D images. It first reconstructs object-part level 3D maps using foundation models, then transforms candidate functional relations into a "dual factor graph." By performing belief propagation with LLM commonsense priors and geometric proximity priors, the method jointly reasons over all functional edges in the scene to output well-calibrated confidence scores for each edge, significantly outperforming pair-wise reasoning baselines in functional relation recall and calibration error.
- FunREC: Reconstructing Functional 3D Scenes from Egocentric Interaction Videos
-
This paper proposes FunREC, a training-free optimization-based method that reconstructs functional articulated 3D digital twins directly from egocentric RGB-D interaction videos. It automatically discovers articulated parts, estimates kinematic parameters, tracks 3D motion, and reconstructs both static and moving geometries. It significantly outperforms prior methods across all benchmarks (part segmentation mIoU increased by over 50, joint angle error reduced by 5-10 times) and supports simulation export and robotic interaction.
- FUSER: Feed-Forward Multiview 3D Registration Transformer and SE(3)\(^N\) Diffusion Refinement
-
FUSER transforms "multiview point cloud registration" from the traditional two-stage "pairwise matching + pose graph synchronization" pipeline into a single feed-forward inference. All scans are processed together in a compact latent space for joint reasoning, directly regressing global poses for each scan, followed by refinement using FUSER-DF, a diffusion model in the joint SE(3)\(^N\) space. It significantly outperforms existing methods on ScanNet/3DMatch/ArkitScenes and reduces per-sequence processing time from hundreds of seconds to mere seconds.
- Fusion of Depth and Semantics for Probabilistic Floorplan Localization
-
This paper reformulates the ray-matching task of "estimating camera pose on a 2D floorplan from a single RGB image" into a probabilistic framework: it couples depth and semantic ray predictions on shared representations, weights each depth ray using distribution-based confidence, and performs soft semantic matching via JSD. This approach simultaneously suppresses environmental, geometric, and semantic ambiguities in indoor scenes, significantly pushing the 1m·30° recall rate on Structured3D and ZInD (e.g., S3D-full 57.5% \(\rightarrow\) 71.4%).
- GaussFusion: Improving 3D Reconstruction in the Wild with A Geometry-Informed Video Generator
-
This paper proposes GaussFusion, a geometry-informed video-to-video generative model. By rendering a Gaussian Primitives Buffer (GP-Buffer) containing depth, normals, opacity, and covariance to condition a video generator, it effectively removes floaters, flickering, and blur in 3DGS reconstructions. It is compatible with both optimization-based and feed-forward reconstruction paradigms, with a distilled version achieving real-time inference at 16 FPS.
- GaussianFluent: Gaussian Simulation for Dynamic Scenes with Mixed Materials
-
GaussianFluent supplements surface-only 3DGS by filling interiors with realistic textured Gaussians using generative models. It integrates a stabilized, parallelized Continuous Damage Material Point Method (CD-MPM) into the simulation, enabling 3DGS to realistically simulate brittle fracture, cutting, and bullet penetration in mixed materials at real-time speeds with exposed internal structures.
- GaussianGrow: Geometry-aware Gaussian Growing from 3D Point Clouds with Text Guidance
-
This paper proposes GaussianGrow, which "grows" 3D Gaussians from easily accessible 3D point clouds instead of predicting both geometry and appearance from scratch. It leverages multi-view diffusion models to generate consistent appearance supervision and introduces an overlap region detection and iterative completion mechanism to resolve viewpoint fusion artifacts and occluded areas, significantly outperforming SOTA on synthetic and real-scan point clouds.
- GeCo: Geometry-Consistent Regularization for Domain Generalized Semantic Segmentation
-
Addressing the issue where adapting Visual Foundation Models (VFMs) via PEFT for Domain Generalized Semantic Segmentation (DGSS) leads to overfitting on the source domain and destruction of pre-trained geometric structures, GeCo proposes Curvature-Guided Perturbation (adjusting perturbation intensity/direction based on local manifold complexity per token) and Geodesic Regularization (constraining prediction consistency on the hypersphere of the probability simplex). It achieves SOTA on closed-set and open-set DGSS with only 4.7M trainable parameters.
- Generalizable Radio-Frequency Radiance Fields for Spatial Spectrum Synthesis
-
GRaF transfers the NeRF concept to the RF domain. By introducing a theorem stating that "the spatial spectrum of a target transmitter can be approximated by interpolating the spectra of neighboring transmitters," it transforms the "per-scene retraining" NeRF into a generalizable latent RF radiance field. Leveraging a geometry-aware Transformer to encode neighbor spectra and complex-valued neural ray tracing to reconstruct the spatial spectrum, GRaF outperforms NeRF2 in both single-scene and unseen-scene settings.
- Generalizable Sparse-View 3D Reconstruction from Unconstrained Images
-
GenWildSplat transforms "in-the-wild internet photo reconstruction" from per-scene optimization into a single feedforward pass: given 2–6 pose-free, variably-lit sparse images with transient occlusions (pedestrians, vehicles), it predicts 3D Gaussians with controllable appearance in under 3 seconds. By employing an appearance adapter for color modulation in 3D space, segmentation masks to filter transients, and a three-stage curriculum learning strategy for stability, it achieves PSNR scores on MegaScenes that surpass optimization-based methods requiring hours of computation.
- Generalizable Structure-Aware Keypoint Correspondence for Category-Unified 3D Single Object Tracking
-
UniKPT proposes replacing point-to-point dense matching with a set of adaptive sparse keypoints. Through three modules—Adaptive Keypoint Extraction, Progressive Correspondence Alignment, and Confidence-Aware Structure Localization—it unifies the tracking of diverse categories such as pedestrians, trucks, and buses within a single model. On nuScenes, it outperforms category-specific SOTA methods by 4.37%/5.16% in Success/Precision.
- Generalized-CVO: Fast and Correspondence-Free Local Point Cloud Registration with Second Order Riemannian Optimization
-
G-CVO represents point clouds as continuous functions in RKHS, encodes local surface geometry using anisotropic kernels, and solves registration via second-order Gaussian-Newton with approximate Riemannian Hessian on the SE(3) manifold. This achieves correspondence-free registration robust to feature-sparse scenes, running approximately 10x faster than similar first-order RKHS methods.
- Generative Diffusion Priors for 3D Mapping of the Dark Universe
-
This paper transforms the highly ill-posed cosmological inverse problem of "reconstructing the 3D dark matter distribution from weak gravitational lensing observations" into a diffusion model posterior sampling task. It utilizes N-body simulations to construct the Conicus3D light-cone dataset and trains a redshift-conditioned 2D diffusion prior. By coupling this data-driven prior with a differentiable weak lensing forward model using a modified DAPS algorithm, the approach significantly improves 3D/2D reconstruction correlations and power spectrum fidelity compared to Wiener filtering and Neural Ensemble baselines on simulated JWST COSMOS-Web surveys.
- GenMatter: Perceiving Physical Objects with Generative Matter Models
-
GenMatter reformulates the task of "segmenting independently moving objects from motion" as online probabilistic inference under a two-level hierarchical generative model (cluster → particle → 3D point). By inverting this model using parallel block Gibbs sampling, the authors reproduce human perception across random dot kinematograms (RDK), camouflaged rotating objects, and natural RGB videos—scenarios where current CV systems often fail individually—using a single engine that requires no task-specific training and matches supervised trackers.
- GenSplat: Bridging the Generalization Gap in 3DGS Language Comprehension
-
GenSplat decomposes language comprehension in 3DGS scenes into a progressive curriculum of "semantics → instance → free-form text". By utilizing a
<SEG>token inferred from an MLLM to query 3D Gaussian features in conjunction with a geometry-aware keyframe selector, the model achieves SOTA performance across scenes and tasks (referring segmentation / VQA / open-vocabulary) without requiring per-scene optimization during inference. - GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis
-
Ours proposes a Probability Density Geodesic Flow Matching (PDG-FM) framework, replacing the stochastic noise-to-data diffusion process with data-to-data deterministic flow matching. By utilizing probability density-based geodesic optimization, the interpolation paths are forced to traverse high-density regions of the data manifold, achieving more geometrically consistent novel view synthesis.
- GeoDiff4D: Geometry-Aware Diffusion for 4D Head Avatar Reconstruction
-
Starting from a single portrait, GeoDiff4D enables a diffusion model to jointly generate portrait frames and corresponding surface normals. These "Images + Normals + Expression Latents" are then fed into a 3D Gaussian reconstruction to distill the implicit 3D geometric priors from the diffusion model into an animatable 4D avatar. This approach significantly outperforms existing methods in identity preservation, expression recovery, and cross-view consistency.
- GeoFree-CoSeg: Unsupervised Point Cloud-Image Cross-Modal Co-Segmentation Without Geometric Alignment
-
GeoFree-CoSeg proposes a new task: "Unsupervised Point Cloud-Image Cross-Modal Co-Segmentation." Using a coarse-to-fine dual-branch framework, the method extracts coarse-grained common semantics from each modality, purifies them via cross-modal semantic graphs into Top-K point-patch correspondences, and finally achieves mutual enhancement. Without any geometric alignment or segmentation annotations, it significantly improves unsupervised SOTA performance on two standard point cloud benchmarks and two new image datasets (e.g., 3D mean IoU on S3DIS is 6 points higher than LogoSP).
- Geometric-Aware Hypergraph Reasoning for Novel Class Discovery in Point Cloud Segmentation
-
High-order relationships where "a novel class simultaneously associates with multiple known class prototypes" are modeled using hypergraphs. By augmenting each prototype with geometric structural features, the model collaboratively infers semantics for unseen point cloud classes (e.g., bed) based on known classes (e.g., chair/sofa/table), leading to a significant lead in novel class mIoU on SemanticKITTI and SemanticPOSS.
- Geometric-Photometric Event-based 3D Gaussian Ray Tracing
-
GPERT decouples pure event-driven 3DGS rendering into two complementary branches: per-event ray-tracing depth rendering (temporally dense, spatially sparse) for geometric loss, and a single snapshot radiance map rendering (spatially dense, temporally sparse) for photometric loss. By bridging these branches via the "Image of Warped Events" (IWE), it resolves the conflict between precision and time windows inherent in the "render-twice-and-subtract" paradigm, achieving SOTA performance on real event datasets with the fastest training speed and no reliance on pre-trained models or COLMAP initialization.
- Geometry-Aligned and Anomaly-Aware Reconstruction for 3D Anomaly Detection
-
AARD addresses two systematic weaknesses in diffusion-based point cloud anomaly detection: geometric destruction by random noise and blurred details from unified references. It proposes "Geometry Rectification" to align noise with vertex normals and an "Anomaly-Aware Transformer" to route normal references to anomalous regions and input references to normal regions, setting new SOTA results on Real3D-AD (O-AUROC 0.82) and Anomaly-ShapeNet (O-AUROC 0.93).
- Geometry-Aware Cross-Modal Graph Alignment for Referring Segmentation in 3D Gaussian Splatting
-
GeoCGA reformulates the task of "identifying and segmenting target objects in 3DGS scenes using natural language" as a geometry-aware cross-modal graph alignment problem. It expands text into a semantic-spatial graph representing spatial relationships while abstracting Gaussian point clouds into an object-level geometric graph. By aligning these graphs at both node and edge levels and applying multi-view consistency constraints, it achieves relative mIoU improvements of 20.8% / 5.7% / 1.0% on Ref-LERF / LERF-OVS / 3D-OVS respectively, while significantly reducing parameters and FLOPs.
- Geometry-Guided 3D Visual Token Pruning for Video-Language Models
-
When treating a 3D scene as a "multi-view spatial video" for input into a VideoLM, thousands of redundant visual tokens are generated. This paper proposes Geo3DPruner, which utilizes the cross-frame global attention of the VGGT geometry encoder to perform a two-stage pruning process: intra-voxel (to remove multi-view redundancy) and inter-voxel (to preserve spatial diversity). The method prunes 90% of tokens while retaining approximately 92% of the original performance, significantly outperforming general pruning methods like FastV and VisPruner.
- GeoSAM2: Unleashing the Power of SAM2 for 3D Part Segmentation
-
GeoSAM2 reformulates part segmentation of textureless 3D models as a "multi-view 2D mask prediction" task: it renders normal and point maps from 12 viewpoints, allows users to provide 2D prompts (clicks/boxes) in any view, predicts masks frame-by-frame using a shared SAM2 backbone with LoRA and geometric residual fusion, and finally back-projects these masks into 3D using visibility-aware voting to achieve class-agnostic SOTA on PartObjaverse-Tiny and PartNetE at a speed of approximately 30 seconds per object.
- GGPT: Geometry-Grounded Point Transformer
-
Proposes the GGPT framework: obtains geometrically consistent sparse point clouds via an improved lightweight SfM pipeline (dense matching + sparse BA + DLT triangulation), then utilizes 3D Point Transformer V3 to directly fuse sparse geometric guidance with feed-forward dense predictions in 3D space for residual refinement. Trained solely on ScanNet++, it significantly improves various feed-forward 3D reconstruction models across architectures and datasets.
- Ghosts in the Point Clouds: De-glaring LiDAR in the Transient Domain
-
Addressing internal multipath glare in new-generation solid-state single-photon LiDAR—which creates "ghost" objects and obscures real ones—this paper models glare as a linear, scene-independent Glare Spread Function (GSF). The method processes low-level echoes per pixel before point cloud formation: it uses the method of moments to correct photon pileup distortion, predicts glare contributions with the GSF, and applies a binomial likelihood confidence measure to distinguish true signals from glare. It is training-free and deployable on unmodified commercial sensors.
- GHPT: Real-Time Relightable Gaussian Splatting using Hybrid Path Tracing
-
GHPT utilizes a "Gaussian-splatted G-buffer and hardware-accelerated ray tracing on an underlying mesh" hybrid path tracing paradigm, coupled with a three-stage inverse rendering pipeline (reconstructing geometry, decomposing materials and environment light, and finally performing factorized inverse path tracing on Gaussians). This approach marks the first time a 3DGS model has achieved high-quality relighting and real-time (113 fps) scene composition with soft shadows and indirect lighting on an RTX 4080 at 1920×1080 resolution.
- GLINT: Modeling Scene-Scale Transparency via Gaussian Radiance Transport
-
GLINT achieves SOTA geometric and appearance reconstruction of scene-scale transparent surfaces (e.g., glass walls, display cases) by decomposing Gaussian representations into interface, transmission, and reflection components integrated within a hybrid rasterization + ray tracing pipeline.
- Global-Aware Edge Prioritization for Pose Graph Initialization
-
A GNN-based global edge prioritization method is proposed, upgrading pose graph initialization from independent pairwise image retrieval to global structure-aware edge ranking and multi-MST construction, significantly improving SfM reconstruction accuracy in extremely sparse settings.
- Global Structure-from-Motion Meets Feedforward Reconstruction
-
GLUEMAP combines the scalability and global consistency of classical global SfM with the local robustness of feedforward multi-view reconstruction networks (π³). It restricts the feedforward network to local inference using a sparse view graph, integrates tens of thousands of local reconstructions into a global solution via global motion averaging, and enhances bundle adjustment with "virtual tracks." It outperforms both pure classical and pure feedforward methods on five diverse datasets and scales to tens of thousands of images on a single RTX 4090.
- Globally Optimal Pose from Orthographic Silhouettes
-
Given a known 3D template and its unoccluded silhouette in an image, this work models "Pose-from-Silhouette (PfS)" as minimizing the Hausdorff distance between two silhouettes on \(\mathbb{SO}(3)\). By leveraging the overlooked property that "silhouette area changes continuously with rotation," the search space is heavily branched. This resulting method is the first globally optimal PfS solver for arbitrary shapes (regardless of convexity or genus) without requiring point correspondences, achieving an orientation error ~86%–90% lower than the closest baseline on synthetic and real data.
- Glove2Hand: Synthesizing Natural Hand-Object Interaction from Multi-Modal Sensing Gloves
-
The Glove2Hand framework translates egocentric videos of users wearing sensing gloves into realistic bare-hand videos while preserving tactile and IMU signals. By constructing HandSense, the first multi-modal hand-object interaction dataset, it significantly improves downstream performance for bare-hand contact estimation and occluded hand tracking.
- GM-R²: Generative Matching Learning for Unsupervised Geometric Representation and Registration
-
Ours reformulates "learning geometric descriptors" as a proxy task of "generating cross-view images conditioned on geometry"—only when the geometric features of two point clouds are consistent can the generator conditioned on them synthesize consistent cross-view images. GM-R² uses this generative consistency as implicit supervision to train a ControlNet encoder, achieving unsupervised registration SOTA on 3DMatch / ScanNet, even surpassing some fully supervised methods.
- GP-4DGS: Probabilistic 4D Gaussian Splatting from Monocular Video via Variational Gaussian Processes
-
The paper proposes GP-4DGS, which integrates Variational Gaussian Processes (GP) into 4D Gaussian Splatting (4DGS). By utilizing spatio-temporal composite kernels and variational inference, it achieves probabilistic motion modeling and equips 4DGS with three new capabilities: uncertainty quantification, motion extrapolation, and adaptive motion priors.
- GS-ASM: 2DGS-Supervised Active Stereo Matching
-
Addressing the accuracy limitations in active stereo matching caused by the lack of ground truth (GT) and reliance on self-supervision, this paper utilizes 2D Gaussian Splatting (2DGS) to reconstruct geometry from real scenes and render high-quality disparity "proxy labels." This transforms unsupervised active stereo networks into "supervised" training, complemented by a hybrid supervision regularization strategy that dynamically balances proxy supervision and self-supervision. The method achieves SOTA performance across multiple backbones, surpassing commercial RealSense D435 depth cameras.
- GSV2X: Geometry-Aware Uncertainty Modeling and Orthogonal Fusion for Robust Roadside Perception
-
To address two persistent issues in roadside multi-view camera-LiDAR fusion—"feature misalignment caused by calibration errors" and "dominant camera features suppressing LiDAR"—GSV2X replaces deterministic projections with 3D Gaussian distributions to "softly" lift pixel features to BEV and employs orthogonal constraints to force the two modalities to learn complementary features. On RCooper, it improves [email protected] from 43.7% (BEVFusion) to 63.4% and shows almost no performance drop under calibration perturbation.
- Guardians of the Hair: Rescuing Soft Boundaries in Depth, Stereo, and Novel Views
-
HairGuard leverages image matting datasets to construct fine-grained depth supervision for soft boundaries (e.g., hair). It employs a "depth fixer + scene painter + color fuser" trio as a plug-and-play solution to correct depth, repair occlusions, and fuse textures, achieving SOTA performance on soft boundary details in monocular depth, stereo conversion, and novel view synthesis.
- H²A²: Homogeneity-Aware and Heterogeneity-Aware Feature Perception for Unified Indoor 3D Object Detection
-
The authors discover that basic geometric structures such as lines, planes, and corners in indoor 3D detection induce highly consistent offset responses in sparse convolution kernels (homogeneous features) across different scenes, while scene-specific structures produce heterogeneous responses. H²A² utilizes a structure-aware kernel selection mechanism (SF-KS) to dynamically decide whether to use a "cross-scene shared kernel" or a "scene-exclusive kernel" at each offset position. Combined with a Norm Gradient Harmonization (NGH) algorithm to stabilize multi-source joint training, it achieves universal gains of 1~7.6 mAP over the strong baseline TR3D on ScanNet/SUN RGB-D/S3DIS.
- HAD: Hallucination-Aware Diffusion Priors for 3D Reconstruction
-
Addressing the issue where diffusion priors improve image quality but generate non-existent content (hallucination) in sparse-view 3D reconstruction, HAD utilizes a pre-trained feed-forward NVS network (LVSM) as a multi-view encoder paired with a lightweight branch to predict pixel-wise "hallucination score maps." During 3DGS training, high-score (unreliable) pixels are masked, and multi-sampling fusion is employed to further decrease the hallucination ratio. Ultimately, the method achieves SOTA performance with a PSNR improvement of 0.78dB on DL3DV and 0.69dB on MipNeRF360.
- HandDreamer: Zero-Shot Text to 3D Hand Model Generation using Corrective Hand Shape Guidance
-
HandDreamer is the first zero-shot "Text-to-3D Hand Model" method. It utilizes MANO hand models for low-score initialization, employs 2D hand skeletons as ControlNet conditions to compress the number of modes in the probability distribution, and introduces a corrective hand shape (CHS) loss to rectify geometry throughout the SDS process. This enables the generation of view-consistent, highly detailed, and animatable 3D hands without introducing Janus multi-face artifacts.
- Hermite Radial Basis Function for Surface Reconstruction via Differentiable Rendering
-
Classical Hermite Radial Basis Functions (HRBF) are integrated into a differentiable rendering framework. A global implicit field \(F\) is constructed using a set of local RBF basis functions with derivatives. Weights, positions, and scales are optimized end-to-end via multi-view RGB volume rendering. Leveraging BVH-accelerated ray intersection, the method achieves superior Chamfer distances compared to PGSR and Fast Dipole Sums on DTU and BlendedMVS datasets.
- HeroGS: Hierarchical Guidance for Robust 3D Gaussian Splatting under Sparse Views
-
HeroGS decomposes the overfitting problem of 3DGS under sparse views into three levels of hierarchical constraints: image-level (pseudo-dense supervision via frame interpolation), feature-level (adaptive Gaussian addition/deletion based on edges and tiling), and parameter-level (synergistic pruning of geometrically inconsistent Gaussians across multiple fields). It comprehensively outperforms SOTAs like FSGS and DropGaussian on LLFF 2/3/6-view benchmarks.
- Hg-I2P: Bridging Modalities for Generalizable Image-to-Point-Cloud Registration via Heterogeneous Graphs
-
Hg-I2P introduces Heterogeneous Graphs (HG) to unify the modeling of relationships between 2D image regions and 3D point cloud regions. By leveraging multi-path adjacency relation mining for cross-modal edge learning, heterogeneous edge-based feature adaptation, and graph-based projection consistency pruning, it achieves state-of-the-art generalization and precision across six indoor and outdoor cross-domain benchmarks.
- Hierarchical Point-Patch Fusion with Adaptive Patch Codebook for 3D Shape Anomaly Detection
-
This paper proposes a hierarchical "point-patch" fusion network that constructs a position-independent normal patch feature codebook using adaptive multi-scale patching. It then injects patch-level priors into point-wise features via RoPE cross-attention to regress anomaly offsets. The method significantly outperforms previous point-wise approaches in detecting large-scale structural defects (planar displacement, angular misalignment) on public benchmarks and self-constructed industrial datasets.
- Hierarchical Visual Relocalization with Nearest View Synthesis from Feature Gaussian Splatting
-
Ours proposes SplatHLoc, a hierarchical visual relocalization framework based on Feature Gaussian Splatting. It synthesizes virtual views closer to the query via adaptive viewpoint retrieval and utilizes a hybrid feature matching strategy (rendered features for coarse matching and a semi-dense matcher for fine matching), achieving SOTA results on indoor and outdoor relocalization benchmarks.
- High-Fidelity Mobile Avatars with Pruned Local Blendshapes
-
This work pushes the pose-dependent appearance decoding of 3DGS-based full-body digital humans to the extreme using "local linear blendshapes + 90% blendshape pruning." Through end-to-end training (without pre-trained large models), it achieves 2K resolution at 120 FPS in mobile browsers with a model size of only 19.4 MB.
- Homaloidal parametrization for detecting critical two-view configurations
-
This paper proposes a novel quadratic transformation parametrization for detecting general critical surface degeneracies in two-view configurations using the "homaloidal net of conics" from projective geometry. By solving a linear system with 7 image correspondences to fit the quadratic transformation and using an 8th point for verification, the method identifies degeneracies without pre-estimating the fundamental matrix. It achieves higher precision and approximately 200× faster performance than the only comparable method by Luong–Faugeras.
- Human Geometry Distribution for 3D Animation Generation
-
This paper proposes a two-stage generation framework that first compresses per-frame 3D human geometry into a compact latent using an improved "Human Geometry Distribution (HuGeoDis)" and then generates short-term transitions in the latent space via identity-conditioned autoregressive diffusion. This approach synthesizes 3D human sequences with fine-grained clothing wrinkles, natural dynamics, and consistent identity even with scarce 3D animation data (reducing reconstruction Chamfer distance by approximately 90% and improving user study scores by 2.2x).
- Human Interaction-Aware 3D Reconstruction from a Single Image
-
The HUG3D framework is proposed to achieve high-fidelity textured 3D reconstruction of multiple interacting humans from a single image through perspective-orthogonal view transformation, group-instance multi-view diffusion models, and physics-aware geometric reconstruction, significantly outperforming existing methods on metrics such as CD, P2S, and NC.
- HumanBA: Human-Aware Bundle Adjustment via Global Human-Camera Decoupling
-
To address the failure of traditional SLAM in monocular videos where the foreground human occupies most of the frame, HumanBA treats humans not as dynamic distractions to be masked, but as structured landmarks. It uses HMR to estimate human motion and subtracts it from observed trajectories to obtain "pseudo-static" human joint landmarks. These landmarks, adaptively weighted by motion stability, are integrated into Bundle Adjustment (BA). This allows camera poses and global human reconstruction to mutually enhance each other during iteration, reducing trajectory errors for both on EMDB2 / SLOPER4D.
- HumanNOVA: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from a Single Image
-
HumanNOVA transfers the Large Reconstruction Model (LRM) paradigm for general objects to the human domain. Utilizing a "dual-modal token conditioning + tri-plane" feed-forward architecture, it reconstructs photorealistic 3D humans from a single image in under 1 second. The study also introduces a scalable data generation pipeline that expands training assets to 100,000 (roughly a 20x increase), achieving a 40%+ relative improvement in LPIPS across three benchmarks.
- Hyper-PCN: Hypergraph-Based Point Cloud Completion via High-Order Correlation Modeling
-
Addressing the issues where Transformers in point cloud completion only model pairwise correlations and fail to reconstruct complex structures in the absence of symmetry priors, Hyper-PCN introduces hypergraphs to incomplete point clouds for the first time. It utilizes a Hypergraph Refinement Stack (HyperRS) with threshold annealing to extract high-order correlations from coarse to fine, and an Anchor-collaborative Hypergraph Neural Network (A-HGNN) to model global many-to-many relationships. Hyper-PCN consistently sets new SOTAs on benchmarks including PCN, ShapeNet-55/34, and MVP.
- HyperGaussians: High-Dimensional Gaussian Splatting for High-Fidelity Animatable Face Avatars
-
HyperGaussians is proposed to extend 3DGS to high-dimensional multivariate Gaussians. It models expression-dependent attribute variations through conditional distributions and utilizes an inverse covariance trick for efficient conditioning. As a plug-and-play module integrated into FlashAvatar and GaussianHeadAvatar, it significantly improves the quality of high-frequency details.
- I-Scene: 3D Instance Models are Implicit Generalizable Spatial Learners
-
I-Scene shifts away from using labeled scene datasets to teach models "where to place objects." Instead, it "reprograms" a pre-trained image-to-3D instance generator (TRELLIS) into a scene-level spatial learner. By utilizing Scene Context Attention and a View-centric Space, the model learns to infer spatial relationships such as adjacency, support, and symmetry in a single feed-forward pass. It can generalize to unseen layouts even when trained on non-semantic random scenes, outperforming SOTAs trained on 3D-FRONT.
- ICTPolarReal: A Polarized Reflection and Material Dataset of Real World Objects
-
This work constructs ICTPolarReal, the first large-scale real-world polarized reflection and material dataset. Utilizing a Light Stage system with 8 cameras and 346 light sources, cross- and parallel-polarized captures were performed on 218 daily objects. This yielded over 1.2 million high-resolution images with ground truth diffuse-specular separation, significantly enhancing the performance of inverse rendering, forward relighting, and sparse-view 3D reconstruction.
- IDESplat: Iterative Depth Probability Estimation for Generalizable 3D Gaussian Splatting
-
IDESplat replaces "single-warp depth estimation" with "multi-cascade warp iterative depth probability boosting." This improves the accuracy of Gaussian center (depth) prediction for feed-forward generalizable 3DGS. On the RE10K dataset, it surpasses DepthSplat by 0.33 dB PSNR with only ~1/10 of the parameters and achieves a significant 2.95 dB gain on the cross-dataset DTU benchmark.
- Illumination-Consistent Human-Scene Reconstruction from Monocular Video
-
This paper jointly reconstructs an animatable human and a static scene from monocular video using 3DGS. The core involves introducing a "light volume" to provide spatially-varying local illumination clues for human PBR and an implicit shadow module to decouple soft shadows cast by the human onto the scene, ensuring human-scene consistency in illumination and shadows while supporting relighting and cross-scene synthesis.
- iLRM: An Iterative Large 3D Reconstruction Model
-
iLRM reformulates feed-forward 3D Gaussian reconstruction from "mapping all image tokens to pixel-aligned Gaussians in a single pass" to "using low-resolution viewpoint embeddings as carriers and iteratively refining them layer-by-layer with multi-view image feedback." By combining representation decoupling and two-stage attention to reduce computational costs, it achieves high quality and speed on RE10K/DL3DV (0.5s for 32-view 540×960 inference, compared to 8 minutes for optimization-based methods).
- Image-to-Point Cloud Feature Back-Projection for Multimodal Training of 3D Semantic Segmentation
-
IPFP proposes a "training-only" image-LiDAR fusion strategy: aggregated image features are back-projected into the 3D physical space based on estimated depths, residing in the same coordinate system as LiDAR features and sharing a single-branch backbone for training. At inference time, the image branch is disabled for pure LiDAR deployment. Ours consistently improves SOTA segmentation models like PTv3 and SPVCNN on nuScenes/KITTI/Waymo datasets with almost no additional inference cost.
- Inferring Compositional 4D Scenes without Ever Seeing One
-
COM4D reconstructs complete, persistent 4D scenes comprising "multiple static objects + multiple dynamic objects" from a single monocular video. The key lies in decoupling spatial compositional reasoning and single-object temporal dynamics into two distinct attention mechanisms learned from two types of readily available data, then combining them via Attention Mixing during inference—all without ever being exposed to 4D compositional training samples.
- Intrinsic Geometry-Appearance Consistency Optimization for Sparse-View Gaussian Splatting
-
ICO-GS attributes the degradation of sparse-view 3DGS to "loss of intrinsic consistency between geometry and appearance." It first constrains geometry using feature-domain multi-view photometric consistency (enhanced by pixel-wise top-k selection and edge-aware smoothing), then filters reliable depths via cycle-consistency to synthesize virtual views for supervising appearance. It consistently outperforms existing sparse-view baselines on LLFF/DTU/Blender, particularly in textureless regions.
- Intrinsic Image Fusion for Multi-View 3D Material Reconstruction
-
Intrinsic Image Fusion (IIF) distills single-view priors from a 2D diffusion material estimator into multi-view inverse rendering. It uses a parametric distribution to aggregate multiple inconsistent PBR predictions per view into a low-dimensional consistent space, achieves 3D consistent textures via distribution matching, and finally performs inverse path tracing fine-tuning on only a few parameters per object. This significantly outperforms existing inverse rendering methods in material decoupling quality on synthetic and real indoor scenes.
- IR-HGP: Physically-Aware Gaussian Inverse Rendering for High-Illumination Scenes via Generative Priors
-
IR-HGP utilizes three collaborative modules—Hybrid Visibility Decomposition (HVD), Generative Illumination Prior (GIFP), and Physically-Aware Radiance Correction (PARC)—to extend 3DGS inverse rendering to high-illumination and strong specular reflection scenes. It solves the challenge of "baked-in shadows and highlights" in materials, achieving SOTA results (mean PSNR 33.61 on synthetic sets) in relighting and novel view synthesis while maintaining real-time rendering.
- Iris: Bringing Real-World Priors into Diffusion Model for Monocular Depth Estimation
-
Iris proposes a deterministic diffusion framework that injects real-world priors into diffusion models through a two-stage "Prior-to-Geometric" (PGD) schedule. The first stage utilizes Spectral Gated Distillation (SGD) at high timesteps to extract low-frequency layout priors from a teacher model. The second stage refines high-frequency geometric details using synthetic data at low timesteps, while introducing Spectral Gated Consistency (SGC) for cross-stage high-frequency alignment. It achieves SOTA zero-shot depth estimation performance under limited data and computational budgets.
- Iris: Integrating Language into Diffusion-based Monocular Depth Estimation
-
Iris systematically validates a naive hypothesis: feeding additional text descriptions of scene objects into a diffusion-based monocular depth estimator leverages the "text \(\leftrightarrow\) 3D scene" conditional distribution learned during text-to-image pre-training. This reduces the depth solution space, leading to overall zero-shot accuracy improvements across three diffusion MDEs (Marigold, Lotus, and E2E-FT), particularly for small objects and blurry regions, while also accelerating training and inference convergence.
- iSplat: Iterative Learning for Fine-Grained Gaussian Splatting
-
iSplat transforms feed-forward 3D Gaussian Splatting from "one-shot prediction" into "recurrent iterative refinement via GRU." By leveraging uncertainty-driven depth refinement and region-aware feature enhancement for progressive self-correction, it outperforms the 354M-parameter DepthSplat on RealEstate10K with only 42.6M parameters and improves PSNR by 2.88 dB on the cross-domain DTU dataset.
- JRM: Joint Reconstruction Model for Multiple Objects without Alignment
-
JRM reformulates the reconstruction problem of "the same object being repeatedly observed in a scene" as personalized generation. By using a 3D flow-matching model to implicitly aggregate multiple unaligned observations in the latent space, it jointly reconstructs a group of objects without explicit matching or rigid registration. This approach is more robust to association errors and articulated deformations, outperforming independent reconstruction and alignment-based baselines.
- Kaleidoscopic Scintillation Event Imaging
-
This work reformulates radiation detection as a computer vision problem: using a "kaleidoscope-shaped" (four-sided mirrored pyramid) scintillator, a single scintillation event is captured as a "direct image + multiple mirrored reflections" on a single-photon camera. A Gaussian Mixture Model (GMM), where all components are parameterized by the event's 3D coordinates \(p_0\), is solved via an EM algorithm. In extreme photon-starved conditions (dozens of photons per event), this approach reduces 3D localization error from approximately 0.8 mm to about 0.14 mm.
- KASALv2: Fully Automatic 3D Rotational Symmetry Classification and Axis Localization
-
KASALv2 proposes a fully automatic framework that identifies rotational symmetry types, rotation orders, and all canonical axes of 3D objects at once without any reference geometry. It covers all 8 canonical rotational symmetry types, achieving 94.75% accuracy on 438 symmetric objects from GSO. Feeding the estimated symmetry priors into FoundationPose training improves pose estimation accuracy by up to 0.9% across 5 BOP datasets.
- KV-Tracker: Real-Time Pose Tracking with Transformers
-
KV-Tracker transforms offline multi-view geometry large models (π3) into real-time systems: the Key-Value pairs calculated by keyframes during the global attention of the mapping phase are cached as a scene representation. During tracking, only a single-frame query is used to attend to this cache, reducing per-frame inference complexity from \(O((NM)^2)\) to \(O(M^2(N+1))\). It achieves drift-free 6-DoF camera and zero-prior object tracking at approximately 27 FPS on TUM/7-Scenes/ARCTIC/OnePose.
- \(L^{2}DGS\): Low-Light Dynamic Gaussian Splatting
-
L2DGS is the first 4D Gaussian Splatting framework to self-supervise the reconstruction of "bright dynamic scenes" directly from low-light videos. It decomposes the color of each Gaussian into "view- and time-dependent illumination \(\times\) intrinsic scene reflectance." By utilizing OCD-Net to model motion-induced time-varying illumination and a forward degradation pipeline (BAFs + BAFE-Net) to transform bright scenes back into low-light versions for self-supervision, it significantly outperforms existing methods on both synthetic and real-world low-light dynamic data.
- LAM: Language Articulated Object Modelers
-
LAM reformulates "text-to-articulated object generation" as a unified code generation task. A collaborative team of LLM and VLM modules—planning hierarchical structures, writing geometry and articulation code, and performing closed-loop error correction via VLMs—generates geometrically and kinematically correct articulated 3D objects from single sentences. It requires no visual priors or pre-made 3D assets, achieving a joint prediction success rate of 77.1%, significantly outperforming Articulate Anything's 40.3%.
- Landscape-Awareness for Geometric View Diffusion Model
-
Addressing the pain point where using Zero123 noise space MSE for two-view camera pose estimation leads to a loss landscape riddled with local minima requiring brute-force multi-initialization, this paper attributes the root cause to landscape local minima caused by geometric symmetry/self-similarity. It uses a score network in the first stage to reshape update directions toward high-likelihood regions of the ground truth pose, followed by a second stage using frozen Zero123 MSE for refinement, significantly improving success rates and sampling efficiency with minimal reliance on multi-initialization.
- LangField4D: Learning Identity-Adaptive and Spatio-Temporal Continuous 4D Language Fields for Dynamic Scenes
-
LangField4D constructs an open-vocabulary language field on 4D Gaussian Splatting. It addresses semantic inconsistency caused by Gaussian drift across object boundaries via "Identity-Adaptive Gaussian Grouping" and replaces discrete state prototypes with a "TetraPlane Continuous Spatio-Temporal Semantic Representation," setting new SOTAs for both time-agnostic and time-sensitive queries in dynamic scenes.
- LangRef3DGS: Natural Language-Guided 3D Referential Segmentation from Partial Observations via 3D Gaussian Splatting
-
A continuous semantic field is constructed on the 3D Gaussian Splatting representation. This method utilizes the Dirichlet Process to automatically discover new classes, compresses semantic features using gradient low-rank constraints, and organizes fragmented candidates into "unseen classes" via graph contrastive loss. This enables robust open-vocabulary 3D segmentation guided by natural language prompts, even under partial observation conditions with sparse or occluded RGB-D views.
- LaS-Comp: Zero-shot 3D Completion with Latent-Spatial Consistency
-
LaS-Comp is proposed as a zero-shot, category-agnostic 3D shape completion framework. By injecting known geometry in the spatial domain through an Explicit Replacement Stage and optimizing boundary consistency via the Implicit Alignment Stage, it bridges the gap between the latent and spatial domains of pretrained 3D foundation models, achieving SOTA performance across various partial observation modes.
- LASER: Layer-wise Scale Alignment for Training-Free Streaming 4D Reconstruction
-
Ours introduces LASER, a training-free framework that converts offline feed-forward reconstruction models (e.g., VGGT, π³) into streaming systems via Layer-wise Scale Alignment (LSA). It achieves real-time streaming 4D reconstruction of kilometer-level videos at 14 FPS with a 6GB peak memory on an RTX A6000.
- Learning 3D Shape Fidelity Metric from Real-world Distortions
-
This paper proposes LoCaSE, a learnable 3D shape fidelity metric. It captures details using local attention on mesh topology and mitigates model bias through LoRA-style pre-training and fine-tuning. Accompanied by the RSF dataset featuring real-world distortions and human annotations, the metric aligns significantly closer to human perception than geometric metrics like Chamfer Distance.
- Learning Compact 3D Representations from Feed-Forward Novel View Synthesis
-
C3G utilizes a small set of learnable query tokens to "discover and decode" approximately 2K compact 3D Gaussians placed at key spatial locations from unposed multi-view images using self-attention. Compared to pixel-wise methods, it maintains comparable novel view synthesis quality with ~65× fewer Gaussians. It further reuses emergent attention maps from the query decoder to lift arbitrary 2D features to 3D without additional training, significantly enhancing tasks like 3D open-vocabulary segmentation while reducing memory consumption and increasing rendering speed.
- ECKConv: Learning Coordinate-based Convolutional Kernels for Continuous SE(3) Equivariant Point Cloud Analysis
-
ECKConv is proposed to define convolutional kernels on the double coset space \(\text{SO(2)}\backslash\text{SE(3)}/\text{SO(2)}\) within the intertwiner framework. By explicitly parameterizing kernel functions via coordinate networks, it achieves both continuous SE(3) equivariance and large-scale scalability for the first time, validated across classification, registration, and segmentation tasks.
- Learning Differentiable Hierarchies in 3D Gaussian Splatting
-
The authors append a learnable "level scalar" to each Gaussian and utilize a differentiable decreasing step function (DDSF) to simultaneously optimize full-model rendering and hierarchy ordering in a single-stage training. This allows 3DGS to perform LoD rendering and pruning for any number of Gaussians without multi-stage training, with a training overhead of only ~10% compared to standard 3DGS.
- Learning Explicit Continuous Motion Representation for Dynamic Gaussian Splatting from Monocular Videos
-
This paper proposes to explicitly model the continuous position and orientation deformation trajectories of dynamic Gaussians through adaptive SE(3) B-spline motion bases. Combined with a soft segment reconstruction strategy and multi-view diffusion model priors, it achieves high-quality novel view synthesis of dynamic scenes from monocular videos, outperforming existing methods on iPhone and NVIDIA datasets.
- Learning Hierarchical Hyperbolic Mixture Model for Part-aware 3D Generation
-
This paper embeds the hierarchical semantics of 3D object parts into hyperbolic space. It proposes the Hierarchical Hyperbolic Mixture Model (H2MM), a geodesic diffusion process that decouples radial and angular noise, and a high-order Riemannian ODE solver that preserves manifold geometry. The method achieves state-of-the-art results in quality (FID/KID) and speed for unconditional, category-conditional, and multimodal 3D generation.
- Learning Multi-View Spatial Reasoning from Cross-View Relations
-
XVR (Cross-View Relations) constructs a large-scale multi-view Visual Question Answering (VQA) dataset with 100,000 samples. By explicitly training VLMs on three categories of tasks—correspondence, geometric verification, and viewpoint localization—it significantly enhances cross-view spatial reasoning capabilities, achieving notable improvements across multi-view benchmarks and robotic manipulation tasks.
- Learning Scene Coordinate Reconstruction from Unposed Images via Pose Graph Optimization
-
Introduces Pose Graph Optimization (PGO) on top of the unsupervised scene coordinate regression framework ACE-Zero. It automatically constructs edges using predicted scene coordinates and estimates confidence for each edge via a dual geometric prior (epipolar + optical flow) for weighted global optimization. This pulls locally refined, drift-prone camera poses into global consistency, matching or exceeding COLMAP in PSNR while compressing reconstruction time from 38h to 30min.
- Learning Spatial-Temporal Consistency for 3D Semantic Scene Completion
-
ConSSC lifts historical RGB frames into a unified 3D occupancy space, utilizing "Hierarchical Voxel Refinement" for geometric completion and "Temporal Semantic Aggregation" for semantic completion. Without any additional sensors, it establishes a new SOTA for camera-only semantic scene completion on SemanticKITTI and KITTI-360 (IoU 48.17 / 48.79, mIoU 19.20 / 20.85).
- Learning to Infer Parameterized Representations of Plants from 3D Scans
-
This paper uses a Recursive Neural Network (RvNN) to learn a "shape space" for plants, directly inferring unordered 3D point clouds into a parameterized L-String (binary axial tree). This approach simultaneously outputs the branching topology of the plant and the geometric parameters of each organ. Trained entirely on synthetic data generated by procedural models, the method generalizes to real scans and provides unified support for three phenotyping tasks: 3D reconstruction, skeleton extraction, and organ segmentation.
- Learning to Solve PDEs on Neural Shape Representations
-
This paper learns the "normal extension" step—the most critical component of the classic Closest Point Method (CPM)—using a lightweight, geometrically-conditioned neural operator. This enables solving surface PDEs directly on neural surface representations (SNS / SDF / Occupancy Fields / Point Clouds / Gaussian Splatting) without mesh extraction or per-instance optimization. The pipeline is fully differentiable and, after being trained once on a single example shape (Spike), generalizes to unseen shapes, topologies, and input functions with accuracy comparable to CPM.
- Lens Component Deletion based on Differentiable Ray Tracing
-
To meet the miniaturization and cost-reduction needs of micro-optical lenses, an "automated lens deletion" pipeline is proposed. It uses a contribution metric to automatically identify the least significant lens in a system, applies a deletion loss to gradually flatten and thin it until safe removal, and employs a differentiable PSF estimation based on Rayleigh-Sommerfeld diffraction theory. This allows for joint optimization of the simplified lens and a post-processing restoration network, maintaining imaging quality comparable to the original system even after removing a lens component.
- Let it Snow! Animating 3D Gaussian Scenes with Dynamic Weather Effects via Physics-Guided Score Distillation
-
Proposes the Physics-Guided Score Distillation framework, which utilizes Material Point Method (MPM) simulations as motion priors to guide Video-SDS optimization. This approach generates dynamic weather effects (snow, rain, fog, sandstorms) with physically plausible motion and realistic appearance within static 3DGS scenes.
- LiDAR Prompted Spatio-Temporal Multi-View Stereo for Autonomous Driving
-
DriveMVS injects sparse LiDAR as "geometric prompts" into Multi-View Stereo (MVS): serving both as hard constraints to anchor the absolute scale of the cost volume and as soft features fused via a Triple-Cue Combiner with monocular and geometric priors. A spatio-temporal decoder ensures cross-frame consistency, enabling the model to achieve metric accuracy, temporal stability, and generalization under zero-shot cross-domain settings (KITTI MAE 0.49 m, AbsRel 2.56%).
- Lifting Unlabeled Internet-level Data for 3D Scene Understanding
-
This work presents SceneVerse++, an automated data engine that generates 3D scene understanding training data from 6,687 unlabeled internet videos. It demonstrates the feasibility of advancing 3D scene understanding using internet-level data across three tasks: 3D object detection (+20.6 [email protected]), spatial VQA (+14.9%), and vision-language navigation (+14% SR).
- Lighting-grounded Video Generation with Renderer-based Agent Reasoning
-
LiVER proposes a lighting-driven video generation framework that utilizes a renderer agent to convert text descriptions into explicit 3D scene proxies (including layout, lighting, and camera trajectories). By employing physical rendering to generate diffuse/glossy/rough GGX scene proxies and injecting them into a video diffusion model, the approach achieves physically accurate lighting effects and precise scene control.
- LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds
-
LightSplat proposes a fast and memory-efficient training-free framework that achieves open-vocabulary 3D scene understanding with a 50-400x speedup and 64x reduced memory compared to existing SOTA. It achieves this by assigning compact 2-byte semantic indices to 3D Gaussians (instead of high-dimensional CLIP features), coupled with a lightweight index-feature mapping and single-step 3D clustering.
- Linear Fundamental Matrix Estimation from 7 or 5 Points
-
This paper provides an elementary geometric explanation for the phenomenon that "7-point fundamental matrix estimation is uniquely solvable under a specific point-line configuration (V-Umlaut: 5 points falling on two lines)" and proposes the first linear solver for this minimal problem. By treating it as a 5-point solver (supplemented by two virtual midpoints) and integrating it with Early Non-Minimal Refitting, the method achieves accuracy comparable to state-of-the-art (SOTA) 5-point methods while being several times faster in RANSAC.
- Lite Any Stereo: Efficient Zero-Shot Stereo Matching
-
Lite Any Stereo is proposed, which utilizes a hybrid 2D-3D cost aggregation module and a three-stage million-scale data training strategy (supervised → self-distillation → real-world knowledge distillation). With less than 1% of the computation (33G MACs) compared to SOTA precise methods, it ranks 1st on four real-world benchmarks, demonstrating for the first time that ultra-lightweight models can possess strong zero-shot generalization capabilities.
- LitePT: Lighter Yet Stronger Point Transformer
-
LitePT proposes a hierarchical hybrid architecture that utilizes sparse convolutions in shallow layers and attention in deep layers, based on an in-depth analysis of their respective roles across U-Net levels. By introducing the parameter-free PointROPE positional encoding, LitePT achieves 3.6x fewer parameters, 2x faster speed, and 2x memory savings compared to Point Transformer V3, while matching or exceeding its performance across multiple point cloud benchmarks.
- LiteSense: Lifting Lightweight ToF with RGB for High-Resolution Metric Depth Estimation
-
LiteSense fuses Compact Normalized Histograms (CNH) from multi-zone ToF sensors with RGB images using patch-wise cross-attention within a U-Net. With only 5.5M parameters, it approaches the performance of SOTA large models in indoor metric depth estimation and significantly outperforms the comparable RGB-ToF method DELTAR.
- LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight
-
LocateAnything3D reformulates monocular multi-object 3D detection as a next-token prediction task for VLMs—first by having the decoder output 2D boxes as a "visual Chain-of-Sight," and then by solving 3D boxes following a curriculum of near-to-far and center→size→rotation. Without any specialized 3D heads, it increases the \(AP_{3D}\) on Omni3D from 24.92 to 38.90.
- LoG3D: Ultra-High-Resolution 3D Shape Modeling via Local-to-Global Partitioning
-
LoG3D partitions high-resolution Unsigned Distance Fields (UDF) into uniform sub-voxel blocks called UBlocks. It employs a hybrid VAE with "local 3D convolution + global sparse Transformer" for block-wise encoding and decoding, combined with a Pad-Average strategy to eliminate boundary seams. This pushes the reconstruction resolution of 3D VAEs to \(2048^3\) for the first time, achieving SOTA in both reconstruction accuracy and generation quality.
- Long-SCOPE: Fully Sparse Long-Range Cooperative 3D Perception
-
Long-SCOPE proposes a fully sparse long-range cooperative 3D perception framework. By utilizing geometric-guided query generation and context-aware association modules, it achieves SOTA performance in 100-150m long-range scenarios while maintaining efficient computational and communication costs.
- LongStream: Long-Sequence Streaming Autoregressive Visual Geometry
-
LongStream is proposed as a gauge-decoupled streaming visual geometry model. By utilizing keyframe-relative pose prediction, orthogonal scale learning, and cache-consistent training, it achieves stable metric-scale scene reconstruction for thousand-frame sequences in real-time (18 FPS).
- LoST: Level of Semantics Tokenization for 3D Shapes
-
This work proposes Level-of-Semantics Tokenization (LoST), which sorts 3D shape tokens by semantic significance. This allows short prefixes to decode into complete and semantically plausible shapes. Combined with the RIDA semantic alignment loss and GPT-style autoregressive generation, the method significantly outperforms existing 3D AR approaches using only 128 tokens compared to tens of thousands.
- Low-Rank Test-Time Training for Pre-Trained Point Cloud Models
-
This paper proposes LoTT-PC, a lightweight test-time training framework for pre-trained point cloud models. By replacing full-parameter fine-tuning with LoRA-style Low-Rank Modulation Units and substituting reconstruction auxiliary heads with decoder-free "Masked Feature Alignment," it outperforms the SOTA by approximately 2.7% on average across three point cloud corruption benchmarks using single-step online updates.
- LumiMotion: Improving Gaussian Relighting with Scene Dynamics
-
LumiMotion is the first Gaussian-based method to utilize scene dynamics (moving regions) as supervision signals to improve inverse rendering. By implementing motion-static separation and leveraging motion-revealed material changes, it achieves better decoupling of lighting and material, resulting in a 23% improvement in albedo LPIPS and a 15% improvement in relighting.
- Lumosaic: Hyperspectral Video via Active Illumination and Coded-Exposure Pixels
-
The Lumosaic system is proposed for active hyperspectral video, synchronizing a 12-narrowband LED array with a Coded-Exposure Pixel (CEP) camera at microsecond precision. By jointly encoding spatial-temporal-spectral information across 158 sub-frames per frame, it achieves motion-robust reconstruction of 31-channel (400–700nm) hyperspectral video at 30fps VGA resolution, surpassing passive snapshot systems by over 10dB in PSNR.
- LuxRemix: Lighting Decomposition and Remixing for Indoor Scenes
-
LuxRemix utilizes a generative single-image lighting decomposition model to break down complex indoor illumination into "One-Light-At-a-Time" (OLAT) components. These results are consistently propagated across all viewpoints via multi-view lighting harmonization and encoded into a relightable 3D Gaussian Splatting representation, enabling users to independently toggle, recolor, or adjust the brightness of each light source in real-time from any perspective.
- M3DLayout: A Multi-Source Dataset of 3D Indoor Layouts and Structured Descriptions for 3D Generation
-
The authors constructed M3DLayout, a multi-source large-scale 3D indoor layout dataset (21,367 layouts, 433k+ object instances), integrating real scans, professional designs, and procedural generation. Complemented by structured text descriptions, it provides a high-quality foundation for text-driven 3D scene generation.
- MAGICIAN: Efficient Long-Term Planning with Imagined Gaussians for Active Mapping
-
The MAGICIAN framework is proposed, which utilizes a pre-trained occupancy network to generate "Imagined Gaussians" for efficient surface coverage gain estimation. Combined with beam search, it achieves long-term trajectory planning in active mapping, reaching SOTA status in both indoor and outdoor scenes with over a 10% increase in coverage.
- MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts
-
MajutsuCity utilizes a four-stage pipeline—"Text → Scene Design → Layout/Heightmap → Assets & Materials → Scene Assembly"—to transform natural language directly into explicit 3D cities with structural consistency, adjustable styles, and object-level editability. It introduces the MajutsuDataset, the MajutsuAgent editing agent, and a set of VLM evaluation metrics (AQS/RDR), achieving an 83.7% reduction in layout FID compared to CityDreamer and a 20.1% reduction compared to CityCraft.
- Mamba Learns in Context: Structure-Aware Domain Generalization for Multi-Task Point Cloud Understanding
-
Ours proposes the SADG framework, which introduces Mamba into in-context learning for multi-task point cloud domain generalization for the first time. Through structure-aware serialization (Centroid Distance Spectrogram + Geodesic Curvature Spectrogram), hierarchical domain-aware modeling, and spectral graph alignment, it comprehensively outperforms SOTA in reconstruction, denoising, and registration tasks.
- ManifoldNeuS: Manifold-aware View Optimizability for Pose-Free Neural Surface Reconstruction
-
ManifoldNeuS identifies that "treating all views uniformly" in pose-free neural surface reconstruction leads to "easy-view bias" (where easily optimized views dominate gradients while critical but difficult views are marginalized). It proposes MaVOS, a score jointly measuring "immediate fitness + long-term coverage gain" on the view manifold. This score drives a tripartite system—dynamic view scheduling, gated positional encoding, and inverse-score loss weighting—reducing pose errors on DTU from hundreds of degrees (COLMAP-free baseline) to the \(0.6^\circ\) level, with reconstruction quality approaching NeuS trained with COLMAP ground truth poses.
- MANSION: Multi-floor Language-to-3D Scene Generation for Long-horizon Tasks
-
MANSION utilizes a "hierarchical multi-agent MLLM + geometrically constrained growth solver" to transform a single natural language instruction into a complete multi-floor building directly executable in simulators. By treating vertical alignment as a hard constraint, the authors release the MansionWorld dataset containing 1000+ buildings and cross-floor task editing agents specifically designed to stress-test the long-horizon cross-floor planning capabilities of embodied agents.
- MARCO: Navigating the Unseen Space of Semantic Correspondence
-
MARCO is proposed, a semantic correspondence model based on a single DINOv2 backbone. It progressively improves spatial precision through a coarse-to-fine Gaussian RBF loss and expands sparse keypoint supervision into dense pseudo-correspondence labels using a self-distillation framework. MARCO achieves SOTA performance on standard benchmarks and unseen keypoints/categories while being \(3\times\) smaller and \(10\times\) faster than dual-encoder methods.
- Mark4D: Temporally-Consistent Watermarking for 4D Gaussian Splatting
-
Mark4D is the first watermarking method specifically designed for dynamic 4D Gaussian Splatting (4DGS). Utilizing a trio of "X-CLIP video-text latent space decoder + offsets along Gaussian motion trajectories + motion-adaptive loss weighting," it embeds invisible, distortion-resistant, and temporally consistent watermarks into dynamic scenes. It significantly outperforms baselines that directly adapt 3DGS watermarking to 4D in both visual fidelity and bit accuracy.
- Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding
-
This work identifies two fundamental conflicts (order bias and instruction isolation) between causal masks in LLM decoders and 3D scene understanding. It proposes the 3D-SLIM masking strategy (Geometry-adaptive Mask + Instruction-aware Mask) to replace the causal mask, achieving significant improvements across multiple 3D scene-language tasks without architectural modifications or additional parameters.
- MatE: Material Extraction from Single-Image via Geometric Prior
-
MatE employs a coarse-to-fine framework with "coarse rectification via depth geometric prior + refinement via dual-branch diffusion" to extract four tileable PBR material maps (albedo / normal / roughness / height) in parallel from a specified region in a single real-world image. This approach overcomes the limitations of existing methods, such as viewpoint overfitting in LoRA-based methods and sequential error accumulation in video DiT-based methods.
- Material Magic Wand: Material-Aware Grouping of 3D Parts in Untextured Meshes
-
Addressing the challenge where "repetitive but geometrically distinct components should share the same material" in untextured meshes, this work proposes Material Magic Wand. By utilizing a part encoder that learns material-aware embeddings, each 3D part is encoded into a vector. Clicking a single part allows for automatic selection of all parts with the same material via nearest neighbor retrieval. On a self-constructed benchmark of 100 shapes and 241 queries, it outperforms the strongest baseline by 8.6% in retrieval AUC and 16.6% in grouping F1.
- MatLat: Material Latent Space for PBR Texture Generation
-
MatLat learns a "Material Latent Space" (MatVAE) by fine-tuning a pretrained VAE to accommodate five channels (albedo/roughness/metallic) while minimizing deviation from the original latent distribution. Combined with "Correspondence-Aware Attention + Locality Regularization" to ensure multi-view consistency, it generates high-quality, relightable PBR textures for given 3D meshes.
- MatSpray: Fusing 2D Material World Knowledge on 3D Geometry
-
MatSpray "sprays" PBR maps (basecolor/roughness/metallic) estimated per view by arbitrary 2D diffusion material predictors onto 3D Gaussian geometry via Gaussian ray tracing. It then uses a softmax neural merger for cross-view fusion combined with PBR rendering loss supervision, yielding de-lighted, multi-view consistent, and relightable 3D material assets with a reconstruction speed approximately 3.5× faster than IRGS.
- MD2E: Modeling Depth-to-Edge Cues for Monocular Metric Depth Estimation
-
Addressing the "unrecoverable scale in monocular metric depth when camera intrinsics are unavailable during both training and inference," this paper observes that while RGB remains nearly invariant when focal length and scene depth change couplingly, the spectral statistics of edges shift systematically. The authors propose the Spectral Quantile Estimator (SQE) to extract a scalar score as a scale proxy from the Fourier spectrum of predicted edge maps to calibrate depth. MD2E achieves SOTA in Monocular Metric Depth Estimation (MMDE) across 6 unseen benchmarks under zero-shot and fine-tuning settings (e.g., A.Rel decreased by 53.0% and RMS by 41.9% on iBIMS-1).
- MERG3R: A Divide-and-Conquer Approach to Large-Scale Neural Visual Geometry
-
MERG3R is a training-free divide-and-conquer framework that sorts thousands of unordered images, partitions them into overlapping subsets for reconstruction using geometric foundation models such as VGGT or π³, and finally merges them into a globally consistent point cloud through global alignment and confidence-weighted bundle adjustment. This enables feed-forward reconstruction models, originally limited by VRAM, to handle image sets far exceeding their native capacity.
- Mesh-Pro: Asynchronous Advantage-guided Ranking Preference Optimization for Artist-style Quadrilateral Mesh Generation
-
Mesh-Pro is proposed as the first asynchronous online reinforcement learning framework for 3D quadrilateral mesh generation. Its core algorithm, ARPO (Advantage-guided Ranking Preference Optimization), combines the Plackett-Luce ranking model with advantage function weighting. This approach achieves simultaneous improvements in efficiency (3.75x faster than offline DPO) and generalization, reaching SOTA generation quality for both artist-style and dense meshes.
- MeshFlow: Efficient Artistic Mesh Generation via MeshVAE and Flow-based Diffusion Transformer
-
MeshFlow employs a MeshVAE that encodes vertex positions, normals, and "discrete connectivity" entirely into a continuous latent space. Combined with a Rectified Flow diffusion Transformer, it parallelly generates all vertices and edges, producing artist-level triangular meshes in approximately 1 second—about 18x faster than the fastest autoregressive generators while avoiding quantization errors.
- MeshWeaver: Sparse-Voxel-Guided Surface Weaving for Autoregressive Mesh Generation
-
The authors transform autoregressive mesh generation from "coordinate-by-coordinate prediction" to "vertex-by-vertex weaving." By utilizing a multi-level sparse-voxel encoder to inject local geometry into the generation process across three levels—representation, prediction, and constraint—the method achieves an 18% tokenization compression rate, enables the generation of meshes with up to 16K faces, and significantly enhances geometric fidelity.
- Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding
-
The proposed BrainCoDec framework achieves fMRI visual decoding that generalizes to new subjects without fine-tuning through two-stage hierarchical in-context learning (estimating encoder parameters for each voxel first, then performing functional inversion via cross-voxel aggregation). It improves Top-1 retrieval accuracy from MindEye2's 3.9% to 22.7%.
- MetricHMSR: Metric Human Mesh and Scene Recovery from Monocular Images
-
MetricHMSR simultaneously recovers human SMPL meshes and 3D scenes with real physical scales (metric) from a single monocular image. The core involves explicitly encoding camera intrinsics and cropping information into the network using a "boundary camera ray map," decoupling local pose from global position via HumanMoE, and calibrating monocular depth using the recovered metric human as a geometric anchor to achieve SOTA in both human mesh recovery and metric human-scene reconstruction tasks.
- MHopReg: Efficient Hierarchical Multi-Hop Graph Search for Point Cloud Registration
-
MHopReg reformulates correspondence-based outlier removal for point cloud registration as a "hierarchical multi-hop graph search": it first predicts correspondence confidence using SE(3) equivariant graph encoding, then ensures coverage of fragmented inliers via cluster-balanced seed sampling, expands inliers layer-by-layer from seeds along the compatibility graph, and finally selects the optimal transformation using distribution-aware ranking that considers both geometric consistency and spatial coverage. It achieves a balance between accuracy and efficiency in low-overlap and large-scale scenarios.
- MimiCAT: Mimic with Correspondence-Aware Cascade-Transformer for Category-Free 3D Pose Transfer
-
This paper proposes MimiCAT, a cascaded Transformer framework that learns flexible many-to-many soft correspondences via semantic keypoint labels. Combined with PokeAnimDB, a million-scale multi-category motion dataset, it achieves high-quality 3D pose transfer across categories (e.g., humanoid to quadruped/bird) for the first time.
- Mind the Hitch: Dynamic Calibration and Articulated Perception for Autonomous Trucks
-
Ours proposes the dCAP framework, which achieves real-time 6-DoF relative pose estimation between the tractor and trailer in articulated autonomous trucks via Transformer-based cross-view and temporal attention mechanisms. It is integrated into BEVFormer to enhance 3D object detection performance under articulated motion (translation error 0.452m, rotation error 0.042 rad).
- Minimal Constraint Relaxation for Multiview Autocalibration
-
Addressing the long-standing issue where the three-view Kruppa autocalibration equations are over-constrained (45 equations for 5 unknowns), leading to no solution or ill-conditioned results, this paper proposes a "minimal relaxation" framework. By systematically retaining only a subset of equations and using symbolic computation with numerical homotopy continuation to exhaustively search all subsets yielding finite solutions, the authors identify a unique feasible \((1,2,2)\) selection pattern. An offline "Global-Best" relaxation is then selected based on Jacobian condition numbers. This approach proves more robust and accurate than classic Kruppa formulas and recent branch-and-bound methods on both synthetic and real data.
- Mirror Illusion Art
-
This paper proposes AutoMIA: given two 2D target images ("front view" and "mirror reflection"), it automatically optimizes a 3D-printable voxel model that satisfies both shape and color constraints. This allows the same object to present two seemingly completely different patterns before and after the mirror. The design is completed in approximately 76 seconds with 2.6 GB VRAM on a single RTX 3090.
- mmWaveFlow: Unified Enhancement and Generation of mmWave Human Point Clouds
-
The tasks of "densifying sparse mmWave point clouds" and "generating mmWave point clouds from dense ones" are unified as a single reversible transport between dense and sparse distributions. By learning this transport path via flow matching, and addressing the challenges of asymmetric distributions and path crossing through a cross-modal latent space alignment and an origin-aware module, a single model achieves SOTA performance in both enhancement and generation tasks across three datasets.
- MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos
-
Given a monocular video and an arbitrary 3D skeletal asset (human/animal/robot/toy) as a prompt, MoCapAnything first predicts per-joint 3D trajectories and then solves for the asset's specific skeleton rotation (e.g., BVH) using constraint-aware Inverse Kinematics (IK). This achieves unified motion capture and cross-species retargeting across heterogeneous skeletons, reducing the MPJPE of unseen species from 7.42cm to 1.76cm on Truebones Zoo.
- Modeling Spatiotemporal Neural Frames for High Resolution Brain Dynamics
-
This work proposes an EEG-conditioned fMRI reconstruction framework based on a Diffusion Transformer. By modeling brain activity as a sequence of spatiotemporal neural frames rather than independent snapshots, it achieves spatio-temporally consistent fMRI reconstruction at cortical vertex-level resolution. Furthermore, it supports intermediate frame interpolation via null space sampling, with functional information preservation validated through downstream visual decoding tasks.
- MonoSAOD: Monocular 3D Object Detection with Sparsely Annotated Label
-
Ours first defines and addresses the problem of monocular 3D object detection with sparsely annotated labels, proposing Road-Aware Patch Augmentation (RAPA) and Prototype-based Filtering (PBF) modules. It significantly outperforms existing 2D SAOD methods under the KITTI 30% annotation setting (AP3D Easy: 21.28 vs 17.14).
- MonoVLM: Monocular 3D Visual Grounding with Vision Language Models
-
MonoVLM utilizes a three-stage curriculum GRPO training framework to elevate monocular 3D visual grounding (predicting a 3D bounding box from an RGB image and a text description)—a task that even GPT-5 fails significantly—from nearly zero to SOTA. The model is first taught accurate 2D localization, then learns 3D centers through camera projection/back-projection, and finally refines complete 3D boxes using compound rewards. The 7B model outperforms specialized pure vision methods on Mono3DRefer.
- MORE-STEM: Long-Short MemOry REcall and Spatio-TEmporal Consistency Model for Query-Driven 3D/4D Point Cloud Segmentation
-
Addressing the limitation that language-driven 3D segmentation only handles static single frames and fails to understand dynamic scenes, MORE-STEM extends query-driven segmentation from 3D to 4D point cloud sequences. It integrates cross-frame text-visual alignment, spatio-temporal consistency modeling (State Space Model + Sparse Transformer), and long-short term memory recall. Additionally, it introduces InstructKITTI, the first outdoor 3D/4D instruction segmentation benchmark, achieving new SOTA performance across instruction, referring, and semantic segmentation tasks.
- MoRE: 3D Visual Geometry Reconstruction Meets Mixture-of-Experts
-
MoRE introduces Mixture-of-Experts (MoE) routing into feed-forward dense 3D geometry foundation models represented by VGGT. This allows different experts to specialize in heterogeneous scenes such as indoor/outdoor, objects, humans, or dynamic environments. Combined with confidence-guided depth refinement and dense semantic feature fusion, it achieves SOTA performance across four tasks: point maps, depth, camera poses, and surface normals.
- MoRe: Motion-aware Feed-forward 4D Reconstruction Transformer
-
This paper proposes MoRe, a motion-aware feed-forward 4D reconstruction Transformer. It decouples dynamic motion from static structures during training via an attention forcing strategy and combines it with grouped causal attention to achieve efficient streaming inference. MoRe achieves SOTA performance in camera pose estimation and depth prediction for dynamic scenes.
- More Natural, More Real: Object-aware Gaussian Splatting for 3D Visual Decoding from Human Brain
-
BrainGS is the first brain signal-to-3D object reconstruction framework based on 3D Gaussian Splatting (3DGS). It encodes fMRI/EEG signals using a spatial-temporal fusion network, decouples and aligns brain signals with vision-semantic-color anchor points via a multi-attribute controller, and tracks/corrects object viewpoint changes through a multi-view stabilizer. It achieves SOTA reconstruction fidelity on fMRI/EEG-3D datasets (fMRI 2.936 FPD / 0.202 LPIPS).
- MoRel: Long-Range Flicker-Free 4D Motion Modeling via Anchor Relay-based Bidirectional Blending with Hierarchical Densification
-
MoRel utilizes "Keyframe Anchors + Bidirectional Deformation + Learnable Temporal Opacity Blending" to decompose long-sequence dynamic scenes of thousands of frames into segments of anchor relays. Under bounded memory constraints, it eliminates flickering at segment boundaries caused by chunk-based training, achieving the best temporal consistency among all comparison methods with a tOF reduction to 0.203.
- MoRGS: Efficient Per-Gaussian Motion Reasoning for Streamable Dynamic 3D Scenes
-
In the context of online 3DGS reconstruction for streaming dynamic scenes, MoRGS explicitly supervises "per-Gaussian motion" using sparse key-view optical flow. It overlays a learnable per-Gaussian motion offset field to correct view inconsistencies in sparse flow and utilizes per-Gaussian motion confidence to apply residual updates only to truly moving Gaussians. This approach achieves state-of-the-art (SOTA) rendering quality and motion fidelity for online methods while maintaining low streaming latency.
- MOSAIC-GS: Monocular Scene Reconstruction via Advanced Initialization for Complex Dynamic Environments
-
MOSAIC-GS shifts "motion estimation" in monocular dynamic scene reconstruction from the photometric optimization stage to a four-step preprocessing pipeline. It first detects, segments, and tracks dynamic objects, refines scene flow using rigidity constraints, and directly initializes trajectories for dynamic Gaussians using Poly-Fourier curves. Combined with static/dynamic Gaussian decoupling, it achieves quality comparable to SOTA (surpassing in LPIPS) while accelerating training and rendering speeds by several times.
- Motion-Aware Animatable Gaussian Avatars Deblurring
-
The authors propose the first method to reconstruct clear, animatable 3D human Gaussian Avatars directly from blurry videos. This is achieved through a 3D-aware physical blur formation model and an SMPL-based human motion model, which jointly optimize the Avatar representation and motion parameters.
- Motion 3-to-4: 3D Motion Reconstruction for 4D Synthesis
-
Motion 3-to-4 decomposes the ill-posed problem of "generating 4D dynamic objects from monocular video" into two steps: static 3D shape generation + dynamic motion reconstruction. By using a (generatable) static reference mesh as an anchor, it performs feed-forward prediction of per-frame vertex motion flow relative to the reference frame. Leveraging DINOv2 video features for "surface point-to-pixel" alignment, it ensures geometric completeness and temporal consistency while compressing inference to seconds, significantly outperforming L4GM / GVFD / V2M4 on the self-built Motion-80 benchmark with ground-truth geometry.
- MotionScale: Reconstructing Appearance, Geometry, and Motion of Dynamic Scenes with Scalable 4D Gaussian Splatting
-
The authors propose MotionScale, a scalable 4D Gaussian Splatting framework. By leveraging cluster-based adaptive motion fields and progressive optimization strategies, it achieves high-fidelity reconstruction of appearance, geometry, and motion for large-scale dynamic scenes from monocular videos. It achieves a PSNR of 17.98 on DyCheck and reduces the 3D tracking EPE to 0.070, significantly outperforming existing methods.
- MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second
-
The authors propose MoVieS, a feed-forward 4D dynamic scene reconstruction framework. By utilizing a Dynamic Splatter Pixel representation to unify appearance, geometry, and motion modeling, it achieves 4D reconstruction from monocular video in approximately 1 second. It supports multiple tasks including novel view synthesis, 3D point tracking, scene flow estimation, and moving object segmentation.
- Moving Border Ownership for Event-based Motion Segmentation
-
This paper reformulates event-based motion segmentation as "moving border ownership" prediction—detecting motion boundaries while simultaneously determining which side of the boundary belongs to the foreground moving object. By training a lightweight time-surface + MobileNet + ConvLSTM network with perfect supervision from Blender synthetic data, the model achieves zero-shot transfer to four real-world datasets (EED / EVIMO1 / EVIMO2 / EMSMC), reaching event-domain SOTA and running in real-time at 200 FPS.
- MSCD-GS: Motion-Separated Cooperative Deblurring Dynamic Reconstruction via Gaussian Splatting
-
To address the pervasive motion blur in dynamic scenes captured by monocular cameras, MSCD-GS categorizes Gaussian points into static and dynamic types to separately model their motion during exposure. Two motion-aware MLPs are utilized to synthesize virtual sharp images, which are then combined with a deblurring network prior for cooperative regularization. This approach reconstructs high-quality 4D dynamic scenes from blurred inputs, outperforming existing methods in both deblurring and novel view synthesis on Stereo Blur and real-world datasets with faster training speeds.
- MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation
-
A Multi-modal 3D Scene Graph (M3DSG) is proposed, utilizing dynamically allocated image edges instead of traditional text relation edges to preserve visual information. Based on this, a zero-shot navigation system, MSGNav, is constructed. A Visibility Viewpoint Decision (VVD) module is introduced to address the "last-mile" problem in navigation, achieving SOTA performance on GOAT-Bench and HM3D-ObjNav.
- MU-GeNeRF: Multi-view Uncertainty-guided Generalizable Neural Radiance Fields for Distractor-aware Scene
-
To address the issue where generalizable NeRF (GeNeRF) supervision signals are contaminated by transient distractors (pedestrians, shadows, dynamic objects) in dynamic real-world scenes, this paper decouples "distractor-awareness" into two complementary components: source-view uncertainty (structural inconsistency across source views) and target-view uncertainty (observation anomalies in the target image). These are fused via a heteroscedastic reconstruction loss. Within a feed-forward generalization framework, this approach locates distractors without damaging static structures, outperforming existing GeNeRFs and approaching the performance of per-scene optimized distractor-free NeRFs.
- Multi-Scale Gaussian-Language Map for Zero-shot Embodied Navigation and Reasoning
-
This paper proposes the Multi-Scale Gaussian-Language Map (GLMap), which organizes environment representation via a "2D indexing grid + instance/region dual-layer semantic units." Each semantic unit simultaneously stores "natural language descriptions + 3D Gaussians," allowing direct reading by LLM/VLM/MLLM without additional projection training. An analytical Gaussian Estimator is used to fit Gaussian parameters directly from point clouds (avoiding gradient optimization). GLMap achieves consistent performance gains across ObjectNav, InstNav, and SQA tasks in a zero-shot manner.
- Multi-view Consistent 3D Gaussian Head Avatars 'without' Multi-view Generation
-
MVCHead achieves SOTA perceptual quality and texture/geometry consistency by directly regressing 240,000 3D Gaussians from randomly sampled 2D face images (without multi-view data, 3D supervision, or intermediate view generation). It leverages a single-forward State Space Model (SSM) featuring "Hierarchical Bi-directional State Scanning" aligned with multi-view drift axes and an "SE(3) multi-view evaluator" to bake consistency directly into the architecture.
- Multi-view Pyramid Transformer: Look Coarser to See Broader
-
MVP utilizes a "dual attention hierarchy" (relaxing the view dimension from intra-frame → intra-group → global, while merging spatial tokens from fine → coarse) to enable feed-forward Transformers to process dozens to hundreds of images. It reconstructs large-scene 3D Gaussians within 0.1–2 seconds, achieving state-of-the-art quality and speed across the 16–256 view range.
- Multimodal Semantic Bias Mitigation for Diverse Text-To-3D Generation
-
To address the "cross-modal bias" in large text-to-3D models (e.g., TRELLIS)—where models are overly sensitive to prompt formatting, focus only on a few keywords, and struggle with complex descriptions—this paper proposes a "localization-quantization-mitigation" framework. It utilizes gradients backpropagated from a 3D quality evaluation model to locate biases at the word level. Based on this, GPT-4 and external 3D generators are used to construct semantically rich and visually reliable text-3D pairs to fine-tune the large model. This approach generates higher-quality 3D content that is more diverse and better aligned with text, surpassing 8 SOTA methods on MATE-3D and T³Bench.
- MuM: Multi-View Masked Image Modeling for 3D Vision
-
MuM generalizes the "mask-and-reconstruct" objective of MAE from single images to arbitrary multi-view sequences (up to 24 views) of the same scene. By utilizing a lightweight multi-view decoder with alternating cross-frame attention, it pre-trains geometric-aware feature encoders. MuM outperforms DINOv3 and CroCo v2 on 3D tasks such as feed-forward reconstruction, dense matching, and relative pose estimation while using approximately 1/30 of the training compute.
- Muses: Designing, Composing, Generating Nonexistent Fantasy 3D Creatures without Training
-
Muses is the first training-free, feed-forward framework for generating fantasy 3D creatures. It parses highly compositional text (e.g., "a creature with a tiger body, dragon wings, robotic legs, and nine fox tails") into 3D skeletons for individual parts, assembles a reasonable holistic skeleton via graph classification and LLM reasoning, and performs voxel-level geometric and texture interpolation within the Structured Latent Space (SLAT) of Trellis. Finally, it concludes with style-consistent texture editing, significantly outperforming methods like DreamBeast and OmniPart in visual fidelity and text alignment (VQAScore 0.93 vs. 0.82).
- MV-RoMa: From Pairwise Matching into Multi-View Track Reconstruction
-
MV-RoMa is proposed as the first multi-view dense matching model. By employing a Track-Guided multi-view encoder and a pixel-aligned multi-view refiner, it simultaneously estimates dense correspondences from a single source image to multiple target images. This produces geometrically consistent tracks for SfM, comprehensively outperforming existing methods on benchmarks such as HPatches, ETH3D, and IMC.
- MV2UV: Generating High-quality UV Texture Maps with Multiview Prompts
-
MV2UV treats multi-view generated images as "semantic prompts" to directly generate texture maps in UV space using a fine-tuned SDXL diffusion model. By employing pixel-aligned 3D coordinates (XYZ) as cross-attention positional encodings, it simultaneously resolves multi-view inconsistencies and completes occluded regions, significantly reducing FID on GSO/DTC datasets.
- MV3DIS: Multi-View Mask Matching via 3D Guides for Zero-Shot 3D Instance Segmentation
-
MV3DIS utilizes "projections of coarse 3D segments" as cross-view common references to match and filter 2D masks generated by SAM. Consistent 2D masks are then used to refine 3D instances. Without relying on video tracking or any 3D annotations, it pushes the zero-shot 3D instance segmentation mAP on ScanNetV2 to 38.5 (surpassing the Prev. SOTA by 4.5).
- MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation
-
This paper proposes a new task, MV-3DRES (language-guided 3D segmentation directly from sparse multi-view RGB), and the MVGGT framework. By utilizing a dual-branch design that fuses a frozen geometry branch with a trainable multimodal branch and applying the PVSO optimization strategy to resolve foreground gradient dilution, the method achieves 39.9 mIoU on the self-constructed MVRefer benchmark, significantly surpassing baselines.
- MVInverse: Feed-forward Multiview Inverse Rendering in Seconds
-
MVInverse utilizes a VGGT-style alternating attention Transformer to simultaneously predict per-view consistent albedo, metallic, roughness, normal, and diffuse shading from multiview RGB sequences in a single feed-forward pass. This compresses multiview inverse rendering—which previously required minutes to hours of per-scene optimization—into seconds, while leveraging self-supervised consistency fine-tuning to ensure stable, flicker-free results on real-world videos.
- NanoSD: Edge Efficient Foundation Model for Real Time Image Restoration
-
NanoSD is proposed as a family of Pareto-optimal lightweight diffusion foundation models (130M–315M parameters, fastest 12ms inference) constructed through hardware-aware U-Net decomposition, block-wise feature distillation, and multi-objective Bayesian optimization. It serves as a drop-in backbone achieving SOTA performance in tasks such as super-resolution, face restoration, deblurring, and monocular depth estimation.
- NaTex: Seamless Texture Generation as Latent Color Diffusion
-
NaTex redefines "coloring 3D meshes" as predicting a color field directly in 3D space. By using a geometry-aware color point cloud VAE to compress textures into an ordered latent set and applying a multi-control DiT for latent color diffusion, it completely bypasses the inherent defects of the multi-view diffusion (MVD) baking route regarding occlusion, alignment, and cross-view consistency. It significantly outperforms previous methods in texture coherence and alignment.
- Natural Human Motion Recovery by Aligning High-Order Temporal Dynamics from Monocular Videos
-
Addressing the issue where monocular human motion recovery results have accurate joint positions but appear either jittery or over-smoothed, this paper proposes HTD-Refine. It uses a lightweight temporal network, PVA-Net, to explicitly predict the 3D velocity and acceleration of each joint from video. These high-order dynamics serve as soft constraints to optimize global trajectories. This plug-and-play approach reduces jitter, suppresses over-smoothing, and improves global accuracy for existing methods like TRAM, GVHMR, and Human3R.
- NeAR: Coupled Neural Asset–Renderer Stack
-
NeAR proposes co-designing neural asset creation and neural rendering as a coupled stack. It utilizes an illumination-homogenized structured 3D latent (LH-SLAT) to eliminate baked lighting from input images, followed by a light-aware neural decoder to synthesize relightable 3D Gaussian fields in real-time. The method outperforms existing approaches across four tasks: forward rendering, reconstruction, relighting, and novel-view relighting.
- NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos
-
NeoVerse proposes a scalable 4D world model. By utilizing feed-forward pose-free 4DGS reconstruction and online monocular degradation simulation, the training pipeline can leverage massive (millions) in-the-wild monocular videos, achieving SOTA in both 4D reconstruction and novel-trajectory video generation.
- Nerfify: A Multi-Agent Framework for Turning NeRF Papers into Code
-
Ours proposes Nerfify, a multi-agent framework that automatically converts NeRF papers into trainable Nerfstudio plug-in code through context-free grammar constraints, Graph of Thoughts code synthesis, and compositional citation recovery. It achieves a 100% execution rate on a 30-paper benchmark, with visual quality differing from expert implementations by only \(\pm 0.5\) dB PSNR.
- Neu-PiG: Neural Preconditioned Grids for Fast Dynamic Surface Reconstruction on Long Sequences
-
Neu-PiG proposes a fast optimization method based on preconditioned multi-resolution latent grids. It encodes the positions and normals of the keyframe reference mesh into a unified latent space, which is then decoded by a lightweight MLP into per-frame 6-DoF deformations. achieving high-fidelity dynamic surface reconstruction more than 60 times faster than existing training-free methods without requiring category priors or explicit correspondences.
- Neural Dynamic GI: Random-Access Neural Compression for Temporal Lightmaps in Dynamic Lighting Environments
-
Addressing the pain point that "dynamic lighting requires multiple sets of lightmaps, resulting in massive data volumes," NDGI compresses the entire temporal sequence of lightmaps into a compact model using mixed-dimension feature maps and a lightweight MLP. Combined with Block Compression (BC) simulation during training and Virtual Texturing (VT) for on-demand runtime decoding, it achieves a high reconstruction quality of 46.7 dB PSNR at an extremely low bitrate of 0.68 BPP. This significantly outperforms traditional GPU compression (BC7/ASTC) and existing neural compression (NTC), while reducing decoding latency to approximately one-quarter of NTC's.
- Neural Field-Based 3D Surface Reconstruction of Microstructures from Multi-Detector Signals in Scanning Electron Microscopy
-
Ours proposes NFH-SEM, a hybrid neural field-based framework that embeds the physical model of SEM electron scattering into the neural field optimization process. By reconstructing high-fidelity 3D surfaces of microstructures from multi-view multi-detector SEM images, it achieves self-calibrated and shadow-resistant reconstruction with nanometer-scale precision (e.g., 478nm layering features, 782nm pollen textures, and 1.559μm fracture steps).
- Neural Gabor Splatting: Enhanced Gaussian Splatting with Neural Gabor for High-frequency Surface Reconstruction
-
Neural Gabor Splatting embeds a lightweight MLP (SIREN architecture) into each Gaussian primitive, enabling a single primitive to represent complex spatially varying color patterns. Combined with a frequency-aware densification strategy, it significantly improves high-frequency surface reconstruction quality under the same data budget.
- NeuROK: Generative 4D Neural Object Kinematics
-
NeuROK reformulates the task of generating physically plausible 4D deformations for static 3D objects—traditionally dependent on category-specific physical models—into "learning a low-dimensional latent kinematic state space + solving an ODE using Lagrangian mechanics within this space." This allows for unified 4D dynamic generation across various objects (e.g., elastomers, cloth, continua, articulated objects) without physical annotations or category priors, achieving an 81% preference rate in user studies.
- NG-GS: NeRF-Guided 3D Gaussian Splatting Segmentation
-
The NG-GS framework is proposed to utilize the continuous modeling capability of NeRF to resolve discretization issues in 3DGS segmentation boundaries. High-quality object segmentation is achieved through continuous feature fields constructed via RBF interpolation combined with multi-resolution hash encoding and joint NeRF-GS optimization.
- NI-Tex: Non-isometric Image-based Garment Texture Generation
-
Ours proposes the NI-Tex framework, which achieves high-quality PBR texture generation from a single image to a 3D garment under non-isometric conditions using a feed-forward architecture. This is accomplished by constructing a 3D Garment Videos dataset, image-editing-based cross-topology augmentation, and an uncertainty-guided iterative baking algorithm.
- NimbusGS: Unified 3D Scene Reconstruction under Hybrid Weather
-
NimbusGS proposes a unified 3D scene reconstruction framework that achieves SOTA reconstruction across cross-weather and hybrid-weather conditions by decomposing weather degradation into a continuous scattering field (fog/haze) and per-view particle residual layers (rain/snow), combined with a geometry-guided gradient scaling mechanism.
- No Calibration, No Depth, No Problem: Cross-Sensor View Synthesis with 3D Consistency
-
The first calibration-free and depth-free cross-sensor view synthesis framework is proposed. Through a match-densify-consolidate pipeline, sparse cross-modal keypoints are expanded into dense, RGB-aligned X-modal images (thermal/NIR/SAR). The synthesis quality is enhanced via confidence-aware fusion and self-matching filtering.
- Node-RF: Learning Generalized Continuous Space-Time Scene Dynamics with Neural ODE-based NeRFs
-
Node-RF tightly couples Neural ODEs with NeRF, using continuous-time differential equations to drive the temporal evolution of implicit scene representations. It achieves long-range extrapolation and cross-trajectory generalization far beyond the training interval, significantly outperforming baselines like D-NeRF and 4D-GS on datasets such as Bouncing Balls, Pendulum, and Oscillating Ball.
- NTK-Guided Implicit Neural Teaching
-
Ours proposes NINT, which utilizes row vectors of the Neural Tangent Kernel (NTK) to measure the influence of each coordinate on global function updates. This allows for the dynamic selection of coordinates that exhibit both high fitting error and high global influence for training, reducing INR training time by nearly half without sacrificing reconstruction quality.
- NVGS: Neural Visibility for Occlusion Culling in 3D Gaussian Splatting
-
NVGS distills the viewpoint-dependent visibility of all Gaussians within a 3DGS asset into a shared small MLP. This MLP is queried prior to rasterization to discard occluded Gaussians. Coupled with an instantiation-based rasterizer that only processes surviving Gaussians, the system enables real-time rendering of complex scenes composed of hundreds of millions of Gaussians while reducing VRAM usage to approximately one-fourth of V3DG with higher image quality.
- ODGS-SLAM: Omnidirectional Gaussian Splatting SLAM
-
ODGS-SLAM is the first system to utilize 3D Gaussian Splatting (3DGS) as a unified representation for omnidirectional (360° panoramic) camera SLAM. It complements the 3DGS-SLAM backpropagation pipeline with analytical gradients for camera poses under equirectangular projection, counteracts equator-pole distortion using latitude weighting, and suppresses memory usage through a graph-analysis-based keyframe removal strategy. Consequently, it achieves simultaneous camera tracking and dense mapping on panoramic inputs, with tracking accuracy (ATE RMSE) statistically significantly superior to existing omnidirectional and perspective 3DGS-SLAM methods.
- Off The Grid: Detection of Primitives for Feed-Forward 3D Gaussian Splatting
-
This paper proposes a feed-forward 3DGS decoder based on keypoint detection concepts, liberating Gaussian primitives from the pixel grid. By adaptively placing primitives at sub-pixel levels and combining an adaptive density mechanism with confidence pruning, it outperforms SOTA feed-forward methods in novel view synthesis using only 1/7th of the primitives compared to the number of input pixels.
- OLATverse: A Large-scale Real-world Object Dataset with Precise Lighting Control
-
OLATverse utilizes a lightstage with 35 cameras and 331 controllable light sources to capture 765 real-world objects in a one-light-at-a-time (OLAT) manner. The resulting large-scale dataset contains approximately 9 million images with precise single-light control, accompanied by camera parameters, object masks, photometric normals, and diffuse albedo. It provides the first real-world benchmark for inverse rendering, novel view synthesis, and normal estimation that combines both large scale and precise lighting control.
- OMG-Avatar: One-shot Multi-LOD Gaussian Head Avatar
-
OMG-Avatar reconstructs an animatable 3D Gaussian head avatar from a single image in 0.2 seconds. Through "hierarchical coarse-to-fine feature extraction + depth-buffer-guided occlusion-aware fusion + head-shoulder divide-and-conquer modeling," the unified model dynamically switches levels of detail (LOD) at runtime, achieving SOTA reconstruction quality and 85 FPS real-time speed with fewer Gaussian points.
- OMGTex: One-stage Multi-style Facial Texture Reconstruction without Geometry Guidance
-
OMGTex utilizes a DiT-based diffusion model to directly map facial images of any style to editable UV textures. It employs "gradient-guided alignment" during inference to correct UV structural misalignments and achieves partitioned editing through semantic attribution of attention blocks. The process remains independent of 3D geometric priors throughout, demonstrating robustness to occlusions and stylized inputs. It achieves SOTA performance on LPFF/CANVAS with a reconstruction time of 7 seconds per image.
- Omni-3DEdit: Generalized Versatile 3D Editing in One-Pass
-
Omni-3DEdit shifts instructional 3D editing from "iterative optimization on explicit 3D representations" to a single forward pass in multi-view latent space. Using OmniNet, a network based on the pre-trained multi-view generation model SEVA, it simultaneously supports object removal, addition, and appearance editing. Equipped with a data synthesis pipeline to address paired data scarcity, it reduces the time for a single edit from dozens of minutes to approximately 2 minutes.
- OmniVGGT: Omni-Modality Driven Visual Geometry Grounded Transformer
-
OmniVGGT introduces a lightweight GeoAdapter to feed-forward 3D foundation models like VGGT, enabling the model to flexibly incorporate an arbitrary number of auxiliary geometric modalities (depth, camera intrinsics/poses) during both training and inference. Even with RGB-only input, Ours outperforms VGGT; with auxiliary information, performance gains are substantial, and its integration into VLA models enhances robotic manipulation.
- Online3R: Online Learning for Consistent Sequential Reconstruction Based on Geometry Foundation Model
-
Online3R integrates a set of lightweight learnable visual prompts into a frozen geometry foundation model (MASt3R-SLAM), updating them online during test time via "local fusion pseudo-ground truth + global reference frame invariance" self-supervised constraints. This allows the feed-forward reconstruction network to adapt to new scenes during the reconstruction process, thereby eliminating inconsistency and long-range drift in sequential reconstruction and outperforming previous SOTA on multiple pose and geometry benchmarks.
- OnlineHMR: Video-based Online World-Grounded Human Mesh Recovery
-
Proposes OnlineHMR, the first online world-grounded human mesh recovery framework that simultaneously satisfies four criteria: system causality, faithfulness, temporal consistency, and efficiency. It achieves streaming camera-coordinate HMR through sliding window causal learning + KV cache inference, combined with human-centric incremental SLAM and EMA trajectory correction for online global localization.
- OnlinePG: Online Open-Vocabulary Panoptic Mapping with 3D Gaussian Splatting
-
Ours proposes OnlinePG, the first online open-vocabulary panoptic mapping system based on 3DGS. By employing a local-to-global paradigm—constructing locally consistent 3D instances within a sliding window via a multi-cue clustering graph (geometric overlap, semantic similarity, and view consensus), and incrementally fusing them into a global map through bidirectional bipartite matching—it achieves state-of-the-art semantic and panoptic segmentation performance among online methods. On ScanNet, it achieves a mIoU of 48.48 (surpassing OnlineAnySeg by +17.2) with a real-time efficiency of 10-18 FPS.
- OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness
-
Ours proposes OpenVO, an open-world monocular visual odometry framework that achieves robust metric-scale ego-motion estimation under uncalibrated and variable frame rate conditions. Through a time-aware flow encoder and a geometry-aware context encoder, it achieves over a 20% improvement in cross-dataset ATE and reduces errors by 46%-92% in variable frame rate scenarios.
- Opti-NeuS: Neural Reconstruction for Dual-Layered Transparent and Opaque Objects
-
Opti-NeuS utilizes "two-stage layered reconstruction + a learnable Index of Refraction network (IoRNetwork)" to decouple and reconstruct dual-layered objects consisting of a transparent shell and an opaque core without controlled environments or extra inputs. By first suppressing refraction to reconstruct the outer surface and then using Snell's Law to trace refractive rays for the interior, it achieves lower Chamfer Distance than Alpha-NeuS, NeTO, and NU-NeRF.
- Optical Flow Matching: Reframing Optical Flow as Continuous Transport Dynamics
-
This work reframes optical flow from "discrete displacement regression between two frames" to "learning a velocity field for continuous transport of pixel coordinates over time." By employing a Triangular Velocity Synergetics (TVS) technique, the theoretical objectives of Flow Matching are aligned with the supervision signals usable by optical flow networks, achieving SOTA accuracy on Sintel / KITTI / Spring datasets alongside stronger cross-dataset generalization.
- ORBIT: Benchmarking SfM in the Wild with 360° Video
-
ORBIT utilizes online 360° panoramic videos as "reliable sources of ground truth." Because panoramic cameras observe all directions, have known intrinsics, and "hide" no stable features, a custom rig-based SfM can yield credible trajectories. These panoramas are then cropped and reprojected into perspective videos that specifically target "difficult viewpoints," forming a benchmark of 100 real-world challenging cases. Results show that SOTA methods like COLMAP, MegaSaM, and VGGT fail significantly, revealing that SfM remains far from solved.
- Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors
-
The authors propose leveraging the latent features of a 3D foundation generative model (Hunyuan3D) as shape priors. These are injected into a base video diffusion model through multi-scale 3D adapters to achieve geometrically realistic and view-consistent orbital video generation from a single image.
- ORD: Object-Relation Decoupling for Generalized 3D Visual Grounding
-
ORD proposes an "Object-Relation Decoupling" framework that explicitly models target-anchor spatial relations as first-class geometric/semantic primitives. By utilizing anchor-centric relative geometry, predicate-decoupled cross-modal alignment, and anchor-guided regression, it severs the dependence on "shortcuts from entity names," consistently outperforming SOTA on multiple 3D visual grounding benchmarks including NR3D/SR3D.
- OrienPose: Orientation-Guided Novel View Synthesis for Single-Image Unseen Object Pose Estimation
-
OrienPose explicitly injects the "orientation prior" of an object into the reference latent variables of Novel View Synthesis (NVS) and employs an orientation consistency loss to supervise view transformations at the geometric level. This converts unseen object pose estimation—from a single image without a CAD model—from an "unconstrained pixel-wise transformation" into a "geometrically defined transformation with a known starting point." On ShapeNet, it improves ACC30 by 7.3% and reduces median error by 7.3° compared to the previous SOTA, NOPE.
- Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos
-
Ov3R performs simultaneous dense 3D reconstruction and open-vocabulary 3D semantic segmentation using only RGB video streams. It consists of CLIP3R, which directly infuses CLIP semantics into a reconstruction network for geometry and object-level semantics, and a 2D-3D OVS module that fuses tri-path features (CLIP3R, DINO, and 3D-CLIP) to "lift" 2D semantics to 3D. It achieves SOTA performance on Replica/7Scenes reconstruction and Replica/ScanNet open-vocabulary segmentation while maintaining approximately 15 FPS.
- P2GS: Physical Prior-guided Gaussian Splatting for Photometrically Consistent Urban Reconstruction
-
P2GS shifts the optimization of 3DGS from LDR pixel space to the linear HDR domain. Using only LDR images, it jointly solves for "view-independent HDR radiance + per-view exposure + per-view tone mapping," effectively eliminating exposure seams and photometric inconsistencies in multi-camera driving data to achieve exposure-invariant reconstruction suitable for autonomous driving simulation.
- PackUV: Packed Gaussian UV Maps for 4D Volumetric Video
-
PackUV "packs" all attributes of 4D Gaussians (3DGS sequences) into a structured multi-scale 2D UV atlas. Combined with PackUV-GS—a method that performs fitting directly in the UV domain using optical flow keyframes and motion-static separation—it enables volumetric video to be stored and streamed losslessly using standard video codecs like HEVC or FFV1 for the first time. It outperforms all existing baselines in rendering quality for sequences up to 30 minutes with large motion and frequent disocclusions.
- PAD-Hand: Physics-Aware Diffusion for Hand Motion Recovery
-
PAD-Hand is proposed as a physics-aware conditional diffusion framework that integrates Euler-Lagrange dynamics residuals into the diffusion process as virtual observations. By estimating joint-wise and frame-wise dynamic variance through last-layer Laplace approximation, it achieves hand motion recovery with both physical plausibility and uncertainty awareness, reducing acceleration error by 50.1% on DexYCB.
- PAM: A Pose-Appearance-Motion Engine for Sim-to-Real HOI Video Generation
-
PAM is proposed as the first engine to generate realistic hand-object interaction (HOI) videos using only initial/target hand poses and object geometry. By decoupling the process into three stages—pose, appearance, and motion generation—it achieves an FVD of 29.13 (vs. InterDyn's 38.83) and an MPJPE of 19.37mm (vs. CosHand's 30.05mm) on DexYCB. The generated synthetic data also effectively augments downstream hand pose estimation tasks.
- PaNDaS: Learnable Shape Interpolation Modeling with Localized Control
-
PaNDaS constructs a deformation feature field by combining per-face local features on the source mesh with a global encoding of the target mesh. Fed into a deformation generator based on Neural Jacobian Fields and trained only with holistic deformation supervision, the model enables localized non-rigid interpolation of arbitrary regions during inference via binary masking of the global features. It achieves state-of-the-art accuracy in both holistic and local interpolation across hand, body, and face datasets.
- Pano360: Perspective to Panoramic Vision with Geometric Consistency
-
Pano360 is proposed to extend panoramic stitching from traditional 2D pairwise matching to a 3D photogrammetric space. By utilizing a Transformer architecture to achieve global geometric consistency alignment across multiple views, it reaches a 97.8% success rate in challenging scenarios such as weak textures, large parallax, and repetitive patterns.
- Pano3DComposer: Feed-Forward Compositional 3D Scene Generation from Single Panoramic Image
-
Pano3DComposer is proposed as a modular feed-forward framework for compositional 3D scene generation from a single panorama. Through a plug-and-play Object-World Transformation Predictor based on Alignment-VGGT, generated 3D objects are transformed from local to world coordinates, accomplishing high-fidelity 3D scene generation in approximately 20 seconds on an RTX 4090.
- PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery
-
The authors propose PanoVGGT, a permutation-equivariant Transformer framework capable of jointly predicting camera poses, depth maps, and globally consistent 3D point clouds from one or more unordered panoramic images in a single feed-forward pass. They also contribute PanoCity, a large-scale dataset containing over 120,000 outdoor panoramic images.
- Paparazzo: Active Mapping of Moving 3D Objects
-
Paparazzo introduces the novel task of "active reconstruction of moving objects" and proposes a training-free dual-mode framework. It utilizes an Extended Kalman Filter (EKF) to predict the trajectory of non-cooperative moving targets and selects optimal observation viewpoints using FisherRF information gain. By balancing "high information but unreachable" views with "lower information but synchronizable" ones, it achieves more complete and efficient 3D reconstruction compared to passive or random baselines.
- Parallel Rigidity Matters for Bundle Adjustment
-
This paper employs "parallel rigidity" theory to systematically address the long-overlooked fundamental question of "when the solution to Bundle Adjustment (BA) is unique." By treating the joint optimization of camera translations and 3D points as a direction-constrained problem on a bipartite graph, the authors design the GPRBA algorithm. This algorithm efficiently extracts "generically parallel rigid" (GPR) subgraphs via the camera-to-camera viewgraph. Integrating this into the global SfM pipeline GLOMAP enables the clean removal of cameras and 3D points that are misplaced due to independent scaling.
- Parallelised Differentiable Straightest Geodesics for 3D Meshes
-
Ours proposes a parallel GPU implementation of straightest geodesics along with two differentiable schemes (an extrinsic proxy function method and a geodesic finite difference method). This approach makes the exponential map on triangle meshes both highly parallelizable and differentiable, supporting three downstream applications: geodesic convolutional layers, mesh-based flow matching, and second-order optimizers.
- Part\(^{2}\)GS: Part-aware Modeling of Articulated Objects using 3D Gaussian Splatting
-
Part\(^{2}\)GS assigns a learnable "part identity embedding" to each 3D Gaussian. Combined with motion-aware canonicalization, repulsive points, and physical constraints, it simultaneously reconstructs high-fidelity geometry and physically consistent motion of articulated objects from multi-view images. It reduces Chamfer Distance by up to 10× for movable parts compared to Prev. SOTA.
- PartDiffuser: Part-wise 3D Mesh Generation via Discrete Diffusion
-
PartDiffuser replaces the "per-token autoregressive mesh generation" with a semi-autoregressive framework characterized by "inter-part autoregression and intra-part parallel discrete diffusion." It injects hierarchical geometric conditions through part-aware cross-attention to refine local high-frequency details while ensuring global topology. On Objaverse, it reduces Chamfer Distance by approximately 27% compared to the second-best method.
- ParticleGS: Learning Neural Gaussian Particle Dynamics from Videos for Prior-free Physical Motion Extrapolation
-
ParticleGS treats each 3D Gaussian as a physics-driven "particle," utilizing a set of shared latent dynamic fields and Neural ODEs to learn continuous-time evolution. This enables physically consistent motion extrapolation beyond the observed time window, achieving over 5 dB higher extrapolation PSNR compared to time-conditioned methods and approximately 2.5 dB higher than velocity field methods across four dynamic scene datasets.
- Particulate: Feed-Forward 3D Object Articulation
-
Particulate proposes a feed-forward model that infers a complete articulated structure (part segmentation, kinematic tree, and motion constraints) from a static 3D mesh within seconds. Trained end-to-end on public datasets using a Part Articulation Transformer, it significantly outperforms existing methods that require per-object optimization and can be integrated with 3D generative models to enable articulation generation from single images.
- PatchAlign3D: Local Feature Alignment for Dense 3D Shape Understanding
-
PatchAlign3D is the first pure encoder 3D model that directly outputs "language-aligned patch-level features" on point clouds. Through a two-stage pre-training process involving "DINOv2 feature distillation + patch-text contrast," it performs zero-shot 3D part segmentation in a single feed-forward pass without multi-view rendering. On ShapeNetPart, it achieves an mIoU +31.3% higher than the previous strongest rendering-based method, COPS.
- PatchScene: Patch-based Voxel Diffusion Model for Large-Scale Scene Completion
-
PatchScene decomposes large-scale LiDAR scene completion into a set of overlapping small voxel patches, performs explicit voxel diffusion on each, and synthesizes them into consistent global point clouds using confidence-guided spatio-temporal fusion. By adopting an "inside-out, annular-flow" diffusion sequence that propagates dense information from proximal to distal regions, it achieves SOTA results on SemanticKITTI and demonstrates zero-shot generalization from 20m training to 50m inference.
- PE3R: Perception-Efficient 3D Reconstruction
-
PE3R proposes a tuning-free feed-forward 3D semantic reconstruction framework. By utilizing pixel embedding disambiguation, semantic point cloud reconstruction, and global view-aware perception modules, it directly generates semantic 3D point clouds from pose-free 2D images. It achieves 9x acceleration and reaches new SOTA in open-vocabulary segmentation and depth estimation.
- P3Sim: Perceptual 3D Simulation with Physical World Modeling
-
P3Sim models "predicting scene evolution from a single image" as probabilistic inference over multimodal scene variables (RGB / depth / optical flow). Utilizing a 7B autoregressive Transformer with pointer-value sequences for random-access decoding, combined with a geometric conditioning module and persistent scene memory, the system supports Novel View Synthesis (NVS), rigid/deformable manipulation, collisions, and multi-agent prediction, outperforming specialized baselines in NVS and 3D object manipulation benchmarks.
- PerpetualWonder: Long-horizon Action-conditioned 4D Scene Generation
-
PerpetualWonder proposes "Visual-Physically Aligned Particles" (VPP) as a unified representation that bi-directionally binds physical particles with Gaussian primitives. Combined with progressive multi-view optimization, it creates the first true closed-loop hybrid generative simulator—allowing visual corrections from video models to back-update physical states, enabling physically plausible 4D scene generation for long-horizon continuous actions starting from a single image.
- Photo3D: Advancing Photorealistic 3D Generation through Structure-Aligned Detail Enhancement
-
Photo3D utilizes GPT-4o-Image to enhance 3D renderings into "structure-aligned, photorealistic" multi-views, constructing the Photo3D-MV dataset paired with 3D geometry. By employing a "relaxed detail enhancement loss" that combines CLIP-aware perceptual adaptation with DINOv3 semantic structure matching, it injects photorealistic appearance into three mainstream 3D-native generation paradigms without compromising geometric integrity, achieving SOTA photorealism.
- PhyGaP: Physically-Grounded Gaussians with Polarization Cues
-
PhyGaP is proposed to incorporate polarization cues into 2DGS optimization via Polarized Deferred Rendering (PolarDR) and introduces the self-occlusion-aware GridMap environment mapping technique to achieve accurate reflection decomposition and realistic relighting of glossy objects.
- PhysGaia: A Physics-Aware Benchmark with Multi-Body Interactions for Dynamic Novel View Synthesis
-
PhysGaia constructs a physics-aware benchmark dataset containing 17 scenes, covering multi-body interactions of various materials such as liquids, gases, fabrics, and rheological substances. It provides ground truth for 3D particle trajectories and physical parameters (e.g., viscosity), and proposes two new metrics, Trajectory Distance (TD) and AUOP, to quantify the physical realism of 4DGS methods, revealing significant deficiencies in the physical reasoning of existing DyNVS methods.
- PhysGM: Large Physical Gaussian Model for Feed-Forward 4D Synthesis
-
The first framework for feed-forward prediction of 3DGS and physical attributes (material category, Young's modulus, Poisson's ratio) from a single image. By employing a two-stage training process (supervised pre-training and DPO preference fine-tuning), it completely bypasses SDS and differentiable physics engines. Combined with the 50K+ PhysAssets dataset, it generates high-fidelity 4D physical simulations in under 1 minute, outperforming per-scene optimization methods in CLIP_sim and human preference rates.
- PhysGS: Bayesian-Inferred Gaussian Splatting for Physical Property Estimation
-
PhysGS is proposed to embed Bayesian inference into the 3D Gaussian Splatting pipeline, utilizing Vision-Language Model (VLM) priors and multi-view confidence-weighted updates to achieve point-wise probabilistic estimation and uncertainty quantification of physical properties (friction, hardness, density, stiffness). It achieves a 22.8% improvement in Absolute Percentage Error (APE) for mass estimation and a 61.2% reduction in Shore hardness error compared to NeRF2Physics.
- PhysHead: Simulation-Ready Gaussian Head Avatars
-
PhysHead is proposed as the first method to combine physics-driven hair dynamics with animatable 3DGS head avatars. It models the expressible face using FLAME meshes and 3DGS, models hair appearance using strands and 3DGS, and drives hair animation via a physics engine. Layered optimization for the face and hair is achieved through VLM-generated bald images.
- PhysHO: Physics-Based Dynamic 3D Gaussian Human and Object from Monocular Video
-
PhysHO treats SMPL-driven Linear Blend Skinning (LBS) as an "internal driving force prior" for the human body and uses the Material Point Method (MPM) as a physics engine to propagate these forces to objects through contact. Combined with per-particle residual neural constitutive laws, it reconstructs physically plausible "human push/pull object" dynamics from monocular videos and enables extrapolation for unseen motions.
- Physically Inspired Gaussian Splatting for HDR Novel View Synthesis
-
PhysHDR-GS is proposed as a physically inspired HDR novel view synthesis framework: it decomposes Gaussian color into intrinsic reflectance and adjustable ambient illumination. It utilizes complementary Image-Exposure (IE) and Gaussian-Illumination (GI) branches to capture HDR details, while a cross-branch HDR consistency loss provides explicit HDR supervision without Ground Truth (GT). Furthermore, illumination-guided gradient scaling addresses the gradient starvation issue caused by exposure bias. It outperforms HDR-GS by 2.04dB on several benchmarks while maintaining real-time rendering at 76FPS.
- PhysIR-Splat: Physically Consistent Thermal Infrared Radiative Transfer in 3D Gaussian Splatting
-
PhysIR-Splat moves beyond simply treating 3DGS color as thermal radiation. Instead, it explicitly assigns three physical quantities—temperature, emissivity, and ambient irradiance—to each Gaussian primitive and embeds the thermal infrared imaging chain ("self-emission + ambient reflection \(\rightarrow\) atmospheric transmittance \(\rightarrow\) radiometric response") directly into the renderer. Combined with VGGT-IR, a feed-forward initializer that consumes thermal infrared (and optional RGB) to directly regress camera poses and initial geometry, it addresses the long-standing challenge of SfM degradation in weakly textured thermal infrared scenes.
- PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image
-
Given a single real-world photo, PhysX-Anything utilizes a fine-tuned VLM through multi-round dialogues to directly generate geometry, joint structures, and physical properties. It employs a voxel representation that compresses geometry tokens by 193×, eventually exporting URDF/XML assets ready for immediate use in physics engines.
- PIP-Stereo: Progressive Iterations Pruner for Iterative Optimization based Stereo Matching
-
Reveals spatial sparsity and temporal redundancy in iterative stereo matching disparity updates. Proposes Progressive Iterations Pruner (PIP) to compress 32 iterations into 1, a collaborative learning paradigm for depth prior transfer without independent monocular encoders, and a hardware-aware FlashGRU operator (7.28× speedup). This enables high-precision iterative stereo matching to achieve real-time inference on Jetson Orin NX (75ms/frame, 320×640) for the first time.
- PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction
-
The study proposes PixARMesh, the first autoregressive framework for single-view scene reconstruction in native mesh space (rather than SDF). By enhancing point cloud encoders with pixel-aligned image features and global scene context, and predicting object poses and meshes simultaneously within a unified token sequence, it achieves scene-level SOTA on 3D-FRONT while outputting compact, editable, artist-ready meshes.
- Point4Cast: Streaming Dynamic Scene Reconstruction and Forecasting
-
Point4Cast utilizes a "continuously evolving latent spatio-temporal representation" to uniformly process streaming video frames. It can reconstruct 3D pointmaps for past and current frames while feed-forwardly forecasting pointmaps and camera parameters for future timestamps. It also derives scene flow in a training-free manner, setting new SOTA benchmarks on PointOdyssey and TAPVid-3D for both dynamic scene reconstruction and the newly proposed "3D pointmap forecasting" task.
- PointCNN++: Performant Convolution on Native Points
-
PointCNN++ generalizes sparse convolution from "voxels" to "native points"—convolution centers reside directly on high-precision original coordinates, neighborhoods are searched in continuous space, and local adaptive voxelization is applied only as the final step to pair kernels. By abstracting computation as an MVMR (Matrix-Vector Multiplication and Reduction) problem with handwritten GPU kernels, it achieves zero additional memory consumption. This allows it to surpass voxel-based methods in efficiency and memory savings while maintaining geometric precision, achieving SOTA in point cloud registration (99.8% Recall on KITTI) and semantic segmentation (78.2% mIoU on nuScenes) as a plug-and-play backbone.
- PointGS: Semantic-Consistent Unsupervised 3D Point Cloud Segmentation with 3D Gaussian Splatting
-
PointGS reconstructs sparse point clouds into a dense 3D Gaussian field as a unified intermediate representation. It extracts 2D masks from rendered images using SAM and distills semantics into Gaussian primitives through scale-aware contrastive learning. After a two-step ICP to align Gaussians back to the original point cloud for nearest-neighbor label transfer, it outperforms existing unsupervised methods on S3DIS (+2.8% mIoU) and ScanNet-v2 (+0.9% mIoU) without manual annotations or point cloud pre-training.
- PointNSP: Autoregressive 3D Point Cloud Generation with Next-Scale Level-of-Detail Prediction
-
PointNSP transforms autoregressive point cloud generation from "point-by-point prediction" to "next-scale LoD prediction"—first determining the global structure at low resolution then refining geometry scale-by-scale. This is achieved via a multi-scale VQVAE and a causal Transformer with block-wise causal masks, maintaining the permutation invariance of point sets. It is the first autoregressive paradigm to achieve SOTA generation quality on ShapeNet, outperforming strong diffusion baselines in parameters, training, and sampling efficiency.
- PointTPA: Dynamic Network Parameter Adaptation for 3D Scene Understanding
-
The PointTPA framework is proposed, utilizing two lightweight modules—Serialized Neighborhood Grouping (SNG) and Dynamic Parameter Projector (DPP)—to generate customized network parameters for each input scene during inference. With an increase of <2% in parameter count, it achieves 78.4% mIoU on ScanNet, surpassing current Parameter-Efficient Fine-Tuning (PEFT) methods.
- PointWorld: Scaling 3D World Models for In-The-Wild Robotic Manipulation
-
PointWorld represents scene states and robot actions as a unified set of 3D point flows. By using a large pre-trained point cloud backbone to learn "how scene points move given an action" across approximately 2 million trajectories, a single checkpoint can drive real robotic arms to complete tasks involving rigid body pushing, deformable objects, articulated objects, and tool use from a single RGB-D input in a zero-shot manner.
- PoseMaster: A Unified 3D Native Framework for Stylized Pose Generation
-
PoseMaster proposes a 3D native approach that unifies pose stylization and 3D generation into an end-to-end framework. It directly utilizes 3D skeletons as pose control signals (instead of 2D skeleton maps), designs a skeleton densification strategy and a Point Transformer encoder to extract fine-grained spatial topological features. Trained through a large-scale "Image-Skeleton-Mesh" triplet data engine, it achieves SOTA results in pose normalization and arbitrary pose stylization.
- PP-Brep: Few-Shot B-rep Classification with Hybrid Graph Representation
-
This paper deconstructs the B-rep model of CAD into a three-layer hybrid graph (local topology graph + global parallel graph + region correlation hypergraph), paired with a hierarchical heterogeneous GNN. It utilizes RL-adaptive perturbation for contrastive pre-training to learn general representations and structure-aware graph prompts for few-shot fine-tuning. It significantly outperforms general graph prompt methods at 1/3/5-shot on the TraceParts-11 and FabWave-31 part datasets.
- PQDT: Pseudo-Query Dual Transformer for Robust Point Cloud Restoration
-
PQDT utilizes a "Pseudo-Query Dual-stage Transformer" to unify three types of point cloud degradations—completion, denoising, and deformation. It first generates a batch of noise-resistant pseudo-query anchors guided by observations, then refines them using shape priors. Combined with sparse geometric embedding attention and dynamic query selection, it achieves new SOTA performance on ShapeNet-55/34 and three newly established degradation datasets.
- PR-IQA: Partial-Reference Image Quality Assessment for Diffusion-Based Novel View Synthesis
-
This paper proposes PR-IQA, a cross-reference image quality assessment method that first computes geometrically consistent local quality maps in multi-view overlapping regions and then "completes" the quality information into non-overlapping regions via a reference-conditioned cross-attention network. This generates dense quality maps approaching full-reference accuracy, which are integrated into the 3DGS pipeline through a dual-filtering strategy to significantly improve sparse-view 3D reconstruction quality.
- PRIMU: Uncertainty Estimation for Novel Views in Gaussian Splatting from Primitive-Based Representations of Error and Coverage
-
PRIMU is a post-processing uncertainty estimation (UE) framework for Gaussian Splatting (GS). It back-projects rendering error, coverage, and field-of-view (FoV) statistics from training views onto each Gaussian primitive to construct a set of "uncertainty feature maps" renderable from any novel view. A gradient boosting regressor, trained on a single held-out view, then directly predicts pixel-wise error. This approach achieves new SOTA results in both RGB and depth uncertainty estimation and enables active view selection using coverage feature maps.
- PrITTI: Primitive-based Generation of Controllable and Editable 3D Semantic Urban Scenes
-
PrITTI replaces voxels with a hybrid representation of "vectorized object primitives (cuboid/ellipsoid) + rasterized ground." It first utilizes a Layout VAE to compress 3D urban semantic layouts into a structured 2D latent space, then trains a Latent Diffusion Transformer (DiT) for controllable generation. It achieves SOTA on KITTI-360 with lower memory requirements, faster inference, and superior editability, naturally supporting downstream tasks such as scene editing, inpainting, outpainting, and street-view synthesis.
- ProgressiveAvatars: Progressive Animatable 3D Gaussian Avatars
-
Proposes ProgressiveAvatars, a progressive avatar representation based on adaptive implicit subdivision of template meshes to construct hierarchical 3DGS. It supports progressive transmission and rendering under varying bandwidth and computation constraints—obtaining a usable avatar with only 5% of the data (2.6MB), with subsequent incremental loading smoothly improving quality to levels comparable with SOTA methods.
- PromptDepth: Efficient and Promptable Geometric 3D Vision Model for Embodied Intelligence
-
PromptDepth unifies "panoptic depth, instance depth, tracking depth, and stereo depth" into a single promptable dense prediction task. A feed-forward network learns geometric representations, switching outputs via different task tokens/points/mask prompts. Combined with ILDS loss and Gram Anchoring to resolve training conflicts between panoptic and instance depth, the model achieves SOTA on multiple benchmarks using only synthetic data. It doubles inference speed and is designed for real-time 3D understanding in embodied agents.
- PromptStereo: Zero-Shot Stereo Matching via Structure and Motion Prompts
-
This paper proposes the Prompt Recurrent Unit (PRU), which utilizes the DPT decoder of monocular depth foundation models as an iterative refinement module (replacing GRU). By injecting monocular structural cues and stereo motion cues through Structure and Motion Prompts via residual addition, it achieves state-of-the-art (SOTA) zero-shot stereo matching performance without destroying monocular priors, reducing error by nearly 50% on Middlebury 2021.
- Proxy-GS: Unified Occlusion Priors for Training and Inference in Structured 3D Gaussian Splatting
-
Proxy-GS utilizes a "lightweight proxy mesh + hardware rasterization" to generate an occlusion depth map in under 1 ms. This depth map is used both during inference to cull occluded anchors/Gaussians for accelerated rendering and during training to guide anchor densification onto visible surfaces. Compared to Octree-GS, it achieves a 3×+ FPS improvement in heavily occluded large-scale urban scenes while simultaneously enhancing rendering quality.
- Prune Wisely, Reconstruct Sharply: Compact 3D Gaussian Splatting via Adaptive Pruning and Difference-of-Gaussian Primitives
-
This paper proposes an adaptive Reconstruction-aware Pruning Strategy (RPS) and 3D DoG primitives, achieving 90% Gaussian point reduction while maintaining rendering quality.
- PV-Ground: Text-Guided Point-Voxel Interaction for 3D Visual Grounding
-
PV-Ground identifies that existing 3D visual grounding (3D VG) methods typically use point cloud backbones that aggressively downsample 50,000 points to 2048, creating a detail bottleneck. It introduces sparse voxel convolution to preserve high-resolution features and distills the voxel feature pyramid into compact keypoints for interaction. A text-guided differentiable soft sampling module is proposed to adaptively concentrate keypoints on task-relevant objects, improving grounding accuracy by approximately 5% on ScanRefer/ReferIt3D.
- QD-PCQA: Quality-Aware Domain Adaptation for Point Cloud Quality Assessment
-
Ours proposes QD-PCQA, a quality-aware domain adaptation framework that transfers quality assessment priors from the image domain to the point cloud domain through two strategies: Rank-weighted Conditional Alignment and Quality-guided Feature Augmentation.
- QuadSync: Quadrifocal Tensor Synchronization via Tucker Decomposition
-
Ours proposes QuadSync, the first global synchronization algorithm for quadrifocal tensors. By constructing a block quadrifocal tensor and proving it admits a Tucker decomposition with multilinear rank \((4,4,4,4)\), the method utilizes an ADMM-IRLS optimization framework to recover camera poses from four-view measurements. It achieves superior synchronization accuracy compared to two-view and three-view methods in dense-view scenarios.
- Query2Uncertainty: Robust Uncertainty Quantification and Calibration for 3D Object Detection under Distribution Shift
-
Addressing the "overconfidence and calibration failure" of DETR-style 3D detectors under distribution shifts (e.g., rain or snow), this paper utilizes Normalizing Flows to estimate the feature density of object queries. This density signal is injected into post-hoc calibrators like Temperature Scaling, Platt Scaling, and Isotonic Regression, allowing calibration intensity to adaptively adjust based on "how far the query is from the training distribution." This approach simultaneously calibrates classification confidence and 3D box regression variance, outperforming standard post-hoc methods on both nuScenes (in-distribution) and MultiCorrupt (distribution shift).
- QueryMe: Query-Driven Open-Vocabulary 3D Object Affordances Grounding from Multimodal Evidence
-
QueryMe projects a single Human-Object Interaction (HOI) image into 3D space via feed-forward monocular reconstruction, then utilizes a set of learnable query vectors to retrieve evidence in a fixed "Text → 3D HOI → Object Point Cloud" sequence. This enables the localization of object functional regions in an open-vocabulary setting, achieving a 4.19% higher AUC on unseen affordances compared to the previous SOTA, GREAT.
- Radar-Guided Polynomial Fitting for Metric Depth Estimation
-
POLAR reformulates the task of "transforming scale-invariant monocular depth estimation (MDE) into metric depth using sparse radar points" as a polynomial fitting problem. It utilizes radar features to predict a set of polynomial coefficients to apply non-uniform, depth-dependent corrections to MDE depth (instead of traditional global scale-and-shift affine transforms). This approach reduces MAE/RMSE by 24.9%/33.2% on average across three datasets while achieving real-time performance (40 fps) with minimal computational overhead.
- Radiance Meshes for Volumetric Reconstruction
-
Radiance Mesh represents radiance fields by partitioning the scene into tetrahedral units with "constant density + linear color" using Delaunay tetrahedralization. In conjunction with exact volumetric rendering sorted by circumsphere power and a novel mesh shader rasterizer, it achieves real-time view synthesis that is faster than 3DGS and popping-free, while maintaining comparable quality and being naturally compatible with graphics ecosystems for simulation, editing, and surface mesh extraction.
- RaGS: Unleashing 3D Gaussian Splatting from 4D Radar and Monocular Cue for 3D Object Detection
-
RaGS models the scene as a continuous 3D Gaussian field. It initializes Gaussians using monocular foreground cues, iteratively absorbs radar geometry and image semantics to "move" Gaussians towards foreground objects, and finally renders multi-layer BEV features for detection. It achieves SOTA on three 4D radar-camera benchmarks: VoD, TJ4DRadSet, and OmniHD-Scenes.
- Random Wins All: Rethinking Grouping Strategies for Vision Tokens
-
Ours proposes a minimalist random grouping strategy to replace various carefully designed token grouping methods in Vision Transformers. It achieves almost comprehensive superiority over all baselines across image classification, object detection, semantic segmentation, point cloud segmentation, and VLMs. The success is explained through four dimensions: positional information, head feature diversity, global receptive field, and fixed grouping patterns.
- RAP: Fast Feedforward Rendering-Free Attribute-Guided Primitive Importance Score Prediction for Efficient 3D Gaussian Splatting Processing
-
RAP is proposed as a rendering-free, feedforward importance scoring method for Gaussian primitives. It extracts 15-dimensional features from intrinsic attributes and local neighborhood statistics, using a lightweight MLP to predict importance scores. Once trained, it generalizes to unseen scenes without additional optimization.
- RayNova: Scale-Temporal Autoregressive World Modeling in Ray Space
-
Ours proposes RayNova, a geometry-agnostic multi-view world model based on dual-causal (scale + temporal) autoregression. By utilizing relative Plücker ray positional encoding, it enables unified 4D spatio-temporal reasoning, achieving SOTA multi-view video generation performance on nuScenes.
- Real-Time Dynamic Scene Rendering with Controlled Compressibility and Contact Awareness
-
Addressing artifacts at contact/occlusion boundaries caused by the common "incompressible, source-free" motion assumption in dynamic 3D Gaussian Splatting, this paper introduces a projection framework utilizing a "source-aware continuity equation + implicit surface contact constraints." By projecting network-predicted velocity fields onto a physically feasible set for supervised training, it achieves higher fidelity and real-time speeds on Plenoptic Video (33.84 dB PSNR, 120 FPS) and D-NeRF (35.24 dB PSNR, 300 FPS).
- Real2Edit2Real: Generating Robotic Demonstrations via a 3D Control Interface
-
Ours proposes the Real2Edit2Real framework, a three-stage pipeline comprising "3D reconstruction → point cloud editing for new trajectories → depth-guided video generation for synthetic demonstrations." It generates massive diverse manipulation demos from only 1-5 real demonstrations, achieving or exceeding the performance of policies trained on 50 real demonstrations, representing a 10-50x improvement in data efficiency.
- Realiz3D: 3D Generation Made Photorealistic via Domain-Aware Learning
-
Addressing the issue where photorealism is lost when fine-tuning diffusion models on synthetic 3D renders to achieve 3D controllability, this paper decouples "domain identity (real/synthetic)" from "3D control signals" using a lightweight Domain Shifter (low-rank residual adapter). Combined with layer-aware training and domain reassignment, the control capability is transferred from the synthetic domain to the real domain. This results in strong 3D consistency and significantly higher photorealism in both multi-view texture generation and text-to-multi-view tasks.
- REArtGS++: Generalizable Articulation Reconstruction with Temporal Geometry Constraint via Planar Gaussian Splatting
-
REArtGS++ reconstructs part-level surface meshes and estimates joint parameters of unseen articulated objects (e.g., drawers, refrigerators) using only multi-view RGB images from any two states, without predefining joint types or relying on external models. By modeling each joint as a decoupled screw motion and extending "normal-depth consistency constraints" from discrete states to the entire motion interval via Planar Gaussians and first-order Taylor expansion, it achieves SOTA performance on PARIS and ArtGS-Multi, especially showing significant advantages for screw joints and multi-part objects.
- Recovering Physically Plausible Human-Object Interactions from Monocular Videos
-
RePHO takes "visually plausible but physically flawed" Human-Object Interaction (HOI) sequences estimated from monocular videos and reenacts them in a physical simulator using reinforcement learning policies. By leveraging "adaptive sampling + bidirectional propagation + online kinematic target updates," it identifies reliable frames from extremely noisy initial values and progressively diffuses physical validity. This results in physically consistent HOI sequences without interpenetration, floating, or jitter, significantly outperforming existing methods on BEHAVE and InterCap datasets.
- ReFlow: Self-correction Motion Learning for Dynamic Scene Reconstruction
-
ReFlow proposes a "self-correction" monocular dynamic scene reconstruction framework that uses inter-frame video differences to directly supervise 3D motion without external optical flow or tracking priors. Combined with complete canonical space initialization and static-dynamic decoupling, it achieves new Prev. SOTA performance on NVIDIA Monocular and Nerfies-HyperNeRF (average PSNR 28.20 dB on NVIDIA).
- ReGenHOI: Unifying Reconstruction and Generation for 3D Human-Object Interaction Understanding
-
ReGenHOI unifies the "reconstruction" (restoring observed contacts from images) and "generation" (synthesizing future interactions from linguistic instructions) of 3D Human-Object Interaction (HOI) into a shared semantic-geometric latent space. By integrating direct 3D point cloud contact reasoning, iterative reasoning trajectories, and a gravitational field diffusion bridge for contact refinement, it simultaneously outperforms SOTA in contact estimation, reconstruction accuracy, and motion generation quality.
- Registration-Free Learnable Multi-View Capture of Faces in Dense Semantic Correspondence
-
MOCHI is the first multi-view dense correspondence face reconstruction framework that does not require pre-registered data for training. By employing a "pseudo-linear inverse kinematics solver + differentiable pointmap/normal loss + dense landmarks trained on synthetic data" trio, it directly learns topology-consistent FLAME meshes from raw scans. Coupled with a lightweight test-time optimization (TTO), its reconstruction accuracy surpasses the very slow and labor-intensive traditional registration pipelines it aims to replace.
- ReLaGS: Relational Language Gaussian Splatting
-
Ours proposes ReLaGS, the first training-free framework that unifies multi-level language Gaussian fields and open-vocabulary 3D scene graphs. It improves scene representation through Max-Weight Pruning and robust outlier-aware feature aggregation, combined with GNN-based relation prediction to achieve efficient structured 3D scene understanding.
- Reliev3R: Relieving Feed-forward 3D Reconstruction from Multi-View Geometric Annotations
-
Reliev3R proposes the first weakly supervised paradigm to train feed-forward 3D reconstruction models (FFRM) from scratch without multi-view geometric annotations (e.g., point clouds and poses from SfM/MVS), utilizing monocular relative depth and sparse image correspondences as alternative supervision. It achieves performance comparable to or exceeding some fully supervised FFRMs.
- Relightable Holoported Characters: Capturing and Relighting Dynamic Human Performance from Sparse Views
-
RHC utilizes a transformer network, RelightNet, to perform cross-attention between "physics-inspired features (geometry/albedo/shading/view)" and environment lighting to implicitly solve the rendering equation in a single forward pass. It enables photo-realistic, free-viewpoint relighting of dynamic full-body characters with unseen motions from just 4 flat-lit cameras—avoiding slow OLAT-based acquisition and achieving significantly higher clarity than inverse rendering methods.
- Reparameterized Tensor Ring Functional Decomposition for Multi-Dimensional Data Recovery
-
Ours proposes RepTRFD: a method that addresses the spectral bias issue of INR-parameterized Tensor Ring (TR) factors by reparameterizing them into a "learnable latent tensor \(\times\) fixed basis" form, consistently outperforming SOTA in tasks like image inpainting, denoising, super-resolution, and point cloud recovery.
- Repurposing 3D Generative Model for Autoregressive Layout Generation
-
LaviGen "repurposes" a pretrained native 3D generative model into an autoregressive layout generator, placing objects one-by-one directly in native 3D space. This ensures generated scene layouts are both physically plausible (no collisions, no out-of-bounds, no floating) and semantically coherent, achieving 19% higher physical plausibility and approximately 65% faster inference compared to SOTA.
- ResiHMR: Residual-Limb Aware Single-Image 3D Human Mesh Recovery for Individuals with Limb Loss
-
ResiHMR is the first single-image 3D human mesh recovery framework specifically for the amputee population. It utilizes "Residual-Limb Anchor-Factor Optimization" to clip the fixed SMPL-X skeleton to cover only the existing limbs and employs "Residual-Limb Reconstruction" to explicitly remove distal mesh vertices and seal smooth residual surfaces. This reduces the residual limb 2D MPJPE from 73.61 px to 23.19 px (using HSMR backbone).
- Rethinking Pose Refinement in 3D Gaussian Splatting under Pose Prior and Geometric Uncertainty
-
The UGS-Loc framework is proposed to jointly model pose prior uncertainty and geometric uncertainty through Monte Carlo pose sampling and Fisher information-guided PnP optimization, significantly enhancing the robustness of camera pose refinement in 3DGS scenes without requiring retraining.
- RetimeGS: Continuous-Time Reconstruction of 4D Gaussian Splatting
-
This paper proposes RetimeGS, which integrates regularized temporal opacity, Catmull-Rom spline trajectories, bidirectional optical flow supervision, and triple rendering strategies. These designs resolve ghosting and temporal aliasing issues in 4D Gaussian Splatting (4DGS) during inter-frame interpolation, enabling ghost-free continuous-time 4D reconstruction at arbitrary timestamps.
- Revisiting 3D Reconstruction Kernels as Low-Pass Filters
-
This paper reinterprets the "reconstruction kernel" in 3D Gaussian Splatting (3DGS) as a "low-pass filter" in signal reconstruction. It demonstrates that Gaussian, Exponential, and Student’s t kernels are non-ideal low-pass filters (causing aliasing via high-frequency leakage). Accordingly, it proposes the Jinc kernel, derived from the ideal low-pass filter, and introduces a modulated kernel to balance frequency fidelity with fast spatial decay, outperforming both 3DGS and SSS in low- and high-resolution new view synthesis.
- Revisiting Monocular SLAM with Spatio-Temporal Scene Modeling
-
Addressing the pain point that calibration-free monocular SLAM is "either slow or non-modular," this paper proposes SLAM-MER, a pipeline implemented from scratch in C++. It utilizes dual-path 3D point queries—"Temporal Buffer (recent keyframes) + Spatial 3D Grid (early reconstructed regions)"—for localization. By invoking a feed-forward depth model (MASt3R) only on keyframes, it fuses sparse keypoint localization with semi-dense anchor representation, achieving 80+ FPS real-time performance (significantly exceeding MASt3R-SLAM at ~13 FPS and VGGT-SLAM at <5 FPS) while maintaining comparable or superior localization accuracy.
- Revisiting Optimal Coding for I-ToF under Practical Sensor Constraints
-
This paper derives the depth error of I-ToF cameras under a realistic noise model into a computable "depth variance metric." It directly integrates hardware constraints—such as peak power, bandwidth, binary waveforms, and mutually exclusive multi-taps—into the design phase. This allows for searching the optimal coding schemes within a constrained feasible space. The two discovered schemes (for high/low SNR) consistently outperform Hamiltonian and double ramp codes in both simulations and on commercial sensors.
- Revisiting Pose Sensitivity in Splat-based Computed Tomography under Sparse-view Reconstruction
-
Aiming at the streak/strip artifacts in 3D Gaussian Splatting-based sparse-view CT on real data, this paper proves through controlled experiments that the primary cause is pose error in the acquisition geometry rather than view sparsity. Based on this, it derives a stable and differentiable joint self-calibration framework that incrementally optimizes camera poses while reconstructing the volume. By removing TV regularization, the system becomes more stable and faster, effectively suppressing streak artifacts while preserving details in real data, achieving a PSNR approximately 10 dB higher than the SOTA on synthetic data.
- Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors
-
To address the slow inference of ViT-based multi-view 3D detectors, this paper proposes SEPatch3D. It replaces traditional token pruning/merging with "scenewise spatio-temporal dynamic patch sizing + coarse patch enhancement using fine patches," achieving up to a 57.7% speedup over StreamPETR on nuScenes with negligible accuracy loss.
- REVIVE 3D: Refinement via Encoded Voluminous Inflated prior for Volume Enhancement
-
REVIVE 3D utilizes a "two-stage, plug-and-play" pipeline to transform flat images lacking 3D cues (cartoons, line art, flat illustrations) into voluminous 3D meshes. It first inflates the image into a volumetric "inflated prior" mesh, then performs noise injection and denoising refinement within the latent space of a pre-trained 3D latent diffusion backbone. The method also introduces two reference-free metrics, Compactness and Normal Anisotropy, to quantify "volume" and "surface flatness."
- ReWeaver: Towards Simulation-Ready and Topology-Accurate Garment Reconstruction
-
The ReWeaver framework is proposed to jointly reconstruct 3D garment geometry and 2D sewing patterns from a minimum of 4 multi-view RGB images. By employing a dual-path Transformer to predict 3D surface patches/curves and their topological connections, followed by in-group attention to flatten 3D structures into 2D panel edges, it achieves the first topology-accurate garment asset recovery ready for direct physical simulation.
- Rewis3d: Reconstruction Improves Weakly-Supervised Semantic Segmentation
-
The Rewis3d framework is proposed, which for the first time integrates feed-forward 3D scene reconstruction as an auxiliary supervision signal into weakly-supervised semantic segmentation. Through a dual Student-Teacher architecture and dual confidence-weighted cross-modal consistency loss, it improves mIoU by 2-7% under sparse annotations, while using only 2D images during inference.
- RF4D: Neural Radar Fields for Novel View Synthesis in Outdoor Dynamic Scenes
-
RF4D integrates mmWave radar into neural fields. By employing a "spatio-temporal radar field + scene flow temporal regularization + physics-informed power rendering," it achieves novel view synthesis (NVS) for radar in outdoor dynamic scenes for the first time. The synthesis and occupancy estimation accuracy significantly outperform Radar Fields on two public radar datasets.
- RHINO: Reconstructing Human Interactions with Novel Objects from Monocular Videos
-
From a single monocular RGB video with a moving perspective, RHINO reconstructs the "Human + Manipulated Unknown Object + Static Scene" into detailed 4D geometry within a unified world coordinate system. It leverages 3D foundation models to stabilize motion estimation for low-texture objects, decouples true object motion from "apparent motion" via camera motion subtraction, and performs joint optimization using per-component neural SDFs with a differentiable contact prior. It outperforms state-of-the-art baselines in both novel view synthesis and 4D reconstruction.
- RI-Mamba: Rotation-Invariant Mamba for Robust Text-to-Shape Retrieval
-
Addressing the retrieval challenges of 3D objects with arbitrary orientations and diverse categories in real-world scenarios, this paper proposes RI-Mamba, the first pure-Mamba rotation-invariant point cloud model. It decouples pose from geometry using local and global reference frames, constructs rotation-invariant token sequences via Hilbert curves, and recovers discarded pose information through linear-time orientation embeddings. Combined with cross-modal contrastive learning using automatic triplet generation, it achieves SOTA performance in arbitrary-orientation retrieval across 200+ categories in OmniObject3D.
- RigMo: Unifying Rig and Motion Learning for Generative Animation
-
RigMo unifies "rig" and "motion" into a single feed-forward VAE: it learns a set of Gaussian bones, skinning weights, and per-frame SE(3) transformations directly from raw mesh sequences through self-supervision, eliminating the need for manual skeletal annotations. Coupled with a Motion-DiT operating in its latent space for controllable motion generation, it significantly outperforms existing auto-rigging and deformation baselines in reconstruction accuracy, cross-motion generalization, and inference speed.
- RINO: Rotation-Invariant Non-Rigid Correspondences
-
RINO utilizes vector neurons to transform DiffusionNet into an end-to-end SO(3)-invariant point feature extractor called RINONet. This is combined with Complex Functional Maps (CFMaps), which encode only orientation-preserving mappings, and a set of coupled unsupervised losses. This allows learning non-rigid shape correspondences directly from raw xyz coordinates without pre-alignment or handcrafted descriptors. It establishes new SOTAs in challenging scenarios such as arbitrary poses, non-isometry, partiality, non-manifold structures, and noise.
- RISE: Single Static Radar-based Indoor Scene Understanding
-
RISE utilizes a single static mmWave radar to transform "multipath ghosts"—traditionally discarded as noise—into geometric cues. By integrating Dual-angle Multipath Enhancement (BAME) and Sim-to-Real Hierarchical Diffusion (SRHD), it achieves the first indoor wall layout reconstruction and furniture detection under a single static radar, reducing Chamfer distance by 60% (to 16 cm) compared to the SOTA, with a furniture detection IoU of 58%.
- RnG: A Unified Transformer for Complete 3D Modeling from Partial Observations
-
RnG proposes Reconstruction-Guided Causal Attention, reinterpreting the Transformer's KV-Cache as an implicit 3D representation. Using a single feed-forward Transformer, it unifies the reconstruction and generation of complete 3D geometry and appearance from unposed sparse images, achieving speeds over 100x faster than diffusion-based methods.
- Robust3DGSW: Toward Robust Watermarking for Quantization-Aware 3D Gaussian Splatting
-
To address the issues where watermarks are erased and rendering quality collapses after quantizing 3DGS models to low bits, Robust3DGSW proposes a two-stage quantization-aware watermarking framework: the first stage embeds watermarks into the mid-frequency bands of 3D Gaussian positions and 2D renderings to resist quantization loss, while the second stage utilizes multi-scale adversarial perturbations and progressive quantization training for dual decoders. This allows for watermark extraction accuracy of \(>80\%\) under 4-bit quantization while maintaining high-quality rendering.
- RoSAMDepth: Robust Self-supervised Depth Estimation Leveraging Segment Anything Model
-
RoSAMDepth utilizes object-level masks generated offline by SAM as priors, injecting them into a self-supervised monocular depth framework from three perspectives: "representation space contrast," "region-level outlier suppression + Gaussian likelihood smoothing," and "object-level reliability weighting." This allows the model to predict depth with sharper boundaries and better intra-object consistency under adverse conditions such as night and rain.
- Routing on Demand: DSNet for Efficient Progressive Point Cloud Denoising
-
DSNet (Dynamic Skip Net) is a "Routing on Demand" progressive point cloud denoising framework. It employs a normal similarity-based noise discriminator to quantify the noise intensity of each local patch, which is then mapped by an anti-monotonic decision function to an appropriate denoising module entry. This allows clean regions to skip redundant denoising while noisy regions receive sufficient refinement, achieving a superior balance between denoising quality and computational efficiency.
- RT-Splatting: Joint Reflection-Transmission Modeling with Gaussian Splatting
-
RT-Splatting decouples the "geometric occupancy" and "optical opacity" of each Gaussian primitive into two independent attributes. This allows a single set of Gaussians to serve as both a reflective surface (performing deferred shading for high-frequency specular effects) and a transmissive volume (performing forward integration for clear backgrounds). A "specular-aware gradient gating" mechanism is utilized to suppress floaters caused by reflection residuals leaking into the transmission branch, achieving SOTA results on real-world scenes with simultaneous reflection and transmission, such as car windows and plastic films.
- S\(^2\)-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance
-
S²-MLLM enables Multimodal Large Language Models (MLLMs) to perform 3D Visual Grounding (3DVG) without relying on expensive point cloud reconstruction and multi-view rendering during the inference phase. Instead, it treats feed-forward 3D reconstruction as "spatial guidance" for joint optimization during training, paired with a structural enhancement module that injects 3D coordinates and camera rays into visual features. This allows the model to perform implicit 3D spatial reasoning in the latent space, achieving state-of-the-art performance on ScanRefer / Nr3D / Sr3D with only 25% of the training overhead and zero additional inference latency.
- S2AM3D: Scale-controllable Part Segmentation of 3D Point Clouds
-
Ours proposes S2AM3D, a point cloud part segmentation framework that merges 2D pre-trained priors with 3D contrastive supervision. It employs a point-consistent encoder to obtain globally consistent point features and a scale-aware prompt decoder to achieve continuous and controllable segmentation granularity adjustment, significantly outperforming existing methods across multiple benchmarks.
- S2D: Sparse to Dense Lifting for 3D Reconstruction with Minimal Inputs
-
S2D bridges "sparse point clouds" and "3D Gaussian Splatting" (3DGS) representations: it utilizes a point-cloud-guided single-step diffusion refiner to clean artifacts in novel views rendered from sparse inputs, paired with a reconstruction strategy featuring random sample dropping and weighted gradients to stabilize optimization. This allows high-quality, 3D-consistent 3DGS reconstruction from minimal inputs (e.g., 1 image for \(30^\circ\) coverage, \(<10\) images for \(180^\circ+\)).
- SAGE: Scalable Agentic 3D Scene Generation for Embodied AI
-
SAGE formalizes 3D indoor scene generation as an agent operating under the MCP protocol. It invokes layout/asset generators on demand and employs a closed-loop self-correction mechanism via "Visual Review + Physical Review (Isaac Sim in-the-loop verification)." It produces physically stable, open-vocabulary scenes that can be directly imported into simulators for robot policy training, scaled through multi-layer augmentation.
- SAM 3D: 3Dfy Anything in Images
-
SAM 3D is a generative foundation model that reconstructs complete 3D shapes, textures, and layouts for any object from a single natural image. It overcomes the barrier of scarce real-world 3D data through a "model-in-the-loop + human annotation" data flywheel and an LLM-style multi-stage training recipe, achieving at least a 5:1 human preference win rate over previous SOTAs on real objects and scenes.
- SAMosaic3D: Modular Scene Assembly for Real-Time 3D Segment Anything
-
Ours treats over-segmented 2D masks from SAM as "mosaic fragments" and employs an end-to-end differentiable framework to first assemble fragments of the same object within a frame and then merge instances into scene memory across frames. Achieving 11.2 FPS, it reaches SOTA among online methods on ScanNet/ScanNet200/SceneNN/3RScan with zero-shot cross-dataset generalization.
- SAQN: Semantic-based Adaptive Query Network for 3D Referring Expression Segmentation
-
SAQN replaces the "point-based query generation" approach in 3D referring expression segmentation with "one learnable query per semantic class." It utilizes a minimal set of queries (21 classes + 10 adaptive queries, totaling 31) to replace the hundreds of queries used in previous works. The Adaptive Query Fusion module resolves ambiguities caused by a single class query representing all identical objects in a scene, achieving SOTA performance for both 3D-RES and 3D-GRES on ScanRefer and Multi3DRefer.
- SASNet: Spatially-Adaptive Sinusoidal Networks for INRs
-
SASNet is proposed to solve the issues of frequency initialization sensitivity and high-frequency leakage in SIREN by combining frozen frequency embedding layers with spatially-adaptive masks learned by a lightweight hash-grid MLP. It achieves faster convergence and higher reconstruction quality across image fitting, volume data fitting, and SDF reconstruction tasks.
- Scalable Feature Matching via State Space Modeling and Sparse Correlation
-
SLiM utilizes a "Conv-Mamba linear-complexity backbone + L2-norm guided sparse correlation + lightweight recurrent coordinate refinement" triad to liberate semi-dense feature matching from quadratic computational costs. On MegaDepth, it achieves AUC@5°=57.9 with only 5.9M parameters (1.5 points higher than Efficient LoFTR). At 1200×1200 resolution, it reduces memory consumption by 45% compared to JamMa and is 1.8× faster than Efficient LoFTR.
- Scalable Object Relation Encoding for Better 3D Spatial Reasoning in Large Language Models
-
The authors propose QuatRoPE, a 3D positional encoding method based on quaternion rotation, which preserves all \(O(n^2)\) spatial relations between objects using only \(O(n)\) input tokens. Combined with the IGRE mechanism to reduce interference with language RoPE, it achieves significant improvements across several 3D vision-language benchmarks.
- Scaling4D: Pushing the Frontier of Video Novel View Synthesis through Large-Scale Monocular Videos
-
Scaling4D reformulates Video Novel View Synthesis (VNVS) from "rendering point clouds followed by inpainting" into a "correspondence-guided generation task." This enables self-supervised training using massive amounts of real-world monocular videos, bridging the training-inference gap of previous methods. It outperforms methods like GEN3C and TrajectoryCrafter on both single-view and multi-view benchmarks, with performance scaling consistently with data volume.
- Scaling View Synthesis Transformers (SVSM)
-
This work establishes the first scaling laws for geometry-free NVS Transformers. By proposing the Effective Batch Size hypothesis (\(B_{\text{eff}} = B \cdot V_T\)), it reveals the root cause for the prior undervaluation of encoder-decoder architectures. The authors design SVSM, a unidirectional encoder-decoder that achieves a new SOTA on RealEstate10K (30.01 PSNR) using less than half the training FLOPs, shifting the Pareto frontier by \(3\times\) compared to LVSM's decoder-only baseline.
- SCAPO: Self-Supervised Category-Level Articulated Pose Estimation from a Single 3D Observation
-
SCAPO utilizes an SE(3) equivariant autoencoder to align articulated objects in arbitrary poses (e.g., laptops, drawers, safes) to a shared canonical space. It then employs "articulation-aware blend skinning" to simultaneously regress part segmentation and joint axes/pivots/states. The entire pipeline is trained via self-supervision through cycle reconstruction and cross-space alignment without any annotations, CAD templates, or multi-frame inputs, outperforming all self-supervised baselines on both synthetic and real-world data.
- SCE-Depth: A Spherical Compound Eye Framework for Wide FOV Depth Estimation
-
SCE-Depth proposes a hardware-software co-designed framework consisting of a "biomimetic spherical compound eye camera + spherical neural network." It processes compound eye images natively on the HEALPix spherical grid to avoid planarization distortions and utilizes the "distance-decaying depth-sensitive gradients" naturally generated by overlapping fields of view of adjacent ommatidia. Combined with a Spherical Gradient Feature Extractor (SGFE) and Spherical Gradient Loss (SGL), it significantly reduces wide-FOV depth errors, especially in peripheral regions.
- SCE-SLAM: Scale-Consistent Monocular SLAM via Scene Coordinate Embeddings
-
SCE-SLAM parallels a "scene coordinate branch" alongside the optical flow branch in frame-to-frame monocular SLAM. It encodes 3D geometric relationships into a canonical scale reference using learnable patch-level scene coordinate embeddings. By propagating scale across windows via geometry-modulated attention and pulling bundle adjustment toward the reference scale using 3D coordinate constraints, it significantly suppresses long-sequence scale drift while maintaining 36 FPS real-time performance (KITTI average ATE reduced from 53.61m in DPVO to 25.79m, or 14.07m with loop closure).
- Scene Grounding In the Wild
-
This paper proposes an inverse optimization framework based on semantic features to align local 3D reconstructions (SfM) captured in the wild to a complete pseudo-synthetic reference model (e.g., Google Earth Studio). By utilizing DINOv2 features and robust optimization, it bridges significant domain gaps and achieves globally consistent fusion of non-overlapping local reconstructions.
- Scene Reconstruction as Mapping Priors for 3D Detection
-
This work repurposes "maps" originally intended for planning in autonomous driving for perception—utilizing automatically reconstructed surfel/3DGS scenes as "mapping priors" to replace expensive manual HD maps. A gated fusion module adaptively integrates these priors with LiDAR/camera inputs, outperforming temporal fusion SOTA using 100 frames with only 4 frames on the Waymo Open Dataset.
- SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model
-
SceneMaker decouples single-image 3D scene generation into three sub-tasks: "de-occlusion / 3D object generation / pose estimation," leveraging image data, 3D object data, and scene data respectively to acquire sufficient open-set priors. It compensates for occluded objects using a de-occlusion model fine-tuned from image editing models and predicts the rotation, translation, and scale of each object directly via a unified diffusion pose model with global/local attention. By utilizing a self-constructed 200K open-set scene dataset, the model achieves high-quality geometry and accurate poses for both indoor and open-set scenarios.
- SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations
-
Ours proposes SceneScribe-1M—a large-scale multimodal video dataset containing 1 million in-the-wild videos and over 4,000 hours. It provides comprehensive annotations, including detailed textual descriptions, precise camera parameters, consistent depth maps, and consistent 3D point trajectories, serving as a unified resource for 3D geometric perception and video generation tasks.
- SceneTok: A Compressed, Diffusable Token Space for 3D Scenes
-
SceneTok compresses a set of multi-view images into a small set (approx. 1024 tokens, only tens of thousands of 32-bit floats) of unstructured scene tokens decoupled from spatial grids. It utilizes a lightweight rectified flow decoder for rendering from arbitrary trajectories and trains a diffusion transformer on this highly compressed latent space to achieve 3D scene generation within 5 seconds, completely decoupling rendering from generation.
- SDGS: Spatial Difference Guided Gaussian Splatting for Simultaneous Localization and 3D Reconstruction
-
SDGS utilizes sparse edges (spatial difference) as descriptors, representing them as slender 3D Gaussian ellipsoids. It estimates 6-DoF poses online through distance transform alignment between rendered and input edges. By leveraging high-frame-rate differential signals from hybrid pixel sensors for mutually-exclusive supervisor deblurring, it achieves robust tracking and clear dense reconstruction even under extreme high-speed motion where traditional RGB methods fail.
- SE(3)-Equivariance with Geometric and Topological Guidance for Category-Level Object Pose Estimation
-
SEGPose is a depth-only (point cloud) category-level 6D object pose estimation method. It is the first to simultaneously introduce geometric features, topological features, and SE(3)-equivariance into pose estimation: persistent homology generates topological labels to guide point cloud reconstruction, while Vector Neuron Networks extract SE(3)-equivariant features to guide the pose prediction head. It outperforms all similar depth-based methods on REAL275 / CAMERA25 and approaches the performance of most RGB-D methods.
- SEA-Flow3D: Simplified, Efficient, and Accurate Scene Flow via Spatial Vector Sampling and Multi-scale Refinement
-
SEA-Flow3D integrates a "3D directional vector between matching point pairs" (Spatial Vector Sampling) into the correlation sampling of a RAFT-style dense scene flow framework. This allows the iterative optimizer to continuously perceive depth and geometric directions beyond 2D correlation. Combined with a lightweight ConvNeXtV2 RNN optimizer and a coarse-to-fine multi-scale structure, it sets new accuracy records on KITTI (SF-all 3.55) and Sintel (Final 2.04) while compressing inference time to 60–72 ms.
- SeeGroup: Multi-Layer Depth Estimation of Transparent Surfaces via Self-Determined Grouping
-
SeeGroup models multi-layer depths of transparent objects as an "intensity function" along the depth axis. By utilizing a recursive decomposition module, the model adaptively decides how to group depth layers, combined with a permutation-invariant likelihood loss. It improves quaternary relative depth accuracy from 61.34% to 70.09% on the LayeredDepth real-world benchmark.
- Seeing Depth Through Frequency and Motion: A Progressive Training Paradigm for Monocular Depth Estimation
-
Addressing the issues where blurred boundaries stem from downsampling frequency aliasing and PoseNets lack sufficient cross-frame motion modeling, this paper proposes the plug-and-play Frequency-Guided Sampling (FGS) module to preserve high-frequency details and the PoseQuery Network (PQNet) using channel-aligned attention for cross-frame motion modeling. Combined with a progressive three-stage decoupled training paradigm to maximize the complementarity between depth and pose, the method achieves a 4.1% reduction in Sq Rel compared to a strong baseline on KITTI.
- Seeing through boxes: Non-Line-of-Sight 3D Reconstruction from Radar Signals
-
Addressing the issues of high reconstruction noise, training instability, and surface position ambiguity when millimeter-wave (mmWave) radar "sees through boxes to reconstruct interior objects," this paper proposes GeRaF 2.0. It unifies line-of-sight (LoS) geometry outside the box and non-line-of-sight (NLoS) geometry inside into a Unified LoS (ULoS) Signed Distance Field. By using a visually pre-trained SDF for stable initialization and employing two-stage training with Relative SDF alignment to lock the surface precisely on the zero-isosurface, it achieves a new Prev. SOTA in RF-based 3D reconstruction.
- Seeing through Light and Darkness: Sensor-Physics Grounded Deblurring HDR NeRF from Single-Exposure Images and Events
-
To address the mismatch between sensor output and physical radiance when reconstructing clear High Dynamic Range (HDR) 3D representations from "single-exposure blurry Low Dynamic Range (LDR) images + event streams," this paper proposes See-NeRF. It directly represents the scene's true HDR radiance using NeRF and explicitly models the "physical radiance → sensor measurement" process through a pixel-level RGB CRF model and a latency-aware, photometrically calibrated event CRF model. The three components are jointly optimized to achieve SOTA deblurring and HDR new-view synthesis results under extreme lighting conditions.
- Seele: A Unified Acceleration Framework for Real-Time Gaussian Splatting on Mobile Devices
-
SEELE is a mobile-oriented 3DGS rendering acceleration framework that reduces the number of rendered Gaussians through "view-dependent scene representation + online filtering + asynchronous prefetching" and concentrates computing power on a few Gaussians that truly affect pixels via "contribution-aware rasterization." It is plug-and-play across four mainstream 3DGS algorithms, achieving up to 6.3× speedup and 39.1% runtime model reduction, with rendering quality often slightly improved.
- Selfi: Self-improving Reconstruction Engine via 3D Geometric Feature Alignment
-
Selfi freezes 3D Vision Foundation Models (VFMs) like VGGT as a backbone and trains only a lightweight feature adapter. By using the depth and pose output by VGGT itself as pseudo-labels and distilling features into a "geometrically aligned" space through re-projection consistency loss, it transforms a foundation model not originally designed for high-fidelity rendering into a SOTA engine for pose-free New View Synthesis (NVS) and camera pose estimation, achieving zero 3D ground truth supervision throughout the process.
- Semantic Foam: Unifying Spatial and Semantic Scene Decomposition
-
The recently proposed Radiant Foam (a differentiable radiance field representation based on Voronoi grids) is extended to semantic decomposition tasks. By explicitly attaching a set of semantic features to each Voronoi cell and leveraging the natural spatial adjacency of the grid for direct spatial regularization, the method avoids occlusion and cross-view inconsistent supervision artifacts common in point-based representations, achieving or exceeding SOTA results in object-level segmentation like Gaussian Grouping and SAGA.
- SemLT3D: Semantic-Guided Expert Distillation for Camera-only Long-Tailed 3D Object Detection
-
Addressing the "rare but safety-critical" categories (children, emergency vehicles, strollers) in camera-only multi-view 3D detection characterized by extreme data scarcity, intra-class diversity, and inter-class ambiguity, SemLT3D leverages language/visual priors from CLIP for two purposes—routing 3D queries to experts based on semantic similarity (Language-guided MoE) and distilling 2D semantics from CLIP into 3D tokens (Semantic Projection Distillation). As a plug-and-play module for StreamPETR/Far3D, it significantly improves tail-class mAP and overall mAP/NDS under the 18-class nuScenes setting.
- SGAD-SLAM: Splatting Gaussians at Adjusted Depth for Better Radiance Fields in RGBD SLAM
-
The authors propose SGAD-SLAM, which utilizes a pixel-aligned simplified Gaussian representation and allows Gaussians to adjust depth offsets along the ray to improve rendering quality and scalability. A GICP tracking strategy based on geometric similarity is introduced to accelerate camera pose estimation, outperforming state-of-the-art methods on Replica, TUM, ScanNet, and ScanNet++.
- SGI: Structured 2D Gaussians for Efficient and Compact Large Image Representation
-
SGI proposes a seed-based structured 2D Gaussian representation framework. By organizing unstructured Gaussian primitives into seed-driven neural Gaussians, combined with context-guided entropy coding and multi-scale optimization strategies, it achieves up to a 7.5× compression ratio and 6.5× optimization acceleration for high-resolution image representation while maintaining or even improving reconstruction fidelity.
- SGS-Intrinsic: Semantic-Invariant Gaussian Splatting for Sparse-View Indoor Inverse Rendering
-
SGS-Intrinsic proposes a two-stage indoor inverse rendering framework. The first stage utilizes semantic and geometric priors to construct a dense, geometrically consistent Gaussian field, while the second stage combines a hybrid lighting model and material priors for material-illumination decomposition, incorporating a de-shadowing module to prevent shadow baking into the albedo.
- SGSoft: Learning Fused Semantic-Geometric Features for 3D Shape Correspondence via Template-Guided Soft Signals
-
SGSoft reformulates the task of "finding dense point correspondences between deforming 3D shapes" as "aligning geodesic probability fields on a canonical template." It utilizes this topologically invariant soft supervision signal to train per-vertex descriptors that fuse geometric, semantic, and spatial cues. During inference, a single forward pass followed by nearest neighbor retrieval yields correspondences without requiring pre-alignment, pair-wise optimization, or post-refinement, reducing the per-pair latency to 1.7 seconds while maintaining high accuracy.
- ShapeR: Robust Conditional 3D Shape Generation from Casual Captures
-
ShapeR converts casual image sequences through SLAM + 3D detection + VLM description into a three-way multi-modal condition: "sparse point clouds + multi-view posed images + text." These are fed into a FLUX-style rectified flow Transformer to denoise VecSet latent codes. It generates metric-accurate, complete single-object meshes in real-world occluded/cluttered scenes, achieving a 2.7× improvement in Chamfer Distance over the SOTA.
- SharpTimeGS: Sharp and Stable Dynamic Gaussian Splatting via Lifespan Modulation
-
SharpTimeGS assigns a learnable "lifespan" parameter to each 4D Gaussian primitive, transforming temporal visibility from Gaussian decay to a "flat-top" profile and modulating motion magnitude. This ensures that long-lived static points experience minimal drift while short-lived dynamic points retain full motion. Combined with lifespan-velocity-aware densification and velocity-aware initialization, it accurately reconstructs both static backgrounds and rapid dynamics in a unified representation, achieving SOTA results with 4K@100FPS real-time rendering.
- Simple but Effective Triplet-Based Compression Strategies for Compact Visual Localization
-
In response to the long-standing problem of "SfM point cloud compression" in visual localization—which traditionally relies on solving complex optimization problems (Set Cover / Integer Programming / Quadratic Programming)—this paper proposes an almost trivial strategy: randomly sample triplets for each database image, estimate poses using P3P, and retain the points belonging to the triplets that yield the most accurate database image poses. By using "pose accuracy" as the direct selection criterion, combined with standard descriptor quantization, this approach matches or even exceeds the performance of current SOTA compression and learning-based methods.
- SketchFaceGS: Real-Time Sketch-Driven Face Editing and Generation with Gaussian Splatting
-
SketchFaceGS utilizes a feed-forward, coarse-to-fine architecture to map a single hand-drawn sketch (plus an optional reference image) to a real-time renderable photorealistic 3D Gaussian face in a single pass. By employing UV Mask Fusion and layer-wise feature fusion, it achieves free-view, optimization-free local real-time editing, outperforming SketchFaceNeRF in both generation fidelity (FID 92.65) and editing latency (~0.3s / 243 FPS).
- Sky2Ground: A Benchmark for Site Modeling under Varying Altitude
-
This paper introduces the Sky2Ground dataset (51 scenes, 80k images, unified coverage of synthetic and real images across satellite, aerial, and ground views) and the SkyNet model (dual-stream encoder + masked satellite attention + progressive view sampling). It represents the first systematic study of joint camera localization across ground, aerial, and satellite perspectives, achieving gains of 9.6% in RRA@5 and 18.1% in RTA@5.
- SLARM: Streaming and Language-Aligned Reconstruction Model for Dynamic Scenes
-
SLARM is a feed-forward Transformer that simultaneously outputs 4D Gaussian geometry, 3D scene flow, and language-aligned semantics for dynamic scenes in a single forward pass. It utilizes high-order motion functions for unsupervised learning of complex non-uniform motions, distills LSeg for text-queryable semantics, and employs windowed causal attention for constant-latency streaming inference. It improves motion accuracy by 21%, PSNR by 1.6 dB, and segmentation mIoU by 20% on the Waymo dataset.
- SmokeSVD: Smoke Reconstruction from A Single View via Progressive Novel View Synthesis and Refinement with Diffusion Models
-
This work utilizes diffusion models to synthesize side views frame-by-frame from a single-view video, followed by a cyclic reconstruction process of "coarse density → progressive refinement → fine density". This framework achieves high-quality single-view smoke reconstruction while being two orders of magnitude faster than differentiable rendering (15 minutes vs. >30 hours).
- SMVRT: Implicit Human 3D Modeling Using Sparse Multi-View Volumetric Reconstruction with Transformer Fusion
-
SMVRT utilizes an end-to-end, template-free implicit occupancy field network to reconstruct clothed humans from sparse multi-view inputs (2–8 images). The core innovation lies in placing Transformer fusion modules across three stages: 2D encoding, 2D-to-3D voxel construction, and query point decoding. This allows the network to "select the most reliable views and features," effectively halving the Chamfer distance of prior SOTA methods on datasets like THUman2.0/2.1, MultiGarment, and MultiHuman.
- SO(3)-Equivariant ViT-Adapter for Data-Efficient Zero-Shot Sim-to-Real Indoor Panoramic Depth Estimation
-
Add an SO(3)-equivariant adapter to a frozen perspective pre-trained ViT (Depth Anything V2). By training on only 6.5K synthetic panoramas with zero real data, the framework transfers the zero-shot generalization capabilities of perspective models to 360° panoramas, outperforming real-data-dependent PanDA in zero-shot sim-to-real on Matterport3D / Stanford2D3D.
- Solvability of the Viewing Graph Under the Affine Camera Model
-
This paper conducts the first study on the solvability of the viewing graph under the affine camera model. It characterizes the problem of "whether a given set of two-view relations uniquely determines all cameras" as a linear system \(Ax=b\). Consequently, a practical testing algorithm based on matrix rank is provided along with several necessary/sufficient conditions. Finally, it is conjectured that "affine solvability = 2D parallel rigidity."
- Solving Minimal Problems Without Matrix Inversion Using FFT-Based Interpolation
-
This paper proposes a matrix-inversion-free construction method for minimal problem solvers. By using hidden variable sparse resultants, multivariate polynomial systems are reduced to a univariate determinantal polynomial in \(x_1\). Then, IFFT interpolation is employed to numerically reconstruct the polynomial coefficients from sampling values on the unit circle (bypassing symbolic expansion). Finally, Cramer's rule combined with a GCD criterion is used for robust back-substitution of remaining unknowns. The method achieves a zero failure rate in numerical stability across 14 camera pose minimal problems and provides an average speedup of approximately 30% for small-scale problems.
- SonoWorld: From One Image to a 3D Audio-Visual Scene
-
SonoWorld is proposed, a training-free framework that generates explorable 3D audio-visual scenes from a single image. It first expands a single image into a 360° panorama and reconstructs it as a 3D Gaussian scene. Then, it utilizes VLM-driven semantic localization to place sound source anchors, and finally renders spatial audio using Ambisonics encoding to achieve both geometric and semantic alignment between visual and auditory domains.
- SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs
-
Proposes Spherical Coordinate-Based Positional Embedding (SoPE), which remaps point cloud tokens from 1D sequence indices to a spherical coordinate space \((t,r,\theta,\phi)\). Combined with multi-dimensional frequency allocation and multi-scale frequency mixing strategies, it significantly enhances the spatial perception capabilities of 3D Large Vision-Language Models.
- Space-Time Forecasting of Dynamic Scenes with Motion-aware Gaussian Grouping
-
MoGaF advances 4D Gaussian Splatting from "interpolation of observed frames" to physically consistent long-term scene forecasting. It accomplishes this by grouping Gaussians into object-level units labeled as rigid or non-rigid, applying typed motion constraints during optimization, and utilizing a lightweight Transformer per group for autoregressive motion extrapolation.
- SPAN: Spatial-Projection Alignment for Monocular 3D Object Detection
-
Proposes Spatial-Projection Alignment (SPAN), which utilizes two geometric synergistic constraints—3D corner spatial alignment and 3D-2D projection alignment—combined with a hierarchical task learning strategy. It serves as a plug-and-play module to improve the localization accuracy of arbitrary monocular 3D detectors.
- SPARK: Sim-ready Part-level Articulated Reconstruction with VLM Knowledge
-
SPARK starts from a single RGB image, using a VLM to parse coarse URDF parameters, part-level reference images, and an open-state image. A Diffusion Transformer with multi-layer attention then simultaneously generates part-level and global meshes. Finally, differentiable forward kinematics optimize joint parameters, creating end-to-end "sim-ready" articulated objects compatible with physics engines, reducing various URDF errors by over 60% compared to previous methods.
- Sparse-View Localization via Online Neural 3D Regression
-
ON3R addresses extreme sparse-view localization scenarios where database images have minimal overlap (star-topology) and no pre-built 3D maps. It temporarily trains a small MLP online for each query image to regress query keypoints into 3D points (supervised by database reprojection residuals and monocular depth priors). Absolute poses are then estimated via P3P-RANSAC and lightweight BA. It outperforms existing structure-less methods and even exceeds structured HLOC on MegaDepth, Cambridge, and sparsified Aachen datasets.
- SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras
-
SparseCam4D is proposed as the first method to achieve 4D reconstruction from sparse cameras (2-3) on standard multi-camera dynamic scene benchmarks. The core innovation is the Spatio-Temporal Distortion Field (STDF), which explicitly models spatio-temporal inconsistencies in generative observations and decouples them from the canonical 4D Gaussian representation, achieving high-fidelity and spatio-temporally consistent dynamic scene rendering.
- Spatial-SAM: Spatially Consistent 3D Electron Microscopy Segmentation with SDF Memory and Semi-Supervised Learning
-
Spatial-SAM replaces the "frame-by-frame 2D logit memory" of SAM2 with a Signed Distance Field (SDF) memory pre-computed by a lightweight 3D U-Net. It adopts a dual-track semi-supervised pipeline—bootstrapping pseudo-labels with SAM2 few-shot capabilities followed by alternating training of SDF and masks. With only 1/64 of the slices annotated, it approaches fully supervised SOTA performance across multiple 3D EM datasets while significantly improving inter-slice 3D morphological consistency.
- Spatial Matters: Position-Guided 3D Referring Expression Segmentation
-
Addressing the limitation where 3D referring expression segmentation (3D-RES) focuses solely on semantics and ignores spatial relationships—leading to failures in distinguishing "multiple similar objects"—Position3D explicitly injects relative spatial positions into two stages: space-aware query generation (initializing queries with geometric awareness) and a position-guided deformable attention decoder (progressively shrinking attention from global to local targets). It achieves mIoU scores of 51.0 / 53.2 on ScanRefer and Multi3DRefer, significantly outperforming the previous SOTA, IPDN.
- SpatialVID: A Large-Scale Video Dataset with Spatial Annotations
-
SpatialVID distills 2.71 million dynamic segments (7,089 hours in total) from 21,000 hours of in-the-wild web videos using a three-stage "hierarchical filtering + geometric/semantic annotation + balanced sampling" pipeline. Each segment includes per-frame camera poses, depth, dynamic masks, structured captions, and serialized motion instructions, representing the largest and most comprehensively annotated video dataset for "dynamic scenes + explicit geometry."
- SPE-MVS: Spatial Position Encoding Enhanced Multi-View Stereo with Monocular Depth Priors
-
SPE-MVS utilizes metric monocular depth priors to construct a "Spatial Position Encoding (SPE)" in a unified coordinate system for each pixel across views. This encoding is fed into feature extraction and cost volume construction alongside the images. A monocular depth-guided two-stage refinement module is then used to polish the probability maps, significantly improving MVS reconstruction quality in areas where photometric matching fails, such as weak-textured and non-Lambertian surfaces.
- SpeeDe3DGS: Speedy Deformable 3D Gaussian Splatting with Temporal Pruning and Motion Grouping
-
SpeeDe3DGS integrates three modules — Temporal Sensitivity Pruning (TSP), Temporal Sensitivity Sampling (TSS), and Grouped Rigid Motion Distillation (GroupFlow) — into DeformableGS. It accelerates dynamic Gaussian Splatting rendering by 13.71×, reduces training time by 2.53×, and cuts the number of Gaussians to 1/10 while maintaining the image quality of neural deformation fields.
- Speeding Up the Learning of 3D Gaussians with Much Shorter Gaussian Lists
-
By periodically resetting Gaussian scales (Scale Reset) and imposing entropy constraints on alpha blending weights (Entropy Constraint), the length of the Gaussian list for each pixel is shortened. This achieves a 5–12× acceleration in 3DGS training while maintaining comparable rendering quality.
- Spherical Voronoi: Directional Appearance as a Differentiable Partition of the Sphere
-
Addressing the limitation of Spherical Harmonics (SH) in representing high-frequency specular reflections for "view-dependent appearance" in radiance fields, this paper proposes Spherical Voronoi (SV). By using a set of learnable sites to softly partition the sphere into regions, SV serves as an explicit spherical function representation that is easier to optimize than SH or Spherical Gaussians (SG) while sharply modeling glint-level highlights. It is further extended as "learnable lighting probes" for spatially-varying reflections, achieving SOTA results on reflection benchmarks like Ref-NeRF and GlossySynthetic (Ref-NeRF PSNR 36.09).
- SpiderCam: Low-Power Snapshot Depth from Differential Defocus
-
SpiderCam utilizes a beam-splitting prism and two low-power image sensors to capture a pair of differential defocus images. It executes an optimized Differential Depth from Defocus (DfDD) algorithm in a streaming fashion on a low-power FPGA—one too small to store even a single pair of full frames. This represents the first passive 3D camera in literature with a total power consumption under 1 Watt (624 mW @ 32.5 FPS) and an operating range exceeding half a meter.
- Splatent: Splatting Diffusion Latents for Novel View Synthesis
-
Splatent performs 3DGS reconstruction within a frozen diffusion VAE latent space. It utilizes a one-step diffusion model combined with multi-view self-attention to inject high-frequency details—previously "averaged out" by 3D optimization—from neighboring reference views back into the rendered novel view latents. This approach achieves SOTA in latent-space novel view synthesis while preserving the reconstruction quality of the pre-trained VAE.
- SplatSuRe: Selective Super-Resolution for Multi-view Consistent 3D Gaussian Splatting
-
SplatSuRe avoids uniformly applying super-resolution (SR) to all pixels. Instead, it calculates a fidelity score based on how sufficiently each Gaussian is sampled across views and renders per-view weight maps. SR supervision is injected only into undersampled regions lacking high-frequency observations, resulting in sharper and multi-view consistent high-resolution reconstructions without additional neural components or modifications to the 3DGS backbone.
- SR3R: Rethinking Super-Resolution 3D Reconstruction With Feed-Forward Gaussian Splatting
-
Ours redefines 3D Super-Resolution (3DSR) as a feed-forward mapping problem from sparse low-resolution views to high-resolution 3DGS. High-fidelity HR 3DGS reconstruction is achieved through Gaussian offset learning and feature refinement, enabling strong zero-shot generalization without per-scene optimization.
- SRGCD: Stability-Driven Region Growth Framework for 3D Change Detection
-
This paper redefines 3D point cloud change detection from "point-wise binary segmentation" to a stability propagation process that "starts from high-confidence invariant seeds and grows layer-by-layer toward boundaries." It selects seeds using geometric consistency priors and diffuses stability from the core to the boundaries via unidirectional controlled attention, achieving SOTA results with 94.11% / 78.79% mIoU on Urb3DCD / HKCD, respectively.
- ST4R-Splat: Spatio-Temporal Referring Segmentation in 4D Gaussian Splatting
-
This paper introduces the new task of "Spatio-Temporal Referring Segmentation in 4D Gaussian Splatting (STRS-4DGS)" and designs the ST4R-Splat framework: it utilizes time-invariant instance referring embeddings to solve "where" and instance-level temporal state mapping in feature space to solve "when." Combined with an MLLM-based pipeline for automatic spatio-temporal supervision generation, it significantly outperforms adapted SOTA baselines (time-agnostic mIoU 77.67% vs. 43.40%) on a self-constructed benchmark.
- StableMTL: Repurposing Latent Diffusion Models for Multi-Task Learning from Partially Annotated Synthetic Datasets
-
StableMTL repurposes a pre-trained Latent Diffusion Model (Stable Diffusion) into a "single-step latent regressor." It jointly trains 7 dense prediction tasks (semantics, normals, depth, optical flow, scene flow, colorization, and albedo) on three synthetic datasets, each only partially annotated. By replacing per-task losses with a unified latent space MSE loss and facilitating knowledge sharing through an N-to-one "mainstream-auxiliary" task attention mechanism, StableMTL outperforms partially annotated MTL baselines by +4.78 \(\Delta m\) across 8 real-world benchmarks and exhibits strong out-of-distribution generalization.
- STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction
-
Ours proposes the STAC framework, which leverages the spatio-temporal sparsity of KV caches in Causal Transformers. Through three modules—working temporal token cache, long-term spatial token cache, and chunked multi-frame optimization—it reduces memory consumption by approximately 10x and increases inference speed by 4x for streaming 3D reconstruction without additional training, while maintaining reconstruction quality.
- STAvatar: Soft Binding and Temporal Density Control for Monocular 3D Head Avatars Reconstruction
-
Ours proposes STAvatar, a framework for reconstructing high-fidelity, drivable 3D head avatars from monocular video. By utilizing a UV-adaptive soft binding framework and a temporal adaptive density control strategy, it significantly outperforms existing methods in handling occluded areas (e.g., mouth interior, eyelids) and fine details.
- SunFaded: Illumination-Aware Gaussian Splatting for Dark Scenes with Camera-Mounted Active Lighting
-
Addressing dark scenes with "camera-mounted moving light sources," this paper utilizes 2DGS and albedo attributes to decouple illumination from intrinsic appearance. Through a three-stage training process—incorporating "illumination-weighted loss → image-space tiled shading → albedo-guided geometric prior refinement"—it outperforms methods like DarkGS across PSNR/SSIM/LPIPS while achieving faster training and rendering.
- SuP: Sub-cloud Driven Point Cloud Registration
-
To address the persistent challenge in low-overlap point cloud registration—where geometric and semantic similarities in non-overlapping regions lead to mismatches—SuP reformulates the problem as "mining high-overlap anchor pairs within sub-clouds." By employing a dual-phase mining process (prior weighting for candidate screening + posterior network for consistency verification) followed by merged matching, it establishes new SOTA results on Color3DMatch/3DLoMatch and can serve as a plug-and-play module to enhance existing methods.
- SV-GS: Sparse View 4D Reconstruction with Skeleton-Driven Gaussian Splatting
-
SV-GS reconstructs continuous 4D motion of articulated objects under extreme sparse settings—with only one arbitrary view per timestamp (approx. 20× fewer than typical dense video)—driven by "input skeleton + first-frame static reconstruction." By restricting time-variance exclusively to joint poses for smooth interpolation, it achieves PSNR gains of up to 34% over SOTA on synthetic data.
- SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation
-
Ours proposes SwiftTailor, a two-stage lightweight framework that generates 3D garment meshes by predicting sewing patterns via PatternMaker and converting them into Garment Geometry Images (GGI) in a unified UV space via GarmentSewer. Combined with inverse mapping and dynamic stitching, it achieves SOTA quality with inference speeds tens of times faster than existing methods.
- TagSplat: Topology-Aware Gaussian Splatting for Dynamic Mesh Modeling and Tracking
-
The authors propose TagSplat, a topology-aware Gaussian Splatting framework. By explicitly encoding the spatial connectivity between Gaussian primitives, it generates topologically consistent mesh sequences in dynamic scene reconstruction and supports precise 3D keypoint tracking.
- Tavatar: Topology-Aware Gaussian Attribute Derivation for Animatable Human Avatars
-
Tavatar no longer treats the rotation and scale of each 3D Gaussian as freely optimized parameters. Instead, it analytically derives them from the triangular geometry of the underlying deformable mesh. This anchors Gaussians naturally to the mesh topology, preventing them from detaching or creating holes under unseen complex poses (OOD). Normal error is reduced by 13.8% on X-Avatar and 17.9% on PeopleSnapshot compared to the best baseline, while maintaining competitive rendering quality.
- Learning 3D Reconstruction with Priors in Test Time
-
A Test-Time Constrained Optimization (TCO) framework is proposed. Without retraining or modifying pre-trained multiview Transformer architectures, it significantly improves 3D reconstruction accuracy by optimizing priors (camera pose, intrinsics, depth) as prediction constraints during inference.
- TeHOR: Text-Guided 3D Human and Object Reconstruction with Textures
-
TeHOR utilizes text descriptions as semantic guidance to jointly optimize the geometry and texture of 3D humans and objects through Score Distillation Sampling from pre-trained diffusion models. It breaks the reliance of traditional methods on contact information, achieving accurate and semantically consistent 3D reconstruction for both contact-based and non-contact interactions.
- TESO: Online Tracking of Essential Matrix by Stochastic Optimization
-
TESO models the online extrinsic calibration of stereo cameras as "adaptive stochastic optimization of a robust kernelized epipolar error on the essential matrix manifold." Without any training data and with only two hyperparameters, it tracks camera calibration drifts in real-time with 0.12°-level accuracy, achieving single-frame optimization precision comparable to neural-network-based methods.
- Text-Driven 3D Hand Motion Generation from Sign Language Data
-
Utilizing large-scale sign language videos, sign language dictionaries, and LLMs, this work automatically constructs a dataset of 1.3 million "text-3D hand motion" pairs (BOBSL3DT). From this, the authors train HandMDM, a hand motion diffusion model driven by free-text descriptions (hand shape, position, finger/arm movement), which demonstrates strong generalization to unseen gestures, various sign languages, and non-sign language hand movements.
- Text–Image Conditioned 3D Generation
-
This paper observes that image and text conditions provide complementary information in 3D generation—images provide precise appearance but are limited by viewpoint, while text provides global semantics but lacks visual detail. It proposes TIGON, a minimalist dual-branch DiT baseline that achieves native 3D generation under joint text-image conditioning through zero-initialized cross-modal bridges (early fusion) and step-wise prediction averaging (late fusion).
- TextFM: Robust Semi-dense Feature Matching with Language Guidance
-
TextFM is the first framework to introduce text semantics from Visual-Language Models (VLM) into semi-dense feature matching. It utilizes text embeddings to generate instance-level queries that inject domain-invariant semantics into coarse matching, employs LoRA for efficient fine-tuning of Visual Foundation Models (VFM), and overlays illumination-invariant physical priors, significantly outperforming existing methods like EfficientLoFTR under cross-domain and day-night variations.
- The Midas Touch for Metric Depth
-
MTD (Midas Touch for Depth) employs a training-free, mathematically interpretable "coarse-to-fine" algorithm to convert relative depth from foundation models into metric depth using extremely sparse 3D seeds (e.g., LiDAR or stereo matching). It aligns local scales via segment-wise graph optimization and performs pixel-level detailing using "discontinuity-aware geodesic cost + dynamic programming." It outperforms SOTA methods like BP-Net, DMD3C, and Marigold-DC in zero-shot depth completion and estimation, with a backend latency of only 1.9 ms.
- Think-Then-Generate: Structural Chain-of-Thought Reasoning for Consistent 3D Generation
-
Thoughtful3D introduces Chain-of-Thought (CoT) reasoning into SDS-style 3D generation. It utilizes a "think-then-generate" two-stage structural reasoning framework: 3DBlueprint-CoT for semantic parsing and stage-wise sub-goal decomposition before generation, and 3DRefine-CoT for multi-round reflection-correction of rendering artifacts during generation. Coupled with a cross-view semantic-appearance alignment loss, the method significantly alleviates multi-view inconsistency, the Janus problem, and guidance collapse, achieving comprehensive improvements in quality and consistency for text-to-3D and image-to-3D tasks.
- TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens
-
TokenGS transforms feed-forward 3D Gaussian Splatting reconstruction from "regressing depth along a ray for each pixel" to "directly regressing 3D coordinates using a set of learnable Gaussian tokens." This completely decouples the number of Gaussians from input resolution and the number of views, achieving SOTA performance on both static and dynamic scenes with cleaner geometry, robustness to pose noise, and support for low-cost test-time token fine-tuning.
- TokenHand: Discrete Token Representation for Efficient Hand Mesh Reconstruction
-
TokenHand encodes a 3D hand into \(M\) discrete tokens within a shared codebook and reframes "single-image hand mesh reconstruction" from a regression problem to a token classification problem. The classifier predicts the category of each token, while a pre-trained lightweight decoder restores the 778-vertex mesh without post-processing. It achieves a PA-MPJPE of 5.7mm at 65 FPS with only 3.0M parameters on FreiHAND.
- Topology-aware Feature Propagation for Unsupervised Non-rigid Point Cloud Correspondence
-
Addressing the limitation in unsupervised non-rigid point cloud correspondence where "feature propagation based on spatial proximity connects physically disjoint parts," this paper proposes learning deformation-robust shape topology. It utilizes topology confidence weights and a Topology-aware Transformer within a "coarse-to-fine" pipeline to propagate features, supplemented by a vector quantization (VQ) codebook, achieving SOTA results across four benchmarks.
- TopoMA: Topology-Guided Multi-Agent Dense RGB 3D Reconstruction via Distributed Inference
-
TopoMA utilizes persistent homology to learn a "scene topological skeleton" connecting subgraphs of various agents. This skeleton serves as a unified coordination core for attention bias, loop closure gating, and residual transmission. This allows multiple agents to reconstruct and incrementally optimize local maps under purely distributed, server-free conditions, achieving globally consistent large-scale RGB dense reconstruction using only lightweight topological messages.
- TopoMesh: High-Fidelity Mesh Autoencoding via Topological Unification
-
TopoMesh is proposed to unify Ground Truth (GT) and predicted meshes under the Dual Marching Cubes (DMC) topological framework. This enables explicit vertex- and face-level correspondence for the first time, supporting direct mesh-level supervision (topology, vertex positions, face normals). The F1-Sharp metric improves by 5.9-7.1% over previous SOTA, demonstrating significant advantages in sharp feature preservation.
- TouchDream: 3D Object Completion through Imagined Touch
-
TouchDream uses a conditional diffusion model to "imagine" tactile signals on object surfaces—generating compact tactile latent vectors from coarse point clouds and sampled poses. These vectors are decoded into local geometry and fused back into the point cloud. This provides fine-grained local geometric guidance for point cloud completion without any physical touch, achieving SOTA performance on PCN, ShapeNet55-34, and KITTI.
- Towards Foundation Models for 3D Scene Understanding: Instance-Aware Self-Supervised Learning for Point Clouds
-
PointINS adds an "offset branch" to point cloud self-supervised pre-training, enabling the model to learn to predict offset vectors pointing to instance centers without any labels. It prevents representation collapse using two complementary regularizers: ODR (Offset Distribution Regularization), which aligns with global statistical priors, and SCR (Spatial Clustering Regularization), which enforces local grouping. This transforms traditional SSL representations from being "semantics-only" to "semantics and geometry-aware," resulting in an average improvement of +3.5% mAP in indoor instance segmentation and +4.1% PQ in outdoor panoptic segmentation across five datasets.
- Towards Generalized Multimodal Homography Estimation
-
Addressing the issue where homography estimation models fail when switching modalities, this paper utilizes style transfer to synthesize misaligned image pairs from a single image with varying textures/colors but identical structures (with ground truth offsets). This allows supervised training on synthetic data to generalize zero-shot to unseen modalities. Simultaneously, CCNet is designed to fuse cross-scale information and decouple color from features, further significantly reducing cross-dataset MACE errors.
- Towards Intrinsic-Aware Monocular 3D Object Detection
-
MonoIA proposes transforming numerical camera intrinsics into language-guided semantic representations (generated via LLM descriptions + CLIP encoding). These are integrated into the detection network through a hierarchical adaptation module, achieving zero-shot generalization to unseen focal lengths and unified cross-dataset training, reaching new SOTA on KITTI, Waymo, and nuScenes.
- Towards Visual Query Localization in the 3D World
-
This work extends "Visual Query Localization (VQL)" from 2D video to the 3D world. The authors construct 3DVQL, the first 3D multimodal VQL benchmark (2002 sequences, 170k frames, 6.4K response tracks, 38 categories, Point Cloud+RGB+Depth modalities, per-frame 9DoF box annotations). Furthermore, they propose LaF, a method that lifts 2D features into 3D voxels along the viewing frustum and performs point cloud-image fusion via depth-aware attention, significantly outperforming multimodal baselines adapted from VQLoC across all metrics.
- TR2M: Transferring Monocular Relative Depth to Metric Depth with Language Descriptions and Dual-Level Scale-Oriented Contrast
-
The TR2M framework is proposed to predict pixel-level scale/shift maps using image and text descriptions, converting highly generalizable but scale-less relative depth into metric depth. Achieving cross-domain zero-shot metric depth estimation is realized with only 19M trainable parameters and 102K training images.
- Tracking-Guided 4D Generation: Foundation-Tracker Motion Priors for 3D Model Animation
-
Track4DGen injects frame-by-frame point correspondences from a foundation point tracker (CoTracker3) into the intermediate features of multi-view video diffusion models and 4D Gaussian reconstruction. By using explicit feature-level temporal supervision to suppress appearance drift in 4D asset generation, it outperforms baselines like Animate3D on both video generation and 4D generation benchmarks.
- Tracking by Predicting 3-D Gaussians Over Time
-
Video-GMAE self-supervisedly encodes a video into "a set of 3-D Gaussian primitives drifting over time"—predicting complete Gaussians for the first frame and only residual displacements for subsequent frames. This inductive bias forces the network to learn cross-frame pixel correspondences, enabling zero-shot point tracking without any tracking annotations. After fine-tuning, it exceeds previous self-supervised methods by 34.6% and 13.1% on Kinetics and Kubric, respectively.
- TROPHIES: Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos
-
TROPHIES proposes the new task of "unified reconstruction of humans, scenes, and cameras from multi-view videos." Using a decoupled human branch + a plug-and-play scene branch + a global alignment optimization module, it integrates dynamic humans, static geometry, and camera trajectories into a single metric-consistent 4D world coordinate system, reducing W-MPJPE by more than half on EgoHumans / EgoExo4D.
- tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction
-
tttLRM introduces Test-Time Training (TTT) into large-scale 3D reconstruction models for the first time. By utilizing LaCT layers, it achieves long-context and autoregressive 3D Gaussian reconstruction with linear complexity. It compresses multi-view observations into TTT fast weights to form an implicit 3D representation, which is then decoded into explicit formats like 3DGS, achieving SOTA performance on both object-level and scene-level datasets.
- Turbo-GS: Accelerating 3D Gaussian Fitting for High-Resolution Radiance Fields
-
Turbo-GS reduces the 3DGS fitting time for 4K scenes from hours to approximately 10 minutes (e.g., 13 minutes for 4K bicycle, 3× faster than Taming 3DGS and 14× faster than 3DGS) through a trifecta of "dilated rendering computing sparse sub-pixel pairs + power-law convergence-aware Gaussian budget scheduling + color gradient-assisted densification," while maintaining or even improving rendering quality (especially LPIPS).
- TWINGS: Thin Plate Splines Warp-aligned Initialization for Sparse-View Gaussian Splatting
-
TWINGS utilizes Thin Plate Splines (TPS) to non-rigidly align dense point clouds back-projected from monocular depth to sparse 3D control points obtained via multi-view triangulation. Dense and geometrically accurate initial point clouds are then sampled near these control points and provided as a plug-and-play module for 3DGS. This significantly outperforms existing methods in Extremely Sparse-view scenarios on DTU / LLFF / Mip-NeRF360 (e.g., DTU 3-view PSNR 21.52, >1.6 dB higher than the runner-up).
- UIKA: Fast Universal Head Avatar from Pose-Free Images
-
UIKA proposes a feed-forward drivable 3D Gaussian head avatar model. It projects an arbitrary number of "pose-free" input images (single image, multi-view, or mobile video) into a shared UV space via per-pixel face UV correspondence. A UV attention branch aggregates multi-view information to decode canonical space Gaussians, enabling reconstruction in a single forward pass and real-time driving at 220 FPS. It outperforms existing SOTA in both monocular and multi-view settings.
- ULF-Loc: Unbiased Landmark Feature for Robust Visual Localization with 3D Gaussian Splatting
-
This paper theoretically proves that optimizing 3DGS feature fields via \(\alpha\)-blending introduces inherent bias to 3D point features. It proposes ULF-Loc, which replaces biased feature optimization with "Geometric Weighted Multi-view Feature Fusion," selects reliable landmarks through "Keypoint Consensus Sampling," and eliminates mismatches caused by rendering artifacts using "Local Geometric Consistency Verification." On Cambridge Landmarks, it reduces the average median translation error by 17% compared to SOTA, while requiring only 1/10 of the training time and 1/6 of the VRAM of STDLoc.
- Unblur-SLAM: Dense Neural SLAM for Blurry Inputs
-
Unblur-SLAM does not simply integrate a deblurring network into the SLAM front-end. Instead, it revolves around the critical decision of "which blurry frames can be deblurred before tracking and which must be modeled directly in 3D space." It designs a complete pipeline including blur detection, physically constrained deblurring, 3D Gaussian blur refinement, and a fallback mechanism for severe blur, effectively handling both motion and defocus blur while significantly improving tracking and reconstruction quality.
- Uncertainty-driven 3D Gaussian Splatting Active Mapping via Anisotropic Visibility Field
-
GAVIS is proposed—a method modeling the "visibility" of each Gaussian particle relative to training viewpoints as a direction-dependent anisotropic visibility field in 3DGS. This field is analytically constructed and queried (training-free, within 1 second) using Spherical Harmonics (SH). Integration into Bayesian-style uncertainty-aware rasterization provides reliable, 200 FPS real-time uncertainty estimation for robotic active mapping, significantly outperforming FisherRF, VIMC, and NVF in both accuracy and efficiency.
- Underground Plant Exploration: Non-Destructive 3D Root Assessment with GPR Based on Point Graph Neural Network
-
This paper reconstructs underground plant root systems as 3D point clouds non-destructively using Ground Penetrating Radar (GPR). The method first detects hyperbolas formed by root reflections on GPR B-scans and regresses their geometric parameters to obtain sparse 3D points. These points are then completed into a dense root system using a Point Graph Neural Network (Point GNN) featuring residual graph convolutions and dual-pooling attention, followed by an upsampling module. On simulated data, it achieves a detection AP of 0.857 and a reconstruction EMD of 5.03%, outperforming existing methods with the smallest parameter count of 20.98M.
- Uni3R: Unified 3D Reconstruction and Semantic Understanding via Generalizable Gaussian Splatting from Unposed Multi-View Images
-
Uni3R utilizes a VGGT-style Cross-View Transformer to perform a single feed-forward prediction of 3D Gaussians with semantic features from an arbitrary number of unposed multi-view images. It achieves novel view synthesis, open-vocabulary 3D segmentation, and depth estimation simultaneously in 0.15 seconds, setting new SOTAs on multiple benchmarks including RE10K and ScanNet.
- UniCorrn: Unified Correspondence Transformer Across 2D and 3D
-
UniCorrn unifies three types of geometric correspondence—image-image (2D-2D), image-point cloud (2D-3D), and point cloud-point cloud (3D-3D)—into a single "query keypoint \(\rightarrow\) regress correspondence coordinate" task using a weight-sharing Transformer. It achieves stackable end-to-end matching via a dual-stream attention decoder (where appearance and position streams share the same attention matrix). It matches SOTA performance on 2D-2D tasks and outperforms previous best methods by 8% and 10% in registration recall on 7Scenes (2D-3D) and 3DLoMatch (3D-3D), respectively.
- UniDAC: Universal Metric Depth Estimation for Any Camera
-
UniDAC decouples monocular metric depth into two components: relative depth and a spatially-varying scale map. Using a unified model trained exclusively on perspective views, it achieves zero-shot metric depth estimation on wide field-of-view cameras like fisheye and 360°. By leveraging a depth-guided scale upsampling module and RoPE-ϕ, a positional encoding adapted for Equirectangular Projection (ERP) geometry, it significantly outperforms previous SOTA in cross-camera generalization.
- Unified Primitive Proxies for Structured Shape Completion
-
UniCo is proposed to learn unified primitive representations on shared shape features via primitive proxies. It jointly predicts complete point clouds and assembly-ready quadric primitives (including geometry, semantics, and membership) in a single forward pass, reducing Chamfer distance by up to 50% and improving normal consistency by up to 7% on synthetic and real-world point cloud benchmarks.
- UniLight: A Unified Representation for Lighting
-
UniLight projects four historically incompatible lighting representations—environment maps, images, irradiance maps, and text—into a single joint latent space via contrastive learning. By adding a Spherical Harmonics (SH) prediction auxiliary task to lock in light direction information, it supports three downstream tasks: cross-modal lighting retrieval, environment map generation, and diffusion-based relighting.
- UniPixie: Unified and Probabilistic 3D Physics Learning via Flow Matching
-
UniPixie reformulates "inferring physical properties from vision" from deterministic point estimation to controllable probabilistic distribution modeling. Using a shared Perceiver-IO encoder and a conditional Flow Matching decoder, it generates physical parameters along a "soft-to-hard" continuum from a single visual input. It is the first unified architecture to simultaneously produce plug-and-play parameters for MPM, LBS, and Spring-Mass solvers, reducing Young's modulus error by over 50% compared to the strongest deterministic baseline.
- UniPR: Unified Object-level Real-to-Sim Perception and Reconstruction from a Single Stereo Pair
-
UniPR uses a single stereo image pair and a single forward pass to simultaneously detect and reconstruct all objects in a scene with true physical scale. It eliminates scale ambiguity via stereo geometric constraints and abandons "per-category pre-defined canonical spaces" through Pose-Aware Shape Representation (PASR), achieving 100× faster scene reconstruction and approximately 3× higher shape proportion accuracy compared to image-to-3D models.
- UniSH: Unifying Scene and Human Reconstruction in a Feed-Forward Pass
-
UniSH employs a feed-forward network to simultaneously output scene geometry, camera parameters, and metric-scale SMPL humans from monocular videos. By utilizing "expert depth model distillation + coarse-to-fine human-scene alignment," it transfers priors trained on synthetic data to real-world in-the-wild videos, achieving joint scene and human reconstruction in a single forward pass.
- Unleashing the Power of Chain-of-Prediction for Monocular 3D Object Detection
-
MonoCoP transforms the coupled attributes of size, orientation, and depth in monocular 3D detection from "independent parallel prediction" to feature-level chain-of-prediction (size→orientation→depth propagation with residual aggregation). It employs an Uncertainty-Guided Selector to dynamically switch between chain and parallel paths, significantly improving 3D detection performance on KITTI, nuScenes, and Waymo, especially for distant objects.
- Unlocking 3D Affordance Segmentation with 2D Semantic Knowledge
-
To address the lack of functional semantics in 3D encoders and insufficient geometric cues in sparse point clouds, this paper leverages semantic knowledge from 2D visual foundation models (e.g., DINOv3). Through "Cross-Modal Affinity Transfer" (CMAT) pre-training, the 3D encoder is aligned with the inter-patch relationship structure of 2D features. Combined with a lightweight prompt segmenter, the method achieves SOTA performance on PIAD/PIADv2/LASO using significantly fewer parameters than MLLM-based approaches.
- Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation
-
This paper uses a rigorous set of ablations to uncover which training factors truly determine performance in feed-forward multi-frame visual geometry estimation (represented by VGGT). It discovers that commonly used learnable confidence losses and spatial gradient losses actually hinder performance, and local region alignment causes accuracy drops. Based on these findings, it proposes a consistency loss combined with an efficient high-resolution adaptation module, integrated into the CARVE model, achieving leading and robust results across 7 benchmarks for point cloud reconstruction, video depth, and camera pose estimation.
- Unsupervised 3D Motion Estimation Using Event Camera
-
Leveraging the clue that event cameras exhibit dilation/contraction streaks on different projection axes reflecting depth changes, this paper derives an analytical relationship between optical flow divergence and "motion in depth" (MID) to provide initial values. This is refined by a Directional Expansion Modulation (DEM) module. Finally, MID is incorporated into event-level warping and jointly optimized using contrast maximization, enabling fully unsupervised estimation of both 2D optical flow and motion along the line of sight. It achieves accuracy far exceeding unsupervised baselines on CarlaEvent3D.
- Unsupervised Monocular 3D Keypoint Discovery from Multi-View Diffusion Priors
-
KeyDiff3D treats pre-trained multi-view diffusion models as "sources of geometric priors"—using them both to generate multi-view images from a single image for self-supervision and to extract implicit 3D geometric cues from intermediate features to be lifted into explicit voxels. Consequently, it predicts accurate and generalizable 3D keypoints from a single image without any 3D annotations, camera parameters, or multi-view acquisition (achieving 119mm MPJPE on Human3.6M single-view, surpassing all single-view unsupervised baselines and matching some multi-view methods).
- Unsupervised Multi-Scale Segmentation of 3D Subcellular World with Stable Diffusion Foundation Model
-
Without training or manual annotation, this work directly utilizes attention features from a pre-trained Stable Diffusion model for spectral clustering, combined with heuristic feature aggregation and adaptive thresholding. This framework segments multi-scale subcellular structures—ranging from large membranes to small ribosomes—in cryo-electron tomograms (cryo-ET). Downstream models trained on the resulting pseudo-labels approach the performance of expert manual annotations.
- Urban-GS: A Unified 3D Gaussian Splatting Framework for Compact and High-Fidelity Aerial-to-Street Reconstruction
-
Urban-GS unifies drone aerial viewpoints and street-level viewpoints into a single 3D Gaussian Splatting framework. It employs "projected area weighted densification + contribution weighted anchor pruning + global-to-local two-stage optimization" to simultaneously address cross-view scale conflicts, memory explosion, and under-optimized regions. It outperforms the SOTA Horizon-GS in rendering quality across multiple urban scenes while reducing anchor storage by an average of 41%.
- UST-Hand: An Uncertainty-aware Spatiotemporal Point Cloud Interaction Network for 3D Self-supervised Hand Pose Estimation
-
UST-Hand utilizes conditional normalizing flows to model 2D hand joints from each view as a probabilistic distribution rather than deterministic points. By sampling multiple hypotheses and triangulating them into a unified probabilistic 3D point cloud space, followed by iterative refinement using a Spatiotemporal Point Transformer (STPT), the model reduces the Mean Per Vertex Position Error (MPVPE) by up to 37.8% relative to previous SOTAs under self-supervised settings with noisy 2D pseudo-labels.
- UZ3DVG: Unaided Zero-Shot 3D Visual Grounding with Generated Language Conditions
-
UZ3DVG completely excludes the VLM from the inference pipeline. It leverages RGB-D scenes during training to automatically generate "3D spatial description pseudo-labels + reasoning chains," then distills this reasoning logic into a lightweight student network. This allows the inference stage to rely solely on point clouds and text—without any dependence on 2D images or LLM/VLM interaction. It achieves zero-shot SOTA on ScanRefer and NR3D while reaching an inference speed of 7.69 FPS (approximately 38x faster than existing methods).
- V-DPM: 4D Video Reconstruction with Dynamic Point Maps
-
V-DPM extends "Dynamic Point Maps (DPM)," which previously only handled image pairs, to entire videos. Through a two-stage "time-varying + time-invariant" point map decomposition and a time-conditioned decoder, the model fine-tunes the pre-trained static reconstructor VGGT using a small amount of synthetic data. It achieves single-pass feed-forward 4D reconstruction—simultaneously recovering 3D shapes, camera parameters, and the motion of every point in the scene—with 2-view errors approximately 5 times lower than previous SOTA.
- VAD-GS: Visibility-Aware Densification for 3D Gaussian Splatting in Dynamic Urban Scenes
-
VAD-GS targets the issues of sparse point clouds and minimal camera overlap in autonomous driving scenes. By utilizing voxel visibility reasoning, it proactively identifies instances with missing or distorted geometry. It then selects supporting views across cameras and timestamps for Multi-View Stereo (MVS) reconstruction to supplement missing structures as reliable geometric priors for initializing new Gaussians. This approach extends MVS-based densification to moving objects for the first time, achieving state-of-the-art rendering quality and geometric consistency on Waymo and nuScenes.
- Variational Graph-based Normal Integration
-
This paper reformulates the "normal map → depth" integration problem as a unified optimization objective on a directed weighted graph. It explicitly models depth discontinuities using triplets combined with a two-component Gaussian Mixture Model (GMM), and employs variational inference to alternatingly solve for depth and graph weights. This approach not only outperforms the current SOTA (BiNI) on regular grids but also directly handles scattered oriented points, which grid-based methods fail to process.
- VarSplat: Uncertainty-aware 3D Gaussian Splatting for Robust RGB-D SLAM
-
Ours proposes VarSplat, the first 3DGS-SLAM system that learns per-splat appearance variance \(\sigma^2\) and renders a pixel-wise uncertainty map \(V\) via the Law of Total Variance. This uncertainty is unified across tracking, submap registration, and loop closure, achieving robust and leading performance across four datasets.
- VDFE: Difference-Aware 3D Scene Editing with Non-Intrusive Video Diffusion Priors for Multi-View Consistency and Efficiency
-
VDFE decomposes text-driven 3D scene editing into three steps: performing multi-view consistent flow editing via video diffusion priors, accurately localizing editing regions through flow differences, and selectively updating only the Gaussians in those regions. This achieves precise and efficient controllable editing of 3D Gaussian Splatting (3DGS) scenes without intrusive modifications to pre-trained video diffusion models.
- Velox: Learning Representations of 4D Geometry and Appearance
-
Velox utilizes a Perceiver encoder to compress unstructured spatiotemporal colored point clouds into a small set of "dynamic tokens" (>30× compression). These tokens are jointly supervised by two complementary decoders: a Flow Matching 4D surface decoder for geometry and a 3D Gaussian decoder for appearance. This creates a general latent representation that characterizes both 4D geometry and appearance without requiring temporal correspondences, which can be directly applied to video-to-4D generation, 3D tracking, and cloth simulation, achieving SOTA performance.
- VENI: Variational Encoder for Natural Illumination
-
VENI establishes a prior for outdoor natural illumination using an \(SO(2)\) rotation-equivariant Variational Autoencoder. It utilizes a novel Vector Neuron Vision Transformer (VN-ViT) as the encoder and adopts the equivariant neural field from RENI++ as the decoder. By directly encoding spherical environment maps into a well-structured, unique latent space, VENI achieves smoother interpolation than the decoder-only RENI++, scales to large datasets, and enhances performance in downstream tasks such as inverse rendering.
- VGA: Empowering Aerial-Ground Localization by Visual Geometry Alignment
-
VGA addresses extreme wide-baseline 6-DoF relative pose estimation between uncalibrated aerial drone views and ground views. It learns two additional physical priors on top of a MASt3R backbone—a gravity alignment prior derived from perspective fields and a planar orientation prior from Procrustes alignment of views projected onto a shared top-down plane. These are used as geometric constraints in test-time joint optimization to refine the pose, improving AUC@30° by approximately 11% over the second-best method on MatrixCity / ACC-NVS1 / ULTRRA.
- VGG-T3: Offline Feed-Forward 3D Reconstruction at Scale
-
VGG-T3 is proposed, which compresses the variable-length KV representation of global attention layers in VGGT into a fixed-size MLP via Test-Time Training (TTT). This reduces the computational complexity of offline feed-forward 3D reconstruction from \(O(n^2)\) to \(O(n)\), enabling large-scale scene reconstruction with thousands of images (1k images in 58 seconds).
- VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation
-
VGGT-360 reformulates panoramic monocular depth estimation as a problem of "reconstructing a globally consistent 3D model using VGGT-like 3D foundation models from multiple views and then reprojecting it back to the panorama." Through three training-free plug-and-play modules (Uncertainty-Guided Adaptive Projection, Structural Saliency-Enhanced Attention, and Correlation-Weighted 3D Refinement), it unifies fragmented depth results from independent view inference into cross-view consistent results, surpassing both supervised and training-free SOTA methods on multiple indoor and outdoor datasets in a zero-shot manner.
- VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection
-
Ours proposes VGGT-Det, the first multi-view indoor 3D object detection framework oriented towards sensor-geometry-free (SG-Free) input. By mining semantic priors (Attention-Guided query generation, AG) and geometric priors (Query-Driven feature aggregation, QD) inside the VGGT encoder, it outperforms the state-of-the-art methods by 4.4 and 8.6 [email protected] on ScanNet and ARKitScenes, respectively.
- VGGT-\(\Omega\)
-
This work systematically "scales up" feed-forward 3D reconstruction models like VGGT. By utilizing register attention, lightweight dense heads, and single-head multi-task supervision, the training memory is reduced to approximately 30% of the original. Combined with a large-scale data pipeline capable of labeling dynamic videos and DINO-style self-supervised distillation, the model is scaled from 0.2B to 10B parameters using 15× more data. It achieves a new SOTA across six static and dynamic benchmarks (e.g., Sintel camera pose AUC@3° increased from 22.5 to 40.0, a 77% relative gain, while being 50× faster than MegaSaM).
- VIAFormer: Voxel-Image Alignment Transformer for High-Fidelity Voxel Refinement
-
VIAFormer reformulates "repairing incomplete and noisy voxels" as a Conditioned Voxel Refinement task guided by multi-view images. It explicitly assigns 3D coordinates to 2D image tokens using an Image Index, learns a direct "dirty-to-clean" correction trajectory via Correctional Flow, and achieves bidirectional cross-modal fusion with a Hybrid Stream Transformer. It reaches SOTA performance on both Vision Foundation Model (VFM) outputs and synthetic noise, achieving an IoU gain of up to 39.1% on synthetic noise.
- VIMCAN: Visual-Inertial 3D Human Pose Estimation with Hybrid Mamba-Cross-Attention Network
-
VIMCAN integrates Mamba's linear complexity for temporal modeling with Cross-Attention's cross-modal spatial reasoning into a hybrid architecture. By fusing RGB keypoints and wearable IMU data, it achieves 17.2 mm MPJPE on TotalCapture while supporting real-time inference at 60+ FPS on consumer-grade hardware.
- Vista4D: Video Reshooting with 4D Point Clouds
-
Vista4D lifts input videos into "temporally persistent" 4D point clouds where static pixels are preserved across time. By rendering these clouds from a target camera and feeding them into a fine-tuned video diffusion model alongside the source video, it "reshoots" the scene from new angles while preserving dynamics. The model is trained on noisy multi-view data to ensure robustness against real-world 4D reconstruction artifacts.
- Volumetric Functional Maps
-
This paper is the first to extend the mature functional maps framework from surface geometry processing to 3D volumes (tetrahedral meshes). By constructing a discretization-invariant function space using the eigenfunctions of the volumetric Laplacian operator, it establishes dense correspondences within the volume interior, supporting applications such as connectivity transfer, segmentation transfer, and solid textures, while also improving the accuracy of classical surface shape matching.
- Voxify3D: Pixel Art Meets Volumetric Rendering
-
Transforming 3D meshes into "Lego/pixel block" style voxel art: This work utilizes a differentiable two-stage voxel radiance field. First, coarse geometry and color are learned via DVGO. Then, stylization is achieved through six-view orthographic pixel art supervision, patch-level CLIP semantic loss, and palette-constrained Gumbel-Softmax discrete color optimization. The framework produces voxel art with clear semantics, clean color blocks, and controllable abstraction (2-8 colors, 20\(\times\)-50\(\times\) resolution), achieving a CLIP-IQA of 37.12 and 77.90% user preference.
- Wanderland: Geometrically Grounded Simulation for Open-World Embodied AI
-
This paper introduces the Wanderland real-to-sim framework, which utilizes handheld multi-sensor scanners (LiDAR+IMU+RGB) to capture open-world indoor and outdoor scenes. By employing LIV-SLAM to obtain metric-grade precise geometry and camera poses, combined with 3DGS for photorealistic rendering and geometrically grounded collision simulation, the authors construct a large-scale dataset of 530 scenes, 420,000 frames, and 3.8 million \(m^2\). The system demonstrates that pure visual reconstruction falls far short of LiDAR-enhanced solutions in terms of metric accuracy, mesh quality, and the reliability of navigation policy training and evaluation.
- Wavelet-Driven 3D Anomaly Detection under Pose-Agnostic and Sparse-View
-
Addressing the issues of overfitting and pose misalignment in Pose-Agnostic Anomaly Detection (PAD) due to insufficient observations in sparse-view scenarios, this paper proposes Wave-Pose3D. The method migrates 3D Gaussian reconstruction, pose estimation, and anomaly scoring into the wavelet frequency domain, utilizing low frequencies for global structure and high frequencies for details. SOTA performance is achieved under 10% and 20% sparse-view conditions.
- WeatherCity: Urban Scene Reconstruction with Controllable Multi-Weather Transformation
-
WeatherCity integrates "2D image weather editing + a multi-weather Gaussian representation with shared features + physics-driven particle simulation" into a unified framework. This allows 4D autonomous driving scenes to undergo controllable switching and intensity adjustment for sunny, rainy, snowy, and foggy conditions after reconstruction. It leads in metrics such as CLIP-S and Sem-CS on Waymo/nuScenes datasets and achieves a rendering speed of 25.67 FPS.
- What Makes Good Synthetic Training Data for Zero-Shot Stereo Matching?
-
This paper systematically ablates the design space of synthetic training data for stereo matching (including floating objects, backgrounds, materials, and baselines). It finds that the combination of "realistic indoor scenes + dense floating objects + wide baselines" is optimal. Based on these findings, WMGStereo-150k is constructed, which outperforms hybrid training using four classic datasets while using only a single dataset.
- Where, What, Why: Toward Explainable 3D-GS Watermarking
-
Ours proposes a representation-native 3D-GS watermarking framework that selects carriers via Trio-Experts (where), controls gradients using a Channel-wise Group Mask (what), and achieves auditable attribution through decoupled fine-tuning (why). It surpasses Prev. SOTA in both rendering quality (PSNR +0.83 dB) and bit accuracy (+1.24%).
- WildRayZer: Self-supervised Large View Synthesis in Dynamic Environments
-
WildRayZer extends RayZer, a self-supervised pose-free large view synthesis model, to real-world "moving camera and moving objects" scenarios. By utilizing a static renderer that only captures rigid structures, the model automatically discovers dynamic objects through rendering residuals, distills a motion mask estimator to discard dynamic tokens before scene encoding, and gates dynamic pixels in the rendering loss. This enables the synthesis of clean "transient-free" novel views in a single feed-forward pass without any pose or mask annotations.
- WonderZoom: Multi-Scale 3D World Generation
-
Starting from a single image, WonderZoom allows users to interactively "zoom in" on any area of a 3D scene, autoregressively synthesizing finer-scale content that did not exist previously (ranging from vast landscapes to microscopic details like a ladybug on a petal). Using an incrementally updatable scale-adaptive Gaussian Splatting representation combined with a progressive detail synthesizer, it significantly outperforms existing video and 3D world generation models in both quality and text alignment.
- WorldGen: From Text to Traversable and Interactive 3D Worlds
-
WorldGen decomposes the "Text → Traversable, Editable 3D World" process into a four-stage pipeline: "Procedural Layout → Navmesh-conditioned Global Reconstruction → Scene Decomposition → Per-object Enhancement." It utilizes an LLM-driven procedural generator to fix traversable structures first, then employs image generators and image-to-3D priors to complete appearance and details. It produces a \(50\times50\) meter scene—directly importable into game engines with support for character climbing and jumping—in approximately 5 minutes.
- X-band Radar Non-Line-of-Sight Imaging
-
This work replaces optical and millimeter-wave (mmWave) sensors with 10 GHz X-band radar for non-line-of-sight (NLOS) imaging. By leveraging longer wavelengths, it transforms "diffuse reflection" on rough walls into "specular reflection." A neural network utilizing "dense prediction + geometric-aware residual reconstruction" is employed to counter the low angular resolution inherent in long wavelengths, extending the effective "corner imaging" distance from a few meters (optical) to 40 m in real-world scenarios.
- Yo'City: Personalized and Boundless 3D Realistic City Scene Generation via Self-Critic Expansion
-
The Yo'City multi-agent framework is proposed, achieving personalized text-driven boundless 3D city generation through a "City–District–Grid" hierarchical planning, a "Produce–Refine–Evaluate" isometric image synthesis loop, and a scene graph-guided expansion mechanism. The method comprehensively outperforms existing approaches like SynCity in semantic consistency and visual quality.
- Z-Order Transformer for Feed-Forward Gaussian Splatting
-
The method utilizes Z-order (Morton) space-filling curves to rearrange cluttered per-pixel Gaussians into 1D sequences that maintain spatial locality. Combined with a sparse Transformer employing "group attention + top-k attention" and Z-order pooling, it predicts high-quality 3D Gaussians in a single feed-forward pass. This reduces the number of Gaussians to 1/2–1/3 of DepthSplat/AnySplat, while inference is approximately 1000x faster than per-scene optimized 3DGS.
- Zero-Shot Depth Completion with Vision-Language Model
-
Sparse depth is injected into a minimally modified VLM (Qwen2.5-VL 3B) via "visual tokens + text prompts + text supervision." This allows the model to understand "where to fill and where to preserve" like verbal instructions, enabling zero-shot depth completion without dense ground truth and achieving up to a 17.3% improvement across 7 cross-domain benchmarks.
- Zero-Shot Reconstruction of Animatable 3D Avatars with Cloth Dynamics from a Single Image
-
DynaAvatar proposes the first zero-shot framework to reconstruct animatable 3D human avatars with motion-dependent cloth dynamics from a single image. By utilizing a static-to-dynamic knowledge transfer strategy and an optical flow-guided DynaFlow loss, it achieves realistic garment dynamic modeling under limited dynamic data, significantly outperforming existing methods.
- ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training
-
ZipMap "compresses" an entire image collection into a fixed-size fast-weight MLP using Test-Time Training (TTT) layers, enabling bidirectional feed-forward 3D reconstruction (camera pose + depth + point cloud) in linear time. It achieves or exceeds the accuracy of quadratic-complexity methods like VGGT/π³, reconstructing over 700 frames in under 10 seconds (20× faster than VGGT), while the resulting implicit scene state can be queried in real-time for novel view geometry and appearance.
- Zoo3D: Zero-Shot 3D Object Detection at Scene Level
-
Zoo3D proposes the first fully training-free (zero-shot) scene-level 3D object detection framework. It directly constructs 3D boxes via graph clustering of 2D instance masks and assigns semantic labels using an open-vocabulary module featuring "best-view selection + SAM refinement + multi-scale CLIP." By leveraging DUSt3R, it relaxes input requirements from point clouds to unposed RGB images, outperforming all self-supervised methods on ScanNet200 and ARKitScenes in a zero-shot setting.