🧊 3D Vision¶

📷 CVPR2026 · 252 paper notes

3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image: This paper proposes a novel paradigm termed in-place completion, which extends pretrained object-level generative priors to the scene level, directly completing fragmented geometry at its original spatial location without explicit pose alignment. The authors also construct ARSG-110K, a 110K-scale scene-level dataset, and substantially outperform baselines such as MIDI and Gen3DSR.
3D-IDE: 3D Implicit Depth Emergent: This paper proposes the Implicit Geometry Emergence Principle (IGEP), which employs a lightweight geometric validator and a global 3D teacher for privileged supervision during training, enabling the visual encoder to acquire 3D perception from RGB video input alone. The approach incurs zero latency overhead at inference time and surpasses comparable methods on multiple 3D scene understanding benchmarks.
3D Gaussian Splatting with Self-Constrained Priors for High Fidelity Surface Reconstruction: This paper proposes Self-Constrained Priors (SCP), which construct a TSDF distance field by fusing depth maps rendered from the current 3D Gaussians. This field serves as a prior to impose geometry-aware constraints on Gaussians (outlier removal, opacity constraint, and surface attraction), enabling high-fidelity surface reconstruction that achieves state-of-the-art performance on NeRF-Synthetic and DTU benchmarks.
3D sans 3D Scans: Scalable Pre-training from Video-Generated Point Clouds: This paper proposes LAM3C, a framework that, for the first time, demonstrates that video-generated point clouds (VGPCs) reconstructed from unlabeled online videos (e.g., property walkthroughs) can replace real 3D scans for 3D self-supervised pre-training. By introducing a Laplacian smoothing loss and a noise consistency loss to stabilize representation learning on noisy point clouds, and paired with the authors' RoomTours dataset (49K scenes), LAM3C matches or surpasses methods that rely on real 3D scans on indoor semantic and instance segmentation benchmarks.
3DrawAgent: Teaching LLM to Draw in 3D with Early Contrastive Experience: This paper proposes 3DrawAgent, a training-free framework that enables a frozen LLM to acquire 3D spatial reasoning through contrastive knowledge extraction (CKE), generating language-driven 3D Bézier sketches in an autoregressive manner without any parameter updates, achieving performance competitive with trained methods.
4C4D: 4 Camera 4D Gaussian Splatting: This paper proposes the 4C4D framework, which employs a Neural Decaying Function to adaptively control Gaussian opacity decay, addressing the geometry–appearance learning imbalance in sparse-view (only 4 cameras) 4D Gaussian Splatting, and achieves state-of-the-art performance across multiple benchmarks.
4DEquine: Disentangling Motion and Appearance for 4D Equine Reconstruction from Monocular Video: This work decomposes equine 4D reconstruction into two sub-tasks — motion estimation (AniMoFormer: spatiotemporal Transformer + post-optimization) and appearance reconstruction (EquineGS: feed-forward 3DGS) — bridged by the VAREN parametric model. Trained exclusively on synthetic data (VarenPoser + VarenTex), the method achieves state-of-the-art performance on real-world benchmarks APT-36K and AiM, and generalizes zero-shot to zebras and donkeys.
4DEquine: Disentangling Motion and Appearance for 4D Equine Reconstruction from Monocular Video: This paper proposes the 4DEquine framework, which disentangles 4D equine reconstruction from monocular video into two subproblems — dynamic motion estimation (AniMoFormer) and static appearance reconstruction (EquineGS) — achieving SOTA on real-world data while training exclusively on synthetic data.
A2Z-10M+: Geometric Deep Learning with A-to-Z BRep Annotations for AI-Assisted CAD Modeling and Reverse Engineering: This work constructs A2Z, a large-scale multimodal CAD dataset comprising 1M+ complex models with 10M+ annotations (high-resolution 3D scans, hand-drawn 3D sketches, text descriptions, and BRep topology labels), providing an unprecedented data foundation for Scan-to-BRep reverse engineering and multimodal BRep learning. Foundation models trained on A2Z substantially outperform existing methods on edge and junction detection.
A Semantically Disentangled Unified Model for Multi-category 3D Anomaly Detection: This paper proposes SeDiR, a framework for semantically disentangled unified 3D anomaly detection, comprising three modules: Coarse-to-Fine Global Tokenization (CFGT), Category-Conditioned Contrastive Learning (C3L), and Geometry-Guided Decoder (GGD). SeDiR addresses the Inter-Category Entanglement (ICE) problem and outperforms the state of the art by 2.8% and 9.1% AUROC on Real3D-AD and Anomaly-ShapeNet, respectively.
GAP: Action-Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation: GAP leverages a pretrained 3D geometric foundation model (π³) to extract 3D features, fuses them with 2D semantic features and proprioception, and jointly predicts future action sequences and future 3D point maps via conditional diffusion, achieving state-of-the-art performance on RoboTwin 2.0 and real-world bimanual manipulation benchmarks.
Action-guided Generation of 3D Functionality Segmentation Data: This paper presents SynthFun3D, the first method for automatically generating 3D functionality segmentation training data from action descriptions. By leveraging metadata-driven 3D object retrieval and scene layout generation, it produces precise part-level interaction masks without manual annotation. Training on combined synthetic and real data yields +2.2 mAP / +6.3 mAR / +5.7 mIoU improvements on the SceneFun3D benchmark.
ActionMesh: Animated 3D Mesh Generation with Temporal 3D Diffusion: ActionMesh minimally extends a pretrained 3D diffusion model with a temporal axis (temporal 3D diffusion), then employs a temporal 3D autoencoder to convert independent shape sequences into topology-consistent animated meshes. The method generates production-quality animated 3D meshes from diverse inputs (video, text, or 3D mesh) in just 2 minutes, achieving state-of-the-art performance in both geometric accuracy and temporal consistency.
Ada3Drift: Adaptive Training-Time Drifting for One-Step 3D Visuomotor Robotic Manipulation: To address the slow multi-step denoising of diffusion policies and the mode-averaging collision problem of one-step Flow Matching, this paper proposes Ada3Drift: a training-time drifting field that attracts predictions toward the nearest expert demonstration while repelling other modes, combined with multi-scale field aggregation and a sigmoid-scheduled loss transition, achieving multimodal action distribution preservation under 1 NFE inference and reaching SOTA on Adroit/Meta-World/RoboTwin and real robots.
Ada3Drift: Adaptive Training-Time Drifting for One-Step 3D Visuomotor Robotic Manipulation: Ada3Drift proposes shifting the iterative refinement of diffusion policies from inference time to training time. By introducing a training-time drifting field—attracting predicted actions toward expert modes while repelling other generated samples—it achieves high-fidelity one-step (1 NFE) 3D visuomotor policies, reaching state-of-the-art performance on Adroit, Meta-World, RoboTwin, and real-robot tasks, with a 10× speedup at inference.
Adapting Point Cloud Analysis via Multimodal Bayesian Distribution Learning: BayesMM proposes a training-free dynamic Bayesian distribution learning framework that models textual and geometric modalities as Gaussian distributions and automatically balances modality weights via Bayesian model averaging, achieving robust test-time adaptation across multiple point cloud benchmarks with an average improvement exceeding 4%.
AeroDGS: Physically Consistent Dynamic Gaussian Splatting for Single-Sequence Aerial 4D Reconstruction: This paper proposes AeroDGS, a physics-guided 4D Gaussian Splatting framework for monocular UAV video. It introduces a Monocular Geometry Lifting module to reconstruct reliable static and dynamic geometry, and incorporates differentiable physical priors — ground support, upright stability, and trajectory smoothness — to resolve ambiguous image cues into physically consistent motion estimates, outperforming existing methods on both synthetic and real UAV scenes.
AffordGrasp: Cross-Modal Diffusion for Affordance-Aware Grasp Synthesis: AffordGrasp proposes a diffusion-based cross-modal framework that generates physically plausible and semantically consistent hand grasp poses from text instructions and object point clouds, via affordance-guided latent diffusion and a Distribution Adjustment Module (DAM), significantly outperforming existing methods on four benchmarks.
AffordMatcher: Affordance Learning in 3D Scenes from Visual Signifiers: AffordMatcher proposes a method for localizing affordance regions in 3D scenes from visual signifiers (RGB images depicting human interactions). Through the large-scale AffordBridge dataset and a Match-to-Match attention mechanism based on dissimilarity matrices, it achieves 53.4 mAP on zero-shot affordance segmentation, surpassing the second-best method by 7.8 points.
Affostruction: 3D Affordance Grounding with Generative Reconstruction: This paper proposes Affostruction, which completes object geometry (including unobserved regions) via sparse voxel fusion-based generative reconstruction, models the multimodal distribution of affordances using Flow Matching, and performs affordance region localization on complete 3D shapes — achieving a 54.8% improvement in reconstruction IoU and a 40.4% improvement in affordance aIoU.
AnchorSplat: Feed-Forward 3D Gaussian Splatting with 3D Geometric Priors: AnchorSplat proposes an anchor-aligned feed-forward 3DGS framework that leverages 3D geometric priors (sparse point clouds) as anchors to predict Gaussians directly in 3D space. Using approximately 20× fewer Gaussians and half the reconstruction time, it achieves state-of-the-art performance on ScanNet++ v2 (PSNR 21.48) with superior depth estimation accuracy.
AnthroTAP: Learning Point Tracking with Real-World Motion: AnthroTAP proposes an automated pipeline that generates large-scale pseudo-labeled point tracking data from real-world human motion videos via SMPL fitting and optical flow filtering. Using only 1.4K videos and 4 GPUs for one day of training, it achieves state-of-the-art performance on the TAP-Vid benchmark, surpassing BootsTAPIR which uses 15M videos.
AnyPcc: Compressing Any Point Cloud with a Single Universal Model: AnyPcc proposes a Universal Context Model (UCM) that integrates dual-granularity spatial and channel priors, combined with an Instance-Adaptive Fine-Tuning (IAFT) strategy, to achieve state-of-the-art point cloud geometry compression across 15 diverse datasets using a single model, yielding approximately 12% bitrate reduction over G-PCC v23.
APC: Transferable and Efficient Adversarial Point Counterattack for Robust 3D Point Cloud Recognition: APC proposes a lightweight input-level purification module that neutralizes adversarial attacks by generating point-wise counter-perturbations, trained under dual geometric and semantic consistency constraints to achieve strong robustness across diverse attacks and models.
ArtHOI: Taming Foundation Models for Monocular 4D Reconstruction of Hand-Articulated-Object Interactions: ArtHOI presents the first complete pipeline for reconstructing 4D interactions between hands and articulated objects (e.g., scissors, glasses, laptops) from monocular RGB video. Through Adaptive Sampling Refinement (ASR) for metric scale and pose estimation, and an MLLM-guided hand-object alignment strategy, the method outperforms the baseline RSRD—which requires pre-scanned object geometry—across multiple datasets.
ArtLLM: Generating Articulated Assets via 3D LLM: ArtLLM formulates articulated object generation as a language generation problem. A 3D multimodal LLM autoregressively predicts part layouts and kinematic joint parameters (discretized as tokens) from point cloud input, followed by XPart-based high-fidelity part geometry synthesis. The method significantly outperforms existing approaches on PartNet-Mobility (mIoU 0.69) with inference in only 19 seconds.
AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models: This paper proposes AVA-Bench, the first systematic evaluation benchmark that decouples the capabilities of vision foundation models (VFMs) into 14 atomic visual abilities (AVAs). By aligning training-test distributions and isolating individual abilities during evaluation, AVA-Bench precisely identifies the strengths and weaknesses of VFMs, and reveals that a 0.5B small model can maintain VFM ranking consistency comparable to a 7B model.
AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models: This paper proposes AVA-Bench, which decomposes the evaluation of vision foundation models (VFMs) into 14 "atomic visual abilities" (AVAs). Through train/test distribution alignment and single-ability isolation testing, AVA-Bench precisely identifies the strengths and weaknesses of VFMs. A key finding is that a 0.5B LLM preserves the same VFM ranking as a 7B LLM, reducing evaluation cost by \(8\times\).
AvatarPointillist: AutoRegressive 4D Gaussian Avatarization: AvatarPointillist proposes an autoregressive (AR) generative framework for constructing 4D Gaussian avatars: a decoder-only Transformer generates 3DGS point clouds (with binding information) token by token, followed by a Gaussian Decoder that predicts rendering attributes for each point. This approach breaks free from fixed template topology, enables adaptive point density adjustment, and comprehensively outperforms baselines such as LAM and GAGAvatar on NeRSemble.
Back to Point: Exploring Point-Language Models for Zero-Shot 3D Anomaly Detection: BTP is the first work to apply pretrained point-language models (PLMs, e.g., ULIP) to zero-shot 3D anomaly detection. It proposes a Multi-Granularity Feature Embedding Module (MGFEM) that fuses patch-level semantics, geometric descriptors, and global CLS tokens, coupled with a joint representation learning strategy. BTP achieves 84.5% point-level AUROC on Real3D-AD, substantially outperforming the VLM-rendering-based method PointAD (73.5%).
BRepGaussian: CAD Reconstruction from Multi-View Images with Gaussian Splatting: BRepGaussian is the first method to reconstruct complete B-rep CAD models directly from multi-view images. It employs a two-stage 2D Gaussian splatting framework to learn edge and patch features, followed by parametric fitting to produce watertight boundary representations, without requiring point cloud supervision.
BulletGen: Improving 4D Reconstruction with Bullet-Time Generation: BulletGen is proposed to generate novel views at selected "bullet-time" frozen frames using a static video diffusion model. The generated views are precisely localized and used to supervise 4D Gaussian scene optimization, achieving state-of-the-art performance in extreme novel view synthesis and 2D/3D tracking from monocular video input only.
Can Natural Image Autoencoders Compactly Tokenize fMRI Volumes for Long-Range Dynamics Modeling?: This paper proposes TABLeT, which leverages a pretrained 2D natural image autoencoder (DCAE) to compress 3D fMRI volumes into as few as 27 continuous tokens per frame. Paired with a standard Transformer encoder, this enables unprecedented long-range temporal modeling (256 frames), surpassing SOTA voxel-based methods on multiple tasks across UKB, HCP, and ADHD-200, while significantly improving computational efficiency.
CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction: CARI4D is proposed as the first category-agnostic method for reconstructing metric-scale 4D human-object interactions from monocular RGB video—encompassing object shape reconstruction, pose tracking, hand contact reasoning, and physics-constrained optimization—with zero-shot generalization to unseen categories.
Catalyst4D: High-Fidelity 3D-to-4D Scene Editing via Dynamic Propagation: This paper proposes Catalyst4D, a framework that propagates high-quality 3D static editing results into 4D dynamic Gaussian scenes through two modules — Anchor-based Motion Guidance (AMG) and Color Uncertainty-guided Appearance Refinement (CUAR) — achieving spatiotemporally consistent, high-fidelity dynamic scene editing.
Catalyst4D: High-Fidelity 3D-to-4D Scene Editing via Dynamic Propagation: This paper proposes Catalyst4D, a framework that propagates mature 3D static editing results into 4D dynamic Gaussian scenes via Anchor-based Motion Guidance (AMG, which establishes region-level correspondences using optimal transport) and Color Uncertainty-guided Appearance Refinement (CUAR, which automatically identifies and corrects occlusion artifacts). The method consistently outperforms existing approaches in CLIP semantic similarity.
CGHair: Compact Gaussian Hair Reconstruction with Card Clustering: CGHair is proposed, achieving over 200× compression of appearance parameters and 4× acceleration in strand reconstruction while maintaining comparable visual quality, via hair-card-guided hierarchical clustering and a shared Gaussian appearance codebook.
Changes in Real Time: Online Scene Change Detection with Multi-View Fusion: This paper presents the first scene change detection (SCD) method that simultaneously achieves online inference, pose-agnosticism, label-free operation, and multi-view consistency. By replacing hard-threshold heuristics with a self-supervised fusion (SSF) loss that integrates pixel-level and feature-level change cues into a 3DGS change representation, the proposed approach surpasses all existing offline methods in detection accuracy while operating in real time at over 10 FPS.
CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation: The first CLIP-based few-shot unsupervised 3D point cloud domain adaptation framework. Through knowledge-driven prompt tuning, parameter-efficient fine-tuning, entropy-guided view selection, and uncertainty-aware alignment loss, it achieves consistent accuracy improvements of 3–16% on PointDA-10 and GraspNetPC-10 with only ~11M trainable parameters.
CMHANet: A Cross-Modal Hybrid Attention Network for Point Cloud Registration: CMHANet is proposed to deeply integrate 2D image texture-semantic features with 3D point cloud geometric features via a cross-modal hybrid attention mechanism, combined with a contrastive learning objective, achieving state-of-the-art point cloud registration performance on 3DMatch/3DLoMatch.
CMHANet: A Cross-Modal Hybrid Attention Network for Point Cloud Registration: CMHANet proposes a three-stage hybrid attention mechanism (geometric self-attention → image aggregation attention → source-target cross-attention) to fuse 2D image texture semantics with 3D point cloud geometric information, complemented by a cross-modal contrastive loss. The method achieves state-of-the-art registration recall on 3DMatch/3DLoMatch (92.4%/75.5%) and a zero-shot RMSE of only \(0.76\times10^{-2}\) on TUM RGB-D.
Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass: CHROMM is proposed as a unified framework that jointly estimates camera parameters, scene point clouds, and human body meshes (SMPL-X) from multi-person multi-view video in a single forward pass, without external modules or preprocessed data. It achieves competitive performance on global human motion estimation and multi-view pose estimation tasks while being more than 8× faster than optimization-based methods.
Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass: CHROMM is a unified framework that integrates the geometric prior of Pi3X and the human prior of Multi-HMR into a single feed-forward network, enabling joint reconstruction of cameras, scene point clouds, and SMPL-X human meshes from multi-person multi-view video in a single pass—without external modules, preprocessing, or iterative optimization. It achieves a multi-view WA-MPJPE of 53.1 mm on RICH and runs more than 8× faster than HAMSt3R.
Context-Nav: Context-Driven Exploration and Viewpoint-Aware 3D Spatial Reasoning for Instance Navigation: Context-Nav elevates the contextual information embedded in long-form textual descriptions from a posterior verification signal to a proactive exploration prior. By constructing a context-driven value map to guide frontier selection and performing viewpoint-aware 3D spatial relation verification at candidate target locations, Context-Nav achieves state-of-the-art performance on InstanceNav and CoIN-Bench without any task-specific training.
Cross-Instance Gaussian Splatting Registration via Geometry-Aware Feature-Guided Alignment: This paper proposes GSA (Gaussian Splatting Alignment), the first method for category-level cross-instance registration of 3DGS models. It combines geometry-aware feature-guided coarse alignment (extending ICP to solve similarity transformations) with multi-view feature consistency fine alignment, substantially outperforming existing methods in both same-instance and cross-instance scenarios.
CrowdGaussian: Reconstructing High-Fidelity 3D Gaussians for Human Crowd from a Single Image: CrowdGaussian proposes a unified framework for reconstructing multi-person 3D Gaussian splatting representations from a single image. It recovers complete geometry of occluded regions via a self-supervised-adapted Large Occluded Human Reconstruction Model (LORM), and enhances texture detail quality through a single-step diffusion refiner (CrowdRefiner) trained with Self-Calibrated Learning (SCL).
CUBE: Representing 3D Faces with Learnable B-Spline Volumes: This paper proposes CUBE (Control-based Unified B-spline Encoding), a hybrid geometric representation combining B-spline volumes with learnable high-dimensional control features. Through a two-stage decoding pipeline (B-spline basis interpolation followed by a lightweight MLP residual), CUBE enables editable, high-fidelity 3D face reconstruction and scan registration.
CustomTex: High-fidelity Indoor Scene Texturing via Multi-Reference Customization: CustomTex is a framework that achieves high-fidelity, instance-controllable texture generation for 3D indoor scenes through instance-level multi-reference image conditioning and a dual distillation training strategy (semantic-level VSD distillation + pixel-level super-resolution distillation), surpassing existing methods in semantic consistency, texture sharpness, and reduction of baked-in shading.
Dark3R: Learning Structure from Motion in the Dark: Dark3R is a teacher-student distillation framework that transfers the 3D priors of MASt3R to extremely low-light (SNR < −4 dB) raw images, enabling Structure from Motion (SfM) and novel view synthesis in dark environments where traditional methods fail entirely.
DeepShapeMatchingKit: Accelerated Functional Map Solver and Shape Matching Pipelines Revisited: This paper proposes a vectorized reformulation of the functional map solver achieving a 33× speedup, identifies and documents two undocumented implementation variants of DiffusionNet, introduces balanced accuracy as a supplementary metric for partial matching evaluation, and releases a unified open-source codebase.
Deformation-based In-Context Learning for Point Cloud Understanding: This paper proposes DeformPIC, which reframes point cloud In-Context Learning from a "masked reconstruction" paradigm to a "deformation transfer" paradigm. A Deformation Extraction Network (DEN) extracts task-specific semantics, and a Deformation Transfer Network (DTN) applies the extracted deformation to the query point cloud, achieving CD reductions of 1.6/1.8/4.7 on reconstruction/denoising/registration respectively.
DirectFisheye-GS: Enabling Native Fisheye Input in Gaussian Splatting with Cross-View Joint Optimization: This paper natively integrates the Kannala-Brandt fisheye projection model into the 3DGS pipeline and proposes a cross-view joint optimization strategy based on feature overlap, eliminating the information loss caused by pre-undistortion and achieving state-of-the-art performance on multiple public benchmarks.
DMAligner: Enhancing Image Alignment via Diffusion Model Based View Synthesis: This paper proposes DMAligner, which reformulates image alignment from the traditional optical flow warping paradigm into an "alignment-oriented view synthesis" task. By leveraging a conditional diffusion model to directly generate complete aligned images, and combining a purpose-built DSIA synthetic dataset with a Dynamics-aware Mask Producing (DMP) module, DMAligner effectively eliminates the ghosting and occlusion artifacts inherent to warp-based methods, achieving state-of-the-art performance across multiple benchmarks.
DROID-W: DROID-SLAM in the Wild: This paper proposes DROID-W, which introduces uncertainty estimation into differentiable Bundle Adjustment (Uncertainty-aware BA), combined with a DINOv2-feature-driven dynamic uncertainty update mechanism and monocular depth regularization, enabling robust camera pose estimation and scene reconstruction for DROID-SLAM in highly dynamic in-the-wild scenarios at approximately 10 FPS in real time.
DropAnSH-GS: Dropping Anchor and Spherical Harmonics for Sparse-view Gaussian Splatting: To address overfitting in 3DGS under sparse-view settings, this paper proposes DropAnSH-GS, which replaces independent random Dropout with Anchor-based Dropout—dropping entire clusters of spatially correlated Gaussians around selected anchors to disrupt local redundancy compensation—while introducing Spherical Harmonics (SH) Dropout to suppress high-order SH overfitting and enable lossless post-training compression.
DuoMo: Dual Motion Diffusion for World-Space Human Reconstruction: DuoMo decomposes world-space human motion reconstruction into two independent diffusion models: a camera-space model that extracts generalizable motion estimates from video in camera coordinates, and a world-space model that refines the noisy lifted proposals into globally consistent world-space motion. By directly generating mesh vertex motion rather than SMPL parameters, DuoMo reduces W-MPJPE by 16% on EMDB and 30% on RICH.
Dynamic Black-hole Emission Tomography with Physics-informed Neural Fields: This paper proposes PI-DEF, a physics-informed coordinate neural network framework that jointly reconstructs the 4D (temporal + 3D spatial) emissivity field and 3D velocity field of gas near a black hole. Under sparse EHT measurements, PI-DEF significantly outperforms BH-NeRF, which enforces hard Keplerian dynamical constraints.
E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training: E-RayZer is the first truly self-supervised feed-forward 3D Gaussian reconstruction model. It replaces RayZer's implicit latent scene representation with explicit 3D Gaussians, and incorporates a visual-overlap-based curriculum learning strategy. Under zero 3D annotation conditions, it learns geometrically grounded 3D-aware representations, drastically outperforming RayZer on pose estimation (RPA@5° from ≈0 to 90.8%). On downstream 3D tasks under frozen-backbone probing, it significantly surpasses mainstream pre-trained models such as DINOv3 and CroCo v2, and even rivals the supervised VGGT.
E2EGS: Event-to-Edge Gaussian Splatting for Pose-Free 3D Reconstruction: This paper proposes E2EGS, a fully pose-free 3D reconstruction framework driven entirely by event streams. It extracts noise-robust edge maps from event streams via patch-based temporal consistency analysis, leverages edge information to guide Gaussian initialization and weighted loss optimization, and achieves high-quality trajectory estimation and 3D reconstruction without any depth model or RGB input.
Easy3E: Feed-Forward 3D Asset Editing via Rectified Voxel Flow: This paper proposes a feed-forward 3D asset editing framework built upon the TRELLIS 3D generation backbone. It achieves globally consistent geometric deformation in a sparse voxel latent space via Voxel FlowEdit, and recovers high-frequency details through normal-guided multi-view texture refinement.
Efficient Hybrid SE(3)-Equivariant Visuomotor Flow Policy via Spherical Harmonics: This paper proposes E3Flow, the first equivariant flow matching policy framework based on spherical harmonic representations. It introduces a Feature Enhancement Module (FEM) to dynamically fuse point cloud and image modalities, and combines rectified flow for efficient equivariant action generation. E3Flow achieves an average success rate 3.12% higher than the strongest baseline SDP across 8 MimicGen tasks while delivering a 7× speedup in inference.
Ego-1K: A Large-Scale Multiview Video Dataset for Egocentric Vision: This paper presents Ego-1K, a large-scale temporally synchronized egocentric multiview video dataset comprising 956 short clips (12+4 cameras, 60Hz), addressing the data gap in egocentric dynamic 3D reconstruction, and demonstrates that stereo depth guidance can substantially improve 4D novel view synthesis quality.
EmbodiedSplat: Online Feed-Forward Semantic 3DGS for Open-Vocabulary 3D Scene Understanding: This paper proposes EmbodiedSplat, the first online feed-forward semantic 3DGS framework. It achieves memory-efficient per-Gaussian semantic representation via a sparse coefficient field and a CLIP global codebook, and integrates 3D geometry-aware features to enable full-scene open-vocabulary 3D understanding at 5–6 FPS over 300+ streaming frames.
EMGauss: Continuous Slice-to-3D Reconstruction via Dynamic Gaussian Modeling in Volume Electron Microscopy: This paper reformulates the anisotropic slice reconstruction problem in volume electron microscopy (vEM) as a dynamic 3D scene rendering task based on deformable 2D Gaussian splatting, achieving high-fidelity continuous slice synthesis under sparse data conditions via a Teacher-Student pseudo-label mechanism.
EmoTaG: Emotion-Aware Talking Head Synthesis on Gaussian Splatting with Few-Shot Personalization: This paper proposes EmoTaG, an emotion-aware 3D talking head synthesis framework built upon FLAME-Gaussian structural priors and a Gated Residual Motion Network (GRMN). It achieves few-shot personalization from as little as 5 seconds of video while jointly addressing emotion expressiveness, lip-audio synchronization, and geometric stability.
Enhancing Hands in 3D Whole-Body Pose Estimation with Conditional Hands Modulator: This paper proposes Hand4Whole++, a modular framework that injects features from a pretrained hand estimator into a frozen whole-body pose estimator via a lightweight CHAM module, enabling accurate wrist orientation prediction and transferring fine-grained finger joints and hand shape from a hand model via differentiable rigid alignment.
EventHub: Data Factory for Generalizable Event-Based Stereo Networks without Active Sensors: This paper proposes EventHub, a training data factory for event-based stereo matching that requires no annotation from active sensors such as LiDAR. It generates proxy event-depth pairs via novel view synthesis and transfers knowledge from RGB stereo models through cross-modal distillation. The resulting event stereo models surpass LiDAR-supervised counterparts in cross-domain generalization, reducing error by up to 50% on M3ED and MVSEC.
Extend3D: Town-Scale 3D Generation: This paper proposes Extend3D, a training-free 3D scene generation pipeline that extends the voxel latent space of a pretrained object-level 3D generative model (Trellis) and introduces overlapping patch joint denoising, under-noising SDEdit initialization, and 3D-aware optimization to generate town-scale large-scale 3D scenes from a single image, surpassing existing methods in both human preference evaluations and quantitative metrics.
ExtrinSplat: Decoupling Geometry and Semantics for Open-Vocabulary Understanding in 3D Gaussian Splatting: This paper proposes the extrinsic paradigm, which fully decouples semantics from 3DGS geometry. By combining multi-granularity overlapping object grouping with VLM-generated text hypotheses, it constructs a lightweight semantic index layer that enables training-free, low-storage, and ambiguity-aware open-vocabulary 3D scene understanding.
FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning: This paper proposes FaceCam, a system that addresses camera control in monocular portrait videos by using facial landmarks as a scale-aware camera representation, thereby avoiding the scale ambiguity inherent in conventional extrinsic camera representations. Two data augmentation strategies—synthetic camera motion and multi-clip stitching—are further designed to support continuous camera trajectory inference.
FACT-GS: Frequency-Aligned Complexity-Aware Texture Reparameterization for 2D Gaussian Splatting: FACT-GS reframes texture parameterization as a sampling density allocation problem, employing a learnable deformation field to achieve frequency-adaptive non-uniform texture sampling, substantially improving high-frequency detail recovery under a fixed parameter budget.
Fall Risk and Gait Analysis using World-Spaced 3D Human Mesh Recovery: This paper proposes a gait analysis pipeline based on GVHMR (world-grounded 3D human mesh recovery) that extracts spatiotemporal gait parameters from monocular video of older adults performing the Timed Up and Go (TUG) test, validating the correlation between video-derived metrics and wearable sensor measurements as well as their association with fall risk.
Fast3Dcache: Training-free 3D Geometry Synthesis Acceleration: This paper proposes Fast3Dcache, a training-free geometry-aware caching framework for 3D diffusion models. It dynamically allocates cache budgets via Predictive Cache Scheduling Constraint (PCSC) based on voxel stabilization patterns, and selects stable tokens for reuse via Spatiotemporal Stability Criterion (SSC) using velocity and acceleration signals. The method achieves up to 27.12% throughput improvement and 54.83% FLOPs reduction with only ~2% degradation in geometric quality.
Fast SceneScript: Fast and Accurate Language-Based 3D Scene Understanding via Multi-Token Prediction: This paper proposes Fast SceneScript, which introduces multi-token prediction (MTP) into structured language models for 3D scene understanding to accelerate inference. Combined with self-speculative decoding (SSD) and confidence-guided decoding (CGD) to filter unreliable tokens, as well as a parameter-efficient head-sharing mechanism, the method achieves 5.09× and 5.14× speedups on layout estimation and object detection respectively without accuracy loss.
FastGS: Training 3D Gaussian Splatting in 100 Seconds: FastGS is a multi-view consistency-based acceleration framework for 3DGS that precisely controls Gaussian count via View-Consistent Densification (VCD) and View-Consistent Pruning (VCP). It achieves scene training in approximately 100 seconds on datasets such as Mip-NeRF 360, delivering over 15× speedup over vanilla 3DGS with comparable rendering quality.
FF3R: Feedforward Feature 3D Reconstruction from Unconstrained Views: FF3R is the first fully annotation-free feedforward framework capable of jointly performing geometric reconstruction and open-vocabulary semantic understanding from unconstrained multi-view image sequences, achieving 180× speedup over optimization-based methods when processing 64+ images.
FluidGaussian: Propagating Simulation-Based Uncertainty Toward Functionally-Intelligent 3D Reconstruction: This paper proposes FluidGaussian, which guides active view selection in 3D reconstruction using uncertainty metrics propagated through fluid simulation, yielding reconstructions that are not only visually faithful but also physically plausible under interactive simulation.
ForgeDreamer: Industrial Text-to-3D Generation with Multi-Expert LoRA and Cross-View Hypergraph: This paper proposes ForgeDreamer, a framework that addresses domain semantic adaptation in industrial settings via multi-expert LoRA teacher-student distillation, and achieves high-order geometric consistency constraints through cross-view hypergraph geometric enhancement, outperforming existing methods on industrial text-to-3D generation tasks.
Foundry: Distilling 3D Foundation Models for the Edge: This paper proposes the Foundation Model Distillation (FMD) paradigm and the Foundry framework. Through a compress-and-reconstruct objective, the student model learns a set of learnable SuperTokens to compress the basis vectors of the teacher's latent space. The resulting single distilled model retains generality across classification, segmentation, and few-shot tasks, while reducing FLOPs from 478G to as low as 137G.
FreeArtGS: Articulated Gaussian Splatting Under Free-Moving Scenario: FreeArtGS addresses articulated object reconstruction from monocular RGB-D video under a free-moving scenario, where both object pose and joint state change arbitrarily and simultaneously. The proposed three-stage pipeline — motion-driven part segmentation, robust joint estimation, and end-to-end 3DGS optimization — substantially outperforms all baselines on the newly introduced FreeArt-21 benchmark and existing datasets.
FreeScale: Scaling 3D Scenes via Certainty-Aware Free-View Generation: FreeScale scales limited real-world data into large-scale training sets by sampling high-quality free-view images from existing scene reconstructions guided by certainty estimation, achieving a 2.7 dB PSNR improvement on feed-forward novel view synthesis models.
FE2E: From Editor to Dense Geometry Estimator: This paper systematically analyzes the fine-tuning behavior of image editing models versus generative models for dense geometry estimation. It finds that editing models possess inherent structural prior advantages, and proposes the FE2E framework — the first to adapt a DiT-based image editing model as a joint depth and normal estimator — achieving substantial zero-shot improvements over existing SOTA (35% AbsRel reduction on ETH3D).
From Orbit to Ground: Generative City Photogrammetry from Extreme Off-Nadir Satellite Images: A two-stage pipeline for reconstructing city-scale 3D models from sparse satellite images: Z-Monotonic SDF for geometry to ensure structural integrity of buildings, followed by a fine-tuned FLUX diffusion model for "deterministic inpainting" that synthesizes photorealistic textures from degraded maps, enabling view extrapolation of nearly 90° from orbit to ground level.
From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection: This work shifts keypoint detection from an "image-pair matching" paradigm to "sequence-level trackability optimization." The proposed reinforcement learning framework, TraqPoint, directly optimizes long-term keypoint tracking quality over image sequences, achieving state-of-the-art performance on pose estimation, visual localization, visual odometry, and 3D reconstruction tasks.
FunREC: Reconstructing Functional 3D Scenes from Egocentric Interaction Videos: This paper presents FunREC, a training-free optimization-based method that reconstructs functional articulated 3D digital twin scenes directly from egocentric RGB-D interaction videos. It automatically discovers articulated parts, estimates kinematic parameters, tracks 3D motion, and reconstructs both static and dynamic geometry. FunREC substantially outperforms prior methods across all benchmarks (part segmentation mIoU improves by 50+, joint angle error reduced by 5–10×) and supports simulation export and robotic interaction.
GaussFusion: Improving 3D Reconstruction in the Wild with A Geometry-Informed Video Generator: This paper proposes GaussFusion, a geometry-informed video-to-video generative model that conditions a video generator on a rendered Gaussian Primitives Buffer (GP-Buffer) — encoding depth, normals, opacity, and covariance — to effectively remove floaters, flickering, and blurring artifacts in 3DGS reconstructions. The framework is compatible with both optimization-based and feed-forward reconstruction paradigms, and its distilled variant achieves real-time inference at 16 FPS.
GaussianGrow: Geometry-aware Gaussian Growing from 3D Point Clouds with Text Guidance: This paper proposes GaussianGrow, which replaces the conventional paradigm of jointly predicting geometry and appearance from scratch by "growing" 3D Gaussians from readily available 3D point clouds. It employs a geometry-aware multi-view diffusion model to generate consistent appearance supervision, and addresses view-fusion artifacts and invisible-region problems through an overlap-region detection mechanism coupled with an iterative inpainting strategy, achieving substantial improvements over state-of-the-art methods on both synthetic and real-scan point clouds.
GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis: This paper proposes Data-to-Data Flow Matching (D2D-FM) to directly learn deterministic transformations between view pairs, and regularizes flow paths via probability density geodesics so that trajectories propagate along high-density data manifolds, achieving improved view consistency and geometric fidelity in novel view synthesis.
GeodesicNVS: Probability Density Geodesic Flow Matching for Novel View Synthesis: This paper proposes a Probability Density Geodesic Flow Matching (PDG-FM) framework that replaces the noise-to-data diffusion process with a deterministic data-to-data flow matching scheme, and optimizes interpolation paths to traverse high-density regions of the data manifold via probability-density-based geodesics, achieving geometrically consistent novel view synthesis.
GGPT: Geometry-Grounded Point Transformer: This paper proposes the GGPT framework, which first obtains geometrically consistent sparse point clouds via an improved lightweight SfM pipeline (dense matching + sparse BA + DLT triangulation), then employs Point Transformer V3 to jointly process sparse geometric guidance and feed-forward dense predictions directly in 3D space for residual refinement. Trained exclusively on ScanNet++, GGPT significantly improves multiple feed-forward 3D reconstruction models across architectures and datasets without any fine-tuning.
GLINT: Modeling Scene-Scale Transparency via Gaussian Radiance Transport: GLINT decomposes Gaussian representations into three components — interface, transmission, and reflection — and couples them with a hybrid rasterization+ray-tracing rendering pipeline, achieving state-of-the-art geometry and appearance reconstruction for scene-scale transparent surfaces such as glass walls and display cases.
Global-Aware Edge Prioritization for Pose Graph Initialization: This paper proposes a GNN-based global edge prioritization method that upgrades pose graph initialization from independent pairwise image retrieval to globally structure-aware edge ranking combined with multi-minimum-spanning-tree construction, achieving significant improvements in SfM reconstruction accuracy under extremely sparse settings.
Glove2Hand: Synthesizing Natural Hand-Object Interaction from Multi-Modal Sensing Gloves: This paper proposes the Glove2Hand framework, which translates egocentric videos of instrumented sensing gloves into photorealistic bare-hand videos while preserving tactile and IMU signals. It also introduces HandSense, the first multi-modal hand-object interaction dataset, and demonstrates significant improvements on downstream bare-hand contact estimation and occluded hand tracking.
GP-4DGS: Probabilistic 4D Gaussian Splatting from Monocular Video via Variational Gaussian Processes: GP-4DGS integrates variational Gaussian Processes (GP) into 4D Gaussian Splatting, enabling probabilistic motion modeling via spatiotemporal composite kernels and variational inference, while endowing 4DGS with three new capabilities: uncertainty quantification, motion extrapolation, and adaptive motion priors.
GS-CLIP: Zero-shot 3D Anomaly Detection by Geometry-Aware Prompt and Synergistic View Representation Learning: This paper proposes GS-CLIP, a two-stage framework that injects global shape context and local defect information from 3D point clouds into text prompts via a Geometry Defect Distillation Module (GDDM), and employs a dual-stream LoRA architecture to synergistically fuse rendered images and depth maps, achieving state-of-the-art zero-shot 3D anomaly detection on four large-scale benchmarks.
Hg-I2P: Bridging Modalities for Generalizable Image-to-Point-Cloud Registration via Heterogeneous Graphs: Hg-I2P introduces a Heterogeneous Graph to jointly model relationships between 2D image regions and 3D point cloud regions. Through multi-path adjacency mining for learning cross-modal edges, heterogeneous-edge-guided feature adaptation, and graph-based projection consistency pruning, it achieves state-of-the-art generalization and accuracy across six indoor and outdoor cross-domain benchmarks.
Hierarchical Visual Relocalization with Nearest View Synthesis from Feature Gaussian Splatting: SplatHLoc proposes a hierarchical visual relocalization framework based on Feature Gaussian Splatting (FGS). By combining adaptive viewpoint retrieval that synthesizes virtual views closer to the query perspective with a hybrid feature matching strategy (rendered features for coarse matching + semi-dense matcher for fine matching), the method achieves new state-of-the-art accuracy on both indoor and outdoor benchmarks.
Human Interaction-Aware 3D Reconstruction from a Single Image: This paper proposes HUG3D, a framework that achieves high-fidelity textured 3D reconstruction of interacting multiple persons from a single image via perspective-to-orthographic view transformation, a group-instance multi-view diffusion model, and physics-aware geometry reconstruction, outperforming existing methods across CD/P2S/NC and other metrics.
Hybrid eTFCE–GRF: Exact Cluster-Size Retrieval with Analytical p-Values for Voxel-Based Morphometry: This paper proposes a hybrid method combining the union-find exact cluster-size retrieval of eTFCE with the GRF analytical inference of pTFCE, achieving for the first time both exact cluster-size queries and analytical \(p\)-value computation without permutation testing, while running \(4.6\times\)–\(75\times\) faster than R pTFCE.
Hybrid eTFCE–GRF: Exact Cluster-Size Retrieval with Analytical p-Values for Voxel-Based Morphometry: This work combines the union-find data structure of eTFCE (for exact cluster-size queries) with the GRF analytical inference of pTFCE, achieving for the first time within a single framework both exact cluster-size extraction and analytical \(p\)-values without permutation testing. Whole-brain VBM analysis is 4.6–75× faster than R pTFCE and three orders of magnitude faster than permutation-based TFCE.
HyperMVP: Hyperbolic Multiview Pretraining for Robotic Manipulation: This paper proposes HyperMVP, the first framework for 3D multiview self-supervised pretraining in hyperbolic space. It learns hyperbolic multiview representations via a GeoLink encoder and transfers them to robotic manipulation tasks, achieving a 2.1× performance improvement on the most challenging All Perturbations setting of COLOSSEUM.
HyperGaussians: High-Dimensional Gaussian Splatting for High-Fidelity Animatable Face Avatars: This paper proposes HyperGaussians, which extends 3DGS to high-dimensional multivariate Gaussians. Expression-dependent attribute variations are modeled via conditional distributions, and an inverse covariance trick enables efficient conditioning. Integrated as a plug-and-play module into FlashAvatar and GaussianHeadAvatar, the method significantly improves high-frequency detail quality.
ICTPolarReal: A Polarized Reflection and Material Dataset of Real World Objects: This paper presents ICTPolarReal, the first large-scale real-world polarized reflection and material dataset, capturing 218 everyday objects using an 8-camera, 346-light Light Stage system under cross- and parallel-polarization configurations. The dataset comprises over 1.2 million high-resolution images with ground-truth diffuse–specular reflection separation, and demonstrably improves inverse rendering, forward relighting, and sparse-view 3D reconstruction.
Indoor Asset Detection in Large Scale 360° Drone-Captured Imagery via 3D Gaussian Splatting: This paper proposes a pipeline based on a 3D Object Codebook that associates 2D segmentation masks into consistent 3D object instances within 3DGS using semantic and spatial constraints, enabling object-level detection on large-scale indoor 360° drone imagery. It achieves a 65% improvement in F1 score and 11% improvement in mAP over the state-of-the-art method GAGA.
InstantHDR: Single-forward Gaussian Splatting for High Dynamic Range 3D Reconstruction: This paper proposes InstantHDR, the first feed-forward HDR novel view synthesis method. It achieves multi-exposure fusion via geometry-guided appearance modeling, and employs a meta-network to learn scene-adaptive tone mappers. The method reconstructs HDR 3D scenes from uncalibrated multi-exposure LDR images in a single forward pass, running ~700× faster than optimization-based methods (feed-forward) and ~20× faster (with post-optimization).
InstantHDR: Single-forward Gaussian Splatting for High Dynamic Range 3D Reconstruction: This paper proposes InstantHDR, the first feed-forward HDR novel view synthesis method. It introduces a geometry-guided appearance modeling module to resolve appearance inconsistencies in multi-exposure fusion, and employs a MetaNet to predict scene-specific tone mapping parameters for generalization. The method reconstructs HDR 3D Gaussian scenes in seconds from uncalibrated multi-exposure LDR images, achieving +2.90 dB PSNR over GaussianHDR under sparse 4-view settings at approximately 700× faster speed.
Iris: Bringing Real-World Priors into Diffusion Model for Monocular Depth Estimation: Iris proposes a deterministic diffusion framework that injects real-world priors into a diffusion model via a two-stage Prior-to-Geometry Decoupled (PGD) schedule: Stage 1 extracts low-frequency layout priors from a teacher model using Spectral Gated Distillation (SGD) at high timesteps, while Stage 2 refines high-frequency geometric details using synthetic data at low timesteps. A Spectral Gated Consistency (SGC) loss is further introduced to align high-frequency information across stages. The method achieves state-of-the-art zero-shot depth estimation performance under limited data and computational budget.
JOPP-3D: Joint Open Vocabulary Semantic Segmentation on Point Clouds and Panoramas: This paper proposes JOPP-3D, the first framework for joint open-vocabulary semantic segmentation on 3D point clouds and panoramic images. It maps panoramas onto icosahedron faces via tangential decomposition, extracts semantically aligned 3D instance embeddings using SAM and CLIP, and achieves 80.9% mIoU on S3DIS under weak supervision, surpassing all closed-vocabulary methods.
JOPP-3D: Joint Open Vocabulary Semantic Segmentation on Point Clouds and Panoramas: This paper proposes JOPP-3D — the first open-vocabulary semantic segmentation framework that jointly processes 3D point clouds and panoramic images. It decomposes panoramas into 20 perspective views via icosahedral tangential projection to accommodate SAM/CLIP, extracts mask-isolated instance-level CLIP embeddings for 3D semantic segmentation, and back-projects results to the panoramic domain via depth correspondence. Without any training, the method achieves 80.9% mIoU on S3DIS, surpassing all supervised approaches.
ECKConv: Learning Coordinate-based Convolutional Kernels for Continuous SE(3) Equivariant Point Cloud Analysis: This paper proposes ECKConv, which defines convolutional kernels on the double coset space \(\text{SO(2)}\backslash\text{SE(3)}/\text{SO(2)}\) within the intertwiner framework and explicitly parameterizes kernel functions via coordinate networks. This is the first approach to simultaneously achieve continuous SE(3) equivariance and large-scale scalability, validated comprehensively across four tasks: classification, registration, and segmentation.
Learning Explicit Continuous Motion Representation for Dynamic Gaussian Splatting from Monocular Videos: This paper proposes to explicitly model the continuous positional and rotational deformation trajectories of dynamic Gaussians via adaptive SE(3) B-spline motion bases, combined with a soft segment reconstruction strategy and multi-view diffusion model priors, achieving high-quality novel view synthesis of dynamic scenes from monocular video. The method surpasses existing approaches on both the iPhone and NVIDIA datasets.
Learning Multi-View Spatial Reasoning from Cross-View Relations: XVR (Cross-View Relations) constructs a large-scale multi-view visual question answering dataset of 100K samples. By explicitly training VLMs on three categories of tasks—correspondence, verification, and viewpoint localization—XVR significantly improves cross-view spatial reasoning, yielding notable gains on both multi-view benchmarks and robotic manipulation tasks.
Let it Snow! Animating 3D Gaussian Scenes with Dynamic Weather Effects via Physics-Guided Score Distillation: This paper proposes a Physics-Guided Score Distillation framework that leverages physics simulation (MPM) as a motion prior to guide Video-SDS optimization, enabling the generation of dynamic weather effects (snow, rain, fog, sandstorm) with physically plausible motion and photorealistic appearance in static 3DGS scenes.
Lifting Unlabeled Internet-level Data for 3D Scene Understanding: This paper presents SceneVerse++, an automated data engine that generates 3D scene understanding training data from 6,687 unlabeled internet videos. It demonstrates the feasibility of leveraging internet-scale data to advance 3D scene understanding across three tasks: 3D object detection (F1@.25 +20.6), spatial VQA (+14.9%), and vision-language navigation (+14% SR).
LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds: LightSplat proposes a training-free framework that is both fast and memory-efficient. By assigning each 3D Gaussian a compact 2-byte semantic index instead of high-dimensional CLIP features, combined with a lightweight index-to-feature lookup and single-pass 3D clustering, it achieves open-vocabulary 3D scene understanding that is 50–400× faster and requires 64× less memory than existing state-of-the-art methods.
Lite Any Stereo: Efficient Zero-Shot Stereo Matching: This paper proposes Lite Any Stereo, which achieves first-place rankings on four real-world benchmarks using less than 1% of the computation (33G MACs) of state-of-the-art accurate methods. This is accomplished via a hybrid 2D-3D cost aggregation module and a three-stage million-scale training strategy (supervised → self-distillation → real-data knowledge distillation), demonstrating for the first time that ultra-lightweight models can exhibit strong zero-shot generalization.
LitePT: Lighter Yet Stronger Point Transformer: LitePT conducts a systematic analysis of the roles played by convolution and attention at different U-Net stages, and proposes a hierarchical hybrid architecture that employs sparse convolution in shallow stages and attention in deep stages. Combined with the parameter-free PointROPE positional encoding, LitePT achieves 3.6× fewer parameters, 2× faster inference, and 2× lower memory consumption compared to Point Transformer V3, while matching or surpassing its performance across multiple point cloud benchmarks.
Long-SCOPE: Fully Sparse Long-Range Cooperative 3D Perception: Long-SCOPE proposes a fully sparse long-range cooperative 3D perception framework that achieves state-of-the-art performance in 100–150 m long-range scenarios through geometry-guided query generation and a context-aware association module, while maintaining efficient computation and communication costs.
LongStream: Long-Sequence Streaming Autoregressive Visual Geometry: LongStream is a gauge-decoupled streaming visual geometry model that achieves stable metric-scale scene reconstruction at 18 FPS over thousand-frame sequences, via keyframe-relative pose prediction, orthogonal scale learning, and cache-consistent training.
LoST: Level of Semantics Tokenization for 3D Shapes: This paper proposes Level-of-Semantics Tokenization (LoST), which orders 3D shape tokens by semantic saliency so that short prefixes can already decode into complete and semantically coherent shapes. Combined with the RIDA semantic alignment loss and GPT-style autoregressive generation, LoST achieves significant improvements over existing 3D AR methods that require tens of thousands of tokens, using only 128 tokens.
LTGS: Long-Term Gaussian Scene Chronology From Sparse View Updates: This paper proposes the LTGS framework, which constructs reusable object-level Gaussian templates to efficiently update 3DGS scene reconstructions from spatiotemporally sparse observations, enabling temporal modeling of long-term environmental evolution.
LumiMotion: Improving Gaussian Relighting with Scene Dynamics: LumiMotion is the first Gaussian-based inverse rendering method that leverages scene dynamics (motion regions) as supervision signals to improve material-lighting decomposition. Through static-dynamic separation and motion-revealed appearance changes, it achieves a 23% improvement in albedo LPIPS and a 15% improvement in relighting LPIPS.
M3DLayout: A Multi-Source Dataset of 3D Indoor Layouts and Structured Descriptions for 3D Generation: This paper constructs M3DLayout, a large-scale multi-source 3D indoor layout dataset comprising 21,367 layouts and over 433k object instances. It integrates three complementary sources—real-world scans, professionally designed scenes, and procedurally generated environments—paired with structured textual descriptions, providing a high-quality training foundation for text-driven 3D scene generation.
MAGICIAN: Efficient Long-Term Planning with Imagined Gaussians for Active Mapping: This paper proposes MAGICIAN, a framework that leverages a pretrained occupancy network to generate "Imagined Gaussians" for efficiently estimating surface coverage gain. Combined with beam search, MAGICIAN enables long-term trajectory planning for active mapping, achieving state-of-the-art performance in both indoor and outdoor scenes with coverage improvements exceeding 10%.
Mamba Learns in Context: Structure-Aware Domain Generalization for Multi-Task Point Cloud Understanding: This paper proposes SADG, the first framework to introduce Mamba into in-context learning for multi-task point cloud domain generalization. Through three modules — structure-aware serialization (Centroid Distance Spectrum + Geodesic Curvature Spectrum), Hierarchical Domain-Aware Modeling, and Spectral Graph Alignment — SADG comprehensively surpasses the state of the art on reconstruction, denoising, and registration tasks.
MARCO: Navigating the Unseen Space of Semantic Correspondence: This paper proposes MARCO, a semantic correspondence model built on a single DINOv2 backbone. It progressively improves spatial precision via a coarse-to-fine Gaussian RBF loss, and expands sparse keypoint supervision into dense pseudo-correspondence labels through a self-distillation framework. MARCO achieves state-of-the-art performance on standard benchmarks as well as on unseen keypoints and categories, while being 3× smaller and 10× faster than dual-encoder approaches.
Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding: This paper identifies two fundamental conflicts between the causal mask in LLM decoders and 3D scene understanding (order bias and instruction isolation), and proposes the 3D-SLIM masking strategy (Geometry-adaptive Mask + Instruction-aware Mask) to replace the causal mask. It achieves significant improvements across multiple 3D scene-language tasks without any architectural modifications or additional parameters.
Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding: This paper proposes BrainCoDec, a framework that performs fMRI-based visual decoding generalizable to new subjects without any fine-tuning. It employs a two-stage hierarchical in-context learning approach: first estimating encoder parameters for each voxel, then aggregating across voxels via functional inversion. Top-1 retrieval accuracy improves from 3.9% (MindEye2) to 22.7%.
MimiCAT: Mimic with Correspondence-Aware Cascade-Transformer for Category-Free 3D Pose Transfer: This paper proposes MimiCAT, a cascade Transformer framework that learns flexible many-to-many soft correspondences via semantic keypoint labels. Combined with the million-scale multi-category motion dataset PokeAnimDB, it achieves, for the first time, high-quality cross-category 3D pose transfer (e.g., humanoid to quadruped/bird).
Modeling Spatiotemporal Neural Frames for High Resolution Brain Dynamics: A diffusion Transformer framework conditioned on EEG for fMRI reconstruction is proposed, modeling brain activity as a spatiotemporal sequence of neural frames rather than independent snapshots. The method achieves spatiotemporally consistent fMRI reconstruction at cortical vertex-level resolution, supports intermediate frame interpolation via null-space sampling, and validates the preservation of functional information on downstream visual decoding tasks.
MoRe: Motion-aware Feed-forward 4D Reconstruction Transformer: This paper presents MoRe, a feed-forward motion-aware 4D reconstruction Transformer that decouples dynamic motion from static structure during training via an attention enforcement strategy, and achieves efficient streaming inference through grouped causal attention, attaining state-of-the-art performance in camera pose estimation and depth prediction on dynamic scenes.
MoRel: Long-Range Flicker-Free 4D Motion Modeling via Anchor Relay-based Bidirectional Blending with Hierarchical Densification: To address the challenges of memory explosion, temporal flickering, and occlusion handling in 4D Gaussian Splatting for long-video dynamic scene modeling, this paper proposes MoRel, a framework based on Anchor Relay-based Bidirectional Blending (ARBB). Through progressive construction of keyframe anchors and learnable temporal opacity control, MoRel achieves flicker-free, memory-bounded long-range 4D motion reconstruction.
Motion-Aware Animatable Gaussian Avatars Deblurring: This paper proposes the first method for directly reconstructing sharp, animatable 3D Gaussian human avatars from blurry video, leveraging a 3D-aware physical blur formation model and an SMPL-based human motion model to jointly optimize the avatar representation and motion parameters.
MotionAnymesh: Physics-Grounded Articulation for Simulation-Ready Digital Twins: This paper presents MotionAnymesh, a zero-shot automated framework that converts static 3D meshes into collision-free, simulation-ready articulated digital twins via motion-aware segmentation (SP4D priors + VLM reasoning) and geometry-physics joint optimization for joint estimation, achieving 87% physical executability on PartNet-Mobility and Objaverse.
MotionAnymesh: Physics-Grounded Articulation for Simulation-Ready Digital Twins: This paper proposes MotionAnymesh, a zero-shot framework that uses SP4D kinematic priors to guide VLMs in eliminating kinematic hallucinations, and employs physics-constrained trajectory optimization to guarantee collision-free articulation. The framework automatically converts static 3D meshes into simulation-ready URDF digital twins directly deployable in physics engines such as SAPIEN, achieving a physical executability rate of 87%—far exceeding existing methods.
MotionScale: Reconstructing Appearance, Geometry, and Motion of Dynamic Scenes with Scalable 4D Gaussian Splatting: This paper proposes MotionScale, a scalable 4D Gaussian Splatting framework that reconstructs the appearance, geometry, and motion of large-scale dynamic scenes from monocular video with high fidelity. Through a clustering-based adaptive motion field and a progressive optimization strategy, MotionScale achieves a PSNR of 17.98 on DyCheck and reduces 3D tracking EPE to 0.070, substantially outperforming existing methods.
MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second: This paper proposes MoVieS, a feed-forward 4D dynamic scene reconstruction framework that unifies appearance, geometry, and motion modeling via Dynamic Splatter Pixels, enabling 4D reconstruction from monocular video in approximately one second while supporting novel view synthesis, 3D point tracking, scene flow estimation, and moving object segmentation.
MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation: This paper proposes a Multi-modal 3D Scene Graph (M3DSG) that replaces conventional text-based relation edges with dynamically assigned image edges, and builds a zero-shot navigation system MSGNav comprising four modules: Key Subgraph Selection, Adaptive Vocabulary Update, Closed-Loop Reasoning, and Visibility Viewpoint Decision. MSGNav achieves 52.0% SR on GOAT-Bench and 74.1% SR on HM3D-ObjNav, both state-of-the-art.
MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation: This paper proposes a Multi-modal 3D Scene Graph (M3DSG) that replaces conventional text-based relation edges with dynamically assigned image edges to preserve visual information. Built upon M3DSG, the zero-shot navigation system MSGNav is constructed, and a Visibility-based Viewpoint Decision (VVD) module is introduced to address the "last-mile" navigation problem. The method achieves state-of-the-art performance on GOAT-Bench and HM3D-ObjNav.
MV-RoMa: From Pairwise Matching into Multi-View Track Reconstruction: This paper proposes MV-RoMa, the first multi-view dense matching model that simultaneously estimates dense correspondences from a single source image to multiple target images via a Track-Guided multi-view encoder and a pixel-aligned multi-view refiner, producing geometrically consistent tracks for SfM and achieving state-of-the-art performance on HPatches, ETH3D, IMC, and related benchmarks.
MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation: This paper introduces the MV-3DRES task (language-guided 3D segmentation directly from sparse multiview RGB images) and the MVGGT framework (a dual-branch design combining a frozen geometry branch with a trainable multimodal branch). A PVSO optimization strategy is proposed to address the foreground gradient dilution (FGD) problem, achieving 39.9 mIoU on the newly constructed MVRefer benchmark, substantially outperforming baselines.
NanoSD: Edge Efficient Foundation Model for Real Time Image Restoration: This paper proposes NanoSD, a family of Pareto-optimal lightweight diffusion foundation models (130M–315M parameters, as fast as 12 ms inference) built upon SD 1.5 through hardware-aware U-Net decomposition, block-wise feature distillation, and multi-objective Bayesian optimization. NanoSD serves as a drop-in backbone that achieves state-of-the-art performance across multiple tasks including super-resolution, face restoration, deblurring, and monocular depth estimation.
NeAR: Coupled Neural Asset–Renderer Stack: NeAR proposes jointly designing neural asset creation and neural rendering as a coupled stack. By introducing illumination-homogenized structured 3D latents (LH-SLAT) to remove baked lighting from input images, and employing an illumination-aware neural decoder for real-time synthesis of relightable 3D Gaussian fields, NeAR surpasses existing methods across four tasks: forward rendering, reconstruction, relighting, and novel-view relighting.
Neu-PiG: Neural Preconditioned Grids for Fast Dynamic Surface Reconstruction on Long Sequences: Neu-PiG proposes a fast optimization framework based on preconditioned multi-resolution latent grids, encoding the position and normal directions of a keyframe reference mesh into a unified latent space. A lightweight MLP decodes these features into per-frame 6-DoF deformations, achieving high-fidelity dynamic surface reconstruction more than 60× faster than existing training-free methods, without requiring category-specific priors or explicit correspondences.
Neural Field-Based 3D Surface Reconstruction of Microstructures from Multi-Detector Signals in Scanning Electron Microscopy: This paper proposes NFH-SEM, a neural field-based hybrid framework that embeds the physical model of electron scattering in SEM into a neural field optimization pipeline, enabling high-fidelity 3D surface reconstruction of microstructures from multi-view, multi-detector SEM images. The framework achieves self-calibration and shadow-robust reconstruction at nanometer-scale accuracy (478 nm stacked features, 782 nm pollen textures, 1.559 μm fracture steps).
Neural Gabor Splatting: Enhanced Gaussian Splatting with Neural Gabor for High-frequency Surface Reconstruction: Neural Gabor Splatting embeds a lightweight MLP (SIREN architecture) into each Gaussian primitive, enabling a single primitive to represent complex spatially-varying color patterns. Combined with a frequency-aware densification strategy, this approach significantly improves high-frequency surface reconstruction quality under the same data budget.
NG-GS: NeRF-Guided 3D Gaussian Splatting Segmentation: This paper proposes the NG-GS framework, which leverages the continuous modeling capability of NeRF to address the boundary discretization problem in 3DGS segmentation. It constructs a continuous feature field via RBF interpolation, combined with multi-resolution hash encoding and joint NeRF-GS optimization, to achieve high-quality object segmentation.
NI-Tex: Non-isometric Image-based Garment Texture Generation: NI-Tex is proposed as a framework that, through the construction of a 3D Garment Videos dataset, image-editing-based cross-topology augmentation, and an uncertainty-guided iterative baking algorithm, achieves for the first time high-quality feed-forward generation of PBR textures for 3D garments from a single image under non-isometric conditions.
NimbusGS: Unified 3D Scene Reconstruction under Hybrid Weather: NimbusGS proposes a unified 3D scene reconstruction framework that decomposes weather degradation into a continuous scattering field (fog/haze) and a per-view particulate residual layer (rain/snow), coupled with a geometry-guided gradient scaling mechanism, achieving state-of-the-art reconstruction under individual and hybrid weather conditions within a single framework.
No Calibration, No Depth, No Problem: Cross-Sensor View Synthesis with 3D Consistency: This paper proposes the first cross-sensor view synthesis framework that requires neither calibration nor depth. Through a match-densify-consolidate pipeline, sparse cross-modal keypoints are expanded into dense X-modality images (thermal/NIR/SAR) aligned with the RGB viewpoint. Synthesis quality is further improved via confidence-aware densification fusion (CADF) and self-matching filtering.
Node-RF: Learning Generalized Continuous Space-Time Scene Dynamics with Neural ODE-based NeRFs: Node-RF tightly couples Neural ODE with NeRF, driving the temporal evolution of implicit scene representations via continuous-time differential equations. This enables long-range extrapolation far beyond the training time horizon and cross-trajectory generalization, achieving significant improvements over baselines such as D-NeRF and 4D-GS on datasets including Bouncing Balls, Pendulum, and Oscillating Ball.
Node-RF: Learning Generalized Continuous Space-Time Scene Dynamics with Neural ODE-based NeRFs: Node-RF tightly couples Neural ODE with NeRF, modeling scene dynamic evolution via differential equations in latent space, enabling long-range extrapolation beyond training time horizons, cross-sequence generalization, and dynamical system behavior analysis.
NTK-Guided Implicit Neural Teaching: This paper proposes NINT, which leverages row vectors of the Neural Tangent Kernel (NTK) to measure each coordinate's influence on the global function update, enabling dynamic selection of coordinates with both high fitting error and high global influence for training. This approach reduces INR training time by nearly half without sacrificing reconstruction quality.
Off The Grid: Detection of Primitives for Feed-Forward 3D Gaussian Splatting: This paper proposes a feed-forward 3DGS decoder based on keypoint detection, liberating Gaussian primitives from the pixel grid by placing them adaptively at sub-pixel precision. Combined with an adaptive density mechanism and confidence-based pruning, the method surpasses state-of-the-art feed-forward approaches in novel view synthesis using only 1/7 of the primitives required by pixel-aligned methods.
OnlinePG: Online Open-Vocabulary Panoptic Mapping with 3D Gaussian Splatting: This paper proposes OnlinePG, the first online open-vocabulary panoptic mapping system built upon 3DGS. It adopts a local-to-global paradigm: within a sliding window, a multi-cue clustering graph (geometric overlap + semantic similarity + view consensus) constructs locally consistent 3D instances, which are then incrementally merged into a global map via bidirectional bipartite matching. OnlinePG achieves state-of-the-art semantic and panoptic segmentation among online methods, surpassing OnlineAnySeg by +17.2 mIoU on ScanNet (48.48), while running at 10–18 FPS.
OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness: This paper proposes OpenVO, an open-world monocular visual odometry framework that achieves robust metric-scale ego-motion estimation under uncalibrated cameras and varying frame rates, via a time-aware flow encoder and a geometry-aware context encoder. OpenVO achieves over 20% ATE improvement across datasets and reduces error by 46%–92% under variable frame-rate settings.
PAD-Hand: Physics-Aware Diffusion for Hand Motion Recovery: PAD-Hand is a physics-aware conditional diffusion framework that models Euler–Lagrange (EL) dynamics residuals as virtual observations integrated into the diffusion process, while estimating per-joint, per-frame dynamic variance via last-layer Laplace approximation. The method achieves physically plausible and uncertainty-aware hand motion recovery, reducing acceleration error by 50.1% on DexYCB.
Pano360: Perspective to Panoramic Vision with Geometric Consistency: Pano360 proposes a Transformer-based panoramic stitching framework that extends the conventional 2D pairwise alignment paradigm to 3D space, directly leveraging camera poses to guide global multi-image alignment. Combined with a multi-feature joint optimization strategy for seam detection, the method achieves a 97.8% success rate on challenging scenarios including weak texture, large parallax, and repetitive patterns, substantially outperforming existing approaches.
Pano360: Perspective to Panoramic Vision with Geometric Consistency: Pano360 extends panoramic image stitching from conventional 2D pairwise matching to the 3D photogrammetric space, leveraging a Transformer architecture to achieve globally geometrically consistent multi-view alignment. It attains a 97.8% success rate under challenging scenarios including weak texture, large parallax, and repetitive patterns.
Pano3DComposer: Feed-Forward Compositional 3D Scene Generation from Single Panoramic Image: This paper proposes Pano3DComposer, a modular feed-forward compositional 3D scene generation framework that takes a single panoramic image as input. A plug-and-play Object-World Transformation Predictor (based on Alignment-VGGT) maps generated 3D objects from local coordinates to world coordinates, producing high-fidelity 3D scenes in approximately 20 seconds on an RTX 4090.
PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery: This paper proposes PanoVGGT, a permutation-equivariant Transformer framework that jointly predicts camera poses, depth maps, and globally consistent 3D point clouds from one or more unordered panoramic images in a single feed-forward pass. The paper also contributes PanoCity — a large-scale dataset comprising over 120,000 outdoor panoramic images.
Parallelised Differentiable Straightest Geodesics for 3D Meshes: This paper proposes a parallel GPU implementation of straightest geodesics along with two differentiable schemes — an extrinsic proxy function method and a geodesic finite differences method — enabling efficient parallel and differentiable exponential map computation on triangular meshes. Three downstream applications are built upon this framework: a geodesic convolutional layer, a flow matching method on meshes, and a second-order optimizer.
Particulate: Feed-Forward 3D Object Articulation: Particulate proposes a feed-forward model that infers complete articulation structures (part segmentation, kinematic tree, and motion constraints) from a static 3D mesh within seconds. Built upon the Part Articulation Transformer and trained end-to-end on public datasets, it significantly outperforms existing per-object optimization methods and can be combined with 3D generative models to enable single-image-to-articulated-3D-object generation.
PCSTracker: Long-Term Scene Flow Estimation for Point Cloud Sequences: PCSTracker is the first end-to-end framework for long-term scene flow estimation on point cloud sequences. Through iterative joint geometry-motion optimization, spatiotemporal trajectory updates, and an overlapping sliding window strategy, it reduces EPE_3D by 57.9% on the synthetic dataset PointOdyssey3D while running in real time at 32.5 FPS.
PE3R: Perception-Efficient 3D Reconstruction: PE3R proposes a tuning-free, feed-forward 3D semantic reconstruction framework that directly generates semantic 3D point clouds from pose-free 2D images via three modules — pixel embedding disambiguation, semantic point cloud reconstruction, and global view perception — achieving a 9× speedup while establishing new state-of-the-art performance on open-vocabulary segmentation and depth estimation.
PhyGaP: Physically-Grounded Gaussians with Polarization Cues: This paper proposes PhyGaP, which integrates polarization cues into 2DGS optimization via a polarization deferred rendering pipeline (PolarDR), and introduces a self-occlusion-aware GridMap environment representation, enabling accurate reflection decomposition and realistic relighting of glossy objects.
PhysGaia: A Physics-Aware Benchmark with Multi-Body Interactions for Dynamic Novel View Synthesis: PhysGaia constructs a physics-aware benchmark dataset comprising 17 scenes that cover multi-body interactions across four material categories—liquid, gas, cloth, and rheological matter—providing ground truth 3D particle trajectories and physical parameters (e.g., viscosity). The paper further introduces two new metrics, Trajectory Distance (TD) and AUOP, to quantify the physical realism of 4DGS methods, revealing severe deficiencies in physical reasoning among existing DyNVS approaches.
PhysGM: Large Physical Gaussian Model for Feed-Forward 4D Synthesis: The first framework for feed-forward prediction of 3DGS and physical attributes (material category, Young's modulus, Poisson's ratio) from a single image. A two-stage training pipeline (supervised pretraining + DPO preference fine-tuning) entirely bypasses SDS and differentiable physics engines. Combined with the 50K+ PhysAssets dataset, the method generates high-fidelity 4D physical simulations within one minute, surpassing per-scene optimization methods in both CLIP similarity and human preference rate.
PhysGM: Large Physical Gaussian Model for Feed-Forward 4D Synthesis: PhysGM proposes the first feed-forward framework that simultaneously predicts 3D Gaussian representations and physical properties (stiffness, mass, etc.) from a single image in one inference pass. Combined with MPM simulation, it generates high-fidelity, physically plausible 4D animations within one minute, requiring no per-scene optimization.
PhysGS: Bayesian-Inferred Gaussian Splatting for Physical Property Estimation: This paper proposes PhysGS, which integrates Bayesian inference into a 3D Gaussian Splatting pipeline. By leveraging vision-language model priors and multi-view confidence-weighted updates, PhysGS enables per-point probabilistic estimation and uncertainty quantification of physical properties (friction, hardness, density, stiffness), achieving a 22.8% improvement over NeRF2Physics in APE for mass estimation and a 61.2% reduction in Shore hardness error.
PhysHead: Simulation-Ready Gaussian Head Avatars: This paper proposes PhysHead—the first method to integrate physics-driven hair dynamics with animatable 3DGS head avatars. It models expressive faces via FLAME mesh + 3DGS, represents hair appearance via strands + 3DGS, drives hair animation through a physics engine, and enables layered optimization of hair and face through VLM-generated bald images.
Physically Inspired Gaussian Splatting for HDR Novel View Synthesis: This paper proposes PhysHDR-GS, a physically inspired HDR novel view synthesis framework that decomposes Gaussian colors into intrinsic reflectance and adjustable ambient illumination. An Image-Exposure (IE) branch and a Gaussian-Illumination (GI) branch complementarily capture HDR details. A cross-branch HDR consistency loss provides explicit HDR supervision without ground-truth HDR data, and illumination-guided gradient scaling addresses gradient starvation caused by exposure bias. The method outperforms HDR-GS by 2.04 dB across multiple benchmarks while maintaining real-time rendering at 76 FPS.
PIP-Stereo: Progressive Iterations Pruner for Iterative Optimization based Stereo Matching: This paper reveals the spatial sparsity and temporal redundancy of disparity updates in iterative stereo matching, and proposes: (1) Progressive Iteration Pruning (PIP) to compress 32 iterations down to 1; (2) a collaborative learning paradigm for monocular depth prior transfer without an independent monocular encoder; and (3) a hardware-aware FlashGRU operator (7.28× speedup). Together, these enable high-accuracy iterative stereo matching to achieve real-time inference on Jetson Orin NX for the first time (75ms/frame at 320×640).
PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction: This paper proposes PixARMesh, the first autoregressive framework for single-view scene reconstruction that operates natively in mesh space (rather than SDF space). By enhancing a point cloud encoder with pixel-aligned image features and global scene context, the method jointly predicts object poses and meshes within a unified token sequence. PixARMesh achieves scene-level state-of-the-art on 3D-FRONT while producing compact, editable, artist-ready meshes.
PointINS: Instance-Aware Self-Supervised Learning for Point Clouds: PointINS proposes the first point cloud self-supervised learning framework that explicitly learns semantic consistency and geometric reasoning. By introducing a label-free offset branch with Offset Distribution Regularization (ODR) and Spatial Clustering Regularization (SCR), it achieves an average improvement of +3.5% mAP on indoor instance segmentation and +4.1% PQ on outdoor panoptic segmentation.
PointTPA: Dynamic Network Parameter Adaptation for 3D Scene Understanding: PointTPA is a framework that generates input-customized network parameters at inference time via two lightweight modules—Serialization-based Neighborhood Grouping (SNG) and Dynamic Parameter Projector (DPP)—achieving 78.4% mIoU on ScanNet with fewer than 2% additional parameters, surpassing existing parameter-efficient fine-tuning (PEFT) methods.
PoseMaster: A Unified 3D Native Framework for Stylized Pose Generation: PoseMaster proposes a 3D native framework that unifies pose stylization and 3D generation in an end-to-end pipeline. It directly uses 3D skeletons as pose control signals (rather than 2D skeleton images), designs a skeleton densification strategy and a Point Transformer encoder to extract fine-grained spatial topology features, and trains on large-scale Image-Skeleton-Mesh triplet data, achieving state-of-the-art performance on both pose canonicalization and arbitrary pose stylization.
PR-IQA: Partial-Reference Image Quality Assessment for Diffusion-Based Novel View Synthesis: This paper proposes PR-IQA, a cross-reference image quality assessment method that first computes geometrically consistent local quality maps in multi-view overlapping regions, then propagates quality information to non-overlapping regions via a reference-conditioned cross-attention network, producing dense quality maps approaching full-reference accuracy. Integrated into a 3DGS pipeline with a dual-filtering strategy, it significantly improves sparse-view 3D reconstruction quality.
ProgressiveAvatars: Progressive Animatable 3D Gaussian Avatars: This paper proposes ProgressiveAvatars, a progressive avatar representation that constructs hierarchical 3DGS via adaptive implicit subdivision on a template mesh, enabling progressive transmission and rendering under varying bandwidth and compute constraints. With only 5% of the data (2.6 MB), a usable avatar is immediately renderable, and incremental loading smoothly improves quality to a level comparable with state-of-the-art methods.
PromptStereo: Zero-Shot Stereo Matching via Structure and Motion Prompts: This paper proposes the Prompt Recurrent Unit (PRU), which replaces the GRU in iterative refinement with the DPT decoder from a monocular depth foundation model. Structure Prompts and Motion Prompts inject monocular structural and stereo motion cues via residual addition, enabling zero-shot state-of-the-art stereo matching without corrupting the monocular prior (nearly 50% error reduction on Middlebury 2021).
Prune Wisely, Reconstruct Sharply: Compact 3D Gaussian Splatting via Adaptive Pruning and Difference-of-Gaussian Primitives: An adaptive reconstruction-aware pruning scheduler (RPS) and 3D DoG primitives are proposed to achieve 90% Gaussian pruning while preserving rendering quality.
QD-PCQA: Quality-Aware Domain Adaptation for Point Cloud Quality Assessment: This paper proposes QD-PCQA, a quality-aware domain adaptation framework that transfers image-domain quality assessment priors to the point cloud domain via two core strategies: Rank-weighted Conditional Alignment (RCA) and Quality-guided Feature Augmentation (QFA).
QuadSync: Quadrifocal Tensor Synchronization via Tucker Decomposition: This paper presents QuadSync, the first global synchronization algorithm for quadrifocal tensors. By constructing a block quadrifocal tensor and proving that it admits a Tucker decomposition with multilinear rank \((4,4,4,4)\), the method recovers camera poses from four-view measurements via an ADMM-IRLS optimization framework, achieving superior synchronization accuracy over two-view and three-view methods in dense-view settings.
r4det 4d radar camera fusion 3d detection: R4Det proposes three plug-and-play modules — Panoramic Depth Fusion (PDF), Deformable Gated Temporal Fusion (DGTF), and Instance-Guided Dynamic Refinement (IGDR) — to address the key challenges in 4D radar-camera fusion: inaccurate depth estimation, ego-pose-dependent temporal fusion, and poor small-object detection. State-of-the-art results are achieved on TJ4DRadSet and VoD.
Random Wins All: Rethinking Grouping Strategies for Vision Tokens: This paper proposes a minimalist random grouping strategy to replace various elaborately designed token grouping methods in Vision Transformers. The approach achieves near-universal improvements over all baselines across image classification, object detection, semantic segmentation, point cloud segmentation, and VLMs, and provides a four-dimensional explanation for its success: positional information, per-head feature diversity, global receptive field, and fixed grouping patterns.
RAP: Fast Feedforward Rendering-Free Attribute-Guided Primitive Importance Score Prediction for Efficient 3D Gaussian Splatting Processing: This paper proposes RAP, a rendering-free feedforward method for Gaussian primitive importance scoring. It extracts 15-dimensional features from intrinsic attributes and local neighborhood statistics, employs a lightweight MLP to predict importance scores, and generalizes to unseen scenes after a single training run.
RayNova: Scale-Temporal Autoregressive World Modeling in Ray Space: This paper proposes RayNova, a geometry-agnostic multi-view world model based on dual-causal (scale + temporal) autoregressive modeling. By leveraging relative Plücker ray positional encodings, RayNova achieves unified 4D spatiotemporal reasoning and attains state-of-the-art multi-view video generation performance on nuScenes.
Real2Edit2Real: Generating Robotic Demonstrations via a 3D Control Interface: This paper proposes the Real2Edit2Real framework, a three-stage pipeline of "3D reconstruction → point cloud editing to generate new trajectories → depth-guided video generation for synthesizing demonstrations." Starting from only 1–5 real demonstrations, the framework generates large quantities of diverse manipulation demonstrations, enabling policy performance that matches or exceeds training on 50 real demonstrations—achieving a 10–50× improvement in data efficiency.
Regularizing INR with Diffusion Prior for Self-Supervised 3D Reconstruction of Neutron Computed Tomography Data: This paper proposes DINR (Diffusive INR), which replaces the conventional inversion solver within the DD3IP diffusion framework with an INR, injecting diffusion denoising estimates into the INR optimization via a proximal loss. DINR surpasses existing SOTA methods for neutron CT reconstruction under extremely sparse-view conditions (as few as 4–5 views).
Regularizing INR with Diffusion Prior for Self-Supervised 3D Reconstruction of Neutron Computed Tomography Data: This paper proposes Diffusive INR (DINR), a framework that replaces the conventional DIS in the DD3IP diffusion reconstruction pipeline with an INR, and injects the diffusion model's denoising estimate as a regularization prior into the INR optimization via a proximal loss function. Under extremely sparse neutron CT conditions with only 4–5 views, DINR surpasses MBIR (qGGMRF), DD3IP, and vanilla INR in reconstruction quality.
ReLaGS: Relational Language Gaussian Splatting: This paper proposes ReLaGS, the first training-free framework that unifies multi-level language Gaussian fields with open-vocabulary 3D scene graphs. It improves scene representation via Maximum Weight Pruning and Robust Outlier-aware Feature Aggregation, and achieves efficient structured 3D scene understanding through GNN-based relation prediction.
Reliev3R: Relieving Feed-forward 3D Reconstruction from Multi-View Geometric Annotations: Reliev3R introduces the first weakly supervised paradigm for training feed-forward 3D reconstruction models (FFRMs) from scratch without multi-view geometric annotations (i.e., no SfM/MVS-derived point clouds or camera poses). By substituting monocular relative depth and sparse image correspondences as supervisory signals, it achieves performance on par with or superior to certain fully supervised FFRMs.
Reparameterized Tensor Ring Functional Decomposition for Multi-Dimensional Data Recovery: This paper proposes RepTRFD, which reparameterizes Tensor Ring factors into the form of "learnable latent tensor × fixed basis" to address the spectral bias problem inherent in INR-parameterized TR factors, achieving state-of-the-art performance across image inpainting, denoising, super-resolution, and point cloud recovery tasks.
Rethinking Pose Refinement in 3D Gaussian Splatting under Pose Prior and Geometric Uncertainty: This paper proposes UGS-Loc, a framework that jointly models pose prior uncertainty and geometric uncertainty via Monte Carlo pose sampling and Fisher information-guided PnP optimization, achieving significantly improved robustness in camera pose refinement within 3DGS scenes without requiring retraining.
RetimeGS: Continuous-Time Reconstruction of 4D Gaussian Splatting: RetimeGS is proposed to address ghosting artifacts and temporal aliasing in 4DGS during inter-frame interpolation, through regularized temporal opacity, Catmull-Rom spline trajectories, bidirectional optical flow supervision, and triple rendering, enabling artifact-free continuous-time 4D reconstruction at arbitrary timestamps.
RetimeGS: Continuous-Time Reconstruction of 4D Gaussian Splatting: This paper proposes RetimeGS, which addresses temporal aliasing (ghosting) in 4DGS frame interpolation through regularized temporal opacity (dual-Sigmoid short-tailed distribution) and Catmull-Rom spline trajectories for modeling continuous Gaussian primitive motion, combined with bidirectional optical flow supervision, triple rendering, and dynamic stretching strategies. RetimeGS achieves 30.08 dB PSNR on the Stage-Capture dataset, surpassing the previous SOTA by 1.29 dB.
ReWeaver: Towards Simulation-Ready and Topology-Accurate Garment Reconstruction: This paper proposes ReWeaver, a framework that jointly reconstructs 3D garment geometry and 2D sewing patterns from as few as four multi-view RGB images. A dual-path Transformer predicts 3D patches/curves and their topological connectivity, after which an intra-group attention module unfolds the 3D structure into 2D panel edges. ReWeaver is the first method to produce topology-accurate garment assets that are directly usable in physical simulation.
Rewis3d: Reconstruction Improves Weakly-Supervised Semantic Segmentation: Rewis3d is the first work to introduce feed-forward 3D scene reconstruction as an auxiliary supervision signal for weakly-supervised semantic segmentation. Through a dual student-teacher architecture, it achieves bidirectional cross-modal consistency (CMC) learning between 2D images and reconstructed 3D point clouds. Combined with dual-confidence filtering and view-aware sampling, the method improves mIoU by 2–7% across multiple datasets under sparse annotations (points, scribbles, coarse labels), while requiring only 2D input at inference time.
Rewis3d: Reconstruction Improves Weakly-Supervised Semantic Segmentation: This paper proposes Rewis3d, the first framework to integrate feedforward 3D scene reconstruction as an auxiliary supervision signal for weakly-supervised semantic segmentation. Through a dual student-teacher architecture and dual confidence-weighted cross-modal consistency loss, Rewis3d improves mIoU by 2–7% under sparse annotation, while using only 2D images at inference time.
RnG: A Unified Transformer for Complete 3D Modeling from Partial Observations: RnG proposes Reconstruction-Guided Causal Attention, which reinterprets the Transformer's KV-Cache as an implicit 3D representation, enabling a single feed-forward Transformer to jointly perform reconstruction and generation—recovering complete 3D geometry and appearance from sparse, pose-free images—at over 100× the speed of diffusion-based methods.
RnG: A Unified Transformer for Complete 3D Modeling from Partial Observations: This paper proposes RnG, a unified feed-forward Transformer that leverages reconstruction-guided causal attention to treat KV-Cache as an implicit 3D representation, simultaneously achieving 3D reconstruction and novel-view RGBD generation from sparse unposed images, with inference speeds over 100× faster than diffusion-based methods.
S2AM3D: Scale-controllable Part Segmentation of 3D Point Clouds: This paper presents S2AM3D, a point cloud part segmentation framework that integrates 2D pretrained priors with 3D contrastive supervision. A point-consistent encoder produces globally coherent per-point features, while a scale-aware prompt decoder enables continuously controllable segmentation granularity. The method substantially outperforms existing approaches across multiple benchmarks.
Sampling-Aware 3D Spatial Analysis in Multiplexed Imaging: This paper systematically investigates how sampling geometry (2D single sections vs. 3D serial sections) affects the accuracy of recovering spatial statistics in multiplexed imaging, and proposes a geometry-aware sparse 3D reconstruction module that enables reliable depth-informed spatial analysis under limited imaging budgets.
SASNet: Spatially-Adaptive Sinusoidal Networks for INRs: This paper proposes SASNet, which combines frozen frequency embedding layers with spatially-adaptive masks learned by a lightweight hash-grid MLP to address SIREN's sensitivity to frequency initialization and its high-frequency leakage problem, achieving faster convergence and higher reconstruction quality on image fitting, volumetric data fitting, and SDF reconstruction tasks.
Scalable Object Relation Encoding for Better 3D Spatial Reasoning in Large Language Models: This paper proposes QuatRoPE, a quaternion rotation-based 3D positional encoding method that preserves all \(O(n^2)\) pairwise spatial relations using only \(O(n)\) input tokens. Combined with the IGRE mechanism to reduce interference with language RoPE, it achieves substantial improvements across multiple 3D vision-language benchmarks.
Scaling View Synthesis Transformers (SVSM): This work establishes, for the first time, scaling laws for geometry-free NVS Transformers. It proposes the effective batch size hypothesis (\(B_\text{eff} = B \cdot V_T\)) to reveal the root cause of the underestimation of encoder-decoder architectures, designs a unidirectional encoder-decoder architecture called SVSM, and achieves a new state of the art on RealEstate10K (30.01 PSNR) with less than half the training FLOPs. The Pareto frontier shifts 3× to the left relative to LVSM decoder-only.
Scene Grounding In the Wild: This paper proposes a semantic feature-based inverse optimization framework that aligns in-the-wild local 3D reconstructions (SfM) to a complete pseudo-synthetic reference model (e.g., Google Earth Studio). By leveraging DINOv2 features and robust optimization, the method addresses large domain gaps and achieves globally consistent fusion of non-overlapping local reconstructions.
SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations: This paper presents SceneScribe-1M — a large-scale multimodal video dataset comprising one million in-the-wild videos spanning over 4,000 hours, with comprehensive annotations including structured text descriptions, accurate camera parameters, temporally consistent depth maps, and 3D point trajectories. The dataset serves as a unified resource for 3D geometric perception and video generation tasks.
SCOPE: Scene-Contextualized Incremental Few-Shot 3D Segmentation: This paper proposes SCOPE, a plug-and-play framework that leverages a class-agnostic segmentation model to mine pseudo-instance prototypes from background regions of base training scenes. By retrieving and fusing these prototypes into sparse few-shot novel-class prototypes via attention, SCOPE improves novel-class IoU by 6.98% on ScanNet without retraining the backbone.
SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation: This paper proposes SeeThrough3D, which conditions the FLUX model on an Occlusion-aware Scene Control Representation (OSCR) rendered from semi-transparent 3D bounding boxes, enabling precise 3D layout control and occlusion-consistent text-to-image generation.
SEPatch3D: Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors: This paper proposes SEPatch3D, which achieves 57% inference acceleration with comparable detection accuracy in ViT-based sparse multi-view 3D detection, via spatiotemporal-aware dynamic patch size selection and an entropy-based informative patch enhancement mechanism.
SGAD-SLAM: Splatting Gaussians at Adjusted Depth for Better Radiance Fields in RGBD SLAM: This paper proposes SGAD-SLAM, which adopts a pixel-aligned simplified Gaussian representation and allows Gaussians to adjust their depth offset along the ray to improve rendering quality and scalability. A geometry-similarity-based GICP tracking strategy is introduced to accelerate camera pose estimation. The method comprehensively outperforms state-of-the-art approaches on Replica, TUM, ScanNet, and ScanNet++.
SGI: Structured 2D Gaussians for Efficient and Compact Large Image Representation: SGI proposes a seed-based structured 2D Gaussian representation framework that organizes unstructured Gaussian primitives into seed-driven neural Gaussians, coupled with context-guided entropy coding and a multi-scale fitting strategy, achieving up to 7.5× compression and 6.5× optimization speedup in high-resolution image representation while maintaining or improving reconstruction fidelity.
SGI: Structured 2D Gaussians for Efficient and Compact Large Image Representation: SGI organizes unstructured 2D Gaussian primitives via seed points and decodes their attributes with lightweight MLPs. Combined with context-model-driven entropy coding and a multi-scale fitting strategy, SGI achieves up to 7.5× compression and 6.5× speedup in high-resolution image representation while maintaining or improving fidelity.
SGS-Intrinsic: Semantic-Invariant Gaussian Splatting for Sparse-View Indoor Inverse Rendering: SGS-Intrinsic proposes a two-stage indoor inverse rendering framework. Stage I constructs a geometrically consistent dense Gaussian field guided by semantic and geometric priors. Stage II performs material–illumination decomposition via a hybrid lighting model and material priors, with a dedicated de-shadowing module to prevent shadow baking into albedo.
Sky2Ground: A Benchmark for Site Modeling under Varying Altitude: This paper introduces the Sky2Ground dataset (51 scenes, 80k images, covering satellite/aerial/ground views with both synthetic and real imagery) and the SkyNet model (dual-stream encoder + masked satellite attention + progressive view sampling), presenting the first systematic study of joint camera localization across ground, aerial, and satellite viewpoints. SkyNet achieves a 9.6% improvement in RRA@5 and an 18.1% improvement in RTA@5.
SonoWorld: From One Image to a 3D Audio-Visual Scene: SonoWorld is proposed as a training-free framework that generates an explorable 3D audio-visual scene from a single image. The pipeline expands the input image into a 360° panorama and reconstructs it as a 3D Gaussian scene, places sound-source anchors via VLM-driven semantic grounding, and renders spatial audio through Ambisonics encoding, achieving geometric and semantic alignment between the visual and auditory modalities.
SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs: This paper proposes SoPE, a spherical coordinate-based positional embedding that remaps point cloud tokens from one-dimensional sequence indices to a spherical coordinate space \((t,r,\theta,\phi)\), combined with multi-dimensional frequency allocation and multi-scale frequency mixing strategies, significantly enhancing the spatial perception capabilities of 3D large vision-language models.
SPAN: Spatial-Projection Alignment for Monocular 3D Object Detection: This paper proposes SPAN, a plug-and-play geometric co-constraint framework that enforces global geometric consistency across decoupled predictions via two differentiable losses — Spatial Point Alignment (3D corner MGIoU alignment) and 3D-2D Projection Alignment (projected bounding rectangle GIoU alignment) — coupled with a Hierarchical Task Learning strategy to stabilize training. On KITTI, SPAN improves MonoDGP's Car Moderate AP3D by 0.92%, achieving a new state of the art with zero additional inference overhead.
Spectral Defense Against Resource-Targeting Attack in 3D Gaussian Splatting: This paper proposes the first frequency-domain defense framework against resource-targeting attacks on 3DGS. By combining a 3D frequency filter that selectively prunes anomalous high-frequency Gaussians with 2D spectral regularization that constrains anisotropic noise in rendered images, the method suppresses Gaussian over-proliferation by up to 5.92×, reduces peak GPU memory by up to 3.66×, and accelerates rendering by up to 4.34× under attack, while maintaining reconstruction quality.
Spectral Defense Against Resource-Targeting Attack in 3D Gaussian Splatting: This paper proposes the first frequency-domain defense framework against resource-targeting attacks on 3DGS — a 3D frequency filter that selectively prunes high-frequency anomalous Gaussians, combined with a 2D angular anisotropy regularization that penalizes directionally concentrated high-frequency noise. The method suppresses attack-induced Gaussian over-growth by up to 5.92×, reduces peak memory by 3.66×, improves rendering speed by 4.34×, and even raises PSNR by +1.93 dB.
Speed3R: Sparse Feed-forward 3D Reconstruction Models: Speed3R introduces a trainable dual-branch Global Sparse Attention (GSA) mechanism for feed-forward 3D reconstruction models. A compression branch provides coarse-grained scene summaries while a selection branch focuses fine-grained attention on critical tokens, achieving 12.4× inference speedup on 1000-view sequences with only marginal accuracy degradation.
Speeding Up the Learning of 3D Gaussians with Much Shorter Gaussian Lists: By periodically resetting Gaussian scales (Scale Reset) and imposing an entropy constraint on alpha blending weights, this paper shortens the per-pixel Gaussian list length to achieve 5–12× training acceleration in 3DGS while maintaining comparable rendering quality.
SR3R: Rethinking Super-Resolution 3D Reconstruction With Feed-Forward Gaussian Splatting: SR3R reformulates 3D super-resolution (3DSR) as a feed-forward mapping from sparse low-resolution views to high-resolution 3DGS, achieving high-fidelity HR 3DGS reconstruction via Gaussian offset learning and feature refinement, without per-scene optimization, while enabling strong zero-shot generalization.
STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction: This paper proposes STAC, a framework that exploits spatio-temporal sparsity in the KV cache of causal Transformers. Through three modules—working temporal token caching, long-term spatial token caching, and chunk-based multi-frame optimization—STAC reduces memory consumption by approximately 10× and improves inference speed by 4× for streaming 3D reconstruction, without any additional training and with negligible degradation in reconstruction quality.
STAvatar: Soft Binding and Temporal Density Control for Monocular 3D Head Avatars Reconstruction: STAvatar is proposed, leveraging a UV-adaptive soft binding framework and a temporal adaptive density control strategy to reconstruct high-fidelity, drivable 3D head avatars from monocular video. It significantly outperforms existing methods in occluded regions (oral interior, eyelids) and fine-grained details.
Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas: This paper proposes Stepper, a framework that generates immersive 3D scenes driven by text input by progressively synthesizing multi-view panoramas and feeding them into a feed-forward 3D reconstruction pipeline, achieving an average PSNR improvement of 3.3 dB over existing methods.
STS-Mixer: Spatio-Temporal-Spectral Mixer for 4D Point Cloud Video Understanding: STS-Mixer is the first to introduce the Graph Fourier Transform (GFT) into 4D point cloud video understanding. By decomposing point clouds in the frequency domain to capture geometric structures at different scales (low frequency = global shape, high frequency = local details) and mixing spectral features with spatio-temporal information, STS-Mixer achieves state-of-the-art performance on action recognition and semantic segmentation.
SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation: SwiftTailor is a lightweight two-stage framework that combines PatternMaker for sewing pattern prediction with GarmentSewer for converting patterns into a Garment Geometry Image (GGI) in a unified UV space. Via inverse mapping and dynamic stitching, the framework directly assembles 3D garment meshes, achieving SOTA quality while running orders of magnitude faster than existing methods.
TagSplat: Topology-Aware Gaussian Splatting for Dynamic Mesh Modeling and Tracking: TagSplat is a topology-aware Gaussian splatting framework that explicitly encodes spatial connectivity among Gaussian primitives, enabling the generation of topologically consistent mesh sequences in dynamic scene reconstruction while supporting accurate 3D keypoint tracking.
Learning 3D Reconstruction with Priors in Test Time: This paper proposes Test-time Constrained Optimization (TCO), a framework that improves 3D reconstruction accuracy by treating available priors (camera poses, intrinsics, depth) as output constraints optimized at inference time, without retraining or modifying the architecture of pretrained multiview Transformers.
Text–Image Conditioned 3D Generation: This paper identifies that image and text conditions provide complementary information for 3D generation—images supply precise appearance but are limited by viewpoint, while text provides global semantics but lacks visual detail—and proposes TIGON, a minimalist dual-branch DiT baseline that achieves native text-image jointly conditioned 3D generation via zero-initialized cross-modal bridges (early fusion) and step-wise prediction averaging (late fusion).
TopoMesh: High-Fidelity Mesh Autoencoding via Topological Unification: This paper proposes TopoMesh, which unifies both ground-truth and predicted meshes under the Dual Marching Cubes (DMC) topology framework, enabling explicit vertex- and face-level correspondence for the first time. This allows direct mesh-level supervision over topology, vertex positions, and face normals. The proposed method improves F1-Sharp by 5.9–7.1% over the current state of the art, with particularly notable advantages in sharp feature preservation.
Towards Spatio-Temporal World Scene Graph Generation from Monocular Videos: This paper introduces the World Scene Graph Generation (WSGG) task, which constructs spatio-temporally persistent, world-coordinate-anchored scene graphs from monocular videos, covering all objects including occluded and out-of-frame ones. The paper also presents the ActionGenome4D dataset and three complementary methods (PWG/MWAE/4DST).
TR2M: Transferring Monocular Relative Depth to Metric Depth with Language Descriptions and Dual-Level Scale-Oriented Contrast: TR2M is a framework that leverages images and textual descriptions to predict pixel-wise scale/shift maps, converting generalizable but scale-ambiguous relative depth into metric depth. With only 19M trainable parameters and 102K training images, it achieves zero-shot cross-domain metric depth estimation.
tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction: tttLRM is the first work to introduce Test-Time Training (TTT) into large-scale 3D reconstruction models. It leverages LaCT layers to achieve long-context and autoregressive 3D Gaussian reconstruction at linear complexity. Multi-view observations are compressed into TTT fast weights to form an implicit 3D representation, which is then decoded into explicit formats such as 3DGS, achieving state-of-the-art performance on both object-level and scene-level benchmarks.
Unblur-SLAM: Dense Neural SLAM for Blurry Inputs: Rather than naively inserting a deblurring network into the SLAM front-end, Unblur-SLAM is designed around a central decision: which blurry frames can be deblurred prior to tracking, and which must be modeled directly in 3D space. This insight drives a complete pipeline comprising blur detection, physically constrained deblurring, 3D Gaussian blur refinement, and a severe-blur fallback, enabling the system to handle both motion blur and defocus blur while substantially improving tracking and reconstruction quality.
UniSplat: Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images: UniSplat learns unified geometry-appearance-semantic 3D representations from unposed multi-view images via three components — dual masking, coarse-to-fine Gaussian splatting, and pose-conditioned recalibration — laying a perceptual foundation for spatial intelligence.
Using Gaussian Splats to Create High-Fidelity Facial Geometry and Texture: This paper proposes a face reconstruction pipeline based on improved Gaussian Splatting. It tightly couples Gaussians with triangle meshes via soft constraints and semantic segmentation supervision, reconstructing high-fidelity triangular mesh geometry from only 11 uncalibrated images. A PCA prior combined with a relightable Gaussian model is used to disentangle illumination and recover de-lit albedo textures, with outputs fully compatible with standard graphics pipelines (MetaHuman).
UTrice: Unifying Primitives in Differentiable Ray Tracing and Rasterization via Triangles for Particle-Based 3D Scenes: UTrice proposes replacing Gaussian ellipsoids with triangles as unified primitives for differentiable ray tracing, enabling direct triangle traversal within an OptiX BVH without any proxy geometry. The method significantly outperforms 3DGRT in rendering quality while maintaining real-time performance, and is natively compatible with triangles optimized by the rasterization-based Triangle Splatting, thereby achieving primitive unification across rasterization and ray tracing pipelines.
VarSplat: Uncertainty-aware 3D Gaussian Splatting for Robust RGB-D SLAM: This paper presents VarSplat, the first 3DGS-SLAM system that learns a per-splat appearance variance \(\sigma^2\) and renders a per-pixel uncertainty map \(V\) via the law of total variance. The uncertainty is uniformly applied to tracking, submap registration, and loop detection, achieving robust and state-of-the-art performance across four datasets.
VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control: This paper presents VerseCrafter, a video world model based on a unified 4D geometric control representation (static background point cloud + per-object 3D Gaussian trajectories). A lightweight GeoAdapter injects 4D control signals into a frozen Wan2.1-14B video diffusion model, enabling precise and disentangled control over camera and multi-object motion. The authors also construct VerseControl4D, a real-world dataset containing 35K training samples.
VGG-T3: Offline Feed-Forward 3D Reconstruction at Scale: This paper proposes VGG-T3, which compresses the variable-length KV representations in VGGT's global attention layers into fixed-size MLP weights via test-time training (TTT), reducing the computational complexity of offline feed-forward 3D reconstruction from \(O(n^2)\) to \(O(n)\), enabling large-scale scene reconstruction at the thousand-image level (1k images in only 58 seconds).
VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection: This paper proposes VGGT-Det, the first multi-view indoor 3D object detection framework under a sensor-geometry-free (SG-Free) setting. By mining semantic priors (via attention-guided query generation, AG) and geometric priors (via query-driven feature aggregation, QD) from the internal representations of the VGGT encoder, VGGT-Det surpasses prior state-of-the-art methods by 4.4 and 8.6 mAP@0.25 on ScanNet and ARKitScenes, respectively.
VGGT-SLAM++: Visual SLAM with DEM-Based Covisibility and Local Bundle Adjustment: VGGT-SLAM++ augments the VGGT feed-forward Transformer odometry with Digital Elevation Maps (DEMs) as a compact, geometry-preserving representation. It leverages DINOv2 embeddings for efficient loop closure detection and covisibility graph construction, and applies high-frequency Sim(3) local bundle adjustment to correct short-term drift, achieving a 45% reduction in ATE on TUM RGB-D (0.079m → 0.036m).
VirPro: Visual-referred Probabilistic Prompt Learning for Weakly-Supervised Monocular 3D Detection: This paper proposes VirPro—an adaptive multimodal pre-training paradigm that provides scene-aware semantic supervision signals for weakly-supervised monocular 3D detection via visually guided probabilistic prompts (Adaptive Prompt Bank + Multi-Gaussian Prompt Modeling). VirPro can be seamlessly integrated into existing WS-M3D frameworks, achieving up to 4.8% AP improvement on KITTI.
Wanderland: Geometrically Grounded Simulation for Open-World Embodied AI: This paper proposes Wanderland, a real-to-sim framework that uses a handheld multi-sensor scanner (LiDAR+IMU+RGB) to capture open-world indoor and outdoor scenes. It employs LIV-SLAM to obtain metric-accurate geometry and camera poses, combines 3DGS for photorealistic rendering with geometrically grounded collision simulation, and constructs a large-scale dataset of 530 scenes / 420K frames / 3.8M m². The work systematically demonstrates that purely vision-based reconstruction falls significantly short of LiDAR-enhanced approaches in metric accuracy, mesh quality, and reliability for navigation policy training and evaluation.
What Makes Good Synthetic Training Data for Zero-Shot Stereo Matching?: This paper systematically ablates the design space of synthetic stereo matching training data—covering floating objects, backgrounds, materials, baselines, and more—and finds that "realistic indoor scenes + dense floating objects + wide baseline" is the optimal combination. The resulting WMGStereo-150k dataset, trained on a single source, outperforms the mixture of four classical datasets.
WMGStereo: What Makes Good Synthetic Training Data for Zero-Shot Stereo Matching?: This paper systematically investigates the design space of synthetic stereo datasets by individually varying six key parameters (floating object density, background objects, object types, materials, camera baseline, lighting augmentation) within the Infinigen procedural generator, and quantifies their impact on zero-shot stereo matching. The study finds that the combination of realistic indoor scenes + floating objects is most effective, leading to the construction of the WMGStereo-150k dataset. Training on this single dataset surpasses the combination of SceneFlow + CREStereo + TartanAir + IRS (28% reduction on Middlebury, 25% on Booster), with performance competitive with FoundationStereo.
Where, What, Why: Toward Explainable 3D-GS Watermarking: A representation-native 3D-GS watermarking framework that answers three key questions: Trio-Experts for carrier selection (where), Channel-wise Group Mask for gradient control (what), and decoupled fine-tuning for auditable attribution (why). Surpasses SOTA on both rendering quality (PSNR +0.83 dB) and bit accuracy (+1.24%).
Yo'City: Personalized and Boundless 3D Realistic City Scene Generation via Self-Critic Expansion: This paper proposes Yo'City, a multi-agent framework that achieves user-personalized, text-driven unbounded 3D city generation through a "City–District–Grid" hierarchical planning strategy, a produce–refine–evaluate isometric image synthesis loop, and a scene graph-guided expansion mechanism. The approach comprehensively outperforms existing methods such as SynCity in semantic consistency and visual quality.
Zero-Shot Reconstruction of Animatable 3D Avatars with Cloth Dynamics from a Single Image: DynaAvatar presents the first zero-shot framework for reconstructing animatable 3D human avatars with motion-dependent cloth dynamics from a single image. Through a static-to-dynamic knowledge transfer strategy and a optical flow-guided DynaFlow loss, the method achieves realistic garment dynamics under limited dynamic training data, surpassing all existing approaches across the board.