🧊 3D Vision¶

🧠 NeurIPS2025 · 116 paper notes

3D-Agent: Tri-Modal Multi-Agent Collaboration for Scalable 3D Object Annotation: This paper proposes Tri-MARF, a tri-modal multi-agent framework comprising a VLM annotation agent (multi-view, multi-candidate description generation), an information aggregation agent (BERT clustering + CLIP weighting + UCB1 Multi-Armed Bandit selection), and a point cloud gating agent (Uni3D text–point cloud alignment for hallucination filtering). The system achieves a CLIPScore of 88.7 (surpassing human annotation at 82.4), a throughput of 12k objects/hour, and has annotated approximately 2 million 3D models.
3D Visual Illusion Depth Estimation: This paper reveals that 3D visual illusions (e.g., wall paintings, screen replays, mirror reflections) severely mislead existing state-of-the-art monocular and stereo depth estimation methods. The authors construct a large-scale dataset comprising approximately 3k scenes and 200k images, and propose a VLM-driven monocular-stereo adaptive fusion framework that achieves state-of-the-art performance across diverse illusion scenarios.
Anti-Aliased 2D Gaussian Splatting: This paper proposes AA-2DGS, which addresses severe aliasing artifacts in 2D Gaussian Splatting under varying sampling rates through two complementary mechanisms: a world-space flat smoothing kernel and an object-space Mip filter. The method significantly improves multi-scale rendering quality while preserving the geometric accuracy advantages of 2DGS.
ARMesh: Autoregressive Mesh Generation via Next-Level-of-Detail Prediction: This paper proposes to formulate 3D mesh generation as a coarse-to-fine, next-level-of-detail prediction process. By reversing a generalized mesh simplification algorithm (GSlim), a progressive refinement sequence is obtained, which is then learned autoregressively via a Transformer. Generation begins from a single point and incrementally adds geometric and topological detail to produce a complete mesh.
AtlasGS: Atlanta-world Guided Surface Reconstruction with Implicit Structured Gaussians: AtlasGS is proposed to achieve smooth, high-frequency-detail-preserving surface reconstruction in indoor and urban scenes by incorporating the Atlanta-world structural prior into an implicit-structured Gaussian representation, comprehensively outperforming existing implicit and explicit methods.
BecomingLit: Relightable Gaussian Avatars with Hybrid Neural Shading: This paper proposes BecomingLit, a method that reconstructs high-fidelity, relightable, and real-time renderable head avatars from low-cost light stage multi-view sequences using 3D Gaussian primitives and hybrid neural shading (neural diffuse BRDF + analytic Cook-Torrance specular). A new publicly available OLAT facial dataset is also released.
CLIPGaussian: Universal and Multimodal Style Transfer Based on Gaussian Splatting: CLIPGaussian proposes the first unified style transfer framework based on Gaussian Splatting, supporting text- and image-guided stylization of 2D images, videos, 3D objects, and 4D dynamic scenes. It integrates as a plug-and-play module into existing GS pipelines without requiring large generative models or retraining from scratch, and without altering model size.
Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations: Concerto combines intra-modal 3D point cloud self-distillation with cross-modal 2D-3D joint embedding prediction. Through a minimalist design, a single point cloud encoder (PTv3) emerges spatial representations that surpass both 2D/3D unimodal methods and their naive concatenation, achieving state-of-the-art performance on multiple 3D scene understanding benchmarks (ScanNet semantic segmentation: 80.7% mIoU).
Copresheaf Topological Neural Networks: A Generalized Deep Learning Framework: This paper proposes Copresheaf Topological Neural Networks (CTNNs), which leverage the algebraic-topological notion of copresheaves to define directional, heterogeneous message passing on combinatorial complexes. The framework unifies CNNs, GNNs, Transformers, Sheaf Neural Networks, and Topological Neural Networks as special cases, and surpasses conventional baselines on physics simulation, graph classification, and higher-order complex classification tasks.
CosmoBench: A Multiscale, Multiview, Multitask Cosmology Benchmark for Geometric Deep Learning: This paper introduces CosmoBench—the largest cosmological geometric deep learning benchmark to date—comprising 34,752 point clouds and 24,996 directed trees across multiple scales, viewpoints, and tasks. A key finding is that simple linear models sometimes outperform large GNNs.
Cue3D: Quantifying the Role of Image Cues in Single-Image 3D Generation: Cue3D is the first model-agnostic framework for quantifying the importance of image cues in single-image 3D generation. By systematically perturbing six visual cues—illumination, texture, silhouette, perspective, edges, and local continuity—across seven methods spanning three paradigms (regression-based, multi-view, and native 3D generation), it reveals key insights: shape meaningfulness rather than texture governs generalization ability, illumination matters more than texture, and models are overly dependent on input silhouettes.
D\(^2\)USt3R: Enhancing 3D Reconstruction for Dynamic Scenes: This paper proposes the Static-Dynamic Aligned Pointmap (SDAP) representation, which unifies 3D alignment of static and dynamic regions into a single framework, enabling DUSt3R-based methods to achieve accurate dense 3D reconstruction and correspondence estimation in dynamic scenes.
DC4GS: Directional Consistency-Driven Adaptive Density Control for 3D Gaussian Splatting: This paper proposes DC4GS, an adaptive density control method based on Directional Consistency (DC), which improves primitive splitting decisions and split position selection in 3DGS by exploiting the angular coherence of positional gradients. DC4GS reduces the number of primitives by up to 30% while improving reconstruction quality.
DGH: Dynamic Gaussian Hair: This paper proposes Dynamic Gaussian Hair (DGH), a data-driven coarse-to-fine framework that learns hair dynamics via a volumetric implicit deformation model, and achieves photorealistic novel-view rendering of dynamic hair by combining cylindrical Gaussian representations with a curvature blending strategy.
DualFocus: Depth from Focus with Spatio-Focal Dual Variational Constraints: This paper proposes DualFocus, which achieves robust and accurate depth estimation from focal stacks via two complementary constraints: a spatial variational constraint (exploiting focus-dependent gradient patterns to distinguish depth edges from texture artifacts) and a focal variational constraint (enforcing a unimodal and monotonic focus probability distribution along the focal axis).
Dynamic Gaussian Splatting from Defocused and Motion-blurred Monocular Videos: A unified framework is proposed that jointly models defocus blur and motion blur via learnable blur kernel convolution, combined with a dynamic Gaussian densification strategy and unseen-view constraints, enabling high-quality novel view synthesis of dynamic scenes from blurry monocular videos using 3DGS.
DynaRend: Learning 3D Dynamics via Masked Future Rendering for Robotic Manipulation: DynaRend is proposed to jointly learn 3D geometry, semantics, and dynamics on triplane representations via differentiable volumetric rendering, using two complementary objectives — masked reconstruction and future prediction — enabling efficient transfer to downstream robotic manipulation tasks after pre-training.
E-MoFlow: Learning Egomotion and Optical Flow from Event Data via Implicit Regularization: This paper proposes E-MoFlow, which models optical flow as an implicit neural representation and egomotion as a continuous spline, jointly optimizing both via differential geometric constraints under an unsupervised paradigm to achieve 6-DoF egomotion and dense optical flow estimation from event data.
EA3D: Online Open-World 3D Object Extraction from Streaming Videos: This paper proposes EA3D (ExtractAnything3D), an online open-world 3D object extraction framework that performs simultaneous geometric reconstruction and comprehensive scene understanding from streaming videos via knowledge-integrated feature maps, online visual odometry, and recurrent joint optimization.
EAG3R: Event-Augmented 3D Geometry Estimation for Dynamic and Extreme-Lighting Scenes: EAG3R integrates asynchronous event streams from event cameras into the MonST3R point map reconstruction framework. Through a Retinex enhancement module, an SNR-aware fusion mechanism, and an event photometric consistency loss, it achieves robust depth estimation, pose tracking, and 4D reconstruction in extreme low-light dynamic scenes, significantly outperforming RGB-only methods via zero-shot transfer to nighttime scenarios.
EF-3DGS: Event-Aided Free-Trajectory 3D Gaussian Splatting: EF-3DGS is the first work to introduce event cameras into free-trajectory scene reconstruction. It employs an Event Generation Model (EGM) to reconstruct latent inter-frame images for continuous supervision, Contrast Maximization (CMax) combined with a Linear Event Generation Model (LEGM) to extract motion information for pose calibration, and a photometric BA + Fixed-GS strategy to resolve color inconsistency. The method achieves a 3 dB PSNR improvement and a 40% reduction in ATE in high-speed scenarios.
ELECTRA: A Cartesian Network for 3D Charge Density Prediction with Floating Orbitals: This paper proposes ELECTRA (Electronic Tensor Reconstruction Algorithm), an equivariant Cartesian tensor network that reconstructs electron density by predicting the positions, weights, and covariance matrices of floating Gaussian orbitals. On the QM9 benchmark, ELECTRA achieves 2.4× higher accuracy than the state-of-the-art method SCDP while being 4.4–11× faster at inference, and reduces the number of SCF iterations in DFT by 50.72%.
EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation: EnerVerse is a generative robotic foundation model that constructs a 4D embodied space via chunk-wise autoregressive video diffusion, sparse context memory, and multi-view generation priors. Combined with a 4DGS data flywheel to narrow the Sim2Real gap, it translates 4D world representations into physical actions through a policy head, achieving state-of-the-art performance on the LIBERO benchmark.
EUGens: Efficient, Unified, and General Dense Layers: EUGens introduces a new family of efficient dense layers that leverage Random Features to reduce the inference complexity of fully connected feedforward layers (FFLs) from quadratic to linear. The framework unifies existing efficient FFL extensions and achieves up to 27% inference speedup and 30% parameter compression across LLM pre-training, ViT image classification, and NeRF/iSDF 3D reconstruction tasks, while supporting layer-wise knowledge distillation without backpropagation.
Evaluation of Vision-LLMs in Surveillance Video: This paper proposes a training-free two-stage framework that leverages small Vision-LLMs to generate textual descriptions of video content, followed by an NLI classifier for zero-shot scoring. It systematically evaluates the impact of prompting strategies and privacy-preserving filters on anomalous behavior recognition in surveillance videos.
Every Camera Effect, Every Time, All at Once: 4D Gaussian Ray Tracing for Physics-based Camera Effect Data Generation: This paper proposes 4D Gaussian Ray Tracing (4D-GRT), which integrates 4D Gaussian Splatting with physics-based ray tracing. After reconstructing dynamic scenes from multi-view videos, the method renders physically accurate video data with controllable camera effects including fisheye distortion, depth of field blur, and rolling shutter artifacts.
Fin3R: Fine-tuning Feed-forward 3D Reconstruction Models via Monocular Knowledge Distillation: Fin3R is proposed to improve the geometric accuracy and robustness of feed-forward 3D reconstruction models (DUSt3R/MASt3R/CUT3R/VGGT) in a unified and lightweight manner, by freezing the decoder and fine-tuning the encoder via monocular knowledge distillation with re-normalization LoRA adapters.
FlareX: A Physics-Informed Dataset for Lens Flare Removal via 2D Synthesis and 3D Rendering: This paper proposes the FlareX dataset, generated through three stages—parameterized template creation, illumination-law-guided 2D synthesis, and physics-engine-based 3D rendering—to produce physically realistic lens flare data. Models trained on FlareX significantly outperform those trained on all prior datasets on real-world test sets.
Flux4D: Flow-based Unsupervised 4D Reconstruction: Flux4D is proposed as an unsupervised and generalizable 4D dynamic driving scene reconstruction framework. It employs a feed-forward network to directly predict 3D Gaussians and their motion velocities, achieving large-scale scene reconstruction using only photometric loss and a static-preference regularization. The method surpasses all unsupervised approaches on PandaSet and Waymo while approaching the performance of supervised methods.
From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes: This paper proposes Anywhere3D-Bench, the first 3D visual grounding benchmark spanning four levels—area, space, object, and part—revealing that even the strongest models (Gemini-2.5-Pro and o3) achieve only ~30% accuracy on space-level tasks and ~40% on part-level tasks, far below the human performance of 95%.
From Pixels to Views: Learning Angular-Aware and Physics-Consistent Representations for Light Field Microscopy: This paper proposes XLFM-Former, which learns angular–spatial priors of XLFM through view-level Masked View Modeling (MVM-LF) self-supervised pretraining, and introduces an Optical Rendering Consistency Loss (ORC Loss) based on PSF differentiable rendering to constrain the physical plausibility of the reconstructed volume. On the first standardized XLFM-Zebrafish benchmark constructed by the authors, the method achieves an average PSNR of 54.04 dB, surpassing the best baseline ConvNeXt (50.16 dB) by 7.7%.
From Programs to Poses: Factored Real-World Scene Generation via Learned Program Libraries: This paper proposes FactoredScenes, which decomposes real-world 3D scene generation into a five-step factorization pipeline — learning a layout program library from synthetic data, generating scene programs via LLM, executing programs to obtain axis-aligned layouts, program-conditioned hierarchical pose prediction, and object retrieval and placement. The method achieves 38.3% FID improvement and 80.4% KID improvement on bedrooms, with human evaluators able to distinguish generated scenes from real ScanNet scenes only 67% of the time.
Fully Dynamic Algorithms for Chamfer Distance: This paper proposes the first fully dynamic algorithm for maintaining Chamfer distance, reducing the problem to approximate nearest neighbor (ANN) queries to achieve a \((1+\epsilon)\) approximation with update time \(\tilde{O}(\epsilon^{-d})\), significantly surpassing the linear-time lower bound of static recomputation. On real-world datasets, the algorithm achieves <10% relative error while running orders of magnitude faster than naive approaches.
Galactification: Painting Galaxies onto Dark Matter Only Simulations Using a Transformer-Based Model: This paper proposes a multimodal Transformer encoder–decoder framework that takes density and velocity fields from inexpensive dark matter N-body simulations as input and autoregressively generates galaxy catalogs (positions + physical properties). The model faithfully reproduces hydrodynamical simulation results across multiple statistical metrics while achieving approximately 100× computational speedup.
GauDP: Reinventing Multi-Agent Collaboration through Gaussian-Image Synergy in Diffusion Policies: GauDP is proposed to enable scalable, perception-enhanced multi-agent collaborative imitation learning by constructing a globally consistent 3D Gaussian field from decentralized RGB observations of multiple agents and dynamically allocating Gaussian attributes back to each agent's local viewpoint.
Gaussian-Augmented Physics Simulation and System Identification with Complex Colliders: This paper proposes AS-DiffMPM, a differentiable Material Point Method (MPM) framework supporting arbitrary-shape rigid body colliders, combined with multiple novel-view synthesis methods to enable system identification of physical parameters from visual observations.
Gaze Beyond the Frame: Forecasting Egocentric 3D Visual Span: This paper proposes EgoSpanLift, a method that lifts egocentric 2D gaze predictions into 3D space, constructing multi-level volumetric visual span representations. Combined with a 3D U-Net and a causal Transformer, the framework forecasts future 3D regions of visual attention.
GeoComplete: Geometry-Aware Diffusion for Reference-Driven Image Completion: This paper proposes GeoComplete, which injects projected point clouds as geometric conditions into a dual-branch diffusion model and employs a target-aware masking strategy to achieve geometrically consistent reference-driven image completion, achieving a 17.1% improvement in PSNR.
GeoSVR: Taming Sparse Voxels for Geometrically Accurate Surface Reconstruction: This paper proposes GeoSVR, an explicit surface reconstruction framework based on sparse voxels. By introducing voxel-uncertainty depth constraints and sparse voxel surface regularization, GeoSVR comprehensively outperforms existing 3DGS- and SDF-based methods in geometric accuracy, detail preservation, and reconstruction completeness.
GOATex: Geometry & Occlusion-Aware Texturing: GOATex proposes the first occlusion-aware 3D mesh texturing framework. It decomposes meshes into visibility layers ordered from outermost to innermost via a ray-casting-based hit-level mechanism, applies a two-stage visibility control strategy combining normal flipping and residual face clustering, and performs visibility-weighted blending in UV space—achieving high-quality texture generation for both exterior surfaces and occluded interior surfaces.
HAIF-GS: Hierarchical and Induced Flow-Guided Gaussian Splatting for Dynamic Scene: HAIF-GS proposes a dynamic 3DGS framework built upon sparse motion anchors, achieving state-of-the-art rendering quality on the NeRF-DS and D-NeRF benchmarks via three key mechanisms: an anchor filter that separates dynamic and static regions, a self-supervised induced scene flow that guides temporally consistent deformation, and hierarchical anchor densification that captures fine-grained non-rigid motion.
High Resolution UDF Meshing via Iterative Networks: This paper proposes the first iterative meshing method for Unsigned Distance Fields (UDFs), which progressively propagates neighborhood information into local voxel pseudo-sign predictions through multiple forward passes. The approach effectively resolves surface holes and discontinuities caused by noisy neural UDFs at high resolutions, significantly outperforming existing single-pass methods across multiple datasets.
How Many Tokens Do 3D Point Cloud Transformer Architectures Really Need?: This paper systematically demonstrates that 90–95% of tokens in 3D point cloud Transformers (e.g., PTv3, Sonata) are redundant, and proposes gitmerge3D — a globally informed graph-based token merging method that achieves up to 5.3× FLOPs reduction and 6.4× memory savings with negligible accuracy loss, via an energy-score-driven adaptive merging strategy.
Hybrid Physical-Neural Simulator for Fast Cosmological Hydrodynamics: This paper proposes a hybrid physical-neural cosmological simulator that handles gravitational dynamics via a differentiable particle-mesh (PM) method and parameterizes the effective gas pressure field using a physics-constrained neural network. The model requires only a single reference simulation for training and outperforms the EGD baseline at both the field level and the statistics level.
HyPlaneHead: Rethinking Tri-plane-like Representations in Full-Head Image Synthesis: This paper systematically analyzes three fundamental problems of tri-plane-like representations in 3D-aware head synthesis — mirror artifacts, non-uniform mapping, and feature penetration — and proposes a hybrid hy-plane representation (planar + spherical) combined with a unify-split strategy and near-equal-area warping, achieving state-of-the-art performance in full-head image synthesis.
HyRF: Hybrid Radiance Fields for Memory-efficient and High-quality Novel View Synthesis: This paper proposes Hybrid Radiance Fields (HyRF), which combines compact explicit Gaussians (storing only 8 parameters each) with decoupled grid-based neural fields, achieving 20× model compression while attaining state-of-the-art rendering quality and real-time performance.
IBGS: Image-Based Gaussian Splatting: This paper proposes Image-Based Gaussian Splatting (IBGS), which enhances standard 3DGS rendering quality by learning color residuals from neighboring training images. The method significantly improves the modeling of high-frequency details and view-dependent effects without introducing additional storage overhead.
IndEgo: A Dataset of Industrial Scenarios and Collaborative Work for Egocentric Assistants: This paper presents IndEgo — the first large-scale multimodal egocentric vision dataset targeting real industrial environments. It comprises 3,460 egocentric video clips (~197 hours) and 1,092 exocentric recordings (~97 hours), spanning five major task categories including assembly/disassembly, logistics, maintenance, woodworking, and miscellaneous tasks, as well as collaborative work scenarios. Three benchmarks are established: mistake detection, reasoning-based QA, and collaborative task understanding.
Instant Video Models: Universal Adapters for Stabilizing Image-Based Networks: This paper proposes a class of universal Stabilization Adapters that can be inserted into nearly any image model architecture. By freezing the base network and training only the adapter parameters, combined with a unified accuracy–stability–robustness loss function, the method endows frame-level models with video temporal consistency and corruption robustness.
Jasmine: Harnessing Diffusion Prior for Self-Supervised Depth Estimation: This paper is the first to incorporate the visual prior of Stable Diffusion into a self-supervised monocular depth estimation (SSMDE) framework. It proposes the Mix-Batch Image Reconstruction (MIR) proxy task to shield the SD prior from corruption by reprojection noise, and introduces the Scale-Shift GRU (SSG) to bridge the gap between SD's scale-shift-invariant (SSI) and self-supervised scale-invariant (SI) depth distributions. Jasmine achieves AbsRel = 0.090 on KITTI, establishing a new state of the art among all SSMDE methods, while comprehensively outperforming supervised SD methods such as Marigold, E2E FT, and Lotus in zero-shot generalization.
LangSplatV2: High-dimensional 3D Language Gaussian Splatting with 450+ FPS: By treating each 3D Gaussian as a sparse code over a global dictionary, LangSplatV2 replaces the heavyweight decoder with a sparse coefficient field, achieving 476.2 FPS high-dimensional feature splatting and 384.6 FPS 3D open-vocabulary querying — a 47× speedup over LangSplat.
Learning Efficient Fuse-and-Refine for Feed-Forward 3D Gaussian Splatting: This paper proposes a Fuse-and-Refine module that aggregates pixel-aligned Gaussian primitives into a coarse-to-fine voxel hierarchy via a hybrid Splat-Voxel representation. A sparse voxel Transformer fuses approximately 200K primitives within 15 ms, yielding ~2 dB PSNR improvement. The model is trained exclusively on static scenes yet generalizes zero-shot to streaming dynamic scene reconstruction.
Learning Neural Exposure Fields for View Synthesis: This paper proposes Neural Exposure Fields (NExF), which achieves 3D-consistent high-quality view synthesis by learning optimal exposure values per 3D point rather than per image. On HDR scenes, NExF surpasses the state-of-the-art by 3.5+ dB in PSNR while being 50× faster.
Linearly Constrained Diffusion Implicit Models: This paper proposes CDIM, a DDIM-based algorithm for solving linear inverse problems. By aligning the residual energy with the \(\chi^2\) distribution of the forward diffusion process, CDIM adaptively controls the number and step size of projection steps, achieving inference speeds 10–50× faster than DPS while exactly satisfying measurement constraints in the noiseless case.
LinPrim: Linear Primitives for Differentiable Volumetric Rendering: This paper proposes LinPrim, which replaces 3D Gaussian kernels with linear primitives (octahedra and tetrahedra) as the scene representation for novel view synthesis. Through a differentiable rasterization pipeline, LinPrim enables end-to-end optimization and achieves reconstruction quality comparable to 3DGS on real-world datasets using fewer primitives, while maintaining real-time rendering capability.
Locality-Sensitive Hashing-Based Efficient Point Transformer for Charged Particle Reconstruction: By combining LSH with Point Transformer, the paper proposes HEPTv2 for end-to-end particle track reconstruction, eliminating the DBScan clustering post-processing bottleneck and achieving a 28.9× speedup while maintaining competitive tracking efficiency.
LODGE: Level-of-Detail Large-Scale Gaussian Splatting with Efficient Rendering: This paper proposes LODGE, which manages 3D Gaussian Splatting at multiple scales through a hierarchical Level-of-Detail (LOD) strategy. By dynamically selecting Gaussian representations of appropriate granularity based on camera distance, LODGE enables high-quality real-time rendering of large-scale scenes.
Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views: Look and Tell introduces a multimodal dataset that synchronously captures gaze, speech, and dual-view video from 25 participants in a kitchen environment using Meta Aria smart glasses and a fixed GoPro camera. Combined with 3D scene reconstruction and a multi-level annotation pipeline, it provides the first benchmark for studying referential communication across egocentric and exocentric perspectives.
MaNGO: Adaptable Graph Network Simulators via Meta-Learning: This paper proposes MaNGO (Meta Neural Graph Operator), which leverages meta-learning and conditional neural processes (CNP) to learn shared latent structure across simulation tasks under varying physical parameters, enabling rapid adaptation to new physical parameters without retraining.
MaterialRefGS: Reflective Gaussian Splatting with Multi-view Consistent Material Inference: MaterialRefGS is proposed to achieve high-fidelity novel view synthesis and accurate illumination decomposition for reflective surfaces, via multi-view consistent material inference constraints and a 2DGS ray-tracing-based environment modeling strategy.
Mesh-RFT: Enhancing Mesh Generation via Fine-Grained Reinforcement Fine-Tuning: This paper proposes Mesh-RFT, a framework that achieves face-level fine-grained mesh quality optimization through a topology-aware scoring system and Masked Direct Preference Optimization (M-DPO), significantly improving the geometric integrity and topological regularity of generated meshes.
Mesh Interpolation Graph Network for Dynamic and Spatially Irregular Global Weather Forecasting: This paper proposes MIGN, a framework that maps irregular weather station data onto a regular HEALPix mesh via a mesh interpolation strategy for message passing, and introduces parameterized spherical harmonics positional encoding to enhance spatial generalization, achieving significant improvements over existing methods on global weather forecasting tasks.
Meta-Learning an In-Context Transformer Model of Human Higher Visual Cortex: This paper proposes BraInCoRL (Brain In-Context Representation Learning), a Transformer-based meta-learning framework that predicts voxel-level neural responses for new subjects directly from a small number of stimulus–response samples via in-context learning (ICL), requiring no fine-tuning to generalize to new subjects or stimuli. With only 100 images as context, it approaches the performance of a reference model fully trained on 9,000 images.
MetaGS: A Meta-Learned Gaussian-Phong Model for Out-of-Distribution 3D Scene Relighting: MetaGS is proposed to achieve high-quality 3D scene relighting under out-of-distribution (OOD) lighting conditions by embedding a differentiable Blinn-Phong reflectance model into 3D Gaussian splatting and adopting a bilevel meta-learning training strategy.
Metropolis-Hastings Sampling for 3D Gaussian Reconstruction: This paper proposes an adaptive Metropolis-Hastings framework to replace the heuristic density control mechanism in 3DGS. Through probabilistic sampling driven by multi-view photometric error, it achieves more efficient inference of Gaussian distributions and converges faster than 3DGS-MCMC.
More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models: Merge proposes a plug-and-play framework that inserts lightweight learnable Converters before each frozen pretrained T2I diffusion block, enabling depth estimation with only ~12% additional parameters while perfectly preserving the original image generation capability. It achieves state-of-the-art performance among unified models on multiple zero-shot depth estimation benchmarks.
Motion4D: Learning 3D-Consistent Motion and Semantics for 4D Scene Understanding: Motion4D proposes a unified 4D Gaussian splatting framework that incorporates priors from 2D foundation models (semantic masks, point tracking, depth) into 3D representations via an iterative refinement strategy, achieving spatiotemporally consistent motion and semantic modeling. The method significantly outperforms existing approaches on video object segmentation, point tracking, and novel view synthesis tasks.
Motion Matters: Compact Gaussian Streaming for Free-Viewpoint Video Reconstruction: This paper proposes ComGS, a framework that exploits the locality and consistency of motion in dynamic scenes to drive the motion of all Gaussians in moving regions using only ~200 keypoints. ComGS achieves 159× storage compression over 3DGStream and 14× over QUEEN while maintaining competitive visual quality and rendering speed.
MPMAvatar: Learning 3D Gaussian Avatars with Accurate and Robust Physics-Based Dynamics: MPMAvatar integrates a Material Point Method (MPM) physics simulator with 3D Gaussian Splatting rendering. Through an anisotropic constitutive model and a novel collision handling algorithm for mesh-based colliders, it achieves accurate and robust physical animation of loose garments. On ActorsHQ and 4D-DRESS, it outperforms PhysAvatar across both geometry and appearance metrics, achieving a 100% simulation success rate vs. 37.6%, with a per-frame simulation time of only 1.1 seconds.
NerfBaselines: Consistent and Reproducible Evaluation of Novel View Synthesis Methods: This paper proposes NerfBaselines, an evaluation framework that addresses unfair comparisons in novel view synthesis (NVS) caused by inconsistent evaluation protocols. By wrapping original method code with a unified API and enforcing environment isolation, the framework ensures that each method's behavior exactly matches its original release. Experiments reveal that seemingly minor protocol differences—such as image resizing strategies and background colors—can significantly alter method rankings.
Neural Green's Functions: This paper proposes Neural Green's Functions, a learnable linear PDE solution operator based on eigendecomposition: pointwise geometric features are extracted from the domain geometry to predict the eigendecomposition of the Green's function, enabling one-time training to solve for arbitrary source functions and boundary conditions via numerical integration. On mechanical part thermal analysis, the method reduces error by 13.9% over the state-of-the-art neural operator while running 350× faster than numerical solvers.
Novel View Synthesis from A Few Glimpses via Test-Time Natural Video Completion: This paper reformulates sparse-input novel view synthesis as a test-time natural video completion problem. It leverages pretrained video diffusion models to generate intermediate pseudo-views, and iteratively optimizes 3D Gaussian Splatting (3D-GS) via an uncertainty-aware mechanism, achieving high-fidelity scene reconstruction under extremely sparse input conditions.
Object-Centric Representation Learning for Enhanced 3D Semantic Scene Graph Prediction: Through empirical analysis, this paper identifies object feature discriminability as the critical bottleneck in 3D scene graph predicate prediction (object misclassification accounts for 92%+ of predicate errors). It proposes an independently contrastively pre-trained object encoder (3D-2D-Text tri-modal alignment), a geometry-regularized relation encoder, and a bidirectional edge-gated GNN, achieving new SOTA on 3DSSG with Object R@1 59.53% and Predicate R@50 91.40%.
On Geometry-Enhanced Parameter-Efficient Fine-Tuning for 3D Scene Segmentation: This paper proposes the Geometry Encoding Mixer (GEM), a geometry-aware PEFT module designed for 3D point cloud Transformers. It captures fine-grained local geometric details via a Spatial Adapter and injects global scene context via a Context Adapter, achieving performance on par with or exceeding full fine-tuning while updating only 1.6% of parameters.
Online Segment Any 3D Thing as Instance Tracking: AutoSeg3D reformulates online 3D instance segmentation as an instance tracking problem, leveraging long-term memory for cross-frame instance association, short-term memory for instance update, and spatial consistency learning to mitigate VFM over-segmentation. The method surpasses ESAM by 2.8 AP on ScanNet200 while maintaining real-time performance.
OnlineSplatter: Pose-Free Online 3D Reconstruction for Free-Moving Objects: This paper proposes OnlineSplatter, a feed-forward online 3D reconstruction framework that requires no camera poses, depth priors, or global optimization. It achieves constant-time incremental reconstruction of free-moving objects via a dual-key memory module combining appearance-geometry latent keys and orientation keys.
OpenLex3D: A Tiered Evaluation Benchmark for Open-Vocabulary 3D Scene Representations: This paper proposes OpenLex3D, a tiered evaluation benchmark for open-vocabulary 3D scene representations. Built upon Replica, ScanNet++, and HM3D, it provides language annotations 13× richer than the original labels, supporting evaluation on two tasks: open-set 3D semantic segmentation and object retrieval.
Orientation-anchored Hyper-Gaussian for 4D Reconstruction from Casual Videos: This paper proposes OriGS (Orientation-anchored Gaussian Splatting), which achieves high-quality 4D dynamic scene reconstruction from casually captured monocular videos via a global orientation field and an orientation-aware hyper-Gaussian representation.
Orientation Matters: Making 3D Generative Models Orientation-Aligned: This paper introduces the task of orientation-aligned 3D object generation, constructs the Objaverse-OA dataset comprising 14,832 orientation-aligned 3D models across 1,008 categories, fine-tunes two mainstream 3D generation frameworks (Trellis and Wonder3D) to achieve orientation-aligned object generation, and demonstrates two downstream applications: zero-shot orientation estimation and arrow-guided rotation manipulation.
PhysX-3D: Physical-Grounded 3D Asset Generation: PhysX proposes the first end-to-end physical-property-driven 3D asset generation paradigm, comprising PhysXNet (the first 3D dataset with systematic annotations across five physical dimensions—absolute scale, material, functional affordance, kinematics, and functional description—covering 26K+ objects) and PhysXGen (a dual-branch feed-forward generation framework that injects physical knowledge into a pretrained 3D structural latent space).
Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers: This paper proposes Pixel-Perfect Depth, a monocular depth estimation model that performs diffusion generation directly in pixel space (rather than latent space). Through a Semantics-Prompted DiT (SP-DiT) that incorporates high-level semantic representations from visual foundation models and a cascaded DiT design, the model generates flying-pixel-free depth maps, surpassing all published generative models on five benchmarks.
Plana3R: Zero-shot Metric Planar 3D Reconstruction via Feed-Forward Planar Splatting: This paper proposes Plana3R, a feed-forward framework that requires neither camera poses nor planar annotations, predicting sparse 3D planar primitives and metric-scale relative poses from unpaired two-view images for zero-shot metric planar 3D reconstruction of indoor scenes.
PlanarGS: High-Fidelity Indoor 3D Gaussian Splatting Guided by Vision-Language Planar Priors: PlanarGS detects planar regions via a vision-language foundation model (GroundedSAM) with text prompts, combines multi-view depth priors from DUSt3R, and optimizes 3DGS through coplanarity constraints and geometric prior supervision to achieve high-fidelity surface reconstruction in indoor scenes.
PointMAC: Meta-Learned Adaptation for Robust Test-Time Point Cloud Completion: PointMAC is the first framework to introduce meta-auxiliary learning and test-time adaptation (TTA) into point cloud completion. It leverages Bi-Aux Units (random masked reconstruction + denoising) as self-supervised signals, employs MAML to align auxiliary objectives with the primary task, and at inference updates only the shared encoder for sample-level refinement, achieving state-of-the-art performance on synthetic, simulated, and real-world data.
Quantifying and Alleviating Co-Adaptation in Sparse-View 3D Gaussian Splatting: This paper identifies co-adaptation among Gaussians as the root cause of appearance artifacts in sparse-view 3D Gaussian Splatting, proposes the Co-Adaptation Score (CA) metric to quantify this entanglement, and introduces two plug-and-play regularization strategies—Gaussian Dropout and multiplicative opacity noise injection—that consistently reduce co-adaptation and improve novel view rendering quality across five baseline methods and three datasets.
Reconstruct, Inpaint, Test-Time Finetune: Dynamic Novel-View Synthesis from Monocular Videos: This paper proposes CogNVS, which decomposes dynamic scene novel-view synthesis into a three-stage pipeline — 3D reconstruction (recovering visible pixels) → video diffusion inpainting (generating occluded regions) → test-time finetuning (adapting to the target video distribution) — training the inpainting model with purely 2D video self-supervision to achieve zero-shot generalization to new test videos.
Reconstructing the Local Density Field with Combined Convolutional and Point Cloud Architecture: This paper proposes a hybrid neural network architecture combining convolutional networks (U-Net) and point cloud networks (DeepSets) to reconstruct the local dark matter density field from line-of-sight peculiar velocities of dark matter halos, achieving significant improvements over purely convolutional and linear reconstruction methods at small scales.
Rectified Point Flow: Generic Point Cloud Pose Estimation: This paper proposes Rectified Point Flow, a unified generative framework that reformulates pairwise point cloud registration and multi-part shape assembly as a conditional generation problem, estimating part poses by learning a continuous point-wise velocity field.
RGB-Only Supervised Camera Parameter Optimization in Dynamic Scenes: ROS-Cam proposes a camera parameter (focal length + pose) optimization method for dynamic scenes supervised solely by a single RGB video. It achieves state-of-the-art accuracy and fastest runtime across 5 datasets via three key contributions: a patch-wise tracking filter for sparse, robust correspondences; a Cauchy distribution-based outlier-aware joint optimization that adaptively down-weights moving objects; and a two-stage optimization strategy grounded in Softplus/convex minimax analysis.
RigAnyFace: Scaling Neural Facial Mesh Auto-Rigging with Unlabeled Data: This paper proposes RigAnyFace (RAF), a scalable facial mesh auto-rigging framework that leverages a 2D supervision strategy to exploit unlabeled neutral meshes for training scale expansion, enabling high-quality FACS blendshape rigging across diverse topologies and disconnected components (e.g., eyeballs).
Robust Neural Rendering in the Wild with Asymmetric Dual 3D Gaussian Splatting: AsymGS leverages a key observation—that reconstruction artifacts caused by in-the-wild training data are stochastic in nature—and proposes an asymmetric dual 3DGS framework that suppresses artifacts via complementary masking strategies and consistency constraints. A Dynamic EMA Proxy is introduced for efficient training, achieving significant improvements over existing methods on multiple in-the-wild benchmarks.
ROGR: Relightable 3D Objects using Generative Relighting: This paper proposes ROGR, which leverages a multi-view diffusion relighting model to generate consistent images under multiple lighting conditions, trains a lighting-conditioned NeRF on the resulting dataset, and achieves feed-forward 3D object relighting under arbitrary environment lighting. ROGR attains state-of-the-art performance on the TensoIR and Stanford-ORB benchmarks while supporting interactive rendering.
Scaffold Diffusion: Sparse Multi-Category Voxel Structure Generation with Discrete Diffusion: This paper proposes Scaffold Diffusion, which treats sparse multi-category 3D voxels as token sequences and employs a Masked Diffusion Language Model (MDLM) with 3D sinusoidal positional encodings to generate spatially coherent multi-category voxel structures conditioned on occupancy maps. The method substantially outperforms autoregressive and conventional discrete diffusion baselines on an extremely sparse (>98% background) Minecraft house dataset.
Scalable Diffusion Transformer for Conditional 4D fMRI Synthesis: This paper proposes the first diffusion Transformer for voxel-level whole-brain 4D fMRI conditional generation, combining 3D VQ-GAN latent space compression, a CNN-Transformer hybrid backbone, and strong conditioning via AdaLN-Zero and cross-attention. The model achieves a task activation map correlation of 0.83, RSA of 0.98, and perfect condition specificity across seven cognitive tasks from the HCP dataset.
SceneForge: Enhancing 3D-text alignment with Structured Scene Compositions: This paper proposes SceneForge, a framework that composes individual 3D point cloud objects into multi-object scenes with explicit spatial relations, paired with LLM-refined compositional captions, to enhance data diversity and complexity for 3D-text contrastive learning, yielding consistent performance gains across multiple downstream tasks.
SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent: This paper proposes SceneWeaver, the first reflective agent framework for 3D scene synthesis, which unifies multiple scene generation paradigms through a standardized and extensible tool interface. By employing a reason-act-reflect closed-loop for iterative refinement, it comprehensively outperforms existing methods in physical plausibility, visual realism, and semantic alignment.
Segment then Splat: Unified 3D Open-Vocabulary Segmentation via Gaussian Splatting: This paper proposes a novel "Segment-then-Splat" paradigm that assigns Gaussians to distinct object sets prior to 3D reconstruction, thereby eliminating geometric and semantic ambiguity and enabling unified 3D open-vocabulary segmentation for both static and dynamic scenes.
Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis: This paper proposes Shallow Flow Matching (SFM), which leverages weak generator outputs to construct intermediate states within a flow matching framework for coarse-to-fine TTS. Inference begins from these intermediate states rather than pure noise, simultaneously improving synthesis quality and accelerating inference.
SingRef6D: Monocular Novel Object Pose Estimation with a Single RGB Reference: SingRef6D is a lightweight 6D pose estimation pipeline requiring only a single RGB reference image. It fine-tunes Depth-Anything v2 via a token-scaler mechanism for robust depth prediction and introduces depth-aware matching to enhance LoFTR's spatial reasoning, substantially outperforming existing methods on transparent and reflective objects.
SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation: This paper introduces the concept of Semantic Orientation, which describes object directions using natural language (e.g., the "insertion direction" of a USB plug or the "handle direction" of a cup). It constructs the large-scale OrienText300K dataset to train the PointSO model for zero-shot orientation prediction, and integrates these components into the SoFar system for 6-DoF scene understanding and robotic manipulation.
Styl3R: Instant 3D Stylized Reconstruction for Arbitrary Scenes and Styles: This paper proposes Styl3R, a feed-forward network that decouples 3D reconstruction from stylization via a structure–appearance dual-branch architecture, enabling stylized 3D reconstruction from uncalibrated sparse-view images and arbitrary style images in 0.15 seconds.
SyncHuman: Synchronizing 2D and 3D Generative Models for Single-View Human Reconstruction: SyncHuman is the first framework to unify 2D multi-view generative models and 3D native generative models within a single pipeline. Through pixel-aligned 2D-3D synchronized attention, the two branches mutually enhance each other, achieving high-fidelity textured mesh reconstruction under complex human poses and surpassing existing methods in both geometric accuracy and visual quality.
TAPIP3D: Tracking Any Point in Persistent 3D Geometry: This paper proposes TAPIP3D, which represents video as a camera-stabilized spatiotemporal 3D feature point cloud and iteratively refines multi-frame point trajectories in persistent 3D geometric space via a 3D Neighborhood-to-Neighborhood (N2N) attention mechanism, substantially outperforming existing 3D point tracking methods.
Temporal Smoothness-Aware Rate-Distortion Optimized 4D Gaussian Splatting: This paper proposes the first end-to-end rate-distortion (RD) optimized compression framework for 4D Gaussian Splatting. By exploiting the temporal smoothness prior of dynamic point trajectories via Haar wavelet transforms, the method achieves up to 91× compression over Ex4DGS (average model size ~1.1% of the original) while maintaining reasonable rendering quality and flexible rate-quality trade-off control.
Towards 3D Objectness Learning in an Open World: This paper proposes OP3Det, a class-agnostic open-world 3D detector that requires no text prompts. It leverages 2D foundation models for 3D object discovery and introduces a cross-modal Mixture-of-Experts (MoE) module to dynamically fuse point cloud and image features, substantially improving recall on novel object categories.
TP-MDDN: Task-Preferenced Multi-Demand-Driven Navigation with Autonomous Decision-Making: This paper proposes the Task-Preferenced Multi-Demand-Driven Navigation (TP-MDDN) benchmark and the AWMSystem autonomous decision-making framework, which achieves long-horizon multi-subtask navigation through three LLM modules—instruction decomposition, dynamic goal selection, and task status monitoring—coupled with a multi-dimensional cumulative semantic map.
TRIM: Scalable 3D Gaussian Diffusion Inference with Temporal and Spatial Trimming: This paper proposes TRIM (Trajectory Reduction and Instance Mask denoising), a post-training framework that accelerates 3D Gaussian diffusion model inference while improving generation quality through temporal trajectory pre-selection and spatial background token pruning. TRIM outperforms baselines such as DiffSplat on both T3Bench text-to-3D and GSO image-to-3D benchmarks.
U-CAN: Unsupervised Point Cloud Denoising with Consistency-Aware Noise2Noise Matching: This paper proposes U-CAN, an unsupervised point cloud denoising framework that infers multi-step denoising paths via a Noise2Noise matching scheme and geometric consistency constraints. The method approaches supervised performance and demonstrates that the consistency constraint generalizes to 2D image denoising.
UGM2N: An Unsupervised and Generalizable Mesh Movement Network via M-Uniform Loss: UGM2N is an unsupervised mesh movement network that achieves zero-shot generalization across PDE types and mesh geometries—without requiring pre-adapted mesh data—through a localized Node Patch representation and an M-Uniform loss function, while guaranteeing freedom from mesh tangling.
UMAMI: Unifying Masked Autoregressive Models and Deterministic Rendering for View Synthesis: This paper proposes UMAMI, a hybrid framework that unifies Masked Autoregressive Models (MAR) and deterministic rendering for sparse-view novel view synthesis. A bidirectional Transformer encodes multi-view image tokens and Plücker ray embeddings; two lightweight MLP heads handle visible regions (deterministic regression) and occluded regions (MAR diffusion generation) respectively. The rendering speed is an order of magnitude faster than fully generative baselines.
URDF-Anything: Constructing Articulated Objects with 3D Multimodal Language Model: This paper proposes URDF-Anything, the first end-to-end articulated object reconstruction framework based on a 3D Multimodal Large Language Model (MLLM). By introducing a [SEG] token mechanism, the framework jointly predicts geometric segmentation and kinematic parameters, achieving state-of-the-art performance in segmentation accuracy (mIoU +17%), parameter error (−29%), and physical executability (surpassing baselines by 50%).
VA-GS: Enhancing the Geometric Representation of Gaussian Splatting via View Alignment: By introducing four view alignment strategies — edge-aware image supervision, visibility-aware multi-view photometric alignment, normal consistency constraints, and depth image feature alignment — VA-GS significantly improves the geometric representation accuracy of 3D Gaussian Splatting, achieving state-of-the-art performance in surface reconstruction and novel view synthesis.
VisualSync: Multi-Camera Synchronization via Cross-View Object Motion: VisualSync presents a multi-camera temporal synchronization framework grounded in epipolar geometry constraints. By leveraging pretrained vision foundation models (VGGT, CoTracker3, MAST3R) to extract motion trajectories and cross-view correspondences, the method estimates per-camera temporal offsets by minimizing Sampson error, achieving millisecond-level synchronization with median errors below 50 ms across four benchmarks.
Walking the Schrödinger Bridge: A Direct Trajectory for Text-to-3D Generation: This paper theoretically establishes SDS as a special case of the Schrödinger Bridge, and builds upon this insight to propose TraCe — a framework that constructs an explicit diffusion bridge between the current rendering and the text-conditioned target, learns the score dynamics along the bridge trajectory via LoRA fine-tuning, and achieves high-quality text-to-3D generation at low CFG values.
WildCAT3D: Appearance-Aware Multi-View Diffusion in the Wild: This paper proposes WildCAT3D, which extends the multi-view diffusion model CAT3D to learn scene-level novel view synthesis from in-the-wild internet data (e.g., tourist photographs) by explicitly modeling global appearance conditions, while simultaneously supporting appearance-controlled generation.
ZPressor: Bottleneck-Aware Compression for Scalable Feed-Forward 3DGS: Grounded in the Information Bottleneck (IB) principle, this paper analyzes the capacity bottleneck of feed-forward 3DGS and proposes ZPressor, a lightweight, architecture-agnostic module that compresses multi-view inputs into a compact anchor-view representation, enabling existing models to scale to 100+ input views (480P, 80GB GPU) with consistent performance gains on DL3DV-10K and RealEstate10K.