🧊 3D Vision¶

🔬 ICLR2026 · 65 paper notes

3DGEER: 3D Gaussian Rendering Made Exact and Efficient for Generic Cameras: This paper proposes 3DGEER, a framework that derives a closed-form solution for integrating Gaussian density along rays, designs a Particle Bounding Frustum (PBF) for accurate and efficient ray–particle association, and introduces Bipolar Equal-Angle Projection (BEAP) to unify wide-FoV camera representations. 3DGEER achieves geometrically exact and real-time efficient 3D Gaussian rendering under arbitrary camera models, outperforming existing methods comprehensively on both fisheye and pinhole datasets.
A Genetic Algorithm for Navigating Synthesizable Molecular Spaces: This paper proposes SynGA, a genetic algorithm that operates directly on synthesis routes (synthesis trees), constraining the search strictly within synthesizable molecular space via custom crossover and mutation operators. Combined with ML-guided building block filtering, SynGA achieves state-of-the-art performance on synthesizable analog search and property optimization.
A Step to Decouple Optimization in 3DGS: This paper provides an in-depth analysis of two overlooked coupling issues in 3DGS optimization — update-step coupling (implicit updates and momentum rescaling for invisible viewpoints) and gradient coupling (entanglement of regularization and photometric loss in Adam momentum) — and proposes AdamW-GS by decoupling and recombining these components, simultaneously improving reconstruction quality and reducing redundant primitives without additional pruning operations.
Augmented Radiance Field: A General Framework for Enhanced Gaussian Splatting: This paper proposes the Augmented Radiance Field (ARF) framework, which explicitly models specular components by designing augmented Gaussian kernels with view-dependent opacity. An error-driven compensation strategy is introduced (2D Gaussian initialization → inverse projection to 3D → joint optimization) to enhance existing 3DGS scenes as a plug-and-play post-processing step. The method surpasses state-of-the-art NeRF approaches on multiple benchmarks while requiring only second-order spherical harmonics to capture complex illumination.
Brain-IT: Image Reconstruction from fMRI via Brain-Interaction Transformer: This paper proposes Brain-IT, a framework that employs a brain-inspired Brain Interaction Transformer (BIT) to cluster functionally similar brain voxels into cross-subject shared Brain Tokens, from which localized semantic and structural image features are predicted, enabling high-fidelity reconstruction of images from fMRI signals. With only 1 hour of data, Brain-IT achieves performance comparable to prior methods trained on 40 hours of data.
CloDS: Visual-Only Unsupervised Cloth Dynamics Learning in Unknown Conditions: CloDS proposes the first framework for unsupervised cloth dynamics learning from multi-view videos. By introducing Spatial Mapping Gaussian Splatting (SMGS) to establish a differentiable mapping between 2D images and 3D meshes, combined with dual-position opacity modulation to address self-occlusion, the method enables a GNN to learn cloth dynamics approaching fully supervised performance without any physical parameter supervision.
Color3D: Controllable and Consistent 3D Colorization with Personalized Colorizer: Color3D introduces a paradigm of "colorize one key view → fine-tune a personalized colorizer → propagate colors to all views and timesteps," reducing the complex 3D colorization problem to single-image colorization plus color propagation. It achieves rich colorization, cross-view consistency, and user controllability simultaneously on both static and dynamic 3D scenes.
COOPERTRIM: Adaptive Data Selection for Uncertainty-Aware Cooperative Perception: CooperTrim is an adaptive feature selection framework that evaluates feature relevance via conformal temporal uncertainty estimation and dynamically determines the sharing volume through a data-driven mechanism. It achieves 80.28% bandwidth reduction with comparable performance on cooperative semantic segmentation, and is the first to apply selective sharing to cooperative segmentation tasks.
CORE-3D: Context-aware Open-vocabulary Retrieval by Embeddings in 3D: This paper proposes CORE-3D, a training-free open-vocabulary 3D semantic segmentation and natural language object retrieval pipeline that achieves state-of-the-art performance on Replica and ScanNet through progressive multi-granularity mask generation, context-aware CLIP encoding, and multi-view 3D fusion.
CRISP: Contact-Guided Real2Sim from Monocular Video with Planar Scene Primitives: This paper proposes CRISP, a method for recovering simulatable human motion and scene geometry from monocular video. By fitting planar primitives to obtain clean, simulation-ready geometry and leveraging human-scene contact modeling to reconstruct occluded regions, CRISP reduces the motion tracking failure rate of a humanoid controller from 55.2% to 6.9%.
Ctrl&Shift: High-Quality Geometry-Aware Object Manipulation in Visual Generation: This paper proposes Ctrl&Shift, an end-to-end diffusion framework that decomposes object manipulation into object removal and reference-guided inpainting, and injects relative camera pose control, achieving geometry-consistent fine-grained object manipulation for the first time without relying on explicit 3D reconstruction.
D-REX: Differentiable Real-to-Sim-to-Real Engine for Learning Dexterous Grasping: D-REX is proposed as a Gaussian-based differentiable real-to-sim-to-real engine that performs end-to-end object mass identification from visual observations and robot control signals, and leverages the identified mass for force-aware dexterous grasping policy learning, effectively bridging the sim-to-real gap.
DiffWind: Physics-Informed Differentiable Modeling of Wind-Driven Object Dynamics: This paper proposes DiffWind, a physics-constrained differentiable framework that models wind as a grid-based physical field, represents objects as a 3D Gaussian Splatting particle system, simulates wind–object interaction via the Material Point Method (MPM), and incorporates the Lattice Boltzmann Method (LBM) as a physical constraint. The framework jointly reconstructs wind force fields and object motion from video, supports forward simulation under novel wind conditions and wind retargeting, and significantly outperforms existing dynamic scene modeling methods on the newly introduced WD-Objects dataset.
Dynamic Novel View Synthesis in High Dynamic Range: This paper is the first to formally define the HDR Dynamic Novel View Synthesis (HDR DNVS) problem and proposes the HDR-4DGS framework. Through a dynamic tone mapping module, the framework achieves temporally consistent HDR radiance field reconstruction in time-varying scenes, outperforming existing methods on both synthetic and real-world datasets.
Efficient-LVSM: Faster, Cheaper, and Better Large View Synthesis Model via Decoupled Co-Refinement Attention: This paper proposes Efficient-LVSM, a dual-stream architecture that decouples input view encoding from target view generation, reducing the complexity of novel view synthesis from \(O(N_{in}^2)\) to \(O(N_{in})\). On RealEstate10K, the model achieves state-of-the-art performance (29.86 dB PSNR) using only 50% of LVSM's training time, with a 4.4× inference speedup.
EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark: This paper introduces EgoNight, the first systematic nighttime egocentric vision benchmark, comprising day-night aligned videos and 3,658 manually verified QA pairs. It reveals that MLLMs suffer up to 32.8% performance degradation under low-light conditions.
EgoWorld: Translating Exocentric View to Egocentric View using Rich Exocentric Observations: EgoWorld proposes an end-to-end exocentric-to-egocentric view translation framework that extracts three complementary observations from a single third-person image—3D point clouds, hand poses, and text descriptions—projects the point cloud to obtain a sparse egocentric RGB map, and reconstructs a complete high-fidelity egocentric image via diffusion-based inpainting, achieving state-of-the-art performance across four datasets under diverse unseen settings.
Einstein Fields: A Neural Perspective To Computational General Relativity: This paper proposes EinFields, the first framework to apply neural implicit representations to the compression of four-dimensional general relativity simulations. By encoding the metric tensor field as compact neural network weights, it achieves 4000× storage compression and 5–7 digits of numerical precision, while tensor derivatives obtained via automatic differentiation are 5 orders of magnitude more accurate than those from finite differences.
Fast Estimation of Wasserstein Distances via Regression on Sliced Wasserstein Distances: Leveraging the mathematical property that standard Sliced Wasserstein (SW) distances provide lower bounds and lifted SW distances provide upper bounds for the Wasserstein distance, this paper constructs a minimal linear regression model (the RG framework) that estimates Wasserstein distances with high accuracy using only a small number of exact Wasserstein labels as supervision, comprehensively outperforming the Transformer-based method Wasserstein Wormhole in low-data regimes.
FastGHA: Generalized Few-Shot 3D Gaussian Head Avatars with Real-Time Animation: This paper proposes FastGHA, a feed-forward few-shot 3D Gaussian head avatar generation framework that reconstructs an animatable 3D Gaussian head from 4 arbitrary-expression/viewpoint input images in ~1 second, supporting real-time animation at 62 FPS. On Ava-256, it achieves a PSNR of 22.5 dB, surpassing Avat3r's 20.7 dB while being 7.75× faster.
Fused-Planes: Why Train a Thousand Tri-Planes When You Can Share?: This paper proposes Fused-Planes, which decomposes the Tri-Plane representation into shared class-level basis planes (macro) and object-specific detail planes (micro) via a macro-micro decomposition. Combined with latent-space rendering, the method achieves 7× training speedup and 3× memory reduction while maintaining or surpassing the reconstruction quality of independently trained Tri-Planes.
Generalizable Coarse-to-Fine Robot Manipulation via Language-Aligned 3D Keypoints: CLAP (Coarse-to-fine Language-Aligned manipulation Policy) achieves strong generalization to novel instructions and unseen environments through three core components: task decomposition, VLM fine-tuning for 3D keypoint prediction, and 3D-aware representation. It outperforms the state of the art by 12% on GemBench using only 1/5 of the training data.
GeoPurify: A Data-Efficient Geometric Distillation Framework for Open-Vocabulary 3D Segmentation: GeoPurify is proposed as a framework that purifies noisy features projected from 2D VLMs into 3D by distilling geometric priors from a 3D self-supervised teacher model, achieving performance on par with or superior to full-data SOTA open-vocabulary 3D segmentation using only ~1.5% of training data.
GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra: This work introduces the GIQ benchmark, comprising 224 synthetic and real polyhedra, and systematically evaluates the geometric reasoning capabilities of vision foundation models across four tasks—monocular 3D reconstruction, symmetry detection, mental rotation testing, and zero-shot classification—revealing significant deficiencies in the geometric understanding of current models.
HDR-NSFF: High Dynamic Range Neural Scene Flow Fields: This paper proposes HDR-NSFF, which shifts HDR video reconstruction from the conventional 2D pixel-level fusion paradigm to 4D spatiotemporal modeling. From alternating-exposure monocular videos, it jointly reconstructs HDR radiance fields, 3D scene flow, geometry, and tone-mapping, enabling temporally and spatially consistent dynamic HDR novel view synthesis.
Improving Long-Range Interactions in Graph Neural Simulators via Hamiltonian Dynamics: This paper proposes Information-preserving Graph Neural Simulators (IGNS), which leverage port-Hamiltonian dynamical structure to prevent information dissipation on graphs. Combined with warmup initialization, geometric encoding, and multi-step training objectives, IGNS consistently outperforms existing graph neural simulators across 6 physics simulation benchmarks.
Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry: This paper trains a sparse autoencoder (SAE) on DINOv2 to extract a dictionary of 32,000 visual concepts, systematically investigates how different downstream tasks (classification / segmentation / depth estimation) selectively recruit subsets of these concepts, reveals that the geometry of the representation space goes beyond the Linear Representation Hypothesis (LRH), and proposes a novel Minkowski Representation Hypothesis (MRH) positing that tokens are superpositions of multiple convex mixtures.
Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry: By training a 32,000-unit Sparse Autoencoder dictionary on DINOv2, this work systematically analyzes how downstream tasks recruit distinct concepts, reveals that representational geometry deviates from the Linear Representation Hypothesis (LRH), and proposes the Minkowski Representation Hypothesis (MRH), which posits that token representations are Minkowski sums of multiple convex polytopes, with concepts defined by proximity to prototype points rather than linear directions.
Joint Shadow Generation and Relighting via Light-Geometry Interaction Maps: This paper proposes Light-Geometry Interaction (LGI) maps, a 2.5D representation encoding light-occlusion relationships derived from monocular depth estimation. Embedded within a bridge matching generative framework, LGI maps enable joint modeling of shadow generation and object relighting, achieving state-of-the-art performance on both synthetic and real images.
LaVCa: LLM-assisted Visual Cortex Captioning: This paper proposes LaVCa, a method that leverages LLMs to generate natural language captions for individual voxels in the human visual cortex. Through a four-step pipeline—encoding model construction, optimal image selection, MLLM-based captioning, and LLM-driven keyword extraction with sentence composition—LaVCa reveals voxel-level visual selectivity more accurately and with greater semantic diversity than the prior method BrainSCUBA.
Learning Part-Aware Dense 3D Feature Field for Generalizable Articulated Object Manipulation: This paper proposes PA3FF (Part-Aware 3D Feature Field), a natively 3D dense part-aware feature representation. By combining a Sonata pre-trained backbone with geometric and semantic contrastive learning, PA3FF yields zero-shot part-level features. Paired with a Part-Aware Diffusion Policy (PADP), the system achieves few-shot, highly generalizable articulated object manipulation, substantially outperforming baselines such as CLIP, DINOv2, and GenDP in both simulation and real-world settings.
Learning Physics-Grounded 4D Dynamics with Neural Gaussian Force Fields: This paper proposes the NGFF framework, which reconstructs 3D Gaussian representations from multi-view RGB images and learns explicit neural force fields to drive physics-based dynamics. By solving ODEs, the framework enables interactive, physically plausible 4D video generation that is two orders of magnitude faster than traditional Gaussian simulators, surpassing Veo3 and NVIDIA Cosmos.
Learning Unified Representation of 3D Gaussian Splatting: The native 3DGS parameters \(\boldsymbol{\theta}=\{\mu,\mathbf{q},\mathbf{s},\mathbf{c},o\}\) suffer from non-uniqueness and numerical heterogeneity, making them unsuitable as a learning space for neural networks. This paper proposes the Submanifold Field (SF) representation: each Gaussian primitive is mapped to a continuous color field defined on its iso-probability ellipsoidal surface. The paper proves this mapping is injective, fundamentally eliminating parameter ambiguity. Combined with a VAE trained using an optimal-transport-based Manifold Distance (M-Dist), the approach comprehensively outperforms parameter-based baselines in reconstruction fidelity, cross-domain generalization, and latent space stability.
LiTo: Surface Light Field Tokenization: LiTo encodes surface light fields into compact sets of latent vectors to jointly model 3D geometry and view-dependent appearance: random subsampling of light field observations from RGB-D multi-view images → Perceiver IO encoder (3D local attention supporting 1M token input) + flow-matching geometry decoder + higher-order spherical harmonic Gaussian decoder → achieves reconstruction and single-image-to-3D generation surpassing TRELLIS, and for the first time models view-dependent effects such as specular highlights and Fresnel reflectance within a latent 3D representation.
MEGS2: Memory-Efficient Gaussian Splatting via Spherical Gaussians and Unified Pruning: This paper proposes MEGS2, a method that compresses 3DGS from the perspective of rendering VRAM: it replaces spherical harmonics (SH) entirely with prunable, arbitrarily oriented spherical Gaussians (SG) to reduce per-primitive parameter count, and formulates the joint pruning of primitive count and lobe count as a single memory-constrained optimization problem via a unified soft pruning framework. The result is an 8× reduction in static VRAM and a 6× reduction in rendering VRAM with preserved rendering quality, enabling real-time 3DGS on mobile devices for the first time.
Mono4DGS-HDR: High Dynamic Range 4D Gaussian Splatting from Alternating-exposure Monocular Videos: This work is the first to address the problem of reconstructing renderable 4D HDR scenes from pose-free alternating-exposure monocular videos. Through a two-stage optimization pipeline (orthographic video space → world space), a Video-to-World Gaussian transformation strategy, and temporal luminance regularization, the method achieves 37.64 dB HDR PSNR and 161 FPS on synthetic data, comprehensively outperforming existing approaches.
MultiMat: Multimodal Program Synthesis for Procedural Materials using Large Multimodal Models: This paper presents MultiMat, the first framework to apply large multimodal models (LMMs) to procedural material node graph synthesis. By incorporating intermediate visual rendering feedback of partially generated nodes into the autoregressive generation process (via two conditioning modes: mixed and graph), and pairing this with an incremental constrained tree search for on-the-fly validation and backtracking, MultiMat is trained on 6,878 production-grade Substance Designer materials and substantially outperforms text-only baselines in both unconditional and conditional generation.
NOVA3R: Non-pixel-aligned Visual Transformer for Amodal 3D Reconstruction: This paper proposes NOVA3R — a non-pixel-aligned amodal 3D reconstruction framework from pose-free images. It employs learnable scene tokens to aggregate global information across views and a flow-matching-based diffusion 3D decoder to generate complete point clouds (including occluded regions). The method addresses two fundamental limitations of pixel-aligned approaches — inability to reconstruct occluded surfaces and redundant geometry in overlapping regions — and outperforms prior SOTA on scene-level and object-level benchmarks including SCRREAM and GSO.
Omni-View: Unlocking How Generation Facilitates Understanding in Unified 3D Model based on Multiview images: This paper presents Omni-View, a unified 3D scene understanding and generation model that enhances understanding performance through a texture module (novel view synthesis) and a geometry module (depth/pose estimation), achieving a score of 55.4 on VSI-Bench and surpassing all existing specialized 3D understanding models.
One2Scene: Geometric Consistent Explorable 3D Scene Generation from a Single Image: One2Scene proposes a three-stage framework that decomposes single-image explorable 3D scene generation into: panorama generation → feed-forward 3D Gaussian splatting for geometric scaffold construction → scaffold-guided novel view synthesis. By reformulating panoramic depth estimation as a multi-view stereo matching problem, the method achieves geometrically consistent and freely explorable 3D scene generation.
One2Scene: Geometric Consistent Explorable 3D Scene Generation from a Single Image: This paper proposes One2Scene, which decomposes the ill-posed problem of generating an explorable 3D scene from a single image into three sub-tasks: (1) panorama generation to expand visual coverage, (2) a feed-forward 3DGS network that constructs an explicit 3D geometric scaffold from sparse anchor views, and (3) scaffold-guided novel view synthesis via Dual-LoRA that fuses high-quality anchor views with geometric priors. The method achieves geometrically consistent and photorealistic scene generation under large viewpoint changes, significantly outperforming state-of-the-art methods.
OpenFly: A Comprehensive Platform for Aerial Vision-Language Navigation: This paper presents OpenFly, a comprehensive platform for aerial vision-language navigation (VLN) that integrates four rendering engines (UE / GTA V / Google Earth / 3DGS), develops a fully automated data generation pipeline (point cloud acquisition → semantic segmentation → trajectory generation → GPT-4o instruction synthesis), constructs a large-scale dataset of 100K trajectories across 18 scenes, and proposes a keyframe-aware VLN model (OpenFly-Agent) combining keyframe selection with visual token merging. OpenFly-Agent outperforms existing methods by 14.0% and 7.9% in success rate on seen and unseen scenes, respectively.
PartSAM: A Scalable Promptable Part Segmentation Model Trained on Native 3D Data: This paper proposes PartSAM, the first promptable part segmentation model trained on large-scale native 3D data. It employs a triplane dual-branch encoder (frozen SAM priors + learnable 3D branch) and a SAM-style decoder. A model-in-the-loop annotation pipeline is used to construct 5M+ shape–part pairs. Under open-world settings, a single click from PartSAM outperforms Point-SAM by over 90% in IoU@1.
PD²GS: Part-Level Decoupling and Continuous Deformation of Articulated Objects via Gaussian Splatting: PD²GS is proposed as a framework that learns a shared canonical Gaussian field and models each interaction state as a continuous deformation thereof, enabling part-level decoupling, reconstruction, and continuous control of articulated objects via coarse-to-fine motion trajectory clustering and SAM-guided boundary refinement, without any manual supervision.
Peering into the Unknown: Active View Selection with Neural Uncertainty Maps for 3D Reconstruction: This paper proposes PUN (Peering into the UnkNowN), which employs a lightweight feed-forward network, UPNet, to directly predict the uncertainty distribution over all candidate viewpoints on a sphere from a single image — termed a neural uncertainty map (UMap) — thereby replacing the conventional iterative active view selection pipeline that requires repeated retraining of NeRF or 3DGS models. PUN achieves comparable reconstruction quality using only half the viewpoints of the upper bound, while delivering a 400× speedup and over 50% reduction in computational resource consumption during view selection.
pySpatial: Generating 3D Visual Programs for Zero-Shot Spatial Reasoning: pySpatial is a visual programming framework that enables MLLMs to generate Python code that automatically invokes 3D spatial tools (3D reconstruction, camera pose estimation, novel view synthesis, etc.), transforming limited 2D image inputs into interactively explorable 3D scenes. The framework achieves zero-shot, plug-and-play explicit 3D spatial reasoning, attaining an overall accuracy of 58.56% on the MindCube benchmark—surpassing GPT-4.1-mini by 12.94% and VLM-3R by 16.5%—while also successfully driving a real quadruped robot to perform indoor navigation.
QuadGPT: Native Quadrilateral Mesh Generation with Autoregressive Models: This paper presents QuadGPT—the first end-to-end autoregressive framework for native quadrilateral mesh generation. It achieves comprehensive superiority over existing triangle-to-quad conversion pipelines and cross-field-guided methods in Chamfer Distance, Hausdorff Distance, quad ratio, and user preference, via unified mixed-topology tokenization (padding triangular faces into 4-vertex blocks), a Hourglass Transformer architecture, and topology-reward-based truncated DPO (tDPO) fine-tuning.
Quantized Visual Geometry Grounded Transformer: To address the deployment demands of the billion-scale 3D reconstruction model VGGT, this paper proposes QuantVGGT, the first dedicated PTQ framework for VGGT. It resolves heavy-tailed activation distributions caused by special tokens via dual-smoothed fine-grained quantization (Hadamard rotation + channel-wise smoothing), and addresses calibration instability via noise-filtered diverse sampling. At 4-bit quantization, the method achieves 3.7× memory compression and 2.5× inference speedup while retaining 98%+ accuracy.
RadioGS: Radiometrically Consistent Gaussian Surfels for Inverse Rendering: RadioGS introduces a radiometric consistency loss that minimizes the residual between the learned radiance of each Gaussian surfel and its physically rendered radiance, providing physics-based supervision for unobserved directions. This forms a self-correcting feedback loop that enables accurate indirect illumination and material decomposition, while supporting relighting in minutes.
Scaling Sequence-to-Sequence Generative Neural Rendering: This paper presents Kaleido, a family of decoder-only rectified flow transformers that treats 3D as a special subdomain of video. Through Unified Positional Encoding, a masked autoregressive framework, and a video pretraining strategy, Kaleido achieves "any-to-any" 6-DoF novel view synthesis without any explicit 3D representation. It is the first generative method to match per-scene optimization (InstantNGP) in rendering quality under multi-view settings, and scales resolution from 512/576px to 1024px.
SceneTransporter: Optimal Transport-Guided Compositional Latent Diffusion for Single-Image Structured 3D Scene Generation: SceneTransporter reformulates open-world structured 3D scene generation as a global correspondence assignment problem by introducing an entropic optimal transport (OT) framework into the denoising loop of a compositional 3D latent diffusion model. The OT plan gates cross-attention to enforce exclusive patch-to-part routing (preventing feature entanglement), while edge-regularized assignment costs encourage clean instance separation at image boundaries. The approach achieves state-of-the-art instance-level consistency and geometric fidelity on 74 diverse open-world scene images.
Sharp Monocular View Synthesis in Less Than a Second: SHARP generates approximately 1.2 million 3D Gaussians from a single image via a single feedforward pass, completing inference in under one second on an A100 GPU with rendering speeds exceeding 100 FPS. It achieves state-of-the-art zero-shot generalization across 6 datasets, reducing LPIPS by 25–34% and synthesis time by three orders of magnitude compared to the strongest prior method.
Splat and Distill: Augmenting Teachers with Feed-Forward 3D Reconstruction for 3D-Aware Distillation: Within a student-teacher distillation framework, this work augments the teacher with a pretrained feed-forward 3D reconstruction model (MVSplat) that lifts 2D features into a 3D Gaussian representation and renders them to novel viewpoints, enabling the student to learn geometrically consistent, 3D-aware 2D features. The proposed method surpasses existing approaches across downstream tasks including depth estimation, surface normal estimation, semantic segmentation, and multi-view correspondence.
Splat Feature Solver: This paper unifies the feature lifting problem for 3D splat representations as a sparse linear inverse problem \(AX=B\), proposes a closed-form solver with a provable \((1+\beta)\)-approximation error bound under convex loss, and introduces two regularization strategies—Tikhonov Guidance and Post-Lifting Aggregation—achieving state-of-the-art performance on open-vocabulary 3D segmentation.
Station2Radar: Query-Conditioned Gaussian Splatting for Precipitation Field: This paper proposes Query-Conditioned Gaussian Splatting (QCGS), the first method to introduce 2D Gaussian Splatting into precipitation field generation. By fusing satellite imagery with sparse automatic weather station (AWS) observations, QCGS achieves flexible-resolution precipitation field reconstruction without radar input, reducing RMSE by over 50% compared to conventional gridded products.
StreamSplat: Towards Online Dynamic 3D Reconstruction from Uncalibrated Video Streams: StreamSplat proposes a fully feed-forward online dynamic 3D reconstruction framework that enables instant generation of dynamic 3DGS representations from uncalibrated video streams, achieving 1200× speedup over optimization-based methods through three key innovations: probabilistic position sampling, bidirectional deformation fields, and adaptive Gaussian fusion.
Stroke3D: Lifting 2D Strokes into Rigged 3D Model via Latent Diffusion Models: Stroke3D is the first method to generate rigged 3D mesh models directly from user-drawn 2D strokes and text prompts. It employs a skeleton-first two-stage pipeline: a graph VAE and graph DiT are used to generate controllable 3D skeletons, followed by TextuRig dataset augmentation and SKA-DPO optimization to synthesize high-quality meshes.
Stylos: Multi-View 3D Stylization with Single-Forward Gaussian Splatting: Stylos proposes a single-forward 3D style transfer framework that achieves zero-shot 3D stylization from uncalibrated inputs via a dual-path design with a shared Transformer backbone (geometry self-attention + style cross-attention) and a voxel-level 3D style loss, supporting scalability from single-view to hundreds of views.
SurfSplat: Conquering Feedforward 2D Gaussian Splatting with Surface Continuity Priors: SurfSplat proposes a feedforward 3D reconstruction framework based on 2DGS that binds Gaussian rotation and scale to local neighborhood positions via surface continuity priors, resolves color bias through a forced alpha blending strategy, and introduces the HRRC metric to reveal reconstruction quality discrepancies at high resolutions.
Topology-Preserved Auto-regressive Mesh Generation in the Manner of Weaving Silk: This paper proposes a "silk-weaving"-inspired mesh tokenization algorithm that provides a canonical topological framework through vertex layering and ordering, guaranteeing manifoldness, watertightness, normal consistency, and part-awareness in generated meshes while achieving state-of-the-art compression efficiency.
UFO-4D: Unposed Feedforward 4D Reconstruction from Two Images: This paper proposes UFO-4D, a unified feedforward framework that directly predicts dynamic 3D Gaussian representations from two unposed images, enabling jointly consistent estimation of 3D geometry, 3D motion, and camera pose, achieving up to 3× improvement over existing methods on geometry and motion benchmarks.
Uncertainty Matters in Dynamic Gaussian Splatting for Monocular 4D Reconstruction: This paper proposes USplat4D, an uncertainty-aware dynamic Gaussian splatting framework that estimates per-Gaussian time-varying uncertainty scores and constructs uncertainty-guided spatiotemporal graphs to propagate reliable motion cues, substantially improving monocular 4D reconstruction quality in occluded regions and under extreme novel viewpoints.
Universal Beta Splatting: This paper proposes Universal Beta Splatting (UBS), which generalizes 3D Gaussian Splatting to an N-dimensional anisotropic Beta kernel. By enabling per-dimension shape control, UBS unifies spatial geometry, view-dependent appearance, and scene dynamics within a single representation, achieving interpretable scene decomposition and state-of-the-art rendering quality.
UrbanGS: A Scalable and Efficient Architecture for Geometrically Accurate Large-Scene Reconstruction: This paper proposes UrbanGS, a scalable 3DGS reconstruction framework for urban-scale scenes that simultaneously improves geometric accuracy, rendering quality, and memory efficiency through depth-consistent D-Normal regularization, spatially adaptive Gaussian pruning (SAGP), and a unified partitioning strategy.
Weight Space Representation Learning on Diverse NeRF Architectures: This paper proposes the first representation learning framework capable of processing weights from diverse NeRF architectures (MLP / tri-plane / hash table). By combining a Graph Meta-Network (GMN) encoder with a SigLIP contrastive loss, it constructs an architecture-agnostic latent space, enabling classification, retrieval, and language-grounded tasks across 13 NeRF architectures, with generalization to architectures unseen during training.