🧊 3D Vision¶
🔬 ICLR2026 · 201 paper notes
📌 Same area in other venues: 📷 CVPR2026 (646) · 🧪 ICML2026 (30) · 🤖 AAAI2026 (79) · 🧠 NeurIPS2025 (116) · 📹 ICCV2025 (267)
🔥 Top topics: 3D Gaussian Splatting ×40 · 3D Reconstruction ×16 · Dynamic Scenes ×14 · Point Cloud ×11 · Novel View Synthesis ×8
- 3DGEER: 3D Gaussian Rendering Made Exact and Efficient for Generic Cameras
-
The 3DGEER framework is proposed, which achieves geometrically exact and real-time efficient 3D Gaussian rendering under any camera model by deriving a closed-form solution for integrating Gaussian density along rays, designing Particle Bounding Frustums (PBF) for precise and efficient ray-particle association, and introducing Bipolar Equi-Angular Projection (BEAP) to unify wide field-of-view camera representations. It comprehensively outperforms existing methods on fisheye and pinhole datasets.
- 3DSMT: A Hybrid Spiking Mamba-Transformer for Point Cloud Analysis
-
3DSMT integrates the event-driven low-power characteristics of Spiking Neural Networks (SNNs) with the local modeling of Transformers and the linear-complexity global modeling of Mamba into a hybrid architecture. By utilizing "Spiking Local Offset Attention + Spiking Mamba Blocks," it achieves SOTA results among SNN methods in classification, few-shot, and segmentation tasks, with energy consumption being dozens of times lower than ANN counterparts, while even outperforming several ANN models.
- A²TG: Adaptive Anisotropic Textured Gaussians for Efficient 3D Scene Representation
-
A²TG assigns an "anisotropic texture" with adaptive resolution and aspect ratio to each 2D Gaussian. By utilizing gradient-driven selection and upsampling rules, texture parameters are allocated only to Gaussians that truly require high-frequency details, achieving higher rendering quality and lower VRAM consumption than fixed-square textured Gaussian Splatting under the same memory budget.
- A Scene is Worth a Thousand Features: Feed-Forward Camera Localization from a Collection of Image Features
-
FastForward compresses "mapping" into a single feature extraction step: it uses a set of features randomly sampled from posed mapping images and anchored in 3D space as the scene map. A DUSt3R-style feed-forward network then predicts the 3D coordinates of the query image in one pass to solve the pose. This achieves mapping in seconds and localization in 0.5s, while its accuracy matches or even surpasses SCR/structured methods that require minutes to hours for mapping.
- A Step to Decouple Optimization in 3DGS
-
The paper provides an in-depth analysis of overlooked optimization couplings in 3DGS, specifically update step coupling (implicit updates and momentum rescaling under invisible viewpoints) and gradient coupling (regularization and photometric loss coupling within Adam's momentum). By decoupling and reorganizing these components, the authors propose the AdamW-GS optimizer, which simultaneously improves reconstruction quality and reduces redundant primitives without requiring additional pruning operations.
- Active Learning of 3D Gaussian Splatting with Consistent Region Partition and Robust Pose Estimation
-
This paper proposes an online active learning algorithm for 3D Gaussian Splatting (3DGS). It guides users by suggesting the "next best view" during training. The system partitions the model into consistent regions using visibility features, identifies the most under-reconstructed areas via semantic feature variance, and directly generates the next optimal pose using a von Mises-Fisher distribution. It also incorporates a robust pose optimization module to handle noise from handheld capture, outperforming SOTAs like FisherRF on NeRF-Synthetic in few-shot settings (10/20 views).
- Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation
-
The authors reformulate multi-view novel view synthesis as a dual-branch diffusion inpainting task of "image + geometry." By utilizing MoAI (cross-Modal Attention Instillation) to inject attention maps from the image branch into the geometry branch, the method generates aligned novel view images and point clouds directly from pose-free reference images, achieving SOTA performance in camera extrapolation settings.
- All That Glitters Is Not Gold: Key-Secured 3D Secrets within 3D Gaussian Splatting
-
KeySS transforms "hiding multiple 3DGS secret scenes within a single 3DGS cover scene" into an end-to-end trainable framework. It employs a decoder controlled by CLIP-encoded keys to directly map cover Gaussians to secret Gaussians; incorrect keys result in reconstructing only the cover. The study identifies that different Gaussian attributes contribute unequally to hiding secrets (opacity is effective, while spherical harmonics are nearly useless). It proposes the 3D-Sinkhorn distance to measure steganographic imperceptibility in the Gaussian parameter space, ultimately surpassing GS-Hider in reconstruction fidelity and anti-detection security.
- Anime-Ready: Controllable 3D Anime Character Generation with Body-Aligned Component-Wise Garment Modeling
-
Anime-Ready normalizes text or single images into A-pose anime character images, then utilizes Anime-SMPL, a body-aligned component-wise garment DiT, and fragmented texture generation to advance 3D anime characters from "looking similar" to animation-ready assets with skeletons, swappable outfits, and expression control.
- ARTDECO: High-Fidelity Online 3D Reconstruction with Hierarchical Gaussian Structure + Feed-forward Priors
-
ARTDECO utilizes feed-forward 3D foundation models (MASt3R / π³) as modular pose and point cloud priors, coupled with a Gaussian decoder that decodes structured Gaussians from multi-scale features, and a hierarchical semi-implicit Gaussian representation with LoD. This system achieves SLAM-level speed, feed-forward robustness, and rendering quality approaching per-scene optimization from monocular video streams.
- ArtUV: Artist-style UV Unwrapping
-
ArtUV automates the "manual UV unwrapping by professional artists" into an end-to-end two-stage process: first using SeamGPT to predict semantic seams, then utilizing a Graph Convolutional + Pyramid Autoencoder to regress "crude UVs" from traditional software into clean, low-distortion artist-style UV maps, outperforming Blender/Maya and even manual work in terms of distortion, utilization, and speed.
- AssetFormer: Modular 3D Assets Generation with Autoregressive Transformer
-
AssetFormer is an autoregressive Transformer based on the Llama architecture that models modular 3D assets (composed of primitive sequences) as discrete token sequences. By utilizing DFS/BFS graph traversal reordering and joint vocabulary decoding, it generates modular 3D assets directly compatible with game engines from text descriptions.
- Augmented Radiance Field: A General Framework for Enhanced Gaussian Splatting
-
This paper proposes the Augmented Radiance Field (AugS) framework, which explicitly models specular components by designing augmented Gaussian kernels with view-dependent opacity. It introduces an error-driven compensation strategy (2D Gaussian initialization → inverse projection to 3D → joint optimization) as a plug-and-play post-processing step to enhance existing 3DGS scenes. It outperforms SOTA NeRF methods on several datasets while capturing complex lighting using only second-order Spherical Harmonics (SH).
- BigMaQ: A Big Macaque Motion and Animation Dataset Bridging Image and 3D Pose Representations
-
BigMaQ utilizes 16 calibrated cameras for markerless multi-view motion capture of real macaques, coupling "individual-specific textured 3D surface meshes + frame-wise joint rotation poses" with "ethological action labels." This constitutes the first large-scale non-human primate dataset capable of feeding generative 3D pose vectors directly into action recognition, demonstrating that adding this pose description consistently improves mAP across various vision backbones.
- CHROMA: Consistent Harmonization of Multi-View Appearance via Bilateral Grid Prediction
-
CHROMA utilizes a multi-view aware Transformer to predict per-frame 3D bilateral grid affine transformations for an entire image sequence at once. It corrects appearance inconsistencies caused by camera ISP/exposure differences to a reference frame in a feed-forward manner, significantly improving the quality of novel view synthesis without slowing down 3DGS training.
- CLAP: Unsupervised 3D Representation Learning for Fusion 3D Perception via Curvature Sampling and Prototype Learning
-
CLAP proposes the first unsupervised joint pre-training method for "Camera+LiDAR fusion perception." It utilizes Curvature Sampling to select only highly informative points/pixels, managing the VRAM overhead of differentiable rendering. Furthermore, it employs Learnable Prototypes + EM Training to align both modalities into a shared feature space to exploit complementarity, achieving double the downstream gains compared to the previous SOTA (UniPAD) on NuScenes and Waymo.
- CLoD-GS: Continuous Level-of-Detail via 3D Gaussian Splatting
-
CLoD-GS assigns a learnable "distance decay factor" to each 3D Gaussian, allowing primitive opacity to decrease smoothly with viewing distance. This achieves Continuous Level-of-Detail (CLoD) within a single model, eliminating the multi-version storage and "popping" artifacts of traditional discrete LoD while simultaneously reducing primitive counts and VRAM usage.
- CloDS: Visual-Only Unsupervised Cloth Dynamics Learning in Unknown Conditions
-
CloDS introduces the first framework for unsupervised learning of cloth dynamics from multi-view videos. By establishing a differentiable mapping from 2D images to 3D meshes via Spatial Mapping Gaussian Splatting and resolving self-occlusion with dual-position opacity modulation, the GNN learns cloth dynamics near the level of full supervision without physical parameter labels.
- CogniMap3D: Cognitive 3D Mapping and Rapid Retrieval
-
To be supplemented after in-depth reading.
- Color3D: Controllable and Consistent 3D Colorization with Personalized Colorizer
-
Color3D proposes a paradigm of "colorize one key view → fine-tune personalized colorizer → propagate color to all views and timesteps." By converting the complex 3D colorization problem into a single-image colorization and color propagation task, it achieves a unification of rich colorization, cross-view consistency, and user controllability across both static and dynamic 3D scenes.
- ComGS: Efficient 3D Object-Scene Composition via Surface Octahedral Probes
-
ComGS utilizes "Surface Octahedral Probes (SOPs)" to cache indirect illumination and occlusion as octahedral textures on object/scene surfaces. By using KNN interpolation instead of per-iteration ray tracing and simplifying complex scene photometry into "local environment map completion at the placement point," it achieves harmonious 3D object-scene composition with realistic shadows at ~26 FPS and 36 seconds of editing time, outperforming existing methods by +1.4 dB PSNR.
- CompMarkGS: Robust Watermarking for Compressed 3D Gaussian Splatting
-
Aiming at the problem of existing 3DGS watermarks being destroyed after quantization compression, this paper embeds the watermark into the anchor features of anchor-based 3DGS. By using a "Quantization Distortion Layer" to simulate compression noise during training, the watermark maintains ~94% bit accuracy before and after HAC/ContextGS compression, while preserving rendering quality via Frequency-aware Anchor Growth and HSV loss.
- Contact-guided Real2Sim from Monocular Video with Planar Scene Primitives
-
CRISP reconstructs "simulatable" human motions and scene geometry from monocular videos. The core idea involves clustering point clouds into approximately 50 clean, convex planar primitives and completing occluded supporting surfaces using human-scene contact cues. Validated by RL-driven humanoid controllers, this approach reduces motion tracking failure rates from 55.2% to 6.9% (an 8x improvement).
- COOPERTRIM: Adaptive Data Selection for Uncertainty-Aware Cooperative Perception
-
The CooperTrim adaptive feature selection framework is proposed, which assesses feature relevance via conformal temporal uncertainty metrics and utilizes a data-driven mechanism to dynamically determine sharing quantities. It achieves an 80.28% bandwidth reduction with comparable performance in cooperative semantic segmentation, marking the first application of selective sharing to segmentation tasks.
- Ctrl&Shift: High-Quality Geometry-Aware Object Manipulation in Visual Generation
-
Ctrl&Shift is an end-to-end diffusion framework that achieves geometrically consistent, fine-grained object manipulation without explicit 3D reconstruction by decomposing the task into object removal and reference-guided inpainting, while injecting relative camera pose control.
- CylinderSplat: 3D Gaussian Splatting with Cylindrical Triplanes for Panoramic Novel View Synthesis
-
CylinderSplat utilizes a dual-branch feed-forward 3D Gaussian Splatting framework (pixel branch + volume branch) for panoramic (360°) novel view synthesis. The core innovation involves replacing traditional Cartesian triplanes with cylindrical triplanes that align with panoramic geometry and the Manhattan world assumption. The volume branch completes occluded/sparse regions that the pixel branch cannot recover, achieving SOTA results in both single-view and multi-view panoramic NVS.
- D²GS: Depth-and-Density Guided Gaussian Splatting for Stable and Accurate Sparse-View Reconstruction
-
D²GS addresses two failure modes of 3DGS under sparse views—"near-field overfitting and far-field underfitting"—by using "Depth-and-Density Guided Dropout" to suppress redundant near-field Gaussians and "Distance-Aware Fidelity Enhancement" to reinforce far-field supervision. It also proposes an Inter-Model Robustness (IMR) metric based on Optimal Transport to quantify reconstruction stability, simultaneously achieving state-of-the-art image quality and robustness on LLFF/MipNeRF360.
- DA\(^{2}\): Depth Anything in Any Direction
-
DA2 utilizes a "Perspective-to-Panorama" data engine to transform approximately 540,000 perspective RGB-D pairs into panoramic training data (increasing the total to approximately 607,000). Combined with the SphereViT backbone that explicitly injects spherical coordinates, it achieves end-to-end, single 360° panorama scale-invariant distance prediction. In zero-shot settings, it improves AbsRel by approximately 38% over the strongest baselines, even surpassing previous in-domain methods.
- Dens3R: A Foundation Model for 3D Geometry Prediction
-
To be completed after deeper reading.
- Depth Anything 3: Recovering the Visual Space from Any Views
-
To be added after thorough reading of the paper.
- Depth Anything with Any Prior
-
Prior Depth Anything employs a two-stage "coarse-to-fine" pipeline to fuse precise but sparse metric depth priors measured by sensors with complete but relative geometric structures predicted by monocular depth models. A single model unifies three tasks—depth completion, super-resolution, and inpainting—in a zero-shot manner, matching or even exceeding the performance of specialized SOTAs across 7 real-world datasets.
- DepthLM: Metric Depth from Vision Language Models
-
DepthLM demonstrates that a standard VLM does not require a dense prediction head or specialized depth losses. By relying solely on visual markers, intrinsic-conditioned data augmentation, and text-based SFT, it achieves pixel-level metric depth performance that approximates or even surpasses various specialized vision-only depth models for the first time.
- DiffPBR: Point-Based Rendering via Spatial-Aware Residual Diffusion
-
DiffPBR directly renders colored point clouds into photo-realistic, cross-view consistent images: first using adaptive CoNo-Splatting to rasterize sparse point clouds into "just-right" initial color maps and geometry-aware noise maps, then employing Spatial-Aware Residual Diffusion (RDDM) to supplement only the missing high-frequency details. It outperforms SOTA by 3–5 dB PSNR across three datasets, reduces training from 41 to 8 GPU hours, and increases rendering speeds from 3.6 to 10 FPS.
- DiffTrans: Differentiable Geometry-Materials Decomposition for Reconstructing Transparent Objects
-
DiffTrans targets transparent objects with complex topologies and internal absorption textures. It initializes geometry and environment light from multi-view images and masks, then jointly optimizes geometry, index of refraction (IoR), and absorption via a differentiable recursive mesh ray tracer, achieving superior geometric reconstruction and relighting performance in both synthetic and real-world scenes.
- DiffWind: Physics-Informed Differentiable Modeling of Wind-Driven Object Dynamics
-
Ours proposes DiffWind, a physics-constrained differentiable framework that jointly reconstructs wind fields and object motion from videos by modeling wind as a grid-based physics field, representing objects as 3D Gaussian Splatting (3DGS) particle systems, and utilizing the Material Point Method (MPM) for wind-object interaction. By incorporating the Lattice Boltzmann Method (LBM) as a physical constraint, the framework supports forward simulation under novel wind conditions and wind retargeting, significantly outperforming existing dynamic scene modeling methods on the self-built WD-Objects dataset.
- DiMeR: Disentangled Mesh Reconstruction Model with Normal-only Geometry Training
-
DiMeR decomposes feed-forward mesh reconstruction from sparse views into two non-interfering branches: geometry relies solely on normal maps, while texture relies solely on RGB images. By equipping the geometry branch with a streamlined FlexiCubes extractor and genuine 3D supervision (eikonal + GT SDF + PBR expectation loss), it reduces Chamfer Distance by over 30% on GSO and OmniObject3D.
- DispViT: Direct Stereo Disparity Regression with a Single-Stream Vision Transformer
-
DispViT discards the "cost volume construction + iterative refinement" paradigm dominated by the stereo matching field for decades. It utilizes a single-stream ViT to tokenize left and right images into a single sequence for direct disparity regression. Supported by lightweight designs like shift-embedding tokenizer, asymmetric initialization, probabilistic disparity parameterization, and disparity-aware RoPE, it achieves SOTA accuracy on benchmarks like Scene Flow. It is significantly more robust and faster in ambiguous scenarios such as occlusion, reflection, and transparency.
- Distractor-free Generalizable 3D Gaussian Splatting
-
TBD
- Do 3D Large Language Models Really Understand 3D Spatial Relationships?
-
The authors find that the high scores of existing 3D Large Language Models (3D-LLMs) on benchmarks like SQA3D are largely driven by "language shortcuts"—a "blind model" that ignores 3D input and fine-tunes only on text QA pairs can match or even outperform SOTA models. Consequently, they construct the more rigorous Real-3DQA benchmark (filtering questions guessable without 3D and introducing viewpoint rotation consistency evaluation) and propose 3D Reweighted Fine-Tuning (3DR-FT) to compel models to utilize 3D cues.
- DreamCS: Geometry-Aware Text-to-3D Generation with Unpaired 3D Reward Supervision
-
DreamCS proposes the first preference alignment framework that directly provides supervision on 3D geometry. It first constructs 3D-MeshPref, an unpaired 3D mesh preference dataset with 30,000 samples via LLMs and human annotation. It then trains RewardCS, a geometry-aware reward model that does not require paired samples, using Cauchy-Schwarz divergence. Finally, it integrates this into the SDS text-to-3D pipeline via differentiable meshization, adaptive mesh fusion, and progressive reward guidance, significantly alleviating Janus (multi-face) issues and geometric incompleteness.
- Dynamic Novel View Synthesis in High Dynamic Range
-
The first paper to formalize the HDR Dynamic Novel View Synthesis (HDR DNVS) problem and design the HDR-4DGS framework. By introducing a dynamic tone mapping module, the method achieves temporally consistent HDR radiance field reconstruction in time-varying scenes, outperforming existing methods on both synthetic and real-world datasets.
- EA3D: Event-Augmented 3D Diffusion for Generalizable Novel View Synthesis
-
EA3D integrates continuous geometric cues from event cameras with appearance cues from sparse RGB frames into view-dependent 3D features. These are then decoded by a 3D-aware video diffusion model (modified from CogVideoX) to generate temporally consistent novel view videos. This enables generalizable novel view synthesis without per-scene optimization under fast camera motion, wide baselines, and cross-scene settings.
- EasyCreator: Empowering 4D Creation through Video Inpainting
-
EasyCreator reformulates the task of "generating 4D video with variable camera trajectories and editable content from monocular video" as a video inpainting task. It renders visibility masks of occluded regions using dynamic point clouds and employs a strong video inpainting base (Wan2.1) for completion. Combined with composite masks, self-iterative tuning, and temporal packing inference, it outperforms several camera redirection SOTAs with minimal additional large-scale training.
- Efficient-LVSM: Faster, Cheaper, and Better Large View Synthesis Model via Decoupled Co-Refinement Attention
-
Ours proposes Efficient-LVSM, a dual-stream architecture that decouples input view encoding from target view generation, reducing the complexity of novel view synthesis from \(O(N_{in}^2)\) to \(O(N_{in})\). It achieves SOTA performance on RealEstate10K (29.86 dB PSNR) with 50% training time and a 4.4x speedup in inference.
- EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark
-
Ours proposes EgoNight, the first nocturnal egocentric vision benchmark, featuring day-night aligned videos and 3658 human-verified QA pairs, revealing a performance degradation of up to 32.8% in MLLMs under low-light conditions.
- EgoWorld: Translating Exocentric View to Egocentric View using Rich Exocentric Observations
-
EgoWorld proposes an end-to-end exocentric-to-egocentric view translation framework. It extracts three complementary observations—3D point clouds, hand poses, and text descriptions—from a single exocentric image. A sparse egocentric RGB mapping is obtained via point cloud re-projection, followed by high-fidelity egocentric image reconstruction using a diffusion model in an inpainting manner. The method outperforms SOTA across multiple unseen settings on four datasets, including H2O.
- Einstein Fields: A Neural Perspective To Computational General Relativity
-
The authors propose EinFields, the first framework to apply neural implicit representations to the compression of 4D General Relativity simulations. By encoding the metric tensor field into compact neural network weights, it achieves \(4000\times\) storage compression and 5-7 bit numerical precision, while tensor derivatives obtained via automatic differentiation are five orders of magnitude more accurate than finite difference methods.
- ETGS: Explicit Thermodynamics Gaussian Splatting for Dynamic Thermal Reconstruction
-
ETGS embeds an explicit thermodynamic model, where each Gaussian follows a first-order heat transfer ODE, into 3D Gaussian Splatting. By deriving an analytical closed-form solution for the ODE that can be directly evaluated at any time, ETGS reconstructs rapidly changing dynamic thermal scenes with training and rendering efficiency close to static 3DGS, achieving an average PSNR ~5 dB higher than previous state-of-the-art methods on the self-built RHD dataset.
- Exploring the Potential of Encoder-free Architectures in 3D LMMs
-
This paper proposes ENEL, the first encoder-free 3D Large Multimodal Model. It delegates "high-level semantic extraction" and "local geometric inductive bias"—tasks previously handled by pre-trained 3D encoders—directly to the LLM. The 7B model matches the performance of PointLLM-PiSA-13B in classification, captioning, and VQA.
- FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction
-
FantasyWorld attaches a trainable geometric branch alongside a frozen video foundation model (Wan2.1). In a single forward pass, it simultaneously outputs camera-conditioned video frames and an implicit 3D field (depth/point maps/camera poses). Through bidirectional cross-attention, geometric constraints guide the video while video priors complement the geometry, exceeding recent geometry-consistent baselines in Multi-view and Style Consistency on WorldScore.
- Fast Estimation of Wasserstein Distances via Regression on Sliced Wasserstein Distances
-
By leveraging the mathematical property that Sliced Wasserstein (SW) variants provide lower bounds of the Wasserstein distance while lifted SW variants provide upper bounds, the authors construct a minimalist linear regression model (RG framework). Trained with a small number of accurate Wasserstein pairs as supervision, this high-precision proxy estimator significantly outperforms the Transformer-based method, Wasserstein Wormhole, in low-data scenarios.
- FastAvatar: Towards Unified and Fast 3D Avatar Reconstruction with Large Gaussian Reconstruction Transformers
-
Using a unified feed-forward Transformer (LGRT), FastAvatar reconstructs drivable high-quality 3DGS avatars from an arbitrary number (1~16 frames) of facial observations—single images, multi-view setups, or monocular videos—within seconds. It achieves incremental reconstruction where "more observations lead to better quality" for the first time.
- FastGHA: Generalized Few-Shot 3D Gaussian Head Avatars with Real-Time Animation
-
Ours proposes FastGHA, a feed-forward few-shot 3D Gaussian head avatar framework. It reconstructs animatable 3D Gaussian heads from 4 arbitrary expression/viewpoint images in ~1 second, supports real-time animation at 62 FPS, and achieves a PSNR of 22.5 dB on Ava-256 (surpassing Avat3r's 20.7 dB while being 7.75x faster).
- FastVGGT: Fast Visual Geometry Transformer
-
Addressing the global attention bottleneck of the large-scale feed-forward 3D reconstruction model VGGT, this paper observes that its token attention maps are highly homogeneous ("token collapse"). Based on this, it proposes a training-free, 3D multi-view oriented three-partition token merging strategy, achieving a 4× speedup with 1000 input images while suppressing error accumulation in long sequences.
- FieryGS: In-the-Wild Fire Synthesis with Physics-Integrated Gaussian Splatting
-
FieryGS integrates 3D Gaussian Splatting (3DGS) reconstruction of real-world scenes, MLLM-based material property reasoning, controllable combustion simulation, and unified volume rendering. This pipeline allows users to automatically synthesize dynamic fire, smoke, and charring effects that are visually realistic and adhere to material and geometric constraints in multi-view scenes captured in-the-wild.
- Flash-Mono: Feed-Forward Accelerated Gaussian Splatting Monocular SLAM
-
Utilizing a recurrent feed-forward model, the system directly predicts camera poses and pixel-aligned 2D Gaussian primitives for each frame. By shifting the monocular GS-SLAM paradigm from "training Gaussians from scratch" to "prediction + lightweight refinement," it achieves approximately a 10\(\times\) speedup while maintaining SOTA rendering and tracking quality.
- FlashWorld: High-quality 3D Scene Generation within Seconds
-
To be added after in-depth reading
- Fore-Mamba3D: Mamba-based Foreground-Enhanced Encoding for 3D Object Detection
-
The Mamba encoder is shifted from "scanning all scene voxels" to "encoding only foreground voxels." Through two mechanisms—sliding window propagation and semantic/geometric fusion—long-range dependencies and context lost due to foreground sparsity are recovered, achieving SOTA performance on nuScenes/KITTI/Waymo with lower FLOPs.
- Fracture-GS: Dynamic Fracture Simulation with Physics-Integrated Gaussian Splatting
-
Fracture-GS unifies an "enhanced Collision-MPM" and a "fracture-aware 3D Gaussian continuum representation" into a pipeline from multi-view images to rendering. It specifically handles brittle fractures under extreme mechanical collisions. By using momentum-conserving interface forces to eliminate non-physical adhesion and MVEE Gaussian reconstruction to fill rendering holes at fracture interfaces, it significantly exceeds PhysGaussian and GIC in PSNR/LPIPS/FID and human-evaluated fracture fidelity.
- Frequency-Aware Dynamic Gaussian Splatting
-
This paper reveals the root cause of motion blur in dynamic 3DGS from a frequency perspective—"high-frequency rendering details" and "high-frequency motion" compete for expressive power on fixed Gaussian kernels. It proposes the Frequency-Differentiated Gaussian Kernel (FDGK) and Fourier Deformation Network (FDN) to decouple detail expression from motion modeling, significantly reducing blur and achieving a new SOTA on synthetic and real 4D benchmarks.
- From Tokens to Nodes: Semantic-Guided Motion Control for Dynamic 3D Gaussian Splatting
-
This work utilizes semantic and motion priors from Visual Foundation Models (VFMs) to allocate control points based on "motion complexity" rather than "geometric uniformity." By replacing MLP deformation fields with cubic spline-parameterized node trajectories, the method achieves fast and high-quality dynamic 3DGS reconstruction from monocular videos.
- FullPart: Generating Each 3D Part at Full Resolution
-
FullPart integrates two paradigms: generating bounding box layouts using implicit vecset diffusion, followed by generating details for each part within its own independent, full-resolution voxel grid. It employs center-corner encoding to resolve scale mismatches during assembly and introduces PartVerse-XL—the largest manually annotated 3D part dataset to date (40K objects / 320K parts)—achieving SOTA performance in part-based 3D generation.
- Fused-Planes: Why Train a Thousand Tri-Planes When You Can Share?
-
The paper proposes Fused-Planes, which decomposes the Tri-Plane representation into shared class-level base planes (macro) and object-specific detail planes (micro) through a macro-micro decomposition. Combined with latent space rendering, it achieves 7× training speedup and 3× memory compression while maintaining or even exceeding the reconstruction quality of independent Tri-Planes.
- G4Splat: Geometry-Guided Gaussian Splatting with Generative Prior
-
G4Splat argues that "accurate geometry is the prerequisite for effectively utilizing generative priors." It first derives scale-accurate plane-aware depth using the planar structures ubiquitous in man-made scenes, then integrates this geometry throughout the entire workflow—including visibility estimation, novel view selection, and video diffusion inpainting—to achieve high-quality sparse-view scene reconstruction with superior geometry and appearance in both observed and unobserved regions.
- Generalizable Coarse-to-Fine Robot Manipulation via Language-Aligned 3D Keypoints
-
CLAP (Coarse-to-fine Language-Aligned manipulation Policy) achieves strong generalization to novel instructions and environments through three core components: task decomposition, VLM-finetuned 3D keypoint prediction, and 3D-aware representations. It outperforms SOTA by 12% on GemBench using only 1/5 of the training data.
- Generative Human Geometry Distribution
-
The authors upgrade "Geometry Distribution" from representing a single object to a "generative model scalable to datasets." By replacing network weights with 2D feature maps and using the SMPL template as the source distribution for Flow Matching instead of a Gaussian, this work enables large-scale 3D human generation with geometry distributions for the first time, achieving a 57% improvement in geometry quality over the SOTA.
- GenFusion: Feed-forward Human Performance Capture via Progressive Canonical Space Updates
-
GenFusion accumulates monocular RGB video streams frame-by-frame into a progressively "completed" canonical feature space as temporal context. It then warps this context back to the current frame and renders novel views through diffusion-based probabilistic regression. This allows the model to synthesize frontal details consistent with historical observations even from side-view inputs, producing sharper results than deterministic regression.
- GeoPurify: A Data-Efficient Geometric Distillation Framework for Open-Vocabulary 3D Segmentation
-
Proposes the GeoPurify framework, which purifies noisy features projected from 2D VLMs into 3D by distilling geometric priors from a 3D self-supervised teacher model. It achieves or exceeds SOTA open-vocabulary 3D segmentation performance using only ~1.5% of the training data.
- GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra
-
The GIQ benchmark dataset is proposed, comprising 224 synthetic and real polyhedra, to systematically evaluate the geometric reasoning capabilities of vision foundation models across four tasks: monocular 3D reconstruction, symmetry detection, mental rotation tests, and zero-shot classification, revealing significant deficiencies in current models' basic geometric understanding.
- GOLDILOCS: General Object-Level Detection and Labeling of Changes in Scenes
-
GOLDILOCS reformulates cross-time scene change detection as a problem of "where the static 3D reconstruction hypothesis is violated," utilizing MASt3R for dense reconstruction, back-depth conflict filtering, SAM2 for mask tracking, and SSIM for structural differences to simultaneously detect and label object-level changes (added, removed, moved, warped) under zero-training conditions.
- GOOD: Geometry-guided Out-of-Distribution Modeling for Open-set Test-time Adaptation in Point Cloud Semantic Segmentation
-
This work shifts Open-set Test-time Adaptation (OSTTA) from "point-wise" to "geometrically connected superpoint" granularity. It utilizes superpoint purity and entropy confidence combined with a GMM to distinguish ID/OOD, supplemented by superpoint ID prototypes for error correction. This addresses the severe class imbalance in 3D point clouds where ID points are overwhelming while OOD points are sparse or absent.
- Gradient-Direction-Aware Density Control for 3D Gaussian Splatting
-
GDAGS identifies that 3DGS density control considers only the "magnitude" of view-space gradients while ignoring the "direction." It proposes the Gradient Consistency Ratio (GCR) and a nonlinear dynamic weighting rule to prioritize splitting large Gaussians with direction conflicts and cloning small Gaussians with consistent directions. This simultaneously alleviates over-reconstruction and over-densification, achieving comparable or better rendering quality while significantly reducing memory consumption.
- Guaranteed Simply Connected Mesh Reconstruction from an Unorganized Point Cloud
-
Closed triangle meshes are reconstructed from noisy point clouds with an algebraic guarantee of simple connectivity (homeomorphic to a 2-sphere) via Helmholtz-Hodge Decomposition (HHD), filling the gap in topological control for existing methods.
- H2OFlow: Grounding Human-Object Affordances with 3D Generative Models and Dense Diffused Flows
-
H2OFlow utilizes 3D generative models to create synthetic HOI data and models point-wise displacement distributions from human to target poses via "dense diffused flow" on point clouds. With zero manual annotation, it simultaneously learns contact, orientation, and spatial occupancy affordances, generalizing effectively to real-world point clouds.
- HDR-NSFF: High Dynamic Range Neural Scene Flow Fields
-
The authors propose HDR-NSFF, which transforms HDR video reconstruction from the traditional 2D pixel-level fusion paradigm to 4D spatio-temporal modeling. By jointly reconstructing the HDR radiance field, 3D scene flow, geometry, and tone-mapping from a monocular video with alternating exposures, they achieve spatio-temporally consistent dynamic HDR novel view synthesis.
- HoloPart: Generative 3D Part Amodal Segmentation
-
HoloPart introduces the concept of "amodal segmentation" from 2D to 3D, proposing the new task of "3D Part Amodal Segmentation"—decomposing a global mesh into geometrically complete semantic parts (rather than fragmented surface patches) using a diffusion model that incorporates local attention and global shape context for part completion.
- Horseshoe Splatting: Handling Structural Sparsity for Uncertainty-Aware Gaussian-Splatting Radiance Field Rendering
-
Apply a global-local Horseshoe shrinkage prior to the covariance scale of each 3DGS Gaussian. Use variational inference to simultaneously solve "automatic pruning of noise directions + outputting pixel-level uncertainty," matching SOTA rendering quality while providing calibrated uncertainty maps.
- Human3R: Everyone Everywhere All at Once
-
Human3R freezes the online 4D reconstruction foundation model CUT3R and uses Visual Prompt Tuning (VPT) to insert "human prompts." This allows the model to simultaneously output multi-person SMPL-X meshes (everyone), dense scene point clouds (everywhere), and camera trajectories (all-at-once) in a single feed-forward pass at 15 FPS with 8 GB VRAM, reaching SOTA after training on a single GPU for just one day.
- Hyden: A Hybrid Dual-Path Encoder for Monocular Geometry of High-resolution Images
-
Hyden utilizes a low-resolution ViT to capture global geometry and a full-resolution CNN to recover local details. Through self-distillation using both global and local crop pseudo-labels, it upgrades monocular geometry models like DepthAnything-v2 and MoGe2 into versions that are faster, sharper, and more accurate under high-resolution inputs.
- IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction
-
IGGT utilizes a multi-view Geometry Transformer to simultaneously predict cameras, depth, point maps, and instance-level features. By employing 3D-consistent contrastive learning to bind geometric reconstruction and instance semantics within a single representation, it achieves more stable results in semantic 3D reconstruction, multi-view instance matching, and open-vocabulary scene understanding.
- Implicit 4D Gaussian Splatting for Fast Motion with Large Inter-Frame Displacements
-
SPIN-4DGS reformulates the failure mode where "poorly learned Gaussian attributes lead to blurred or disappearing dynamic objects" under fast motion as a problem of "explicitly slicing by \((x,y,z,t)\) to obtain reliable spatiotemporal positions, and then using a lightweight feedforward network to decode scale, rotation, color, and opacity directly from these positions." It achieves an average PSNR 1.4–1.7 dB higher than the strongest baselines on six CMU Panoptic Sports scenes, outperforming D3DGS by +1.83 dB in the basketball scene.
- IncVGGT: Incremental VGGT for Memory-Bounded Long-Range 3D Reconstruction
-
Under a completely training-free premise, IncVGGT modifies VGGT/StreamVGGT with two orthogonal modules: "Input-side registration & synthesis" and "History-side Top-k cache pruning." This compresses the quadratic growth of attention to near-constant levels, enabling the processing of 10,000 frames on an 80GB GPU without memory overflow. Compared to StreamVGGT on 500 frames, it reduces operations by 58.5×, memory by 9×, and energy by 25.7×, with 4.9× faster inference while maintaining comparable accuracy.
- Interp3D: Correspondence-aware Interpolation for Generative Textured 3D Morphing
-
Interp3D proposes a training-free framework that leverages the 3D generative prior of TRELLIS to inject a progressive three-stage correspondence — "Semantic Alignment → Structural Alignment → Texture Alignment" — into the diffusion generation process, thereby generating structurally coherent, visually plausible, and smoothly transitioning morphing sequences between two textured 3D assets.
- Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry
-
By training a 32,000-unit Sparse Autoencoder dictionary on DINOv2, this paper systematically analyzes how downstream tasks recruit different concepts. It finds that representation geometry deviates from the Linear Representation Hypothesis (LRH) and proposes the Minkowski Representation Hypothesis (MRH), which posits that token representations are Minkowski sums of multiple convex polytopes, where concepts are defined by proximity to prototypical points rather than linear directions.
- Joint Optimization for 4D Human-Scene Reconstruction in the Wild
-
JOSH proposes using "human-scene contact" as a bridge to integrate camera pose, global human motion, and dense scene point clouds into a single-stage joint optimization. It reconstructs physically consistent 4D human-scene interactions from casual monocular web videos and further utilizes JOSH to generate pseudo-labels for 20 hours of web video to train JOSH3R, an end-to-end model capable of real-time inference.
- Joint Shadow Generation and Relighting via Light-Geometry Interaction Maps
-
This paper proposes Light-Geometry Interaction (LGI) maps, a 2.5D representation encoding light-occlusion relationships from monocular depth estimation. These maps are embedded into a bridge matching generation framework to achieve joint modeling of shadow generation and object relighting, attaining SOTA performance on both synthetic and real images.
- Large Depth Completion Model from Sparse Observations
-
LDCM employs a minimalist framework for sparse depth completion without complex modules. At the front end, it uses Poisson reconstruction to align the relative depth from monocular foundation models with sparse observations into metric-consistent coarse depth. At the back end, it replaces traditional depth regression heads with pixel-wise 3D point map regression heads, achieving SOTA performance in zero-shot depth completion and point map estimation across six benchmarks.
- Learning Hierarchical and Geometry-Aware Graph Representations for Text-to-CAD
-
Graph-CAD decomposes the long-horizon "text-to-CAD code" task into three stages. It first leverages an LLM to generate an explicit decomposition graph expressing assembly hierarchies and geometric constraints as an intermediate representation. It then plans actions and generates bpy code sequentially, integrated with a structure-aware progressive curriculum learning approach to push the model's capability boundaries. This pulls the Geometric Constraint Satisfaction (GCS) rate on CADBench from ~0.40 (end-to-end) to 0.90.
- Learning Physics-Grounded 4D Dynamics with Neural Gaussian Force Fields
-
The NGFF framework is proposed to construct 3D Gaussian representations from multi-view RGB images and learn explicit neural force fields to drive physical dynamics. By employing ODE solvers, it achieves interactive, physically realistic 4D video generation, running two orders of magnitude faster than traditional Gaussian simulators and surpassing Veo3 and NVIDIA Cosmos.
- Learning Unified Representation of 3D Gaussian Splatting
-
Native 3DGS parameters \(\boldsymbol{\theta}=\{\mu,\mathbf{q},\mathbf{s},\mathbf{c},o\}\) suffer from non-uniqueness and numerical heterogeneity, making them unsuitable as a learning space for neural networks. This paper proposes the Submanifold Field representation: mapping each Gaussian primitive to a continuous color field on its isoprobability ellipsoid. This mapping is proven to be injective, eliminating parameter ambiguity at the source. Combined with a manifold distance (M-Dist) based on optimal transport to train a VAE embedding, this method significantly outperforms parameter-based baselines in reconstruction fidelity, cross-domain generalization, and latent space stability.
- Less Gaussians, Texture More: 4K Feed-Forward Textured Splatting
-
To be added after in-depth reading.
- Light of Normals: Unified Feature Representation for Universal Photometric Stereo
-
LINO UniPS utilizes "Light Register Tokens with light alignment supervision + interleaved attention" to explicitly decouple illumination from normal features within the encoder. It further employs a "wavelet dual-branch + normal-gradient perception loss" to preserve high-frequency geometric details, achieving new SOTA normal accuracy on benchmarks like DiLiGenT and Luces.
- LiTo: Surface Light Field Tokenization
-
LiTo is proposed to simultaneously model 3D geometry and view-dependent appearance by encoding the surface light field into a compact set of latent vectors. By using random sub-sampling of light fields from multi-view RGB-D images as input, a Perceiver IO encoder (supporting local 3D attention for 1 million tokens) coupled with a flow-matching geometry decoder and a high-order Spherical Harmonic Gaussian decoder is employed. This achieves reconstruction and single-image-to-3D generation results that surpass TRELLIS, marking the first time view-dependent effects like specular highlights and Fresnel reflections are modeled in a latent 3D representation.
- LumiTex: Towards High-Fidelity PBR Texture Generation with Illumination Context
-
LumiTex focuses on PBR texture generation given a mesh and a reference image, integrating multi-view illumination context, branched albedo/metallic-roughness material inference, and geometry-guided view completion based on LVSM into a single pipeline. It outperforms open-source and commercial baselines in texture quality, relighting consistency, and human preference.
- Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation
-
Lyra employs a camera-controllable video diffusion model as a "teacher" and uses its RGB decoding branch to supervise a newly added 3DGS decoding "student." It achieves feed-forward generation of explicit 3D (and even 4D) Gaussian scenes from a single image or video using only synthetic video self-distillation, without requiring any real-world multi-view data.
- Mango-GS: Enhancing Spatio-Temporal Consistency in Dynamic Scenes Reconstruction using Multi-Frame Node-Guided 4D Gaussian Splatting
-
Mango-GS drives dense 4D Gaussians using a set of sparse control nodes with decoupled "position + latent code," and performs multi-frame temporal Transformer operations in the node space. By shifting from "frame-by-frame memorization of transients" to "modeling motion trends," it achieves SOTA image quality, optimal temporal consistency, and 149.5 FPS real-time rendering for dynamic scene reconstruction.
- MAVEN: A Mesh-Aware Volumetric Encoding Network for Simulating 3D Flexible Deformation
-
MAVEN treats 2D facets and 3D cells within the mesh as explicit nodes for message passing, utilizing "geometry-aware volumetric encoding" to more accurately simulate flexible deformation and contact of 3D solids on sparse meshes.
- MEGS2: Memory-Efficient Gaussian Splatting via Spherical Gaussians and Unified Pruning
-
MEGS2 is proposed to compress 3DGS from the perspective of rendering VRAM: using prunable arbitrary-direction Spherical Gaussians (SG) to completely replace Spherical Harmonics (SH) to reduce parameters per primitive, and a unified soft-pruning framework to model primitive and lobe count pruning as a single memory-constrained optimization problem. It achieves 8x static VRAM and 6x rendering VRAM compression while maintaining rendering quality, enabling 3DGS to run in real-time on mobile devices for the first time.
- Mesh Splatting for End-to-end Multiview Surface Reconstruction
-
The authors "soften" a mesh into multiple semi-transparent shells along its normals and make these layers differentiable with respect to the base mesh. This allows for end-to-end optimization of the mesh surface using volume rendering, reconstructing high-quality meshes with minimal vertices within 20 minutes.
- Mobile-GS: Real-time Gaussian Splatting for Mobile Devices
-
Mobile-GS incorporates a suite of five techniques—"depth-aware order-independent rendering, neural view enhancement, first-order SH distillation, neural vector quantization, and contribution pruning"—to compress 3DGS to 4.6 MB and achieve 1100+ FPS on desktop. It marks the first implementation of real-time Gaussian Splatting at 116 FPS on a Snapdragon 8 Gen 3 mobile device.
- MoE-GS: Mixture of Experts for Dynamic Gaussian Splatting
-
MoE-GS is the first framework to introduce the Mixture of Experts architecture into dynamic Gaussian Splatting. By employing a Volume-aware Pixel Router to adaptively fuse heterogenous deformation priors (HexPlane, per-Gaussian, polynomial, and interpolation), it consistently outperforms SOTA on N3V and Technicolor datasets while maintaining efficiency through single-pass rendering, gate-aware pruning, and knowledge distillation.
- MoGen: Detailed Neuronal Morphology Generation via Point Cloud Flow Matching
-
MoGen utilizes flow matching on high-resolution 3D point clouds to generate realistic mouse cortical axon/dendrite fragment morphologies. By feeding millions of synthetic samples into a shape plausibility classifier within a production-grade connectome reconstruction pipeline, it reduces residual reconstruction errors by 4.4%, equivalent to saving approximately 157 person-years of manual proofreading for whole-brain reconstruction.
- Mono4DGS-HDR: High Dynamic Range 4D Gaussian Splatting from Alternating-exposure Monocular Videos
-
This work is the first to address the problem of reconstructing renderable 4D HDR scenes from pose-less alternating-exposure monocular videos. Through a two-stage optimization (orthographic video space → world space), a Video-to-World Gaussian transformation strategy, and temporal luminance regularization, it achieves 37.64 dB HDR PSNR and 161 FPS on synthetic data, significantly outperforming existing methods.
- MOSIV: Multi-Object System Identification from Video
-
MOSIV formalizes "Multi-Object System Identification" as a task for the first time—simultaneously reconstructing the 4D geometry of each object from multi-view videos and optimizing continuous constitutive material parameters per object (stiffness, plasticity, friction). By driving a differentiable MPM simulator with geometry alignment losses, it moves beyond discrete modeling (selecting categories from a fixed material library) to replicate observations and predict long-term future dynamics in contact-heavy multi-object scenes.
- MultiMat: Multimodal Program Synthesis for Procedural Materials using Large Multimodal Models
-
MultiMat is proposed as the first framework to utilize Large Multimodal Models (LMMs) for synthesizing procedural material node graphs. By integrating visual rendering feedback from intermediate nodes during the autoregressive generation process (via Mixed and Graph conditioning modes) and employing incremental constrained tree search for real-time validation and backtracking, the model significantly outperforms text-only baselines after training on 6,878 production-grade Substance Designer materials.
- Nano3D: A Training-Free Approach for Efficient 3D Editing Without Masks
-
This work adapts the training-free 2D editing method FlowEdit into the geometry-appearance decoupled generation pipeline of TRELLIS. By employing Voxel/Slat-Merge based on connected component analysis to fuse "edited regions" back onto the original object, it enables consistent local 3D editing (addition, removal, modification) without masks, training, or multi-view reconstruction, facilitating the construction of the first 3D editing dataset of 100k scale.
- Neural Compression of 3D Meshes using Sparse Implicit Representation
-
The mesh is converted into a "Sparse Implicit Tensor" (SIR) that stores SDF solely near the surface. A 0.42 MB Sparse Convolutional Autoencoder (SNC) performs end-to-end rate-distortion compression, achieving 30%–90% bit rate savings over Draco / V-DMC / G-PCC / NeCGS at near real-time speeds.
- NGS-Marker: Robust Native Watermarking for 3D Gaussian Splatting
-
NGS-Marker embeds watermarks directly into the 3D Gaussian primitives themselves rather than rendered images. Consequently, even if an attacker extracts a small subset of Gaussians to integrate into a new scene, attribution information can be decoded from any local region, specifically addressing "partial infringement" where existing methods fail.
- NOVA3R: Non-pixel-aligned Visual Transformer for Amodal 3D Reconstruction
-
NOVA3R is proposed for non-pixel-aligned complete 3D reconstruction from unposed images. It employs learnable scene tokens to aggregate global information across views and a flow-matching-based diffusion 3D decoder to generate complete point clouds (including occluded areas). This addresses two fundamental limitations of pixel-aligned methods—only reconstructing visible surfaces and creating redundant geometry in overlapping regions—outperforming SOTA in both scene-level and object-level reconstruction on SCRREAM and GSO datasets.
- ODE-GS: Latent ODEs for Dynamic Scene Extrapolation with 3D Gaussian Splatting
-
ODE-GS decouples "reconstruction" and "future prediction" for dynamic 3D Gaussian Splatting: it first trains a temporal deformation model to generate Gaussian parameter trajectories within the observation window, then utilizes a Transformer + Neural ODE to extrapolate past trajectories into future timestamps in a continuous latent space. This approach avoids out-of-distribution (OOD) failures caused by "timestamp conditioning," improving extrapolation metrics on D-NeRF, NVFi, and HyperNeRF by an average of approximately 19.8%.
- Omni-View: Unlocking How Generation Facilitates Understanding in Unified 3D Model based on Multiview images
-
Omni-View is a unified 3D scene understanding and generation model that enhances understanding performance through the generative capabilities of a texture module (novel view synthesis) and a geometry module (depth/pose estimation), achieving a score of 55.4 on VSI-Bench and surpassing all existing specialized 3D understanding models.
- OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling
-
The authors construct OmniWorld, a 4D world modeling dataset spanning four domains (simulator, robot, human, and internet) with over 300 million frames, featuring five modalities: depth, camera pose, text, optical flow, and foreground masks. By combining self-collected game engine data with 12 public datasets and an automated annotation pipeline, they demonstrate that fine-tuning existing SOTA models on OmniWorld leads to significant gains in 3D geometric reconstruction and camera-controllable video generation.
- One2Scene: Geometric Consistent Explorable 3D Scene Generation from a Single Image
-
One2Scene is proposed to decompose the ill-posed problem of single-image to explorable 3D scene generation into three sub-tasks: (1) panoramic image generation to extend visual coverage, (2) a feed-forward 3DGS network to construct an explicit 3D geometric scaffold from sparse anchor views, and (3) scaffold-guided novel view synthesis. By fusing high-quality anchor views and geometric priors via Dual-LoRA, the method achieves geometrically consistent and realistic scene generation under large viewpoint changes, significantly outperforming SOTA.
- Open-Set Semantic Gaussian Splatting SLAM with Expandable Representation
-
This work integrates a dynamically expandable semantic feature pool into 3DGS-SLAM. Each 3D Gaussian stores only a low-dimensional index key to soft-aggregate semantics from the shared pool on demand. This enables online reconstruction of 3D scenes with open-vocabulary semantics using minimal memory. Consistency targets and semantic stability guidance are employed to resolve cross-view semantic inconsistencies, improving rendering, trajectory, and segmentation quality in both Replica and handheld captured scenes.
- OpenFly: A Comprehensive Platform for Aerial Vision-Language Navigation
-
Constructs OpenFly, a comprehensive platform for Aerial Vision-Language Navigation (VLN): integrates 4 rendering engines (UE/GTA V/Google Earth/3DGS); develops a fully automatic data generation toolchain (point cloud acquisition → semantic segmentation → trajectory generation → GPT-4o instructions); builds a large-scale dataset of 100,000 trajectories across 18 scenes; proposes a keyframe-aware VLN model, OpenFly-Agent (Keyframe Selection + Visual Token Merging), which outperforms existing methods by 14.0% and 7.9% in Success Rate (SR) for seen and unseen scenes, respectively.
- ORCaS: Unsupervised Depth Completion via Occluded Region Completion as Supervision
-
ORCaS enables unsupervised depth completion models to predict features of occluded regions—areas invisible to the input view but visible to adjacent views—during training. This forces the model to learn an inductive bias regarding 3D object shapes, outperforming previous state-of-the-art methods on VOID1500 / NYUv2 by an average of 8.91%, with significant leads in cross-dataset generalization and sparse inputs.
- OVSeg3R: Learn Open-vocabulary Instance Segmentation from 2D via 3D Reconstruction
-
OVSeg3R is a training scheme: it treats point clouds obtained from 3D reconstruction of 2D videos as input, lifts open-vocabulary 2D instance segmentation results to 3D as annotations using reconstructed 2D-3D correspondences, and stabilizes training with "View-level Instance Partitioning" (VIP) and "2D Instance Boundary-aware Superpoints" (IBSp). It extends a closed-set SOTA 3D segmenter into an open-vocabulary model, achieving +2.3 mAP overall and +7.7 mAP on novel classes on ScanNet200.
- PAGE-4D: VGGT-4D Perception via Disentangled Pose and Geometry Estimation
-
PAGE-4D attaches a "Dynamics-Aware Aggregator" to the feed-forward 3D foundation model VGGT. It utilizes a self-supervised dynamic mask to decouple motion information based on the specific task—masking it during pose estimation and amplifying it during geometry reconstruction. Fine-tuning only the middle 10 layers enables VGGT to outperform the original version in pose, depth, and point cloud reconstruction for dynamic scenes.
- PAINET: A Principled Efficient Transformer for 3D Dynamics Modeling
-
PAINET formulates unobserved long-range all-to-all interactions in 3D multi-body systems as an energy minimization problem. From this, it derives an equivariant Transformer encoder with particle-type adaptive mapping, followed by a parallel EGNN to decode future trajectories. It achieves lower prediction errors at nearly identical computational costs across human motion, small/large molecules, and protein dynamics.
- Parameterization-Based Dataset Distillation of 3D Point Clouds through Learnable Shape Morphing
-
This paper introduces the concept of "Distilled Dataset Parameterization" (DDP) to 3D point cloud distillation for the first time. By representing the synthetic set as a convex combination of low-resolution anchors with learnable weights via shape morphing, the method generates a larger and more diverse set of synthetic samples within the same storage budget. Combined with a uniformity-aware matching loss, it significantly outperforms existing distillation methods across five standard 3D benchmarks.
- Part-X-MLLM: Part-aware 3D Multimodal Large Language Model
-
Part-X-MLLM is a native 3D, part-aware multimodal large model that unifies heterogeneous 3D tasks—such as generation, editing, and QA—into "writing programs with a part-based grammar." Given an RGB point cloud and natural language, it autoregressively outputs a token sequence encoding part bounding boxes, semantic descriptions, and editing instructions. This sequence is then executed by off-the-shelf geometry engines, driving diverse 3D asset operations via a language-native frontend and achieving SOTA on 11 task types.
- PartSAM: A Scalable Promptable Part Segmentation Model Trained on Native 3D Data
-
The study proposes PartSAM, the first promptable part segmentation model trained on large-scale native 3D data. It employs a dual-branch triplane encoder (combining a frozen SAM prior with a learnable 3D branch) and a SAM-style decoder. Through a model-in-the-loop annotation pipeline, the authors constructed over 5 million shape-part pairs, achieving performance that outperforms Point-SAM by over 90% in single-click IoU under open-world settings.
- PAT3D: Physics-Augmented Text-to-3D Scene Generation
-
PAT3D integrates vision-language model (VLM) reasoning and differentiable rigid-body contact simulation into the text-to-3D scene generation pipeline. By extracting support dependencies from a reference image to build a scene tree and generating an interpenetration-free initial layout, the method utilizes "simulation-in-the-loop" differentiable optimization. This allows the scene to converge under gravity to a static equilibrium that is stable, non-interpenetrating, and semantically aligned, making it the first "simulation-ready" scene generation method suitable for editing and robotic manipulation.
- PatchRefiner V2: Fast and Lightweight Real-Domain High-Resolution Metric Depth Estimation
-
PatchRefiner V2 replaces the "large and slow" refinement branch in the tile-based high-resolution metric depth framework with a lightweight encoder. It recovers the resulting accuracy loss through a Coarse-to-Fine denoising module + Noisy Pre-training, and enhances boundary quality using a local window gradient matching loss during the synthetic-to-real transfer stage—achieving higher accuracy than the previous SOTA on UnrealStereo4K with 9.2x fewer parameters and 10.7x faster inference.
- Path Matters: Unveiling Geometric Implicit Bias via Curvature-Aware Sparse View Optimization
-
This paper reveals two types of geometric implicit biases in 3DGS under sparse views—stronger supervision requirements for high-curvature regions and sensitivity to the smoothness of input view trajectories. Accordingly, it proposes a "Curvature-aware Camera Trajectory Optimization + Synthetic View Generation" framework. This approach ensures that pseudo-label views cover more surface details while maintaining smoothness, pushing rendering quality and geometric accuracy to SOTA on datasets such as DTU, Mip-NeRF 360, and Tanks & Temples.
- PD²GS: Part-Level Decoupling and Continuous Deformation of Articulated Objects via Gaussian Splatting
-
The PD²GS framework is proposed to achieve part-level decoupling, reconstruction, and continuous control of articulated objects by learning a shared canonical Gaussian field and modeling each interaction state as its continuous deformation. It employs coarse-to-fine motion trajectory clustering and SAM-guided boundary refinement without manual supervision.
- Peering into the Unknown: Active View Selection with Neural Uncertainty Maps for 3D Reconstruction
-
Ours proposes PUN (Peering into the UnkNowN), which utilizes a lightweight feed-forward network, UPNet, to directly predict the uncertainty distribution (neural uncertainty map) across all candidate viewpoints on a sphere from a single image. This replaces the traditional active view selection (AVS) pipeline that requires iterative retraining of NeRF/3DGS. PUN achieves comparable reconstruction quality using only half the viewpoints of the upper bound, while realizing a 400x speedup in the selection phase and over 50% savings in computational resources.
- \(\pi^3\): Permutation-Equivariant Visual Geometry Learning
-
\(\pi^3\) proposes a fully permutation-equivariant feed-forward network that completely discards the "fixed reference view" inductive bias inherited from traditional SfM. Instead, it predicts "affine-invariant camera poses + scale-invariant local pointmaps" in each frame's own coordinate system. This approach is naturally robust to input order and sets new SOTA records across tasks like camera pose estimation, monocular/video depth, and dense pointmaps, while achieving 57.4 FPS.
- Pixel3DMM: Versatile Screen-Space Priors for Single-Image 3D Face Reconstruction
-
Pixel3DMM utilizes DINOv2-driven pixel-level normal and UV coordinate priors to constrain FLAME optimization, significantly improving 3D face reconstruction accuracy—especially in exaggerated expression scenarios—and proposes a new benchmark for simultaneously evaluating posed and neutral geometry.
- Plan then Act: Bi-level CAD Command Sequence Generation
-
To address the poor quality of CAD command sequences directly generated by LLMs, this paper proposes PTA: a fine-tuned Planner (Qwen3-8B) first parses user text into a "chained high-level operation plan," which is then implemented by an Actioner equipped with a Requirement-Aware Mechanism (RAM) into executable low-level CAD command sequences. PTA reduces the invalid rate to 0.85% and achieves leading performance across various geometric metrics on the Text2CAD dataset.
- Point-Focused Attention Meets Context-Scan State Space: Robust Biological Visual Perception for Point Cloud Representation
-
PointLearner utilizes a biomimetic "focus-then-scan" design—Point-Focused Attention (simulating foveal vision) and Context-Scan State Space (simulating saccadic reasoning)—to model local fine-grained structures and global long-range dependencies under linear complexity. It achieves SOTA performance on ModelNet40/ScanObjectNN/ShapeNet/S3DIS and demonstrates strong robustness to noise and sparse sampling.
- Point-MoE: Large-Scale Multi-Dataset Training with Mixture-of-Experts for 3D Semantic Segmentation
-
This work integrates a sparsely activated Mixture-of-Experts (MoE) module into the attention output projection layer of Point Transformer V3 (PTv3). This allows a unified model to jointly train on heterogeneous indoor and outdoor point cloud datasets without relying on "dataset labels." By allowing routers to spontaneously select experts for tokens, the model achieves a semantic segmentation mIoU across 7 datasets (including zero-shot) that surpasses PPT (which requires dataset labels), while reducing inference FLOPs by 30.9%.
- Point-UQ: An Uncertainty Quantification Paradigm for Point Cloud Few-Shot Class-Incremental Learning
-
Point-UQ shifts the focus of 3D Few-Shot Class-Incremental Learning (FSCIL) from "repeatedly fine-tuning features" to "dynamically optimizing decisions." It uses predictive entropy to measure cognitive uncertainty for each sample to adaptively arbitrate between semantic classifiers and geometric prototypes, thereby preserving old class knowledge while correctly identifying new class samples without retraining.
- PointRePar: SpatioTemporal Point Relation Parsing for Robust Category-Unified 3D Tracking
-
PointRePar is a "category-unified" 3D single-object tracker. It employs a U-shaped spatial relation parsing backbone built with Mamba and Dynamic Feature Aggregation to learn more discriminative shape features, combined with a dual-layer point-level/box-level temporal parsing mechanism to capture motion. Coupled with sparse-adaptive Gaussian perturbation training, a single model trained jointly across all categories outperforms the previous category-unified method CUTrack and competes with category-specific SOTA models.
- Positional Encoding Field
-
This paper discovers that image tokens in DiT are highly independent, with spatial coherence almost entirely determined by positional encodings (PE). Based on this, it extends 2D PE into a 3D "Positional Encoding Field" (PE-Field) with depth and hierarchy. By simply modifying the PE, the Diffusion Transformer can rearrange image content in 3D space, achieving SOTA results in single-image novel view synthesis (NVS) and naturally generalizing to controllable spatial editing.
- Progressive Gaussian Transformer with Anisotropy-aware Sampling for Open Vocabulary Occupancy Prediction
-
PG-Occ represents driving scenes using a set of sparse 3D Gaussians with text-aligned features. It employs "Progressive Online Densification" to supplement Gaussians in under-reconstructed areas during inference, paired with "Anisotropy-aware Sampling" to adaptively extract features according to Gaussian shapes. This achieves a 14.3% mIoU improvement over the previous SOTA on the Occ3D-nuScenes open vocabulary occupancy prediction task.
- pySpatial: Generating 3D Visual Programs for Zero-Shot Spatial Reasoning
-
pySpatial is a visual programming framework that enables MLLMs to automatically invoke 3D spatial tools (3D reconstruction, camera pose recovery, novel view synthesis, etc.) by generating Python code. It transforms limited 2D image inputs into interactively explorable 3D scenes, achieving zero-shot, plug-and-play explicit 3D spatial reasoning. It outperforms GPT-4.1-mini by 12.94% and VLM-3R by 16.5% with an overall accuracy of 58.56% on the MindCube benchmark, and successfully drives a real quadruped robot for indoor navigation.
- QuadGPT: Native Quadrilateral Mesh Generation with Autoregressive Models
-
Proposes QuadGPT—the first end-to-end autoregressive framework for generating native quad meshes. By utilizing a unified mixed-topology tokenization (padding triangles into 4-vertex blocks), an Hourglass Transformer architecture, and truncated DPO (tDPO) fine-tuning based on topological rewards, it surpasses existing triangle-to-quad conversion pipelines and cross-field-guided methods in Chamfer Distance, Hausdorff Distance, quad ratio, and user preference.
- Quantized Visual Geometry Grounded Transformer
-
To address the deployment needs of the billion-scale 3D reconstruction model VGGT, this paper proposes QuantVGGT, the first dedicated PTQ framework. It resolves the heavy-tail distribution caused by special tokens through dual-smoothed fine-grained quantization (Hadamard rotation + channel smoothing) and addresses calibration instability via noise-filtered diverse sampling. 4-bit quantization achieves 3.7× memory compression and 2.5× speedup while maintaining 98%+ accuracy.
- Quartet of Diffusions: Structure-Aware Point Cloud Generation through Part and Symmetry Guidance
-
This paper decomposes point cloud generation into four diffusion processes: shape latent variables, symmetry groups, semantic parts, and part assembly. By using explicit part and symmetry priors, it generates more consistent and controllable 3D point clouds that closely follow the ground truth distribution on ShapeNetPart.
- RadioGS: Radiometrically Consistent Gaussian Surfels for Inverse Rendering
-
RadioGS proposes radiometric consistency loss—a mechanism that minimizes the residual between the learned radiance of each Gaussian surfel and its physically rendered radiance. This provides a physics-based supervision signal for unobserved directions, constructing a self-correcting feedback loop that achieves accurate indirect illumination and material decomposition, supporting efficient re-lighting in minutes.
- RayI2P: Learning Rays for Image-to-Point Cloud Registration
-
This paper reformulates image-to-point cloud registration from "establishing 2D-3D correspondences" to "predicting a bundle of 3D rays for each image patch." A differentiable ray-guided regression module is then used to directly estimate the camera's 6-DoF pose, fundamentally bypassing projection ambiguity and scale inconsistency, setting new state-of-the-art accuracy on KITTI and nuScenes.
- ReconViaGen: Towards Accurate Multi-view 3D Object Reconstruction via Generation
-
ReconViaGen integrates strong reconstruction priors (VGGT) as multi-view perceptual conditions into a diffusion-based 3D generator (TRELLIS). During inference, it employs rendering-aligned velocity compensation to constrain the denoising trajectory. This approach maintains the capability to "complete unobserved parts" while ensuring global structure and local details are highly consistent with input views, achieving SOTA results on Dora-bench and OmniObject3D.
- Reducing Class-Wise Performance Disparity via Margin Regularization
-
The paper proposes MR2 (Margin Regularization for performance disparity Reduction), which dynamically adjusts class-dependent margins in logit and representation spaces. Based on a theoretically derived generalization bound, it reduces inter-class performance disparity while simultaneously improving overall accuracy.
- ReLi3D: Relightable Multi-View 3D Reconstruction with Disentangled Illumination
-
ReLi3D is the first end-to-end feed-forward system capable of simultaneously reconstructing complete geometry, spatially-varying PBR materials, and consistent HDR environment lighting from sparse multi-view images in less than 1 second. The core idea is to utilize "multi-view constraints" as the primary driver for material-illumination disentanglement, transforming the inherently ill-posed single-image inverse rendering problem into a well-constrained one.
- ReSplat: Degradation-agnostic Feed-forward Gaussian Splatting via Self-guided Residual Diffusion
-
ReSplat couples a universal diffusion-based image restoration model and a feed-forward 3D Gaussian Splatting (3DGS) model into a self-guided closed loop. 3D Gaussian centers generated midway through diffusion sampling serve as "self-guidance" to achieve multi-view consistent restoration. The restored images are then fed back into the GS model for scene reconstruction, enabling clearer and more robust novel view synthesis under various degradations such as blur, low light, fog, rain, and snow.
- RobustSpring: Benchmarking Robustness to Image Corruptions for Optical Flow, Scene Flow and Stereo
-
Ours proposes RobustSpring—the first benchmark for image corruption robustness in optical flow, scene flow, and stereo matching (dense matching). It injects 20 corruptions into the high-resolution Spring dataset in a temporal, stereo, and depth-consistent manner. Equipped with a Lipschitz-based robustness metric decoupled from accuracy, it evaluates 17 models, revealing hidden weaknesses where "high accuracy \(\neq\) high robustness."
- RoRE: Rotary Ray Embedding for Generalised Multi-Modal Scene Understanding
-
RoRE directly encodes image patches as "rays" and injects them into a Transformer via learnable Rotary Positional Embedding (RoPE). Combined with asymmetric rotation and modality-shared ray embeddings, this allows a single network to handle arbitrary camera geometries and modalities—such as perspective, fisheye, and RGB-Thermal—without retraining, significantly improving generalization and consistency across geometries and modalities.
- Sat3DGen: Comprehensive Street-level 3D Scene Generation from Single Satellite Image
-
Given a single top-down satellite image, Sat3DGen injects three types of geometric constraints (gravity density prior, satellite-view depth prior, spatial boundary tokens) and panorama-to-perspective view augmentation into a feed-forward tri-plane NeRF framework. This approach reduces street-level 3D geometric RMSE from 6.76m to 5.20m and improves rendering FID from ~40 to 19.
- Scaling Sequence-to-Sequence Generative Neural Rendering
-
Kaleido is proposed as a series of decoder-only rectified flow transformer generative models that treat 3D as a special sub-domain of video. Through Unified Positional Encoding, a masked autoregressive framework, and video pre-training strategies, it achieves "any-to-any" 6-DoF novel view synthesis without any explicit 3D representation. It matches the rendering quality of per-scene optimization methods (InstantNGP) in multi-view settings for the first time and increases the resolution from 512/576px to 1024px.
- Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation
-
Scenethesis is a training-free agentic framework that utilizes LLMs to draft coarse layouts, vision foundation models for visual grounding and scene graph extraction, and a physics-aware optimizer (semantic correspondence + SDF contact/support constraints) for object-wise pose correction. A GPT-5 judge verifies spatial consistency and triggers re-planning, enabling the generation of collision-free, stable, and interactive 3D scenes for both indoor and outdoor environments.
- SceneTransporter: Optimal Transport-Guided Compositional Latent Diffusion for Single-Image Structured 3D Scene Generation
-
SceneTransporter reformulates open-world structured 3D scene generation as a global assignment-association problem by introducing an entropic Optimal Transport (OT) framework into the denoising loop of compositional 3D latent diffusion models: OT plan-gated cross-attention achieves exclusive patch-to-part routing (preventing feature entanglement), while edge-regularized assignment costs encourage the separation of different instances at image boundaries, achieving SOTA instance-level consistency and geometric fidelity across 74 diverse open-world scene images.
- ShapeGen4D: Towards High Quality 4D Shape Generation from Videos
-
ShapeGen4D adapts a large-scale pre-trained 3D shape diffusion model into a feed-forward "video-to-4D mesh sequence" generator. By employing temporally aligned latent codes, spatio-temporal attention, and cross-frame shared noise, it end-to-end generates geometrically consistent dynamic mesh sequences capable of handling topological changes and volumetric expansion/contraction, outperforming baselines like L4GM, V2M4, and GVFD in geometric accuracy.
- Sharp Monocular View Synthesis in Less Than a Second
-
SHARP generates approximately 1.2 million 3D Gaussians from a single image via a single feedforward neural network. It completes inference in less than 1 second on an A100 GPU and supports rendering speeds exceeding 100 FPS. It achieves zero-shot SOTA performance across 6 datasets, reducing LPIPS by 25–34% compared to the strongest prior methods while shortening synthesis time by three orders of magnitude.
- Signal Structure-Aware Gaussian Splatting for Large-Scale Scene Reconstruction
-
This paper reformulates large-scale 3DGS scene reconstruction as a "signal structure recovery" problem. It derives the average sampling frequency and scene bandwidth for the 3D Gaussian representation and proposes SIG, an adaptive scheduler that switches image resolution and densification timing based on scene frequency convergence. Combined with spherically-constrained Gaussians to suppress floaters, it achieves a +0.9 dB PSNR improvement on multiple large-scale benchmarks while accelerating single-GPU training by approximately 1.5×.
- SkyEvents: A Large-Scale Event-Enhanced UAV Dataset for Robust 3D Scene Reconstruction
-
This paper introduces SkyEvents, the first "Event + RGB + LiDAR" multimodal dataset for large-scale UAV 3D scene reconstruction (45 sequences, >8 hours, 0.72 km² point cloud). It proposes a Geometric Timestamp Alignment (GTA) module and a Region-level Event Rendering (RER) loss, demonstrating that incorporating the event modality significantly enhances the texture and geometric fidelity of 3DGS reconstruction under extreme conditions such as low light and motion blur.
- SMAGA: Secondary Motion-Aware 3D Clothed Gaussian Avatars from Monocular Videos
-
Addressing the difficulty of 3DGS human avatars reconstructed from monocular videos to represent the flowing secondary motion of loose clothing (e.g., skirts), this paper proposes a two-stage framework: it first uses template-free personalized Gaussian initialization to align with clothed silhouettes, followed by a GNN deformer that structures Gaussians into a graph and autoregressively predicts second-order dynamics (mass-spring-damper). This generates realistic and temporally coherent clothing dynamics under single-view constraints.
- SpaceControl: Introducing Test-Time Spatial Control to 3D Generative Modeling
-
SpaceControl proposes a training-free test-time method that voxelizes user-provided 3D geometry (from coarse superquadrics to fine meshes) and encodes it into the latent space of a pre-trained 3D generative model (Trellis). By utilizing an SDEdit-style "add noise to \(t_0\) then denoise" mechanism to inject spatial guidance and a single parameter \(\tau_0\) to smoothly adjust "geometric fidelity ↔ generative realism," it significantly outperforms training-based and optimization-based baselines in geometric alignment (Chamfer distance) without any parameter fine-tuning.
- SpatialHand: Generative Object Manipulation from 3D Perspective
-
SpatialHand elevates generative object insertion from the 2D image plane to a "3D perspective." By decoupling 6DoF poses into three conditional streams—2D position (mask), depth (depth map), and 3D orientation (latent embedding)—and feeding them into a FLUX diffusion Transformer, paired with an automated synthetic data pipeline and progressive multi-stage training, it achieves precise 3D localization, arbitrary rotation, and correct occlusion control for inserted objects.
- Special Unitary Parameterized Estimators of Rotation
-
This paper re-derives the classical Wahba rotation estimation problem using the special unitary group \(SU(2)\), yielding linear quaternion constraints, a closed-form two-point solution, and two network-oriented continuous rotation representations. Among these, 2-vec generally outperforms Gram-Schmidt within the same dimensionality, and QuadMobius achieves state-of-the-art or competitive results across multiple rotation learning tasks.
- SpikeStereoNet: A Brain-Inspired Stereo Depth Estimation Framework for Spike Streams
-
This paper proposes SpikeStereoNet, which estimates stereo depth directly from a pair of raw spike streams (binary high-frequency streams from spike cameras). It employs a three-layer Recurrent Spiking Neural Network (RSNN) as an iterative refinement operator to repeatedly update disparity. Accompanied by large-scale synthetic and real spike stereo datasets, the method outperforms existing frame-based and event-based stereo matching methods on both datasets while maintaining high accuracy with only 10% of the training data.
- Spiking Discrepancy Transformer for Point Cloud Analysis
-
To address the issues in Spiking Neural Networks (SNNs) for point cloud analysis, where "dot-product attention tends to smooth edges and struggles to model local and global features simultaneously," this paper proposes using the discrepancy between spike sequences instead of dot-product similarity as the attention mechanism. Combined with a spatial-aware spiking neuron that injects coordinates into the initial membrane potential, the hierarchical Spiking Discrepancy Transformer achieves SOTA within the SNN domain, with energy consumption at only a few percent of ANN SOTA.
- Splat and Distill: Augmenting Teachers with Feed-Forward 3D Reconstruction for 3D-Aware Distillation
-
In a student-teacher distillation framework, the teacher is augmented with a pre-trained feed-forward 3D reconstruction model (MVSplat). By lifting 2D features to a 3D Gaussian representation and rendering them to novel views, the student learns geometrically consistent 3D-aware 2D features. This approach comprehensively outperforms existing methods across downstream tasks including depth estimation, normal estimation, semantic segmentation, and multi-view correspondence.
- Splat Feature Solver
-
The problem of feature lifting in 3D splat representations is unified and modeled as a sparse linear inverse problem \(AX=B\). A closed-form solver is proposed with a provable \((1+\beta)\)-approximation error bound under convex loss. Combined with Tikhonov Guidance and Post-Lifting Aggregation filtering, the method achieves SOTA performance in open-vocabulary 3D segmentation.
- Splat the Net: Radiance Fields with Splattable Neural Primitives
-
This paper proposes "splattable neural primitives," where the density field of each primitive is represented by a shallow neural network (SIREN) bounded spatially by an ellipsoid. By deriving a closed-form solution for the density integral along view rays, the method maintains the high expressivity of neural representations while achieving the efficient splatting of 3DGS. It achieves quality and speed comparable to 3DGS using 10× fewer primitives and 6× fewer parameters for novel view synthesis.
- SSD-GS: Scattering and Shadow Decomposition for Relightable 3D Gaussian Splatting
-
SSD-GS replaces the "spherical harmonic coefficients" in 3D Gaussian Splatting with a physically interpretable four-term shading decomposition: "diffuse + specular + shadow + subsurface scattering." Combined with learnable dipole scattering, occlusion-aware two-stage soft shadows, and progressive training, it significantly outperforms existing methods in relighting fidelity for complex materials like metals and translucent objects.
- Station2Radar: Query-Conditioned Gaussian Splatting for Precipitation Field
-
The authors propose Query-Conditioned Gaussian Splatting (QCGS), the first work to introduce 2D Gaussian Splatting into precipitation field generation. By fusing satellite imagery with sparse Automatic Weather Station (AWS) observations, QCGS achieves resolution-flexible precipitation field reconstruction under radar-free conditions, improving RMSE by over 50% compared to traditional gridded products.
- STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer
-
STREAM3R reformulates dense 3D reconstruction as a "frame-by-frame causal attention problem in a decoder-only Transformer." Whenever a new image arrives, it performs causal cross-attention with the cached historical frame features to regress pointmaps. This enables incremental online reconstruction using KVCache and sliding window attention similar to LLMs, achieving performance superior to or comparable with existing streaming methods in depth estimation and 3D reconstruction for both static and dynamic scenes, while offering faster inference.
- Streaming Visual Geometry Transformer
-
This paper proposes StreamVGGT, which transforms the offline global-attention-based VGGT into a causal Transformer utilizing "temporal causal attention + cached memory tokens." This enables 3D geometric reconstruction to be updated incrementally frame-by-frame (reducing latency from \(O(N^2)\) to \(O(N)\)). By distilling from the original VGGT as a teacher for low-cost training, StreamVGGT approaches the performance of the offline VGGT and outperforms existing streaming methods across multiple 3D reconstruction, depth, and pose benchmarks.
- StreamSplat: Towards Online Dynamic 3D Reconstruction from Uncalibrated Video Streams
-
StreamSplat proposes a fully feed-forward online dynamic 3D reconstruction framework. Through three innovations—probabilistic position sampling, bidirectional deformation fields, and adaptive Gaussian fusion—it can instantly generate dynamic 3DGS representations from uncalibrated video streams, achieving a speed 1200x faster than optimization-based methods.
- Stroke3D: Lifting 2D Strokes into Rigged 3D Model via Latent Diffusion Models
-
Stroke3D achieves the first direct generation of rigged 3D mesh models from user-drawn 2D strokes and text prompts. It employs a skeleton-first two-stage pipeline: first generating controllable 3D skeletons using Graph VAE + Graph DiT, followed by high-quality mesh generation enhanced by the TextuRig dataset and SKA-DPO optimization.
- Stylos: Multi-View 3D Stylization with Single-Forward Gaussian Splatting
-
Stylos proposes a single-forward 3D style transfer framework. Through a dual-path design (geometry self-attention + style cross-attention) sharing a Transformer backbone and a voxel-level 3D style loss, it achieves zero-shot 3D stylization from uncalibrated inputs, supporting scaling from single-view to hundreds of views.
- SurfSplat: Conquering Feedforward 2D Gaussian Splatting with Surface Continuity Priors
-
SurfSplat proposes a feedforward 3D reconstruction framework based on 2DGS. It binds Gaussian rotation and scale to neighborhood positions via surface continuity priors, addresses color bias through a forced alpha blending strategy, and introduces the High-Resolution Rendering Consistency (HRRC) metric to reveal reconstruction quality differences at high resolutions.
- Test-Time Optimization of 3D Point Cloud LLM via Manifold-Aware In-Context Guidance and Refinement
-
This paper proposes Point-Graph LLM (PGLLM), which organizes unlabeled support sets into a KNN graph at test time without retraining. It injects 3D captions of neighboring samples as in-context guidance into a second-stage LLM and performs score refinement via label propagation to correct noisy predictions. This approach improves the accuracy and robustness of 3D recognition, OOD detection, and captioning with almost zero additional computational overhead.
- Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator
-
The VIST3A framework is proposed, which seamlessly interfaces the latent space of a pretrained video generator with feed-forward 3D reconstruction models (such as AnySplat/MVDUSt3R/VGGT) via model stitching. Subsequently, direct reward finetuning is employed to align the generative model with the stitched 3D decoder, achieving high-quality end-to-end text-to-3DGS and text-to-pointmap generation. It consistently outperforms existing methods on T3Bench, SceneBench, and DPG-Bench.
- The Less You Depend, the More You Learn: Synthesizing Novel Views from Sparse, Unposed Images with Minimal 3D Knowledge
-
This paper systematically demonstrates the scalability law that "the less one depends on explicit 3D knowledge, the more one can learn from large-scale data." Based on this, the authors propose UP-LVSM—a pure Transformer feed-forward NVS framework that requires no explicit scene structure or camera pose annotations. By utilizing a self-supervised "Latent Plücker Learner," it synthesizes high-fidelity novel views directly from unposed 2D images, outperforming methods trained with ground-truth poses.
- TIGaussian: Disentangle Gaussians for Spatial-Aware Text-Image-3D Alignment
-
TIGaussian刷新了文本-图像-3DGS三模态对齐的SOTA。该方法通过多分支编码器解耦3D Gaussian Splatting (3DGS) 的内在属性,利用扩散先验将单视图图像补充为多视图融合特征,并使用 Query Transformer 将3D特征投影至文本空间。
- TINKER: Diffusion's Gift to 3D--Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization
-
TINKER transforms large-scale 2D image editing models and video diffusion models into a 3D-oriented multi-view consistent editing pipeline. It generates dense consistent views from one or a few edited reference images and completes high-quality 3DGS editing without requiring per-scene optimization of the editing model.
- Topology-Preserved Auto-regressive Mesh Generation in the Manner of Weaving Silk
-
A mesh tokenization algorithm inspired by "silk weaving" is proposed. It provides a canonical topological framework through vertex layering and ordering, ensuring manifoldness, watertightness, normal consistency, and part-awareness of generated meshes while achieving SOTA compression efficiency.
- Towards Physically Executable 3D Gaussian for Embodied Navigation
-
This paper proposes the SAGE-3D paradigm, upgrading 3DGS from a "rendering-only" scene representation to an environment for training and evaluating embodied agents by adding object-level semantics and physical collision structures. It releases the InteriorGS dataset with 1k annotated scenes and SAGE-Bench, the first 3DGS-based VLN benchmark (2M trajectory-instruction pairs).
- Trace Anything: Representing Any Video in 4D via Trajectory Fields
-
Trace Anything represents every pixel in a video as a continuous 3D trajectory and directly predicts the trajectory field of the entire video through a single feed-forward inference. This achieves efficient 4D dynamic scene representation without requiring depth, optical flow, 2D trackers, or per-scene optimization.
- True Self-Supervised Novel View Synthesis is Transferable
-
This paper proposes "transferability" as the core criterion for determining whether a model truly performs Novel View Synthesis (NVS). Based on this, it introduces XFactor—the first model capable of learning cross-scene transferable camera pose representations through pure self-supervision without relying on multi-view geometry. By utilizing two simple designs—a "stereo-monocular model" and a "pose-preserving transferable objective"—it significantly outperforms RayZer and RUST.
- TTT3R: 3D Reconstruction as Test-Time Training
-
The state update of the recurrent 3D reconstruction model CUT3R is reformulated as a test-time online learning problem. By deriving a closed-form, per-token adaptive learning rate based on the alignment confidence between memory states and new observations to gate state updates, this method significantly mitigates long-sequence forgetting without retraining or adding parameters. It improves global pose accuracy by 2× over the baseline while remaining efficient (20 FPS, 6 GB VRAM) for sequences of thousands of images.
- UFO-4D: Unposed Feedforward 4D Reconstruction from Two Images
-
Ours proposes UFO-4D, a unified feedforward framework that directly predicts dynamic 3D Gaussian representations from only two unposed images, achieving joint consistent estimation of 3D geometry, 3D motion, and camera pose, with performance improvements of up to 3× over existing methods on geometry and motion benchmarks.
- ULTRA-360: Unconstrained Dataset for Large-scale Temporal 3D Reconstruction across Altitudes and Omnidirectional Views
-
ULTRA-360 constructs a large-scale real-world image dataset covering campus-level buildings, four-season appearances, ground-level and aerial multi-altitude views, and perspective and 360-degree cameras. Using a semi-automatic calibration pipeline and multi-category reconstruction benchmarks, it reveals key shortcomings in current large-scale temporal 3D/4D reconstruction regarding cross-altitude matching, doppelganger disambiguation, densification, and multi-appearance modeling.
- Uncertainty-Aware 3D Reconstruction for Dynamic Underwater Scenes
-
This paper proposes UDF (Uncertainty-aware Dynamic Field) to simultaneously model underwater dynamic geometry and time-varying participating media in a unified 4D field. It utilizes per-pixel uncertainty derived from "surface observation blurring + inter-frame optical flow inconsistency" to weight the rendering loss, achieving high-quality reconstruction and new-view synthesis on both controlled and in-the-wild underwater videos.
- Uncertainty Matters in Dynamic Gaussian Splatting for Monocular 4D Reconstruction
-
Proposes USplat4D, an uncertainty-aware dynamic Gaussian splatting framework that estimates time-varying uncertainty for each Gaussian and constructs an uncertainty-guided spatiotemporal graph to propagate reliable motion cues. This significantly improves monocular 4D reconstruction quality in occluded regions and under extreme novel views.
- Unified 3D Scene Understanding Through Physical World Modeling
-
3WM unifies RGB image patches, optical flow patches, and camera poses into a random-access probabilistic graphical model. Using GPT-style autoregressive prediction, it completes novel view synthesis (NVS), 3D object manipulation, and self-supervised depth estimation within a single prompt interface, outperforming specialized models on multiple real-world benchmarks.
- UniUGG: Unified 3D Understanding and Generation via Geometric-Semantic Encoding
-
UniUGG is the first "Unified Understanding and Generation" framework for 3D modalities. It utilizes a jointly pre-trained geometric-semantic ViT to encode visual representations and enables an LLM, combined with a diffusion model, to "imagine" geometrically consistent 3D scenes from a reference image and target view transforms via conditional denoising on compressed latent tokens. It maintains superior spatial VQA capabilities, outperforming the second-best method by 17.9% on VSI-Bench.
- Universal Beta Splatting
-
The authors propose Universal Beta Splatting (UBS), which generalizes 3D Gaussian Splatting into N-dimensional anisotropic Beta kernels. By providing per-dimension shape control, it unifies the modeling of spatial geometry, view-dependent appearance, and scene dynamics within a single representation, achieving interpretable scene decomposition and SOTA rendering quality.
- UnLoc: Leveraging Depth Uncertainties for Floorplan Localization
-
UnLoc explicitly models monocularly predicted "floorplan depth" as a Laplace distribution with uncertainty. By replacing scene-specific depth networks with an off-the-shelf pre-trained monocular depth model (Depth Anything v2), it achieves significant improvements over the SOTA (F3Loc) in sequential visual floorplan localization—improving recall by 42.2x on 15-frame short sequences of the real-world dataset LaMAR HGE.
- Unsupervised Representation Learning for 3D Mesh Parameterization with Semantic and Visibility Objectives
-
This paper advances unsupervised neural UV parameterization from "geometry distortion only" to "serving real-world texturing workflows." By using semantic partitioning to align UV islands with 3D components and ambient occlusion (AO) to guide seams into inconspicuous areas, the method produces 3D mesh UV atlases better suited for editing, texture generation, and asset reuse.
- UP2You: Fast Reconstruction of Yourself from Unconstrained Photo Collections
-
UP2You proposes a "data corrector" paradigm that transforms a collection of unconstrained photos with varying poses, viewpoints, crops, and occlusions into clean orthogonal multi-view RGB and normal maps via a single forward pass in seconds. These are then processed by traditional reconstruction algorithms to generate high-fidelity textured human meshes. The entire pipeline takes 1.5 minutes with nearly constant memory usage, outperforming previous optimization-based methods that require hours.
- UrbanGS: Efficient and Scalable Architecture for Geometrically Accurate Large-Scale Urban Gaussian Splatting
-
UrbanGS extends 3DGS to city-level scenes using a quartet of "depth-consistent D-Normal dual-supervised regularization + geometry-aware confidence weighting + spatially adaptive Gaussian pruning + unified partitioning." It surpasses methods like CityGaussian-v2 and VCR-GauS in rendering quality, geometric accuracy, and memory efficiency, while remaining runnable on a single A5000 without memory overflow.
- Variation-Aware Flexible 3D Gaussian Editing
-
VF-Editor redefines 3D Gaussian editing as an "attribute-wise variational prediction" problem. By utilizing a feed-forward variational predictor distilled from multi-source 2D editing knowledge, it can natively edit an entire Gaussian field in approximately 0.3 seconds. This approach eliminates multi-view inconsistencies inherent in the "2D edit then 3D rebuild" paradigm while supporting flexible editing operations such as free mixing and intensity adjustment.
- VoMP: Predicting Volumetric Mechanical Property Fields
-
VoMP is the first feed-forward method to predict internal volumetric mechanical material fields (Young's modulus \(E\), Poisson's ratio \(\nu\), density \(\rho\)) for 3D objects. It aggregates multi-view DINOv2 features per voxel for any voxelizable and renderable 3D representation (Mesh / Gaussian Splatting / NeRF / SDF), predicts per-voxel material latent codes via a Geometry Transformer, and decodes them into real physical triplets using a MatVAE constrained on a "physics-feasible material manifold." It generates simulation-ready materials within seconds, significantly outperforming previous methods in both accuracy and speed.
- WAFT: Warping-Alone Field Transforms for Optical Flow
-
WAFT completely replaces the standard cost volume in optical flow methods with high-resolution feature warping. By utilizing a DPT/ViT iterative update module to implicitly handle large displacements, it achieves top-tier accuracy on Spring, Sintel, and KITTI while consuming only 1/3 of the VRAM and being 1.3–4.1 times faster than comparable methods.
- Weight Space Representation Learning on Diverse NeRF Architectures
-
The authors propose the first representation learning framework capable of processing weights from diverse NeRF architectures (MLP/tri-plane/hash table). By utilizing a Graph Meta-Network encoder combined with SigLIP contrastive loss to construct an architecture-agnostic latent space, the method achieves classification, retrieval, and language tasks across 13 NeRF architectures and generalizes to architectures unseen during training.
- World2Minecraft: Occupancy-Driven Simulated Scenes Construction
-
This work converts real-world indoor scenes into voxel-aligned editable Minecraft environments using "3D semantic occupancy prediction" and builds a simulation platform for Vision-Language Navigation (VLN). Simultaneously, it utilizes Minecraft to automatically generate 100,000 occupancy annotations (MinecraftOcc dataset), serving as both a challenging benchmark and an augmentation source for real-world datasets.
- WorldTree: Towards 4D Dynamic Worlds from Monocular Video Using Tree-Chains
-
WorldTree utilizes a "Temporal Partition Tree" to recursively bifurcate monocular videos into coarse-to-fine sub-intervals for layer-wise optimization. It combines this with "Spatial Ancestral Chains" to link each child node with its ancestors for spatial complementarity and motion representation specialization. This approach simultaneously addresses the issues of "global temporal optimization" and "hierarchical spatial coupling" in monocular dynamic reconstruction, reducing LPIPS on NVIDIA-LS by 8.26% and mLPIPS on DyCheck by 9.09% compared to the runner-up.
- YoNoSplat: You Only Need One Model for Feedforward 3D Gaussian Splatting
-
YoNoSplat uses a feedforward model to directly predict per-view local 3D Gaussians, camera poses, and intrinsics from an arbitrary number of unposed and uncalibrated multi-view images, which are then aggregated into a global scene. By employing a "mix-forcing" training strategy, pairwise distance normalization, and Intrinsic Condition Embedding (ICE), it resolves pose-geometry entanglement and scale ambiguity. It achieves SOTA performance in both posed and unposed settings, reconstructing a scene from 100 images in 2.69 seconds.