ECCV2024 3D Vision AI paper notes paper summaries 3D Gaussian Splatting Diffusion Models Point Cloud NeRF Novel View Synthesis Text-to-3D

🧊 3D Vision¶

🎞️ ECCV2024 · 181 paper notes

📌 Same area in other venues: 📷 CVPR2026 (751) · 🔬 ICLR2026 (197) · 🧪 ICML2026 (30) · 🤖 AAAI2026 (79) · 🧠 NeurIPS2025 (116) · 📹 ICCV2025 (267)

🔥 Top topics: 3D Gaussian Splatting ×33 · Diffusion Models ×21 · Point Cloud ×20 · NeRF ×16 · Novel View Synthesis ×10

3D Congealing: 3D-Aware Image Alignment in the Wild: 3D Congealing aligns a set of unannotated, semantically similar internet images into a shared 3D canonical space. By combining SDS guidance from a pre-trained diffusion model to obtain the 3D shape and DINO semantic feature matching to estimate poses and coordinate mappings, it requires no templates, pose annotations, or camera parameters.
3D Reconstruction of Objects in Hands without Real World 3D Supervision: This paper proposes the HORSE framework, which trains an occupancy network to reconstruct the 3D shape of hand-held objects from a single RGB image. This is achieved by extracting multi-view 2D mask supervision from in-the-wild videos (using hand pose as an object pose proxy) and learning a 2D slice adversarial shape prior from a synthetic 3D shape collection. Without using any real-world 3D annotations, it outperforms 3D-supervised methods by 11.6% on the MOW dataset.
3D Single-Object Tracking in Point Clouds with High Temporal Variation: HVTrack is the first to explore 3D single-object tracking under high temporal variation scenarios. It addresses coordinate-wise cloud shape variations, distractor interference, and background noise via three modules: Relative-Pose-Aware Memory (RPM), Base-Expansion Feature Cross-Attention (BEA), and Contextual Point Guided Self-Attention (CPA). On the KITTI-HV dataset with a 5-frame interval, it improves Success/Precision by 11.3%/15.7% over the state-of-the-art (SOTA).
3DEgo: 3D Editing on the Go!: 3DEgo compresses the traditional three-stage 3D editing pipeline (COLMAP pose estimation $\rightarrow$ unedited scene initialization $\rightarrow$ iterative editing and update) into a single-stage framework: first performing multi-view consistent 2D editing on video frames using an autoregressive noise blending module, and then directly reconstructing the 3D scene from the edited frames using COLMAP-free 3DGS, boosting the speed by approximately 10x and supporting videos from arbitrary sources.
3iGS: Factorised Tensorial Illumination for 3D Gaussian Splatting: 3iGS replaces the independently optimized spherical harmonics (SH) coefficients of each Gaussian in 3DGS with a continuous incident illumination field based on tensor decomposition. Combined with learnable BRDF features and a lightweight neural renderer to model the outgoing radiance, it significantly improves the rendering quality of view-dependent effects such as specular reflections while maintaining real-time rendering speeds.
3×2: 3D Object Part Segmentation by 2D Semantic Correspondences: Proposes a training-free 3D object part segmentation method, 3-By-2, which utilizes 2D semantic correspondences from diffusion models (DIFT) to transfer part labels from annotated 2D datasets or a small number of 3D annotated objects to 3D, achieving state-of-the-art (SOTA) performance under both zero-shot and few-shot settings.
4Diff: 3D-Aware Diffusion Model for Third-to-First Viewpoint Translation: This paper proposes 4Diff, a transformer-based diffusion model integrating 3D geometric priors. By incorporating egocentric point cloud rasterization and 3D-aware rotary cross-attention mechanisms, it translates exocentric (third-person) images into egocentric (first-person) images, achieving state-of-the-art performance on the Ego-Exo4D dataset and demonstrating strong generalization capabilities to novel environments.
6DGS: 6D Pose Estimation from a Single Image and a 3D Gaussian Splatting Model: Proposes 6DGS, which inverts the 3DGS rendering workflow—casting rays uniformly from the surfaces of the ellipsoids (Ellicell), using an attention mechanism to bind rays to target image pixels, and then utilizing weighted least squares to solve for camera pose in closed form. Requiring no iterations or initial poses, it improves rotation accuracy by 12% and translation accuracy by 22% on real-world scenes, achieving near-real-time performance at 15fps.
A Compact Dynamic 3D Gaussian Representation for Real-Time Dynamic View Synthesis: This work models the position and rotation parameters in 3DGS as continuous functions of time (Fourier approximation for position, linear approximation for rotation), reducing the storage complexity of dynamic scenes from $O(TN)$ to $O(LN)$. It achieves rendering quality comparable to NeRF-based methods on the D-NeRF, DyNeRF, and HyperNeRF datasets while maintaining real-time rendering speeds over 118 FPS.
A Direct Approach to Viewing Graph Solvability: This paper proposes a more direct reformulation of the viewing graph solvability problem than prior works, introduces new concepts to understand the solvability of real-world SfM graphs, and presents more efficient algorithms for detecting and decomposing unsolvable scenarios.
A Probability-guided Sampler for Neural Implicit Surface Rendering: This paper proposes a probability-guided ray sampler (Probability-guided Sampler) that models a probability density function in a 3D image projection space to guide ray sampling toward regions of interest. Simultaneously, a novel surface reconstruction loss comprising near-to-surface and empty space components is designed. This sampler can be integrated as a plug-and-play module into existing neural implicit surface renderers, significantly improving reconstruction accuracy and rendering quality.
AEDNet: Adaptive Embedding and Multiview-Aware Disentanglement for Point Cloud Completion: AEDNet is proposed, which conducts global embedding and local disentanglement of point clouds in the encoder and decoder respectively through the Adaptive Embedding and Multiview-Aware Disentanglement (AED) module. By utilizing 3D viewpoints generated from a unit sphere to observe the point cloud from the outside, it achieves a comprehensive understanding of 3D object geometry, reaching SOTA on the MVP and PCN datasets.
Analysis-by-Synthesis Transformer for Single-View 3D Reconstruction: Proposes the Analysis-by-Synthesis Transformer (AST), which models pixel-to-shape and pixel-to-texture relationships using Shape Transformer and Texture Transformer in a unified framework, achieving high-quality mesh reconstruction and texture generation with only 2D annotations, outperforming existing methods on CUB-200-2011 and ShapeNet.
Analytic-Splatting: Anti-Aliased 3D Gaussian Splatting via Analytic Integration: By analytically approximating the integration of Gaussian signals over pixel windows using a conditional logistic function, instead of pixel-center point sampling in 3DGS, alias-free 3D Gaussian Splatting is achieved, outperforming Mip-Splatting in multi-scale rendering.
AnimatableDreamer: Text-Guided Non-rigid 3D Model Generation and Reconstruction with Canonical Score Distillation: This work proposes AnimatableDreamer, which extracts skeletons and motion from monocular videos and generates text-guided animatable 3D non-rigid models via Canonical Score Distillation (CSD), comprehensively outperforming existing methods in both generation quality and temporal consistency.
BAD-Gaussians: Bundle Adjusted Deblur Gaussian Splatting: This work introduces a physical motion blur imaging model into the 3D Gaussian Splatting framework, jointly optimizing scene Gaussian parameters and camera trajectories during exposure to restore sharp 3D scenes from blurry images and achieve real-time rendering.
BeNeRF: Neural Radiance Fields from a Single Blurry Image and Event Stream: BeNeRF is proposed to jointly recover a neural radiance field and camera motion trajectory from only a single blurry image and its corresponding event stream. High-quality deblurring and novel view synthesis are achieved without requiring multi-view inputs or known poses.
Bi-directional Contextual Attention for 3D Dense Captioning: This paper proposes BiCA, which decouples and parallelly decodes instance queries and context queries via a bi-directional contextual attention mechanism. This solves the objective conflict between localization and caption generation in 3D dense captioning, achieving state-of-the-art (SOTA) performance on both the ScanRefer and Nr3D benchmarks.
Binomial Self-compensation for Motion Error in Dynamic 3D Scanning: This work proposes a binomial self-compensation (BSC) algorithm. By performing a weighted sum of motion-affected phase sequences based on binomial coefficients, the algorithm exponentially eliminates motion errors in four-step phase-shifting profilometry without requiring any intermediate variables, thereby achieving high-precision dynamic 3D scanning at the same frame rate as the camera.
CaesarNeRF: Calibrated Semantic Representation for Few-Shot Generalizable Neural Rendering: This paper proposes CaesarNeRF, which introduces scene-level semantic representations built upon generalizable NeRF (such as GNT). By leveraging camera pose calibration (aligning feature rotations with the target view) and sequential refinement (gradually updating global features across Transformer layers), CaesarNeRF improves PSNR by 1.74dB (on LLFF) compared to GNT in a 1-view setup, and seamlessly enhances other baselines like IBRNet and MatchNeRF.
Camera Height Doesn't Change: Unsupervised Training for Metric Monocular Road-Scene Depth Estimation: Proposes the FUMET training framework, which leverages vehicle size priors detected on the road to aggregate camera height estimates and utilizes the invariance of camera height within the same video sequence as metric scale supervision, enabling any monocular depth network to learn absolute scale without auxiliary sensors.
CanonicalFusion: Generating Drivable 3D Human Avatars from Multiple Images: This paper proposes the CanonicalFusion framework, which achieves direct canonicalization by jointly predicting depth maps and compressed LBS weight maps, and fuses information from multiple input images using forward skinning differentiable rendering to generate drivable 3D human avatars from multiple input images.
CG-SLAM: Efficient Dense RGB-D SLAM in a Consistent Uncertainty-Aware 3D Gaussian Field: This paper proposes CG-SLAM, an efficient dense RGB-D SLAM framework based on a consistency- and geometric-stability-optimized uncertainty-aware 3D Gaussian field, achieving state-of-the-art (SOTA) performance in both localization accuracy and reconstruction quality with a tracking speed of up to 15Hz.
CityGaussian: Real-Time High-Quality Large-Scale Scene Rendering with Gaussians: CityGaussian (CityGS) is proposed to enable high-quality 3D Gaussian Splatting training and cross-scale real-time rendering for city-scale scenes (> 1.5 km²) for the first time, leveraging a divide-and-conquer training strategy and a block-wise Level-of-Detail (LoD) mechanism.
Click-Gaussian: Interactive Segmentation to Any 3D Gaussians: This work proposes Click-Gaussian, which learns a discriminative 3D feature field with two-level granularity (coarse/fine) and combines it with Global Feature-guided Learning (GFL) to address cross-view mask inconsistency. It achieves real-time interactive 3D Gaussian segmentation at only 10ms per click, which is 15-130 times faster than existing methods while significantly improving segmentation accuracy.
CloudFixer: Test-Time Adaptation for 3D Point Clouds via Diffusion-Guided Geometric Transformation: This paper proposes CloudFixer, the first test-time input adaptation method for 3D point clouds. By optimizing geometric transformation parameters guided by a pre-trained diffusion model, it transforms out-of-distribution test point clouds back to the source domain. It avoids backpropagation through the diffusion model, achieving single-instance adaptation in under 1 second.
CoherentGS: Sparse Novel View Synthesis with Coherent 3D Gaussians: CoherentGS is proposed to introduce a structured representation (one Gaussian per pixel) for 3DGS, establishing single-view and multi-view consistency constraints using an implicit convolutional decoder and total variation loss. Combined with a monocular depth-based initialization strategy, it achieves high-quality novel view synthesis under extremely sparse inputs (e.g., 3 images), outperforming existing NeRF methods significantly in terms of LPIPS.
ComboVerse: Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance: This paper proposes ComboVerse, a compositional 3D asset generation framework. It first decomposes an input image containing multiple objects into individual elements and reconstructs them independently as single-object 3D models. Then, it optimizes the position, scale, and rotation parameters of the objects guided by Spatially-Aware Score Distillation Sampling (SSDS), enabling high-quality multi-object compositional 3D asset creation. It significantly outperforms existing methods in both CLIP Score and human evaluation.
Compress3D: a Compressed Latent Space for 3D Generation from a Single Image: This paper proposes a highly compressed triplane latent space autoencoder, paired with a two-stage diffusion model (generating a shape embedding first, followed by a triplane latent). It generates high-quality 3D assets from a single image in just 7 seconds, utilizing significantly less training data and time than comparable methods.
CoR-GS: Sparse-View 3D Gaussian Splatting via Co-Regularization: Identifying that the disagreement in Gaussian locations and rendering results between two co-trained 3DGS radiance fields is negatively correlated with reconstruction quality, this paper proposes CoR-GS to suppress inaccurate reconstructions through co-pruning and pseudo-view co-regularization, achieving state-of-the-art sparse-view novel view synthesis.
CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model: This paper proposes CRM (Convolutional Reconstruction Model), which leverages the spatial alignment prior between triplanes and six orthographic views. It replaces the Transformer with a U-Net to directly map the six views to a triplane, and utilizes FlexiCubes for end-to-end training. CRM generates high-fidelity textured meshes from a single image in under 10 seconds, with only 1/8 of the training cost of LRM.
CrossScore: Towards Multi-View Image Evaluation and Scoring: A new Cross-Reference (CR) image quality assessment paradigm is proposed. By comparing a query image with multiple reference images from different perspectives, a cross-attention neural network is utilized to predict pixel-level quality scores highly correlated with SSIM, enabling the evaluation of novel view synthesis quality without ground-truth reference images.
D-SCo: Dual-Stream Conditional Diffusion for Monocular Hand-Held Object Reconstruction: This work proposes D-SCo, a dual-stream conditional diffusion model for hand-held object point cloud reconstruction from a single RGB image. By combining unified hand-object semantic embeddings and hand-joint geometric embeddings, two branches provide semantic and geometric priors, respectively. Paired with a hand-constrained centroid-fixing strategy to stabilize the diffusion process, D-SCo achieves an F-5 score of 0.61 on ObMan (outperforming DDF-HO by 10.9%) and also leads significantly on real-world datasets such as HO3D/MOW.
DATENeRF: Depth-Aware Text-based Editing of NeRFs: Leverages scene depth reconstructed by NeRF to guide text-based 2D image editing (via depth-conditioned ControlNet + projection inpainting scheme), achieving multi-view consistent, high-quality NeRF scene editing.
Deblur e-NeRF: NeRF from Motion-Blurred Events under High-speed or Low-light Conditions: This work proposes Deblur e-NeRF, which models the motion blur of event cameras using a physically accurate pixel bandwidth model, enabling the first direct and effective reconstruction of blur-free NeRFs from motion-blurred event streams.
Deceptive-NeRF/3DGS: Diffusion-Generated Pseudo-observations for High-Quality Sparse-View Reconstruction: A fine-tuned Stable Diffusion + ControlNet is used to transform coarse NeRF/3DGS renderings into high-quality pseudo-observations. By densifying sparse input views by $5$-$10\times$ before retraining, this approach outperforms methods like FreeNeRF by $1$-$2\text{ dB}$ PSNR on datasets such as Hypersim, LLFF, and ScanNet, while training about $10\times$ faster than diffusion-regularization methods.
Deep Patch Visual SLAM: Based on the DPVO visual odometry system, this work extends it into a complete SLAM system, DPV-SLAM, by introducing efficient proximity loop closure and classical loop closure mechanisms, achieving real-time, high-precision, and low-memory monocular visual SLAM on a single GPU.
DG-PIC: Domain Generalized Point-In-Context Learning for Point Cloud Understanding: DG-PIC is proposed, representing the first point cloud understanding framework that simultaneously addresses multi-domain and multi-task learning in a unified model. Through dual-level source prototype estimation and a test-time feature shifting mechanism, it enhances generalization capability to unseen domains without requiring model updates.
Differentiable Convex Polyhedra Optimization from Multi-view Images: A differentiable convex polyhedra construction method based on duality transform and three-plane intersection solving is proposed. By bypassing implicit field supervision and directly optimizing gradients using multi-view image losses, high-fidelity convex polyhedral shape representation is achieved.
Diffusion Models for Monocular Depth Estimation: Overcoming Challenging Conditions: Leverages a text-to-image diffusion model (ControlNet/T2I-Adapter) to transform clean-weather images into adverse-condition images while preserving the same 3D structure. The existing monocular depth estimation networks are fine-tuned via self-distillation to uniformly address out-of-distribution challenges like adverse weather and non-Lambertian surfaces.
DiffusionDepth: Diffusion Denoising Approach for Monocular Depth Estimation: This paper introduces the diffusion denoising process to monocular depth estimation for the first time. By performing visually conditioned iterative denoising in the latent depth space, and proposing a self-diffusion mechanism to resolve the mode collapse issue caused by sparse Ground Truth (GT) depths, it achieves SOTA performance on KITTI and NYU-Depth-V2.
Divide and Fuse: Body Part Mesh Recovery from Partially Visible Human Images: A "divide-and-conquer" bottom-up human mesh recovery method is proposed, which reconstructs individual body parts independently and then fuses them, effectively solving the failure mode of traditional top-down methods (such as SMPL) when large areas of the human body are invisible.
DreamDissector: Learning Disentangled Text-to-3D Generation from 2D Diffusion Priors: The DreamDissector framework is proposed to disentangle text-to-3D NeRFs containing multi-object interactions into independent textured meshes using a Neural Category Field and Deep Concept Mining, achieving object-level 3D editing control.
DreamScene360: Unconstrained Text-to-3D Scene Generation with Panoramic Gaussian Splatting: Proposes DreamScene360, which utilizes panoramic images as an intermediate representation, combined with a GPT-4V self-refinement mechanism and panoramic 3D Gaussian Splatting, to achieve rapid generation of immersive 360° 3D scenes from text.
DreamView: Injecting View-specific Text Guidance into Text-to-3D Generation: DreamView proposes an adaptive text guidance injection module to collaboratively inject view-specific and global text descriptions into a diffusion model, achieving customizable and multi-view consistent text-to-3D generation.
Dual-level Adaptive Self-Labeling for Novel Class Discovery in Point Cloud Segmentation: This paper proposes a dual-level adaptive self-labeling method that addresses the class imbalance problem through semi-relaxed optimal transport and incorporates region-level representations to enhance pointwise classifier learning, achieving efficient novel class discovery in point cloud segmentation.
Dynamic Neural Radiance Field from Defocused Monocular Video: Proposed $D^2RF$, the first method to recover sharp dynamic NeRFs from defocused monocular videos, which unifies Depth of Field (DoF) rendering with volume rendering and introduces layered DoF volume rendering to model defocus blur and recover sharp novel views.
Efficient Depth-Guided Urban View Synthesis (EDUS): This work proposes EDUS, which leverages noisy geometric priors (monocular/stereo depth) to guide generalizable NeRF. Through a tri-part decomposition consisting of a foreground 3D CNN and background/sky image-based rendering, it achieves fast feed-forward inference and efficient scene-by-scene fine-tuning under sparse urban views.
Equi-GSPR: Equivariant SE(3) Graph Network Model for Sparse Point Cloud Registration: This paper proposes Equi-GSPR, a sparse point cloud registration method based on SE(3) equivariant graph neural networks. By utilizing equivariant message passing, low-rank feature transformation (LRFT), and implicit feature space similarity matching, it achieves SOTA registration performance on indoor and outdoor datasets with low model complexity.
Event-based Mosaicing Bundle Adjustment: This work proposes EMBA, the first photometric Bundle Adjustment method for rotation-only event cameras. It formulates the problem as a regularized non-linear least squares optimization based on a linearized event generation model, designs an efficient solver by exploiting the block-diagonal sparse structure of the normal equation matrix, and simultaneously optimizes the camera rotation trajectory and the panoramic gradient map.
Explicitly Guided Information Interaction Network for Cross-modal Point Cloud Completion: This work proposes the EGIInet framework, which achieves modal alignment using a unified encoder and utilizes an explicitly guided information interaction strategy (FT-Loss) to enable the network to accurately identify key structural information in images. On view-guided point cloud completion tasks, it outperforms XMFnet by 16% CD with fewer parameters.
External Knowledge Enhanced 3D Scene Generation from Sketch: Proposes the SEK framework, which integrates freehand sketches and an external object relation knowledge base as conditions for a diffusion model. Through knowledge-enhanced graph reasoning and spectrum filtering, it end-to-end simultaneously generates the layout and object geometry of 3D indoor scenes.
FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance: FALIP (Foveal-Attention CLIP) is proposed as a training-free method that enhances the region-awareness capability of CLIP without modifying the original image by inserting a foveal-like attention mask into the multi-head self-attention module of CLIP. It achieves improvements across zero-shot tasks including referring expression comprehension, image classification, and 3D point cloud recognition.
FastCAD: Real-Time CAD Retrieval and Alignment from Scans and Videos: This work proposes FastCAD to achieve CAD model retrieval and alignment for all objects in a scene within 50ms through contrastive learning embedding space distillation and direct parameter prediction, which is 50 times faster than existing methods with superior accuracy.
Flash Cache: Reducing Bias in Radiance Cache Based Inverse Rendering: Proposes an unbiased radiance cache-based inverse rendering method. By utilizing occlusion-aware vMF importance sampling and quick cache control variates, it eliminates rendering bias present in existing methods while maintaining computational efficiency, thereby improving the quality of material and illumination decomposition.
FlashSplat: 2D to 3D Gaussian Splatting Segmentation Solved Optimally: This paper reformulates the 2D-to-3D segmentation problem of 3D Gaussian Splatting as an integer linear program. Leveraging the linearity of alpha blending, it obtains a closed-form optimal solution in just 30 seconds, achieving a 50x speedup over existing methods.
FlashTex: Fast Relightable Mesh Texturing with LightControlNet: Proposed LightControlNet, an illumination-aware variant of ControlNet. Combined with a two-stage texture optimization pipeline, it generates high-quality, relightable PBR textures for 3D meshes in approximately 4 minutes, which is 3-10 times faster than existing methods.
FLAT: Flux-Aware Imperceptible Adversarial Attacks on 3D Point Clouds: This paper proposes the FLAT framework to address the imperceptibility issue in 3D point cloud adversarial attacks from a flux perspective. By calculating the flux of local perturbation vector fields to evaluate uniformity changes and adjusting perturbation directions when high flux (disrupted uniformity) is detected, FLAT generates adversarial point clouds that are far harder to perceive than those of existing methods.
Flying with Photons: Rendering Novel Views of Propagating Light: Proposed the Transient Field representation, combined with a first-of-its-kind multi-view ultrafast imaging dataset, achieving the first novel-view rendering of light propagation videos in real-world scenes from dynamic perspectives, capable of processing complex light transport effects such as scattering, reflection, refraction, and diffraction.
Forest2Seq: Revitalizing Order Prior for Sequential Indoor Scene Synthesis: Proposed the Forest2Seq framework, which organizes unordered indoor scene objects into a hierarchical scene tree/forest structure and derives a meaningful permutation order as prior knowledge using breadth-first search (BFS). Combined with a Transformer autoregressive decoder, it significantly improves the quality of indoor scene synthesis.
Formula-Supervised Visual-Geometric Pre-training (FSVGP): Proposes FSVGP, which automatically generates aligned synthetic images and point clouds using mathematical formulas of fractal geometry. Through formula-supervised consistency labels, it achieves cross-modal visual-geometric pre-training on a unified Transformer, outperforming single-modal FDSL methods across six tasks in image and 3D object classification, detection, and segmentation.
FutureDepth: Learning to Predict the Future Improves Video Depth Estimation: This paper proposes FutureDepth, which injects implicit motion and scene features into the depth decoder via a Future Prediction Network (F-Net) to learn motion cues and a Reconstruction Network (R-Net) to learn multi-frame correspondences. It achieves state-of-the-art (SOTA) accuracy and temporal consistency on four datasets (NYUDv2, KITTI, DDAD, and Sintel), with inference efficiency significantly surpassing existing video depth estimation methods.
G2fR: Frequency Regularization in Grid-Based Feature Encoding Neural Radiance Fields: Proposed G²fR (Generalized Grid-based Frequency Regularization), establishing a theoretical link between frequency regularization and grid-based feature encoding NeRF to solve the core challenges of GFE-NeRF in camera pose optimization and few-shot reconstruction.
G3R: Gradient Guided Generalizable Reconstruction: G3R is proposed as a gradient-guided generalizable reconstruction method. It iteratively updates a 3D Neural Gaussian representation using 3D gradient feedback from differentiable rendering via a learned reconstruction network. It achieves reconstruction of large-scale scenes (>10,000m²) in under 2 minutes, accelerating the process by at least 10x while maintaining comparable or superior rendering quality compared to 3DGS.
GAURA: Generalizable Approach for Unified Restoration and Rendering of Arbitrary Views: GAURA is proposed, a unified restoration and rendering framework based on generalizable NeRF. By utilizing learnable degradation-aware latent codes, it dynamically adapts to different image degradation types during the feature aggregation and rendering stages, enabling the rendering of clean novel views from degraded images without scene-specific optimization.
GaussCtrl: Multi-View Consistent Text-Driven 3D Gaussian Splatting Editing: GaussCtrl is proposed, which utilizes depth-conditioned ControlNet editing and an attention alignment module to achieve multi-view consistent text-driven 3DGS scene editing, supporting editing of all viewpoints at once and requiring only a single 3D model update.
Gaussian Grouping: Segment and Edit Anything in 3D Scenes: Learns a 16-dimensional Identity Encoding for each Gaussian in 3D Gaussian Splatting to achieve instance-level grouping, utilizing SAM + DEVA video tracking to generate multi-view consistent 2D pseudo-labels for supervision. It achieves 69-77% mIoU on LERF-Mask open-vocabulary segmentation (outperforming LERF by over 2x) and outperforms Panoptic Lifting by 4.9% mIoU in panoptic segmentation while being 14x faster, while simultaneously supporting various editing operations such as 3D object removal, inpainting, colorization, and style transfer.
GaussianImage: 1000 FPS Image Representation and Compression by 2D Gaussian Splatting: This paper proposes GaussianImage, which represents the first attempt to apply 2D Gaussian Splatting to image representation and compression. By utilizing compact 8-parameter 2D Gaussians and an accumulative summation rasterization algorithm, it achieves a decoding speed of over 2000 FPS, while matching the representation quality and compression performance of INR-based methods.
GaussReg: Fast 3D Registration with Gaussian Splatting: This work presents the first exploration of registration between 3D Gaussian Splatting scenes, proposing a coarse-to-fine GaussReg framework. The coarse stage utilizes point cloud registration to estimate the initial transformation, while the fine stage extracts volumetric features from rendered images for fine-grained alignment, achieving comparable accuracy to HLoc while being 44x faster.
Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis: Proposes Generative Camera Dolly (GCD), which fine-tunes the Stable Video Diffusion model to generate synchronized dynamic novel-view videos from any viewpoint using a monocular video, supporting extreme camera transitions up to 180° without requiring depth input or explicit 3D modeling.
GeometrySticker: Enabling Ownership Claim of Recolorized Neural Radiance Fields: Proposes GeometrySticker, which "sticks" binary copyright information onto the geometric components (instead of the color components) of NeRF. This allows original creators to extract watermarks from rendered images to claim ownership, even if the NeRF is recolorized.
GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image: The authors present GeoWizard, a foundational model for geometry estimation based on Stable Diffusion priors. It jointly predicts depth and normals using a unified model with a Geometry Switcher, and eliminates layout ambiguity in mixed scenes via a Scene Distribution Decoupler, achieving state-of-the-art results on zero-shot depth and normal benchmarks.
Global-to-Pixel Regression for Human Mesh Recovery: A two-stage regression framework extending from global features to pixel-level features is proposed. It captures fine-grained body part information through an adaptive 2D keypoint-guided local encoding module and introduces a dynamic matching strategy to improve vision-mesh alignment, achieving SOTA performance on Human3.6M and 3DPW.
GPSFormer: A Global Perception and Local Structure Fitting-Based Transformer for Point Cloud Understanding: This paper proposes GPSFormer, which learns short-range and long-range dependencies through a Global Perception Module (GPM) and accurately models local geometric information using a Taylor-series-inspired Local Structure Fitting Convolution (LSFConv). With only 2.36M parameters, it achieves 95.4% accuracy on ScanObjectNN, outperforming all supervised learning and pre-training methods.
GRM: Large Gaussian Reconstruction Model for Efficient 3D Reconstruction and Generation: This paper proposes GRM, a feed-forward 3D reconstruction model based on a pure Transformer architecture. It converts sparse-view (4 images) pixels into dense 3D Gaussian representations via pixel-aligned Gaussians, completing the reconstruction in approximately 0.1 seconds. Combined with multi-view diffusion models, it enables text-to-3D and image-to-3D generation.
GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting: This paper proposes GS-LRM, an extremely simple Transformer-based large reconstruction model that patchifies multi-view images and directly regresses per-pixel 3D Gaussian parameters through self-attention. It significantly outperforms SOTA in both object-level (surpassing Triplane-LRM by 4dB PSNR) and scene-level (surpassing pixelSplat by 2.2dB PSNR) reconstruction, completing inference in 0.23 seconds on a single A100 GPU.
GVGEN: Text-to-3D Generation with Volumetric Representation: Proposes GVGEN, the first framework to directly generate 3D Gaussians from text in a feed-forward manner. By organizing unordered Gaussians into a structured volumetric representation (GaussianVolume) and designing a coarse-to-fine generation pipeline (generating geometric volumes first and then predicting Gaussian attributes), text-to-3D generation is completed in approximately 7 seconds.
HAC: Hash-grid Assisted Context for 3D Gaussian Splatting Compression: This work utilizes structured binary hash-grids to establish spatial context relationships for unordered 3DGS anchors. Through conditional probability modeling and adaptive quantization, it achieves efficient entropy coding, reaching a 75× compression rate compared to vanilla 3DGS while maintaining or even improving rendering quality.
HeadGaS: Real-Time Animatable Head Avatars via 3D Gaussian Splatting: Proposes HeadGaS, which equips each 3D Gaussian primitive with a learnable latent feature base, linearly blends features using expression parameters, and predicts expression-dependent color and opacity via an MLP. This design achieves real-time (250+ fps) and high-quality animatable head reconstruction, outperforming baselines in PSNR by approximately 2 dB.
Heterogeneous Graph Learning for Scene Graph Prediction in 3D Point Clouds: The 3D-HetSGP framework is proposed to model 3D scene graph prediction as a heterogeneous graph learning problem. By utilizing a two-stage process of Heterogeneous Graph Structure Learning (HGSL) and Heterogeneous Graph Reasoning (HGR), it addresses the suboptimal performance issue caused by indiscriminate message passing in existing homogeneous fully-connected graph methods.
Hiding Imperceptible Noise in Curvature-Aware Patches for 3D Point Cloud Attack: Proposes the Wavelet Patches Attack (WPA) method, which employs wavelet transform to analyze local curvature structures of point clouds and hides adversarial perturbations within curvature-consistent patches—perturbing along tangent planes in flat regions and along normal vectors in sharp regions—achieving a more imperceptible 3D point cloud attack compared to existing methods.
High-Precision Self-Supervised Monocular Depth Estimation with Rich-Resource Prior: This paper proposes RPrDepth, which utilizes features and predictions of "rich-resource" models (such as multi-frame and high-resolution models) as priors during training. Through a prior depth fusion module and a rich-resource guided loss, the model achieves or even exceeds the depth estimation accuracy of multi-frame high-resolution models while performing inference using only a low-resolution single image.
High-Resolution and Few-shot View Synthesis from Asymmetric Dual-Lens Inputs: This paper proposes DL-GS (Dual-Lens 3D-GS), which addresses two major issues of 3D-GS in few-shot training and super-resolution rendering by leveraging stereo geometric constraints and high-resolution guidance from asymmetric dual-lens systems (wide-angle + telephoto) commonly found on mobile devices. It achieves SOTA performance through a consistency-aware training strategy and a multi-reference guided refinement module.
Human Hair Reconstruction with Strand-Aligned 3D Gaussians: This paper proposes Gaussian Haircut, which introduces a dual representation of classic hair strands (polylines) and strand-aligned 3D Gaussian primitives. By integrating 3D orientation field lifting and a coarse-to-fine strand fitting optimization strategy, high-fidelity strand-level hairstyles can be reconstructed from multi-view images. The reconstructed hairstyles can be directly used for editing, rendering, and physical simulation in graphics engines, achieving a speedup of over 10× compared to previous methods.
Hyperion: A Fast, Versatile Symbolic Gaussian Belief Propagation Framework for Continuous-Time SLAM: This paper presents Hyperion, a continuous-time Gaussian Belief Propagation (GBP) SLAM framework that automatically generates ultra-efficient B/Z-spline implementations based on the SymForce symbolic computation framework. It achieves comparable accuracy to traditional NLLS solvers (Ceres) in motion tracking and localization scenarios, while naturally supporting distributed multi-agent inference.
I²-SLAM: Inverting Imaging Process for Robust Photorealistic Dense SLAM: Proposed I²-SLAM, which integrates the physical imaging process (motion blur modeling + tone mapping) into a visual SLAM system. Through the joint optimization of an HDR radiance field map, multi-virtual-camera motion blur simulation, and differentiable tone mapping, it reconstructs sharp HDR 3D maps and more accurate camera trajectories from degraded hand-held casual videos.
IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation: The IDOL framework is proposed to achieve joint video-depth generation for human-centric tasks by unifying a dual-modal U-Net and motion consistency loss, significantly outperforming existing methods.
Implicit Filtering for Learning Neural Signed Distance Functions from 3D Point Clouds: A non-linear implicit filter is proposed to smooth the implicit field of neural SDFs without requiring normals while preserving sharp geometric details, achieving field-wide consistency regularization through extension to non-zero level sets.
Improving 2D Feature Representations by 3D-Aware Fine-Tuning: Through lifting 2D foundation model features into 3D Gaussian representations for multi-view fusion, followed by backward fine-tuning of the 2D model using rendered 3D-aware features, semantic segmentation and depth estimation performance are improved merely via linear probing.
Improving Domain Generalization in Self-Supervised Monocular Depth Estimation via Stabilized Adversarial Training: Proposed the SCAT framework, which reduces the sensitivity of UNet skip connections to perturbations via the Scale Depth Network (SDN) and introduces Conflict Gradient Surgery (CGS) to resolve the dual optimization conflict caused by adversarial augmentation, successfully applying adversarial data augmentation to self-supervised monocular depth estimation for the first time to enhance cross-domain generalization.
Interactive 3D Object Detection with Prompts: This work proposes a multi-modal interactive 3D object detection framework named "Prompt in 2D, Detect in 3D" + "Detect in 3D, Refine in 3D". By bridging the 2D-3D complexity gap with simple 2D interaction prompts (clicks or bounding boxes) and supporting iterative refinement, it significantly reduces 3D annotation costs. Its effectiveness and outstanding open-set adaptation capabilities are validated on nuScenes.
Invertible Neural Warp for NeRF: This paper proposes using Invertible Neural Networks (INNs) to over-parameterize the rigid transformation function of camera poses, significantly improving pose estimation accuracy and reconstruction quality in joint NeRF optimization. It demonstrates that invertibility is a critical constraint when using MLPs to model rigid warps.
JointDreamer: Ensuring Geometry Consistency and Text Congruence in Text-to-3D Generation via Joint Score Distillation: Proposes Joint Score Distillation (JSD), which models the joint distribution of multi-view denoised images through an energy function, extending SDS from single-view independent optimization to multi-view joint optimization. This effectively resolves the Janus problem in 3D generation while maintaining generation fidelity for complex text prompts.
Lagrangian Hashing for Compressed Neural Field Representations: Combines the Eulerian grid hash table of InstantNGP with a Lagrangian point cloud representation to store movable Gaussian feature points in hash buckets, achieving a compact neural field representation with a 1.8-2.8x reduction in parameters without losing reconstruction quality.
Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance: This paper proposes a large-scale language-driven 6-DoF grasp dataset named Grasp-Anything-6D (containing 1M scenes and over 200M grasp poses) and LGrasp6D, a diffusion-based framework. The core novelty lies in the Negative Prompt Guidance (NPG) strategy, which directs the grasp poses away from non-target objects during inference.
LaRa: Efficient Large-Baseline Radiance Fields: The LaRa feed-forward reconstruction model is proposed, which unifies local and global reasoning through a Gaussian Volume representation and a Group Attention Layer. It reconstructs 360° radiance fields from large-baseline views using only 4 images, and outperforms more computationally demanding methods like LGM while requiring only 4×A100 training for 2 days.
Learning 3D-Aware GANs from Unposed Images with Template Feature Field: Proposes Template Feature Field (TeFF), which jointly learns a generative radiance field and a semantic feature field to automatically extract 3D templates and estimate camera poses online from unposed in-the-wild images, thereby enabling generative adversarial learning of complete 3D geometries.
Learning 3D Geometry and Feature Consistent Gaussian Splatting for Object Removal: This paper proposes the GScream framework, which achieves high-quality object removal in the 3D Gaussian Splatting representation using monocular depth-guided training and cross-attention feature regularization while maintaining geometric consistency and texture coherence.
Learning to Generate Conditional Tri-Plane for 3D-Aware Expression Controllable Portrait Animation: This paper proposes Export3D, which learns appearance-decoupled expression representations (CLeBS) via contrastive pre-training and directly generates conditional tri-planes integrated with Expression-Adaptive Layer Normalization (EAdaLN), achieving cross-identity 3D-aware portrait expression animation without identity leakage.
LEIA: Latent View-Invariant Embeddings for Implicit 3D Articulation: LEIA is proposed to characterize different states of articulated objects by learning view-invariant latent embeddings. It utilizes a HyperNetwork to modulate NeRF weights, enabling smooth interpolation between unseen articulated configurations without requiring any prior motion knowledge or 3D supervision.
LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation: This paper proposes LGM, a multi-view 3D Gaussian reconstruction model based on an asymmetric U-Net architecture. It predicts 65,536 3D Gaussian primitives from 4 orthogonal view images, achieving text/image-to-high-resolution 3D model generation within 5 seconds at a 512 resolution. The model bridges the training-inference domain gap through data augmentation strategies.
LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation: Proposes the LN3Diff++ framework, which compresses multi-view images into a compact 3D latent space via a 3D-aware VAE, and trains diffusion models (U-Net or DiT) on this space to achieve high-quality, fast, and general conditional 3D generation, including text-to-3D and image-to-3D.
MaRINeR: Enhancing Novel Views by Matching Rendered Images with Nearby References: The MaRINeR method is proposed to enhance the rendered image quality of 3D reconstructions using nearby reference images via deep feature matching and hierarchical detail transfer. It is applicable to post-processing rendering for various 3D representations, including explicit (mesh) and implicit (NeRF) representations.
MegaScenes: Scene-Level View Synthesis at Scale: Constructing MegaScenes, a large-scale scene-level 3D dataset containing over 100k SfM reconstructions from Wikimedia Commons internet photos, and combining warp conditioning with pose conditioning to improve pose consistency in scene-level novel view synthesis.
Mesh2NeRF: Direct Mesh Supervision for Neural Radiance Field Representation and Generation: Proposes Mesh2NeRF, which directly constructs GT radiance fields from textured meshes via analytical solutions, modeling the density field with an occupancy function and the color field with a reflection model, providing accurate 3D point-wise supervision for NeRF representation and generation tasks.
MeshFeat: Multi-Resolution Features for Neural Fields on Meshes: This paper proposes MeshFeat, a parametric multi-resolution feature encoding method for neural fields on meshes. It constructs multi-resolution feature representations using mesh simplification algorithms, achieving a 13x inference speedup while maintaining reconstruction quality.
milliFlow: Scene Flow Estimation on mmWave Radar Point Cloud for Human Motion Sensing: This work proposes milliFlow, the first scene flow estimation method for mmWave radar point clouds. By leveraging multi-scale feature extraction, global aggregation, GRU temporal propagation, and constrained regression, it reduces EPE3D from the sub-optimal 0.107m to 0.046m (centimeter-level accuracy) on a self-collected dataset. It also demonstrates the enhancement effects of scene flow features on downstream tasks, including human activity recognition (+7.9%), human body parsing (+3.6%), and human tracking.
Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot: Multi-HMR is the first single-shot multi-person whole-body (including hands and facial expressions) 3D human mesh recovery method. It employs a ViT backbone and a Human Perception Head (HPH) with cross-attention, combined with a new synthetic dataset named CUFFS to address the difficulty of hand pose learning, achieving state-of-the-art (SOTA) performance on both multi-person and whole-body benchmarks.
MVDD: Multi-View Depth Diffusion Models: MVDD is proposed, a diffusion model based on multi-view depth map representations. By incorporating epipolar "line segment" attention and denoising depth fusion, it achieves 3D-consistent, high-quality shape generation, enabling the synthesis of dense point clouds with over 20K points.
MVDiffusion++: A Dense High-Resolution Multi-View Diffusion Model for Single or Sparse-View 3D Object Reconstruction: MVDiffusion++ proposes a pose-free multi-view latent diffusion model. By leveraging two elegant ideas—a "pose-free architecture" and a "view dropout training strategy"—it generates dense (32 views) high-resolution (512×512) multi-view images from a single or sparse set of input images, enabling high-quality 3D object reconstruction.
MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo: This work integrates cost volume-based depth estimation from MVS with 3D Gaussian Splatting, enhancing generalization through hybrid rendering (splatting + volume rendering). It proposes a geometric-consistency-based point cloud aggregation strategy that allows per-scene optimization to surpass the performance of a 10-minute 3D-GS optimization in just 45 seconds.
MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images: MVSplat is proposed, which constructs a cost volume via plane-sweep to accurately locate Gaussian centers. It achieves state-of-the-art sparse-view feed-forward 3D Gaussian prediction with significantly fewer parameters (1/10 of pixelSplat) and the fastest inference speed (22 fps).
NGP-RT: Fusing Multi-Level Hash Features with Lightweight Attention for Real-Time Novel View Synthesis: NGP-RT is proposed to replace the per-point MLP by aggregating multi-level explicit hash features using a lightweight attention mechanism, and to introduce an occupancy distance grid to reduce memory access during ray marching, achieving real-time NeRF rendering at 1080p 108fps on the Mip-NeRF 360 dataset.
NOVUM: Neural Object Volumes for Robust Object Classification: This paper proposes the NOVUM architecture, which maintains a neural volume representation composed of 3D Gaussians for each object category. By matching image features with the Gaussian features of each category, it achieves classification. NOVUM improves classification accuracy by 6-33% compared to standard architectures like ResNet/ViT/Swin under occlusion, corruption, and real-world OOD scenarios, while supporting 3D pose estimation and interpretable visualization.
nuCraft: Crafting High Resolution 3D Semantic Occupancy for Unified 3D Scene Understanding: This paper constructs nuCraft, a high-precision 3D semantic occupancy dataset based on nuScenes (with a resolution up to 0.1m voxels, $8\times$ denser than existing benchmarks), and proposes VQ-Occ, which uses VQ-VAE to encode occupancy data into a compact latent space for prediction, achieving direct generation of high-resolution semantic occupancy without post-processing upsampling for the first time.
Omni-Recon: Harnessing Image-Based Rendering for General-Purpose Neural Radiance Fields: This paper proposes the Omni-Recon framework to construct a general-purpose NeRF through an image-based rendering (IBR) pipeline. By leveraging a decoupled dual-branch design of geometry and appearance, it is the first to achieve compatibility with multiple downstream 3D tasks—such as generalizable 3D reconstruction, zero-shot multi-task scene understanding, real-time rendering, and scene editing—within a single model.
Omni6D: Large-Vocabulary 3D Object Dataset for Category-Level 6D Object Pose Estimation: Constructs Omni6D, the first large-scale category-level 6DoF pose estimation RGBD dataset. It covers 166 categories, 4,688 instances, and 800,000 images, far exceeding existing datasets like NOCS (only 6 categories). It also proposes a symmetry-aware evaluation metric and a progressive fine-tuning strategy.
On the Error Analysis of 3D Gaussian Splatting and an Optimal Projection Strategy: This work systematically analyzes the projection error introduced by local affine approximation in 3D Gaussian Splatting (3D-GS) mathematically. It proves that the error function reaches its minimum when the Gaussian mean direction aligns with the projection plane normal. Based on this, a projection-to-tangent-plane strategy for each Gaussian is proposed (Optimal Gaussian Splatting), which significantly reduces rendering artifacts without sacrificing real-time performance.
Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models: Diff2Scene is proposed, marking the first attempt to leverage a pretrained text-to-image diffusion model (Stable Diffusion) for open-vocabulary 3D semantic segmentation. Through an innovative mask distillation method, semantically rich mask embeddings from the 2D foundation model are transferred to a 3D geometry-aware mask model, outperforming the state-of-the-art by 12% on ScanNet200.
Open Vocabulary 3D Scene Understanding via Geometry Guided Self-Distillation: The GGSD framework is proposed, which leverages 3D geometric priors (semantic consistency of superpoints) to guide knowledge distillation from 2D to 3D models. It further uncovers the representational advantages of 3D data through a self-distillation mechanism, significantly outperforming existing methods on both indoor and outdoor open-vocabulary 3D scene understanding tasks.
P2P-Bridge: Diffusion Bridges for 3D Point Cloud Denoising: Proposed P2P-Bridge, which formulates point cloud denoising as a Schrödinger Bridge problem to learn the optimal transport plan between noisy and clean point clouds. It introduces a data-to-data (rather than data-to-noise) diffusion framework for the first time, significantly outperforming existing methods on both synthetic data and real-world indoor scenes (ScanNet++, ARKitScenes).
PCF-Lift: Panoptic Lifting by Probabilistic Contrastive Fusion: This paper proposes PCF-Lift, which replaces deterministic features with probabilistic feature embeddings (multivariate Gaussian distributions) and combines contrastive loss based on the Probabilistic Product Kernel (PP Kernel) with cross-view constraints. This effectively addresses the issues of inconsistent segmentation and inconsistent IDs in 2D segmentation, significantly outperforming state-of-the-art methods on the ScanNet and Messy Room datasets.
Per-Gaussian Embedding-Based Deformation for Deformable 3D Gaussian Splatting: This paper proposes a deformation representation method based on Per-Gaussian Embedding, which defines deformation as a function of per-Gaussian latent embeddings and temporal embeddings. Combined with coarse-to-fine deformation decomposition and local smoothness regularization, it achieves comprehensive advantages in quality, speed, and model capacity across multiple dynamic scene datasets.
PISR: Polarimetric Neural Implicit Surface Reconstruction for Textureless and Specular Objects: This paper proposes PISR, which leverages the geometric constraints of polarized light (the correspondence between the angle of polarization and the azimuth of surface normals) to directly regularize neural implicit surface shapes. Combined with hash grid acceleration and image-space normal smoothing, it achieves high-precision reconstruction on textureless and specular objects with a 0.5mm Chamfer distance and a 99.5% F-score, while running 4 to 30 times faster than previous polarimetric methods.
Pixel-GS: Density Control with Pixel-aware Gradient for 3D Gaussian Splatting: By introducing pixel coverage count as a gradient weighting factor into the point cloud growth decision criteria of 3DGS, Pixel-GS addresses the issue where large Gaussians in sparse regions of the initial point cloud fail to split effectively, while suppressing floaters near the camera through distance-aware gradient scaling.
PointLLM: Empowering Large Language Models to Understand Point Clouds: PointLLM connects a point cloud encoder (Point-BERT) to the LLaMA large language model via an MLP projection layer. Utilizing 730K instruction-following data (660K brief descriptions + 70K complex instructions) for two-stage training, it achieves a generative accuracy of 53.4% on 3D object classification (surpassing LLaVA-13B's 44.2%) and a human evaluation win rate of 55% over human annotations in object description tasks.
Power Variable Projection for Initialization-Free Large-Scale Bundle Adjustment: This paper proposes the Power Variable Projection (PoVar) algorithm, which extends the power series expansion method to the Variable Projection (VarPro) framework and further generalizes it to Riemannian manifold optimization. This achieves efficient optimization of initialization-free large-scale Bundle Adjustment (BA) for the first time.
ProDepth: Boosting Self-Supervised Multi-Frame Monocular Depth with Probabilistic Fusion: Proposes ProDepth, a probabilistic fusion framework that infers dynamic region uncertainty via an auxiliary decoder to adaptively fuse single-frame and multi-frame depth probability distributions using a weighted geometric mean. This directly corrects erroneous matching costs in the cost volume, and combined with an uncertainty-aware loss reweighting strategy, achieves SOTA performance in self-supervised multi-frame monocular depth estimation.
Progressive Classifier and Feature Extractor Adaptation for Unsupervised Domain Adaptation on Point Clouds: The PCFEA method is proposed for unsupervised domain adaptation on point clouds. By progressively constructing intermediate domains from the source to the target domain, it trains the classifier using target-style feature augmentation at the macro-level (PTFA), and guides the feature extractor to align with the intermediate domains at the micro-level (IDFA). It achieves a mean accuracy of 76.5% on PointDA-10 (+2.9% over SOTA) and 87.6% on GraspNetPC-10 (+13.7% over SOTA).
Protecting NeRFs' Copyright via Plug-And-Play Watermarking Base Model: NeRFProtector is proposed, which utilizes a pre-trained watermarking base model (message extractor) to embed binary watermarks in a plug-and-play manner during the NeRF creation process. By employing Progressive Global Rendering (PGR), watermarking knowledge is distilled into the NeRF representation, achieving high bit-accuracy copyright protection without modifying the NeRF architecture.
Ray-Distance Volume Rendering for Neural Scene Reconstruction: The RS-Recon method is proposed, which replaces the traditional SDF with a ray-direction-dependent Signed Ray Distance Function (SRDF) to parameterize the density function in volume rendering. Combined with an SRDF-SDF consistency loss and a self-supervised visibility task, it achieves more accurate surface reconstruction and view synthesis in multi-object indoor scenes.
Spring-Gaus: Reconstruction and Simulation of Elastic Objects with Spring-Mass 3D Gaussians: This paper proposes Spring-Gaus, which integrates a learnable 3D spring-mass model into 3D Gaussian Splatting to reconstruct the appearance, geometry, and physical dynamics parameters of elastic objects from multi-view videos, supporting future prediction and simulation under different conditions.
Reliable Spatial-Temporal Voxels For Multi-Modal Test-Time Adaptation: This paper proposes Latte (ReLiable Spatial-temporal Voxels), a multi-modal test-time adaptation method that constructs spatial-temporal voxels (ST voxels) via sliding window frame aggregation and computes spatial-temporal entropy (ST entropy) to evaluate prediction reliability, thereby enabling adaptive cross-modal learning and achieving SOTA performance on three MM-TTA benchmarks.
Repaint123: Fast and High-Quality One Image to 3D Generation with Progressive Controllable Repainting: Repaint123 proposes a progressive controllable repainting strategy that uses 2D diffusion models to generate multi-view consistent, high-quality images, and then rapidly optimizes 3D representations via a simple MSE loss. It generates 3D content with delicate textures and multi-view consistency from a single image in just 2 minutes, significantly outperforming SDS-based methods.
RISurConv: Rotation Invariant Surface Attention-Augmented Convolutions for 3D Point Cloud Classification and Segmentation: RISurConv is proposed to construct local triangular surfaces and extract highly representative Rotation Invariant Surface Properties (RISP). Combined with attention-augmented convolutions, it achieves the first rotation-invariant point cloud analysis network to surpass non-rotation-invariant methods in accuracy.
RoGUENeRF: A Robust Geometry-Consistent Universal Enhancer for NeRF: This paper proposes RoGUENeRF, a post-processing enhancer for NeRF that combines 3D reprojection alignment, non-rigid optical flow refinement, and geometry-aware attention. It significantly improves the rendering quality of various NeRF methods while maintaining view consistency, demonstrating robustness to camera calibration errors.
S³D-NeRF: Single-Shot Speech-Driven Neural Radiance Field for High Fidelity Talking Head Synthesis: This paper proposes S³D-NeRF, a NeRF-based method that leverages a hierarchical facial appearance encoder, a cross-modal facial deformation field, and a lip-sync discriminator to synthesize high-fidelity talking head videos driven by speech using only a single source image, outperforming existing single-shot methods in video quality and lip synchronization.
SAGS: Structure-Aware 3D Gaussian Splatting: This work proposes SAGS, which implicitly encodes scene geometry using a local-global graph representation and graph neural networks. It improves the rendering quality of 3DGS, reduces storage requirements (up to 24× compression), and significantly suppresses floater artifacts while maintaining real-time rendering.
Sapiens: Foundation for Human Vision Models: Sapiens presents a family of human-centric vision foundation models (0.3B to 2B parameters) pre-trained on 300 million human images using MAE self-supervised methods. It natively supports $1024 \times 1024$ high-resolution inference, systematically outperforming the state-of-the-art across four major human vision tasks: 2D pose estimation, body part segmentation, depth estimation, and surface normal prediction.
SC4D: Sparse-Controlled Video-to-4D Generation and Motion Transfer: SC4D proposes a sparse-controlled video-to-4D generation framework. By decoupling the motion and appearance of dynamic 3D objects into sparse control points (~512) and dense Gaussian volumes (~50k), combined with Adaptive Gaussian Initialization (AG) and Gaussian Alignment Loss (GA) to address shape degradation, it achieves high-quality generation and cross-entity motion transfer based on control point trajectories.
ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities: This paper proposes a new task of 3D reasoning grounding and introduces the ScanReason benchmark (10K+ QA-location pairs, 5 reasoning types). It designs the ReGround3D framework to collaborate MLLM reasoning with a 3D grounding module via a Chain-of-Grounding mechanism, achieving accurate 3D object localization under implicit instructions.
ScatterFormer: Efficient Voxel Transformer with Scattered Linear Attention: ScatterFormer is proposed, which is the first voxel Transformer to directly apply linear attention on variable-length voxel sequences across windows. By implementing a Scattered Linear Attention (SLA) module and a chunk-wise matrix multiplication algorithm, it achieves sub-millisecond latency. Paired with a Cross-Window Interaction (CWI) module to replace window shifting, it achieves state-of-the-art accuracy on Waymo and nuScenes while maintaining a detection speed of 23 FPS.
SceneGraphLoc: Cross-Modal Coarse Visual Localization on 3D Scene Graphs: SceneGraphLoc is proposed to perform coarse localization of query images within a reference map composed of multimodal 3D scene graphs. Without relying on large-scale image databases, it achieves localization accuracy comparable to state-of-the-art image-level methods while reducing storage requirements by three orders of magnitude.
SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding: This paper proposes the first million-scale 3D vision-language dataset, SceneVerse (68K indoor scenes + 2.5M scene-language pairs), and introduces GPS, a multi-level contrastive pre-training framework, achieving SOTA results in 3D visual grounding and QA tasks, as well as zero-shot transfer capabilities.
SEDiff: Structure Extraction for Domain Adaptive Depth Estimation via Denoising Diffusion Models: Proposes SEDiff, which for the first time leverages diffusion models to extract domain-invariant structural information, eliminating the domain gap between synthetic and real data through structure-consistent style transfer to achieve high-performance domain adaptive monocular depth estimation.
SEED: A Simple and Effective 3D DETR in Point Clouds: SEED proposes a simple and effective 3D DETR detector. It obtains high-quality queries in a coarse-to-fine manner through a Dual Query Selection (DQS) module, and achieves flexible query interaction by utilizing geometric structural information of 3D objects with a Deformable Grid Attention (DGA) module, reaching new SOTA on Waymo and nuScenes.
SegPoint: Segment Any Point Cloud via Large Language Model: SegPoint is proposed, the first model to utilize multimodal LLM reasoning capabilities in a unified framework to complete four tasks: 3D instruction segmentation, referring segmentation, semantic segmentation, and open-vocabulary segmentation. Additionally, the Instruct3D benchmark (2,565 pairs) is constructed, achieving an mIoU of 27.5%.
SemanticHuman-HD: High-Resolution Semantic Disentangled 3D Human Generation: SemanticHuman-HD is proposed as the first 3D human image synthesis method that achieves semantic disentangling. By leveraging $K$ independent local generators and a 3D-aware super-resolution module, it enables semantically controllable human generation at $1024^2$ resolution.
SGS-SLAM: Semantic Gaussian Splatting for Neural Dense SLAM: Ours proposes SGS-SLAM, the first semantic visual SLAM system based on Gaussian Splatting. By optimizing multiple channels to integrate appearance, geometry, and semantic features, it achieves state-of-the-art (SOTA) performance in camera pose estimation, map reconstruction, and semantic segmentation.
ShapeFusion: A 3D Diffusion Model for Localized Shape Editing: Proposes ShapeFusion, a 3D mesh localized editing method based on a masked diffusion training strategy, achieving fully localized and interpretable 3D shape editing by directly operating in vertex space without latent-space optimization.
SignAvatars: A Large-scale 3D Sign Language Holistic Motion Dataset and Benchmark: Proposes SignAvatars, the first large-scale multi-prompt (HamNoSys/language/word) 3D sign language holistic motion dataset (70K videos, 8.34M frames, 153 signers). It designs an automatic 3D annotation pipeline with biomechanical constraints, and proposes the VQ-VAE-based SignVAE model as the first benchmark baseline for 3D Sign Language Production (SLP).
SINDER: Repairing the Singular Defects of DINOv2: Reveals that the root cause of high-norm defect tokens in DINOv2 feature maps is the principal left singular vector of network weights (singular defect), and proposes SINDER—which repairs the defects by fine-tuning singular values on a small dataset while preserving feature quality.
SlotLifter: Slot-guided Feature Lifting for Learning Object-centric Radiance Fields: SlotLifter is proposed, which combines 2D-to-3D feature lifting with Slot Attention through a slot-guided feature lifting design. It achieves state-of-the-art performance in both scene decomposition and novel view synthesis, while accelerating training efficiency by approximately 5 times.
SparseSSP: 3D Subcellular Structure Prediction from Sparse-View Transmitted Light Images: Proposes SparseSSP, an efficient framework with a mixed-dimension topology that converts 3D subcellular structure prediction into a 2D network task via a Z-axis depth-to-channel transformation, reducing imaging frequency by up to 87.5% while maintaining state-of-the-art accuracy.
SpectraM-PS: Spectrally Multiplexed Photometric Stereo Under Unknown Spectral Composition: A spectrally multiplexed photometric stereo method (SpectraM-PS) is proposed that eliminates the need for physical model constraints. Under conditions where the spectral composition of the light source is completely unknown, it recovers surface normals from a single RGB image in a data-driven manner, achieving a breakthrough from traditional multi-shot photometric stereo to single-shot.
SplatFields: Neural Gaussian Splats for Sparse 3D and 4D Reconstruction: SplatFields finds that the performance bottleneck of 3D Gaussian Splatting (3DGS) in sparse-view settings stems from the lack of spatial autocorrelation in splat features. It proposes to introduce spatial regularization by predicting splat features through an implicit neural field, consistently improving reconstruction quality in both static 3D and dynamic 4D sparse reconstruction scenarios.
SuperGaussian: Repurposing Video Models for 3D Super Resolution: SuperGaussian is proposed to achieve 3D super-resolution by repurposing pre-trained video upscaling models. It requires no category-specific training, can handle various 3D input formats (Gaussians, NeRF, meshes, etc.), and outputs high-quality Gaussian Splatting models.
Surface Reconstruction from 3D Gaussian Splatting via Local Structural Hints: To address the issue of poor surface reconstruction quality in 3DGS, this paper proposes utilizing monocular normal/depth priors to enhance the geometric organization of Gaussian primitives, constructing local signed distance fields via Moving Least Squares (MLS), and jointly learning a neural implicit network for regularization, significantly improving the surface reconstruction precision of 3DGS.
T-MAE: Temporal Masked Autoencoders for Point Cloud Representation Learning: T-MAE proposes a temporal masked autoencoder pre-training strategy that takes two temporally adjacent frames as input and learns temporal dependencies by masking the current frame and reconstructing it with the help of historical frame information. Equipped with the proposed SiamWCA (Siamese encoder + Windowed Cross-Attention) architecture, it outperforms SOTA self-supervised methods on the Waymo and ONCE datasets with fewer labeled data and fewer training iterations.
TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting: TalkingGaussian is proposed, a deformation-driven talking head synthesis framework based on 3D Gaussian Splatting. It represents facial motion by applying smooth deformations to persistent Gaussian primitives, and decomposes the face and inner mouth regions to address motion inconsistency.
MALD-NeRF: Taming Latent Diffusion Model for Neural Radiance Field Inpainting: MALD-NeRF is proposed to achieve high-quality NeRF inpainting through masked adversarial training and a scene-customized latent diffusion model, effectively addressing the multi-view inconsistency and texture shift problems of diffusion models.
TCC-Det: Temporarily Consistent Cues for Weakly-Supervised 3D Detection: This paper proposes TCC-Det, a weakly-supervised 3D object detection method that requires absolutely no manual 3D annotations. By leveraging an off-the-shelf 2D detector (Mask-RCNN) and multi-frame temporal consistency cues, it generates high-quality pseudo 3D labels to train a 3D point cloud detector (Voxel-RCNN). It outperforms all prior weakly-supervised methods on KITTI and Waymo, significantly narrowing the gap to fully-supervised methods.
TC-Stereo: Temporally Consistent Stereo Matching: Proposed TC-Stereo, which achieves temporally consistent stereo matching through temporal disparity completion for good initialization, temporal state fusion to maintain hidden state coherence, and dual-space (disparity + disparity gradient) iterative refinement to improve ill-posed regions.
Texture-GS: Disentangling the Geometry and Texture for 3D Gaussian Splatting Editing: This paper proposes Texture-GS, which disentangles geometry and texture for 3D Gaussian Splatting (3D-GS) for the first time. By leveraging a UV-mapping MLP and local Taylor expansion, it represents scene appearance as 2D texture maps, enabling real-time texture swapping and editing (58 FPS on an RTX 2080 Ti).
The NeRFect Match: Exploring NeRF Features for Visual Localization: Proposes NeRFMatch, which explores the potential of internal NeRF features as 3D descriptors and establishes an attention-based 2D-3D matching network. It achieves competitive localization performance on Cambridge Landmarks, validating the feasibility of NeRF as a scene representation for localization.
Thermal3D-GS: Physics-induced 3D Gaussians for Thermal Infrared Novel-view Synthesis: This paper proposes Thermal3D-GS, which models atmospheric transmission effects and thermal conduction physical processes using neural networks and introduces temperature consistency constraints, achieving high-quality novel-view synthesis of thermal infrared images, and establishing the first large-scale thermal infrared novel-view synthesis dataset, TI-NSD.
TPA3D: Triplane Attention for Fast Text-to-3D Generation: Proposes TPA3D, a GAN-based text-guided 3D generation framework that performs layer-wise refinement of sentence-level and word-level text features through a Triplane Attention (TPA) module, achieving fast and fine-grained text-to-3D textured mesh generation.
Track Everything Everywhere Fast and Robustly: This paper proposes an efficient and robust test-time optimization method for pixel tracking. By introducing the CaDeX++ invertible deformation network, monocular depth priors, and DINOv2 long-term semantic consistency, the method accelerates the training speed by over 10 times while significantly improving tracking accuracy and robustness.
TrackNeRF: Bundle Adjusting NeRF from Sparse and Noisy Views via Feature Tracks: This paper proposes TrackNeRF, which integrates feature tracks from SfM into NeRF training. By replacing traditional pairwise correspondence losses with a global multi-view reprojection consistency loss, TrackNeRF significantly improves NeRF reconstruction quality and pose optimization accuracy under sparse views with noisy poses.
TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos: Proposes TRAM, a two-stage method that restores metric-scale camera motion via robustified SLAM and regresses camera-frame human motion using a video Transformer (VIMO), combining both to achieve accurate 3D global trajectory and motion reconstruction of humans in world coordinates.
Transferable 3D Adversarial Shape Completion using Diffusion Models: This work proposes 3DAdvDiff, which leverages 3D diffusion models to generate high-quality, transferable 3D adversarial point clouds via adversarial shape completion. By combining model uncertainty, ensemble adversarial guidance, and saliency scoring strategies, it achieves state-of-the-art (SOTA) attack success rates against modern 3D models under black-box settings.
UniDream: Unifying Diffusion Priors for Relightable Text-to-3D Generation: Proposes UniDream, which achieves relightable text-to-3D generation with clean albedo textures and PBR materials by training an albedo-normal aligned multi-view diffusion model (AN-MVM), integrated with a Transformer reconstruction model and stage-wise SDS optimization.
VCD-Texture: Variance Alignment based 3D-2D Co-Denoising for Text-Guided Texturing: Proposes VCD-Texture, which unifies 2D and 3D self-attention learning (JNP) during the Stable Diffusion denoising process, addresses the variance decay issue caused by rasterization through Variance Alignment (VA), and handles inconsistent regions using inpainting refinement, achieving high-fidelity and highly consistent 3D texture synthesis.
VersatileGaussian: Real-Time Neural Rendering for Versatile Tasks Using Gaussian Splatting: This paper proposes VersatileGaussian, which equips 3D Gaussians with shared multi-task features and designs a Task Correlation Attention (TCA) module to enable cross-task information flow, achieving SOTA accuracy for multi-task label prediction on ScanNet and Replica datasets while maintaining a real-time rendering speed of 35 FPS.
View Selection for 3D Captioning via Diffusion Ranking: This work proposes DiffuRank, a method that leverages a pre-trained text-to-3D diffusion model (Shap·E) to score and rank alignment across rendered views of 3D objects, selecting the most representative top-6 views for GPT-4 Vision to generate high-quality captions. This refines approximately 200k incorrect annotations in Cap3D and expands the dataset to 1.5 million captions.
Vista3D: Unravel the 3D Darkside of a Single Image: Vista3D is proposed, which generates diverse and consistent high-fidelity 3D meshes from a single image within 5 minutes. It utilizes a coarse-to-fine two-stage framework (3D Gaussian Splatting $\rightarrow$ FlexiCubes differentiable isosurface refinement + decoupled texture) combined with viewpoint-aware diffusion prior composition.
WaSt-3D: Wasserstein-2 Distance for Scene-to-Scene Stylization on 3D Gaussians: Proposes WaSt-3D, which reformulates style transfer as an optimal transport problem between two Gaussian distributions using 3D Gaussian Splatting representations. By matching the 3D distributions of the content and style scenes via Sinkhorn divergence, it achieves the first 3D scene-to-scene geometric style transfer.
When Do We Not Need Larger Vision Models?: This paper proposes the Scaling on Scales (S2) strategy: freezing a small model (e.g., ViT-B) to run on multiple image scales and concatenating the features, which matches or even outperforms large models (ViT-H/G) on tasks like classification, segmentation, depth estimation, and MLLMs without increasing parameters. Furthermore, it demonstrates both theoretically and experimentally that the representations learned by large models can be largely approximated linearly by multi-scale small models.
WordRobe: Text-Guided Generation of Textured 3D Garments: Proposes WordRobe, which learns a 3D garment UDF latent space through a coarse-to-fine two-stage encoder-decoder framework. It utilizes a weakly supervised CLIP mapping network to achieve text-driven 3D garment generation and editing, and leverages the view-composited property of ControlNet to generate view-consistent texture maps in a single forward inference pass, running 13 times faster than Text2Tex.
Zero-Shot Multi-Object Scene Completion: OctMAE is proposed, a hybrid architecture fusing Octree U-Net and latent 3D MAE to achieve high-quality, near-real-time multi-object scene shape completion from a single RGB-D image. Efficiency and generalization are significantly enhanced via an occlusion-masking strategy and 3D Rotary Position Embedding (RoPE).
ZeST: Zero-Shot Material Transfer from a Single Image: ZeST is proposed, a zero-shot, training-free material transfer method. By combining three parallel branches—extracting material representations via IP-Adapter, providing geometric guidance through ControlNet, and utilizing a foreground grayscale image for lighting cues—it achieves 2D material transfer from a single material exemplar image to a target object.