🧊 3D Vision¶

🤖 AAAI2026 · 74 paper notes

3D-ANC: Adaptive Neural Collapse for Robust 3D Point Cloud Recognition: This paper introduces the Neural Collapse (NC) mechanism into adversarial robustness for 3D point cloud recognition. By replacing the classifier head with a fixed ETF structure and adopting an adaptive training framework (RBL + FDL) to construct a disentangled feature space, 3D-ANC improves the adversarial accuracy of DGCNN on ModelNet40 from 27.2% to 80.9%, surpassing the best baseline by 34 percentage points.
3D-Free Meets 3D Priors: Novel View Synthesis from a Single Image with Pretrained Diffusion Guidance: This paper proposes a framework that combines 3D-free methods (HawkI-style test-time optimization) with 3D-based priors (weak guidance images from Zero123++) to synthesize camera-controlled views at specified elevation/azimuth angles from a single image, requiring neither additional 3D data nor training. The approach comprehensively outperforms Zero123++, HawkI, and Stable Zero123 on LPIPS, CLIP-Score, and other metrics in complex scenes.
3DTeethSAM: Taming SAM2 for 3D Teeth Segmentation: This work adapts the SAM2 foundation model for 3D teeth segmentation by converting 3D meshes into 2D images via multi-view rendering and designing three lightweight adapters—a Prompt Embedding Generator, a Mask Refiner, and a Mask Classifier—along with a Deformable Global Attention Plugin (DGAP) to address automatic prompting, boundary refinement, and semantic classification. The proposed method achieves a new state-of-the-art T-mIoU of 91.90% on Teeth3DS.
4DSTR: Advancing Generative 4D Gaussians with Spatial-Temporal Rectification for High-Quality and Consistent 4D Generation: This paper proposes the 4DSTR framework, which significantly improves the spatial-temporal consistency of 4D Gaussian generation and its adaptability to rapid temporal changes through a Mamba-based temporal correlation rectification module (correcting Gaussian scale and rotation residuals) and a per-frame adaptive densification and pruning strategy.
Adapt-As-You-Walk Through the Clouds: Training-Free Online Test-Time Adaptation of 3D Vision-Language Foundation Models: This paper proposes Uni-Adapter, a training-free online test-time adaptation (TTA) framework for 3D vision-language foundation models (VLFMs). It addresses distribution shifts via clustering-based dynamic prototype caching and graph-regularized label smoothing, achieving state-of-the-art performance on multiple 3D corruption benchmarks.
AnchorDS: Anchoring Dynamic Sources for Semantically Consistent Text-to-3D Generation: This paper identifies a critical yet overlooked issue in SDS: the source distribution is dynamically evolving rather than static. AnchorDS is proposed to anchor the source distribution by feeding the current rendered image as an image condition into a dual-conditioned diffusion model, thereby resolving semantic over-smoothing and multi-view inconsistency in SDS. The method comprehensively outperforms SDS, VSD, and SDS-Bridge on T3Bench.
AnchorHOI: Zero-shot Generation of 4D Human-Object Interaction via Anchor-based Prior Distillation: AnchorHOI is proposed to achieve zero-shot text-driven 4D human-object interaction (HOI) generation by introducing two intermediate bridges — anchor NeRF and anchor keypoints — to distill interaction priors and motion priors from image and video diffusion models, respectively. The method outperforms existing approaches on both static 3D and dynamic 4D HOI generation.
Arbitrary-Scale 3D Gaussian Super-Resolution: This paper proposes Arbi-3DGSR, an integrated framework that, for the first time, enables a single 3DGS model to support arbitrary-scale (including non-integer) high-resolution rendering through three core components: scale-aware rendering, generative prior-guided optimization, and progressive super-resolving. At ×5.7 scale, PSNR improves by 6.59 dB over vanilla 3DGS while maintaining real-time rendering at 85 FPS.
ASSIST-3D: Adapted Scene Synthesis for Class-Agnostic 3D Instance Segmentation: This paper proposes ASSIST-3D, a synthetic data pipeline that generates high-quality annotated data for class-agnostic 3D instance segmentation through three stages: heterogeneous object selection, LLM-guided scene layout generation, and realistic point cloud construction, significantly improving model generalization.
Can Protective Watermarking Safeguard the Copyright of 3D Gaussian Splatting?: This paper presents the first systematic study exposing the vulnerability of 3DGS watermarking frameworks, and proposes GSPure — a purification framework that leverages view-aware Gaussian weight accumulation and geometric feature clustering to precisely isolate and remove watermark-related Gaussian primitives, reducing watermark PSNR by up to 16.34 dB while incurring less than 1 dB loss in scene fidelity.
Cheating Stereo Matching in Full-Scale: Physical Adversarial Attack against Binocular Depth Estimation: This paper proposes the first full-surface 3D texture physical adversarial attack against stereo matching models. Through a stereo-aligned rendering module and a region-aware merging attack strategy, adversarial vehicles seamlessly blend into the background in the predicted depth map, causing severe failures in autonomous driving perception systems.
Class-Partitioned VQ-VAE and Latent Flow Matching for Point Cloud Scene Generation: This paper proposes a Class-Partitioned VQ-VAE (CPVQ-VAE) and a Latent Flow Matching Model (LFMM), achieving the first purely generative point cloud scene generation method that requires no external database retrieval, reducing Chamfer Distance by 70.4% on complex living room scenes.
DANCE: Density-Agnostic and Class-Aware Network for Point Cloud Completion: This paper proposes the DANCE framework, which achieves density-agnostic point cloud completion via ray-based candidate point sampling and an opacity prediction mechanism, while introducing a classification head to provide semantic priors. The method achieves state-of-the-art performance on the PCN and MVP benchmarks.
DAPointMamba: Domain Adaptive Point Mamba for Point Cloud Completion: This work presents the first integration of Mamba (SSM) into unsupervised domain adaptive point cloud completion (UDA PCC). The proposed DAPointMamba framework achieves high-quality cross-domain point cloud completion through three modules—Cross-Domain Patch-Level Scanning, Spatial SSM Alignment, and Channel SSM Alignment—while maintaining linear complexity and a global receptive field.
Debiasing Diffusion Priors via 3D Attention for Consistent Gaussian Splatting: This paper proposes the TD-Attn framework, which addresses multi-view inconsistency (the Janus problem) caused by prior-view bias in T2I diffusion models for 3D generation and editing. The framework comprises two modules—3D-Aware Attention Guidance (3D-AAG) and Hierarchical Attention Modulation (HAM)—and can be integrated as a general-purpose plugin into existing 3DGS pipelines.
DeepRAHT: Learning Predictive RAHT for Point Cloud Attribute Compression: This paper proposes DeepRAHT, the first end-to-end differentiable Region Adaptive Hierarchical Transform (RAHT) framework for lossy point cloud attribute compression. By integrating learnable prediction models with a Laplace distribution-based rate proxy, DeepRAHT achieves compression performance surpassing both the G-PCC standard and existing deep learning methods.
Distilling Future Temporal Knowledge with Masked Feature Reconstruction for 3D Object Detection: This paper proposes FTKD (Future Temporal Knowledge Distillation), a framework comprising two strategies—Future-aware Feature Reconstruction (FFR) and Future-guided Logit Distillation (FLD)—to effectively transfer future frame knowledge from an offline teacher model to an online student model, achieving gains of 1.3 mAP / 1.3 NDS on nuScenes without additional inference overhead.
Domain Generalized Stereo Matching with Uncertainty-guided Data Augmentation: This paper proposes UgDA-Stereo, a plug-and-play training-time module that simulates diverse unseen domain styles by applying Gaussian uncertainty perturbations—derived from batch statistics—to the per-channel mean and standard deviation of RGB images. Combined with a feature consistency constraint, the method substantially improves the cross-domain generalization of stereo matching models.
Dynamic Gaussian Scene Reconstruction from Unsynchronized Videos: This paper proposes a coarse-to-fine temporal alignment module that can be plugged into existing 4D Gaussian Splatting frameworks to address reconstruction quality degradation caused by temporal misalignment across multi-view videos. The method achieves consistent improvements in PSNR/SSIM/LPIPS over multiple baselines on the DyNeRF dataset.
Enhancing Generalization of Depth Estimation Foundation Model via Weakly-Supervised Adaptation with Regularization: This paper proposes WeSTAR, a framework that synergistically combines semantics-aware hierarchical depth normalization self-training, sparse pairwise ordinal weak supervision, and LoRA weight regularization to enhance the generalization of depth estimation foundation models (Depth Anything V2) on unseen domains and corrupted data in a parameter-efficient manner, achieving state-of-the-art results on multiple OOD benchmarks.
Enhancing Rotation-Invariant 3D Learning with Global Pose Awareness and Attention Mechanisms: This paper proposes the Shadow-informed Pose Feature (SiPF) and the RIAttnConv operator. By introducing a global "shadow" reference point generated via Bingham distribution learning, the method enhances the global pose awareness of local rotation-invariant features, resolving the "Wing-tip Feature Collapse" problem where symmetric structures (e.g., left and right wings of an airplane) cannot be distinguished. The approach achieves state-of-the-art performance on ModelNet40 classification and ShapeNetPart segmentation.
EPSegFZ: Efficient Point Cloud Semantic Segmentation for Few- and Zero-Shot Scenarios: This paper proposes EPSegFZ, a pretraining-free framework for few- and zero-shot 3D point cloud semantic segmentation. It extracts high-frequency features via ProERA, updates prototypes with textual information via LGPE, and establishes accurate query-prototype correspondences via DRPE. EPSegFZ surpasses the state of the art by 5.68% on S3DIS and 3.82% on ScanNet.
FantasyStyle: Controllable Stylized Distillation for 3D Gaussian Splatting: This paper presents FantasyStyle, the first 3DGS style transfer framework built entirely on diffusion model distillation. It introduces a Multi-View Frequency Consistency (MVFC) mechanism that suppresses low-frequency components to reduce cross-view conflicts, and designs Controllable Stylized Distillation (CSD) with negative guidance to eliminate content leakage from style images. The method surpasses existing VGG-based and diffusion-based approaches in both stylization quality and content preservation.
Fast 3D Surrogate Modeling for Data Center Thermal Management: This paper develops a vision-based 3D surrogate modeling framework for data centers. Server workloads, fan speeds, and air-conditioning temperature setpoints are encoded as 3D voxel representations, and architectures including 3D CNN U-Net, 3D Fourier Neural Operator, and 3D Vision Transformer are employed for real-time temperature field prediction. The proposed framework achieves inference speeds up to 20,000× faster than traditional CFD solvers while enabling a 7% reduction in energy consumption.
FoundationSLAM: Unleashing the Potential of Deep Foundation Models in End-to-End Dense Visual SLAM: This work injects geometric priors from depth foundation models into a flow-based SLAM system. Three modules — a hybrid flow network, a bi-consistent BA layer, and reliability-aware refinement — form a closed loop. The resulting system achieves state-of-the-art trajectory accuracy and dense reconstruction quality across TUM/EuRoC/7Scenes/ETH3D benchmarks at 18 FPS in real time.
Free-Form Scene Editor: Enabling Multi-Round Object Manipulation like in a 3D Engine: This paper proposes FFSE — an autoregressive 3D-aware image editing framework built on a video diffusion model — paired with a hybrid dataset 3DObjectEditor (real + synthetic). FFSE enables multi-round object translation, scaling, and rotation on real images in the manner of a 3D engine, while generating physically plausible background effects such as shadows, reflections, and occlusions, and maintaining cross-round consistency. It substantially outperforms existing methods in both single-round and multi-round editing.
Gaussian Blending: Rethinking Alpha Blending in 3D Gaussian Splatting: This paper revisits scalar alpha blending in 3DGS and identifies its neglect of intra-pixel spatial variation as the root cause of multi-scale rendering artifacts (enlargement erosion / downscaling dilation). The proposed Gaussian Blending models alpha and transmittance as spatial distributions within a pixel (2D uniform window), achieving real-time anti-aliasing without retraining. PSNR on multi-scale Blender improves from 31.59 to 35.80.
GaussianImage++: Boosted Image Representation and Compression with 2D Gaussian Splatting: This paper proposes GaussianImage++, which achieves high-quality image representation and compression with a limited number of 2D Gaussian primitives via a distortion-driven densification mechanism and content-aware Gaussian filters, combined with an attribute-separated learnable scalar quantizer for efficient compression.
Generalized Geometry Encoding Volume for Real-time Stereo Matching: This paper proposes GGEV, which integrates depth priors from a monocular depth foundation model (Depth Anything V2) into the cost aggregation process in a lightweight manner. Through Depth-aware Dynamic Cost Aggregation (DDCA), GGEV adaptively enhances matching relationships across different disparity hypotheses, achieving strong generalization at real-time inference speed.
Geometry Meets Light: Leveraging Geometric Priors for Universal Photometric Stereo under Limited Multi-Illumination Cues: This paper proposes GeoUniPS, which injects geometric priors learned by a large-scale 3D reconstruction model (VGGT) into a universal photometric stereo pipeline. Through a light–geometry dual-branch encoder, the method recovers plausible surface normals even when multi-illumination cues are unreliable (e.g., shadows, self-occlusions, biased lighting). A new perspective-projection training dataset, PS-Perp, is also introduced to bridge the gap between the orthographic projection assumption and real-world cameras.
Graph Smoothing for Enhanced Local Geometry Learning in Point Cloud Analysis: This paper analyzes the limitations of conventional graph construction methods (ball query), specifically sparse connectivity at boundary points and noisy connectivity at junction regions, and proposes a graph smoothing module (symmetric adjacency optimization + von Neumann kernel) and a local geometry learning module (adaptive shape features + cylindrical coordinate transformation), achieving competitive performance on classification and segmentation tasks.
Griffin: Aerial-Ground Cooperative Detection and Tracking Dataset and Benchmark: This paper presents Griffin, the first aerial-ground cooperative (AGC) 3D perception dataset and benchmark framework, comprising 250+ dynamic scenes (37K+ frames) generated via CARLA-AirSim joint simulation. Griffin features realistic UAV dynamics, variable cruise altitudes (20–60 m), occlusion-aware annotations, and a systematic robustness evaluation protocol.
GT2-GS: Geometry-aware Texture Transfer for Gaussian Splatting: This paper proposes GT2-GS, a framework that achieves high-quality, view-consistent texture transfer for 3DGS via a geometry-aware texture transfer loss (GT2 Loss), an adaptive fine-grained control module (AFCM), and a geometry-preserving branch (GPB), outperforming existing 3D style transfer methods in both texture fidelity and scene content preservation.
Hierarchical Direction Perception via Atomic Dot-Product Operators for Rotation-Invariant Point Clouds Learning: This paper proposes DiPVNet, which leverages the dual properties of the atomic dot-product operator (directional selectivity + rotation invariance) to construct a local L2DP operator and a global DASFT module, achieving hierarchical direction-aware rotation-invariant point cloud learning.
IE-SRGS: An Internal-External Knowledge Fusion Framework for High-Fidelity 3D Gaussian Splatting Super-Resolution: This paper proposes IE-SRGS, a framework that fuses external knowledge (high-frequency texture priors from a pretrained 2D super-resolution model) with internal knowledge (cross-view consistent depth and texture features from a multi-scale 3DGS model), coordinated via a mask-guided fusion strategy, to achieve high-fidelity 3DGS super-resolution reconstruction from low-resolution inputs, attaining state-of-the-art performance on both synthetic and real-world scenes.
Learning Conjugate Direction Fields for Planar Quadrilateral Mesh Generation: This paper proposes a data-driven approach based on DGCNN to efficiently generate conjugate direction fields (CDFs), bypassing the high computational cost of traditional nonlinear optimization. The method supports user stroke-guided controllable CDF generation, achieves a 1–2 order-of-magnitude speedup, and is accompanied by a large-scale dataset of 50,000+ free-form surfaces.
MeshA*: Efficient Path Planning With Motion Primitives: This paper proposes MeshA, an algorithm that reformulates lattice-based path planning from "searching at the motion primitive level" to "searching at the grid cell level while simultaneously fitting primitive sequences." By defining a novel search space based on extended cells, MeshA achieves 1.5×–2× runtime speedup over standard LBA* while preserving completeness and optimality guarantees.
MeshSplat: Generalizable Sparse-View Surface Reconstruction via Gaussian Splatting: This paper proposes MeshSplat, the first generalizable sparse-view surface reconstruction framework based on 2DGS. It regularizes depth prediction via a Weighted Chamfer Distance loss and aligns 2DGS orientations through an uncertainty-guided normal prediction network, learning geometric priors in a self-supervised manner from novel view synthesis. MeshSplat achieves state-of-the-art performance on both sparse-view mesh reconstruction and cross-dataset generalization.
MoBGS: Motion Deblurring Dynamic 3D Gaussian Splatting for Blurry Monocular Video: MoBGS proposes an end-to-end dynamic deblurring 3D Gaussian Splatting framework that reconstructs sharp spatiotemporal novel views from blurry monocular video via two core modules — Blur-adaptive Latent Camera Estimation (BLCE) and Latent Camera-induced Exposure Estimation (LCEE) — achieving substantial improvements over existing state-of-the-art methods on the Stereo Blur dataset.
MR-CoSMo: Visual-Text Memory Recall and Direct Cross-Modal Alignment Method for Query-Driven 3D Segmentation: This paper proposes MR-CoSMo, a coarse-to-fine query-driven 3D segmentation model that establishes explicit alignment between 3D point clouds and text/2D images via a Direct Cross-Modal Alignment module (DCMA), and incorporates a Visual-Text Memory Module to store high-confidence feature pairs for enhanced cross-scene segmentation consistency. The method achieves state-of-the-art performance across three tasks: 3D instruction segmentation, referring segmentation, and semantic segmentation.
Multi-Modal Assistance for Unsupervised Domain Adaptation on Point Cloud 3D Object Detection: This paper proposes MMAssist, which leverages image and text features as "bridges" to align 3D features between the source and target domains, while incorporating 2D detection results to enhance pseudo-label quality, achieving significant improvements in LiDAR-based 3D unsupervised domain adaptation object detection.
NURBGen: High-Fidelity Text-to-CAD Generation through LLM-Driven NURBS Modeling: NURBGen is the first text-to-CAD generation framework based on NURBS surface representation. By fine-tuning an LLM, it maps natural language descriptions to structured NURBS parameter JSONs. A hybrid representation (untrimmed NURBS + analytic primitives) and a large-scale partABC dataset are introduced, achieving significant improvements over existing methods in geometric fidelity and dimensional accuracy.
OceanSplat: Object-aware Gaussian Splatting with Trinocular View Consistency for Underwater Scene Reconstruction: This paper proposes OceanSplat, which achieves high-fidelity underwater 3D Gaussian Splatting scene reconstruction under scattering media through trinocular view consistency constraints, synthetic epipolar depth priors, and depth-aware alpha adjustment, significantly reducing floating artifacts and surpassing existing methods.
Open-World 3D Scene Graph Generation for Retrieval-Augmented Reasoning: This paper proposes OSU-3DSG, a unified framework that integrates vision-language models for open-world 3D scene graph generation and supports four scene interaction tasks — scene question answering, visual grounding, instance retrieval, and task planning — via retrieval-augmented reasoning, achieving performance comparable to supervised methods under a zero-shot setting.
OpenScan: A Benchmark for Generalized Open-Vocabulary 3D Scene Understanding: This paper proposes the Generalized Open-Vocabulary 3D Scene Understanding (GOV-3D) task and the corresponding OpenScan benchmark, extending 3D scene understanding beyond object categories to eight linguistic attribute dimensions, revealing critical deficiencies of existing OV-3D methods in understanding abstract object attributes.
Opt3DGS: Optimizing 3D Gaussian Splatting with Adaptive Exploration and Curvature-Aware Exploitation: This paper proposes Opt3DGS, a framework that divides 3DGS training into two phases — exploration and exploitation. The exploration phase employs adaptively weighted SGLD to escape local optima, while the exploitation phase uses a local quasi-Newton Adam optimizer for precise convergence. The method achieves state-of-the-art rendering quality without modifying the Gaussian representation.
Parameter-Free Fine-tuning via Redundancy Elimination for Vision Foundation Models: This work identifies pervasive redundant channels in vision foundation models (SAM/SAM2/DINOv2) and proposes a parameter-free adaptation method that requires no parameter updates: an output-difference-based channel selection algorithm identifies optimal replacement pairs, substituting redundant channels with effective ones to enhance feature representations for downstream tasks, yielding average mIoU gains of 5–11 points.
Pb4U-GNet: Resolution-Adaptive Garment Simulation via Propagation-before-Update Graph Network: This paper proposes Pb4U-GNet, which decouples message propagation from feature update (Propagation-before-Update) and incorporates resolution-aware propagation depth control and update scaling mechanisms, enabling garment simulation models trained solely on low-resolution meshes to generalize to high-resolution meshes.
PFAvatar: Pose-Fusion 3D Personalized Avatar Reconstruction from Real-World Outfit-of-the-Day Photos: PFAvatar is proposed as a two-stage pipeline—comprising pose-aware diffusion model fine-tuning (ControlBooth) and NeRF distillation (BoothAvatar)—that reconstructs high-quality 3D personalized avatars from real-world Outfit-of-the-Day (OOTD) photos, completing personalization within 5 minutes and achieving a 48× speedup over prior methods.
Physics-Informed Deformable Gaussian Splatting: Towards Unified Constitutive Laws for Time-Evolving Material Field: Each 3D Gaussian is treated as a Lagrangian material point. A time-evolving material field predicts per-particle velocities and constitutive stress tensors; the Cauchy momentum residual serves as a physics constraint while Lagrangian particle flow matching provides a data-fitting term. The approach achieves physical consistency and cross-scene generalization in monocular dynamic view synthesis, reaching state-of-the-art performance on both a self-constructed physics-driven dataset and the HyperNeRF real-world benchmark.
Point-SRA: Self-Representation Alignment for 3D Representation Learning: Point-SRA is proposed to enhance 3D point cloud representation learning via Dual Self-Representation Alignment (MAE-SRA + MFT-SRA) and MeanFlow probabilistic modeling, exploiting the complementarity of representations under different mask ratios. The method surpasses Point-MAE by 5.59% on ScanObjectNN.
Point Cloud Quantization through Multimodal Prompting for 3D Understanding: This paper proposes PCQ (Point Cloud Quantization), which leverages text embeddings from pretrained vision-language models as semantic prototypes. Through Gumbel-Softmax differentiable quantization, continuous point cloud features are discretized into a text prototype space, and cross-modal feature fusion is applied to achieve significant improvements in 3D understanding.
RadarLLM: Empowering Large Language Models to Understand Human Motion from Millimeter-Wave Point Cloud Sequence: This paper proposes RadarLLM, the first end-to-end framework leveraging large language models for semantic-level human motion understanding from millimeter-wave radar point cloud sequences. The framework comprises a motion-guided radar tokenizer based on Aggregate VQ-VAE and a radar-aware language model, along with a physics-aware simulation pipeline for generating large-scale paired radar-text data.
Redundant Queries in DETR-Based 3D Detection: Unnecessary and Prunable: This paper proposes GPQ (Gradually Pruning Queries), a method that progressively prunes redundant object queries in DETR-based 3D detectors using classification scores. Without introducing any additional learnable parameters, GPQ can be applied as a fine-tuning step directly on pretrained checkpoints, achieving up to 67.86% FLOPs reduction and 65.16% inference time reduction on edge devices.
Rethinking Multimodal Point Cloud Completion: A Completion-by-Correction Perspective: This paper proposes a novel Completion-by-Correction paradigm that leverages a pretrained image-to-3D model to generate a topologically complete shape prior, then corrects it in feature space to align with local observations. This replaces the conventional Completion-by-Inpainting approach, achieving a 23.5% reduction in average CD and a 7.1% improvement in F-score on ShapeNet-ViPC.
Rethinking Rainy 3D Scene Reconstruction via Perspective Transforming and Brightness Tuning: This paper proposes OmniRain3D, the first dataset that jointly models perspective heterogeneity and brightness dynamicity in rainy 3D scenes, along with REVR-GSNet, an end-to-end framework integrating recursive brightness enhancement, Gaussian primitive optimization, and GS-guided rain elimination to reconstruct high-fidelity clean 3D scenes from rain-degraded images.
Retrieving Objects from 3D Scenes with Box-Guided Open-Vocabulary Instance Segmentation: This paper proposes a box-guided approach that leverages 2D bounding boxes from the open-vocabulary detector YOLO-World to guide the assembly of 3D instance masks from superpoints, without relying on SAM or CLIP. The method achieves high efficiency (<1 min/scene) while substantially improving retrieval performance on rare-category objects.
RTGaze: Real-Time 3D-Aware Gaze Redirection from a Single Image: This paper proposes RTGaze, a real-time 3D-aware gaze redirection method that achieves high-quality gaze redirection from a single image at 61 ms/frame via a hybrid-frequency feature encoder, a gaze injection module, and 3D facial geometry prior distillation — approximately 800× faster than the previous state-of-the-art 3D method, GazeNeRF.
Simba: Towards High-Fidelity and Geometrically-Consistent Point Cloud Completion via Transformation Diffusion: This paper proposes Simba, a framework that, for the first time, reformulates point cloud completion as diffusion over a geometric transformation field rather than over point coordinates. A Sym-Diffuser learns the conditional distribution of per-point affine transformations to generate coarse completions, which are then progressively refined to high-fidelity outputs via a cascaded Mamba architecture (MBA-Refiner). Simba achieves state-of-the-art performance on PCN, ShapeNet, and KITTI benchmarks.
SmartSplat: Feature-Smart Gaussians for Scalable Compression of Ultra-High-Resolution Images: This paper proposes SmartSplat, a feature-aware 2D Gaussian Splatting framework for image compression. By introducing three key strategies—gradient-color-guided variational sampling, repulsion-based uniform sampling, and scale-adaptive color initialization—SmartSplat achieves, for the first time, high-quality reconstruction of 8K/16K ultra-high-resolution (UHR) images at extreme compression ratios (up to 5000×).
Sparse4DGS: 4D Gaussian Splatting for Sparse-Frame Dynamic Scene Reconstruction: Sparse4DGS is proposed as the first method for sparse-frame dynamic scene reconstruction, achieving high-fidelity 4D scene reconstruction from sparse video frames via two core modules: Texture-Aware Deformation Regularization (TADR) and Texture-Aware Canonical Optimization (TACO).
SparseSurf: Sparse-View 3D Gaussian Splatting for Surface Reconstruction: SparseSurf is proposed to achieve simultaneous high-accuracy surface reconstruction and high-quality novel view synthesis under sparse-view settings, via Stereo Geometry-Texture Alignment (SGTA) and Pseudo-Feature Enhanced Geometry Consistency (PFEGC).
Splat-SAP: Feed-Forward Gaussian Splatting for Human-Centered Scene with Scale-Aware Point Map Reconstruction: This paper proposes Splat-SAP, a feed-forward method that reconstructs scale-aware point maps from wide-baseline stereo camera inputs and renders free-viewpoint video of human-centered scenes via a Gaussian Plane, requiring neither per-scene optimization nor 3D geometric supervision.
Splats in Splats: Robust and Effective 3D Steganography towards Gaussian Splatting: This paper proposes "Splats in Splats," the first steganography framework that embeds 3D hidden content into 3DGS assets without modifying any vanilla 3DGS attributes. It achieves secure, robust, and efficient copyright protection through importance-graded spherical harmonic (SH) coefficient encryption and autoencoder-assisted opacity mapping.
SplatSSC: Decoupled Depth-Guided Gaussian Splatting for Semantic Scene Completion: SplatSSC is proposed as a monocular 3D semantic scene completion framework based on depth-guided initialization and a Decoupled Gaussian Aggregator (DGA). Through compact Gaussian primitive initialization and robust geometry-semantics decoupled aggregation, it achieves state-of-the-art performance on Occ-ScanNet with significantly fewer primitives.
Split-Layer: Enhancing Implicit Neural Representation by Maximizing the Dimensionality of Feature Space: This paper proposes Split-Layer, which decomposes fully connected layers in MLPs into multiple parallel branches and integrates their outputs via the Hadamard product. Without increasing parameter count or computational cost, this approach exponentially expands the feature space dimensionality from \(C\) to \(\binom{C/\sqrt{N}+N-1}{N}\), significantly enhancing the representational capacity of implicit neural representations (INRs).
STMI: Segmentation-Guided Token Modulation with Cross-Modal Hypergraph Interaction for Multi-Modal Object Re-Identification: STMI proposes a three-component multi-modal object re-identification framework that suppresses background noise via SAM segmentation-guided feature modulation (SFM), extracts compact representations through semantic token reallocation (STR), and captures high-order semantic relationships via cross-modal hypergraph interaction (CHI), achieving significant improvements on benchmarks such as RGBNT201.
StreamSTGS: Streaming Spatial and Temporal Gaussian Grids for Real-Time Free-Viewpoint Video: This paper proposes StreamSTGS, a streamable spatial-temporal Gaussian grid representation that encodes canonical 3D Gaussian attributes as 2D images and temporal features as video, enabling real-time free-viewpoint video streaming at only 170 KB per frame. Reconstruction quality is maintained (PSNR 32.30 dB) through Transformer-assisted training and a sliding window mechanism.
Surface-Based Visibility-Guided Uncertainty for Continuous Active 3D Neural Reconstruction: This paper proposes a Surface-Based Visibility field (SBV) that derives surface confidence from SDF sign changes and updates it via a voxel grid, enabling accurate visibility-aware uncertainty estimation during continuous active learning for Next-Best View selection. SBV achieves up to 11.6% improvement in image rendering quality across four benchmarks: DTU, Blender, TanksAndTemples, and BlendedMVS.
TG-Field: Geometry-Aware Radiative Gaussian Fields for Tomographic Reconstruction: This paper proposes TG-Field, a geometry-aware Gaussian deformation framework for extremely sparse-view CT reconstruction. It employs a multi-resolution hash encoder to model spatial geometric priors, a spatiotemporal attention module and a motion flow network to handle dynamic CT, achieving state-of-the-art performance on both static and dynamic CT reconstruction.
TOSC: Task-Oriented Shape Completion for Open-World Dexterous Grasp Generation from Partial Point Clouds: This paper introduces Task-Oriented Shape Completion (TOSC), a novel task that completes only the contact regions relevant to a manipulation task—rather than the entire object—by leveraging pretrained foundation models to generate candidate shapes, a 3D Discriminative Autoencoder (DAE) to select the optimal shape, and a FlowGrasp flow-matching model to synthesize dexterous grasps. The approach achieves improvements of 16.17% in grasp displacement and 55.26% in Chamfer Distance over prior methods.
UniC-Lift: Unified 3D Instance Segmentation via Contrastive Learning: This paper proposes UniC-Lift, a unified single-stage 3D instance segmentation framework that learns optimizable vector embeddings on 3DGS primitives via contrastive loss and triplet loss, and directly decodes consistent 3D segmentation labels through a simple Embedding-to-Label procedure — eliminating post-processing clustering steps such as HDBSCAN and reducing training time from 15+ hours to under 40 minutes.
VGGT-DP: Generalizable Robot Control via Vision Foundation Models: This paper proposes VGGT-DP, a biologically inspired visuomotor policy framework that integrates the pretrained 3D-aware foundation model VGGT as a visual encoder with Diffusion Policy. Through three key designs — frame-wise token reuse (FTR), random token pruning, and proprioception-guided visual learning — VGGT-DP substantially outperforms DP and DP3 baselines on high-precision manipulation tasks in MetaWorld.
VPN: Visual Prompt Navigation: This paper proposes Visual Prompt Navigation (VPN), a novel navigation paradigm in which users annotate visual trajectories (keypoints connected by arrows) on 2D top-down maps to guide agent navigation, replacing natural language or image-goal instructions. Two datasets, R2R-VP and R2R-CE-VP, are constructed alongside a VPNet baseline model. Combined with view-level and trajectory-level data augmentation, the approach achieves strong performance in both discrete and continuous environments.