ICML2025 3D Vision AI paper notes paper summaries Diffusion Models 3D Gaussian Splatting Alignment/RLHF Adversarial Robustness Segmentation

🧊 3D Vision¶

🧪 ICML2025 · 17 paper notes

📌 Same area in other venues: 📷 CVPR2026 (751) · 🔬 ICLR2026 (197) · 🧪 ICML2026 (30) · 🤖 AAAI2026 (79) · 🧠 NeurIPS2025 (116) · 📹 ICCV2025 (267)

🔥 Top topics: Diffusion Models ×3 · 3D Gaussian Splatting ×3 · Alignment/RLHF ×2 · Adversarial Robustness ×2 · Segmentation ×2

ADHMR: Aligning Diffusion-based Human Mesh Recovery via Direct Preference Optimization: This work introduces the concept of DPO into diffusion-based Human Mesh Recovery (HMR). By training an HMR-Scorer to evaluate prediction quality and constructing a preference dataset (winner/loser pairs), the base diffusion model is fine-tuned via DPO, improving HMR performance on in-the-wild images without requiring 3D annotations.
D-Fusion: Direct Preference Optimization for Aligning Diffusion Models with Visually Consistent Samples: This paper proposes D-Fusion, a method that constructs visually consistent preference data pairs and preserves denoising trajectories via mask-guided Self-Attention Fusion. It addresses the performance limitations in training diffusion models with DPO caused by visual inconsistency, significantly improving prompt-image alignment quality across various RL algorithms and prompt types.
Diverse Prototypical Ensembles Improve Robustness to Subpopulation Shift: The paper proposes Diversified Prototypical Ensemble (DPE), which replaces the standard linear classification head with multiple diverse prototype classifiers. By utilizing both explicit (inter-prototype similarity loss) and implicit (bootstrap sampling) diversification strategies, DPE adaptively discovers subpopulation decision boundaries without requiring subpopulation annotations, significantly improving worst-group accuracy.
FlowDrag: 3D-aware Drag-based Image Editing with Mesh-guided Deformation Vector Flow Fields: Proposed FlowDrag, which constructs a 3D mesh from an image and generates continuous 2D vector flow fields through progressive SR-ARAP deformation. This injects global geometric priors into the motion supervision process of diffusion models, leading to comprehensive state-of-the-art performance on DragBench (MD=22.88) and the newly proposed VFD-Bench (PSNR=18.55, 1-LPIPS=0.82, MD=28.23).
FreeMesh: Boosting Mesh Generation with Coordinates Merging: This work proposes the Per-Token-Mesh-Entropy (PTME) metric to evaluate mesh tokenizer quality without training, and introduces Rearrange & Merge Coordinates (RMC), a coordinate merging technique borrowed from NLP, achieving a compression rate of up to 21.2% across three tokenizers (MeshXL, MeshAnythingV2, and EdgeRunner), while significantly increasing the number of generatable faces and preserving geometric details.
GAPrompt: Geometry-Aware Point Cloud Prompt for 3D Vision Model: This paper proposes GAPrompt, a geometry-aware PEFT method for pre-trained 3D vision models. By synergistically leveraging point cloud geometric information through three modules—Point Prompt, Point Shift Prompter, and Prompt Propagation—it matches or even outperforms full fine-tuning while training only 2.19% of the parameters.
High Dynamic Range Novel View Synthesis with Single Exposure: First proposes the problem setting of HDR novel view synthesis (HDR-NVS) using only single-exposure LDR images, and designs Mono-HDR-3D, a meta-algorithm framework based on camera imaging principles. It achieves HDR scene modeling without HDR supervision through an LDR-to-HDR Color Converter (L2H-CC) and an HDR-to-LDR closed-loop Color Converter (H2L-CC).
Of Mice and Machines: A Comparison of Learning Between Real World Mice and RL Agents: This paper systematically compares the behavioral differences between real mice and RL agents in a predator-prey maze. Revealing that RL agents lack self-preservation instincts, the authors propose two mechanisms: Trauma-Inspired Safety Buffer (TISB) and Variance-Penalized TD learning (VP-TDMPC-2), which improve the state visitation overlap between agents and mice from 20.9% to 86.1%.
PhysicsNeRF: Physics-Guided 3D Reconstruction from Sparse Views: PhysicsNeRF proposes a physics-prior-based sparse-view NeRF framework. By leveraging four complementary constraints—depth ranking, cross-view consistency, sparsity regularization, and progressive training—it achieves a PSNR of 21.4 dB with only 8 views while providing an in-depth theoretical analysis of the nature of overfitting under sparse-view conditions.
Probabilistic Interactive 3D Segmentation with Hierarchical Neural Processes: NPISeg3D proposes the first probabilistic interactive 3D segmentation framework based on Hierarchical Neural Processes. Through a two-level latent variable structure (scene-level and object-level) and a probabilistic prototype modulator, it achieves segmentation accuracy superior to AGILE3D under a few clicks, while providing reliable uncertainty estimations.
ReferSplat: Referring Segmentation in 3D Gaussian Splatting: ReferSplat proposes the new task of Referring 3D Gaussian Splatting Segmentation (R3DGS). By constructing 3D Gaussian Referring Fields, a Position-Aware Cross-Modal Interaction (PCMI) module, and Gaussian-Text Contrastive Learning (GTCL), it enables target object segmentation (including occluded/invisible objects) in 3DGS scenes berdasarkan natural language descriptions. It achieves SOTA performance on the newly created Ref-LERF dataset and open-vocabulary segmentation benchmarks.
SE(3)-Equivariant Diffusion Policy in Spherical Fourier Space: This paper proposes constructing SE(3)-equivariant diffusion policies in spherical Fourier space, leveraging the equivariant properties of spherical harmonics to make the policy equivariant under rigid body transformations of the input scene, thereby achieving better spatial generalization in robot manipulation tasks.
Symmetry-Robust 3D Orientation Estimation: A robust two-stage 3D orientation estimation pipeline is proposed to handle rotational symmetry. The first stage recovers the orientation within an equivalence class of the octahedral symmetry group via quotient regression, and the second stage predicts one of the 24 octahedral flips using a classifier to achieve precise recovery, achieving state-of-the-art results on ShapeNet.
LaGa: Tackling View-Dependent Semantics in 3D Language Gaussian Splatting: Proposed LaGa, which establishes cross-view semantic connections via 3D scene decomposition and constructs view-aggregated semantic representations using adaptive clustering with dual-factor re-weighting. This addresses the overlooked view-dependent semantics issue in 3D Language Gaussian Splatting, achieving a 3D mIoU of 64.0% (+18.7%) on LERF-OVS.
The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training: Reveals a significant and persistent sharpness disparity among different types of modules (Emb, QK, FFN, VO, Norm) in Transformers. Based on this, a Blockwise LR strategy is proposed to allocate larger learning rates to low-sharpness modules, achieving nearly 2× acceleration in LLM pre-training without compromising stability.
Thickness-aware E(3)-Equivariant 3D Mesh Neural Networks: This paper proposes T-EMNN, which introduces a thickness-aware message passing mechanism and a PCA-based data-driven coordinate system. While maintaining the computational efficiency of surface meshes, it models the thickness interaction between opposing surfaces to achieve E(3)-equivariant/invariant node-level 3D deformation prediction.
VTGaussian-SLAM: RGBD SLAM for Large Scale Scenes with Splatting View-Tied 3D Gaussians: This paper proposes View-Tied 3D Gaussians, which bind Gaussians to depth pixels and simplify them to spherical shapes to significantly reduce storage footprint. Combined with a tracking/mapping strategy that only optimizes Gaussians associated with adjacent views, a scalable RGBD SLAM system is realized for large-scale scenes.