📹 ICCV2025 Paper Notes¶

1359 ICCV2025 paper notes covering 3D Vision (268), Image Generation (219), Multimodal VLM (142), Autonomous Driving (98), Segmentation (78), Video Understanding (58), Video Generation (51), Human Understanding (49) and other 39 areas. Each note has TL;DR, motivation, method, experiments, highlights, and limitations — 5-minute reads of core ideas.

🧊 3D Vision¶

TRAN-D: 2D Gaussian Splatting-based Sparse-view Transparent Object Depth Reconstruction via Physics Simulation for Scene Update: This paper proposes TRAN-D, a 2D Gaussian Splatting-based method for sparse-view transparent object depth reconstruction. It employs segmentation-guided object-aware losses to optimize Gaussian distributions in occluded regions, and leverages physics simulation (MPM) to enable dynamic scene updates after object removal, requiring only a single image for scene refresh.
3D Gaussian Map with Open-Set Semantic Grouping for Vision-Language Navigation: This paper proposes a 3D Gaussian Map based on 3D Gaussian Splatting for scene representation, combined with an open-set semantic grouping mechanism, to construct a 3D environmental representation that captures both geometric structure and rich semantic information for Vision-Language Navigation (VLN). A Multi-Level Action Prediction strategy is further designed to integrate multi-granularity spatial-semantic cues for navigation decision-making.
3D Mesh Editing using Masked LRMs: This paper proposes MaskedLRM, which reformulates 3D shape editing as a conditional reconstruction problem. During training, randomly generated 3D occluders mask multi-view inputs, and a single clean conditioning view guides completion of the occluded regions. At inference, the user defines an edit region and provides a single edited image; the model produces an edited 3D mesh in a single forward pass in under 3 seconds — 2–10× faster than optimization-based methods — while supporting topological changes (e.g., adding holes or handles) and achieving reconstruction quality on par with state-of-the-art methods.
3D Test-time Adaptation via Graph Spectral Driven Point Shift: This paper proposes GSDTTA, which shifts 3D point cloud test-time adaptation from the spatial domain to the graph spectral domain. By optimizing only the lowest 10% frequency components to adapt global structure, combined with an eigenvector-map-guided self-training strategy, GSDTTA achieves state-of-the-art performance on ModelNet40-C and ScanObjectNN-C.
3D Test-time Adaptation via Graph Spectral Driven Point Shift: This paper proposes GSDTTA, which is the first work to shift 3D point cloud test-time adaptation (TTA) from the spatial domain to the graph spectral domain. By optimizing only the lowest 10% frequency components (reducing parameters by ~90%), GSDTTA achieves global structural adjustment. Combined with a feature map guided self-training strategy for pseudo-label generation, it significantly outperforms existing 3D TTA methods on ModelNet40-C and ScanObjectNN-C.
3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding: This paper proposes 3DGraphLLM, which encodes semantic inter-object relationships in 3D scenes as learnable graph representations and feeds them into an LLM. The method significantly outperforms baselines that ignore semantic relations across multiple 3D vision-language tasks — including object grounding, scene captioning, and visual question answering — while achieving 5× faster inference than LVLM-based approaches.
3DGS-LM: Faster Gaussian-Splatting Optimization with Levenberg-Marquardt: This paper proposes 3DGS-LM, which replaces the Adam optimizer in 3DGS with a customized Levenberg-Marquardt (LM) optimizer. By introducing a GPU cache-driven parallelization scheme for efficient Jacobian-vector product computation, the method achieves a 20% speedup in 3DGS optimization while maintaining equivalent reconstruction quality.
3DGS-LM: Faster Gaussian-Splatting Optimization with Levenberg-Marquardt: This paper proposes 3DGS-LM, which replaces the ADAM optimizer in 3D Gaussian Splatting with a customized second-order Levenberg-Marquardt (LM) optimizer. Combined with an efficient GPU parallelization scheme and a gradient caching structure, the method achieves a 20% training speedup while preserving reconstruction quality.
3DGS-LM: Faster Gaussian-Splatting Optimization with Levenberg-Marquardt: This paper replaces the ADAM optimizer in 3D Gaussian Splatting with a custom Levenberg-Marquardt (LM) second-order optimizer. By leveraging an efficient CUDA-parallelized PCG algorithm and a gradient cache structure to accelerate Jacobian-vector products, the method reduces optimization time by approximately 20% while maintaining equivalent reconstruction quality.
4D Gaussian Splatting SLAM: This paper presents the first complete 4D Gaussian Splatting SLAM system capable of simultaneously performing camera pose tracking and 4D Gaussian radiance field reconstruction in dynamic scenes. Gaussian primitives are partitioned into static and dynamic sets; dynamic object motion is modeled via sparse control points and an MLP; and a novel 2D optical flow map rendering algorithm is introduced to supervise dynamic Gaussian motion learning.
4D Visual Pre-training for Robot Learning: FVP proposes a visual pre-training framework based on 4D (3D spatial + temporal) point cloud prediction. By formulating the pre-training objective as "next-frame point cloud prediction" and implementing it via a diffusion model, FVP significantly improves the success rate of multiple 3D imitation learning methods on real-robot manipulation tasks (average +28% on DP3).
4D Visual Pre-training for Robot Learning: FVP formulates 3D visual pre-training as a next-point-cloud-prediction problem, training a conditional diffusion model to predict the current-frame point cloud from historical-frame point clouds. This approach achieves a 28% average success rate improvement over DP3 across 12 real-world manipulation tasks, establishing a new state of the art.
7DGS: Unified Spatial-Temporal-Angular Gaussian Splatting: This work extends 3DGS to seven dimensions (spatial 3D + temporal 1D + directional 3D). A conditional slicing mechanism projects 7D Gaussians into 3D Gaussians compatible with the standard 3DGS pipeline, achieving up to 7.36 dB PSNR improvement on dynamic scenes with view-dependent effects while maintaining 401 FPS real-time rendering.
7DGS: Unified Spatial-Temporal-Angular Gaussian Splatting: This paper proposes 7DGS, which models scene elements as 7-dimensional Gaussian distributions (3D spatial + 1D temporal + 3D view direction). A conditional slicing mechanism converts 7D Gaussians into time- and view-conditioned 3D Gaussians, unifying dynamic scene rendering with view-dependent appearance. On the proposed 7DGS-PBR dataset, 7DGS achieves up to 7.36 dB PSNR gain over 4DGS while using only 15.3% of the Gaussian primitives, with real-time rendering at 401 FPS.
A3GS: Arbitrary Artistic Style into Arbitrary 3D Gaussian Splatting: A3GS is proposed as the first feed-forward zero-shot 3DGS style transfer framework. It encodes 3DGS scenes into a latent space via a GCN-based autoencoder and injects arbitrary style features using AdaIN, completing style transfer from any style to any 3D scene in approximately 10 seconds — two orders of magnitude faster than optimization-based methods.
A Lesson in Splats: Teacher-Guided Diffusion for 3D Gaussian Splats Generation with 2D Supervision: This paper proposes a novel framework for training 3D diffusion models using only 2D image supervision. By employing a deterministic 3D reconstruction model as a "noisy teacher" to generate 3D noisy samples, and combining a multi-step denoising strategy with cycle-consistency regularization, the proposed method achieves 3D Gaussian Splatting generation quality that surpasses the teacher model (PSNR gain of 0.5–0.85).
A Lesson in Splats: Teacher-Guided Diffusion for 3D Gaussian Splats Generation with 2D Supervision: This paper proposes a framework for training 3D diffusion models with 2D image supervision: a pretrained deterministic 3D reconstruction model serves as a "noisy teacher" to generate noisy 3D samples, and a multi-step denoising strategy combined with rendering losses enables cross-modal training (3D denoising + 2D supervision). The approach surpasses the teacher model by 0.5–0.85 PSNR while using a smaller model.
A Recipe for Generating 3D Worlds from a Single Image: The problem of single-image-to-3D-world generation is decomposed into two simpler sub-problems—panorama synthesis (training-free in-context learning) and point-cloud-conditioned inpainting (ControlNet fine-tuned for only 5k steps)—combined with 3DGS reconstruction to produce immersive 3D environments navigable within a 2 m³ volume in VR, surpassing SOTA methods such as WonderJourney and DimensionX across all image quality metrics.
A Simple yet Mighty Hartley Diffusion Versatilist for Generalizable Dense Vision Tasks: This paper proposes HarDiff — a frequency-domain learning strategy based on the Discrete Hartley Transform (DHT) — that enhances the cross-domain generalization capability of diffusion models on dense vision tasks through low-frequency training (extracting structural priors from the source domain) and high-frequency sampling (leveraging target-domain detail guidance). HarDiff achieves state-of-the-art results across 12 benchmarks spanning semantic segmentation, depth estimation, and image dehazing.
A Unified Interpretation of Training-Time Out-of-Distribution Detection: This paper proposes a novel perspective based on inter-variable "interactions" to provide a unified explanation for why different training-time OOD detection methods are effective — they all encourage the model to encode more high-order interactions. The paper further validates the dominant role of high-order interactions in OOD detection and explains, through interaction distribution analysis, why near-OOD samples are harder to detect.
AAA-Gaussians: Anti-Aliased and Artifact-Free 3D Gaussian Rendering: AAA-Gaussians proposes a unified 3D Gaussian rasterization framework that simultaneously addresses the three persistent problems of 3DGS—aliasing, projection distortion, and popping artifacts—through an adaptive 3D smoothing filter, view-space perspective-correct bounding computation, and frustum-based 3D culling, all within a single framework. The method substantially outperforms competing approaches under out-of-distribution viewpoint evaluation while maintaining real-time rendering performance.
AAA-Gaussians: Anti-Aliased and Artifact-Free 3D Gaussian Rendering: This paper proposes AAA-Gaussians, which systematically addresses aliasing, projection distortion, and pop-in artifacts in 3DGS within a unified framework via three techniques: an adaptive 3D smoothing filter, a view-space perspective-correct bounding scheme, and frustum-based 3D culling. The method achieves state-of-the-art artifact-free real-time rendering under both in-distribution and out-of-distribution viewpoints.
AAA-Gaussians: Anti-Aliased and Artifact-Free 3D Gaussian Rendering: By incorporating full 3D evaluation (rather than 2D splat approximations) into every stage of the 3DGS rendering pipeline, this work proposes an adaptive 3D smoothing filter, view-space bounding computation, and frustum-based tile culling to jointly address aliasing, projection artifacts, and popping artifacts in 3DGS. The method substantially outperforms existing approaches under out-of-distribution (OOD) viewpoints while maintaining real-time rendering (>100 FPS).
Advancing Text-to-3D Generation with Linearized Lookahead Variational Score Distillation: By identifying the optimization order mismatch between the LoRA model and the 3D model in VSD, this paper proposes a linearized lookahead correction term, $L^2$-VSD, which significantly improves text-to-3D generation quality at the cost of only one additional forward pass.
Adversarial Exploitation of Data Diversity Improves Visual Localization: This paper proposes RAP, a framework that synthesizes diverse training data via appearance-controllable 3DGS and introduces an adversarial discriminator to bridge the synthetic-to-real domain gap, enabling absolute pose regression methods to substantially surpass the state of the art across multiple datasets — reducing indoor translation/rotation errors by 50%/41% and outdoor errors by 38%/44%.
AllTracker: Efficient Dense Point Tracking at High Resolution: This paper proposes AllTracker, which reformulates point tracking as multi-frame long-range optical flow estimation. Through low-resolution iterative inference (2D convolutions + temporal attention) followed by high-resolution upsampling, AllTracker achieves state-of-the-art accuracy for high-resolution (768–1024 px) dense point tracking across all pixels with only 16M parameters.
Amodal3R: Amodal 3D Reconstruction from Occluded 2D Images: This paper proposes Amodal3R, an end-to-end occlusion-aware 3D reconstruction model that introduces mask-weighted cross-attention and an occlusion-aware attention layer on top of TRELLIS, enabling direct reconstruction of complete 3D object geometry and appearance from partially occluded 2D images in the 3D latent space, substantially outperforming prior two-stage "2D completion → 3D reconstruction" pipelines.
Amodal Depth Anything: Amodal Depth Estimation in the Wild: This paper proposes a new paradigm for amodal relative depth estimation, constructs a large-scale real-world dataset ADIW (564K samples), and designs two complementary frameworks (Amodal-DAV2 and Amodal-DepthFM) built upon Depth Anything V2 and DepthFM. By minimally modifying pretrained models, the method achieves depth prediction in occluded regions, improving RMSE by 27.4% over the previous SOTA on ADIW.
AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation: This paper proposes AnimateAnyMesh, the first feed-forward text-driven universal mesh animation framework. It introduces DyMeshVAE to decompose dynamic meshes into initial positions and relative trajectories, compressing them into a latent space. A Rectified Flow-based MMDiT model then learns the trajectory distribution conditioned on text. Trained on the 4M+ DyMesh dataset, the framework generates high-quality animations for meshes of arbitrary topology within 6 seconds, comprehensively outperforming DG4D, L4GM, and Animate3D.
AnyI2V: Animating Any Conditional Image with Motion Control: This paper proposes AnyI2V, a training-free framework that accepts arbitrary-modality images (mesh, point cloud, depth map, skeleton, etc.) as first-frame conditions and combines user-defined trajectories for motion-controlled video generation, outperforming existing training-free methods and competing with trained methods on FID/FVD/ObjMC metrics.
AR-1-to-3: Single Image to Consistent 3D Object Generation via Next-View Prediction: This paper proposes AR-1-to-3, an autoregressive next-view prediction framework built upon diffusion models. By adopting a progressive "near-to-far" generation strategy, combined with two conditioning mechanisms—Stacked-LE (Stacked Local-feature Encoding) and LSTM-GE (LSTM-based Global-feature Encoding)—the method significantly improves multi-view consistency in single-image-to-multi-view generation. On the GSO dataset, it achieves a PSNR of 13.18 (a 23.5% improvement over InstantMesh's 10.67) and reduces Chamfer Distance to 0.063 (compared to 0.117 for InstantMesh).
ArgMatch: Adaptive Refinement Gathering for Efficient Dense Matching: This paper proposes an Adaptive Refinement Gathering pipeline that substantially reduces dependence on heavy feature extractors and global matchers through a content-aware offset estimator, a local-consistency matching corrector, and a local-consistency upsampler, achieving competitive dense matching performance with a lightweight network.
Articulate3D: Holistic Understanding of 3D Scenes as Universal Scene Description: This paper presents Articulate3D — the first large-scale real-world indoor scene dataset with articulation annotations (280 high-quality scans) — along with USDNet, a unified framework that simultaneously predicts movable/interactive part segmentation and motion parameters from 3D point clouds, providing simulation-ready scene data for embodied AI and physical simulation.
ATLAS: Decoupling Skeletal and Shape Parameters for Expressive Parametric Human Modeling: This paper presents ATLAS, a parametric human body model that explicitly decouples external surface shape from internal skeletal parameters, incorporates sparse nonlinear pose correctives, and is trained on 600K high-resolution scans, achieving more accurate and controllable human body modeling than SMPL-X.
Auto-Regressively Generating Multi-View Consistent Images: This paper proposes MV-AR, the first autoregressive model for multi-view image generation. It progressively generates each subsequent view conditioned on all previously generated views, incorporating a unified multimodal condition injection module and a Shuffle View data augmentation strategy. MV-AR achieves consistency comparable to diffusion-based methods under text, image, and shape conditioning.
AutoOcc: Automatic Open-Ended Semantic Occupancy Annotation via Vision-Language Guided Gaussian Splatting: This paper proposes AutoOcc, a fully automatic vision-centric pipeline for open-ended semantic occupancy annotation. By leveraging vision-language model (VLM)-guided differentiable Gaussian splatting (VL-GS), AutoOcc generates 3D semantic occupancy without any human labels, achieving IoU 83.01 / mIoU 20.92 on Occ3D-nuScenes with camera-only input, substantially outperforming existing automatic annotation methods.
Back on Track: Bundle Adjustment for Dynamic Scene Reconstruction: This paper proposes BA-Track, a framework that leverages a 3D point tracker to decompose observed motion into camera motion and object motion, enabling classical Bundle Adjustment to jointly handle static and dynamic scene elements for accurate camera pose estimation and temporally consistent dense reconstruction.
Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation and Reconstruction: This paper proposes DiffusionGS, which bakes 3D Gaussian point clouds into the denoiser of a diffusion model, enabling single-stage, view-consistent single-view 3D object generation and scene reconstruction. Combined with a scene-object mixed training strategy and RPPC camera conditioning encoding, the method substantially outperforms existing approaches on PSNR/FID metrics while requiring only ~6 seconds for inference.
BANet: Bilateral Aggregation Network for Mobile Stereo Matching: This paper proposes BANet, a Bilateral Aggregation Network that decomposes the cost volume into a high-frequency detail volume and a low-frequency smooth volume via spatial attention and aggregates them separately. Using only 2D convolutions, BANet runs in real time on mobile devices while substantially outperforming MobileStereoNet-2D (35.3% accuracy improvement on KITTI 2015). Its 3D variant achieves the highest accuracy among real-time methods on GPU.
Benchmarking and Learning Multi-Dimensional Quality Evaluator for Text-to-3D Generation: This paper introduces MATE-3D, a multi-dimensional benchmark comprising 1,280 text-to-3D models (8 prompt categories × 8 generation methods × 4 evaluation dimensions × 21 annotators), and proposes HyperScore, a hypernetwork-based multi-dimensional quality evaluator that employs conditional feature fusion and adaptive quality mapping to surpass existing metrics across all evaluation dimensions.
Benchmarking and Learning Multi-Dimensional Quality Evaluator for Text-to-3D Generation: This paper constructs the MATE-3D benchmark (8 prompt categories × 8 methods = 1,280 textured meshes, annotated with 4 dimensions × 21 human raters = 107,520 labels) and proposes HyperScore, a multi-dimensional quality evaluator. HyperScore employs learnable condition features, conditional feature fusion (simulating attention shift), and a hypernetwork that generates dimension-adaptive mapping functions (simulating changes in the decision process), achieving comprehensive superiority over existing metrics across four dimensions: semantic alignment, geometry, texture, and overall quality.
Benchmarking Egocentric Visual-Inertial SLAM at City Scale: This paper introduces LaMAria — the first city-scale egocentric multi-sensor VIO/SLAM benchmark dataset — providing centimeter-accurate ground truth via surveying-grade control points. It systematically evaluates mainstream academic SLAM methods on real egocentric data, revealing a substantial performance gap between academic systems and commercial solutions.
BezierGS: Dynamic Urban Scene Reconstruction with Bézier Curve Gaussian Splatting: This paper proposes BezierGS, a 3D Gaussian Splatting method that models dynamic object motion trajectories using learnable Bézier curves, eliminating reliance on precise bounding box annotations. The method achieves state-of-the-art performance on both dynamic and static scene reconstruction on the Waymo and nuPlan datasets.
BillBoard Splatting (BBSplat): Learnable Textured Primitives for Novel View Synthesis: This paper proposes BBSplat, which replaces the Gaussian opacity in 2D Gaussian Splatting with learnable RGB texture maps and alpha maps, enabling each planar primitive to possess arbitrary shape and per-pixel color control. With fewer primitives, BBSplat closes the rendering quality gap between 2DGS and 3DGS while preserving accurate mesh extraction capability and achieving up to ×17 storage compression.
Blended Point Cloud Diffusion for Localized Text-guided Shape Editing: This paper proposes BlendedPC, which reformulates localized text-guided 3D shape editing as a semantic inpainting problem. By fine-tuning an Inpaint-E model on top of Point·E and introducing an inversion-free coordinate blending mechanism at inference time, BlendedPC achieves precise local editing while preserving the identity of the original shape, outperforming existing methods comprehensively on the ShapeTalk dataset.
BokehDiff: Neural Lens Blur with One-Step Diffusion: BokehDiff proposes a one-step inference bokeh rendering method built upon a pretrained diffusion model. It incorporates energy conservation, circle-of-confusion constraints, and self-occlusion effects via a Physics-Inspired Self-Attention (PISA) module, combined with synthetic foreground data for training, achieving significant improvements over conventional methods at depth-discontinuous regions.
Bolt3D: Generating 3D Scenes in Seconds: A feed-forward 3D scene generation method based on latent diffusion models that represents 3D scenes as multiple sets of Splatter Images and employs a dedicated geometry VAE, generating a complete 3D scene on a single GPU in 7 seconds—reducing inference cost by 300× compared to optimization-based methods (CAT3D).
Boost 3D Reconstruction using Diffusion-based Monocular Camera Calibration: This paper proposes DM-Calib, which leverages Stable Diffusion priors for monocular camera intrinsic estimation. It introduces a Camera Image representation that losslessly encodes intrinsics as an image, and recovers focal length and principal point via RANSAC. DM-Calib significantly outperforms existing calibration methods on 5 zero-shot datasets and advances downstream tasks including metric depth estimation, pose estimation, and sparse-view reconstruction.
Boost 3D Reconstruction using Diffusion-based Monocular Camera Calibration: This paper proposes DM-Calib, a diffusion-based monocular camera intrinsic estimation method. It introduces a Camera Image representation that losslessly encodes intrinsics as a 3-channel image (azimuth + elevation + grayscale), fine-tunes Stable Diffusion to generate Camera Images, and extracts intrinsics via RANSAC. The method outperforms all baselines on 5 zero-shot datasets and extends camera calibration to metric depth estimation, pose estimation, and sparse-view 3D reconstruction.
Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data: This paper proposes Bootstrap3D, a framework that leverages 2D/video diffusion models to automatically generate 1 million high-quality multi-view images paired with fine-grained text descriptions. Combined with a Training Timestep Rescheduling (TTR) strategy that balances image quality and view consistency during fine-tuning, Bootstrap3D significantly improves text-to-3D generation quality.
Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data: This paper proposes Bootstrap3D, a framework that leverages video diffusion models to generate synthetic multi-view data, employs a fine-tuned MV-LLaVA for quality filtering and dense caption rewriting, and introduces a Training Timestep Reschedule (TTR) strategy for training multi-view diffusion models — substantially improving image quality and text alignment without sacrificing view consistency.
BoxDreamer: Dreaming Box Corners for Generalizable Object Pose Estimation: This paper proposes BoxDreamer, which adopts 3D bounding box corners as an intermediate representation. A reference-view-based corner synthesizer predicts 2D corner projections in query images, and 6DoF poses are recovered via PnP using the resulting 2D–3D correspondences. The method significantly outperforms existing approaches under sparse-view and heavy-occlusion conditions.
Bridging 3D Anomaly Localization and Repair via High-Quality Continuous Geometric Representation: This paper proposes PASDF, a framework that employs a pose-aware signed distance function (SDF) for continuous geometric representation, unifying 3D anomaly detection and repair tasks, achieving state-of-the-art performance on Real3D-AD and Anomaly-ShapeNet.
Bridging 3D Anomaly Localization and Repair via High-Quality Continuous Geometric Representation: This paper proposes the PASDF framework, which achieves continuous geometric representation via a pose-aware signed distance function (SDF), unifying 3D anomaly detection and repair tasks, and attains state-of-the-art performance on Real3D-AD and Anomaly-ShapeNet.
PASDF: Bridging 3D Anomaly Localization and Repair via High-Quality Continuous Geometric Representation: This paper proposes PASDF, a unified framework that aligns point clouds to a canonical pose via a Pose Alignment Module (PAM), learns a continuous geometric representation through a neural SDF network, and scores anomalies based on SDF deviation. Anomaly repair is achieved by extracting the zero-level set via Marching Cubes as a repair template. PASDF achieves state-of-the-art performance on Real3D-AD (O-AUROC 80.2%) and Anomaly-ShapeNet (O-AUROC 90.0%).
Bridging Diffusion Models and 3D Representations: A 3D Consistent Super-Resolution Framework: This paper proposes 3DSR, a framework that integrates 2D diffusion-based super-resolution with 3D Gaussian Splatting (3DGS) representations. At each diffusion denoising step, multi-view 3D consistency is enforced via 3DGS rendering, enabling high-fidelity and spatially consistent 3D scene super-resolution.
BUFFER-X: Towards Zero-Shot Point Cloud Registration in Diverse Scenes: BUFFER-X is proposed as a zero-shot point cloud registration method that requires no manual parameter tuning. Through adaptive voxel size/search radius estimation, FPS as a replacement for learned keypoint detectors, and patch-level coordinate normalization, it achieves out-of-the-box cross-domain generalization across 11 datasets.
BUFFER-X: Towards Zero-Shot Point Cloud Registration in Diverse Scenes: BUFFER-X is a registration pipeline that determines voxel size and search radii via geometry-adaptive bootstrapping, replaces learned keypoint detectors with FPS, and applies patch-level coordinate normalization. Without any manual parameter tuning, it achieves zero-shot point cloud registration across 11 cross-domain datasets, attaining the best average-rank success rate across indoor/outdoor, multi-sensor, and multi-scene settings.
CA-I2P: Channel-Adaptive Registration Network with Global Optimal Selection: This paper proposes CA-I2P, which introduces a Channel Adaptive Adjustment Module (CAA) to enhance and filter channel-level discrepancies between image and point cloud features, and a Global Optimal Selection (GOS) module that replaces top-k selection with optimal transport to reduce many-to-one matching errors, achieving state-of-the-art image-to-point cloud registration performance on RGB-D Scenes V2 and 7-Scenes.
Can3Tok: Canonical 3D Tokenization and Latent Modeling of Scene-Level 3D Gaussians: This paper proposes Can3Tok, the first variational autoencoder capable of encoding scene-level 3DGS into a low-dimensional latent space. It achieves efficient tokenization via cross-attention with canonical queries, and addresses scale inconsistency through 3DGS normalization and semantic-aware filtering, successfully generalizing to novel scenes on DL3DV-10K.
CasP: Improving Semi-Dense Feature Matching Pipeline Leveraging Cascaded Correspondence Priors for Guidance: This paper proposes CasP, a cascaded matching pipeline that decomposes the matching stage into a one-to-many prior matching at 1/16 scale and a one-to-one fine matching at 1/8 scale, achieving up to 2.2× speedup while maintaining accuracy and significantly improving cross-domain generalization.
CATSplat: Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from A Single-View Image: This paper proposes CATSplat, a generalizable Transformer framework for feed-forward single-view 3DGS reconstruction. It enhances image features via dual cross-attention using VLM text embeddings (contextual prior) and 3D point cloud features (spatial prior), achieving comprehensive improvements over Flash3D in PSNR/SSIM/LPIPS on RE10K and other datasets, with strong cross-dataset generalization.
CF³: Compact and Fast 3D Feature Fields: This paper proposes the CF³ pipeline, which constructs a compact and high-speed 3D feature field using only 5% of the original Gaussian count via top-down feature lifting, per-Gaussian autoencoder compression, and adaptive sparsification, achieving 121–245× storage compression with real-time rendering.
CHARM3R: Towards Unseen Camera Height Robust Monocular 3D Detector: Through mathematical proof that regression-based depth and ground-plane depth exhibit opposite extrapolation trends under camera height variation, CHARM3R proposes a simple in-model average of the two depth estimates to cancel out these trends, achieving robust generalization of Mono3D detectors to unseen camera heights with AP3D improvements exceeding 45%.
CL-Splats: Continual Learning of Gaussian Splatting with Local Optimization: This paper proposes CL-Splats, a continual learning framework built on 3D Gaussian Splatting that incrementally updates scene reconstructions from sparse novel views via DINOv2-based change detection, 2D-to-3D mask lifting, and sphere-constrained local optimization. CL-Splats substantially outperforms CL-NeRF and related methods on both synthetic and real scenes (PSNR: 40.1 vs. 30.1 dB) while supporting applications such as historical state recovery and concurrent updates.
CLIP-GS: Unifying Vision-Language Representation with 3D Gaussian Splatting: This paper proposes CLIP-GS, the first multimodal representation learning framework based on 3D Gaussian Splatting (3DGS). It serializes 3DGS into tokens via a GS Tokenizer and aligns multimodal representations using an Image Voting Loss, achieving comprehensive improvements over point-cloud-based methods on cross-modal retrieval, zero-shot, and few-shot 3D classification tasks.
CMT: A Cascade MAR with Topology Predictor for Multimodal Conditional CAD Generation: This paper proposes CMT, the first multimodal CAD generation framework based on B-Rep representation. By employing a cascade MAR (edges first, then faces) and a topology predictor, CMT achieves accurate topology and geometry generation. The authors also construct mmABC, a multimodal CAD dataset of over 1.3 million models.
CoMoGaussian: Continuous Motion-Aware Gaussian Splatting from Motion-Blurred Images: A Neural ODE is used to model continuous camera motion trajectories during exposure, combining rigid-body transformations with a learnable Continuous Motion Refinement (CMR) transform to reconstruct sharp 3D Gaussian scenes from motion-blurred images, achieving substantial improvements over the state of the art across all benchmarks.
Compression of 3D Gaussian Splatting with Optimized Feature Planes and Standard Video Codecs: This paper proposes CodecGS, which represents all Gaussian attributes via compact Tri-plane feature planes, combined with frequency-domain DCT entropy modeling and a channel-level bit allocation strategy, enabling efficient compression of feature planes using standard video codecs (HEVC). The method achieves storage sizes within ~10 MB while maintaining high rendering quality, yielding up to 146× compression over vanilla 3DGS.
CstNet: Constraint-Aware Feature Learning for Parametric Point Cloud: This paper proposes CstNet, the first constraint-aware feature learning method for parametric point clouds. CAD constraints are encoded as point-level MAD-Adj-PT triplet representations, and a two-stage network (constraint acquisition + constraint feature learning) achieves state-of-the-art results on the newly constructed Param20K dataset, with classification accuracy improved by +3.49% and rotation robustness improved by +26.17%.
Contact-Aware Amodal Completion for Human-Object Interaction via Multi-Regional Inpainting: This paper proposes the first amodal completion framework tailored for human-object interaction (HOI) scenes. It leverages human body topology and contact information to identify occluded regions via convex hull operations, and employs a multi-regional inpainting strategy on a pretrained diffusion model to achieve high-quality occluded object completion without any additional training.
Curve-Aware Gaussian Splatting for 3D Parametric Curve Reconstruction: This paper proposes CurveGaussian, a single-stage method that establishes a bidirectional coupling mechanism between parametric curves and edge-oriented Gaussian primitives, enabling direct end-to-end optimization of 3D parametric curves from multi-view edge maps. By eliminating the error accumulation inherent in two-stage pipelines, the method achieves comprehensive improvements over prior approaches in accuracy, efficiency, and compactness.
CutS3D: Cutting Semantics in 3D for 2D Unsupervised Instance Segmentation: CutS3D is the first method to introduce 3D information (monocular depth estimation) into unsupervised instance segmentation. By cutting semantic regions in 3D point clouds, it separates overlapping instances in 2D, and introduces a spatial confidence mechanism to improve pseudo-label quality, surpassing CutLER and other SoTA methods on multiple benchmarks.
DAP-MAE: Domain-Adaptive Point Cloud Masked Autoencoder for Effective Cross-Domain Learning: DAP-MAE is proposed to jointly learn multi-domain point cloud data via a Heterogeneous Domain Adapter (HDA) and a Domain Feature Generator (DFG), enabling a single pretraining run to adapt to diverse downstream tasks including object classification, expression recognition, part segmentation, and 3D detection.
DAP-MAE: Domain-Adaptive Point Cloud Masked Autoencoder for Effective Cross-Domain Learning: This paper proposes a domain-adaptive point cloud MAE framework (DAP-MAE) that enables a single cross-domain pre-training to achieve state-of-the-art performance across multiple downstream tasks spanning diverse domains, including object classification, facial expression recognition, part segmentation, and object detection, via two key modules: the Heterogeneous Domain Adapter (HDA) and the Domain Feature Generator (DFG).
DAViD: Data-efficient and Accurate Vision Models from Synthetic Data: This work demonstrates that high-fidelity procedural synthetic data suffices to train human-centric dense prediction models that match the accuracy of foundation models such as Sapiens-2B, requiring only 300K synthetic images, 0.3B parameters, and less than 1/16 the training cost of comparable approaches, while achieving state-of-the-art or near-SOTA performance on depth estimation, surface normal estimation, and soft foreground segmentation.
DeepMesh: Auto-Regressive Artist-Mesh Creation with Reinforcement Learning: This paper proposes DeepMesh, a framework that achieves human preference alignment in 3D mesh generation through an improved efficient mesh tokenization algorithm (72% compression rate) and the first application of DPO-based reinforcement learning to 3D mesh generation, capable of producing high-quality artist-like triangle meshes with up to 30K faces.
DeGauss: Dynamic-Static Decomposition with Gaussian Splatting for Distractor-free 3D Reconstruction: This paper proposes DeGauss, a self-supervised framework based on decoupled dynamic-static Gaussian Splatting. By combining foreground dynamic Gaussians and background static Gaussians via a probabilistic composition mask, it achieves distractor-free 3D reconstruction across a broad range of scenarios, from casually captured image collections to highly dynamic egocentric videos.
Demeter: A Parametric Model of Crop Plant Morphology from the Real World: Demeter is a data-driven parametric plant morphology model that decomposes plant shape into four factors — topology, articulation, shape, and deformation — supporting shape generation, 3D reconstruction, and biophysical simulation.
Depth AnyEvent: A Cross-Modal Distillation Paradigm for Event-Based Monocular Depth Estimation: This paper proposes a cross-modal distillation paradigm that leverages a vision foundation model (VFM) from the image domain (Depth Anything v2) to generate pseudo-labels for training event-based depth estimation networks. It further introduces a VFM-based recurrent architecture, DepthAnyEvent-R, achieving state-of-the-art performance in event-based monocular depth estimation without requiring costly depth annotations.
Describe, Adapt and Combine: Empowering CLIP Encoders for Open-set 3D Object Retrieval: This paper proposes the DAC framework, which employs a "Describe–Adapt–Combine" three-step strategy to synergize CLIP with a multimodal large language model (MLLM). Using only multi-view images, DAC substantially outperforms the previous SOTA method that relies on all modalities (point clouds + voxels + images) on open-set 3D object retrieval, achieving an average mAP improvement of over +10%.
Diorama: Unleashing Zero-shot Single-view 3D Indoor Scene Modeling: This paper presents Diorama, the first zero-shot open-world system for complete 3D indoor scene modeling from a single RGB image. It employs a modular pipeline consisting of open-world perception and CAD-based scene assembly to produce full scenes including architectural structures and object placement, requiring neither end-to-end training nor manual annotations.
Diorama: Unleashing Zero-shot Single-view 3D Indoor Scene Modeling: This paper presents Diorama, the first zero-shot open-world system for 3D indoor scene modeling. By modularly composing foundation models (GPT-4o, SAM, DINOv2, Metric3D, etc.), Diorama converts a single RGB image into a complete, compositional 3D indoor scene containing architectural structures and CAD objects, requiring neither end-to-end training nor manual annotation.
Discretized Gaussian Representation for Tomographic Reconstruction: This paper proposes Discretized Gaussian Representation (DGR) for CT reconstruction, which directly reconstructs 3D voxels end-to-end via discretized Gaussian functions and introduces a highly parallelized fast volume reconstruction technique. DGR surpasses both deep learning-based and instance-based reconstruction methods in sparse-view and limited-angle CT settings without any training data.
Disentangling Instance and Scene Contexts for 3D Semantic Scene Completion: This paper proposes DISC, a category-aware dual-stream architecture for 3D semantic scene completion that disentangles instance and scene categories into separate query streams with dedicated decoding modules. Using only single-frame input, DISC surpasses multi-frame state-of-the-art methods on SemanticKITTI, achieving a 17.9% improvement in instance category mIoU.
Diving into the Fusion of Monocular Priors for Generalized Stereo Matching: This paper systematically analyzes three key problems in monocular prior fusion—affine-invariant vs. absolute depth alignment, local optima in iterative updates, and noisy disparity interference—and proposes a Binary Local Ranking Map and a Global Registration Module. On SceneFlow→Middlebury/Booster generalization benchmarks, bad2 error is reduced by half or more with negligible additional computational cost.
DMesh++: An Efficient Differentiable Mesh for Complex Shapes: This paper proposes DMesh++, which replaces weighted Delaunay triangulation (WDT) with a Minimum-Ball algorithm as the tessellation function for differentiable meshes, reducing computational complexity from $O(N)$ to $O(\log N)$. The method achieves up to 32× speedup on complex shapes while preserving desirable properties such as no self-intersections and few degenerate triangles.
Do It Yourself: Learning Semantic Correspondence from Pseudo-Labels: This paper proposes DIY-SC, a 3D-aware pseudo-label generation strategy (chained propagation + relaxed cycle consistency + spherical prototype filtering) to train a lightweight adapter that improves semantic correspondence using foundation model features, achieving a 4.5% gain over the previous SOTA on SPair-71k (PCK@0.1 per-keypoint) without any manual keypoint annotations.
DriveX: Driving View Synthesis on Free-form Trajectories with Generative Prior: DriveX is a driving view synthesis framework that progressively distills generative priors from a video diffusion model into a 3DGS representation. It designs an inpainting-based video restoration task to generate pseudo-labels for novel trajectories and iteratively refines the 3D reconstruction, enabling high-quality real-time rendering on free-form trajectories.
DSO: Aligning 3D Generators with Simulation Feedback for Physical Soundness: This paper proposes the Direct Simulation Optimization (DSO) framework, which uses stability feedback from a (non-differentiable) physics simulator as a reward signal to fine-tune 3D generators via DPO or the newly proposed DRO objective, enabling feed-forward generation of physically self-supporting 3D objects without test-time optimization.
Dynamic Point Maps: A Versatile Representation for Dynamic 3D Reconstruction: This paper proposes Dynamic Point Maps (DPM), extending DUSt3R's viewpoint-invariant point maps into a spatiotemporal-invariant representation that jointly controls viewpoint and time. By predicting only four sets of point maps in a feed-forward manner, DPM simultaneously addresses multiple 4D tasks including depth estimation, scene flow, motion segmentation, and 3D object tracking.
Easi3R: Estimating Disentangled Motion from DUSt3R Without Training: This paper proposes Easi3R, a training-free plug-and-play method that analyzes and manipulates the implicit motion information encoded in DUSt3R's cross-attention layers to achieve dynamic object segmentation, camera pose estimation, and 4D dense point cloud reconstruction.
Easy3D: A Simple Yet Effective Method for 3D Interactive Segmentation: This paper proposes Easy3D, a simple yet effective method for 3D interactive instance segmentation. By combining a sparse voxel encoder, a lightweight Transformer decoder, and an implicit click fusion strategy, Easy3D consistently outperforms state-of-the-art methods on both in-domain and out-of-domain datasets. It is also the first work to successfully apply learned negative embeddings to implicit click fusion.
Efficient Spiking Point Mamba for Point Cloud Analysis: SPM (Spiking Point Mamba) proposes the first Mamba-based 3D spiking neural network framework. Through Hierarchical Dynamic Encoding (HDE) and a Spiking Mamba Block (SMB), it achieves over 3.5× energy reduction while improving accuracy by 6–7% over the previous state-of-the-art SNN method on ScanObjectNN.
Egocentric Action-aware Inertial Localization in Point Clouds with Vision-Language Guidance: The EAIL framework leverages egocentric action cues embedded in head-mounted IMU signals and employs hierarchical multimodal alignment (vision-language guidance) to learn associations between actions and environmental structures, enabling accurate inertial localization in 3D point clouds while simultaneously supporting action recognition.
EgoM2P: Egocentric Multimodal Multitask Pretraining: EgoM2P is the first large-scale multimodal multitask model for egocentric 4D understanding. It unifies four modalities — RGB video, depth, gaze, and camera trajectory — within a temporally-aware masked modeling framework, matching or surpassing task-specific models on multiple downstream tasks while being an order of magnitude faster.
EmbodiedSplat: Personalized Real-to-Sim-to-Real Navigation with Gaussian Splats from a Mobile Device: This paper proposes EmbodiedSplat, a complete pipeline that captures real environments via iPhone video → reconstructs 3D Gaussian Splat meshes → fine-tunes navigation policies in Habitat-Sim → deploys to the real world. The approach achieves 20%–40% absolute success rate improvement over zero-shot baselines on real-scene ImageNav tasks, with a sim-vs-real Spearman rank correlation coefficient of 0.87–0.97.
Estimating 2D Camera Motion with Hybrid Motion Basis: This paper proposes CamFlow, which represents complex 2D camera motion via a hybrid motion basis (12 physical bases + random noise bases), reveals the nonlinear nature of superimposed homography flow fields, and incorporates a Laplace distribution-based probabilistic loss function. CamFlow substantially outperforms existing homography and meshflow methods under both standard and cross-dataset zero-shot evaluation settings.
ETCH: Generalizing Body Fitting to Clothed Humans via Equivariant Tightness: This paper proposes ETCH, a framework that models SE(3)-equivariant tightness vectors from clothing surfaces to body surfaces, reducing clothed human body fitting to a tightness-aware sparse marker fitting task. On the CAPE and 4D-Dress datasets, ETCH achieves 16.7%–69.5% reduction in joint error on loose garments and an average 49.9% improvement in shape accuracy compared to state-of-the-art methods (both tightness-agnostic and tightness-aware).
EvaGaussians: Event Stream Assisted Gaussian Splatting from Blurry Images: This paper proposes EvaGaussians, a framework that leverages the high temporal resolution event streams from event cameras to assist 3D Gaussian Splatting in learning from motion-blurred images. Through event-assisted initialization, joint blur/event reconstruction losses, and event-assisted geometric regularization, the method achieves high-fidelity novel view synthesis while maintaining real-time rendering efficiency.
Event-based Tiny Object Detection: A Benchmark Dataset and Baseline: This paper introduces EV-UAV, the first large-scale event camera benchmark for anti-UAV tiny object detection (147 sequences / 23M+ event-level annotations / average target size only 6.8×5.4 pixels), and proposes EV-SpSegNet — a detection framework based on sparse 3D point cloud segmentation. The method exploits the observation that tiny moving targets form continuous elongated curves in spatiotemporal event point clouds, and incorporates a Spatiotemporal Correlation loss (STC loss) to guide the network in retaining target events. It outperforms 13 state-of-the-art methods across IoU/ACC/detection probability metrics while achieving 10–100× faster inference.
Event-boosted Deformable 3D Gaussians for Dynamic Scene Reconstruction: This paper presents the first integration of event cameras with deformable 3D Gaussian Splatting (3D-GS) for dynamic scene reconstruction. It introduces a GS-Threshold Joint Modeling (GTJM) strategy and a Dynamic-Static Decomposition (DSD) strategy, achieving state-of-the-art rendering quality and speed on a newly constructed event-4D benchmark (average PSNR improvement of 2.73 dB on synthetic data, rendering speed 1.71× faster than 4D-GS).
Event-Driven Storytelling with Multiple Lifelike Humans in a 3D Scene: This paper proposes an event-driven LLM framework that decomposes multi-character behavior planning in 3D scenes into two modules — a Narrator for event-by-event generation and an Event Parser for fine-grained spatial reasoning — achieving, for the first time, long-horizon natural interaction motion generation for 4–5+ characters in large-scale multi-room 3D scenes.
ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail: This paper proposes ExCap3D, a method for generating multi-granularity captions for objects in 3D indoor scenes at both object-level and part-level description layers. Through part-to-object information sharing and semantic/textual consistency losses, the approach ensures caption accuracy and coherence. On a newly constructed dataset of 190K captions, CIDEr scores improve by 17% and 124% over the prior SOTA at the object and part levels, respectively.
FaceLift: Learning Generalizable Single Image 3D Face Reconstruction from Synthetic Heads: This paper presents FaceLift, a single-image 360° high-quality 3D human head reconstruction method trained exclusively on synthetic data yet generalizing well to real-world images. It generates identity-consistent multi-view images via a multi-view latent diffusion model, then feeds them into a Transformer-based reconstructor to produce pixel-aligned 3D Gaussian representations.
Faster and Better 3D Splatting via Group Training: This paper proposes a Group Training strategy that accelerates 3DGS training by periodically partitioning Gaussian primitives into an "under-training group" and a "cached group," combined with Opacity-based Priority Sampling (OPS). Across four standard benchmarks, the method achieves approximately 30% training speedup while simultaneously improving rendering quality and reducing model size, and can be applied as a plug-and-play module to 3DGS and Mip-Splatting frameworks.
FiffDepth: Feed-forward Transformation of Diffusion-Based Generators for Detailed Depth Estimation: This paper proposes FiffDepth, which transforms a pretrained diffusion model into a deterministic feed-forward architecture for monocular depth estimation. By preserving the diffusion trajectory to maintain detail generation capability and introducing a learnable filter to distill DINOv2's robust generalization ability into the diffusion backbone, FiffDepth simultaneously surpasses existing methods in efficiency, accuracy, and detail richness.
Find Any Part in 3D: This paper proposes Find3D, an automated 3D data annotation engine driven by 2D foundation models (SAM + Gemini) that generates 2.1 million part annotations. The resulting model is the first to simultaneously achieve open-world, cross-category, part-level, and feed-forward inference capabilities in 3D segmentation, yielding a 260% zero-shot mIoU improvement and inference speeds 6–300× faster than existing methods.
FlashDepth: Real-time Streaming Video Depth Estimation at 2K Resolution: FlashDepth augments Depth Anything v2 with a Mamba recurrent module for cross-frame scale consistency, and introduces a Small-Large hybrid architecture that achieves real-time streaming video depth estimation at 2K resolution and 24 FPS with significantly sharper boundaries than existing methods.
FlexGen: Flexible Multi-View Generation from Text and Image Inputs: This paper proposes FlexGen, a flexible multi-view image generation framework that leverages GPT-4V to produce 3D-aware text annotations from tiled orthographic views and introduces an adaptive dual-control module to support single-image, text-only, or joint image-text conditioning for generating consistent multi-view images, enabling capabilities such as unseen-region completion, material editing, and texture control.
From Gallery to Wrist: Realistic 3D Bracelet Insertion in Videos: This paper proposes a hybrid pipeline for realistically inserting 3D bracelets into videos. It leverages 3D Gaussian Splatting (3DGS) to ensure temporal consistency, employs a 2D diffusion model to enhance photorealistic illumination, and introduces a Shading-Driven pipeline that separately optimizes albedo, shading, and specular residuals. The method achieves an 81.7% realism preference rate in user studies, significantly outperforming existing approaches.
From Image to Video: An Empirical Study of Diffusion Representations: This paper systematically compares diffusion models trained under image vs. video generation objectives using the same architecture (WALT) on a suite of downstream visual understanding tasks. Video diffusion models consistently outperform their image counterparts across all tasks, with particularly large gains on tasks requiring motion and 3D spatial understanding (point tracking +68%, camera pose +60%).
From One to More: Contextual Part Latents for 3D Generation: This paper proposes CoPart, a framework that represents 3D objects via contextual part latents and fine-tunes pretrained diffusion models with a mutual guidance strategy, enabling high-quality part-level 3D generation along with support for part editing, articulated object generation, and small-scale scene generation.
FROSS: Faster-than-Real-Time Online 3D Semantic Scene Graph Generation from RGB-D Images: This paper proposes FROSS, a method that lifts 2D scene graphs directly into 3D space and represents objects as Gaussian distributions, achieving faster-than-real-time (144 FPS) online 3D semantic scene graph generation without requiring precise point cloud reconstruction.
G2SF: Geometry-Guided Score Fusion for Multimodal Industrial Anomaly Detection: G2SF reinterprets memory-bank-based anomaly scores as isotropic Euclidean distances in a local feature space, and progressively evolves them into an anisotropic unified fusion metric by learning direction-aware scaling factors via a Local Scale Prediction Network (LSPN), achieving state-of-the-art performance on multimodal industrial anomaly detection.
GAS: Generative Avatar Synthesis from a Single Image: GAS is a framework that unifies novel view synthesis and novel pose synthesis into a video generation task by combining dense appearance cues from a generalizable NeRF with a video diffusion model. A modality switcher decouples the two tasks, enabling view-consistent and temporally coherent human avatar generation from a single image.
Gaussian Splatting with Discretized SDF for Relightable Assets: This paper proposes encoding a continuous SDF as an additional per-Gaussian attribute via an SDF-to-opacity transformation that unifies Gaussian splatting and SDF representations. Combined with a projection-based consistency loss and spherical initialization, the method achieves relighting quality surpassing existing Gaussian-based inverse rendering approaches using only 4 GB of GPU memory.
Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis: This paper proposes a video-to-4D generation framework that encodes animation data directly into a compact Gaussian variation field latent space via a Direct 4DMesh-to-GS Variation Field VAE, and trains a temporally-aware diffusion model to generate dynamic 3D content. The framework achieves high-fidelity 4D synthesis in 4.5 seconds and demonstrates strong generalization to real-world video inputs.
GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs: GaussianProperty presents a training-free framework that assigns physical properties (density, elastic modulus, friction coefficient, etc.) to 3D Gaussians by leveraging SAM for segmentation and GPT-4V for recognition, via a global-local reasoning module and a multi-view voting strategy. The framework supports two downstream tasks: physics-based simulation and robotic grasping.
GaussianUpdate: Continual 3D Gaussian Splatting Update for Changing Environments: This paper presents GaussianUpdate, the first method to integrate 3D Gaussian representations with continual learning. It achieves real-time rendering and change visualization in temporally varying scenes through a three-stage update strategy (appearance update → geometric layout update → joint refinement) and visibility-aware generative replay.
GazeGaussian: High-Fidelity Gaze Redirection with 3D Gaussian Splatting: This paper proposes GazeGaussian, the first high-fidelity gaze redirection method based on 3D Gaussian Splatting (3DGS). By employing a dual-stream 3DGS model to separately represent the facial and eye regions, the method introduces an explicit Gaussian eyeball rotation representation and an expression-guided neural renderer (EGNR), achieving state-of-the-art performance in gaze accuracy, synthesis quality, and rendering speed.
Generating Physically Stable and Buildable Brick Structures from Text: BrickGPT is the first method to generate physically stable and assemblable interlocking brick structures directly from text prompts. The core idea is to formulate brick assembly as an autoregressive text generation task, augmented at inference time with physics-aware validity checking and a rollback mechanism to ensure structural stability and buildability.
Geometry Distributions: This paper proposes Geometry Distributions (GeomDist), which models 3D geometry as a probability distribution over surface points and learns it via a diffusion model. Without assuming genus, connectivity, or boundary conditions, the method samples arbitrarily many surface points from Gaussian noise to represent geometry of arbitrary topology.
GeoProg3D: Compositional Visual Reasoning for City-Scale 3D Language Fields: This paper proposes GeoProg3D, the first visual programming framework supporting natural language interaction with city-scale high-fidelity 3D scenes. By combining a Geo-aware City-scale 3D Language Field (GCLF) with Geo-Visual APIs (GV-APIs) and an LLM reasoning engine, the framework enables compositional geospatial reasoning. GeoProg3D comprehensively outperforms existing 3D language field and VLM methods on the newly introduced GeoEval3D benchmark, which contains 952 annotated queries.
GeoSplatting: Towards Geometry Guided Gaussian Splatting for Physically-based Inverse Rendering: This paper proposes GeoSplatting, which differentiably generates surface-aligned Gaussians from an optimizable explicit mesh to provide accurate geometric guidance for 3DGS, achieving state-of-the-art inverse rendering performance (material–lighting decomposition) with training times of only 10–15 minutes.
Global-Aware Monocular Semantic Scene Completion with State Space Models: This paper proposes GA-MonoSSC, a hybrid architecture combining Transformer (2D global context) and Mamba (3D long-range dependencies) for indoor monocular semantic scene completion. A novel Frustum Mamba Layer is introduced to address feature discontinuities in voxel serialization, achieving state-of-the-art performance on Occ-ScanNet and NYUv2.
Global Motion Corresponder for 3D Point-Based Scene Interpolation under Large Motion: This paper proposes the Global Motion Corresponder (GMC), which learns unary potential fields that map 3D Gaussians from two time steps into a shared canonical space, enabling robust scene interpolation and extrapolation under large motion.
GSOT3D: Towards Generic 3D Single Object Tracking in the Wild: This paper presents GSOT3D, the largest generic 3D single object tracking benchmark to date, comprising 620 multimodal sequences (point cloud + RGB + depth) spanning 54 object categories. It supports three 3D tracking tasks (PC / RGB-PC / RGB-D) and introduces PROT3D, a progressive spatiotemporal tracker that achieves state-of-the-art performance via 9DoF bounding box estimation.
GUAVA: Generalizable Upper Body 3D Gaussian Avatar: This paper presents GUAVA, the first framework for feed-forward reconstruction of animatable upper-body 3D Gaussian avatars from a single image. By combining template Gaussians and UV Gaussians in a canonical space representation, GUAVA supports rich facial expression and gesture driving, completing reconstruction in approximately 0.1 s with real-time rendering capability.
Guiding Diffusion-Based Articulated Object Generation by Partial Point Cloud Alignment and Physical Plausibility Constraints: This paper proposes PhysNAP, which guides the reverse diffusion process of the pretrained diffusion model NAP via point cloud alignment loss and SDF-based physical plausibility constraints (part penetration and joint mobility), enabling category-aware articulated object generation with significant improvements in alignment accuracy and physical plausibility over the unguided baseline.
HairCUP: Hair Compositional Universal Prior for 3D Gaussian Avatars: This paper proposes HairCUP, a compositional universal prior model that decomposes head modeling into two independent latent spaces for face and hair. By leveraging a synthetic hairless data creation pipeline for effective disentanglement, the model supports flexible face/hairstyle swapping and few-shot monocular adaptation.
Hierarchical Material Recognition from Local Appearance: This paper proposes a hierarchical material taxonomy designed for visual applications alongside a new in-the-wild dataset, Matador (~7,200 material images with depth maps, 57 categories). A graph attention network (GAT) leverages the taxonomic hierarchy for material recognition, achieving state-of-the-art results on multiple benchmarks while supporting few-shot learning of novel materials and material probing at arbitrary scene points.
HORT: Monocular Hand-held Objects Reconstruction with Transformers: This paper proposes HORT, a coarse-to-fine Transformer-based framework that efficiently reconstructs dense 3D point clouds of hand-held objects from monocular images. By integrating image features with 3D hand geometry, HORT jointly predicts the object point cloud and its pose relative to the hand, achieving state-of-the-art performance in both reconstruction accuracy and inference speed.
HouseTour: A Virtual Real Estate A(I)gent: HouseTour is proposed to jointly generate human-like 3D camera trajectories and real estate textual descriptions given a set of indoor images with known poses. The system employs a Residual Diffuser for diffusion-based trajectory planning and integrates spatial features into Qwen2-VL-3D to produce 3D-grounded text summaries.
How Far are AI-generated Videos from Simulating the 3D Visual World: A Learned 3D Evaluation Approach: This paper proposes Learned 3D Evaluation (L3DE), an objective and quantifiable evaluation framework based on monocular 3D cues (motion, depth, appearance) and contrastive learning, designed to measure the gap between AI-generated videos and real videos in terms of 3D visual consistency, without requiring manual annotation of artifacts or quality labels.
HumanOLAT: A Large-Scale Dataset for Full-Body Human Relighting and Novel-View Synthesis: This paper introduces HumanOLAT — the first publicly available large-scale full-body multi-view OLAT (One-Light-at-a-Time) dataset, comprising 21 subjects × 3 poses × 40 viewpoints × 344 lighting conditions ≈ 850K frames, providing a high-quality benchmark for human relighting and novel-view synthesis.
Identity Preserving 3D Head Stylization with Multiview Score Distillation: This paper proposes a 3D head stylization framework based on Likelihood Distillation (LD), achieving high-quality stylization with identity preservation under 360-degree consistent rendering through multiview grid scoring, mirror gradients, and rank-weighted score tensors.
IM360: Large-scale Indoor Mapping with 360 Cameras: This paper presents IM360, a 3D mapping pipeline for large-scale indoor environments captured under sparse scanning conditions. By deeply integrating a spherical camera model into every stage of SfM—combined with dense feature matching and differentiable rendering-based texture optimization—IM360 achieves substantially superior camera localization accuracy and rendering quality on Matterport3D and Stanford2D3D compared to existing methods (PSNR gain of 3.5).
Image-Guided Shape-from-Template Using Mesh Inextensibility Constraints: This paper proposes a purely image-guided unsupervised Shape-from-Template (SfT) method that reconstructs the 3D shape of deforming objects using only visual cues—color, gradients, and silhouettes—combined with mesh inextensibility constraints. The method is approximately 400× faster than the best-performing unsupervised baseline while achieving substantially higher accuracy.
Image as an IMU: Estimating Camera Motion from a Single Motion-Blurred Image: This paper reframes motion blur from an "unwanted artifact" into a "valuable motion cue." By predicting a dense optical flow field and a monocular depth map from a single blurred image, and subsequently recovering the camera's 6DoF instantaneous velocity via a differentiable least-squares solver, the method achieves motion estimation accuracy comparable to or surpassing that of an IMU, with real-time performance at 30 FPS.
InstaScene: Towards Complete 3D Instance Decomposition and Reconstruction from Cluttered Scenes: InstaScene proposes a unified framework for instance decomposition and complete reconstruction from cluttered scenes. It constructs a spatial contrastive learning scheme via tracked Gaussian rasterization for accurate instance segmentation, and designs an in-situ generation pipeline that leverages available observations and geometric cues to guide a 3D generative model toward complete object reconstruction.
JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers: JointDiT builds an RGB-Depth joint distribution model upon the Flux diffusion Transformer. Through adaptive scheduling weights and an unbalanced timestep sampling strategy, a single model can flexibly perform three tasks—joint generation, depth estimation, and depth-conditioned image generation—by controlling the timestep of each modality.
χ: Symmetry Understanding of 3D Shapes via Chirality Disentanglement: This paper proposes an unsupervised chirality feature extraction pipeline that distills left-right chirality information from 2D foundation model features to augment 3D shape vertex descriptors, effectively resolving left-right ambiguity in shape analysis.
LACONIC: A 3D Layout Adapter for Controllable Image Creation: This paper proposes LACONIC, a lightweight adapter based on parameterized 3D semantic bounding boxes that injects explicit 3D geometric information into a pretrained text-to-image diffusion model via a decoupled cross-attention mechanism. It is the first method to simultaneously support camera control, 3D object-level semantic guidance, and full scene context modeling of off-screen objects, achieving a 75.8% reduction in FID compared to SceneCraft.
LayerLock: Non-collapsing Representation Learning with Progressive Freezing: This paper proposes LayerLock, a self-supervised video representation learning method that progressively freezes network layers while dynamically shifting prediction targets from pixels to increasingly deep intermediate layer features. It combines the training stability of pixel prediction with the semantic efficiency of latent variable prediction, and is applied to video models with up to 4B parameters.
Learning 3D Object Spatial Relationships from Pre-trained 2D Diffusion Models: This paper proposes learning 3D object-object spatial relationships (OOR) from synthetic images generated by pre-trained 2D diffusion models. A 3D lifting pipeline is introduced to construct a paired dataset, upon which a text-conditioned score-based diffusion model is trained to model the distribution of relative poses and scales between object pairs. The framework is further extended to multi-object scene layout generation and scene editing.
Learning 3D Scene Analogies with Neural Contextual Scene Maps: This paper introduces the 3D scene analogy task and proposes neural contextual scene maps to establish dense 3D mappings between scene regions sharing similar semantic context, enabling downstream applications such as trajectory transfer and object placement transfer.
Learning Robust Stereo Matching in the Wild with Selective Mixture-of-Experts: This paper proposes SMoEStereo, which integrates variable-rank MoE-LoRA and variable-kernel MoE-Adapter modules into a frozen Visual Foundation Model (VFM), combined with a lightweight decision network for selective activation of MoE modules, achieving scene-adaptive robust stereo matching with state-of-the-art performance on cross-domain and joint generalization benchmarks.
Lightweight Gradient-Aware Upscaling of 3D Gaussian Splatting Images: A lightweight image upscaling technique tailored for 3DGS is proposed, leveraging analytic image gradients from Gaussian primitives for gradient-aware bicubic spline interpolation. Without any deep learning inference, the method achieves 3–4× rendering acceleration while surpassing standard bicubic interpolation and DL-based upscaling in reconstruction quality.
LINR-PCGC: Lossless Implicit Neural Representations for Point Cloud Geometry Compression: LINR-PCGC proposes the first implicit neural representation (INR)-based method for lossless point cloud geometry compression. By designing a lightweight multi-scale SparseConv network with Scale Context Extraction (SCE) and Child Node Prediction (CNP) modules, combined with a GoP-level shared decoder and initialization strategy, the method achieves a 21.21% bitrate reduction over G-PCC TMC13v23 and a 21.95% reduction over SparsePCGC on the MVUB dataset, without relying on any specific training data distribution.
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D Capabilities: This paper proposes LLaVA-3D, which constructs "3D Patches" by injecting 3D positional embeddings into 2D CLIP patch features, extending a 2D LMM (LLaVA-Video) into a unified 2D/3D understanding model with minimal architectural modifications. The approach achieves 3.5× faster training convergence than existing 3D LMMs, reaches state-of-the-art performance on multiple 3D benchmarks, and preserves 2D capabilities without degradation.
LocalDyGS: Multi-view Global Dynamic Scene Modeling via Adaptive Local Implicit Feature Decoupling: This paper proposes LocalDyGS, a framework that decomposes complex global dynamic scenes into local spaces defined by seed points and generates temporal Gaussians via static-dynamic feature decoupling to model local motions independently, achieving high-quality reconstruction of large-scale complex dynamic scenes for the first time.
LONG3R: Long Sequence Streaming 3D Reconstruction: This paper proposes LONG3R, a streaming multi-view 3D reconstruction model based on a recurrent memory mechanism. Through three key innovations — memory gating, a dual-source refined decoder, and 3D spatio-temporal memory — LONG3R significantly improves long-sequence reconstruction quality while maintaining real-time inference speed.
LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos: LongSplat targets casually captured long videos without known camera poses. It proposes an incremental joint optimization framework that simultaneously optimizes camera poses and 3DGS, introduces a robust pose estimation module based on MASt3R priors, and designs an adaptive octree anchor formation mechanism, collectively addressing pose drift, inaccurate geometry initialization, and memory constraints.
MaskHand: Generative Masked Modeling for Robust Hand Mesh Reconstruction in the Wild: This paper proposes MaskHand, the first method to introduce generative masked modeling into 3D hand mesh reconstruction. It discretizes continuous hand poses into tokens via VQ-MANO, then employs a context-guided masked Transformer to learn the probability distribution of 2D-to-3D mappings. During inference, confidence-guided iterative sampling is used to generate high-precision hand meshes, achieving a 19.5% reduction in PA-MPJPE on the HO3Dv3 zero-shot evaluation.
MaterialMVP: Illumination-Invariant Material Generation via Multi-view PBR Diffusion: MaterialMVP is an end-to-end multi-view PBR texture generation model that decouples illumination via consistency-regularized training and employs a dual-channel material generation framework (MCAA + Learnable Material Embeddings) to align albedo and metallic-roughness maps, enabling single-pass generation of high-quality, illumination-invariant, multi-view-consistent PBR materials from a 3D mesh and an image prompt.
MEGA: Memory-Efficient 4D Gaussian Splatting for Dynamic Scenes: This paper proposes MEGA, a memory-efficient framework for 4D Gaussian Splatting that eliminates redundant spherical harmonic coefficients via DC-AC color decomposition (8× compression) and reduces the total number of Gaussians through entropy-constrained Gaussian deformation. MEGA achieves approximately 190× and 125× storage compression on the Technicolor and Neural 3D Video datasets, respectively, while maintaining comparable rendering quality and real-time speed.
MemoryTalker: Personalized Speech-Driven 3D Facial Animation via Audio-Guided Stylization: This paper proposes MemoryTalker, a two-stage training framework (Memorizing + Animating) that employs a key-value memory network to store generic facial motions and generates personalized 3D facial animations driven solely by audio via audio-guided stylized memory retrieval, requiring no additional prior information at inference time.
MeshAnything V2: Artist-Created Mesh Generation with Adjacent Mesh Tokenization: MeshAnything V2 proposes Adjacent Mesh Tokenization (AMT), which represents adjacent faces using a single vertex rather than the conventional three, reducing the average token sequence length by approximately half. This allows the maximum number of generated faces to scale from 800 to 1600 without additional computational cost, significantly improving the efficiency and quality of autoregressive mesh generation.
MeshMamba: State Space Models for Articulated 3D Mesh Generation and Reconstruction: MeshMamba introduces a Mamba state space model-based approach for articulated 3D mesh generation and reconstruction. By designing vertex serialization strategies based on body-part UV maps and template mesh coordinates, the method achieves efficient generation and reconstruction of meshes with tens of thousands of vertices, running 6–9× faster than Transformer-based counterparts.
MeshPad: Interactive Sketch-Conditioned Artist-Reminiscent Mesh Generation and Editing: MeshPad decomposes sketch-driven 3D mesh creation and editing into two sub-tasks—addition and deletion—based on a triangle sequence representation and Transformer autoregressive generation. It further proposes a vertex-aligned speculative decoder achieving a 2.2× speedup, enabling interactive mesh editing within seconds.
MinCD-PnP: Learning 2D-3D Correspondences with Approximate Blind PnP: This paper proposes MinCD-PnP, which reduces the computationally expensive Blind PnP to a problem of minimizing the Chamfer distance between 2D-3D keypoints via a triple approximation strategy. A lightweight multi-task learning module, MinCD-Net, is designed and integrated into existing I2P registration frameworks, achieving significant improvements in inlier ratio and registration recall under cross-scene and cross-dataset settings.
MoGA: 3D Generative Avatar Prior for Monocular Gaussian Avatar Reconstruction: MoGA is proposed to reconstruct high-fidelity 3D Gaussian avatars from a single image by learning a generative 3D avatar prior and leveraging it as a strong constraint for initialization, regularization, and pose optimization, substantially outperforming existing methods.
Momentum-GS: Momentum Gaussian Self-Distillation for High-Quality Large Scene Reconstruction: Momentum-GS proposes a momentum-based self-distillation mechanism to address cross-block consistency issues in block-parallel training of large-scale 3D Gaussian Splatting. By introducing a momentum teacher Gaussian decoder for global guidance and decoupling the number of blocks from the number of GPUs, the method achieves state-of-the-art performance on multiple large-scale scene datasets, improving LPIPS by 18.7% over CityGaussian.
Monocular Semantic Scene Completion via Masked Recurrent Networks: This paper proposes MonoMRN, a two-stage monocular semantic scene completion framework that first generates coarse-grained predictions, then iteratively refines occluded regions via a Masked Sparse GRU (MS-GRU), while introducing distance attention projection to reduce depth projection errors. The method achieves state-of-the-art performance on both NYUv2 and SemanticKITTI.
MonoMobility: Zero-Shot 3D Mobility Analysis from Monocular Videos: MonoMobility presents the first framework for zero-shot analysis of moving parts and motion attributes (motion axis and motion type) of articulated objects from monocular video. It combines off-the-shelf tools—depth estimation, optical flow, and segmentation—for coarse initialization, and then refines the results via self-supervised optimization of a dynamic scene represented with 2D Gaussian splatting together with a specially designed articulated-object dynamic scene optimization algorithm. The method requires no annotated data and handles rotational, translational, and compound motion.
MuGS: Multi-Baseline Generalizable Gaussian Splatting Reconstruction: This paper proposes MuGS, the first generalizable 3D Gaussian Splatting method designed for multi-baseline settings. By fusing multi-view stereo (MVS) and monocular depth estimation (MDE) features and introducing a projected-sampled depth consistency network, MuGS achieves state-of-the-art novel view synthesis under both small-baseline and large-baseline scenarios.
Multi-View 3D Point Tracking: This paper presents MVTracker—the first data-driven multi-view 3D point tracker. By back-projecting multi-view depth maps into a unified 3D feature point cloud and leveraging kNN association with Transformer-based iterative refinement, MVTracker achieves robust long-range 3D point trajectory estimation under a practical 4-camera configuration, attaining median trajectory errors of 3.1 cm and 2.0 cm on Panoptic Studio and DexYCB, respectively.
MV-Adapter: Multi-view Consistent Image Generation Made Easy: This paper proposes MV-Adapter, the first adapter-based framework for multi-view image generation. By duplicating self-attention layers and adopting a parallel attention architecture, it enables plug-and-play multi-view generation on SDXL at 768 resolution, with compatibility across diverse T2I-derived models.
MVGBench: a Comprehensive Benchmark for Multi-view Generation Models: This paper presents MVGBench, a comprehensive evaluation framework for multi-view generation (MVG) models. It introduces a novel 3D consistency metric based on 3DGS self-consistency (requiring no 3D ground truth), systematically evaluates 12 state-of-the-art methods across three dimensions—peak performance, generalization, and robustness—and derives a new method, ViFiGen, from the identified best practices.
Nautilus: Locality-aware Autoencoder for Scalable Mesh Generation: Nautilus proposes a locality-aware autoencoder for scalable artist-like mesh generation. By introducing a nautilus-shell-structured mesh tokenization algorithm that reduces sequence length to 1/4 of the naive baseline, and combining it with a dual-stream point cloud conditioner to improve local structural fidelity, Nautilus achieves for the first time direct high-quality mesh generation with up to 5,000 faces.
Neural Compression for 3D Geometry Sets: This paper proposes NeCGS, the first neural compression paradigm capable of compressing geometry sets containing thousands of diverse 3D mesh models at ratios up to 900×, achieving high-fidelity reconstruction via a TSDF-Def implicit representation and a quantization-aware auto-decoder.
NeuraLeaf: Neural Parametric Leaf Models with Shape and Deformation Disentanglement: NeuraLeaf disentangles the 3D geometry of leaves into two latent spaces — a 2D base shape space and a 3D deformation space — leveraging large-scale 2D leaf image datasets to learn the shape space, proposes a skeleton-free skinning model to handle highly flexible leaf deformations, and introduces DeformLeaf, the first 3D dataset dedicated to leaf deformation modeling.
No Pose at All: Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views: This paper proposes SPFSplat, the first self-supervised 3DGS framework that requires no ground-truth poses at either training or inference time. By sharing a ViT backbone to jointly predict Gaussian primitives and camera poses, SPFSplat surpasses pose-dependent state-of-the-art methods under extreme viewpoint changes.
Noise2Score3D: Tweedie's Approach for Unsupervised Point Cloud Denoising: This paper proposes Noise2Score3D, a fully unsupervised point cloud denoising framework based on Tweedie's formula. It learns the score function directly from noisy data and achieves single-step denoising, while introducing point cloud total variation to estimate unknown noise parameters.
Not All Frame Features Are Equal: Video-to-4D Generation via Decoupling Dynamic-Static Features: DS4D is the first method to decouple dynamic and static features along both the temporal and spatial axes in video-to-4D generation. It introduces a Dynamic-Static Feature Decoupling module (DSFD) to extract dynamic representations, and a Spatiotemporal Similarity Fusion module (TSSF) to adaptively aggregate dynamic information across viewpoints, achieving state-of-the-art performance on the Consistent4D and Objaverse datasets.
OccluGaussian: Occlusion-Aware Gaussian Splatting for Large Scene Reconstruction and Rendering: This paper proposes an occlusion-aware scene partitioning strategy and region-based rendering technique. By clustering a camera co-visibility graph, it achieves partitions aligned with the scene layout, significantly improving reconstruction quality and rendering speed for large-scale 3DGS.
one look is enough seamless patchwise refinement for zero-shot monocular depth e: This paper proposes PRO (Patch Refine Once), which achieves seamless patchwise depth refinement on high-resolution images through Grouped Patch Consistency Training (GPCT) and a Bias-Free Mask (BFM) strategy. PRO eliminates boundary artifacts with only a single refinement pass per patch and achieves a 12× inference speedup over PatchRefiner.
Online Language Splatting: The first framework to achieve online, near-real-time, open-vocabulary language mapping within a 3DGS-SLAM system. Through three innovations—high-resolution CLIP embedding, two-stage online autoencoder compression, and decoupled color-language optimization—the method surpasses offline state-of-the-art in accuracy while achieving 40×–200× efficiency gains.
Open-Vocabulary Octree-Graph for 3D Scene Understanding: This paper proposes Octree-Graph, a novel scene representation combining adaptive octrees with a graph structure. Through Chronological Group-based Segment Merging (CGSM) and Instance Feature Aggregation (IFA), it obtains accurate semantic objects and enables efficient open-vocabulary 3D scene understanding.
Outdoor Monocular SLAM with Global Scale-Consistent 3D Gaussian Pointmaps: This paper proposes S3PO-GS, a monocular RGB-only outdoor SLAM system that anchors pose estimation to 3DGS-rendered pointmaps for scale self-consistency, and employs a patch-based dynamic mapping mechanism, achieving high-accuracy localization without cumulative scale drift and high-fidelity novel view synthesis.
PanSt3R: Multi-view Consistent Panoptic Segmentation: PanSt3R builds upon MUSt3R to simultaneously perform 3D reconstruction and multi-view panoptic segmentation in a single forward pass, requiring neither camera parameters nor test-time optimization, and achieves inference speeds orders of magnitude faster than existing methods.
PCR-GS: COLMAP-Free 3D Gaussian Splatting via Pose Co-Regularizations: PCR-GS is proposed to achieve high-quality 3D-GS reconstruction and pose estimation under complex camera trajectories without COLMAP priors, via co-regularizing camera poses through DINO feature reprojection regularization and wavelet-based frequency regularization.
PlaceIt3D: Language-Guided Object Placement in Real 3D Scenes: This paper introduces PlaceIt3D, a language-guided object placement task in real 3D scenes, comprising a benchmark, a large-scale dataset, and a 3D LLM-based baseline method called PlaceWizard that performs joint reasoning over scenes, objects, and natural language instructions.
PLMP -- Point-Line Minimal Problems for Projective SfM: This paper provides a complete classification of all point-line minimal problems in projective SfM, identifying 291 minimal problems (73 of which admit unique solutions solvable by linear methods), and develops a systematic framework for problem decomposition and non-minimality proofs via stabilizer subgroup analysis.
PolarAnything: Diffusion-based Polarimetric Image Synthesis: This paper proposes PolarAnything, the first diffusion-based framework for generating polarimetric images from a single RGB image. By performing denoising diffusion over encoded AoLP and DoLP representations, the method achieves physically accurate and photorealistic polarimetric attribute synthesis without requiring 3D assets or polarization cameras.
Predict-Optimize-Distill: A Self-Improving Cycle for 4D Object Understanding: This paper proposes Predict-Optimize-Distill (POD), a self-improving framework that recovers 4D part poses of articulated objects from long monocular videos through iterative predict–optimize–distill cycles, with performance that improves consistently with video length and iteration count.
Proactive Scene Decomposition and Reconstruction: This paper proposes an online scene decomposition and reconstruction task grounded in proactive human-object interaction, where interaction behavior observed from an egocentric viewpoint defines the decomposition granularity, enabling progressive object decoupling and high-quality global reconstruction.
PseudoMapTrainer: Learning Online Mapping without HD Maps: This paper proposes PseudoMapTrainer, the first framework to train online mapping models entirely without GT HD Maps: it reconstructs road surfaces from multi-camera images via 2D Gaussian Splatting (RoGS) and combines a pretrained semantic segmentation model (Mask2Former) to generate vectorized pseudo labels. A mask-aware matching algorithm and loss function are further designed to handle partially occluded pseudo labels, supporting both single-trip and multi-trip (crowdsourced) modes.
Radiant Foam: Real-Time Differentiable Ray Tracing: This paper proposes Radiant Foam, a novel differentiable scene representation based on volumetric tetrahedral mesh ray tracing. Without relying on rasterization, it achieves rendering speed and quality comparable to Gaussian Splatting while natively supporting light transport phenomena such as reflection and refraction.
RapVerse: Coherent Vocals and Whole-Body Motion Generation from Text: This work constructs the large-scale rap dataset RapVerse and proposes a unified autoregressive transformer framework that, for the first time, simultaneously generates coherent singing vocals and whole-body 3D motion from lyric text.
RayletDF: Raylet Distance Fields for Generalizable 3D Surface Reconstruction from Point Clouds or Gaussians: This paper proposes RayletDF, a generalizable 3D surface reconstruction method based on a "raylet" (ray segment) distance field. Through three modules — a raylet feature extractor, a distance field predictor, and a multi-raylet mixer — RayletDF directly predicts surface points from point clouds or 3D Gaussians, achieving high-accuracy cross-dataset generalization via a single forward pass on unseen datasets.
RayZer: A Self-supervised Large View Synthesis Model: This paper proposes RayZer, a self-supervised multi-view 3D vision model that requires no 3D supervision (no camera poses, no scene geometry annotations). By decoupling images into camera parameters and scene representations, RayZer performs 3D-aware image autoencoding and achieves performance on novel view synthesis that matches or surpasses oracle methods relying on pose annotations.
RegGS: Unposed Sparse Views Gaussian Splatting with 3DGS Registration: RegGS is proposed as a framework that incrementally aligns locally generated 3D Gaussians from a feed-forward network into a globally consistent 3D representation via a differentiable 3DGS registration module based on the optimal-transport MW2 distance, enabling high-quality 3D reconstruction from unposed sparse views.
Relative Illumination Fields: Learning Medium and Light Independent Underwater Scenes: This paper proposes Relative Illumination Fields (RIF), which models non-uniform illumination distributions in camera-local coordinates via an MLP and jointly optimizes a volumetric medium representation, enabling clean reconstruction of underwater scenes free from light source and medium effects.
REPARO: Compositional 3D Assets Generation with Differentiable 3D Layout Alignment: This paper proposes REPARO, which generates compositional 3D assets from a single image by first reconstructing individual object meshes separately and then performing layout alignment via optimal transport-based differentiable rendering.
RePoseD: Efficient Relative Pose Estimation with Known Depth Information: This paper proposes a set of efficient minimal solvers for relative pose estimation that jointly estimate the scale and affine parameters of monocular depth estimation (MDE) alongside the relative pose. The proposed solvers outperform state-of-the-art depth-aware solvers across three camera configurations (calibrated / shared focal length / unknown individual focal lengths), and large-scale experiments provide a definitive answer to the question of whether MDE depth actually benefits relative pose estimation.
Representing 3D Shapes with 64 Latent Vectors for 3D Diffusion Models: This paper proposes COD-VAE, a two-stage autoencoder framework—comprising a progressive encoder, a triplane decoder, and uncertainty-guided token pruning—that encodes 3D shapes into only 64 one-dimensional latent vectors, achieving a 16× compression ratio and 20.8× generation speedup while maintaining reconstruction quality.
Repurposing 2D Diffusion Models with Gaussian Atlas for 3D Generation: This paper proposes the Gaussian Atlas representation, which maps unordered 3D Gaussians onto a sphere via optimal transport and then flattens them into a structured 2D grid, enabling direct fine-tuning of pretrained 2D Latent Diffusion models for high-quality text-to-3D generation.
ResGS: Residual Densification of 3D Gaussian for Efficient Detail Recovery: This paper proposes a residual split operation to replace the binary split/clone mechanism in 3D-GS, combined with image pyramid progressive supervision and a variable gradient threshold selection strategy, to adaptively address both over-reconstruction and under-reconstruction simultaneously, achieving state-of-the-art rendering quality while reducing the number of Gaussians.
Revisiting Point Cloud Completion: Are We Ready For The Real-World?: Using algebraic topology and persistent homology ($\mathcal{PH}$) tools, this paper reveals that existing synthetic point cloud datasets lack the rich topological features present in real-world data. It contributes the first real-world industrial point cloud completion dataset RealPC (~40,000 pairs, 21 categories), and proposes BOSHNet, which samples proxy homology skeletons as topological priors to achieve significant improvements on real-world point cloud completion.
ri3d few-shot gaussian splatting with repair and inpainting diffusion priors: RI3D decomposes sparse-view synthesis into two sub-tasks — repairing visible regions and completing missing regions — and introduces two personalized diffusion models (repair + inpainting) combined with a two-stage optimization strategy to achieve high-quality 3DGS reconstruction under extremely sparse inputs.
RoboPearls: Editable Video Simulation for Robot Manipulation: This paper presents RoboPearls, an editable video simulation framework built upon 3D Gaussian Splatting (3DGS) that constructs photorealistic simulation environments from demonstration videos. It supports rich scene editing operations via Incremental Semantic Distillation (ISD) and a 3D-regularized NNFM loss, and employs a multi-LLM agent system to automate the simulation generation pipeline, forming a VLM-in-the-loop robot learning augmentation system.
RoboTron-Mani: All-in-One Multimodal Large Model for Robotic Manipulation: This paper proposes RoboTron-Mani, a multimodal large model for robotic manipulation, together with the comprehensive dataset RoboData. By enhancing 3D perception via camera parameters and occupancy supervision, and enabling flexible multimodal fusion through a Modality-Isolation-Mask (MIM), RoboTron-Mani is the first generalist policy to simultaneously surpass specialist models across multiple datasets.
Robust and Efficient 3D Gaussian Splatting for Urban Scene Reconstruction: This paper proposes a robust and efficient 3DGS reconstruction framework for city-scale scenes. Through a visibility-based partitioning strategy, controllable LOD generation, a fine-grained appearance transformation module, and multiple regularization techniques, the framework achieves high-quality reconstruction and real-time rendering on urban data with large appearance variations and transient objects.
RobuSTereo: Robust Zero-Shot Stereo Matching under Adverse Weather: This paper proposes RobuSTereo, a framework that significantly improves the zero-shot generalization of stereo matching models under adverse weather conditions (rain, fog, snow) via a diffusion-based stereo data generation pipeline and a robust feature encoder combining a denoising Vision Transformer (DVT) with VGG19.
RobustSplat: Decoupling Densification and Dynamics for Transient-Free 3DGS: This paper identifies Gaussian densification in 3DGS as the key factor responsible for transient-object artifacts, and proposes a delayed Gaussian growth strategy along with a scale-cascaded mask bootstrapping method to decouple densification from dynamic region modeling, achieving state-of-the-art transient-free novel view synthesis across multiple benchmark datasets.
RoCo-Sim: Enhancing Roadside Collaborative Perception through Foreground Simulation: RoCo-Sim is proposed as the first simulation framework for roadside collaborative perception. By integrating extrinsic parameter optimization, occlusion-aware 3D asset placement, DepthSAM-based depth modeling, and style-transfer post-processing, it generates multi-view consistent simulation data from single images, achieving over 83% improvement in roadside 3D detection performance.
Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness: Ross3D introduces 3D-aware visual reconstruction pretraining tasks—cross-view reconstruction and global BEV reconstruction—into the training pipeline of 2D large multimodal models (LMMs). Without modifying the input representation, it significantly improves 3D scene understanding through output-level supervision signals, achieving state-of-the-art performance on five benchmarks: SQA3D, ScanQA, Scan2Cap, ScanRefer, and Multi3DRefer.
S3E: Self-Supervised State Estimation for Radar-Inertial System: S3E is proposed as the first method to achieve complementary self-supervised state estimation from radar signal spectra and inertial data, leveraging a rotation-based cross-fusion technique to enhance spatial structural information under limited angular resolution.
S3R-GS: Streamlining the Pipeline for Large-Scale Street Scene Reconstruction: S3R-GS identifies three major computational redundancies in conventional street scene reconstruction pipelines—unnecessary local-to-global coordinate transformations, excessive 3D-to-2D projections, and inefficient rendering of distant content—and proposes instance-specific projection, temporal visibility filtering, and adaptive level-of-detail (LOD) strategies to reduce reconstruction time to 20%–50% of competing methods while maintaining state-of-the-art rendering quality.
SAS: Segment Any 3D Scene with Integrated 2D Priors: This paper proposes SAS, a framework that for the first time integrates the complementary capabilities of multiple 2D open-vocabulary models to learn better 3D representations. It aligns feature spaces across models via Model Alignment via Text, and quantifies per-category model recognition capability using diffusion-synthesized images through Annotation-Free Model Capability Construction. These components jointly guide multi-model feature fusion and 3D distillation, achieving substantial improvements over prior work on ScanNet v2, Matterport3D, and nuScenes.
Sat2City: 3D City Generation from A Single Satellite Image with Cascaded Latent Diffusion: This paper presents Sat2City, the first 3D generation framework capable of simultaneously producing city-scale geometry and appearance from a single satellite image. By integrating sparse voxel grids with a cascaded latent diffusion model, it introduces a Re-Hash multi-scale feature grid and an inverse sampling strategy, achieving high-fidelity generation superior to existing methods on a self-constructed 3D city dataset.
Scene Coordinate Reconstruction Priors: This paper proposes a probabilistic training framework for scene coordinate regression (SCR) that introduces hand-crafted depth distribution priors and a learned prior based on a 3D point cloud diffusion model, significantly improving scene reconstruction quality, camera pose estimation, and downstream task performance under insufficient multi-view constraints.
SceneMI: Motion In-betweening for Modeling Human-Scene Interactions: This work formally introduces the scene-aware motion in-betweening problem and proposes the SceneMI framework, which comprehensively encodes scene context via a dual-layer scene descriptor (global voxels + local BPS). By leveraging the denoising capability of diffusion models to handle noisy keyframes, SceneMI reduces the collision frame rate by 56.9% on TRUMANS, and reduces foot skating by 37.5% and jitter by 56.5% on the real-world GIMO dataset.
Seeing and Seeing Through the Glass: Real and Synthetic Data for Multi-Layer Depth Estimation: This paper introduces the novel task of multi-layer depth estimation, constructs the LayeredDepth benchmark comprising 1,500 real-world images, and develops a procedural synthetic data generator. The work reveals severe deficiencies of existing depth estimation methods when applied to transparent objects.
SegmentDreamer: Towards High-Fidelity Text-to-3D Synthesis with Segmented Consistency Trajectory Distillation: This paper proposes SegmentDreamer, which reformulates the SDS loss via Segmented Consistency Trajectory Distillation (SCTD) to address the imbalance between self-consistency and cross-consistency in existing consistency distillation (CD) methods, enabling high-fidelity 3D asset generation via 3DGS in ~32 minutes on a single A100 GPU.
SeHDR: Single-Exposure HDR Novel View Synthesis via 3D Gaussian Bracketing: SeHDR is proposed as the first framework for synthesizing HDR novel views from single-exposure multi-view LDR images. It generates bracketed exposures in 3D Gaussian space (Bracketed 3D Gaussians) and merges them into an HDR scene representation via differentiable Neural Exposure Fusion (NeEF).
Self-Ensembling Gaussian Splatting for Few-Shot Novel View Synthesis: SE-GS dynamically generates diverse 3DGS models during training via an uncertainty-aware perturbation strategy, and leverages a self-ensembling mechanism to allow the Σ-model to aggregate information from perturbed models, effectively mitigating overfitting under sparse-view settings and achieving state-of-the-art few-shot novel view synthesis performance across multiple datasets.
Sequential Gaussian Avatars with Hierarchical Motion Context: This paper proposes SeqAvatar, which leverages explicit 3DGS representations combined with hierarchical motion context (coarse-grained skeletal motion + fine-grained per-point velocity) to model motion-correlated appearance changes in human avatars. Spatio-temporal multi-scale sampling further enhances the robustness of motion conditioning. SeqAvatar achieves state-of-the-art rendering quality across multiple datasets while maintaining real-time rendering speed.
Shape of Motion: 4D Reconstruction from a Single Video: This paper proposes a dynamic 3D Gaussian representation based on $\mathbb{SE}(3)$ motion bases, recovering globally consistent 3D motion trajectories from monocular video while simultaneously enabling real-time novel view synthesis and long-range 3D tracking, outperforming prior methods comprehensively on the iPhone and Kubric datasets.
SHeaP: Self-Supervised Head Geometry Predictor Learned via 2D Gaussians: SHeaP replaces traditional differentiable mesh rendering with 2D Gaussian Splatting for self-supervised 3DMM prediction training. By binding Gaussians to the 3DMM mesh for re-animation, and introducing a graph-convolution-based Gaussian regressor together with geometry consistency regularization, SHeaP surpasses all self-supervised methods on the NoW and Nersemble benchmarks.
SiM3D: Single-Instance Multiview Multimodal and Multisetup 3D Anomaly Detection Benchmark: This paper introduces SiM3D, the first benchmark for multiview multimodal 3D anomaly detection and segmentation targeting single-instance industrial scenarios. It employs industrial-grade sensors to acquire high-resolution data, replaces 2D anomaly maps with voxelized Anomaly Volumes, and is the first benchmark to support cross-domain synthetic-to-real evaluation.
Simulating Dual-Pixel Images From Ray Tracing For Depth Estimation: Sdirt proposes a ray-tracing-based dual-pixel (DP) image simulation framework that computes spatially varying DP PSFs incorporating lens aberrations and phase-splitting characteristics, thereby bridging the domain gap between simulated and real DP data and improving the generalization of depth estimation models on real DP images.
Single-Scanline Relative Pose Estimation for Rolling Shutter Cameras: This paper proposes a rolling shutter relative pose estimation method that requires no explicit camera motion modeling. It recovers camera pose solely from the intersections of line projections with a single selected scanline per image, and develops multiple minimal solvers for special configurations such as parallel lines and known gravity direction.
SL2A-INR: Single-Layer Learnable Activation for Implicit Neural Representation: This paper proposes SL2A-INR, a hybrid architecture combining a single-layer learnable activation block parameterized by Chebyshev polynomials with a ReLU-MLP fusion block, effectively alleviating spectral bias in implicit neural representations and achieving state-of-the-art performance on image fitting, 3D shape reconstruction, and novel view synthesis.
Sparfels: Fast Reconstruction from Sparse Unposed Imagery: Sparfels integrates a 3D foundation model (MASt3R) with efficient test-time optimization (2DGS). MASt3R provides an initial point cloud, camera poses, and dense correspondences to guide optimization. A novel splat color variance loss is introduced, enabling state-of-the-art geometric reconstruction from sparse unposed images in under three minutes.
Spatial-Temporal Aware Visuomotor Diffusion Policy Learning: This paper proposes the 4D Diffusion Policy (DP4), which injects 3D spatial and 4D spatial-temporal awareness into a diffusion policy via a dynamic Gaussian world model, achieving substantial improvements over baselines across 17 simulation tasks and 3 real-robot tasks (Adroit +16.4%, DexArt +14%, RLBench +6.45%, real tasks +8.6%).
SpatialSplat: Efficient Semantic 3D from Sparse Unposed Images: SpatialSplat is proposed to generate compact semantic 3D Gaussians from sparse unposed images via feed-forward inference, leveraging a dual-field semantic representation and a selective Gaussian mechanism that reduces representation parameters by 60% while surpassing state-of-the-art methods.
SpinMeRound: Consistent Multi-View Identity Generation Using Diffusion Models: This paper presents SpinMeRound, an identity-conditioned multi-view diffusion model that generates 360° full-head portraits with consistent identity and corresponding normal maps from a single or few face images, surpassing existing multi-view diffusion methods on face novel view synthesis benchmarks.
SplatTalk: 3D VQA with Gaussian Splatting: This paper proposes SplatTalk, a framework that leverages generalizable 3D Gaussian Splatting to generate LLM-compatible 3D tokens from multi-view RGB images alone, enabling zero-shot 3D visual question answering that surpasses 2D LMM baselines and approaches 3D LMM performance.
Stable Score Distillation: This paper proposes Stable Score Distillation (SSD), which achieves more stable and precise text-guided 2D/3D editing through single-classifier cross-prompt guidance and cross-trajectory regularization via a null-text branch, improving editing alignment while preserving the structural content of the source.
StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions: This work presents the first density-guided poisoning attack against 3D Gaussian Splatting (3DGS). By injecting illusion Gaussians into low-density regions and introducing adaptive noise to disrupt multi-view consistency, the method achieves attacks that are clearly visible from target viewpoints while remaining imperceptible from all others.
Stereo Any Video: Temporally Consistent Stereo Matching: This paper proposes Stereo Any Video, a framework that achieves spatially accurate and temporally consistent video stereo matching without relying on camera poses or optical flow. It integrates three core modules — monocular video depth foundation model priors (Video Depth Anything), all-to-all-pair correlation, and temporal convex upsampling — attaining state-of-the-art performance under zero-shot settings across multiple benchmarks.
StochasticSplats: Stochastic Rasterization for Sorting-Free 3D Gaussian Splatting: StochasticSplats introduces Stochastic Transparency into 3DGS, replacing depth-sorted alpha blending with an unbiased Monte Carlo estimator to achieve sorting-free, popping-free rendering. At 1 SPP, it is 4× faster than standard CUDA 3DGS, and the number of samples provides a flexible quality–speed trade-off.
StrandHead: Text to Hair-Disentangled 3D Head Avatars Using Human-Centric Priors: This paper presents StrandHead, the first framework for generating strand-level 3D head avatars by distilling human-centric 2D diffusion priors. It introduces a differentiable prismatization algorithm to convert hair strands into watertight meshes with gradient backpropagation, and designs regularization losses based on statistical geometric priors of hair strands to ensure hairstyle realism.
StruMamba3D: Exploring Structural Mamba for Self-supervised Point Cloud Representation Learning: StruMamba3D is proposed to maintain 3D point adjacency relationships by endowing SSM hidden states with spatial positional attributes (spatial states), and introduces a sequence-length-adaptive strategy to address the sequence length discrepancy between pre-training and downstream tasks. The method achieves 92.75% accuracy on the hardest ScanObjectNN split and 95.1% on ModelNet40, both representing single-modality SOTA.
SuperDec: 3D Scene Decomposition with Superquadric Primitives: SuperDec is a Transformer-based learning approach that decomposes point clouds into compact sets of superquadric primitives. Trained on ShapeNet, it generalizes to real-world scenes and supports downstream applications including robot manipulation and controllable generation.
SuperMat: Physically Consistent PBR Material Estimation at Interactive Rates: This paper proposes SuperMat, a single-step inference framework for PBR material decomposition. Through structured expert branches and scheduler correction, it enables end-to-end training and introduces a re-render loss to enforce physical consistency, accelerating inference from seconds to milliseconds.
SurfaceSplat: Connecting Surface Reconstruction and Gaussian Splatting: SurfaceSplat proposes a hybrid framework that establishes bidirectional connections between SDF (Signed Distance Function) and 3D Gaussian Splatting (3DGS): the SDF provides coarse geometry to enhance 3DGS rendering quality, while novel-view images rendered by 3DGS are in turn used to refine SDF surface reconstruction accuracy. The method achieves state-of-the-art performance on both surface reconstruction and novel view synthesis on the DTU and MobileBrick datasets.
SVG-Head: Hybrid Surface-Volumetric Gaussians for High-Fidelity Head Reconstruction and Real-Time Editing: SVG-Head is proposed as a hybrid representation combining surface Gaussians (with explicit texture maps) and volumetric Gaussians (for supplementary modeling of non-Lambertian regions), achieving, for the first time, real-time appearance editing of high-fidelity Gaussian head avatars.
TAPNext: Tracking Any Point (TAP) as Next Token Prediction: TAPNext reformulates Tracking Any Point (TAP) in video as a sequential masked token decoding task, eliminating the tracking-specific inductive biases and heuristics prevalent in conventional approaches. It achieves causal online tracking and establishes new state-of-the-art results among both online and offline trackers, with remarkably low inference latency.
TAR3D: Creating High-Quality 3D Assets via Next-Part Prediction: TAR3D is proposed as the first framework to quantize triplane representations into discrete geometric parts and generate them autoregressively via GPT. A 3D VQ-VAE encodes meshes of arbitrary face counts into fixed-length sequences, while TriPE positional encoding preserves 3D spatial information. The method comprehensively outperforms existing approaches on text/image-to-3D tasks.
Text2VDM: Text to Vector Displacement Maps for Expressive and Interactive 3D Sculpting: Text2VDM is proposed as the first framework for generating VDM sculpting brushes from text. It addresses the semantic entanglement problem in sub-object structure generation via Sobolev-preconditioned mesh deformation and a semantically enhanced SDS loss.
Textured 3D Regenerative Morphing with 3D Diffusion Prior: This paper proposes a regenerative 3D morphing method based on a 3D diffusion prior. By performing interpolation at three levels — initial noise, model parameters, and conditioning features — and combining three strategies (Attention Fusion, Token Reordering, and Low-Frequency Enhancement), it is the first to achieve smooth and semantically plausible morphing sequences for textured 3D objects across categories.
TimeFormer: Capturing Temporal Relationships of Deformable 3D Gaussians for Robust Reconstruction: This paper proposes the TimeFormer module, which implicitly learns temporal relationships among deformable 3D Gaussians via a cross-time Transformer encoder, and introduces a dual-stream optimization strategy that transfers motion knowledge during training with no additional overhead at inference.
TokenUnify: Scaling Up Autoregressive Pretraining for Neuron Segmentation: TokenUnify is proposed to unify three complementary learning objectives—random token prediction, next-token prediction, and next-all-token prediction—enabling hierarchical predictive coding on large-scale electron microscopy data. The method reduces autoregressive error accumulation from $O(K)$ to $O(\sqrt{K})$, achieving a 44% improvement on downstream neuron segmentation.
Towards More Diverse and Challenging Pre-training for Point Cloud Learning: Self-Supervised Cross Reconstruction with Decoupled Views: This paper proposes Point-PQAE, the first framework to introduce cross-view reconstruction into 3D generative self-supervised learning. By designing a point cloud cropping mechanism to generate decoupled views, a View-Relative Positional Embedding (VRPE), and a Positional Query module, the pre-training task becomes more challenging and informative. Point-PQAE surpasses Point-MAE by an average of 6.7% on ScanObjectNN under the Mlp-Linear protocol.
Towards Scalable Spatial Intelligence via 2D-to-3D Data Lifting: This paper proposes a scalable data generation pipeline that automatically converts single-view 2D images into metric-scale 3D representations—including point clouds, camera poses, and depth maps—by integrating depth estimation, camera calibration, and scale calibration. The pipeline produces COCO-3D and Objects365-v2-3D datasets comprising approximately 2 million scenes, yielding significant performance gains across multiple 3D tasks.
Trace3D: Consistent Segmentation Lifting via Gaussian Instance Tracing: This paper proposes the Gaussian Instance Tracing (GIT) mechanism, which maintains a per-Gaussian instance weight matrix across views via inverse rasterization. GIT jointly addresses two longstanding challenges—multi-view inconsistency in 2D segmentation and boundary Gaussian ambiguity—and yields significant improvements in 3D segmentation quality under both offline contrastive learning and online self-prompting settings.
TRACE: Learning 3D Gaussian Physical Dynamics from Multi-view Videos: TRACE is a framework that treats each 3D Gaussian kernel as a rigid particle and learns an independent translational-rotational dynamical system for it—comprising a complete set of physical parameters including velocity, acceleration, angular velocity, and angular acceleration. Without any manual annotation, TRACE learns the physical motion laws of 3D scenes from multi-view dynamic videos and accurately extrapolates future frames.
Tune-Your-Style: Intensity-Tunable 3D Style Transfer with Gaussian Splatting: This paper proposes Tune-Your-Style, the first intensity-tunable 3D style transfer paradigm, which explicitly models style intensity via Gaussian neurons and parameterizes a learnable style tuner. Combined with a two-stage optimization strategy, the method enables users to freely adjust the degree of style injection without retraining.
TurboReg: TurboClique for Robust and Efficient Point Cloud Registration: TurboReg is proposed, a framework that replaces traditional maximum clique enumeration with lightweight 3-cliques (TurboCliques) and introduces a highly parallelizable Pivot-Guided Search (PGS) algorithm, achieving state-of-the-art registration accuracy while delivering over 208× speedup.
UniEgoMotion: A Unified Model for Egocentric Motion Reconstruction, Forecasting, and Generation: This paper proposes UniEgoMotion, the first unified egocentric motion model that achieves 3D human motion reconstruction, forecasting, and generation from an egocentric perspective within a single model, via a conditional motion diffusion framework and a head-centric motion representation. The large-scale EE4D-Motion dataset is also released.
Unified Category-Level Object Detection and Pose Estimation from RGB Images using 3D Prototypes: This work presents the first RGB-only, single-model framework that unifies object detection and category-level pose estimation. By leveraging Neural Mesh Models as 3D prototypes, the method performs feature matching and multi-model RANSAC PnP to simultaneously detect objects and estimate their 9D poses. It surpasses the state of the art on all scale-agnostic metrics on REAL275.
UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing: This paper proposes UniVG, a unified image generation model built on MM-DiT that supports T2I generation, editing, identity-preserving generation, layout-guided synthesis, depth estimation, and more within a single set of weights, achieved via channel-wise input concatenation, progressive multi-task training, and external condition injection.
Unleashing Vecset Diffusion Model for Fast Shape Generation (FlashVDM): FlashVDM proposes a systematic framework to accelerate both DiT sampling and VAE decoding in Vecset Diffusion Models (VDM): progressive flow distillation reduces diffusion steps to 5, while adaptive KV selection, hierarchical volume decoding, and an efficient decoder yield a 45× VAE decoding speedup, achieving an overall 32× acceleration that enables high-quality 3D shape generation in under one second.
UPP: Unified Point-Level Prompting for Robust Point Cloud Analysis: This paper proposes UPP, a unified point-level prompting framework that reformulates point cloud denoising and completion as prompting mechanisms for downstream tasks. It introduces a Rectification Prompter to filter noise, a Completion Prompter to recover missing regions, and a Shape-Aware Unit to capture geometry-sensitive features. With only 6.3% of the parameters, UPP surpasses full fine-tuning on noisy and incomplete point clouds.
UST-SSM: Unified Spatio-Temporal State Space Models for Point Cloud Video Modeling: This paper proposes UST-SSM, which extends selective state space models to point cloud video analysis via three core modules — Spatio-Temporal Selective Scanning (STSS), Spatio-Temporal Structure Aggregation (STSA), and Temporal Interaction Sampling (TIS) — achieving linear complexity while surpassing Transformer-based methods.
VertexRegen: Mesh Generation with Continuous Level of Detail: VertexRegen is proposed to reframe mesh generation—inspired by progressive meshes—as learning the inverse of edge collapse, i.e., vertex split, enabling "anytime" mesh generation with continuous level of detail.
ViT-Split: Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads: Grounded in the key observation that VFM layers can be partitioned into low-level feature extractors and high-level task adapters, this paper proposes ViT-Split, which freezes the VFM backbone and introduces a task head (replicating the last $K_t$ layers) and a prior head (a lightweight CNN aggregating multi-scale prior features). On ADE20K, ViT-Split achieves 58.2 mIoU (DINOv2-L) with only a linear head, offers 4× faster training, and requires only 1/4–1/5 of the trainable parameters compared to conventional adapters.
Vivid4D: Improving 4D Reconstruction from Monocular Video by Video Inpainting: This paper proposes Vivid4D, which reformulates multi-view augmentation from monocular video as a video inpainting problem — warping the video to novel viewpoints using monocular depth priors, then employing a video diffusion model to inpaint occluded regions. Through an iterative view expansion strategy and a robust reconstruction loss, Vivid4D significantly improves 4D dynamic scene reconstruction quality from monocular video.
VoluMe: Authentic 3D Video Calls from Live Gaussian Splat Prediction: Microsoft proposes the first method for real-time prediction of 3D Gaussian Splatting reconstructions from a monocular 2D camera, simultaneously satisfying four requirements: authenticity, realism, liveness, and temporal stability. This enables anyone to conduct volumetric 3D video calls using only a standard laptop camera.
VolumetricSMPL: A Neural Volumetric Body Model for Efficient Interactions, Contacts, and Collisions: This paper proposes VolumetricSMPL, an efficient neural volumetric body model based on Neural Blend Weights (NBW), achieving 10× inference speedup and 6× memory reduction over its predecessor COAP, while providing more accurate differentiable collision modeling through SDF (rather than occupancy function) representation.
WonderPlay: Dynamic 3D Scene Generation from a Single Image and Actions: WonderPlay introduces a Hybrid Generative Simulator that combines coarse 3D dynamic simulation from a physics solver with high-quality generation from a video diffusion model, enabling realistic multi-material dynamic 3D scene generation from a single image and user-specified actions. The framework supports diverse material types including rigid bodies, cloth, liquids, smoke, and granular materials.
WonderTurbo: Generating Interactive 3D World in 0.72 Seconds: WonderTurbo proposes the first real-time interactive 3D scene generation framework. Through the coordinated acceleration of three modules — StepSplat (feed-forward 3DGS), QuickDepth (lightweight depth completion), and FastPaint (2-step diffusion inpainting) — it compresses single-step scene extension time from 10+ seconds to 0.72 seconds, achieving a 15× speedup while maintaining generation quality comparable to WonderWorld.
Zero-Shot Inexact CAD Model Alignment from a Single Image: A weakly supervised 9-DoF CAD model alignment method that enhances DINOv2 features with geometry awareness and performs dense alignment optimization in Normalized Object Coordinate (NOC) space, enabling zero-shot 3D alignment without pose annotations that generalizes to unseen categories.
ZeroStereo: Zero-shot Stereo Matching from Single Images: This paper proposes ZeroStereo, a pipeline that starts from an arbitrary single image, uses monocular depth estimation to generate pseudo disparity, and synthesizes high-quality right-view images via a fine-tuned diffusion inpainting model. The approach achieves state-of-the-art zero-shot stereo matching generalization using only 35K synthetic training samples.

🎨 Image Generation¶

A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation: This paper proposes A₀, an affordance-aware hierarchical diffusion model that decomposes manipulation tasks into high-level spatial affordance understanding (predicting contact points and trajectories) and low-level action execution. Pretrained on 1M contact point data and fine-tuned with minimal task-specific data, A₀ achieves cross-platform deployment across Franka/Kinova/Realman/Dobot, reaching a 45% success rate on complex trajectory tasks such as whiteboard wiping.
A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation: This paper proposes A0, a hierarchical affordance-aware diffusion model that decomposes manipulation tasks into high-level spatial understanding and low-level action execution by predicting object-centric contact points and post-contact trajectories via an Embodiment-Agnostic Affordance Representation. Pre-trained on 1 million contact-point annotations, A0 generalizes across four robot platforms: Franka, Kinova, Realman, and Dobot.
A Unified Framework for Motion Reasoning and Generation in Human Interaction: This paper proposes MoLaM, a unified interactive motion-language model that, through a three-stage training strategy and a newly constructed Inter-MT² dataset (82.7K multi-turn instructions), is the first to simultaneously achieve understanding, generation, editing, and reasoning of dyadic interaction motion within a single framework.
Accelerating Diffusion Sampling via Exploiting Local Transition Coherence: This paper proposes LTC-Accel, a training-free diffusion sampling acceleration method based on the phenomenon of Local Transition Coherence (LTC). By exploiting the strong correlation between transition operators of adjacent denoising steps, the method approximates the current step's computation using the previous step's transition operator. It achieves 1.67× speedup on Stable Diffusion v2, and combined with distilled models, reaches 10× acceleration in video generation.
Adaptive Routing of Text-to-Image Generation Requests Between Large Cloud Models and Small Edge Models: This paper proposes RouteT2I, the first edge-cloud model routing framework for text-to-image generation. It maximizes image generation quality under cost constraints through multi-dimensional quality metrics, Pareto Relative Superiority, and a dual-gated token selection MoE architecture.
Adaptive Routing of Text-to-Image Generation Requests Between Large Cloud Model and Light-Weight Edge Model: This paper proposes RouteT2I, a framework that dynamically routes text-to-image generation requests to either a lightweight edge model or a large cloud model via multi-dimensional quality assessment metrics and a dual-gate token-selection MoE routing model, achieving 83.97% of the quality gain attainable by exclusively using the cloud model at a 50% routing rate.
Addressing Text Embedding Leakage in Diffusion-Based Image Editing: This work identifies the root cause of attribute leakage in text-driven diffusion-based image editing — semantic entanglement in EOS embeddings of autoregressive text encoders — and proposes the ALE framework (ORE + RGB-CAM + BB) to comprehensively eliminate attribute leakage through embedding disentanglement, attention masking, and background blending.
Addressing Text Embedding Leakage in Diffusion-based Image Editing: This paper proposes the ALE framework, which systematically addresses attribute leakage in diffusion-based text-guided image editing through three components: Object-Restricted Embedding (ORE) to decouple semantic entanglement in EOS tokens, Region-Guided Blended Cross-Attention Masking (RGB-CAM) to constrain spatial attention, and Background Blending (BB) to preserve unedited regions. A new evaluation benchmark, ALE-Bench, is also introduced.
ADIEE: Automatic Dataset Creation and Scorer for Instruction-Guided Image Editing Evaluation: This paper proposes ADIEE, an automated pipeline for constructing training datasets for instruction-guided image editing evaluation. A LLaVA-NeXT-8B model is fine-tuned on over 100K samples as a scorer, surpassing all open-source VLMs and Gemini-Pro 1.5 on multiple benchmarks. The trained scorer can further serve as a reward model to improve image editing models.
ADIEE: Automatic Dataset Creation and Scorer for Instruction-Guided Image Editing Evaluation: This paper proposes ADIEE, an automated pipeline for constructing a training dataset of over 100,000 samples for image editing evaluation. It fine-tunes LLaVA-NeXT-8B as an editing quality scorer and surpasses open-source VLMs and Gemini-Pro 1.5 on multiple benchmarks. The resulting scorer can also serve as a reward model to improve editing model performance.
Aether: Geometric-Aware Unified World Modeling: Aether proposes a geometric-aware unified world modeling framework that jointly trains reconstruction, prediction, and planning capabilities on synthetic 4D data, built upon post-training of CogVideoX to achieve zero-shot generalization to real-world scenes.
Aether: Geometric-Aware Unified World Modeling: This paper proposes Aether, a unified world model that post-trains the CogVideoX video diffusion model on synthetic RGB-D data. Through a multi-task training strategy that randomly combines input/output modalities, Aether simultaneously achieves 4D reconstruction, action-conditioned video prediction, and goal-conditioned visual planning, with zero-shot transfer to real-world data reaching performance comparable to domain-specific models.
AIComposer: Any Style and Content Image Composition via Feature Integration: AIComposer proposes the first cross-domain image composition method that requires no text prompts. By fusing foreground and background CLIP features via an MLP network, combined with backward inversion + forward denoising and a local cross-attention strategy, the method achieves natural stylization and seamless composition without training the diffusion model, improving LPIPS and CSD metrics by 30.5% and 18.1%, respectively.
AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction: This paper proposes AID, a framework that transfers a pretrained Image2Video diffusion model (SVD) to text-guided video prediction (TVP) tasks. Through MLLM-assisted video state prediction, a Dual-Query Transformer for condition injection, and spatiotemporal adapters, AID surpasses the previous state-of-the-art FVD scores by over 50% across multiple datasets.
ALE: Attribute-Leakage-free Editing for Text-based Image Editing: This paper identifies semantic entanglement in the EOS embeddings of autoregressive text encoders as the root cause of attribute leakage in text-guided image editing, and proposes the ALE framework to eliminate such leakage via three components: Object-Restricted Embedding (ORE), Region-Guided Cross-Attention Masking (RGB-CAM), and Background Blending (BB). A dedicated benchmark, ALE-Bench, is also introduced for evaluation.
Anchor Token Matching: Implicit Structure Locking for Training-free AR Image Editing: This paper proposes ISLock, the first training-free image editing framework for autoregressive (AR) visual generation models. Through Anchor Token Matching (ATM), it implicitly aligns self-attention patterns in the latent space, enabling structure-consistent text-guided image editing.
AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction: This paper proposes AnimeGamer, an infinite anime life simulation system built upon a multimodal large language model (MLLM). By predicting the next game state via action-aware multimodal representations — comprising dynamic animation shots and character state updates — the system achieves a continuously consistent interactive anime gaming experience.
Anti-Tamper Protection for Unauthorized Individual Image Generation: This paper proposes Anti-Tamper Perturbation (ATP), which decouples protection perturbations (preventing forged generation) and authorization perturbations (detecting purified tampering) into separate frequency-domain regions. When an attacker attempts to purify the protective signal, the anti-tamper mechanism is triggered to deny service, achieving a 100% protection success rate against various purification attacks.
AnyPortal: Zero-Shot Consistent Video Background Replacement: AnyPortal presents a zero-shot, training-free video background replacement framework that synergistically leverages IC-Light's relighting capability and the temporal prior of a video diffusion model (CogVideoX), together with a newly proposed Refinement Projection Algorithm (RPA) for pixel-level foreground preservation, running efficiently on a single 24 GB GPU.
Attention to Neural Plagiarism: Diffusion Models Can Plagiarize Your Copyrighted Images!: This paper exposes the threat of "neural plagiarism"—diffusion models can readily replicate copyright-protected images (including watermarked ones). It proposes a universal attack framework based on "anchors and shims," searching for perturbations in the cross-attention mechanism to achieve coarse-to-fine semantic modification, bypassing copyright protections ranging from visible trademarks to invisible watermarks.
AutoPrompt: Automated Red-Teaming of Text-to-Image Models via LLM-Driven Adversarial Prompts: This paper proposes APT (AutoPrompT), a black-box red-teaming framework driven by an LLM. Through an alternating optimize-finetune pipeline and a dual-evasion strategy, APT automatically generates human-readable adversarial suffixes that bypass content filters, effectively circumventing the safety mechanisms of T2I models while enabling zero-shot cross-prompt transferability.
Balanced Image Stylization with Style Matching Score: This paper proposes Style Matching Score (SMS), which recasts image stylization as a style distribution matching problem. Through progressive spectrum regularization and semantic-aware gradient refinement, SMS achieves a superior balance between style alignment and content preservation, and can be distilled into a lightweight feed-forward network for one-step stylization.
Bitrate-Controlled Diffusion for Disentangling Motion and Content in Video: This paper proposes BCD (Bitrate-Controlled Diffusion), a general self-supervised video disentanglement framework that separates per-frame motion features from global content features via a low-bitrate vector quantization information bottleneck, and reconstructs video using a conditional diffusion model. The approach demonstrates high-quality motion transfer and autoregressive video generation on talking-head and pixel-art cartoon datasets.
3DSR: Bridging Diffusion Models and 3D Representations for 3D Consistent Super-Resolution: 3DSR is proposed — an alternating iterative framework coupling diffusion-based SR with 3DGS to achieve 3D-consistent super-resolution: after each denoising step, SR images are used to train a 3DGS, yielding 3D-consistent renderings that are re-encoded into the latent space to guide the next denoising step. Without fine-tuning any model, it explicitly enforces cross-view consistency, achieving +1.16 dB PSNR and 50% FID reduction on LLFF (vs. StableSR).
Bridging the Skeleton-Text Modality Gap: Diffusion-Powered Modality Alignment for Zero-shot Skeleton-based Action Recognition: This paper proposes TDSM (Triplet Diffusion for Skeleton-Text Matching), the first work to apply diffusion models to zero-shot skeleton-based action recognition (ZSAR). TDSM achieves implicit alignment between skeleton features and text prompts through the reverse diffusion process, and introduces a triplet diffusion loss to enhance discriminability. It substantially outperforms state-of-the-art methods on NTU-60/120 and PKU-MMD, with improvements ranging from 2.36% to 13.05%.
Bridging the Skeleton-Text Modality Gap: Diffusion-Powered Modality Alignment for Zero-shot Skeleton-based Action Recognition: This paper proposes TDSM (Triplet Diffusion for Skeleton-Text Matching), the first work to apply diffusion models to zero-shot skeleton-based action recognition (ZSAR). It achieves implicit alignment between skeleton features and text prompts through the reverse diffusion process, and introduces a triplet diffusion loss to enhance discriminability. TDSM substantially outperforms state-of-the-art methods on NTU-60/120 and PKU-MMD by margins ranging from 2.36% to 13.05%.
BVINet: Unlocking Blind Video Inpainting with Zero Annotations: This paper is the first to formally define and address the task of blind video inpainting—simultaneously predicting where to restore and how to restore, end-to-end, without any annotation of corrupted regions. A mask prediction network and a video completion network mutually reinforce each other via a consistency constraint, achieving strong results on both synthetic data and real-world applications (danmaku removal and scratch repair).
CaO2: Rectifying Inconsistencies in Diffusion-Based Dataset Distillation: This paper identifies two critical issues in diffusion-based dataset distillation — objective inconsistency and condition inconsistency — and proposes a two-stage framework, CaO2: the first stage mitigates objective inconsistency via classifier-guided sample selection, and the second stage mitigates condition inconsistency via latent space optimization to maximize conditional likelihood, achieving an average improvement of 2.3% on ImageNet.
CAP: Evaluation of Persuasive and Creative Image Generation: This paper proposes three novel evaluation metrics (creativity, alignment, and persuasiveness) for the task of advertising image generation, and leverages LLMs to expand implicit messages into explicit visual descriptions to improve T2I model performance on advertisement generation, achieving significantly higher agreement with human annotations than baselines such as CLIPScore.
CharaConsist: Fine-Grained Consistent Character Generation: This paper proposes a training-free, fine-grained consistent character generation method that achieves high-quality cross-image character consistency on a DiT architecture (FLUX.1) for the first time, via Point-Tracking Attention, adaptive token merging, and foreground-background decoupled control.
CHORDS: Diffusion Sampling Accelerator with Multi-Core Hierarchical ODE Solvers: This paper proposes CHORDS, a diffusion sampling acceleration framework based on multi-core hierarchical ODE solvers. Through a slow-to-fast inter-core rectification mechanism, CHORDS achieves 2.1×–2.9× speedup on 4–8 GPUs without sacrificing generation quality.
CHORDS: Diffusion Sampling Accelerator with Multi-Core Hierarchical ODE Solvers: This paper proposes CHORDS, a training-free and model-agnostic diffusion sampling acceleration framework based on multi-core hierarchical ODE solvers. By employing a slow-to-fast solver hierarchy and an inter-core rectification mechanism, CHORDS achieves up to 2.9× speedup across 4–8 GPU cores without sacrificing generation quality.
CNS-Bench: Benchmarking Image Classifier Robustness Under Continuous Nuisance Shifts: This paper proposes CNS-Bench, the first benchmark that leverages LoRA adapters to impose continuous and photorealistic nuisance shifts on diffusion models for systematically evaluating the OOD robustness of image classifiers, covering 14 shift types, 5 severity levels, and 40+ classifiers.
CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models: CoMPaSS leverages the SCOP data engine to curate spatially unambiguous training data and introduces the parameter-free TENOR module to inject token ordering information into the attention mechanism, substantially improving spatial relationship generation accuracy in T2I diffusion models (VISOR +98%, GenEval Position +131%).
CompleteMe: Reference-based Human Image Completion: This paper proposes the CompleteMe framework, which leverages a dual U-Net architecture and Region-focused Attention (RFA) Block to achieve high-fidelity reference-guided human image completion by exploiting fine-grained person-specific details (clothing textures, tattoos, etc.) from reference images.
Compression-Aware One-Step Diffusion Model for JPEG Artifact Removal: This paper proposes CODiff, a compression-aware one-step diffusion model for JPEG artifact removal. The core contribution is a Compression-aware Visual Embedder (CaVE) that extracts JPEG compression priors via an explicit–implicit dual learning strategy, guiding the diffusion model toward high-quality restoration. CODiff comprehensively outperforms existing methods on LIVE-1, Urban100, and DIV2K-Val while achieving extremely high inference efficiency.
CompSlider: Compositional Slider for Disentangled Multiple-Attribute Image Generation: This paper proposes CompSlider, a compositional slider model that generates conditional priors to enable simultaneous, independent, and fine-grained control over multiple attributes in T2I foundation models. It addresses inter-attribute entanglement via a disentanglement loss and a structural consistency loss.
Contrastive Flow Matching (ΔFM): A contrastive regularization term is introduced into the Flow Matching training objective to enforce separation between velocity fields of different conditions, achieving 9× training acceleration, 5× fewer sampling steps, and up to 8.9 FID reduction with zero additional inference overhead.
CSD-VAR: Content-Style Decomposition in Visual Autoregressive Models: This work is the first to explore content-style decomposition (CSD) in visual autoregressive (VAR) models. Through three key innovations—scale-aware alternating optimization, SVD-based style embedding rectification, and augmented key-value memory—CSD-VAR achieves content preservation and style transfer quality that surpasses existing diffusion-model-based methods.
CURE: Cultural Gaps in the Long Tail of Text-to-Image Systems: This work introduces the CURE benchmark and scoring suite, which employs Marginal Information Attribution (MIA) of attribute specifications as a proxy for human judgment to systematically evaluate the representational capacity of T2I systems across the global cultural long tail.
Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences: Cycle consistency (reconstruction similarity via image→text→image or text→image→text) is employed as a supervision signal in lieu of human preferences to construct the 866K preference dataset CyclePrefDB. The resulting CycleReward model surpasses all existing methods on detailed caption evaluation and can improve both VLMs and diffusion models via DPO.
Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences: This paper proposes CycleReward, which leverages cycle consistency as a self-supervised signal to replace human preference annotations — captions are reconstructed into images via a T2I model and ranked by visual similarity, yielding the 866K preference-pair dataset CyclePrefDB. The trained reward model outperforms HPSv2/PickScore/ImageReward by 6%+ on detailed captioning, and DPO training with it improves VLM performance across multiple vision-language tasks, all without any human annotation.
DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer: This paper proposes DC-AR, a masked autoregressive text-to-image generation framework built upon a Deep Compression Hybrid Tokenizer (DC-HT, 32× spatial compression). Through a hybrid pipeline of discrete token generation for structure followed by residual token refinement, DC-AR achieves state-of-the-art gFID of 5.49 on MJHQ-30K while delivering 1.5–7.9× higher throughput than diffusion models.
DCT-Shield: A Robust Frequency Domain Defense against Malicious Image Editing: DCT-Shield introduces adversarial perturbations in the Discrete Cosine Transform (DCT) domain rather than pixel space, making the immunization noise highly imperceptible and inherently robust to JPEG compression, thereby effectively defending against diffusion-model-based malicious image editing.
Deeply Supervised Flow-Based Generative Models: DeepFlow introduces deep supervision and a VeRA (Velocity Refiner with Acceleration) module between Transformer layers of flow-based models, aligning intermediate-layer velocity features via second-order ODE dynamics. Without relying on any external pretrained model, it achieves an 8× training speedup and significant FID improvement.
DeepShield: Fortifying Deepfake Video Detection with Local and Global Forgery Analysis: This paper proposes DeepShield, a deepfake video detection framework that combines Local Patch Guidance (LPG) and Global Forgery Diversification (GFD). It provides patch-level supervision via spatiotemporal artifact modeling and synthesizes diverse forgery representations through distribution-level feature augmentation, significantly outperforming state-of-the-art methods in cross-dataset and cross-manipulation evaluations.
Dense2MoE: Restructuring Diffusion Transformer to MoE for Efficient Text-to-Image Generation: Dense2MoE is the first paradigm for converting dense Diffusion Transformers (DiT) into sparse MoE structures. By replacing FFN layers with MoE layers and grouping Transformer blocks into Mixture of Blocks (MoB), combined with a multi-stage distillation pipeline, it compresses FLUX.1's 12B parameters to 5.2B activated parameters while preserving original performance, comprehensively outperforming pruning-based methods.
Dense Policy: Bidirectional Autoregressive Learning of Actions: This paper proposes Dense Policy, a robot manipulation policy based on bidirectional autoregressive expansion, which achieves hierarchical coarse-to-fine action generation in logarithmic time and surpasses mainstream generative policies such as Diffusion Policy and ACT on both simulation and real-world tasks.
Describe, Don't Dictate: Semantic Image Editing with Natural Language Intent: This paper proposes DescriptiveEdit, which reframes "instruction-based image editing" as "text-to-image generation conditioned on a reference image." A Cross-Attentive UNet introduces attention bridge layers to inject reference image features into the generation process. With only 75M trainable parameters, the method achieves high-fidelity descriptive editing and is seamlessly compatible with community tools such as ControlNet and IP-Adapter.
DIA: The Adversarial Exposure of Deterministic Inversion in Diffusion Models: This paper proposes DDIM Inversion Attack (DIA), which disrupts the image editing capability of diffusion models by directly attacking the DDIM inversion trajectory. DIA effectively defends against malicious deepfake generation and privacy-violating content synthesis, substantially outperforming existing defenses such as AdvDM and Photoguard across diverse editing methods.
DICE: Staleness-Centric Optimizations for Parallel Diffusion MoE Inference: DICE is a framework targeting the staleness problem in parallel inference of MoE-based diffusion models. Through three levels of optimization — step-level interweaved parallelism, layer-level selective synchronization, and token-level conditional communication — DICE achieves a 1.26× speedup on DiT-MoE with negligible quality degradation.
DiffDoctor: Diagnosing Image Diffusion Models Before Treating: This paper proposes DiffDoctor, the first method to fine-tune diffusion models using pixel-level feedback. It first trains a robust artifact detector (1M+ samples with a category-balancing strategy), then backpropagates gradients through the detector to the diffusion model by minimizing the artifact confidence of each pixel in synthesized images, achieving significant artifact reduction on unseen prompts.
DiffSim: Taming Diffusion Models for Evaluating Visual Similarity: DiffSim is the first work to discover that attention-layer features of pretrained diffusion models (Stable Diffusion) can be used to measure visual similarity. It proposes the Aligned Attention Score (AAS), which aligns features from two images in the self-attention and cross-attention layers of the U-Net and computes cosine similarity, achieving state-of-the-art performance on multiple benchmarks covering human perceptual alignment, style similarity, and instance consistency.
DiffuMatch: Category-Agnostic Spectral Diffusion Priors for Robust Non-rigid Shape Matching: This paper proposes training an unconditional diffusion model in the spectral domain of Functional Maps, and replacing hand-crafted axiomatic regularizers (e.g., Laplacian commutativity, orthogonality) with distilled structural priors, enabling zero-shot non-rigid shape matching across categories.
Diffusion-based 3D Hand Motion Recovery with Intuitive Physics: This paper proposes a physics-augmented conditional diffusion model that refines per-frame 3D hand reconstruction results into temporally consistent motion sequences via an iterative denoising process, incorporating intuitive physics constraints (kinematic and stability constraints) to substantially improve reconstruction accuracy and physical plausibility.
DIIP: Diffusion Image Prior: This paper discovers that pretrained diffusion models exhibit an implicit bias analogous to Deep Image Prior (DIP) when reconstructing degraded images—the iterative optimization first produces a clean image before overfitting to the degraded input—and that this bias generalizes to a broader range of degradation types than DIP. Based on this finding, the authors propose DIIP, a fully blind (degradation-model-free) image restoration method.
Discovering Divergent Representations between Text-to-Image Models: This paper proposes CompCon (Comparing Concepts), an evolutionary search algorithm that automatically discovers "divergent representations" between two text-to-image models — identifying which visual attributes differ between models and which prompt types trigger these differences — and introduces the ID² benchmark dataset for systematic evaluation.
Disrupting Model Merging: A Parameter-Level Defense Without Sacrificing Accuracy: This paper proposes PaRaMS (Parameter Rearrangement & Random Multi-head Scaling), a parameter-level proactive defense method that displaces a protected model away from the shared loss basin via functionally equivalent parameter transformations, causing severe performance degradation upon merging while preserving original performance when the model is used standalone.
DiTFastAttnV2: Head-wise Attention Compression for Multi-Modality Diffusion Transformers: DiTFastAttnV2 is proposed for multi-modality diffusion Transformers (MMDiT), achieving fine-grained attention compression via Head-wise Arrow Attention and Head-wise Caching mechanisms. It reduces attention FLOPs by 68% and achieves 1.5× end-to-end speedup on 2K image generation without visual quality degradation.
DMQ: Dissecting Outliers of Diffusion Models for Post-Training Quantization: DMQ is a framework that combines Learned Equivalent Scaling (LES) and channel-wise Power-of-Two Scaling (PTS) to address outlier problems in diffusion model quantization, achieving, for the first time, stable high-quality image generation under the W4A6 low-bit setting.
Domain Generalizable Portrait Style Transfer: DGPST proposes a diffusion-based portrait style transfer framework that establishes cross-domain dense semantic correspondences via a semantic adapter to warp the reference image, employs AdaIN-Wavelet Transform for latent space initialization to balance stylization and content preservation, and generates final results through a dual-conditional diffusion model combining ControlNet (high-frequency structural guidance) and a style adapter (style guidance). The model is trained solely on 30K real portrait photographs yet generalizes to diverse domains including photos, cartoons, sketches, and anime.
DPoser-X: Diffusion Model as Robust 3D Whole-Body Human Pose Prior: This paper proposes DPoser-X, a 3D whole-body human pose prior based on an unconditional diffusion model. It unifies various pose-related tasks as inverse problems and performs test-time optimization via a truncated timestep schedule for variational diffusion sampling. A hybrid training strategy is introduced to effectively combine whole-body and part-specific datasets. DPoser-X achieves up to 61% improvement across 8 benchmarks covering body, hand, face, and whole-body modeling.
DreamDance: Animating Human Images by Enriching 3D Geometry Cues from 2D Poses: DreamDance proposes a human image animation framework that takes only 2D skeleton pose sequences as input. It first generates mutually aligned depth maps and normal maps from 2D poses via a Mutually Aligned Geometry Diffusion Model to enrich 3D geometric guidance, then integrates multi-level guidance signals through an SVD-based Cross-Domain Controlled Video Diffusion Model to synthesize high-quality human animations. The method achieves state-of-the-art performance on the TikTok dataset (FVD 153.07 vs. Champ 170.20).
Dual Recursive Feedback on Generation and Appearance Latents for Pose-Robust Text-to-Image Diffusion: This paper proposes Dual Recursive Feedback (DRF), a training-free dual recursive feedback system that recursively refines intermediate latents via appearance feedback and generation feedback, addressing the insufficient structure/appearance disentanglement of controllable T2I diffusion models in class-invariant scenarios, thereby achieving fine-grained pose transfer and appearance fusion.
DynamicID: Zero-Shot Multi-ID Image Personalization with Flexible Facial Editability: DynamicID achieves zero-shot single/multi-identity personalized image generation through two core components — Semantic Activation Attention (SAA) and Identity-Motion Reconfigurer (IMR) — while maintaining high fidelity and flexible facial editability.
Early Timestep Zero-Shot Candidate Selection for Instruction-Guided Image Editing: This paper proposes ELECT (Early-timestep Latent Evaluation for Candidate selecTion), a zero-shot framework that selects the optimal seed by estimating background inconsistency at early denoising timesteps, reducing computational overhead by 41% (up to 61%) while improving background consistency and instruction-following quality, without requiring external supervision or additional training.
EC-Flow: Enabling Versatile Robotic Manipulation from Action-Unlabeled Videos via Equivariant Flow Matching: EC-Flow introduces an "embodiment-centric flow" paradigm that predicts pixel-level motion trajectories of the robot body from action-unlabeled RGB videos, and converts visual predictions into executable actions via URDF kinematic constraints. It substantially outperforms object-centric methods in scenarios involving deformable objects, occlusion, and non-displacement manipulation.
EDiT: Efficient Diffusion Transformers with Linear Compressed Attention: EDiT proposes a linear compressed attention mechanism that enhances query local information via ConvFusion and compresses key/value tokens via a Spatial Compressor, achieving up to 2.2× acceleration over DiT and MM-DiT architectures while maintaining comparable image quality.
EEdit: Rethinking the Spatial and Temporal Redundancy for Efficient Image Editing: This paper proposes EEdit, an efficient image editing framework that achieves an average 2.46× speedup without quality degradation across diverse editing tasks—including prompt-guided, drag-based, and image composition editing—via three components: Spatial Locality Caching (SLoC) to skip computation in unedited regions, Token Index Preprocessing (TIP) for lossless acceleration of caching operations, and Inversion Step Skipping (ISS) to reduce inversion redundancy.
Efficient Autoregressive Shape Generation via Octree-Based Adaptive Tokenization: OAT proposes an adaptive octree tokenization scheme guided by quadric error metrics (QEM) that dynamically allocates token budgets according to local geometric complexity. It reduces token count by 50% while preserving reconstruction quality, and builds upon this foundation an autoregressive model, OctreeGPT, for high-quality text-to-3D generation.
Efficient Input-Level Backdoor Defense on Text-to-Image Synthesis via Neuron Activation Variation: NaviT2I identifies an "Early-step Activation Variation" phenomenon induced by backdoor triggers in text-to-image diffusion models, and proposes an efficient input-level backdoor defense framework that requires only the first diffusion iteration for analysis. The method achieves an average AUROC of 96.3% across 8 mainstream attacks while incurring only 3.8%–16.7% of the computational cost of existing methods.
EmotiCrafter: Text-to-Emotional-Image Generation based on Valence-Arousal Model: EmotiCrafter is proposed as the first emotional image generation method based on a continuous Valence-Arousal (V-A) model. It integrates V-A values into text features via an emotional embedding mapping network, which is then injected into Stable Diffusion XL to achieve precise dual control over content and emotion. The generated images significantly outperform existing methods in terms of emotional continuity and controllability.
End-to-End Multi-Modal Diffusion Mamba: This paper proposes Multi-Modal Diffusion Mamba (MDM), an end-to-end multimodal model based on the Mamba architecture. By employing a unified VAE encoder-decoder and a multi-step selective diffusion model, MDM enables simultaneous generation of images and text with computational complexity $\mathcal{O}(MLN^2)$, surpassing existing end-to-end models on tasks including image generation, image captioning, and VQA.
Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment: This paper identifies a "scoring paradox" in CLIP/BLIP-based reward models when evaluating high-quality images — detail-rich, high-fidelity images are paradoxically assigned lower scores. The authors propose two new metrics: ICT Score (Image-Contained-Text, measuring the degree to which an image encodes the textual information) and HP Score (a purely image-modal human preference score). Training on the Pick-High dataset yields over 10% improvement in preference prediction accuracy and successfully guides SD3.5-Turbo toward generating higher-quality images.
Erasing More Than Intended? How Concept Erasure Degrades the Generation of Non-Target Concepts: This paper systematically analyzes the unintended negative effects of concept erasure techniques on non-target concepts (spillover degradation) in text-to-image models. It proposes EraseBench, a comprehensive evaluation framework covering multiple dimensions including visual similarity, binomial association, and semantic relatedness. The findings reveal that current state-of-the-art erasure methods remain unreliable in preserving the generation quality of non-target concepts.
Exploring Multimodal Diffusion Transformers for Enhanced Prompt-based Image Editing: This paper systematically analyzes the attention mechanism of Multimodal Diffusion Transformers (MM-DiT), decomposing the attention matrix into four functional sub-blocks (I2I/T2I/I2T/T2T). Based on this analysis, it proposes an efficient prompt-based image editing method that operates by replacing image input projections ($\mathbf{q}_i, \mathbf{k}_i$), and is applicable to multiple MM-DiT variants including the SD3 series and Flux.1.
FaceCraft4D: Animated 3D Facial Avatar Generation from a Single Image: This paper proposes FaceCraft4D, a framework that generates animatable 360° 4D facial avatars from a single image by combining three complementary priors: a 3D shape prior (PanoHead GAN inversion), a 2D image prior (diffusion model texture enhancement), and a video prior (LivePortrait expression animation). A COIN training strategy is introduced to address multi-view data inconsistency, enabling high-quality real-time rendering at 156 FPS.
Fair Generation without Unfair Distortions: Debiasing Text-to-Image Generation with Entanglement-Free Attention: This paper proposes Entanglement-Free Attention (EFA), a training-free inference-time debiasing method that injects target attributes (e.g., gender, race) into person regions by modifying the cross-attention mechanism, while preserving non-target attributes (e.g., background, objects). EFA eliminates generative bias without introducing new unfair associations.
FedDifRC: Unlocking the Potential of Text-to-Image Diffusion Models in Heterogeneous Federated Learning: This paper is the first to introduce the internal representations of a pretrained text-to-image diffusion model (Stable Diffusion) into federated learning, proposing the FedDifRC framework. Through two complementary modules—Text-Driven Diffusion Contrastive Learning (TDCL) and Noise-Driven Diffusion Consistency Regularization (NDCR)—the framework effectively mitigates data heterogeneity and achieves significant performance improvements on global models across diverse non-iid settings.
Fewer Denoising Steps or Cheaper Per-Step Inference: Towards Compute-Optimal Diffusion Model Deployment: This paper proposes PostDiff — a training-free diffusion model acceleration framework that reduces redundancy at two levels: at the input level via a mixed-resolution denoising strategy (low resolution in early steps → high resolution in later steps), and at the module level via a hybrid caching strategy (DeepCache + cross-attention caching). The work systematically addresses the key question of whether reducing the number of denoising steps or reducing the per-step computation cost is more effective — concluding that the latter is superior across most efficiency regimes.
FICGen: Frequency-Inspired Contextual Disentanglement for Layout-driven Degraded Image Generation: FICGen is proposed as the first method to address the "contextual illusion dilemma" in Layout-to-Image (L2I) generation for degraded scenes (low-light, underwater, remote sensing, adverse weather, etc.). It extracts high- and low-frequency prototypes of degraded scenes via a learnable dual-query mechanism, injects them into the latent diffusion space through visual-frequency enhanced attention, and achieves foreground-background disentanglement using instance consistency maps and spatial-frequency adaptive aggregation. FICGen comprehensively outperforms existing L2I methods across five degraded-scene datasets.
Fix-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text: Fix-CLIP enhances CLIP's long-text understanding capability through three key innovations: (1) a dual-branch training pipeline that aligns short texts with masked images and long texts with original images; (2) learnable Regional Prompts with unidirectional attention masks for local visual feature extraction; and (3) a hierarchical feature alignment module that aligns multi-scale features across intermediate encoder layers. After incremental training on 30M synthetic long-text data, Fix-CLIP substantially outperforms state-of-the-art methods on both long- and short-text retrieval benchmarks. Its text encoder can be directly plugged into diffusion models to improve long-text generation quality.
FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait: This paper proposes FLOAT, an audio-driven talking portrait generation method based on Flow Matching, which employs a Transformer architecture to predict vector fields in an orthogonal motion latent space. The approach enables efficient (~10-step sampling), temporally consistent, high-quality talking video generation, with additional support for speech-driven emotion enhancement and test-time head pose editing.
FlowDPS: Flow-Driven Posterior Sampling for Inverse Problems: FlowDPS derives a Tweedie formula for flow models to decompose the Flow ODE into a clean image estimation component and a noise estimation component. Likelihood gradients are injected into the clean image component while stochastic noise is introduced into the noise component, enabling principled posterior sampling for inverse problems under the flow matching framework. FlowDPS surpasses all prior methods on four linear inverse problems using SD3.0.
FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models: FlowEdit proposes an inversion-free, optimization-free, model-agnostic text-based image editing method that constructs an ODE path directly between the source and target distributions of a pre-trained flow model, achieving structure-preserving editing with lower transport cost than inversion-based approaches.
FlowTok: Flowing Seamlessly Across Text and Image Tokens: FlowTok proposes encoding both text and images as compact 1D token representations ($77 \times 16$) and directly evolving between text and image tokens via flow matching, eliminating the need for complex conditioning mechanisms or noise schedules, thereby enabling efficient cross-modal generation.
ForgeLens: Data-Efficient Forgery Focus for Generalizable Forgery Image Detection: This paper proposes ForgeLens, a feature-guided framework built upon a frozen CLIP-ViT backbone. Through a lightweight Weight-Shared Guidance Module (WSGM) and a Forgery-Aware Feature Integrator (FAFormer), it steers the frozen pretrained network to focus on forgery-relevant features, achieving state-of-the-art generalization performance with only 1% of training data.
Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency: This paper proposes Free4D, the first tuning-free framework for single-image 4D scene generation. It achieves spatial consistency via 4D geometric structure initialization and adaptive guidance denoising, temporal consistency via reference latent replacement, and integrates multi-view information into a coherent 4D Gaussian representation through modulation-based refinement, enabling real-time controllable rendering.
FreeCus: Free Lunch Subject-driven Customization in Diffusion Transformers: This paper proposes FreeCus, a completely training-free subject-driven customization framework that activates the intrinsic zero-shot subject customization capability of Diffusion Transformers (DiT) through three innovations: pivotal attention sharing, an upgraded dynamic shifting mechanism for fine-grained feature extraction, and multimodal large language model (MLLM) semantic enhancement. FreeCus achieves results comparable to or better than methods that require additional training.
FreeMorph: Tuning-Free Generalized Image Morphing with Diffusion Model: FreeMorph proposes the first tuning-free generalized image morphing method. Through two key designs—guidance-aware spherical interpolation and step-oriented change trend—it generates smooth transition sequences between image pairs of arbitrary semantics and layouts within 30 seconds, achieving a speed improvement of 10–50× over existing methods.
FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion: This paper proposes FreeScale, a tuning-free inference paradigm that extracts and fuses information from different receptive field scales via a Scale Fusion mechanism (global high-frequency + local low-frequency), combined with tailored self-cascade upscaling and restrained dilated convolution, achieving for the first time text-to-image generation at 8K resolution on a single A800 GPU, while also supporting high-resolution video generation.
From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers: This paper proposes TaylorSeer, which upgrades the feature caching paradigm for diffusion models from "cache-and-reuse" to "cache-and-forecast" — leveraging Taylor series expansion with high-order finite differences over historical features to predict intermediate features at future timesteps. TaylorSeer achieves near-lossless 4.99× acceleration on FLUX and 5.00× on HunyuanVideo, entirely without additional training.
GameFactory: Creating New Games with Generative Interactive Videos: This paper proposes GameFactory, a multi-stage training strategy that decouples game style from action control on top of a pretrained video diffusion model, enabling action control learned from small-scale Minecraft data to generalize to arbitrary open-domain scenes for interactive game video generation. This is the first method with a complete technical paper that validates scene generalization over a complex action space (7 keys + mouse).
GAP: Gaussianize Any Point Clouds with Text Guidance: This paper proposes GAP, a framework that leverages depth-aware image diffusion models to convert colorless point clouds into high-fidelity 3D Gaussian representations. A surface anchoring mechanism ensures geometric fidelity, and a diffusion-based inpainting strategy completes hard-to-observe regions.
Generating Multi-Image Synthetic Data for Text-to-Image Customization: This paper proposes SynCD (Synthetic Customization Dataset) and its generation pipeline, which synthesizes multi-image consistent object datasets using shared attention and 3D asset priors. The trained encoder model surpasses existing encoder-based methods without requiring test-time optimization.
Generative Modeling of Shape-Dependent Self-Contact Human Poses: This work constructs Goliath-SC, the first large-scale self-contact pose dataset with accurate shape annotations (383K poses / 130 subjects), proposes PAPoseDiff—a shape-conditioned part-aware latent diffusion model for modeling body-shape-dependent self-contact pose distributions—and leverages the learned diffusion prior for monocular pose refinement, outperforming SOTA methods such as BUDDI and SMPLer-X on unseen subjects.
GenFlowRL: Shaping Rewards with Generative Object-Centric Flow in Visual Reinforcement Learning: This paper proposes GenFlowRL, which integrates generative object-centric optical flow with reinforcement learning by shaping rewards using a δ-flow representation extracted from a flow generation model trained on cross-embodiment datasets. The approach enables robust and generalizable robot manipulation policy learning, significantly outperforming flow-based imitation learning and video-guided RL methods across 10 manipulation tasks.
GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers: This work identifies that "perfect image reconstruction does not always yield the best visual representations," and proposes GenHancer — a two-stage post-training method that uses only a lightweight randomly initialized denoiser (~1/10 the parameters of pretrained heavy denoisers) conditioned solely on the global [CLS] token. Through self-supervised reconstruction, GenHancer enhances CLIP's fine-grained visual perception, achieving a 6.0% improvement over DIVA on MMVP-VLM.
Golden Noise for Diffusion Models: A Learning Framework: This paper introduces the concept of "Noise Prompt" and proposes a lightweight Noise Prompt Network (NPNet). By collecting 100K noise pairs via Re-denoise Sampling, NPNet is trained to transform random Gaussian noise into semantically informed "golden noise," serving as a plug-and-play module to improve the generation quality of SDXL and other diffusion models with only a 3% increase in inference time.
Grouped Speculative Decoding for Autoregressive Image Generation: This paper proposes Grouped Speculative Decoding (GSD), a training-free acceleration method for autoregressive image generation that performs speculative verification at the level of semantically valid token clusters rather than the single most probable token, achieving an average speedup of 3.7× without degrading image quality.
Guiding Noisy Label Conditional Diffusion Models with Score-based Discriminator Correction: This paper proposes Score-based Discriminator Correction (SBDC), which trains a lightweight discriminator to correct the generation trajectory of noisy-label conditional diffusion models at inference time. The discriminator is trained by partitioning the training set into clean and corrupted subsets via noise detection, and the paper finds that applying guidance only during the early-to-middle stages of the sampling process yields optimal results.
Holistic Tokenizer for Autoregressive Image Generation: This paper proposes Hita, a holistic-to-local image tokenizer that captures global attributes such as texture, material, and shape via learnable global queries, and integrates dual codebooks with a causal-attention fusion module. Without modifying the AR model architecture, Hita reduces ImageNet 256×256 generation FID to 2.59, accelerates training convergence by 2.1×, and supports zero-shot style transfer and image completion.
Holistic Unlearning Benchmark: A Multi-Faceted Evaluation for Text-to-Image Diffusion Model Unlearning: HUB introduces the first comprehensive benchmark for evaluating concept unlearning methods in text-to-image diffusion models, covering 33 target concepts across 6 evaluation dimensions (faithfulness, alignment, pinpoint-ness, multilingual robustness, adversarial robustness, and efficiency), with 16,000 prompts per concept. The benchmark reveals that no single method achieves superiority across all dimensions.
HPSv3: Towards Wide-Spectrum Human Preference Score: HPSv3 constructs the first wide-spectrum human preference dataset HPDv3 (1.08M image-text pairs, 1.17M annotated pairs), trains a preference model using a VLM backbone (Qwen2-VL) with an uncertainty-aware ranking loss, and proposes a Chain-of-Human-Preference (CoHP) iterative generation method, significantly improving the accuracy and coverage of image generation evaluation.
HypDAE: Hyperbolic Diffusion Autoencoders for Hierarchical Few-shot Image Generation: This work combines the hierarchical representation learning capacity of hyperbolic space with the high-quality generative capability of diffusion autoencoders. By manipulating the radius and direction of latent codes within the Poincaré disk, it achieves controllable, diverse, and class-consistent few-shot image generation.
ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance: This paper proposes ILLUME, a unified MLLM that integrates multimodal understanding and generation capabilities into a single LLM via a unified next-token prediction paradigm. Through a semantic visual tokenizer (reducing pretraining data by 4× to 15M) and a self-enhancement multimodal alignment scheme (enabling the model to self-evaluate the consistency between its generated images and text), ILLUME achieves competitive or superior performance compared to state-of-the-art unified models across diverse understanding, generation, and editing tasks.
ImageGem: In-the-wild Generative Image Interaction Dataset for Generative Model Personalization: This paper introduces ImageGem, the first large-scale in-the-wild generative interaction dataset (57K users × 242K customized LoRAs × 3M text prompts × 5M generated images), enabling three applications via individual user preference annotations: aggregate preference alignment surpassing Pick-a-Pic, personalized retrieval and generative recommendation (with significant VLM reranking gains), and the first proposed generative model personalization—learning preference editing directions in the LoRA latent weight space (W2W) to customize diffusion models.
Improved Noise Schedule for Diffusion Training: This paper proposes a unified framework for analyzing and designing noise schedules in diffusion models from a probability distribution perspective. It finds that a Laplace noise schedule—which concentrates sampling probability near $\log\text{SNR}=0$ (the signal–noise transition point)—improves FID by 26.6% over the standard cosine schedule under the same training budget, outperforming all loss-weighting adjustment methods.
Inference-Time Diffusion Model Distillation: This paper proposes Distillation++, an inference-time diffusion distillation framework that leverages a pretrained teacher model during the student model's sampling process to correct its denoising trajectory, significantly narrowing the teacher–student performance gap without requiring additional training data or fine-tuning.
InfGen: A Resolution-Agnostic Paradigm for Scalable Image Synthesis: This paper proposes InfGen, a "second-generation" paradigm that replaces the VAE decoder with a Transformer-based generator, decoding fixed-size latents into images at arbitrary resolution in a single forward pass—without modifying or retraining the diffusion model. It reduces 4K image generation to under 10 seconds, achieving over 10× speedup compared to the fastest existing method, UltraPixel.
InfiniDreamer: Arbitrarily Long Human Motion Generation via Segment Score Distillation: InfiniDreamer leverages a pretrained short-sequence motion diffusion model as a prior and proposes Segment Score Distillation (SSD), an optimization method that iteratively refines overlapping short segments within a coarsely initialized long motion sequence, enabling arbitrarily long human motion generation without requiring additional long-sequence training data.
Inpaint4Drag: Repurposing Inpainting Models for Drag-Based Image Editing via Bidirectional Warping: This paper proposes Inpaint4Drag, which decomposes drag-based image editing into two stages: pixel-space bidirectional warping and image inpainting. Inspired by elastic object deformation, the proposed bidirectional warping algorithm enables real-time preview (0.01s) and efficient generation (0.3s), achieving a 600× speedup over existing methods while serving as a universal adapter for arbitrary inpainting models.
IntroStyle: Training-Free Introspective Style Attribution using Diffusion Features: This paper proposes IntroStyle, a training-free style attribution method that leverages only channel-wise mean and variance statistics from intermediate layers of a diffusion model's own denoising network, measuring inter-image style similarity via the 2-Wasserstein distance. IntroStyle substantially outperforms supervised state-of-the-art methods on WikiArt and DomainNet without any task-specific training.
Invisible Watermarks, Visible Gains: Steering Machine Unlearning with Bi-Level Watermarking Design: This paper proposes Water4MU, a framework that integrates digital watermarking with machine unlearning (MU) via bi-level optimization (BLO). The upper level optimizes the watermark network to facilitate unlearning, while the lower level performs the unlearning optimization, thereby substantially improving unlearning effectiveness without significantly compromising model utility.
IRGPT: Understanding Real-world Infrared Image with Bi-cross-modal Curriculum on Large-scale Benchmark: This paper proposes IRGPT, the first multimodal large language model grounded in real-world infrared images. It introduces IR-TD, a large-scale infrared-text dataset containing 260K+ image-text pairs, and designs a Bi-cross-modal Curriculum transfer learning strategy. IRGPT achieves state-of-the-art performance across 9 infrared task benchmarks, with zero-shot psum outperforming the baseline InternVL2-8B by 76.35.
Joint Diffusion Models in Continual Learning: This paper proposes JDCL, which unifies a classifier and a diffusion generative model into a single jointly parameterized network. Combined with knowledge distillation and a two-stage training strategy, JDCL substantially alleviates catastrophic forgetting in generative replay-based continual learning, surpassing existing generative replay methods.
LaRender: Training-Free Occlusion Control in Image Generation via Latent Rendering: This paper proposes LaRender, a training-free image generation method grounded in volume rendering principles. It precisely controls inter-object occlusion relationships by "rendering" object features in latent space. The method replaces only the cross-attention layers of a pretrained diffusion model without introducing any learnable parameters, significantly outperforming existing SOTA methods in occlusion accuracy while enabling rich effects such as semantic transparency control.
Latent Diffusion Models with Masked AutoEncoders: This paper systematically analyzes three key properties that autoencoders in LDMs should possess (latent space smoothness, perceptual compression quality, and reconstruction quality), identifies that existing autoencoders fail to satisfy all three simultaneously, and proposes Variational Masked AutoEncoders (VMAEs). By combining MAE's hierarchical features with VAE's probabilistic encoding, VMAEs achieve significant improvements in generation quality (ImageNet-1K gFID: 5.98 vs. 6.49 for SD-VAE) using only 13.4% of the parameters and 4.1% of the GFLOPs.
LATINO-PRO: LAtent consisTency INverse sOlver with PRompt Optimization: LATINO-PRO is the first work to embed Latent Consistency Models (LCMs) as generative priors within a zero-shot inverse problem solving framework, achieving state-of-the-art reconstruction quality with only 8 neural function evaluations (NFEs), and further improving performance via empirical Bayes-based automatic text prompt calibration.
Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers: This paper proposes LayouSyn, an open-vocabulary text-to-layout generation pipeline that extracts scene elements via a lightweight open-source language model and generates layouts using an aspect-ratio-aware diffusion Transformer, achieving state-of-the-art performance on spatial and quantity reasoning benchmarks.
LazyMAR: Accelerating Masked Autoregressive Models via Feature Caching: LazyMAR addresses the inference efficiency bottleneck of Masked Autoregressive (MAR) models by exploiting two types of redundancy: token redundancy (most token features are highly similar across adjacent decoding steps) and condition redundancy (the residual between conditional and unconditional outputs in classifier-free guidance changes minimally between adjacent steps). Based on these observations, the paper proposes token cache and condition cache mechanisms, achieving a 2.83× speedup with negligible loss in generation quality.
LD-RPS: Zero-Shot Unified Image Restoration via Latent Diffusion Recurrent Posterior Sampling: LD-RPS proposes a zero-shot, dataset-free unified image restoration method that performs recurrent posterior sampling via a pretrained latent diffusion model. It leverages multimodal large language models for semantic priors and a learnable F-PAM module to align the degradation domain, achieving high-quality blind restoration across diverse degradation types.
Learning Deblurring Texture Prior from Unpaired Data with Diffusion Model: TP-Diff is the first work to introduce diffusion models into unpaired image deblurring. It learns spatially varying texture priors via a memory-augmented Texture Prior Encoder (TPE), and designs a Filter-Modulated Multi-head Self-Attention (FM-MSA) to leverage these priors for precise deblurring, achieving a new unsupervised state-of-the-art on multiple benchmarks with only 11.89M parameters.
Learning Few-Step Diffusion Models by Trajectory Distribution Matching: This paper proposes Trajectory Distribution Matching (TDM), a novel paradigm that unifies trajectory distillation and distribution matching by aligning the marginal distributions of student and teacher ODE trajectories at the distributional level. TDM enables efficient few-step diffusion model distillation, requiring only 2 A800 GPU-hours to distill PixArt-α into a 4-step generator that surpasses the teacher model.
Learning to See in the Extremely Dark: This paper proposes a paired-to-paired data synthesis pipeline to construct SIED, a RAW image enhancement dataset for extremely dark scenes (down to 0.0001 lux), and designs a diffusion-model-based framework that achieves high-quality restoration of ultra-low-SNR RAW images via an Adaptive Illumination Correction Module (AICM) and a color consistency loss.
Less-to-More Generalization: Unlocking More Controllability by In-Context Generation: This paper proposes UNO, a universal DiT-based customized generation model. Through a "model-data co-evolution" paradigm—wherein synthetic data generated by weaker models progressively trains stronger models—combined with progressive cross-modal alignment and Universal RoPE, UNO achieves state-of-the-art performance on both single- and multi-subject-driven image generation (DreamBench DINO 0.760, CLIP-I 0.835).
Less is More: Improving Motion Diffusion Models with Sparse Keyframes: This paper proposes sMDM, a motion diffusion framework centered on sparse keyframes. By introducing a masking-interpolation strategy and the Visvalingam-Whyatt keyframe selection algorithm, sMDM reduces redundant frame processing and consistently outperforms dense-frame baselines in text alignment and motion quality.
LIFT: Latent Implicit Functions for Task- and Data-Agnostic Encoding: LIFT proposes a meta-learning-based multi-scale implicit neural representation framework that achieves unified encoding across tasks (generation, classification) and data modalities (2D images, 3D voxels) via parallel local implicit functions and a hierarchical latent generator, attaining state-of-the-art performance on both reconstruction and generation tasks while substantially reducing computational cost.
LiT: Delving into a Simple Linear Diffusion Transformer for Image Generation: This paper systematically investigates how to safely and efficiently convert a pretrained DiT into a linear attention variant called LiT. It proposes five practical guidelines—depth-wise convolution augmentation, fewer heads, weight inheritance, selective parameter loading, and hybrid distillation—achieving comparable performance with only 20% of DiT's training steps.
Long-Context State-Space Video World Models: This paper proposes integrating State Space Models (SSM/Mamba) into video world models. Through a block-wise SSM scan scheme that balances spatial consistency and temporal memory, combined with local frame attention, the method achieves persistent long-term spatial memory under linear training complexity and constant inference overhead, substantially outperforming finite-context Transformer baselines on Memory Maze and Minecraft datasets.
Looking in the Mirror: A Faithful Counterfactual Explanation Method for Interpreting Deep Image Classification Models: This paper treats a classifier's decision boundary as a "mirror" and generates counterfactual explanations (CFEs) by reflecting feature representations to the other side of the mirror. A triangulation loss is designed to preserve distance relationships between the latent space and image space, yielding faithful, controllable, and animatable counterfactual explanations.
LoRAverse: A Submodular Framework to Retrieve Diverse Adapters for Diffusion Models: This paper formulates the retrieval of relevant and diverse LoRA combinations from a library of 100K+ adapters as a combinatorial optimization problem. It proposes LoRAverse, a framework based on submodular function maximization, which achieves relevance- and diversity-aware LoRA selection through concept extraction followed by submodular retrieval.
LUSD: Localized Update Score Distillation for Text-Guided Image Editing: LUSD addresses the failure modes of existing score distillation methods in image editing (particularly object insertion) through two simple modifications—attention-based spatial regularization and gradient filtering-normalization—which resolve instabilities caused by large disparities in gradient magnitude and spatial distribution, achieving a better balance between prompt fidelity and background preservation.
M2SFormer: Multi-Spectral and Multi-Scale Attention with Edge-Aware Difficulty Guidance for Image Forgery Localization: This paper proposes M2SFormer, which unifies multi-spectral (2D DCT frequency-domain) and multi-scale (SIFT-style spatial pyramid) attention mechanisms within encoder-decoder skip connections, and introduces an edge-aware curvature-based difficulty-guided attention decoder. The method achieves state-of-the-art cross-domain generalization in image forgery localization (average unseen-domain DSC 43.0% and mIoU 34.3% under the CASIAv2 training protocol).
Make Me Happier: Evoking Emotions Through Image Diffusion Models: EmoEditor presents the first systematic emotion-driven image generation framework, employing a dual-branch diffusion model (global emotion conditioning + local semantic features) to generate target-emotion images from only a source image and a target emotion label — without manual text instructions or reference images. The work also introduces the EmoPair dataset containing 340K emotion-annotated image pairs.
MamTiff-CAD: Multi-Scale Latent Diffusion with Mamba+ for Complex Parametric Sequence: This paper proposes MamTiff-CAD, a framework that combines a Mamba+-based encoder with a Transformer decoder in an autoencoder to learn latent representations of CAD command sequences, followed by a multi-scale Transformer diffusion model for generation. It is the first method to generate complex CAD models with sequence lengths of 60–256 commands.
MaskControl: Spatio-Temporal Control for Masked Motion Synthesis: MaskControl is the first work to introduce spatial controllability into generative masked motion models. It manipulates the logits of the token classifier via two core components — Logits Regularizer (implicit alignment during training) and Logits Optimization (explicit optimization during inference) — simultaneously achieving high-quality motion generation (FID reduced by 77%) and high-precision joint control (average error 0.91 cm vs. 1.08 cm).
MatchDiffusion: Training-free Generation of Match-Cuts: MatchDiffusion is proposed as a training-free two-stage pipeline that exploits the property of diffusion models—whereby early denoising steps establish the macroscopic scene structure and late steps add semantic details—to automatically generate match-cut video pairs via Joint Diffusion and Disjoint Diffusion.
MAVFlow: Preserving Paralinguistic Elements with Conditional Flow Matching for Zero-Shot AV2AV Multilingual Translation: This paper proposes MAVFlow, a zero-shot audio-visual renderer based on conditional flow matching (CFM), which leverages dual-modal guidance from audio speaker embeddings and visual emotion embeddings to preserve speaker consistency in multilingual AV2AV translation.
Meta-Unlearning on Diffusion Models: Preventing Relearning Unlearned Concepts: This paper proposes a Meta-Unlearning framework for diffusion models that augments standard unlearning objectives with a meta-objective, causing benign knowledge associated with unlearned concepts to self-destruct upon malicious fine-tuning, thereby preventing relearning of erased concepts. The framework is compatible with most existing unlearning methods and requires only the addition of a simple meta-objective.
Mind the Gap: Aligning Vision Foundation Models to Image Feature Matching: This paper identifies an "alignment gap" in vision foundation models (e.g., DINOv2) for image feature matching: contrastive learning-based models discard instance-level details and lack cross-image interaction mechanisms, causing failures in multi-instance matching scenarios. To address this, the authors propose the IMD framework, which employs diffusion models as feature extractors to preserve instance-level details, and designs a Cross-Image Interaction Prompt Module (CIPM) for bidirectional information exchange. IMD achieves state-of-the-art performance on standard benchmarks and on the newly introduced multi-instance benchmark IMIM, with a 12% improvement in multi-instance scenarios.
MMAIF: Multi-task and Multi-degradation All-in-One for Image Fusion with Language Guidance: MMAIF proposes a unified multi-task, multi-degradation, language-guided image fusion framework that operates in latent space via a realistic degradation pipeline and a modernized DiT architecture. It offers both a regression and a Flow Matching variant, surpassing existing restoration+fusion pipelines across diverse degraded fusion tasks.
MoFRR: Mixture of Diffusion Models for Face Retouching Restoration: This paper introduces the Face Retouching Restoration (FRR) task for the first time and proposes the MoFRR framework—inspired by DeepSeek MoE—which activates retouching-type-specific experts (Wavelet DDIM) and a shared expert (general DDIM) via a router, achieving near-authentic restoration of retouched faces on the newly constructed million-scale RetouchingFFHQ++ dataset.
MosaicDiff: Training-free Structural Pruning for Diffusion Model Acceleration Reflecting Pretraining Dynamics: This paper proposes MosaicDiff, a training-free structural pruning method for diffusion models that dynamically partitions the inference trajectory into three stages based on pretraining learning-speed dynamics and applies stage-specific subnetworks with varying sparsity, achieving significant acceleration on DiT and SDXL without sacrificing generation quality.
MotionDiff: Training-Free Zero-Shot Interactive Motion Editing via Flow-Assisted Multi-View Diffusion: MotionDiff proposes a training-free, zero-shot multi-view motion editing approach that estimates multi-view optical flow from static scenes via a Point Kinematics Model (PKM), and leverages a decoupled motion representation to guide Stable Diffusion in generating high-quality, multi-view-consistent motion editing results.
MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space: This paper proposes MotionStreamer, which integrates a continuous causal latent space with a diffusion head into an autoregressive framework for text-conditioned streaming human motion generation, supporting online multi-turn generation and dynamic motion composition.
Multi-turn Consistent Image Editing: This paper proposes a multi-turn image editing framework based on flow matching. By incorporating dual-objective LQR guidance and an adaptive attention mechanism, it effectively suppresses error accumulation across editing rounds, enabling flexible and controllable iterative editing while maintaining content consistency.
Multimodal Latent Diffusion Model for Complex Sewing Pattern Generation: This paper proposes SewingLDM, a multimodal conditional latent diffusion model that generates complex sewing patterns under text, sketch, and body-shape conditions via an extended sewing pattern representation and a two-stage training strategy, with seamless integration into CG simulation pipelines.
MUNBa: Machine Unlearning via Nash Bargaining: This work formulates Machine Unlearning (MU) as a two-player cooperative bargaining game and derives a closed-form solution via Nash bargaining theory to simultaneously address gradient conflict and gradient dominance between the forgetting and retention objectives, achieving an optimal balance between unlearning and preservation across both classification and generation tasks.
Music-Aligned Holistic 3D Dance Generation via Hierarchical Motion Modeling: This paper introduces the SoulDance dataset (the first high-quality 3D dance dataset encompassing body, hand, and facial motion) and the SoulNet framework (hierarchical residual vector quantization + music-aligned generative model + cross-modal retrieval), achieving the first whole-body 3D dance generation with coordinated facial expressions, body, and hand movements aligned to musical rhythm and emotion.
NuiScene: Exploring Efficient Generation of Unbounded Outdoor Scenes: NuiScene proposes an efficient vector set encoding scheme for scene chunks, paired with an explicitly trained outpainting diffusion model, to enable fast unbounded outdoor scene generation. The work also curates NuiScene43, a high-quality outdoor scene dataset.
NullSwap: Proactive Identity Cloaking Against Deepfake Face Swapping: This paper proposes NullSwap, which embeds identity-guided invisible perturbations into source images to cloak facial identity information, preventing Deepfake face-swapping models from extracting the correct identity, thereby enabling proactive defense against face-swapping attacks in a purely black-box setting.
Omegance: A Single Parameter for Various Granularities in Diffusion-Based Synthesis: Omegance proposes scaling the noise prediction in each denoising step of a diffusion model by a single parameter $\omega$, enabling training-free global, spatial, and temporal control over the detail granularity of generated images and videos. The method is architecture-agnostic and compatible with SDXL, SD3, FLUX, and other models.
OminiControl: Minimal and Universal Control for Diffusion Transformer: OminiControl is proposed to unify spatially aligned and non-aligned image control tasks on the DiT architecture with only 0.1% additional parameters. Core innovations include unified sequence processing, dynamic positional encoding, and an attention bias control mechanism.
OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting: This paper proposes OmniPaint, a unified framework that reformulates object removal and insertion as mutually inverse and complementary tasks. Built upon the FLUX diffusion prior, it introduces the CycleFlow unpaired training mechanism and the CFD reference-free evaluation metric. With only 3K real paired samples, OmniPaint achieves high-fidelity object editing, excelling particularly at complex physical effects such as shadows and reflections.
OmniVTON: Training-Free Universal Virtual Try-On: OmniVTON proposes the first training-free universal virtual try-on framework. By decoupling garment texture and pose conditions, the method employs three core modules—Structured Garment Morphing (SGM), Continuous Boundary Stitching (CBS), and Spectral Pose Injection (SPI)—to achieve high-fidelity try-on in both in-shop and in-the-wild settings, while also supporting multi-person try-on for the first time.
Ouroboros: Single-step Diffusion Models for Cycle-consistent Forward and Inverse Rendering: This paper presents Ouroboros, a unified framework comprising two single-step diffusion models (for inverse rendering RGB→X and forward rendering X→RGB respectively) that are jointly trained with cycle-consistency to enforce bidirectional rendering coherence. The method achieves state-of-the-art performance across multiple datasets while running 50× faster than multi-step diffusion baselines, and can be applied to video decomposition in a training-free manner.
PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs: This paper proposes PanoLlama, which extends fixed-size visual autoregressive (VAR) models to endless panorama generation via a token redirection strategy, enabling training-free next-crop prediction that surpasses joint diffusion methods in coherence, fidelity, and aesthetics.
PatchScaler: An Efficient Patch-Independent Diffusion Model for Image Super-Resolution: This paper proposes PatchScaler, a patch-level independent diffusion super-resolution pipeline that employs a Global Restoration Module to generate confidence maps quantifying per-region reconstruction difficulty, partitions patches into easy/medium/hard groups with different sampling step budgets, and incorporates a texture prompt retrieval mechanism — achieving superior quality on RealSR at only 0.23× the inference time of ResShift.
Penalizing Boundary Activation for Object Completeness in Diffusion Models: This paper investigates the root cause of incomplete object generation in diffusion models — the RandomCrop data augmentation used during training — and proposes a training-free boundary activation penalty method. By applying cross-attention and self-attention constraints during early denoising steps, the method suppresses object generation near image boundaries, reducing the object incompleteness rate of SDv2.1 from 45.7% to 17.3%.
PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation: This paper proposes PersonalVideo, a framework that applies hybrid reward supervision—comprising an Identity Consistency Reward (ICR) and a Semantic Consistency Reward (SCR)—directly to generated videos. This approach eliminates the distribution gap between T2I fine-tuning and T2V inference inherent in conventional methods, achieving high identity fidelity while preventing degradation of motion dynamics and semantic alignment.
PINO: Person-Interaction Noise Optimization for Long-Duration and Customizable Motion Generation of Arbitrary-Sized Groups: This paper proposes Person-Interaction Noise Optimization (PINO), a training-free framework that decomposes complex multi-person group interactions into semantically well-defined dyadic interaction pairs. By leveraging a pretrained two-person interaction diffusion model with noise optimization and physical penalty terms, PINO sequentially synthesizes group interaction motions of arbitrary scale, supporting fine-grained user control and long-duration motion generation.
PLA: Prompt Learning Attack against Text-to-Image Generative Models: This paper proposes PLA (Prompt Learning Attack), a gradient-driven adversarial attack framework targeting black-box T2I models. By leveraging sensitive knowledge encoding and multimodal similarity losses, PLA learns adversarial prompts that bypass both prompt filters and post-hoc safety checkers, achieving an average ASR-4 exceeding 90%, substantially outperforming existing methods.
Pretrained Reversible Generation as Unsupervised Visual Representation Learning: PRG extracts unsupervised visual representations by inverting the generation process of pretrained continuous generative models (diffusion/flow models), enabling model-agnostic adaptation to discriminative tasks. It achieves 78% top-1 accuracy on ImageNet 64×64, establishing state of the art among generative model-based methods.
Randomized Autoregressive Visual Generation: This paper proposes Randomized AutoRegressive modeling (RAR): during standard autoregressive training, the input sequence is randomly permuted and gradually annealed back to raster-scan order, enabling the model to learn bidirectional context. RAR achieves a state-of-the-art FID of 1.48 on ImageNet-256 for autoregressive image generation while remaining fully compatible with the language model framework.
REDUCIO! Generating 1K Video within 16 Seconds using Extremely Compressed Motion Latents: This paper proposes Reducio-VAE, a content-frame-conditioned 3D video autoencoder that compresses video into a motion latent space 64× smaller than a standard 2D VAE. Paired with Reducio-DiT, it generates 16-frame 1024×1024 videos in 15.5 seconds on a single A100 GPU, with training requiring only 3,200 A100 GPU hours.
ReFlex: Text-Guided Editing of Real Images in Rectified Flow via Mid-Step Feature Extraction and Attention Adaptation: To address the challenge of real image editing in Rectified Flow (ReFlow) models, this paper systematically analyzes intermediate representations in MM-DiT, identifies three key features (I2I-SA, I2T-CA, and residual features), and proposes mid-step feature extraction along with two attention adaptation techniques. The resulting training-free, user-mask-free method achieves high-quality real image editing on the FLUX model, attaining a 68.2% human preference rate that substantially outperforms competing approaches.
REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder: REGEN replaces the conventional VAE decoder with a Diffusion Transformer (DiT) as a re-generative decoder for video, breaking the temporal compression bottleneck through a "generation rather than exact reconstruction" learning paradigm and achieving up to 32× temporal compression.
REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers: This paper proposes REPA-E, which enables joint end-to-end training of VAE and latent diffusion Transformers via representation alignment (REPA) loss, achieving 17× and 45× training speedups over REPA and vanilla training respectively, and setting a new state of the art of FID 1.12 on ImageNet 256×256.
REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers: This paper proposes REPA-E, the first training framework to successfully enable end-to-end joint tuning of a VAE and a latent diffusion model. By updating the VAE via the REPA alignment loss rather than the diffusion loss, REPA-E achieves a 17–45× training speedup and sets a new state of the art on ImageNet 256 (FID 1.12).
Rethink Sparse Signals for Pose-guided Text-to-Image Generation: This paper proposes SP-Ctrl (Spatial-Pose ControlNet), which replaces the fixed RGB encoding of OpenPose with learnable Spatial-Pose Representations (SPR) and introduces a Keypoint Concept Learning (KCL) strategy that leverages cross-attention heatmap constraints to improve keypoint alignment. The method enables sparse pose signals to achieve pose control accuracy comparable to dense signals (depth maps / DensePose), while preserving image diversity and cross-species generation capability.
Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers: This paper identifies two structural issues in MM-DiT architectures (FLUX, SD3.5): the token count asymmetry between visual and text modalities suppresses cross-modal attention, and attention weights are insensitive to timestep. To address these, the authors propose TACA (Temperature-Adjusted Cross-modal Attention), which rebalances multimodal interaction via temperature scaling and timestep-adaptive adjustment. Combined with LoRA fine-tuning, TACA achieves significant improvements in text-image alignment on T2I-CompBench (spatial relations +16.4%, shape +5.9%) with negligible additional computational overhead.
Rethinking Layered Graphic Design Generation with a Top-Down Approach: This paper proposes Accordion, a top-down framework that converts AI-generated rasterized design images into editable layered designs (comprising background, foreground object, and vectorized text layers), where a VLM plays distinct roles across three stages: reference creation, design planning, and layer generation.
Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities: This paper proposes VLN-PE, the first physically realistic vision-and-language navigation platform supporting humanoid, quadruped, and wheeled robots. It systematically evaluates existing VLN methods under real physical constraints, revealing a 34% drop in success rate when transferring from simulation to physical deployment.
Revelio: Interpreting and Leveraging Semantic Information in Diffusion Models: Revelio employs k-sparse autoencoders (k-SAE) to uncover monosemantic, interpretable features encoded across different layers and timesteps of diffusion models, and validates the transfer learning utility of these features via a lightweight classifier, Diff-C, enabling a systematic interpretation of black-box diffusion models.
SA-LUT: Spatial Adaptive 4D Look-Up Table for Photorealistic Style Transfer: This paper proposes SA-LUT, which achieves spatially adaptive photorealistic style transfer via a style-guided 4D look-up table and a context map generated by content-style cross-attention. On the newly introduced PST50 benchmark, SA-LUT reduces LPIPS by 66.7% compared to 3D LUT methods while supporting real-time 4K video processing at 16 FPS.
SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation: SANA-Sprint proposes a hybrid distillation framework combining continuous-time consistency models (sCM) and latent adversarial diffusion distillation (LADD). It converts pretrained Flow Matching models to TrigFlow in a lossless manner and jointly trains with sCM+LADD, achieving unified adaptive high-quality text-to-image generation in 1–4 steps, with a single-step latency of only 0.1 seconds on H100.
SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation: This work converts a pretrained SANA flow matching model into TrigFlow via a lossless mathematical transformation, and combines continuous-time consistency distillation (sCM) with latent adversarial diffusion distillation (LADD) in a hybrid training strategy, achieving unified 1–4 step adaptive high-quality image generation. One-step generation of 1024×1024 images requires only 0.1s on an H100, surpassing FLUX-schnell with an FID of 7.59 and GenEval of 0.74 while being 10× faster.
SCFlow: Implicitly Learning Style and Content Disentanglement with Flow Models: SCFlow is proposed to learn an invertible merging mapping between style and content via Flow Matching, leveraging the invertibility of the mapping to allow disentanglement to emerge naturally as an implicit property of the merging process, without requiring explicit disentanglement supervision.
ScoreHOI: Physically Plausible Reconstruction of Human-Object Interaction via Score-Guided Diffusion: ScoreHOI employs a score-based diffusion model as an optimizer, integrating DDIM inversion–forward sampling with physical constraints (contact, penetration, ground contact) to guide the denoising process. Combined with a contact-driven iterative refinement strategy, it achieves physically plausible 3D reconstruction of human-object interactions from monocular images, improving contact F-Score by 9% on BEHAVE.
SDMatte: Grafting Diffusion Models for Interactive Matting: This paper proposes SDMatte, a Stable Diffusion-based interactive matting model that converts the text interaction capability of diffusion models into visual prompt interaction capability via three key designs: visual prompt cross-attention, coordinate/opacity embeddings, and mask self-attention. SDMatte significantly outperforms SAM-based methods across multiple datasets.
Semantic Discrepancy-aware Detector for Image Forgery Identification: This paper proposes the Semantic Discrepancy-aware Detector (SDD), which leverages three modules — Semantic Token Sampling (STS), Concept-level Forgery Discrepancy Learning (CFDL), and a Low-level Forgery Feature Enhancer — to align CLIP's visual semantic concept space with the forgery space via reconstruction learning. SDD achieves state-of-the-art performance on the UnivFD and SynRIS benchmarks ($ap_m$ 98.51%, AUROC 95.1%).
Semantic Watermarking Reinvented: Enhancing Robustness and Generation Quality with Fourier Integrity: This paper addresses the frequency integrity loss caused by discarding the imaginary part in existing semantic watermarking methods for latent diffusion models (LDMs). It proposes Hermitian Symmetric Fourier Watermarking (SFW) and a center-aware embedding strategy to preserve frequency-domain integrity while enhancing detection robustness and generation quality.
ShortFT: Diffusion Model Alignment via Shortcut-based Fine-Tuning: ShortFT is proposed to construct denoising shortcuts using trajectory-preserving few-step diffusion models, substantially compressing the original lengthy denoising chain to enable complete end-to-end reward gradient backpropagation, achieving efficient and effective alignment of diffusion models with reward functions.
SliderSpace: Decomposing the Visual Capabilities of Diffusion Models: SliderSpace applies PCA to CLIP features of images generated by a diffusion model under a given prompt, automatically discovering multiple semantically orthogonal controllable directions. Each direction is trained as a LoRA adapter (slider), enabling concept decomposition, artistic style exploration, and diversity enhancement without any manually specified attributes.
SMGDiff: Soccer Motion Generation using Diffusion Probabilistic Models: This paper proposes SMGDiff, a two-stage diffusion model framework that generates high-quality, diverse soccer motion animations in real time from user control signals, while refining ball-foot interaction details via a contact guidance module.
Spectral Image Tokenizer: This paper proposes the Spectral Image Tokenizer (SIT), which tokenizes images in the frequency domain after converting them via the Discrete Wavelet Transform (DWT). The resulting token sequence is naturally arranged in a coarse-to-fine order, enabling capabilities unavailable to conventional raster-scan tokenizers, including multi-resolution reconstruction, progressive generation, text-guided super-resolution, and image editing.
Straighten Viscous Rectified Flow via Noise Optimization: This paper proposes VRFNO (Viscous Rectified Flow via Noise Optimization), which enhances trajectory distinguishability by introducing a historical velocity term and jointly trains an encoder to optimize noise for constructing optimal couplings, effectively straightening the inference trajectories of Rectified Flow. VRFNO achieves state-of-the-art one-step/few-step generation performance on CIFAR-10 and AFHQ (one-step FID of 4.50, without distillation).
StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation: StreamDiffusion proposes a pipeline-level real-time diffusion framework that achieves up to 91 fps on a single RTX 4090—59.6× faster than Diffusers AutoPipeline—through Stream Batch (batched denoising steps), R-CFG (residual classifier-free guidance), and SSF (stochastic similarity filtering).
Structure-Guided Diffusion Models for High-Fidelity Portrait Shadow Removal: This paper formulates portrait shadow removal as a diffusion inpainting problem. It trains an illumination-invariant structure extraction network to obtain structure maps free of shadow boundaries, uses these maps to guide an inpainting diffusion model for shadow region restoration, and applies a gradient-guided detail recovery diffusion model to reconstruct fine facial details. The proposed method substantially outperforms existing approaches on benchmark datasets.
StyleKeeper: Prevent Content Leakage using Negative Visual Query Guidance: This paper proposes Negative Visual Query Guidance (NVQG), a training-free method that suppresses content leakage by injecting the reference image's queries as a negative guidance signal in self-attention layers. The approach achieves high-quality visual style prompting and outperforms existing methods in both style similarity and text alignment.
StyleMotif: Multi-Modal Motion Stylization using Style-Content Cross Fusion: This paper proposes StyleMotif, a single-branch motion latent diffusion framework that unifies content generation and multi-modal style injection (text/image/video/audio/motion) via a style-content cross normalization mechanism. Compared to SMooDi's dual-branch design, StyleMotif reduces trainable parameters by 43.9% and improves inference speed by 22.5%, while achieving a 5.23% gain in Style Recognition Accuracy (SRA).
SummDiff: Generative Modeling of Video Summarization with Diffusion: SummDiff is the first work to introduce diffusion models into video summarization, formulating the task as a conditional generation problem. By learning the distribution of "good summaries," the model generates diverse plausible summaries that better reflect the inherent subjectivity of the video summarization task.
SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing: SuperEdit addresses the noisy supervision problem in instruction-based image editing by leveraging diffusion generation priors to guide a VLM in rectifying editing instructions, and by constructing contrastive supervision signals (positive/negative instructions + triplet loss), surpassing SmartEdit by 9.19% with less data and a smaller model.
Synthesizing Near-Boundary OOD Samples for Out-of-Distribution Detection: This paper proposes SynOOD, which synthesizes challenging near-boundary OOD samples by combining MLLM-based contextual semantic extraction, iterative diffusion inpainting, and OOD gradient guidance. The synthesized samples are used to fine-tune the CLIP image encoder and negative label features, achieving a 2.80% AUROC improvement and 11.13% FPR95 reduction on the ImageNet benchmark.
TaxaDiffusion: Progressively Trained Diffusion Model for Fine-Grained Species Generation: TaxaDiffusion leverages the hierarchical structure of biological taxonomy (Kingdom→Phylum→Class→Order→Family→Genus→Species) to progressively train a diffusion model, gradually refining from high-level shared characteristics to species-level subtle distinctions. This approach achieves high-precision fine-grained animal image generation, reducing FID to 31.87 on the FishNet dataset (vs. 43.91 for LoRA), improving BioCLIP alignment score by 37%, and remaining effective for rare species with very few samples (even as few as 1).
TeEFusion: Blending Text Embeddings to Distill Classifier-Free Guidance: This paper proposes TeEFusion, which encodes the guidance magnitude of CFG directly as a linear combination of conditional and unconditional text embeddings to replace dual forward passes, achieving efficient CFG distillation with zero additional parameters. The method is compatible with complex teacher sampling strategies (e.g., Z-Sampling, W2SD), enabling a 6× inference speedup over the teacher model.
TeRA: Rethinking Text-guided Realistic 3D Avatar Generation: TeRA is proposed as the first text-guided 3D realistic avatar generation framework based on a latent diffusion model. By distilling a large-scale human reconstruction model to construct a structured latent space, TeRA generates realistic 3D human avatars in 12 seconds—two orders of magnitude faster than SDS-based methods.
Text Embedding Knows How to Quantize Text-Guided Diffusion Models: This paper is the first to leverage text prompts to guide dynamic bit-width allocation for diffusion model quantization — by predicting the quality of images generated from a given text prompt, it adaptively selects high/medium/low bit precision for different layers and timesteps, reducing computational complexity while maintaining or even improving generation quality.
The Curse of Conditions: Analyzing and Improving Optimal Transport for Conditional Flow-Based Generation: This paper identifies the "curse of conditions" in conditional flow matching — a training-test mismatch caused by standard optimal transport (OT) ignoring conditioning information, which induces a conditionally skewed prior during training while an unbiased prior is used at test time. The authors propose C²OT (Conditional Optimal Transport), which resolves this issue by incorporating a condition-weighted term into the OT cost matrix.
The Silent Assistant: NoiseQuery as Implicit Guidance for Goal-Driven Image Generation: This paper proposes NoiseQuery, a training-free T2I generation enhancement method that pre-constructs a large-scale noise library and retrieves the initial noise best matching the user's goal at inference time, enabling fine-grained control over both high-level semantics and low-level visual attributes, with only 0.002s/prompt additional overhead, improving performance across multiple T2I models and enhancement techniques.
Timestep-Aware Diffusion Model for Extreme Image Rescaling: This paper proposes TADM, which performs extreme image rescaling (16×/32×) in the latent space of a pretrained Stable Diffusion model. By introducing a Decoupled Feature Rescaling Module (DFRM) and a timestep-aware alignment strategy, TADM dynamically allocates the generative capacity of the diffusion model to handle spatially non-uniform degradation.
TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation: This paper proposes TLB-VFI, an efficient video diffusion model for frame interpolation. It employs a temporal-aware autoencoder—comprising a latent-space temporal block and a pixel-space 3D wavelet gating mechanism—to extract rich temporal information, combined with a redesigned Brownian bridge diffusion process. With only 46.7M parameters (3× fewer than image diffusion methods and 20× fewer than video diffusion methods), TLB-VFI achieves approximately 20% FID improvement on SNU-FILM extreme and Xiph-4K benchmarks.
Towards Robust Defense against Customization via Protective Perturbation Resistant to Diffusion-based Purification: This paper proposes AntiPure, an adversarial perturbation method that directly attacks the diffusion-based purification process through two guidance mechanisms—Patch-wise Frequency Guidance (PFG) and Erroneous Timestep Guidance (ETG)—to generate protective perturbations that continue to disrupt customization fine-tuning even after purification, outperforming all existing protection methods under the purification-customization (P-C) workflow.
Trade-offs in Image Generation: How Do Different Dimensions Interact?: This paper proposes TRIG-Bench, a benchmark comprising 40,200 samples across 10 evaluation dimensions and 132 pairwise dimension subsets, along with a VLM-as-Judge metric termed TRIGScore. It is the first work to systematically reveal and analyze trade-offs among evaluation dimensions (e.g., realism, relation alignment, style) in image generation models, and leverages a Dimension Trade-off Map (DTM) to guide fine-tuning for performance improvement.
Trans-Adapter: A Plug-and-Play Framework for Transparent Image Inpainting: This paper proposes Trans-Adapter, a plug-and-play adapter module that enables diffusion-based image inpainting models to directly process transparent (RGBA) images. It also introduces the LayerBench benchmark and the Alpha Edge Quality (AEQ) metric.
Transformed Low-rank Adaptation via Tensor Decomposition and Its Applications to Text-to-image Models: This paper proposes TLoRA, which decomposes the fine-tuning of pretrained weights into a Transform adaptation and a Residual adaptation, parameterized respectively via Tensor Ring Matrix (TRM) and Tensor Ring (TR) decompositions. On SDXL, TLoRA achieves highly parameter-efficient fine-tuning with only 0.4M parameters while outperforming LoRA and other baselines.
TRCE: Towards Reliable Malicious Concept Erasure in Text-to-Image Diffusion Models: This paper proposes TRCE, a two-stage concept erasure strategy—textual semantic erasure followed by denoising trajectory steering—that reliably removes malicious concepts while minimizing degradation of the model's general generation capability.
Understanding Flatness in Generative Models: Its Role and Benefits: This paper presents the first systematic study of loss landscape flatness in generative models, particularly diffusion models. It theoretically demonstrates that flat minima enhance robustness to perturbations in the prior distribution, and empirically shows that SAM effectively promotes flatness in diffusion models, leading to improved generation quality, reduced exposure bias, and greater quantization robustness.
UniCombine: Unified Multi-Conditional Combination with Diffusion Transformer: UniCombine proposes a DiT-based multi-condition controllable generation framework that achieves unified generation under arbitrary condition combinations (text + spatial map + subject image) via a Conditional MMDiT Attention mechanism and a LoRA Switching module. It supports both training-free and training-based modes, and introduces SubjectSpatial200K, the first dataset for multi-condition generation.
Unlocking the Potential of Diffusion Priors in Blind Face Restoration: This paper proposes FLIPNET, a unified framework built upon a T2I diffusion model that switches between a restoration mode (BoostHub selectively fuses LQ features + BFR-oriented facial embeddings) and a degradation mode (learns from real degradation datasets and synthesizes degraded images) by simply flipping the inputs, simultaneously addressing two key challenges: the HQ/LQ distribution gap and the synthetic/real degradation gap.
Unsupervised Imaging Inverse Problems with Diffusion Distribution Matching: DDM4IP proposes an unsupervised framework that models the degradation distribution via Conditional Flow Matching, while simultaneously learning an unknown forward degradation model through a distribution matching loss. Using only a small number of unpaired images, the method achieves competitive or superior performance on deblurring, spatially-varying PSF calibration, and blind super-resolution tasks.
Video Color Grading via Look-Up Table Generation: This paper proposes a video color grading framework that explicitly generates Look-Up Tables (LUTs) via a diffusion model. A GS-Extractor captures high-level style features from a reference scene, and an L-Diffuser generates a color LUT that can be applied losslessly to all video frames in a single forward pass. Text prompts are further supported for fine-grained adjustments such as brightness and contrast.
Video Motion Graphs: Video Motion Graphs proposes a retrieval-augmented generation system for human motion video synthesis. It constructs a motion graph from reference videos and performs conditioned path search to obtain keyframes, then employs HMInterp—a dual-branch diffusion-based frame interpolation model combining skeleton guidance from a Motion Diffusion Model and progressive condition training—to seamlessly connect discontinuous frames. The system supports multiple conditioning signals (music, speech, action labels) and significantly outperforms both generative and retrieval-based baselines in human motion video quality.
VIGFace: Virtual Identity Generation for Privacy-Free Face Recognition Dataset: This paper proposes VIGFace, a framework that pre-allocates virtual prototypes orthogonal to real identities in the feature space of a face recognition (FR) model, and trains a diffusion model to generate face images conditioned on these prototypes—producing identities that do not exist in the real world, thereby enabling privacy-free face recognition dataset construction and data augmentation.
VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning: This paper presents VisualCloze, which unifies diverse image generation tasks under a "visual cloze" paradigm—defining tasks via visual in-context examples rather than text instructions, performing unified generation through an image infilling model, and constructing the Graph200K graph-structured dataset to enhance cross-task knowledge transfer. The framework supports in-distribution tasks, unseen-task generalization, multi-task composition, and reverse generation.
What Makes for Text to 360-degree Panorama Generation with Stable Diffusion?: Through systematic analysis of the behavior of $W_{\{q,k,v,o\}}$ components during LoRA fine-tuning, this work reveals that $W_v$ and $W_o$ are responsible for learning panoramic spherical structure while $W_q$ and $W_k$ retain perspective-domain shared knowledge. Based on this finding, the paper proposes UniPano, an efficient single-branch panorama generation framework.
What's in a Latent? Leveraging Diffusion Latent Space for Domain Generalization: This paper systematically analyzes the domain separation capacity of latent spaces from six pretrained models (CLIP, DiT, SD, MAE, DINOv2, ResNet) and demonstrates that diffusion model features are most effective at separating domain information in an unsupervised setting. Building on this insight, the authors propose GUIDE — a framework that leverages diffusion features to discover pseudo-domain representations and augment classifier features — achieving 66.3% average accuracy across five DomainBed datasets without domain labels (surpassing the ERM baseline by +2.6% and +4.3% on TerraIncognita), while outperforming most methods that require domain labels.
Your Text Encoder Can Be An Object-Level Watermarking Controller: By fine-tuning only the pseudo-token embedding $\mathcal{W}_*$ in the text encoder, this work achieves object-level invisible watermark embedding in T2I diffusion model-generated images, attaining 99% bit accuracy (48 bits) with $10^5\times$ fewer parameters.

🧩 Multimodal VLM¶

A Quality-Guided Mixture of Score-Fusion Experts Framework for Human Recognition: This paper proposes QME (Quality-guided Mixture of score-fusion Experts), a framework that dynamically integrates similarity scores from multiple biometric modalities—including face recognition, gait recognition, and person re-identification—via learnable score fusion strategies and a quality-based MoE routing mechanism, achieving state-of-the-art performance on multiple whole-body recognition benchmarks.
A Quality-Guided Mixture of Score-Fusion Experts Framework for Human Recognition: This paper proposes a Quality-guided Mixture of score-fusion Experts (QME) framework that employs a quality-guided MoE strategy to perform learnable fusion of similarity scores from heterogeneous biometric modalities (face, gait, body). Combined with a pseudo-quality loss and a score triplet loss, QME achieves state-of-the-art performance on multiple whole-body biometric recognition benchmarks.
Acknowledging Focus Ambiguity in Visual Questions: This work is the first to formally define and systematically investigate focus ambiguity in visual question answering — the phenomenon arising when a linguistic expression in a question may plausibly refer to multiple regions in an image, a type of ambiguity entirely overlooked by existing VQA systems. The authors construct the VQ-FocusAmbiguity dataset (5,500 samples with 12,880 instance segmentation annotations) and demonstrate that modern models perform poorly at both recognizing and localizing focus ambiguity.
Adaptive Prompt Learning via Gaussian Outlier Synthesis for Out-of-Distribution Detection: This paper proposes the APLGOS framework, which initializes learnable in-distribution (ID) prompts using ChatGPT-standardized Q&A pairs, synthesizes virtual OOD prompts and images by sampling from the low-likelihood regions of class-conditional Gaussian distributions, and aligns text-image embeddings via contrastive learning to achieve more compact ID/OOD decision boundaries.
Adaptive Prompt Learning via Gaussian Outlier Synthesis for Out-of-distribution Detection: This paper proposes APLGOS, a framework that leverages prompt learning in vision-language models to synthesize virtual OOD prompts and images by sampling from low-probability regions of class-conditional Gaussian distributions, thereby enforcing more compact decision boundaries between in-distribution (ID) and out-of-distribution (OOD) categories. The method achieves state-of-the-art performance on four mainstream benchmarks.
Advancing Textual Prompt Learning with Anchored Attributes: This paper proposes ATPrompt, which embeds general-purpose attribute tokens (e.g., color, shape) into textual prompts, extending the learning space of soft prompts from a one-dimensional class level to a multi-dimensional attribute level. ATPrompt serves as a plug-and-play module that integrates seamlessly into existing textual prompt learning methods, consistently improving baseline performance across 11 datasets.
AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations?: This paper proposes AdvDreamer, a framework that generates physically reproducible adversarial 3D transformation (Adv-3DT) samples from single images via zero-shot monocular pose manipulation, a naturalness reward model, and an inverse semantic probability loss. The framework reveals that current VLMs—including GPT-4o—suffer performance drops of 50–80% under 3D variations, and establishes MM3DTBench, the first VQA benchmark for evaluating VLM robustness to 3D variations.
AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models: This paper proposes AIGI-Holmes, which adapts MLLMs into a "Holmes"-style detector capable of both accurately identifying AI-generated images and providing human-verifiable explanations. This is achieved by constructing the Holmes-Set dataset with explanatory annotations and a carefully designed three-stage training pipeline (visual expert pre-training → SFT → DPO). At inference time, a collaborative decoding strategy further enhances generalization.
AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models: This paper proposes AIGI-Holmes, which achieves explainable and generalizable AI-generated image detection through the construction of Holmes-Set — an annotated dataset with interpretive labels — a three-stage training pipeline (visual expert pre-training → SFT → DPO), and a collaborative decoding strategy. The method attains state-of-the-art detection accuracy on three benchmarks while providing human-verifiable explanations.
AirCache: Activating Inter-Modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference: This paper proposes AirCache, a KV Cache compression method for LVLMs that evaluates visual token importance via an Elite Observation Window, combined with adaptive layer-wise budget allocation based on the intensity and skewness of importance score distributions. At only 10% visual KV Cache retention, performance degradation remains within 1%, while decoding latency is reduced by 29%–66%.
AirCache: Activating Inter-modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference: This paper proposes AirCache, which achieves model performance retention with only 10% of the visual KV cache—reducing decoding latency by 29%–66%—through an elite observation window (leveraging text self-attention to select critical text tokens for evaluating visual token importance) and adaptive inter-layer budget allocation (based on the intensity and skewness of importance score distributions).
Analyzing Finetuning Representation Shift for Multimodal LLMs Steering: A training-free framework that reveals representation shifts in multimodal large language models (MLLMs) during finetuning through concept-level analysis, and leverages shift vectors for lightweight model behavior steering (debiasing, safety control).
Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs: This paper presents the first systematic study of visual correspondence matching deficiencies in multimodal large language models (MLLMs). The authors construct the MMVM benchmark (1,510 samples) and a 220K matching dataset, and propose CoLVA, which leverages object-level contrastive learning and a fine-grained visual expert to substantially improve cross-image instance matching in MLLMs.
Attention to the Burstiness in Visual Prompt Tuning!: This paper reveals the "burstiness" and non-Gaussian distribution of self-attention module data in Visual Prompt Tuning, and proposes learning "bursty prompts" via data whitening and a bilinear model. The approach substantially outperforms VPT and its variants across multiple benchmarks, e.g., improving accuracy on CUB-200 from 42.15% to 77.86%.
AutoComPose: Automatic Generation of Pose Transition Descriptions for Composed Pose Retrieval Using Multimodal LLMs: This paper proposes AutoComPose, the first framework leveraging multimodal large language models (MLLMs) to automatically generate human pose transition descriptions. Through body-part-level description generation, diversification augmentation, and a cyclic consistency loss, AutoComPose achieves superior composed pose retrieval performance while eliminating the need for costly manual annotation.
BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning: Inspired by the efficient learning capabilities of human infants, this paper proposes the BabyVLM framework, which includes a synthetic training dataset (converting general-purpose data into child-directed formats) and multiple developmentally aligned evaluation benchmarks. The framework enables data-efficient pretraining of compact VLMs, achieving performance that surpasses models trained solely on SAYCam or generic data.
Background Invariance Testing According to Semantic Proximity: This paper proposes a background invariance testing method based on semantic proximity. It constructs a keyword ontology via association analysis to systematically sample background scenes, achieving an optimal balance between test diversity (recall) and consistency with human judgment (precision). The work further demonstrates that visualization-based testing frameworks are more informative than global statistical metrics.
BASIC: Boosting Visual Alignment with Intrinsic Refined Embeddings in Multimodal Large Language Models: By analyzing the semantic refinement of visual embeddings in the shallow layers of LLMs, this paper proposes BASIC, a method that leverages intrinsically refined visual embeddings from within the LLM as supervision signals to directly guide the visual projector in generating better initial visual embeddings along two dimensions: directional alignment and semantic distribution.
Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval: This paper identifies the candidate prior bias problem in MLLM-based retrieval systems — where candidate likelihood estimation tends to favor candidates with high prior probability rather than those that are semantically most relevant — and proposes BLiM (Bidirectional Likelihood Estimation) and CPN (Candidate Prior Normalization) to address this issue, achieving an average R@1 gain of 6.4 across four text-video retrieval benchmarks.
Boosting MLLM Reasoning with Text-Debiased Hint-GRPO: This paper identifies two critical issues in applying GRPO to MLLM reasoning — low data utilization (invalid gradients when all sampled outputs for a hard question are incorrect) and text bias (the model ignores visual input and relies solely on textual reasoning) — and proposes two corresponding solutions: Hint-GRPO (adaptively providing reasoning hints) and text-debiasing calibration (enhancing image conditioning at test time). The approach achieves significant reasoning improvements across 11 datasets on 3 base MLLMs.
CAD-Assistant: Tool-Augmented VLLMs as Generic CAD Task Solvers: This paper proposes CAD-Assistant, the first tool-augmented vision-language model framework for generic CAD tasks. By integrating a CAD-specific toolset (sketch parameterizer, rendering module, constraint checker, etc.) and the FreeCAD Python API, it surpasses supervised task-specific methods in a zero-shot setting.
Calibrating MLLM-as-a-Judge via Multimodal Bayesian Prompt Ensembles: This paper proposes Multimodal Mixture-of-Bayesian Prompt Ensembles (MMB), which learns image-cluster-conditioned prompt weights to substantially improve calibration and judgment accuracy of MLLMs used as evaluators, addressing the failure of standard prompt ensemble methods in multimodal settings.
CapeLLM: Support-Free Category-Agnostic Pose Estimation with Multimodal Large Language Models: This work is the first to introduce multimodal large language models (MLLMs) into category-agnostic pose estimation (CAPE), enabling keypoint localization for arbitrary categories using only a query image and textual descriptions—without requiring traditional support images or annotations—surpassing the 5-shot state-of-the-art on the MP-100 benchmark.
CaptionSmiths: Flexibly Controlling Language Pattern in Image Captioning: CaptionSmiths is a framework that enables slider-style flexible control over three caption attributes — length, descriptiveness, and lexical uniqueness — via continuous scalar interpolation rather than discrete clustering. Trained jointly on multiple datasets, it achieves more precise attribute control and higher lexical alignment quality than baselines.
CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting: This paper introduces CAPTURe, a benchmark that evaluates spatial reasoning and world model construction in VLMs by requiring amodal counting of regularly arranged objects under occlusion. Results show that even the strongest model, GPT-4o, achieves a 14.75% counting error under occlusion, while humans perform nearly perfectly.
Causal Disentanglement and Cross-Modal Alignment for Enhanced Few-Shot Learning: This paper proposes the Causal CLIP Adapter (CCA), which applies ICA to causally disentangle CLIP visual features, and enhances cross-modal alignment via unidirectional text classifier fine-tuning and bidirectional cross-attention, achieving state-of-the-art few-shot classification performance across 11 benchmark datasets.
ChartPoint: Guiding MLLMs with Grounding Reflection for Chart Reasoning: This paper proposes PointCoT, which integrates reflective visual grounding (bounding boxes) into the chain-of-thought for chart reasoning, enabling MLLMs to interactively verify each reasoning step against the chart's visual content. It also constructs the ChartPoint-SFT-62k dataset containing 19.2K high-quality samples, achieving a +5.04% improvement on ChartBench.
Chimera: Improving Generalist Model with Domain-Specific Experts: This paper proposes Chimera, a scalable and low-cost multimodal pipeline that integrates domain-specific expert knowledge (tables, charts, math, documents) into a generalist multimodal large model via a lightweight routing module for dynamic expert selection, a progressive training strategy, and a Generalist-Specialist Collaboration Masking (GSCM) mechanism. Chimera achieves 64.9% on MathVista (SOTA) and matches or surpasses specialist models on multiple visual structure extraction tasks.
CLIPSym: Delving into Symmetry Detection with CLIP: This paper proposes CLIPSym, the first method to leverage the multimodal understanding capability of pretrained CLIP for reflection and rotation symmetry detection. It introduces a Semantics-Aware Prompt Grouping (SAPG) strategy to integrate textual semantic cues and a decoder with theoretical rotation equivariance guarantees, achieving state-of-the-art results on three benchmarks: DENDI, SDRW, and LDRS.
CoA-VLA: Improving Vision-Language-Action Models via Visual-Textual Chain-of-Affordance: This paper proposes CoA-VLA, which organizes four categories of robotic affordances (object, grasp, spatial, and motion) into a chain-of-thought reasoning process, and injects them into a diffusion policy network via a visual-textual co-injection module, significantly improving the accuracy and generalization of VLA models in multi-task manipulation.
CoA-VLA: Improving Vision-Language-Action Models via Visual-Textual Chain-of-Affordance: This paper proposes the Chain-of-Affordance (CoA-VLA) framework, which injects four categories of robot affordances (object, grasp, spatial, and movement) into the policy network of a VLA model in both textual and visual modalities. The approach achieves an 85.54% success rate on a real-robot multi-task benchmark spanning 7 tasks, outperforming OpenVLA by 30.65%, and demonstrates generalization to unseen object poses and obstacles.
CompCap: Improving Multimodal Large Language Models with Composite Captions: This paper proposes CompCap, an automated framework for synthesizing six categories of composite images (collages, image-text mixtures, charts, tables, code, and diagrams) along with high-quality captions. The resulting CompCap-118K dataset, when incorporated into the SFT stage, significantly improves MLLM comprehension of composite images.
Controlling Multimodal LLMs via Reward-guided Decoding: This paper proposes Multimodal Reward-Guided Decoding (MRGD), which constructs two reward models to independently control object precision and recall, enabling fine-grained controllability over MLLM outputs at inference time while substantially reducing object hallucinations.
Controlling Multimodal LLMs via Reward-guided Decoding: This paper proposes MRGD (Multimodal Reward-Guided Decoding), which trains a PaliGemma-based object hallucination reward model and an OWLv2-based object recall reward model. During MLLM inference, MRGD performs sentence-level beam search by scoring candidates with a linearly weighted combination of the two rewards. On CHAIR, it reduces LLaVA-1.5's CHAIRi from 15.05 to 4.53 (a 70% reduction) while enabling dynamic and controllable precision–recall trade-offs.
CVPT: Cross Visual Prompt Tuning: To address the computational redundancy and attention disruption caused by prompt tokens participating in self-attention in Visual Prompt Tuning (VPT), this paper proposes CVPT, which decouples the interaction between prompt and image tokens via cross-attention and leverages a weight-sharing mechanism to initialize the cross-attention module. CVPT significantly outperforms VPT across 25 datasets and achieves performance comparable to mainstream adapter-based methods.
DADM: Dual Alignment of Domain and Modality for Face Anti-Spoofing: This paper proposes the DADM framework, which simultaneously addresses intra-domain modality misalignment and inter-domain modality misalignment in multimodal face anti-spoofing via a Mutual Information Mask (MIM) module and a dual domain-modality alignment optimization strategy, achieving state-of-the-art performance across four evaluation protocols.
DASH: Detection and Assessment of Systematic Hallucinations of VLMs: This paper proposes DASH, a fully automated pipeline that systematically discovers false-positive object hallucination clusters in VLMs via two complementary strategies: LLM-based text query generation (DASH-LLM) and diffusion model optimization-based image query generation (DASH-OPT). Applied to ReLAION-5B, DASH uncovers 19k+ clusters and 950k+ images, and constructs the more challenging DASH-B benchmark.
DisenQ: Disentangling Q-Former for Activity-Biometrics: This paper proposes DisenQ (Disentangling Q-Former), which leverages structured language guidance to disentangle video features into three independent spaces—biometric, motion, and non-biometric—achieving state-of-the-art activity-aware person recognition without requiring additional visual modalities.
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy: This paper proposes Dita (Diffusion Transformer Policy), which, unlike prior methods that denoise on compressed embeddings using shallow networks, adopts in-context conditioning to directly condition denoising on raw visual tokens. A causal Transformer processes the full token sequence of language, images, timesteps, and noisy actions. With 334M parameters, Dita achieves state-of-the-art or competitive performance on SimplerEnv zero-shot, LIBERO, CALVIN, and other benchmarks.
DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding: This paper proposes DocThinker, the first framework to apply GRPO (Group Relative Policy Optimization) reinforcement learning to document understanding. By training MLLMs with a four-objective rule-based reward (format, answer accuracy, RoI IoU, and question rephrasing quality), DocThinker enables models to autonomously generate interpretable reasoning processes. Using only 4K training samples, it improves Qwen2.5-VL-7B on DocVQA from 0.355 to 0.579 (RL vs. SFT: 0.579 vs. 0.355) and achieves 82.4% precision on visual grounding tasks.
DOGR: Towards Versatile Visual Document Grounding and Referring: This paper proposes DOGR-Engine, a data engine for document grounding and referring, constructs DOGR-Bench — the first comprehensive benchmark evaluating document grounding and referring capabilities across 7 task types × 3 document types — and develops DOGR, the first document understanding MLLM that integrates precise text localization with interactive grounding and referring capabilities.
DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning: This paper proposes the DWIM framework, which employs a discrepancy-aware workflow generation strategy to curate high-quality training data and an instruct-masking fine-tuning strategy to clone only effective actions, endowing LLMs with tool-aware capability for compositional visual reasoning and achieving state-of-the-art results on multiple VR benchmarks.
Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM: This paper proposes Dynamic-VLM, which employs a dynamic visual token compressor to flexibly adjust the number of tokens per frame according to video length. Combined with a 2-million-scale high-quality synthetic video QA dataset, the method achieves a 2.7% improvement over LLaVA-OneVision on VideoMME and a 10.7% improvement on MuirBench.
Dynamic Group Detection using VLM-augmented Temporal Groupness Graph: This paper proposes a VLM-augmented temporal groupness graph for detecting dynamically changing groups in video. The core innovation lies in using CLIP to extract groupness-augmented features from bounding boxes containing person pairs and background context to estimate grouping probability, followed by Louvain clustering over a full-sequence temporal graph to enable dynamic group detection.
Dynamic Multimodal Prototype Learning in Vision-Language Models: This paper proposes ProtoMM, a training-free multimodal prototype learning framework that models prototypes as discrete distributions over textual descriptions and visual particles. By leveraging optimal transport to dynamically update multimodal prototypes, ProtoMM achieves state-of-the-art performance across 15 zero-shot benchmarks.
Effective Training Data Synthesis for Improving MLLM Chart Understanding: This paper proposes a modular five-stage chart data synthesis pipeline that produces a high-quality training set, ECD (Effective Chart Dataset), comprising 10k+ chart images and 300k+ QA pairs, consistently improving chart understanding across multiple open-source MLLMs.
Enhancing Few-Shot Vision-Language Classification with Large Multimodal Model Features: This paper proposes Sparse Attention Vectors (SAVs) — a training-free method that extracts fewer than 5% of attention heads from frozen generative Large Multimodal Models (LMMs) as strong feature representations. With only approximately 20 labeled samples per class, SAVs achieve state-of-the-art performance on vision-language classification tasks, outperforming LoRA fine-tuning by an average of 7% on challenging benchmarks including BLINK, VLGuard, and NaturalBench.
Enrich and Detect: Video Temporal Grounding with Multimodal LLMs: This paper proposes ED-VTG, a two-stage framework for video temporal grounding (VTG) that first enriches the input query and then predicts temporal intervals. By leveraging the descriptive capability of multimodal LLMs to supplement query details, combined with a lightweight interval decoder and a multiple instance learning (MIL) framework, ED-VTG is the first LLM-based method to comprehensively match or surpass specialized models across multiple benchmarks.
Evading Data Provenance in Deep Neural Networks: This paper exposes the false sense of security in existing Dataset Ownership Verification (DOV) methods. Through a unified evasion framework, Escaping DOV, task-relevant but identity-free knowledge is transferred from a teacher model to a surrogate student via OOD data, successfully bypassing all 11 evaluated DOV methods simultaneously.
EVEv2: Improved Baselines for Encoder-Free Vision-Language Models: This work systematically investigates the optimal architecture and training strategy for encoder-free VLMs, proposing a Divide-and-Conquer architecture that fully decomposes a transformer into modality-specific components (independent attention/FFN/LayerNorm per modality). Using only 100M publicly available data, EVEv2 surpasses all encoder-free counterparts and approaches the performance of encoder-based VLMs.
Exploiting Vision Language Model for Training-Free 3D Point Cloud OOD Detection: This paper proposes Graph Score Propagation (GSP), a training-free framework that performs score propagation over a graph constructed from class prototypes and test data. By incorporating prompt clustering and a self-training negative prompting strategy, GSP leverages VLMs for efficient OOD detection on 3D point clouds, consistently outperforming existing state-of-the-art methods on both synthetic and real-world datasets.
FA: Forced Prompt Learning of Vision-Language Models for Out-of-Distribution Detection: This paper proposes FA (Forced prompt leArning), which introduces a learnable "forced prompt" and trains it to produce higher ID-class matching scores than a frozen original prompt, compelling it to capture richer ID class descriptions beyond label text semantics. FA achieves significant improvements in CLIP-based few-shot OOD detection without external auxiliary data or additional parameters.
FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers: This paper proposes FALCON, which introduces learnable Visual Registers into the ViT encoder. Through the ReCompact mechanism, visual redundancy is eliminated directly during the encoding stage (achieving 9× token compression), while the ReAtten module resolves visual fragmentation caused by image cropping via inter-register interactions.
Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration: This paper identifies a systematic positional bias in early visual token pruning for VLMs—caused by RoPE, which tends to retain tokens from the bottom of the image—and proposes FEATHER, which addresses this issue via RoPE-free attention, uniform sampling, and multi-stage pruning, achieving over 5× performance improvement on visual grounding tasks.
FedMVP: Federated Multimodal Visual Prompt Tuning for Vision-Language Models: This paper proposes FedMVP, which, under a federated learning setting, employs a PromptFormer network to fuse image visual features with LLM-generated category attribute text features, generating dynamic multimodal visual prompts injected into CLIP's visual encoder. FedMVP achieves substantial improvements of 1.57%–2.26% over existing federated prompt learning methods across 20 datasets and three generalization settings.
Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving: This paper proposes VLADBench, a fine-grained vision-language model evaluation benchmark for autonomous driving scenarios, covering 5 major domains, 11 second-level dimensions, and 29 third-level tasks. Using a closed-ended QA format, it progressively assesses VLM capabilities from static knowledge to dynamic reasoning, and trains small-scale domain-specific (DS) models on 1.4M domain-specific QA data to validate cognitive interactions across domains.
FinMMR: Make Financial Numerical Reasoning More Multimodal, Comprehensive, and Challenging: This paper proposes FinMMR, a bilingual (Chinese–English) multimodal financial numerical reasoning benchmark comprising 4,300 questions, 8,700+ financial charts, and 14 financial sub-domains. It systematically evaluates 15 MLLMs to identify bottlenecks in complex domain-specific reasoning, and proposes three improvement strategies: visual filtering, knowledge augmentation, and model collaboration.
FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance: This paper proposes FOLDER — a plug-and-play visual token compression module that systematically analyzes three key factors of information loss (reduction impact, propagation effect, and aggregation method), performs aggressive token merging in the last few layers of the visual encoder, and achieves up to 70% token reduction while maintaining or even improving model performance.
FREE-Merging: Fourier Transform for Efficient Model Merging: This paper is the first to identify the frequency-domain manifestation of task interference in model merging. It proposes FR-Merging, which removes low-frequency interference via high-pass filtering to construct a high-quality merged backbone, and combines it with lightweight task expert modules (FREE-Merging) to achieve an optimal performance–cost trade-off across vision, language, and multimodal tasks.
Free-MoRef: Instantly Multiplexing Context Perception Capabilities of Video-MLLMs within Single Inference: This paper proposes Free-MoRef, a training-free method inspired by Mixture-of-Experts (MoE) that partitions long video tokens into multiple short sequences as multi-references, queries them in parallel via the MoRef attention mechanism, and fuses unified activation values. The approach enables efficient and comprehensive understanding of 2× to 8× longer frame inputs on a single A100 GPU, surpassing dedicated long-video models on VideoMME, MLVU, and LongVideoBench.
From Easy to Hard: The MIR Benchmark for Progressive Interleaved Multi-Image Reasoning: This paper proposes the MIR benchmark, comprising 22,257 multi-image interleaved reasoning QA pairs with five-stage reasoning steps, and introduces a progressive curriculum learning strategy that trains MLLMs from easy to hard samples to improve multi-image interleaved reasoning capability.
From Holistic to Localized: Local Enhanced Adapters for Efficient Visual Instruction Fine-Tuning: This paper proposes Dual-LoRA and Visual Cue Enhancement (VCE) modules that adopt a "from holistic to localized" paradigm to address data conflicts in efficient visual instruction fine-tuning, surpassing LoRA-MoE methods with only a 1.16× inference time overhead.
G2D: Boosting Multimodal Learning with Gradient-Guided Distillation: This paper proposes G2D (Gradient-Guided Distillation), which addresses the modality imbalance problem in multimodal learning by combining feature distillation and logit distillation from unimodal teachers to a multimodal student, together with a Sequential Modality Prioritization (SMP) gradient modulation strategy guided by unimodal teacher confidence scores. G2D achieves 85.89% accuracy on CREMA-D, surpassing all state-of-the-art methods focused on modality imbalance.
GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography: This paper introduces the DataDoP dataset (29K free-moving camera trajectories with descriptions extracted from real films) and the GenDoP auto-regressive Transformer model, which generates artistic, high-quality camera motion trajectories conditioned on text and/or RGBD input, outperforming existing methods in controllability, motion smoothness, and complexity.
Generalizable Object Re-Identification via Visual In-Context Prompting: VICP proposes a generalizable object re-identification framework in which an LLM infers identity-discriminative rules from a small set of positive/negative image pairs and converts them into dynamic visual prompts injected into a frozen visual foundation model (DINOv2), enabling zero-parameter-update generalization to unseen object categories.
GTA-CLIP: Generate, Transduct, Adapt — Iterative Transduction with VLMs: This paper proposes GTA-CLIP, which iteratively executes three steps — LLM-based attribute generation, attribute-enhanced transductive inference, and encoder fine-tuning — achieving an average zero-shot improvement of 9.5% and few-shot improvement of 3–4% across 12 datasets, and for the first time unifying attribute discovery, transductive inference, and model adaptation in a zero-label setting.
GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks: This paper introduces GEOBench-VLM, a comprehensive benchmark designed to evaluate VLMs on geospatial tasks, encompassing 31 sub-tasks across 8 major categories and over 10,000 manually verified instructions. The benchmark reveals that current state-of-the-art VLMs, including GPT-4o, still perform poorly on geospatial tasks, with the highest accuracy reaching only 41.7%.
Global and Local Entailment Learning for Natural World Imagery: This paper proposes Radial Cross-Modal Embeddings (RCME), a framework that explicitly models the transitivity of entailment relations to learn hierarchical representations in vision-language models. RCME enables inference at arbitrary taxonomic ranks on the Tree of Life and achieves state-of-the-art performance on hierarchical classification and retrieval tasks.
GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models: GRAB is a graph analysis benchmark for large multimodal models (LMMs), comprising 3,284 synthetically generated questions spanning 5 tasks and 23 graph properties. The strongest model evaluated, Claude 3.5 Sonnet, achieves only 21.0% accuracy, revealing critical deficiencies in LMMs' capacity for visual analytical reasoning.
Growing a Twig to Accelerate Large Vision-Language Models: This paper proposes TwigVLM, which attaches a lightweight twig module to the early layers of a VLM to simultaneously enable twig-guided visual token pruning (TTP, for prefilling acceleration) and self-speculative decoding (SSD, for decoding acceleration). On LLaVA-1.5-7B, TwigVLM retains 96% accuracy after pruning 88.9% of visual tokens and achieves a 154% speedup in long-answer generation, substantially outperforming existing methods in both accuracy and speed.
GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-Based VLM Agent: This paper identifies that relying solely on outcome rewards during RL training of VLM agents leads to "thought collapse," and proposes the GTR framework, which employs an external VLM corrector to automatically rectify reasoning processes and jointly trains thoughts and actions via PPO + SFT, achieving 3–5× improvement in task success rates on the Game of 24 and ALFWorld benchmarks.
Harmonizing Visual Representations for Unified Multimodal Understanding and Generation: This work identifies that the encoder of masked autoregressive (MAR) models inherently possesses both the fine-grained image features required for generation and the high-level semantic representations required for understanding. Based on this observation, Harmon is proposed — an autoregressive framework that unifies image generation and understanding via a shared MAR encoder. Through three-stage progressive training, Harmon achieves an Overall score of 0.76 on GenEval, surpassing all unified models, while matching the understanding performance of the Janus series that employs a dedicated SigLIP encoder.
Hints of Prompt: Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving: This paper proposes the Hints of Prompt (HoP) framework, which enhances CLIP visual representations through three hierarchical hints (Affinity/Semantic/Question hint) to capture instance-level structure, domain-specific semantics, and question relevance. HoP surpasses the fully trained baseline on autonomous driving VQA tasks using only 25% of the training data.
HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding?: This paper introduces HRScene, a benchmark covering 25 real-world scenarios and 2 diagnostic datasets (resolution 1K–35K). Evaluating 28 VLMs reveals that current state-of-the-art models achieve an average accuracy of only ~50% on real high-resolution tasks, with significant regional performance divergence and a pronounced lost-in-middle problem.
IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves: This paper proposes IDEATOR, a framework that leverages VLMs themselves as red-team models to autonomously generate multimodal jailbreak image-text pairs, achieving a 94% attack success rate against MiniGPT-4's safety mechanisms. Based on this framework, the authors construct VLJailbreakBench, a safety evaluation benchmark comprising 3,654 samples.
IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves: This paper proposes IDEATOR, the first black-box jailbreak framework that uses a VLM to red-team other VLMs. A weakly safety-aligned VLM (MiniGPT-4) serves as the attacker, generating semantically rich image–text jailbreak pairs in conjunction with Stable Diffusion. A breadth-depth exploration strategy iteratively refines attacks, achieving a 94% attack success rate (ASR) on MiniGPT-4 with an average of 5.34 queries, and transferring to LLaVA/InstructBLIP/Chameleon at 75–88%. The work also introduces VLJailbreakBench (3,654 samples) to expose safety vulnerabilities across 11 VLMs.
Information Density Principle for MLLM Benchmarks: This paper proposes an "information density" principle to evaluate MLLM benchmark quality along four dimensions — Fallacy, Difficulty, Redundancy, and Diversity — and constructs a three-tier automated evaluation pipeline (Human–Model–Data) to conduct a systematic "benchmark for benchmark" analysis of 19 mainstream benchmarks.
Instruction-Grounded Visual Projectors for Continual Learning of Generative Vision-Language Models: This paper proposes MVP (Mixture of Visual Projectors), a Mixture-of-Experts framework for visual projectors conditioned on instruction context. Through an expert recommendation strategy and an expert pruning mechanism, MVP enables generative VLMs to continually learn new vision-language tasks without catastrophic forgetting, while maintaining responsiveness to diverse instruction types. MVP consistently outperforms existing methods across classification, captioning, and question-answering tasks.
Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs: This paper proposes the Instruction-oriented Preference Alignment (IPA) framework, which anchors alignment signals to instruction completion efficacy rather than hallucination factors alone, via an automated preference construction mechanism and a progressive preference data collection pipeline. IPA achieves consistent improvements on Qwen2VL-7B across 9 benchmarks spanning hallucination evaluation, general VQA, and text comprehension.
Interpretable Zero-Shot Learning with Locally-Aligned Vision-Language Model: This paper proposes LaZSL, which leverages Optimal Transport (OT) to achieve fine-grained alignment between local visual regions and semantic attributes, constructing an interpretable zero-shot classifier without additional training. LaZSL demonstrates strong accuracy, interpretability, and domain generalization across 9 datasets.
Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining: Iris introduces two core innovations — Information-Sensitive Cropping (ISC) and Self-Refining Dual Learning (SRDL) — achieving SOTA on multiple GUI understanding benchmarks with only 850K annotated samples, matching methods that use over 10× more data, while reducing inference time from 3 seconds to 1 second.
Is Less More? Exploring Token Condensation as Training-free Test-time Adaptation: This paper proposes Token Condensation as Adaptation (TCA), a training-free test-time adaptation method that leverages a Domain-aware Token Reservoir (DTR) to guide cross-head token pruning/merging and logits self-correction. Without modifying model parameters, TCA improves cross-dataset performance of CLIP/SigLIP variants by up to 21.4% while reducing GFLOPs by 12.2%–48.9%.
Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency: This paper identifies a Shuffle Inconsistency between the comprehension capability and the safety capability of multimodal large language models (MLLMs)—models can understand shuffled harmful instructions, yet their safety mechanisms fail to defend against them. Building on this finding, the authors propose SI-Attack, a query-based black-box jailbreak method that achieves substantially higher attack success rates on both open-source and closed-source commercial models.
Large Multi-modal Models Can Interpret Features in Large Multi-modal Models: This paper proposes the first automated feature interpretation framework for Large Multimodal Models (LMMs). It employs Sparse Autoencoders (SAEs) to decompose LMM internal representations into monosemantic features, leverages larger LMMs to automatically interpret these features, and demonstrates that feature steering can correct model hallucinations.
LATTE: Collaborative Test-Time Adaptation of Vision-Language Models in Federated Learning: This paper proposes Latte, a framework that enables collaborative test-time adaptation of vision-language models (e.g., CLIP) in decentralized federated learning settings. Through a dual-memory mechanism combining local and external memory, Latte achieves cross-client knowledge sharing while preserving client-level personalization.
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step: LLaVA-CoT proposes a method enabling vision-language models to perform autonomous multi-stage structured reasoning. By constructing the LLaVA-CoT-100k structured reasoning annotation dataset, the model is trained to sequentially execute four stages—Summary, Caption, Reasoning, and Conclusion—and a Stage-Wise Retracing Search (SWIRES) is proposed for test-time scaling, allowing an 11B model to surpass Gemini-1.5-pro and GPT-4o-mini.
LLaVA-KD: A Framework of Distilling Multimodal Large Language Models: This paper proposes the LLaVA-KD framework, which transfers knowledge from large-scale MLLMs to small-scale MLLMs via Multimodal Distillation (MDist) and Relational Distillation (RDist) strategies combined with a three-stage training scheme (DPT-SFT-DFT), significantly improving small model performance without modifying the model architecture.
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models: By exploiting the sparsity of attention scores between the CLS token and spatial tokens in the visual encoder, this work adaptively prunes and merges visual tokens, maintaining comparable LMM performance while retaining only 5.5% of visual tokens.
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step: By constructing the LLaVA-CoT-100k dataset with structured reasoning annotations, the proposed method trains a VLM to autonomously execute a four-stage reasoning pipeline—Summary → Caption → Reasoning → Conclusion—combined with a SWIRES search strategy at test time. The resulting 11B model outperforms substantially larger models including GPT-4o-mini and Gemini-1.5-pro.
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models: By exploiting the sparsity of attention scores between the [CLS] token and visual tokens in CLIP-ViT, PruMerge adaptively selects important visual tokens via IQR-based outlier detection, then merges pruned tokens back into retained tokens through k-nearest-neighbor clustering, achieving up to 14× visual token compression with negligible performance degradation.
Mastering Collaborative Multi-modal Data Selection: A Focus on Informativeness, Uniqueness, and Representativeness: This paper proposes DataTailor — a collaborative multimodal data selection framework grounded in three principles: informativeness, uniqueness, and representativeness. Using only 15% of the data, DataTailor achieves 101.3% of the performance obtained with full-data fine-tuning, embodying the "Less is More" philosophy.
MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling: This paper proposes MaTVLM, which replaces a portion of Transformer layers in a pretrained VLM with Mamba-2 layers and trains the resulting model via single-stage knowledge distillation, achieving 3.6× inference speedup and 27.5% memory reduction while maintaining competitive performance.
MAVias: Mitigate Any Visual Bias: This paper proposes MAVias, an open-set visual bias mitigation framework that extracts visual attribute tags from images using a tagging foundation model, employs an LLM to filter out tags irrelevant to the target class as potential biases, encodes the identified biases via vision-language embeddings, and incorporates them into training to learn bias-invariant representations. MAVias substantially outperforms existing methods on CelebA, Waterbirds, UrbanCars, and ImageNet9.
MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs: This paper introduces Multi-Context Visual Grounding as a novel task and the MC-Bench benchmark—comprising 2,000 manually annotated samples, 3 text description styles, and 20 practical skills—to evaluate 20+ MLLMs and foundation models. It reveals a substantial performance gap between current models and humans (human AP50=41.3% vs. best end-to-end model AP50=30.7%), and provides an agentic baseline combining GPT-4o and G-DINO (AP50=36.2%).
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning: This paper proposes Visual-Predictive Instruction Tuning (VPiT), which extends a pretrained LLM into a unified model—MetaMorph—capable of both visual understanding and generation via lightweight instruction tuning alone. A key finding is that visual generation ability emerges as a natural byproduct of visual understanding, and the two capabilities mutually benefit each other in an asymmetric manner.
METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models: METEOR proposes the first three-stage progressive token pruning framework for multi-encoder MLLMs: at the encoding stage, feature rank is used to allocate sparsity ratios across encoders; at the fusion stage, collaborative pruning eliminates cross-encoder redundancy; at the decoding stage, pruning ratios are adaptively adjusted based on text prompts. The framework reduces visual tokens by 76% with only a 0.3% performance drop.
Mitigating Object Hallucinations via Sentence-Level Early Intervention: This paper proposes SENTINEL, a framework grounded in the key observation that hallucinations emerge early in generation and propagate forward. By combining in-domain candidate bootstrapping with dual-detector cross-validation to construct sentence-level preference data, and employing Context-aware DPO (C-DPO) for early intervention, SENTINEL reduces hallucinations on Object HalBench by 92% while preserving general capabilities.
MM-IFEngine: Towards Multimodal Instruction Following: This paper proposes the MM-IFEngine pipeline, which systematically generates high-quality image–instruction pair data (in both SFT and DPO variants) and constructs the MM-IFEval benchmark, achieving significant improvements in multimodal instruction following for MLLMs.
MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs: Apple proposes the CA-VQA dataset and MM-Spatial model, leveraging high-quality 3D scene data and open-set annotations to generate training/evaluation data covering spatial relation prediction, metric estimation, and 3D grounding. The resulting general-purpose MLLM achieves SOTA on 3D spatial understanding benchmarks while remaining competitive on other tasks.
MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning: This paper introduces MMAT-1M, the first million-scale multimodal agent tuning dataset, constructed via a four-stage data engine (Foundation → Rationale → Reflection → Integration). It endows MLLMs with CoT reasoning, tool invocation, and self-reflection capabilities, achieving an average improvement of 2.7% on InternVL2.5-8B and 8.8% on RAG tasks.
MMOne: Representing Multiple Modalities in One Scene: MMOne is a general framework that addresses property disparity and granularity disparity in multi-modal scene representation through a modality modeling module (with modality indicators) and a multi-modal decomposition mechanism. It jointly models RGB, thermal, and language modalities within a single 3DGS representation, achieving consistent improvements across all modalities.
MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild: This paper proposes MolParser, an end-to-end Optical Chemical Structure Recognition (OCSR) method that handles Markush structures via an extended SMILES representation (E-SMILES), constructs a large-scale training set MolParser-7M with 7 million samples, and incorporates real-world literature data through active learning. MolParser achieves 76.9% accuracy on the WildMol benchmark, significantly outperforming existing methods.
Multi-Cache Enhanced Prototype Learning for Test-Time Generalization of Vision-Language Models: This paper proposes MCP/MCP++, a multi-cache enhanced prototype learning framework that constructs compact intra-class distributions via three complementary cache modules—entropy cache, align cache, and negative cache—and further introduces cross-modal residual learning to refine the alignment between visual and textual prototypes, achieving state-of-the-art zero-shot generalization across 15 downstream tasks.
Multimodal LLMs as Customized Reward Models for Text-to-Image Generation: This paper proposes LLaVA-Reward, which leverages the hidden states (rather than text generation outputs) of a pretrained MLLM to directly predict reward scores. A Skip-connection Cross Attention (SkipCA) module is introduced to enhance bidirectional visual-text interaction, and LoRA adapters are employed to handle different evaluation dimensions. The method achieves state-of-the-art performance on text-image alignment, fidelity, and safety evaluation, and can be applied to inference-time scaling for diffusion models.
MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large Vision and Language Models: This paper proposes MultiVerse, a multi-turn conversation evaluation benchmark comprising 647 dialogues collected from 12 VLM evaluation datasets, spanning 484 task types and 484 interaction goals. Using a checklist-based evaluation approach, the benchmark reveals that even the strongest model, GPT-4o, achieves only ~50% success rate on complex multi-turn conversations.
MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding: This paper proposes Semantic Discrete Encoding (SDE), which injects pretrained CLIP semantic features into the quantization process of a visual tokenizer, enabling discrete visual tokens to be naturally aligned with language tokens. With only 24M image-text pairs, the resulting unified model achieves state-of-the-art performance on both visual understanding and generation.
MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding: This paper proposes a Semantic Discrete Encoding (SDE) visual tokenizer that augments VQGAN with SigLIP semantic feature constraints, enabling discrete visual tokens to align semantically with language tokens. Built upon SDE, a unified autoregressive VLM (MUSE-VL) is constructed that, using only 24M training samples, outperforms Emu3 by 4.8% on understanding benchmarks, surpasses the specialist model LLaVA-NeXT 34B by 3.7%, and simultaneously supports image generation.
NegRefine: Refining Negative Label-Based Zero-Shot OOD Detection: This paper proposes NegRefine, which leverages an LLM to filter proper nouns and subcategory labels from the negative label set, and designs a multi-label matching scoring function to handle cases where an image simultaneously matches both in-distribution and negative labels. On the ImageNet-1K benchmark, NegRefine achieves an average AUROC improvement of 1.82% and FPR95 reduction of 4.35%, establishing a new state of the art in zero-shot OOD detection.
On Large Multimodal Models as Open-World Image Classifiers: This paper systematically evaluates 13 large multimodal models (LMMs) on open-world image classification, proposes an evaluation protocol comprising four complementary metrics, and reveals systematic error patterns in LMMs regarding granularity judgment and fine-grained discrimination.
One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models: This paper proposes C-PGC, a framework that trains a conditional perturbation generator via malicious contrastive learning to produce a pair of universal image-text adversarial perturbations (UAPs), fundamentally disrupting the multimodal alignment of VLP models and achieving strong attack performance across multiple VLP models and downstream tasks in both white-box and black-box settings.
ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models: This paper proposes ONLY, a training-free single-layer intervention decoding method. It selects text-biased attention heads via the Text-to-Visual Entropy Ratio (TVER) to generate textually-enhanced logits, which are then used in adaptive contrastive or collaborative decoding against the original logits. With only 1.07× inference overhead, ONLY outperforms VCD/M3ID by 3.14% on POPE and reduces CHAIR_S by 6.2 points on CHAIR.
OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning: This paper introduces OpenVision — a fully open-source (data, training code, and weights) family of vision encoders (5.9M–632.1M parameters) trained on the CLIPS framework with the Recap-DataComp-1B dataset. When integrated into multimodal frameworks such as LLaVA, OpenVision matches or surpasses OpenAI CLIP and Google SigLIP, providing the community with a transparent and flexible alternative visual backbone.
OracleFusion: Assisting the Decipherment of Oracle Bone Script with Structurally Constrained Semantic Typography: This paper proposes OracleFusion, a two-stage semantic typography framework. Stage 1 employs MLLM-enhanced Spatial Awareness Reasoning (SAR) to analyze the glyph structure of oracle bone script (OBS) and localize key components. Stage 2 introduces Structural Oracle Vector Fusion (SOVF), which generates semantically enriched vector glyphs through glyph structure constraints and skeleton-preserving losses, conveying semantic meaning while preserving original glyph integrity to assist expert decipherment of undeciphered OBS characters.
orderchain towards general instruct-tuning for stimulating the ordinal understan: This paper proposes OrderChain, a prompting paradigm that enhances the ordinal understanding capability of multimodal large language models (MLLMs) via task-aware prompts and a Range-Optimized Chain-of-Thought (RO-CoT), achieving for the first time a unified ordinal regression model across diverse tasks.
Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation: This paper proposes the Abstract Perspective Change (APC) framework, which leverages visual foundation models to construct an abstract scene representation and perform perspective transformations, enabling VLMs to reason spatially from arbitrary viewpoints. APC substantially outperforms existing VLMs and fine-tuned models on both synthetic and real-image benchmarks.
Physics Context Builders: A Modular Framework for Physical Reasoning in Vision-Language Models: This paper proposes Physics Context Builders (PCBs), a modular framework that fine-tunes small specialized VLMs on simulation data to generate detailed physical scene descriptions, which serve as physical context to augment the physical reasoning capabilities of large foundation VLMs (e.g., GPT-4o), without modifying the large model itself.
PhysSplat: Efficient Physics Simulation for 3D Scenes via MLLM-Guided Gaussian Splatting: This paper proposes PhysSplat, the first approach to leverage multimodal large language models (MLLMs) for zero-shot estimation of physical properties of objects in 3D scenes. Combined with a physics-geometry adaptive sampling strategy, it achieves realistic physics simulation on a single GPU within 2 minutes.
Pi-GPS: Enhancing Geometry Problem Solving by Unleashing the Power of Diagrammatic Information: Pi-GPS leverages diagrammatic information to resolve ambiguities in textual descriptions. By introducing a lightweight Rectifier–Verifier module, it addresses a previously overlooked problem of textual ambiguity, achieving nearly 10% improvement over prior state-of-the-art neuro-symbolic methods on Geometry3K.
PRO-VPT: Distribution-Adaptive Visual Prompt Tuning via Prompt Relocation: This paper proposes PRO-VPT, a framework that co-designs Adaptive Distribution Optimization (ADO) with Visual Prompt Tuning (VPT) via nested optimization. By iteratively relocating prompts through idleness score-based pruning and a reinforcement learning-based allocation strategy, PRO-VPT achieves gains of 1.6 pp and 2.0 pp over VPT on VTAB-1k and FGVC, respectively.
ProbRes: Probabilistic Jump Diffusion for Open-World Egocentric Activity Recognition: This paper proposes ProbRes, a framework that leverages a probabilistic residual search strategy based on jump diffusion, combined with ConceptNet commonsense priors and VLM likelihood estimation, to efficiently navigate large-scale search spaces in open-world egocentric activity recognition. ProbRes substantially reduces the number of VLM queries while improving recognition accuracy.
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization: This paper proposes StepGRPO, an online reinforcement learning framework that introduces two rule-based step-wise reasoning rewards — StepRAR (Step-wise Reasoning Accuracy Reward) and StepRVR (Step-wise Reasoning Validity Reward) — without requiring a process reward model. The framework addresses the sparse reward problem in RL-based MLLM training, enabling models to autonomously explore and improve their reasoning capabilities.
ReasonVQA: A Multi-hop Reasoning Benchmark with Structural Knowledge for Visual Question Answering: This paper proposes ReasonVQA, a dataset constructed through a low-cost and scalable framework that automatically integrates structured encyclopedic knowledge (Wikidata) with images, generating 1/2/3-hop multi-hop reasoning questions. The benchmark comprises 598K images and 4.2M questions, posing significant challenges to existing VQA models.
Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks: This work identifies a pervasive vulnerability of mainstream VLMs to Gaussian noise, proposes the Robust-VLGuard safety dataset (covering both image-text aligned and misaligned scenarios) with noise-augmented fine-tuning to improve Gaussian noise robustness, and combines it with DiffPure to convert adversarial noise into Gaussian-like noise, forming the DiffPure-VLM general defense framework that effectively resists adversarial attacks of varying strengths.
SAUCE: Selective Concept Unlearning in Vision-Language Models with Sparse Autoencoders: SAUCE leverages sparse autoencoders (SAEs) to identify and selectively suppress features associated with target concepts in VLM intermediate representations, enabling fine-grained concept unlearning without weight updates. Evaluated across 60 concepts, it surpasses the previous SOTA in forgetting quality by 18%.
SC-Captioner: Improving Image Captioning with Self-Correction by Reinforcement Learning: SC-Captioner proposes a multi-turn reinforcement learning framework based on policy gradient. By designing a correction reward function that incorporates correctness bonuses and mistake punishments, it enables large vision-language models to acquire self-correction capabilities for image captioning, while also introducing an improved CAPTURE evaluation metric.
Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension: This paper proposes the Vision Value Model (VisVM), trained via temporal difference (TD) learning, to guide sentence-level inference-time search in VLMs for generating higher-quality descriptive captions. Compared to greedy decoding and CLIP-PRM, VisVM search significantly reduces hallucination (CHAIRs from 32.4 to 26.2), and data generated through this process, when used for self-training, yields an average improvement of 10.8% across 9 benchmarks.
Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension: This paper proposes the Vision Value Model (VisVM), a value network trained via TD learning to predict the long-term value of sentences generated by a VLM. VisVM guides sentence-level beam search at inference time to produce image descriptions with fewer hallucinations and richer detail. High-quality captions generated by VisVM are further used for self-training, achieving an average improvement of 10.8% over LLaVA-Next across 9 benchmarks.
Scaling Laws for Native Multimodal Models: By training 457 models across diverse architectures, scales, and training data mixtures, this paper systematically investigates scaling laws for Native Multimodal Models (NMMs). It finds that early-fusion architectures (without pretrained visual encoders) outperform late-fusion counterparts at small parameter scales, are more training-efficient, and simpler to deploy; incorporating MoE further yields substantial performance gains.
SCAN: Bootstrapping Contrastive Pre-training for Data Efficiency: This paper proposes SCAN, a dynamic bootstrapping dataset pruning method that iteratively identifies pruning candidates and applies dataset mutation operations, achieving an average performance drop of less than 1% at a 30–35% pruning rate in CLIP and MoCo contrastive pre-training.
ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers: This work identifies significant layer-level redundancy in MLLMs—most layers contribute minimally to the transformation of visual tokens—and proposes ShortV: freezing visual tokens (skipping their attention and FFN computations) in approximately 60% of layers. On LLaVA-NeXT-13B, this achieves a 50% reduction in FLOPs with negligible performance degradation. The method is training-free and orthogonal to token pruning approaches, allowing them to be combined.
SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models: SimpleVQA is the first VQA benchmark designed for comprehensive multimodal factuality evaluation of MLLMs. It spans 9 task types and 9 thematic domains, and employs a short-answer design with deterministic references alongside an LLM-as-a-judge scoring protocol to systematically assess the factual capabilities of 18 MLLMs and 8 text-only LLMs.
SMoLoRA: Exploring and Defying Dual Catastrophic Forgetting in Continual Visual Instruction Tuning: This paper identifies a phenomenon termed "dual catastrophic forgetting" in continual visual instruction tuning (CVIT) of multimodal large models, wherein both visual understanding capability and instruction-following capability degrade simultaneously. To address this, SMoLoRA is proposed, employing a separable-routing mixture of LoRA experts to effectively mitigate both forms of forgetting.
SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs: This paper reveals a "visual head sparsity" phenomenon in Multimodal Large Language Models (MLLMs), where only approximately 5% of attention heads actively participate in visual understanding. It proposes a training-free visual head identification framework based on OCR tasks and introduces SparseMM — an acceleration strategy that asymmetrically allocates KV-Cache budgets across heads according to their visual scores — achieving 1.38× real-time speedup and 52% memory reduction with no performance degradation.
SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference: This paper proposes SparseVILA—the first VLM inference acceleration framework that decouples visual sparsity between the prefill and decode stages: query-agnostic redundant token pruning during prefill, and query-aware relevant token retrieval during decode. The approach achieves up to 4.0× prefill speedup, 2.5× decode throughput improvement, and 2.6× end-to-end acceleration, while maintaining accuracy in multi-turn conversation settings where existing methods suffer severe degradation due to permanent token deletion.
Sparsity Outperforms Low-Rank Projections in Few-Shot Adaptation: This paper proposes Sparse Optimization (SO), a framework that replaces low-rank adaptation methods (e.g., LoRA) via dynamic sparse gradient selection and importance-based momentum pruning. SO achieves state-of-the-art performance on few-shot VLM adaptation across 11 datasets while reducing memory overhead.
Spatial Preference Rewarding for MLLMs Spatial Understanding: This paper proposes SPR (Spatial Preference Rewarding), a framework that automatically constructs preference data pairs via semantic and localization scores, and trains MLLMs with DPO to distinguish high-precision grounding (chosen) from ambiguous or erroneous grounding (rejected), substantially improving fine-grained spatial understanding—particularly at high IoU thresholds.
STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?: This paper proposes STI-Bench, a benchmark for evaluating the precise spatial-temporal understanding capabilities of multimodal large language models (MLLMs), covering three scene categories (desktop/indoor/outdoor), eight static and dynamic task types, and over 2,000 QA pairs. The benchmark reveals that the current state-of-the-art MLLM (Gemini-2.5-Pro) achieves an average accuracy of only 41.4%, exposing fundamental deficiencies in precise spatial quantification and temporal dynamic understanding.
Synergistic Prompting for Robust Visual Recognition with Missing Modalities: This paper proposes the Synergistic Prompting (SyP) framework, which employs a dynamic adapter to generate input-adaptive scaling factors that modulate a base prompt (dynamic prompt), synergizing with a static prompt that captures shared cross-modal features. SyP achieves robust visual recognition under missing-modality conditions and consistently outperforms SOTA methods such as DCP on MM-IMDb, Food101, and Hateful Memes.
TAB: Transformer Attention Bottlenecks enable User Intervention and Debugging in Vision-Language Models: This paper proposes TAB (Transformer Attention Bottleneck), a single-head co-attention bottleneck layer inserted after standard MHSA. By removing the skip connection and constraining attention values to $[0,1]$, TAB enables precise attention visualization, ground-truth-supervised training, and test-time user editing intervention in VLMs. On change captioning tasks, it establishes for the first time a causal relationship between attention values and VLM outputs.
Taming the Untamed: Graph-Based Knowledge Retrieval and Reasoning for MLLMs to Conquer the Unknown: Using Monster Hunter: World as a testbed, this paper constructs a multimodal knowledge graph (MH-MMKG) containing text, images, video, and complex entity relations, designs 238 complex queries along with a multi-agent knowledge retrieval method, and reveals the inadequacy of current MLLMs in domain-specific knowledge retrieval and reasoning tasks.
The Inter-Intra Modal Measure: A Predictive Lens on Fine-Tuning Outcomes in Vision-Language Models: This paper proposes the Inter-Intra Modal Measure (IIMM)—a metric that requires only a single forward pass to predict both the performance gain and the degree of catastrophic forgetting following fine-tuning of vision-language dual-encoder models. By quantifying intra-modal image embedding similarity and inter-modal misaligned label alignment, IIMM demonstrates strong linear predictive power ($R^2 > 0.85$) across 4 foundation models and 5 fine-tuning strategies.
ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools: This paper proposes ToolVQA — a large-scale multimodal tool-augmented VQA dataset containing 23K samples. It is automatically constructed via the ToolEngine pipeline, which combines image-guided DFS with LCS-based example matching, to generate multi-step reasoning data in realistic scenarios. LLaVA-7B fine-tuned on this dataset surpasses GPT-3.5-Turbo on 5 OOD benchmarks.

🚗 Autonomous Driving¶

3D Gaussian Splatting Driven Multi-View Robust Physical Adversarial Camouflage Generation: This paper proposes PGA, the first physical adversarial attack framework based on 3DGS, which generates cross-view robust physical adversarial camouflage through fast and accurate target reconstruction, resolution of Gaussian mutual/self-occlusion issues, and a min-max background adversarial optimization strategy. PGA surpasses state-of-the-art methods in both digital and physical domains.
3D Gaussian Splatting Driven Multi-View Robust Physical Adversarial Camouflage Generation: This paper proposes PGA, the first physical adversarial attack framework based on 3D Gaussian Splatting (3DGS). By addressing mutual occlusion and self-occlusion among Gaussians to ensure cross-viewpoint consistency, and by designing a min-max optimization strategy to filter non-robust adversarial features, PGA substantially outperforms state-of-the-art methods in both the digital and physical domains.
3DRealCar: An In-the-wild RGB-D Car Dataset with 360-degree Views: This paper presents 3DRealCar, the first large-scale real-world 3D vehicle dataset comprising 2,500 vehicles from 100+ brands, each with approximately 200 high-resolution 360-degree RGB-D views captured under three lighting conditions (standard, reflective, and low-light), along with 13-category vehicle parsing annotations, supporting tasks including 3D reconstruction, detection, and generation.
3DRealCar: An In-the-wild RGB-D Car Dataset with 360-degree Views: This paper introduces 3DRealCar, the first large-scale real-world 3D car dataset comprising high-resolution (1920×1440) 360-degree RGB-D scans of 2,500 real vehicles (averaging 200 views per car), covering 100+ brands and three lighting conditions (standard / high-reflectance / low-light). The dataset provides rich annotations including point clouds and parsing maps, and benchmarks multiple 3D reconstruction methods, revealing significant reconstruction challenges under reflective and low-light conditions.
4DSegStreamer: Streaming 4D Panoptic Segmentation via Dual Threads: This paper proposes 4DSegStreamer, a streaming 4D panoptic segmentation framework built upon a dual-thread system (predictive thread + inference thread). It achieves real-time, high-quality 4D panoptic segmentation through geometric and motion memory maintenance, ego-pose prediction, and inverse forward flow iteration.
6DOPE-GS: Online 6D Object Pose Estimation using Gaussian Splatting: This paper proposes 6DOPE-GS, a model-free online tracking method that jointly optimizes 6D object pose and 3D reconstruction using 2D Gaussian Splatting (2DGS). Through dynamic keyframe selection and opacity-percentile-based density control, it achieves a 5× speedup while maintaining state-of-the-art accuracy.
6DOPE-GS: Online 6D Object Pose Estimation using Gaussian Splatting: Leveraging the efficient differentiable rendering capability of 2D Gaussian Splatting, this paper proposes a CAD-model-free online 6D object pose estimation and tracking method. By jointly optimizing a Gaussian object field and keyframe poses, it achieves approximately 5× speedup over BundleSDF while maintaining comparable accuracy.
A Constrained Optimization Approach for Gaussian Splatting from Coarsely-posed Images and Noisy Lidar Point Clouds: This paper proposes an SfM-free constrained optimization framework that jointly optimizes camera parameters and 3DGS scene reconstruction from coarse poses and noisy point clouds produced by multi-camera SLAM systems, via camera pose decomposition, sensitivity-based pre-conditioning, log-barrier constraints, and geometric constraints.
ACAM-KD: Adaptive and Cooperative Attention Masking for Knowledge Distillation: This paper proposes ACAM-KD, an adaptive student-teacher cooperative attention masking framework for knowledge distillation. By employing Student-Teacher Cross-Attention Feature Fusion (STCA-FF) and Adaptive Spatial-Channel Masking (ASCM) to dynamically adjust distillation focus, ACAM-KD surpasses the state of the art by up to 1.4 mAP on COCO detection and improves mIoU by 3.09 on Cityscapes segmentation.
ACAM-KD: Adaptive and Cooperative Attention Masking for Knowledge Distillation: This paper proposes ACAM-KD, which introduces two modules — Student-Teacher Cross-Attention Feature Fusion (STCA-FF) and Adaptive Spatial-Channel Masking (ASCM) — to enable dynamically evolving feature selection in knowledge distillation that adapts to the student's learning state. On COCO detection, RetinaNet R50 distilled from R101 achieves 41.2 mAP (+1.4 over prior SOTA); on Cityscapes segmentation, DeepLabV3-MBV2 improves mIoU by 3.09.
AD-GS: Object-Aware B-Spline Gaussian Splatting for Self-Supervised Autonomous Driving: This paper proposes AD-GS, a self-supervised autonomous driving scene rendering framework based on 3D Gaussian Splatting. The core innovation is combining learnable B-spline curves with trigonometric functions for local-global motion modeling, coupled with a simplified binary pseudo-segmentation for robust scene decomposition. Without relying on manual 3D annotations, AD-GS substantially outperforms existing self-supervised methods.
AD-GS: Object-Aware B-Spline Gaussian Splatting for Self-Supervised Autonomous Driving: This paper proposes AD-GS, a self-supervised autonomous driving scene rendering framework that models dynamic object motion by combining locally-aware learnable B-spline curves with globally-aware trigonometric functions. It employs simplified pseudo 2D segmentation for scene decomposition, significantly outperforming existing self-supervised methods and approaching the performance of annotation-dependent approaches without relying on manual 3D annotations.
AdaDrive: Self-Adaptive Slow-Fast System for Language-Grounded Autonomous Driving: AdaDrive presents the first LLM-augmented autonomous driving framework with an adaptive slow-fast architecture. Two adaptive connectors dynamically determine when to activate the LLM (Connector-W) and how much the LLM contributes (Connector-H), achieving SOTA performance on language-grounded driving benchmarks (driving score 80.9%) while reducing inference latency to 189ms and GPU memory to 6.79GB.
Adaptive Dual Uncertainty Optimization: Boosting Monocular 3D Object Detection under Test-Time Shifts: This paper proposes DUO (Dual Uncertainty Optimization), the first test-time adaptation framework that jointly minimizes semantic uncertainty and geometric uncertainty, achieving robust monocular 3D object detection via conjugate focal loss and normal field constraints.
AGO: Adaptive Grounding for Open World 3D Occupancy Prediction: This paper proposes the AGO framework, which handles known categories via noise-augmented grounding training and unknown categories via a modality adapter for adaptive alignment. An information entropy-based open-world recognizer dynamically selects the optimal features at inference time. AGO surpasses VEON by 4.09 mIoU on the Occ3D-nuScenes self-supervised benchmark while exhibiting open-world zero-shot/few-shot transfer capability.
ALOcc: Adaptive Lifting-Based 3D Semantic Occupancy and Cost Volume-Based Flow Predictions: ALOcc is proposed as a framework that achieves state-of-the-art performance on multiple 3D semantic occupancy and occupancy flow prediction benchmarks through three innovations: an occlusion-aware adaptive lifting mechanism, a semantic prototype-based occupancy head, and a BEV cost volume-based flow prediction module, while offering multiple model variants ranging from real-time to high-accuracy configurations.
ALOcc: Adaptive Lifting-Based 3D Semantic Occupancy and Cost Volume-Based Flow Predictions: This paper proposes the ALOcc framework, which achieves state-of-the-art performance on multiple occupancy prediction benchmarks while maintaining high inference speed through three improvements: an occlusion-aware adaptive lifting mechanism, semantic prototype alignment, and BEV cost volume-based flow prediction.
Beyond One Shot, Beyond One Perspective: Cross-View and Long-Horizon Distillation for Better LiDAR Representations: LiMA proposes a long-horizon image-to-LiDAR memory aggregation framework that explicitly leverages spatiotemporal cues in LiDAR sequences via three modules—cross-view aggregation, long-term feature propagation, and cross-sequence memory alignment—to enhance LiDAR representation learning, achieving substantial improvements over existing pre-training methods on semantic segmentation and 3D object detection.
CCL-LGS: Contrastive Codebook Learning for 3D Language Gaussian Splatting: This paper proposes the CCL-LGS framework, which employs a zero-shot tracker for cross-view mask association and a Contrastive Codebook Learning (CCL) module to distill semantically compact intra-class and discriminative inter-class features. The framework addresses cross-view semantic inconsistency in 2D-prior-based 3D semantic field reconstruction caused by occlusion, blur, and viewpoint variation.
CoDa-4DGS: Dynamic Gaussian Splatting with Context and Deformation Awareness for Autonomous Driving: CoDa-4DGS augments the 4D Gaussian Splatting (4DGS) framework with context awareness (self-supervised 4D semantic features from 2D foundation models) and temporal deformation awareness (tracking per-Gaussian deformation between adjacent frames). By jointly encoding semantic and deformation features as dynamic compensation cues for each Gaussian, the method captures finer-grained details in autonomous driving dynamic scenes and surpasses existing self-supervised approaches.
CoLMDriver: LLM-based Negotiation Benefits Cooperative Autonomous Driving: The first end-to-end LLM-driven cooperative driving system. Through an Actor-Critic language negotiation module and an intention-guided trajectory generator, it achieves an 11% higher success rate than existing methods across diverse V2V interaction scenarios.
Controllable 3D Outdoor Scene Generation via Scene Graphs: This work proposes the first method to use scene graphs as control signals for large-scale 3D outdoor scene generation. A GNN encodes sparse scene graphs into BEV embedding maps, which are then fed into a cascaded 2D→3D discrete diffusion model to generate semantic 3D scenes. An accompanying interactive system allows users to directly edit scene graphs to control the generation.
CoopTrack: Exploring End-to-End Learning for Efficient Cooperative Sequential Perception: This paper proposes CoopTrack, the first fully instance-level end-to-end cooperative 3D multi-object tracking framework. It achieves cross-agent instance matching and fusion via a learnable graph attention association module and multi-dimensional feature extraction, reaching state-of-the-art performance on V2X-Seq.
Counting Stacked Objects: The paper decomposes the stacked object counting problem into two sub-problems — volume estimation and occupancy ratio estimation — solving the former via multi-view 3D reconstruction and the latter via a depth-map-driven neural network that infers interior occupancy from visible surfaces. This is the first method to accurately count largely invisible stacked objects, significantly outperforming humans.
CVFusion: Cross-View Fusion of 4D Radar and Camera for 3D Object Detection: This paper proposes CVFusion — the first two-stage 4D radar-camera fusion network for 3D object detection. Stage 1 generates high-recall proposals via a Radar-Guided Iterative (RGIter) BEV fusion module, while Stage 2 refines each proposal by aggregating heterogeneous multi-view features through Point-Guided Fusion (PGF) and Grid-Guided Fusion (GGF). CVFusion achieves mAP improvements of 9.10% and 3.68% on VoD and TJ4DRadSet, respectively.
DAMap: Distance-aware MapNet for High Quality HD Map Construction: This paper identifies two inherent deficiencies in current HD map construction methods regarding high-quality prediction — inappropriate classification labels and suboptimal task-specific features — and proposes DAMap (comprising three components: DAFL, HLS, and TMDA) to systematically address task misalignment, achieving consistent gains of 2–3 mAP across multiple baselines on NuScenes and Argoverse2.
DCHM: Depth-Consistent Human Modeling for Multiview Detection: This paper proposes DCHM, a depth-consistent human modeling framework that requires no 3D annotations. It generates pseudo depth labels via superpixel-level Gaussian splatting to fine-tune a monocular depth estimation network, and combines multiview label matching to achieve high-accuracy pedestrian detection under sparse-view and heavily occluded scenarios. DCHM achieves 84.2% MODA on Wildtrack and improves MODP by 31.2% over UMPD.
Decoupled Diffusion Sparks Adaptive Scene Generation: This paper proposes Nexus, an adaptive driving scene generation framework based on decoupled diffusion. By assigning independent noise states to individual tokens, Nexus unifies goal-oriented and reactive generation, reduces displacement error by 40%, and introduces Nexus-Data, a dataset comprising 540 hours of safety-critical driving scenarios.
Detect Anything 3D in the Wild: DetAny3D is a promptable 3D detection foundation model that transfers prior knowledge from two 2D foundation models—SAM and depth-pretrained DINO—via a proposed 2D Aggregator and Zero-Embedding Mapping (ZEM) mechanism, enabling stable 2D-to-3D knowledge transfer. Using only monocular images, it achieves zero-shot 3D object detection across arbitrary scenes and camera configurations, surpassing baselines by up to 21% AP3D on novel categories.
DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation: This paper proposes DiST-4D, the first feed-forward 4D driving scene generation framework. By disentangling temporal prediction (DiST-T) and spatial novel view synthesis (DiST-S) into two separate diffusion processes, with metric depth serving as a geometric bridge, the method achieves state-of-the-art temporal video generation (FVD 22.67) and spatial NVS (FID 10.12) on nuScenes simultaneously, without any per-scene optimization.
Distilling Diffusion Models to Efficient 3D LiDAR Scene Completion: This paper proposes ScoreLiDAR, a diffusion model distillation framework for 3D LiDAR scene completion. By incorporating scene-wise and point-wise structural losses to guide distillation, it reduces completion time from 30.55 seconds to 5.37 seconds (>5× speedup) while surpassing all state-of-the-art methods on SemanticKITTI.
DONUT: A Decoder-Only Model for Trajectory Prediction: DONUT draws inspiration from the decoder-only architecture of LLMs and proposes a unified autoregressive model for processing both historical and future trajectories, coupled with an overprediction strategy to improve anticipation of the distant future. It achieves state-of-the-art performance on the Argoverse 2 benchmark.
DriveX: Omni Scene Modeling for Learning Generalizable World Knowledge in Autonomous Driving: This paper proposes DriveX, a self-supervised world model framework that learns transferable general scene representations in a BEV latent space via Omni Scene Modeling (OSM)—jointly supervising 3D point cloud prediction, 2D semantic representation, and image generation. A Future Spatial Attention (FSA) paradigm is designed to seamlessly integrate predicted future states into downstream tasks such as occupancy prediction, flow estimation, and end-to-end driving, achieving state-of-the-art performance across multiple tasks.
DuET: Dual Incremental Object Detection via Exemplar-Free Task Arithmetic: This paper proposes DuET, a framework that, for the first time, addresses both class-incremental and domain-incremental object detection simultaneously (Dual Incremental Object Detection, DuIOD) via exemplar-free Task Arithmetic model merging. It introduces a Directional Consistency Loss to mitigate sign conflicts, achieving substantial improvements over existing methods on the Pascal Series and Diverse Weather Series benchmarks.
EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding: This paper proposes EmbodiedOcc, a framework that leverages 3D semantic Gaussians as a global memory to enable online indoor 3D occupancy prediction from monocular visual input through progressive exploration and local updating.
EMD: Explicit Motion Modeling for High-Quality Street Gaussian Splatting: This paper proposes the Explicit Motion Decomposition (EMD) module, which models the motion characteristics of each Gaussian primitive via learnable motion embeddings and a dual-scale deformation framework. As a plug-and-play module, EMD integrates seamlessly into both self-supervised and supervised street-view Gaussian splatting methods, achieving state-of-the-art performance under the self-supervised setting on the Waymo and KITTI datasets.
Epona: Autoregressive Diffusion World Model for Autonomous Driving: This paper proposes Epona, an autoregressive diffusion world model that achieves a unified framework for high-resolution long-horizon driving video generation and real-time trajectory planning through decoupled spatiotemporal modeling and asynchronous multimodal generation.
ETA: Efficiency through Thinking Ahead, A Dual Approach to Self-Driving with Large Models: This paper proposes ETA, a dual-system framework that shifts large-model computation from the current frame to preceding time steps and applies batch inference, enabling large-model features to be available at every frame. ETA achieves a driving score of 69.53 on Bench2Drive with a 50 ms latency, improving the state of the art by 8%.
EVT: Efficient View Transformation for Multi-Modal 3D Object Detection: This paper proposes EVT, a framework that achieves efficient LiDAR-guided view transformation via Adaptive Sampling and Adaptive Projection (ASAP), combined with group-wise mixed query selection and geometry-aware cross-attention, attaining state-of-the-art performance of 75.3% NDS on the nuScenes test set at real-time inference speed.
Extrapolated Urban View Synthesis Benchmark: This paper presents the first Extrapolated Urban View Synthesis (EUVS) benchmark, which leverages publicly available multi-traversal/multi-vehicle/multi-camera datasets to systematically evaluate the generalization ability of 3DGS and NeRF methods under extrapolation settings, revealing that current methods severely overfit to training viewpoints.
Foresight in Motion: Reinforcing Trajectory Prediction with Reward Heuristics: This paper proposes a "First Reasoning, Then Forecasting" strategy that infers reward distributions over driving intentions via Query-centric Inverse Reinforcement Learning (QIRL), and couples this with a Bi-Mamba-enhanced DETR-style trajectory decoder to significantly improve prediction confidence and accuracy.
Free-running vs. Synchronous: Single-Photon Lidar for High-flux 3D Imaging: This paper systematically compares free-running and synchronous single-photon lidar (SPL) operating modes for depth imaging under high-flux conditions, proposes an efficient joint maximum likelihood estimator and a score-based depth regularization algorithm SSDR, and demonstrates that the free-running mode consistently outperforms the synchronous mode across diverse flux levels and signal-to-background ratios.
SDKD: Frequency-Aligned Knowledge Distillation for Lightweight Spatiotemporal Forecasting: This paper proposes SDKD (Spectral Decoupled Knowledge Distillation), a framework that leverages a frequency-aware teacher model and a frequency-aligned distillation strategy to transfer multi-scale spectral knowledge from complex spatiotemporal forecasting models to lightweight student networks, achieving up to 81.3% MSE reduction on the Navier-Stokes dataset.
Future-Aware Interaction Network For Motion Forecasting: This paper proposes FINet, which incorporates latent future trajectories into the scene encoding stage for joint optimization, while introducing the Mamba architecture as a replacement for Transformers in spatiotemporal modeling, achieving efficient and accurate motion forecasting.
GaussianFlowOcc: Sparse and Weakly Supervised Occupancy Estimation using Gaussian Splatting and Temporal Flow: This paper proposes GaussianFlowOcc, which replaces dense voxel grids with sparse 3D Gaussian distributions for occupancy estimation. A Gaussian Transformer is introduced for efficient scene modeling, and a Temporal Module estimates per-Gaussian 3D temporal flow to handle dynamic objects. The method substantially outperforms existing approaches on nuScenes under weak supervision (51%+ mIoU improvement) while achieving 50× faster inference.
GaussRender: Learning 3D Occupancy with Gaussian Rendering: This paper proposes GaussRender, a plug-and-play differentiable Gaussian rendering module that projects predicted and ground-truth 3D occupancy onto 2D views and enforces semantic and depth consistency constraints, thereby eliminating visual artifacts such as floating voxels. The approach achieves significant improvements in geometric fidelity across multiple benchmarks, with particularly pronounced gains on surface-sensitive metrics such as RayIoU.
Generative Active Learning for Long-tail Trajectory Prediction via Controllable Diffusion Model: This paper proposes GALTraj, the first method to apply generative active learning to trajectory prediction. During training, it dynamically identifies tail samples on which the model fails, and employs a controllable diffusion model to synthesize new samples that preserve tail-behavior characteristics while complying with traffic rules. This effectively alleviates long-tail data imbalance, improving both tail-case performance and overall prediction accuracy.
GM-MoE: Low-Light Enhancement with Gated-Mechanism Mixture-of-Experts: This paper is the first to introduce Mixture-of-Experts (MoE) networks into low-light image enhancement (LLIE), employing three specialized sub-expert networks to handle color restoration, detail enhancement, and high-level feature enhancement respectively. A dynamic gating mechanism adaptively adjusts the contribution of each expert, achieving state-of-the-art PSNR performance on five benchmark datasets.
GS-LIVM: Real-Time Photo-Realistic LiDAR-Inertial-Visual Mapping with Gaussian Splatting: This paper proposes GS-LIVM, the first real-time photo-realistic LiDAR-inertial-visual mapping framework designed for large-scale unbounded outdoor scenes. It addresses the problem of sparse and non-uniform LiDAR point clouds via voxel-level Gaussian Process Regression (Voxel-GPR), and leverages a covariance-centric design to rapidly initialize 3D Gaussian parameters. The method achieves state-of-the-art mapping efficiency and rendering quality across multiple outdoor datasets.
GS-Occ3D: Scaling Vision-only Occupancy Reconstruction with Gaussian Splatting: This paper proposes GS-Occ3D, a scalable vision-only occupancy reconstruction framework that achieves full-dataset auto-labeling on Waymo through Octree-based Gaussian Surfel representation and a three-layer decomposed modeling of ground, static background, and dynamic objects. The resulting labels enable downstream occupancy prediction models to achieve zero-shot generalization comparable to or better than LiDAR-based annotations.
Hermes: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation: This paper proposes Hermes, the first unified driving world model that simultaneously performs 3D scene understanding (VQA/captioning) and future scene generation (point cloud prediction). By leveraging BEV representations and world queries to inject LLM world knowledge into future scene generation, Hermes reduces 3s point cloud generation error by 32.4% and improves scene understanding CIDEr by 8.0%.
IGL-Nav: Incremental 3D Gaussian Localization for Image-goal Navigation: This paper proposes IGL-Nav, a system that builds a renderable scene memory via incremental 3D Gaussian representations and efficiently solves the image-goal navigation problem through a coarse-to-fine localization strategy, while supporting a free-view setting with arbitrary camera viewpoints.
INSTINCT: Instance-Level Interaction Architecture for Query-Based Collaborative Perception: This paper proposes INSTINCT, a LiDAR-based instance-level collaborative perception framework that achieves state-of-the-art performance across multiple datasets through three core modules — quality-aware filtering, dual-branch detection routing, and cross-agent local instance fusion — while reducing communication bandwidth to approximately 1/264–1/281 of that required by existing methods.
LangTraj: Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation: LangTraj is proposed as the first diffusion-based trajectory simulator that incorporates natural language as a training-time condition. It is accompanied by the InterDrive dataset, containing 150K human-annotated interaction behaviors, enabling language-controllable multi-agent interaction simulation and safety-critical scenario generation.
Language Driven Occupancy Prediction (LOcc): This paper proposes LOcc, an effective and generalizable open-vocabulary occupancy (OVO) prediction framework. Its core contribution is a semantic transitive labeling pipeline (LVLM + OV-Seg → LiDAR → voxel) that generates dense, fine-grained 3D language occupancy pseudo-GT, replacing the noisy and sparse intermediate feature distillation used in prior work. LOcc comprehensively surpasses state-of-the-art methods on Occ3D-nuScenes.
Leveraging 2D Priors and SDF Guidance for Dynamic Urban Scene Rendering: This paper proposes UGSDF, a method that jointly learns an SDF network and 3D Gaussian Splatting to model dynamic objects in urban scenes. Using only 2D priors (a depth network and a point tracker), UGSDF achieves state-of-the-art rendering quality without requiring LiDAR data, 3D motion annotations, or human body templates.
SkyDiffusion: Leveraging BEV Paradigm for Ground-to-Aerial Image Synthesis: This paper proposes SkyDiffusion, which combines a Curved-BEV transformation with a BEV-guided diffusion model to achieve high-quality cross-view synthesis from ground-level street view images to aerial/satellite imagery, and introduces the Ground2Aerial-3 multi-scene dataset.
LightsOut: Diffusion-based Outpainting for Enhanced Lens Flare Removal: This paper proposes LightsOut, a diffusion-based image outpainting framework that enhances existing single-image flare removal (SIFR) methods by predicting and reconstructing out-of-frame light sources. It serves as a plug-and-play preprocessing module that improves arbitrary SIFR models without requiring additional training.
Long-term Traffic Simulation with Interleaved Autoregressive Motion and Scenario Generation: This paper proposes InfGen, a unified autoregressive next-token prediction model that interleaves closed-loop motion simulation with scenario generation (dynamic agent insertion and removal), achieving for the first time stable long-term (30-second) traffic simulation. InfGen reaches state-of-the-art performance on short-term benchmarks and significantly outperforms all existing methods on long-term tasks.
LookOut: Real-World Humanoid Egocentric Navigation: LookOut proposes predicting future 6D head pose sequences (translation + rotation) over a 4.5-second horizon from first-person video with known poses. It backprojects DINOv2 features into 3D space and compresses them into a BEV representation to capture scene geometry and semantics. Trained on a self-collected 4-hour real-world dynamic scene dataset, the model learns human-like navigation behaviors including waiting, detour, and looking left and right before crossing the street.
MAESTRO: Task-Relevant Optimization via Adaptive Feature Enhancement and Suppression for Multi-task 3D Perception: This paper proposes the MAESTRO framework, which generates task-specific features and suppresses inter-task interference in multi-task 3D perception through three modules: Class-wise Prototype Generator (CPG), Task-Specific Feature Generator (TSFG), and Scene Prototype Aggregator (SPA). MAESTRO simultaneously surpasses single-task models on 3D object detection, BEV map segmentation, and 3D occupancy prediction.
MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding: This paper proposes MCAM, which constructs a causal structure between visual and language modalities via a Driving State Directed Acyclic Graph (DSDAG), combined with a multi-level feature extractor and a causal analysis module, for behavior description and causal reasoning in ego-vehicle-level driving video understanding.
MGSfM: Multi-Camera Geometry Driven Global Structure-from-Motion: This paper proposes MGSfM, a global Structure-from-Motion (SfM) framework for multi-camera systems. By exploiting multi-camera rigid constraints through two core modules — Decoupled Multi-camera Rotation Averaging (DMRA) and Multi-camera Geometry driven Position estimation (MGP) — MGSfM achieves accuracy comparable to or better than incremental SfM on large-scale scenes while being approximately 10× faster.
Mixed Signals: A Diverse Point Cloud Dataset for Heterogeneous LiDAR V2X Collaboration: Mixed Signals is the first real-world V2X dataset featuring heterogeneous LiDAR configurations (varying mounting heights and tilt angles), collected by 3 autonomous vehicles and one roadside unit. It provides 45,100 point cloud frames and 240,600 annotated bounding boxes, and is also the first V2X dataset collected in a left-hand traffic country (Australia).
MonoSOWA: Scalable Monocular 3D Object Detector Without Human Annotations: This paper proposes the first monocular 3D object detection method that requires no human annotations of any kind (neither 2D nor 3D). A novel Local Object Motion Model (LOMM) is introduced to disentangle inter-frame motion sources, enabling auto-labeling at a speed ~700× faster than prior work. A Canonical Object Space (COS) is further proposed to enable multi-dataset training across heterogeneous camera configurations.
Occupancy Learning with Spatiotemporal Memory: This paper proposes ST-Occ, a scene-level spatiotemporal occupancy representation learning framework. Through a Unified Temporal Modeling paradigm, it employs a spatiotemporal memory bank defined in scene coordinates along with an uncertainty- and dynamics-aware memory attention mechanism. ST-Occ outperforms the prior state of the art by 3 mIoU on the Occ3D benchmark while reducing temporal inconsistency by 29%.
OD-RASE: Ontology-Driven Risk Assessment and Safety Enhancement for Autonomous Driving: This paper proposes OD-RASE, a framework that constructs a road traffic expert knowledge ontology to filter infrastructure improvement proposals generated by LVLMs, enabling proactive identification of accident-prone road structures and generation of improvement recommendations.
Passing the Driving Knowledge Test: This paper introduces DriveQA — the first large-scale text-and-visual dual-modality driving knowledge test benchmark (26K text QA + 448K image QA) — to systematically evaluate LLMs/MLLMs on driving knowledge including traffic regulations, sign recognition, and right-of-way judgment. The benchmark reveals significant deficiencies in numerical reasoning and complex right-of-way scenarios, and demonstrates that DriveQA pretraining yields generalization gains on downstream driving tasks.
PBCAT: Patch-Based Composite Adversarial Training against Physically Realizable Attacks on Object Detection: This paper proposes PBCAT (Patch-Based Composite Adversarial Training), which combines small-area gradient-guided adversarial patches with global imperceptible perturbations for adversarial training, providing unified defense against multiple physically realizable attacks (adversarial patches and adversarial textures). PBCAT achieves a 29.7% AP improvement over the previous SOTA defense on pedestrian detection tasks.
ReconDreamer++: Harmonizing Generative and Reconstructive Models for Driving Scene Representation: Building upon ReconDreamer, ReconDreamer++ introduces a Novel Trajectory Deformation Network (NTDNet) to bridge the domain gap between generated data and real observations, and independently models the ground plane to preserve geometric priors. On Waymo, the method achieves performance on par with Street Gaussians on original trajectories while improving NTA-IoU by 6.1%, FID by 23.0% on novel trajectories.
Referring Expression Comprehension for Small Objects: This work proposes the SOREC dataset (100K referring expression–bounding box pairs for small objects) and the PIZA adapter module (Progressive-Iterative Zooming Adapter), enabling pretrained models such as GroundingDINO to autoregressively zoom in on extremely small targets, achieving substantial accuracy gains for small-object REC in autonomous driving scenarios.
RESCUE: Crowd Evacuation Simulation via Controlling SDM-United Characters: This paper proposes RESCUE, the first online SDM (Sensing–Decision–Motion) unified 3D evacuation simulation framework, integrating a 3D adaptive social force model and a personalized gait controller to achieve real-time personalized evacuation simulation for hundreds of agents.
Resonance: Learning to Predict Social-Aware Pedestrian Trajectories as Co-Vibrations: This paper proposes the Resonance (Re) model, which decomposes pedestrian trajectory prediction into a superposition of multiple "vibrations"—a linear base, a self-bias, and a resonance-bias. By leveraging spectral similarity between trajectories to simulate "resonance" phenomena in social interactions, the method is validated on ETH-UCY, SDD, NBA, and nuScenes benchmarks.
Resonance: Learning to Predict Social-Aware Pedestrian Trajectories as Co-Vibrations: This paper proposes Resonance, a physics-inspired model that decomposes pedestrian trajectories into multiple independent "vibration" components, each representing an agent's response to a single cause. The final trajectory is predicted via superposition of these components, while social interactions are learned by simulating resonance phenomena, enhancing interpretability.
RoboTron-Sim: Improving Real-World Driving via Simulated Hard-Case: This paper proposes RoboTron-Sim, a framework that constructs a hard-case simulation dataset (HASS), introduces Scenario-aware Prompt Engineering (SPE) and an Image-to-Ego encoder (I2E), enabling MLLMs to effectively leverage simulated hard cases to improve real-world driving performance. On nuScenes hard scenarios, it achieves ~48% reduction in L2 distance and ~46% reduction in collision rate, establishing state-of-the-art open-loop planning performance.
Robust 3D Object Detection using Probabilistic Point Clouds from Single-Photon LiDARs: This paper proposes the Probabilistic Point Cloud (PPC) representation, which attaches measurement confidence derived from raw single-photon LiDAR timing histograms as a probability attribute to each 3D point. Combined with a lightweight NPD filter and FPPS sampling strategy, PPC enables robust 3D object detection under low signal-to-background ratio (SBR) conditions, substantially outperforming point cloud denoising baselines on SUN RGB-D and KITTI with negligible computational overhead.
RTMap: Real-Time Recursive Mapping with Change Detection and Localization: RTMap is proposed as the first end-to-end framework that simultaneously addresses three core challenges in multi-traversal online HD map construction: prior-map-based localization, road structure change detection, and probabilistic crowdsourced map fusion. It achieves improvements in both map quality and localization accuracy on TbV and nuScenes.
SA-Occ: Satellite-Assisted 3D Occupancy Prediction in Real World: SA-Occ is proposed as the first method to leverage satellite imagery to assist onboard cameras in 3D occupancy prediction. Three modules—Dynamic-Decoupling Fusion, 3D Projection Guidance, and Uniform Sampling Alignment—address cross-view perception challenges, achieving 39.05% mIoU (+6.97%) on Occ3D-nuScenes with only 6.93 ms additional latency.
Saliency-Aware Quantized Imitation Learning for Efficient Robotic Control: This paper proposes SQIL (Saliency-Aware Quantized Imitation Learning), which identifies task-critical states via saliency scoring and applies weighted distillation during quantization-aware training. SQIL recovers full-precision performance for 4-bit quantized VLA policy models in robotic manipulation and autonomous driving, while achieving 2.5–3.7× inference speedup.
SAM4D: Segment Anything in Camera and LiDAR Streams: This paper presents SAM4D, the first promptable multimodal segmentation foundation model for camera and LiDAR streams. It introduces Unified Multimodal Positional Encoding (UMPE) to enable cross-modal prompting and interaction, Motion-aware Cross-Modal Attention (MCMA) for temporal consistency, and constructs the Waymo-4DSeg dataset containing 300K+ masklets, demonstrating strong capabilities in cross-modal segmentation and data annotation.
Self-Supervised Sparse Sensor Fusion for Long Range Perception: LRS4Fusion proposes a long-range LiDAR-camera fusion framework based on sparse voxel representations, combined with a self-supervised pretraining strategy via sparse occupancy and velocity field reconstruction, achieving state-of-the-art performance within a 250-meter perception range: a 26.6% improvement in object detection mAP and a 30.5% reduction in LiDAR prediction Chamfer Distance.
Semantic Causality-Aware Vision-Based 3D Occupancy Prediction: This paper analyzes semantic ambiguity in 2D-to-3D transformation for vision-based 3D occupancy prediction from a causal perspective, proposes a Causal Loss for end-to-end semantic consistency supervision, and designs the SCAT module (channel-grouped lifting, learnable camera offsets, normalized convolution) to significantly improve occupancy prediction accuracy and robustness to camera perturbations.
SeqGrowGraph: Learning Lane Topology as a Chain of Graph Expansions: Inspired by the human sketching process, this work models lane topology as a chain of sequential graph expansions, incrementally constructing directed lane graphs via an autoregressive transformer, thereby overcoming the inability of DAG-based methods to represent cycles and bidirectional lanes.
SparseLaneSTP: Leveraging Spatio-Temporal Priors with Sparse Transformers for 3D Lane Detection: This paper proposes SparseLaneSTP, which integrates lane geometry priors (parallelism, continuity) and temporal information into a sparse Transformer architecture. Through Catmull-Rom spline representation, spatio-temporal attention mechanisms, and temporal regularization, the method achieves state-of-the-art performance on multiple 3D lane detection benchmarks.
Splat-LOAM: Gaussian Splatting LiDAR Odometry and Mapping: The first LiDAR odometry and mapping pipeline built entirely on 2D Gaussian primitives, simultaneously achieving high-accuracy pose estimation and lightweight scene reconstruction via a spherical-projection-driven differentiable rasterizer.
SRefiner: Soft-Braid Attention for Multi-Agent Trajectory Refinement: This paper proposes Soft-Braid Attention, which explicitly models spatiotemporal topological relationships between trajectories and between trajectories and lanes via "soft crossing points" to guide multi-agent trajectory refinement. The method achieves significant improvements over four baseline methods on both Argoverse v2 and INTERACTION datasets, establishing a new state of the art for trajectory refinement.
TARS: Traffic-Aware Radar Scene Flow Estimation: This paper proposes TARS, a traffic-aware radar scene flow estimation method that constructs a Traffic Vector Field (TVF) via joint object detection, capturing rigid-body motion at the traffic level rather than the instance level. TARS surpasses the state of the art by 15% and 23% on the VOD and a proprietary dataset, respectively.
Towards Open-World Generation of Stereo Images and Unsupervised Matching: This paper proposes GenStereo, a diffusion-based stereo image generation framework that achieves both high visual quality and geometric accuracy through disparity-aware coordinate embedding, cross-view attention, and an adaptive fusion mechanism, while advancing unsupervised stereo matching to a new state of the art.
TrackAny3D: Transferring Pretrained 3D Models for Category-unified 3D Point Cloud Tracking: TrackAny3D is the first work to transfer large-scale pretrained 3D models to category-agnostic 3D single object tracking. By introducing a dual-path adapter, a Mixture of Geometry Experts (MoGE) module, and a temporal context optimization strategy, it achieves state-of-the-art performance on cross-category unified tracking within a single model.
TrafficLoc: Localizing Traffic Surveillance Cameras in 3D Scenes: This paper proposes TrafficLoc, a coarse-to-fine image-to-point cloud registration method that achieves high-accuracy localization of traffic surveillance cameras in 3D reference maps via Geometry-guided Attention Loss (GAL), Inter- and Intra-modal Contrastive Learning (ICL), and Dense Training Alignment (DTA). On the self-constructed Carla Intersection dataset, it outperforms the previous state of the art by up to 86%.
UAVScenes: A Multi-Modal Dataset for UAVs: UAVScenes is the first large-scale multi-modal UAV dataset that simultaneously provides per-frame semantic annotations for both images and LiDAR point clouds along with accurate 6-DoF poses. It contains over 120,000 annotated frames and supports six perception tasks including semantic segmentation, depth estimation, localization, scene recognition, and novel view synthesis.
UniOcc: A Unified Benchmark for Occupancy Forecasting and Prediction in Autonomous Driving: This paper proposes UniOcc, the first unified benchmark for 2D/3D occupancy prediction and forecasting, integrating four data sources — nuScenes, Waymo, CARLA, and OpenCOOD — while introducing per-voxel flow annotations and ground-truth-free evaluation metrics. Large-scale experiments reveal the significant value of voxel-level flow information and cross-domain training for occupancy tasks.
Unleashing the Temporal Potential of Stereo Event Cameras for Continuous-Time 3D Perception: This paper proposes the first 3D object detection framework relying solely on stereo event cameras. Through a semantic-geometric dual filtering module and object-centric ROI alignment, it enables continuous-time 3D detection during blind time periods, significantly outperforming methods that depend on synchronized sensors (Ev-3DOD) in dynamic large-motion scenarios. Its pedestrian AP3D even surpasses methods that use LiDAR+RGB+Event.
Unraveling the Effects of Synthetic Data on End-to-End Autonomous Driving: This paper proposes SceneCrafter, a unified 3DGS-based simulation framework that simultaneously supports synthetic data generation and closed-loop evaluation via an adaptive kinematic model and bidirectional interactive agent control. Experiments demonstrate that synthetic data significantly improves the generalization of end-to-end autonomous driving models, yielding up to 18% improvement in Route Completion.
Wavelet Policy: Lifting Scheme for Policy Learning in Long-Horizon Tasks: Wavelet Policy is the first work to introduce wavelet analysis into embodied intelligence policy learning. It proposes a multi-scale policy network based on a learnable lifting scheme, decomposing observation sequences into different frequency components and synthesizing action sequences layer by layer. The method achieves superior or comparable performance to baselines across five long-horizon tasks, including autonomous driving (CARLA), robotic manipulation, and multi-robot collaboration.
Where am I? Cross-View Geo-localization with Natural Language Descriptions: This paper introduces a novel task of cross-view geo-localization via natural language descriptions, constructs the CVG-Text multimodal dataset covering 30,000+ coordinates across 3 cities (street-view + satellite + OSM + text), and proposes CrossText2Loc — a method employing Extended Positional Embedding for long-text handling and an Explainable Retrieval Module for localization rationale, achieving over 10% improvement in Top-1 Recall.
Where, What, Why: Towards Explainable Driver Attention Prediction: This paper proposes a new paradigm of "explainable driver attention prediction," introducing the first large-scale W³DA dataset and the LLada framework, which unifies spatial attention prediction (Where), semantic parsing (What), and cognitive reasoning (Why) within a single end-to-end large language model-driven architecture.
World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model: World4Drive constructs an intention-aware latent world model that leverages spatial-semantic priors from visual foundation models to achieve annotation-free end-to-end planning, reducing L2 error by 18.1% and collision rate by 46.7%.

✂️ Segmentation¶

2HandedAfforder: Learning Precise Actionable Bimanual Affordances from Human Videos: This paper proposes an automated pipeline to extract precise bimanual affordance annotations from human activity videos, yielding the 2HANDS dataset, and trains a VLM-based 2HandedAfforder model that predicts precise object region segmentation masks for left and right hand grasps conditioned on text prompts. The approach significantly outperforms existing methods on the newly introduced ActAffordance benchmark.
A Plug-and-Play Physical Motion Restoration Approach for In-the-Wild High-Difficulty Motions: This paper proposes a plug-and-play physical motion restoration approach that repairs artifact frames in video-based motion capture via a Mask-conditioned Motion Correction Module (MCM), and achieves physics-based simulation of high-difficulty in-the-wild motions via a Physics-based Motion Transfer Module (PTM) built on pretraining and test-time adaptation, substantially improving the physical plausibility of recovered motions.
A Plug-and-Play Physical Motion Restoration Approach for In-the-Wild High-Difficulty Motions: A plug-and-play physical motion restoration framework is proposed that repairs defective frames in video-based motion capture via a Mask-conditioned Motion Correction Module (MCM), and subsequently transfers the corrected motion into a physically plausible simulation through a Physics-based Motion Transfer Module (PTM) with RL-based test-time adaptation. This work is the first to achieve physics-based simulation restoration for in-the-wild high-difficulty motions such as gymnastics and martial arts back-flips.
Advancing Visual Large Language Model for Multi-granular Versatile Perception: This paper proposes MVP-LM, a multi-granular versatile perception framework built upon a visual large language model. Through a novel multi-granular decoder and a CoT-inspired data unification strategy, MVP-LM is the first single model to simultaneously support all four perception combinations—box and mask predictions under both word-level and sentence-level instructions—achieving competitive performance on panoptic segmentation, object detection, visual grounding, and referring expression segmentation.
AnimalClue: Recognizing Animals by their Traces: This paper introduces AnimalClue, the first large-scale dataset for animal trace recognition, containing 159,605 bounding boxes spanning 968 species across five categories of indirect clues (footprints, feces, eggs, bones, and feathers), and establishes four benchmarks covering classification, detection, instance segmentation, and attribute prediction.
Auto-Vocabulary Semantic Segmentation: This paper introduces Auto-Vocabulary Semantic Segmentation (AVS), a new task in which the AutoSeg framework autonomously discovers target categories from images and performs segmentation without any human-specified vocabulary. AutoSeg achieves 87.1 mIoU on PASCAL VOC, far surpassing the only comparable method ZeroSeg (20.1), and even outperforming several open-vocabulary methods that require explicit category specification.
Beyond Single Images: Retrieval Self-Augmented Unsupervised Camouflaged Object Detection: This paper proposes RISE — a retrieval self-augmented unsupervised camouflaged object detection paradigm that constructs foreground/background prototype libraries from the training set itself and leverages KNN retrieval to generate pseudo-labels, substantially outperforming existing unsupervised and prompt-based methods without any annotations.
Can Generative Geospatial Diffusion Models Excel as Discriminative Geospatial Foundation Models?: This paper proposes SatDiFuser, a framework that repurposes a generative geospatial diffusion model (DiffusionSat) as a discriminative remote sensing foundation model. Through systematic analysis of multi-stage, multi-timestep diffusion features and three designed fusion strategies (Global Weighted, Localized Weighted, and MoE Joint Fusion), SatDiFuser outperforms existing state-of-the-art geospatial foundation models (GFMs) on semantic segmentation and classification tasks, achieving gains of up to +5.7% mIoU and +7.9% F1.
CAVIS: Context-Aware Video Instance Segmentation: This paper proposes CAVIS, which introduces a Context-Aware Instance Tracker (CAIT) to incorporate contextual information around object boundaries for enhanced instance association, and designs a Prototypical Cross-frame Contrastive loss (PCC) to enforce cross-frame feature consistency, achieving state-of-the-art performance on both VIS and VPS benchmarks.
CLOT: Closed Loop Optimal Transport for Unsupervised Action Segmentation: This paper proposes Closed Loop Optimal Transport (CLOT), a framework that jointly solves three OT problems through a three-level cyclic feature learning pipeline (frame embeddings → segment embeddings → cross-attention refined frame embeddings), establishing an explicit feedback loop between frame-level and segment-level representations to substantially improve boundary detection and clustering quality in unsupervised action segmentation.
ConformalSAM: Unlocking the Potential of Foundational Segmentation Models in Semi-Supervised Semantic Segmentation with Conformal Prediction: This paper proposes ConformalSAM, a framework that leverages Conformal Prediction to calibrate the output uncertainty of the foundation segmentation model SEEM on target domains. Unreliable pixel labels are filtered out before serving as supervision signals for unlabeled data. Combined with a late-stage self-reliance training strategy, the framework achieves 81.21 mIoU on PASCAL VOC under the 1/16 labeled setting.
CorrCLIP: Reconstructing Patch Correlations in CLIP for Open-Vocabulary Semantic Segmentation: This paper identifies inter-class patch correlations in CLIP as the fundamental bottleneck for segmentation performance, and proposes CorrCLIP, which addresses this via SAM-constrained patch interaction scope (scope reconstruction), DINO-based similarity value reconstruction (value reconstruction), spatial/semantic feature refinement, and SAM mask post-processing. The method achieves an average mIoU improvement from 48.6% to 53.6% across 8 benchmarks under the training-free setting.
Correspondence as Video: Test-Time Adaption on SAM2 for Reference Segmentation in the Wild: CAV-SAM represents the correspondence between reference–target image pairs as a pseudo-video sequence, bridging semantic gaps via a Diffusion-Based Semantic Transition (DBST) module and aligning geometric variations via a Test-Time Geometric Alignment (TTGA) module. This enables SAM2's video segmentation capability to be adapted to reference segmentation in a training-free manner, surpassing the state of the art by approximately 5% mIoU on cross-domain few-shot segmentation benchmarks.
Correspondence as Video: Test-Time Adaption on SAM2 for Reference Segmentation in the Wild: This paper represents the correspondences between reference-target image pairs as a pseudo-video sequence generated by a diffusion model, leverages SAM2's interactive video object segmentation (iVOS) capability for segmentation, and combines lightweight test-time fine-tuning to handle geometric variation. The proposed method outperforms state-of-the-art approaches by approximately 5% mIoU on cross-domain few-shot segmentation without requiring any meta-training.
DDB: Diffusion Driven Balancing to Address Spurious Correlations: This paper proposes Diffusion Driven Balancing (DDB), which leverages the textual inversion and inpainting capabilities of Stable Diffusion to automatically generate minority-group samples for balancing spurious correlations in datasets. Combined with a bicephalous pruning strategy based on ERM model prediction probabilities and integrated gradients, DDB achieves state-of-the-art worst-group accuracy on Waterbirds and MetaShift.
DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy: This paper proposes DeRIS, a framework that decouples referring image segmentation into two branches — perception and cognition — and introduces a Loopback Synergy mechanism to iteratively enhance cross-branch interaction. A non-referent sample conversion augmentation strategy is also introduced. DeRIS achieves state-of-the-art performance on RefCOCO/+/g and gRefCOCO benchmarks.
Dynamic Dictionary Learning for Remote Sensing Image Segmentation: This paper proposes D2LS, a dynamic dictionary learning framework that iteratively updates category-aware semantic embeddings (the dictionary) via multi-stage alternating cross-attention, and incorporates contrastive constraints to enhance inter-class separability. D2LS surpasses the state of the art on both coarse-grained and fine-grained remote sensing image segmentation benchmarks.
E-SAM: Training-Free Segment Every Entity Model: E-SAM is a training-free framework that systematically addresses over-segmentation and under-segmentation in SAM's Automatic Mask Generation (AMG) via three cascaded modules—Multi-level Mask Generation (MMG), Entity-level Mask Refinement (EMR), and Under-Segmentation Refinement (USR)—surpassing existing entity segmentation methods by +30.1 points on benchmark metrics.
Enhancing Transformers Through Conditioned Embedded Tokens: This paper identifies an inherent ill-conditioning problem in the self-attention matrices of Transformers. Through theoretical analysis, it establishes a direct relationship between the condition number of the self-attention matrix and that of the embedded token matrix, and proposes Conditioned Embedded Tokens — an SVD-based correction term applied to the embedding matrix — achieving consistent performance improvements across image classification, object detection, instance segmentation, and NLP tasks.
Ensemble Foreground Management for Unsupervised Object Discovery: This paper proposes UnionCut — a foreground union detection method based on minimum cut and ensemble learning — which provides mathematically guaranteed foreground priors for unsupervised object discovery (UOD). It enables UOD algorithms to reliably determine whether discovered regions are foreground and when to stop exploration. A distilled variant, UnionSeg, is also proposed to substantially improve both efficiency and accuracy.
Exploiting Domain Properties in Language-Driven Domain Generalization for Semantic Segmentation: This paper proposes DPMFormer, a framework that transforms domain-specific properties of input images into textual context prompts via domain-aware prompt learning, combined with domain-robust consistency learning, to address the semantic misalignment between visual and textual contexts in language-driven domain generalization for semantic segmentation.
Exploring Probabilistic Modeling Beyond Domain Generalization for Semantic Segmentation: This paper proposes PDAF (Probabilistic Diffusion Alignment Framework), which explicitly estimates a Latent Domain Prior (LDP) via probabilistic diffusion modeling to provide domain-shift compensation for existing segmentation networks, achieving state-of-the-art cross-domain generalization without requiring paired target-domain samples.
FLOSS: Free Lunch in Open-vocabulary Semantic Segmentation: This paper challenges the default practice of averaging 80 templates in open-vocabulary semantic segmentation (OVSS), revealing that each class has specific "class-expert" templates that significantly outperform the averaged classifier. It proposes FLOSS, a method that uses prediction entropy to unsupervisedly select expert templates and fuse their predictions, consistently improving existing OVSS methods without any labels or training.
Harnessing Massive Satellite Imagery with Efficient Masked Image Modeling: This paper proposes a remote sensing model pre-training pipeline comprising OpticalRS-13M, a dataset of 13 million optical remote sensing images, and SelectiveMAE, an efficient MIM method that selectively encodes and reconstructs patches based on semantic richness. Using only 40% of image patches, SelectiveMAE achieves performance comparable to full-patch training while delivering more than 2× speedup.
Hierarchical Visual Prompt Learning for Continual Video Instance Segmentation: This paper introduces Continual Video Instance Segmentation (CVIS) as a new problem formulation, and proposes the Hierarchical Visual Prompt Learning (HVPL) model, which mitigates catastrophic forgetting of old categories via forgetting compensation mechanisms at both frame-level and video-level.
HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model: This paper proposes HiMTok (Hierarchical Mask Tokenizer), which represents segmentation masks as up to 32 coarse-to-fine discrete tokens, enabling LMMs to directly generate segmentation results in the same manner as text generation — without any additional image-conditioned mask decoder — achieving state-of-the-art performance across multiple segmentation benchmarks.
How Do Optical Flow and Textual Prompts Collaborate to Assist in Audio-Visual Semantic Segmentation?: This paper proposes the SSP (Stepping Stone Plus) framework, which employs optical flow as auxiliary mask prompts in conjunction with two types of textual prompts and a Visual-Textual Alignment (VTA) module, achieving state-of-the-art performance on the audio-visual semantic segmentation task.
Hybrid-TTA: Continual Test-time Adaptation via Dynamic Domain Shift Detection: Hybrid-TTA proposes a continual test-time adaptation (CTTA) framework that employs a Dynamic Domain Shift Detection (DDSD) module to determine whether the current input originates from a new domain, adaptively switching between Full Tuning (FT) and Adapter Tuning (AT). It additionally introduces Masked Image Modeling Adaptation (MIMA) as an auxiliary task to enhance model stability, achieving 62.2% mIoU on the Cityscapes-to-ACDC benchmark while running approximately 20× faster than comparable methods.
Implicit Counterfactual Learning for Audio-Visual Segmentation: This paper proposes the Implicit Counterfactual Framework (ICF), which employs multi-granularity implicit text as a modality bridge to reduce the audio-visual representation gap, and leverages semantic counterfactuals to generate orthogonal counterfactual samples that mitigate modality preference. Combined with Collaborative Distribution-Aware Contrastive Learning (CDCL), ICF achieves unbiased cross-modal understanding and state-of-the-art performance on three AVS benchmarks.
Inter2Former: Dynamic Hybrid Attention for Efficient High-Precision Interactive Segmentation: This paper proposes Inter2Former, which employs Dynamic Hybrid Attention (DHA) to route boundary tokens to full attention and non-boundary tokens to linear-complexity BSQ attention. Combined with Dynamic Prompt Embedding (DPE), Hybrid Mixture of Experts (HMoE), and Dynamic Local Upsampling (DLU), the method achieves state-of-the-art performance and efficient inference for high-precision interactive segmentation on CPU devices.
Joint Self-Supervised Video Alignment and Action Segmentation: This paper proposes the VAOT/VASOT framework, which integrates Gromov-Wasserstein optimal transport with structural priors to unify self-supervised video alignment and action segmentation within a single model for the first time. The framework surpasses existing methods on video alignment and achieves state-of-the-art performance on action segmentation.
Know "No" Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP: By analyzing the scarcity and misalignment of negation expressions in CLIP's pre-training data, this work designs two LLM/MLLM-based negation data generation pipelines to fine-tune the CLIP text encoder, producing NegationCLIP — a model that enhances negation understanding while preserving general performance. A new benchmark, NegRefCOCOg, is proposed for comprehensive negation evaluation.
Know Your Attention Maps: Class-specific Token Masking for Weakly Supervised Semantic Segmentation: This paper proposes an end-to-end weakly supervised semantic segmentation method that introduces multiple [CLS] tokens (one per class) into a ViT, applies random masking to [CLS] token output embeddings, and prunes redundant attention heads. Class-specific pseudo segmentation masks are generated directly from self-attention maps without any additional CAM module.
Latent Expression Generation for Referring Image Segmentation and Grounding: This paper proposes the Latent-VG framework, which generates multiple latent expressions from a single textual description — each sharing the same subject but highlighting distinct visual attributes — to bridge the semantic gap between sparse text and rich visual information via complementary visual details. The method achieves state-of-the-art performance on both referring image segmentation and referring expression comprehension tasks.
LawDIS: Language-Window-based Controllable Dichotomous Image Segmentation: This paper proposes LawDIS, a language-window dual-control dichotomous image segmentation framework built upon Stable Diffusion. In macro mode, language prompts guide target segmentation; in micro mode, variable-size windows refine local details. LawDIS comprehensively outperforms 11 state-of-the-art methods on DIS5K.
LawDIS: Language-Window-based Controllable Dichotomous Image Segmentation: This paper proposes LawDIS, a controllable dichotomous image segmentation framework built upon a latent diffusion model. It achieves high-quality foreground mask generation through the synergy of macro-level language control (LS) and micro-level window refinement (WR), comprehensively outperforming 11 state-of-the-art methods on the DIS5K benchmark.
LayerAnimate: Layer-level Control for Animation: This paper proposes LayerAnimate, a framework that integrates the layer-separation paradigm of traditional animation production with video diffusion models to enable fine-grained layer-level control (motion scores, trajectories, sketches). An automated data curation pipeline is designed to address the scarcity of layered animation data. The framework comprehensively outperforms existing methods across six video generation tasks.
Learn2Synth: Learning Optimal Data Synthesis Using Hypergradients for Brain Image Segmentation: This paper proposes Learn2Synth, a training framework that leverages hypergradients to learn optimal synthetic data augmentation parameters, enabling segmentation networks trained exclusively on synthetic data to achieve peak performance on real data. The framework simultaneously attains high in-domain accuracy and strong out-of-domain generalization, outperforming both SynthSeg and supervised learning baselines on brain MRI segmentation tasks.
Learning Precise Affordances from Egocentric Videos for Robotic Manipulation: This paper presents a complete affordance learning system comprising: (1) an automatic pipeline for extracting precise graspable and functional affordance segmentation annotations from egocentric videos; (2) a Geometry-guided Affordance Transformer (GAT) based on DINOv2 with depth-geometric guidance for cross-domain affordance segmentation (mIoU improved by 13.8%); and (3) the Aff-Grasp framework, which achieves a 77.1% grasping success rate across 179 real robot trials.
LEGION: Learning to Ground and Explain for Synthetic Image Detection: This paper proposes the LEGION framework and the SynthScars dataset, leveraging a multimodal large language model (MLLM) to unify artifact detection, pixel-level segmentation, and textual explanation for synthetic image detection. It further innovatively extends the role of the detector from a "Defender" to a "Controller," guiding generative models to produce higher-quality images.
LeGrad: An Explainability Method for Vision Transformers via Feature Formation Sensitivity: This paper proposes LeGrad, a layer-wise explainability method designed specifically for ViTs. It computes the gradient of the activation with respect to the attention map at each layer as the explanation signal, aggregates these signals across layers to produce high-quality spatial saliency maps, and demonstrates superior spatial fidelity in segmentation, perturbation, and open-vocabulary settings.
MOVE: Motion-Guided Few-Shot Video Object Segmentation: This paper introduces a novel task of motion-guided few-shot video object segmentation along with a large-scale dataset MOVE (224 motion categories, 4,300 videos, 314K masks), and proposes a Decoupled Motion-Appearance (DMA) network. Through a dual-branch architecture combining frame-differencing-based motion prototypes and appearance prototypes, the proposed method significantly outperforms existing FSVOS methods on the new benchmark.
O-MaMa: Learning Object Mask Matching between Egocentric and Exocentric Views: This work reframes cross-view (ego-exo) object segmentation as a mask matching problem. It leverages FastSAM to generate candidate masks, DINOv2 to extract semantic features, and contrastive learning to match objects across views, achieving state-of-the-art performance on the Ego-Exo4D benchmark with only 1% of the trainable parameters used by prior methods.
Object-level Correlation for Few-Shot Segmentation: OCNet is proposed to construct object-level (rather than image-level) support-query correlations by emulating biological visual processes. It first mines generic objects in the query image and then identifies the target object among them, effectively suppressing irrelevant object noise in the background.
OmniSAM: Omnidirectional Segment Anything Model for UDA in Panoramic Semantic Segmentation: This paper proposes OmniSAM, the first framework to apply SAM2 to unsupervised domain adaptation (UDA) for panoramic semantic segmentation. It partitions panoramic images into patch sequences via a sliding window and leverages SAM2's memory mechanism to capture cross-patch correspondences. Combined with a FoV-based prototypical adaptation module and a dynamic pseudo-label update strategy, OmniSAM significantly surpasses the state of the art on both indoor and outdoor benchmarks (+10.22% / +6.58%).
On the Generalization of Representation Uncertainty in Earth Observation: This paper systematically investigates the generalization of pretrained representation uncertainty in Earth Observation (EO), demonstrating that EO-pretrained uncertainty generalizes robustly across geographic locations, EO tasks, and target granularities, while remaining highly sensitive to ground sampling distance (GSD).
Online Generic Event Boundary Detection: This paper proposes Online Generic Event Boundary Detection (On-GEBD) as a new task—detecting event boundaries in real time from streaming video—and introduces the ESTimator framework inspired by the cognitive science Event Segmentation Theory (EST). Through the collaboration of a Consistent Event Anticipator (CEA) and an Online Boundary Discriminator (OBD), ESTimator achieves an Avg F1 of 0.748 on Kinetics-GEBD, surpassing all online baselines and approaching the performance of offline methods.
Online Reasoning Video Segmentation with Just-in-Time Digital Twins: This paper proposes a multi-agent framework based on the concept of "Just-in-Time Digital Twins" that decouples perception from reasoning. Without any LLM fine-tuning, the framework enables online video reasoning segmentation and comprehensively outperforms existing methods across semantic, spatial, and temporal reasoning tasks.
Open-World Skill Discovery from Unsegmented Demonstration Videos: Inspired by the human cognitive Event Segmentation Theory (EST), this paper proposes the Skill Boundary Detection (SBD) algorithm, which leverages prediction error spikes from a pretrained unconditional action prediction model to automatically identify skill boundaries in unsegmented demonstration videos, significantly improving the performance of conditional policies and hierarchical agents in Minecraft.
PartField: Learning 3D Feature Fields for Part Segmentation and Beyond: PartField learns a continuous 3D feature field via a feed-forward model, distilling knowledge from mixed 2D/3D part proposals through contrastive learning. It outperforms existing methods by 20%+ on category-agnostic 3D part segmentation while achieving inference speeds orders of magnitude faster.
Prompt Guidance and Human Proximal Perception for HOT Prediction with Regional Joint Loss: This paper proposes P3HOT, a framework that achieves state-of-the-art performance on Human-Object Contact (HOT) detection by incorporating text prompt guidance to focus on human contact regions, a depth-aware module to filter irrelevant backgrounds, and a Regional Joint Loss to enforce intra-region category consistency.
RAGNet: Large-scale Reasoning-based Affordance Segmentation Benchmark towards General Grasping: This paper introduces RAGNet, the first large-scale reasoning-based affordance segmentation benchmark (273k images, 180 categories, 26k reasoning instructions), and proposes the AffordanceNet framework, which integrates VLM-pretrained affordance prediction with grasp pose generation, demonstrating strong open-world generalization and reasoning capabilities.
Refer to Any Segmentation Mask Group With Vision-Language Prompts: This paper proposes the Omni-modal Referring Expression Segmentation (ORES) task and the RAS framework, which leverages a mask-level LMM with a non-autoregressive decoding mechanism to select target mask groups from a candidate pool based on vision-language hybrid prompts. The approach achieves state-of-the-art performance on the newly introduced ORES dataset as well as classical RES/GRES benchmarks.
ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations: This paper proposes ReferDINO, which end-to-end adapts the GroundingDINO visual grounding foundation model to the Referring Video Object Segmentation (RVOS) task. By introducing a grounding-guided deformable mask decoder, an object-consistent temporal enhancer, and a confidence-based query pruning strategy, ReferDINO significantly surpasses state-of-the-art methods across five benchmarks (e.g., +3.9% $\mathcal{J}\&\mathcal{F}$ on Ref-YouTube-VOS) while achieving real-time inference at 51 FPS.
ReferEverything: Towards Segmenting Everything We Can Speak of in Videos: By leveraging the general visual-language mappings learned by video diffusion models, and by preserving the complete generative model architecture while shifting the prediction target from noise to mask latents, this work enables open-world referring segmentation of any concept expressible in natural language in videos — including non-object dynamic processes.
Region-based Cluster Discrimination for Visual Representation Learning: This paper proposes RICE (Region-Aware Cluster Discrimination), which constructs a billion-scale region dataset, designs a Region Transformer layer, and introduces a unified region cluster discrimination loss to jointly optimize object-aware and OCR capabilities, significantly improving visual encoder performance across segmentation, detection, and MLLM multi-task benchmarks.
Rethinking Detecting Salient and Camouflaged Objects in Unconstrained Scenes: This paper constructs USC12K, the first unconstrained dataset for salient and camouflaged object detection covering four scene types, proposes USCNet built upon SAM, introduces an Attribute Relationship Modeling (ARM) module to explicitly model the relationship between salient and camouflaged objects, and designs a new metric CSCS to quantify confusion between the two categories, achieving state-of-the-art performance across all scene types.
ROADWork: A Dataset and Benchmark for Learning to Recognize, Observe, Analyze and Drive Through Work Zones: This paper introduces ROADWork, the first large-scale work zone dataset comprising 4,375 video clips, 9,650 richly annotated images, and 129K images with drivable path annotations. It reveals that foundation models fail severely in work zone scenarios (AP of only 2.9–4.2), while fine-tuning yields substantial improvements (+32.2 AP), and proposes a four-level cognitive framework of Recognize, Observe, Analyze, and Drive.
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree: To address the error accumulation caused by SAM 2's greedy selection strategy in long videos, this paper proposes a training-free constrained tree search memory strategy that maintains multiple segmentation paths and selects the optimal result at the video level, achieving an average improvement of 3.7 J&F across 9 VOS and 3 VOT benchmarks, with up to 5.3 gains on long-video scenarios.
SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation: This paper proposes the SCORE framework, which leverages multi-granularity scene context (regional context + global context) to enhance open-vocabulary remote sensing instance segmentation. Two dedicated modules — Region-Aware Integration (RAI) and Global Context Adaptation (GCA) — are introduced to strengthen visual and textual representations, respectively.
SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation: This paper proposes SCORE, a framework that injects multi-granularity scene knowledge from a remote sensing-specific CLIP into an open-vocabulary instance segmentation pipeline via two modules — Region-Aware Integration (RAI) and Global Context Adaptation (GCA) — achieving an average mAP improvement of 5.53% over the previous state of the art in cross-dataset evaluation across multiple remote sensing benchmarks.
Skeleton Motion Words for Unsupervised Skeleton-Based Temporal Action Segmentation: This paper proposes Skeleton Motion Quantization (SMQ), which achieves unsupervised temporal action segmentation on skeleton sequences via a joint-decoupled temporal autoencoder and a skeleton motion word quantization module, substantially outperforming existing unsupervised methods on HuGaDB, LARa, and BABEL.
SPADE: Spatial-Aware Denoising Network for Open-vocabulary Panoptic Scene Graph Generation: This paper proposes SPADE — a spatial-aware denoising network for open-vocabulary panoptic scene graph generation (PSG). It adapts a pretrained diffusion model into a PSG-specific spatial prior extractor via DDIM inversion-guided calibration, and designs a relational graph Transformer to capture both long-range and local context. SPADE substantially outperforms prior state-of-the-art methods in both closed-set and open-set settings, with particularly strong performance on spatial relation prediction.
Stepping Out of Similar Semantic Space for Open-Vocabulary Segmentation: This paper exposes an evaluation bias in existing open-vocabulary segmentation (OVS) benchmarks, where test sets exhibit high semantic similarity to training spaces. It proposes a new benchmark, OpenBench, and a method, OVSNet, that integrates heterogeneous features via Gradient-Free Aggregation (GFA) and expands the training semantic space at zero cost through Proxy Calibration (PC), achieving state-of-the-art performance on both existing benchmarks and OpenBench.
TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models: TAViS is a text-bridged audio-visual segmentation framework that couples the cross-modal alignment capability of ImageBind with the precise segmentation capability of SAM2. By introducing a text-bridged hybrid prompting mechanism and alignment supervision strategy, TAViS achieves state-of-the-art performance across single-source, multi-source, semantic, and zero-shot segmentation scenarios.
Temporal Rate Reduction Clustering for Human Motion Segmentation: This paper proposes Temporal Rate Reduction Clustering (TR²C), which integrates the Maximal Coding Rate Reduction (MCR²) principle with temporal continuity regularization to jointly learn temporally consistent representations and affinity matrices conforming to the Union of Subspaces (UoS) distribution, achieving substantial state-of-the-art improvements on five HMS benchmarks.
TinyViM: Frequency Decoupling for Tiny Hybrid Vision Mamba: This paper proposes TinyViM, a lightweight convolution-Mamba hybrid visual backbone based on frequency decoupling. A Laplace Mixer routes low-frequency components to Mamba for global context modeling and enhances high-frequency components via depthwise convolution. A frequency ramp Inception structure progressively adjusts frequency allocation across stages. TinyViM achieves 2–3× higher throughput than existing Mamba models on classification, detection, and segmentation tasks.
TopoTTA: Topology-Enhanced Test-Time Adaptation for Tubular Structure Segmentation: The first test-time adaptation (TTA) framework specifically designed for tubular structure segmentation (TSS). It adapts to cross-domain topological structural differences via Topological Meta-Differential Convolutions (TopoMDCs), and restores topological continuity through a Topological Hard Sample Generation (TopoHG) strategy, achieving an average clDice improvement of 31.81% across 10 datasets.
Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation: This paper proposes the OmniAVS dataset and OISA model, extending referring audio-visual segmentation (RAVS) beyond simple acoustic attribute perception to omnimodal expressions (arbitrary combinations of text/speech/sound/image) and deep reasoning (understanding sound content + world knowledge), achieving SOTA on the new benchmark and multiple related tasks.
Training-Free Class Purification for Open-Vocabulary Semantic Segmentation: This paper proposes FreeCP, a training-free class purification framework that addresses class redundancy and visual-language ambiguity arising from over-complete vocabularies in open-vocabulary semantic segmentation (OVSS), via a two-stage strategy of redundancy purification and ambiguity purification. As a plug-and-play module, FreeCP consistently improves existing methods across eight benchmarks.
Personalized OVSS: Understanding Personal Concept in Open-Vocabulary Semantic Segmentation: This paper introduces the Personalized Open-Vocabulary Semantic Segmentation (Personalized OVSS) task for the first time, and proposes a plug-and-play method based on text prompt tuning. By incorporating negative mask proposals to suppress false positives and injecting visual embeddings to enrich personalized concept representations, the method enables recognition of user-specific object instances from only a few image-mask pairs, while preserving the original OVSS performance.
UniGlyph: Unified Segmentation-Conditioned Diffusion for Precise Visual Text Synthesis: This paper proposes UniGlyph, a visual text generation framework that adopts segmentation masks as a unified conditioning signal. By replacing conventional rendered glyph conditions with Adaptive Glyph Conditioning (AGC) and Glyph Region Loss (GRL), UniGlyph achieves state-of-the-art bilingual (Chinese and English) text image generation under a single ControlNet architecture, with particularly large margins in small-font and complex-layout scenarios.
VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation: VEGGIE proposes an end-to-end unified framework that bridges an MLLM with a video diffusion model, enabling a single model to simultaneously accomplish 8 tasks—including instructional video editing, concept grounding, and reasoning segmentation—using only the diffusion loss.
VSC: Visual Search Compositional Text-to-Image Diffusion Model: This paper proposes VSC, a visual search-based compositional text-to-image diffusion generation method that significantly improves the accuracy and scalability of multi-attribute-object binding by generating reference images for each attribute-object pair independently, fusing visual prototype embeddings, and training with segmentation-guided cross-attention localization.
VSSD: Vision Mamba with Non-Causal State Space Duality: This paper proposes Non-Causal State Space Duality (NC-SSD), which transforms the SSD formulation of Mamba2 into a non-causal form by retaining the relative weights of token contributions in lieu of the cumulative decay of hidden states. Built upon NC-SSD, the VSSD visual backbone surpasses existing SSM-based models across classification, detection, and segmentation benchmarks while achieving 20%–50% faster training speed.
What If: Understanding Motion Through Sparse Interactions: This paper proposes the Flow Poke Transformer (FPT), which directly predicts multimodal probability distributions over object motion in a scene (rather than a single deterministic outcome), conditioned on sparse "poke" interactions, enabling interpretable motion understanding and moving part segmentation.
WildSeg3D: Segment Any 3D Objects in the Wild from 2D Images: This paper proposes WildSeg3D, the first feed-forward 3D segmentation model that requires no scene-specific training. It addresses multi-view pointmap alignment errors via Dynamic Global Alignment (DGA) and achieves real-time interactive 3D segmentation through Multi-view Group Mapping (MGM), outperforming the current state of the art in accuracy while being 40× faster.
ZIM: Zero-Shot Image Matting for Anything: This paper proposes ZIM, a zero-shot image matting model that constructs the SA1B-Matte dataset by converting SA1B segmentation labels into fine-grained matting labels via a label converter. A hierarchical pixel decoder and a prompt-aware masked attention mechanism are further introduced to achieve micro-level fine-grained matting while preserving zero-shot generalization capability.

📹 Video Understanding¶

4D-Bench: Benchmarking Multi-Modal Large Language Models for 4D Object Understanding: 4D-Bench is the first benchmark for evaluating multi-modal large language models (MLLMs) on 4D object understanding. It encompasses two tasks—4D object question answering and captioning—and reveals that even the state-of-the-art GPT-4o achieves only 63% accuracy against a human baseline of 91%, exposing significant deficiencies in multi-view temporal reasoning among current MLLMs.
4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding: This paper introduces 4D-Bench, the first benchmark for evaluating multi-modal large language models (MLLMs) on 4D object (dynamic 3D object) understanding, comprising two tasks: 4D object question answering and 4D object captioning. The benchmark reveals that even GPT-4o achieves only 63% accuracy on simple 4D objects (vs. 91% human baseline), with particularly weak performance on object counting and temporal understanding.
4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding: This paper introduces 4D-Bench, the first benchmark for evaluating multimodal large language models (MLLMs) on 4D object understanding (i.e., 3D objects with temporal evolution). It comprises two core tasks: 4D Object QA (751 QA pairs) and 4D Object Captioning (580 objects × 5 annotations). Evaluation reveals that even the state-of-the-art GPT-4o achieves only 63% accuracy compared to 91% for humans, exposing a substantial gap in multi-view spatiotemporal understanding among MLLMs.
Adaptive Hyper-Graph Convolution Network for Skeleton-Based Human Action Recognition: This paper proposes Hyper-GCN, which replaces conventional binary graphs with an adaptive non-uniform hypergraph to model skeletal topology, and introduces virtual hyper-joints to create virtual connections that enable direct modeling of multi-joint cooperative relationships. The approach achieves state-of-the-art performance on NTU-60/120 and NW-UCLA with the most lightweight GCN design (base variant: only 1.1M parameters, 1.63 GFLOPs).
Adaptive Hyper-Graph Convolution Network for Skeleton-based Human Action Recognition with Virtual Connections: This paper proposes Hyper-GCN, which transcends the limitation of conventional GCNs that model only binary pairwise joint relationships, by introducing adaptive non-uniform hypergraph convolution and virtual hyper joints. The design enables efficient aggregation of multi-joint collaborative semantics, achieving state-of-the-art performance on NTU-60/120 and NW-UCLA benchmarks with the most lightweight GCN architecture to date.
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning: This paper proposes AIM, a training-free adaptive inference method for multimodal LLMs that achieves a 6.8× FLOPs reduction while maintaining performance, through similarity-based iterative visual token merging before the LLM and progressive PageRank-based token pruning within LLM layers. Under equal compute budgets, AIM even surpasses SOTA on long video understanding (+4.6 MLVU).
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning: This paper proposes a training-free adaptive inference framework that achieves flexible accuracy–efficiency trade-offs across a 40× FLOPs range for multimodal LLMs. The method combines iterative token merging based on embedding cosine similarity before the LLM, and progressive token pruning based on PageRank-derived multimodal importance scores within LLM layers. Strong performance is demonstrated on both video and image understanding benchmarks.
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning: This paper proposes AIM, a training-free adaptive inference method that combines iterative token merging before the LLM (based on embedding cosine similarity) with progressive token pruning within LLM layers (based on PageRank importance scores), achieving a 6.8× FLOPs reduction with negligible performance loss, and even surpassing SOTA on long video understanding benchmarks.
Aligning Effective Tokens with Video Anomaly in Large Language Models: This paper proposes VA-GPT, which efficiently aligns anomaly-relevant tokens within MLLMs via two modules — Spatial Effective Token Selection (SETS) and Temporal Effective Token Generation (TETG) — enabling precise detection, description, and temporal localization of anomalous events.
AllTracker: Efficient Dense Point Tracking at High Resolution: AllTracker reformulates point tracking as a multi-frame long-range optical flow problem, iteratively refining correspondence estimates on low-resolution grids via 2D convolutions and pixel-aligned temporal attention, followed by upsampling. With only 16M parameters, it achieves state-of-the-art accuracy and enables high-resolution (768×1024) dense tracking of all pixels at speeds approaching optical flow methods.
An Empirical Study of Autoregressive Pre-training from Videos: This paper systematically investigates autoregressive pre-training from videos (termed Toto), training a causal Transformer on over one trillion visual tokens. Despite minimal inductive biases, the approach achieves competitive performance across image recognition, video classification, object tracking, and robot manipulation, while exhibiting scaling laws analogous to those of language models, albeit at a slower rate.
Attention to Trajectory: Trajectory-Aware Open-Vocabulary Tracking: This paper proposes TRACT, a method that leverages trajectory-level information to enhance open-vocabulary multi-object tracking (OV-MOT). It improves association via Trajectory Consistency Reinforcement (TCR) and improves classification via Trajectory Feature Aggregation (TFA) and Trajectory Semantic Enrichment (TSE). TRACT achieves significant performance gains on the OV-TAO benchmark, particularly in classification accuracy.
Beyond Label Semantics: Language-Guided Action Anatomy for Few-shot Action Recognition: This paper proposes the Language-Guided Action Anatomy (LGA) framework, which leverages large language models to decompose action labels into atomic-level action descriptions encoded as subject–motion–object triplets. On the video side, a clustering-based segmentation strategy partitions frame sequences into corresponding atomic action stages. Multimodal fusion and matching are then performed at the atomic level, yielding substantial improvements in few-shot action recognition performance.
Beyond the Frame: Generating 360° Panoramic Videos from Perspective Videos: This paper introduces Argus, the first model to generate complete 360° panoramic videos from standard perspective videos. Through three geometry- and motion-aware techniques—camera movement simulation, view-based frame alignment, and blended decoding—Argus achieves spatially consistent and temporally coherent panoramic video generation within a diffusion-based framework.
BlinkTrack: Feature Tracking over 80 FPS via Events and Images: BlinkTrack introduces a differentiable Kalman filter into a learning framework to address the challenges of asynchronous data association and uncertainty-aware fusion between event cameras and conventional cameras, achieving feature tracking at over 80 FPS with significantly superior performance in occlusion scenarios compared to existing methods.
Breaking the Encoder Barrier for Seamless Video-Language Understanding: This paper proposes ELVA, the first encoder-free Video Large Language Model (Video-LLM), which achieves performance comparable to encoder-based architectures through hierarchical token merging, video guidance supervision, and hybrid resolution inference, using only 7M publicly available video-text pairs while reducing FLOPs by 95% and inference latency by 92%.
DeSPITE: Exploring Contrastive Deep Skeleton-PointCloud-IMU-Text Embeddings for Action Recognition: DeSPITE proposes a privacy-preserving multimodal contrastive pre-training framework that aligns four modalities — LiDAR point clouds, skeleton poses, IMU signals, and text — into a unified embedding space, enabling cross-modal matching, retrieval, and a pre-training paradigm for human activity recognition.
DisTime: Distribution-based Time Representation for Video Large Language Models: This paper proposes DisTime, a framework that enables continuous time representation in Video-LLMs via a single learnable time token and a distribution-based time decoder. Complemented by the large-scale automatically annotated dataset InternVid-TG (1.25M events), DisTime achieves state-of-the-art performance on three categories of time-sensitive tasks: moment retrieval, dense video captioning, and grounded VQA.
DynImg: Key Frames with Visual Prompts are Good Representation for Multi-Modal Video Understanding: DynImg proposes a novel video representation method that appends non-key frames as "temporal visual prompts" below key frames to form dynamic images, enabling fine-grained spatiotemporal interaction inside the visual encoder (rather than at the high-level token stage). Combined with a 4D rotary positional encoding to maintain correct spatiotemporal ordering, DynImg surpasses SOTA by approximately 2% on multiple video understanding benchmarks while using fewer visual tokens.
EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception: EgoAdapt is a framework that jointly trains cross-modal distillation and policy learning to adaptively select the optimal modality combination, achieving up to 89% GMACs reduction while maintaining performance on par with or superior to SOTA on egocentric perception tasks.
egoPPG: Heart Rate Estimation from Eye-Tracking Cameras in Egocentric Systems to Benefit Downstream Vision Tasks: This paper introduces egoPPG as a new egocentric vision task, proposes PulseFormer to estimate heart rate (MAE=7.67 bpm) from the eye-tracking cameras of unmodified egocentric head-mounted devices, and demonstrates that heart rate estimation improves skill assessment accuracy on EgoExo4D by 14.1%.
EMoTive: Event-Guided Trajectory Modeling for 3D Motion Estimation: This paper proposes EMoTive, an event camera-based 3D motion estimation framework that encodes fine-grained temporal evolution via Event Kymograph and models spatiotemporal trajectories using event-density-guided non-uniform NURBS parametric curves. Optical flow and motion-in-depth fields are derived from these trajectories, achieving state-of-the-art performance on the newly constructed CarlaEvent3D dataset and real-world benchmarks.
Factorized Learning for Temporally Grounded Video-Language Models: This paper proposes D2VLM, a framework that decomposes video understanding into a "first localize evidence, then generate answers based on evidence" paradigm. It introduces evidence tokens to capture event-level visual semantics and designs Factorized Preference Optimization (FPO) to simultaneously improve temporal grounding and text response quality.
Fine-grained Spatiotemporal Grounding on Egocentric Videos: This paper presents EgoMask, the first pixel-level spatiotemporal grounding benchmark for egocentric videos, comprising short/medium/long evaluation splits and a large-scale training set EgoMask-Train. Through systematic analysis, it reveals key differences between egocentric and exocentric videos, and demonstrates that fine-tuned models can achieve substantial performance gains.
Flow4Agent: Long-form Video Understanding via Motion Prior from Optical Flow: Flow4Agent is the first work to introduce optical flow motion priors into LLM-based video understanding. It employs Temporal Granularity Optimization (TGO) to cluster video events via coarse-grained optical flow and filter redundant scenes using semantic priors, and Motion Token Pruning (MTP) to remove intra-frame static redundant tokens via fine-grained optical flow. The method achieves state-of-the-art performance on long-video benchmarks including VideoMME, MLVU, and LongVideoBench.
FlowSeek: Optical Flow Made Easier with Depth Foundation Models and Motion Bases: FlowSeek integrates the prior knowledge of a depth foundation model (Depth Anything V2) and classical low-dimensional motion parameterization (motion bases) into an optical flow network, achieving state-of-the-art cross-dataset generalization while training on a single consumer-grade GPU.
Frequency-Semantic Enhanced Variational Autoencoder for Zero-Shot Skeleton-based Action Recognition: This paper proposes FS-VAE (Frequency-Semantic Enhanced Variational Autoencoder), which achieves significant performance gains in zero-shot skeleton-based action recognition through three key contributions: frequency decomposition for enhanced skeleton semantic learning, multilevel semantic alignment to bridge the visual-text modality gap, and a calibrated cross-alignment loss to mitigate alignment ambiguity.
General Compression Framework for Efficient Transformer Object Tracking: This paper proposes CompressTracker, a general Transformer tracker compression framework that achieves architecture-agnostic efficient compression through three progressive innovations—stage division, replacement training, and feature mimicking—delivering a 2.42× speedup while retaining approximately 99% of SUTrack's accuracy.
HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics: This paper proposes HERMES, a framework comprising two general-purpose modules — the Episodic COmpressor (ECO) and the Semantics reTRiever (SeTR) — that capture episodic memory and semantic information from video respectively. HERMES can serve as a standalone system achieving state-of-the-art performance, or be integrated as plug-and-play components into existing video-language models, simultaneously reducing inference latency by up to 43% and memory consumption by up to 46%.
Hierarchical Event Memory for Accurate and Low-latency Online Video Temporal Grounding: This paper addresses the Online Video Temporal Grounding (OnVTG) task by proposing a hierarchical event memory mechanism that stores historical event information at multiple temporal scales. Combined with a segment-tree-based event proposal structure and a future prediction branch, the method achieves state-of-the-art grounding accuracy and low-latency prediction on TACoS, ActivityNet Captions, and MAD.
Learning to Generalize Without Bias for Open-Vocabulary Action Recognition: This paper proposes Open-MeDe, a meta-learning-based framework for open-vocabulary action recognition (OVAR). By simulating "known-to-open" generalization tasks via cross-batch meta-optimization and stabilizing training with a Gaussian weight averaging strategy, the framework improves generalization in both in-context and out-of-context settings without relying on CLIP regularization.
MEMFOF: High-Resolution Training for Memory-Efficient Multi-Frame Optical Flow Estimation: MEMFOF is the first memory-efficient multi-frame optical flow method. By reducing the correlation volume resolution and introducing a high-resolution training strategy, it achieves state-of-the-art accuracy on Spring, Sintel, and KITTI benchmarks while requiring only 2.09 GB of GPU memory for 1080p inference.
MikuDance: Animating Character Art with Mixed Motion Dynamics: This paper proposes MikuDance, a diffusion-based character art animation system that achieves high-dynamic animation of complex character artwork through two core contributions: Mixed Motion Modeling, which unifies character motion and 3D camera motion into a pixel-space representation, and Mixed-Control Diffusion, which implicitly aligns character shape/scale with motion guidance within the Reference UNet.
MIORe & VAR-MIORe: Benchmarks to Push the Boundaries of Restoration: This paper introduces MIORe and VAR-MIORe, two multi-task motion restoration benchmark datasets captured using a 1000fps industrial-grade camera and a professional lens array. The benchmarks span a full motion magnitude spectrum from near-static to extreme motion, employ an adaptive frame-averaging mechanism to generate consistent motion blur, and provide a unified evaluation platform for deblurring, frame interpolation, and optical flow estimation.
MobileViCLIP: An Efficient Video-Text Model for Mobile Devices: MobileViCLIP introduces spatiotemporal structural re-parameterization into the efficient image-text model MobileCLIP and trains it on large-scale video-text datasets, yielding a mobile-deployable video-text model that achieves performance comparable to much larger models on zero-shot retrieval and action recognition.
Moment Quantization for Video Temporal Grounding: This paper proposes MQVTG, which for the first time introduces vector quantization into video temporal grounding (VTG) by mapping video clips to discrete vectors via a moment codebook and soft quantization, thereby enhancing foreground/background discriminability and achieving state-of-the-art performance on 6 benchmarks.
Multi-modal Multi-platform Person Re-Identification: Benchmark and Method: This paper presents MP-ReID, the first multi-modal multi-platform person re-identification benchmark encompassing three modalities (RGB, infrared, thermal) and two platforms (ground and UAV), along with a unified prompt learning framework, Uni-Prompt ReID, which leverages modality-aware, platform-aware, and visual-enhanced prompts to substantially improve ReID performance under complex real-world conditions.
Online Dense Point Tracking with Streaming Memory: This paper proposes SPOT, a framework for online dense long-range point tracking via a customized memory readout module, sensory memory, and visibility-guided splatting. SPOT achieves state-of-the-art performance on the CVO benchmark with 10× fewer parameters and 2× faster speed, while matching or surpassing offline methods on multiple sparse tracking benchmarks.
OVG-HQ: Online Video Grounding with Hybrid-modal Queries: This paper proposes OVG-HQ, a new online video grounding task supporting hybrid-modal queries (text, image, and video clip), and introduces a Parametric Memory Block (PMB) to retain historical context alongside a hybrid distillation strategy to mitigate modality imbalance, enabling real-time moment localization in streaming video.
PriOr-Flow: Enhancing Primitive Panoramic Optical Flow with Orthogonal View: This paper proposes PriOr-Flow, a dual-branch framework that leverages the low-distortion prior of orthogonal views to compensate for severe distortions in polar regions of ERP panoramic images, achieving significant improvements in panoramic optical flow estimation — reducing EPE by 30.0% on MPFDataset and 29.6% on FlowScape.
Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs: This paper proposes Q-Frame, a training-free plug-and-play framework for video frame selection and multi-resolution adaptation. By leveraging CLIP cross-modal matching and the Gumbel-Max trick, Q-Frame achieves query-aware frame selection, enabling Video-LLMs to process more informative frames under the same computational budget. It achieves significant performance gains on three benchmarks: MLVU, LongVideoBench, and Video-MME.
RainbowPrompt: Diversity-Enhanced Prompt-Evolving for Continual Learning: This paper proposes RainbowPrompt, a prompt-evolving mechanism that integrates multiple task-specific prompts into a diversity-enhanced unified prompt via attention-based transformation and task-guided alignment, achieving an average improvement of 8.23% over existing methods on image classification and video action recognition tasks.
ResidualViT for Efficient Temporally Dense Video Encoding: This paper proposes ResidualViT, which draws an analogy to I-frame/P-frame strategies in video compression by alternating between a full ViT and a lightweight residual ViT for encoding video frames. The approach achieves up to 60% reduction in computational cost and 2.5× inference speedup while maintaining accuracy close to the original CLIP.
Simultaneous Motion And Noise Estimation with Event Cameras: This paper presents the first joint method for simultaneous motion estimation and noise estimation with event cameras. It scores each event using the local contrast in the motion-compensated image of warped events (IWE) within the Contrast Maximization (CMax) framework, and obtains motion parameters along with signal/noise classification through alternating optimization. The method achieves state-of-the-art performance on the E-MLB denoising benchmark.
Sparse-Dense Side-Tuner for Efficient Video Temporal Grounding: This paper proposes SDST (Sparse-Dense Side-Tuner), the first anchor-free side-tuning architecture for video temporal grounding (VTG). Through a sparse-dense dual-stream design, SDST jointly addresses moment retrieval (MR) and highlight detection (HD). A novel Reference-based Deformable Self-Attention (RDSA) module is introduced to resolve the context deficiency in standard deformable cross-attention. SDST achieves state-of-the-art or highly competitive results on QVHighlights, TACoS, and Charades-STA while reducing trainable parameters to 27% of the current SOTA.
TimeExpert: An Expert-Guided Video LLM for Video Temporal Grounding: This paper proposes TimeExpert — the first MoE-based Video-LLM framework that routes timestamps, saliency scores, and text descriptions to specialized experts via task-aware dynamic gating and token-adaptive routing, complemented by task-dependent auxiliary losses. TimeExpert achieves state-of-the-art performance across three VTG task categories: Dense Video Captioning, Moment Retrieval, and Video Highlight Detection.
TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision: This paper proposes TOGA — a weakly supervised vision-language model that generates pseudo temporal labels via a multi-scale visual-language connector and consistency constraints, enabling joint generation of open-ended answers and temporal grounding without any temporal annotations, achieving SOTA on NExT-GQA, MSVD-QA, and ActivityNet-QA.
Towards Efficient General Feature Prediction in Masked Skeleton Modeling: This paper proposes GFP (General Feature Prediction), a framework that elevates the reconstruction target in masked skeleton modeling from low-level joint coordinates to multi-scale high-level semantic feature prediction. Coupled with a lightweight Target Generation Network and an information maximization constraint, GFP achieves a 6.2× training speedup while attaining state-of-the-art performance.
Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding: This paper introduces Video Thinking Test (Video-TT), a benchmark for evaluating both the correctness and robustness of video large language models (Video LLMs). It comprises 1,000 YouTube Shorts videos and 5,000 questions, designed around visual/narrative complexity factors and natural adversarial question variants. The benchmark reveals a substantial gap between the best-performing model (GPT-4o, 36.6%) and humans (84.3%).
Trokens: Semantic-Aware Relational Trajectory Tokens for Few-Shot Action Recognition: This paper proposes the Trokens framework, which converts point trajectories into semantically-aware relational tokens via semantic-aware trajectory point sampling and relational motion modeling (comprising intra-trajectory HoD and inter-trajectory relative displacement descriptors). By fusing these tokens with appearance features, Trokens achieves state-of-the-art performance on six few-shot action recognition benchmarks.
UMDATrack: Unified Multi-Domain Adaptive Tracking Under Adverse Weather Conditions: UMDATrack proposes the first unified multi-domain adaptive tracking framework. It leverages text-guided diffusion models to synthesize a small number (<2% of frames) of unlabeled multi-weather videos, employs Domain-Customized Adapters (DCA) to efficiently transfer object representations across weather domains, and introduces Target-aware Confidence Alignment (TCA) based on optimal transport to enhance cross-domain localization consistency. The framework substantially outperforms existing state-of-the-art trackers under nighttime, hazy, and rainy conditions.
Unsupervised Joint Learning of Optical Flow and Intensity with Event Cameras: This paper proposes the first unsupervised learning framework based on a single network for jointly estimating optical flow and image intensity from event camera data. The core contribution is a complementary loss formulation combining a newly derived Event-based Photometric Error (PhE) with Contrast Maximization (CMax).
Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers: This paper proposes Vamba — a hybrid Mamba-Transformer large multimodal model (LMM) that encodes video tokens with linear complexity via Mamba-2 blocks and updates text tokens via cross-attention. Vamba processes up to 1024 frames on a single GPU and outperforms all efficient LMM methods on hour-level video understanding benchmarks.
VideoLLaMB: Long Streaming Video Understanding with Recurrent Memory Bridges: VideoLLaMB is proposed, achieving long streaming video understanding with linear GPU memory scaling via SceneTiling semantic segmentation, recurrent memory bridge layers, and a memory cache retrieval mechanism, yielding an average improvement of 4.2 points across 4 VideoQA benchmarks.
VideoMiner: Iteratively Grounding Key Frames of Hour-Long Videos via Tree-based Group Relative Policy Optimization: This paper proposes VideoMiner, a tree-structured reinforcement learning framework for long-form video understanding. It iteratively applies segmentation–captioning–clustering to construct a hierarchical video tree, and introduces T-GRPO (Tree-based Group Relative Policy Optimization) to guide a policy model in adaptively exploring key frames. VideoMiner achieves state-of-the-art performance on four long-video benchmarks, and it is observed that T-GRPO spontaneously elicits chain-of-thought reasoning.
VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning: This paper proposes VTimeCoT, a training-free visual-temporal chain-of-thought framework that overlays a synchronized progress bar and highlights key segments at the bottom of video frames, enabling multimodal large language models (MLLMs) to accurately perceive timestamps. The approach substantially outperforms GPT-4o and Qwen2VL-7B baselines on temporal grounding and reasoning QA tasks.
What You Have is What You Track: Adaptive and Robust Multimodal Tracking: This paper proposes FlexTrack—the first framework to systematically study tracking under temporally incomplete multimodal data—achieving adaptive computational complexity via a Heterogeneous Mixture-of-Experts fusion module (HMoE) combined with a video-level masking training strategy. FlexTrack achieves state-of-the-art performance on 9 benchmarks, with gains of 2.6% under complete modalities and 10.2% under missing-modality scenarios.
XTrack: Multimodal Training Boosts RGB-X Video Object Trackers: This paper proposes XTrack, which employs a Mixture of Modal Experts (MeME) framework and a soft-routing classifier to enable cross-modal knowledge sharing across RGB-D/T/E modalities, allowing inference with a single modality to benefit from multimodal training knowledge, achieving an average precision gain of 3%.

🎬 Video Generation¶

Adversarial Distribution Matching for Diffusion Distillation Towards Efficient Image and Video Synthesis: This paper proposes the Adversarial Distribution Matching (ADM) framework, which aligns the latent predictions of real and fake score estimators adversarially via a diffusion-based discriminator, replacing the predefined KL divergence in DMD. Combined with Adversarial Distillation Pretraining (ADP), the proposed DMDX pipeline achieves one-step generation on SDXL surpassing DMD2, and sets new multi-step distillation benchmarks on SD3 and CogVideoX.
Adversarial Distribution Matching for Diffusion Distillation Towards Efficient Image and Video Synthesis: This paper proposes an Adversarial Distribution Matching (ADM) framework that replaces the predefined KL divergence in DMD with an implicit, data-driven measure of distributional discrepancy. A diffusion-model-based discriminator aligns the latent predictions of real and fake score estimators along the PF-ODE. Combined with Adversarial Distillation Pre-training (ADP), the resulting DMDX pipeline surpasses DMD2 on one-step SDXL generation and extends naturally to SD3 and CogVideoX video synthesis.
Aligning Moments in Time using Video Queries: This paper proposes MATR (Moment Alignment TRansformer), which conditions target video representations on query video features via dual-stage sequence alignment (soft-DTW), enabling video-to-video moment retrieval (Vid2VidMR). A self-supervised pretraining strategy is designed accordingly, achieving +13.1% R@1 and +8.1% mIoU on ActivityNet-VRL.
BadVideo: Stealthy Backdoor Attack against Text-to-Video Generation: BadVideo is the first backdoor attack framework targeting text-to-video (T2V) generation models. It exploits inherent static and dynamic redundancy in video (e.g., unspecified background elements, motion trajectories) through two strategies—spatio-temporal composition and dynamic element transition—to covertly embed malicious content. The framework achieves up to 93.5% human-evaluated attack success rate on LaVie and Open-Sora while effectively evading existing content moderation systems.
Causal-Entity Reflected Egocentric Traffic Accident Video Synthesis: This paper proposes Causal-VidSyn, a diffusion model that achieves causal entity localization via an Accident-Reason Answering (ArA) module and a gaze-conditioned visual token selection mechanism. The authors also construct the Drive-Gaze dataset comprising 1.54 million frames of gaze data. The method outperforms state-of-the-art approaches across three tasks: accident video editing, normal-to-accident video diffusion, and text-to-video generation.
D3: Training-Free AI-Generated Video Detection Using Second-Order Features: Drawing from second-order control systems in Newtonian mechanics, this paper identifies a fundamental distinction between real and AI-generated videos in their second-order temporal features ("acceleration"): real videos exhibit high fluctuation while generated videos remain flat. Based on this insight, the authors propose D3, a fully training-free AI-generated video detection method that classifies videos solely by computing the standard deviation of second-order differences of inter-frame features, achieving state-of-the-art performance across 40 test subsets.
DACoN: DINO for Anime Paint Bucket Colorization with Any Number of Reference Images: This paper proposes DACoN, which fuses semantic features from the DINOv2 foundation model with high-resolution spatial features from a U-Net to enable automatic anime line art colorization with an arbitrary number of reference images, surpassing existing methods on both key-frame and sequential-frame colorization tasks.
Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion Transfer: To address the difficulty of decoupling motion from appearance in DiT models with 3D full-attention, this paper proposes Shared Temporal Kernels and a Dense Point Tracking Loss, along with a comprehensive motion transfer benchmark MTBench and a hybrid motion fidelity metric.
DH-FaceVid-1K: A Large-Scale High-Quality Dataset for Face Video Generation: This paper introduces DH-FaceVid-1K, a large-scale high-quality face video dataset comprising 1,200+ hours, 270,043 video clips, and 20,000+ unique identities. It specifically addresses the severe underrepresentation of Asian faces in existing datasets and empirically validates scaling laws with respect to data volume and model parameter count through systematic experiments.
Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning: This paper proposes DisWM, a framework that pre-trains disentangled representations from "distracting videos" offline, then transfers semantic knowledge to downstream world models via offline-to-online latent space distillation, improving sample efficiency and robustness of visual reinforcement learning under environmental variations.
DIVE: Taming DINO for Subject-Driven Video Editing: This paper proposes DIVE, a framework that leverages semantic features from the pretrained DINOv2 model as implicit correspondences to guide subject-driven video editing. DINO features are used for temporal motion modeling and target subject identity registration, enabling high-quality subject replacement while preserving motion consistency.
DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization: This paper proposes DOLLAR, which combines Variational Score Distillation (VSD) and Consistency Distillation (CD) to achieve few-step video generation, and introduces a latent reward model fine-tuning strategy to further improve quality. The 4-step student model achieves a VBench score of 82.57, surpassing the teacher model and baselines such as Gen-3 and Kling, while the single-step distillation achieves a 278.6× speedup.
DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization: DOLLAR combines variational score distillation (VSD) and consistency distillation to achieve few-step video generation, and introduces a latent-space reward model fine-tuning method to further optimize specific quality dimensions. The 4-step student model achieves a VBench score of 82.57, surpassing the teacher model and commercial baselines such as Gen-3 and Kling, while 1-step distillation yields a 278.6× sampling speedup.
DreamRelation: Relation-Centric Video Customization: DreamRelation is proposed as the first relation-centric video customization method. Through a Relation LoRA Triplet combined with Hybrid Mask Training, it achieves disentanglement between relation and appearance, and enhances relational dynamics learning via a spatiotemporal relational contrastive loss, enabling animals to imitate human interactions.
Dual-Expert Consistency Model for Efficient and High-Quality Video Generation: This paper analyzes the optimization conflict between high- and low-noise levels in consistency model distillation, and proposes a parameter-efficient Dual-Expert Consistency Model (DCM). A semantic expert handles layout and motion while a detail expert handles fine-grained details, complemented by a temporal coherence loss and GAN with feature matching loss. On HunyuanVideo (13B), DCM achieves 4-step sampling quality approaching the 50-step baseline.
DualReal: Adaptive Joint Training for Lossless Identity-Motion Fusion in Video Customization: DualReal is the first framework to propose adaptive joint training for identity and motion, achieving lossless fusion along both dimensions via Dual-aware Adaptation and a StageBlender Controller, with average gains of 21.7% and 31.8% on CLIP-I and DINO-I metrics.
EfficientMT: Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models: This paper proposes EfficientMT, an efficient end-to-end video motion transfer framework that reuses a pretrained T2V model backbone to extract temporal motion features, combines a scaler module with a temporal integration mechanism, and achieves zero-shot motion transfer using only a small amount of synthetic paired data. The inference speed is more than 10× faster than optimization-based methods.
ETVA: Evaluation of Text-to-Video Alignment via Fine-Grained Question Generation and Answering: This paper proposes ETVA, a text-to-video alignment evaluation method based on fine-grained question generation and answering. It employs a multi-agent scene graph traversal to generate atomic questions and a knowledge-augmented multi-stage reasoning pipeline to answer them. ETVA substantially outperforms existing metrics in correlation with human judgments (Spearman's ρ 58.47 vs. 31.0) and introduces an evaluation benchmark containing 2k prompts and 12k questions.
Free-Form Motion Control: Controlling the 6D Poses of Camera and Objects in Video Generation: This paper proposes SynFMC, a synthetic dataset (the first video dataset with complete 6D pose annotations for both camera and objects) and the FMC method, enabling independent or simultaneous 6D pose control of camera and objects in text-to-video generation. The approach produces high-fidelity videos across diverse scenarios and is compatible with multiple personalized T2I models.
FuXi-RTM: A Physics-Guided Prediction Framework with Radiative Transfer Modeling: This paper proposes FuXi-RTM, the first hybrid physics-guided weather forecasting framework that integrates a deep learning radiative transfer model (DLRTM) as a differentiable physical regularizer, outperforming the unconstrained baseline on 88.51% of variable–lead-time combinations.
FVGen: Accelerating Novel-View Synthesis with Adversarial Video Diffusion Distillation: This paper proposes FVGen, a framework that distills a multi-step video diffusion model (VDM) into a student model requiring only 4 sampling steps. Through GAN-based student initialization and softened reverse KL divergence optimization, FVGen reduces sampling time by over 90% while maintaining or even surpassing the visual quality of the teacher model.
Generating, Fast and Slow: Scalable Parallel Video Generation with Video Interface Networks: This paper proposes Video Interface Networks (VINs), an abstraction module analogous to "fast thinking," which encodes long videos into fixed-size global tokens at each diffusion step to guide a DiT in generating multiple video chunks in parallel, enabling efficient and temporally consistent long video generation.
Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction: This paper adapts a pretrained video diffusion model (DynamiCrafter) into a monocular 4D dynamic scene reconstructor that simultaneously predicts three complementary geometric modalities — point maps, disparity maps, and ray maps. Through a multi-modal alignment and fusion algorithm combined with sliding-window inference, the model generalizes zero-shot to real videos despite being trained exclusively on synthetic data, substantially outperforming current state-of-the-art video depth estimation methods.
LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models: LeanVAE is proposed as an ultra-efficient video VAE built upon non-overlapping patch operations, a Neighborhood-Aware Feedforward (NAF) module, wavelet transforms, and compressed sensing. With only 40M parameters, it achieves a 50× reduction in FLOPs and a 44× speedup in inference while maintaining competitive reconstruction quality.
Long Context Tuning for Video Generation: This paper proposes Long Context Tuning (LCT), which extends the context window of pretrained single-shot video diffusion models to the scene level. By introducing interleaved 3D positional embeddings and an asynchronous noise strategy, LCT achieves cross-shot visual and temporal consistency without additional parameters, supporting both joint and autoregressive multi-shot generation, and exhibits emergent capabilities such as compositional generation.
MagicDrive-V2: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control: MagicDrive-V2 proposes a multi-view driving video generation framework based on DiT + 3D VAE. Through a spatial-temporal condition encoding module and a progressive training strategy, it achieves high-resolution long video generation at 848×1600×6 views and 241 frames, significantly surpassing existing methods in both resolution and frame count.
MagicMirror: ID-Preserved Video Generation in Video Diffusion Transformers: MagicMirror is the first framework to achieve zero-shot identity-preserving video generation on a Video Diffusion Transformer (CogVideoX). It employs dual-branch facial feature extraction, Conditioned Adaptive Normalization (CAN), and a two-stage training strategy (image pre-training followed by video fine-tuning) to generate high-quality dynamic videos while maintaining consistent facial identity.
MotionAgent: Fine-grained Controllable Video Generation via Motion Field Agent: This paper proposes MotionAgent, which leverages a Motion Field Agent to parse motion descriptions from text into object trajectories and camera extrinsics, then unifies them into optical flow maps via an analytical flow synthesis module, enabling fine-grained and precise control over both object motion and camera motion in I2V generation using only text input.
MotionShot: Adaptive Motion Transfer across Arbitrary Objects for Text-to-Video Generation: This paper proposes MotionShot, a training-free motion transfer framework that achieves high-fidelity motion transfer between arbitrary reference–target object pairs with significant appearance and structural differences, via a two-level motion alignment strategy combining high-level semantic alignment and low-level morphological alignment.
Multi-identity Human Image Animation with Structural Video Diffusion: This paper proposes the Structural Video Diffusion framework, which maintains multi-person appearance consistency via mask-guided identity-specific embeddings, jointly learns RGB/depth/normal tri-modal geometric structure to model human–object interactions, and introduces the Multi-HumanVid dataset of 25K multi-person interaction videos to enable multi-identity human video generation.
NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors: NormalCrafter proposes a video normal estimation method built upon Stable Video Diffusion (SVD). By incorporating Semantic Feature Regularization (SFR) and a two-stage training strategy, the method generates normal sequences with fine-grained details and temporal consistency, substantially outperforming existing per-frame methods on video benchmarks.
OCK: Unsupervised Dynamic Video Prediction with Object-Centric Kinematics: This paper proposes OCK (Object-Centric Kinematics), which augments object-centric video prediction by introducing explicit kinematic attributes (position, velocity, acceleration) as complements to slot representations. Two Transformer variants — Joint-OCK and Cross-OCK — are designed to fuse appearance and motion information, achieving significant improvements in dynamic video prediction quality across complex synthetic and real-world scenarios.
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models: This paper proposes OmniHuman, a multi-condition human animation generation framework based on Diffusion Transformer. Through an omni-conditions training strategy that mixes motion-related conditions including text, audio, and pose, the framework enables effective data scaling. It is the first single model to support audio-driven human video generation with arbitrary body proportions and aspect ratios, achieving state-of-the-art performance on both portrait and half-body animation tasks.
Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM: Prompt-A-Video is proposed, which automatically constructs training data via a reward-guided prompt evolution pipeline and optimizes an LLM through two-stage SFT and DPO training to generate enhanced prompts aligned with the preferences of specific video diffusion models.
Quantifying and Narrowing the Unknown: Interactive Text-to-Video Retrieval via Uncertainty Minimization: This paper proposes UMIVR, a framework that explicitly quantifies three types of uncertainty in text-to-video retrieval—textual ambiguity (semantic entropy), mapping uncertainty (JS divergence), and frame uncertainty (temporal quality frame sampling)—and adaptively generates clarification questions based on the quantified uncertainty to iteratively refine queries, achieving 69.2% R@1 on MSR-VTT-1k after 10 interaction rounds.
RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control: RealCam-I2V integrates monocular metric depth estimation to construct 3D scenes for metric-scale-aligned training, provides an interactive 3D scene trajectory drawing interface, and introduces a scene-constrained noise shaping mechanism, addressing the scale inconsistency and real-world usability issues inherent in existing trajectory-guided I2V methods.
Reangle-A-Video: 4D Video Generation as Video-to-Video Translation: Reangle-A-Video reformulates multi-view video generation as a video-to-video translation problem. It learns view-invariant motion via self-supervised fine-tuning of a video diffusion model, and combines DUSt3R-guided multi-view consistent inpainting to generate synchronized multi-view videos from a monocular input video.
ReCamMaster: Camera-Controlled Generative Rendering from A Single Video: This paper proposes ReCamMaster, which achieves camera-trajectory-controlled video re-generation from a single input video via a frame-dimension conditioning mechanism and a multi-camera synchronized dataset synthesized in UE5, significantly outperforming existing methods.
SteerX: Creating Any Camera-Free 3D and 4D Scenes with Geometric Steering: SteerX proposes a zero-shot inference-time guidance method that integrates scene reconstruction into the video generation process. By designing geometric reward functions using camera-free feed-forward reconstruction models, SteerX steers the generation distribution toward geometrically consistent samples, enabling high-quality camera-free 3D/4D scene generation.
STiV: Scalable Text and Image Conditioned Video Generation: This paper proposes STIV, a unified text-image conditioned video generation framework based on Diffusion Transformer. It integrates image conditioning via a frame replacement strategy and introduces joint image-text classifier-free guidance, enabling both T2V and TI2V generation within a single model. The 8.7B-parameter model achieves state-of-the-art scores of 83.1 and 90.1 on VBench T2V and I2V, respectively.
SweetTok: Semantic-Aware Spatial-Temporal Tokenizer for Compact Video Discretization: This paper proposes SweetTok, a video tokenizer that decouples spatial and temporal information compression via a Decoupled Query AutoEncoder (DQAE), and assigns codewords by part-of-speech through a Motion-enhanced Language Codebook (MLC). Using only 25% of the token count, SweetTok achieves 42.8% improvement in rFVD and 15.1% improvement in gFVD, attaining an optimal balance between compression ratio and reconstruction fidelity.
TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation: This paper constructs TIP-I2V, the first million-scale real-user text and image prompt dataset for image-to-video (I2V) generation (1,701,935 unique prompt pairs), accompanied by generated videos from five state-of-the-art I2V models. Built upon this dataset, the paper introduces TIP-Eval, a large-scale evaluation benchmark, alongside studies on user preference analysis and AI-generated video detection.
VACE: All-in-One Video Creation and Editing: This paper proposes VACE, an all-in-one video creation and editing framework built on Diffusion Transformer. Through a unified Video Condition Unit (VCU) interface and a pluggable Context Adapter architecture, a single model covers 12+ video tasks including reference-based generation, video editing, and mask-based editing, achieving performance on par with task-specific models.
VACE: All-in-One Video Creation and Editing: VACE is proposed as a unified framework for video creation and editing. It introduces a Video Condition Unit (VCU) that consolidates text, image, video, and mask inputs into a unified conditional representation. Combined with a Context Adapter that injects task concepts into a DiT model, VACE is the first single video DiT to simultaneously support reference-guided generation, video editing, mask-based editing, and their arbitrary combinations.
Versatile Transition Generation with Image-to-Video Diffusion: This paper proposes VTG, a unified video transition generation framework built upon an image-to-video diffusion model. VTG achieves smooth, high-fidelity transitions across four task categories — object morphing, motion prediction, concept blending, and scene transition — via interpolation-based initialization (noise SLERP + LoRA interpolation + text SLERP), bidirectional motion fine-tuning, and DINOv2 representation alignment regularization.
V.I.P.: Iterative Online Preference Distillation for Efficient Video Diffusion Models: This paper proposes the ReDPO loss function and the V.I.P. iterative online preference distillation framework, which combines preference learning (DPO) with SFT regularization for distilling pruned video diffusion models. The approach matches or surpasses the performance of the full model while reducing parameters by 36.2%–67.5%.
VMBench: A Benchmark for Perception-Aligned Video Motion Generation: This paper proposes VMBench — the first comprehensive benchmark for video motion quality evaluation, featuring five-dimensional perception-aligned motion metrics (PMM) and a meta-information-guided motion prompt generation framework (MMPG). VMBench covers 969 motion categories and achieves an average improvement of 35.3% in Spearman correlation over existing methods.
VPO: Aligning Text-to-Video Generation Models with Prompt Optimization: This paper proposes the VPO framework, which systematically optimizes text prompts for video generation based on three core principles (Harmless, Accurate, Helpful). Through principle-guided SFT and multi-feedback preference optimization, VPO significantly improves the safety, alignment, and quality of generated videos.
VSRM: A Robust Mamba-Based Framework for Video Super-Resolution: This work is the first to introduce Mamba into video super-resolution (VSR), proposing the VSRM framework. It achieves efficient spatiotemporal modeling via the Dual Aggregation Mamba Block, combined with Deformable Cross-Mamba Alignment and a frequency-domain loss, achieving state-of-the-art performance on multiple benchmarks.
WorldScore: A Unified Evaluation Benchmark for World Generation: This paper proposes WorldScore — the first unified evaluation benchmark for world generation. It decomposes world generation into a series of next-scene generation tasks, enabling unified evaluation of 3D, 4D, I2V, and T2V models across 3,000 test samples and 10 evaluation metrics.
X-Dancer: Expressive Music to Human Dance Video Generation: X-Dancer proposes a unified Transformer–diffusion framework that takes a single static image and a music sequence as input, autoregressively generates 2D whole-body dance pose token sequences synchronized with musical beats via a Transformer, and then synthesizes high-fidelity dance videos from these tokens using a diffusion model, surpassing existing methods in diversity, expressiveness, and video quality.

🧑 Human Understanding¶

AdaHuman: Animatable Detailed 3D Human Generation with Compositional Multiview Diffusion: AdaHuman is proposed as a framework that generates high-fidelity, animatable 3D human avatars from a single image via a pose-conditioned 3D joint diffusion model and a compositional 3DGS refinement module.
AJAHR: Amputated Joint Aware 3D Human Mesh Recovery: The first 3D human mesh recovery framework for amputees — by synthesizing 1M+ amputee images (A3D), designing the BPAC-Net amputation classifier to distinguish amputation from occlusion, and employing a dual-tokenizer switching strategy to encode amputation/normal pose priors separately. The method achieves substantial improvements on amputee data (MVE 16.87 lower than TokenHMR on ITW-amputee) while remaining competitive on non-amputee benchmarks.
AR-VRM: Imitating Human Motions for Visual Robot Manipulation with Analogical Reasoning: This paper proposes AR-VRM, the first method to enhance visual robot manipulation (VRM) through explicit imitation of human hand keypoints. It employs a keypoint vision-language model pretrained on large-scale human activity videos to acquire motion knowledge, and establishes correspondences between human hand keypoints and robot components via analogical reasoning.
Avat3r: Large Animatable Gaussian Reconstruction Model for High-fidelity 3D Head Avatars: This paper presents Avat3r — the first animatable large reconstruction model (LRM) that regresses high-quality drivable 3D Gaussian head avatars from only 4 input images in a feed-forward manner. By integrating DUSt3R positional maps and Sapiens semantic features as priors, and modeling expression-driven animation via simple cross-attention, Avat3r substantially outperforms existing methods on the Ava256 and NeRSemble datasets.
Bi-Level Optimization for Self-Supervised AI-Generated Face Detection: This paper proposes BLADES, a method that employs bi-level optimization to explicitly align self-supervised pretraining with the AI-generated face detection objective. The inner loop optimizes a visual encoder on pretext tasks including EXIF classification/ranking and face manipulation detection, while the outer loop optimizes task weights to improve performance on a proxy detection task, enabling cross-generator generalization without relying on any synthetic face data.
Bring Your Rear Cameras for Egocentric 3D Human Pose Estimation: This paper is the first to investigate the value of rear-mounted cameras on HMDs for egocentric 3D whole-body pose estimation. It proposes a Transformer-based multi-view heatmap refinement method with an uncertainty-aware masking mechanism, achieving >10% MPJPE improvement on the newly constructed Ego4View dataset.
CarGait: Cross-Attention based Re-ranking for Gait Recognition: This paper proposes CarGait, a cross-attention-based re-ranking method for gait recognition. By performing strip-wise cross-attention between the probe and candidate sequences, CarGait learns fine-grained gait correspondences and maps global features from a frozen single-stage model into a new discriminative embedding space. Consistent Rank-1/5 accuracy improvements are achieved across seven gait models on three major benchmarks: Gait3D, GREW, and OU-MVLP.
CarGait: Cross-Attention based Re-ranking for Gait Recognition: This paper proposes CarGait, a cross-attention based re-ranking method for gait recognition. By performing strip-wise cross-attention between probe and candidate sequences, CarGait learns fine-grained gait correspondences and maps global features from pretrained single-stage models into a new discriminative embedding space. The method consistently improves Rank-1/5 accuracy across seven gait models on three major benchmarks: Gait3D, GREW, and OU-MVLP.
CarGait: Cross-Attention based Re-ranking for Gait Recognition: This paper proposes CarGait, a cross-attention-based re-ranking method for gait recognition. Given the top-K retrieval results of any single-stage gait model, CarGait learns fine-grained pair-wise interactions between the probe and each candidate via cross-attention over gait strips, generates new conditioned representations, and recomputes distances for re-ranking. CarGait consistently improves Rank-1/5 accuracy across three datasets (Gait3D, GREW, OU-MVLP) and seven baseline models, with an inference speed of 6.5 ms/probe that substantially outperforms existing re-ranking methods.
CleanPose: Category-Level Object Pose Estimation via Causal Learning and Knowledge Distillation: This work is the first to introduce causal reasoning into category-level object pose estimation (COPE). It eliminates spurious correlations induced by data bias via a front-door adjustment-based causal reasoning module, and provides unbiased categorical semantic supervision through residual knowledge distillation from the 3D foundation model ULIP-2. The method achieves 61.7% on the strict 5°2cm metric on REAL275, surpassing the state of the art by 4.7%.
Contact-Aware Refinement of Human Pose Pseudo-Ground Truth via Bioimpedance Sensing: This paper proposes BioTUCH, a framework that detects self-contact events via wrist-to-wrist bioimpedance sensing and performs contact-aware 3D arm pose refinement in conjunction with a visual pose estimator, achieving an average improvement of 11.7% in reconstruction accuracy.
Controllable and Expressive One-Shot Video Head Swapping: This paper proposes a diffusion-based multi-condition controllable video head swapping framework (SwapAnyHead) that achieves high-fidelity identity preservation, seamless background blending, and accurate cross-identity expression transfer and editing via a shape-agnostic mask strategy, a hair enhancement strategy, and an expression-aware 3DMM-driven landmark retargeting module.
DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance: DreamActor-M1 proposes a human image animation framework based on the DiT architecture, achieving fine-grained facial and body control through hybrid control signals comprising implicit facial representations, 3D head spheres, and 3D body skeletons. Combined with complementary appearance guidance and a progressive training strategy, the framework supports multi-scale generation ranging from portrait to full-body.
Dynamic Reconstruction of Hand-Object Interaction with Distributed Force-aware Contact Representation: This paper proposes ViTaM-D, a vision-tactile fusion framework that achieves dynamic reconstruction of hand-object interaction for both rigid and deformable objects. The framework introduces a novel Distributed Force-aware Contact Representation (DF-Field) and a two-stage pipeline consisting of visual dynamic tracking followed by force-aware optimization. The HOT dataset is also introduced to address the evaluation gap in deformable object hand-object interaction.
DynFaceRestore: Balancing Fidelity and Quality in Diffusion-Guided Blind Face Restoration: This paper proposes DynFaceRestore, which reformulates blind degradation as a Gaussian deblurring problem via Dynamic Blur Level Mapping (DBLM), and achieves an optimal fidelity-quality trade-off during diffusion model sampling through a Dynamic Starting Step lookup table (DSST) and a Dynamic Guidance Scaling Adjuster (DGSA).
EgoAgent: A Joint Predictive Agent Model in Egocentric Worlds: This paper proposes EgoAgent, a unified predictive agent model that simultaneously learns to represent egocentric visual observations, predict future world states, and generate 3D human motions within a single Transformer.
Fish2Mesh Transformer: 3D Human Mesh Recovery from Egocentric Vision: This paper proposes Fish2Mesh, a fisheye-aware Transformer model that embeds the spherical geometry of fisheye images into a Swin Transformer via an Egocentric Positional Encoding (EPE) based on equirectangular projection, enabling accurate 3D human mesh recovery from a head-mounted fisheye camera in egocentric perspective.
GenM3: Generative Pretrained Multi-path Motion Model for Text Conditional Human Motion Generation: This paper proposes GenM3, a framework that learns unified discrete motion representations via a Multi-Expert VQ-VAE (MEVQ-VAE) and employs a Multi-path Motion Transformer (MMT) to handle intra-modal variation and cross-modal alignment. By integrating 11 motion datasets (~220 hours), GenM3 achieves state-of-the-art FID of 0.035 on HumanML3D.
GENMO: A GENeralist Model for Human MOtion: This paper proposes GENMO, the first generalist model that unifies human motion estimation (recovering motion from video/2D keypoints) and motion generation (synthesizing motion from text/music/keyframes) within a single framework. Through a dual-mode training paradigm (regression + diffusion), GENMO achieves both precise estimation and diverse generation in a single model.
GestureHYDRA: Semantic Co-speech Gesture Synthesis via Hybrid Modality Diffusion Transformer and Cascaded-Synchronized Retrieval-Augmented Generation: This paper proposes GestureHYDRA, a co-speech gesture synthesis system based on a Hybrid-Modality Diffusion Transformer and Cascaded-Synchronized Retrieval-Augmented Generation, capable of reliably activating semantically explicit gestures such as numerical and directional indications.
GGTalker: Talking Head Synthesis with Generalizable Gaussian Priors and Identity-Specific Adaptation: GGTalker proposes a prior-adaptation two-stage training strategy that learns generalizable audio-to-expression and expression-to-visual priors from large-scale datasets, then rapidly adapts to a specific identity. The method achieves state-of-the-art performance across rendering quality, 3D consistency, lip synchronization, and training efficiency, requiring only 20 minutes of adaptation to generate photorealistic talking-head videos at 120 FPS.
HccePose(BF): Predicting Front & Back Surfaces to Construct Ultra-Dense 2D-3D Correspondences for Pose Estimation: This paper proposes simultaneously predicting the 3D coordinates of both the front and back surfaces of an object and densely sampling between the two surfaces to construct ultra-dense 2D-3D correspondences. Combined with a novel Hierarchical Continuous Coordinate Encoding (HCCE), the method surpasses existing state-of-the-art approaches on all seven core BOP benchmark datasets.
High-Resolution Spatiotemporal Modeling with Global-Local State Space Models for Video-Based Human Pose Estimation: This paper proposes GLSMamba, the first pure-Mamba framework for video-based human pose estimation (VHPE). It models global dynamic context via a Global Spatiotemporal Mamba (GSM) module—featuring 6D selective space-time scanning and spatiotemporal-modulated scan merging—and captures local keypoint details via a Local Refinement Mamba (LRM) with windowed spatiotemporal scanning. The method achieves state-of-the-art performance on four benchmarks with linear computational complexity.
HIS-GPT: Towards 3D Human-In-Scene Multimodal Understanding: This paper proposes the HIS-QA task, the HIS-Bench benchmark, and HIS-GPT — the first foundation model for joint 3D human-in-scene understanding. Through an Auxiliary Interaction Module (AInt) and a Layout-Trajectory Positional Encoding (LTP), HIS-GPT captures fine-grained human–scene interactions and substantially outperforms GPT-4o and other baselines across 16 sub-tasks.
HUMOTO: A 4D Dataset of Mocap Human Object Interactions: This paper presents HUMOTO, a high-fidelity 4D human-object interaction dataset comprising 735 sequences (7,875 seconds at 30fps), covering 63 precisely modeled objects with 72 articulated parts. It introduces an LLM-driven scene scripting pipeline and a multi-sensor capture system, achieving significantly superior hand pose accuracy and interaction quality compared to existing datasets.
IDFace: Face Template Protection for Efficient and Secure Identification: This paper proposes IDFace, a homomorphic encryption (HE)-based face template protection method that achieves retrieval over 1 million encrypted templates in only 126ms — incurring merely a 2× overhead compared to unprotected retrieval — through two key techniques: a near-isometric transformation (real-valued vector → ternary vector) and a space-efficient encoding scheme.
ImHead: A Large-scale Implicit Morphable Model for Localized Head Modeling: imHead proposes the first large-scale implicit 3D head morphable model. Through a global-local decoupled architecture trained on a dataset of 4,000 identities, it achieves both a compact implicit representation and localized facial editing, surpassing existing methods in reconstruction accuracy and editing flexibility.
KinMo: Kinematic-Aware Human Motion Understanding and Generation: This paper proposes the KinMo framework, which decomposes human motion into six kinematic groups and their interactions as a hierarchically describable representation. An automatic annotation pipeline generates fine-grained textual descriptions at multiple granularities. Combined with hierarchical text-motion alignment (HTMA) and a coarse-to-fine motion generation strategy, KinMo significantly improves motion understanding and fine-grained motion generation.
LVFace: Progressive Cluster Optimization for Large Vision Models in Face Recognition: This paper proposes LVFace, which addresses training instability of ViT in large-scale face recognition via a Progressive Cluster Optimization (PCO) strategy. The training process is decomposed into three stages — feature alignment, centroid stabilization, and boundary refinement — achieving state-of-the-art results on multiple benchmarks.
MagShield: Towards Better Robustness in Sparse Inertial Motion Capture Under Magnetic Disturbances: MagShield is proposed as the first method addressing magnetic disturbance in sparse inertial motion capture systems. It adopts a two-stage detect-then-correct strategy: detecting magnetic disturbances via joint analysis of multiple IMUs, and correcting orientation errors using a human motion prior network. The approach can be plug-and-played into existing sparse IMU motion capture systems to enhance robustness.
MDD: A Dataset for Text-and-Music Conditioned Duet Dance Generation: This paper introduces the Multimodal DuetDance (MDD) dataset — the first large-scale, professional-grade duet dance dataset simultaneously integrating motion, music, and text descriptions. MDD comprises 620 minutes of motion capture data spanning 15 dance styles and over 10K fine-grained text annotations, and defines two new tasks: Text-to-Duet and Text-to-Dance Accompaniment.
Mitigating Object Hallucinations via Sentence-Level Early Intervention: This paper proposes SENTINEL, a framework that mitigates object hallucinations in MLLMs via sentence-level early intervention and in-domain preference learning. It reduces hallucination rates by over 90% on Object HalBench while maintaining or even improving general-purpose capabilities.
MixRI: Mixing Features of Reference Images for Novel Object Pose Estimation: This paper proposes MixRI, a lightweight network with only 12 reference images and 5.3M parameters, which establishes 2D–3D correspondences between multiple reference images and a query image via a multi-view feature fusion strategy. MixRI achieves pose estimation performance comparable to methods requiring hundreds of reference images across 7 core BOP challenge datasets.
Monocular Facial Appearance Capture in the Wild: This paper proposes a method for reconstructing facial appearance attributes (diffuse albedo, specular intensity, specular roughness) from monocular head-rotation videos. By introducing an occlusion-aware split-sum approximation shading model, the method achieves studio-grade facial appearance capture quality without imposing any simplifying assumptions on the illumination environment.
NGD: Neural Gradient Based Deformation for Monocular Garment Reconstruction: This paper proposes NGD, a neural gradient-based deformation method that decomposes the Jacobian field into a frame-invariant static component and a frame-dependent dynamic component. Combined with an adaptive remeshing strategy, NGD reconstructs high-fidelity dynamic garment geometry and appearance from monocular video, significantly outperforming existing SOTA methods on challenging scenarios such as loose-fitting garments.
One-Shot Knowledge Transfer for Scalable Person Re-Identification: This paper proposes OSKT (One-Shot Knowledge Transfer), which distills teacher model knowledge into a compact intermediate representation termed a "weight chain," enabling the generation of student models of arbitrary sizes for person re-identification with a single round of computation.
OpenAnimals: Revisiting Person Re-Identification for Animals Towards Better Generalization: This paper develops the OpenAnimals open-source framework, systematically revisiting the transferability of person re-identification methods to animal re-identification. It proposes ARBase, an animal-oriented strong baseline that substantially outperforms existing person ReID methods across multiple benchmarks.
PersPose: 3D Human Pose Estimation with Perspective Encoding and Perspective Rotation: This paper proposes the PersPose framework, which addresses the inaccurate depth estimation caused by existing methods neglecting field-of-view (FOV) information. It encodes cropped camera intrinsics as a 2D map via Perspective Encoding (PE) and centers the subject through Perspective Rotation (PR) to eliminate perspective distortion.
PHD: Personalized 3D Human Body Fitting with Point Diffusion: This paper proposes PHD, a personalized 3D human pose estimation paradigm that first calibrates user-specific body shape via SHAPify, then employs a shape-conditioned point diffusion model (PointDiT) as a 3D prior, and iteratively optimizes pose parameters through Point Distillation Sampling combined with 2D keypoint constraints, achieving state-of-the-art absolute pose accuracy on the EMDB dataset.
PoseSyn: Synthesizing Diverse 3D Pose Data from In-the-Wild 2D Data: This paper proposes the PoseSyn framework, which identifies hard samples for a target pose estimator (TPE) from in-the-wild 2D pose data via an Error Extraction Module (EEM), then expands inaccurate pseudo-labels into diverse motion sequences via a Motion Synthesis Module (MSM). A human animation model subsequently renders these sequences into realistic training images with accurate 3D annotations, improving 3D pose estimation accuracy by up to 14% across multiple real-world benchmarks.
RayPose: Ray Bundling Diffusion for Template Views in Unseen 6D Object Pose Estimation: This work reformulates unseen 6D object pose estimation as a ray alignment problem, proposes an object-centric ray parameterization scheme, and employs a diffusion transformer to infer the 6D pose of a query image from multiple template images with known poses.
SemGes: Semantics-aware Co-Speech Gesture Generation using Semantic Coherence and Relevance Learning: SemGes proposes a two-stage framework that integrates semantic information at both global and fine-grained levels through semantic coherence and semantic relevance learning, generating co-speech gestures aligned with speech semantics. The method surpasses existing approaches on two benchmarks: BEAT and TED-Expressive.
Sequential Keypoint Density Estimator: An Overlooked Baseline of Skeleton-Based Video Anomaly Detection: SeeKer proposes to autoregressively factorize the joint density of skeleton sequences at the keypoint level, detecting abnormal human behaviors by predicting conditional Gaussian distributions over subsequent keypoints. It substantially outperforms existing methods on the UBnormal and MSAD-HR benchmarks.
Signs as Tokens: A Retrieval-Enhanced Multilingual Sign Language Generator: This paper proposes SOKE, a multilingual sign language generation framework built upon pretrained language models. It discretizes continuous sign language motion into token sequences via a decoupled tokenizer, and achieves high-quality text-to-3D-avatar sign language generation across multiple languages through multi-head decoding and retrieval-augmented strategies.
SynFER: Towards Boosting Facial Expression Recognition with Synthetic Data: This paper proposes SynFER, a diffusion-model-based facial expression synthesis framework that achieves fine-grained expression generation via dual control signals — text descriptions and Facial Action Units (FAUs) — and introduces a FERAnno label calibrator to ensure annotation reliability. The effectiveness of synthetic data for FER is validated across four learning paradigms: self-supervised, supervised, zero-shot, and few-shot learning.
TriDi: Trilateral Diffusion of 3D Humans, Objects, and Interactions: TriDi is proposed as the first unified diffusion model that jointly models the three-variable distribution of humans (H), objects (O), and interactions (I). A single network covers 7 conditional generation modes, outperforming dedicated unidirectional baselines across all settings.
UDC-VIT: A Real-World Video Dataset for Under-Display Cameras: This paper presents UDC-VIT, the first real-world video dataset for under-display cameras (UDC), comprising 647 video clips with 116,460 frames in total. A carefully designed dual-camera beam-splitter acquisition system achieves precise spatiotemporal alignment. With face recognition as the primary application scenario, the dataset reveals the inadequacy of synthetic datasets in simulating real-world UDC degradation.
Weakly Supervised Visible-Infrared Person Re-Identification via Heterogeneous Expert Collaborative Consistency Learning: This paper proposes the first weakly supervised paradigm for visible-infrared person re-identification (VIReID), which relies solely on intra-modality identity annotations (without cross-modal correspondence labels). A heterogeneous expert collaborative consistency learning framework is introduced to establish cross-modal identity correspondences, achieving performance close to fully supervised methods.
What's Making That Sound Right Now? Video-centric Audio-Visual Localization: This paper proposes AVATAR, a video-level audio-visual localization benchmark, and TAVLO, a temporally-aware model that addresses the neglect of temporal dynamics in conventional AVL methods through high-resolution temporal modeling.

📦 Model Compression¶

A Good Teacher Adapts Their Knowledge for Distillation: This paper identifies the root cause of the teacher–student capacity gap in knowledge distillation as intra-class distribution mismatch in the output distributions, and proposes AID (Adapted Intra-class Distribution), a method that fine-tunes the teacher model prior to distillation to align its intra-class distribution with the student's learning capacity, achieving state-of-the-art performance across diverse architecture combinations.
Achieving More with Less: Additive Prompt Tuning for Rehearsal-Free Class-Incremental Learning: This paper proposes APT (Additive Prompt Tuning), which replaces the conventional prompt concatenation paradigm with an additive operation. By introducing only two learnable vectors added to the key/value projections of the CLS token, APT achieves state-of-the-art class-incremental learning performance while substantially reducing computational overhead (41.5% reduction in GFLOPs) and trainable parameters (78.2% reduction).
ARGMatch: Adaptive Refinement Gathering for Efficient Dense Matching: This paper proposes an Adaptive Refinement Gathering pipeline comprising three modules—a content-aware offset estimator, a local consistency matching corrector, and a local consistency upsampler—augmented with an adaptive gating mechanism. The approach substantially reduces reliance on heavyweight feature extractors and global matchers, achieving performance comparable to state-of-the-art methods with a lightweight model.
B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens: This paper proposes B-VLLM, a framework that dynamically balances spatio-temporal cues within the context window constraints of VLLMs via three modules: text-conditioned adaptive frame selection, temporal frame token merging, and spatial token sampling. The approach achieves a 10% performance improvement on MVBench.
B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens: This paper proposes B-VLLM, a framework that dynamically balances spatio-temporal tokens within the VLLM context window budget through three modules: text-conditioned adaptive frame selection, temporal frame token merging, and spatial token sampling. It addresses the dilemma between uniform sampling (which neglects temporal dynamics) and per-frame token reduction (which loses spatial detail), achieving a 10% improvement on MVBench.
Beyond Low-Rank Tuning: Model Prior-Guided Rank Allocation for Effective Transfer in Low-Data and Large-Gap Regimes: This paper proposes SR-LoRA (Stable Rank-Guided LoRA), which leverages the stable rank of pretrained weight matrices as a natural prior to assign optimal per-layer ranks for LoRA modules. Without any search procedure, SR-LoRA achieves flexible layer-wise rank allocation and significantly outperforms fixed low-rank LoRA and other adaptive-rank methods in large-domain-gap and few-shot transfer scenarios such as medical imaging.
Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation: This paper proposes TokenBridge, which converts continuous tokens into discrete tokens by applying post-training dimension-wise quantization to pre-trained continuous VAE features. The approach preserves the high-fidelity representation capability of continuous tokens while enabling straightforward autoregressive modeling with standard cross-entropy loss, achieving generation quality on ImageNet 256×256 comparable to continuous methods.
CIARD: Cyclic Iterative Adversarial Robustness Distillation: This paper proposes CIARD, which addresses the optimization objective conflict between the clean teacher and robust teacher in dual-teacher ARD frameworks via a Contrastive Push Loss, and introduces an Iterative Teacher Training (ITT) strategy to continuously update the robust teacher and prevent performance degradation. CIARD simultaneously improves adversarial robustness by +3.53% and clean accuracy by +5.87% on CIFAR-10/100 and Tiny-ImageNet.
Color Matching Using Hypernetwork-Based Kolmogorov-Arnold Networks (cmKAN): This paper proposes cmKAN, a hypernetwork-driven Kolmogorov-Arnold Network for color matching. A generator predicts spatially varying KAN spline parameters, supporting three scenarios (supervised / unsupervised / pairwise optimization) and three tasks (raw-to-raw / raw-to-sRGB / sRGB-to-sRGB). cmKAN outperforms existing methods by an average of 37.3% across all tasks while remaining extremely lightweight (76.4K parameters).
Colors See Colors Ignore: Clothes Changing ReID with Color Disentanglement: This paper proposes CSCI, a method that introduces a Color token to learn color representations (Color See) and employs a novel S2A self-attention mechanism to disentangle color information from ReID features (Color Ignore), effectively eliminating appearance bias in clothes-changing person re-identification without requiring any external annotations.
Competitive Distillation: A Simple Learning Strategy for Improving Visual Classification: This paper proposes a competitive distillation strategy in which, during multi-network joint training, the best-performing network is dynamically selected as the teacher at each iteration. Combined with a stochastic perturbation mechanism that introduces mutation operations analogous to genetic algorithms, the approach achieves significant improvements in visual classification performance.
Context Guided Transformer Entropy Modeling for Video Compression: This paper proposes the Context Guided Transformer (CGT) conditional entropy model, which reduces entropy modeling time by approximately 65% while achieving an 11% BD-Rate improvement in video compression. This is accomplished via a Temporal Context Resampler that reduces computational overhead and a Dependency-Weighted Spatial Context Assigner that explicitly models spatial dependencies.
Cross-Architecture Distillation Made Simple with Redundancy Suppression: This paper proposes RSD (Redundancy Suppression Distillation), which extracts architecture-agnostic knowledge via cross-architecture invariance maximization and feature decorrelation. Using a single simple RSD loss and a lightweight MLP decoupling module, RSD substantially outperforms OFA—the pioneering cross-architecture distillation method—on both CIFAR-100 and ImageNet-1k, while incurring only a fraction of OFA's parameter overhead.
Dataset Distillation via the Wasserstein Metric: This paper proposes WMDD (Wasserstein Metric-based Dataset Distillation), which replaces MMD with Wasserstein barycenters for distribution matching and incorporates per-class BatchNorm regularization, achieving state-of-the-art dataset distillation performance on large-scale benchmarks including ImageNet-1K.
DLF: Extreme Image Compression with Dual-generative Latent Fusion: This paper proposes the Dual-generative Latent Fusion (DLF) framework, which decomposes the image latent space into semantic and detail branches for separate compression, and eliminates inter-branch redundancy via a cross-branch interactive design. At extreme low bitrates (<0.01 bpp), DLF achieves state-of-the-art reconstruction quality with BD-Rate savings of up to 67.82% over MS-ILLM, while decoding significantly faster than diffusion-based approaches.
DuoLoRA: Cycle-Consistent and Rank-Disentangled Content-Style Personalization: DuoLoRA introduces rank-dimension mask learning (ZipRank) for LoRA merging, combined with SDXL layer priors and a cycle-consistent merging loss (Constyle loss), enabling efficient content-style LoRA composition that surpasses ZipLoRA and other state-of-the-art methods across multiple benchmarks while reducing trainable parameters by 19×.
EA-ViT: Efficient Adaptation for Elastic Vision Transformer: This paper proposes the first ViT framework that introduces elastic structure at the adaptation stage. Through a multi-dimensional elastic architecture, curriculum learning, and a lightweight router, a single adaptation run yields sub-models covering $10^{26}$ configurations, consistently outperforming existing elastic methods across multiple downstream tasks.
Efficient Adaptation of Pre-Trained Vision Transformer Underpinned by Approximation Theory: This paper identifies that the row/column vectors of pre-trained ViT weight matrices exhibit approximate orthogonality, whereas the projection matrices learned by LoRA/Adapter do not. The authors propose AOFT, a strategy that generates approximately orthogonal down/up projection matrices from a single learnable vector, aligning the adaptation modules with the properties of the backbone network. This reduces the generalization error bound and achieves competitive performance on FGVC and VTAB-1k with fewer parameters.
FastVAR: Linear Visual Autoregressive Modeling via Cached Token Pruning: FastVAR proposes a training-free post-hoc acceleration method for VAR models. Grounded in the observation that large-scale steps primarily model high-frequency textures and are robust to pruning, it selects pivotal tokens via frequency-guided scoring (PTS) to retain only high-frequency tokens during the forward pass, and restores pruned positions using cached token maps from earlier scales (CTR). Built on top of FlashAttention, FastVAR achieves an additional 2.7× speedup with less than 1% performance degradation, and for the first time enables 2K image generation in 1.5 seconds on a single RTX 3090 GPU.
Fuse Before Transfer: Knowledge Fusion for Heterogeneous Distillation: This paper proposes FBT (Fuse Before Transfer), which mitigates the feature gap in cross-architecture knowledge distillation (CAKD) by first fusing modules (CNN/MSA/MLP) from heterogeneous teachers and students to construct an adaptive intermediate fusion model before knowledge transfer, and replaces the conventional MSE loss with a spatial-agnostic InfoNCE loss. FBT achieves an average improvement of 8.38% on CIFAR-100 and 2.31% on ImageNet-1K.
Gain-MLP: Improving HDR Gain Map Encoding via a Lightweight MLP: This paper proposes replacing traditional JPEG/HEIC compression with a lightweight 10 KB MLP network for encoding HDR gain maps. The MLP takes SDR image color and spatial coordinates $(r,g,b,x,y)$ as input and incorporates exponential residual encoding (gamma map), outperforming existing methods and traditional compression techniques across multiple HDR reconstruction metrics.
Generalized Tensor-based Parameter-Efficient Fine-Tuning via Lie Group Transformations: This paper proposes LieRA, which leverages Lie group theory to generalize matrix-level PEFT methods (e.g., LoRA) to high-dimensional parameter spaces (e.g., convolutional kernels). By representing perturbations in the Lie algebra and mapping them back to the Lie group via the exponential map, LieRA achieves efficient fine-tuning while preserving the structural properties of the parameter space.
Gradient Short-Circuit: Efficient Out-of-Distribution Detection via Feature Intervention: This paper identifies that ID samples exhibit consistent local gradient directions while OOD samples display chaotic gradient directions, and proposes to "short-circuit" feature coordinates exploited by spurious gradients at inference time to suppress OOD confidence. A first-order Taylor approximation is employed to avoid a second forward pass, yielding a lightweight and efficient OOD detection method.
Heavy Labels Out! Dataset Distillation with Label Space Lightening: This paper proposes the HeLlO framework, which constructs a lightweight image-label projector using a CLIP pretrained model and LoRA-like low-rank knowledge transfer, reducing soft label storage in dataset distillation to 0.003% of the original while maintaining or surpassing SOTA performance.
Integrating Task-Specific and Universal Adapters for Pre-Trained Model-based Class-Incremental Learning: This paper proposes TUNA, a method that trains orthogonal task-specific adapters for each incremental task and merges them into a universal adapter. Combined with an entropy-based adapter selection mechanism and a dual-adapter ensemble inference strategy, TUNA achieves state-of-the-art performance in exemplar-free PTM-based class-incremental learning.
Knowledge Distillation with Refined Logits: RLD refines teacher knowledge into two complementary forms — Sample Confidence and Masked Correlation — to mitigate the negative effects of teacher mispredictions without disrupting inter-class correlations. It consistently outperforms existing logit distillation methods on both CIFAR-100 and ImageNet.
Learned Image Compression with Hierarchical Progressive Context Modeling: This paper proposes a Hierarchical Progressive Context Model (HPCM) that partitions the latent representation into multi-scale sub-representations and encodes them sequentially from coarse to fine, combined with a cross-attention-based progressive context fusion mechanism across coding steps, enabling more efficient long-range dependency modeling and more accurate entropy parameter estimation while achieving a better trade-off between compression performance and computational complexity.
Local Dense Logit Relations for Enhanced Knowledge Distillation: This paper proposes Local Dense Relational Logit Distillation (LDRLD), which captures fine-grained inter-class relations by recursively decoupling and recombining logit knowledge, combined with an Adaptive Decay Weight (ADW) strategy that assigns higher weights to critical class pairs. LDRLD consistently outperforms existing logit distillation state-of-the-art methods on CIFAR-100, ImageNet-1K, and Tiny-ImageNet.
MixA-Q: Revisiting Activation Sparsity for Vision Transformers from a Mixed-Precision Quantization Perspective: This paper proposes MixA-Q, a mixed-precision activation quantization framework that repurposes window-level activation sparsity (originally used for pruning) as a dimension for quantization — assigning lower bit-widths to less important windows rather than skipping their computation entirely. The method achieves lossless 1.35× speedup under PTQ and lossless 1.25× speedup under QAT on COCO object detection, while exhibiting superior out-of-distribution (OOD) robustness.
MotionFollower: Editing Video Motion via Lightweight Score-Guided Diffusion: This paper proposes MotionFollower, which achieves video motion editing via two lightweight convolutional controllers (pose + appearance) and a consistency guidance mechanism based on score function regularization, surpassing strong baselines such as MotionEditor while reducing GPU memory consumption by approximately 80%.
MSQ: Memory-Efficient Bit Sparsification Quantization: MSQ is proposed to achieve mixed-precision quantization discovery by computing the least significant bit (LSB) directly from weights via a RoundClamp quantizer and imposing L1 regularization to induce sparsity, without explicitly creating bit-level trainable parameters. This reduces trainable parameters by 8× and training time by 86% while maintaining competitive accuracy–compression trade-offs.
Multi-Object Sketch Animation by Scene Decomposition and Motion Planning: MoSketch is the first method to address multi-object sketch animation. It integrates four modules — LLM-based scene decomposition, LLM-based motion planning, a motion refinement network, and compositional SDS — under a divide-and-conquer strategy to tackle two core challenges: object-aware motion modeling and complex motion optimization. High-quality multi-object sketch animation is achieved without any training data.
OuroMamba: A Data-Free Quantization Framework for Vision Mamba: The first data-free post-training quantization (PTQ) framework for Vision Mamba Models (VMMs), which generates high-quality synthetic data via enhanced implicit attention and employs a mixed-precision quantization scheme with dynamic outlier detection. Under W4A4 settings, it significantly outperforms existing data-driven PTQ methods.
Partial Forward Blocking: A Novel Data Pruning Paradigm for Lossless Training Acceleration: This paper proposes Partial Forward Blocking (PFB), which computes sample importance at shallow layers during forward propagation and prunes low-importance samples by blocking their subsequent deep-layer forward passes. On ImageNet with 40% pruning, PFB achieves a 0.5% accuracy improvement and a 33% reduction in training time.
Perspective-Aware Teaching: Adapting Knowledge for Heterogeneous Distillation: This paper proposes PAT (Perspective-Aware Teaching), a framework that addresses the view mismatch problem across heterogeneous architectures via Region-Aware Attention (RAA) and the teacher unawareness problem via Adaptive Feedback Prompting (AFP), enabling feature-level distillation to comprehensively surpass logit-level methods in heterogeneous knowledge distillation for the first time.
PLAN: Proactive Low-Rank Allocation for Continual Learning: This paper proposes PLAN, a framework that proactively allocates orthogonal low-rank subspaces for each task and employs a perturbation-based strategy to minimize inter-task interference, achieving efficient and forgetting-free fine-tuning of large models in continual learning (CL) settings, establishing a new state of the art on standard CL benchmarks.
SAMO: A Lightweight Sharpness-Aware Approach for Multi-Task Optimization with Joint Global-Local Perturbation: This paper proposes SAMO, a lightweight sharpness-aware multi-task optimization method that mitigates task gradient conflicts via joint global-local perturbation, while substantially reducing computational overhead through zeroth-order gradient approximation and layer-wise normalization.
Scheduling Weight Transitions for Quantization-Aware Training: This paper identifies that conventional learning rate scheduling fails to control the effective step size of quantized weights in quantization-aware training (QAT), and proposes a Transition Rate (TR) scheduling technique that explicitly governs the number of discrete weight transitions via a Transition-Adaptive Learning Rate (TALR), substantially improving low-bit quantized model performance.
Soft Separation and Distillation: Toward Global Uniformity in Federated Unsupervised Learning: This paper proposes the Soft Separation and Distillation (SSD) framework, which addresses insufficient inter-client representation uniformity in federated unsupervised learning through two modules — Dimension Scaling Regularization (DSR) and Projector Distillation (PD) — significantly improving global representation quality without incurring additional communication overhead.
SSVQ: Unleashing the Potential of Vector Quantization with Sign-Splitting: This paper proposes Sign-Splitting Vector Quantization (SSVQ), which decouples the sign bits of weights from the codebook, introduces learnable sign parameters and an enhanced iterative freezing strategy, enabling each quantized weight to update independently along its own gradient direction during VQ fine-tuning. SSVQ significantly outperforms conventional VQ and scalar quantization under extreme compression ratios.
StolenLoRA: Exploring LoRA Extraction Attacks via Synthetic Data: StolenLoRA is the first work to formulate model extraction attacks targeting LoRA-adapted models. It leverages LLM-driven Stable Diffusion to synthesize high-quality training data, eliminating the need to search real datasets, and designs a Disagreement-based Semi-supervised Learning (DSL) strategy that maximizes information gain through selective querying. With only 10k queries, StolenLoRA achieves an attack success rate (ASR) of up to 96.60%, exposing critical security vulnerabilities in LoRA-adapted models.
Task Vector Quantization for Memory-Efficient Model Merging: This paper proposes quantizing task vectors (the difference between fine-tuned and pre-trained weights) rather than the fine-tuned weights themselves. By exploiting the narrower numerical range of task vectors, the method achieves quantization down to 3-bit without accuracy loss. The paper further proposes Residual Task Vector Quantization (RTVQ), which decomposes task vectors into a shared high-precision base vector and low-precision per-task offsets, maintaining or even improving model merging performance while using only 8% of the original storage.
Time-Aware Auto White Balance in Mobile Photography: This paper proposes a lightweight illumination estimation method (~5K parameters) that leverages contextual metadata from mobile devices (timestamps and geolocation) alongside image color information. The method achieves performance on par with or superior to much larger models on a newly collected dataset of 3,224 smartphone images, and runs in under 0.25ms on a flagship mobile DSP.
TR-PTS: Task-Relevant Parameter and Token Selection for Efficient Tuning: This paper proposes TR-PTS, a framework that performs task-driven layer-wise parameter selection via the Fisher Information Matrix and dynamically filters/merges tokens using CLS attention scores. By tuning only 0.34%–0.60% of parameters, TR-PTS surpasses full fine-tuning by 3.40% on FGVC and 10.35% on VTAB.
UniConvNet: Expanding Effective Receptive Field while Maintaining Asymptotically Gaussian Distribution for ConvNets of Any Scale: This paper proposes UniConvNet, which employs a three-layer Receptive Field Aggregator (RFA) composed of moderately sized convolution kernels (7×7, 9×9, 11×11) to expand the Effective Receptive Field (ERF) while preserving its Asymptotically Gaussian Distribution (AGD), achieving consistent improvements over existing CNNs and ViTs across lightweight to large-scale model regimes.
Variance-Based Pruning for Accelerating and Compressing Trained Networks: This paper proposes Variance-Based Pruning (VBP), a one-shot structured pruning method that removes neurons with the smallest activation variance in MLP hidden layers and compensates their mean activations into the subsequent layer's bias. With only 10 epochs of fine-tuning, VBP recovers 99% of the original accuracy while reducing computation by 35% and parameters by 36%.
ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models: This paper proposes ViT-Linearizer, a cross-architecture distillation framework that transfers the "quadratic knowledge" learned by ViT self-attention into linear-complexity recurrent models (Mamba-based Adventurer) via two core mechanisms: activation matching and masked prediction. The approach achieves 84.3% accuracy on ImageNet while delivering up to 4.2× inference speedup on high-resolution tasks.
VQ-SGen: A Vector Quantized Stroke Representation for Creative Sketch Generation: VQ-SGen treats each stroke as an independent entity and decouples its shape from positional information. By applying vector quantization (VQ), it constructs a compact discrete stroke codebook, and employs a cascaded autoregressive Transformer to sequentially generate semantic labels, shape codes, and position codes for each stroke. The method significantly outperforms existing approaches on the CreativeSketch dataset.

🏥 Medical Imaging¶

AcZeroTS: Active Learning for Zero-shot Tissue Segmentation in Pathology Images: This work proposes AcZeroTS, a framework that integrates active learning with a VLM-based prototype-guided zero-shot segmentation model (ProZS). By simultaneously accounting for uncertainty, diversity, and the ability of selected samples to improve prototype coverage over unseen classes, the framework selects the most informative samples for annotation, achieving high-quality segmentation of both seen and unseen tissue types under minimal annotation budgets.
Alleviating Textual Reliance in Medical Language-guided Segmentation via Prototype-driven Semantic Approximation: This paper proposes ProLearn, a framework that introduces a Prototype-driven Semantic Approximation (PSA) module to fundamentally alleviate textual reliance in medical language-guided segmentation. The prototype space is initialized from a small number of image-text pairs; thereafter, both training and inference require no text input. ProLearn maintains strong performance under 1% text availability (QaTa-COV19 Dice = 0.857), with parameters 1000× fewer than LLM-based solutions and inference speed 100× faster.
An OpenMind for 3D Medical Vision Self-Supervised Learning: This work releases OpenMind, the largest publicly available 3D medical imaging pre-training dataset (114k brain MRIs), and systematically compares 7+ SSL methods across two architectures — a CNN (ResEnc-L) and a Transformer (Primus-M) — on 15 downstream datasets. Key findings: MAE pre-training yields the best segmentation performance, contrastive learning excels at classification, and for the first time, a pre-trained Transformer is shown to outperform a randomly initialized CNN on select datasets.
An OpenMind for 3D Medical Vision Self-supervised Learning: This work releases OpenMind, the largest publicly available 3D medical imaging pretraining dataset (114k brain MRI volumes), and conducts a systematic benchmark of existing 3D SSL methods on this dataset using state-of-the-art CNN (ResEnc-L) and Transformer (Primus-M) architectures, establishing the current SOTA for 3D medical image SSL.
Beyond Brain Decoding: Visual-Semantic Reconstructions to Mental Creation Extension Based on fMRI: This paper proposes NeuroCreat — a multimodal brain architecture that integrates the visual and textual capabilities of LLMs — extending fMRI decoding from single-task visual stimulus reconstruction to three levels: image reconstruction + text captioning + mental creation. A Prompt Variant Alignment (PVA) module is introduced to effectively bridge the gap between low-resolution fMRI signals and high-level semantic representations.
Boosting Vision Semantic Density with Anatomy Normality Modeling for Medical Vision-language Pre-training: This paper proposes ViSD-Boost, which addresses the alignment bias caused by low visual semantic density in medical vision-language pre-training (VLP). The method employs disease-level visual contrastive learning to enhance visual semantics and VQ-VAE-based anatomical normality modeling to amplify abnormality signals, achieving 84.9% AUC in zero-shot diagnosis across 54 diseases spanning 15 organs.
COIN: Confidence Score-Guided Distillation for Annotation-Free Cell Segmentation: This paper proposes COIN, a three-stage framework that addresses the critical "error-free instance absence" problem in annotation-free cell instance segmentation. The framework combines unsupervised semantic segmentation with optimal transport for pixel-level cell propagation, model–SAM consistency for instance-level confidence scoring, and confidence-guided recursive self-distillation, achieving performance on MoNuSeg and TNBC that surpasses semi-supervised and weakly supervised methods.
Controllable Latent Space Augmentation for Digital Pathology: This paper proposes HistAug — a lightweight Transformer-based latent space augmentation model that simulates realistic image transformations (hue shifts, erosion, etc.) in feature space via conditional cross-attention, providing controllable and computationally efficient data augmentation for pathology MIL training at minimal overhead.
Coordinate-based Speed of Sound Recovery for Aberration-Corrected Photoacoustic Computed Tomography: This paper proposes an efficient self-supervised joint reconstruction method that parameterizes the speed of sound (SOS) as either a pixel grid or a neural field, recovering SOS and high-quality photoacoustic images by backpropagating gradients through a differentiable imaging forward model. The method surpasses the current state of the art in accuracy while achieving a 35× speedup (40 seconds vs. 23 minutes).
CryoFastAR: Fast Cryo-EM Ab initio Reconstruction Made Easy: CryoFastAR is proposed as the first geometric foundation model for cryo-EM, which employs a ViT architecture to directly predict Fourier Planar Maps from multi-view noisy particle images in a feed-forward manner for pose estimation, achieving over 10× speedup while maintaining comparable reconstruction quality on both synthetic and real datasets.
CryoFastAR: Fast Cryo-EM Ab initio Reconstruction Made Easy: The first work to introduce the DUSt3R-style geometric foundation model paradigm into cryo-EM, achieving feedforward pose prediction from large sets of noisy particle images via a ViT encoder with cross-view attention decoder—without iterative optimization—enabling ab initio protein 3D reconstruction 10–33× faster than traditional methods.
CuMPerLay: Learning Cubical Multiparameter Persistence Vectorizations: This paper proposes CuMPerLay, a differentiable Cubical Multiparameter Persistence (CMP) vectorization layer that decomposes CMP into multiple learnable single-parameter persistence lines. By jointly learning bifiltration functions for end-to-end training and embedding the layer into Swin Transformer, the method achieves significant improvements on medical image classification and semantic segmentation tasks, particularly in data-scarce settings.
DictAS: A Framework for Class-Generalizable Few-Shot Anomaly Segmentation via Dictionary Lookup: Inspired by the intuition of human inspectors "consulting a dictionary," DictAS reformulates few-shot anomaly segmentation (FSAS) as a dictionary lookup task—a query feature is deemed anomalous if it cannot be retrieved from a dictionary of normal samples. Through self-supervised training, the framework acquires class-agnostic lookup capability and achieves state-of-the-art FSAS performance and inference speed across 7 industrial and medical datasets.
G2PDiffusion: Cross-Species Genotype-to-Phenotype Prediction via Evolutionary Diffusion: This paper proposes G2PDiffusion, the first diffusion model-based cross-species genotype-to-phenotype prediction framework, which generates morphological images conditioned on evolutionary signals (multiple sequence alignments, MSA, and environmental context) to predict species appearance from DNA sequences.
GDKVM: Echocardiography Video Segmentation via Spatiotemporal Key-Value Memory with Gated Delta Rule: This paper proposes GDKVM, an echocardiography video segmentation architecture based on linear key-value association and the gated delta rule, achieving state-of-the-art performance on CAMUS and EchoNet-Dynamic through efficient memory management and multi-scale feature fusion while maintaining real-time inference speed.
GECKO: Gigapixel Vision-Concept Contrastive Pretraining in Histopathology: GECKO is proposed as a WSI-level MIL aggregator pretraining method that requires no additional clinical data modalities. By automatically extracting interpretable concept priors from H&E WSIs and aligning them with deep features via contrastive learning, GECKO surpasses existing unimodal and multimodal pretraining methods on five classification tasks while providing pathologist-interpretable WSI-level descriptions.
GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis: This paper presents GEMeX, the largest chest X-ray VQA dataset to date (151K images, 1.6M questions), which for the first time simultaneously provides textual reasoning explanations and visual region grounding across four question types, and systematically evaluates 12 representative large vision-language models.
IDF: Iterative Dynamic Filtering Networks for Generalizable Image Denoising: This paper proposes Iterative Dynamic Filtering Networks (IDF), which achieves strong out-of-distribution (OOD) denoising performance using only ~0.04M parameters. By combining per-pixel dynamic kernel prediction with an adaptive iterative refinement strategy, IDF generalizes to diverse unseen noise types (Gaussian, Poisson, salt-and-pepper, Monte Carlo rendering, and real noise) while trained exclusively on single-level Gaussian noise.
InsideOut: Integrated RGB-Radiative Gaussian Splatting for Comprehensive 3D Object Representation: InsideOut extends 3D Gaussian Splatting (3DGS) beyond RGB surface modeling to simultaneously represent internal X-ray structures, achieving joint representation of RGB appearance and internal radiative structure through hierarchical fitting and an X-ray reference loss.
Integrating Biological Knowledge for Robust Microscopy Image Profiling on De Novo Cell Lines: This paper proposes integrating external biological knowledge — protein–protein interaction graphs and transcriptomic features from single-cell foundation models — into microscopy image pretraining, explicitly decoupling perturbation-specific and cell-line-specific representations to improve generalization of perturbation screening on unseen (de novo) cell lines.
M-Net: MRI Brain Tumor Sequential Segmentation Network via Mesh-Cast: M-Net reinterprets the spatial continuity between adjacent MRI slices as "quasi-temporal" data, and proposes the Mesh-Cast mechanism to seamlessly integrate arbitrary sequential models (LSTM, Transformer, Mamba SSM, etc.) into both channel and temporal information processing. Combined with a Two-Phase Sequential training strategy (TPS), M-Net achieves state-of-the-art segmentation performance on BraTS2019 and BraTS2023.
MRGen: Segmentation Data Engine for Underrepresented MRI Modalities: To address the lack of segmentation annotations for scarce MRI modalities, this work constructs a large-scale radiological image dataset MRGen-DB (~250K slices, 100+ modalities) and trains a controllable diffusion-based data engine MRGen. Using dual-condition control via text prompts and segmentation masks, MRGen generates high-quality MR images in target modalities for training segmentation models. Across 10 cross-modal segmentation experiments, the average DSC improves from 10%–27% to 43%–45%, enabling "zero-shot" segmentation for annotation-scarce modalities.
MultiverSeg: Scalable Interactive Segmentation of Biomedical Imaging Datasets with In-Context Guidance: This paper proposes MultiverSeg, a progressive interactive segmentation system in which each image annotated by the user reduces the number of interactions required for subsequent images. By incorporating previously segmented images as in-context inputs, the system improves with use. On 12 unseen datasets, it reduces click counts by 36% and scribble steps by 25% compared to ScribblePrompt.
NEURONS: Emulating the Human Visual Cortex Improves Fidelity and Interpretability in fMRI-to-Video Reconstruction: This paper proposes NEURONS, a framework inspired by the hierarchical structure of the human visual cortex that decouples fMRI-to-video reconstruction into four sub-tasks (key object segmentation, concept recognition, scene description, and blurry video reconstruction), emulating the functional specialization of cortical regions V1/V2/V4/ITC. NEURONS substantially outperforms state-of-the-art methods in video consistency (+26.6%) and semantic accuracy (+19.1%).
ProGait: A Multi-Purpose Video Dataset and Benchmark for Transfemoral Prosthesis Users: This paper presents ProGait—the first multi-purpose video dataset targeting transfemoral amputee prosthesis users—supporting three tasks: video object segmentation, 2D human pose estimation, and gait analysis. Baseline models are provided to demonstrate the dataset's effectiveness in improving prosthesis detection.
Progressive Test Time Energy Adaptation for Medical Image Segmentation: This paper proposes a progressive test-time adaptation method based on energy-based models. A shape energy model is trained as an in-distribution/out-of-distribution discriminator; at test time, energy minimization guides the segmentation model to adapt to the target domain. The method consistently outperforms baselines across 8 public datasets covering cardiac, spinal cord, and lung segmentation tasks.
PVChat: Personalized Video Chat with One-Shot Learning: This paper proposes PVChat, the first video large language model supporting personalized subject learning from a single reference video. Through a ReLU-routed Mixture-of-Heads (ReMoH) attention mechanism, a systematic data augmentation pipeline, and a progressive image-to-video training strategy, PVChat achieves identity-aware video question answering and surpasses existing state-of-the-art ViLLMs across diverse scenarios including medical, TV drama, and anime settings.
RadGPT: Constructing 3D Image-Text Tumor Datasets: This paper proposes RadGPT — an anatomy-aware vision-language AI pipeline that converts radiologist-revised tumor segmentation masks into structured reports via deterministic algorithms, then adapts them into narrative-style reports using an LLM. This pipeline is used to construct AbdomenAtlas 3.0, the first large-scale public abdominal CT image-text tumor dataset (9,262 CT scans with per-voxel annotations and reports). The work demonstrates that segmentation assistance significantly improves tumor detection rates in AI-generated reports.
Scaling Tumor Segmentation: Best Lessons from Real and Synthetic Data: Through a systematic study of data scaling laws on a large-scale private dataset, this work demonstrates that synthetic tumors can substantially reduce the need for real annotations (from 1,500 to 500 cases). Building on these findings, the authors construct AbdomenAtlas 2.0—the first large-scale manually annotated CT dataset with over 10,000 scans covering six organ tumor types—achieving significant improvements on both in-distribution and out-of-distribution benchmarks.
SciVid: Cross-Domain Evaluation of Video Models in Scientific Applications: This paper introduces SciVid, a benchmark comprising five interdisciplinary scientific video tasks—including animal behavior classification, tissue tracking, and weather forecasting—that systematically evaluates six categories of Video Foundation Models (ViFMs). The study finds that adapting a frozen ViFM backbone with a simple trainable readout suffices to achieve state-of-the-art performance on multiple scientific tasks, providing the first systematic evidence of the transferability of general-purpose ViFMs to scientific domains.
SegAnyPET: Universal Promptable Segmentation from Positron Emission Tomography Images: This paper introduces PETS-5k, the largest PET segmentation dataset to date (5,731 3D whole-body PET scans, over 1.3 million 2D slices), and proposes SegAnyPET — the first 3D promptable segmentation foundation model tailored for PET imaging. Through a Cross-Prompt Confidence Learning (CPCL) strategy to handle inconsistent annotation quality, SegAnyPET substantially outperforms existing foundation models and task-specific models on both seen and unseen targets.
Semi-supervised Deep Transfer for Regression without Domain Alignment: This paper proposes CRAFT (Contradistinguisher-based Regularization Approach for Flexible Training), a semi-supervised transfer learning framework that requires neither source data nor domain alignment, specifically designed for regression tasks. CRAFT jointly optimizes a supervised loss and an unsupervised Contradistinguisher-based regularization term to substantially improve prediction performance under label-scarce conditions.
SIC: Similarity-Based Interpretable Image Classification with Neural Networks: This paper proposes SIC, an inherently interpretable neural network that simultaneously provides local, global, and faithful explanations. By extracting class-representative support vectors from training images and computing input-to-support-vector similarities via B-cos transformations for classification, SIC achieves accuracy comparable to black-box models while delivering pixel-level contribution maps and case-based global explanations. On the FunnyBirds benchmark, SIC outperforms ProtoPNet on 8 out of 9 interpretability metrics.
SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality: This paper proposes SimMLM, a simple yet effective framework for multi-modal learning under missing modality conditions. It consists of a Dynamic Mixture of Modality Experts (DMoME) architecture and a More vs. Fewer (MoFe) ranking loss. SimMLM comprehensively outperforms state-of-the-art methods on brain tumor segmentation and multi-modal classification tasks with fewer parameters and lower computational cost, while providing interpretable modality importance estimates.
TeethGenerator: A Two-Stage Framework for Paired Pre- and Post-Orthodontic 3D Dental Data Generation: This paper proposes TeethGenerator, a two-stage framework for generating paired pre- and post-orthodontic 3D dental point cloud models. Stage I employs a VQ-VAE combined with a diffusion model to generate post-treatment tooth morphology, while Stage II uses a Transformer conditioned on a style model to generate the corresponding pre-treatment dental arrangement.
Toward Long-Tailed Online Anomaly Detection through Class-Agnostic Concepts: This paper proposes a new task and benchmark for Long-Tailed Online Anomaly Detection (LTOAD). The core innovation is replacing class-label dependency with a learnable class-agnostic concept set, combined with a Concept VQ-VAE and a comprehensive prompt learning framework. The proposed method achieves state-of-the-art performance in both offline and online settings without requiring class labels.
UKBOB: One Billion MRI Labeled Masks for Generalizable 3D Medical Image Segmentation: This paper introduces UKBOB—the largest annotated medical image segmentation dataset to date (51,761 MRI 3D volumes, 72 organ classes, 1.37 billion 2D segmentation masks)—and proposes a Specialized Organ Label Filter (SOLF) for cleaning automated annotations and an Entropy Test-Time Adaptation (ETTA) method for handling domain shift under noisy labels. The resulting Swin-BOB foundation model achieves state-of-the-art performance on the BRATS and BTCV benchmarks.
Vector Contrastive Learning for Pixel-wise Pretraining in Medical Vision: This paper proposes Vector Contrastive Learning (Vector CL), which reformulates standard contrastive learning from a binary optimization problem into a vector regression problem. By modeling feature distances to quantify the degree of dispersion, it addresses the over-dispersion problem in pixel-wise medical vision pretraining, achieving significant improvements over 17 methods across 8 downstream tasks.
ViCTr: Vital Consistency Transfer for Pathology Aware Image Synthesis: This paper proposes ViCTr, a two-stage framework that combines Rectified Flow with a Tweedie-corrected diffusion process to achieve high-fidelity pathology-aware medical image synthesis. The method reduces inference steps from 50 to 3–4 and, for the first time, enables graded-severity pathology synthesis for abdominal MRI.
Visual Surface Wave Elastography: Revealing Subsurface Physical Properties via Visible Surface Waves: This paper proposes VSWE (Visual Surface Wave Elastography), a method that extracts the dispersion relation from a video of surface wave propagation and combines it with physics-based finite element optimization to infer subsurface layer thickness and stiffness. High-accuracy parameter recovery is demonstrated in both simulated and real gelatin experiments, providing a proof-of-concept for at-home health monitoring.

🖼️ Image Restoration¶

Benchmarking Burst Super-Resolution for Polarization Images: Noise Dataset and Analysis: This paper addresses the lack of datasets and noise models for polarization image burst super-resolution (SR) by constructing two dedicated datasets—PolarNS (noise statistics) and PolarBurstSR (SR benchmark)—proposing a polarization noise propagation analysis model, and systematically benchmarking existing burst SR methods on polarization scenes, thereby establishing a standardized evaluation framework for polarization image reconstruction.
Benchmarking Burst Super-Resolution for Polarization Images: Noise Dataset and Analysis: To address the hardware bottleneck of polarization cameras—low light efficiency, low spatial resolution, and high noise—this work constructs two dedicated datasets (PolarNS for noise statistics analysis and PolarBurstSR for burst super-resolution training/evaluation), proposes a polarimetric noise propagation analysis model, and adapts five SOTA burst super-resolution methods to the polarization domain. Results demonstrate that polarization-specific training significantly outperforms generic RGB training in reconstructing both intensity maps (s0) and angle of linear polarization (AoLP).
Blind2Sound: Self-Supervised Image Denoising without Residual Noise: This paper proposes the Blind2Sound framework, which perceives noise levels and achieves personalized denoising via an adaptive re-visible loss, complemented by a Cramer Gaussian loss that improves noise parameter estimation accuracy. The framework eliminates residual noise in self-supervised blind denoising and outperforms all contemporary self-supervised methods and even some supervised baselines.
Blind Noisy Image Deblurring Using Residual Guidance Strategy: This paper proposes a Residual Guidance Strategy (RGS) for coarse-to-fine blind image deblurring within an image pyramid framework. At each scale transition, the convolution residual from the adjacent coarser scale is denoised via a guided filter and used to correct the blurred input at the current scale. This approach significantly improves kernel estimation accuracy and restoration quality under high noise levels (σ=0.1), surpassing multiple deep learning methods without requiring any training.
Closed-Loop Transfer for Weakly-supervised Affordance Grounding: This paper proposes LoopTrans, a closed-loop knowledge transfer framework that unifies exocentric and egocentric image activation via a shared CAM module, refines coarse activations into precise localizations using pixel-level pseudo-masks, and feeds egocentric localization results back to enhance exocentric knowledge extraction through denoising distillation, achieving state-of-the-art performance across all metrics on AGD20K.
Consistent Time-of-Flight Depth Denoising via Graph-Informed Geometric Attention: GIGA-ToF proposes a ToF depth denoising network that fuses motion-invariant graph structures across frames. Through cross-frame graph attention and algorithm unrolling of a MAP problem, the method simultaneously improves temporal stability and spatial sharpness, demonstrating strong generalization on both synthetic and real data.
CWNet: Causal Wavelet Network for Low-Light Image Enhancement: This paper proposes CWNet, a Causal Wavelet Network that models low-light image enhancement through a structural causal model (SCM), treating semantic information as causal factors and brightness/color degradation as non-causal factors, and employs a wavelet-based backbone for fine-grained frequency-domain feature restoration.
Decouple to Reconstruct: High Quality UHD Restoration via Active Feature Disentanglement and Reversible Fusion: This paper proposes D²R-UHDNet, a framework that employs a Controlled Differential Disentangled VAE (CD²-VAE) to actively decompose degraded images into a degradation-dominant latent space and background-dominant features, and processes the background features via a complex-domain invertible multi-scale fusion network. The method achieves state-of-the-art performance across six UHD restoration tasks with only 1M parameters.
Devil is in the Uniformity: Exploring Diverse Learners within Transformer for Image Restoration: Targeting the redundancy caused by uniform subspace allocation across heads in standard Multi-Head Attention (MHA), this paper proposes HINT, which introduces Hierarchical Multi-Head Attention (HMHA) and Query-Key Cache Updating (QKCU) to enhance inter-head diversity and interaction, achieving state-of-the-art results on 12 benchmarks across 5 image restoration tasks.
EAMamba: Efficient All-Around Vision State Space Model for Image Restoration: This paper proposes EAMamba, a framework that introduces a Multi-Head Selective Scan Module (MHSSM) and an all-around scanning strategy to achieve multi-directional scanning without increasing computational complexity or parameter count. EAMamba addresses the computational overhead and local pixel forgetting issues of Vision Mamba in image restoration, achieving 31–89% FLOPs reduction while maintaining competitive performance across super-resolution, denoising, deblurring, and dehazing tasks.
Efficient Concertormer for Image Deblurring and Beyond: This paper proposes Concertormer, which decomposes self-attention into a global Concertino component and a local Ripieno component, and further introduces a Cross-Dimensional Communication module and a Gated Depthwise Convolution MLP. The method achieves global-local feature modeling at linear complexity, attaining state-of-the-art performance on image deblurring and other restoration tasks.
Emulating Self-Attention with Convolution for Efficient Image Super-Resolution: Motivated by the observation that features and attention maps across adjacent self-attention layers exhibit high inter-layer similarity (89%/87%), this paper proposes ConvAttn — a module composed of a shared large-kernel convolution and a dynamic convolution kernel — to replace the majority of self-attention layers. Flash Attention is introduced into lightweight SR for the first time, extending the window size to $32 \times 32$, achieving state-of-the-art performance at minimal latency and memory cost.
Enhancing Image Restoration Transformer via Adaptive Translation Equivariance: This paper systematically investigates the impact of Translation Equivariance (TE) on the convergence speed and generalization ability of image restoration networks. It proposes Sliding Key-Value Self-Attention (SkvSA), its adaptive variant (ASkvSA), and Downsampled Self-Attention (DSA), and constructs TEAFormer, which achieves state-of-the-art performance on super-resolution, deblurring, denoising, and other tasks while maintaining linear complexity.
Exploiting Diffusion Prior for Task-driven Image Restoration: This paper proposes EDTR, a method that leverages diffusion model priors via a pre-restoration + partial diffusion strategy combined with short-step denoising to effectively recover task-relevant details, achieving significant gains in classification, segmentation, and detection under complex degradation scenarios.
FoundIR: Unleashing Million-scale Training Data to Advance Foundation Models for Image Restoration: This work constructs the first million-scale real-world paired image restoration dataset covering 20 degradation types, and proposes the FoundIR framework, which combines a degradation-agnostic generalist model with degradation-aware expert models to surpass existing performance ceilings across 24 benchmarks.
Generic Event Boundary Detection via Denoising Diffusion (DiffGEBD): DiffGEBD is the first work to introduce diffusion models into Generic Event Boundary Detection (GEBD). It frames boundary prediction as an iterative denoising process from random noise to a plausible boundary distribution, leverages Classifier-Free Guidance to control prediction diversity, and proposes two new evaluation metrics—Symmetric F1 and Diversity Score—to measure quality and diversity in multi-prediction scenarios.
IM-LUT: Interpolation Mixing Look-Up Tables for Image Super-Resolution: This paper proposes IM-LUT, which achieves arbitrary-scale image super-resolution by learning to mix multiple interpolation functions, and converts the prediction network into a look-up table form to enable lightweight, fast CPU inference while maintaining reconstruction quality.
Learning Pixel-adaptive Multi-layer Perceptrons for Real-time Image Enhancement: This paper proposes the BPAM framework, which combines the spatial modeling capability of bilateral grids with the nonlinear mapping power of MLPs by dynamically generating unique micro-MLP parameters for each pixel, enabling high-quality, real-time image enhancement.
Lightweight and Fast Real-time Image Enhancement via Decomposition of the Spatial-aware Lookup Tables: By decomposing 3D LUTs into linear combinations of 2D LUTs followed by SVD, and adopting a cache-efficient spatial feature fusion structure, the proposed method achieves spatially-aware image enhancement while reducing model parameters by 84% and accelerating 4K inference by 2.8×.
Low-Light Image Enhancement using Event-Based Illumination Estimation (RetinEV): RetinEV proposes exploiting temporal-mapping events (triggered by transmittance modulation) rather than conventional motion events for illumination estimation. Combined with Retinex theory, it decomposes low-light images into illumination and reflectance components, and employs an Illumination-guided Reflectance Enhancement (IRE) module to achieve high-quality low-light image enhancement, reaching real-time inference at 35.6 FPS on 640×480 images.
Metric Convolutions: A Unifying Theory to Adaptive Image Convolutions: This paper proposes a metric-geometric perspective that unifies existing adaptive convolution variants (standard, dilated, shifted, and deformable), and introduces Metric Convolution based on unit-ball sampling of an explicit Randers metric, achieving superior geometric regularization and generalization with substantially fewer parameters.
MobileIE: An Extremely Lightweight and Effective ConvNet for Real-Time Image Enhancement on Mobile Devices: This paper proposes MobileIE, an extremely lightweight CNN framework with approximately 4K parameters, which achieves real-time image enhancement at over 1100 FPS on mobile devices for the first time. This is accomplished through multi-branch re-parameterizable convolution (MBRConv), a feature self-transformation (FST) module, hierarchical dual-path attention (HDPA), and an incremental weight optimization (IWO) strategy. MobileIE achieves state-of-the-art speed–performance trade-offs across three tasks: low-light enhancement, underwater enhancement, and ISP.
MP-HSIR: A Multi-Prompt Framework for Universal Hyperspectral Image Restoration: This paper proposes MP-HSIR, a unified hyperspectral image restoration framework that integrates three modalities of guidance—spectral prompts (universal low-rank spectral patterns), text prompts, and visual prompts—to comprehensively outperform existing all-in-one methods and numerous task-specific methods across 9 HSI restoration tasks, including denoising, deblurring, super-resolution, inpainting, dehazing, and band completion.
Outlier-Aware Post-Training Quantization for Image Super-Resolution: This paper proposes an outlier-aware post-training quantization method for image super-resolution. It introduces a dual-region piecewise linear quantizer to balance outlier preservation with normal activation fidelity, and incorporates a sensitivity-aware finetuning strategy that directs attention to quantization-sensitive layers. Under the W4A4 setting, the method substantially outperforms existing PTQ approaches and approaches QAT-level performance.
PRE-Mamba: A 4D State Space Model for Ultra-High-Frequent Event Camera Deraining: The first point-based event camera deraining framework, leveraging 4D event cloud representation and a Multi-Scale State Space Model (MS3M) to achieve efficient deraining while preserving microsecond-level temporal precision, reaching state-of-the-art performance with only 0.26M parameters.
Robust Adverse Weather Removal via Spectral-based Spatial Grouping (SSGformer): SSGformer proposes an All-in-One adverse weather image restoration method based on spectral decomposition and grouping attention: it extracts high-frequency edge information via the Sobel operator and analyzes low-frequency degradation textures via SVD, fuses both to generate spatial grouping masks, and performs channel and spatial attention within groups to achieve robust removal of multiple weather degradations (rain, snow, haze, raindrops).
Self-Calibrated Variance-Stabilizing Transformations for Real-World Image Denoising: This paper proposes Noise2VST, a framework that learns a model-free variance-stabilizing transformation (VST) via self-supervised learning, enabling off-the-shelf Gaussian denoisers to handle real-world noisy images without any additional training.
Towards a Universal Image Degradation Model via Content-Degradation Disentanglement: This paper proposes the first universal image degradation model. Through a disentangle-by-compression approach, it separates degradation information from image content, introduces IDEN and IDA layers to handle inhomogeneous degradation, and enables cross-degradation encoding, synthesis, and transfer. The model can serve as a plug-in module to convert non-blind image restoration methods into blind ones.
UniPhys: Unified Planner and Controller with Diffusion for Flexible Physics-Based Character Control: This paper proposes UniPhys, a behavior cloning framework based on diffusion models that unifies motion planning and physics-based control within a single model. By adopting the Diffusion Forcing training paradigm to address compounding prediction errors, UniPhys enables flexible multi-task physics-based character motion generation, including text-driven control, velocity control, goal reaching, and dynamic obstacle avoidance.
UniRes: Universal Image Restoration for Complex Degradations: This paper proposes UniRes — a diffusion-based universal image restoration framework that acquires expert knowledge across four tasks (super-resolution, motion deblurring, defocus deblurring, and denoising) through multi-task training. At inference time, it handles arbitrary combinations of real-world complex degradations end-to-end by flexibly composing latent-space prediction weights from different tasks.

🎯 Object Detection¶

3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection: This paper proposes 3D-MOOD, the first end-to-end monocular open-set 3D object detector, which lifts open-set 2D detections into 3D space via geometry-aware 3D query generation and a canonical image space design, achieving state-of-the-art performance on both the Omni3D closed-set benchmark and the Argoverse 2 / ScanNet open-set benchmarks.
Accelerate 3D Object Detection Models via Zero-Shot Attention Key Pruning: This paper proposes tgGBC (trim keys gradually Guided By Classification scores), a zero-shot runtime pruning method that computes key importance by element-wise multiplication of classification scores and attention maps, progressively pruning unimportant keys across layers. It achieves nearly 2× acceleration of the Transformer decoder on multiple 3D detectors with less than 1% performance degradation.
Adversarial Attention Perturbations for Large Object Detection Transformers: This paper proposes AFOG (Attention-Focused Offensive Gradient), an architecture-agnostic adversarial attack method that leverages a learnable attention mechanism to concentrate perturbations on vulnerable image regions. With only 10 iterations and visually imperceptible perturbations, AFOG reduces the mAP of 12 detection Transformers by up to 37.8×, while also outperforming existing methods on CNN-based detectors.
Augmenting Moment Retrieval: Zero-Dependency Two-Stage Learning: This paper proposes the AMR framework, which leverages a Splice-and-Boost data augmentation strategy and a cold-start–distillation two-stage training pipeline to substantially improve boundary awareness and semantic discriminability in video moment retrieval—without relying on any external data or pretrained models—surpassing the previous SOTA by +5% on QVHighlights.
Automated Model Evaluation for Object Detection via Prediction Consistency and Reliability: This paper proposes PCR (Prediction Consistency and Reliability), an automated evaluation method that estimates object detection model performance without human annotations. PCR analyzes the spatial consistency and confidence reliability of bounding boxes before and after NMS to estimate mAP, and constructs a corruption-based meta-dataset for more realistic and scalable evaluation.
Boosting Multi-View Indoor 3D Object Detection via Adaptive 3D Volume Construction: This paper proposes SGCDet, a framework that achieves efficient and accurate multi-view indoor 3D object detection without relying on ground-truth scene geometry, via a Geometry and Context-Aware aggregation module (adaptive feature lifting) and a sparse voxel construction strategy (coarse-to-fine adaptive voxel selection).
SGCDet: Boosting Multi-View Indoor 3D Object Detection via Adaptive 3D Volume Construction: SGCDet achieves efficient and accurate multi-view indoor 3D object detection through adaptive sparse 3D voxel construction and geometry-context-aware aggregation, surpassing existing methods without requiring ground-truth geometric supervision.
Boosting Multi-View Indoor 3D Object Detection via Adaptive 3D Volume Construction: SGCDet achieves state-of-the-art performance in multi-view indoor 3D object detection without ground-truth geometric supervision, through a geometry- and context-aware aggregation module (3D deformable attention + multi-view attention fusion) and an occupancy-probability-based sparse voxel construction strategy, while substantially reducing computational overhead.
Diffusion Curriculum: Synthetic-to-Real Data Curriculum via Image-Guided Diffusion: This paper leverages the image guidance strength of diffusion models to generate a continuous synthetic-to-real spectrum of data, and proposes a Diffusion Curriculum Learning (DisCL) strategy that adaptively selects synthetic data at optimal guidance levels across different training stages, effectively addressing long-tail classification and low-quality data learning challenges.
DISTIL: Data-Free Inversion of Suspicious Trojan Inputs via Latent Diffusion: DISTIL proposes a data-free trojan trigger inversion method that searches for trigger patterns in the latent space of a pretrained guided diffusion model—rather than in pixel space—and injects uniform noise regularization at each step to effectively distinguish genuine backdoor triggers from adversarial perturbations, achieving up to 7.1% accuracy improvement on BackdoorBench.
Dynamic-DINO: Fine-Grained Mixture of Experts Tuning for Real-time Open-Vocabulary Object Detection: This work is the first to introduce Mixture of Experts into real-time open-vocabulary object detectors. Through MoE-Tuning, it extends Grounding DINO 1.5 Edge from a dense model into a dynamic inference framework, proposing fine-grained expert decomposition and a pretrained weight allocation strategy. Using only 1.56M open-source data, the resulting model surpasses the original version trained on 20M private data.
EA-KD: Entropy-based Adaptive Knowledge Distillation: This paper proposes EA-KD, a plug-and-play knowledge distillation method based on information entropy. It dynamically reweights distillation losses by combining the entropy values of teacher and student outputs, prioritizing learning from high-entropy (high-information) samples. EA-KD consistently improves multiple KD frameworks across image classification, object detection, and LLM distillation tasks with negligible computational overhead.
EvRT-DETR: Latent Space Adaptation of Image Detectors for Event-based Vision: This paper proposes I2EvDet, a framework that adapts mainstream image detectors to event-based video detection by inserting lightweight RNN temporal modules into the frozen latent space of RT-DETR, achieving state-of-the-art results of +2.3 and +1.4 mAP on the Gen1 and 1Mpx benchmarks, respectively, with minimal architectural modifications.
Intervening in Black Box: Concept Bottleneck Model for Enhancing Human-Neural Network Mutual Understanding: This paper proposes the CBM-HNMU framework, which approximates the reasoning process of a black-box model via a Concept Bottleneck Model (CBM), automatically identifies and corrects harmful concepts, and distills the corrected knowledge back into the black-box model, enabling systematic model intervention and accuracy improvement beyond the sample level.
Large-scale Pre-training for Grounded Video Caption Generation: This paper proposes the GROVE model along with a large-scale automatic annotation pipeline, constructing the HowToGround1M pre-training dataset (1M videos) and the manually annotated iGround dataset (3,513 videos). GROVE jointly performs video caption generation and multi-object spatio-temporal bounding box localization, achieving state-of-the-art results on iGround, VidSTG, ActivityNet-Entities, and other benchmarks.
LMM-Det: Make Large Multimodal Models Excel in Object Detection: This paper proposes LMM-Det, which through systematic analysis identifies low recall as the core bottleneck of large multimodal models (LMMs) in object detection. By applying data distribution adjustment (pseudo-label augmentation) and inference optimization (per-category detection), LMM-Det improves COCO AP from 0.2 to 47.5 without any additional specialized detection modules.
Measuring the Impact of Rotation Equivariance on Aerial Object Detection: This paper proposes MessDet, a rotation-equivariant aerial object detector that achieves strict rotation equivariance through a novel downsampling procedure, and introduces rotation-equivariant channel attention (RE-CA) and a multi-branch detection head, attaining state-of-the-art performance on DOTA and other benchmarks with significantly fewer parameters.
OpenRSD: Towards Open-prompts for Object Detection in Remote Sensing Images: OpenRSD is a general-purpose open-prompt object detection framework for remote sensing that supports both text and image multimodal prompts. It integrates an alignment head and a fusion head to balance speed and accuracy, employs a three-stage training pipeline, and is trained on the ORSD+ dataset comprising 470K images. OpenRSD achieves state-of-the-art average performance across seven public benchmarks while maintaining real-time inference at 20.8 FPS.
Revisiting Adversarial Patch Defenses on Object Detectors: Unified Evaluation, Large-Scale Dataset, and New Insights: This paper systematically revisits 11 adversarial patch defense methods, establishes the first patch defense benchmark covering 13 attacks, 11 detectors, and 4 metrics, constructs a large-scale APDE dataset of 94,000 images, and reveals three key insights: the difficulty of defending against natural adversarial patches stems from data distribution rather than high-frequency components; patch detection accuracy is inconsistent with defense performance; and adaptive attacks can circumvent most existing defenses.
SFUOD: Source-Free Unknown Object Detection: This paper introduces a novel Source-Free Unknown Object Detection (SFUOD) setting and proposes the CollaPAUL framework, which simultaneously detects known and unknown objects without access to source data by combining collaborative tuning to fuse source- and target-domain knowledge with a principal-axis-based pseudo-label assignment strategy for unknown objects.
Sim-DETR: Unlock DETR for Temporal Sentence Grounding: This paper systematically analyzes the root causes of anomalous behavior in DETR-based temporal sentence grounding (TSG) — inter-query conflict and intra-query global-local contradiction — and proposes two simple decoder modifications (Query Grouping & Ranking + Global-Local Bridging) to form Sim-DETR, unlocking the full potential of DETR for TSG.
The Devil is in the Spurious Correlations: Boosting Moment Retrieval with Dynamic Learning: This paper is the first to identify spurious correlations between text queries and background frames as the fundamental bottleneck in moment retrieval performance. It proposes TD-DETR, a framework that mitigates this issue via two strategies: dynamic context video synthesis and text-dynamics interaction enhancement, achieving state-of-the-art results on QVHighlights and Charades-STA.
Uncertainty-Aware Gradient Stabilization for Small Object Detection: This paper identifies gradient instability caused by steep loss curvature in traditional localization methods when applied to small objects, and proposes UGS (Uncertainty-aware Gradient Stabilization), a framework comprising three components — classification-based localization, uncertainty minimization, and uncertainty-guided refinement — to stabilize gradients and significantly improve small object detection performance.
UPRE: Zero-Shot Domain Adaptation for Object Detection via Unified Prompt and Representation Enhancement: This paper proposes the UPRE framework, which jointly optimizes Multi-view Domain Prompts (MDP) and Unified Representation Enhancement (URE) to simultaneously alleviate detection bias and domain bias in zero-shot domain adaptive object detection, achieving state-of-the-art performance across nine datasets spanning three scenario types: adverse weather, cross-city, and virtual-to-real.
VisRL: Intention-Driven Visual Perception via Reinforced Reasoning: VisRL is the first framework to apply reinforcement learning to intention-driven visual perception. Through iterative DPO training, it enables large multimodal models (LMMs) to autonomously select focus regions (by predicting bounding boxes) according to query intent, achieving superior visual reasoning over SFT without requiring costly intermediate bounding box annotations.
Visual-RFT: Visual Reinforcement Fine-Tuning: Visual-RFT extends the Reinforcement Learning with Verifiable Rewards (RLVR) paradigm from DeepSeek R1—originally applied to mathematics and code—to visual perception tasks. It introduces task-specific verifiable reward functions, including an IoU reward for object detection and a CLS reward for classification, achieving substantial improvements over SFT on fine-grained classification, few-shot detection, and grounded reasoning with only a fraction of the training data.
Visual Modality Prompt for Adapting Vision-Language Object Detectors: This paper proposes ModPrompt, an encoder-decoder-based visual prompting strategy that adapts vision-language object detectors (e.g., YOLO-World, Grounding DINO) to new modalities such as infrared and depth, while preserving zero-shot detection capability.
VOccl3D: A Video Benchmark Dataset for 3D Human Pose and Shape Estimation under Real Occlusions: This paper presents VOccl3D, a large-scale synthetic video dataset (250K frames, 400 video sequences) rendered via 3DGS, targeting 3D human pose and shape (HPS) estimation under realistic occlusion scenarios. Models fine-tuned on VOccl3D demonstrate significant performance improvements in occluded settings.
YOLO-Count: Differentiable Object Counting for Text-to-Image Generation: This paper proposes YOLO-Count, a fully differentiable open-vocabulary object counting model built upon the YOLO architecture. Through an innovative cardinality map regression target and a hybrid strong-weak supervised training strategy, YOLO-Count achieves state-of-the-art performance on both general object counting and quantity-controlled text-to-image generation.
YOLOE: Real-Time Seeing Anything: This paper proposes YOLOE, which unifies text prompt, visual prompt, and prompt-free open-scenario detection and segmentation within the YOLO architecture. Through three key designs — RepRTA (Re-parameterizable Region-Text Alignment), SAVPE (Semantic-Activated Visual Prompt Encoder), and LRPC (Lazy Region-Prompt Contrast) — YOLOE achieves high efficiency and strong performance, surpassing YOLO-World v2 on LVIS with 3× lower training cost.

📊 LLM Evaluation¶

3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark: This paper introduces 3DSRBench, the first comprehensive 3D spatial reasoning benchmark comprising 2,772 manually annotated VQA pairs across 12 question types. Through balanced data distribution and a novel FlipEval strategy, the benchmark enables robust evaluation. Results reveal that state-of-the-art LMMs—including GPT-4o and Gemini—fall far short of human performance on 3D spatial reasoning (≈52% vs. 95.7%), with substantial performance degradation under uncommon camera viewpoints.
A Conditional Probability Framework for Compositional Zero-shot Learning: This paper proposes CPF, a conditional probability framework for compositional zero-shot learning (CZSL) that decomposes the compositional likelihood into an object likelihood and a conditional attribute likelihood. Through a text-enhanced object learning module and an object-guided attribute learning module, CPF explicitly models the semantic constraints and contextual dependencies between attributes and objects, achieving a 17.9% AUC improvement on UT-Zappos50K and a 5.5% Unseen Accuracy improvement on MIT-States.
A Conditional Probability Framework for Compositional Zero-shot Learning: This paper proposes a Conditional Probability Framework (CPF) that decomposes the compositional recognition probability into an object likelihood $p(o|x)$ and a conditional attribute likelihood $p(a|o,x)$. Two dedicated modules — Text-Enhanced Object learning (TEO) and Object-Guided Attribute learning (OGA) — explicitly model attribute-object dependencies, achieving state-of-the-art performance across three CZSL benchmarks.
A Real-world Display Inverse Rendering Dataset: This paper presents the first real-world inverse rendering dataset built upon an LCD display-camera system, comprising stereo polarization images of 16 objects with diverse materials captured under OLAT illumination patterns alongside high-precision geometric ground truth. A simple yet effective display inverse rendering baseline is proposed, outperforming existing inverse rendering methods.
A Real-world Display Inverse Rendering Dataset: This paper presents the first real-world inverse rendering dataset (DIR) built upon an LCD display–polarization camera system, comprising polarimetric stereo images of objects with diverse reflectance properties captured under OLAT illumination, calibrated display backlight/nonlinearity, and high-quality ground-truth geometry. A simple yet effective baseline method for display-based inverse rendering is also proposed.
BATCLIP: Bimodal Online Test-Time Adaptation for CLIP: This paper proposes BATCLIP, a bimodal online test-time adaptation (TTA) method for CLIP that simultaneously adapts the LayerNorm parameters of both the visual and text encoders. By introducing a projection matching loss and an inter-class separability loss to enhance vision-text feature alignment and class discriminability, BATCLIP achieves state-of-the-art performance on CIFAR-10C, CIFAR-100C, and ImageNet-C.
Combinative Matching for Geometric Shape Assembly: This paper proposes Combinative Matching (CMNet), which jointly models two fundamental properties of interlocking parts — surface shape consistency and volumetric occupancy complementarity — via an equivariant network trained with three objectives: orientation alignment, shape matching, and occupancy matching, substantially reducing local ambiguity in geometric assembly.
Degradation-Modeled Multipath Diffusion for Tunable Metalens Photography: This paper proposes DMDiff, a framework that leverages the natural image priors of pretrained diffusion models. Through a positive/neutral/negative tripath multi-prompt diffusion strategy and a Spatially-Varying Degradation-Aware (SVDA) attention module, DMDiff achieves high-fidelity tunable image reconstruction for millimeter-scale metalens cameras, surpassing existing methods across multiple metrics.
Discontinuity-aware Normal Integration for Generic Central Camera Models: This paper proposes a novel normal integration method that supports explicit discontinuity modeling and generic central camera models. By establishing constraints between surface normals and ray directions under a local planarity assumption, the method achieves state-of-the-art performance on standard normal integration benchmarks and, for the first time, directly handles generic central cameras such as fisheye and panoramic cameras.
DisCoPatch: Taming Adversarially-driven Batch Statistics for Improved Out-of-Distribution Detection: This paper proposes DisCoPatch, a framework that exploits the inherent bias of BatchNorm toward batch statistics in adversarial VAEs to distinguish ID from OOD samples. At inference time, multiple patches from the same image are composed into a batch to ensure distributional consistency. The method achieves state-of-the-art performance on covariate-shift OOD detection (ImageNet-1K(-C) 95.5% AUROC) and near-OOD detection (95.0% AUROC), with a model size of only 25 MB and latency an order of magnitude lower than competing methods.
DISTA-Net: Dynamic Closely-Spaced Infrared Small Target Unmixing: DISTA-Net proposes a dynamic deep unfolding network that replaces the static nonlinear transform and threshold parameters in ISTA-based sparse reconstruction with input-adaptive counterparts, constituting the first deep learning method for closely-spaced infrared small target (CSIST) unmixing. The work also establishes the first open-source ecosystem encompassing a dataset, evaluation metrics, and a toolkit.
Few-Shot Pattern Detection via Template Matching and Regression: This paper proposes TMR, a method that combines classical template matching with support-conditioned bounding box regression to achieve few-shot detection of arbitrary patterns—including non-object-level patterns. The authors also introduce the RPINE dataset to cover a broader range of repetitive patterns. TMR surpasses existing FSCD methods on multiple benchmarks and demonstrates strong cross-dataset generalization.
ForCenNet: Foreground-Centric Network for Document Image Rectification: This paper proposes ForCenNet, a foreground-centric document rectification network featuring three key contributions: foreground label generation, a mask-guided Transformer decoder, and a curvature consistency loss. The method requires only undistorted images for training and achieves state-of-the-art performance on four benchmarks: DocUNet, DIR300, WarpDoc, and DocReal.
Generative Zoo: A scalable pipeline is proposed for synthesizing animal 3D pose and shape training data using conditional image generation models (FLUX + ControlNet), producing the million-scale GenZoo dataset. Training exclusively on synthetic data achieves state-of-the-art performance on real-world benchmarks.
HiERO: Understanding the Hierarchy of Human Behavior Enhances Reasoning on Egocentric Videos: This paper proposes HiERO, a weakly supervised hierarchical graph architecture that learns the hierarchy of functional activity cues by aligning video segments with narration text. The resulting segment features encode multi-scale behavioral dependencies. HiERO substantially outperforms fully supervised methods in zero-shot evaluation on procedure learning tasks (F1 +12.5% on EgoProceL) and achieves state-of-the-art performance on video–text alignment benchmarks.
Imbalance in Balance: Online Concept Balancing in Generation Models: Through carefully designed causal experiments, this work reveals that data distribution—rather than model scale or data volume—is the decisive factor for concept composition ability in diffusion models. It further proposes IMBA Loss, an online concept-level balancing loss that adaptively reweights token-level losses via the discrepancy between conditional and unconditional distributions (the IMBA distance). With only a few lines of code modification, the method significantly improves multi-concept generation capability.
InterSyn: Interleaved Learning for Dynamic Motion Synthesis in the Wild: This paper proposes the InterSyn framework, which jointly models single-person and multi-person motions within a unified interleaved sequence via an Interleaved Learning strategy, combined with a Relative Coordination Refinement (REC) module, to generate more natural and coordinated human interaction motions. On the InterHuman test set, FID is reduced by 6.1% and R Precision Top-1 is improved by 2.8% compared to FreeMotion.
Lay2Story: Extending Diffusion Transformers for Layout-Togglable Story Generation: Lay2Story introduces the task of layout-togglable story generation, constructs the Lay2Story-1M dataset of over 1 million high-resolution images, and proposes a global–subject dual-branch framework built on the DiT architecture, achieving comprehensive improvements over existing methods in consistency, semantic relevance, and aesthetic quality.
Neural Multi-View Self-Calibrated Photometric Stereo without Photometric Stereo Cues: This paper proposes an end-to-end neural inverse rendering framework that jointly recovers geometry, spatially-varying reflectance, and lighting parameters from multi-view images captured under varying illumination, requiring neither light source calibration nor intermediate photometric stereo cues (e.g., normal maps). The method outperforms existing multi-stage MVPS approaches.
ODP-Bench: Benchmarking Out-of-Distribution Performance Prediction: This paper presents ODP-Bench, the first comprehensive benchmark for OOD performance prediction, covering 29 OOD datasets, 10 prediction algorithms, and 1,444 pretrained models. It reveals a key finding that existing algorithms perform reasonably well on synthetic corruptions but consistently fail under natural distribution shifts.
OmniDiff: A Comprehensive Benchmark for Fine-grained Image Difference Captioning: This paper introduces OmniDiff, a fine-grained image difference captioning dataset comprising 324 diverse scenes (real-world and 3D synthetic), and proposes a plug-and-play Multi-scale Differential Perception (MDP) module integrated into an MLLM to build the M3Diff model, achieving state-of-the-art performance on OmniDiff and multiple public benchmarks.
On the Robustness Tradeoff in Fine-Tuning: The first systematic study of the adversarial robustness–accuracy tradeoff during fine-tuning, conducted across 231 models, 7 fine-tuning strategies, and 6 datasets. Key findings: (1) robustness first increases then decreases in the early stages of fine-tuning; (2) different PEFT strategies and task complexities yield distinct Pareto frontiers; (3) OOD robustness exhibits no analogous tradeoff and instead tracks accuracy changes closely.
PHATNet: A Physics-guided Haze Transfer Network for Domain-adaptive Real-world Image Dehazing: This paper proposes PHATNet, a physics-guided haze transfer network that extends the Atmospheric Scattering Model (ASM) to latent space to disentangle and transfer haze patterns, generating domain-adaptive fine-tuning datasets that enable dehazing models to effectively adapt to unseen real-world haze scenes at test time.
Rethinking Few Shot CLIP Benchmarks: A Critical Analysis in the Inductive Setting: This paper identifies that existing CLIP few-shot classification benchmarks constitute a "partially transductive setting" due to CLIP's exposure to test datasets during pretraining. It proposes an unlearning-based inductive benchmark evaluation framework and introduces a few-shot classification method that achieves stable state-of-the-art performance under the new benchmark.
SketchSplat: 3D Edge Reconstruction via Differentiable Multi-view Sketch Splatting: This paper proposes SketchSplat, which represents 3D edges as parametric sketches (line segments + Bézier curves) and directly optimizes edge parameters via differentiable rendering by sampling Gaussian points from sketches. Combined with adaptive topology control and an improved 2D edge detector, the method achieves state-of-the-art accuracy, completeness, and compactness on CAD datasets.
Spectral Sensitivity Estimation with an Uncalibrated Diffraction Grating: A practical method is proposed for estimating camera spectral sensitivity using an uncalibrated diffraction grating film. By jointly estimating spectral sensitivity and grating efficiency, accurate closed-form solutions are obtained from a single capture of a light source with known spectrum. The method significantly outperforms traditional color chart approaches at an equipment cost of under $5 USD.
StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition: StreamMind proposes an "event-gated LLM invocation" paradigm to replace the existing "per-frame LLM invocation" approach. By inserting a Cognition Gate network between the video encoder and the LLM, the model invokes the LLM only when query-relevant events occur. Combined with an Event-Preserving Feature Extractor (EPFE) based on state space methods that ensures constant perception cost, the system achieves 100 fps streaming video processing on a single A100 GPU.
Supercharging Floorplan Localization with Semantic Rays: A semantics-aware floorplan localization framework is proposed that fuses predicted semantic rays with depth rays into a structural-semantic probability volume. Combined with a coarse-to-fine refinement strategy, the method achieves 2–3× performance improvements on two standard benchmarks.
SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition: SVTRv2 is proposed with three key designs — Multi-Size Resize (MSR), Feature Rearrangement Module (FRM), and Semantic Guidance Module (SGM) — enabling a CTC-based model to comprehensively outperform encoder-decoder methods across multi-scene benchmarks for the first time, while retaining inference speed advantages.

🤖 Robotics & Embodied AI¶

Adaptive Articulated Object Manipulation On The Fly with Foundation Model Reasoning and Part Grounding: This paper proposes AdaRPG, a framework that leverages foundation vision-language models for part-level segmentation and affordance reasoning on articulated objects, and employs GPT-4o to generate high-level control code for adaptively scheduling atomic manipulation skills, achieving cross-category zero-shot generalization in both simulation and real-world environments.
AnyBimanual: Transferring Unimanual Policy for General Bimanual Manipulation: This paper proposes AnyBimanual, a plug-and-play framework that transfers pretrained unimanual manipulation policies to general bimanual manipulation scenarios via a Skill Manager and a Visual Aligner, achieving significant multi-task generalization with only a small number of bimanual demonstrations.
Beyond Losses Reweighting: Empowering Multi-Task Learning via the Generalization Perspective: From a generalization perspective, this paper introduces Sharpness-Aware Minimization (SAM) into multi-task learning (MTL). By decomposing each task's SAM gradient into a "low-loss direction" and a "flat direction" and aggregating them separately, the method reduces gradient conflicts and guides the model toward a jointly flat low-loss region shared across tasks.
Bridging Domain Generalization to Multimodal Domain Generalization via Unified Representations: This paper proposes URMMDG, a framework that constructs a cross-modal unified representation space via supervised contrastive learning and decouples class-generic information from modality/domain-specific information through mutual information minimization. This enables effective transfer of classical single-modal domain generalization methods (Mixup, JiGen, IBN-Net) to multimodal domain generalization (MMDG) settings, achieving state-of-the-art performance on the EPIC-Kitchens and HAC benchmarks.
Certifiably Optimal Anisotropic Rotation Averaging: This paper proposes a novel SDP relaxation that enforces solutions to lie within the convex hull of SO(3), conv(SO(3)), achieving for the first time certifiably globally optimal rotation averaging under anisotropic cost functions. It resolves the fundamental failure of the conventional O(3) relaxation in anisotropic settings.
CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games: This paper proposes CombatVLA, an efficient 3B-parameter VLA model designed for combat tasks in 3D action role-playing games. Through the Action-of-Thought data format and a truncated inference strategy, CombatVLA achieves inference speeds up to 50× faster than existing VLM-based game frameworks while surpassing human players in combat success rate.
COSMO: Combination of Selective Memorization for Low-cost Vision-and-Language Navigation: This paper proposes COSMO, a low-cost VLN architecture combining selective memorization, which replaces the computationally expensive attention mechanisms in Transformers with two customized selective state space modules—Round Selective Scan (RSS, capturing global context in a single scan pass) and Cross-modal Selective State Space Module (CS3, dual-stream cross-modal interaction)—achieving navigation performance surpassing the baseline DUET with only 15.5% of its parameters and 9.3% of its FLOPs.
DexVLG: Dexterous Vision-Language-Grasp Model at Scale: This paper presents DexVLG — the first large-scale vision-language-dexterous-grasp model. It introduces DexGraspNet 3.0, a dataset comprising 174K objects and 170M grasp poses with part-level semantic annotations. By combining a VLM encoder with a Flow Matching pose prediction head, DexVLG achieves over 76% zero-shot execution success in simulation and demonstrates semantically aligned dexterous grasping in the real world.
Embodied Representation Alignment with Mirror Neurons: Inspired by mirror neurons, this paper aligns the intermediate representations of action understanding (observing others' behavior) and embodied execution (autonomously performing actions) into a shared latent space via contrastive learning. The work reveals a spontaneous alignment phenomenon between the two model families that correlates with task success rate, and demonstrates that explicit alignment yields improvements on action recognition (+3.3%) and robot manipulation (+3.5%).
EvolvingGrasp: Evolutionary Grasp Generation via Efficient Preference Alignment: This paper proposes EvolvingGrasp, which achieves efficient evolutionary generation and human preference alignment for dexterous grasp pose synthesis via Handpose-wise Preference Optimization (HPO) and a Physics-Aware Consistency Model (PCM), attaining state-of-the-art performance on four benchmark datasets with a 30× inference speedup.
GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices: This paper presents GUIOdyssey, the first comprehensive dataset for cross-app GUI navigation on mobile devices (8,334 episodes, 212 apps, 1,357 app combinations), along with OdysseyAgent—a multimodal navigation agent equipped with a history resampling module that significantly improves cross-app task performance while balancing accuracy and inference efficiency.
iManip: Skill-Incremental Learning for Robotic Manipulation: This paper proposes the iManip framework, which enables robots to continually acquire new manipulation skills without retraining through a temporal replay strategy and a scalable PerceiverIO architecture, while mitigating catastrophic forgetting of previously learned skills. iManip achieves an average improvement of 9.4% over conventional incremental learning baselines on RLBench.
Interaction-Merged Motion Planning: Effectively Leveraging Diverse Motion Datasets for Robust Planning: This paper proposes IMMP (Interaction-Merged Motion Planning), a two-stage strategy — Interaction-Conserving Pre-Merging (constructing a multi-metric checkpoint pool) and Interaction Transfer with Merging (task-vector-based weighted merging grouped by interaction modules) — to transfer agent behavior and interaction knowledge from diverse trajectory datasets to a target domain, effectively improving cross-domain adaptability of motion planning.
TesserAct: Learning 4D Embodied World Models: TesserAct is a 4D embodied world model that trains a video generative model to jointly predict RGB, depth, and normal videos, which are subsequently converted into high-quality 4D scenes, enabling spatiotemporally consistent 3D world dynamics simulation and robot action planning.
Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos: This paper proposes Moto, a framework that encodes inter-frame visual motion from video into discrete sequences via unsupervised Latent Motion Tokens. A GPT-style autoregressive pre-training scheme is employed to learn motion priors, which are then transferred to real robot manipulation through a co-fine-tuning strategy. Moto achieves performance competitive with 55B-parameter models on the SIMPLER and CALVIN benchmarks using only 98M parameters.
NavMorph: A Self-Evolving World Model for Vision-and-Language Navigation in Continuous Environments: This paper proposes NavMorph, an RSSM-based self-evolving world model that models continuous environment dynamics in latent space via a World-aware Navigator and a Foresight Action Planner, and introduces a Contextual Evolution Memory (CEM) for rapid online test-time adaptation.
PacGDC: Label-Efficient Generalizable Depth Completion with Projection Ambiguity and Consistency: This paper proposes PacGDC, which exploits the inherent shape and position ambiguities in 2D-to-3D projection to synthesize large quantities of pseudo-geometric data—using multiple depth foundation models as scale manipulators—thereby achieving generalizable depth completion with minimal annotation cost, attaining state-of-the-art performance in both zero-shot and few-shot settings.
PASG: A Closed-Loop Framework for Automated Geometric Primitive Extraction and Semantic Anchoring in Robotic Manipulation: This paper proposes PASG (Primitive-Aware Semantic Grounding), a closed-loop framework that dynamically couples low-level geometric features with high-level task semantics through automated geometric primitive extraction (keypoints, functional axes, principal axes) and VLM-driven semantic anchoring. PASG achieves near-human-annotation performance on robotic manipulation tasks, and introduces the Robocasa-PA benchmark along with the fine-tuned model Qwen2.5VL-PA.
Rep-MTL: Unleashing the Power of Representation-Level Task Saliency for Multi-Task Learning: This paper proposes Rep-MTL, a multi-task optimization method grounded in representation-level task saliency. It mitigates negative transfer and explicitly promotes cross-task complementarity via entropy-regularized task-specific saliency regulation (TSR) and sample-level cross-task saliency alignment (CSA), without modifying the optimizer or network architecture.
Resolving Token-Space Gradient Conflicts: Token Space Manipulation for Transformer-Based Multi-Task Learning: This paper proposes DTME-MTL, a framework that identifies and categorizes gradient conflicts in token space into range-space conflicts and null-space conflicts, and addresses them via Token Modulation (affine transformation) and Token Expansion (task-specific token insertion), respectively, to mitigate negative transfer in Transformer-based multi-task learning with minimal parameter overhead.
Selective Contrastive Learning for Weakly Supervised Affordance Grounding: This paper proposes a selective contrastive learning approach for weakly supervised affordance grounding (WSAG). By combining prototypical contrastive learning and pixel-level contrastive learning, the method adaptively learns affordance-relevant cues at both object and part granularities, effectively preventing the model from attending to action-irrelevant salient features. The approach comprehensively outperforms competing methods that rely on stronger foundation models (GPT-4, LLaVA, etc.) on AGD20K and HICO-IIF benchmarks.
Self-supervised Learning of Hybrid Part-aware 3D Representations of 2D Gaussians and Superquadrics: This paper proposes PartGS, a self-supervised part-aware 3D reconstruction framework that hybridly couples 2D Gaussian Splatting with superquadrics. Through parameter sharing and multiple regularization terms, PartGS achieves simultaneous high-quality geometric decomposition and texture reconstruction, outperforming state-of-the-art methods by 75.9% in reconstruction accuracy and 16.13 dB in PSNR on DTU, ShapeNet, and real-world scenes.
SITE: towards Spatial Intelligence Thorough Evaluation: This paper presents SITE, a comprehensive spatial intelligence benchmark grounded in a tripartite cognitive-science taxonomy. It comprises 8,068 multiple-choice VQA tasks spanning 31 datasets (images and videos). Evaluation results show that the strongest VLM (GPT-4o) still lags human experts by approximately 32% on overall spatial reasoning, and VLM spatial intelligence scores are highly correlated with robotic manipulation success rates (Pearson $r=0.902$).
TransiT: Transient Transformer for Non-line-of-sight Videography: TransiT is a novel architecture for real-time NLOS video reconstruction that achieves 64×64 resolution at 10 FPS from sparse fast-scan (16×16, 0.4 ms/point) transient measurements. The system integrates transient compression, inter-frame feature fusion, and a spatiotemporal Transformer, and further proposes an MMD-based transfer learning strategy to bridge the distribution gap between synthetic and real data.
UnZipLoRA: Separating Content and Style from a Single Image: This paper proposes UnZipLoRA, a method that simultaneously trains two decoupled and compatible LoRAs (a content LoRA and a style LoRA) from a single image. Through three strategies—prompt separation, column separation, and block separation—the method achieves effective disentanglement of content and style, enabling independent manipulation and free recombination. UnZipLoRA surpasses DreamBooth-LoRA, Inspiration Tree, and B-LoRA across all user preference metrics.
Weakly-Supervised Learning of Dense Functional Correspondences: This paper defines the task of Dense Functional Correspondence—establishing pixel-level dense correspondences between objects of different categories based on shared functionality (e.g., "pouring")—and proposes a weakly-supervised learning framework that distills functional and structural knowledge into a new model via VLM-based pseudo-labeling of functional parts combined with multi-view contrastive learning.

🛡️ AI Safety¶

A Framework for Double-Blind Federated Adaptation of Foundation Models: This paper proposes BlindFed, a framework that achieves "double-blind" federated adaptation of foundation models through FHE-friendly architectural transformation, two-stage split learning, and privacy-enhancing strategies — keeping the model hidden from data holders and data hidden from the service provider. BlindFed achieves 94.28% accuracy on CIFAR-10, approaching LoRA's 95.92%.
A Framework for Double-Blind Federated Adaptation of Foundation Models: BlindFed proposes a double-blind federated foundation model adaptation framework combining FHE-friendly architectural redesign (polynomial approximation of nonlinear operations), a two-stage split learning protocol (offline knowledge distillation + online encrypted inference), and privacy enhancements (sample permutation + random block sampling), achieving adaptation accuracy close to LoRA under the constraint that the data owner cannot observe the model and the model owner cannot observe the data.
Active Membership Inference Test (aMINT): Enhancing Model Auditability with Multi-Task Learning: This paper proposes Active MINT (aMINT), a multi-task learning framework that jointly trains a MINT model alongside the audited model during training, enabling detection of whether specific data was used for training with over 80% accuracy — significantly outperforming existing passive MINT and membership inference attack methods.
Ask and Remember: A Questions-Only Replay Strategy for Continual Visual Question Answering: This paper proposes QUAD—a continual VQA method that stores only past task questions (without images). Through question replay and attention consistency distillation, QUAD achieves privacy preservation while outperforming methods that store full image–question–answer triplets.
Ask and Remember: A Questions-Only Replay Strategy for Continual Visual Question Answering: This paper proposes QUAD, which replays only questions from previous tasks (without storing images), combined with attention consistency distillation to preserve intra- and inter-modal attention patterns across tasks, achieving state-of-the-art performance in continual VQA under a privacy-preserving setting.
Backdoor Attacks on Neural Networks via One-Bit Flip: This paper proposes SOLEFLIP, the first inference-time backdoor attack on quantized models that requires flipping only a single bit. Through an efficient algorithm for identifying exploitable weights and bit positions, along with a corresponding trigger generation procedure, SOLEFLIP achieves an average attack success rate of 98.9% with zero degradation in clean accuracy across CIFAR-10, SVHN, and ImageNet.
Backdoor Mitigation by Distance-Driven Detoxification: This paper proposes Distance-Driven Detoxification (D3), which reformulates backdoor defense as a constrained optimization problem — maximizing the distance between the fine-tuned model weights and the poisoned initial weights, subject to a constraint that the clean sample loss does not exceed a threshold. This allows the model to effectively escape the "backdoor region," achieving best or second-best defense performance across 7 state-of-the-art attacks.
Backdooring Self-Supervised Contrastive Learning by Noisy Alignment: This paper proposes Noisy Alignment (NA), a method that enhances backdoor attacks against self-supervised contrastive learning by explicitly suppressing noise components in poisoned images. The attack is formulated as a 2D image layout optimization problem, and theoretically optimal layout parameters are derived. NA achieves up to 45.9% improvement in ASR on ImageNet-100.
Client2Vec: Improving Federated Learning by Distribution Shifts Aware Client Indexing: This paper proposes the Client2Vec mechanism, which leverages a CLIP encoder and a Distribution Shifts Aware Index Generation Network (DSA-IGN) to generate, prior to federated training, an index vector for each client that encodes both label and feature distribution information. The resulting indices are then used to improve three key stages of FL: client sampling, model aggregation, and local training.
Controllable Feature Whitening for Hyperparameter-Free Bias Mitigation: This paper proposes the Controllable Feature Whitening (CFW) framework, which eliminates linear correlations between target features and bias features via whitening transformations to mitigate model bias. The approach requires neither adversarial training nor additional regularization hyperparameters, and supports smooth interpolation between demographic parity and equalized odds through a single weighting coefficient.
FakeRadar: Probing Forgery Outliers to Detect Unknown Deepfake Videos: This paper proposes FakeRadar, a deepfake video detection framework that actively generates outlier samples simulating unknown forgeries in the feature space via Forgery Outlier Probing, and designs an Outlier-Guided Tri-Training strategy with three-class optimization (Real/Fake/Outlier). FakeRadar significantly outperforms existing methods on cross-dataset and cross-manipulation evaluations.
FedMeNF: Privacy-Preserving Federated Meta-Learning for Neural Fields: This paper is the first to study federated meta-learning for Neural Fields (NFs) under private data settings. It reveals the severe privacy leakage mechanisms of existing federated meta-learning methods on neural field tasks, and proposes FedMeNF, which regularizes private information in local meta-gradients via a privacy-preserving loss function, effectively protecting client data privacy while retaining fast adaptation capability.
FedVLA: Federated Vision-Language-Action Learning with Dual Gating Mixture-of-Experts for Robotic Manipulation: This paper proposes FedVLA — the first federated learning framework for Vision-Language-Action (VLA) models — comprising three synergistic components: Instruction-Oriented Scene-Parsing (IOSP) for task-aware feature extraction, Dual Gating Mixture-of-Experts (DGMoE) for adaptive knowledge routing, and Expert-Driven Aggregation (EDA) for effective cross-client knowledge integration, achieving task success rates comparable to centralized training while preserving data privacy.
Find a Scapegoat: Poisoning Membership Inference Attack and Defense to Federated Learning: This paper proposes FedPoisonMIA, a poisoning-based membership inference attack for federated learning that maximizes angular deviation, along with a defense mechanism called Angular Trimmed-mean (ATM) that filters malicious gradients via angular distance.
FRET: Feature Redundancy Elimination for Test Time Adaptation: This paper proposes Feature Redundancy Elimination (FRET) as a novel perspective for test-time adaptation (TTA), observing that embedding feature redundancy increases significantly under distribution shift. Two methods are designed: S-FRET (direct minimization of the redundancy score) and G-FRET (GCN-based attention-redundancy decomposition with bi-level optimization). G-FRET achieves state-of-the-art performance across multiple architectures and datasets.
LoRA-FAIR: Federated LoRA Fine-Tuning with Aggregation and Initialization Refinement: This paper proposes LoRA-FAIR, which introduces a server-side residual correction term $\Delta\mathbf{B}$ to simultaneously address two fundamental challenges in federated LoRA fine-tuning — server-side aggregation bias and client-side initialization staleness — consistently outperforming existing federated fine-tuning methods on ViT and MLP-Mixer without incurring additional communication overhead.
Mind the Cost of Scaffold! Benign Clients May Even Become Accomplices of Backdoor Attack: This paper proposes BadSFL, the first backdoor attack tailored to the Scaffold federated learning algorithm. By manipulating control variates, BadSFL turns benign clients into unwitting accomplices. Combined with GAN-based data augmentation and an optimization strategy that predicts the global model's future convergence direction, BadSFL achieves backdoor persistence lasting 60+ rounds after the attack ceases in non-IID settings—three times longer than baseline methods.
Semantic Alignment and Reinforcement for Data-Free Quantization of Vision Transformers: This paper proposes SARDFQ to address semantic distortion and semantic insufficiency in data-free quantization (DFQ) of ViTs. Attention Prior Alignment (APA) guides synthetic images to match the attention patterns of real images, while Multi-Semantic Reinforcement (MSR) enriches local patch semantics. SARDFQ achieves a 15.52% Top-1 accuracy improvement on ImageNet W4A4 ViT-B.
SpecGuard: Spectral Projection-based Advanced Invisible Watermarking: SpecGuard embeds watermark information into the spectral domain of high-frequency subbands obtained via wavelet decomposition (approximated through FFT-based spectral projection). The encoder employs a strength factor to enhance robustness, while the decoder applies a learnable threshold derived from Parseval's theorem for bit recovery. The method achieves high image quality (PSNR > 42 dB) alongside comprehensive robustness against distortion, regeneration, and adversarial attacks, surpassing existing SOTA methods.
Staining and Locking Computer Vision Models without Retraining: This paper proposes novel algorithms for staining (watermark embedding) and locking (usage protection) of pretrained vision models without any retraining or fine-tuning. The approach directly modifies a small number of weights to implant highly selective detector neurons, provides theoretically computable false positive rate guarantees, and is validated on image classification and object detection models.
Towards Adversarial Robustness via Debiased High-Confidence Logit Alignment: This paper reveals that inverse adversarial attacks in adversarial training introduce spurious correlations by shifting model attention toward background features. The proposed DHAT method addresses this bias through two components—Debiased High-confidence Logit Regularization (DHLR) and Foreground Logit Orthogonal Enhancement (FLOE)—achieving state-of-the-art adversarial robustness on CIFAR-10/100 and ImageNet-1K.
Vulnerability-Aware Spatio-Temporal Learning for Generalizable Deepfake Video Detection: This paper proposes FakeSTormer, a fine-grained generative deepfake video detection framework that simultaneously models temporal and spatial vulnerability regions via multi-task learning, coupled with a Self-Blended Video (SBV) data synthesis strategy to generate high-quality forgery samples. Trained exclusively on real data, it achieves state-of-the-art generalization across multiple cross-dataset benchmarks.

🎵 Audio & Speech¶

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining: This work extracts keyframes and text (via ASR and OCR) from YouTube instructional videos to construct a high-quality interleaved image-text "multimodal textbook" dataset for VLM pretraining, achieving substantial improvements over web-crawled interleaved datasets on knowledge-intensive and reasoning benchmarks.
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining: This work collects 2.5 years (22,000 hours) of instructional videos from YouTube and constructs a high-quality interleaved image-text "multimodal textbook" corpus (6.5M keyframes + 0.75B text tokens) via an LLM-driven multi-level extraction and filtering pipeline. The resulting dataset significantly improves VLM pretraining on knowledge-intensive and reasoning tasks, yielding substantial gains on ScienceQA and MathVista in particular.
Align Your Rhythm: Generating Highly Aligned Dance Poses with Gating-Enhanced Rhythm-Aware Feature Representation: This paper proposes Danceba, a framework comprising three core modules — Phase-based Rhythm Extraction (PRE), Temporal Gated Causal Attention (TGCA), and Parallel Mamba Motion Modeling (PMMM) — to achieve music-driven dance generation with high rhythm alignment and diversity, attaining a 48.68% improvement in FIDk and a 12% improvement in BAS on the AIST++ dataset.
Everything is a Video: Unifying Modalities through Next-Frame Prediction: This paper reformulates multimodal learning tasks involving text, images, audio, and video as a unified next-frame prediction problem—rendering all inputs and outputs as sequences of 64×64 video frames—and demonstrates that a single Transformer model without any modality-specific encoders can handle cross-modal tasks, validating the radical yet feasible "everything is a video" unified representation paradigm.
How Would It Sound? Material-Controlled Multimodal Acoustic Profile Generation for Objects: This paper proposes a material-controlled acoustic profile generation task (M-CAPA): given audio-visual observations of an indoor scene and a user-defined target material configuration, the model generates a target room impulse response (RIR) that reflects the material changes. A companion dataset, Acoustic Wonderland, is also introduced.
Latent Swap Joint Diffusion for 2D Long-Form Latent Generation: This paper proposes SaFa (Swap Forward), a modality-agnostic and efficient method that replaces the averaging operation in conventional joint diffusion with two latent swap operators—Self-Loop Latent Swap and Reference-Guided Latent Swap—to address spectrum aliasing and preserve cross-view consistency, achieving significant improvements over existing methods in both long audio and panoramic image generation.
Learning to See Inside Opaque Liquid Containers using Speckle Vibrometry: This paper proposes a non-contact system based on laser speckle vibrometry that simultaneously senses micro-vibrations on the surfaces of multiple opaque containers via a 2D grid, then employs a Vibration Transformer to infer container type and hidden liquid fill level from vibration spectra — establishing "seeing inside opaque containers" as a novel computer vision task.
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition: This paper proposes Lyra, a speech-centric and efficient omni-cognition MLLM framework. Through three key strategies—multimodal LoRA, a latent cross-modality regularizer, and a latent multi-modality extractor—Lyra achieves state-of-the-art performance across vision-language-speech modalities with less training data, and is the first to support speech inputs spanning several hours.
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition: This paper proposes Lyra, a speech-centric omni-modal MLLM framework consisting of three core components — a DTW-based cross-modality regularizer, multi-modality LoRA, and a latent multi-modality extractor — along with the first 12K long-speech SFT dataset. Using only 2.7M training samples and modest compute, Lyra achieves state-of-the-art performance simultaneously on vision-language, vision-speech, and speech-language benchmarks, while supporting speech inputs of up to 2 hours in length.
MUG: Pseudo Labeling Augmented Audio-Visual Mamba Network for Audio-Visual Video Parsing: This paper proposes the MUG framework, which simultaneously improves segment-level and event-level prediction in weakly supervised audio-visual video parsing (AVVP) through a pseudo label-augmented cross-modal random combination data augmentation strategy and an audio-visual Mamba network.
Understanding Co-speech Gestures in-the-wild: This paper proposes JEGAL — a joint gesture-audio-language tri-modal embedding space that learns co-speech gesture representations under weak supervision via a global phrase-level contrastive loss and a local gesture-word coupling loss. Three new gesture understanding tasks and benchmarks are introduced, and the method outperforms a range of baselines including large vision-language models.
VGGSounder: Audio-Visual Evaluations for Foundation Models: To address the limitations of the VGGSound dataset — including missing multi-labels, category overlap, and modality misalignment — this work constructs VGGSounder, a multi-label audio-visual classification benchmark with modality-level annotations, and proposes a "modality confusion" metric to expose deficiencies in foundation models' multimodal fusion capabilities.
Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations: This paper proposes Zero-AVSR, a framework that transcribes speech into language-agnostic romanized text (Roman text) and then leverages an LLM to convert the Roman text into target-language graphemes, enabling zero-shot audio-visual speech recognition without any target-language speech data. The authors also construct the MARC dataset, covering 82 languages and 2,916 hours of audio-visual data.

🔬 Interpretability¶

AIM: Amending Inherent Interpretability via Self-Supervised Masking: This paper proposes AIM, a top-down learnable binary masking mechanism for self-supervised spatial feature selection, built upon a feature pyramid architecture. Without requiring additional annotations, AIM guides CNNs to focus on genuinely discriminative features and suppress spurious correlations, simultaneously achieving inherent interpretability and improved OOD generalization.
ArgoTweak: Towards Self-Updating HD Maps through Structured Priors: This paper proposes ArgoTweak, the first HD map dataset providing complete triplets of "prior map + current sensor data + up-to-date ground-truth map." It decomposes large-scale map modifications into element-level atomic changes via a bijective change mapping framework, and introduces interpretable evaluation metrics (mAPC/mACC). Models trained on ArgoTweak reduce the sim2real gap by more than 10× compared to synthetic-prior baselines.
CAD-Recode: Reverse Engineering CAD Code from Point Clouds: This paper proposes CAD-Recode, which translates point clouds into executable Python CadQuery code to reconstruct CAD models. By leveraging a pretrained LLM (Qwen2-1.5B) as the decoder paired with a lightweight point cloud encoder, the method achieves more than 10× reduction in Chamfer Distance on three benchmarks: DeepCAD, Fusion360, and CC3D.
CAD-Recode: Reverse Engineering CAD Code from Point Clouds: CAD-Recode frames 3D CAD reverse engineering as a point-cloud-to-Python-code translation task. It leverages the Python code understanding capabilities of pretrained LLMs as the decoder, combined with a lightweight point cloud projector and a million-scale procedurally generated dataset, achieving significant improvements over existing methods on multiple CAD benchmarks while enabling LLM-driven CAD editing and question answering.
CE-FAM: Concept-Based Explanation via Fusion of Activation Maps: CE-FAM is a concept explanation method that trains a branch network sharing activation maps with an image classifier to simulate VLM embeddings, establishing a one-to-one correspondence among concept prediction → concept region (weighted sum of activation maps) → concept contribution (effect on classification score). The paper also introduces a novel NRA evaluation metric and surpasses existing methods on zero-shot concept reasoning.
Granular Concept Circuits: Toward a Fine-Grained Circuit Discovery for Concept Representations: This paper proposes Granular Concept Circuit (GCC), a method that automatically discovers fine-grained visual circuits encoding specific concepts in deep visual models by iteratively evaluating inter-neuron functional dependency (Neuron Sensitivity Score) and semantic consistency (Semantic Flow Score). GCC is the first method capable of discovering multiple concept-level circuits within a single query.
Learnable Fractional Reaction-Diffusion Dynamics for Under-Display ToF Imaging and Beyond: LFRD² proposes a hybrid framework that combines learnable time-fractional reaction-diffusion equations with neural networks for under-display ToF (UD-ToF) depth map restoration. The approach captures long-range memory dependencies across iterations via fractional calculus and introduces an efficient continuous convolution operator to replace discrete convolution, achieving state-of-the-art performance on UD-ToF depth restoration, ToF denoising, and depth super-resolution tasks.
Minerva: Evaluating Complex Video Reasoning: This paper introduces Minerva — a manually annotated benchmark of 1,515 complex video reasoning QA pairs, each with 5 answer choices and a detailed reasoning trace, designed to evaluate the video reasoning capabilities of multimodal large language models. It further establishes a video reasoning error taxonomy (Temporal / Perceptual / Logical / Completeness) and the MiRA automated evaluation framework.
"Principal Components" Enable A New Language of Images: This paper proposes Semanticist, a visual tokenization framework that embeds a provable PCA structure into the latent token space—where each subsequent token contributes decreasing, non-overlapping information—and employs a diffusion decoder to decouple the semantic-spectral entanglement effect, achieving state-of-the-art performance on both image reconstruction and autoregressive generation.
SVIP: Semantically Contextualized Visual Patches for Zero-Shot Learning: This paper proposes the SVIP framework, which addresses semantic misalignment in zero-shot learning at its source by identifying and replacing semantically irrelevant image patches at the input stage with learnable embeddings initialized from attribute-level word embeddings.
VITAL: More Understandable Feature Visualization through Distribution Alignment and Relevant Information Flow: This paper proposes VITAL, a feature visualization method that reframes the problem as aligning intermediate feature distributions with those of real images (rather than conventional activation maximization), and incorporates relevance scores to filter irrelevant features, producing neuron visualizations that are more interpretable to humans.

🛰️ Remote Sensing¶

AstroLoc: Robust Space to Ground Image Localizer: This paper proposes AstroLoc, the first space-to-ground localization model trained on 300K manually annotated astronaut photographs. Through a query-satellite pairwise loss and unsupervised mining technique, the model learns robust representations of Earth's surface, achieving an average improvement of 35% in Recall@1, consistently exceeding 99% in Recall@100, and has already localized over 500K photographs in real-world deployment.
CityNav: A Large-Scale Dataset for Real-World Aerial Navigation: This paper introduces CityNav, the first large-scale aerial vision-and-language navigation dataset for real-world urban environments, comprising 32,637 human demonstration trajectories covering 4.65 km². A Geo-Semantic Map (GSM) auxiliary representation is proposed and shown to significantly improve baseline navigation performance.
GeoDistill: Geometry-Guided Self-Distillation for Weakly Supervised Cross-View Localization: This paper proposes GeoDistill, a framework that enhances locally discriminative feature learning via a Field-of-View (FoV) occlusion-based teacher-student self-distillation paradigm. Under weakly supervised conditions (requiring only coarse GPS annotations), it achieves robust cross-view localization with performance improvements exceeding 10%, and can be applied as a plug-and-play component to different localization frameworks.
GeoExplorer: Active Geo-Localization with Curiosity-Driven Exploration: This paper proposes GeoExplorer, an active geo-localization (AGL) agent that integrates goal-directed extrinsic rewards with curiosity-driven intrinsic rewards. By jointly modeling action-state dynamics and curiosity-based exploration within a reinforcement learning framework, GeoExplorer achieves more robust UAV search strategies and demonstrates superior generalization to unseen targets and environments.
Information-Bottleneck Driven Binary Neural Network for Change Detection: This paper proposes BiCD, the first binary neural network specifically designed for change detection. By introducing an auxiliary objective module guided by the Information Bottleneck (IB) principle, BiCD enhances the feature representation capability and separability of BNNs, achieving state-of-the-art performance among BNN-based methods on both street-view and remote sensing change detection benchmarks, while achieving 30× memory compression and 2.5× inference acceleration.
Pan-Crafter: Learning Modality-Consistent Alignment for Pan-Sharpening: PAN-Crafter proposes a modality-consistent alignment framework that explicitly addresses cross-modal misregistration between PAN and MS images via Modality-Adaptive Reconstruction (MARs) and Cross-Modal Misalignment-aware Multi-scale Attention (CM3A), achieving state-of-the-art performance on multiple remote sensing benchmarks while running 1110× faster than diffusion-based methods.
RS-vHeat: Heat Conduction Guided Efficient Remote Sensing Foundation Model: This work is the first to introduce the physical heat conduction process into a remote sensing foundation model. RS-vHeat replaces the attention mechanism with a Heat Conduction Operator (HCO) to model local region correlations in remote sensing images, achieving strong performance across 4 tasks and 10 datasets while reducing GPU memory by 84%, FLOPs by 24%, and improving throughput by 2.7× compared to the attention-based baseline.
SkySense V2: A Unified Foundation Model for Multi-Modal Remote Sensing: This paper proposes SkySense V2, which employs a single unified Transformer backbone to process three remote sensing modalities — high-resolution optical, multispectral, and SAR imagery — and introduces Adaptive Patch Merging (APM), modality-specific prompt tokens, and Query-based Semantic Aggregation Contrastive Learning (QSACL) for pre-training. With only 665M parameters (vs. 1.26B in the predecessor SkySense), SkySense V2 achieves an average improvement of 1.8 points across 7 tasks on 16 datasets.
SMARTIES: Spectrum-Aware Multi-Sensor Auto-Encoder for Remote Sensing Images: This paper proposes SMARTIES, a unified sensor-agnostic foundation model for remote sensing that maps heterogeneous sensor data into a shared space via spectrum-aware projection. Combined with cross-sensor token mixing and masked reconstruction for self-supervised pre-training, SMARTIES surpasses sensor-specific models on both unimodal and multimodal tasks and generalizes to sensors unseen during pre-training.
Towards a Unified Copernicus Foundation Model for Earth Vision: This work presents a unified Earth observation foundation model system covering all major Copernicus Sentinel tasks, comprising the Copernicus-Pretrain dataset with 18.7 million aligned images, the Copernicus-FM model supporting arbitrary spectral and non-spectral sensors, and the Copernicus-Bench evaluation benchmark spanning 15 hierarchical downstream tasks.
WildSAT: Learning Satellite Image Representations from Wildlife Observations: This paper proposes WildSAT, which leverages millions of geotagged wildlife observations from citizen science platforms to align satellite images, species locations, and textual descriptions via contrastive learning, substantially improving remote sensing representation quality and enabling zero-shot text-based retrieval.

🔄 Self-Supervised Learning¶

A Token-level Text Image Foundation Model for Document Understanding (TokenFD/TokenVL): This paper proposes TokenFD, the first token-level text image foundation model, pre-trained on 20 million images and 1.8 billion BPE token-mask pairs via token-level vision-language alignment to achieve image-as-text semantic understanding. Built upon TokenFD, TokenVL is introduced as a document understanding MLLM, achieving a score of 860 on OCRBench (highest among 8B-class models) and an average improvement of 8.8% across ten VQA benchmarks including DocVQA.
Always Skip Attention: This paper theoretically demonstrates that the self-attention mechanism in Vision Transformers is inherently ill-conditioned, leading to training collapse in the absence of skip connections. It further proposes Token Graying (TG), a method that improves the condition number of input tokens to enhance ViT training stability and performance.
CObL: Toward Zero-Shot Ordinal Layering without User Prompting: This paper presents CObL, an architecture based on multiple frozen Stable Diffusion UNets operating in parallel, capable of inferring an occlusion-ordered object layer representation (one amodally-completed object per layer) from a single image without any user prompts or prior knowledge of object count. Trained on only a few thousand synthetic tabletop scenes, CObL generalizes zero-shot to real-world photographs.
From Linearity to Non-Linearity: How Masked Autoencoders Capture Spatial Correlations: This paper theoretically analyzes how MAE learns spatial correlations in images. It derives a closed-form solution for linear MAE, reveals how masking ratio and patch size select short- or long-range spatial features, and extends the analysis to nonlinear MAE, providing theoretical guidance for hyperparameter selection in practice.
Improving Large Vision and Language Models by Learning from a Panel of Peers: This paper proposes the Panel-of-Peers (PoP) learning framework, in which multiple LVLMs of comparable capability mutually generate candidate responses and score each other to construct preference data. Combined with iterative self-improvement via SimPO, PoP raises the average score across 15 benchmarks from 48% to 57% without any human-annotated data.
LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models: LoftUp is proposed to map low-resolution VFM features to arbitrary high resolutions via a coordinate-cross-attention architecture, with class-agnostic mask refinement and self-distillation to construct full-resolution pseudo-GT for training, achieving average improvements of 10–20% across 6 downstream tasks and nearly 50% on video object segmentation.
Manual-PA: Learning 3D Part Assembly from Instruction Diagrams: This paper proposes Manual-PA, a Transformer-based instruction-guided 3D part assembly framework that infers assembly order by aligning 3D parts with instruction step diagrams via contrastive learning, then uses the learned order as soft guidance through positional encoding for 6DoF pose prediction, significantly outperforming existing methods on PartNet.
MoSiC: Optimal-Transport Motion Trajectory for Dense Self-Supervised Learning: MoSiC extracts long-range motion trajectories via an offline point tracker and propagates cluster assignments along the temporal dimension through an Optimal Transport (Sinkhorn-Knopp)-based clustering mechanism. This enables learning spatially and temporally consistent dense representations from video data, improving DINOv2 by 1%–6% across multiple image and video benchmarks using only video for training.
Scaling Language-Free Visual Representation Learning: By training DINOv2/MAE-series models (1B–7B parameters) on MetaCLIP's 2 billion web images, this work systematically demonstrates that purely visual self-supervised learning (SSL) exhibits superior scaling behavior compared to CLIP in both model and data dimensions, surpassing CLIP on average VQA performance at 5B+ parameters—including OCR/Chart tasks conventionally assumed to require language supervision.
To Label or Not to Label: PALM – A Predictive Model for Evaluating Sample Efficiency in Active Learning Models: This paper proposes PALM — a unified mathematical model that characterizes active learning trajectories using four interpretable parameters (maximum accuracy $A_{\max}$, coverage efficiency $\delta$, initial learning offset $\alpha$, and scalability $\beta$). The model predicts complete learning curves from limited labeled data, enabling quantitative and fair comparison of active learning strategies.
WIR3D: Visually-Informed and Geometry-Aware 3D Shape Abstraction: WIR3D optimizes a set of 3D Bézier curve parameters under the spatial guidance of CLIP intermediate-layer activations to faithfully represent the geometric structure and visually salient features (including texture) of 3D shapes from arbitrary viewpoints, achieving sparse yet semantically rich 3D shape abstraction.

📚 Pretraining¶

ACE-G: Improving Generalization of Scene Coordinate Regression Through Query Pre-Training: ACE-G decomposes a scene coordinate regressor (SCR) into a general-purpose Transformer and a scene-specific map code, and pre-trains the Transformer across tens of thousands of scenes to learn generalization from mapping images to unseen query images. This significantly improves relocalization robustness under illumination and viewpoint changes while maintaining computational efficiency.
ACE-G: Improving Generalization of Scene Coordinate Regression Through Query Pre-Training: ACE-G decomposes a scene coordinate regressor into a scene-agnostic Transformer and a scene-specific map code, and achieves significant generalization gains under illumination and viewpoint variation by conducting alternating mapping/query pre-training across tens of thousands of scenes, while maintaining lightweight computational overhead.
ConstStyle: Robust Domain Generalization with Unified Style Transformation: This paper proposes ConstStyle, a framework that constructs a theoretically grounded Unified Domain to which all training samples are style-aligned during training, while test samples from unseen domains are partially projected toward this unified domain at inference time, effectively reducing the domain gap and improving generalization performance.
Dataset Ownership Verification for Pre-trained Masked Models: DOV4MM proposes the first dataset ownership verification method tailored for masked pre-trained models. By comparing the embedding reconstruction difficulty of seen versus unseen samples, and applying a paired t-test, the method determines whether a black-box model was pre-trained on a specific dataset. It achieves p-values well below 0.05 across 10 masked image models and 4 masked language models.
ETA: Energy-based Test-time Adaptation for Depth Completion: This paper proposes ETA, a method that employs an energy-based model to quantify the likelihood of depth predictions belonging to the source domain distribution, and guides a pre-trained depth completion model to adapt to new environments at test time by minimizing the energy of target-domain predictions. ETA achieves average improvements of 6.94% and 10.23% over the previous state of the art on outdoor and indoor scenes, respectively.
FlowMo: Flow to the Mode — Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization: This paper proposes FlowMo, a Transformer-based diffusion autoencoder trained in two stages (mode-matching pretraining + mode-seeking post-training), achieving state-of-the-art performance on ImageNet-1K discrete image tokenization for the first time among diffusion autoencoders — without convolutions, adversarial losses, 2D spatially-aligned latents, or distillation from other tokenizers.
Image Intrinsic Scale Assessment: Bridging the Gap Between Quality and Resolution: This paper introduces Image Intrinsic Scale (IIS)—the maximum scaling factor at which an image exhibits its highest perceptual quality—and proposes the IISA task, constructs a dataset of 785 images with expert annotations, and presents a weak-label training strategy (WIISA) that consistently improves IIS prediction across multiple NR-IQA methods.
Make Your Training Flexible: Towards Deployment-Efficient Video Models: This paper proposes Flux — a data augmentation tool that enables flexible video model training through flexible sampling grids and group-dynamic token selection, allowing a single model to operate efficiently across varying computational budgets. The paper further introduces a Token Optimization test-time paradigm that matches previous SOTA performance using only 1/4 of the tokens, saving approximately 90% of computation.
Synchronization of Multiple Videos: This paper proposes Temporal Prototype Learning (TPL), a prototype-based video synchronization framework that constructs shared compact 1D representations from high-dimensional embeddings extracted by pretrained models. By learning a unified prototype sequence to anchor key action phases, TPL aligns multiple videos jointly and, for the first time, addresses the synchronization of generative AI videos.
SynCity: Training-Free Generation of 3D Worlds: SynCity proposes a training- and optimization-free method for 3D world generation. Through carefully designed prompt engineering strategies, it combines a pretrained language model, a 2D image generator (Flux), and a 3D generator (TRELLIS) to autoregressively synthesize large-scale, high-quality, freely navigable 3D scenes in a tile-by-tile fashion.

🔍 Information Retrieval & RAG¶

aligning information capacity between vision and language via dense-to-sparse fe: This paper proposes D2S-VSE, a two-stage training framework (dense-text pretraining + dense-to-sparse feature distillation fine-tuning) that enhances information capacity in visual-semantic embeddings, addressing the core asymmetry in information density between image and text modalities for image-text matching.
Aligning Information Capacity Between Vision and Language via Dense-to-Sparse Feature Distillation for Image-Text Matching: This paper proposes D2S-VSE, a two-stage training framework that addresses the information density asymmetry in image-text matching. In the first stage, the model is pre-trained on LLaVA-generated dense captions to enhance information capacity; in the second stage, dense text embeddings are distilled into sparse text embeddings. The method achieves state-of-the-art performance on MS-COCO and Flickr30K.
External Knowledge Injection for CLIP-Based Class-Incremental Learning: This paper proposes Engine (ExterNal knowledGe INjEction), a framework that employs dual-branch injection tuning (visual branch via data augmentation; text branch via GPT-4-generated discriminative descriptions) and post-tuning knowledge injection at inference (pairwise discriminative feature re-ranking), achieving 3–10% improvements over all CLIP-based class-incremental learning methods across 9 benchmark datasets without storing any historical samples.
LangBridge: Interpreting Image as a Combination of Language Embeddings: LangBridge achieves interpretable vision-language alignment by explicitly decomposing visual features into linear combinations of LLM vocabulary embeddings, and supports pretraining-free adapter transfer across different LLMs.
MonSTeR: a Unified Model for Motion, Scene, Text Retrieval: This paper proposes MonSTeR—the first tri-modal retrieval model for motion, scene, and text—which constructs a unified latent space via higher-order relationship modeling inspired by topological deep learning. By capturing intrinsic dependencies among all three modalities, MonSTeR substantially outperforms baselines that rely solely on unimodal representations across multiple retrieval tasks, and can further serve as an evaluation tool for human-scene interaction models.
OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation: This paper presents OHRBench—the first benchmark for evaluating the cascading impact of OCR on RAG systems. It comprises 8,561 document images across 7 domains and 8,498 QA pairs, and systematically reveals the distinct impact patterns of OCR-induced Semantic Noise and Formatting Noise on both the retrieval and generation stages.
Representation Shift: Unifying Token Compression with FlashAttention: This paper proposes Representation Shift, a training-free and model-agnostic token importance metric that measures the magnitude of representational change before and after a network layer, enabling — for the first time — compatibility between token compression and FlashAttention, achieving up to 5.5× speedup on video understanding and image classification tasks.
ViLU: Learning Vision-Language Uncertainties for Failure Prediction: This paper proposes ViLU, a post-hoc uncertainty quantification framework for VLM zero-shot prediction. By fusing visual embeddings, predicted text embeddings, and image-conditioned text representations via cross-attention, ViLU constructs uncertainty-aware multimodal representations that significantly outperform existing failure prediction methods across 13 classification datasets and large-scale image-text datasets.

💬 LLM / NLP¶

Any-SSR: How Recursive Least Squares Works in Continual Learning of Large Language Models: This paper proposes the Analytic Subspace Routing (Any-SSR) framework, which eliminates inter-task interference by assigning each task an independent LoRA subspace, and trains a zero-forgetting analytic router via a recursive least squares (RLS) closed-form solution, enabling replay-free continual learning for LLMs.
Any-SSR: How Recursive Least Squares Works in Continual Learning of Large Language Models: This paper proposes Analytic Subspace Routing (Any-SSR), which assigns an independent LoRA subspace to each new task to eliminate knowledge interference, while employing an analytic router based on a recursive least squares (RLS) closed-form solution to dynamically select subspaces. The approach provides theoretical guarantees against forgetting prior task knowledge, enabling replay-free continual learning for LLMs.
Balancing Task-Invariant Interaction and Task-Specific Adaptation for Unified Image Fusion: TITA proposes a unified image fusion framework that requires no task identifier at inference. It employs an Interaction-enhanced Pixel Attention (IPA) module to explore task-invariant complementary information extraction, an Operation-based Adaptive Fusion (OAF) module to dynamically adapt to task-specific requirements, and the FAMO strategy to mitigate multi-task gradient conflicts.
Beyond Isolated Words: Diffusion Brush for Handwritten Text-Line Generation: This paper proposes DiffBrush, the first diffusion-based method for handwritten text-line generation. Through content-decoupled style learning (column/row masking) and a multi-scale content discriminator (line/word level), DiffBrush substantially outperforms existing methods in both style imitation and content accuracy.
FW-Merging: Scaling Model Merging with Frank-Wolfe Optimization: This paper formalizes model merging as a constrained optimization problem and introduces FW-Merging, a Frank-Wolfe optimization-inspired method that iteratively selects the most relevant models and performs local merging. The approach achieves scalable and robust merging over large black-box model pools, surpassing the data-aware method AdaMerging by 8.39% when merging 20 ViT models.
ShadowHack: Hacking Shadows via Luminance-Color Divide and Conquer: This paper proposes the ShadowHack framework, which decomposes shadow removal into two subtasks—luminance restoration and color reconstruction. LRNet with Rectified Outreach Attention (ROA) recovers luminance and texture, followed by CRNet with cross-attention to reconstruct accurate color. The method achieves state-of-the-art performance on the ISTD+ and SRD datasets.
VA-GPT: Aligning Effective Tokens with Video Anomaly in Large Language Models: This paper proposes VA-GPT, a multimodal large language model for video anomaly event understanding. Through two modules—Spatial Effective Token Selection (SETS) and Temporal Effective Token Generation (TETG)—VA-GPT enables MLLMs to precisely align anomaly-relevant information in both spatial and temporal dimensions, achieving state-of-the-art performance on both in-domain and cross-domain anomaly detection benchmarks.
VIM: Versatile Interactive Motion-Language Model: This paper proposes VIM, the first multimodal large language model capable of simultaneously understanding and generating dyadic interactive motion and text within a unified framework. Accompanied by the Inter-MT² dataset containing 82.7K multi-turn interactive motion instruction samples, VIM supports a diverse set of tasks including text-to-motion, motion-to-text, reaction generation, motion editing, and motion reasoning.

� LLM Safety¶

Adversarial Robust Memory-Based Continual Learner: This paper identifies two compounding challenges when combining continual learning with adversarial training—accelerated forgetting and gradient confusion—and proposes two plug-and-play modules, Anti-Forgettable Logit Calibration (AFLC) and Robustness-Aware Experience Replay (RAER), achieving up to 8.13% improvement in adversarial robustness on Split-CIFAR10/100 and Split-Tiny-ImageNet.
Asynchronous Event Error-Minimizing Noise for Safeguarding Event Dataset: This paper proposes UEvs, the first unlearnable example generation method for asynchronous event data. It introduces Event Error-Minimizing Noise (E²MN) and an adaptive projection mechanism that prevent unauthorized models from learning from event datasets while preserving utility for legitimate use.
ChartCap: Mitigating Hallucination of Dense Chart Captioning: This work constructs ChartCap, a large-scale dataset of 565K real chart–caption pairs. By adopting type-specific caption schemas that exclude irrelevant information while emphasizing structure and key insights, and by introducing a reference-free Visual Consistency Score (VCS) evaluation metric, the paper effectively mitigates hallucination in VLM-based chart captioning.
Enhancing Adversarial Transferability by Balancing Exploration and Exploitation with Gradient-Guided Sampling: This paper proposes Gradient-Guided Sampling (GGS), an inner-iteration sampling strategy that uses the gradient direction from the previous inner iteration to guide sampling. By striking a balance between Exploitation (attack strength / loss maxima) and Exploration (cross-model generalization / flat loss landscape), GGS significantly outperforms existing transfer attack methods across diverse architectures including CNNs, ViTs, and MLLMs.
Forgetting Through Transforming: Enabling Federated Unlearning via Class-Aware Representation Transformation: This paper proposes FUCRT, a federated unlearning method based on class-aware representation transformation. Rather than directly erasing the representations of forget classes, FUCRT transforms them toward the semantically nearest retain classes, and employs dual contrastive learning to align transformation consistency across clients. The method guarantees 100% unlearning on four datasets while maintaining or even improving performance on retain classes.
Geminio: Language-Guided Gradient Inversion Attacks in Federated Learning: This paper proposes Geminio, the first gradient inversion attack (GIA) leveraging vision-language models (VLMs) to enable natural language-guided targeted reconstruction. A malicious server can specify the type of data to steal via natural language queries, precisely locating and reconstructing semantically matching private samples from large-batch gradients, without disrupting normal FL model training.
Oasis: One Image is All You Need for Multimodal Instruction Data Synthesis: This paper proposes Oasis, a method that induces MLLMs to autoregressively generate high-quality multimodal instruction-following data using only an input image (without any text prompt). Combined with a fine-grained instruction quality control mechanism, synthesizing 500K samples yields an average 3.1% overall performance gain for LLaVA-NeXT, surpassing other data synthesis methods.
Temporal Unlearnable Examples: Preventing Personal Video Data from Unauthorized Exploitation: This paper presents the first study on preventing video data from being exploited by deep trackers without authorization. It proposes a DiT-based generative framework for producing Temporal Unlearnable Examples (TUE), employing a temporal contrastive loss to induce trackers to rely on perturbation noise for temporal matching rather than learning genuine data structure. The method achieves strong transferability across models, datasets, and tasks.

📐 Optimization & Theory¶

Addressing Representation Collapse in Vector Quantized Models with One Linear Layer: This paper proposes SimVQ, a method that reparameterizes codebook vectors via a single learnable linear transformation layer ($\bm{C}\bm{W}$), converting the disjoint optimization of the codebook into a joint spatial optimization, thereby fundamentally resolving representation collapse in VQ models and achieving near-100% codebook utilization.
Class-Wise Federated Averaging for Efficient Personalization: cwFedAvg extends FedAvg from client-level aggregation to class-level aggregation, constructing a dedicated global model per class and combining them into a personalized local model weighted by each client's class distribution. Coupled with Weight Distribution Regularization (WDR) to strengthen the alignment between class distribution and weight norms, the method achieves substantial personalization gains under non-IID settings while maintaining the same communication overhead as FedAvg.
Cooperative Pseudo Labeling for Unsupervised Federated Classification: FedCoPL is the first work to extend unsupervised federated learning (UFL) to classification tasks. It addresses CLIP's inherent bias and label shift challenges via a cooperative pseudo labeling strategy (global assignment ensuring class balance) and a partial prompt aggregation protocol (aggregating only visual prompts while keeping text prompts local).
Federated Continual Instruction Tuning: This paper introduces the first Federated Continual Instruction Tuning (FCIT) benchmark, covering 2 scenarios, 4 settings, and 12 datasets, and proposes the DISCO framework, which addresses data heterogeneity and catastrophic forgetting via Dynamic Knowledge Organization (DKO) and Subspace Selective Activation (SSA).
Federated Prompt-Tuning with Heterogeneous and Incomplete Multimodal Client Data: This paper proposes FED-PRIME, a federated prompt-tuning framework for multimodal settings with missing modalities. It maintains two sets of learnable prompts — inter-client and intra-client — to capture cross-client alignable missing patterns and client-specific missing patterns, respectively, and employs a clustering-alignment mechanism for server-side aggregation. FED-PRIME substantially outperforms existing baselines across diverse missing-data configurations.
Learning Interpretable Queries for Explainable Image Classification with Information Pursuit: This paper parameterizes the query dictionary of Information Pursuit (IP) as learnable vectors in the CLIP semantic embedding space, and learns a task-sufficient interpretable query dictionary via an alternating optimization algorithm, substantially closing the performance gap between interpretable classifiers and black-box classifiers.
Memory-Efficient 4-bit Preconditioned Stochastic Optimization: This paper proposes a 4-bit quantization scheme based on Cholesky decomposition and error feedback, compressing the preconditioner matrices of the Shampoo optimizer to 4-bit precision. The approach substantially reduces GPU memory consumption while preserving training performance close to 32-bit Shampoo, with convergence guarantees provided for both smooth and non-smooth settings.
Zeroth-Order Fine-Tuning of LLMs in Random Subspaces: This paper proposes SubZero (random Subspace Zeroth-order), which estimates gradients in random subspaces via per-layer low-rank perturbations, significantly reducing gradient variance and angular error in zeroth-order optimization, enabling memory-efficient LLM fine-tuning at a cost close to inference.

🎮 Reinforcement Learning¶

Embodied Navigation with Auxiliary Task of Action Description Prediction: DescRL introduces action description generation as an auxiliary task for reinforcement learning-based navigation. By distilling knowledge from pretrained vision-language models to train an ADPredictor, the navigation agent simultaneously produces interpretable action descriptions and achieves improved navigation performance, attaining state-of-the-art results on Semantic Audio-Visual Navigation (SAVNav) and several other tasks.
mDP3: A Training-free Approach for List-wise Frame Selection in Video-LLMs: This paper proposes mDP3, a training-free and model-agnostic video frame selection method that estimates frame similarity in RKHS via a conditional Gaussian kernel, leverages Determinantal Point Processes (DPP) to capture query relevance and list-wise diversity, and models temporal structure via a Markov Decision Process (MDP). Using only 8 input frames, mDP3 significantly outperforms uniform sampling and existing frame selection methods on multiple long-video benchmarks.
NavQ: Learning a Q-Model for Foresighted Vision-and-Language Navigation: This paper proposes NavQ, a foresighted VLN agent that employs a Q-model to predict, in a single forward pass, long-horizon future semantic aggregation features (Q-features) for each candidate action. Combined with an A*-style search strategy, NavQ achieves significant improvements on object-goal navigation benchmarks.
Progressor: A Perceptually Guided Reward Estimator with Self-Supervised Online Refinement: This paper proposes Progressor, a framework that learns task-agnostic reward functions from unannotated videos via self-supervision. It provides dense reward signals by predicting task progress distributions and addresses distribution shift during online RL training through an adversarial push-back strategy.
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization: This paper proposes R1-Onevision, a framework that converts images into formalized textual representations via a cross-modal reasoning pipeline, combined with a two-stage post-training strategy of SFT followed by rule-based reinforcement learning (GRPO), to significantly enhance multimodal reasoning in vision-language models, surpassing GPT-4o on multiple mathematical reasoning benchmarks.
RL-Selector: Reinforcement Learning-Guided Data Selection via Redundancy Assessment: This paper proposes RL-Selector, which introduces the ε-sample cover concept to quantify sample redundancy and formulates data selection as a reinforcement learning problem. A lightweight A2C policy network adaptively optimizes the selection strategy, achieving generalization performance comparable to or surpassing full-data training with significantly fewer samples across multiple benchmark datasets.
RoboFactory: Exploring Embodied Agent Collaboration with Compositional Constraints: This paper introduces the concept of compositional constraints to formalize safety and efficiency requirements in multi-agent embodied collaboration, constructs the first multi-agent manipulation benchmark RoboFactory based on this formalization, and systematically investigates architectures and training strategies for multi-agent imitation learning.

🦾 LLM Agent¶

Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions: A three-stage self-supervised framework is proposed that significantly improves cross-view description consistency and accuracy for the same object in indoor environments, achieved through agent-driven multi-view observation collection, LLM consensus-based pseudo-label generation, and contrastive fine-tuning of the captioner.
GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training: This paper identifies the "thought collapse" phenomenon in RL-based VLM Agent training — where CoT reasoning rapidly degenerates into state-agnostic, templated thoughts that lead to ineffective actions — and proposes the GTR framework, which combines a VLM corrector for automatic thought correction (SFT) with PPO-based action optimization in a dual-objective training scheme, achieving 3–5× success rate improvements on the 24-Point Game and ALFWorld.
Less is More: Empowering GUI Agent with Context-Aware Simplification: This paper proposes SimpAgent — a context-aware simplification framework that achieves SOTA on multiple GUI navigation benchmarks while reducing FLOPs by 27%, via masking-based element pruning (randomly masking irrelevant element regions during training) and consistency-guided history compression (directly dropping historical visual tokens at intermediate LLM layers with a KL divergence consistency constraint).
UIPro: Unleashing Superior Interaction Capability for GUI Agents: UIPro is proposed to achieve state-of-the-art GUI interaction performance across mobile, web, and desktop platforms by constructing 20.6M GUI understanding samples for pre-training and introducing a unified action space to integrate heterogeneous GUI agent task data.

👥 Social Computing¶

Gradient Extrapolation for Debiased Representation Learning: This paper proposes GERNE, a method that constructs two batches with different degrees of spurious correlation and performs linear extrapolation on their gradients to guide the model toward learning debiased representations, outperforming state-of-the-art methods under both known and unknown attribute settings.
Learning Visual Proxy for Compositional Zero-Shot Learning: This paper proposes the concept of Visual Proxy — text-guided visual class centers introduced into CZSL for the first time — and jointly optimizes textual prototypes and visual proxies via Cross-Modal Joint Learning (CMJL), achieving closed-world SOTA on four CZSL benchmarks.
No More Sibling Rivalry: Debiasing Human-Object Interaction Detection: This paper identifies and systematically analyzes the "Toxic Siblings Bias" in HOI detection—highly similar HOI triplets that mutually interfere and compete at both the input and output levels. Two debiasing learning objectives are proposed: Contrastive-then-Calibration (C2C) and Merge-then-Split (M2S), achieving +9.18% mAP over the baseline and +3.59% over the previous state-of-the-art on HICO-DET.
PropVG: End-to-End Proposal-Driven Visual Grounding with Multi-Granularity Discrimination: This paper proposes PropVG, the first end-to-end proposal-based visual grounding framework that eliminates the need for pretrained detectors. It decomposes visual grounding into two stages — foreground proposal generation and contrastive learning-based referring scoring — and introduces a Multi-granularity Target Discrimination (MTD) module that integrates object-level and semantic-level information to determine target existence. PropVG achieves state-of-the-art performance on 10 datasets while running 4× faster than traditional proposal-based methods.

📈 Time Series¶

I²-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting: This paper proposes I²-World, which decouples 3D scene tokenization into two complementary processes — intra-scene multi-scale residual quantization and inter-scene temporal quantization — thereby retaining the high compression ratio of 3D tokenizers while incorporating the temporal modeling capability of 4D tokenizers, enabling efficient and high-quality 4D occupancy forecasting.
V2XPnP: Vehicle-to-Everything Spatio-Temporal Fusion for Multi-Agent Perception and Prediction: This paper proposes V2XPnP, a V2X spatio-temporal fusion framework built upon a unified Transformer architecture, which achieves multi-agent end-to-end perception and prediction under a one-step communication strategy. The work also introduces the first large-scale real-world sequential dataset supporting all V2X collaboration modes, achieving state-of-the-art performance on both perception and prediction tasks.
VA-MoE: Variables-Adaptive Mixture of Experts for Incremental Weather Forecasting: This paper proposes a novel incremental weather forecasting paradigm and the VA-MoE framework. Through a variables-adaptive MoE architecture and index embedding mechanism, VA-MoE achieves forecasting accuracy comparable to full training with only 25% trainable parameters and 50% of the initial training data.
VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models: This paper proposes VLRMBench, a comprehensive and challenging benchmark for vision-language reward models (VLRMs) comprising 12,634 questions across 12 tasks, covering three dimensions: process understanding, outcome judgment, and criticism generation. Extensive experiments on 26 models reveal significant deficiencies in current VLRMs.

💡 LLM Reasoning¶

CoRVid: Improving Multimodal Large Language Models Towards Chain-of-Thought Reasoning: This paper proposes Corvid, which comprehensively enhances the chain-of-thought reasoning capability of MLLMs through a hybrid visual encoder, a GateMixer connector, a high-quality CoT dataset, and a test-time self-verification strategy, surpassing open-source models of comparable parameter scale on mathematical reasoning and scientific problem solving.
Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization: This paper proposes UV-CoT, a framework that enables image-level chain-of-thought (Visual CoT) reasoning without any manual bounding box annotations, by automatically constructing preference data and introducing an improved Score-DPO loss. UV-CoT surpasses the supervised Visual-CoT method on 6 benchmarks.
Video-T1: Test-Time Scaling for Video Generation: This paper transfers the test-time scaling (TTS) paradigm from LLMs to video generation by reformulating TTS as a search problem over trajectories from Gaussian noise space to the target video distribution. It proposes the Tree-of-Frames (ToF) search algorithm for efficient inference-time compute scaling, achieving consistent quality improvements across diverse video generation models on VBench.

📡 Signal & Communications¶

Boosting Multimodal Learning via Disentangled Gradient Learning: This paper reveals an optimization conflict between modality encoders and fusion modules in multimodal learning — the fusion module suppresses gradients propagated back to individual modality encoders, causing even the dominant modality to underperform its unimodal counterpart. The paper proposes the Disentangled Gradient Learning (DGL) framework, which addresses this issue by cutting the gradient path from the fusion module to the encoders and replacing it with independent unimodal losses.
Generalizable Non-Line-of-Sight Imaging with Learnable Physical Priors: This paper proposes two modules — Learnable Path Compensation (LPC) and Adaptive Phasor Field (APF) — to address material-dependent radiance intensity falloff and frequency-domain denoising under varying SNR conditions in NLOS imaging, respectively. Trained solely on synthetic data, the method achieves state-of-the-art generalization across multiple real-world datasets.
Rectifying Magnitude Neglect in Linear Attention: This paper identifies that Linear Attention completely discards Query magnitude information, causing a significant deviation of attention score distributions from Softmax Attention. It proposes Magnitude-Aware Linear Attention (MALA), which restores magnitude awareness by introducing a scaling factor $\beta$ and an offset term $\gamma$, achieving comprehensive improvements over existing methods across classification, detection, segmentation, NLP, speech recognition, and image generation tasks.

🔗 Causal Inference¶

A Visual Leap in CLIP Compositionality Reasoning through Generation of Counterfactual Sets: This paper proposes a block-based diffusion method leveraging LLMs and diffusion models to automatically generate high-quality counterfactual image-text pair datasets, accompanied by a set-aware loss function. Without manual annotation, the approach significantly improves CLIP's compositional reasoning ability, surpassing state-of-the-art methods on ARO/VL-Checklist and other benchmarks with substantially less data.
Social Debiasing for Fair Multi-modal LLMs: This paper constructs CMSC, a large-scale counterfactual dataset spanning 18 social concepts, and proposes the Anti-Stereotype Debiasing (ASD) strategy—comprising bias-aware data resampling and a Social Fairness Loss—that effectively reduces social bias across four MLLM architectures with negligible degradation of general multimodal capability.

⚖️ Alignment & RLHF¶

Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models: This paper proposes HIMRD, a black-box multimodal jailbreak attack method that bypasses unimodal safety mechanisms by distributing malicious semantics across multiple modalities. A heuristic search strategy is employed to identify optimal understanding-enhancing prompts and inducing prompts, achieving average attack success rates of approximately 90% and 68% on open-source and closed-source multimodal large language models, respectively.
MagicID: Hybrid Preference Optimization for ID-Consistent and Dynamic-Preserved Video Customization: This paper proposes MagicID, a framework that constructs hybrid video pair data capturing identity and dynamic preferences, and designs a two-stage Hybrid Preference Optimization (HPO) training strategy. MagicID is the first work to apply DPO to identity-customized video generation, simultaneously addressing identity degradation and motion weakening caused by conventional self-reconstruction training.

💻 Code Intelligence¶

TikZero: Zero-Shot Text-Guided Graphics Program Synthesis: This paper proposes TikZero, which decouples graphics program generation from text understanding by using image representations as an intermediate bridge, enabling zero-shot text-guided TikZ graphics program synthesis without text-aligned training data. TikZero substantially outperforms baseline methods, and its end-to-end fine-tuned variant TikZero+ matches or surpasses large commercial models such as GPT-4o.

🕸️ Graph Learning¶

PASTA: Part-Aware Sketch-to-3D Shape Generation with Text-Aligned Prior: This paper proposes the PASTA framework, which leverages VLM-derived text priors to compensate for semantic deficiencies in sketches, and designs ISG-Net (IndivGCN + PartGCN) to model inter-part structural relationships, achieving state-of-the-art performance in sketch-to-3D shape generation with support for part-level editing.

⚡ LLM Efficiency¶

MixANT: Observation-dependent Memory Propagation for Stochastic Dense Action Anticipation: This paper proposes MixANT, which introduces input-dependence into the forgetting gate (A matrix) of Mamba via a Mixture-of-Experts approach. A lightweight router dynamically selects context-aware A matrices to control temporal memory propagation, achieving state-of-the-art performance across all three dense action anticipation benchmarks: 50Salads, Breakfast, and Assembly101.

🌐 Multilingual & Translation¶

SignRep: Enhancing Self-Supervised Sign Representations: This paper proposes SignRep, a scalable self-supervised sign language representation learning framework that incorporates sign-specific skeleton priors, feature regularization, and an adversarial style-invariant loss into Masked Autoencoder pretraining. Using only a single RGB modality, SignRep surpasses complex multi-modal and multi-branch methods, achieving state-of-the-art performance on three tasks: sign language recognition, dictionary retrieval, and sign language translation.

⚛️ Physics¶

ResQ: A Novel Framework to Implement Residual Neural Networks on Analog Rydberg Atom Quantum Computers: This paper proposes ResQ — the first framework to natively implement residual neural networks (ResNets) on analog Rydberg atom quantum computers by exploiting continuous-time Hamiltonian evolution, encoding input features and trainable parameters via piecewise parameterized laser pulses, achieving an average 50% improvement over classical models of equivalent scale on MNIST, FashionMNIST, and medical dataset classification tasks.

🧮 Scientific Computing¶

JPEG Processing Neural Operator for Backward-Compatible Coding: This paper proposes JPNeO, a next-generation codec that is fully backward-compatible with the JPEG format. By introducing neural operators at both the encoding stage (JENO) and decoding stage (JDNO), along with a trainable quantization matrix, JPNeO significantly improves JPEG reconstruction quality—particularly for chroma components—while maintaining low memory footprint and parameter count.

📂 Others¶

A Hidden Stumbling Block in Generalized Category Discovery: Distracted Attention: This paper identifies a previously overlooked issue in GCD—ViT attention on unlabeled data (especially novel categories) tends to disperse onto background regions (distracted attention)—and proposes an Attention Focusing (AF) module that corrects attention via multi-scale token importance measurement combined with adaptive pruning. As a plug-and-play module on top of SimGCD, AF achieves up to 15.4% performance improvement.
A Hyperdimensional One Place Signature to Represent Them All: Stackable Descriptors For Visual Place Recognition: This paper proposes HOPS (Hyperdimensional One Place Signatures), a framework leveraging hyperdimensional computing (HDC) to fuse multiple reference descriptors of the same place captured under varying environmental conditions into a unified representation, substantially improving the robustness and recall of Visual Place Recognition (VPR) without increasing computational or memory overhead.
A Linear N-Point Solver for Structure and Motion from Asynchronous Tracks: This paper proposes a unified linear N-point solver that recovers camera linear velocity and 3D point structure from 2D point correspondences with arbitrary timestamps, supporting global shutter, rolling shutter, and event camera sensor modalities.
AdaptiveAE: An Adaptive Exposure Strategy for HDR Capturing in Dynamic Scenes: This paper proposes AdaptiveAE, which formulates HDR bracketed exposure capture as a Markov Decision Process (MDP) using deep reinforcement learning, jointly optimizing ISO and shutter speed combinations to adaptively select optimal exposure parameters for dynamic scenes within a user-defined time budget. The method achieves PSNR 39.70 on the HDRV dataset, outperforming the previous best method Hasinoff et al. (37.59) by 2.1 dB.
Adversarial Data Augmentation for Single Domain Generalization via Lyapunov Exponents: This paper proposes LEAwareSGD, an optimizer that dynamically adjusts the learning rate using Lyapunov exponents (LE) to guide model training toward the edge of chaos, enabling broader exploration of the parameter space within an adversarial data augmentation framework and achieving significant improvements in single domain generalization (SDG).
AFUNet: Cross-Iterative Alignment-Fusion Synergy for HDR Reconstruction via Deep Unfolding Paradigm: This paper formulates multi-exposure HDR reconstruction from a MAP estimation perspective, decomposes the problem into two alternating subproblems—alignment and fusion—via a spatial correspondence prior, and unfolds them into an end-to-end trainable AFUNet comprising SAM (spatial alignment), CFM (channel fusion), and DCM (data consistency) modules. The method achieves state-of-the-art performance on three HDR benchmarks, reaching PSNR-μ of 44.91 dB on the Kalantari dataset.
Auto-Regressively Generating Multi-View Consistent Images (MV-AR): This paper is the first to introduce autoregressive (AR) models into multi-view image generation. By generating views sequentially, the model leverages all preceding views to enhance consistency across distant viewpoints. It further proposes a unified multimodal condition injection architecture and a Shuffle Views data augmentation strategy, enabling a single model to handle text, image, and geometry conditions simultaneously.
C4D: 4D Made from 3D through Dual Correspondences: This paper proposes C4D, a framework that upgrades existing 3D reconstruction paradigms to full 4D reconstruction by jointly capturing dual temporal correspondences — short-term optical flow and dynamic-aware long-term point tracking (DynPT) — on top of DUSt3R's 3D pointmap predictions. Motion masks are generated to separate static and dynamic regions. Three optimization objectives are introduced: camera motion alignment, camera trajectory smoothing, and point trajectory smoothing. The resulting system produces per-frame point clouds, camera parameters, and 2D/3D trajectories, achieving competitive performance across depth estimation, pose estimation, and point tracking tasks.
DeSPITE: Exploring Contrastive Deep Skeleton-Pointcloud-IMU-Text Embeddings for Advanced Point Cloud Human Activity Understanding: This paper proposes DeSPITE, a contrastive learning framework that aligns four modalities—LiDAR point clouds, skeletal poses, IMU signals, and text—into a joint embedding space. It is the first to adopt LiDAR (rather than RGB) as the primary visual modality, enabling previously infeasible tasks such as cross-modal matching and retrieval, while also serving as an effective HAR pretraining strategy that achieves state-of-the-art performance on MSR-Action3D and HMPEAR.
Doodle Your Keypoints: Sketch-Based Few-Shot Keypoint Detection: This paper proposes the first sketch-based cross-modal few-shot keypoint detection framework. By leveraging a prototype network, grid-based locator, prototype domain adaptation, and a de-stylization network, the framework detects novel keypoints on unseen categories in real photographs using only a handful of annotated sketches.
EDFFDNet: Towards Accurate and Efficient Unsupervised Multi-Grid Image Registration: This paper proposes EDFFDNet, which replaces conventional B-spline FFD and TPS with an Exponentially Decaying Free-Form Deformation (EDFFD) model for image registration. Combined with an Adaptive Sparse Motion Aggregator (ASMA) and a progressive correlation strategy, the method achieves a +0.5 dB PSNR improvement on the UDIS-D dataset while reducing parameter count by 70.5% and GPU memory usage by 32.6%.
Failure Cases Are Better Learned But Boundary Says Sorry: Facilitating Smooth Perception Change for Accuracy-Robustness Trade-Off in Adversarial Training: This paper reveals a counterintuitive phenomenon in adversarial training — the model's perceptual change on failure cases is actually smaller than on success cases (i.e., failure cases are "over-learned") — and proposes Robust Perception Adversarial Training (RPAT), which encourages perceptions to change smoothly with perturbations to alleviate the accuracy-robustness trade-off.
FixTalk: Taming Identity Leakage for High-Quality Talking Head Generation in Extreme Cases: FixTalk is proposed as a framework that addresses identity leakage in GAN-based talking head generation through two lightweight plug-and-play modules — the Enhanced Motion Indicator (EMI) and the Enhanced Detail Indicator (EDI). EMI eliminates identity information from motion features to suppress identity leakage, while EDI repurposes the leaked identity information to compensate for missing details under extreme poses, thereby removing rendering artifacts.
From Easy to Hard: Progressive Active Learning Framework for Infrared Small Target Detection with Single Point Supervision: This paper proposes a Progressive Active Learning (PAL) framework that trains infrared small target detection networks through a three-stage strategy—model pre-start, model enhancement, and model refinement—driving the network to actively identify and learn from hard samples in an easy-to-hard manner. Under single point supervision, PAL substantially narrows the performance gap with fully supervised methods (IoU improvement of 8.53%–29.1%).
Generate, Refine, and Encode: Leveraging Synthesized Novel Samples for On-the-Fly Fine-Grained Category Discovery: This paper proposes DiffGRE, a diffusion-model-based framework for on-the-fly category discovery. It synthesizes novel samples containing virtual category information via Attribute Composition Generation (ACG), filters low-quality samples through Diversity-Driven Refinement (DDR), and injects additional category knowledge via Semi-supervised Leader Encoding (SLE). DiffGRE achieves substantial performance gains over existing OCD methods across 6 fine-grained datasets (average ACC-ALL improvement of 6.5%).
Hi3DGen: High-fidelity 3D Geometry Generation from Images via Normal Bridging: This paper proposes Hi3DGen, a framework that uses normal maps as an intermediate representation to bridge 2D images and 3D geometry. Through two core components — a Noise-injected Regressive Normal Estimation (NiRNE) module and Normal-Regularized Latent Diffusion (NoRLD) — the framework significantly improves the geometric detail fidelity of generated 3D models.
HiNeuS: High-fidelity Neural Surface Mitigating Low-texture and Reflective Ambiguity: This paper proposes HiNeuS, a unified neural surface reconstruction framework that simultaneously addresses three core challenges—reflective ambiguity, low-texture degradation, and detail preservation—through three innovations: SDF-guided visibility verification, planar conformal regularization, and rendering-prioritized Eikonal relaxation.
HyTIP: Hybrid Temporal Information Propagation for Masked Conditional Residual Video Coding: This paper proposes HyTIP, a framework that unifies output-recurrence (explicit buffering of decoded frames) and hidden-to-hidden propagation (implicit buffering of latent features) within a single learned video coding framework, achieving comparable coding performance to state-of-the-art methods using only 14% of their buffer size.
I Am Big, You Are Little; I Am Right, You Are Wrong: This work employs the causal-reasoning XAI tool rex to extract Minimal Pixel Sets (MPS) from image classification models, systematically comparing the "attentional focus" of 15 models across 5 architectures. Large models (EVA/ConvNext) are found to make classification decisions using as little as 5% of image pixels, and statistically significant differences in MPS size and spatial location are observed across architectures.
IAP: Invisible Adversarial Patch Attack through Perceptibility-Aware Localization: This paper proposes the IAP framework, which achieves — for the first time in targeted attack settings — truly invisible adversarial patches via perceptibility-aware patch localization and color-preserving gradient updates, while simultaneously bypassing multiple SOTA patch defenses.
Intra-view and Inter-view Correlation Guided Multi-view Novel Class Discovery: This paper proposes IICMVNCD, the first framework extending Novel Class Discovery (NCD) to the multi-view setting. It captures distributional consistency between known and novel classes via intra-view matrix factorization, and transfers view relationships learned from known classes to novel classes through inter-view weight learning, eliminating the need for pseudo-labels.
Is Meta-Learning Out? Rethinking Unsupervised Few-Shot Classification with Limited Entropy: This paper introduces an entropy-constrained supervision setting to establish a fair comparison framework between meta-learning and Whole-Class Training (WCT). It theoretically demonstrates that meta-learning yields tighter generalization bounds, and reveals its advantages in label noise robustness and suitability for heterogeneous tasks. Building on these insights, the proposed MINO framework achieves state-of-the-art performance on unsupervised few-shot and zero-shot tasks.
Jigsaw++: Imagining Complete Shape Priors for Object Reassembly: Jigsaw++ proposes a generative model-based approach for learning complete shape priors, mapping partially assembled fragment point clouds to the shape space of complete objects via a retargeting strategy, thereby improving reassembly quality in a manner orthogonal to existing assembly algorithms.
Joint Asymmetric Loss for Learning with Noisy Labels: This paper extends asymmetric loss functions to the more challenging passive loss setting, proposes Asymmetric Mean Squared Error (AMSE), rigorously establishes the necessary and sufficient conditions for AMSE to satisfy the asymmetric condition, and embeds AMSE into the APL framework to construct the Joint Asymmetric Loss (JAL), achieving comprehensive improvements over existing robust loss methods on CIFAR-10/100 and other datasets.
Kaputt: A Large-Scale Dataset for Visual Defect Detection: Kaputt introduces a large-scale retail logistics defect detection dataset comprising 230,000+ images and 48,000+ unique items — 40× the scale of MVTec-AD — and is the first to incorporate significant pose and appearance variation. State-of-the-art anomaly detection methods achieve no more than 56.96% AUROC on this benchmark, exposing critical shortcomings of existing approaches in real-world retail scenarios.
LaCoOT: Layer Collapse through Optimal Transport: This paper proposes LaCoOT, an optimal transport-based regularization strategy that minimizes the Max-Sliced Wasserstein distance between intermediate feature distributions within a network during training, enabling the removal of entire layers post-training while maintaining performance and significantly reducing model depth and inference time.
LayerD: Decomposing Raster Graphic Designs into Layers: This paper proposes LayerD, a method that decomposes raster graphic designs into editable layers by iteratively extracting the unoccluded top layer and completing the background. It leverages domain priors of graphic design (texture-flat regions) for refinement, and introduces a DTW-based hierarchical evaluation protocol.
LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer: LayerTracer presents the first cognitive-aligned layered SVG generation framework built upon a Diffusion Transformer (DiT). It constructs a dataset of 20,000+ designer operation sequences, trains a DiT to generate multi-stage rasterized blueprints that simulate designer workflows, and converts these blueprints into clean, editable layered SVGs via layer-wise vectorization and path deduplication. The framework supports both text-driven generation and image-to-layered-SVG conversion.
Learning Visual Hierarchies in Hyperbolic Space for Image Retrieval: This paper presents the first learning paradigm for encoding user-defined multi-level visual hierarchies in hyperbolic space. It introduces an angle-based entailment contrastive loss to learn scene→object→part hierarchies without explicit hierarchy labels, and proposes an optimal-transport-based hierarchical retrieval evaluation metric.
Loss Functions for Predictor-based Neural Architecture Search: This paper presents the first comprehensive and systematic study of 8 loss functions for performance predictors, spanning regression, ranking, and weighting categories. Evaluated across 13 tasks on 5 search spaces, the study reveals the characteristics and complementarity of each loss type, and proposes PWLNAS—a piecewise loss (PW loss) combination method—that surpasses existing state-of-the-art on multiple benchmarks.
Magic Insert: Style-Aware Drag-and-Drop: This paper proposes Magic Insert, the first method to formally define and address the "style-aware drag-and-drop" problem—inserting a subject from an arbitrary style into a target image of a different style, such that the subject automatically adapts to the target style while being composited in a physically plausible manner. The core components are style-aware personalization (LoRA + IP-Adapter style injection) and Bootstrap Domain Adaptation (adapting a real-image-trained insertion model to the stylized image domain).
Membership Inference Attacks with False Discovery Rate Control: This paper proposes MIAFdR, the first membership inference attack (MIA) method with theoretical false discovery rate (FDR) guarantees. By designing a novel non-member conformity score function and an adjusted membership decision strategy, MIAFdR controls FDR and can be integrated as a plug-and-play wrapper into existing MIA methods, maintaining attack performance while providing FDR guarantees.
Multi-view Gaze Target Estimation: This paper is the first to extend Gaze Target Estimation (GTE) from single-view to multi-view settings. By integrating three modules — Head Information Aggregation (HIA), Uncertainty-based Gaze Selection (UGS), and Epipolar-based Scene Attention (ESA) — the method fuses information across multiple cameras. It significantly outperforms single-view state-of-the-art methods on the newly introduced MVGT dataset and enables cross-view estimation that single-view methods cannot handle.
NAPPure: Adversarial Purification for Robust Image Classification under Non-Additive Perturbations: This paper proposes NAPPure, a framework that jointly optimizes the underlying clean image and perturbation parameters via likelihood maximization, extending adversarial purification beyond additive perturbations to handle blur, occlusion, and geometric distortion. NAPPure achieves an average robust accuracy of 73.93% on GTSRB, compared to only 43.2% for conventional methods.
Omni-DC: Highly Robust Depth Completion with Multiresolution Depth Integration: This paper presents OMNI-DC, a highly robust depth completion model that achieves zero-shot generalization across diverse datasets and sparse depth patterns via a multiresolution Discrete Depth Integration module (Multi-res DDI), a Laplacian loss, and scale normalization.
On the Complexity-Faithfulness Trade-off of Gradient-Based Explanations: This paper proposes a unified spectral framework to systematically analyze and quantify the trade-off between the smoothness (complexity) and faithfulness of gradient-based explanations. It introduces Expected Frequency (EF) to measure a network's reliance on high-frequency information, controls explanation complexity by convolving ReLU with a Gaussian function, and defines an "explanation gap" to quantify the faithfulness loss induced by surrogate models.
Φ-GAN: Physics-Inspired GAN for Generating SAR Images Under Limited Data: This paper proposes Φ-GAN, which integrates the ideal Point Scattering Center (PSC) electromagnetic scattering physical model into GAN training as a differentiable neural module. Through a dual physics loss (generator physical consistency constraint + discriminator electromagnetic feature distillation), the method significantly improves the quality and stability of SAR image generation under data-scarce conditions.
Processing and Acquisition Traces in Visual Encoders: What Does CLIP Know About Your Camera?: This paper reveals that visual encoders such as CLIP systematically encode image acquisition and processing parameters (e.g., camera model, ISO, JPEG quality, and other perceptually invisible attributes) within their learned representations, and that these latent signals significantly influence semantic prediction accuracy—both positively and negatively—through statistical correlations with semantic labels.
Recover Biological Structure from Sparse-View Diffraction Images with Neural Volumetric Prior: This paper proposes Neural Volumetric Prior (NVP), a hybrid neural representation combining an explicit 3D feature grid with an implicit MLP, integrated with a physically accurate diffraction-based rendering equation. NVP enables, for the first time, high-fidelity volumetric reconstruction of the 3D refractive index of semi-transparent biological specimens from sparse-view inputs (as few as 6–7 fluorescence images), reducing the required number of images by approximately 50× and processing time by 3×.
Recovering Parametric Scenes from Very Few Time-of-Flight Pixels: This paper investigates the feasibility of recovering 3D parametric scene geometry using an extremely small number (as few as 15 pixels) of low-cost wide-field-of-view ToF sensors. An analysis-by-synthesis framework combining feedforward prediction and differentiable rendering is proposed, demonstrating surprisingly strong performance on tasks such as 6D object pose estimation.
Revisiting Image Fusion for Multi-Illuminant White-Balance Correction: This paper addresses white-balance (WB) correction under multi-illuminant scenes by proposing an efficient Transformer-based fusion model to replace conventional linear fusion, alongside a large-scale multi-illuminant WB dataset containing 16,000+ images. The proposed method achieves a 100% improvement in correction quality over existing methods on the new dataset.
SemTalk: Holistic Co-speech Motion Generation with Frame-level Semantic Emphasis: SemTalk decomposes co-speech motion into rhythm-aligned base motions and semantics-aware sparse motions, and adaptively fuses them via learned semantic scores to achieve high-quality holistic co-speech motion generation with frame-level semantic emphasis.
Stroke2Sketch: Harnessing Stroke Attributes for Training-Free Sketch Generation: This paper proposes Stroke2Sketch, a training-free reference-guided sketch generation framework that achieves fine-grained stroke attribute transfer while preserving content structure within a pretrained diffusion model, via three collaborative modules: Cross-image Stroke Attention (CSA), Directive Attention Module (DAM), and Semantic Preservation Module (SPM).
Switch-a-View: View Selection Learned from Unlabeled In-the-wild Videos: This paper proposes Switch-a-view, a model that learns view-switching patterns (ego/exo) from large-scale unlabeled in-the-wild instructional videos to enable automatic view selection in multi-view instructional videos, without requiring explicit best-view annotations.
SyncDiff: Synchronized Motion Diffusion for Multi-Body Human-Object Interaction Synthesis: This paper proposes SyncDiff, a unified multi-body human-object interaction (HOI) motion synthesis framework that achieves precise multi-body synchronization via alignment scores and an explicit synchronization strategy, while introducing frequency-domain decomposition to model high-frequency interaction semantics.
Thermal Polarimetric Multi-view Stereo: This paper proposes a method for high-fidelity 3D shape reconstruction using thermal polarimetric (long-wave infrared polarimetric) cues. It theoretically demonstrates that LWIR polarimetric observations are unaffected by illumination conditions and material optical properties, enabling accurate 3D reconstruction of transparent, translucent, and heterogeneous objects—significantly outperforming visible-light polarimetric methods.
Toward Material-Agnostic System Identification from Videos: This paper proposes MASIV, the first visual system identification framework that requires no predefined material priors. It replaces hand-crafted elastic/plastic equations with a learnable neural constitutive model, reconstructs dense continuum particle trajectories to provide temporally rich geometric supervision, and infers the intrinsic dynamic properties of objects from multi-view videos.
You Share Beliefs, I Adapt: Progressive Heterogeneous Collaborative Perception: This paper proposes PHCP, the first framework that addresses the domain gap in heterogeneous collaborative perception at inference time. By leveraging collaborating agents' pseudo labels for few-shot unsupervised domain adaptation, PHCP trains lightweight adapters via self-training to align feature spaces—requiring no joint training—and achieves near-SOTA (HEAL) performance on OPV2V with only a small number of unlabeled samples.