📂 Others¶

📹 ICCV2025 · 48 paper notes

A Hidden Stumbling Block in Generalized Category Discovery: Distracted Attention: This paper identifies a previously overlooked issue in GCD—ViT attention on unlabeled data (especially novel categories) tends to disperse onto background regions (distracted attention)—and proposes an Attention Focusing (AF) module that corrects attention via multi-scale token importance measurement combined with adaptive pruning. As a plug-and-play module on top of SimGCD, AF achieves up to 15.4% performance improvement.
A Hyperdimensional One Place Signature to Represent Them All: Stackable Descriptors For Visual Place Recognition: This paper proposes HOPS (Hyperdimensional One Place Signatures), a framework leveraging hyperdimensional computing (HDC) to fuse multiple reference descriptors of the same place captured under varying environmental conditions into a unified representation, substantially improving the robustness and recall of Visual Place Recognition (VPR) without increasing computational or memory overhead.
A Linear N-Point Solver for Structure and Motion from Asynchronous Tracks: This paper proposes a unified linear N-point solver that recovers camera linear velocity and 3D point structure from 2D point correspondences with arbitrary timestamps, supporting global shutter, rolling shutter, and event camera sensor modalities.
AdaptiveAE: An Adaptive Exposure Strategy for HDR Capturing in Dynamic Scenes: This paper proposes AdaptiveAE, which formulates HDR bracketed exposure capture as a Markov Decision Process (MDP) using deep reinforcement learning, jointly optimizing ISO and shutter speed combinations to adaptively select optimal exposure parameters for dynamic scenes within a user-defined time budget. The method achieves PSNR 39.70 on the HDRV dataset, outperforming the previous best method Hasinoff et al. (37.59) by 2.1 dB.
Adversarial Data Augmentation for Single Domain Generalization via Lyapunov Exponents: This paper proposes LEAwareSGD, an optimizer that dynamically adjusts the learning rate using Lyapunov exponents (LE) to guide model training toward the edge of chaos, enabling broader exploration of the parameter space within an adversarial data augmentation framework and achieving significant improvements in single domain generalization (SDG).
AFUNet: Cross-Iterative Alignment-Fusion Synergy for HDR Reconstruction via Deep Unfolding Paradigm: This paper formulates multi-exposure HDR reconstruction from a MAP estimation perspective, decomposes the problem into two alternating subproblems—alignment and fusion—via a spatial correspondence prior, and unfolds them into an end-to-end trainable AFUNet comprising SAM (spatial alignment), CFM (channel fusion), and DCM (data consistency) modules. The method achieves state-of-the-art performance on three HDR benchmarks, reaching PSNR-μ of 44.91 dB on the Kalantari dataset.
Auto-Regressively Generating Multi-View Consistent Images (MV-AR): This paper is the first to introduce autoregressive (AR) models into multi-view image generation. By generating views sequentially, the model leverages all preceding views to enhance consistency across distant viewpoints. It further proposes a unified multimodal condition injection architecture and a Shuffle Views data augmentation strategy, enabling a single model to handle text, image, and geometry conditions simultaneously.
C4D: 4D Made from 3D through Dual Correspondences: This paper proposes C4D, a framework that upgrades existing 3D reconstruction paradigms to full 4D reconstruction by jointly capturing dual temporal correspondences — short-term optical flow and dynamic-aware long-term point tracking (DynPT) — on top of DUSt3R's 3D pointmap predictions. Motion masks are generated to separate static and dynamic regions. Three optimization objectives are introduced: camera motion alignment, camera trajectory smoothing, and point trajectory smoothing. The resulting system produces per-frame point clouds, camera parameters, and 2D/3D trajectories, achieving competitive performance across depth estimation, pose estimation, and point tracking tasks.
DeSPITE: Exploring Contrastive Deep Skeleton-Pointcloud-IMU-Text Embeddings for Advanced Point Cloud Human Activity Understanding: This paper proposes DeSPITE, a contrastive learning framework that aligns four modalities—LiDAR point clouds, skeletal poses, IMU signals, and text—into a joint embedding space. It is the first to adopt LiDAR (rather than RGB) as the primary visual modality, enabling previously infeasible tasks such as cross-modal matching and retrieval, while also serving as an effective HAR pretraining strategy that achieves state-of-the-art performance on MSR-Action3D and HMPEAR.
Doodle Your Keypoints: Sketch-Based Few-Shot Keypoint Detection: This paper proposes the first sketch-based cross-modal few-shot keypoint detection framework. By leveraging a prototype network, grid-based locator, prototype domain adaptation, and a de-stylization network, the framework detects novel keypoints on unseen categories in real photographs using only a handful of annotated sketches.
EDFFDNet: Towards Accurate and Efficient Unsupervised Multi-Grid Image Registration: This paper proposes EDFFDNet, which replaces conventional B-spline FFD and TPS with an Exponentially Decaying Free-Form Deformation (EDFFD) model for image registration. Combined with an Adaptive Sparse Motion Aggregator (ASMA) and a progressive correlation strategy, the method achieves a +0.5 dB PSNR improvement on the UDIS-D dataset while reducing parameter count by 70.5% and GPU memory usage by 32.6%.
Failure Cases Are Better Learned But Boundary Says Sorry: Facilitating Smooth Perception Change for Accuracy-Robustness Trade-Off in Adversarial Training: This paper reveals a counterintuitive phenomenon in adversarial training — the model's perceptual change on failure cases is actually smaller than on success cases (i.e., failure cases are "over-learned") — and proposes Robust Perception Adversarial Training (RPAT), which encourages perceptions to change smoothly with perturbations to alleviate the accuracy-robustness trade-off.
FixTalk: Taming Identity Leakage for High-Quality Talking Head Generation in Extreme Cases: FixTalk is proposed as a framework that addresses identity leakage in GAN-based talking head generation through two lightweight plug-and-play modules — the Enhanced Motion Indicator (EMI) and the Enhanced Detail Indicator (EDI). EMI eliminates identity information from motion features to suppress identity leakage, while EDI repurposes the leaked identity information to compensate for missing details under extreme poses, thereby removing rendering artifacts.
From Easy to Hard: Progressive Active Learning Framework for Infrared Small Target Detection with Single Point Supervision: This paper proposes a Progressive Active Learning (PAL) framework that trains infrared small target detection networks through a three-stage strategy—model pre-start, model enhancement, and model refinement—driving the network to actively identify and learn from hard samples in an easy-to-hard manner. Under single point supervision, PAL substantially narrows the performance gap with fully supervised methods (IoU improvement of 8.53%–29.1%).
Generate, Refine, and Encode: Leveraging Synthesized Novel Samples for On-the-Fly Fine-Grained Category Discovery: This paper proposes DiffGRE, a diffusion-model-based framework for on-the-fly category discovery. It synthesizes novel samples containing virtual category information via Attribute Composition Generation (ACG), filters low-quality samples through Diversity-Driven Refinement (DDR), and injects additional category knowledge via Semi-supervised Leader Encoding (SLE). DiffGRE achieves substantial performance gains over existing OCD methods across 6 fine-grained datasets (average ACC-ALL improvement of 6.5%).
Hi3DGen: High-fidelity 3D Geometry Generation from Images via Normal Bridging: This paper proposes Hi3DGen, a framework that uses normal maps as an intermediate representation to bridge 2D images and 3D geometry. Through two core components — a Noise-injected Regressive Normal Estimation (NiRNE) module and Normal-Regularized Latent Diffusion (NoRLD) — the framework significantly improves the geometric detail fidelity of generated 3D models.
HiNeuS: High-fidelity Neural Surface Mitigating Low-texture and Reflective Ambiguity: This paper proposes HiNeuS, a unified neural surface reconstruction framework that simultaneously addresses three core challenges—reflective ambiguity, low-texture degradation, and detail preservation—through three innovations: SDF-guided visibility verification, planar conformal regularization, and rendering-prioritized Eikonal relaxation.
HyTIP: Hybrid Temporal Information Propagation for Masked Conditional Residual Video Coding: This paper proposes HyTIP, a framework that unifies output-recurrence (explicit buffering of decoded frames) and hidden-to-hidden propagation (implicit buffering of latent features) within a single learned video coding framework, achieving comparable coding performance to state-of-the-art methods using only 14% of their buffer size.
I Am Big, You Are Little; I Am Right, You Are Wrong: This work employs the causal-reasoning XAI tool rex to extract Minimal Pixel Sets (MPS) from image classification models, systematically comparing the "attentional focus" of 15 models across 5 architectures. Large models (EVA/ConvNext) are found to make classification decisions using as little as 5% of image pixels, and statistically significant differences in MPS size and spatial location are observed across architectures.
IAP: Invisible Adversarial Patch Attack through Perceptibility-Aware Localization: This paper proposes the IAP framework, which achieves — for the first time in targeted attack settings — truly invisible adversarial patches via perceptibility-aware patch localization and color-preserving gradient updates, while simultaneously bypassing multiple SOTA patch defenses.
Intra-view and Inter-view Correlation Guided Multi-view Novel Class Discovery: This paper proposes IICMVNCD, the first framework extending Novel Class Discovery (NCD) to the multi-view setting. It captures distributional consistency between known and novel classes via intra-view matrix factorization, and transfers view relationships learned from known classes to novel classes through inter-view weight learning, eliminating the need for pseudo-labels.
Is Meta-Learning Out? Rethinking Unsupervised Few-Shot Classification with Limited Entropy: This paper introduces an entropy-constrained supervision setting to establish a fair comparison framework between meta-learning and Whole-Class Training (WCT). It theoretically demonstrates that meta-learning yields tighter generalization bounds, and reveals its advantages in label noise robustness and suitability for heterogeneous tasks. Building on these insights, the proposed MINO framework achieves state-of-the-art performance on unsupervised few-shot and zero-shot tasks.
Jigsaw++: Imagining Complete Shape Priors for Object Reassembly: Jigsaw++ proposes a generative model-based approach for learning complete shape priors, mapping partially assembled fragment point clouds to the shape space of complete objects via a retargeting strategy, thereby improving reassembly quality in a manner orthogonal to existing assembly algorithms.
Joint Asymmetric Loss for Learning with Noisy Labels: This paper extends asymmetric loss functions to the more challenging passive loss setting, proposes Asymmetric Mean Squared Error (AMSE), rigorously establishes the necessary and sufficient conditions for AMSE to satisfy the asymmetric condition, and embeds AMSE into the APL framework to construct the Joint Asymmetric Loss (JAL), achieving comprehensive improvements over existing robust loss methods on CIFAR-10/100 and other datasets.
Kaputt: A Large-Scale Dataset for Visual Defect Detection: Kaputt introduces a large-scale retail logistics defect detection dataset comprising 230,000+ images and 48,000+ unique items — 40× the scale of MVTec-AD — and is the first to incorporate significant pose and appearance variation. State-of-the-art anomaly detection methods achieve no more than 56.96% AUROC on this benchmark, exposing critical shortcomings of existing approaches in real-world retail scenarios.
LaCoOT: Layer Collapse through Optimal Transport: This paper proposes LaCoOT, an optimal transport-based regularization strategy that minimizes the Max-Sliced Wasserstein distance between intermediate feature distributions within a network during training, enabling the removal of entire layers post-training while maintaining performance and significantly reducing model depth and inference time.
LayerD: Decomposing Raster Graphic Designs into Layers: This paper proposes LayerD, a method that decomposes raster graphic designs into editable layers by iteratively extracting the unoccluded top layer and completing the background. It leverages domain priors of graphic design (texture-flat regions) for refinement, and introduces a DTW-based hierarchical evaluation protocol.
LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer: LayerTracer presents the first cognitive-aligned layered SVG generation framework built upon a Diffusion Transformer (DiT). It constructs a dataset of 20,000+ designer operation sequences, trains a DiT to generate multi-stage rasterized blueprints that simulate designer workflows, and converts these blueprints into clean, editable layered SVGs via layer-wise vectorization and path deduplication. The framework supports both text-driven generation and image-to-layered-SVG conversion.
Learning Visual Hierarchies in Hyperbolic Space for Image Retrieval: This paper presents the first learning paradigm for encoding user-defined multi-level visual hierarchies in hyperbolic space. It introduces an angle-based entailment contrastive loss to learn scene→object→part hierarchies without explicit hierarchy labels, and proposes an optimal-transport-based hierarchical retrieval evaluation metric.
Loss Functions for Predictor-based Neural Architecture Search: This paper presents the first comprehensive and systematic study of 8 loss functions for performance predictors, spanning regression, ranking, and weighting categories. Evaluated across 13 tasks on 5 search spaces, the study reveals the characteristics and complementarity of each loss type, and proposes PWLNAS—a piecewise loss (PW loss) combination method—that surpasses existing state-of-the-art on multiple benchmarks.
Magic Insert: Style-Aware Drag-and-Drop: This paper proposes Magic Insert, the first method to formally define and address the "style-aware drag-and-drop" problem—inserting a subject from an arbitrary style into a target image of a different style, such that the subject automatically adapts to the target style while being composited in a physically plausible manner. The core components are style-aware personalization (LoRA + IP-Adapter style injection) and Bootstrap Domain Adaptation (adapting a real-image-trained insertion model to the stylized image domain).
Membership Inference Attacks with False Discovery Rate Control: This paper proposes MIAFdR, the first membership inference attack (MIA) method with theoretical false discovery rate (FDR) guarantees. By designing a novel non-member conformity score function and an adjusted membership decision strategy, MIAFdR controls FDR and can be integrated as a plug-and-play wrapper into existing MIA methods, maintaining attack performance while providing FDR guarantees.
Multi-view Gaze Target Estimation: This paper is the first to extend Gaze Target Estimation (GTE) from single-view to multi-view settings. By integrating three modules — Head Information Aggregation (HIA), Uncertainty-based Gaze Selection (UGS), and Epipolar-based Scene Attention (ESA) — the method fuses information across multiple cameras. It significantly outperforms single-view state-of-the-art methods on the newly introduced MVGT dataset and enables cross-view estimation that single-view methods cannot handle.
NAPPure: Adversarial Purification for Robust Image Classification under Non-Additive Perturbations: This paper proposes NAPPure, a framework that jointly optimizes the underlying clean image and perturbation parameters via likelihood maximization, extending adversarial purification beyond additive perturbations to handle blur, occlusion, and geometric distortion. NAPPure achieves an average robust accuracy of 73.93% on GTSRB, compared to only 43.2% for conventional methods.
Omni-DC: Highly Robust Depth Completion with Multiresolution Depth Integration: This paper presents OMNI-DC, a highly robust depth completion model that achieves zero-shot generalization across diverse datasets and sparse depth patterns via a multiresolution Discrete Depth Integration module (Multi-res DDI), a Laplacian loss, and scale normalization.
On the Complexity-Faithfulness Trade-off of Gradient-Based Explanations: This paper proposes a unified spectral framework to systematically analyze and quantify the trade-off between the smoothness (complexity) and faithfulness of gradient-based explanations. It introduces Expected Frequency (EF) to measure a network's reliance on high-frequency information, controls explanation complexity by convolving ReLU with a Gaussian function, and defines an "explanation gap" to quantify the faithfulness loss induced by surrogate models.
Φ-GAN: Physics-Inspired GAN for Generating SAR Images Under Limited Data: This paper proposes Φ-GAN, which integrates the ideal Point Scattering Center (PSC) electromagnetic scattering physical model into GAN training as a differentiable neural module. Through a dual physics loss (generator physical consistency constraint + discriminator electromagnetic feature distillation), the method significantly improves the quality and stability of SAR image generation under data-scarce conditions.
Processing and Acquisition Traces in Visual Encoders: What Does CLIP Know About Your Camera?: This paper reveals that visual encoders such as CLIP systematically encode image acquisition and processing parameters (e.g., camera model, ISO, JPEG quality, and other perceptually invisible attributes) within their learned representations, and that these latent signals significantly influence semantic prediction accuracy—both positively and negatively—through statistical correlations with semantic labels.
Recover Biological Structure from Sparse-View Diffraction Images with Neural Volumetric Prior: This paper proposes Neural Volumetric Prior (NVP), a hybrid neural representation combining an explicit 3D feature grid with an implicit MLP, integrated with a physically accurate diffraction-based rendering equation. NVP enables, for the first time, high-fidelity volumetric reconstruction of the 3D refractive index of semi-transparent biological specimens from sparse-view inputs (as few as 6–7 fluorescence images), reducing the required number of images by approximately 50× and processing time by 3×.
Recovering Parametric Scenes from Very Few Time-of-Flight Pixels: This paper investigates the feasibility of recovering 3D parametric scene geometry using an extremely small number (as few as 15 pixels) of low-cost wide-field-of-view ToF sensors. An analysis-by-synthesis framework combining feedforward prediction and differentiable rendering is proposed, demonstrating surprisingly strong performance on tasks such as 6D object pose estimation.
Revisiting Image Fusion for Multi-Illuminant White-Balance Correction: This paper addresses white-balance (WB) correction under multi-illuminant scenes by proposing an efficient Transformer-based fusion model to replace conventional linear fusion, alongside a large-scale multi-illuminant WB dataset containing 16,000+ images. The proposed method achieves a 100% improvement in correction quality over existing methods on the new dataset.
SemTalk: Holistic Co-speech Motion Generation with Frame-level Semantic Emphasis: SemTalk decomposes co-speech motion into rhythm-aligned base motions and semantics-aware sparse motions, and adaptively fuses them via learned semantic scores to achieve high-quality holistic co-speech motion generation with frame-level semantic emphasis.
Stroke2Sketch: Harnessing Stroke Attributes for Training-Free Sketch Generation: This paper proposes Stroke2Sketch, a training-free reference-guided sketch generation framework that achieves fine-grained stroke attribute transfer while preserving content structure within a pretrained diffusion model, via three collaborative modules: Cross-image Stroke Attention (CSA), Directive Attention Module (DAM), and Semantic Preservation Module (SPM).
Switch-a-View: View Selection Learned from Unlabeled In-the-wild Videos: This paper proposes Switch-a-view, a model that learns view-switching patterns (ego/exo) from large-scale unlabeled in-the-wild instructional videos to enable automatic view selection in multi-view instructional videos, without requiring explicit best-view annotations.
SyncDiff: Synchronized Motion Diffusion for Multi-Body Human-Object Interaction Synthesis: This paper proposes SyncDiff, a unified multi-body human-object interaction (HOI) motion synthesis framework that achieves precise multi-body synchronization via alignment scores and an explicit synchronization strategy, while introducing frequency-domain decomposition to model high-frequency interaction semantics.
Thermal Polarimetric Multi-view Stereo: This paper proposes a method for high-fidelity 3D shape reconstruction using thermal polarimetric (long-wave infrared polarimetric) cues. It theoretically demonstrates that LWIR polarimetric observations are unaffected by illumination conditions and material optical properties, enabling accurate 3D reconstruction of transparent, translucent, and heterogeneous objects—significantly outperforming visible-light polarimetric methods.
Toward Material-Agnostic System Identification from Videos: This paper proposes MASIV, the first visual system identification framework that requires no predefined material priors. It replaces hand-crafted elastic/plastic equations with a learnable neural constitutive model, reconstructs dense continuum particle trajectories to provide temporally rich geometric supervision, and infers the intrinsic dynamic properties of objects from multi-view videos.
You Share Beliefs, I Adapt: Progressive Heterogeneous Collaborative Perception: This paper proposes PHCP, the first framework that addresses the domain gap in heterogeneous collaborative perception at inference time. By leveraging collaborating agents' pseudo labels for few-shot unsupervised domain adaptation, PHCP trains lightweight adapters via self-training to align feature spaces—requiring no joint training—and achieves near-SOTA (HEAL) performance on OPV2V with only a small number of unlabeled samples.