📂 Others¶
📹 ICCV2025 · 33 paper notes
📌 Same area in other venues: 📷 CVPR2026 (98) · 🔬 ICLR2026 (116) · 💬 ACL2026 (4) · 🧪 ICML2026 (70) · 🤖 AAAI2026 (117) · 🧠 NeurIPS2025 (118)
🔥 Top topics: Dynamic Scenes ×2 · Adversarial Robustness ×2 · Few-/Zero-Shot Learning ×2 · Diffusion Models ×2
- A Hyperdimensional One Place Signature to Represent Them All: Stackable Descriptors For Visual Place Recognition
-
This paper proposes HOPS (Hyperdimensional One Place Signatures), a framework leveraging hyperdimensional computing (HDC) to fuse multiple reference descriptors of the same place captured under varying environmental conditions into a unified representation, substantially improving the robustness and recall of Visual Place Recognition (VPR) without increasing computational or memory overhead.
- A Linear N-Point Solver for Structure and Motion from Asynchronous Tracks
-
This paper proposes a unified linear N-point solver that recovers camera linear velocity and 3D point structure from 2D point correspondences with arbitrary timestamps, supporting global shutter, rolling shutter, and event camera sensor modalities.
- AdaptiveAE: An Adaptive Exposure Strategy for HDR Capturing in Dynamic Scenes
-
This paper proposes AdaptiveAE, which formulates HDR bracketed exposure capture as a Markov Decision Process (MDP) using deep reinforcement learning, jointly optimizing ISO and shutter speed combinations to adaptively select optimal exposure parameters for dynamic scenes within a user-defined time budget. The method achieves PSNR 39.70 on the HDRV dataset, outperforming the previous best method Hasinoff et al. (37.59) by 2.1 dB.
- Adversarial Data Augmentation for Single Domain Generalization via Lyapunov Exponents
-
This paper proposes LEAwareSGD, an optimizer that dynamically adjusts the learning rate using Lyapunov exponents (LE) to guide model training toward the edge of chaos, enabling broader exploration of the parameter space within an adversarial data augmentation framework and achieving significant improvements in single domain generalization (SDG).
- Auto-Regressively Generating Multi-View Consistent Images (MV-AR)
-
This paper is the first to introduce autoregressive (AR) models into multi-view image generation. By generating views sequentially, the model leverages all preceding views to enhance consistency across distant viewpoints. It further proposes a unified multimodal condition injection architecture and a Shuffle Views data augmentation strategy, enabling a single model to handle text, image, and geometry conditions simultaneously.
- C4D: 4D Made from 3D through Dual Correspondences
-
This paper proposes C4D, a framework that upgrades existing 3D reconstruction paradigms to full 4D reconstruction by jointly capturing dual temporal correspondences — short-term optical flow and dynamic-aware long-term point tracking (DynPT) — on top of DUSt3R's 3D pointmap predictions. Motion masks are generated to separate static and dynamic regions. Three optimization objectives are introduced: camera motion alignment, camera trajectory smoothing, and point trajectory smoothing. The resulting system produces per-frame point clouds, camera parameters, and 2D/3D trajectories, achieving competitive performance across depth estimation, pose estimation, and point tracking tasks.
- Doodle Your Keypoints: Sketch-Based Few-Shot Keypoint Detection
-
This paper proposes the first sketch-based cross-modal few-shot keypoint detection framework. By leveraging a prototype network, grid-based locator, prototype domain adaptation, and a de-stylization network, the framework detects novel keypoints on unseen categories in real photographs using only a handful of annotated sketches.
- EDFFDNet: Towards Accurate and Efficient Unsupervised Multi-Grid Image Registration
-
This paper proposes EDFFDNet, which replaces conventional B-spline FFD and TPS with an Exponentially Decaying Free-Form Deformation (EDFFD) model for image registration. Combined with an Adaptive Sparse Motion Aggregator (ASMA) and a progressive correlation strategy, the method achieves a +0.5 dB PSNR improvement on the UDIS-D dataset while reducing parameter count by 70.5% and GPU memory usage by 32.6%.
- FixTalk: Taming Identity Leakage for High-Quality Talking Head Generation in Extreme Cases
-
FixTalk is proposed as a framework that addresses identity leakage in GAN-based talking head generation through two lightweight plug-and-play modules — the Enhanced Motion Indicator (EMI) and the Enhanced Detail Indicator (EDI). EMI eliminates identity information from motion features to suppress identity leakage, while EDI repurposes the leaked identity information to compensate for missing details under extreme poses, thereby removing rendering artifacts.
- HyTIP: Hybrid Temporal Information Propagation for Masked Conditional Residual Video Coding
-
This paper proposes HyTIP, a framework that unifies output-recurrence (explicit buffering of decoded frames) and hidden-to-hidden propagation (implicit buffering of latent features) within a single learned video coding framework, achieving comparable coding performance to state-of-the-art methods using only 14% of their buffer size.
- I Am Big, You Are Little; I Am Right, You Are Wrong
-
This work employs the causal-reasoning XAI tool rex to extract Minimal Pixel Sets (MPS) from image classification models, systematically comparing the "attentional focus" of 15 models across 5 architectures. Large models (EVA/ConvNext) are found to make classification decisions using as little as 5% of image pixels, and statistically significant differences in MPS size and spatial location are observed across architectures.
- Intra-view and Inter-view Correlation Guided Multi-view Novel Class Discovery
-
This paper proposes IICMVNCD, the first framework extending Novel Class Discovery (NCD) to the multi-view setting. It captures distributional consistency between known and novel classes via intra-view matrix factorization, and transfers view relationships learned from known classes to novel classes through inter-view weight learning, eliminating the need for pseudo-labels.
- Is Meta-Learning Out? Rethinking Unsupervised Few-Shot Classification with Limited Entropy
-
This paper introduces an entropy-constrained supervision setting to establish a fair comparison framework between meta-learning and Whole-Class Training (WCT). It theoretically demonstrates that meta-learning yields tighter generalization bounds, and reveals its advantages in label noise robustness and suitability for heterogeneous tasks. Building on these insights, the proposed MINO framework achieves state-of-the-art performance on unsupervised few-shot and zero-shot tasks.
- Jigsaw++: Imagining Complete Shape Priors for Object Reassembly
-
Jigsaw++ proposes a generative model-based approach for learning complete shape priors, mapping partially assembled fragment point clouds to the shape space of complete objects via a retargeting strategy, thereby improving reassembly quality in a manner orthogonal to existing assembly algorithms.
- Joint Asymmetric Loss for Learning with Noisy Labels
-
This paper extends asymmetric loss functions to the more challenging passive loss setting, proposes Asymmetric Mean Squared Error (AMSE), rigorously establishes the necessary and sufficient conditions for AMSE to satisfy the asymmetric condition, and embeds AMSE into the APL framework to construct the Joint Asymmetric Loss (JAL), achieving comprehensive improvements over existing robust loss methods on CIFAR-10/100 and other datasets.
- LaCoOT: Layer Collapse through Optimal Transport
-
This paper proposes LaCoOT, an optimal transport-based regularization strategy that minimizes the Max-Sliced Wasserstein distance between intermediate feature distributions within a network during training, enabling the removal of entire layers post-training while maintaining performance and significantly reducing model depth and inference time.
- LayerD: Decomposing Raster Graphic Designs into Layers
-
This paper proposes LayerD, a method that decomposes raster graphic designs into editable layers by iteratively extracting the unoccluded top layer and completing the background. It leverages domain priors of graphic design (texture-flat regions) for refinement, and introduces a DTW-based hierarchical evaluation protocol.
- LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer
-
LayerTracer presents the first cognitive-aligned layered SVG generation framework built upon a Diffusion Transformer (DiT). It constructs a dataset of 20,000+ designer operation sequences, trains a DiT to generate multi-stage rasterized blueprints that simulate designer workflows, and converts these blueprints into clean, editable layered SVGs via layer-wise vectorization and path deduplication. The framework supports both text-driven generation and image-to-layered-SVG conversion.
- Learning Visual Hierarchies in Hyperbolic Space for Image Retrieval
-
This paper presents the first learning paradigm for encoding user-defined multi-level visual hierarchies in hyperbolic space. It introduces an angle-based entailment contrastive loss to learn scene→object→part hierarchies without explicit hierarchy labels, and proposes an optimal-transport-based hierarchical retrieval evaluation metric.
- Loss Functions for Predictor-based Neural Architecture Search
-
This paper presents the first comprehensive and systematic study of 8 loss functions for performance predictors, spanning regression, ranking, and weighting categories. Evaluated across 13 tasks on 5 search spaces, the study reveals the characteristics and complementarity of each loss type, and proposes PWLNAS—a piecewise loss (PW loss) combination method—that surpasses existing state-of-the-art on multiple benchmarks.
- Magic Insert: Style-Aware Drag-and-Drop
-
This paper proposes Magic Insert, the first method to formally define and address the "style-aware drag-and-drop" problem—inserting a subject from an arbitrary style into a target image of a different style, such that the subject automatically adapts to the target style while being composited in a physically plausible manner. The core components are style-aware personalization (LoRA + IP-Adapter style injection) and Bootstrap Domain Adaptation (adapting a real-image-trained insertion model to the stylized image domain).
- NAPPure: Adversarial Purification for Robust Image Classification under Non-Additive Perturbations
-
This paper proposes NAPPure, a framework that jointly optimizes the underlying clean image and perturbation parameters via likelihood maximization, extending adversarial purification beyond additive perturbations to handle blur, occlusion, and geometric distortion. NAPPure achieves an average robust accuracy of 73.93% on GTSRB, compared to only 43.2% for conventional methods.
- On the Complexity-Faithfulness Trade-off of Gradient-Based Explanations
-
This paper proposes a unified spectral framework to systematically analyze and quantify the trade-off between the smoothness (complexity) and faithfulness of gradient-based explanations. It introduces Expected Frequency (EF) to measure a network's reliance on high-frequency information, controls explanation complexity by convolving ReLU with a Gaussian function, and defines an "explanation gap" to quantify the faithfulness loss induced by surrogate models.
- Φ-GAN: Physics-Inspired GAN for Generating SAR Images Under Limited Data
-
This paper proposes Φ-GAN, which integrates the ideal Point Scattering Center (PSC) electromagnetic scattering physical model into GAN training as a differentiable neural module. Through a dual physics loss (generator physical consistency constraint + discriminator electromagnetic feature distillation), the method significantly improves the quality and stability of SAR image generation under data-scarce conditions.
- Processing and Acquisition Traces in Visual Encoders: What Does CLIP Know About Your Camera?
-
This paper reveals that visual encoders such as CLIP systematically encode image acquisition and processing parameters (e.g., camera model, ISO, JPEG quality, and other perceptually invisible attributes) within their learned representations, and that these latent signals significantly influence semantic prediction accuracy—both positively and negatively—through statistical correlations with semantic labels.
- Recover Biological Structure from Sparse-View Diffraction Images with Neural Volumetric Prior
-
This paper proposes Neural Volumetric Prior (NVP), a hybrid neural representation combining an explicit 3D feature grid with an implicit MLP, integrated with a physically accurate diffraction-based rendering equation. NVP enables, for the first time, high-fidelity volumetric reconstruction of the 3D refractive index of semi-transparent biological specimens from sparse-view inputs (as few as 6–7 fluorescence images), reducing the required number of images by approximately 50× and processing time by 3×.
- Recovering Parametric Scenes from Very Few Time-of-Flight Pixels
-
This paper investigates the feasibility of recovering 3D parametric scene geometry using an extremely small number (as few as 15 pixels) of low-cost wide-field-of-view ToF sensors. An analysis-by-synthesis framework combining feedforward prediction and differentiable rendering is proposed, demonstrating surprisingly strong performance on tasks such as 6D object pose estimation.
- Revisiting Image Fusion for Multi-Illuminant White-Balance Correction
-
This paper addresses white-balance (WB) correction under multi-illuminant scenes by proposing an efficient Transformer-based fusion model to replace conventional linear fusion, alongside a large-scale multi-illuminant WB dataset containing 16,000+ images. The proposed method achieves a 100% improvement in correction quality over existing methods on the new dataset.
- Stroke2Sketch: Harnessing Stroke Attributes for Training-Free Sketch Generation
-
This paper proposes Stroke2Sketch, a training-free reference-guided sketch generation framework that achieves fine-grained stroke attribute transfer while preserving content structure within a pretrained diffusion model, via three collaborative modules: Cross-image Stroke Attention (CSA), Directive Attention Module (DAM), and Semantic Preservation Module (SPM).
- Switch-a-View: View Selection Learned from Unlabeled In-the-wild Videos
-
This paper proposes Switch-a-view, a model that learns view-switching patterns (ego/exo) from large-scale unlabeled in-the-wild instructional videos to enable automatic view selection in multi-view instructional videos, without requiring explicit best-view annotations.
- SyncDiff: Synchronized Motion Diffusion for Multi-Body Human-Object Interaction Synthesis
-
This paper proposes SyncDiff, a unified multi-body human-object interaction (HOI) motion synthesis framework that achieves precise multi-body synchronization via alignment scores and an explicit synchronization strategy, while introducing frequency-domain decomposition to model high-frequency interaction semantics.
- Toward Material-Agnostic System Identification from Videos
-
This paper proposes MASIV, the first visual system identification framework that requires no predefined material priors. It replaces hand-crafted elastic/plastic equations with a learnable neural constitutive model, reconstructs dense continuum particle trajectories to provide temporally rich geometric supervision, and infers the intrinsic dynamic properties of objects from multi-view videos.
- You Share Beliefs, I Adapt: Progressive Heterogeneous Collaborative Perception
-
This paper proposes PHCP, the first framework that addresses the domain gap in heterogeneous collaborative perception at inference time. By leveraging collaborating agents' pseudo labels for few-shot unsupervised domain adaptation, PHCP trains lightweight adapters via self-training to align feature spaces—requiring no joint training—and achieves near-SOTA (HEAL) performance on OPV2V with only a small number of unlabeled samples.