CVPR2025 Others AI paper notes paper summaries Adversarial Robustness Object Tracking Dynamic Scenes Domain Adaptation Re-Identification Alignment/RLHF

📂 Others¶

📷 CVPR2025 · 58 paper notes

📌 Same area in other venues: 📷 CVPR2026 (105) · 🔬 ICLR2026 (116) · 💬 ACL2026 (4) · 🧪 ICML2026 (70) · 🤖 AAAI2026 (117) · 🧠 NeurIPS2025 (121)

🔥 Top topics: Adversarial Robustness ×6

BenDFM: A taxonomy and synthetic CAD dataset for manufacturability assessment in sheet metal bending: Proposes a manufacturability metric taxonomy for sheet metal bending processes (categorized into a four-quadrant framework based on two dimensions: configuration-dependency \(\times\) feasibility/complexity), and constructs BenDFM, the first synthetic dataset containing 20,000 parts (comprising both manufacturable and non-manufacturable samples). Benchmarking indicates that graph-structured representations (UV-Net) outperform point clouds (PointNext), and predicting configuration-dependent metrics is more challenging.
Bounds on Agreement between Subjective and Objective Measurements: By assuming only that the voting mean converges to the true quality, mathematical bounds on PCC (upper bound) and MSE (lower bound) between subjective tests (MOS) and objective estimators are derived. A Binomial-based voting model, BinoVotes, is proposed to enable the calculation of these bounds even when voting variance is unavailable. Validation on 18 subjective test datasets demonstrates that BinoVotes bounds align closely with full-data-driven bounds.
CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction: This paper proposes CARE Transformer, which decouples the learning of local inductive bias and long-range dependencies through asymmetrical feature decoupling. Fueled by a dynamic memory unit and a dual interaction module that fully exploit feature complementarity, it delivers a mobile-friendly linear-complexity vision Transformer. It achieves 78.4% top-1 accuracy on ImageNet with only 0.7 GMACs.
Deconstructing the Failure of Ideal Noise Correction: A Three-Pillar Diagnosis: By providing a perfect oracle noise transition matrix T, this work demonstrates that Forward Correction still suffers from training collapse under ideal conditions (first ascending, then descending, and eventually converging to the uncorrected baseline). It systematically diagnoses the root causes of failure from three levels: macro (convergence end-state), micro (gradient dynamics), and information-theoretic (irreversible information loss in noisy channels). This reveals that the failure is not a matter of inaccurate T estimation, but a structural deficiency of high-capacity networks under finite samples.
Do ImageNet-trained Models Learn Shortcuts? The Impact of Frequency Shortcuts on Generalization: This paper proposes a Hierarchical Frequency Shortcut Search (HFSS) method to efficiently discover frequency shortcuts learned by CNNs and Transformers at the ImageNet-1K scale for the first time (permitting correct classification with only 5% of frequencies). It reveals that frequency shortcuts are surprisingly beneficial in texture-preserving OOD tests but detrimental in stylized tests (IN-R/IN-S), pointing out that existing OOD evaluation frameworks overlook the impact of frequency shortcuts.
EBS-EKF: Accurate and High Frequency Event-based Star Tracking: This paper proposes EBS-EKF, which models the circuit behavior of event cameras under low-light conditions to obtain intensity-dependent centroid offset correction, combined with a 3D Extended Kalman Filter for star tracking, achieving an order of magnitude higher accuracy than existing methods on real night-sky data.
EDM: Equirectangular Projection-Oriented Dense Kernelized Feature Matching: EDM is proposed as the first learning-based dense feature matching method for Equirectangular Projection (ERP) panoramic images. It addresses the polar distortion of ERP through a Spherical Space Alignment Module (SSAM, utilizing spherical positional encoding with 3D Cartesian coordinates + Gaussian Process regression) and geodesic flow refinement. On Matterport3D, it outperforms DKM by 26.72% in AUC@5°, and on Stanford2D3D by 42.62%.
Effortless Active Labeling for Long-Term Test-Time Adaptation: This work proposes EATTA, an approach that labels only one most valuable sample per batch (instead of multiple) based on feature perturbation sensitivity during long-term test-time adaptation (TTA). Combined with a gradient norm debiasing strategy to balance the gradients of supervised and unsupervised losses, EATTA achieves an average error rate of 50.9% on ImageNet-C with an extremely low annotation cost, outperforming SimATTA with three times the labeling budget by 3.9%.
Event Ellipsometer: Event-based Mueller-Matrix Video Imaging: The first system to achieve 30fps video-rate Mueller matrix imaging. By capturing intensity modulations caused by a rapidly rotating QWP via an event camera, the system maps event time differences to Mueller matrix ratios and reconstructs physically valid Mueller matrix videos using SVD estimation combined with spatiotemporal propagation.
EVOS: Efficient Implicit Neural Training via EVOlutionary Selector: This paper proposes EVOS, an evolutionary selection paradigm (sparse fitness evaluation + frequency-guided crossover + augmented unbiased mutation) for intelligent sparse sampling of INR training coordinates. EVOS reduces training time by 48-66% (180s \(\rightarrow\) 97s) while maintaining or even improving reconstruction quality (PSNR 37.81 vs. standard 37.10).
Exploring Contextual Attribute Density in Referring Expression Counting (CAD-GD): The concept of Contextual Attribute Density (CAD) is proposed to enhance referring expression counting. By incorporating three modules—a U-shape density estimator, CAD attention, and dynamic query initialization—the approach reduces counting errors on the REC-8K dataset by approximately 30% compared to GroundingREC (MAE decreases from 6.80 to 5.43).
Feature Selection for Latent Factor Models: A class-specific feature selection method based on signal-to-noise ratio (SNR) is proposed for low-rank generative models (PPCA/LFA/ELF). Accommodating a new class requires only \(O(1)\) computation without retraining models of historical classes, thereby circumventing catastrophic forgetting. Furthermore, a novel non-parametric latent factor model, ELF, is proposed, and its effectiveness is validated on microarray cancer classification and high-dimensional feature selection.
FIction: 4D Future Interaction Prediction from Video: This paper proposes FIction, the first model for 4D future interaction prediction from video. Given an input video, it predicts which objects in the environment a person will interact with, at what 3D locations the interaction will occur, and how the interaction will be executed (3D human pose), achieving over 30%+ relative gain compared to prior methods on the EgoExo4D dataset.
Focal Split: Untethered Snapshot Depth from Differential Defocus: Inspired by jumping spider vision, this work constructs Focal Split, the first untethered (battery-powered) snapshot depth-from-differential-defocus camera. By utilizing a beamsplitter to split the optical path across two sensors with different focal distances, it estimates depth in real-time on a Raspberry Pi using only 500 FLOPs/pixel and 4.9W of power.
Foundations of the Theory of Performance-Based Ranking: This paper establishes a rigorous mathematical foundation for performance-based ranking based on probability theory and order theory. It proposes a general framework consisting of 6 pillars and 3 axioms, defines a parameterized family of "ranking scores," and demonstrates in binary classification tasks that metrics such as accuracy, TPR, TNR, PPV, and F-score satisfy these axioms, whereas commonly used metrics like MCC and geometric mean are unsuitable for ranking.
Full-DoF Egomotion Estimation for Event Cameras Using Geometric Solvers: Proposes the first geometric solver method to estimate full 6-DoF egomotion (angular and linear velocities) solely using event streams. By establishing line-segment geometric constraints on the eventail manifold—specifically incidence and novel coplanarity relations—a sparse solver requiring as few as 8 events is designed, enabling decoupled rotation and translation estimation without requiring an IMU.
Gradient-Guided Annealing for Domain Generalization: This paper proposes the GGA method, which uses simulated annealing in the early stages of training to search for parameter space points where gradients across domains are aligned (by maximizing the minimum cosine similarity of gradients between domains). This guides the model to find starting points for domain-invariant features at the beginning of optimization, improving domain generalization without data augmentation. It can be combined with existing DG methods to obtain significant improvements.
HotSpot: Signed Distance Function Optimization with an Asymptotically Sufficient Condition: This paper proposes HotSpot, which utilizes the classical relationship between the screened Poisson equation and distance fields to design a new heat loss. It provides an asymptotically sufficient condition for optimizing neural signed distance functions, ensuring that the implicit function converges to the true distance field. It significantly outperforms existing methods in 2D/3D surface reconstruction with complex topologies.
Image Reconstruction from Readout-Multiplexed Single-Photon Detector Arrays: This paper formalizes the multiphoton coincidence resolution problem in row-column readout-multiplexed single-photon detector arrays as an inverse imaging problem. It proposes a probabilistic Multiphoton Estimator (ME) capable of resolving the spatial locations of up to 4 concurrent incident photons. Compared to traditional methods, ME achieves a 3-4 dB PSNR improvement on a 32×32 array and reduces the required frame count by approximately 4 times.
Improving Accuracy and Calibration via Differentiated Deep Mutual Learning: Proposes Diff-DML (Differentiated Deep Mutual Learning), which simultaneously improves accuracy and uncertainty calibration quality while maintaining the prediction diversity of the ensemble models through two core designs: Differentiated Training Strategy (DTS) and Diversity-Preserving Learning Objective (DPLO).
Improving Transferable Targeted Attacks with Feature Tuning Mixup: Proposed FTM (Feature Tuning Mixup) to improve the transferability of targeted adversarial attacks by mixing optimized attack-specific perturbations and random clean perturbations in the feature space of the surrogate model. Using a momentum-based stochastic update strategy to maintain computational efficiency, the average success rate across 14 black-box models is improved from 74.6% to 77.4%.
Instance-wise Supervision-level Optimization in Active Learning: This paper proposes the ISO (Instance-wise Supervision-level Optimization) framework. In active learning, it not only selects which samples to annotate but also automatically determines the optimal annotation level (exact vs. coarse labels) for each sample. Through a value-cost ratio (VCR) and a diversity-aware batch selection algorithm, it achieves over 10% higher accuracy than traditional active learning under a fixed budget constraint.
Integral Fast Fourier Color Constancy: This paper proposes IFFCC, which extends the FFCC algorithm to multi-illuminant scenes. By using an integral UV histogram to accelerate regional histogram computation and parallelize Fourier convolution operations, it achieves accuracy comparable to pixel-level neural networks while achieving real-time multi-illuminant automatic white balance with 400x fewer parameters and 20-100x speedups.
LATTE-MV: Learning to Anticipate Table Tennis Hits from Monocular Videos: LATTE-MV proposes a scalable system to reconstruct 3D match data from monocular table tennis match videos and trains a Transformer model to anticipate the opponent's striking intention. Combined with conformal prediction for uncertainty-aware anticipatory control, it improves the robot's return success rate in simulation from 49.9% to 59.0%.
Locally Orderless Images for Optimization in Differentiable Rendering: This work proposes an inverse rendering optimization method that leverages local histogram matching within a 3D scale space (inner scale \(\sigma\), tonal scale \(\beta\), and extent scale \(\alpha\)) of Locally Orderless Images (LOIs). It expands the support range of sparse gradients without modifying differentiable renderers, effectively avoiding local optima.
MagicArticulate: Make Your 3D Models Articulation-Ready: This paper proposes MagicArticulate, a two-stage framework. The first stage models skeleton generation as a sequence prediction task using an auto-regressive Transformer. The second stage predicts skinning weights via a functional diffusion process combined with a volumetric geodesic distance prior. Together with the large-scale Articulation-XL dataset (33K+), it achieves automatic conversion from static 3D models to animatable assets.
NeISF++: Neural Incident Stokes Field for Polarized Inverse Rendering of Conductors and Dielectrics: NeISF++ extends polarized inverse rendering from supporting only dielectrics to supporting both conductors and dielectrics. By introducing a generalized pBRDF model with a binary control variable \(m\), complex refractive index modeling, and DoLP geometric initialization, it reduces the normal angular error on synthetic conductor scenes to 1.789° (an 83% reduction compared to NeISF's 10.303°).
On the Generalization of Handwritten Text Recognition Models: This paper presents the first systematic analysis of the out-of-distribution (OOD) generalization capability of HTR models. Through 336 OOD evaluations of 8 SOTA models across 7 datasets (5 languages), it is discovered that textual discrepancy is the most critical factor affecting generalization, and the OOD error can be reliably predicted in 70% of cases (with a deviation \(< 10\) percentage points).
Open Set Label Shift with Test Time Out-of-Distribution Reference: To address the Open Set Label Shift (OSLS) problem—where the target distribution contains out-of-distribution (OOD) classes unseen in the source distribution and the label distribution shifts—this paper proposes a retrain-free three-stage estimation method. By leveraging an existing in-distribution (ID) classifier and an OOD detector, the method estimates the target-domain label distribution and OOD proportion using the EM algorithm, and subsequently corrects the classifier to adapt to the target distribution.
Order-One Rolling Shutter Cameras: A unified theory of Order-One Rolling Shutter (RS1) cameras is proposed, proving the mathematical characterization of rolling shutter camera classes that map a spatial point to exactly one image point, constructing explicit parametrizations, and providing a complete classification of the 31 relative pose minimal problems for linear RS1 cameras.
PLeaS: Merging Models with Permutations and Least Squares: This work proposes PLeaS, a two-step model merging algorithm. First, it exploits permutation symmetry to partially match the features of two models (merging similar features while retaining dissimilar ones). Second, it uses layer-wise least squares optimization to align the merged model's features with the permuted ensemble features of the original models, achieving up to a 15 percentage point improvement over existing methods at the same model size.
Potential Field Based Deep Metric Learning: PFML is proposed to replace traditional tuple mining with the concept of physical potential fields for metric learning. Each sample creates a continuous attractive field (intra-class) and repulsive field (inter-class) in the embedding space with a distance decay property (weaker interactions at long distances), achieving 92.7% R@1 on Cars-196 (prev. SOTA was 89.6%).
Practical Solutions to the Relative Pose of Three Calibrated Cameras: This paper addresses the classic challenge of relative pose estimation for three calibrated cameras from four point correspondences in three views (4p3v). It proposes practical solutions based on approximate geometry—using affine camera approximations or mean point correspondence approximations to estimate the relative pose of the first two cameras, and then registering the third camera via P3P. Combined with local optimization, this approach achieves SOTA accuracy on real-world data.
Regor: Progressive Correspondence Regenerator for Robust 3D Registration: Regor proposes a progressive correspondence regeneration strategy. Unlike traditional "top-down" outlier removal methods, it iteratively generates high-quality correspondences in local spheres using a "bottom-up" approach. The number of generated correct matches is up to 10 times that of existing methods, achieving robust registration even under weak feature conditions.
Radio Frequency Ray Tracing with Neural Object Representation for Enhanced RF Modeling: The RFScape framework is proposed, which learns object-level neural electromagnetic property representations for individual objects. By combining this with the composability of traditional ray tracing, it achieves high-precision RF propagation modeling with sparse training samples, outperforming traditional ray tracing by 13 dB and the SOTA neural baseline by 5 dB.
Removing Reflections from RAW Photos: Proposes the first end-to-end reflection removal system based on RAW images. It simulates realistic reflections (including Fresnel, double reflection, white balance, and exposure) in the XYZ color space, trains an EfficientNet+BiFPN base model to separate the transmission and reflection layers, and then uses a Gaussian pyramid upsampler to preserve high-resolution details. An optional front-facing camera context map is leveraged to aid inference, achieving a PSNR of 30.62 dB.
Rethinking Epistemic and Aleatoric Uncertainty for Active Open-Set Annotation: An Energy-Based Approach: The EAOA framework is proposed, which utilizes free energy-based epistemic uncertainty (EU) and aleatoric uncertainty (AU) metrics, combined with an adaptive coarse-to-fine query strategy, to effectively select samples that are both known-class and highly informative in active open-set annotation scenarios.
Rooftop Wind Field Reconstruction Using Sparse Sensors: From Deterministic to Generative Learning Methods: A framework is established for rooftop wind field reconstruction based on wind tunnel PIV experimental data (rather than CFD simulations). The reconstruction performance of Kriging interpolation and three deep learning models (UNet, ViTAE, and CWGAN) is systematically compared using 5-30 sparse sensors. Results show that Multi-Directional Training (MDT) enables deep learning methods to comprehensively outperform Kriging (with SSIM improvements of up to 32.7%), and optimizing sensor layouts via QR decomposition enhances robustness by up to 27.8%.
Scene-Agnostic Pose Regression for Visual Localization: Proposed a new task paradigm called "Scene-Agnostic Pose Regression" (SPR), which regresses the relative poses of subsequent frames using the first frame of the sequence as the coordinate origin. This avoids the need for retraining in APR, database retrieval in RPR, and cumulative drift in VO. A large-scale dataset, 360SPR, containing 200K panorama images, and a dual-branch SPR-Mamba model are established.
SDF-Net: Structure-Aware Disentangled Feature Learning for Optical–SAR Ship Re-Identification: SDF-Net is proposed, which leverages the physical prior of ships as rigid bodies. It extracts scale-invariant gradient energy statistics in intermediate ViT layers as cross-modal geometric anchors. In the terminal layer, features are disentangled into modal-invariant shared features and modal-specific features, which are then fused via additive residuality, achieving state-of-the-art (SOTA) performance in optical-SAR ship re-identification.
STRAP-ViT: Segregated Tokens with Randomized Transformations for Defense against Adversarial Patches in ViTs: STRAP-ViT proposes a training-free, plug-and-play defense module for ViTs. It utilizes Jensen-Shannon divergence to segregate tokens affected by adversarial patches from benign tokens and then applies randomized composite transformations to neutralize their adversarial effects, achieving a robust accuracy within 2-3% of the clean baseline across multiple ViT architectures and attack methods.
Subnet-Aware Dynamic Supernet Training for Neural Architecture Search: This paper proposes a dynamic supernet training strategy (CaLR + MS). It addresses the subnet training unfairness problem via complexity-aware learning rate scheduling, and alleviates the gradient noise issue through momentum separation, significantly improving the search performance of N-shot NAS with minimal computational overhead.
Sufficient Invariant Learning for Distribution Shift: This paper proposes the Sufficient Invariant Learning (SIL) framework to improve robustness under distribution shift by learning diverse subsets of invariant features instead of a single invariant feature. It designs the ASGDRO algorithm to implement SIL by seeking a common flat minimum across environments, achieving SOTA performance on multiple distribution shift benchmarks.
TAET: Two-Stage Adversarial Equalization Training on Long-Tailed Distributions: This paper proposes TAET, a two-stage adversarial equalization training framework: it first stabilizes early training using cross-entropy loss, and then balances the performance across all classes using Hierarchical Adversarial Robust Learning (HARL) combined with three losses (BCL/HDL/RCEL). It also introduces a Balanced Robustness evaluation metric to address the insufficient robustness of tail classes in adversarial training under long-tailed distributions.
TensoFlow: Tensorial Flow-based Sampler for Inverse Rendering: This paper proposes TensoFlow, which learns a spatially and directionally aware importance sampler using Tensorial Normalizing Flow to replace fixed, predefined samplers (e.g., cosine-weighted, GGX) in inverse rendering. This significantly reduces the variance of Monte Carlo estimation of the rendering equation and improves the quality of material and illumination decomposition.
Three-View Focal Length Recovery From Homographies: An efficient solver is proposed to recover focal lengths from three-view homographies. By leveraging normal consistency constraints to derive new explicit constraints, the problem is formulated as solving univariate or bivariate polynomials, achieving speedups of 80x-270x over existing methods.
Towards In-the-Wild 3D Plane Reconstruction from a Single Image: ZeroPlane proposes the first cross-domain zero-shot 3D plane reconstruction framework. By constructing a large-scale plane benchmark dataset with 14 datasets and 560k annotations, and designing a normal-offset decoupled classification-regression paradigm along with a pixel-level geometric enhanced embedding module, it achieves generalization performance significantly outperforming existing methods in diverse indoor and outdoor scenes.
Towards Million-Scale Adversarial Robustness Evaluation With Stronger Individual Attacks: This paper proposes Probability Margin Attack (PMA), which defines the adversarial margin loss function in the probability space rather than the logits space. Its gradient is equivalent to an adaptively weighted combination of untargeted and targeted cross-entropy losses, consistently outperforming existing individual attack methods. Based on this, a million-scale evaluation dataset, CC1M, is constructed to conduct the first-ever million-scale white-box robustness evaluation of adversarial-trained models.
TraF-Align: Trajectory-aware Feature Alignment for Asynchronous Multi-agent Perception: Proposes the TraF-Align framework, which learns the spatiotemporal flow path of features by predicting object motion trajectories at the feature level. It generates temporally ordered sampling points along the trajectory to guide the current-timestamp query to relevant historical features, achieving precise feature alignment in asynchronous multi-agent perception. It achieves state-of-the-art (SOTA) performance on two real-world datasets: V2V4Real and DAIR-V2X-Seq.
VKDNW: Training-free Neural Architecture Search through Variance of Knowledge of Deep Network Weights: VKDNW proposes a training-free NAS proxy based on the spectral entropy of the eigenvalues of the Fisher Information Matrix (FIM). It successfully applies Fisher information theory to large-scale deep network architecture search for the first time, evaluating network classification accuracy potential without any training. Additionally, it introduces the nDCG evaluation metric, which is better suited for NAS tasks.
Tuning the Frequencies: Robust Training for Sinusoidal Neural Networks: TUNER is proposed, a sinusoidal MLP training scheme based on the amplitude-phase expansion theory of Bessel functions. By expanding hidden neurons into Fourier series of integer combinations of input frequencies, robust frequency initialization and in-training band-limiting control are achieved, significantly improving the convergence stability and reconstruction quality of implicit neural representations.
Uncertainty Weighted Gradients for Model Calibration: By analyzing a unified framework of methods like Focal Loss, this work reveals that directly applying uncertainty weights to the loss function leads to a misalignment between gradients and uncertainty. Hence, the Uncertainty-GRA framework is proposed to apply uncertainty weights directly to gradients, using the Generalized Brier Score as a more precise uncertainty metric, achieving state-of-the-art calibration performance.
UniPhy: Learning a Unified Constitutive Model for Inverse Physics Simulation: This work proposes UniPhy, the first unified latent-conditioned constitutive model that encodes diverse material properties (such as elastomers, sand, plastics, Newtonian, and non-Newtonian fluids) within a shared latent space. During inference, the latent variables are optimized through a differentiable Material Point Method (MPM) simulator to match observed particle trajectories, reducing reconstruction errors by 1 to 2 orders of magnitude compared to NCLaw.
VinaBench: Benchmark for Faithful and Consistent Visual Narratives: VinaBench is constructed to annotate commonsense links and discourse constraints for visual narrative samples, propose faithfulness and consistency evaluation metrics, and verify that utilizing these constraints substantially improves the quality of visual narrative generation.
Wear Classification of Abrasive Flap Wheels using a Hierarchical Deep Learning Approach: This paper proposes a hierarchical visual classification framework based on EfficientNetV2, which decomposes the wear state of abrasive flap wheels into three levels (usage state \(\rightarrow\) wear type \(\rightarrow\) severity), achieving classification accuracy of 93.8% to 99.3% across various subtasks.
Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Instructional Videos: This paper proposes LangView, which utilizes viewpoint-agnostic textual narrations as weak supervision signals. By comparing the alignment between the predicted captions of each viewpoint and the ground-truth narration, it generates pseudo-labels for the best viewpoint, enabling automatic view selection in multi-view instructional videos without manual annotation.
Zero-Shot Head Swapping in Real-World Scenarios: The paper proposes HID (Head Injection Diffusion), a zero-shot head swapping method that achieves seamless head-body fusion by automatically generating context-aware editing masks through IOMask. It also introduces a hair injection module to precisely transfer hairstyle details, achieving SOTA performance in real-world scenarios containing upper bodies and multi-angle faces.
ZO-SAM: Zero-Order Sharpness-Aware Minimization for Efficient Sparse Training: ZO-SAM strategically integrates zero-order optimization into the perturbation step of SAM, achieving the flat-minimum advantages of SAM with only a single backward pass, thereby halving the computational overhead while enhancing accuracy and robustness in sparse training scenarios.