Skip to content

📂 Others

📷 CVPR2026 · 98 paper notes

📌 Same area in other venues: 🔬 ICLR2026 (116) · 💬 ACL2026 (4) · 🧪 ICML2026 (70) · 🤖 AAAI2026 (117) · 🧠 NeurIPS2025 (118) · 📹 ICCV2025 (33)

🔥 Top topics: Few-/Zero-Shot Learning ×7 · Adversarial Robustness ×5 · Federated Learning ×3 · Alignment/RLHF ×2 · Face & Gaze ×2

A2GC: Asymmetric Aggregation with Geometric Constraints for Locally Aggregated Descriptors

Addressing the failure of the "symmetric Sinkhorn" assumption in feature aggregation for Visual Place Recognition (VPR), A2GC reformulates the Optimal Transport solver into an asymmetric version (averaging row/column normalization + independent source/target marginal calibration) and overlays a geometric constraint branch (using learnable coordinate embeddings to bias spatially adjacent features towards the same cluster), achieving 95.6% Recall@1 on Pitts30k.

A Debiased Reconstruction-based Framework for Training-Free Detection of AI-Generated Images

To address the issue where "reconstruction-based training-free AI image detection" is biased by simple backgrounds or large latent norms, this paper proposes using augmentations like rotation + low-pass filtering—which "preserve bias factors but destroy forensic information"—to normalize reconstruction errors. By computing debiased scores at both the image and latent levels and fusing them into a unified RDD score, the method achieves training-free SOTA performance (average AUROC 0.981 / 0.940) across 18 sub-benchmarks including GenImage and LSUN-Bedroom.

A Difference-in-Difference Approach to Detecting AI-Generated Images

Addressing the limitation where first-order reconstruction errors fail as modern diffusion models generate images closer to reality, this paper performs reconstruction twice. It utilizes the "difference of reconstruction errors"—a second-order difference—to cancel out stochastic perturbations inherent in the reconstruction process and amplify weak signals between real and fake images. By combining separate classifiers for first-order and second-order errors, the method achieves a 20%–30% improvement over the strongest baselines in cross-dataset and cross-generator scenarios.

Adaptive Bayesian Early-Exit Networks for Efficient Non-Transferable Learning

ENL-DEE redesigns "Non-Transferable Learning (NTL)" as a Bayesian early-exit network. By freezing the backbone and training only several early-exit classification heads, it uses entropy-based routing to guide source domain samples to deep exits (preserving performance) and eject target domain samples at shallow exits (non-semantic features, accuracy near random). This significantly strengthens model IP protection while drastically reducing training and inference costs.

Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition

ADAMAB trains a lightweight "calibrator" on top of frozen pre-trained embedding models and utilizes a modified Upper Confidence Bound (UCB) algorithm to adaptively determine which data to synthesize for augmentation on a per-class basis. This approach improves accuracy by up to approximately 40% on few-shot long-tail recognition tasks with only 2–5 initial samples per class, providing theoretical guarantees for convergence.

AdaSFormer: Adaptive Serialized Transformers for Monocular Semantic Scene Completion from Indoor Environments

This paper proposes AdaSFormer, a serialized Transformer framework for indoor Monocular Semantic Scene Completion (MSSC). By introducing three core designs—Adaptive Serialized Attention (with learnable offsets), Center Relative Position Encoding, and Convolution-Modulated Layer Normalization—it achieves SOTA performance on NYUv2 and Occ-ScanNet.

ALLNet: Multi-task Dense Prediction for Degraded Images

ALLNet dismantles the two-stage cascaded "restoration-then-prediction" pipeline. Using a dual-decoder U-Net, it enables mutual feature feeding between the restoration and prediction streams at every scale. By employing a degradation-adaptive Mixture-of-Experts (MaE) module for de-degradation and a Task Collaborative Refinement (TCR) module for bidirectional semantic alignment, it outperforms existing SOTA methods across four tasks on degraded versions of NYUD-v2 and PASCAL-Context.

Beyond Euclidean Gossip: KL-Barycentric Consensus on Heterogeneous and Imbalanced Images

Addressing the collapse of fully decentralized training under non-i.i.d. data and imbalanced client scales, this paper replaces the "Euclidean gossip" operation (averaging model parameters between neighbors) with linear mixing in the expectation parameter space of exponential families. This approach happens to be equivalent to a curvature-aware KL-Barycentric consensus (natural gradient step), reducing the per-round complexity from \(O(d^3)\) to \(O(d)\) without constructing or inverting the Fisher matrix. The authors provide an implementation called KL-consensus Adam, which has nearly the same overhead as Adam and achieves approximately 20% higher accuracy than the Euclidean consensus baseline on CIFAR-100.

Bi-directional Autoregressive Diffusion for Large Complex Motion Interpolation

ARVFI reformulates video frame interpolation from "generating all intermediate frames at once" to "generating frames autoregressively from two endpoints towards the center." By replacing optical flow with DINOv3 features as the motion representation, it significantly enhances interpolation accuracy for large complex motions (leading in FID across benchmarks) while reducing sampling to 15 steps—approximately 3x faster than its backbone, Wan.

Bias In, Bias Out? Finding Unbiased Subnetworks in Vanilla Models

BISE proposes that a biased model trained normally (vanilla) on biased data actually already contains a relatively unbiased subnetwork. By freezing the original parameters and learning a set of structured pruning masks, combined with "reweighted cross-entropy + biased mutual information regularization" to prune neurons relying on shortcut features, this subnetwork can be extracted without retraining or additional unbiased datasets. Performance is on par with SOTA debiasing methods, can exceed them after fine-tuning, and the model becomes smaller and faster.

Bidirectional Query-Driven Generation of Parametric CAD Sketch

CADSketcher reformulates parametric CAD sketch completion from a unidirectional "prefix → continuation" task into a "middle fragment → bidirectional outward expansion" query-driven generation. By integrating bidirectional query learning, confidence gating, and a validity compiler, it improves sketch-level accuracy on SketchGraphs from ~33% to 45.6% and reduces the invalid rate to zero.

Bootstrapping Multi-view Learning for Test-time Noisy Correspondence

Focusing on "view mismatch" (Test-time Noisy Correspondence, TNC) that occurs only during deployment, BML performs in-place bootstrapping to inject controllable mismatches and record contaminated views on a clean training set. This "revealed" knowledge is used to supervise a lightweight reliability estimator (incorporating both intra-view uncertainty and inter-view prediction divergence). During inference, the estimated reliability weights are used in a weighted fusion to suppress corrupted views, consistently outperforming existing SOTAs across 11 benchmarks.

BrepVGAE: Variational Graph Autoencoder with Unified Latent Representation for B-rep

BrepVGAE unifies heterogeneous "faces" and "edges" in CAD B-rep models as nodes of a single sparse isomorphic graph. Using a Variational Graph Autoencoder (VGAE), it compresses the graph into a global latent vector and employs a set-based parallel decoder to reconstruct the entire topological adjacency and continuous geometric features in a single pass. It significantly outperforms methods like BrepGen in reconstruction accuracy, topological validity, and generation diversity.

Bridging Domain Expertise and Generalization for Performance Estimation

To estimate model accuracy on unlabeled test sets under distribution shift, this paper moves beyond relying solely on the evaluated model's own outputs by introducing a foundation model (CLIP/SigLIP) as an "external reference." It first calibrates the foundation model's predictions to the same confidence scale as the evaluated model using JS divergence, then fuses them via confidence-weighted averaging into a "pseudo-ground-truth" distribution. Accuracy is estimated by the consistency between the base model's predictions and this fused distribution, reducing the average MAE from a sub-optimal 6.72% to 6.53%.

CHIRP dataset: towards long-term, individual-level, behavioral monitoring of bird populations in the wild

To enable computer vision to truly serve long-term, individual-level behavioral monitoring of wild birds, this paper reconstructs the CHIRP dataset (concurrently covering Re-ID, action recognition, 2D keypoints, detection, and instance segmentation) using a wild population of Siberian Jays in Swedish Lapland across 9 years (2014–2022). It proposes an "application-oriented evaluation" paradigm centered on biological indicators such as "feeding rate" and "co-occurrence rate." Additionally, it introduces a baseline method, CORVID—a pipeline for individual recognition by identifying colored leg rings—which outperforms the animal Re-ID foundation model MegaDescriptor in Top-1 accuracy under "territory constraints."

Clair Obscur: an Illumination-Aware Method for Real-World Image Vectorization

COVec introduces the artistic concept of "Clair-Obscur" into image vectorization, performing intrinsic image decomposition in the vector domain for the first time. It decomposes a real photograph into three semantically coherent SVG layers—albedo, shade, and light—via region-level semantic binarization initialization and a two-stage differentiable rendering optimization. This achieves high fidelity with minimal layers, ensuring the resulting SVGs are truly editable.

Coded-E2LF: Coded Aperture Light Field Imaging from Events

This paper demonstrates for the first time that pixel-level accuracy 4D light fields can be reconstructed using only an event camera (without traditional intensity images). The proposed Coded-E2LF system triggers events by accumulating sequences of coded aperture patterns into event images. By utilizing a "black-first" pattern, the authors establish the mathematical equivalence between event-based and intensity-based coded aperture imaging. Combined with end-to-end deep optics training, the system achieves \(8 \times 8\) viewpoint light field reconstruction.

Computer Vision with a Superpixelation Camera

The authors propose "SuperCam," a superpixelation camera where the sensor generates superpixel maps directly on-chip through sparse sampling. It avoids storing full high-resolution images entirely, driving segmentation, detection, and depth estimation with memory requirements one to two orders of magnitude lower than conventional images. Under the same memory budget, its segmentation error is at least twice as good as a constrained version of SNIC.

Confusion-Aware Spectral Regularizer for Long-Tailed Recognition

This paper demonstrates that "worst-class error" in long-tailed scenarios is tightly upper-bounded by the spectral norm of the frequency-weighted confusion matrix. Consequently, it proposes CAR, a regularizer that directly minimizes this spectral norm using a differentiable confusion matrix proxy and an EMA estimator. CAR improves worst-class accuracy by 6%~10% and overall accuracy by 2.4%~4.8% over previous SOTA on benchmarks like ImageNet-LT, CIFAR100-LT, and iNaturalist.

Consensus vs. Controversy: Mapping the Decision Space Where Architectures Diverge

The authors perform "disagreement forensics" on ImageNet using 12 pre-trained models from three major families (CNN, ViT, and MLP-Mixer). They find that while overall accuracies are nearly identical (mean 79.9%), architectural differences are concentrated in the most controversial 10% of images. This "controversial subset" exhibits ~4.5x higher disagreement than the "consensus subset," and intra-family consistency is significantly higher than inter-family, providing actionable guidance for model selection and ensemble construction.

Convolutional Neural Networks Driven by Content Similarity

By performing "intra-channel sorting" on features to align tokens with high similarity in adjacent positions and then aggregating them using 1D depthwise convolution, the proposed pure CNN model, Ego, enables "content-driven aggregation" similar to self-attention. Ego outperforms Transformers and advanced CNNs of reached scales in classification, segmentation, and detection with lower computational costs.

Coupling Liquid Time-Constant Encoders with Modern Hopfield Memory

This work attaches an external Modern Hopfield associative memory module to Liquid Time-Constant (LTC) networks to decouple "real-time encoding" from "long-term memory" within a single hidden state. It theoretically demonstrates that this coupling maintains bounded stability while contracting upstream gradients and depressing the Hessian trace, smoothing the training surface and yielding an average accuracy gain of 2.3% across six time-series benchmarks.

Customized Fusion: A Closed-Loop Dynamic Network for Adaptive Multi-Task-Aware Infrared-Visible Image Fusion

This paper proposes CLDyN, a closed-loop dynamic network that enables a frozen fusion network to adapt to downstream tasks (detection/segmentation/saliency) without retraining. By utilizing a "Request-driven Semantic Compensation (RSC)" module with only 0.46M parameters, the system receives semantic feedback and dynamically customizes convolutional structures for task-specific compensation. It maintains high fusion quality while achieving superior multi-task adaptability on M3FD, FMB, and VT5000 datasets.

Dance Across Shifts: Forward-Facilitation Continual Test-Time Adaptation through Dynamic Style Bridging

To address the long-standing issue of "sparse and unreliable supervision" in Continual Test-Time Adaptation (CTTA), this paper pivots from the traditional "backward-alignment" (forcing shifting test data toward static source anchors). Instead, it proposes a "forward-facilitation" approach: generating semantically pure category exemplars offline via diffusion models and dynamically "coloring" them with the current target domain style online (via input/statistics/representation bridging). This produces reliable on-demand supervision with ground-truth labels aligned to the current distribution, reducing average error rates on ImageNet-C / CIFAR100-C / CIFAR10-C to 44.1% / 29.8% / 9.1%, with significantly lower memory and latency compared to diffusion-based TTA methods.

Data-Centric Meta-Learning for Robust Few-Shot Generalization

Addressing the generalization collapse of optimization-based meta-learning in cross-domain few-shot scenarios, this work elevates "learnable visual prompts" from test-time auxiliaries to a core mechanism throughout the meta-training process. By aligning task inputs in the data space to reduce gradient direction conflicts, Ours learns more universal prior knowledge and enables efficient adaptation by updating only the prompts and classification heads during inference.

Debiased Sample Selection for Learning with Noisy Labels

This paper identifies two types of confirmation bias inherent in the "small-loss-is-clean" sample selection strategy dominant in noisy label learning: class-level bias (easy-to-learn classes are over-selected while hard-to-learn classes are neglected) and instance-level bias (mislabeled samples with pseudo-low losses are memorized as clean samples). It proposes two plug-and-play modules, MDA (Marginal Distribution Adjustment) and CCS (Candidate Class Selection), to eliminate these biases. Combined as DSS, the approach consistently improves various selectors and SOTA pipelines on CIFAR-10/100 synthetic noise and real-world noise datasets including CIFAR-N, Clothing1M, and WebVision.

Deconstructing the Failure of Ideal Noise Correction: A Three-Pillar Diagnosis

This paper demonstrates through controlled experiments that Forward Correction (FC) still suffers from performance collapse in the late stages of training even given a perfect noise transition matrix \(T\). It provides a systematic diagnosis of this failure from three perspectives: macro-convergence states, micro-optimization dynamics, and information theory.

DF²-VB: Dual-level Fuzzy Fusion with View-specific Boosting for Multi-view Multi-label Classification

To address the conflict in Multi-view Multi-label Classification (MVMLC) where "feature-level fusion is expressive but lacks supervision, while decision-level fusion is supervised but relies on weak representations," DF²-VB unifies both levels into a single framework. It utilizes fuzzy membership functions for dynamic element-level weighting of consistent features (FDF) and employs Boosting to adaptively measure the importance of samples and view-specific atomic classifiers (VB). This mutually reinforces expressiveness and discriminability, achieving new SOTA results across 6 public datasets.

DiffBMP: Differentiable Rendering with Bitmap Primitives

Ours proposes DiffBMP—the first general-purpose differentiable rendering engine for bitmap primitives. It implements an efficient custom CUDA parallel pipeline to enable gradient optimization of position, rotation, scaling, color, and opacity for thousands of bitmap primitives, filling the gap where 2D differentiable rendering was previously restricted to vector graphics.

Differentiable Stroke Planning with Dual Parameterization for Efficient and High-Fidelity Painting Creation

This work represents a single stroke simultaneously as a "discrete polyline" and a "continuous Bézier curve," enabling differentiable bi-directional conversion. A residual-guided discrete search handles global structure while gradient optimization performs pixel-level refinement. Coupled with a Gaussian-style differentiable polyline renderer that optimizes thousands of strokes in parallel, the method improves PSNR by 4–5 dB on complex textures while using 30–50% fewer strokes and being 30–40% faster than existing methods.

DP-FedAdamW: An Efficient Optimizer for Differentially Private Federated Large Models

This paper discovers that directly porting AdamW to Differentially Private Federated Learning (DPFL) fails, identifying three pathologies: "amplified second-moment variance, DP-induced second-moment bias, and intensified client drift." It proposes DP-FedAdamW, the first AdamW optimizer tailored for DPFL. By employing block-wise second-moment aggregation, explicit DP noise bias subtraction, and local update alignment with the global direction, it achieves a 5.83% improvement over SOTA on Tiny-ImageNet (Swin-Base, ε=1).

Drainage: A Unifying Framework for Addressing Class Uncertainty

By adding an extra "drainage node" to the output layer and a "drainage loss" generalized from cross-entropy, the framework allows ambiguous, out-of-distribution, or mislabeled samples to direct their probability mass into this node rather than being forced into an incorrect class. This approach achieves up to 10% higher accuracy in high-noise scenarios compared to existing robust losses and directly serves as a rejector for open-set recognition.

DREAM: Document Recognition with Explicit Adaptive Memory

DREAM equips document recognition models with an "explicit prototype memory"—compressing recurring layout structures and writing styles (margins, skewed text, table lines, etc.) from the training corpus into a set of retrievable prototype vectors. Regional features sparsely "read" these prototypes using cross-attention, which are then "written" back via EMA during training. Serving as non-parametric structural knowledge, these are integrated with visual features for the decoder, allowing a 0.6B parameter model to outperform Large Language Models (LLMs) dozens of times its size on datasets like Fox, DreamDoc, and SCUT-HCCDoc.

Dual-Band Thermal Videography: Separating Time-Varying Reflection and Emission Near Ambient Conditions

A dual-band Long-Wave Infrared (LWIR) video analysis framework is proposed. By jointly utilizing spectral cues (constant dual-band emissivity ratio) and temporal cues (smooth object radiation vs. abrupt background radiation changes), the method achieves pixel-wise separation of reflection and emission components in dynamic scenes near ambient temperature for the first time, while recovering object emissivity and temperature fields.

Evidential Deep Partial Label Learning to Quantify Disambiguation Uncertainty

This work introduces Evidential Deep Learning (EDL) into Partial Label Learning (PLL) by using a Dirichlet distribution to model candidate label sets as "evidence" for disambiguation trustworthiness. Equipped with non-candidate label suppression and intra-class conflict-aware regularization, the proposed approach identifies ground-truth labels from ambiguous candidates while providing uncertainty estimates for each prediction. It serves as a plug-and-play loss function for various deep networks.

EXOTIC: External Vision-driven Incomplete Multi-view Classification

EXOTIC introduces an "external visual knowledge base" to Incomplete Multi-view Classification (IMVC) for the first time. It utilizes pre-trained vision-language models to transform unlabeled image collections into semantic priors. After filtering and purification, these priors are used to complete missing views, breaking the performance ceiling inherent in "internal-only supervision" methods—showing particularly significant improvements at high missing rates (e.g., 80.0% vs. 72.1% second-best on LandUse21 at MR=0.1).

FedSDR: Federated Graph Learning with Structural Noise Detection and Reconstruction

Addressing the neglected issue of "client graph structure contaminated by random edge addition/deletion" in subgraph federated learning, FedSDR employs a spectral-domain structural fidelity metric \(S_{\text{ide}}\) to identify and downweight contaminated clients during aggregation (SNAA). It then leverages feature similarity from the healthy global model to perform "spurious edge pruning + missing edge completion" (RLSR) on local damaged graphs, outperforming 17 federated baselines across 7 datasets.

FedSST: Rethinking Fair Federated Graph Learning under Structural Shift

FedSST utilizes a shared "probe graph" to detect the structural preferences of GNNs across different clients, compresses this into a scalar signal mapped to fair weights that favor "intermediate clients," and uses these weights to drive both differentiated local training assignment (SLTA) and two-level adaptive aggregation (SSCA). This approach simultaneously improves average accuracy and reduces inter-client accuracy variance in cross-domain graph classification.

FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips

FoleyDesigner mimics the workflow of professional Foley artists by decomposing silent film clips into layered sound events. It uses "depth + azimuth" spatio-temporal clues extracted from visual tracking to drive a DiT diffusion model for frame-level aligned stereo generation. Finally, a multi-agent system handles post-mixing and upmixing to 5.1 surround sound, outperforming existing baselines in spatio-temporal alignment and cinematic Foley quality.

GardenDesigner: Encoding Aesthetic Principles into Jiangnan Garden Construction via a Chain of Agents

The GardenDesigner framework is proposed to encode the aesthetic principles of Jiangnan gardens into computable constraints via a Chain of Agents (Terrain Distribution → Road Generation → Asset Selection → Layout Optimization). Combined with the expert-annotated GardenVerse dataset, it enables non-professional users to automatically construct Jiangnan gardens that comply with aesthetic standards within one minute via text input.

Global Information Thresholding for Sufficient and Necessary Circuits

Addressing the common pain point where automatic circuit discovery relies on "manual fixed budgets" (fixed top-k), this paper moves away from pre-defining circuit size. Instead, it scores edges (using signed integrated gradients) and automatically searches for a single global threshold \(\tau\) based on a "model behavior retention" target. This makes the circuit size a result of "retaining behavior" rather than a hyperparameter. Ours achieves optimal or near-optimal CPR/CMD on the MIB benchmark and improves both sufficiency and necessity diagnostics on GPT-2 IOI.

Global Underwater Geolocation from Time-Lapse Polarization Imagery

By using a single underwater polarization camera to capture a time-lapse sequence looking up at the sky with UTC timestamps, this work employs "physically-guided synthesis of 2.8 million training sequences + a two-stage Transformer to first reconstruct the solar elevation curve and then regress latitude and longitude." This reduces the median cross-site (unseen waters) localization error from the SOTA of approximately 3000 km to approximately 500 km, an improvement of nearly 8x.

Graph Attention Prototypical Network for Robust Few-Shot Classification

To address the "prototype shift" problem where Prototypical Networks experience sharp accuracy drops due to mislabeled samples in the support set, GAPNet introduces a four-step pipeline: "Global+Local Dual Features → Pseudo-label Guided Graph Construction → Edge-Aware Graph Attention → Adaptive Noise-Robust Prototype Generation." By explicitly modeling intra/inter-class relationships and dynamically suppressing noise sample weights, it outperforms the SOTA by 3%~8% on 5-way 5-shot tasks across four datasets and exhibits significantly slower decay under 40% label noise.

Hearing the Room Through the Shape of the Drum: Modal-Guided Sound Recovery from Multi-Point Surface Vibrations

Addressing "hard" objects with poor response and strong resonance (drumheads, laptops, photo frames), this work utilizes speckle vibrometry to capture dual-axis vibrations from a 10×10 grid on the object surface. A physical forward model is derived to link "scene sound → multi-point vibrations" using the object's vibration modes as a bridge. By inverting this model via optimization, dozens of noisy vibration channels are fused into a single denoised, de-resonated sound, achieving quality significantly superior to single-point speckle vibrometry and classical signal processing fusion (averaging, delay-and-sum).

HierUQ: Hierarchical Uncertainty Quantification with Adaptive Granularity Reconciliation for Degraded Image Classification

HierUQ addresses hierarchical classification of degraded (blurred/occluded/noisy/low-resolution) images by providing reliable confidence levels via Hierarchical Uncertainty Quantification (HUQ) based on label smoothing and proper scoring rules. It utilizes Confidence-Aware Path Adjustment (CAPA) to automatically fallback from fine-grained to coarser levels when uncertain, and employs Self-paced Multi-Layer Joint Optimization (MLJO) to coordinate multi-level objectives, achieving SOTA on degraded remote sensing ship and bird datasets.

Hyperbolic Defect Feature Synthesis for Few-Shot Defect Classification

This paper proposes HypDFS, which shifts defect feature synthesis from Euclidean space to hyperbolic space. By modeling defect distributions with sparse hyperbolic prototypes, sampling synthetic features, and employing a residual adapter with hierarchical defect contrastive losses, HypDFS leverages the inherent "tree-like hierarchy" of industrial defects. It significantly outperforms Euclidean baselines on MVTec-FS and MTD few-shot benchmarks.

HyperNAS: Enhancing Architecture Representation for NAS Predictor via Hypernetwork

HyperNAS treats "weight generation via hypernetwork" as an auxiliary task alongside the NAS performance predictor. Both tasks share a GCN encoder, coupled with an adaptive multi-task loss using a preference coefficient. This enables the model to learn architecture representations with better generalization even with minimal labeled samples—achieving 97.60% top-1 on CIFAR-10 using at least 5x fewer samples.

HypeVPR: Exploring Hyperbolic Space for Perspective to Equirectangular Visual Place Recognition

This paper proposes HypeVPR, a visual place recognition framework based on hierarchical embeddings in hyperbolic space. It specifically addresses the cross-field-of-view matching problem between perspective images (queries) and equirectangular images (database). By constructing multi-level descriptors from local to global in the Poincaré ball, it achieves a flexible balance between accuracy, efficiency, and storage, with retrieval speeds significantly faster than sliding window baselines while maintaining comparable accuracy.

ImmerIris: A Large-Scale Dataset and Benchmark for Off-Axis and Unconstrained Iris Recognition in Immersive Applications

The authors created ImmerIris, a large-scale dataset for "off-axis and unconstrained" iris recognition in XR/VR HMD scenarios, containing 499,800 eye images from 546 subjects. They established 8 evaluation protocols of increasing difficulty and demonstrated that traditional two-stage methods are bottlenecked by "normalization." They propose a NormFree paradigm that directly processes cropped eye images using face recognition backbones, which is simple yet outperforms normalization-based SOTA methods on most protocols.

InstantRetouch: Efficient and High-Fidelity Instruction-Guided Image Retouching with Bilateral Space

InstantRetouch shifts language-guided photo retouching from "direct pixel/latent editing" to "predicting a single set of affine transformation grids in a compact, content-disentangled bilateral space." By distilling a multi-step diffusion teacher into a single-step generator using Variational Score Distillation (VSD), it achieves 68ms inference at 4K resolution—70–900 times faster than diffusion baselines—while maintaining near-perfect content fidelity (zero content drift).

InTrain: Intrinsic Trainability for Zero-Cost Neural Architecture Search

InTrain formalizes "whether an architecture can be trained effectively" as an intrinsic invariant independent of the training process. By combining the Geometric Capacity (via Participation Ratio) of forward activations and the Optimization Resilience (via Gradient Health) of backward gradients through scale-invariant multiplicative coupling, it achieves ranking correlations on NAS-Bench-101/201 that match ensemble proxies and surpass all single-index proxies.

IrisFP: Adversarial-Example-based Model Fingerprinting with Enhanced Uniqueness and Robustness

Ours proposes IrisFP, a model fingerprinting framework that simultaneously enhances uniqueness and robustness through three innovations: placing fingerprints at multi-class decision boundary intersections, constructing composite sample fingerprints, and screening fingerprints based on statistical separability. It consistently outperforms SOTA methods in AUC across five datasets.

Language Does Matter for Cross-Domain Few-Shot Visual Feature Enhancement

To address the issue that "pure visual features easily learn non-transferable shortcut patterns" in cross-domain few-shot tasks, this paper uses image captioning models and Large Language Models (LLMs) to generate "image-level + domain-level" language attributes for each image. A lightweight residual cross-attention mechanism then embeds language semantics into visual features. This plug-and-play module can be integrated into classification, segmentation, and detection baselines, yielding consistent performance gains across multiple CD-FSL benchmarks.

Large-scale Robust Enhanced Ensemble Clustering via Outlier Decoupling

Addressing the issue where anchor-based ensemble clustering produces biased anchors due to "reconstructing contaminated base clusterings," this paper proposes RANGE. It first utilizes a high-order fuzzy enhancement strategy to improve bipartite graph reliability. It then explicitly decomposes the similarity matrix into "clean structure + residual outlier structure" in the anchor space, using orthogonal penalties and \(\ell_{2,1}\)-norm constraints to isolate pollution into a few anchor directions. This residual structure also enables outlier detection, forming a cross-task framework with linear complexity scalable to millions of samples.

Learning Long-term Motion Embeddings for Efficient Kinematics Generation

Instead of modeling "appearance + motion" pixel-by-pixel using video generation models, this work proposes learning a motion-only long-term latent space with \(64\times\) temporal compression. A Trajectory VAE first compresses sparse tracking trajectories into a dense, queryable motion grid, followed by a conditional flow matching model that generates long-term goal-directed motion based on text or "pokes." This approach is over 10,000 times faster than SOTA video models while achieving higher quality.

Learning What Helps: Task-Aligned Context Selection for Vision Tasks

TACS enables discriminative vision models (ViT) to learn to select paired samples from a candidate pool that "truly improve task performance," rather than "visually most similar" neighbors. By jointly training a selector through a differentiable sampling path and a reward-driven policy optimization path, retrieval is transformed from a static preprocessing step into a learnable component back-propagated by downstream task loss. It consistently outperforms similarity-based retrieval across 18 datasets.

Mitigating Instance Entanglement in Instance-Dependent Partial Label Learning

To address the "instance entanglement" problem in Instance-Dependent Partial Label Learning (ID-PLL), where instances from similar classes share overlapping features and candidate labels, this paper proposes the CAD framework. CAD mitigates class confusion through a two-pronged approach: intra-class alignment via class-specific augmentation and inter-class separation via weighted penalty loss.

Modeling the Visual Ambiguity of Human Sketches

This paper points out that visual ambiguity, where "one sketch corresponds to multiple plausible images," can degrade sketch-image matching training. It proposes AmbiScore, calculated using CLIP, to quantify the ambiguity of each sketch-image pair. The DisAmb framework is introduced to explicitly model and eliminate ambiguity through Elastic Matching (dynamically adjusting supervision strength based on ambiguity) and Purified Matching (using Grounded SAM masks for shape jigsaw and texture swapping). The method significantly advances SOTA on ZS-SBIR / FG-ZS-SBIR without increasing inference overhead.

MooCap: A Multi-View Benchmark for Cow-Object-Human Interaction and Behavior Dynamics

MooCap integrates classic ethological "controlled stimulus experiments" into computer vision. Utilizing 43 cows, 7 standardized interaction scenarios, 42 hours of synchronized multi-view video, and dense annotations (23 fine-grained behaviors + 39 keypoints + 4 spatial zones + three longitudinal rearing labels), it establishes three benchmarks: temporal action segmentation, skeleton action recognition, and longitudinal phenotypic classification. SOTA models achieve only \(66.4\%\) frame accuracy and \(0.39\) mean F1, highlighting the vast potential for research in animal behavior understanding.

More Than Meets the Eye: A Unified Image Fusion Framework via Semantic-Pixel Entropy Trade-off for Zero-Shot Generalization

Image fusion is reformulated as a free energy minimization problem—where the perception path suppresses "semantic entropy" and the reconstruction path elevates "pixel entropy." By training on only infrared-visible data, the model generalizes zero-shot to unseen fusion tasks such as medical, multi-focus, and multi-exposure imaging, while significantly improving downstream detection/segmentation performance.

MSPT: Efficient Large-Scale Physical Modeling via Parallelized Multi-Scale Attention

MSPT partitions million-scale point clouds into ball tree patches, performing local self-attention within each patch while pooling each patch into a small number of "supernodes" for cross-patch global communication. Both operations are integrated into the same attention operator for parallel computation, enabling solvers for industrial-grade PDE/aerodynamics problems on a single GPU with near-linear complexity, achieving SOTA on multiple benchmarks with significantly lower memory and latency.

MUFASA: A Multi-Layer Framework for Slot Attention

MUFASA is a plug-and-play multi-layer Slot Attention framework. Instead of performing Slot Attention solely on the features of the last layer of a pre-trained DINO ViT, it simultaneously runs Slot Attention on several final layers. It uses Hungarian matching to align slots across layers and fuses them into a unified set of object-centric representations. This approach pushes methods like DINOSAUR/SPOT to new SOTA performance on VOC/COCO/MOVi-C for unsupervised segmentation, while significantly accelerating training convergence with minimal inference overhead.

Multi-view Crowd Tracking Transformer with View-Ground Interactions Under Large Real-World Scenes

This work is the first to push multi-view crowd tracking from small scenes with dozens of frames (e.g., Wildtrack/MultiviewX) to large-scale real-world scenes spanning hundreds of meters. It proposes a fully Transformer-based model, MVTrackTrans (tracking in ground BEV space + view-ground cross-attention to complement appearance information), and releases two large-scale long-sequence datasets, MVCrowdTrack and CityTrack. The model leads CNN-based methods in MOTA/IDF1 on large datasets.

NAF: Zero-Shot Feature Upsampling via Neighborhood Attention Filtering

NAF reformulates the task of "upsampling low-resolution features from Vision Foundation Models (VFMs)" as a neighborhood attention filtering process that only observes the high-resolution input image, ignoring the VFM features themselves during guidance. Trained once, it can be applied zero-shot to any VFM (including 7B models) and any magnification factor. It achieves new SOTA results across downstream tasks such as semantic segmentation, depth estimation, open-vocabulary segmentation, and video propagation, while operating approximately 4x faster than comparable methods.

Negative Binomial Variational Autoencoders for Overdispersed Latent Modeling

The Poisson distribution of discrete latent spike variables in VAE is replaced with a Negative Binomial distribution. By introducing a dispersion parameter, the model allows the variance to exceed the mean, capturing the "overdispersion" of real neural spikes. The framework includes a trainable KL estimation and reparameterization sampling, achieving reconstruction and generation quality superior to single-layer VAE baselines across four datasets.

Neural Collapse in Test-Time Adaptation

The authors extend Neural Collapse (NC) theory from the class level to the sample level, discovering the NC3+ phenomenon (alignment between sample feature embeddings and corresponding classifier weights). Based on this, they reveal that performance degradation under distribution shift is fundamentally caused by sample-level feature-classifier misalignment. They propose the NCTTA method, which uses a mixed objective of geometric proximity and prediction confidence to guide feature realignment, achieving a 14.52% improvement over Tent on ImageNet-C.

Neural Mixture Density Processes

Addressing the limitation where classical Neural Processes (NPs) only output unimodal predictive distributions due to Gaussian likelihood assumptions, this paper proposes Neural Mixture Density Processes (NMDP). By using a Dirichlet latent variable on a simplex to linearly weight a set of task-shared density experts and training with an importance-weighted EM/MM proxy objective, NMDP achieves competitive predictive accuracy, superior uncertainty calibration, and interpretable task representations on heterogeneous, multi-modal function families.

NeuroRule: Bridging Vision and Logic with Differentiable Rule Induction

NeuroRule connects the pixel-level perception of Mask2Former with a differentiable first-order logic rule induction engine. It automatically learns explainable compositional logic rules (e.g., riding(x,y) ∧ on(y,z) → travel-on(x,z)) from images in an end-to-end manner. This approach achieves SOTA performance across three scene graph benchmarks (VG / PSG / Open-PSG) while providing an auditable reasoning chain for every relation prediction.

NexusFlow: Unifying Disparate Tasks under Partial Supervision via Invertible Flow Networks

NexusFlow utilizes a set of "surrogate networks" with invertible affine coupling layers to map intermediate features of structurally disparate tasks (e.g., sparse object tracking vs. dense map reconstruction) into a unified standard latent space with aligned distributions. In extreme partial supervision scenarios where labels are partitioned by geographic domain (e.g., tasks labeled only in different cities), it achieves performance nearly approaching full supervision as a plug-and-play module without altering the original model architecture.

Optical Diffraction-based Convolution for Semiconductor Lithography

OptiCo derives the Rayleigh-Sommerfeld diffraction integral into a "complex convolution," constructing Optical Phase (OP) kernels that encode light wave phase variations. These kernels are directly embedded into a CNN, allowing the network to explicitly adhere to diffraction physics during lithography mask optimization. This approach reduces the Edge Placement Error (EPE) from double-digit levels seen in peer models to near zero on the OOD subset of LithoBench.

Order Matters: 3D Shape Generation from Sequential VR Sketches

The authors propose VRSketch2Shape, a framework that models temporal stroke information of VR sketches for the first time. Utilizing a sequence-aware BERT encoder and a diffusion-based 3D generator (SDFusion), it generates high-fidelity 3D shapes from ordered VR sketches. The work also contributes a multi-category dataset containing 20k synthetic and 900 real sketches.

PAI-Bench: A Comprehensive Benchmark For Physical AI

PAI-Bench decomposes "Physical AI" into two capability tracks—perception and prediction—and maps them to three tracks: video generation, conditional video generation, and video understanding. Using 2,808 real-world samples paired with task-aligned physical plausibility metrics, the authors systematically evaluate 15 video generation models, 4 controllable generation models, and 16 Multimodal Large Language Models (MLLMs). The findings indicate that while these models produce aesthetically pleasing visuals, they generally fail to learn physical laws, and their understanding capabilities lag significantly behind human performance.

PAUL: Uncertainty-Guided Partition and Augmentation for Robust Cross-View Geo-Localization under Noisy Correspondence

Addressing the "semi-positive" alignment noise caused by GPS drift in UAV-satellite cross-view localization, PAUL utilizes a Gaussian Mixture Model (GMM) to softly partition clean/noisy pairs, employs Evidential Deep Learning (EDL) for uncertainty-guided region mask augmentation, and uses dual-network co-training to absorb effective signals from noisy samples, consistently outperforming existing noisy correspondence methods under various noise ratios.

PhysSkin: Real-Time and Generalizable Physics-Based Skin Simulation

Ours proposes PhysSkin, a generalized physics-informed framework that directly learns continuous skinning weight fields from static 3D geometries via a neural skinning field autoencoder. Using physics-informed self-supervised learning strategies (energy minimization + smoothness + orthogonality constraints), it achieves real-time physics-based animation across shapes and discretizations without any labeled data or simulation trajectories.

Plug-and-Play Incomplete Multi-View Clustering via Janus-Faced Affinity Learning with Topology Harmonization

PJFTH proposes a hyperparameter-free plug-and-play framework for incomplete multi-view clustering. It utilizes "Janus-faced affinity learning" to explicitly strip view-exclusive artifacts before fusing the consensus graph and "topology calibration" to align disordered anchor sequences across views. The objective is optimized via a six-step alternating process with linear complexity \(n\) relative to the sample size, achieving competitive performance on six datasets with varying missing rates.

Progressive Neural Architecture Generation

PNAG remodels neural architecture "generation" as a coarse-to-fine autoregressive process—each step decodes a fully functional sub-architecture using vector quantization, gradually increasing in scale until the target architecture is reached. By applying consistency constraints at every step to ensure validity, it compresses single-generation time by 1300× compared to diffusion-based methods while achieving higher architectural accuracy.

Prototype-based Causal Intervention for Multi-Label Image Classification

ProCI models the "confounding context" in multi-label classification as a set of learnable category-level prototypes, storing them in a dynamic memory and using an adaptive module to approximate Pearl's backdoor adjustment in the feature space. Relying only on image-level labels, it eliminates reliance on spurious co-occurrences, improving F2CIW by +5.44 points on the heavily confounded industrial dataset Sewer-ML.

RaUF: Learning the Spatial Uncertainty Field of Radar

RaUF reformulates "low-fidelity radar point cloud reconstruction" as a Bayesian problem of learning a spatial uncertainty field. It employs anisotropic Gaussians to characterize the "crescent-shaped" azimuth/range uncertainty of radar, converting conflicting "feature-to-label" supervision into learnable confidence signals. By injecting Doppler consistency into spatial features via bidirectional domain attention to suppress ghost points, it achieves state-of-the-art reconstruction accuracy and downstream task reliability on Coloradar, RaDelft, and self-collected datasets.

Region-Wise Correspondence Prediction between Manga Line Art Images

This paper introduces the task of "directly predicting region-level correspondences from unlabeled raw manga line art pairs." It employs a combined ViT + Multiplex Transformer to jointly learn intra-image structural grouping and cross-image similarity. Combined with edge-aware post-processing to transform patch similarities into pixel-level region segmentation and matching, the method achieves 78.4–84.4% region-level accuracy on hand-drawn style line art.

Revisiting F-measure Optimization in Multi-Label Classification: A Sampling-based Approach

Addressing optimal F-measure prediction in multi-label classification, this paper rewrites the \(O(q^3)\) matrix multiplication in the Bayesian rule into a convolution using the Hankel structure, further reducing complexity to \(O(q^2\log q)\) via FFT. It replaces the traditionally difficult-to-train \(q\) multinomial estimators with a "train \(q\) binary estimators + autoregressive sampling + Monte Carlo integration" strategy to alleviate sparsity issues, consistently outperforming standard practices from the past decade across six datasets.

Revisiting Sparsity Constraint Under High-Rank Property in Partial Multi-Label Learning

This paper points out that the long-standing joint assumptions of "sparse noise labels + low-rank true labels" in Partial Multi-Label Learning (PML) are actually contradictory. It proves that sparse perturbations instead preserve the high-rank property of predicted label matrices. Accordingly, the authors propose Schirn—which applies a sparsity constraint to the noise matrix and a high-rank (nuclear norm) constraint to the prediction matrix—consistently outperforming 9 SOTAs across 11 datasets.

RNN as Linear Transformer: A Closer Investigation into Representational Potentials of Visual Mamba Models

This paper unifies Softmax Attention, Linear Attention, and Mamba into a single token-mixing matrix \(Y=MX\). Through rank analysis, it proves that Mamba is a "low-rank approximation" of Softmax Attention, with its representational power strictly bounded between the two. The authors propose the Binary-AUC metric to quantify feature map quality and demonstrate that Vision Mamba trained via DINO self-supervision achieves 78.5% ImageNet linear probing accuracy.

Scalable Multi-View Subspace Clustering with Tensorized Anchor Guidance

SMVS-TAG concatenates anchors learned from each view into a third-order tensor and imposes a tensor Schatten p-norm low-rank constraint in the frequency domain. This directly couples cross-view consistency and complementarity at the "anchor itself" level. This approach improves anchor quality while ensuring the regularization term is independent of the sample size \(n\). It significantly refreshes the ACC for large-scale multi-view clustering across seven datasets (leading the second-best method by over 30% on certain datasets).

Shoe Style-Invariant and Ground-Aware Learning for Dense Foot Contact Estimation

The FECO framework is proposed to achieve robust dense foot contact estimation from a single RGB image through shoe style-content randomization (adversarial training) and ground-aware learning (pixel height maps + ground normals), significantly outperforming existing methods on multiple benchmarks.

SimRecon: SimReady Compositional Scene Reconstruction from Real Videos

The SimRecon framework proposes a "Perception → Generation → Simulation" pipeline to automatically construct simulator-ready compositional 3D scenes from real-world videos. The core innovations include Active View Optimization (AVO) to find optimal projection views for single-object generation and a Scene Graph Synthesizer (SGS) to guide physically plausible hierarchical assembly.

Spectral Conformal Risk Control: Distribution-Free Tail Guarantees via Bayesian Quadrature

This paper proposes BQ-SRC, extending conformal risk control from "averaging loss" to "managing high-cost tail errors" using spectral risk (such as CVaR). By constructing a distribution-free risk upper envelope from a Bayesian quadrature perspective and replacing the DKW inflation with an exact binomial confidence lower bound, the method reduces Monte Carlo conservatism by approximately 3x. It maintains finite-sample tail risk guarantees with smaller prediction sets across tasks including synthetic regression, multi-label classification, and semantic segmentation.

TeamHOI: Learning a Unified Policy for Cooperative Human-Object Interactions with Any Team Size

The TeamHOI framework is proposed, utilizing a Transformer-based decentralized policy network and Masked Adversarial Motion Priors (Masked AMP). This allows a single policy to generalize to cooperative carrying tasks with an arbitrary number of agents, achieving a \(97\%+\) success rate for teams of 2-8 humanoid agents carrying tables.

Towards Knowledge-augmented Bayesian Deep Learning For Computer Vision

This work embeds domain knowledge into both the "prior" and the "likelihood" of Bayesian inference. An informative prior \(p(\theta\mid K)\) is first pre-trained under knowledge constraints, followed by an adaptive "knowledge likelihood" \(p(K\mid\theta,D)\) during the main training stage to continuously enforce constraints. This approach achieves higher accuracy, stable constraint satisfaction, and superior uncertainty estimation in image classification and monocular 3D hand reconstruction.

UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition

UniMERNet redefines the task of converting formula images to LaTeX: it constructs the UniMER-1M dataset covering four real-world scenarios and, based on the observation that "decoder attention naturally follows a raster-scan (horizontal then vertical) pattern," proposes Raster-Scan Attention. This decomposes 2D attention into two orthogonal 1D computations, reducing complexity from \(O(NH^2W^2D)\) to \(O(NHWD(H+W))\). With 313M parameters, it achieves ~10× VRAM savings and 5× speedup, while its CDM consistently outperforms Texify, GOT, and even 72B/78B multimodal large models across four real-world scenarios.

Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling

Upsample Anything unifies classical Joint Bilateral Upsampling (JBU) and 2D Gaussian Splatting into a per-pixel anisotropic Gaussian kernel. These kernels are learned via a 50-step "RGB self-reconstruction" test-time optimization for each image, then directly transferred to the low-resolution features of foundation models for pure mixture upsampling. It requires no dataset-level training, takes approximately 0.419 seconds for a 224×224 image, and achieves or approaches SOTA across segmentation, depth estimation, and depth/probability map upsampling.

VideoMaMa: Mask-Guided Video Matting via Generative Prior

VideoMaMa utilizes a pre-trained video diffusion model (SVD) to "translate" coarse binary segmentation masks into pixel-accurate alpha mattes. Trained solely on synthetic data, it achieves zero-shot generalization to real videos. It automatically converts SA-V segmentation annotations into MA-V, a matting dataset featuring over 50,000 real video clips, which is subsequently used to fine-tune a standard SAM2 into a more robust matting model, SAM2-Matte.

VideoWorld 2: Learning Transferable Knowledge from Real-world Videos

VideoWorld 2 proposes the "Dynamic-enhanced Latent Dynamics Model (dLDM)," which utilizes a pre-trained Video Diffusion Model (VDM) to handle appearance reconstruction. This forces latent codes to encode only task-relevant action dynamics, enabling the first-ever learning of transferable and executable long-horizon task knowledge from raw real-world videos. On a minute-level manual origami task, the 7-step continuous success rate improved from a 0% baseline to 68.8%, with the capability to transfer manipulation knowledge learned on Open-X to CALVIN.

ViT3: Unlocking Test-Time Training in Vision

Systematically explores the design space of Test-Time Training (TTT) for vision tasks, summarizes six practical design insights, and proposes ViT3—a pure TTT vision architecture with linear complexity that matches or exceeds Mamba and linear attention methods in classification, generation, detection, and segmentation tasks.

What Is the Optimal Ranking Score Between Precision and Recall? We Can Always Find It and It Is Rarely \(F_1\)

This paper systematically investigates the properties of the \(F_\beta\) score family as a ranking tradeoff between Precision and Recall from a ranking theory perspective. It proves that the rankings induced by \(F_\beta\) constitute a geodesic (shortest path) between Precision and Recall rankings. Consequently, it proposes a closed-form formula to find the optimal \(\beta\) value and demonstrates that the commonly used \(F_1\) and skew-insensitive \(F_1\) are not the optimal ranking tradeoffs in most cases.

What Is Wrong with Synthetic Data for Scene Text Recognition? A Strong Synthetic Engine with Diverse Simulations and Self-Evolution

This paper systematically analyzes the deficiencies of existing rendered synthetic data in terms of corpus, font, and layout diversity. It proposes the UnionST synthesis engine and a Self-Evolution Learning (SEL) framework. Using only synthetic data, UnionST significantly outperforms traditional synthetic sets. Combined with SEL, it approaches fully supervised performance using only 9% of real-world annotations.

When AVSR Meets Video Conferencing: Dataset, Degradation, and the Hidden Mechanism Behind Performance Collapse

The authors perform the first systematic evaluation of mainstream Audio-Visual Speech Recognition (AVSR) models in real video conferencing (VC) scenarios, finding that error rates skyrocket from 0.93%/0.56% to the 33% range. Consequently, they construct the first VC-oriented multimodal dataset, MLD-VC (31 speakers, 22.79 hours, 4 platforms, with explicit Lombard effect injection). By deconstructing the transmission pipeline, they identify that "speech enhancement algorithms shifting F1/F2 formants upward" is the hidden culprit behind performance collapse; fine-tuning on MLD-VC reduces the average CER by 17.5%.

WiTTA-Bench: Benchmarking Test-Time Adaptation for WiFi Sensing

WiTTA-Bench is the first benchmark to systematically evaluate "Test-Time Adaptation (TTA) for WiFi Sensing." It decomposes domain shifts in WiFi Channel State Information (CSI) into three physically-induced categories: cross-environment, cross-subject, and cross-device. Evaluating 20 representative TTA methods under Online TTA (OTTA) and Test-Time Domain Adaptation (TTDA) protocols, while introducing a paired cross-device dataset WiHAR-Dual, the study identifies a difficulty hierarchy of CE < CS < CD and reveals WiFi-specific findings, such as the failure of consistency-based methods that typically excel in computer vision.

Zero-shot Detection of AI-Generated Image via RAW-RGB Alignment

The authors redefine a "synthetic image" as an image generated directly in digital space without a physical world source. They propose a self-supervised method using only real RAW–RGB data pairs to learn a forensic feature called an "alignment trace"—which characterizes "whether this RGB can be traced back to a legitimate RAW source"—achieving zero-shot SOTA performance (Clustering NMI 0.964, Similarity AUC 0.925) without exposure to any generative model priors.