📂 Others¶
📷 CVPR2026 · 54 paper notes
- AdaSFormer: Adaptive Serialized Transformers for Monocular Semantic Scene Completion from Indoor Environments
-
This paper proposes AdaSFormer, a serialized Transformer framework for indoor Monocular Semantic Scene Completion (MSSC), achieving state-of-the-art performance on NYUv2 and Occ-ScanNet through three core designs: Adaptive Serialization Attention (with learnable offsets), Center-Relative Position Encoding, and Convolutional Modulation Layer Normalization.
- AssistMimic: Physics-Grounded Humanoid Assistance via Multi-Agent RL
-
The first multi-agent RL framework that performs contact-rich human-human assistive motion imitation in physics simulation, enabling MARL in high-contact settings via motion prior initialization, dynamic reference redirection, and contact facilitation rewards.
- BenDFM: A Taxonomy and Synthetic CAD Dataset for Manufacturability Assessment in Sheet Metal Bending
-
This paper proposes a two-dimensional taxonomy of manufacturability metrics (configuration dependence × feasibility/complexity) and introduces BenDFM, the first synthetic CAD dataset for sheet metal bending (20,000 parts, covering both manufacturable and non-manufacturable designs). Benchmark results show that topology-aware graph representations (UV-Net, AUC 0.896) consistently outperform point cloud methods (PointNext, AUC 0.844) across all four task categories.
- BenDFM: A taxonomy and synthetic CAD dataset for manufacturability assessment in sheet metal bending
-
This paper proposes a two-dimensional taxonomy of manufacturability metrics (configuration dependency × feasibility/complexity) and constructs BenDFM, the first synthetic dataset for sheet metal bending (20k parts). Benchmark results show that graph-based representations (UV-Net) outperform point cloud representations (PointNext), and configuration-dependent metrics are harder to predict.
- Bounds on Agreement between Subjective and Objective Measurements
-
Starting from the mathematical properties of MOS, this paper derives theoretical formulas for the upper bound on PCC and the lower bound on MSE between subjective test results and any objective estimator. It further proposes the BinoVotes/BinoMOS voting model and validates both the bounds and the model on 18 subjective test datasets.
- Bounds on Agreement between Subjective and Objective Measurements
-
This paper derives closed-form expressions for the upper bound on PCC and the lower bound on MSE between subjective MOS values and any objective quality estimator, and proposes BinoVotes — a binomial distribution-based voting model — to estimate these bounds when per-vote variance information is unavailable.
- U-F²-CBM: CLIP-Free, Label Free, Unsupervised Concept Bottleneck Models
-
This paper proposes TextUnlock, a method that trains a lightweight MLP to project features from an arbitrary frozen visual classifier into the text embedding space—while preserving the original classifier's output distribution—requiring no CLIP, no annotations, and no linear probe training. Any legacy classifier can thereby be converted into an interpretable concept bottleneck model. Evaluated on 40+ architectures, the approach surpasses even supervised CLIP-based CBMs.
- Coded-E2LF: Coded Aperture Light Field Imaging from Events
-
This paper provides the first demonstration that an event camera alone (without conventional intensity images) can reconstruct a 4D light field at pixel-level accuracy. The proposed Coded-E2LF system triggers events via a coded aperture pattern sequence and accumulates them into event images. By introducing an all-black pattern, a mathematical equivalence between event-based and intensity-based coded aperture imaging is established. Combined with end-to-end deep optics training, the system achieves 8×8 sub-aperture light field reconstruction.
- Crowdsourcing of Real-world Image Annotation via Visual Properties
-
This paper proposes an image annotation methodology constrained by visual properties. It constructs an object category hierarchy through knowledge representation and combines an interactive crowdsourcing framework that leverages visual genus and visual differentia to guide the annotation process, thereby reducing annotator subjectivity and the semantic gap problem.
- Deconstructing the Failure of Ideal Noise Correction: A Three-Pillar Diagnosis
-
Through three complementary levels of analysis — macroscopic convergence state, microscopic gradient dynamics, and information-theoretic limits — this paper rigorously proves that even given a perfect noise transition matrix, Forward Correction (FC) inevitably collapses to the same suboptimal level as no correction. The root cause lies in memorization under finite samples and the information loss induced by the noisy channel.
- Deconstructing the Failure of Ideal Noise Correction: A Three-Pillar Diagnosis
-
Through controlled experiments, this paper demonstrates that even given a perfect noise transition matrix \(T\), forward correction (FC) still suffers from performance collapse in the late stages of training. The paper systematically diagnoses the root causes of this failure from three complementary perspectives: macroscopic convergence states, microscopic optimization dynamics, and information theory.
- DiffBMP: Differentiable Rendering with Bitmap Primitives
-
This paper proposes DiffBMP — the first general-purpose differentiable rendering engine for bitmap primitives — which enables efficient gradient-based optimization of position, rotation, scale, color, and opacity across thousands of bitmap primitives via a custom CUDA parallel pipeline, filling the gap left by 2D differentiable rendering methods that are restricted to vector graphics.
- DirPA: Addressing Prior Shift in Imbalanced Few-shot Crop-type Classification
-
This paper proposes Dirichlet Prior Augmentation (DirPA), which mitigates prior shift between artificially balanced training episodes and severely imbalanced real-world label distributions by sampling from a Dirichlet distribution to simulate unknown long-tailed label distribution shifts during few-shot learning training. The method is validated on crop-type classification tasks across multiple EU countries, demonstrating cross-regional effectiveness.
- DirPA: Addressing Prior Shift in Imbalanced Few-shot Crop-type Classification
-
This paper proposes Dirichlet Prior Augmentation (DirPA), which constructs imbalanced episodes during FSL training by sampling class proportion vectors from a Dirichlet distribution, actively simulating real-world long-tail distributions to eliminate prior shift. The method demonstrates consistent robustness improvements and rare-class accuracy gains on crop-type classification tasks across multiple European countries.
- Do Vision Models Perceive Illusory Motion in Static Images Like Humans?
-
This paper systematically evaluates a range of optical flow models on static-image motion illusions such as the Rotating Snakes, finding that only the biologically-inspired Dual-Channel model reproduces the human-perceived rotational motion under simulated saccade conditions.
- Dual-Band Thermal Videography: Separating Time-Varying Reflection and Emission Near Ambient Conditions
-
This paper proposes a dual-band long-wave infrared (LWIR) video analysis framework that jointly leverages spectral cues (constant emissivity ratio across dual bands) and temporal cues (smooth object radiance variation vs. abrupt background radiance changes) to achieve, for the first time, pixel-wise separation of reflected and emitted components in dynamic scenes near ambient temperature, along with recovery of per-pixel emissivity and temperature fields.
- ELogitNorm: Enhancing OOD Detection with Extended Logit Normalization
-
This paper diagnoses two feature collapse problems in LogitNorm (dimensional collapse and origin collapse), and proposes ELogitNorm — replacing the feature norm with the average distance to decision boundaries as an adaptive temperature scaling factor. The method requires no hyperparameters, is compatible with all post-hoc OOD detection methods, achieves a 10.48% far-OOD AUROC improvement on CIFAR-10 (with SCALE), reduces FPR95 from 51.45% to 27.74% on ImageNet-1K, and simultaneously improves classification accuracy and ECE calibration.
- FEAT: Federated Geometry-Aware Correction for Exemplar Replay under Continual Dynamic Heterogeneity
-
FEAT is proposed to address the underutilization of replay exemplars in federated continual learning (FCL), mitigating cross-client heterogeneity and task-level data imbalance via geometric structure alignment (angular distillation based on ETF prototypes) and energy-based geometric correction (inference-time debiasing).
- GardenDesigner: Encoding Aesthetic Principles into Jiangnan Garden Construction via a Chain of Agents
-
This paper proposes GardenDesigner, a framework that encodes the aesthetic principles of Jiangnan gardens into computable constraints through a chain of agents (terrain distribution → road generation → asset selection → layout optimization). Combined with the expert-annotated GardenVerse dataset, the framework enables non-expert users to automatically construct aesthetically compliant Jiangnan gardens from text input within one minute.
- GazeOnce360: Fisheye-Based 360° Multi-Person Gaze Estimation with Global-Local Feature Fusion
-
This paper proposes GazeOnce360, an end-to-end dual-resolution CNN model for 360° multi-person gaze direction estimation using a single upward-facing tabletop fisheye camera. The authors also construct MPSGaze360, the first large-scale synthetic dataset for this setting, achieving substantial improvements over the existing multi-stage method GAM360 in both accuracy and speed.
- HypeVPR: Exploring Hyperbolic Space for Perspective to Equirectangular Visual Place Recognition
-
This paper proposes HypeVPR, a visual place recognition framework based on hierarchical embedding in hyperbolic space, specifically designed to address cross-field-of-view matching between perspective (query) and equirectangular panoramic (database) images. By constructing multi-level descriptors from local to global within the Poincaré ball, HypeVPR achieves a flexible balance among accuracy, efficiency, and storage, achieving retrieval speeds several times faster than sliding-window baselines at comparable accuracy.
- Integration of deep generative Anomaly Detection algorithm in high-speed industrial line
-
A GAN-based dense bottleneck residual autoencoder (DRAE) improved upon GRD-Net achieves semi-supervised anomaly detection on a pharmaceutical BFS production line, completing inference over 2.81 million training patches within a 500 ms time constraint (0.17 ms/patch) at a balanced accuracy of 97.62%.
- Integration of Deep Generative Anomaly Detection Algorithm in High-Speed Industrial Line
-
This paper proposes a semi-supervised anomaly detection framework based on GAN and a Dense Residual Autoencoder (DRAE), specifically designed for high-speed online quality inspection in pharmaceutical Blow-Fill-Seal (BFS) production lines. Trained exclusively on non-defective samples, the system achieves 96.4% accuracy with a per-patch inference latency of only 0.17ms, satisfying the strict industrial constraint of a 500ms inspection cycle.
- IrisFP: Adversarial-Example-based Model Fingerprinting with Enhanced Uniqueness and Robustness
-
This paper proposes IrisFP, a model fingerprinting framework that simultaneously enhances fingerprint uniqueness and robustness through three innovations: placing fingerprints at the intersection of multi-class decision boundaries, constructing composite sample fingerprints, and performing statistically-guided fingerprint selection. IrisFP consistently achieves higher AUC than state-of-the-art methods across 5 datasets.
- LoViF 2026 Challenge on Human-oriented Semantic Image Quality Assessment
-
The LoViF 2026 inaugural challenge on human-oriented semantic image quality assessment introduces the SeIQA benchmark dataset (510/80/160 train/validation/test pairs) to measure whether image degradation alters the semantic information that humans care about, rather than traditional perceptual fidelity. The winning solution, RedpanQA Alliance, achieves a final score of 0.8724 using Qwen3-VL multimodal large language model with LoRA fine-tuning and PLCC loss.
- Mitigating Instance Entanglement in Instance-Dependent Partial Label Learning
-
To address the "instance entanglement" problem in instance-dependent partial label learning (ID-PLL)—where instances from visually similar classes share overlapping features and candidate label sets—this paper proposes the CAD framework, which mitigates class confusion through two complementary mechanisms: intra-class alignment via class-specific augmentation and inter-class separation via a weighted penalty loss.
- MyoVision: A Mobile Research Tool and NEATBoost-Attention Ensemble Framework for Real Time Chicken Breast Myopathy Detection
-
This paper proposes MyoVision, a smartphone-based transillumination imaging framework, and the NEATBoost-Attention neuroevolution-optimized ensemble model for low-cost, real-time three-class classification of chicken breast myopathies (Wooden Breast and Spaghetti Meat).
- NaiLIA: Multimodal Nail Design Retrieval Based on Dense Intent Descriptions and Palette Queries
-
This paper proposes NaiLIA, a multimodal retrieval method for nail design images that achieves fine-grained matching via dense intent descriptions and palette queries. A confidence-based relaxed contrastive (CRC) loss is introduced to handle unlabeled positives. NaiLIA substantially outperforms existing methods on the authors' newly constructed NAIL-STAR benchmark and on Marqo Fashion200K.
- Neural Collapse in Test-Time Adaptation
-
This work extends Neural Collapse (NC) theory from the class level to the sample level, discovering the NC3+ phenomenon (sample feature embeddings align with their corresponding classifier weights). Building on this, it identifies feature-classifier misalignment at the sample level as the root cause of performance degradation under distribution shift, and proposes NCTTA, which employs a hybrid objective combining geometric proximity and prediction confidence to guide feature re-alignment, achieving a 14.52% improvement over Tent on ImageNet-C.
- Next-Scale Autoregressive Models for Text-to-Motion Generation
-
MoScale proposes a next-scale autoregressive motion generation framework that replaces conventional next-token prediction. By performing hierarchical causal generation from coarse to fine, the model captures global semantic structure and introduces cross-scale hierarchical refinement and in-scale temporal refinement, achieving state-of-the-art performance on HumanML3D and KIT-ML (Top-1 0.540, FID 0.046).
- Novel Anomaly Detection Scenarios and Evaluation Metrics to Address the Ambiguity in the Definition of Normal Samples
-
To address the practical challenge that the definition of "normal" shifts with specification changes in industrial anomaly detection, this paper proposes two novel evaluation scenarios (A2N/N2A), a new metric (S-AUROC), and a training augmentation method called RePaste. RePaste increases the training frequency of high-anomaly-score regions by repasting them onto subsequent training images, enabling models to flexibly adapt to changes in the definition of normal samples.
- OmniFood8K: Single-Image Nutrition Estimation via Hierarchical Frequency-Aligned Fusion
-
This work introduces OmniFood8K, a multimodal Chinese food nutrition dataset comprising 8,036 samples, along with a synthetic dataset NutritionSynth-115K containing 115K samples. An end-to-end framework is proposed that predicts nutritional information from a single RGB image via a Scale-Shift depth adapter, frequency-aligned fusion, and a mask-based prediction head.
- Order Matters: 3D Shape Generation from Sequential VR Sketches
-
This paper proposes VRSketch2Shape, a framework that, for the first time, models the temporal stroke order of VR sketches. Through a sequence-aware BERT encoder combined with a diffusion-based 3D generator (SDFusion), the framework generates high-fidelity 3D shapes from ordered VR sketches. The work also contributes a multi-category dataset comprising 20k synthetic and 900 real sketches.
- POLISH'ing the Sky: Wide-Field and High-Dynamic Range Interferometric Image Reconstruction
-
POLISH++ extends the POLISH framework by introducing a patch-wise training-and-stitching strategy and an arcsinh nonlinear transformation, addressing two major practical deployment challenges in radio interferometric imaging: wide-field imaging (images exceeding ten thousand pixels) and high dynamic range (\(10^4\)–\(10^6\)). On T-RECS simulated data, POLISH++ substantially outperforms CLEAN in source detection accuracy, recovers strong gravitational lens systems near the PSF scale through super-resolution, and is projected to increase the number of gravitational lens discoveries in DSA surveys by approximately one order of magnitude.
- Rethinking SNN Online Training and Deployment: Gradient-Coherent Learning via Hybrid-Driven LIF Model
-
This paper proposes HD-LIF (Hybrid-Driven LIF), a family of spiking neuron models that adopts distinct spike computation mechanisms above and below the firing threshold. It theoretically establishes gradient separability and alignment, resolving the forward–backward propagation inconsistency in SNN online training, while simultaneously achieving full-pipeline optimization of learning accuracy, memory complexity, and power consumption—attaining 78.61% accuracy on CIFAR-100 with 10× parameter compression, 11× power reduction, and 30% NOPs savings.
- Rethinking SNN Online Training and Deployment: Gradient-Coherent Learning via Hybrid-Driven LIF Model
-
This paper proposes the Hybrid-Driven LIF (HD-LIF) model family, which achieves gradient separability and alignment by adopting distinct spike computation mechanisms in the sub- and supra-threshold regions. This approach resolves the fundamental forward–backward propagation inconsistency in SNN online training, while simultaneously optimizing training accuracy, memory complexity, and inference power consumption across all stages.
- Rooftop Wind Field Reconstruction Using Sparse Sensors: From Deterministic to Generative Learning Methods
-
This work establishes a learning-observation framework based on PIV wind tunnel experimental data, systematically comparing Kriging interpolation with three deep learning models (UNet/ViTAE/CWGAN) for rooftop wind field reconstruction under 5–30 sparse sensors. It demonstrates that under multi-direction training (MDT), deep learning consistently outperforms Kriging (SSIM improvement of 18–34%), and sensor placement robustness is enhanced by up to 27.8% via QR decomposition-based sensor layout optimization.
- Rooftop Wind Field Reconstruction Using Sparse Sensors: From Deterministic to Generative Learning Methods
-
Based on wind tunnel PIV experimental data, this paper systematically compares Kriging interpolation and three deep learning methods (UNet, ViTAE, CWGAN) for rooftop wind field reconstruction under sparse sensor conditions, and proposes QR decomposition-based sensor placement optimization to enhance robustness.
- Shoe Style-Invariant and Ground-Aware Learning for Dense Foot Contact Estimation
-
This paper proposes FECO, a framework that achieves robust dense foot contact estimation from a single RGB image via shoe style–content randomization (adversarial training) and ground-aware learning (pixel height maps + ground normals), significantly outperforming existing methods on multiple benchmarks.
- SHREC: A Spectral Embedding-Based Approach for Ab-Initio Reconstruction of Helical Molecules
-
This paper proposes SHREC, an algorithm that recovers projection angles of helical molecule segments directly from cryo-EM 2D projection images via spectral embedding, without requiring prior knowledge of helical symmetry parameters (rise/twist), enabling truly ab-initio helical reconstruction.
- SHREC: A Spectral Embedding-Based Approach for Ab-Initio Reconstruction of Helical Molecules
-
SHREC employs spectral embedding to directly recover projection angles of helical molecules from 2D cryo-EM projection images without prior knowledge of helical symmetry parameters. By proving that projections of helical segment form a one-dimensional closed manifold homeomorphic to the circle \(S^1\), the method achieves near-publication-quality high-resolution reconstructions (3.66 Å–8.23 Å) on three public datasets: TMV, VipA/VipB, and MakA.
- SimRecon: SimReady Compositional Scene Reconstruction from Real Videos
-
This paper proposes SimRecon, a framework that automatically constructs simulation-ready compositional 3D scenes from real videos via a three-stage "perception → generation → simulation" pipeline. The core innovations are Active Viewpoint Optimization (AVO), which identifies the optimal projection viewpoint for single-object generation, and the Scene Graph Synthesizer (SGS), which guides physically plausible hierarchical assembly.
- SldprtNet: A Large-Scale Multimodal Dataset for CAD Generation in Language-Driven 3D Design
-
This paper presents SldprtNet, a large-scale multimodal CAD dataset comprising 242,000+ industrial parts, where each sample contains four fully aligned modalities: .sldprt/.step 3D models, seven-view composite images, parametric modeling scripts, and natural language descriptions. The authors develop a lossless encoder/decoder toolchain supporting 13 CAD commands, and baseline experiments demonstrate the significant advantage of multimodal input over text-only input for CAD generation tasks.
- SldprtNet: A Large-Scale Multimodal Dataset for CAD Generation in Language-Driven 3D Design
-
This paper introduces SldprtNet — a large-scale multimodal dataset comprising 242K+ industrial CAD parts, where each sample includes a
.sldprt/.stepmodel, a 7-view composite image, a parametric modeling script (supporting lossless encoding/decoding of 13 command types), and a natural language description generated by Qwen2.5-VL. Baseline experiments demonstrate that multimodal input (image + text) outperforms text-only input for CAD generation. - Stronger Normalization-Free Transformers
-
By systematically analyzing four key properties required for pointwise functions to replace normalization layers (zero-centeredness, boundedness, center-sensitivity, and monotonicity), this work identifies \(\text{Derf}(x) = \text{erf}(\alpha x + s)\) as the optimal normalization-layer substitute through large-scale search. Derf consistently outperforms LayerNorm and DyT across vision recognition, image generation, speech representation, and DNA sequence modeling, with performance gains primarily attributable to stronger generalization rather than fitting capacity.
- TeamHOI: Learning a Unified Policy for Cooperative Human-Object Interactions with Any Team Size
-
TeamHOI proposes a framework using a Transformer-based decentralized policy network and Masked Adversarial Motion Prior (Masked AMP), enabling a single policy to generalize to cooperative carrying tasks with any number of agents, achieving 97%+ success rate for 2–8 humanoid agents cooperatively carrying a table.
- UniSpector: Towards Universal Open-set Defect Recognition via Spectral-Contrastive Visual Prompting
-
This paper proposes UniSpector, an open-set industrial defect detection framework that addresses visual prompt embedding collapse through spectral-spatial dual-domain feature fusion (SSPE) and angular-margin contrastive prompt encoding (CPE). On the newly constructed Inspect Anything benchmark encompassing 360 defect categories, UniSpector surpasses the best baseline by 19.7% in AP50 detection and 15.8% in segmentation.
- V-Nutri: Dish-Level Nutrition Estimation from Egocentric Cooking Videos
-
This paper proposes V-Nutri, the first framework to leverage process information from egocentric cooking videos for dish-level nutrition estimation. A VideoMamba-based keyframe selection module identifies ingredient addition moments, which are fused with the final dish image to predict calories and macronutrients.
- ViT3: Unlocking Test-Time Training in Vision
-
This paper systematically explores the design space of Test-Time Training (TTT) for vision tasks, distills six practical design insights, and proposes ViT3—a purely TTT-based vision architecture with linear complexity—that matches or surpasses Mamba and linear attention methods on classification, generation, detection, and segmentation tasks.
- What Is the Optimal Ranking Score Between Precision and Recall? We Can Always Find It and It Is Rarely F₁
-
This paper systematically studies the \(F_\beta\) score family as a ranking tradeoff between Precision and Recall from a ranking-theoretic perspective. It proves that the rankings induced by \(F_\beta\) form a geodesic (shortest path) between the Precision and Recall rankings, derives a closed-form formula for finding the optimal \(\beta\), and demonstrates that the commonly used \(F_1\) and skew-insensitive \(F_1\) are rarely optimal ranking tradeoffs in practice.
- What Is Wrong with Synthetic Data for Scene Text Recognition? A Strong Synthetic Engine with Diverse Simulations and Self-Evolution
-
This paper systematically analyzes the deficiencies of existing rendered synthetic data in corpus, font, and layout diversity, and proposes the UnionST synthetic engine together with a Self-Evolution Learning (SEL) framework. Using only synthetic data, the approach substantially outperforms conventional synthetic sets; combined with SEL, only 9% of real labeled data is required to approach fully supervised performance.
- Your Classifier Can Do More: Towards Balancing the Gaps in Classification, Robustness, and Generation
-
This paper analyzes the energy landscape to reveal the complementarity between adversarial training (AT) and JEM—AT aligns the clean-adversarial energy distribution (→ robustness); JEM aligns the clean-generated energy distribution (→ accuracy + generation). The proposed EB-JDAT models the joint distribution \(p(\mathbf{x}, \tilde{\mathbf{x}}, y)\) and employs min-max energy optimization to align the energy distributions of all three data types. On CIFAR-10, AutoAttack robustness reaches 68.76% (surpassing SOTA AT by +10.78%), while maintaining 90.39% clean accuracy and competitive generation quality with FID=27.42.
- Your Classifier Can Do More: Towards Balancing the Gaps in Classification, Robustness, and Generation
-
This paper proposes EB-JDAT, a framework that models the joint energy distribution \(p_\theta(\mathbf{x}, \tilde{\mathbf{x}}, y)\) over clean, adversarial, and generated samples, achieving — for the first time in a single model — high classification accuracy, strong adversarial robustness, and competitive generative capability. On CIFAR-10, it attains 66.12% AutoAttack robustness, surpassing state-of-the-art adversarial training methods by over 10 percentage points.
- ZO-SAM: Zero-Order Sharpness-Aware Minimization for Efficient Sparse Training
-
This paper proposes ZO-SAM, which replaces the backpropagation in SAM's perturbation step with zeroth-order gradient estimation, reducing SAM's computational overhead from two backward passes to one. This makes SAM practical for sparse training for the first time, achieving consistent improvements of 0.38%–2.54% over all mainstream sparse training methods on CIFAR-10/100 and ImageNet-1K.