📊 LLM Evaluation¶

📹 ICCV2025 · 29 paper notes

3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark: This paper introduces 3DSRBench, the first comprehensive 3D spatial reasoning benchmark comprising 2,772 manually annotated VQA pairs across 12 question types. Through balanced data distribution and a novel FlipEval strategy, the benchmark enables robust evaluation. Results reveal that state-of-the-art LMMs—including GPT-4o and Gemini—fall far short of human performance on 3D spatial reasoning (≈52% vs. 95.7%), with substantial performance degradation under uncommon camera viewpoints.
A Conditional Probability Framework for Compositional Zero-shot Learning: This paper proposes CPF, a conditional probability framework for compositional zero-shot learning (CZSL) that decomposes the compositional likelihood into an object likelihood and a conditional attribute likelihood. Through a text-enhanced object learning module and an object-guided attribute learning module, CPF explicitly models the semantic constraints and contextual dependencies between attributes and objects, achieving a 17.9% AUC improvement on UT-Zappos50K and a 5.5% Unseen Accuracy improvement on MIT-States.
A Conditional Probability Framework for Compositional Zero-shot Learning: This paper proposes a Conditional Probability Framework (CPF) that decomposes the compositional recognition probability into an object likelihood $p(o|x)$ and a conditional attribute likelihood $p(a|o,x)$. Two dedicated modules — Text-Enhanced Object learning (TEO) and Object-Guided Attribute learning (OGA) — explicitly model attribute-object dependencies, achieving state-of-the-art performance across three CZSL benchmarks.
A Real-world Display Inverse Rendering Dataset: This paper presents the first real-world inverse rendering dataset built upon an LCD display-camera system, comprising stereo polarization images of 16 objects with diverse materials captured under OLAT illumination patterns alongside high-precision geometric ground truth. A simple yet effective display inverse rendering baseline is proposed, outperforming existing inverse rendering methods.
A Real-world Display Inverse Rendering Dataset: This paper presents the first real-world inverse rendering dataset (DIR) built upon an LCD display–polarization camera system, comprising polarimetric stereo images of objects with diverse reflectance properties captured under OLAT illumination, calibrated display backlight/nonlinearity, and high-quality ground-truth geometry. A simple yet effective baseline method for display-based inverse rendering is also proposed.
BATCLIP: Bimodal Online Test-Time Adaptation for CLIP: This paper proposes BATCLIP, a bimodal online test-time adaptation (TTA) method for CLIP that simultaneously adapts the LayerNorm parameters of both the visual and text encoders. By introducing a projection matching loss and an inter-class separability loss to enhance vision-text feature alignment and class discriminability, BATCLIP achieves state-of-the-art performance on CIFAR-10C, CIFAR-100C, and ImageNet-C.
Combinative Matching for Geometric Shape Assembly: This paper proposes Combinative Matching (CMNet), which jointly models two fundamental properties of interlocking parts — surface shape consistency and volumetric occupancy complementarity — via an equivariant network trained with three objectives: orientation alignment, shape matching, and occupancy matching, substantially reducing local ambiguity in geometric assembly.
Degradation-Modeled Multipath Diffusion for Tunable Metalens Photography: This paper proposes DMDiff, a framework that leverages the natural image priors of pretrained diffusion models. Through a positive/neutral/negative tripath multi-prompt diffusion strategy and a Spatially-Varying Degradation-Aware (SVDA) attention module, DMDiff achieves high-fidelity tunable image reconstruction for millimeter-scale metalens cameras, surpassing existing methods across multiple metrics.
Discontinuity-aware Normal Integration for Generic Central Camera Models: This paper proposes a novel normal integration method that supports explicit discontinuity modeling and generic central camera models. By establishing constraints between surface normals and ray directions under a local planarity assumption, the method achieves state-of-the-art performance on standard normal integration benchmarks and, for the first time, directly handles generic central cameras such as fisheye and panoramic cameras.
DisCoPatch: Taming Adversarially-driven Batch Statistics for Improved Out-of-Distribution Detection: This paper proposes DisCoPatch, a framework that exploits the inherent bias of BatchNorm toward batch statistics in adversarial VAEs to distinguish ID from OOD samples. At inference time, multiple patches from the same image are composed into a batch to ensure distributional consistency. The method achieves state-of-the-art performance on covariate-shift OOD detection (ImageNet-1K(-C) 95.5% AUROC) and near-OOD detection (95.0% AUROC), with a model size of only 25 MB and latency an order of magnitude lower than competing methods.
DISTA-Net: Dynamic Closely-Spaced Infrared Small Target Unmixing: DISTA-Net proposes a dynamic deep unfolding network that replaces the static nonlinear transform and threshold parameters in ISTA-based sparse reconstruction with input-adaptive counterparts, constituting the first deep learning method for closely-spaced infrared small target (CSIST) unmixing. The work also establishes the first open-source ecosystem encompassing a dataset, evaluation metrics, and a toolkit.
Few-Shot Pattern Detection via Template Matching and Regression: This paper proposes TMR, a method that combines classical template matching with support-conditioned bounding box regression to achieve few-shot detection of arbitrary patterns—including non-object-level patterns. The authors also introduce the RPINE dataset to cover a broader range of repetitive patterns. TMR surpasses existing FSCD methods on multiple benchmarks and demonstrates strong cross-dataset generalization.
ForCenNet: Foreground-Centric Network for Document Image Rectification: This paper proposes ForCenNet, a foreground-centric document rectification network featuring three key contributions: foreground label generation, a mask-guided Transformer decoder, and a curvature consistency loss. The method requires only undistorted images for training and achieves state-of-the-art performance on four benchmarks: DocUNet, DIR300, WarpDoc, and DocReal.
Generative Zoo: A scalable pipeline is proposed for synthesizing animal 3D pose and shape training data using conditional image generation models (FLUX + ControlNet), producing the million-scale GenZoo dataset. Training exclusively on synthetic data achieves state-of-the-art performance on real-world benchmarks.
HiERO: Understanding the Hierarchy of Human Behavior Enhances Reasoning on Egocentric Videos: This paper proposes HiERO, a weakly supervised hierarchical graph architecture that learns the hierarchy of functional activity cues by aligning video segments with narration text. The resulting segment features encode multi-scale behavioral dependencies. HiERO substantially outperforms fully supervised methods in zero-shot evaluation on procedure learning tasks (F1 +12.5% on EgoProceL) and achieves state-of-the-art performance on video–text alignment benchmarks.
Imbalance in Balance: Online Concept Balancing in Generation Models: Through carefully designed causal experiments, this work reveals that data distribution—rather than model scale or data volume—is the decisive factor for concept composition ability in diffusion models. It further proposes IMBA Loss, an online concept-level balancing loss that adaptively reweights token-level losses via the discrepancy between conditional and unconditional distributions (the IMBA distance). With only a few lines of code modification, the method significantly improves multi-concept generation capability.
InterSyn: Interleaved Learning for Dynamic Motion Synthesis in the Wild: This paper proposes the InterSyn framework, which jointly models single-person and multi-person motions within a unified interleaved sequence via an Interleaved Learning strategy, combined with a Relative Coordination Refinement (REC) module, to generate more natural and coordinated human interaction motions. On the InterHuman test set, FID is reduced by 6.1% and R Precision Top-1 is improved by 2.8% compared to FreeMotion.
Lay2Story: Extending Diffusion Transformers for Layout-Togglable Story Generation: Lay2Story introduces the task of layout-togglable story generation, constructs the Lay2Story-1M dataset of over 1 million high-resolution images, and proposes a global–subject dual-branch framework built on the DiT architecture, achieving comprehensive improvements over existing methods in consistency, semantic relevance, and aesthetic quality.
Neural Multi-View Self-Calibrated Photometric Stereo without Photometric Stereo Cues: This paper proposes an end-to-end neural inverse rendering framework that jointly recovers geometry, spatially-varying reflectance, and lighting parameters from multi-view images captured under varying illumination, requiring neither light source calibration nor intermediate photometric stereo cues (e.g., normal maps). The method outperforms existing multi-stage MVPS approaches.
ODP-Bench: Benchmarking Out-of-Distribution Performance Prediction: This paper presents ODP-Bench, the first comprehensive benchmark for OOD performance prediction, covering 29 OOD datasets, 10 prediction algorithms, and 1,444 pretrained models. It reveals a key finding that existing algorithms perform reasonably well on synthetic corruptions but consistently fail under natural distribution shifts.
OmniDiff: A Comprehensive Benchmark for Fine-grained Image Difference Captioning: This paper introduces OmniDiff, a fine-grained image difference captioning dataset comprising 324 diverse scenes (real-world and 3D synthetic), and proposes a plug-and-play Multi-scale Differential Perception (MDP) module integrated into an MLLM to build the M3Diff model, achieving state-of-the-art performance on OmniDiff and multiple public benchmarks.
On the Robustness Tradeoff in Fine-Tuning: The first systematic study of the adversarial robustness–accuracy tradeoff during fine-tuning, conducted across 231 models, 7 fine-tuning strategies, and 6 datasets. Key findings: (1) robustness first increases then decreases in the early stages of fine-tuning; (2) different PEFT strategies and task complexities yield distinct Pareto frontiers; (3) OOD robustness exhibits no analogous tradeoff and instead tracks accuracy changes closely.
PHATNet: A Physics-guided Haze Transfer Network for Domain-adaptive Real-world Image Dehazing: This paper proposes PHATNet, a physics-guided haze transfer network that extends the Atmospheric Scattering Model (ASM) to latent space to disentangle and transfer haze patterns, generating domain-adaptive fine-tuning datasets that enable dehazing models to effectively adapt to unseen real-world haze scenes at test time.
Rethinking Few Shot CLIP Benchmarks: A Critical Analysis in the Inductive Setting: This paper identifies that existing CLIP few-shot classification benchmarks constitute a "partially transductive setting" due to CLIP's exposure to test datasets during pretraining. It proposes an unlearning-based inductive benchmark evaluation framework and introduces a few-shot classification method that achieves stable state-of-the-art performance under the new benchmark.
SketchSplat: 3D Edge Reconstruction via Differentiable Multi-view Sketch Splatting: This paper proposes SketchSplat, which represents 3D edges as parametric sketches (line segments + Bézier curves) and directly optimizes edge parameters via differentiable rendering by sampling Gaussian points from sketches. Combined with adaptive topology control and an improved 2D edge detector, the method achieves state-of-the-art accuracy, completeness, and compactness on CAD datasets.
Spectral Sensitivity Estimation with an Uncalibrated Diffraction Grating: A practical method is proposed for estimating camera spectral sensitivity using an uncalibrated diffraction grating film. By jointly estimating spectral sensitivity and grating efficiency, accurate closed-form solutions are obtained from a single capture of a light source with known spectrum. The method significantly outperforms traditional color chart approaches at an equipment cost of under $5 USD.
StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition: StreamMind proposes an "event-gated LLM invocation" paradigm to replace the existing "per-frame LLM invocation" approach. By inserting a Cognition Gate network between the video encoder and the LLM, the model invokes the LLM only when query-relevant events occur. Combined with an Event-Preserving Feature Extractor (EPFE) based on state space methods that ensures constant perception cost, the system achieves 100 fps streaming video processing on a single A100 GPU.
Supercharging Floorplan Localization with Semantic Rays: A semantics-aware floorplan localization framework is proposed that fuses predicted semantic rays with depth rays into a structural-semantic probability volume. Combined with a coarse-to-fine refinement strategy, the method achieves 2–3× performance improvements on two standard benchmarks.
SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition: SVTRv2 is proposed with three key designs — Multi-Size Resize (MSR), Feature Rearrangement Module (FRM), and Semantic Guidance Module (SGM) — enabling a CTC-based model to comprehensively outperform encoder-decoder methods across multi-scene benchmarks for the first time, while retaining inference speed advantages.