📊 LLM Evaluation¶

📷 CVPR2026 · 28 paper notes

ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation: This paper theoretically proves that fine-tuning weight deltas encode input covariance information, and proposes ACE-Merging, which achieves data-free closed-form model merging through three steps: adaptive covariance estimation, collective structural prior, and spectral refinement. ACE-Merging achieves an average improvement of 4% over prior methods on GPT-2 and 5% on RoBERTa-Base.
AdaBet: Gradient-free Layer Selection for Efficient Training of Deep Neural Networks: This paper proposes AdaBet, a gradient-free layer selection method grounded in algebraic topology, which uses the first Betti number \(b_1\) to quantify the topological complexity of each layer's activation space via a single forward pass—requiring no labels, gradients, or backpropagation. By fine-tuning only 10% of layers on ResNet50/VGG16/MobileNetV2/ViT-B16, AdaBet surpasses full fine-tuning in accuracy while reducing peak memory by approximately 40%.
Anchoring and Rescaling Attention for Semantically Coherent Inbetweening: This paper proposes KAB (Keyframe-Anchored Attention Bias) and ReTRo (Rescaled Temporal RoPE), two training-free inference-time methods built upon the Wan2.1 video diffusion model. These methods address semantic infidelity, frame inconsistency, and temporal rhythm instability in generative inbetweening (GI) with sparse keyframes under large-motion conditions. The paper also introduces TGI-Bench, the first text-conditioned GI evaluation benchmark.
Cross-Scale Pansharpening via ScaleFormer and the PanScale Benchmark: This paper proposes PanScale, the first cross-scale pansharpening dataset, along with the PanScale-Bench evaluation benchmark, and the ScaleFormer framework — which reinterprets resolution variation as sequence length variation, achieving cross-scale generalization via Scale-Aware Patchify bucketed sampling, decoupled spatial-sequence modeling, and RoPE.
CryoHype: Reconstructing a Thousand Cryo-EM Structures with Transformer-Based Hypernetworks: This paper proposes CryoHype, a Transformer-based hypernetwork approach for cryo-EM reconstruction that dynamically modulates the weights of implicit neural representations (INRs) to reduce parameter sharing, achieving for the first time the simultaneous reconstruction of 1,000 distinct protein structures from unlabeled cryo-EM images.
Enhancing Out-of-Distribution Detection with Extended Logit Normalization: This paper identifies two forms of feature collapse induced by LogitNorm during training—dimensional collapse and origin collapse—and proposes a hyperparameter-free Extended Logit Normalization (ELogitNorm) that replaces the distance-to-origin scaling factor with the distance from features to the decision boundary. ELogitNorm significantly improves both post-hoc OOD detection performance and confidence calibration without sacrificing classification accuracy.
Flow3r: Factored Flow Prediction for Scalable Visual Geometry Learning: This paper proposes a Factored Flow Prediction module that predicts optical flow from the geometric latent of a source view and the pose latent of a target view, enabling unlabeled videos to serve as supervisory signals for 3D geometry learning. The method achieves state-of-the-art performance across 8 benchmarks covering both static and dynamic scenes.
Free-Grained Hierarchical Visual Recognition: This paper proposes free-grained hierarchical recognition, a setting in which training labels may appear at any level of a taxonomy. Two complementary methods are introduced to compensate for missing supervision — text-guided pseudo-attributes (Text-Attr) and taxonomy-guided semi-supervised learning (Taxon-SSL) — while at inference time the model adaptively selects its prediction depth.
HeSS: Head Sensitivity Score for Sparsity Redistribution in VGGT: HeSS proposes a Head Sensitivity Score to quantify the sensitivity of each attention head in VGGT's global attention layers to sparsification, and redistributes the attention budget from insensitive heads to sensitive ones accordingly. This approach significantly outperforms the uniform sparsification method SparseVGGT at high sparsity ratios with virtually no additional runtime overhead.
Hier-COS: Making Deep Features Hierarchy-aware via Composition of Orthogonal Subspaces: This paper proposes Hier-COS, a framework that assigns orthogonal basis vectors to each node in a label hierarchy tree to construct a theoretically guaranteed Hierarchy-Aware Vector Space (HAVS). It is the first to unify "hierarchy-aware fine-grained classification" and "hierarchical multi-level classification" within a single framework, while introducing a new evaluation metric HOPS, achieving comprehensive state-of-the-art performance across four datasets.
Hier-COS: Making Deep Features Hierarchy-aware via Composition of Orthogonal Subspaces: This paper proposes Hier-COS, a framework that assigns orthogonal basis vectors to each node in a label hierarchy tree and constructs a Hierarchy-Aware Vector Space (HAVS) via subspace composition (ancestor bases + self basis + descendant bases). The approach provides theoretical guarantees that the distance structure of the feature space is consistent with the hierarchy tree, while also introducing the HOPS evaluation metric to address the permutation-invariance deficiency of existing hierarchical evaluation metrics.
HyCal: A Training-Free Prototype Calibration Method for Cross-Discipline Few-Shot Class-Incremental Learning: This paper identifies a "Domain Gravity" bias in heterogeneous-domain continual learning—whereby data-rich or low-entropy domains exert disproportionate influence in a shared embedding space—and proposes HyCal, a training-free method that calibrates prototypes by fusing cosine similarity and Mahalanobis distance, achieving robust classification in cross-discipline imbalanced few-shot class-incremental learning.
Learning Like Humans: Analogical Concept Learning for Generalized Category Discovery: This paper proposes AL-GCD, a framework that simulates human analogical reasoning by designing an Analogical Text Concept Generator (ATCG)—which analogically generates textual concepts for unlabeled samples by drawing on a visual-textual knowledge base built from labeled categories—thereby casting category discovery as a joint visual-textual reasoning task. AL-GCD achieves an average improvement of 5.0% across six benchmarks, with 7.1% gains on fine-grained datasets.
Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models: This paper introduces StEvo-Bench, a benchmark comprising 225 tasks that evaluates whether video world models can correctly continue evolving scene states during unobserved intervals—induced by inserting occlusions or redirecting the camera during video generation. Experiments reveal that state-of-the-art models (including Veo 3 and Sora 2 Pro) achieve success rates below 10%, exposing a fundamental tendency of current video models to couple state evolution tightly with pixel-level observation.
Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models: This paper proposes StEvo-Bench, a benchmark comprising 225 tasks across 6 evolution categories, which systematically evaluates whether 9 video world models can decouple state evolution from observation via occlusion or camera-away controls. All models achieve a success rate below 10% under observation interruption, and 5 specialized verifiers are employed to precisely localize failure modes.
Pioneering Perceptual Video Fluency Assessment: A Novel Task with Benchmark Dataset and Baseline: This paper formally separates Video Fluency Assessment (VFA) from conventional Video Quality Assessment (VQA) for the first time, introduces FluVid — the first fluency-oriented benchmark dataset (4,606 videos) — and proposes a baseline model FluNet that leverages Temporal Permuted Self-Attention (T-PSA) for efficient inter-frame interaction, achieving SRCC/PLCC of 0.816/0.821.
PRISM: Video Dataset Condensation with Progressive Refinement and Insertion for Sparse Motion: This paper proposes PRISM, a holistic video dataset condensation method that begins from only two temporal anchors (first and last frames), adaptively inserts keyframes by detecting gradient direction conflicts, and achieves state-of-the-art storage efficiency while preserving content–motion coupling integrity — reaching 17.9% accuracy with 20 MB on miniUCF 1VPC, a 5× storage reduction over prior methods (94 MB).
R2G: A Multi-View Circuit Graph Benchmark Suite from RTL to GDSII: This paper introduces R2G, the first standardized multi-view circuit graph benchmark suite, providing five stage-aware graph representations with information equivalence across 30 IP cores. A systematic study reveals that graph representation choice has a greater impact on performance than GNN model choice.
ReflexSplit: Single Image Reflection Separation via Layer Fusion-Separation: ReflexSplit proposes an explicit layer fusion-separation framework that addresses the transmission-reflection confusion problem in single image reflection separation (SIRS). It employs Cross-scale Gated Fusion (CrGF) for adaptive multi-scale feature aggregation, a differential dual-dimensional attention mechanism \(\mathbf{A}^t - \lambda_\ell \mathbf{A}^r\) within the Layer Fusion-Separation Block (LFSB) for cross-stream interference suppression, and a curriculum training strategy with depth-dependent initialization and epoch-wise warmup to progressively strengthen separation intensity, achieving state-of-the-art performance on both synthetic and real-world benchmarks.
Reframing Long-Tailed Learning via Loss Landscape Geometry: This paper reframes the head-tail seesaw dilemma in long-tailed learning through the lens of loss landscape geometry. It identifies that tail class degradation stems from optimization converging to sharp minima that are far from tail-class optima. A dual-module framework comprising GKP (Grouped Knowledge Preservation) and GSA (Grouped Sharpness Aware) is proposed based on continual learning principles, achieving state-of-the-art results on four benchmarks (CIFAR-LT / ImageNet-LT / iNat2018) without requiring additional data.
SATTC: Structure-Aware Label-Free Test-Time Calibration for Cross-Subject EEG-to-Image Retrieval: This paper proposes SATTC, a label-free test-time calibration head that operates directly on the similarity matrix over frozen EEG and image encoders. It combines a geometric expert (subject-adaptive whitening + adaptive CSLS) and a structural expert (mutual nearest neighbors + bidirectional top-k ranking + category popularity) via a product-of-experts fusion, significantly improving Top-1 accuracy and reducing the hubness effect in cross-subject EEG-to-image retrieval.
Semi-Supervised Conformal Prediction With Unlabeled Nonconformity Score: This paper proposes SemiCP, a framework that incorporates unlabeled data into the conformal prediction calibration pipeline via a Nearest Neighbor Matching (NNM) score. Under extremely limited labeled data, SemiCP reduces the average coverage gap by up to 77% while simultaneously shrinking prediction set sizes.
SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras: This paper proposes SparseCam4D, the first method to achieve sparse-camera (2–3 views) 4D reconstruction on standard multi-camera dynamic scene benchmarks. The core innovation is the Spatio-Temporal Distortion Field (STDF), which explicitly models spatio-temporal inconsistencies in generative observations and decouples them from the underlying 4D Gaussian representation, enabling high-fidelity, spatio-temporally consistent rendering of dynamic scenes.
TacSIm: A Dataset and Benchmark for Football Tactical Style Imitation: This paper presents TacSIm, the first large-scale dataset and benchmark that reconstructs full-team trajectories from real Premier League broadcast footage and performs tactical style imitation in a virtual football environment, quantifying imitation fidelity via two metrics: spatial occupancy similarity and motion vector similarity.
Temporal Imbalance of Positive and Negative Supervision in Class-Incremental Learning: This paper identifies temporal imbalance as a previously overlooked source of bias in class-incremental learning (CIL) and proposes the Temporal-Adjusted Loss (TAL), which dynamically downweights negative supervision for old classes via a temporally decaying memory kernel. TAL integrates in a plug-and-play manner and significantly alleviates catastrophic forgetting.
Unified Primitive Proxies for Structured Shape Completion: This paper proposes UniCo, which learns unified primitive representations over shared shape features via primitive proxies, jointly predicting complete point clouds and assembly-ready quadric primitives (with geometry, semantics, and membership) in a single forward pass. UniCo reduces Chamfer distance by up to 50% and improves normal consistency by up to 7% on synthetic and real-world point cloud benchmarks.
VGA-Bench: A Unified Benchmark for Video Aesthetics and Generation Quality Evaluation: VGA-Bench proposes a unified AIGC video evaluation benchmark comprising a three-tier taxonomy (aesthetic quality, aesthetic labels, and generation quality), 1,016 prompts, 60,000 videos, and three dedicated evaluation models, enabling automated assessment aligned with human judgment.
Weakly Supervised Video Anomaly Detection with Anomaly-Connected Components and Intention Reasoning: This paper proposes the LAS-VAD framework, which introduces an Anomaly-Connected Components (ACC) mechanism to partition video frames into semantically consistent groups for pseudo-label generation to compensate for the absence of frame-level annotations, and an Intention-Aware Mechanism (IAM) that leverages position-velocity-acceleration features to distinguish normal from anomalous behaviors with similar appearances but different intentions. The method achieves 89.96% AP (I3D) on XD-Violence.