📊 LLM Evaluation¶

🧠 NeurIPS2025 · 76 paper notes

A High-Dimensional Statistical Method for Optimizing Transfer Quantities in Multi-Source Transfer Learning: This paper proposes a theoretical framework based on K-L divergence and high-dimensional statistical analysis to determine the optimal number of samples to transfer from each source task in multi-source transfer learning. The framework avoids the negative transfer caused by naively using all source data, and the resulting algorithm OTQMS surpasses the state of the art by 1.0–1.5% on DomainNet and Office-Home while reducing sample usage by 47.85% and training time by 35.19%.
A Standardized Benchmark for Multilabel Antimicrobial Peptide Classification: This paper presents ESCAPE—the first standardized multilabel antimicrobial peptide classification benchmark, integrating 80,000+ peptides from 27 public databases, along with a dual-branch Transformer + bidirectional cross-attention baseline model that achieves a 2.56% relative improvement in mAP over the second-best method.
A Unified Framework for Provably Efficient Algorithms to Estimate Shapley Values: This paper proposes a unified framework that subsumes KernelSHAP, LeverageSHAP, and related Shapley value estimators under a randomized sketching perspective, provides the first non-asymptotic theoretical guarantees for KernelSHAP, and extends these methods to high-dimensional datasets such as CIFAR-10 via algorithmic improvements including Poisson approximation.
AdaSTaR: Adaptive Data Sampling for Training Self-Taught Reasoners: This work identifies that random data sampling in STaR (Self-Taught Reasoner) leads to severely imbalanced observation training frequencies—easy problems are over-trained while hard problems are under-trained—and proposes AdaSTaR, which combines adaptive diversity sampling (prioritizing under-trained samples) with adaptive curriculum sampling (adjusting difficulty based on model strength) to achieve the highest accuracy on all 6 benchmarks while reducing training FLOPs by 58.6%.
Aggregation Hides OOD Generalization Failures from Spurious Correlations: This paper reveals the "aggregation masking" phenomenon in OOD generalization benchmarks: while aggregate evaluation exhibits accuracy-on-the-line (AoTL)—a positive correlation between ID and OOD accuracy—the proposed OODSelect method can identify large, semantically coherent subsets (up to 75%) from the same OOD data on which higher ID accuracy corresponds to lower OOD accuracy (Pearson R as low as −0.92), demonstrating that the harm of spurious correlations is systematically concealed by aggregate evaluation.
Asymmetric Duos: Sidekicks Improve Uncertainty: Asymmetric Duos (AD) pairs a large model with a small "sidekick"—combining their predictions via temperature-weighted logit averaging—achieving near-5× deep ensemble uncertainty estimation quality at only 10–20% additional FLOPs. RN50 AD (5% FLOPs overhead) approaches an \(m=5\) deep ensemble (400% FLOPs overhead) on AUROC/AURC/SAC@98.
Bayesian Evaluation of Large Language Model Behavior: This paper proposes a Beta-Binomial Bayesian framework for evaluating LLM behavior. By modeling the posterior distribution of \(\theta_m\) over stochastic generations for each prompt, the framework quantifies statistical uncertainty in evaluation metrics and introduces sequential sampling strategies such as Thompson sampling to achieve narrower credible intervals with fewer API calls.
Belief-Calibrated Multi-Agent Consensus Seeking for Complex NLP Tasks: This paper proposes the Belief-Calibrated Consensus Seeking (BCCS) framework, which incorporates three modules—belief-calibrated consensus judgment, conflict-aware collaborator assignment, and leader selection—to enable multi-agent systems to reach more stable consensus on complex NLP tasks, yielding improvements of 2.23% and 3.95% on difficult subsets of MATH and MMLU, respectively.
Benchmarking is Broken — Don't Let AI be its Own Judge: This paper systematically critiques the fundamental flaws of current AI benchmark evaluation—data contamination (45%+ overlap in MMLU), selective reporting, and lack of proctoring—and proposes PeerBench: drawing on the proctoring paradigm of high-stakes exams (e.g., SAT/GRE), it constructs a next-generation AI evaluation infrastructure via a rolling confidential question bank, peer-review quality control, reputation-weighted scoring, and cryptographic commitment mechanisms.
Benchmarking Large Language Models for Zero-Shot and Few-Shot Phishing URL Detection: This paper systematically evaluates three commercial LLMs — GPT-4o, Claude-3.7, and Grok-3-Beta — on phishing URL detection under a unified zero-shot and few-shot prompt framework. Results show that few-shot prompting consistently improves performance across all models, with Grok-3-Beta achieving the best F1 (0.9399) on the balanced dataset, while different models exhibit distinct precision–recall trade-off behaviors.
Beyond the Singular: Revealing the Value of Multiple Generations in Benchmark Evaluation: This paper formalizes LLM benchmark evaluation as a hierarchical statistical model, theoretically demonstrates that multiple stochastic generations (\(k>1\)) reduce the variance of benchmark score estimates, and introduces a prompt-level difficulty metric \(\mathbb{P}(\text{correct})\) along with data maps for benchmark quality control.
Beyond the Surface: Enhancing LLM-as-a-Judge Alignment with Human via Internal Representations: This paper proposes LAGER, a framework that aggregates score token logits from intermediate to final layers of an LLM and computes an expected score to derive the final judgment. Without any model fine-tuning, LAGER improves human alignment by up to 7.5% and matches or surpasses reasoning-based methods without requiring chain-of-thought inference.
Bispectral OT: Dataset Comparison using Symmetry-Aware Optimal Transport: This paper proposes Bispectral Optimal Transport (BOT), which replaces the cost matrix in discrete optimal transport from raw pixel distances to bispectrum (group Fourier invariant) distances, enabling transport plans to eliminate group-action-induced variation (e.g., rotation) while preserving signal structure. On rotation-augmented MNIST and related datasets, the class-preservation accuracy improves from 33% to 84%.
BLINK-Twice: You See But Do You Observe? A Reasoning Benchmark on Visual Perception: This paper introduces BLINK-Twice, a vision-centric reasoning benchmark comprising 345 visually challenging images, 103 adversarial samples, 896 VQA pairs, and 1,725 annotated reasoning steps. Through seven categories of visual illusion scenarios, it evaluates the "you see but do not observe" reasoning capability of MLLMs. The strongest model, Gemini-2.5 Pro, achieves only 26.9% G-Acc, suggesting that multi-round image observation and active visual interaction are promising directions for improvement.
Can Large Language Models Master Complex Card Games?: This paper systematically evaluates the ability of LLMs to learn eight complex card games. It finds that through SFT on high-quality game trajectory data, LLMs can approach the performance of strong game AIs and simultaneously master multiple games, though general capabilities degrade — a decline that can be mitigated by mixing in general instruction data.
CLIMB: Class-Imbalanced Learning Benchmark on Tabular Data: This paper presents CLIMB — the most comprehensive benchmark to date for class-imbalanced learning on tabular data — encompassing 73 real-world datasets and 29 CIL algorithms. Large-scale experiments reveal several practical insights: naive rebalancing is often ineffective, ensemble methods are critical, and data quality impacts performance more than the degree of imbalance itself.
CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance: This paper proposes CodeAssistBench (CAB), the first fully automated benchmark for evaluating multi-turn, repository-level programming assistance. CAB automatically constructs 3,286 real-world programming help scenarios from GitHub Issues, spanning 7 languages and 214 repositories, and reveals a substantial performance gap: state-of-the-art models achieve 70–83% on StackOverflow-style questions but only 7–16% on post-cutoff repositories.
ComPO: Preference Alignment via Comparison Oracles: To address likelihood displacement and verbosity caused by noisy preference pairs (where preferred and dispreferred responses are highly similar) in DPO, this paper proposes ComPO, a zeroth-order preference alignment method based on comparison oracles. The approach partitions data into clean and noisy subsets, applying DPO to the clean subset and ComPO to extract alignment signals from the noisy subset, achieving consistent improvements in LC win rate on benchmarks such as AlpacaEval 2.
Conformal Online Learning of Deep Koopman Linear Embeddings: This paper proposes the COLoKe framework, which reinterprets conformal prediction as a model consistency diagnostic tool. Parameter updates are triggered only when the Koopman model's prediction error exceeds a dynamically calibrated threshold, enabling efficient online Koopman linear embedding learning for nonlinear dynamical systems.
Cost-Sensitive Freeze-thaw Bayesian Optimization for Efficient Hyperparameter Tuning: CFBO incorporates user-defined utility functions (cost–performance trade-offs) into the freeze-thaw Bayesian optimization framework, and combines an adaptive stopping criterion with LC mixup-based transfer learning to achieve optimal cost–performance trade-offs on multi-fidelity HPO benchmarks.
Creativity or Brute Force? Using Brainteasers as a Window into the Problem-Solving Abilities of Large Language Models: This work constructs the Braingle Brainteaser benchmark (242 math + 236 logic puzzles) and systematically evaluates LLM reasoning strategies on brainteasers. The findings reveal that models occasionally produce creative, insight-driven solutions, but frequently fall back on brute-force enumeration even when elegant solutions exist; self-correction ability is limited; and translating narrative formats into mathematical formats yields modest performance gains.
Decoupled Entropy Minimization: This paper decouples classical entropy minimization (EM) into two opposing components — the Cluster Aggregation Driving Factor (CADF, which rewards dominant classes) and the Gradient Mitigation Calibrator (GMC, which penalizes high-confidence classes) — revealing two inherent flaws of classical EM (reward collapse and easy-class bias). The proposed AdaDEM addresses these issues via normalized rewards and marginal entropy calibration, achieving significant improvements across semi-supervised learning, domain adaptation, reinforcement learning, and other tasks.
Document Summarization with Conformal Importance Guarantees: This work presents the first application of Conformal Prediction to document summarization. By calibrating a threshold on sentence importance scores, it provides rigorous statistical guarantees on user-controllable coverage (\(1-\alpha\)) and recall (\(\beta\)) for extractive summaries. The method is model-agnostic and requires only a small calibration set.
Efficient Semantic Uncertainty Quantification in Language Models via Diversity-Steered Sampling: This paper proposes a diversity-steered sampling framework that injects NLI-based semantic similarity penalties during decoding to encourage semantically diverse generation, and corrects distributional bias via importance weighting with control variates to reduce variance. The method accurately estimates semantic entropy (aleatoric uncertainty) and mutual information (epistemic uncertainty) of LLMs using as few as 16 samples.
EvaLearn: Quantifying the Learning Capability and Efficiency of LLMs via Sequential Problem Solving: This paper proposes EvaLearn, a benchmark that evaluates the learning capability and learning efficiency of LLMs through a sequential problem-solving paradigm, revealing that models with stronger static performance do not necessarily possess greater learning potential.
Exploiting Task Relationships in Continual Learning via Transferability-Aware Task Embeddings: This paper proposes H-embedding, a transferability-aware task embedding based on H-score, and integrates it into a hypernetwork framework. By explicitly modeling inter-task relationships in the embedding space to guide parameter generation, the method achieves state-of-the-art final accuracy in a rehearsal-free setting.
Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training: Through controlled experiments, this paper reveals the fundamental mechanism by which larger vocabularies improve language model performance: expanding the vocabulary reduces the Kolmogorov complexity of tokenized text, exploiting vocabulary frequency imbalance to substantially lower the loss on high-frequency tokens, thereby driving down global cross-entropy and improving downstream task performance.
Generalization Error Analysis for Selective State-Space Models Through the Lens of Attention: This work unrolls selective SSMs (Mamba) into an attention-equivalent form and derives generalization bounds via covering number techniques, controlled by the spectral abscissa \(s_{\mathbf{A}}\) of the continuous-time state matrix. When \(s_{\mathbf{A}} < 0\), the bound is independent of sequence length; when \(s_{\mathbf{A}} \geq 0\), it grows exponentially. The paper further proves this dependence is irreducible.
Heterogeneous Adversarial Play in Interactive Environments: This paper proposes HAP (Heterogeneous Adversarial Play), which formalizes teacher-student interaction as a minimax game: a teacher network automatically generates challenge tasks targeting student weaknesses, while the student policy continuously adapts and evolves, forming an adaptive curriculum without manual design. HAP outperforms state-of-the-art baselines in multi-task RL environments, and the generated curriculum proves effective for human learners as well.
HouseLayout3D: A Benchmark and Training-Free Baseline for 3D Layout Estimation in the Wild: This paper introduces HouseLayout3D—the first real-world 3D layout estimation benchmark targeting large-scale multi-floor buildings—and MultiFloor3D, a training-free baseline that combines modern 3D reconstruction and segmentation models to surpass existing deep learning methods on multi-floor building layout estimation.
HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization: This paper proposes HybridNorm, a hybrid normalization strategy that applies QKV normalization within the attention module to decouple gradients and Post-Norm within the FFN to enhance regularization. Across scales from 550M to 7B parameters, HybridNorm simultaneously achieves the training stability of Pre-Norm and the generalization performance of Post-Norm, yielding an average downstream task improvement of 2.45% at the 7B scale.
Hyperbolic Fine-Tuning for Large Language Models: This work identifies that LLM token embeddings follow power-law distributions and exhibit tree-like hyperbolic structure, and proposes HypLoRA — performing low-rank adaptation directly on the Lorentz hyperbolic manifold (bypassing the cancellation effect of tangent space mappings) — achieving significant gains over standard LoRA on arithmetic and commonsense reasoning tasks (e.g., M.AVG +7.5% on Qwen2.5-7B).
Incomplete Multi-view Clustering via Hierarchical Semantic Alignment and Cooperative Completion: This paper proposes the HSACC framework, which employs a two-level semantic space design (low-level mutual information consistency + high-level adaptive weighted fusion) combined with cooperatively optimized implicit missing-view recovery, achieving significant improvements over existing incomplete multi-view clustering methods on five benchmark datasets.
Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on Inequalities: This paper introduces the Ineq-Comp benchmark, which applies compositionally transformed variants of simple inequality seed problems—variants that humans can resolve with minimal additional effort—to expose fundamental deficiencies in the compositional reasoning of current LLM-based formal theorem provers. Even DeepSeek-Prover-V2-7B suffers a performance drop exceeding 20%.
Keep It on a Leash: Controllable Pseudo-label Generation Towards Realistic Long-Tailed Semi-Supervised Learning: This paper proposes Controllable Pseudo-label Generation (CPG), a framework that progressively incorporates reliable pseudo-labels into the labeled set via a controllable self-reinforcing optimization cycle. By training a Bayes-optimal classifier on a distribution of known composition, CPG achieves accuracy gains of up to 15.97% in the Realistic LTSSL setting where the unlabeled data distribution is entirely unknown.
LCDB 1.1: A Database Illustrating Learning Curves Are More Ill-Behaved Than Previously Thought: This paper constructs LCDB 1.1, a large-scale high-resolution learning curve database, demonstrating that ill-behaved sample learning curves (non-monotonic, non-convex) are approximately twice as prevalent as previously believed, with roughly 15% of curves exhibiting significant ill-behavior that feature scaling largely fails to remedy.
Learning Generalizable Shape Completion with SIM(3) Equivariance: This paper proposes SIMECO, the first SIM(3)-equivariant shape completion network. Through a three-stage modular design — feature canonicalization → similarity-invariant geometric reasoning → transformation recovery — SIMECO outperforms all augmentation-based and equivariant baselines under an unbiased evaluation protocol, achieving a 17% MMD reduction on KITTI and a 14% CD-\(\ell_1\) reduction on OmniObject3D. Notably, SIMECO under the stricter protocol still surpasses competing methods evaluated under their own biased settings.
Leveraging Robust Optimization for LLM Alignment under Distribution Shifts: This paper proposes DoRA (Distribution-aware Optimization for Robust Alignment), which trains a distribution classifier to assign calibrated weights to individual samples and incorporates them into a KL-DRO framework to minimize worst-case loss. DoRA operates as a model-agnostic plug-and-play module that consistently improves the robustness of various alignment algorithms—including DPO, RRHF, and LIRE—under distribution shifts.
LTD-Bench: Evaluating Large Language Models by Letting Them Draw: LTD-Bench evaluates the spatial reasoning capabilities of LLMs by having them draw (via dot-matrix output or code-based rendering), transforming abstract evaluation metrics into intuitive visual outputs. The benchmark reveals critical deficiencies in current state-of-the-art LLMs regarding bidirectional mapping between linguistic and spatial concepts.
MEIcoder: Decoding Visual Stimuli from Neural Activity by Leveraging Most Exciting Inputs: MEIcoder is proposed to leverage neuron-specific Most Exciting Inputs (MEIs) as biological priors, combined with SSIM loss and adversarial training, to achieve state-of-the-art visual stimulus reconstruction from neural population activity in the primary visual cortex (V1), with particular strengths in small-dataset and low-neuron-count regimes.
Mind the Gap: Removing the Discretization Gap in Differentiable Logic Gate Networks: This paper proposes Gumbel Logic Gate Networks (Gumbel LGNs), which inject Gumbel noise into logic gate selection and employ a straight-through (ST) estimator to reduce the discretization gap of differentiable logic gate networks by 98%, achieve a 4.5× speedup in training, and reduce the proportion of unused neurons to 0%.
Model-Behavior Alignment under Flexible Evaluation: When the Best-Fitting Model Isn't the Right One: Through large-scale model recovery experiments, this paper demonstrates that even with 4.5 million behavioral data points, flexible evaluation methods based on linear probing achieve model recovery accuracy below 80% across 20 visual models. This reveals a fundamental trade-off between predictive accuracy and model identifiability, challenging the prevailing paradigm that the best-fitting model is the most appropriate one.
Model Context Protocol for Vision Systems: Audit, Security, and Protocol Extensions: The first protocol-level audit of MCP deployment in vision systems, analyzing 91 public MCP servers and finding that 78% exhibit schema inconsistencies and 89% lack runtime validation; the paper further proposes protocol extensions including semantic schemas, visual memory, and runtime validators.
MVSMamba: Multi-View Stereo with State Space Model: This paper proposes MVSMamba, the first Mamba-based Multi-View Stereo (MVS) network, which achieves efficient intra-view and inter-view global omnidirectional feature aggregation via a reference-centered dynamic scanning strategy, attaining state-of-the-art performance on DTU and Tanks-and-Temples with superior efficiency.
Normal-Abnormal Guided Generalist Anomaly Detection: NAGL is the first framework to incorporate mixed normal-and-abnormal reference samples into Generalist Anomaly Detection (GAD). Through two attention modules—Residual Mining (RM) and Anomaly Feature Learning (AFL)—it learns transferable anomaly patterns in residual space, substantially outperforming normal-reference-only methods in cross-domain scenarios with as few as 1 anomaly reference sample.
Not All Splits Are Equal: Rethinking Attribute Generalization Across Unrelated Categories: This paper presents the first systematic evaluation of how train/test splitting strategies affect generalization performance in attribute prediction tasks. It proposes four progressively harder splitting schemes based on LLM semantic grouping, embedding similarity, embedding clustering, and ground-truth supercategory labels. The study finds that unsupervised clustering-based splitting achieves leakage reduction comparable to ground-truth supercategory splits—without requiring any annotations—while retaining substantially better predictive performance.
On Evaluating LLM Alignment by Evaluating LLMs as Judges: This paper systematically investigates the consistency between LLMs' generation capability and evaluation capability (GE-consistency), finding a strong correlation between the two rankings under a strong preference oracle (Spearman \(\rho = 0.96\)). Based on this finding, the authors propose the AlignEval benchmark, which measures LLM alignment by assessing LLMs' ability as judges—without directly invoking LLM-as-Judge to evaluate model outputs—achieving performance comparable to or better than AlpacaEval and Arena-Hard.
On the Entropy Calibration of Language Models: This paper systematically investigates the entropy calibration of language models — whether the entropy of generated text matches the log loss on human text — and finds that due to the power-law nature of data distributions (\(\alpha \approx 1\)), error accumulation improves extremely slowly with model scale (scaling exponent \(\approx -0.05\)). The paper further provides a theoretical proof that entropy can be calibrated in polynomial time without sacrificing diversity.
Open-Insect: Benchmarking Open-Set Recognition of Novel Species in Biodiversity Monitoring: This paper introduces Open-Insect — the first large-scale fine-grained open-set recognition benchmark for insect species discovery, spanning three geographic regions and three types of open-set splits. It systematically evaluates 38 OSR algorithms, finding that simple posterior methods (e.g., MSP) remain strong baselines in fine-grained settings, and demonstrates the critical role of domain-relevant auxiliary data in improving OSR performance.
OptiTree: Hierarchical Thoughts Generation with Tree Search for LLM Optimization Modeling: This paper proposes OptiTree, which organizes hierarchical classification and modeling thoughts for operations research (OR) problems by constructing a modeling tree, and employs tree search to adaptively decompose complex problems into sequences of simpler subproblems, achieving significant accuracy gains in optimization modeling tasks for LLMs (exceeding 10% on multiple challenging benchmarks).
PARROT: A Benchmark for Evaluating LLMs in Cross-System SQL Translation: This paper presents PARROT, a practical and realistic benchmark for cross-system SQL translation (SQL-to-SQL), comprising 598 core translation pairs (expanded to 28,003 pairs) sourced from 38 open-source benchmarks and real-world business scenarios, covering 22 production-grade database systems. The benchmark reveals that the strongest current LLMs achieve an average accuracy below 38.53%.
PaTH Attention: Position Encoding via Accumulating Householder Transformations: This paper proposes PaTH (Position encoding via accumulating Householder Transformations), a data-dependent multiplicative position encoding scheme that replaces RoPE's static rotation matrices with accumulated Householder transformations, achieving superior theoretical expressiveness and empirical language modeling performance over RoPE.
PFΔ: A Benchmark Dataset for Power Flow under Load, Generation, and Topology Variations: PFΔ is the first power flow benchmark dataset to simultaneously encompass load, generation dispatch, and topology variations. It comprises 859,800 solved instances across six grid scales, includes close-to-infeasible extreme operating conditions, and introduces a standardized evaluation task suite for systematically assessing ML methods under diverse operating conditions.
Put CASH on Bandits: A Max K-Armed Problem for Automated Machine Learning: This paper addresses the Combined Algorithm Selection and Hyperparameter Optimization (CASH) problem in AutoML. Through data-driven analysis, it reveals that HPO reward distributions are bounded and left-skewed, and proposes MaxUCB—a bandit algorithm specifically tailored to this distributional property—achieving both theoretical and empirical improvements over existing methods.
RDB2G-Bench: A Comprehensive Benchmark for Automatic Graph Modeling of Relational Databases: This paper proposes RDB2G-Bench — the first benchmark framework for evaluating relational-database-to-graph modeling methods, comprising 5 real-world RDBs, 12 prediction tasks, approximately 50,000 precomputed graph model–performance pairs, and a systematic comparison of 10 automatic graph modeling approaches.
Reliably Detecting Model Failures in Deployment Without Labels: This paper proposes D3M (Disagreement-Driven Deterioration Monitoring), a three-stage model monitoring algorithm based on variational Bayesian posterior sampling, which reliably detects model performance degradation in label-free, training-data-free deployment settings while maintaining low false positive rates under non-degrading distribution shifts.
Rethinking Evaluation of Infrared Small Target Detection: This paper systematically identifies three critical limitations in existing evaluation protocols for infrared small target detection (IRSTD), and proposes a hierarchical analysis framework comprising the hybrid-level metric hIoU, a systematic error analysis methodology, and a cross-dataset evaluation setting.
Rethinking Losses for Diffusion Bridge Samplers: This paper identifies theoretical flaws in the widely used Log Variance (LV) loss for diffusion bridge samplers—namely, that it violates the data processing inequality and its gradients are not equivalent to those of the reverse KL (rKL)—and proposes computing rKL gradients via the log-derivative trick (rKL-LD). The proposed approach consistently outperforms LV loss across multiple benchmarks while exhibiting more stable training and reduced sensitivity to hyperparameters.
RGB-to-Polarization Estimation: A New Task and Benchmark Study: This paper formally defines the novel task of estimating polarization components (S₁/S₂/S₃) from standard RGB images, establishes the first systematic benchmark encompassing both restoration-based and generative methods, and finds that pretrained MAE achieves the best overall pixel-level accuracy (PSNR 24.74). Restoration-based methods consistently outperform diffusion-based generative methods, with pretrained weight transfer identified as a critical advantage.
Risk Management for Mitigating Benchmark Failure Modes: BenchRisk: Grounded in the NIST Risk Management Framework, this work systematically analyzes 57 failure modes across 26 LLM benchmarks, proposes 196 mitigation strategies, and introduces BenchRisk—a meta-evaluation framework that scores the reliability of benchmarks themselves.
Robust Hallucination Detection in LLMs via Adaptive Token Selection: HaMI frames hallucination detection as a Multiple Instance Learning (MIL) problem, treating each generated sequence as a bag of token instances. By jointly optimizing token selection and hallucination detection, it adaptively identifies the most informative tokens, achieving substantial AUROC improvements over all existing methods across four QA benchmarks (up to 11.9%).
scMRDR: A Scalable and Flexible Framework for Unpaired Single-Cell Multi-Omics Data Integration: This paper proposes scMRDR, a framework based on β-VAE that disentangles latent representations of single-cell multi-omics data into modality-shared and modality-specific components, achieving scalable integration of unpaired multi-omics data through isometric regularization, adversarial training, and masked reconstruction loss.
Semi-Supervised Regression with Heteroscedastic Pseudo-Labels: This paper proposes an uncertainty-aware pseudo-label framework based on heteroscedastic modeling, which dynamically calibrates per-sample pseudo-label uncertainty via bilevel optimization to mitigate the negative impact of noisy pseudo-labels on regression models, achieving state-of-the-art performance on multiple SSR benchmarks.
Small Language Models as Compiler Experts: Auto-Parallelization for Heterogeneous Systems: This work systematically evaluates three language models with fewer than 1.5B parameters (gemma3, llama3.2, qwen2.5) on compiler auto-parallelization tasks. Using six inference strategies across 11 real-world kernels, the approach achieves an average speedup of 6.81x and a peak speedup of 43.25x, demonstrating that small models can serve as powerful compiler optimization reasoning engines.
SPROD: Spurious-Aware Prototype Refinement for Reliable Out-of-Distribution Detection: SPROD is a post-hoc OOD detection method designed to handle spurious correlations in training data. It subdivides each class prototype into "correctly classified" and "misclassified" subgroups (the latter sharing spurious features), combined with K-means-style refinement and distance-based (generative) scoring. Across 5 spurious-correlation OOD benchmarks, it achieves an average AUROC of 85.1% (+4.8% vs. runner-up KNN) and FPR@95 of 49.0% (−9.3% vs. runner-up).
Test-Time Adaptation by Causal Trimming: This paper proposes TACT, a method that identifies non-causal directions in the representation space via data augmentation and PCA, then removes the projections of both test representations and class prototypes along these directions at test time. This reduces model reliance on non-causal features and significantly improves prediction performance under distribution shift.
The Geometry of Cortical Computation: Manifold Disentanglement and Predictive Dynamics in VCNet: This paper proposes VCNet—a neural network architecture that simulates the macroscopic organization of the primate visual cortex—reinterpreting dual-stream separation (manifold disentanglement) and predictive coding (geodesic refinement) through the language of geometry and dynamical systems. At an extremely compact size of 0.04 MB, VCNet achieves 92.1% accuracy on Spots-10 (10% above a distilled DenseNet), and attains 74.4% on light field classification at 3.52 MB (surpassing MobileNetV2 by 2.3%).
Thought Communication in Multiagent Collaboration: This paper proposes ThoughtComm, a framework that formalizes multiagent communication as a latent variable generative model. It proves that both shared and private thoughts are identifiable under nonparametric conditions, extracts latent thoughts via a sparsity-regularized autoencoder, and feeds them back to each agent through prefix injection. ThoughtComm achieves an average improvement of 19.06% over the current SOTA Multiagent Finetuning on mathematical reasoning benchmarks.
Tight Lower Bounds and Improved Convergence in Performative Prediction: Under the performative prediction framework, this paper provides the first tight convergence rate analysis for Repeated Risk Minimization (RRM) and proposes the Affine Risk Minimizers (ARM) algorithm class, which achieves convergence over a broader problem class by leveraging data from historical training snapshots.
Time Travel is Cheating: Going Live with DeepFund for Real-Time Fund Investment Benchmarking: This paper introduces DeepFund — the first live fund investment benchmark for LLMs — which employs a multi-agent architecture (Financial Planner + Analyst Team + Portfolio Manager) connected to real-time market data, eliminating the information leakage caused by LLM "time travel" in traditional backtesting. Over 24 trading days of live testing across 9 flagship LLMs, only Grok 3 achieves positive returns, revealing fundamental limitations of current LLMs in active fund management.
Towards Implicit Aggregation: Robust Image Representation for Place Recognition in the Transformer Era: This paper proposes ImAge (Implicit Aggregation), which inserts learnable aggregation tokens at a specific layer of a Transformer backbone and leverages the intrinsic self-attention mechanism to implicitly aggregate patch features into a global descriptor, completely eliminating the need for an external aggregator. With the smallest descriptor dimensionality (6144) and fastest inference speed, ImAge surpasses SOTA methods such as SALAD and BoQ across multiple VPR benchmarks, and ranks 1st on the MSLS Challenge leaderboard.
Turbocharging Gaussian Process Inference with Approximate Sketch-and-Project: This paper proposes the ADASAP algorithm, which extends the sketch-and-project framework to large-scale GP inference via approximate subspace preconditioning, distributed computation, and Nesterov acceleration. It is the first method to scale exact GP inference beyond \(>3\times10^8\) samples, while theoretically establishing condition number-free convergence guarantees for the SAP family.
Unlocking Transfer Learning for Open-World Few-Shot Recognition: A two-stage framework is proposed that combines open-set-aware meta-learning with open-set-free transfer learning, achieving the first successful application of the transfer learning paradigm to few-shot open-set recognition (FSOSR) and reaching SOTA on miniImageNet and tieredImageNet.
What Does It Take to Build a Performant Selective Classifier?: This paper presents the first finite-sample decomposition of the selective classification gap, attributing it to five sources—Bayes noise, approximation error, ranking error, statistical noise, and implementation bias—and demonstrates that monotone calibration methods have limited effect on closing this gap.
Words That Unite The World: A Unified Framework for Deciphering Central Bank Communications Globally: This paper constructs WCB, the most comprehensive central bank monetary policy corpus to date (380,000+ sentences, 25 central banks, spanning 28 years), defines three NLP tasks (stance detection, temporal classification, uncertainty estimation), and through 15,075 benchmark experiments demonstrates that models trained on aggregated multi-bank data significantly outperform single-bank training, confirming the principle that "the whole is greater than the sum of its parts."
Your Pre-trained LLM is Secretly an Unsupervised Confidence Calibrator: This paper identifies that post-training (SFT/RLHF/DPO) degrades the confidence calibration of pre-trained language models, and proposes DACA, a method that exploits the well-calibrated nature of pre-trained models by aligning confidence distributions exclusively on prediction-consistent samples, achieving label-free calibration of post-trained models with up to 15.08% ECE improvement.