Skip to content

📂 Others

🧪 ICML2025 · 90 paper notes

📌 Same area in other venues: 📷 CVPR2026 (105) · 🔬 ICLR2026 (116) · 💬 ACL2026 (4) · 🧪 ICML2026 (70) · 🤖 AAAI2026 (117) · 🧠 NeurIPS2025 (121)

🔥 Top topics: Adversarial Robustness ×5 · Domain Adaptation ×3 · Few-/Zero-Shot Learning ×2

Access Controls Will Solve the Dual-Use Dilemma

Proposes a conceptual framework based on access control to address the dual-use dilemma in AI safety. By obtaining real-world context through user verification and combining it with content classification, the framework achieves fine-grained permission management, simultaneously mitigating over-refusal and under-refusal.

Addressing Imbalanced Domain-Incremental Learning through Dual-Balance Collaborative Experts (DCE)

DCE proposes a two-stage training framework of a frequency-aware expert group + a dynamic expert selector to simultaneously resolve the two challenges of intra-domain class imbalance and cross-domain class distribution shift in domain-incremental learning, achieving state-of-the-art (SOTA) performance on four benchmarks.

Adversarial Combinatorial Semi-bandits with Graph Feedback

This paper introduces graph feedback into the adversarial combinatorial semi-bandits framework and proposes the OSMD-G algorithm, establishing the optimal regret bound of \(\widetilde{\Theta}(S\sqrt{T} + \sqrt{\alpha S T})\), where \(S\) is the size of the combinatorial action and \(\alpha\) is the independence number of the feedback graph. The key technique lies in utilizing randomized swap rounding to achieve negatively correlated sampling.

AutoAL: Automated Active Learning with Differentiable Query Strategy Search

Proposes the first differentiable active learning strategy search framework, AutoAL. By collaboratively training two networks, SearchNet and FitNet, under a bilevel optimization framework, it automatically selects the optimal strategy from multiple candidate AL strategies for a given task, consistently outperforming all candidate strategies and other SOTA methods on natural and medical image datasets.

Beyond Entropy: Region Confidence Proxy for Wild Test-Time Adaptation

Reveals the fundamental limitation of entropy minimization in wild test-time adaptation (WTTA)—conflicting optimization dynamics caused by inconsistent predictions of semantically similar samples in local regions. Proposes the ReCAP framework, which models regions probabilistically and utilizes a finite-to-infinite asymptotic approximation to convert the intractable region confidence into an efficiently optimizable proxy objective, consistently outperforming the state-of-the-art on ImageNet-C.

Bipartite Ranking From Multiple Labels: On Loss Versus Label Aggregation

This paper theoretically analyzes the Bayes-optimal solutions of two aggregation strategies in multi-label bipartite ranking—loss aggregation and label aggregation—revealing that loss aggregation suffers from a "label dictatorship" phenomenon (where a single label dominates the ranking due to marginal skewness), whereas label aggregation treats all labels in a more balanced manner.

Constrained Hamiltonian Systems on Observation-Induced Fiber Bundles: Theory of Symmetry and Integrability

This work proposes a geometric framework of "observation-induced fiber bundles" that internalizes observational uncertainty in partially observable systems from external perturbations into intrinsic variations of fiber coordinates. On this structure, it unifies the treatment of state and observational constraints, establishing a complete theory of symplectic geometry, integrability, symmetry, and conservation laws.

Continuous-Time Analysis of Heavy Ball Momentum in Min-Max Games

Through continuous-time ODE modeling, this work systematically reveals that heavy ball momentum behaves completely differently in min-max games compared to minimization problems: smaller momentum (including negative momentum) expands the stable stepsize range and guides trajectories toward flatter gradient regions, while alternating updates converge faster than simultaneous updates and amplify this regularization effect.

Cross-regularization: Adaptive Model Complexity through Validation Gradients

Proposes Cross-regularization, which directly optimizes regularization parameters (weight norm, noise scale, augmentation intensity) via validation set gradients, converging to the cross-validation optimal solution in a single training run, thereby eliminating the need for manual hyperparameter tuning.

Curvature Enhanced Data Augmentation for Regression

Proposes CEMS (Curvature-Enhanced Manifold Sampling), which utilizes the second-order approximation (curvature information) of the data manifold to generate synthetic samples for data augmentation in regression tasks, achieving state-of-the-art (SOTA) or near-SOTA performance in both in-distribution and out-of-distribution scenarios.

DRO-BAS: Decision Making under the Exponential Family DRO with Bayesian Ambiguity Sets

Proposes the DRO-BAS framework, which leverages Bayesian posterior beliefs to construct two posterior-informed ambiguity sets (BASPP and BASPE). Under exponential family conjugate models, these can be reformulated as efficient single-stage stochastic programs, Pareto-dominating existing Bayesian DRO methods on the Newsvendor and Portfolio problems.

Democratic AI is Possible. The Democracy Levels Framework Shows How It Might Work

This paper proposes the "Democracy Levels" framework, which categorizes the transfer of AI decision-making authority from unilateral power to democratic systems into six levels (L0–L5). Equipped with a multidimensional evaluation system and practical tools, it provides a systematic roadmap for the democratization of AI governance.

DiLQR: Differentiable Iterative Linear Quadratic Regulator via Implicit Differentiation

This paper proposes the DiLQR framework, which applies implicit differentiation to the fixed points of the iLQR controller to derive analytical gradient solutions. This reduces the backpropagation computational complexity from linear growth with the number of iterations to a constant \(O(1)\), achieving up to a 128× speedup while improving learning performance by up to \(10^6\) times compared to conventional neural network policies.

Discrepancy Minimization in Input-Sparsity Time

Proposed is the first input-sparsity time algorithm for real-valued discrepancy minimization—a combinatorial version running in \(\widetilde{O}(\mathrm{nnz}(A)+n^3)\) time and a fast matrix multiplication (FMM) version in \(\widetilde{O}(\mathrm{nnz}(A)+n^{2.53})\) time. The logarithmic approximation guarantee relative to \(\mathrm{herdisc}\) remains unchanged, nearly bridging the computational gap between real-valued and binary matrices.

Discrete Neural Algorithmic Reasoning

This paper proposes the Discrete Neural Algorithmic Reasoner (DNAR). By leveraging three core components—feature discretization, hard attention, and separate continuous/discrete data flows—DNAR forces neural networks to execute algorithmic trajectories along a finite set of predefined states. It achieves a 100% perfect test score on tasks such as BFS, DFS, Dijkstra, Prim, and MIS, and allows for formal proofs of the correctness of the learned algorithms.

Diversity By Design: Leveraging Distribution Matching for Offline Model-Based Optimization

This paper proposes DynAMO, which explicitly models design diversity as a distribution matching problem to simultaneously discover high-quality and highly diverse candidate designs in offline model-based optimization (MBO).

DSP: Dynamic Sequence Parallelism for Multi-Dimensional Transformers

To address the issue where existing sequence parallelism methods in multi-dimensional Transformers (e.g., spatio-temporal attention models in video generation) can only shard along a single dimension, leading to massive redundant communication, this paper proposes Dynamic Sequence Parallelism (DSP). By dynamically switching the parallel dimension between computation stages (instead of communicating inside modules) using efficient all-to-all operations for resharding, DSP achieves a 32.2% to 10× end-to-end throughput improvement and reduces communication overhead by at least 50%.

Efficient Network Automatic Relevance Determination

Extends Automatic Relevance Determination (ARD) from single-output to multi-output regression scenarios, proposes the NARD framework to jointly estimate sparse regression coefficients and the output precision matrix, and designs three acceleration algorithms (Sequential/Surrogate/Hybrid) to reduce complexity from \(\mathcal{O}(d^3)\) to \(\mathcal{O}(p^2)\).

Efficient Optimization with Orthogonality Constraint: a Randomized Riemannian Submanifold Method

A randomized Riemannian submanifold descent method (RSDM) is proposed. By restricting each update step to a randomized low-dimensional submanifold, the computational complexity of the retraction operation in orthogonality-constrained optimization is reduced from \(O(np^2)\) to \(O(r^3)\), while maintaining a total computational complexity that matches that of full-space Riemannian gradient descent.

Exploiting Similarity for Computation and Communication-Efficient Decentralized Optimization

Proposes the Stabilized Proximal Decentralized Optimization (SPDO) method and its accelerated variant, achieving optimal communication and computation complexities simultaneously within a proximal decentralized optimization framework. This is achieved by relaxing local subproblem accuracy requirements (from increasing with iterations to constant) via a stabilized projection technique, and reducing communication overhead by replacing the maximum function dissimilarity \(\delta_{\max}\) with the average function similarity \(\delta\).

Faster and Stronger: When ANN-SNN Conversion Meets Parallel Spiking Calculation

Integrates parallel spiking calculation with ANN-SNN conversion for the first time, establishing a mathematically equivalent mapping. This achieves 72.90% Top-1 accuracy on ImageNet within an ultra-low latency of only 4 steps, accelerating inference by 19x to 38x.

Feature Learning beyond the Lazy-Rich Dichotomy: Insights from Representational Geometry

This paper proposes using manifold capacity and its associated geometric metrics (GLUE) to characterize the richness of feature learning. This approach goes beyond the traditional lazy vs. rich dichotomy, revealing new insights into different learning phases, learning strategies, computational neuroscience, and OOD generalization.

FEDTAIL: Federated Long-Tailed Domain Generalization with Sharpness-Guided Gradient Matching

FedTAIL proposes a federated domain generalization framework that simultaneously addresses the dual challenges of domain shift and long-tailed class imbalance through three modules: gradient coherence regularization, class-wise sharpness-aware minimization, and curvature-aware dynamic weighting, achieving SOTA performance on multiple benchmarks.

Feedforward Few-shot Species Range Estimation

Proposes FS-SINR (Few-shot Spatial Implicit Neural Representations), a Transformer-based feedforward few-shot species range estimation model. Without requiring retraining for new species, it predicts spatial distributions in a single forward pass from a few (or even zero) observation locations, outperforming retraining-based methods like LE-SINR on IUCN and S&T benchmarks with only 2-6% of the computational time.

Fishers for Free? Approximating the Fisher Information Matrix by Recycling the Squared Gradient Accumulator

This paper systematically analyzes the theoretical connection between the squared gradient accumulator (Squisher) of the Adam optimizer and the diagonal of the Fisher Information Matrix. It demonstrates that Squisher can serve as a zero-cost approximation of the Fisher diagonal, performing comparably to Fisher across five major applications, including model merging, continual learning, and sparsification.

Fixed-Confidence Multiple Change Point Identification under Bandit Feedback

This paper proposes the problem of multiple change point identification in piecewise-constant bandits under the fixed-confidence setting, establishes instance-dependent lower bounds on sample complexity, and designs MCPI (Multiple Change Point Identification), a simple, computationally efficient, and asymptotically optimal algorithm.

Fixing the Loose Brake: Exponential-Tailed Stopping Time in Best Arm Identification

This paper reveals that classic fixed-confidence best arm identification algorithms (Successive Elimination, KL-LUCB) have non-zero probability events where they never stop. It proposes two schemes, FC-DSH and the meta-algorithm BrakeBooster, achieving the first guaranteed exponential tail decay for stopping time without losing instance-dependent complexity (only up to log factors).

Fully Dynamic Euclidean Bi-Chromatic Matching in Sublinear Update Time

This work presents the first fully dynamic sublinear update algorithm for the Euclidean bi-chromatic matching problem. For any fixed \(\varepsilon > 0\), it achieves an \(O(1/\varepsilon)\) approximation ratio and \(O(n^{\varepsilon})\) update time, which can be utilized to efficiently monitor distribution shifts (Wasserstein distance).

Function Encoders: A Principled Approach to Transfer Learning in Hilbert Spaces

Proposes a taxonomy of transfer learning from the geometric perspective of Hilbert spaces (convex hull interpolation / linear span extrapolation / full-space extrapolation), and designs the Function Encoder method utilizing learnable neural network basis functions to achieve all three types of transfer, outperforming methods such as MAML and Transformers on multiple benchmarks.

General Agents Contain World Models

This work theoretically proves that any agent capable of generalizing across multi-step goal-conditioned tasks must implicitly learn a predictive model of its environment (a world model), and this model can be extracted from the agent's policy—the stronger the agent and the more complex the goals, the more accurate its implicit world model.

Generation from Noisy Examples

The theoretical framework of "language generation in the limit" by Kleinberg & Mullainathan (2024) is extended to noisy sample stream scenarios. The Noisy Closure dimension is proposed to fully characterize the necessary and sufficient conditions for uniform noise-dependent generability, proving that all countable hypothesis classes remain non-uniformly generable under finite noise.

GLGENN: A Novel Parameter-Light Equivariant Neural Networks Architecture Based on Clifford Geometric Algebras

This paper proposes Generalized Lipschitz Group Equivariant Neural Networks (GLGENN), which leverage weight sharing across four fundamental subspaces defined by grade involution and reversion in geometric algebra. While maintaining equivariance to pseudo-orthogonal groups, GLGENN significantly reduces trainable parameters (to approximately 1/2 to 1/3 of CGENN) and matches or key performance metrics of CGENN on multiple benchmark tasks.

GPU-friendly and Linearly Convergent First-order Methods for Certifying Optimal \(k\)-sparse GLMs

Proposes GPU-friendly and linearly convergent first-order methods. Through composite reformulation and a dual gap restart strategy, the perspective relaxation solving is accelerated by 1-2 orders of magnitude, enabling optimality certification for large-scale sparse GLMs.

Gradient Aligned Regression via Pairwise Losses

Proposes Gradient Aligned Regression (GAR), which aligns the gradients of the predictive and true functions by introducing two pairwise difference losses (error variance + negative Pearson correlation coefficient) in the label space, and robustly aggregates three sub-losses using DRO. This achieves the same linear complexity as traditional regression losses, while outperforming MAE/MSE and contrastive learning methods on multiple benchmarks.

Hierarchical Refinement: Optimal Transport to Infinity and Beyond

Proposes the Hierarchical Refinement (HiRef) algorithm, which dynamically constructs multi-scale data partitions by recursively solving low-rank optimal transport subproblems to obtain a full bijective Monge map in log-linear time and linear space complexity, scaling optimal transport to million-scale datasets.

How Do Transformers Learn Variable Binding in Symbolic Programs?

By training Transformers to perform variable dereferencing on synthetic programs, a three-stage developmental trajectory is identified: (1) random prediction \(\rightarrow\) (2) shallow heuristics \(\rightarrow\) (3) systematic dereferencing mechanism. Causal interventions demonstrate that the model learns to utilize the residual stream as an addressable memory space.

If Open Source Is to Win, It Must Go Public

This paper argues that open-source AI, under current practices, cannot independently achieve AI democratization—model weights are merely "inert code" requiring substantial capital to activate. It must be embedded within public AI infrastructure (public funding + public access + public governance + private commitment) to serve as a genuine public good.

Improved Exploration in GFlowNets via Enhanced Epistemic Neural Networks

This paper integrates Epistemic Neural Networks (ENN/epinet) into GFlowNets to achieve uncertainty-driven exploration, proposing the ENN-GFN-Enhanced algorithm. It significantly improves mode discovery efficiency and distribution learning quality on HyperGrid and sequence generation tasks.

Improved Learning via k-DTW: A Novel Dissimilarity Measure for Curves

Proposes \(k\)-DTW—a novel dissimilarity measure for polygonal curves that focuses only on the sum of the \(k\) largest distances in a traversal, combining the robustness of DTW with the metric properties of the Fréchet distance, while proving for the first time a dimension-free learning bound for curve clustering.

Improving Generalization with Flat Hilbert Bayesian Inference

Proposes Flat Hilbert Bayesian Inference (FHBI), which generalizes the sharpness-aware minimization (SAM) concept of flatness from finite-dimensional Euclidean space to infinite-dimensional Reproducing Kernel Hilbert Space (RKHS), and integrates it with particle-based Bayesian inference, outperforming nine baselines with an average Top-1 accuracy of 73.7% on the VTAB-1K benchmark.

Improving the Effective Receptive Field of Message-Passing Neural Networks

This paper formalizes the concept of Effective Receptive Field (ERF) in MPNNs, proves that node contributions decay exponentially with distance (modeled as a binomial distribution), and proposes the IM-MPNN architecture. By utilizing multiscale graph coarsening and cross-scale information interleaving, IM-MPNN expands the ERF and achieves significant improvements on long-range dependency benchmarks such as LRGB.

K²IE: Kernel Method-based Kernel Intensity Estimators for Inhomogeneous Poisson Processes

K²IE is proposed as an RKHS least-squares regularized kernel intensity estimator. By proving that the dual coefficients in its representer theorem are identically 1, the authors theoretically unify classical kernel intensity estimation (KIE) with modern kernel methods, combining the computational efficiency of KIE with the boundary correction advantages of kernel methods.

LapSum -- One Method to Differentiate Them All: Ranking, Sorting and Top-k Selection

The authors propose LapSum, a unified framework for four major differentiable sorting tasks (differentiable ranking, sorting, top-k selection, and permutation matrices) based on a closed-form invertible formula of the sum of cumulative density functions of the Laplace distribution. It operates with a time complexity of only \(O(n\log n)\) and \(O(n)\) space complexity, significantly outperforming existing methods in large-scale scenarios.

Latent Variable Estimation in Bayesian Black-Litterman Models

By treating the subjective investor views \((q, \Omega)\) in the classical Black-Litterman portfolio optimization model as latent variables, this paper automatically infers them from market feature data via a Bayesian network. This eliminates reliance on manual subjective inputs, improving the Sharpe ratio by approximately 50% and reducing the turnover rate by around 55% on 30-year Dow Jones and 20-year ETF datasets.

Learning Distances from Data with Normalizing Flows and Score Matching

This paper proposes to learn density functions and score functions using normalizing flows and score matching to efficiently compute density-based Fermat distances, addressing the issues of slow convergence and rough paths in high-dimensional spaces associated with traditional graph-based methods.

Learning Safe Strategies for Value Maximizing Buyers in Uniform Price Auctions

For value-maximizing buyers with RoI constraints in repeated uniform-price multi-unit auctions, this work introduces the concept of "safe bidding strategies," proves that they only need to satisfy mild no-overbidding conditions, and designs a polynomial-time online learning algorithm that achieves a regret bound of \(\widetilde{O}(M\sqrt{mT})\).

Lightspeed Geometric Dataset Distance via Sliced Optimal Transport

Proposes s-OTDD (sliced optimal transport dataset distance), which maps label distributions to scalars via Moment Transform Projection (MTP) to achieve near-linear complexity dataset distance computation, running significantly faster than OTDD while achieving comparable performance.

Modern Methods in Associative Memory

A systematic tutorial by the IBM & MIT team, extending Dense Associative Memory (DenseAM) from classic Hopfield networks to modern AI architectures. It unifies AM with Transformer attention and diffusion models through an energy function framework, revealing deep connections, accompanied by mathematical derivations and programming exercises.

Modified K-means Algorithm with Local Optimality Guarantees

This work identifies a long-standing misconception that the classic K-means algorithm always converges to a local optimum. It proposes the LO-K-means modification, which guarantees convergence to a continuous or discrete local optimum without increasing the per-step computational complexity.

NeuronTune: Towards Self-Guided Spurious Bias Mitigation

NeuronTune proposes a group-label-free self-guided debiasing method: by comparing the difference in neuron activations between correctly and incorrectly predicted samples in the model's latent space, it identifies the dimensions affected by spurious biases and sets them to zero. It then retrains the final classification layer, significantly improving the worst-group accuracy.

Nonparametric Modern Hopfield Models

This paper proposes a nonparametric framework for modern Hopfield models, modeling the memory storage and retrieval process as a nonparametric regression problem. This formulation derives the first efficient sparse-structure modern Hopfield model with sub-quadratic complexity, backed by comprehensive theoretical analysis (retrieval error bounds, noise robustness, and exponential memory capacity).

On the Importance of Gaussianizing Representations

Based on information-theoretic motivations (the normal distribution is simultaneously the optimal signal and the worst-case noise distribution), this paper proposes the Normality Normalization layer. After conventional normalization, activation values are Gaussianized using a Power Transform, and scaled Gaussian noise is injected for regularization. This universally improves generalization and robustness across ViTs and ResNets without introducing additional learnable parameters.

Online Sparsification of Bipartite-Like Clusters in Graphs

Proposes a near-linear time online graph sparsification algorithm that compresses the number of edges to \(\widetilde{O}(n)\) while preserving the bipartite-like cluster structure of the graph. It is applicable to both undirected and directed graphs, significantly accelerating existing clustering algorithms.

OOD-Chameleon: Is Algorithm Selection for OOD Generalization Learnable?

The problem of training algorithm selection for OOD generalization is formulated as a learnable multi-label classification task. By training a selector on a "dataset of datasets," the optimal training algorithm (ERM / GroupDRO / Resampling / Logits Adjustment) can be predicted a priori using only dataset statistical features (such as shift degree and data scale). Evaluations across 7 applications in synthetic, vision, and language domains demonstrate that the selector learns transferable, non-trivial decision rules.

Optimal Auction Design in the Joint Advertising

This paper proposes an optimal auction mechanism for joint advertising scenarios (where retailers and suppliers co-bid for ad slots): for a single slot, a Myerson-style closed-form optimal solution is derived; for multiple slots, a BundleNet neural network is designed to construct IC constraints on a per-bundle basis, maximizing platform revenue while ensuring approximate incentive compatibility.

Optimal Sensor Scheduling and Selection for Continuous-Discrete Kalman Filtering with Auxiliary Dynamics

This work proposes an optimal sensor scheduling framework for continuous-discrete Kalman filtering (CD-KF). By modeling multi-sensor observations as independent Poisson processes, it derives a continuously differentiable upper bound for the posterior covariance matrix. A gradient-based optimization method is then utilized to jointly optimize observation rates and auxiliary dynamics inputs, and the deterministic observation times are selected via Wasserstein-2 optimal quantization.

Parity Requires Unified Input Dependence and Negative Eigenvalues in SSMs

This paper theoretically proves that linear SSMs (such as S4/Mamba) cannot compute the parity function—even when input-dependent parameterization is allowed—unless the state transition matrix contains negative eigenvalues, providing a precise mathematical characterization of the expressivity bottleneck of SSMs.

Permutation Equivariant Neural Networks for Symmetric Tensors

This work presents the first study on permutation equivariant neural networks with symmetric tensors as inputs. It provides two complete characterizations of all linear permutation equivariant functions between symmetric power spaces, and experimentally demonstrates that this method significantly outperforms standard MLPs in terms of data efficiency and generalization capability.

Position: AI Evaluation Should Learn from How We Test Humans

Proposes systematically introducing the adaptive testing paradigm from human psychometrics into AI evaluation, achieving efficient and reliable model capability assessment by estimating item characteristics (difficulty, discrimination, and guessing factor), reconstructing full benchmark scores with only 3% of the questions.

Position: Solve Layerwise Linear Models First to Understand Neural Dynamical Phenomena

Proposes the Dynamical Feedback Principle and demonstrates that layerwise linear models are sufficient to provide a unified explanation for four major deep learning dynamical phenomena: neural collapse, emergence, lazy/rich regimes, and grokking. The authors advocate prioritizing the study of layerwise structures over non-linear activations.

Practical Principles for AI Cost and Compute Accounting

To address the ambiguity of accounting standards for compute/cost thresholds in AI regulation, this paper proposes seven principles to close evasion loopholes (such as the distillation loophole), avoid disincentivizing safety measures, and achieve consistent implementation across firms, providing a theoretical framework for operationalizing regulations like the EU AI Act.

Prediction-Powered Adaptive Shrinkage Estimation

By organically combining Prediction-Powered Inference (PPI) with Empirical Bayes shrinkage, this work proposes the PAS two-stage estimation method. It first utilizes ML predictions for within-problem variance reduction, and then performs across-problem adaptive shrinkage with the ML predictions as shrinkage targets. The shrinkage parameters are automatically tuned via the Correlation-Unbiased Risk Estimator (CURE), with theoretical guarantees of asymptotic optimality.

Prediction via Shapley Value Regression (ViaSHAP)

Proposes ViaSHAP, which integrates Shapley value computation into the model training process, allowing prediction to be obtained directly via summing the Shapley values during inference. It requires no post-hoc explainer, achieves XGBoost-level predictive accuracy on tabular data, and yields Shapley value approximation quality significantly superior to FastSHAP.

Probably Approximately Global Robustness Certification

A probabilistic approximately global robustness (PAG) certification framework based on ε-net sampling is proposed. The required sample complexity is independent of the input dimension, number of classes, and model architecture, enabling the efficient certification of global robustness for large-scale neural networks.

IBDR: Promoting Ensemble Diversity with Interactive Bayesian Distributional Robustness

This paper proposes the IBDR Bayesian inference framework. By introducing interactive loss and Wasserstein distributional robustness optimization over the product distribution space, the framework constructs a particle ensemble that balances diversity and low sharpness. Utilizing ViT-B/16, it achieves a 73.6% average accuracy on VTAB-1K, outperforming all baselines.

Provably Cost-Sensitive Adversarial Defense via Randomized Smoothing

A "cost-sensitive certified radius" is proposed based on the randomized smoothing framework, achieving the first scalable cost-sensitive adversarial robustness certification and training for large models and high-dimensional data. This significantly improves robustness against high-cost misclassifications while maintaining overall accuracy.

Provably Improving Generalization of Few-Shot Models with Synthetic Data

This paper proposes a theoretical framework to quantify the impact of the distribution gap between synthetic and real data on the generalization capability of few-shot classification. Based on this theory, it designs an algorithm that jointly optimizes data partitioning and model training, surpassing SOTA on 10 benchmark datasets.

Randomized Dimensionality Reduction for Euclidean Maximization and Diversity Measures

It is proved that for a broad class of Euclidean maximization problems (such as maximum matching, Max-TSP, maximum spanning tree, and subgraph diversity), reducing the dimension to \(O(\lambda)\) (where \(\lambda\) is the doubling dimension of the dataset) using a data-independent Gaussian JL transform preserves the value of all candidate solutions. This dependency is also shown to be tight.

Ranked Entropy Minimization for Continual Test-Time Adaptation

Proposes Ranked Entropy Minimization (REM), which constructs an explicit ranking structure of prediction difficulty through a progressive masking strategy. Combining masked consistency loss and entropy ranking loss, it solves the model collapse issue of entropy minimization methods in Continual Test-Time Adaptation (CTTA) while maintaining computational efficiency.

Regression for the Mean: Auto-Evaluation and Inference with Few Labels through Post-hoc Regression

Reinterprets the tuning of \(\lambda\) in PPI++ as a post-hoc regression and proposes two improved methods, Ridge-PPI and Sigmoid-PPI. These methods significantly reduce the variance of mean estimation in few-label scenarios (\(n < 50\)), outperforming both classical estimation and PPI++.

Residual Matrix Transformers: Scaling the Size of the Residual Stream

Replaces the residual stream vector of the Transformer with an outer-product memory matrix, allowing the size of the residual stream to be scaled independently of the model parameter count and FLOPS, saving 58% FLOPS, 25% parameters, and 41% training tokens for the same loss.

Rethinking Aleatoric and Epistemic Uncertainty

This paper points out that the aleatoric/epistemic uncertainty dichotomy in machine learning suffers from fundamental conceptual confusion. It proposes a decision-theoretic alternative framework that unifies predictive uncertainty, reducible/irreducible decomposition, predictive performance, and data dispersion within a coherent theoretical system, and reveals the limitations of BALD as an epistemic uncertainty estimator.

Revisiting Instance-Optimal Cluster Recovery in the Labeled Stochastic Block Model

For the Labeled Stochastic Block Model (LSBM), this work proposes the IAC (Instance-Adaptive Clustering) algorithm. Using a two-stage strategy of a single spectral clustering followed by iterative likelihood refinement, it is the first to achieve community recovery matching the instance-specific information-theoretic lower bound with an \(\mathcal{O}(n(\log n)^3)\) complexity, while simultaneously providing dual guarantees in expectation and with high probability.

Revisiting the Predictability of Performative, Social Events

This paper leverages modern learning theory tools (performative prediction + outcome indistinguishability) to answer a classic 20th-century question in social science: Can social events still be accurately predicted when predictions actively influence outcomes? The answer is affirmative—yet such "accurate" predictions can be entirely useless.

Runtime Analysis of Evolutionary NAS for Multiclass Classification

This work presents the first theoretical runtime analysis of evolutionary neural architecture search (ENAS) on multiclass classification. It proves that the (1+1)-ENAS algorithms with both one-bit and bit-wise mutations scale with an expected runtime of \(O(rM\ln rM)\) to find the optimal architecture, demonstrating that simple one-bit mutation can perform on par with complex bit-wise mutation.

Sampling from Binary Quadratic Distributions via Stochastic Localization

This work represents the first application of the Stochastic Localization (SL) framework to sampling from general Binary Quadratic Distributions (BQDs). It proves that after a sufficient number of SL iterations, the posterior distribution almost surely satisfies the Poincaré inequality, thereby guaranteeing polynomial-time mixing for discrete MCMC samplers. Consistent improvements in sampling efficiency are verified on QUBO combinatorial optimization problems.

Score Matching with Missing Data

This paper adapts score matching and its major extensions to missing data scenarios, proposing two variants—the Importance Weighting (IW) method and the variational method—and demonstrating their respective advantages in various scenarios such as graphical model estimation.

Set-Valued Predictions for Robust Domain Generalization

A set-valued predictor is proposed to address the robustness issue in domain generalization (DG): it outputs a subset of labels rather than a single label, satisfying predefined coverage requirements on as many unseen domains as possible while minimizing the prediction set size.

Softmax is not Enough (for Sharp Size Generalisation)

This work theoretically proves that softmax attention inevitably undergoes coefficient dispersion as the input scale increases, failing to maintain sharp focus on a small number of key elements, and proposes adaptive temperature as a mitigation method.

Sparse Training from Random Initialization: Aligning Lottery Ticket Masks using Weight Symmetry

This work explains why Lottery Ticket Hypothesis (LTH) masks cannot transfer to new initializations from the perspective of weight symmetry, and proposes to achieve sparse training by aligning LTH masks with the optimization basins of new initializations via permutation matching.

Suitability Filter: A Statistical Framework for Classifier Evaluation in Real-World Settings

This paper proposes the Suitability Filter framework, which leverages "suitability signals" from model outputs to detect classifier performance degradation on unlabeled user data, determining whether the accuracy has dropped significantly compared to the test set via statistical hypothesis testing.

Symmetry-Aware GFlowNets

Uncovers systematic sampling bias in GFlowNets for graph generation caused by equivalent actions (different actions yielding isomorphic graphs), where node-by-node generation favors low-symmetry graphs and fragment-based generation favors highly symmetric components. This work proposes SA-GFN, a simple correction method that scales rewards by the size of the final state's automorphism group, achieving unbiased sampling with only a single automorphism group computation.

SynDaCaTE: A Synthetic Dataset for Evaluating Part-Whole Hierarchical Inference

This paper proposes the SynDaCaTE synthetic dataset and the Mereological Inference framework, decomposing part-whole hierarchical inference into two independently evaluable sub-tasks: Image-to-Parts and Parts-to-Wholes. Through carefully designed control experiments, it demonstrates that the bottleneck of CapsNets lies in extracting parts from images rather than inferring wholes from parts. Additionally, the permutation-equivariant SetTransformer is found to significantly outperform all baselines in part-to-whole inference (with over a 10x precision advantage).

Synthesizing Images on Perceptual Boundaries of ANNs for Uncovering and Manipulating Human Perceptual Variability

This paper proposes the BAM (Boundary Alignment & Manipulation) framework, which systematically uncovers, predicts, and manipulates perceptual variability among human individuals by sampling and generating image stimuli on the perceptual decision boundaries of ANNs.

TANGO: Clustering with Typicality-Aware Nonlocal Mode-Seeking and Graph-Cut Optimization

Proposes the concept of "typicality" to quantify the confidence of data points serving as modes (cluster centers) from a global perspective. Combined with an improved path similarity and graph-cut optimization, it achieves automatic mode detection and clustering without manual threshold tuning.

The Price of Freedom: Exploring Expressivity and Runtime Tradeoffs in Equivariant Networks

This paper systematically analyzes the tradeoffs between expressivity and runtime for various tensor product operations in \(E(3)\)-equivariant neural networks, reveals a significant gap between theoretical complexity and empirical performance, and proposes a simplified Gaunt tensor product implementation based on spherical grids, achieving a 30% speedup in MACE interatomic potential training.

Time-Aware World Model for Adaptive Prediction and Control

Proposed the Time-Aware World Model (TAWM), which conditions the model on the time step \(\Delta t\) as an explicit input and mixes various \(\Delta t\) samples during training, enabling the model to adapt to arbitrary time resolutions at inference time via single-step prediction without increasing sample complexity.

To Each Metric Its Decoding: Post-Hoc Optimal Decision Rules of Probabilistic Hierarchical Classifiers

This paper proposes a post-hoc optimal decoding framework for probabilistic hierarchical classifiers. It derives optimal decision rules for various evaluation metrics (such as hierarchical \(F_\beta\)) and provides general algorithms when the prediction candidate set is restricted to the node set. Furthermore, it derives a dedicated optimal strategy for hierarchical \(hF_\beta\) in subset prediction.

Truly Self-Improving Agents Require Intrinsic Metacognitive Learning

This paper proposes a formal framework demonstrating that truly self-improving agents require intrinsic metacognitive learning capabilities (rather than extrinsic, human-designed fixed loops). The framework comprises three components: metacognitive knowledge, metacognitive planning, and metacognitive evaluation. It also analyzes the limitations of existing self-improving agents and outlines paths toward achieving intrinsic metacognition.

UnHiPPO: Uncertainty-Aware Initialization for State Space Models

This work extends the HiPPO framework to handle noisy measurements. By reformulating the initialization of State Space Models (SSMs) as a linear stochastic control/estimation problem, the authors derive an uncertainty-aware initialization scheme for SSM dynamics, which significantly enhances noise robustness without increasing runtime overhead.