Skip to content

📐 Learning Theory

🧪 ICML2026 · 17 paper notes

📌 Same area in other venues: 🔬 ICLR2026 (4) · 🤖 AAAI2026 (1) · 🧠 NeurIPS2025 (20)

🔥 Top topics: Adversarial Robustness ×2

A Perturbation Approach to Unconstrained Linear Bandits

This paper revisits the perturbation-based bandit linear optimization approach of Abernethy et al., proposing the PABLO reduction to transform unconstrained linear bandits into a problem that can call any OLO subroutine. This yields comparator-adaptive static/dynamic regret, high-probability bounds, and discussions on several lower bounds.

Conditional KRR: Injecting Unpenalized Features into Kernel Methods with Applications to Kernel Thresholding

This paper proposes the Conditional Kernel Ridge Regression (Conditional KRR) framework, which injects a set of unpenalized features into kernel methods. By reducing it to standard KRR via a residual kernel, the authors prove that the reduction cost is \(\mathcal{O}(1/\sqrt{N})\). Furthermore, the study identifies sufficient conditions for Conditional KRR to outperform standard KRR under both hard thresholding (top-k eigenfunctions) and soft thresholding (random Gaussian features) settings.

CORE-MTL: Rethinking Gradient Balancing via Causal Orthogonal Representations

The authors reattribute the root cause of "negative transfer" in multi-task learning from "gradient conflict" to the "entanglement of semantics and noise within shared representations." They propose CORE-MTL: a dual-stream encoder splits representations into semantic \(\hat{Z}_s\) and residual \(\hat{Z}_r\), implementing "causal orthogonality" via CKA independence constraints + counterfactual style substitution + inverse rendering reconstruction. Theoretically, it provides a tighter OOD upper bound than gradient balancing. Experimentally, it outperforms ten baselines including PCGrad, GradNorm, STCH, and FairGrad across ID settings (NYUv2/Cityscapes) and OOD settings (GTA5→Cityscapes, Cityscapes-C).

Correcting Split Selection in Online Decision Trees via Anytime-Valid Inference

The authors point out that the "fixed sample size" concentration inequality used by the classic Hoeffding Tree (HT) for splitting on data streams is invalidated by its own "data-dependent stopping rule." They reformulate the split criterion using testing-by-betting + Universal Portfolio, allowing both single trees and Adaptive Random Forests to maintain controlled Type-I errors at any stopping time while achieving higher accuracy and smaller tree sizes across 12 real-world streams.

Estimating Correlation Clustering Cost in Node-Arrival Stream

This paper investigates the approximate estimation of correlation clustering costs under the "node-arrival" streaming model. The authors propose the C4Approx algorithm, which achieves a \((O(1), n^{1-\alpha})\)-approximation using sublinear space of \(O(n^{(3+\alpha)/4}\log n)\) words and a constant number of passes. They also provide two matching lower bounds to prove that both multiple passes and additive errors are inevitable. On real-world data, the algorithm achieves performance comparable to the Pivot algorithm while storing only 2% of the nodes.

Expectation Consistency Loss: Rethink Confidence Calibration under Covariate Shift

ECL proves that full alignment of input distributions \(P_s(X) = P_t(X)\) is not a necessary condition for calibration under covariate shift. Instead, it is sufficient that the "conditional expectation of \(P(Y_k=1|X)\) on each confidence level set is consistent across domains." Based on this, it constructs ECL: a differentiable loss with unbiased mini-batch gradients that generalizes across canonical, class-wise, and top-label calibration.

Matroid Algorithms Under Size-Sensitive Independence Oracles

The authors propose a "size-sensitive matroid oracle" model where the query cost grows linearly with the size of the query set. They prove that under this model, the optimal query cost for finding a basis, estimating rank, and estimating the partition number is \(\tilde{\Theta}(n^2)\). Furthermore, for matroids with a bounded girth \(c\), they present a maximum weight basis algorithm with a complexity of \(\mathcal{O}(n^{2-1/c}\log n)\), breaking the quadratic lower bound.

MMD-Balls as Credal Sets: A PAC-Bayesian Framework for Epistemic Uncertainty in Test-Time Adaptation

The paper provides the first PAC-Bayes upper bound for test-time adaptation formulated as "Target Risk \(\le\) Source Empirical Risk + KL Complexity + MMD Distribution Shift Term." It interprets MMD-balls as Walley-style credal sets, naturally separating aleatoric and epistemic uncertainty through "upper and lower risk intervals," and provides a computable criterion for "when to adapt and when to abstain."

Multi-task Linear Regression without Eigenvalue Lower Bounds: Adaptivity, Robustness and Safety

This paper proposes a robust multi-task linear regression estimator using \(\|\theta_j-\beta\|_{\bm\Sigma_j}\) (matrix-weighted norm) as a regularization term. It replaces the rigid "minimum eigenvalue of the second moment \(\Omega(1)\) for every task" assumption in prior work with a relative "balance constant" \(B\). This provides minimax, adaptive, and fallback safety guarantees to Independent Task Learning (ITL) in high-dimensional scenarios with ill-conditioned, low-rank, or outlier tasks.

On the Learnability of Test-Time Adaptation: A Recovery Complexity Perspective

This paper establishes the first theoretical framework for Test-Time Adaptation (TTA) learnability. It introduces \((\epsilon,\delta)\)-Recovery Complexity to measure the time required for a model to reduce excess risk to \(\epsilon\) after a distribution shift. By extending local recovery to non-stationary test streams via \((\epsilon,\rho)\)-TTA Learnability, the authors derive matching minimax upper/lower bounds, revealing the intrinsic "adaptation speed—information constraint" trade-off in TTA.

Optimal Design for Multinomial Logit Model with Applications to Best Assortment Identification

This paper provides the first computationally feasible G-optimal experimental design for the combinatorial action space of Multinomial Logit (MNL) bandits. By formulating the Frank–Wolfe Linear Maximization Oracle (LMO) as a 0–1 MILP or a polynomial-time Schur complement relaxation, the authors construct the first Best Assortment Identification algorithm for "linear utility + non-uniform rewards" with a sample complexity of \(\tilde{\mathcal{O}}(d\log N / \Delta^2)\).

Parsimonious Learning-Augmented Online Metric Matching

This paper resolves an open question posed by Im et al. (2022) by extending "action-based prediction" for Online Metric Matching (OMM) into a "parsimonious prediction" framework—where predictions are issued expensively every \(k\) steps. Using a Follow-the-Prediction (FtP) framework combined with a meta-algorithm that automatically generates "virtual predictions," the authors provide deterministic and randomized competitive ratio upper bounds that essentially match known lower bounds.

Provably Data-driven Multiple Hyper-parameter Tuning with Structured Loss Function

This paper uses "real algebraic geometry + first-order logic quantifier elimination" to provide the first provable generalization bound for multi-dimensional hyperparameter tuning. It generalizes the Balcan 2025 framework, which was previously limited to one-dimensional scalar hyperparameters, to various practical scenarios including arbitrary \(p\) dimensions, bilevel validation loss, and approximate inner optimization, while establishing the first matching lower bound.

Realizable Bayes-Consistency for General Metric Losses

This paper provides a sharp characterization for the open problem of when a hypothesis class \(\mathcal{H}\) admits a distribution-free strong universal Bayes-consistent learning algorithm under general (possibly unbounded) metric losses in the realizable setting. The necessary and sufficient condition is that \(\mathcal{H}\) does not contain a new combinatorial obstacle termed the "unbounded gap Littlestone tree."

Semi-Supervised Noise Adaptation: Transferring Knowledge from Noise Domain

The authors treat a "synthetic domain generated from Gaussian noise" as an alternative source domain in semi-supervised transfer learning. They first prove that such noise, which lacks semantics but possesses a "discriminative structure," provides quantifiable improvements to generalization bounds for the target domain. They then use the Noise Adaptation Framework (NAF) with three losses to jointly optimize domain risks and distribution discrepancy, achieving a 12.35% improvement for 4-shot ResNet-18 on CIFAR-10 over ERM.

Simple Algorithms for Bad Triangle Transversals with Applications to Correlation Clustering

This paper presents two simple 2-approximation algorithms for the "Bad Triangle Transversal" (BTT) problem on signed graphs that require only a single LP solve. It proves a unified NP-hardness lower bound of \(\tfrac{2137}{2136}\) for BTT, Correlation Clustering (CC), MinSTC, and Cluster Deletion on complete graphs. Additionally, it introduces a new pivot process that converts any feasible BTT cover into a clustering with at most \(\tfrac{3}{2}|F|\) errors, tightening the gap between BTT and CC optimal values from 2 to \(3/2\).

Towards Optimal Robustness in Learning-Augmented Paging

This paper proposes a unified "Relative Prediction Budget" (RPB) perspective for randomized online paging with predictions. Based on OnlineMin, the RPB-OnOPT framework is designed, pushing the provable robust competitive ratio from the existing \(2H_k+O(1)\) to \(H_k+O(1)\), which is near the information-theoretic lower bound, while maintaining 1-consistency.