Skip to content

📐 Optimization & Theory

🧪 ICML2026 · 15 paper notes

📌 Same area in other venues: 📷 CVPR2026 (8) · 🔬 ICLR2026 (44) · 🤖 AAAI2026 (22) · 🧠 NeurIPS2025 (110) · 📹 ICCV2025 (7)

Adaptive Estimation and Inference in Semi-parametric Heterogeneous Clustered Multitask Learning via Neyman Orthogonality

This work bridges double machine learning and clustered multitask learning, proposing an adaptive framework that combines Neyman orthogonality with data-driven pairwise fusion penalties. In a semiparametric setting with heterogeneous (possibly infinite-dimensional) noise, it accurately recovers latent task clusters, achieves oracle rates at the aggregation level, and establishes asymptotic normality for valid statistical inference.

Budget-Feasible Mechanisms for Submodular Welfare Maximization in Procurement Auctions

This work is the first to provide a truthful mechanism with approximation guarantees for submodular social welfare maximization in procurement auctions with "budget constraint + private costs": BFM-SWM—a descending clock auction with geometrically increasing thresholds, single-point protection, and a price/payment rate parameter \(\beta\)—achieves non-negative surplus and budget feasibility, with a 0.0328-approximation for general submodular functions and 0.0877-approximation for monotone submodular functions. As a byproduct, BFM-VM improves the deterministic best approximation for value maximization from 1/64 to \(1/(12+4\sqrt{3})\approx 0.0528\), and reduces runtime from \(\mathcal{O}(n^2\log n)\) to \(\mathcal{O}(n\log n)\).

FAB: A First-Order AB-based Gradient Algorithm for Distributed Bilevel Optimization over Time-Varying Directed Graphs

This paper proposes FAB—the first purely first-order algorithm for distributed bilevel optimization over time-varying directed graphs—combining AB/Push-Pull communication with value function penalization, providing a non-asymptotic \(\mathcal{O}(K^{-2/3})\) convergence rate, and incidentally resolving the long-standing open problem of AB/Push-Pull convergence rates in nonconvex, time-varying directed graph settings.

Learning-Augmented Scalable Linear Assignment Problem Optimization via Neural Dual Warm-Starts

A lightweight network is trained to predict dual variables \(\hat{u}\) for the linear assignment problem (LAP). Using the Min-Trick, feasible duals \(\hat{v}\) are constructed and used as a warm start for the LAPJV exact solver, achieving over \(2\times\) end-to-end speedup on \(N=16{,}384\) instances while maintaining optimality.

Learning Dynamics of Zeroth-Order Optimization: A Kernel Perspective

This paper uses empirical NTK as a unified perspective to prove that the eNTK induced by zeroth-order SGD is equivalent to projecting the first-order eNTK onto a random subspace spanned by perturbations. This, via the Johnson-Lindenstrauss lemma, explains why ZO methods still work on billion-parameter LLMs: the error depends only on output dimension \(V\) and perturbation count \(P\), and is independent of model dimension \(d\).

Learning to Approximate Uniform Facility Location via Graph Neural Networks

This work designs an MPNN that "neuralizes" the classic approximation algorithm SimpleUniformFL for Uniform Facility Location, enabling end-to-end training with unsupervised expected cost loss, while also providing a provable \(\mathcal{O}(\log n)\) approximation guarantee (and \(\mathcal{O}(1)\) for the recursive version). Experiments show it outperforms the classic SimpleUniformFL algorithm and approaches ILP optimality.

On the Convergence Rate of LoRA Gradient Descent

This work is the first to prove that original LoRA gradient descent achieves a minimum gradient norm convergence rate of \(O(1/\log T)\) without assuming bounded adapter matrices or requiring the reparameterized loss to be Lipschitz smooth (recovers the classical \(O(1/T)\) rate if parameter norms are bounded). Based on this, strictly theoretically-motivated adaptive/normalized learning rates are designed and empirically validated for accelerated and stabilized training on logistic regression, ResNet-18, and TinyLlama.

Probing Neural TSP Representations for Prescriptive Decision Support

The authors treat a trained TSP neural solver as a "transferable encoder," using frozen representations and lightweight probes to predict two types of expensive operations research sensitivity queries (node removal and edge forbidding). They systematically demonstrate that probe accuracy improves monotonically with solver quality and that ensembling with traditional heuristics achieves SOTA.

RL-SPH: Learning to Achieve Feasible Solutions for Integer Linear Programs

This paper proposes RL-SPH—an end-to-end reinforcement learning heuristic that does not rely on external ILP solvers and can independently produce 100% feasible solutions. By leveraging "feasibility reward + two-phase policy + feasibility-aware neighborhood search," the Graph Transformer Agent reduces the average primal gap by 28.6 times on ILPs with non-binary integer variables.

RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization

Based on the "row block diagonal dominance" structure of the Transformer layer Hessian, this work replaces the expensive Newton-Schulz orthogonalization in the Muon optimizer with a single row-wise \(\ell_2\) normalization, reducing per-step preconditioning complexity from \(\mathcal{O}(mn\min(m,n))\) to \(\mathcal{O}(mn)\). On GPT-2 / LLaMA pretraining, this yields a 13–44× wall-clock speedup, with perplexity not only maintained but slightly improved.

Streaming Sliced Optimal Transport

Stream-SW is the first algorithm capable of estimating the sliced Wasserstein distance on "sample streams": for each 1D projection, it maintains an approximate quantile function using KLL/quantile sketch, turning the closed-form integral of 1D Wasserstein into a streamable estimator with logarithmic space complexity in the number of samples, thus enabling SOT in IoT/edge device scenarios where samples are "seen once and discarded".

Support-Proximity Augmented Diffusion Estimation for Offline Black-Box Optimization

SPADE replaces the traditional regression surrogate with a conditional diffusion model to model \(p(y\mid\boldsymbol{x})\), and implicitly injects data priors into the surrogate via "mean/rank calibration" and "kNN support regularization (mean shrinkage + variance inflation)", enabling offline black-box optimization to stably achieve SOTA on Design-Bench and LLM data mixture tasks.

Test time training enhances in-context learning of nonlinear functions

This work establishes the first rigorous generalization bound for the combination of single-layer softmax-attention transformer and LoRA test-time fine-tuning, proving that on single-index polynomial tasks, TTT reduces the sample complexity of ICL from \(r^{\Theta(\mathrm{ie}(\sigma_*))}\) to \(r^{\Theta(\mathrm{ge}(\sigma_*))}\), allows the link function to vary per task, and enables inference error to approach the noise level as context length \(\to\) increases.

Transformed Latent Variable Multi-Output Gaussian Processes

This paper proposes T-LVMOGP: it transforms the core modeling challenge of multi-output GPs—constructing cross-output covariance \(k_{p,p'}(x, x')\)—into "computing inner products with a single scalar base kernel in a Lipschitz-regularized RCNN embedding space," and fully embeds this into the SVGP framework. For the first time, MOGP can scalably and expressively handle \(P > 10000\) outputs (including ZINB-likelihood spatial transcriptomics data), comprehensively outperforming baselines such as SV-LMC, OILMM, and GS-LVMOGP.

Turning Stale Gradients into Stable Gradients: Coherent Coordinate Descent with Implicit Landscape Smoothing for Lightweight Zeroth-Order Optimization

This paper stores "stale" block cyclic coordinate descent gradient estimates in a FIFO buffer, reusing them with momentum decay, and proves this is equivalent to BCCD with warm-start. It also presents a counterintuitive result: a larger finite difference step size \(\epsilon\) implicitly smooths the loss landscape and reduces the effective Lipschitz constant, making stale gradients yield more stable descent.