📐 Optimization & Theory¶

🔬 ICLR2026 · 45 paper notes

A Convergence Analysis of Adaptive Optimizers under Floating-Point Quantization: This paper establishes the first theoretical framework for analyzing the convergence of adaptive optimizers under floating-point quantization. By imposing a relative error quantization model simultaneously on gradients, weights, and optimizer states (first and second moments), it proves that quantized Adam and Muon achieve the same \(\tilde{O}(T^{-1/4})\) convergence rate as their full-precision counterparts when the mantissa length grows only logarithmically in the number of iterations. The analysis further reveals that Adam is highly sensitive to the quantization of weights and second moments, whereas Muon is theoretically more robust.
Adaptive Rollout Allocation for Online RL with Verifiable Rewards (VIP): This paper proposes VIP (Variance-Informed Predictive allocation), which uses a Gaussian process to predict the success probability of each prompt and then solves a convex optimization problem to allocate rollout counts under a compute budget constraint, minimizing gradient variance. VIP consistently improves the sampling efficiency of GRPO/RLOO on mathematical reasoning tasks, achieving up to 12.3-point gains in Pass@32 on AIME24/25.
Celo2: Towards Learned Optimization Free Lunch: This paper proposes Celo2—a learned optimizer meta-trained in only 4.5 GPU hours—that achieves stable generalization to models up to 1 billion parameters (GPT-3 XL, 1.3B), which is 6 orders of magnitude beyond the meta-training distribution, via simple recipes including a normalized MLP update rule and task augmentation. Celo2 outperforms the prior VeLO optimizer (which required 4,000 TPU-months of meta-training) and carefully tuned AdamW baselines.
CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving: CogFlow proposes a cognition-inspired three-stage visual mathematical reasoning framework (Perception → Internalization → Reasoning), enhanced by Synergistic Visual Rewards for perception, a Knowledge Internalization Reward to bridge perception and reasoning, and Visual-Gated Policy Optimization to anchor visual reasoning, addressing the core problem of "correct perception but drifted reasoning" in existing methods.
COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics: COLD-Steer is proposed as a training-free LLM activation steering method that approximates the representational change induced by one step of gradient descent on in-context examples, achieving 95% steering effectiveness with only 1/50 of the samples required by prior methods.
Constraint Matters: Multi-Modal Representation for Reducing Mixed-Integer Linear Programming: This paper proposes a constraint-reduction framework for simplifying MILP models. It defines a fixed-constraint strength \(\rho\) and uses information gain \(\Delta H = -\log\rho\) to identify critical tight constraints (CTCs). A multi-modal GNN combining an instance-level bipartite graph with an abstract-level type graph is designed to predict CTCs. On four large-scale benchmarks, the method achieves an average improvement of 51.06% in solution quality (\(\text{gap}_\text{abs}\)) and an average speedup of 17.47% in convergence (PDI).
Converge Faster, Talk Less: Hessian-Informed Federated Zeroth-Order Optimization: This paper proposes HiSo (Hessian-informed Scalar-only communication), which leverages a global diagonal Hessian approximation to accelerate convergence in federated zeroth-order optimization while strictly maintaining scalar communication without transmitting any second-order information. Theoretical analysis proves that, under low effective rank and whitening assumptions, the convergence rate is independent of the Lipschitz constant \(L\) and model dimension \(d\). Experiments on OPT-350M/1.3B/2.7B fine-tuning demonstrate a 1.4–5.4× reduction in communication rounds, with total communication cost remaining at the KB level.
Convergence of Muon with Newton-Schulz: This work provides the first convergence guarantees for the Muon optimizer as it is actually used in practice—with Newton-Schulz (NS) approximation rather than exact SVD-based polar decomposition. It proves that the convergence rate matches the idealized SVD variant up to a constant factor \(C_q\) that decays doubly exponentially in the number of NS iterations \(q\), and that Muon enjoys a \(\sqrt{r}\) advantage over its vector-space counterpart SGD-M due to reduced rank loss.
Convex Dominance in Deep Learning I: A Scaling Law of Loss and Learning Rate: Grounded in convex optimization theory, this paper proves that training loss in deep learning converges at a rate of \(O(1/\sqrt{T})\) and that the optimal learning rate scales as \(1/\sqrt{T}\). The resulting scaling law is validated across models ranging from GPT-2 to 12.5B parameters (\(R^2 \geq 0.978\)), enabling learning rate extrapolation across an 80× range of training steps.
DeepAFL: Deep Analytic Federated Learning: This paper proposes DeepAFL, which designs gradient-free analytic residual blocks and introduces a layer-wise federated training protocol, achieving for the first time a deep analytic federated learning model with representation learning capability. The method maintains ideal invariance to data heterogeneity while overcoming the fundamental limitation of existing analytic approaches to single-layer linear models, surpassing the state of the art by 5.68%–8.42% across three benchmark datasets.
Directional Convergence, Benign Overfitting of Gradient Descent in leaky ReLU two-layer Neural Networks: This paper provides the first proof of directional convergence of gradient descent in leaky ReLU two-layer neural networks, and on this basis establishes sufficient conditions for benign overfitting under a significantly broader mixed-data setting that goes well beyond nearly orthogonal data, while uncovering a novel phase transition phenomenon.
Dual Optimistic Ascent (PI Control) is the Augmented Lagrangian Method in Disguise: This paper proves that dual optimistic ascent (PI control), widely used in constrained deep learning, is mathematically equivalent to the classical Augmented Lagrangian Method (ALM) under the single-step first-order update regime. This equivalence transfers ALM's robust convergence guarantees (linear convergence to all strict local solutions) to PI control, and provides principled tuning guidance for the optimism coefficient \(\omega\).
Exploring Diverse Generation Paths via Inference-time Stiefel Activation Steering: This paper proposes STARS (Stiefel-based Activation Steering for Diverse ReaSoning), a training-free inference-time activation steering method that jointly optimizes \(N\) orthogonal steering directions on the Stiefel manifold at each token decoding step, maximizing the geometric volume of hidden states to promote divergent activation trajectories. STARS consistently outperforms temperature sampling in diversity on test case generation (TestEval) and scientific discovery (LiveIdeaBench) with negligible latency overhead and without sacrificing output quality.
Faster Gradient Methods for Highly-Smooth Stochastic Bilevel Optimization: By reinterpreting the F2SA method as a forward-difference approximation to the hyper-gradient, this paper proposes the F2SA-p algorithm family based on higher-order finite differences. Under higher-order smoothness conditions, the SFO complexity for stochastic bilevel optimization is improved from \(\tilde{\mathcal{O}}(\epsilon^{-6})\) to \(\tilde{\mathcal{O}}(p\epsilon^{-4-2/p})\), and a matching lower bound of \(\Omega(\epsilon^{-4})\) is established, showing near-optimality for sufficiently large \(p\).
FedDAG: Clustered Federated Learning via Global Data and Gradient Integration for Heterogeneous Environments: FedDAG is a clustered federated learning framework that performs more accurate client clustering via weighted class-wise similarity fusion of data and gradient signals, and enables cross-cluster feature transfer through a dual-encoder architecture, consistently outperforming existing baselines across diverse heterogeneity settings.
FrontierCO: Real-World and Large-Scale Evaluation of Machine Learning Solvers for Combinatorial Optimization: FrontierCO is a large-scale, real-world benchmark covering 8 categories of combinatorial optimization problems (TSP, MIS, CVRP, etc.), evaluating 16 ML solvers (neural methods + LLM agents) against state-of-the-art classical solvers. The benchmark reveals that ML methods remain significantly behind classical approaches on structurally complex and extremely large-scale instances, though they show potential to surpass classical methods in certain scenarios.
Generalization Below the Edge of Stability: The Role of Data Geometry: This paper introduces the principle of data shatterability to provide a unified explanation of how data geometry governs the strength of implicit regularization induced by gradient descent near the Edge of Stability (EoS). For the Beta(α) radial distribution family, the authors derive a spectrum of generalization upper and lower bounds that depend on α. For mixture distributions supported on low-dimensional subspaces, they prove that the generalization rate adapts to the intrinsic dimension \(m\) rather than the ambient dimension \(d\).
Learning to Recall with Transformers Beyond Orthogonal Embeddings: This paper analyzes the "early phase" of empirical gradient descent for a single-layer Transformer on a token retrieval task under random (non-orthogonal) embeddings. It derives an explicit formula for storage capacity, revealing a multiplicative dependence among sample size \(N\), embedding dimension \(d\), and sequence length \(L\), and proves that this scaling relation is intrinsic to the information-theoretic lower bound.
Learning to Solve Orienteering Problem with Time Windows and Variable Profits: This paper proposes DeCoST, a learning-based two-stage framework that decouples the coupled discrete routing decisions and continuous service time allocation in OPTWVP. The first stage employs a parallel decoder to jointly generate routes and initial service times, while the second stage applies LP to optimally allocate service times (globally optimal). Cross-stage coordination is achieved via pTAR feedback. DeCoST achieves optimality gaps of only 0.83%–3.31% on OPTWVP instances with 50–500 nodes, with inference up to 45× faster than metaheuristics.
Markovian Transformers for Informative Language Modeling: This paper proposes the Markovian Language Model (MLM) framework, which enforces CoT to serve as a causally necessary reasoning bottleneck through structural constraints (removing the original question during answer prediction, so that the answer is derived solely from the CoT). Analogous to the narrow latent layer in an autoencoder, this approach is combined with GRPO-style policy gradient training, improving accuracy on GSM8K from 19.6% to 57.1%. The learned CoT also transfers across model architectures (Llama→Mistral/Phi/GPT-2), demonstrating that CoT encodes natural language reasoning rather than steganography.
Minor First, Major Last: A Depth-Induced Implicit Bias of Sharpness-Aware Minimization: This paper provides a rigorous theoretical analysis of the implicit bias of SAM when training linear diagonal networks, revealing a qualitative phase transition induced by increasing depth from \(L=1\) to \(L=2\): the limiting direction of \(\ell_\infty\)-SAM is highly sensitive to initialization, while \(\ell_2\)-SAM exhibits a sequential feature amplification phenomenon — "minor first, major last" — demonstrating that analyses focused solely on the \(t\to\infty\) limit are insufficient to characterize the full dynamics of SAM.
MT-DAO: Multi-Timescale Distributed Adaptive Optimizers with Local Updates: This paper proposes MT-DAO, a multi-timescale distributed adaptive optimizer that introduces a slow momentum (high \(\beta\)) to address the timescale mismatch caused by excessively rapid decay of standard momentum under infrequent communication. MT-DAO provides the first convergence guarantee in this setting, eliminates the performance gap with fully synchronized DDP in language model pre-training, and reduces end-to-end training time by 6–27%.
∇-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space: This paper proposes ∇-Reasoner, which upgrades inference-time search from zeroth-order (sampling + evaluation) to first-order (gradient descent). By applying Differentiable Textual Optimization (DTO) in the token logits space—jointly leveraging reward gradients and LLM likelihood—the framework iteratively refines the decoding strategy, achieving 10–40% accuracy gains on mathematical reasoning tasks while reducing model calls by 10–40%.
Neural Networks Learn Generic Multi-Index Models Near Information-Theoretic Limit: This paper proves that, under generic non-degeneracy assumptions, standard two-layer neural networks trained via hierarchical gradient descent can learn generic Gaussian Multi-Index models \(f(\bm{x})=g(\bm{U}\bm{x})\) with \(\tilde{O}(d)\) samples and \(\tilde{O}(d^2)\) time, achieving information-theoretically optimal sample and time complexity. This constitutes the first proof that neural networks can efficiently learn hierarchical functions.
Non-Asymptotic Analysis of Efficiency in Conformalized Regression: This work establishes the first non-asymptotic efficiency bounds for Conformalized Quantile Regression (CQR) and Conformalized Median Regression (CMR) trained with SGD, explicitly characterizing the joint dependence of prediction set length bias on training sample size \(n\), calibration sample size \(m\), and miscoverage level \(\alpha\).
Nonparametric Teaching of Attention Learners: This paper proposes AtteNT — a reinterpretation of attention learner (Transformer/ViT) training through the lens of nonparametric teaching theory: it analyzes the importance-adaptive role of attention in parametric gradients → proves that the dynamic ANTK converges to an importance-adaptive canonical kernel in functional gradient descent → bridges parameter space and function space → applies a greedy teaching algorithm that selects samples with the largest prediction deviation to accelerate training → achieving 13.01% time savings for LLM fine-tuning and 20.58% for ViT training from scratch, with no degradation in accuracy.
NRGPT: An Energy-based Alternative for GPT: This paper proposes NRGPT (eNeRgy-GPT), which applies minimal modifications to standard GPT to yield an energy-based model: attention and feedforward energy functions are designed such that each forward pass is equivalent to a gradient descent step on the energy landscape. The work proves asymptotic energy decrease and stable convergence properties, and validates performance comparable to standard GPT on ListOps, Shakespeare, and OpenWebText.
Optimizer Choice Matters for the Emergence of Neural Collapse: Through 3,900+ training experiments and theoretical analysis, this paper reveals that optimizer choice—particularly the coupling mechanism of weight decay—plays a decisive role in the emergence of Neural Collapse: AdamW (decoupled weight decay) fails to produce Neural Collapse, whereas SGD and Adam (coupled weight decay) succeed.
Personalized Collaborative Learning with Affinity-Based Variance Reduction: This paper proposes AffPCL, a personalized collaborative learning framework that enables heterogeneous agents to collaboratively learn personalized solutions without prior knowledge, via bias correction and importance correction mechanisms. It achieves an adaptive convergence rate of \(O(t^{-1} \cdot \max\{n^{-1}, \delta\})\)—yielding linear speedup when agents are similar, and no worse than independent learning when they are dissimilar.
Πnet: Optimizing Hard-Constrained Neural Networks with Orthogonal Projection Layers: This paper proposes the Πnet architecture, which appends a Douglas-Rachford operator splitting-based orthogonal projection layer to the output of a neural network to guarantee strict satisfaction of convex constraints, and employs the implicit function theorem for efficient backpropagation. Πnet substantially outperforms existing methods in training time, solution quality, and hyperparameter robustness.
Provable and Practical In-Context Policy Optimization for Self-Improvement: This paper proposes the In-Context Policy Optimization (ICPO) framework, theoretically proving that a single-layer linear self-attention Transformer, after sufficient pretraining, can simulate a policy optimization algorithm in context. Building on this, the paper designs a practical ME-ICPO algorithm that achieves multi-round test-time self-reflection via minimum-entropy selection and self-evaluation rewards, yielding significant gains on mathematical reasoning tasks (Qwen2.5-Math-7B improves from 11% to 30% on AIME 2024).
Rapid Training of Hamiltonian Graph Networks using Random Features: This paper proposes RF-HGN, which constructs dense layer parameters via random feature sampling (ELM/SWIM) and solves a linear least-squares problem to train Hamiltonian Graph Networks. By completely bypassing gradient-descent-based iterative optimization, RF-HGN achieves 150–600× speedup on N-body physical systems while maintaining comparable accuracy and strong zero-shot generalization.
Rethinking Consistent Multi-Label Classification Under Inexact Supervision: The paper proposes the COMES framework, which provides consistent risk estimators for multi-label classification under inexact supervision via first-order (Hamming loss) and second-order (Ranking loss) strategies, without requiring estimation of the label generation process or uniform distribution assumptions.
Rolling Ball Optimizer: Learning by Ironing Out Loss Landscape Wrinkles: This paper proposes the Rolling Ball Optimizer (RBO), which breaks the spatial locality of conventional optimizers by simulating the rolling motion of a finite-radius rigid sphere over the loss landscape. This induces an ironing property on the loss function and demonstrates superior convergence speed and generalization performance on MNIST and CIFAR-10/100.
RRNCO: Towards Real-World Routing with Neural Combinatorial Optimization: This paper proposes the RRNCO architecture, which introduces two key innovations — Adaptive Node Embedding (ANE) and Neural Adaptive Bias (NAB) — to jointly model asymmetric distances, travel durations, and bearing angles within a deep routing framework for the first time. It also constructs a VRP benchmark dataset based on 100 real-world cities, significantly narrowing the sim-to-real gap for NCO methods.
RS-ORT: A Reduced-Space Branch-and-Bound Algorithm for Optimal Regression Trees: This paper proposes RS-ORT, an algorithm that reformulates regression tree training as a two-stage optimization problem and applies branch-and-bound over a reduced space (branching only on tree structure variables). Combined with closed-form leaf predictions, threshold discretization, and exact last-layer subtree resolution, RS-ORT is the first exact method to achieve globally optimal regression tree learning on datasets with up to 2 million samples containing continuous features.
Saddle-to-Saddle Dynamics Explains A Simplicity Bias Across Neural Network Architectures: This paper proposes a unified theoretical framework that explains the pervasive simplicity bias observed across multiple neural network architectures (fully connected, convolutional, and attention-based) through saddle-to-saddle learning dynamics — the phenomenon whereby gradient descent tends to learn simple solutions first before progressively learning more complex ones.
Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning: This paper proposes the Scaf-GRPO framework, which injects hierarchical in-prompt hints (Knowledge → Planning → Solution) to overcome the "learning cliff" (zero-reward) problem in GRPO training. On Qwen2.5-Math-7B, it achieves a 44.3% relative improvement in pass@1 on AIME24 while preserving on-policy training consistency.
Scaling Laws of SignSGD in Linear Regression: When Does It Outperform SGD?: Under the Power-Law Random Features model, this paper systematically analyzes the scaling laws of SignSGD, reveals two distinctive effects of SignSGD relative to SGD—drift normalization and noise reshaping—and demonstrates that the compute-optimal scaling exponent of SignSGD can surpass that of SGD in noise-dominated regimes.
SCRAPL: Scattering Transform with Random Paths for Machine Learning: To address the prohibitive computational cost of using the multivariate scattering transform (ST) as a differentiable loss function due to its large number of paths \(P\), this paper proposes SCRAPL—a framework that samples a single random path per training step and stabilizes gradient updates via three variance-reduction techniques: P-Adam (path-adaptive momentum), P-SAGA (path stochastic average gradient), and \(\theta\)-importance sampling. On unsupervised sound matching tasks, SCRAPL achieves Pareto optimality by attaining accuracy close to the full-path ST at computational costs comparable to multi-scale spectral (MSS) loss.
Test-Time Meta-Adaptation with Self-Synthesis: This paper proposes MASS (Meta-Adaptation with Self-Synthesis), a framework that employs bilevel optimization-based meta-learning to enable LLMs to generate task-specific synthetic training data at inference time via a Generator, filter samples through a Scorer, and perform weighted SFT self-update via LoRA. Meta-gradients are backpropagated through the inner update to optimize data quality, improving Llama-3.1-8B from 43.6% to 59.0% on MATH-500.
The Affine Divergence: Aligning Activation Updates Beyond Normalisation: This paper reveals a fundamental misalignment between the steepest descent direction in parameter space and the effective update propagated to activations under gradient descent — the "affine divergence" \(\Delta\mathcal{L}/\Delta z_i = (\partial\mathcal{L}/\partial z_i) \cdot (\|\vec{x}\|^2+1)\) — derives normalization as the natural remedy from first principles, and discovers a non-normalizing alternative that empirically surpasses conventional normalization methods.
Unifying Formal Explanations: A Complexity-Theoretic Perspective: This paper proposes a unified framework that reduces sufficient reasons and contrastive reasons (local/global, probabilistic/non-probabilistic) to the problem of minimizing a unified probabilistic value function. It reveals that global value functions possess key combinatorial optimization properties—monotonicity, submodularity/supermodularity—and proves that global explanations are computable in polynomial time, even when their local counterparts are NP-hard.
Weak-SIGReg: Covariance Regularization for Stable Deep Learning: This work transfers SIGReg regularization from LeJEPA's self-supervised learning setting to supervised learning and proposes a computationally efficient variant called Weak-SIGReg—constraining the covariance matrix toward the identity (rather than matching all moments). Random projections reduce memory from \(O(C^2)\) to \(O(CK)\). On a ViT without BN or residual connections, this approach recovers CIFAR-100 accuracy from 20.73% (collapsed) to 72.02%, matching or surpassing carefully tuned baselines.
When to Restart? Exploring Escalating Restarts on Convergence: This paper proposes SGD-ER (SGD with Escalating Restarts), a convergence-aware learning rate scheduling strategy that triggers restarts with linearly escalating learning rates upon detecting training stagnation, enabling the optimizer to escape sharp local minima and explore flatter loss landscape regions. SGD-ER achieves 0.5–4.5% test accuracy improvements on CIFAR-10/100 and TinyImageNet.