Skip to content

📐 Optimization & Theory

🧪 ICML2025 · 61 paper notes

📌 Same area in other venues: 📷 CVPR2026 (22) · 🔬 ICLR2026 (222) · 🧪 ICML2026 (88) · 🤖 AAAI2026 (21) · 🧠 NeurIPS2025 (126) · 📹 ICCV2025 (7)

🔥 Top topics: Federated Learning ×4 · Layout & Composition ×2 · Adversarial Robustness ×2 · Alignment/RLHF ×2 · LLM ×2

A Generalization Result for Convergence in Learning-to-Optimize

A probabilistic framework is proposed to combine PAC-Bayesian generalization theory with the Kurdyka-Łojasiewicz (KL) convergence theorem from variational analysis, proving for the first time with high probability that learned optimization algorithms converge to critical points without restricting the algorithm design.

A Near-Optimal Single-Loop Stochastic Algorithm for Convex Finite-Sum Coupled Compositional Optimization

This paper proposes the ALEXR algorithm—an efficient single-loop primal-dual block-coordinate stochastic algorithm for solving convex finite-sum coupled compositional optimization (cFCCO) problems. It achieves near-optimal convergence rates under both smooth and non-smooth conditions, and proves the optimality of the algorithm by deriving lower bounds.

A Unified View on Learning Unnormalized Distributions via Noise-Contrastive Estimation

Proposes two estimator families, alpha-CentNCE and f-CondNCE, based on f-NCE, to unify methods for learning unnormalized distributions (such as MLE, MC-MLE, GlobalGISO, pseudo-likelihood, and ISO), corrects the misleading connection between CondNCE and score matching, and establishes the first finite-sample convergence guarantees for bounded exponential families.

Adjustment for Confounding using Pre-Trained Representations

This paper investigates how to leverage latent representations from pre-trained neural networks to adjust for confounding in non-tabular data (e.g., images, text). It formalizes representation sufficiency conditions, proves that sparsity/additivity assumptions do not hold under Invertible Linear Transformations (ILTs), and establishes convergence rate theory for deep networks based on low intrinsic dimension and Hierarchical Compositional Models (HCMs), thereby guaranteeing valid inference of ATE estimation within the Double Machine Learning (DML) framework.

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

Proposed AdvPrompter, which uses an LLM (AdvPrompter) to generate human-readable adversarial prompt suffixes for target LLMs in seconds. Trained via an alternating optimization algorithm, it achieves high attack success rates on AdvBench and HarmBench and transfers to closed-source black-box LLMs, while presenting a strategy for adversarial training using generated adversarial suffixes to enhance target LLM robustness.

Benefits of Early Stopping in Gradient Descent for Overparameterized Logistic Regression

This paper theoretically proves that early-stopped gradient descent (GD) possesses a statistical advantage over asymptotic GD in overparameterized logistic regression: early-stopped GD is calibrated and consistent, whereas the logistic risk of asymptotic GD diverges to infinity and its calibration error does not vanish. Additionally, a quantitative connection between early stopping and \(\ell_2\) regularization is established.

Beyond Self-Repellent Kernels: History-Driven Target Towards Efficient Nonlinear MCMC on General Graphs

This paper proposes the History-Driven Target (HDT) framework, which embeds self-repellent dynamics into any MCMC sampler by modifying the target distribution (rather than the transition kernel). It achieves the \(O(1/\alpha)\) variance reduction while resolving the three major limitations of SRRW: high computational overhead, restriction to reversible chains, and high memory footprint.

BOPO: Neural Combinatorial Optimization via Best-anchored and Objective-guided Preference Optimization

This paper introduces preference optimization into Neural Combinatorial Optimization (NCO) and proposes BOPO. Through (1) best-anchored preference pair construction (hybrid rollout + uniform filtering + best-anchored pairing) and (2) an objective-guided adaptively scaled loss function (\(\beta = g(y_l)/g(y_w)\)), BOPO comprehensively outperforms state-of-the-art (SOTA) approaches on three classical combinatorial optimization problems (JSP, TSP, and FJSP) without requiring a reward model or a reference policy.

Can Transformers Learn Full Bayesian Inference In Context?

This paper demonstrates that Transformers can perform full Bayesian inference in context. By pre-training an encoder-decoder architecture (TabPFN encoder + diffusion Transformer decoder) on synthetic data, the model can generate posterior samples of comparable quality to HMC for statistical models like GLMs and Gaussian mixture models during deployment, without requiring any parameter updates.

Clipping Improves Adam-Norm and AdaGrad-Norm when the Noise Is Heavy-Tailed

This paper proves that the high-probability convergence of AdaGrad/Adam under heavy-tailed noise can be poor (with polynomial dependence on the confidence level) and demonstrates that gradient clipping resolves this issue—specifically, Clip-AdaGrad-Norm and Clip-Adam-Norm achieve high-probability convergence bounds with polylogarithmic dependence on the confidence level under heavy-tailed noise, which are then extended to delayed stepsizes versions.

Compelling ReLU Networks to Exhibit Exponentially Many Linear Regions at Initialization and During Training

An asymmetric triangular wave-based reparameterization method for ReLU networks is proposed, forcing a 4-neuron-wide network of depth \(d\) to generate \(2^d\) linear regions at initialization and maintain this exponential expressivity during pre-training, reducing error by 3 orders of magnitude on 1D function approximation tasks.

Conformal Prediction as Bayesian Quadrature

Revisiting conformal prediction from a Bayesian perspective—proving that both split conformal prediction and conformal risk control are special cases of the Bayesian Quadrature framework, proposing practical Bayesian alternatives, and providing interpretable guarantees as well as a richer representation of future loss ranges.

Constant Stepsize Local GD for Logistic Regression: Acceleration by Instability

It is proven that Local GD can converge with any positive stepsize \(\eta > 0\) for distributed logistic regression. By allowing non-monotonic objective decrease during an initial unstable phase, it achieves a faster convergence rate of \(\widetilde{\mathcal{O}}(M/(\gamma^5 R^2))\) compared to the existing worst-case lower bounds of convex optimization.

Efficient Curvature-Aware Hypergradient Approximation for Bilevel Optimization

Proposes the NBO framework, which exploits the inherent structure of hypergradients in bilevel optimization (where solving the lower-level problem and the Hessian-inverse-vector product share the same Hessian). By efficiently integrating curvature information via inexact Newton methods to approximate the hypergradient, it improves the gradient computational complexity by a factor of \(\kappa \log \kappa\) compared to the SOTA in deterministic settings.

Emergence in Non-Neural Models: Grokking Modular Arithmetic via Average Gradient Outer Product

This paper demonstrates that the grokking (delayed generalization) phenomenon is not unique to neural networks or gradient descent, but rather stems from the progressive learning of task-relevant features. By reproducing modular arithmetic grokking on kernel machines using the non-neural Recursive Feature Machines (RFM), it reveals that block-circulant feature matrices are central to generalization.

Enhancing Parallelism in Decentralized Stochastic Convex Optimization

Proposes Decentralized Anytime SGD (DAT-SGD), which alleviates consensus distance bias by calculating gradients on slowly-varying averaged query points. This improves the parallelism upper bound of decentralized stochastic convex optimization from \(\mathcal{O}(\rho^{1/2} N^{1/4})\) to \(\mathcal{O}(\rho \sqrt{N})\), matching the rate of centralized learning for the first time under highly connected topologies.

FedSWA: Improving Generalization in Federated Learning with Highly Heterogeneous Data via Momentum-Based Stochastic Controlled Weight Averaging

To address the generalization failure of FedSAM under high data heterogeneity, this paper proposes FedSWA (cyclic learning rate + EMA aggregation) and FedMoSWA (momentum-based variance reduction control variables). Both theoretical analysis and empirical results demonstrate their superiority over FedSAM and its variants, achieving a 21.8% accuracy improvement over FedSAM on CIFAR-100 with Dirichlet-0.1.

Flexible Tails for Normalizing Flows

This paper proposes Tail Transform Flow (TTF), which appends a non-Lipschitz transformation based on the complementary error function to the final layer of normalizing flows. This converts Gaussian tails to heavy tails with adjustable weights, avoiding the optimization difficulties in neural networks caused by using heavy-tailed base distributions.

FSL-SAGE: Accelerating Federated Split Learning via Smashed Activation Gradient Estimation

This paper proposes FSL-SAGE, a federated split learning algorithm that estimates server-side gradient feedback via an auxiliary model. It significantly reduces communication overhead and client-side memory footprint while maintaining an \(O(1/\sqrt{T})\) convergence rate comparable to FedAvg.

GCAL: Adapting Graph Models to Evolving Domain Shifts

This work proposes Graph Continual Adaptive Learning (GCAL). Through an "adapt and generate memory" bi-level optimization strategy, GCAL utilizes unsupervised domain adaptation via information maximization when GNN models face continuously evolving OOD graph sequences. At the same time, a variational memory graph generation module based on the information bottleneck theory is designed to compress historical graph knowledge, effectively mitigating catastrophic forgetting.

Generalization and Robustness of the Tilted Empirical Risk

This paper provides systematic generalization error bounds and robustness guarantees for Tilted Empirical Risk (TER) under negative tilt parameters (\(\gamma < 0\)). Under the condition of unbounded loss functions with bounded \((1+\epsilon)\)-th order moments, it establishes a convergence rate of \(O(n^{-\epsilon/(1+\epsilon)})\) using uniform and information-theoretic approaches, and presents a data-driven scheme for selecting the tilt parameter.

Global Convergence and Rich Feature Learning in \(L\)-Layer Infinite-Width Neural Networks under \(\mu\)P Parametrization

This work proves that under \(\mu\)P (Maximal Update Parametrization), when an \(L\)-layer infinite-width MLP is trained using SGD, the features at each layer remain linearly independent and undergo substantial adaptation throughout training, guaranteeing that any convergence point of training is a global minimum. This is the first work to simultaneously achieve both theoretical goals of "rich feature learning" and "global convergence."

Grokking at the Edge of Linear Separability

Reveals the root cause of grokking (delayed generalization) in the simplest logistic regression binary classification task: when the ratio of data dimensionality to the number of samples \(\lambda = d/N\) approaches the critical threshold \(\lambda_c = 1/2\), the training dynamics remain near the overfitted solution for an arbitrarily long time before finally converging to the generalizing solution, resembling the "critical slowing down" phenomenon in physics.

How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias

This work theoretically characterizes the two-stage training dynamics of a single-layer Transformer learning two types of regular language recognition tasks, "even pairs" and "parity check". It proves that the linear layer implicitly converges to the max-margin hyperplane under gradient descent, and reveals the critical role of CoT in solving the parity problem.

Improved Last-Iterate Convergence of Shuffling Gradient Methods for Nonsmooth Convex Optimization

This paper provides the first proof that the last-iterate convergence rates of Random Reshuffle (RR) and Single Shuffle (SS) in nonsmooth (strongly) convex finite-sum optimization are strictly better than that of Proximal GD. Specifically, RR achieves \(\tilde{O}(GD_\star / (n^{1/4}\sqrt{K}))\), closely matching the lower bound \(\Omega(1/(n^{1/4}\sqrt{K}))\).

Improved Sample Complexity for Private Nonsmooth Nonconvex Optimization

This work studies nonsmooth nonconvex (NSNC) stochastic optimization under differential privacy constraints. By improving the effective sensitivity of the gradient estimator, it reduces the best-known sample complexity by a factor of \(\Omega(\sqrt{d})\), and proves for the first time that Goldstein stationarity generalizes from empirical loss to population loss.

In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention

This paper theoretically and empirically demonstrates that multi-head softmax attention, trained on linear regression ICL tasks, spontaneously develops elegant attention patterns (diagonal-homogeneous KQ and last-entry-only zero-sum OV). It proves that these structures enable the model to approximate a debiased gradient descent predictor, achieving near-Bayes-optimal performance.

Incremental Gradient Descent with Small Epoch Counts is Surprisingly Slow on Ill-Conditioned Problems

This paper systematically investigates the convergence behavior of incremental gradient descent (IGD) in the small epoch regime (\(K \lesssim \kappa\)). It proves that IGD is at least \(n\) times slower than SGD with replacement in this regime, and its convergence rate catastrophically deteriorates to an exponential rate when the component functions are non-convex.

Integer Programming for Generalized Causal Bootstrap Designs

Proposes an integer programming (IP) based method to numerically solve for the worst-case copula, extending the design uncertainty quantification of causal bootstrap from "complete randomization + difference-in-means estimator" to any known probability assignment and linear/quadratic treatment estimators, and proves asymptotic validity.

Interior-Point Vanishing Problem in Semidefinite Relaxations for Neural Network Verification

This work is the first to identify the "interior-point vanishing" problem when SDP relaxation is applied to deep neural network verification, where the SDP problem loses strict feasibility as the network depth increases, resulting in numerical instability and solver failures. The authors propose five mitigation methods, among which B-Remove (removing layer boundary constraints) is the most effective, resolving 88% of previously unsolvable cases.

Layer-wise Quantization for Quantized Optimistic Dual Averaging

Applying layer-wise quantization (assigning different quantization schemes to different layers) and the Quantized Optimistic Dual Averaging (QODA) algorithm achieves competitive convergence rates on monotone variational inequalities, yielding a 150% end-to-end acceleration in distributed WGAN training.

Learning Mixtures of Experts with EM: A Mirror Descent Perspective

This paper rigorously analyzes the convergence of the EM algorithm for training Mixture of Experts (MoE) models from the perspective of mirror descent. It proves that EM is equivalent to projected mirror descent with KL divergence as the regularization term, establishes conditions for local linear convergence, and demonstrates that EM outperforms gradient descent on both synthetic and real-world datasets.

Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge

This paper proposes EvalPlanner, which decouples the reasoning process of LLM-as-a-Judge into two phases: "evaluation plan generation" and "plan execution". By iteratively optimizing plan and execution preference pairs using DPO in a self-training loop, it achieves a new generative reward model State-of-the-Art (SOTA) score of 93.9 on RewardBench with only 22K synthetic preference pairs.

Nearly Optimal Sample Complexity for Learning with Label Proportions

This paper studies the sample complexity of learning from label proportions (LLP). Under the squared loss, it establishes nearly optimal upper and lower bounds for the sample complexity, and designs ERM- and SGD-based algorithms that significantly improve the dependency on the bag size compared to existing results.

Nonparametric Teaching for Graph Property Learners

This paper proposes the GraNT paradigm to extend nonparametric teaching theory to graph property learning scenarios. By greedily choosing a subset of graph samples with the "largest prediction deviation," GraNT accelerates GCN training, reducing training time by 30%–47% while maintaining generalization performance.

On Temperature Scaling and Conformal Prediction of Deep Classifiers

This paper presents the first systematic study on the impact of Temperature Scaling (TS) calibration on Conformal Prediction (CP) methods, revealing the counter-intuitive phenomenon where TS improves class-conditional coverage of APS/RAPS-like methods at the expense of increasing the prediction set size. It establishes a complete non-monotonic theoretical explanation and proposes practical guidelines.

On Understanding Attention-Based In-Context Learning for Categorical Data

This work generalizes the in-context learning (ICL) of Transformers from real-valued outputs to categorical data (categorical outcomes). It demonstrates that an architecture alternating between self-attention and cross-attention can exactly implement multi-step functional gradient descent (functional GD). Furthermore, it theoretically proves that this GD parameter configuration constitutes a stationary point of the attention model's loss function.

Optimization over Sparse Support-Preserving Sets: Two-Step Projection with Global Optimality Guarantees

An IHT algorithm with two-step projection (first hard thresholding, then convex projection) is proposed for sparse optimization problems with extra support-preserving constraints. Global objective value guarantees (without systematic error) are provided under RSC/RSS conditions, revealing a new trade-off between sparsity relaxation and suboptimality.

Provable Benefit of Random Permutations over Uniform Sampling in Stochastic Coordinate Descent

This paper provides the first theoretical proof that, in coordinate descent for positive definite quadratic functions, the contraction rate of Random Permutation Coordinate Descent (RPCD) is strictly superior to that of Randomized Coordinate Descent (RCD), thereby resolving a long-standing open theoretical question.

Provable In-Context Vector Arithmetic via Retrieving Task Concepts

This paper proves from an optimization theory perspective that a non-linear Transformer with residual connections and Layer Normalization, trained via gradient descent on QA data, can perform factual recall ICL tasks through vector addition (task vector + query). Additionally, training on ICL data leads to harmful memorization of low-level features.

Quantum Optimization via Gradient-Based Hamiltonian Descent

Integrates gradient information into the Quantum Hamiltonian Descent (QHD) framework to propose gradient-based QHD, achieving convergence rates at least one order of magnitude faster and a higher success probability of finding global optima than the original QHD and classical methods (NAG, SGDM) in both convex and non-convex optimization.

Random Feature Representation Boosting

RFRBoost is proposed, leveraging the gradient representation boosting theory to construct deep residual random feature neural networks. It obtains a closed-form solution under the MSE loss and reduces to a quadratically constrained least squares problem under general loss functions. It significantly outperforms single-layer RFNNs and end-to-end trained MLP ResNets on tabular data.

Revisiting Unbiased Implicit Variational Inference

Revisiting the "impractical" Unbiased Implicit Variational Inference (UIVI) by replacing its inner MCMC loop with importance sampling, and unbiasedly learning the optimal proposal distribution by minimizing the expected forward KL divergence, which achieves or surpasses the SOTA on standard SIVI benchmarks.

REVOLVE: Optimizing AI Systems by Tracking Response Evolution in Textual Optimization

REVOLVE guides optimization by tracking the "evolutionary" trends of responses across iterations in LLM systems. It is more stable and efficient than immediate-feedback-based methods like TextGrad, improving prompt optimization, solution refinement, and code optimization by 7.8%, 20.72%, and 29.17% respectively.

Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift

This paper proposes NS-DPO, which incorporates a single exponential decay parameter \(\gamma\) into the Dynamic Bradley-Terry model to temporally weight training data. This enables robust DPO alignment under preference drift over time, without sacrificing performance in stationary scenarios.

SAFER: A Calibrated Risk-Aware Multimodal Recommendation Model for Dynamic Treatment Regimes

The SAFER framework is proposed to integrate multimodal information from structured EHR and clinical notes. It utilizes KL divergence to measure label uncertainty and incorporates conformal prediction to control the FDR, providing statistical safety guarantees for high-risk dynamic treatment recommendations.

Sassha: Sharpness-aware Adaptive Second-order Optimization with Stable Hessian Approximation

This work proposes the Sassha optimizer, which introduces sharpness-aware minimization (SAM) into the second-order optimization framework. Through stable Hessian approximation and a lazy update strategy, it enables second-order methods to comprehensively outperform first-order methods such as SGD, AdamW, and SAM in generalization performance for the first time.

SDP-CROWN: Efficient Bound Propagation for Neural Network Verification with Tightness of Semidefinite Programming

Proposes SDP-CROWN, which integrates the tightness of Semidefinite Programming (SDP) relaxation into the linear bound propagation framework. By adding only one parameter \(\lambda\) per layer, it tightens the verification relaxation by up to a factor of \(\sqrt{n}\) under \(\ell_2\) perturbations while maintaining scalability comparable to \(\alpha\)-CROWN.

Sparse Causal Discovery with Generative Intervention for Unsupervised Graph Domain Adaptation

This paper proposes the SLOGAN framework, which decouples causal and spurious features through sparse causal graph construction and information bottlenecks. It incorporates a generative intervention mechanism utilizing cross-domain spurious feature swapping alongside class-adaptive dynamic pseudo-label calibration to achieve stable causal feature transfer in unsupervised graph domain adaptation.

Statistical and Computational Guarantees of Kernel Max-Sliced Wasserstein Distances

This paper provides sharp finite-sample statistical guarantees (dimension-free, with a convergence rate of \(n^{-1/(2p)}\)) and computational guarantees (proving exact computation is NP-hard and proposing an efficient semidefinite relaxation SDR alongside a first-order algorithm) for the Kernel Max-Sliced (KMS) Wasserstein distance. Its superior performance is demonstrated in high-dimensional two-sample testing, human activity detection, and generative modeling.

Subspace Optimization for Large Language Models with Convergence Guarantees

This paper reveals that GaLore (a subspace optimization algorithm) does not always converge under stochastic settings, and proposes GoLore (Gradient Random Low-rank Projection)—a provably convergent variant that guarantees convergence even under standard batch sizes.

Synonymous Variational Inference for Perceptual Image Compression

Based on the perspective of synonymity in semantic information theory, this work proposes the Synonymous Variational Inference (SVI) method to theoretically prove that the optimization direction for perceptual image compression is a rate-distortion-perception three-way trade-off. It also designs a progressive Synonymous Image Compression (SIC) codec, allowing a single model to cover multiple bitrates and perceptual quality levels.

The Butterfly Effect: Neural Network Training Trajectories Are Highly Sensitive to Initial Conditions

Through the "spawn-and-perturb" experimental paradigm, the sensitivity of neural network training trajectories to initial conditions is systematically investigated, showing that infinitesimal perturbations (even to a single weight) at the very beginning of training can lead to entirely different convergence outcomes—the "butterfly effect". This instability is independent of training noise and rapidly decays as training progresses.

The Panaceas for Improving Low-Rank Decomposition in Communication-Efficient Federated Learning

Addressing three core questions of low-rank decomposition in federated learning (what to decompose, how to decompose, and how to aggregate), this paper proposes three complementary technologies—MUD (Model Update Decomposition), BKD (Block-wise Kronecker Decomposition), and AAD (Aggregation-Aware Decomposition)—to achieve faster convergence and higher accuracy while maintaining low communication overhead.

Tilted Sharpness-Aware Minimization

This paper proposes Tilted SAM (TSAM), which utilizes exponential tilting to smooth the min-max objective of SAM into a soft optimization that weights multiple local solutions within a neighborhood by their loss values. Theoretically, TSAM is smoother and exhibits a stronger preference for flat minima. Empirically, it consistently outperforms SAM and its variants across both vision and language tasks.

Training Dynamics of In-Context Learning in Linear Attention

This paper completely characterizes the dynamical process of multi-head linear attention acquiring in-context learning (ICL) capabilities during gradient flow training: the merged KQ parametrization exhibits a single abrupt loss drop, whereas the separate KQ parametrization displays saddle-to-saddle progressive learning of principal component regression with staircase training dynamics.

Transformative or Conservative? Conservation Laws for ResNets and Transformers

This work systematically derives and proves conservation laws under gradient flow training dynamics for modern architectures such as convolutional ResNets and Transformers. It reveals that residual connections do not alter conservation laws, block-level conservation laws are equivalent to those of isolated blocks, and the conservation error under discrete SGD is \(O(\text{step-size}^2)\).

Understanding Mode Connectivity via Parameter Space Symmetry

Analyzes the topological connectivity of the set of minima of neural network loss functions through continuous symmetries in parameter space (e.g., \(GL_h(\mathbb{R})\)). It derives that the number of connected components of the minima for linear networks is \(2^{l-1}\), proves that skip connections can reduce this number, and provides explicit symmetry-induced low-loss connecting paths as well as sufficient conditions for linear mode connectivity to approximately hold.

Understanding Sharpness Dynamics in NN Training with a Minimalist Example: The Effects of Dataset Difficulty, Depth, Stochasticity, and More

Proposed a minimalist model using a "deep linear network with a single neuron per layer" to systematically study progressive sharpening and edge of stability phenomena. The concept of dataset difficulty \(Q\) is introduced, and both upper and lower bounds of sharpness at global optima are derived. The theoretical analysis uncovers the impact mechanisms of dataset size, network depth, batch size, and learning rate on sharpness dynamics.

Understanding the Statistical Accuracy-Communication Trade-off in Personalized Federated Learning with Minimax Guarantees

This paper provides the first quantitative characterization of how the personalization parameter \(\lambda\) simultaneously affects statistical accuracy and communication efficiency in personalized federated learning, establishes the minimax optimal statistical rate, and proposes the FedCLUP algorithm to achieve the optimal statistical-communication trade-off.

Widening the Network Mitigates the Impact of Data Heterogeneity on FedAvg

Based on Neural Tangent Kernel (NTK) theory, this work proves that the upper bound of model divergence caused by data heterogeneity in FedAvg is \(\mathcal{O}(n^{-1/2})\) (where \(n\) represents the network width). In the infinite-width limit, both global and local models linearized. Consequently, FedAvg with the same number of iterations is equivalent to centralized gradient descent, achieving identical generalization performance.