ICML2026 Optimization & Theory AI paper notes paper summaries Federated Learning LLM Adversarial Robustness Agents Alignment/RLHF Compression

📐 Optimization & Theory¶

🧪 ICML2026 · 88 paper notes

📌 Same area in other venues: 📷 CVPR2026 (22) · 🔬 ICLR2026 (222) · 🤖 AAAI2026 (21) · 🧠 NeurIPS2025 (126) · 📹 ICCV2025 (7) · 🧪 ICML2025 (61)

🔥 Top topics: Federated Learning ×5 · LLM ×4 · Adversarial Robustness ×2 · Agents ×2 · Alignment/RLHF ×2

A2SG: Adaptive and Asymmetric Surrogate Gradients for Training Deep Spiking Neural Networks: To address the dual issues of "sharp loss landscapes" and "conflicting gradients across timesteps" in deep Spiking Neural Networks (SNNs) trained with surrogate gradients, this paper proposes a unified framework, A2SG. On one hand, it employs an adaptive effective window width (automatically adjusting \(\beta\) based on Spatial Gradient Variation (SGV) and Temporal Gradient Consistency (TGC)) to suppress gradient variation and align directions across timesteps. On the other hand, it replaces symmetric surrogate functions with an asymmetric shape that allocates gradients based on membrane potential levels. It theoretically proves that asymmetric shapes exhibit lower variation than symmetric ones and that smaller local gradient variation leads to flatter loss landscapes, consistently improving accuracy and energy efficiency across CNN and Transformer-based SNNs.
A Fully First-Order Layer for Differentiable Optimization: Mainstream differentiable optimization layers rely on implicit differentiation of KKT conditions, which requires computing Hessians and solving large KKT linear systems, making them difficult to scale to large problems. This paper rewrites differentiable optimization as a bi-level optimization, constructing a "ghost proxy" problem with a fixed active set and linearized active constraints to simplify inequality constraints into equality constraints locally. It then uses finite differences to estimate the hypergradient using only first-order information within nearly constant \(\mathcal{O}(\log(1/\epsilon))\) calls. The authors implement FFOLayer, a PyTorch library that is plug-and-play with any convex solver (including GUROBI/MOSEK). It achieves convergence comparable to exact methods while computational time and peak memory grow nearly sublinearly with problem scale.
A General Framework for Dynamic Consistent Submodular Maximization: This paper presents a general consistency framework for fully dynamic submodular maximization. In streaming environments with insertions and deletions, it provides the first constant approximation guarantees with sublinear worst-case per-step solution changes (recourse) for both cardinality and matroid constraints.
Accelerated Multiple Wasserstein Gradient Flows for Multi-objective Distributional Optimization: This paper generalizes Multiple Wasserstein Gradient Descent into continuous-time gradient flows and introduces Nesterov-style momentum acceleration to obtain A-MWGraD. Theoretically, it improves the convergence rate to the weak Pareto optimum from \(O(1/t)\) to \(O(1/t^2)\) in geodesically convex scenarios. Empirically, it accelerates convergence in multi-target sampling and Bayesian multi-task learning.
AdaGC: Enhancing LLM Pretraining Stability via Adaptive Gradient Clipping: To address the recurring loss spikes in large model pretraining, AdaGC replaces the "one-size-fits-all" Global Gradient Clipping with "per-tensor adaptive clipping based on the EMA of its own historical gradient norm." By suppressing abnormal gradients before they pollute the optimizer's first and second moments, it reduces spike scores to zero on Llama-2 7B, Mixtral 8×1B, and ERNIE 10B-A1.4B, while improving downstream accuracy by +1.32%, +1.27%, and +2.48% respectively compared to Global Gradient Clipping (GlobalGC).
Adaptive Estimation and Inference in Semi-parametric Heterogeneous Clustered Multitask Learning via Neyman Orthogonality: This paper bridges Double Machine Learning (DML) and clustered multitask learning by proposing an adaptive framework that combines Neyman orthogonality with a data-driven pairwise fusion penalty. In semi-parametric settings with heterogeneous (potentially infinite-dimensional) nuisance parameters, it accurately recovers latent task clusters, achieves oracle-level aggregation rates, and establishes asymptotic normality for valid statistical inference.
Adaptive Preconditioners Trigger Loss Spikes in Adam: This paper attributes loss spikes in Adam training to the lag-induced decoupling between the second-moment preconditioner and the current squared gradients, and explains as well as predicts spike occurrences using the curvature of the preconditioned Hessian in the gradient direction.
Adaptive Sharpness-Aware Minimization with a Polyak-type Step size: A Theory-Grounded Scheduler: This paper generalizes the Polyak step size to USAM/SAM, providing a sharpness-aware scheduler that does not rely on manual learning rate tuning. Its stability and performance are verified through convex optimization theory and CIFAR experiments.
Asymmetric Perturbation in Solving Bilinear Saddle-Point Optimization: This paper demonstrates that perturbing the payoff of only one player in a bilinear zero-sum game preserves the original equilibrium under a sufficiently small perturbation. Based on this, the authors construct AsymP-GDA, which theoretically achieves linear last-iterate convergence and approaches the original equilibrium faster and more accurately than symmetric perturbation in normal-form and extensive-form game experiments.
Automatic Unsupervised Ensemble Outlier Model Selection–Extended Version: The MetaEns framework is proposed to adaptively and greedily construct compact, high-quality anomaly detection ensembles under unlabeled conditions. It works by predicting the marginal ensemble gain of candidate detectors through meta-learning, combined with a proxy objective function featuring diversity discounts and algorithm family risk regularization.
Balanced LoRA: Removing Parameter Invariance to Accelerate Convergence: This paper reveals that the overparameterization of LoRA leads to varying condition numbers for different low-rank factors \((A, B)\). It proves that the balanced minimum point (\(A^\top A = BB^\top\)) possesses the optimal condition number. Based on this, it proposes BaLoRA—projecting adapters onto the balanced manifold after each optimization step to accelerate convergence and enhance fine-tuning performance with almost zero overhead.
Balancing Learning Rates Across Layers: Exact Two-Step Dynamics and Optimal Scaling in Linear Neural Networks: This paper derives exact closed-form expressions for the test loss after one and two steps of gradient descent in two- and three-layer linear neural networks. It reveals a phase transition phenomenon: asymmetric learning rates are optimal for the first step, while symmetric (balanced) learning rates become locally optimal after the second step, providing a theoretical foundation for layer-wise learning rate scheduling.
Bayesian Gated Non-Negative Contrastive Learning: Addressing the optimization conflict (gradient oscillation) caused by shared background features in Non-Negative Contrastive Learning (NCL), BayesNCL is proposed. By learning a Bernoulli distribution for each feature dimension via a Bayesian gating head to dynamically filter high-frequency public features, it achieves a 142.1% improvement in semantic consistency on ImageNet-100 without sacrificing downstream accuracy.
Bregman meets Lévy: Stochastic Mirror Descent with Heavy-Tailed Noise in Continuous and Discrete Time: This paper proposes Lévy Mirror Flow (LMF)—a continuous-time SDE model for Stochastic Mirror Descent driven by Lévy noise. It proves that SMD maintains convergence guarantees even under heavy-tailed gradient noise with infinite variance (convex case \(O(\varepsilon^{-p/(p-1)})\), strongly convex case \(\tilde{O}(\varepsilon^{-1/(p-1)})\)), and seamlessly transfers continuous-time results to discrete-time algorithms.
Budget-Feasible Mechanisms for Submodular Welfare Maximization in Procurement Auctions: This paper introduces BFM-SWM, the first truthful mechanism with approximation guarantees for submodular welfare maximization in procurement auctions under budget constraints and private costs. By utilizing a descending clock auction with geometrically increasing thresholds, single-point protection, and a price/payment ratio parameter \(\beta\) to ensure non-negative surplus and budget feasibility, the mechanism achieves a 0.0328-approximation for general submodular functions and a 0.0877-approximation for monotone submodular functions. As a byproduct, BFM-VM improves the best deterministic approximation ratio for valuation maximization from 1/64 to \(1/(12+4\sqrt{3})\approx 0.0528\), while reducing the runtime from \(\mathcal{O}(n^2\log n)\) to \(\mathcal{O}(n\log n)\).
Bulk-Calibrated Credal Ambiguity Sets: Fast, Tractable Decision Making under Out-of-Sample Contamination: To address the long-standing issue where placing a "Huber (linear-vacuous) contamination set" in an unbounded space causes the worst-case risk to become \(+\infty\) and the DRO objective to fail, this paper proposes bulk-calibrated credal ambiguity sets. By learning a high-probability "bulk set" \(\Xi_0\) from data, restricting the contamination budget solely within \(\Xi_0\), and using moment conditions to separately control the tail, the authors derive a closed-form \(\text{mean}+\sup\) robust objective. This objective is fast, finite, and can be solved via common LP/SOCP for standard losses.
Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad: This paper provides the first proof that AdaGrad converges under heavy-tailed noise (\(p \in (4/3, 2]\)) without any algorithmic modifications. It also establishes an algorithm-dependent lower bound showing that AdaGrad cannot achieve the minimax optimal rate, while proving that AdaGrad-Norm can achieve a faster rate of \(O(1/T^{(p-1)/(2p)})\) under the assumption of a bounded objective function.
CLoVE: Personalized Federated Learning through Clustering of Loss Vector Embeddings: CLoVE utilizes the "loss vector of each client across all candidate models" as a client embedding for Clustered Federated Learning (CFL). Based on the observation that "clients in the same cluster share similar loss patterns while those in different clusters exhibit significantly different patterns," the method recovers correct client clusters and trains cluster-specific models within a few communication rounds without requiring meticulous model initialization, achieving SOTA across numerous non-IID settings.
Colorful Pinball: Density-Weighted Quantile Regression for Conditional Guarantee of Conformal Prediction: This paper reveals the inherent flaw of standard pinball loss in optimizing conditional coverage through Taylor expansion—specifically, its neglect of heteroscedasticity. It proposes the density-weighted pinball loss as a tighter surrogate objective for the Mean Squared Coverage Error (MSCE) and designs a triple-head quantile network using finite differences to estimate density weights, significantly improving conditional coverage performance across 8 high-dimensional regression benchmarks.
Conflicting Biases at the Edge of Stability: Norm versus Sharpness Regularization: This is an analytical paper: the authors point out that Gradient Descent (GD) simultaneously exhibits two conflicting implicit biases—small learning rates (LR) tend to suppress the parameter norm, while large learning rates (Edge of Stability) tend to suppress the loss sharpness. The learning rate interpolates between the two, and the authors observe a phase transition divided by a critical learning rate \(\eta_c\). They further use a theoretical counter-example of a diagonal linear network to prove that "any single implicit bias is insufficient to explain generalization."
Conservation Laws for Modern Neural Architectures: This paper reformulates the problem of "characterizing all conserved quantities in training dynamics" as solving a data-independent partial differential equation (PDE). Leveraging meromorphic continuation techniques from complex analysis, it provides for the first time a complete list of conservation laws for GELU/SiLU/SwiGLU feed-forward networks, multi-head attention (including sinusoidal PE and RoPE), and various gated MoEs, effectively solving the open problem for multi-head attention posed by Marcotte et al. (2025).
Convex Basins in Single-Index Model Loss Landscapes: Applications to Robust Recovery under Strong Adversarial Corruption: Under heavy-tailed noise and constant-proportion strong adversarial corruption, the authors prove that a dimension-independent, constant-radius convex basin exists in the squared loss of Gaussian Single-Index Models for a wide class of non-monotonic link functions (GeLU, Swish, Tanh, Probit, Logistic, Phase Retrieval...). Based on this, they design a robust recovery algorithm with \(\tilde{O}(nd)\) time and \(\tilde{O}(d)\) sample complexity, achieving a final estimation error of \(O(\sigma\sqrt{\epsilon})\).
Cost-Aware Stopping for Bayesian Optimization: The authors extend Weitzman's Pandora's Box stopping rule to Bayesian Optimization (BO) with correlations. They prove that under a shared "acquisition function value crossing the current best" stopping rule, the PBGI and LogEIPC cost-aware acquisition functions achieve an expected cost-adjusted simple regret no worse than "stopping after one sample." This provides the first adaptive stopping rule with theoretical guarantees for cost-adjusted simple regret.
Delayed Momentum Aggregation: Communication-efficient Byzantine-robust Federated Learning with Partial Participation: Addressing the pain point where "Byzantine clients temporarily form a majority in sampled clients" collapses existing robust aggregators under partial participation, this paper proposes the Delayed Momentum Aggregation principle. The server feeds the current round's new momentum along with the most recent cached momentum from unsampled clients into the robust aggregator, effectively maintaining the global Byzantine ratio \(\delta < 1/2\) in every aggregation round. Based on this, the DeMoA optimizer is designed, which achieves stable ResNet-18/CIFAR-10 training even under extreme settings of \(p=0.1\) and \(\delta=0.2\).
Depth over Fidelity in Fixed-Budget Noisy Evolution Strategies: In noisy black-box optimization where the number of evaluations is strictly limited (fixed budget), instead of spending budget on repeated measurements to "clean" intra-generational rankings (fidelity), it is more effective to save that budget to perform more distribution updates (depth). This paper introduces PEM (Probabilistic Elite Membership) to replace hard ranking weights with "expected weights over ranking uncertainty" and utilizes Residual Bootstrap (RB-PEM) to estimate it with near-zero additional overhead. This approach consistently outperforms the mainstream "denoise then rank" paradigm in high-misranking, budget-constrained scenarios.
Differentially Private Submodular Maximization with a Knapsack Constraint: This paper introduces differentially private algorithms for submodular maximization with a knapsack constraint (SMK). For monotone objectives, it achieves the optimal \((1-1/e)\) approximation while improving the additive error from polynomial dependency on \(n\) to polylogarithmic dependency and reducing query complexity from exponential to polynomial. Additionally, it provides the first DP algorithm with provable guarantees (\(1/4\) approximation) for non-monotone objectives.
Distilling Linearized Behavior into Non-Linear Fine-Tuning for Effective Task Arithmetic: This paper proposes DELTA, which distills intermediate activations from a "tangent space linearized teacher" into a standard non-linear student in an online manner. Combined with EK-FAC curvature regularization and sampling along the interpolation path, DELTA ensures that task vectors from standard non-linear fine-tuning inherit properties like "addability, low interference, and robustness to scaling" typically found in linearized models, without introducing any inference overhead.
Distribution-Free Uncertainty Quantification for Continuous AI Agent Evaluation: This paper proposes the AgentPulse framework, which combines split conformal prediction, adaptive conformal inference (ACI), Mondrian conformal, and BH-FDR to provide distribution-free coverage guarantees, composite pipeline uncertainty bounds, and ranking abstention mechanisms with FDR control for continuous scoring of 50 AI agents, treating "measurement uncertainty" as a first-class evaluation output.
Distribution Alignment for One-Shot Federated Learning via Optimal Transport: This paper proposes SLOT-Align, a training-free, single-round federated feature alignment framework. Each client computes the first and second-order statistics of features using a shared frozen encoder. The server aggregates these into a global reference via a Bures–Wasserstein barycenter. Clients then align local features to this reference using closed-form Optimal Transport (OT) mappings between Gaussians. This approach consistently improves accuracy in extreme one-shot federated scenarios where domain shift is coupled with label shift.
Diversity-Driven Offline Multi-Objective Optimization via Nested Pareto Set Learning: For offline multi-objective optimization (offline MOO) where "only a fixed offline dataset is available and the true objective function cannot be queried," this paper proposes DOMOO. It utilizes nested Pareto set learning to jointly update preferences and models, embeds an out-of-distribution (OOD) risk suppression factor into the preference gradient, and employs an offline-specific \(\text{IGD}_{\text{offline}}\) metric for diversity filtering, thereby obtaining a solution set with better convergence and diversity simultaneously.
Dynamics and Representation Structure of Local Approximations to Gradient-Based Learning in Linear Recurrent Neural Networks: This paper derives analytical ODEs for the updates of BPTT, one-step tBPTT, and RFLO in student–teacher data-aligned linear RNNs. By comparing their fixed-point manifolds, stability, and convergence rates, it is found that RFLO lacks the non-optimal saddle manifold of BPTT/tBPTT but at the cost of sign-dependent stability and slower convergence. Furthermore, RFLO is intrinsically limited to low-rank perturbations of initial weights, a constraint that generalizes to non-data-aligned settings.
Efficient Stochastic Optimisation via Sequential Monte Carlo: When the gradient of the loss is formulated as an expectation over an intractable parameter-dependent distribution \(\pi_\theta\), conventional approaches require an expensive MCMC inner loop for each optimization step. This paper proposes SOSMC, which utilizes a Sequential Monte Carlo (SMC) sampler to link the sequence of distributions \((\pi_{\theta_k})_k\) that evolve slowly with the parameters. By reusing particles from the previous step and obtaining weighted gradient estimates, the method eliminates the inner loop, reducing computational cost while providing convergence guarantees. It outperforms single/double-loop baselines in tasks such as EBM reward tuning and image deblurring.
Enhancing LLM Training via Spectral Clipping: This paper proposes SPECTRA: an optimizer-agnostic wrapper that applies post-spectral clipping to the update matrix and optional pre-spectral clipping to the original gradients. Theoretically equivalent to the composite Frank-Wolfe algorithm with weight regularization, it consistently reduces validation loss for AdamW / Signum / Mars / AdEMAMix in 124M–1.5B LLM pre-training.
ePC: Fast and Deep Predictive Coding in Digital Simulation: This paper identifies a neglected root cause where "state-based Predictive Coding (sPC) exponentially decays training signals with network depth in digital simulations," leading to training failure and extremely slow convergence in deep networks. The authors propose ePC, an equivalent reparameterization that changes the optimization variable from states to errors. It calculates exactly the same state equilibria and weight gradients as sPC but utilizes reverse-mode AD to allow signals to reach all layers in a single step. Consequently, ePC converges over 100 times faster in deep networks and matches the performance of backpropagation on deep architectures.
Flatland: The Adventures of Gradient Descent with Large Step Sizes: This paper provides a unified definition of "large step size" requiring only "local Lipschitz / Hölder gradient continuity." By constructing a first-order adaptive step size using equality-type non-monotone line search, Gradient Descent (GD) is driven to run on the Edge of Stability (EoS) from the start of training, suppressing sharpness to the global minimum of \(2/K\). It further discovers that "premature entry into the global flat region is harmful" and employs a self-stabilization constraint to recover failed training sessions.
FOAM: Frequency and Operator Error-Based Adaptive Damping Method for Reducing Staleness-Oriented Error for Shampoo: FOAM couples Shampoo's damping coefficient \(\epsilon\) and eigenvalue decomposition (EVD) trigger frequency into a feedback control loop via a "relative operator error proxy \(h_t\)" that can be cheaply estimated in the stale eigenspace. It reduces EVD calls by over 80% on large model training while maintaining convergence quality.
Follow-the-Perturbed-Leader for Decoupled Bandits: Best-of-Both-Worlds and Practicality: This paper designs the first Best-of-Both-Worlds (BOBW) FTPL algorithm for the decoupled multi-armed bandit problem (where each round selects one arm to "exploit" and another to "explore"). By employing Pareto perturbations for exploitation and a proxy \(q_{t,i}\)—dependent only on the ranking of cumulative estimated losses—to define the exploration distribution, the algorithm eliminates the need for per-step convex optimization required by FTRL or geometric resampling required by standard FTPL. It achieves regret bounds of \(\mathcal{O}(\sqrt{KT})\) in adversarial environments and \(\mathcal{O}(K/\Delta_{\min})\) in stochastic environments, matching state-of-the-art FTRL methods while being approximately 130× faster for \(K=2\).
Full-Batch Gradient Descent Outperforms One-Pass SGD: Sample Complexity Separation in Single-Index Learning: This paper provides a rigorous proof in Gaussian single-index models with quadratic activation: while full-batch gradient descent (GD) "reusing all data" under naive quadratic activation is no more sample-efficient than one-pass SGD (both requiring \(n\gtrsim d\log d\)), simply truncating the activation allows full-batch GD to achieve weak or even strong recovery with \(n\gtrsim d\) (linear sample size). This establishes a \(\log d\) sample complexity separation from one-pass SGD, which still requires \(d\log d\).
Gradient Descent with Large Step Size Restores Symmetry in Deep Linear Networks with Multi-Pathway: Previous analyses using Gradient Flow (GF) for multi-pathway deep linear networks concluded a "winner-takes-all" phenomenon—signals concentrate on a single path, leading to symmetry breaking. This paper demonstrates that discrete Gradient Descent (GD) with large step sizes tell a different story: single-path solutions are sharp minima; distributing signals across multiple paths reduces sharpness by a factor of \(H^{2/L-1}\). Consequently, oscillations at the Edge of Stability (EoS) overturn early symmetry breaking and enter a "path rebalancing" phase, ultimately favoring shared over single-path exclusive representations.
HO-SFL: Hybrid-Order Split Federated Learning with Backprop-Free Clients and Dimension-Free Aggregation: HO-SFL decouples the client and server in Split Federated Learning (SFL) through Lagrangian variable lifting. The server continues to execute first-order backpropagation (BP), while clients perform only zeroth-order (ZO) perturbed forward passes. By leveraging shared random seeds, the uplink communication per round is compressed to \(\mathcal{O}(P)\) scalars. This reduces the VRAM requirement for fine-tuning large models on edge devices to inference-level while maintaining a convergence rate of \(\mathcal{O}(\sqrt{d_c/PT})\).
Improved Convergence Analysis of Topology Dependence in Decentralized SGD: This paper provides a tighter convergence analysis for Decentralized SGD by replacing the topological factor—previously determined solely by the "spectral gap (second largest eigenvalue)"—with the "entire spectrum of the mixing matrix." This theoretically explains for the first time why sparse topologies like rings perform significantly better in training than pessimistic predictions suggest when data is near-homogeneous.
Interpretability and Generalization Bounds for Learning Spatial Physics: This paper employs numerical analysis tools to prove that a learned solution operator \(\mathbf{W}\) for linear PDEs (e.g., 1D Poisson) converges only to the projection of the true operator \(\mathbf{A}\) onto the training function space, denoted as \(\mathbf{A}\mathbf{U}\mathbf{U}^\top\). Consequently, the function space itself—rather than data volume or grid fineness—determines OOD generalization. The authors propose a mechanistic interpretability technique that visualizes whether the Green's function structure is learned by applying the weight matrix to one-hot vectors. Using a 25×25 cross-dataset evaluation, they identify the failure modes of eight classes of SciML models, including PINNs, DeepONets, and FNOs.
Learning-Augmented Scalable Linear Assignment Problem Optimization via Neural Dual Warm-Starts: Training a lightweight network to predict the dual variables \(\hat{u}\) of the Linear Assignment Problem (LAP) and constructing feasible duals \(\hat{v}\) via the Min-Trick provides a warm-start for the LAPJV exact solver. This approach accelerates end-to-end solving of \(N=16{,}384\) scale instances by over \(2\times\) while maintaining optimality.
Learning a Zeroth-Order Optimizer for Fine-Tuning LLMs: This paper proposes ZO Fine-tuner: using a "per-block lightweight neural network PertNN" to automatically learn the perturbation variance for each parameter block of an LLM. It upgrades the fixed \(\mathcal{N}(0,I)\) perturbation in MeZO to a block-adaptive non-uniform distribution. On OPT-30B, the auxiliary network occupies <2MB yet outperforms existing zeroth-order (ZO) baselines in 82.1% of 28 experiment pairs (4 LLMs × 7 datasets), achieving "train once, reuse across tasks and derived models."
Learning Context-Conditioned Predicate Semantics via Prototype Feedback: AlignG transforms the static predicate prototypes of PE-Net into "image-conditioned" dynamic prototypes: it first performs incremental GRU updates on prototypes using relation candidates to obtain image-specific prototypes, then uses these back to recalibrate relation features, while anchoring the alignment loss to static global prototypes to prevent drift. It achieves F@100 gains of 1.4 and 2.7 on VG-150 and GQA-200 SGDet settings, respectively.
Learning Dynamics of Zeroth-Order Optimization: A Kernel Perspective: This paper adopts empirical NTK (eNTK) as a unified perspective to prove that the eNTK induced by zeroth-order (ZO) SGD is equivalent to projecting the first-order (FO) eNTK onto a random subspace spanned by perturbations. Using the Johnson-Lindenstrauss (JL) Lemma, the authors explain why ZO methods remain effective for billion-parameter LLMs: the error depends only on the output dimension \(V\) and the perturbation count \(P\), and is independent of the model dimension \(d\).
Learning Locally, Revising Globally: Global Reviser for Federated Learning with Noisy Labels: This paper observes a "delayed memory" phenomenon in the global model of FL regarding noisy labels (memory rate \(\le 30\%\) on CIFAR-10, significantly lower than centralized training). Based on this, FedGR is proposed: using server-side GMM to jointly sieve samples and estimate per-client noise ratios based on aggregated loss proxies, periodically "revising" local EMA teachers with global parameters for distillation, and adding global-local representation consistency regularization. These three modules work synergistically to achieve significant gains over 8 SOTA baselines on CIFAR-10/100 and Clothing1M under dual heterogeneity (label noise \(\times\) non-IID).
Learning Randomized Reductions: This paper formalizes the manual task of "discovering a Randomized Self-Reduction (RSR) for a function \(f\)," which has been stagnant for forty years, into a learning problem based on correlated sampling. The authors construct the Bitween framework: it first utilizes sparse linear regression to mine RSRs within a fixed query set \(\{x+r, x-r, x \cdot r, x, r\}\), and then employs an LLM agent to search in a larger query function space. Ultimately, it pushes RSR coverage from 54% to 80% on the RSR-Bench consisting of 80 mathematical/ML functions and provides the first known RSR expression for the sigmoid function.
Limits of Convergence-Rate Control for Open-Weight Safety: The authors formalize "open-weight safety" as "how to delay the convergence of malicious fine-tuning," proving that the maximum singular value of the Hessian is lower-bounded by the weight spectrum. They design the SpecDef algorithm to strictly decelerate first/second-order optimization but simultaneously prove that any such convergence-rate control method can be bypassed by an adversary at the cost of a "linear increase in model size."
LiMuon: Light and Fast Muon Optimizer for Large Models: LiMuon integrates STORM-style momentum variance reduction with Randomized SVD (RSVD) into the Muon optimizer. It compresses matrix parameter momentum from \(m \times n\) to \((m+n)\hat{r}\) while reducing the SFO complexity for finding \(\epsilon\)-stationary points from \(\mathcal{O}(\epsilon^{-4})\) to \(\mathcal{O}(\epsilon^{-3})\). It simultaneously achieves lower perplexity/higher accuracy and reduced GPU memory on Mamba-130M / Qwen2.5-0.5B / ViT.
LoRe: Adaptive Interaction-Evaluation Routing with Per-Step Interaction Budgets for Iterative Graph Solvers: LoRe adapts the "Cluster + Bath" decomposition from condensed matter physics into a training-free inference-time wrapper for diffusion-based graph combinatorial optimization solvers. By evaluating only a fixed proportion of high-conflict edges per step and compensating for the discarded components with an \(\mathcal{O}(N)\) global recall term, it enables MIS solvers to exceed the baseline OOM limit by \(3\times\) (executing \(n=50\mathrm{k}\) instances on a single GPU) and achieves \(\sim 15\times\) speedup with \(44\times\) memory compression on TSP \(n=1000\).
Lower Complexity Bounds for Nonconvex-Strongly-Convex Bilevel Optimization with First-Order Oracles: This paper provides the first complexity lower bounds for smooth "nonconvex-strongly-convex" bilevel optimization under standard (deterministic/stochastic) first-order oracles that are strongly related to the condition number \(\kappa\). The established bounds are \(\Omega(\kappa^{3/2}\epsilon^{-2})\) for the deterministic case and \(\Omega(\kappa^{5/2}\epsilon^{-4})\) for the stochastic case. These results demonstrate that bilevel problems are inherently more difficult than single-level nonconvex and min-max optimization, exposing a significant power gap in \(\kappa\) between existing upper and lower bounds.
Memory-Efficient LLM Pretraining via Minimalist Optimizer Design: By "deconstructing Adam bottom-up," this paper identifies two truly essential components—per-column gradient normalization and first-order momentum restricted to the last layer—to compose the SCALE optimizer. SCALE achieves near-SGD memory (13.74 GB on LLaMA 7B) while matching Adam-level or even surpassing Muon/APOLLO in pretraining perplexity.
Minibatch Selection via Partition Matroid Constrained Gradient Matching: PartitionSel models "cross-domain minibatch selection" as maximizing a validation-guided weighted gradient matching utility under partition matroid constraints (per-domain budgets). The authors prove that this objective is monotone and weak submodular, allowing it to be solved via Orthogonal Matching Pursuit (OMP) with approximation guarantees. This induces implicit data mixing at the batch level during every training step without training any proxy models, thereby reducing cross-domain redundancy and gradient conflicts.
Mirror Descent Under Generalized Smoothness: This paper proposes the concept of \(\ell*\)-generalized smoothness based on an arbitrary norm and its dual norm. By utilizing a "generalized self-bounding lemma," the gradient dual norm is controlled within the initial sub-optimality gap. This establishes, for the first time, convergence rates for Mirror Descent and its accelerated, optimistic, Mirror Prox, stochastic, and composite variants under non-Euclidean geometry that match those under classic \(L\)-smoothness.
Mirror Mean-Field Langevin Dynamics: This paper merges mean-field Langevin dynamics (MFLD) with mirror Langevin dynamics (MLD) to create "Mirror Mean-Field Langevin Dynamics" (MMFLD). It provides the first global convergence algorithm for minimizing the entropy-regularized functional \(\mathcal{L}(\mu)=F(\mu)+\lambda\,\mathrm{Ent}(\mu)\) on a convex constrained domain \(X\subseteq\mathbb{R}^d\). In continuous time, it proves \(e^{-2C_{\mathrm{LSI}}\lambda t}\) linear convergence using uniform mirror LSI; for discretization, it provides uniform-in-time propagation of chaos using an \(N\)-particle system with Euler-Maruyama.
Multi-Objective Bayesian Optimization via Adaptive ε-Constraints Decomposition: STAGE-BO reformulates MOBO as a sequence of ε-constrained single-objective Bayesian sub-problems with "thresholds adaptively selected via fill distance," solved using cEI. This achieves uniform Pareto front coverage without calculating hypervolume and is naturally compatible with hard constraints and user preferences.
Muon in Associative Memory Learning: Training Dynamics and Scaling Laws: This paper provides a theoretical characterization of convergence rates and scaling laws for Muon on a linear associative memory model with softmax retrieval and hierarchical spectra: compared to GD, Muon achieves exponential acceleration in the noiseless case and improves the loss scaling law from \(\tilde{\Omega}(T^{-(1-1/\beta)})\) to \(\tilde{\mathcal{O}}(T^{-2})\) in the power-law spectrum noise case, attributing this acceleration to the matrix sign operator acting as an adaptive task-aligned implicit preconditioner.
Neural QAOA\(^2\): Differentiable Joint Graph Partitioning and Parameter Initialization for Quantum Combinatorial Optimization: A generative-evaluative neural network (GEN) is proposed to jointly differentiate "graph partitioning + quantum circuit parameter initialization" for QAOA². The evaluator learns a high-fidelity quantum performance surrogate, while the generator outputs discrete partitions and initial parameters guided by its gradients. Straight-Through Estimator (STE) and an Orthogonal Complement Head (OCH) enable end-to-end training. The method surpasses heuristic baselines across 183 QUBO/Ising/MaxCut instances (21-1000 variables), ranking first in 101 instances.
On the Convergence Rate of LoRA Gradient Descent: This paper proves for the first time that the minimum gradient norm of original LoRA gradient descent converges at a rate of \(O(1/\log T)\) without assuming bounded adapter matrices or requiring Lipschitz smoothness of the re-parameterized loss (recovering the classic \(O(1/T)\) if parameter norms are bounded). Based on this, adaptive/normalized learning rates strictly corresponding to the theory are designed, with training acceleration and stability improvements validated on logistic regression, ResNet-18, and TinyLlama.
On the Expressive Power of GNNs to Solve Linear SDPs: This paper characterizes for the first time the minimum GNN expressive power required to learn solutions for linear SDPs from the perspective of the Weisfeiler–Leman (WL) hierarchy. It proves that standard variable-constraint bipartite message passing (VC-WL) and higher-order VC-2-WL are insufficient. In contrast, the VC-2-FWL architecture, equivalent to 2-FWL, is shown to be sufficient for simulating the update steps of the PDHG solver. Using high-quality predictions as a warm-start on synthetic data and SDPLIB benchmarks results in speedups of up to approximately 80%.
On the Interaction of Batch Noise, Adaptivity, and Compression, under \((L_0,L_1)\)-Smoothness: An SDE Approach: This paper demonstrates that standard first-order and second-order SDEs in current literature completely fail to capture learning rate stability constraints under \((L_0,L_1)\)-smoothness (even predicting convergence in regions where the discrete version diverges). By flipping the sign of the curvature term in the drift, the authors construct a family of "stability-faithful" first-order weak approximation SDEs. This enables the first unified analysis of DCSGD and DSignSGD under compression, affine variance, and heavy-tailed noise, providing specific prescriptions for selecting normalization intensity.
On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization: This paper theoretically proves that in nonstationary strongly convex stochastic optimization where the optimum drifts over time, Momentum SGD is systematically inferior to vanilla SGD due to "inertial lag," with performance degradation amplified by a factor of the order \((1 - \beta)^{-2}\). Through information-theoretic lower bounds, it demonstrates that this cost is a fundamental obstacle rather than an analytical artifact.
PathWise: Planning through World Model for Automated Heuristic Design via Self-Evolving LLMs: PathWise reformulates LLM-based Automated Heuristic Design (AHD) as a sequential decision process on an "entailment graph." Four LLM agents—Policy, World Model, and Dual-Critics—collaborate to replace gradient updates with natural language reflections. On problems such as TSP, CVRP, KP, and Bin Packing, it outperforms mainstream baselines like FunSearch, EoH, ReEvo, HSEvo, and MCTS-AHD with only 50% of the evaluation budget.
Probing Neural TSP Representations for Prescriptive Decision Support: The authors treat trained TSP neural solvers as "transferable encoders," using frozen representations and lightweight probes to predict two types of expensive operational sensitivity queries (node removal and edge forbidding). They systematically demonstrate that probe accuracy improves monotonically with solver quality and achieves Prev. SOTA through integration with traditional heuristics.
Provably Data-Driven Lagrangian Relaxation for Mixed Integer Linear Programming: This paper provides the first rigorous statistical learning theory for the empirical approach of "learning to predict Lagrangian multipliers to accelerate MILP": it derives an ERM generalization upper bound of \(\mathcal{O}(s^{1.5}/\sqrt{N})\), a minimax lower bound of \(\Omega(s/\sqrt{N})\), and constructively achieves the optimal rate of \(\Theta(s/\sqrt{N})\) using an SGA averaging algorithm. Furthermore, it proves that the sample complexity can be improved to \(\Theta(s/N)\) when the problem is reformulated as "learning a warm-start initial value."
Pseudospectral Bounds for Transient Amplification in Coupled Gradient Descent: This paper establishes sharp Kreiss constant bounds \(K(J) \leq 2/(1-\gamma) + \|C\|/(4(1-\gamma))\) for block-triangular Jacobians \(J = \begin{bmatrix} A & 0 \\ C & D \end{bmatrix}\) in coupled gradient descent, providing matching lower bounds. It reveals that transient amplification can be arbitrarily large even when the spectral radius is \(< 1\). This theory serves as a scaling law for high-dimensional learning dynamics, providing a finite-time iteration complexity of \(O(K(J)^2 \log(1/\delta))\) and extending the results to nearly self-referential systems.
Rethinking the Flow-Based Gradual Domain Adaptation: A Semi-Dual Optimal Transport Perspective: This paper reformulates flow-based Gradual Domain Adaptation (GDA), which typically constructs intermediate domains, as an Entropy-regularized Semi-dual Unbalanced Optimal Transport (E-SUOT) problem. By bypassing explicit Probability Density Function (PDF) estimation of the target domain and directly learning a sequence of transport maps to push source samples toward the target, the method consistently outperforms existing GDA/UDA approaches on Portraits, MNIST-rot, and Office-Home.
RACO: Reward-free Alignment for Conflicting Objectives: RACO reformulates multi-objective LLM preference alignment as a multi-objective optimization problem, where each objective possesses its own DPO loss. Gradient conflicts are addressed using clipped CAGrad (CAGrad with coefficients clipped by user weights). It theoretically guarantees convergence to Pareto-critical points respecting user-specified weights (with strict acceleration in two-objective scenarios). Empirically, it consistently achieves superior Pareto trade-offs across Qwen 3, Llama 3, and Gemma 3 model families.
RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization: Based on the "row-block diagonal dominant" structure of the Transformer layer-wise Hessian, this paper replaces the expensive Newton-Schulz orthogonalization in the Muon optimizer with a single row-level \(\ell_2\) normalization. This reduces the per-step preconditioning complexity from \(\mathcal{O}(mn\min(m,n))\) to \(\mathcal{O}(mn)\), resulting in a 13–44× wall-clock speedup in GPT-2 / LLaMA pre-training with slightly improved perplexity.
SPSsafe: Safeguarded Stochastic Polyak Step Sizes for Non-smooth Optimization: SPSsafe extends Stochastic Polyak Step Size (SPS) to non-smooth stochastic optimization without requiring interpolation assumptions or knowledge of optimal values. Combined with momentum (IMA = SHB equivalent form), it maintains rigorous convergence guarantees. It is more robust than existing adaptive methods (AdaGrad, Adam, DecSPS, etc.) for DNN training and avoids gradient norm collapse (anti-gradient vanishing).
Selecting Samples on Graphs: A Unified Dataset Pruning Framework for Lossless Training Acceleration: Dataset pruning is reformulated as a "Maximum Weight Clique Problem" on a weighted graph, where node weights represent the intrinsic value of samples and edge weights represent redundancy/diversity relationships. Under mild conditions, this unified objective is proven to be submodular, allowing for a greedy solution with approximation guarantees. This approach reduces training time by over 40% on ImageNet-1k with ResNet-50 without sacrificing accuracy.
Sharp Description of Local Minima in the Loss Landscape of High-Dimensional Two-Layer ReLU Networks: Under the high-dimensional Gaussian input setting for teacher-student two-layer ReLU networks, this paper provides a hierarchical classification of all local minima of the population loss using a set of exact low-dimensional summary statistics equations regarding weight overlaps \((Q,R)\). It characterizes how over-parameterization transforms low-order spurious minima into saddle points while retaining high-order minima, thereby reconciling Safran–Shamir’s existence results, Arjevani–Field’s group-theoretic classification, and Safran et al.’s Hessian instability theory for the first time.
Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression: This paper reveals that post-training weight sign matrices are indistinguishable from i.i.d. Rademacher noise across all architectures, forming a "one-bit wall" for sub-bit compression. Using stopping time analysis, it proves this pseudo-randomness is actually a "lock-in" of initialized signs. Consequently, it proposes a from-scratch training scheme using low-rank sign templates + gap initialization + outer-zone log-barrier regularization, amortizing sign bits to nearly 0 bit/weight.
Stability Analysis of Sharpness-Aware Minimization: This paper analyzes the convergence instability of SAM near saddle points from a dynamical systems perspective. It first proves under deterministic gradient flow that a saddle point becomes an attractor for SAM as long as the neighborhood radius \(\rho > -1/\lambda_1\). Subsequently, within a stochastic diffusion framework, it demonstrates that the mean square displacement for saddle point escape in SAM is smaller than that of SGD by \(2\eta t^2|\lambda_j|^3\rho/B\). Finally, the SAM diffusion formula is utilized to explain why momentum and batch size are the true hidden drivers behind SAM achieving SOTA generalization performance.
SVRG and Beyond via Posterior Correction: This paper demonstrates that the classic variance reduction algorithm SVRG is essentially a special case of Bayesian "Posterior Correction" (PoCo) under an isotropic Gaussian posterior. Based on this, it automatically derives two new extensions previously difficult to obtain: a Newton-type variant that simultaneously corrects the Hessian, and an Adam-type variant (IVON-PoCo) scalable to deep learning.
SyMerge: From Non-Interference to Synergistic Merging via Single-Layer Adaptation: This paper redefines the objective of "model merging" from "avoiding task interference" to "promoting task synergy." It proposes SyMerge: jointly optimizing only one task-specific layer per task and the layer-wise merging coefficients of the encoder, while employing fine-tuned expert models as soft-label teachers to prevent test-time drift caused by entropy minimization. This approach elevates merged models to performance levels near single-task upper bounds across vision, dense prediction, and NLP benchmarks.
Taming the Loss Landscape of PINNs with Noisy Feynman-Kac Supervision: Operator Preconditioning and Non-Asymptotic Error Bounds: Incorporating a small number of interior point pseudo-labels, obtained via Monte Carlo simulation of the Feynman–Kac formula, into the PINN loss essentially acts as operator preconditioning for the PDE operator. This work provides an operator-level proof that the condition number remains bounded with respect to the number of collocation points \(N\), along with a non-asymptotic \(L^2\) error bound for \(\tanh\) activations. This approach enables PINNs to solve previously failed problems such as Schrödinger, Poisson, and committor equations.
Test time training enhances in-context learning of nonlinear functions: This paper establishes the first rigorous generalization bound for the combination of a single-layer softmax-attention transformer and LoRA test-time fine-tuning. It proves that TTT compresses the sample complexity of ICL from \(r^{\Theta(\mathrm{ie}(\sigma_*))}\) to \(r^{\Theta(\mathrm{ge}(\sigma_*))}\) on single-index polynomial tasks, allows the link function to vary per task, and ensures that inference error scales with context length \(\to\) noise level.
The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks: This paper proves that under the setting of smooth \(L\)-homogeneous models + exponential-tail loss + learning rate decay, Muon (including Muon-Signum, Muon-Adam) as a momentum-based "normalized steepest descent" converges to the KKT points of the corresponding norm max-margin problem; Adam (without the stability constant \(\varepsilon\)) converges to the KKT points of the \(\ell_\infty\) max-margin problem. This elevates implicit bias conclusions, previously only valid for linear models, to all smooth homogeneous networks.
Towards Understanding Adam Convergence on Highly Degenerate Polynomials: This paper selects a class of high-order degenerate polynomials \(L(x)=\tfrac{1}{k}x^k\) (even \(k\ge 4\)) as a minimal problem model. It proves that under a constant learning rate, Adam achieves local linear convergence by exponentially amplifying the effective learning rate through a "decoupling" mechanism between \(v_t\) and \(g_t^2\). Meanwhile, GD and momentum only achieve a sublinear rate of \(\Theta(t^{-1/(k-2)})\) on the same problem. The study comprehensively characterizes three phase regions of Adam—"stable convergence / spike / SignGD oscillation"—on the \((\beta_1,\beta_2)\) plane.
Towards Understanding Continual Factual Knowledge Acquisition of Language Models: From Theory to Algorithm: The authors derive closed-form training dynamics for simplified single-layer linear attention Transformers, proving that regularization methods only alter convergence speed without shifting the convergence point (leading to inevitable failure in cFKA scenarios). In contrast, data replay directly modifies the convergence point and amplifies oscillations to stabilize old knowledge. Based on these findings, the authors propose STOC, which selects snippets via token-level attention contributions to guide pre-trained models in generating replay corpora. STOC consistently suppresses forgetting more effectively than LAMOL on synthetic, KnowEdit, and IndustryCorpus (legal) datasets.
TPV: Parameter Perturbations Through the Lens of Test Prediction Variance: The authors formalize the "local prediction sensitivity of a trained model to parameter perturbations" as Test Prediction Variance (TPV). They prove that under a first-order approximation, TPV reduces to a trace form \(\mathrm{Tr}(H_{\mathrm{eff}}C)\), unifying SGD noise, label noise, quantization, and pruning within a curvature-covariance framework. A stability theorem is provided to estimate TPV using only the training set, leading to the label-free pruning criterion JBR and model selection signals that do not require test labels.
Ubiquity of Emergent Hebbian Dynamics in Regularized Learning: This paper demonstrates that near the steady state of L2 weight decay, the learning signals of nearly any learning rule (including SGD, Adam, DFA, and even Random Networks) spontaneously align toward the Hebbian direction. Conversely, sufficiently strong noise flips this alignment toward an anti-Hebbian direction, with a clear phase transition boundary emerging at \(\gamma \propto \sigma^2\).
URS: Unified Neural Routing Solver: The authors propose a Unified Data Representation (UDR) and a Mixed Bias Module (MBM) to replace problem enumeration—enabling a single neural model to generalize zero-shot to 110 VRP variants (99 unseen) without fine-tuning.
Utility-Diversity Aware Online Batch Selection for LLM Supervised Fine-tuning: UDS proposes an efficient online batch selection framework for LLM Supervised Fine-tuning (SFT): it leverages the nuclear norm of the logits matrix obtained solely from forward passes to simultaneously characterize "optimization utility + intra-sentence diversity." It then uses low-dimensional bilinear random projection of logits to measure similarity matching against a historical sample memory buffer for "inter-sentence diversity." By selecting top-K samples based on a weighted sum of these metrics, UDS avoids reliance on external resources like reference models or validation sets and performs no additional backpropagation. Consequently, it is faster than full SFT and consistently outperforms existing SOTA online batch selection methods across several benchmarks.
Variational Adapter for Cross-modal Similarity Representation: Learning continuous cross-modal similarity distributions through a variational inference framework—mitigating false negative issues caused by binary labeling with adaptive uncertainty weights, significantly enhancing VLM performance in cross-modal retrieval and domain generalization tasks.
\(α\)-PFN: Fast Entropy Search via In-Context Learning: This paper "amortizes" information-theoretic acquisition functions like Entropy Search (ES) into a single forward pass using a two-stage Prior-data Fitted Network (PFN). It first trains a base PFN capable of making predictions conditioned on optimal point information, then trains an \(α\)-PFN that directly outputs the distribution of information gain. This bypasses slow and complex Monte Carlo approximations, achieving performance comparable to SOTA Entropy Search on synthetic and real HPO benchmarks while providing speedups of up to 70x.