Skip to content

📐 Optimization & Theory

🧠 NeurIPS2025 · 113 paper notes

A Single-Loop First-Order Algorithm for Linearly Constrained Bilevel Optimization

For bilevel optimization problems with coupled linear constraints in the lower-level problem, this paper proposes SFLCB, a single-loop first-order algorithm that eliminates Hessian dependence via a penalty-based reformulation combined with augmented Lagrangian, improving the iteration complexity from \(O(\epsilon^{-3}\log(\epsilon^{-1}))\) to \(O(\epsilon^{-3})\).

A Theoretical Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning

This paper proposes the first theoretical framework for sampling-based test-time scaling methods, decomposing reasoning error into estimation error and model error. It reveals the limitations of Self-Consistency (slow convergence) and Perplexity (large model error), and introduces the RPC method that combines the strengths of both, achieving comparable reasoning performance on 7 benchmarks with only 50% of the sampling cost.

A Unified Approach to Submodular Maximization Under Noise

This paper proposes a unified meta-algorithm framework that takes any exact submodular maximization algorithm satisfying a "robustness" condition as a black box and automatically converts it into an algorithm that maintains its approximation ratio (losing only \(o(1)\)) under a persistent noisy value oracle. This achieves, for the first time, optimal approximation ratios for non-monotone submodular maximization under matroid constraints and in the unconstrained setting.

A Unified Stability Analysis of SAM vs SGD: Role of Data Coherence and Emergence of Simplicity Bias

Through a linear stability analysis framework, this paper demonstrates that "flat minima ⇒ better generalization" and "SGD prefers simple functions" are two sides of the same coin — data coherence simultaneously governs both phenomena, and SAM amplifies the simplicity bias further by imposing stricter stability conditions.

Adaptive Algorithms with Sharp Convergence Rates for Stochastic Hierarchical Optimization

This paper proposes Ada-Minimax and Ada-BiO, two adaptive algorithms that combine momentum normalization with a novel online noise estimation strategy to achieve, for the first time, sharp convergence rates of \(\tilde{O}(1/\sqrt{T} + \sqrt{\bar{\sigma}}/T^{1/4})\) for nonconvex-strongly-concave minimax and nonconvex-strongly-convex bilevel optimization without requiring prior knowledge of the gradient noise level.

An Adaptive Algorithm for Bilevel Optimization on Riemannian Manifolds

AdaRHD is the first adaptive algorithm for Riemannian bilevel optimization (RBO) that requires no prior knowledge of problem-specific parameters (strong convexity constants, Lipschitz bounds, manifold curvature). By adopting an inverse cumulative gradient norm strategy for adaptive step size selection and solving the lower-level problem, linear system, and upper-level update sequentially within a three-stage framework, AdaRHD achieves a convergence rate of \(O(1/\epsilon)\) matching non-adaptive methods, while exhibiting substantially greater robustness to initial step size choices compared to RHGD.

Asymptotically Stable Quaternionic Hopfield Structured Neural Network with Supervised Projection-based Manifold Learning

This paper proposes a Quaternion-valued Supervised Hopfield-structured Neural Network (QSHNN) that employs a periodic projection strategy to maintain the quaternionic structural consistency of the weight matrix. The existence and uniqueness of fixed points and their asymptotic stability are established via Lyapunov theory, while bounded trajectory curvature guarantees path smoothness for robotic path planning.

Auto-Compressing Networks

Auto-Compressing Networks (ACN) replace short residual connections with long-range forward connections (aggregating all layer outputs directly into the final output), making the Direct Gradient (DG) component significantly stronger than the Forward Gradient (FG), thereby implicitly compressing information into earlier layers. A ViT with only 6 layers matches standard 12-layer performance; BERT saves 75% of its layers; additional benefits include noise robustness (+6.4%) and reduced catastrophic forgetting in continual learning (−18%).

Automated Algorithm Design via Nevanlinna-Pick Interpolation

This paper proposes an automated algorithm design framework based on Nevanlinna-Pick interpolation from frequency-domain robust control theory, targeting strongly convex optimization with equality constraints, and achieves an optimal trade-off between the number of matrix-vector multiplications and the convergence rate.

AutoOpt: A Dataset and a Unified Framework for Automating Optimization Problem Solving

AutoOpt introduces the first end-to-end framework for converting optimization problem images to executable code — comprising the AutoOpt-11k dataset of 11,554 optimization formula images (handwritten + printed), an M1 hybrid encoder (ResNet+Swin→mBART) for image-to-LaTeX conversion (BLEU 96.70), an M2 DeepSeek-Coder module for LaTeX-to-PYOMO translation, and an M3 bilevel decomposition solver, achieving an overall pipeline success rate of 94.20%.

Better NTK Conditioning: A Free Lunch from ReLU Nonlinear Activation in Wide Neural Networks

This paper establishes a previously unnoticed "free" benefit of ReLU activation in wide neural networks: (a) it induces better data separation in the model's gradient feature space (angles between similar inputs are amplified in gradient space), and (b) this strictly reduces the condition number of the NTK matrix compared to linear networks. Depth further amplifies this effect — in the infinite-width-then-infinite-depth limit, all data pairs achieve equal angular separation in gradient space (~75.5°), and the NTK condition number converges to a fixed value \((n+4)/3\) that depends only on the number of training samples \(n\).

Beyond Õ(√T) Constraint Violation for Online Convex Optimization with Adversarial Constraints

This paper studies Constrained Online Convex Optimization (COCO) with adversarial constraints and introduces a tunable parameter \(\beta\) to achieve a precise tradeoff between \(\tilde{O}(T^\beta)\) regret and \(\tilde{O}(T^{1-\beta})\) cumulative constraint violation (CCV), surpassing the previously known optimal bound of \(\tilde{O}(\sqrt{T})\) constraint violation.

Brain-like Variational Inference

This paper proposes the FOND framework (Free energy Online Natural-gradient Dynamics), which derives spiking neural network inference dynamics from first principles via free energy minimization, and implements iPVAE (iterative Poisson VAE). iPVAE outperforms standard VAEs and predictive coding models in reconstruction–sparsity trade-off, biological plausibility, and OOD generalization.

Clean First, Align Later: Benchmarking Preference Data Cleaning for Reliable LLM Alignment

This paper introduces PrefCleanBench, the first comprehensive benchmark for systematically evaluating 13 preference data cleaning methods in the context of LLM alignment. It covers diverse datasets, model architectures, and optimization algorithms, revealing the underappreciated yet critical role of data preprocessing in responsible AI development.

Composing Global Solutions to Reasoning Tasks via Algebraic Objects in Neural Nets

This paper proposes the CoGS framework, demonstrating that the weight space of two-layer quadratic-activation networks on Abelian group multiplication reasoning tasks admits a semiring algebraic structure. The Sum Potentials in the loss function are ring homomorphisms, enabling global optimal solutions to be algebraically composed from partial solutions—each satisfying only a subset of loss constraints—via ring addition and multiplication. Approximately 95% of gradient descent solutions match the theoretical constructions exactly.

Composing Global Solutions to Reasoning Tasks via Algebraic Objects in Neural Nets

This work reveals that the weight space of two-layer quadratic-activation networks trained on Abelian group reasoning tasks possesses a semiring algebraic structure, and proposes the CoGS framework that composes partial solutions into globally optimal solutions via ring operations. Approximately 95% of gradient descent solutions match the theoretical constructions exactly.

Constrained Network Slice Assignment via Large Language Models

This paper investigates the use of LLMs (Claude series) for solving constrained optimization problems in 5G network slice resource allocation. Two approaches are proposed: zero-shot LLM direct assignment and LLM-guided integer programming. Empirical findings show that LLMs alone can produce reasonable initial allocations but may violate hard constraints, whereas combining LLMs with an ILP solver achieves 100% completeness and balanced utilization.

Contribution of Task-Irrelevant Stimuli to Drift of Neural Representations

This work theoretically demonstrates that the statistical properties (variance and dimensionality) of task-irrelevant stimuli are key drivers of representational drift in online learning. Across Oja's rule, Similarity Matching, autoencoders, and supervised two-layer networks, a drift rate of \(D \propto \lambda_\perp^2 (n-m)\) is consistently observed. Furthermore, learning-noise-induced drift exhibits anisotropic geometric structure, qualitatively distinct from the isotropic drift induced by Gaussian synaptic noise.

Covariances for Free: Exploiting Mean Distributions for Training-free Federated Learning

This paper proposes FedCOF, which leverages only the class means uploaded by clients to perform unbiased estimation of class covariance matrices on the server side, enabling initialization of a global classifier with zero training and minimal communication overhead — achieving performance on par with or surpassing Fed3R, which requires transmission of second-order statistics.

DartQuant: Efficient Rotational Distribution Calibration for LLM Quantization

DartQuant proposes a distribution-calibration-based rotation matrix optimization method that drives activation distributions toward uniformity via a Whip loss to reduce quantization error, and replaces expensive manifold optimizers with QR-Orth, achieving 47× speedup and 10× memory reduction on 70B models — enabling large-model rotation calibration on a single RTX 3090 for the first time.

Deep Taxonomic Networks for Unsupervised Hierarchical Prototype Discovery

Deep Taxonomic Networks proposes a deep latent variable model with a complete binary tree Gaussian mixture prior. Through variational inference, the model automatically discovers hierarchical taxonomies and multi-level prototype clusters from unlabeled data without requiring a predefined number of classes, substantially outperforming baselines such as TreeVAE across multiple datasets.

Do Neural Networks Need Gradient Descent to Generalize? A Theoretical Study

This paper establishes, within the matrix factorization framework (a canonical theoretical testbed for neural networks), that Guess & Check (G&C; randomly sampling parameters until the training set is fit) exhibits generalization that degrades with increasing width (the first provable demonstration of a canonical setting where G&C is strictly inferior to gradient descent) yet improves with increasing depth, revealing the fundamentally opposite roles that width and depth play in generalization.

Doubly Robust Alignment for Large Language Models

DRPO draws on doubly robust estimation from causal inference to propose a preference optimization algorithm that maintains consistency whenever either the preference model or the reference policy is correctly specified, outperforming PPO/DPO and their variants both theoretically and empirically.

DynaAct: Large Language Model Reasoning with Dynamic Action Spaces

DynaAct frames action space construction in LLM reasoning as a subset selection problem, dynamically assembling a compact action space at each step via a submodular function that balances utility and diversity. It outperforms rStar, RAP, and other baselines on 6 benchmarks, surpassing rStar by 6.8% on MATH-500.

Effective Policy Learning for Multi-Agent Online Coordination Beyond Submodular Objectives

This paper proposes two multi-agent online coordination algorithms, MA-SPL and MA-MPL, which leverage a policy-based continuous extension technique to surpass the limitations of submodularity. For the first time, both algorithms achieve the optimal \((1 - c/e)\) approximation ratio under submodular and weakly submodular objectives, while supporting time-varying objectives and the practical constraint of local-only feedback.

Efficient Adaptive Experimentation with Noncompliance

This paper proposes AMRIV — the first semiparametrically efficient, multiply robust ATE estimator for adaptive experiments with noncompliance — combined with a variance-optimal instrumental variable allocation strategy and sequential inference guarantees.

Efficient Adaptive Federated Optimization

FedAda2/FedAda2++ proposes efficient server–client joint adaptive optimization for federated learning: client-side local preconditioners are initialized from zero (eliminating server-to-client transmission), with an optional SM3-based memory-efficient compression of local statistics. The method is theoretically shown to achieve the same \(O(T^{-1/2})\) convergence rate as full joint adaptivity, while incurring the same communication cost as FedAvg.

Efficient Federated Learning against Byzantine Attacks and Data Heterogeneity via Aggregating Normalized Gradients

This paper proposes Fed-NGA, an algorithm that aggregates client gradients via weighted averaging after \(\ell_2\) normalization, achieving Byzantine robustness and resilience to data heterogeneity simultaneously at an extremely low time complexity of \(\mathcal{O}(pM)\). Under non-convex loss functions, it is the first to prove zero optimality gap convergence under certain mild conditions.

Emergence and Scaling Laws in SGD Learning of Shallow Neural Networks

This paper provides a precise analysis of online SGD learning of additive models (sums of single-index functions) in shallow neural networks, proving that the learning of each teacher neuron exhibits a sharp phase transition (emergence), and that the superposition of many such transition curves across different timescales naturally produces a smooth power-law scaling law.

Escaping Saddle Points without Lipschitz Smoothness: The Power of Nonlinear Preconditioning

This paper proposes a unified sufficient condition connecting the \((L_0,L_1)\)-smoothness and anisotropic smoothness frameworks, proves that nonlinear preconditioned gradient methods (including gradient clipping variants) retain saddle-point avoidance under these relaxed conditions, and establishes that a perturbed variant achieves second-order stationary points with polylogarithmic dimension dependence.

Estimation of Stochastic Optimal Transport Maps

This paper proposes a transport error metric \(\mathcal{E}_p\) for stochastic OT maps—decomposed into an optimality gap and a feasibility gap—and constructs a computationally efficient rounding estimator achieving a near-optimal convergence rate of \(\tilde{O}(n^{-1/(d+2p)})\) under minimal assumptions that require neither the existence nor uniqueness of Brenier maps. The framework is further extended to Hölder-continuous kernels and adversarially corrupted settings, establishing the first general theory for OT map estimation.

Evaluating LLMs for Combinatorial Optimization: One-Phase and Two-Phase Heuristics for 2D Bin-Packing

This paper proposes a systematic evaluation framework combining LLMs with evolutionary algorithms to assess the capability of LLMs in generating and optimizing heuristics for the 2D bin-packing problem. GPT-4o achieves optimal solutions within 2 iterations, reducing the average number of bins from 16 to 15 and improving space utilization from 0.76–0.78 to 0.83.

Exact and Linear Convergence for Federated Learning under Arbitrary Client Participation is Attainable

This paper introduces stochastic matrices and time-varying graphs as modeling tools to unify client participation and local update processes in federated learning into a matrix multiplication formulation. It proposes the FOCUS algorithm (based on a push-pull strategy), which achieves, for the first time, exact convergence and linear convergence rates under arbitrary client participation and data heterogeneity.

Exploring Landscapes for Better Minima along Valleys

This paper proposes an optimizer adapter "E" that incorporates an exponential moving average of gradient differences \(\mathbf{a}_k = \text{EMA}(\mathbf{g}_k - \mathbf{g}_{k-1})\) into the gradient update, enabling optimizers to continue exploring "valleys" in the loss landscape for lower and flatter minima after reaching a local minimum. The resulting ALTO optimizer achieves an average improvement of 2.5% in test accuracy under large-batch training.

Extragradient Method for \((L_0, L_1)\)-Lipschitz Root-finding Problems

Under the \(\alpha\)-symmetric \((L_0,L_1)\)-Lipschitz condition (a relaxation of the classical \(L\)-Lipschitz assumption), this paper proposes an adaptive step size strategy \(\gamma_k = 1/(c_0 + c_1\|F(x_k)\|^\alpha)\) for the extragradient (EG) method, establishing the first complete convergence guarantees for three classes of root-finding problems: strongly monotone (linear convergence), monotone (sublinear convergence), and weak Minty (local convergence).

Faster Algorithms for Structured John Ellipsoid Computation

For computing the John ellipsoid of a symmetric convex polytope \(P = \{x \in \mathbb{R}^d : -\mathbf{1}_n \leq Ax \leq \mathbf{1}_n\}\), this paper proposes two fast algorithms: a near-input-sparsity algorithm based on sketching with per-iteration cost \(\widetilde{O}(\text{nnz}(A) + d^\omega)\), and a treewidth-based algorithm with per-iteration cost \(O(n\tau^2)\), both significantly improving upon the prior state-of-the-art cost of \(O(nd^2)\).

FedQS: Optimizing Gradient and Model Aggregation for Semi-Asynchronous Federated Learning

This paper proposes FedQS, the first framework to simultaneously optimize both gradient aggregation and model aggregation strategies in semi-asynchronous federated learning (SAFL). By partitioning clients into four categories and adaptively adjusting training strategies, FedQS achieves comprehensive improvements over baselines in accuracy, convergence speed, and stability.

FedRTS: Federated Robust Pruning via Combinatorial Thompson Sampling

This paper reformulates federated dynamic pruning as a Combinatorial Multi-Armed Bandit (CMAB) problem and proposes TSAdj, a Thompson Sampling-based topology adjustment mechanism. By replacing deterministic decisions with probabilistic ones, the method obtains more robust sparse model topologies while significantly reducing communication overhead.

Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers

This paper argues that LLM generalization and hallucination share a common mechanism — out-of-context reasoning (OCR) — and provides theoretical guarantees on a single-layer attention model: the factorized parameterization \((W_O, W_V)\) can perform OCR due to the nuclear norm implicit bias of gradient descent, whereas the merged parameterization \(W_{OV}\) cannot due to its Frobenius norm bias. Moreover, OCR is sample-efficient (requiring only \(m_{\text{train}}>0\)).

Gradient Descent as Loss Landscape Navigation: a Normative Framework for Deriving Learning Rules

This paper proposes treating learning rules as optimal navigation policies in a (partially observable) loss landscape. By solving a continuous-time optimal control problem via variational calculus, it derives gradient descent, momentum, natural gradient, Adam, and continual learning strategies within a unified framework.

Implicit Bias of Spectral Descent and Muon on Multiclass Separable Data

This paper provides the first complete characterization of the implicit bias of Normalized Steepest Descent (NSD) and Normalized Momentum Descent (NMD) on multiclass linearly separable data: these algorithms converge to the maximum-margin solution under the corresponding \(p\)-norm at a rate of \(\mathcal{O}(1/\sqrt{t})\), with Spectral Descent (spectral norm) and Muon as special cases, and further extended to Adam (max-norm margin).

Improving the Straight-Through Estimator with Zeroth-Order Information

This paper proposes FOGZO (First-Order-Guided Zeroth-Order Gradient Descent), which injects STE gradients as a bias source into zeroth-order gradient estimation. By retaining the computational efficiency of STE while leveraging zeroth-order information to correct occasional erroneous gradient directions, FOGZO achieves 1–22 point improvements in accuracy/perplexity on DeiT, ResNet, and LLaMA with only 2 additional forward passes.

In Search of Adam's Secret Sauce

Through large-scale experiments training 1500+ language models, this paper establishes: (1) Signum closes 96% of the SGD–Adam gap yet remains 25% slower than Adam; (2) setting \(\beta_1 = \beta_2\) is a near-optimal simplification of Adam; (3) under \(\beta_1 = \beta_2 = \beta\), Adam can be reinterpreted as a signal-to-noise ratio–adaptive Signum that estimates the gradient mean and variance via online Gaussian variational inference.

Isotropic Noise in Stochastic and Quantum Convex Optimization

This paper introduces the concept of an Isotropic Stochastic Gradient Oracle (ISGO)—where noise is bounded in every direction with high probability—and designs a stochastic cutting-plane algorithm achieving a query complexity of \(\tilde{O}(R^2\sigma_I^2/\epsilon^2 + d)\), improving over SGD by a factor of \(d\) in certain parameter regimes. As corollaries, the paper establishes new state-of-the-art complexities under sub-exponential noise and improves the dimension dependence of quantum stochastic convex optimization via a quantum isotropization subroutine.

Kernel Learning with Adversarial Features: Numerical Efficiency and Adaptive Regularization

This paper proposes a new paradigm that relocates adversarial perturbations from the input space to the feature space within reproducing kernel Hilbert spaces (RKHS), enabling exact closed-form solutions to the inner maximization problem. The outer minimization is solved efficiently via iterative reweighted kernel ridge regression, and the resulting adaptive regularization matches cross-validation performance without any hyperparameter tuning.

Large Language Bayes

This work mathematically "glues" an LLM and a probabilistic programming language (PPL/Stan) into a joint distribution \(p(z,x,m|t) = p(m|t)_{\text{LLM}} \cdot p(z,x|m)_{\text{PPL}}\). Given only an informal problem description and data, the system automatically samples candidate formal models from the LLM, performs Bayesian inference within each model, and produces a marginal-likelihood-weighted model average — requiring no user-written probabilistic model.

Large Stepsizes Accelerate Gradient Descent for Regularized Logistic Regression

This paper proves that applying GD with large stepsizes (entering the Edge of Stability regime) to \(\ell_2\)-regularized logistic regression on linearly separable data accelerates the step complexity from the classical \(\widetilde{O}(\kappa)\) to \(\widetilde{O}(\sqrt{\kappa})\), matching the acceleration rate of Nesterov momentum in the small-regularization regime.

Layer-wise Update Aggregation with Recycling for Communication-Efficient Federated Learning

This paper proposes FedLUAR, which uses a gradient-to-weight ratio metric to identify low-priority layers and recycles their updates from the previous round (rather than discarding them), achieving accuracy nearly identical to FedAvg with only 17% communication overhead.

Learning at the Speed of Physics: Equilibrium Propagation on Oscillator Ising Machines

This work presents the first complete mapping of Equilibrium Propagation (EP) onto Oscillator Ising Machine (OIM) hardware, leveraging GHz-scale physical dynamics to enable backpropagation-free local learning. The approach achieves 97.2%/88.0% accuracy on MNIST/Fashion-MNIST and demonstrates robustness under parameter quantization and noise.

Learning from Interval Targets

This paper studies regression under interval-only supervision (lower and upper bounds), establishes non-asymptotic generalization bounds based on hypothesis class smoothness without requiring a small ambiguity degree assumption, and proposes a minmax learning framework that leverages smoothness constraints to limit worst-case labels, achieving significant improvements over unconstrained methods across 18 real-world datasets.

Learning Orthogonal Multi-Index Models: A Fine-Grained Information Exponent Analysis

This paper proves that orthogonal multi-index models \(f_*(\mathbf{x}) = \sum_{k=1}^P \phi(\mathbf{v}_k^* \cdot \mathbf{x})\) can be learned via a two-phase online SGD with sample complexity \(\tilde{O}(dP^{L-1})\) (where \(L\) is the lowest higher-order Hermite degree of the link function), significantly improving upon \(\tilde{O}(Pd^{L-1})\) obtained by using only the lowest-order information. The key insight is to first recover the subspace using second-order terms and then recover the directions using \(L\)-th order terms, jointly exploiting Hermite components of different orders.

Learning Parameterized Skills from Demonstrations

This paper proposes DEPS, an end-to-end algorithm for discovering parameterized skills from expert demonstrations. Through a three-level hierarchical policy (discrete skill selection → continuous parameter selection → low-level actions) and an information bottleneck design, DEPS learns interpretable and generalizable skill abstractions, achieving significant improvements over baselines on LIBERO and MetaWorld.

Learning Provably Improves the Convergence of Gradient Descent

This paper presents the first rigorous proof of training convergence for an unrolling-based Learn to Optimize (L2O) framework (Math-L2O). By leveraging NTK theory, it establishes a linear convergence rate and proposes a deterministic initialization strategy that provably ensures L2O improves upon the convergence performance of gradient descent. Experiments demonstrate over 50% improvement in optimality compared to standard GD.

Learning Quadratic Neural Networks in High Dimensions: SGD Dynamics and Scaling Laws

This paper provides a precise analysis of gradient-based training of two-layer neural networks with quadratic activations in high dimensions. Under the setting where data is generated by \(f_*(x) \propto \sum_{j=1}^r \lambda_j \sigma(\langle \theta_j, x \rangle)\), with extensive width \(r \asymp d^\beta\) and power-law decaying coefficients \(\lambda_j \asymp j^{-\alpha}\), the paper derives scaling laws for the prediction risk as a function of optimization time, sample size, and model width.

Learning Reconfigurable Representations for Multimodal Federated Learning with Missing Data

This paper proposes PEPSY, a framework that learns client-side embedding controls to encode data-missing patterns, reconfiguring globally aggregated representations into complete-data features adapted to each client's local context, addressing both modality-missing and feature-missing scenarios in multimodal federated learning.

Learning Single-Index Models via Harmonic Decomposition

This paper proposes spherical harmonics as a natural basis for single-index models (SIMs) in place of Hermite polynomials, leveraging rotational symmetry to characterize sample and computational complexity for learning SIMs under arbitrary spherically symmetric input distributions. Two families of optimal estimators are constructed (tensor unfolding + online SGD), and a sample–runtime tradeoff is revealed that does not arise in the Gaussian setting.

Learning Sparse Approximate Inverse Preconditioners for Conjugate Gradient Solvers on GPUs

This paper proposes a graph neural network (GNN)-based method for learning sparse approximate inverse (SPAI) preconditioners, exploiting the natural compatibility between SPAI's locality and GNN message passing, and introduces a scale-invariant loss function (SAI loss) that achieves 40%–53% reduction in solve time (68%–113% speedup) on GPUs.

Learning Theory for Kernel Bilevel Optimization

This work establishes the first finite-sample generalization bounds for kernel bilevel optimization (KBO), proving that plug-in estimation errors for both the objective value and its gradient converge uniformly at the parametric rate \(\mathcal{O}(1/\sqrt{m}+1/\sqrt{n})\), and applies this theory to analyze the statistical accuracy of bilevel gradient descent.

Learning to Insert for Constructive Neural Vehicle Routing Solver

This paper proposes L2C-Insert, the first learning-based insertion construction paradigm for neural combinatorial optimization. By allowing nodes to be inserted at any feasible position within a partial solution—rather than appended only to its end—the method significantly improves construction quality and flexibility for TSP/CVRP.

Least Squares Variational Inference

This paper proposes LSVI (Least Squares Variational Inference), a gradient-free variational inference method based on ordinary least squares regression. Within the exponential family, LSVI iteratively solves for the optimal variational approximation by performing OLS regression on a tempered log-target, admitting efficient \(O(d^3)\) (full-covariance) or \(O(d)\) (mean-field) implementations for the Gaussian family.

MAR-FL: A Communication Efficient Peer-to-Peer Federated Learning System

This paper proposes MAR-FL, a system that reduces the communication complexity of P2P federated learning from \(O(N^2)\) to \(O(N \log N)\) via a Moshpit All-Reduce mechanism and dynamic group aggregation, while maintaining robustness against network jitter.

MDNS: Masked Diffusion Neural Sampler via Stochastic Optimal Control

This paper proposes the Masked Diffusion Neural Sampler (MDNS), a framework grounded in stochastic optimal control theory for continuous-time Markov chains (CTMCs). By aligning path measures, MDNS trains a discrete neural sampler capable of accurately sampling from Ising/Potts models with state spaces as large as \(10^{122}\), substantially outperforming existing learning-based baselines.

MeCeFO: Enhancing LLM Training Robustness via Fault-Tolerant Optimization

MeCeFO proposes a fault-tolerant optimization algorithm for LLM training that minimizes overhead during node failures through three techniques—skip-connections, selective activation recomputation, and low-rank gradient approximation—achieving only a 4.18% throughput drop under high-frequency failure conditions.

Memory-Augmented Potential Field Theory: A Framework for Adaptive Control in Non-Convex Domains

This paper proposes Memory-Augmented Potential Field Theory (MAPFT), which maintains a dynamic memory module within stochastic optimal control to detect and encode topological features of the state space (local minima, low-gradient regions, etc.), and adaptively reshapes the value function landscape to enable control in non-convex environments. On tasks such as Humanoid-v4, the method achieves a 27% improvement in cumulative reward over the best RL baseline (SAC), and raises the local optima escape rate from ~30% to ~72%.

MESS+: Dynamically Learned Inference-Time LLM Routing in Model Zoos with Service Level Guarantees

MESS+ is the first framework to formalize LLM request routing as a constrained stochastic optimization problem with SLA guarantees. By combining an online-learned request satisfaction predictor with a virtual queue mechanism, it dynamically selects models per request. Across 3 reasoning and 5 question-answering benchmarks, MESS+ achieves an average 2× cost reduction while satisfying SLA constraints, with theoretical guarantees on both cost optimality and constraint satisfaction.

MOBO-OSD: Batch Multi-Objective Bayesian Optimization via Orthogonal Search Directions

This paper proposes MOBO-OSD, an algorithm that generates diverse Pareto-optimal solutions by defining orthogonal search directions on an approximated convex hull of individual minima (CHIM), combined with Pareto front estimation and a batch selection strategy, consistently outperforming state-of-the-art multi-objective Bayesian optimization methods on both synthetic and real-world benchmarks.

Multi-head Transformers Provably Learn Symbolic Multi-step Reasoning via Gradient Descent

This work rigorously proves, from the perspective of gradient descent training dynamics, that a single-layer multi-head Transformer can learn both forward and backward reasoning on a tree path-finding task via Chain-of-Thought, and reveals that distinct attention heads spontaneously specialize to collaboratively solve multi-stage subtasks.

Multiplayer Federated Learning: Reaching Equilibrium with Less Communication

This paper proposes the Multiplayer Federated Learning (MpFL) framework, which models FL clients as rational players in a game-theoretic setting, and introduces the PEARL-SGD algorithm that reduces communication overhead via local updates while converging to a Nash equilibrium.

Natural Gradient Descent for Improving Variational Inference Based Classification of Radio Galaxies

This work replaces standard SGD with the natural gradient descent optimizer iVON for optimizing BNN parameters under variational inference, achieving better uncertainty calibration in radio galaxy classification while maintaining predictive performance comparable to HMC and BBB-VI.

Natural Gradient VI: Guarantees for Non-Conjugate Models

Under mean-field parameterization, this paper establishes three key theoretical results for natural gradient variational inference (NGVI) in non-conjugate models: a relative smoothness condition on the variational loss, a global convergence-to-stationary-point guarantee for a modified NGVI with non-Euclidean projections, and, under additional structural assumptions, hidden convexity and fast global convergence guarantees.

Near-Exponential Savings for Mean Estimation with Active Learning

This paper proposes the PartiBandits algorithm, which combines disagreement-based active learning with UCB-style stratified sampling to achieve near-exponential label savings for mean estimation when auxiliary information \(X\) is predictive of the target variable \(Y\).

Neural Thermodynamics: Entropic Forces in Deep and Universal Representation Learning

This paper establishes a "neural thermodynamics" framework, proving that emergent entropic forces arising from stochasticity and discrete-time updates in SGD training systematically break continuous parameter symmetries while preserving discrete ones. This leads to a gradient balance phenomenon analogous to thermodynamic equipartition, thereby (a) providing the first theoretical proof of the Platonic Representation Hypothesis (that different models learn similar representations), and (b) reconciling the seemingly contradictory observations of sharpness-seeking and flatness-seeking behavior in deep learning optimization.

NeuSymEA: Neuro-symbolic Entity Alignment via Variational Inference

This paper proposes NeuSymEA, a neuro-symbolic reasoning framework based on a variational EM algorithm that unifies symbolic rule reasoning and neural network embeddings within a Markov Random Field for entity alignment, achieving significant performance gains and low-resource robustness on DBP15K.

Non-Stationary Bandit Convex Optimization: A Comprehensive Study

This paper systematically studies bandit convex optimization (BCO) in non-stationary environments, proposes two algorithms (TEWA-SE and cExO), and establishes unified regret upper and lower bounds under three non-stationarity measures (number of switches \(S\), total variation \(\Delta\), and path length \(P\)), achieving minimax optimality in several settings.

On Minimax Estimation of Parameters in Softmax-Contaminated Mixture of Experts

This paper presents the first minimax parameter estimation analysis for contaminated Mixture of Experts (MoE) models with softmax gating. It introduces the concept of "distinguishability" to characterize the relationship between a pretrained model and a prompt, proving that MLE achieves the parametric rate \(\tilde{O}(n^{-1/2})\) when distinguishability holds, while the rate degrades significantly otherwise.

Online Two-Stage Submodular Maximization

This paper introduces the Online Two-Stage Submodular Maximization (O2SSM) problem for the first time, and proposes the RAOCO algorithm for Weighted Threshold Potential (WTP) functions. By combining fractional relaxation with randomized pipage rounding, RAOCO achieves sublinear \((1-1/e)^2\)-regret in polynomial time, while also improving the approximation ratio for the offline problem.

Optimal Rates for Generalization of Gradient Descent for Deep ReLU Classification

This paper establishes a generalization rate of \(\widetilde{O}(L^4(1+\gamma L^2)/(n\gamma^2))\) for gradient descent on deep ReLU networks, achieving for the first time simultaneously: (1) the optimal \(1/n\) dependence on sample size \(n\), and (2) only polynomial dependence on depth \(L\).

Optimality and NP-Hardness of Transformers in Learning Markovian Dynamical Functions

This paper analyzes the ICL capability of Transformers for learning Markovian dynamical functions from an optimization-theoretic perspective. It derives the global optimal solution (in closed form) for single-layer linear self-attention, proves that recovering Transformer parameters from an extended parameter space is NP-hard, and reveals that multi-layer LSA is equivalent to preconditioned multi-objective optimization.

Optimistic Online-to-Batch Conversions for Accelerated Convergence and Universality

This paper proposes an optimistic online-to-batch (O2B) conversion framework that relocates optimism from the online algorithm to the conversion mechanism itself, enabling simple online gradient descent (OGD) to achieve an \(O(T^{-2})\) accelerated convergence rate. It is the first to achieve optimal convergence for strongly convex smooth objectives via O2B conversion, while simultaneously attaining universality with respect to smoothness.

Oracle-Efficient Combinatorial Semi-Bandits

Two oracle-efficient frameworks (adaptive and scheduled) are proposed for combinatorial semi-bandits, reducing oracle calls from \(\Theta(T)\) to \(O(\log\log T)\) while maintaining near-optimal regret bounds.

OrthoGrad Improves Neural Calibration

This paper presents the first systematic study of OrthoGrad (⊥Grad)—a geometrically constrained optimization method that projects gradients layer-wise onto directions orthogonal to the weights—for neural network calibration. Experiments on CIFAR-10 in low-data regimes demonstrate that OrthoGrad significantly improves calibration metrics (entropy, loss, confidence) without degrading accuracy, and the paper establishes convergence guarantees for a simplified variant under standard assumptions.

Personalized Subgraph Federated Learning with Differentiable Auxiliary Projections

This paper proposes FedAux, a framework that introduces differentiable Auxiliary Projection Vectors (APVs) to map node embeddings into a one-dimensional space and perform soft-ranking aggregation via Gaussian kernels. The APV simultaneously serves as a compact, privacy-preserving summary of the local subgraph for server-side similarity computation and participates in joint client-side optimization, enabling personalized subgraph federated learning.

perturbation bounds for low-rank inverse approximations under noise

This paper provides the first non-asymptotic spectral norm perturbation bound for low-rank inverse approximations \(\|(\tilde{A}^{-1})_p - A_p^{-1}\|\) under additive noise. Using contour integration techniques, it derives sharp bounds depending on the eigenvalue gap, spectral decay, and noise alignment, improving upon classical full-inverse bounds by up to a factor of \(\sqrt{n}\).

Preference Learning with Response Time: Robust Losses and Guarantees

This paper incorporates response time information from user decision-making into the preference learning framework, reducing the estimation error of reward model learning from exponential to polynomial order via a Neyman orthogonal loss function.

Probing Neural Combinatorial Optimization Models

This work is the first to systematically introduce probing methodology into the study of neural combinatorial optimization (NCO) models. It proposes a CS-Probing framework to analyze the decision-relevant knowledge, inductive biases, and generalization mechanisms encoded in model representations, and identifies critical embedding dimensions that can be leveraged to improve model generalization.

Problem-Parameter-Free Decentralized Bilevel Optimization

This paper proposes AdaSDBO, a fully parameter-free decentralized bilevel optimization algorithm that requires no prior knowledge of problem parameters. By employing adaptive step sizes based on cumulative gradient norms, it achieves a convergence rate of \(\tilde{O}(1/\sqrt{T})\).

PROFIT: A Specialized Optimizer for Deep Fine Tuning

PROFIT frames fine-tuning as a multi-task learning problem across the time dimension, and achieves forgetting-resistant fine-tuning without additional data or parameters by orthogonally projecting new-task gradients onto the direction of a "regression equilibrium point."

Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry

This paper reveals a fundamental duality between sparse autoencoder (SAE) architectures and the concept structures they are capable of discovering — each SAE implicitly assumes a particular organization of concepts, and when this assumption is mismatched, concepts are systematically missed. Based on this analysis, the authors propose SpaDE, a novel SAE that accounts for nonlinear separability and dimensional heterogeneity.

Purifying Shampoo: Investigating Shampoo's Heuristics by Decomposing its Preconditioner

By decomposing Shampoo's preconditioner into eigenvalue and eigenbasis components, this work reveals that learning rate grafting essentially compensates for eigenvalue staleness and scaling bias, and proposes eigenvalue correction and adaptive eigenbasis update frequency as principled replacements for these heuristics.

Quantitative Convergence of Trained Single Layer Neural Networks to Gaussian Processes

This paper establishes explicit quantitative upper bounds on the convergence of gradient-descent-trained shallow neural networks to Gaussian processes at any positive training time \(t \geq 0\), proving that the squared 2-Wasserstein distance decays polynomially at rate \(O(\log n_1 / n_1)\).

Rao-Blackwellised Reparameterisation Gradients

This paper proposes the R2-G2 estimator as a Rao-Blackwellised variant of reparameterisation gradients, proves that local reparameterisation in Bayesian MLPs is a special case thereof, and extends the low-variance gradient advantage to a broad class of probabilistic models.

Rethinking Neural Combinatorial Optimization for Vehicle Routing Problems with Different Constraint Tightness Degrees

This paper reveals that existing NCO methods severely overfit to fixed constraint tightness (e.g., fixed vehicle capacity \(C=50\) in CVRP), and proposes a Variable Constraint Tightness (VCT) training scheme along with a Multiple Expert Module (MEM), enabling models to effectively handle the full spectrum of constraints from extremely tight to extremely loose.

Revisiting Orbital Minimization Method for Neural Operator Decomposition

This paper revisits the classical Orbital Minimization Method (OMM) originating from computational chemistry, provides a concise linear-algebraic consistency proof, reveals its deep connections to Sanger's rule and streaming PCA, and generalizes it into a unified framework for training neural networks to perform spectral decomposition of positive semidefinite operators.

Robust Estimation Under Heterogeneous Corruption Rates

This paper studies robust estimation under heterogeneous corruption rates—where each sample is corrupted with a distinct known probability—and establishes tight minimax rates for mean estimation and linear regression under both bounded and Gaussian distributions. A key finding is that the optimal estimator can simply discard samples whose corruption rates exceed a certain threshold.

Second-Order Optimization Under Heavy-Tailed Noise: Hessian Clipping and Sample Complexity

This paper provides the first systematic theoretical treatment of second-order stochastic optimization under heavy-tailed noise. It establishes tight minimax sample complexity lower bounds, proposes a normalized SGD algorithm with gradient and Hessian clipping (Clip NSGDHess), and proves that the proposed algorithm nearly achieves the information-theoretic limit.

Set Smoothness Unlocks Clarke Hyper-stationarity in Bilevel Optimization

This paper introduces set smoothness, a novel structural property of set-valued mappings, proves that it holds naturally in the nonconvex-PŁ bilevel setting, and uses it to reveal hidden weak convexity/concavity structure in the hyper-objective. This yields the first computability guarantee for Clarke stationary points of nonsmooth hyper-objectives.

Sharper Convergence Rates for Nonconvex Optimisation via Reduction Mappings

This paper proposes the Reduction Mapping framework, which exploits the manifold structure of the optimal solution set (arising from over-parameterization or symmetry) to reparameterize the optimization problem, and proves that this improves curvature properties and theoretically accelerates the convergence of gradient-based methods.

Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful

This paper systematically investigates the behavior of small batch sizes (down to batch size = 1) in language model pre-training and fine-tuning. It proposes a scaling rule for Adam \(\beta_2\) based on fixing the "token half-life," demonstrates that small-batch training is stable, shows that vanilla SGD becomes competitive with adaptive optimizers under small batches, and recommends avoiding gradient accumulation.

Stable Coresets via Posterior Sampling: Aligning Induced and Full Loss Landscapes

This paper proposes a coreset selection framework based on posterior sampling. By sampling weight perturbations on BatchNorm layers to smooth the loss landscape, it guarantees alignment between the coreset and full-dataset loss landscapes (including approximations of the Hessian and Newton step), significantly outperforming existing methods under high label noise.

Stochastic Momentum Methods for Non-smooth Non-Convex Finite-Sum Coupled Compositional Optimization

For non-smooth non-convex finite-sum coupled compositional optimization (FCCO), this paper proposes two stochastic momentum methods — SONEX (single-loop) and ALEXR2 (double-loop) — that improve the iteration complexity from \(O(1/\epsilon^6)\) to \(O(1/\epsilon^5)\) via outer Moreau envelope smoothing and nested smoothing techniques, and achieve the same optimal complexity for non-convex inequality-constrained optimization.

Streaming Federated Learning with Markovian Data

This work provides the first rigorous analysis of streaming federated learning with Markovian data under non-convex objectives, establishing that Minibatch SGD, Local SGD, and Local SGD-M all achieve sample complexity inversely proportional to the number of clients (linear speedup), and that Local SGD-M matches the communication complexity of Minibatch SGD without requiring heterogeneity assumptions.

The Implicit Bias of Structured State Space Models Can Be Poisoned With Clean Labels

This paper provides the first theoretical proof that the implicit bias of Structured State Space Models (SSMs) can be poisoned by clean-label training samples — specifically, there exist specially constructed training examples whose labels are correctly assigned by a teacher model, yet their inclusion fundamentally distorts the implicit bias of SSMs, causing complete generalization failure.

The Rich and the Simple: On the Implicit Bias of Adam and SGD

This paper provides theoretical and empirical evidence that neural networks trained with SGD tend to learn simple linear features (simplicity bias), whereas Adam produces richer nonlinear features, yielding predictors closer to the Bayes-optimal classifier and better generalization under distribution shift.

Training-Free Bayesianization for Low-Rank Adapters of Large Language Models

This paper proposes TFB (Training-Free Bayesianization), which converts a pre-trained LoRA adapter into its Bayesian counterpart without any retraining by searching for the maximum admissible variance within a family of low-rank isotropic Gaussian distributions. The procedure is theoretically shown to be equivalent to generalized variational inference.

Training Robust Graph Neural Networks by Modeling Noise Dependencies

This paper proposes Dependency-Aware Graph Noise (DANG) and the DA-GNN framework, which model a causal dependency chain from node feature noise → graph structure noise → label noise, and employ variational inference to derive an ELBO objective for training GNNs robust to multi-source correlated noise.

Understanding Adam Requires Better Rotation Dependent Assumptions

Through systematic empirical investigation, this paper demonstrates that the Adam optimizer exhibits strong dependence on the choice of coordinate basis in parameter space, showing that existing rotation-invariant theoretical assumptions are insufficient to explain Adam's superiority. The orthogonality of per-layer updates is identified as a reliable indicator for predicting Adam's performance under different bases.

Understanding the Generalization of Stochastic Gradient Adam in Learning Neural Networks

This work presents the first theoretical analysis of the generalization behavior of mini-batch Adam. It proves that large-batch Adam/AdamW converges to solutions with high test error even with weight decay, whereas small-batch variants achieve near-zero test error through a combination of implicit regularization from stochastic gradients and explicit regularization from weight decay. Moreover, the effective weight decay upper bound for Adam is strictly smaller than that for AdamW.

Unveiling m-Sharpness Through the Structure of Stochastic Gradient Noise

This paper reveals the theoretical mechanism underlying m-sharpness in SAM through an extended stochastic differential equation (SDE) framework — smaller micro-batch size \(m\) induces stronger implicit regularization via the covariance of stochastic gradient noise (SGN) — and proposes a parallelizable Reweighted SAM (RW-SAM) method based on this insight.

Unveiling the Power of Multiple Gossip Steps: A Stability-Based Generalization Analysis in Decentralized Training

This paper presents the first stability-based generalization analysis of Multiple Gossip Steps (MGS) in Decentralized SGD (DSGD), proving that MGS reduces optimization error at an exponential rate to tighten generalization bounds, while also establishing that even as the number of Gossip steps approaches infinity, DSGD cannot fully close the generalization gap with centralized training.

VERA: Variational Inference Framework for Jailbreaking Large Language Models

This paper formalizes black-box LLM jailbreaking as a variational inference problem, training a small attacker LLM to approximate the posterior distribution of adversarial prompts for a target LLM. Once trained, the attacker can efficiently generate diverse jailbreak prompts without relying on human-crafted templates.

Verbalized Algorithms: Zero-shot Classical Algorithmic Reasoning for Correctness and Runtime Guarantees

This paper proposes the Verbalized Algorithms (VAs) framework, which preserves the control flow of classical algorithms while replacing their atomic operations (e.g., binary comparisons) with LLM queries, thereby inheriting the correctness and complexity guarantees of classical algorithms for natural language reasoning tasks. The framework is validated across four case studies: sorting, maximum finding, clustering, and submodular maximization.

VIKING: Deep Variational Inference with Stochastic Projections

VIKING proposes a variational approximate posterior family based on the kernel- and image-space decomposition of the Fisher-Rao metric, and achieves scalable full-covariance Bayesian training via a stochastic alternating projections algorithm, surpassing existing Bayesian deep learning methods on multiple benchmarks.

Wasserstein Transfer Learning

This paper proposes WaTL, the first transfer learning framework for distributional outputs in Wasserstein space. It adopts a three-step procedure — weighted auxiliary estimation, bias correction, and projection — combined with adaptive source selection, to transfer knowledge from source domains and improve distributional regression in the target domain.