Skip to content

📐 Optimization & Theory

🔬 ICLR2026 · 222 paper notes

📌 Same area in other venues: 📷 CVPR2026 (22) · 🧪 ICML2026 (88) · 🤖 AAAI2026 (21) · 🧠 NeurIPS2025 (121) · 📹 ICCV2025 (7)

🔥 Top topics: Federated Learning ×16 · LLM ×10 · Adversarial Robustness ×5 · Layout & Composition ×4 · Compression ×4

A Block Coordinate Descent Method for Nonsmooth Composite Optimization under Orthogonality Constraints

OBCD is proposed as a block coordinate descent algorithm for solving "smooth + nonsmooth" composite optimization under orthogonality constraints (Stiefel manifold). By updating only \(k\ge 2\) rows of the solution matrix and reducing the problem to a small-scale \(k\times k\) orthogonal subproblem for exact solution, the method ensures strict feasibility and low per-iteration cost. It establishes "block-\(k\) stationarity," a stronger optimality guarantee than classic critical points, alongside an \(O(1/\epsilon)\) iteration complexity and last-iterate convergence rates under KL conditions.

A Convergence Analysis of Adaptive Optimizers under Floating-Point Quantization

This paper establishes the first theoretical framework for analyzing the convergence of adaptive optimizers under floating-point quantization. By applying a relative error quantization model simultaneously to gradients, weights, and optimizer states (momentum and second moments), it proves that quantized Adam and Muon maintain the same \(\tilde{O}(T^{-1/4})\) convergence rate as full-precision versions when the mantissa length grows only logarithmically with the number of iterations. It further reveals the theoretical mechanism explaining why Adam is highly sensitive to weight and second-moment quantization while Muon is more robust.

A Memory-Efficient Hierarchical Algorithm for Large-scale Optimal Transport Problems

This paper proposes HALO—a multiscale hierarchical framework for large-scale optimal transport (OT) problems. By combining "coarse-to-fine warm-start," "active support set pruning," and a "factorization-free first-order LP solver," the framework reduces memory requirements to \(O(n)\). On \(1024^2\) pixel images, it achieves an 8.9× speedup and a 70.5% reduction in GPU memory compared to the strongest baselines, while providing a scale-invariant upper bound on iteration complexity.

A Scalable Constant-Factor Approximation Algorithm for \(W_p\) Optimal Transport

This paper provides the first truly quadratic-time constant-factor approximation algorithm for all \(p \in [1, \infty]\) (including \(p = \infty\)): on any metric space, it computes a \((4+\varepsilon)\)-approximation for \(W_p\) optimal transport in \(O(n^2+(n^{3/2}\varepsilon^{-1}\log n\log\Delta)^{1+o(1)}\log U)\) time, reducing the previous \(O(\log n)\) approximation ratio to a constant.

A Schrödinger Eigenfunction Method for Long-Horizon Stochastic Optimal Control

For a class of stochastic optimal control (SOC) problems where the "uncontrolled drift is the gradient of a potential function," this paper proves that the linearized HJB operator is unitarily equivalent to a Schrödinger operator with a purely discrete spectrum. Consequently, long-horizon optimal control can be directly determined by the principal eigenfunction of this operator (with correction terms decaying exponentially over the time horizon). Based on this, closed-form solutions for symmetric LQR are provided, and a relative eigenfunction loss is proposed to eliminate "implicit reweighting" bias, reducing the memory/time complexity of long-horizon SOC from \(O(Td)\) to \(O(d)\) while improving control accuracy by approximately one order of magnitude.

A Tale of Two Geometries: Adaptive Optimizers and Non-Euclidean Descent

This paper characterizes the relationship between adaptive optimizers like Adam/Shampoo and Normalized Steepest Descent (NSD) methods like SignGD/Muon through the lens of "two geometries / two types of smoothness." Both classes exploit the non-Euclidean geometry of the loss function, but adaptive optimizers rely on a stronger "adaptive smoothness" \(\Lambda_{\mathcal H}(f)\), whereas NSD depends on standard smoothness \(L_{\|\cdot\|_{\mathcal H}}(f)\). The authors extend the analysis of adaptive smoothness from convex to non-convex settings and prove that this stronger assumption yields benefits unattainable under standard smoothness: a Nesterov acceleration rate of \(\tilde O(T^{-2})\) and dimension-independent stochastic convergence rates.

Activation Function Design Sustains Plasticity in Continual Learning

This paper repositions "activation functions" as the primary, architecture-agnostic lever for mitigating loss of plasticity in continual learning. Through an attribute-level analysis of negative slope and saturation behavior, three design principles are refined. Based on these, two plug-and-play non-linearities, Smooth-Leaky and Randomized Smooth-Leaky, are proposed, which consistently improve late-stage adaptation in supervised continual classification and non-stationary MuJoCo reinforcement learning.

Adaptive Acquisition Selection for Bayesian Optimization with Large Language Models

This paper proposes LMABO, which utilizes a pre-trained Large Language Model (LLM) as a "zero-shot online strategist" for the Bayesian Optimization (BO) process. In each iteration, the optimization state is serialized into a structured text prompt, enabling the LLM to select the most suitable acquisition function (AF) from a portfolio. LMABO consistently outperforms static AFs, adaptive portfolios, and other LLM-based baselines across 50 benchmarks.

Adaptive gradient descent on Riemannian manifolds and its applications to Gaussian variational inference

This paper proposes RAdaGD—a family of Riemannian adaptive gradient descent methods that do not require line searches. By automatically adjusting step sizes through online estimation of local smoothness constants, RAdaGD achieves a non-ergodic convergence rate of \(f(x_k)-f(x^\star)\le O(1/k)\) under the weak assumptions of "local geodesic smoothness + generalized geodesic convexity." Based on this, it provides the first convergence guarantee for Gaussian variational inference when the target log-density does not satisfy global L-smoothness.

Adaptive Rollout Allocation for Online RL with Verifiable Rewards (VIP)

VIP (Variance-Informed Predictive allocation) is proposed to predict success probabilities of prompts via Gaussian processes, and subsequently use convex optimization to allocate rollout counts under computational budget constraints to minimize gradient variance. This consistently improves sampling efficiency for GRPO/RLOO in mathematical reasoning tasks, showing up to a 12.3-point Pass@32 improvement on AIME24/25.

Align-SAM: Seeking Flatter Minima for Better Cross-Subset Alignment

Align-SAM reframes "generalization" as the "consistency of updates across two random subsets of the same distribution." Building on SAM's pursuit of flat minima, it introduces an auxiliary mini-batch and ensures the gradient of the primary training batch is "congruent" with the gradient of the auxiliary batch. This approach consistently and moderately outperforms SAM/ASAM across various settings, including classification, noisy labels, few-shot transfer, and meta-learning.

Angle k-means: Accelerating Exact k-means via Angular Relationships

This paper proposes Angle k-means, which accelerates the assignment step by precomputing distances and angles between centroids. By utilizing a geometric inequality involving only angular comparisons to prune distant candidate centroids, it runs faster than SOTA methods like Ball k-means and Exp-ns without introducing any hyperparameters or altering the final clustering results.

Arbitrary-Order Block SignSGD for Memory-Efficient LLM Fine-Tuning

This paper proposes ABSignSGD, an optimizer combining SignSGD with "arbitrary-order block coordinate updates." By updating only one Transformer layer block per step, storing only that block's state, and using sign-based updates, it compresses the GPU memory of full-parameter fine-tuning to near-inference levels. It further incorporates a depth-biased block selection strategy to save an additional 20% runtime. A unified \(O(1/\sqrt{K})\) convergence proof and a multi-agent majority-vote variant (reducing communication by 960× via sign-only transmission) are provided.

AutoEP: LLMs-Driven Automation of Hyperparameter Evolution for Metaheuristic Algorithms

AutoEP inputs "online Exploratory Landscape Analysis (ELA) quantitative metrics" into a multi-LLM reasoning chain, allowing LLMs to dynamically adjust hyperparameters of metaheuristics (e.g., GA, PSO, ACO) generation by generation under zero-training conditions. By "grounding" reasoning in data to suppress hallucinations, an open-source 30B model can match the tuning performance of GPT-4.

Bayesian Evidence-Driven Prototype Evolution for Federated Domain Adaptation

FedPTE treats the server-side global prototype set as a dynamically evolving topological structure. It employs a Bayesian Gaussian Mixture Model (BGMM) and marginal likelihood ratios as "statistical evidence" to decide when to split or merge prototype clusters. Complemented by stability penalties and client-side topology-aware contrastive learning, it continuously characterizes fine-grained intra-class structures and mitigates domain shift in cross-domain federated learning.

Bayesian Parameter Shift Rules in Variational Quantum Eigensolvers

The Parameter Shift Rule (PSR) used for gradient estimation in Variational Quantum Eigensolvers (VQE) is reformulated into a Bayesian version—utilizing a derivative Gaussian Process with a VQE kernel to estimate gradients. This allows for the reuse of historical observations at arbitrary positions and provides posterior uncertainty of the gradients. Based on this, "Gradient Confidence Region (GradCoRe)" is proposed to adaptively allocate measurement shots, enabling VQE SGD optimization to converge significantly faster under the same measurement budget, surpassing existing SOTAs including the NFT family.

Beyond Aggregation: Guiding Clients in Heterogeneous Federated Learning

FedDRM upgrades the server's role from a "passive aggregator" to an "intelligent router"—utilizing Density Ratio Models (DRM) and Empirical Likelihood (EL) to model heterogeneity as a learnable client classification task. This allows the system to train high-quality local models while enabling the server to route new queries to the most specialized client, improving both local and system-level routing accuracy on CIFAR and real-world medical datasets.

Beyond Short Steps in Frank-Wolfe Algorithms

This work upgrades the analysis of the Frank-Wolfe (FW) algorithm from "short steps considering only primal progress" to a "unified primal-dual gap framework." Based on this, it proposes an Optimistic FW algorithm borrowing the "optimism" concept from online learning (featuring an \(O(LD^2/t)\) primal-dual convergence bound and a computable stopping criterion). Furthermore, it derives a class of "primal-dual short steps" that generalize short steps to the dual gap and are transferable to gradient descent. Experimentally, the optimistic variant significantly outperforms heavy-ball FW, vanilla FW, and even adaptive line search in convergence order.

Beyond the Heatmap: A Rigorous Evaluation of Component Impact in MCTS-Based TSP Solvers

This is an evaluation paper that deconstructs the mainstream "Heatmap + MCTS" paradigm for solving TSP. Using extensive experiments, the authors demonstrate that the "heatmap complexity," which the community has focused on, is not the most critical factor. Instead, neglected MCTS search hyperparameters dominate performance. A zero-learning, zero-parameter k-nearest neighbor prior heatmap (GT-Prior), when paired with tuned MCTS, can match or even exceed complex learning models like DIFUSCO.

Bi-LoRA: Efficient Sharpness-Aware Minimization for Fine-Tuning Large-Scale Models

Bi-LoRA utilizes an additional "adversarial LoRA module" to specifically carry SAM's adversarial perturbations, merging the original sequential two-step "perturb-then-descend" process into a single parallel forward-backward pass. This enables SAM to be practically applied to large-scale model LoRA fine-tuning with almost no added cost, while escaping the restricted subspace of LoRA-SAM to find flatter minima.

Bilevel Optimization with Lower-Level Uniform Convexity: Theory and Algorithm

This paper bridges the solvable gap between lower-level strong convexity (\(p=2\)) and general convexity by introducing "lower-level uniform convexity (LLUC, index \(p\ge 2\))". It establishes an implicit differentiation theorem under uniform convexity (providing explicit hypergradient formulas and Hölder smoothness) and designs a stochastic algorithm, UniBiO, proving an oracle complexity of \(\widetilde{O}(\epsilon^{-5p+6})\) to find an \(\epsilon\)-stationary point, which recovers the optimal \(\widetilde{O}(\epsilon^{-4})\) when \(p=2\).

Binomial Gradient-Based Meta-Learning for Enhanced Meta-Gradient Estimation

Addressing the pain point in gradient-based meta-learning (GBML) like MAML where "meta-gradient backpropagation scales linearly with adaptation steps \(K\)," this paper applies truncated binomial expansion to the meta-gradient product sequence \(\prod_{k}(I-\alpha H_k)\) instead of simply truncating the tail. The resulting estimator, BinomMAML, preserves more second-order information at the same truncation order \(L\), with error decaying at a super-exponential rate relative to \(L\). It supports parallel HVP computation and achieves precision significantly closer to full MAML on miniImageNet/tieredImageNet with only a slight increase in overhead.

Birch SGD: A Tree Graph Framework for Local and Asynchronous SGD Methods

The paper represents every distributed/asynchronous SGD method as a weighted directed "computation tree" and reduces convergence analysis to "measuring distances on the tree" via a geometric master theorem. This framework provides a unified explanation for existing methods and facilitates the batch design of 8 new methods (at least 6 of which achieve optimal computational time complexity).

BoGrape: Bayesian optimization over graphs with shortest-path encoded

BoGrape transforms the challenging task of "Bayesian optimization over the graph structure itself" into a Mixed-Integer Programming (MIP) problem. It precisely characterizes the shortest-path structure of an unknown graph using decision variables and encodes the shortest-path graph kernel along with the Gaussian Process (GP) posterior into the MIP. This allows for global optimization of the acquisition function and the integration of task constraints like molecular feasibility, outperforming existing graph BO methods on synthetic benchmarks and QM7/QM9 molecular design tasks.

Byzantine-Robust Federated Learning with Learnable Aggregation Weights

The paper reformulates the discrete decision of "detecting and removing malicious clients" into a continuous optimization of aggregation weights \(w\). By jointly solving this with the global model \(\theta\), the authors propose FedLAW—a federated learning framework that suppresses Byzantine clients while adaptively re-weighting honest clients in data-heterogeneous scenarios, backed by provable robustness and convergence guarantees.

CALM: Co-evolution of Algorithms and Language Model for Automatic Heuristic Design

CALM enables the simultaneous evolution of "prompts for heuristic generation" and the "underlying LLM" itself. Within an LLM-driven evolutionary heuristic design loop, each "prompt-response-performance" triplet is treated as reinforcement learning data. A local 7B INT4 model is fine-tuned online using GRPO, allowing heuristics generated on a single 24GB GPU to outperform SOTA methods relying on the GPT-4o-mini API across multiple combinatorial optimization tasks.

Cautious Optimizers: Improving Training with One Line of Code

Add a single line of code to any momentum optimizer: only update coordinates where the "update direction" and "current gradient" share the same sign; otherwise, zero out the update for those coordinates and scale up others proportionally. This yields "cautious" versions like C-AdamW / C-Lion, which consistently accelerate LLM pre-training and image classification without modifying original hyperparameters.

Cautious Weight Decay

This paper proposes Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification: weight decay is applied only on coordinates where the "optimizer update direction" aligns with the "parameter sign." This preserves the original loss objective (no longer implicitly optimizing a regularized/constrained proxy objective) and generates sliding mode dynamics upon reaching the stationary manifold, leading towards a local Pareto-optimal small-norm solution. Without adding new hyperparameters, CWD consistently reduces final loss and improves accuracy for language model pre-training and ImageNet across ADAMW / LION / MUON.

Celo2: Towards Learned Optimization Free Lunch

Ours proposes Celo2—a learned optimizer meta-trained in only 4.5 GPU hours. Through simple recipes such as normalized MLP update rules and task augmentation, it achieves stable generalization to billion-parameter models (GPT-3 XL 1.3B), which is 6 orders of magnitude larger than the meta-training distribution. Performance surpasses the previous VeLO, which consumed 4000 TPU-months, and carefully tuned AdamW baselines.

Clipped Gradient Methods for Nonsmooth Convex Optimization under Heavy-Tailed Noise: A Refined Analysis

This paper provides a more refined convergence analysis for Clipped SGD under heavy-tailed noise. By utilizing Freedman’s inequality more effectively and providing tighter clipping error bounds, the authors accelerate the known optimal high-probability convergence rates by a factor of \(\mathrm{poly}(1/d_{\mathrm{eff}})\) (\(d_{\mathrm{eff}}\) is the "generalized effective dimension" defined by the authors). Furthermore, the results for expected convergence break existing lower bounds and perfectly match (i.e., are optimal) the newly proven lower bounds.

CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving

CogFlow proposes a cognitive-inspired three-stage visual mathematical reasoning framework (Perception → Internalization → Reasoning). By employing Synergistic Visual Rewards to enhance perception, a Knowledge Internalization Reward to bridge perception and reasoning, and Visual-Gated Policy Optimization to anchor visual reasoning, it addresses the core issue of "correct perception but drifted reasoning" in existing methods.

COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics

The authors propose COLD-Steer, which achieves training-free LLM activation steering by approximating representation changes produced by gradient descent on in-context examples. It achieves 95% of the steering effect using only 1/50th of the sample size.

Combination-of-Experts with Knowledge Sharing for Cross-Task Vehicle Routing Problems

Addressing the structural characteristic of Vehicle Routing Problems (VRP) where "each task is composed of several basic constraints," this paper proposes CoEKS: utilizing "constraint-specific experts + combiners" to activate and weigh experts of each constraint on demand, combined with a "mutual expert distillation + shared transformation layer" for multi-view knowledge sharing. This enables a unified model to perform accurately on seen constraint combinations, achieve zero-shot generalization to unseen combinations (relative improvement of 12–18% over SOTA), and support the insertion of new experts to adapt to entirely new constraints (approx. 25% improvement).

Combinatorial Bandit Bayesian Optimization for Tensor Outputs

Addressing expensive black-box systems where the output is a multi-modal tensor, this paper proposes a Tensor Output Bayesian Optimization framework (TOBO). It utilizes a Tensor Gaussian Process (TOGP) as a surrogate model to capture cross-modal dependencies and employs Upper Confidence Bound (UCB) sampling. The framework is further extended to a Combinatorial Bandit Bayesian Optimization (CBBO) setting where only a subset of tensor elements contributes to the objective. A CMAB-UCB2 criterion is designed to jointly select input points and subsets, both of which are proven to achieve sublinear regret of \(\tilde{O}(\sqrt{T})\).

Communication-Efficient Decentralized Optimization via Double-Communication Symmetric ADMM

Addressing decentralized composite optimization without a central node, this paper proposes DS-ADMM: embedding two rounds of communication per iteration into the ADMM structure via a pair of symmetric consensus constraints and accelerating it with symmetric ADMM. Although communication per iteration doubles, the total number of iterations decreases significantly, reducing overall communication costs while proving linear convergence under the weak condition of metric subregularity.

Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration

The authors complete the μP/CompleteP scaling rules ("tune on small models, transfer to large models") across four critical training axes—width, depth, batch size, and training duration. They further demonstrate that under proper parameterization, even fine-grained "per-module" hyperparameters (individual learning rates, weight decays, and Adam parameters for each tensor/layer type) can be directly transferred from a 50M parameter proxy model to a 7.2B parameter model, achieving a ~1.3× training speedup.

Composite Optimization with Error Feedback: the Dual Averaging Approach

To address the long-standing gap where "Error Feedback (EF) fails in composite optimization with non-smooth regularization/constraints," this paper reshapes the summation structure of iterations using Dual Averaging (DA). By combining it with the recent EControl error feedback mechanism, it provides the first convergence rate for composite convex optimization that perfectly matches the rate of non-composite cases.

Compositional Generalization through Gradient Search in Nonparametric Latent Space

This paper proposes the Abduction Transformer, which represents hidden rules in few-shot abstract reasoning tasks as variable-sized nonparametric latent mixture distributions. By performing gradient search on latent hypotheses at test time, it significantly improves OOD compositional generalization on 1-D ARC, SRAVEN, and linguistic systematicity tasks.

Conformal Robustness Control: A New Strategy for Robust Decision

Targeting the pain point of "overly conservative coverage constraints" when using conformal prediction for robust decision-making, this paper proposes Conformal Robustness Control (CRC). It optimizes prediction set construction directly under explicit robustness constraints (rather than requiring set coverage). The problem is solved using smooth proxies and alternating Lagrangian gradients, with non-asymptotic theoretical guarantees and test-time finite-sample calibration. CRC achieves lower risk certificates and decision losses in tasks like portfolio optimization, stocks, and battery energy storage while precisely hitting target robustness levels.

ConRep4CO: Contrastive Representation Learning of Combinatorial Optimization Instances across Types

ConRep4CO unifies diverse combinatorial optimization (CO) instance types by reducing them to a SAT format as an "intermediary modality." It employs augmentation-free contrastive pre-training using "original instance ↔ its SAT form" as positive pairs to learn cross-problem universal representations. When integrated into specialized neural solvers for MVC, MIS, MC, and MDS, the optimality gap is reduced by an average of 32% to 61%.

Constraint Matters: Multi-Modal Representation for Reducing Mixed-Integer Linear programming

A MILP model simplification framework based on constraint reduction is proposed. It defines fixed constraint strength \(\rho\) and utilizes information gain \(\Delta H=-\log\rho\) to identify Critical Tight Constraints (CTC). A multi-modal GNN representation, merging instance-level bipartite graphs with abstract-level type graphs, is designed to predict CTCs. Across 4 large-scale benchmarks, solution quality (\(\text{gap}_\text{abs}\)) improved by an average of 51.06%, and convergence speed (PDI) accelerated by an average of 17.47%.

Contextual Causal Bayesian Optimisation

This paper proposes CoCa-BO, which unifies the selection of "which variables to intervene on" (causal scope selection) and "what values to intervene with" (contextual Bayesian optimization) into a single search process. It uses a multi-armed bandit to select scopes among all "possibly-optimal mixed policy scopes" (POMPS) and Gaussian processes to select intervention values within each scope, providing the first sublinear regret bound covering both causal BO and contextual BO.

Convergence of Muon with Newton-Schulz

This paper provides the first non-convex convergence guarantee for the practical Muon optimizer (which uses Newton-Schulz approximation instead of exact SVD polar decomposition). It proves that the convergence rate matches the idealized SVD version up to a constant factor that decays doubly exponentially with the number of Newton-Schulz steps \(q\), and that Muon suffers \(\sqrt{r}\) times less rank-dependent loss than its vector counterpart SGD-M.

Convergence of Regret Matching in Potential Games and Constrained Optimization

This paper provides the first proof that Regret Matching+ (RM+) serves as a reliable and fast first-order optimizer for "constrained optimization over products of simplices" (including potential games as a special case), converging to an \(\epsilon\)-KKT point in \(O_\epsilon(1/\epsilon^4)\) steps. Simultaneously, it proves that the original Regret Matching (RM) can take exponential steps to converge even in two-player common-interest games, establishing the first worst-case (and exponential) separation between RM and RM+.

Convex Dominance in Deep Learning I: A Scaling Law of Loss and Learning Rate

Starting from convex optimization theory, this work proves that deep learning training loss converges at a rate of \(O(1/\sqrt{T})\) and the optimal learning rate scales with \(1/\sqrt{T}\). This scaling law is validated on models ranging from GPT-2 to 12.5B parameter models (\(R^2 \ge 0.978\)), achieving learning rate extrapolation for up to 80x the training steps.

Corner Gradient Descent

This paper proposes a "contour perspective" that equates generalized (S)GD with arbitrary linear memory to a contour \(\gamma\) on the complex plane (via the response mapping \(\Psi=P/Q\)). It proves that by forming a "sharp corner" at the origin with an exterior angle \(\theta\pi\) (\(1<\theta<2\)), the loss convergence rate of SGD on power-law quadratic problems can be accelerated from \(O(t^{-\zeta})\) to \(O(t^{-\theta\zeta})\). The authors provide a precise phase diagram for the optimal acceleration factor \(\theta_{\max}=\min(2,\nu,\frac{2}{\zeta+1/\nu})\) and utilize rational approximation to compress the ideal infinite-memory "corner algorithm" into a practical finite-memory version, validated on synthetic problems and MNIST.

CurES: From Gradient Analysis to Efficient Curriculum Learning for Reasoning LLMs

Starting from gradient analysis in reinforcement learning, CurES proves that "the prompt sampling distribution determines convergence speed, while rollout quotas determine the stability of gradient updates." Based on this, it employs a Bayesian (Beta-Binomial) difficulty estimation to dynamically bias sampling probabilities and rollout budgets toward medium-difficulty tasks. It outperforms GRPO by an average of 3.3 (1.5B) / 4.82 (7B) points across eight mathematical reasoning benchmarks and achieves up to 5.5x faster convergence.

Cutting the Skip: Training Residual-Free Transformers

Starting from the condition number of the Transformer Jacobian, this paper reveals that the essence of skip (residual) connections is to improve the network's condition number. Based on this, a scheme is proposed that modifies only the initialization without changing the architecture. This allows a Transformer completely devoid of residual connections to be trained as fast as models with residuals for the first time, while learning more abstract and hierarchically clear representations on dense prediction tasks, outperforming residual-based baselines.

DADA: Dual Averaging with Distance Adaptation

DADA grafts the technique of "dynamically estimating the distance from the initial point to the optimal solution" onto the classic Dual Averaging (DA) framework. This results in a universal first-order method that requires no problem-dependent parameters, holds for both constrained and unbounded domains, and automatically adapts to six major convex function classes. Simultaneously, it reduces the cost of misestimating the initial distance from multiplicative \(\rho^2\) to logarithmic \(\log^2\rho\).

Decentralized Nonconvex Optimization under Heavy-Tailed Noise: Normalization and Optimal Convergence

This paper proposes GT-NSGDm—a decentralized algorithm combining gradient normalization, momentum variance reduction, and gradient tracking. For the first time in a decentralized nonconvex setting, assuming only that gradient noise has finite \(p\)-th moments (\(p\in(1,2]\)), it proves that the expected gradient norm achieves the optimal convergence rate of \(O(1/T^{(p-1)/(3p-2)})\), consistent with the centralized lower bound.

Deep-ICE: The First Globally Optimal Algorithm for Minimizing 0–1 Loss in Two-Layer ReLU and Maxout Networks

This paper employs constructive algorithmics (list homomorphism + fusion law) to derive Deep-ICE, the first globally optimal algorithm for Empirical Risk Minimization (ERM) of two-layer ReLU/Maxout networks under 0-1 loss. With a worst-case complexity of approximately \(O(N D^{K+1})\), it achieves superior training and test accuracy compared to SVMs and gradient-descent-trained MLPs across 11 UCI datasets using an accompanying coreset heuristic.

Deep Latent Variable Model based Vertical Federated Learning with Flexible Alignment and Labeling Scenarios

The problem of "misaligned users" in Vertical Federated Learning (VFL) is reinterpreted as a classic "blockwise missing data" problem. By employing a deep latent variable model with a two-stage training approach (unsupervised pre-training followed by supervised fine-tuning), this work simultaneously handles misaligned, unlabeled, and data under arbitrary missing mechanisms (MCAR/MAR/MNAR) in multi-party scenarios. It outperforms the strongest baselines in 160 out of 168 configurations, with an average lead of 9.6 percentage points.

DeepAFL: Deep Analytic Federated Learning

Ours proposes DeepAFL, which achieves the first deep analytic federated learning model with representation learning capabilities by designing gradient-free analytic residual blocks and introducing a layer-wise federated training protocol. It maintains ideal invariance to data heterogeneity while breaking the limitations of existing analytic methods restricted to single-layer linear models, outperforming SOTA by 5.68%-8.42% on three benchmark datasets.

DeMo: Decoupled Momentum Optimization

DeMo replaces the "sync full-precision gradients every step" approach in distributed data parallelism with "syncing only compressed local momentum." By decoupling momentum updates across workers, compressing momentum via DCT orthogonal transformation + top-k sparsification, and utilizing the momentum buffer itself as error feedback, DeMo achieves up to an 85× reduction in communication volume per step compared to AdamW-DDP, while maintaining comparable downstream accuracy and convergence.

Derandomized Online-to-Non-convex Conversion for Stochastic Weakly Convex Optimization

This paper demonstrates that in stochastic weakly convex optimization, the randomized interpolation or scaling required by O2NC can be removed. By taking stochastic subgradients directly at the current iterate and using online incremental learning with quadratic regularization, the optimal Goldstein stationarity complexity is achieved. This further derives an SGDM variant that is nearly equivalent to periodic momentum restart.

DES-LOC: Desynced Low Communication Adaptive Optimizers for Foundation Models

DES-LOC assigns independent synchronization periods to parameters and various momentum states of adaptive optimizers—parameters are synchronized frequently, while momenta are synchronized sparsely according to their "half-lives." While maintaining provable convergence, it reduces communication volume by 170× compared to DDP and 2× compared to the previous SOTA, Local Adam, achieving 1.3–2.1× end-to-end acceleration on 1–13B models.

DiffBED: Scaling Bayesian Experimental Design to High-Dimensions

DiffBED identifies that the primary cause of Bayesian Experimental Design (BED) failure in high-dimensional design spaces is not the inadequacy of EIG estimators, but rather the "overconfident" exploitation of the likelihood in areas far from the data manifold (a form of reward hacking). It utilizes a diffusion model as a "realism prior" and uses EIG gradients to guide the reverse SDE of the diffusion process, thereby generating designs that are both highly informative and realistically feasible. This marks the first extension of BED to image design spaces exceeding 750,000 dimensions.

Difference Predictive Coding for Training Spiking Neural Networks

This paper transforms the bio-inspired local learning framework "Predictive Coding" into a spike-native training algorithm named DiffPC. Instead of transmitting dense floating-point numbers between layers, it only emits sparse ternary spikes (-1/0/1) when states change. DiffPC achieves 99.3% on MNIST and 89.6% on Fashion-MNIST, outperforming backpropagation baselines on CIFAR-10 while reducing training communication volume by over two orders of magnitude.

Differentiable Model Predictive Control on the GPU

The authors propose DiffMPC, a solver that completely ports differentiable Model Predictive Control (MPC) to the GPU. By employing Sequential Quadratic Programming (SQP) for the forward pass, a Preconditioned Conjugate Gradient (PCG) with a "stair" preconditioner to solve the KKT linear system in parallel across the time dimension, and the Implicit Function Theorem (IFT) to reuse the same KKT matrix for gradient computation, the method achieves a \(4\)--\(7\times\) speedup on GPU compared to baselines like mpc.pytorch, trajax, and Theseus. It was successfully used to automatically tune parameters via Reinforcement Learning, enabling a Toyota Supra to robustly drift through puddles.

Diffusion-DFL: Decision-focused Diffusion Models for Stochastic Optimization

This paper integrates diffusion models into the Decision-focused Learning (DFL) framework for the first time. It uses conditional diffusion models to characterize the full distribution of uncertain parameters and solves stochastic optimization using samples from this distribution. Two end-to-end gradient algorithms are proposed: an exact but memory-intensive reparameterization estimator, and a lightweight score function estimator that approximates the score function using the gradient of the ELBO, reducing memory usage from 60.75 GB to 0.13 GB. Decision quality consistently outperforms all baselines across three real-world optimization tasks.

Directional Convergence, Benign Overfitting of Gradient Descent in leaky ReLU two-layer Neural Networks

This paper provides the first proof of directional convergence for gradient descent in leaky ReLU two-layer neural networks. Based on this, it establishes sufficient conditions for benign overfitting under a broad mixture data setting that significantly exceeds nearly orthogonal data while identifying a new phase transition phenomenon.

Directional Sheaf Hypergraph Networks: Unifying Learning on Directed and Undirected Hypergraphs

This paper proposes Directional Sheaf Hypergraph Networks (DSHN), which combine Cellular Sheaf theory with the directional information of directed hypergraphs to construct a complex-valued Hermitian Laplacian operator. This operator unifies and generalizes existing graph and hypergraph Laplacians, achieving a relative accuracy improvement of 2%–20% across 7 real-world datasets.

Distributionally Robust Linear Regression with Block Lewis Weights

This paper designs a second-order algorithm for "Empirical Group Distributionally Robust Least Squares" (i.e., min-max group regression) with an iterative complexity of only \(\tilde{O}(\min\{\mathrm{rank}(A),m\}^{1/3}\varepsilon^{-2/3})\). The key is using block Lewis weights to construct an optimal approximating ellipsoidal geometry, coupled with an accelerated proximal method, making the number of iterations nearly independent of the group count \(m\).

Distributionally Robust Optimization via Generative Ambiguity Modeling

This paper defines the "ambiguity set" of DRO directly on the parameter space of generative models (Diffusion Models / VAEs). By using reconstruction loss to constrain the consistency between generated and nominal distributions, and solving the inner maximization via dual learning and policy optimization, the authors develop GAS-DRO—a tractable DRO algorithm capable of searching for worst-case distributions across different support sets.

DNT: a Deeply Normalized Transformer that can be trained by Momentum SGD

Starting from the analysis of the Jacobian matrix, this paper clarifies the source of "heavy-tailed gradients" during Transformer training. By redesigning the DNT architecture with normalization operators at appropriate positions (InputNorm + PreNorm + QKNorm + MidNorm), the authors enable training with vanilla momentum SGDW. The performance matches AdamW (ImageNet 81.5% vs 82.1%, OpenWebText val loss 2.849 vs 2.863) while saving half of the optimizer's memory.

DR-Submodular Maximization with Stochastic Biased Gradients: Classical and Quantum Gradient Algorithms

This paper systematically investigates continuous DR-submodular maximization under "stochastic biased gradients." It extends the analytical Lyapunov framework from exact gradients to gradients with bias and noise. Consequently, it proves a \(1/e\) approximation for a new class of constraints (convex sets with a greatest element), surpassing the \(1/4\) hardness of general convex constraints. The authors provide both classical (\(O(\epsilon^{-3})\) iterations) and quantum (\(O(\epsilon^{-1})\) iterations) zeroth-order algorithms, demonstrating quantum acceleration theoretically and numerically.

Dual Optimistic Ascent (PI Control) is the Augmented Lagrangian Method in Disguise

Proves that dual optimistic ascent (PI control), widely used in constrained deep learning, is mathematically equivalent to the classical Augmented Lagrangian Method (ALM) under a single-step first-order update regime. This transfers ALM's robust convergence guarantees (linear convergence to all strict local solutions) to PI control and provides principled tuning guidance for the optimism coefficient \(\omega\).

Efficient Algorithms for Incremental Metric Bipartite Matching

This paper provides the first constant-factor approximation algorithm for incremental minimum-cost bipartite matching in arbitrary metric spaces. Under the setting of a fixed server set \(S\) and requests arriving online on one side, a "distance scaling hierarchy + push-relabel" framework is used to maintain an \(O(1/\delta^{0.631})\) approximate matching. The amortized update time per insertion is \(\tilde{O}(n^{1+\delta})\), and the approach is naturally parallelizable and GPU-compatible.

Efficient Submodular Maximization for Sums of Concave over Modular Functions

Focusing on the "Sum of Concave over Modular" (SCM) subclass of submodular functions, this paper proposes a toolkit of "concave extension + Accelerated Approximate Projected Gradient Ascent (AAPGA) + randomized rounding." It reduces the query complexity for cardinality, knapsack, and partition matroid constraints from \(O(n^2\varepsilon^{-2})\) of PGA to \(O(n^{1/2}\varepsilon^{-1/2}(T_1+T_2))\), achieving a measured speedup of up to 32.3x.

Egalitarian Gradient Descent: A Simple Approach to Accelerated Grokking

The paper attributes the long plateau in grokking to the severe imbalance of the gradient spectrum and proposes EGD: leveling the update speed across different singular directions without changing the primary gradient direction, thereby significantly compressing the "memorization-then-generalization" delay into a few epochs.

Elastic Optimal Transport: Theory, Application, and Empirical Evaluation

This paper proposes Elastic Optimal Transport (ELOT), which replaces the equality constraints of classical OT with "marginal inequality constraints + mixed-sign cost matrices." This allows the transport mass to be adaptively determined by the geometric structure of the problem itself, significantly outperforming POT/UOT series methods on unsupervised domain adaptation and partial domain adaptation benchmarks.

Enhancing Communication Compression via Discrepancy-aware Calibration for Federated Learning

Existing communication compression methods in Federated Learning (e.g., Top-k, ATOMO) decide which parameters to discard based on "magnitude." This paper proposes using a small set of local calibration data to directly measure "how much dropping a specific compression unit changes the layer output." By sorting based on this output discrepancy, the method provides a plug-and-play enhancement for mainstream compression schemes, achieving an 18.9% relative accuracy improvement at a compression ratio of 0.1.

Enhancing Learning with Noisy Labels via Rockafellian Relaxation

This paper proposes the Rockafellian Relaxation Method (RRM), which wraps any supervised training loss into a reweightable min-min optimization problem. By automatically downweighting suspicious high-loss samples, it enhances the robustness of classification models in real-world noise, synthetic noise, and partial adversarial perturbation scenarios.

Entropic Confinement and Mode Connectivity in Overparameterized Neural Networks

This work reveals that a systematic increase in curvature along low-loss paths generates an entropic force barrier. Even when the energy path is flat, SGD noise confines optimization dynamics to flat regions near minima, explaining the paradox of "mode connectivity vs. restricted dynamics."

Error Feedback for Muon and Friends

The paper proposes EF21-Muon, the first communication-efficient distributed LMO optimizer that generalizes error feedback to non-Euclidean geometry with rigorous convergence guarantees. It reduces to Muon/Scion/Gluon when compression is disabled and achieves up to 7× communication savings on NanoGPT without accuracy loss.

Evaluating Data Influence in Meta Learning

This work generalizes influence functions from single-layer M-estimators to the bi-level optimization (BLO) structure of meta-learning. It proposes two closed-form formulas, task-IF and instance-IF, which accurately characterize the direct and indirect contributions of a task or an instance to meta-parameters using the "total gradient / total Hessian." Accelerated by EK-FAC and Neumann series, the method enables retraining-free identification of deleterious data and model editing.

Exploring Diverse Generation Paths via Inference-time Stiefel Activation Steering

The authors propose STARS (Stiefel-based Activation Steering for Diverse ReaSoning), a training-free inference-time activation steering method. By jointly optimizing \(N\) parallel generation paths' orthogonal steering directions on the Stiefel manifold during each token's decoding to maximize the geometric volume of hidden states, STARS promotes divergent activation trajectories. It consistently outperforms temperature sampling in diversity across test case generation (TestEval) and scientific discovery (LiveIdeaBench) with minimal latency and no loss in quality.

Exploring Mode Connectivity in Krylov Subspace for Domain Generalization

This paper moves beyond the mainstream approach of "finding flat minima" and instead exploits the global geometric property of the loss surface—mode connectivity. It proposes the Billiard Optimization Algorithm (BOA), which simulates billiard dynamics within a low-dimensional Krylov subspace to "walk" from a standard minimum along low-loss tunnels to a solution with stronger generalization, outperforming sharpness-aware methods like SAM on DomainBed.

Fantastic Pretraining Optimizers and Where to Find Them

Based on a unified and fair hyperparameter tuning and end-to-end evaluation protocol, this study systematically compares 11 deep learning optimizers. It reveals that the 1.4–2× speedups claimed by new optimizers largely originate from "weak baselines." True speedups do not exceed 1.4× and decay to 1.1× as model scale increases; furthermore, it confirms that matrix-based optimizers (Muon/Soap/Kron) indeed outperform scalar-based ones.

Fast Convergence of Natural Gradient Descent for Over-parameterized Physics-Informed Neural Networks

This paper establishes the first convergence theory for Natural Gradient Descent (NGD) in training two-layer PINNs. It proves that the learning rate can be \(O(1)\), the convergence rate is independent of the sample size and the minimum eigenvalue of the Gram matrix, and it achieves quadratic convergence under smooth activations—significantly faster than first-order gradient descent.

Fast Data Mixture Optimization via Gradient Descent

FASTMIX reparameterizes "selecting data mixture proportions" into "weighting losses for each data source," making mixture proportions differentiable. By training only one proxy model and using gradient descent to simultaneously optimize the model and proportions, it reduces search costs from several hundred GPU-hours to 1–2 GPU-hours while achieving superior performance.

Fast Frank–Wolfe Algorithms with Adaptive Bregman Step-Size for Weakly Convex Functions

The paper liberates the Frank–Wolfe algorithm from the classical assumptions of "Lipschitz gradient + convexity." As long as the objective function satisfies "relative smoothness (L-smad)" with respect to a kernel generating distance and is weakly convex, Ours provides convergence guarantees from sublinear to linear under both convex and non-convex settings using adaptive Bregman step-sizes, and proves local linear convergence of FW for a class of non-convex problems for the first time.

Faster Gradient Methods for Highly-Smooth Stochastic Bilevel Optimization

By reinterpreting the F2SA method as a forward difference approximation of the hyper-gradient, this work proposes the F2SA-p algorithmic family utilizing higher-order finite differences. Under high-order smoothness conditions, it improves the SFO complexity of stochastic bilevel optimization from \(\tilde{\mathcal{O}}(\epsilon^{-6})\) to \(\tilde{\mathcal{O}}(p\epsilon^{-4-2/p})\), and proves an \(\Omega(\epsilon^{-4})\) lower bound indicating the method is nearly optimal for sufficiently large \(p\).

FedDAG: Clustered Federated Learning via Global Data and Gradient Integration for Heterogeneous Environments

A Clustered Federated Learning (CFL) framework named FedDAG is proposed. It integrates data and gradient signals to compute weighted class-wise similarities for more accurate client clustering and utilizes a dual-encoder architecture for cross-cluster feature transfer, consistently outperforming existing baselines under various heterogeneity settings.

Federated Learning with Profile Mapping under Distribution Shifts and Drifts

FEROMA decouples "which models to aggregate with" from "client/cluster identity" to "data distribution profiles": each client extracts a lightweight, differentially private distribution profile. The similarity between profiles automatically determines whether the current round employs clustered aggregation, personalization, or global aggregation. This enables robust operation under simultaneous distribution shift (between clients) and distribution drift (over time), with costs comparable to FedAvg.

FedMC: Federated Manifold Calibration

Addressing the issue where local calibration using global linear geometric priors (points/ellipsoids) pushes samples off the manifold and generates OOD pseudo-samples in Federated Learning, FedMC utilizes local Kernel PCA on clients to learn non-linear manifold geometry. These are aggregated into a privacy-secure "Geometry Dictionary" on the server, allowing clients to perform manifold-aligned calibration via table lookups. This serves as a plug-and-play module that consistently enhances various FL and FPL methods.

FedMuon: Federated Learning with Bias-corrected LMO-based Optimization

This paper points out that directly using Muon (an optimizer based on the Linear Minimization Oracle, LMO) as a local optimizer for FedAvg fails to converge because LMO is a biased operator. It proposes FedMuon, which utilizes SCAFFOLD-like control variables for bias correction, and provides the first proof that it converges for any number of Newton-Schulz iterations, with faster convergence as iterations increase.

FIRE: Frobenius-Isometry Reinitialization for Balancing the Stability–Plasticity Tradeoff

FIRE reformulates the long-standing problem of "how much to reset weights" as a constrained optimization problem with a closed-form solution. By projecting weights onto an orthogonal (isometric) manifold to restore plasticity while minimizing Frobenius error relative to old weights (maintaining stability), FIRE uses Newton-Schulz iterations for efficient approximation. It outperforms naive training and standard reinitialization methods across vision, language, and RL tasks with nearly zero hyperparameter tuning.

FMIP: Joint Continuous-Integer Flow for Mixed-Integer Linear Programming

Addressing the limitation that existing generative MILP heuristics only model integer variables and ignore integer-continuous coupling, FMIP utilizes flow matching to jointly model the solution distribution on the "integer + continuous" mixed solution space. Utilizing this complete solution candidate, it designs a holistic guidance mechanism to push the generation trajectory toward "better and more feasible" solutions, reducing the primal gap by an average of 41.34% across 8 standard benchmarks.

From Gradient Volume to Shapley Fairness: Towards Fair Multi-Task Learning

Addressing the unfairness issue in multi-task learning where gradient conflicts cause "strong tasks to dominate update directions and weak tasks to be repeatedly sacrificed," this paper proposes SVFair. It uses the volume of the parallelepiped spanned by normalized gradients (Gram determinant) as the utility function for a Shapley cooperative game. This allows for calculating the degree of task gradient deviation from the whole in a single forward pass, redistributing update weights accordingly to achieve superior MR and \(\Delta m\%\) across supervised and reinforcement learning benchmarks.

From Sequential to Parallel: Reformulating Dynamic Programming as GPU Kernels for Large-Scale Stochastic Combinatorial Optimization

The bottleneck of "sequentially solving second-stage integer subproblems per scenario" in stochastic combinatorial optimization is reformulated as matrix-vector multiplication over the \((\min, +)\) semiring. By designing hardware-aware GPU kernels for scenario batching, Bellman updates for over \(10^6\) scenarios are evaluated in parallel within a single GPU pass, achieving speedups of one to five orders of magnitude.

From Sorting Algorithms to Scalable Kernels: Bayesian Optimization in High-Dimensional Permutation Spaces

The paper reinterprets "comparison-based sorting algorithms" as feature generators for permutations. This perspective unifies the SOTA Mallows kernel as a special case of enumeration sort and derives the Merge Kernel with a length of only \(\Theta(n\log n)\) using merge sort. It significantly outperforms Mallows kernels in high-dimensional permutation Bayesian Optimization due to its orders of magnitude smaller feature dimensionality.

FrontierCO: Real-World and Large-Scale Evaluation of Machine Learning Solvers for Combinatorial Optimization

FrontierCO is a large-scale real-world benchmark covering 8 categories of combinatorial optimization (CO) problems (TSP, MIS, CVRP, etc.). It evaluates 16 ML solvers (neural methods + LLM Agents) against SOTA traditional solvers, finding that ML methods still lag significantly behind traditional approaches on structurally complex and extreme-scale instances, despite showing potential in specific scenarios.

FZOO: Fast Zeroth-Order Optimizer for Fine-Tuning Large Language Models towards Adam-Scale Speed

FZOO utilizes "batched one-sided estimation + Rademacher (±1) perturbation" to bring zeroth-order optimizers close to Adam's convergence speed. By adopting adaptive step sizes based on the standard deviation of batch losses, it reduces the required forward passes by an order of magnitude. Simultaneously, ±1 perturbations allow multiple forward passes to be merged into a single batched matrix multiplication, making full-parameter LLM fine-tuning on consumer-grade memory realistic.

Gen-DFL: Decision-Focused Generative Learning for Robust Decision Making

Gen-DFL replaces the traditional "point prediction" in Decision-Focused Learning (DFL) with a conditional generative model. This allows the model to directly learn the full conditional distribution of optimization parameters and sample from high-risk tail regions. By using a CVaR objective for end-to-end training, it significantly reduces decision regret in high-dimensional, risk-sensitive decision-making problems.

Generalizable Heuristic Generation Through LLMs with Meta-Optimization

MoH elevates LLM-based heuristic generation from "evolving heuristics with a fixed evolutionary algorithm" to "iteratively building optimizers." By using self-invocation to generate diverse heuristic optimizers and selecting the best one based on multi-task utility as the next round's meta-optimizer, MoH breaks free from manually designed evolutionary frameworks and significantly enhances cross-scale generalization.

Generalization Below the Edge of Stability: The Role of Data Geometry

This paper proposes the "data shatterability" principle to provide a unified explanation of how data geometry controls the strength of implicit regularization for gradient descent near the Edge of Stability (EoS). It derives a spectrum of \(\alpha\)-dependent generalization upper and lower bounds for the Beta(α) radial distribution family and proves that generalization rates adapt to the intrinsic dimension \(m\) rather than the ambient dimension \(d\) for mixtures of low-dimensional subspace distributions.

Generative Bayesian Optimization: Generative Models as Acquisition Functions

GenBO trains a generative model directly as a proposal distribution whose sampling density is proportional to the acquisition function. By leveraging DPO-style logic to train in a single step using noisy utility values, it bypasses the need to fit surrogate regression or classification models, making it simple and scalable for high-dimensional, combinatorial, and large-batch black-box optimization (e.g., protein design).

GIT-BO: High-Dimensional Bayesian Optimization with Tabular Foundation Models

GIT-BO utilizes a frozen TabPFN v2 as a zero-training Bayesian optimization surrogate model, estimates a low-dimensional active subspace from the gradients of its predictive mean, and performs point selection using UCB within this subspace. It achieves a superior performance-time trade-off compared to various GP-based high-dimensional BO methods on synthetic and engineering optimization tasks up to 500 dimensions.

Globally Aware Optimization with Resurgence

This paper introduces resurgence theory from mathematical physics into neural network optimization: it first calculates the divergent asymptotic series of the parameter space partition function \(Z(g)=\int e^{-L(\theta)/g}\,d\theta\), then utilizes the Borel transform to map the singularities of this series to the values of the loss function at all critical points. This provides "target loss values" as global information to local gradient optimizers, packaged as a plug-and-play learning rate scheduler named SURGE.

Gradient-Based Diversity Optimization with Differentiable Top-\(k\) Objective

This work relaxes the inherently non-differentiable "top-\(k\) set diversity" objective into a differentiable loss using soft-rank, enabling its integration into gradient-based training. By employing Multiple Gradient Descent Algorithm (MGDA) to adaptively balance the conflicting gradients of "relevance vs. diversity," the framework optimizes both objectives simultaneously without altering model architectures or requiring post-processing. It achieves significant diversity gains across five recommendation datasets with negligible loss in accuracy.

Gradient-Normalized Smoothness for Optimization with Approximate Hessians

This paper proposes "Gradient-Normalized Smoothness," a problem-agnostic local characterization that enables gradient-regularized Newton-type methods using approximate Hessians to automatically adapt to the correct smoothness class, recovering optimal global convergence rates of exact Newton methods on both convex and non-convex objectives.

Gradient-Sign Masking for Task Vector Transport Across Pre-Trained Models

The proposed GradFix method constructs a binary mask using the signs of gradients computed from a very small number of samples on a target pre-trained model. This mask filters the source model’s task vector coordinate-wise, retaining only components aligned with the descent direction of the target loss landscape. It achieves cross-model task knowledge transfer without any fine-tuning, theoretically provides a strict first-order descent guarantee, and significantly outperforms naive transfer and few-shot fine-tuning on both vision and language benchmarks.

Gradient Descent with Large Step Sizes: Chaos and Fractal Convergence Region

This paper provides a rigorous proof for matrix factorization problems: when gradient descent uses large step sizes close to the critical threshold, fractal convergence boundaries and chaotic dynamics emerge in the parameter space. The final minimum reached (or even whether convergence occurs) becomes extremely sensitive to initialization, causing commonly assumed implicit biases—such as "flatness/minimum norm/balance"—to fail completely.

Harmonized Cone for Feasible and Non-conflict Directions in Training Physics-Informed Neural Networks

The paper unifies "implementability by non-negative loss weights" and "avoiding loss increase for any component" into a harmonized cone. It proposes HARMONIC, which uses the Double Description method to construct update directions within this cone, consistently outperforming existing reweighting and multi-objective gradient methods across multiple PDE / IDE benchmarks.

HBO: Hierarchical Balancing Optimization for Fine-Tuning Large Language Models

HBO decomposes the data mixing problem in LLM instruction fine-tuning into two levels: "how to sample across datasets" and "how to sample within each dataset according to difficulty." Using Global and Local Actors to dynamically update sampling probabilities based on the training state, HBO consistently outperforms static sampling and existing dynamic data mixing methods in multilingual and multi-task fine-tuning.

HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization

HeuriGym places LLMs into an agentic closed loop of "problem reading—code writing—execution feedback—iterative refinement," requiring them to write complete heuristics from scratch for 9 real-world combinatorial optimization (CO) problems (EDA, biology, logistics, etc.). It uses a new metric, QYI, to measure solution quality and feasibility—results show that even the strongest GPT-o4-mini-high and Gemini-2.5-Pro score only 0.6 (where experts score 1.0), exposing weaknesses in tool usage, long-range planning, and adaptive reasoning.

Hierarchical Multi-Stage Recovery Framework for Kronecker Compressed Sensing

This paper proposes a "hierarchical observation" perspective for Kronecker Compressed Sensing (KCS), noting that each factor matrix of the Kronecker product measurement matrix actually probes signal sparsity at different levels. Based on this, it designs a Multi-Stage Recovery (MSR) framework that decomposes high-dimensional recovery into layer-wise MMV subproblems. MSR uniformly handles standard, hierarchical, and Kronecker support sparsity models with a unified \((s,N)\)-RIP theoretical guarantee. While maintaining accuracy comparable to SOTA, it reduces runtime by one to three orders of magnitude.

High-dimensional limit theorems for SGD: Momentum and Adaptive Step-sizes

Ours extends the "effective dynamics" high-dimensional scaling limit framework by Ben Arous et al. to SGD with Polyak momentum (SGD-M) and adaptive step-size SGD with scalar preconditioning. It proves that at the critical step-size, SGD-M amplifies high-dimensional fluctuations and differs from online SGD only by a time rescaling, whereas a simple preconditioner that normalizes the gradient to unit norm (SGD-U) can broaden the range of convergent step-sizes and push fixed points closer to the population optimum.

High-dimensional Mean-Field Games by Particle-based Flow Matching

The authors reformulate the solution of high-dimensional Mean-Field Games (MFG) as a particle optimization problem in Lagrangian coordinates. By employing an alternating iteration of "particle proximal descent + Flow Matching for velocity field fitting" to approximate fictitious play, the method can solve both potential and non-potential games with high-dimensional scalability while providing provable convergence rates.

High-Probability Bounds for the Last Iterate of Clipped SGD

This paper provides the first high-probability convergence rate for the last iterate of Clipped-SGD under convex and smooth objectives with heavy-tailed noise (only finite \(\alpha\)-th moments, \(\alpha\in(1,2]\)). It also introduces a general technique to convert high-probability bounds into expectation bounds.

High Probability Bounds for Non-Convex Stochastic Optimization with Momentum

Ours completes the high-probability convergence and generalization bounds for Stochastic Gradient Descent with Momentum (SGDM) under non-convex settings: by relaxing noise to sub-Weibull heavy-tailed distributions and sequentially layering PL and Bernstein structural assumptions, a complete hierarchy of bounds is derived, ranging from \(\tilde{O}(1/\sqrt{T})\) and \(\tilde{O}(1/T)\) to dimension-independent \(\tilde{O}(1/n^2)\), marking the first generalization bounds for SGDM in the industry.

Hinge Regression Tree: A Newton Method for Oblique Regression Tree Splitting

This work rewrites the splitting of each node in an oblique regression tree as a nonlinear least squares problem of "taking the max/min envelope of two linear predictors." It proves that alternating fitting is exactly equivalent to the damped Newton (Gauss-Newton) method under a fixed partition. By using iterative updates with backtracking line search, the authors obtain compact oblique regression trees that are both fast and stable, with an \(O(\delta^2)\) universal approximation rate.

HOTA: Hamiltonian Framework for Optimal Transport Advection

HOTA reformulates the dual problem of the Generalized Schrödinger Bridge (GSB) as a joint optimization of "Kantorovich potential + HJB value function." By employing RL-style replay buffers, target networks, and adaptive gradient balancing, it achieves stable training without modeling intermediate densities, handles non-smooth potential functions, and strictly guarantees terminal distribution matching.

How does the optimizer implicitly bias the model merging loss landscape?

This paper proposes using a single physical quantity, "effective noise scale," to uniformly characterize the impact of optimization hyperparameters—such as learning rate, weight decay, batch size, momentum, and data augmentation—on model merging. It proves that merging benefits are a non-monotonic function of this noise (with an optimal critical point), thereby extending the implicit bias of the optimizer from "flatness of a single minimum" to the "global loss landscape geometry between different solutions."

How Muon's Spectral Design Benefits Generalization: A Study on Imbalanced Data

This paper abstracts Muon/Shampoo as Spectral Gradient Descent (SpecGD) and provides closed-form training trajectories on Gaussian mixture imbalanced data. It proves that SpecGD learns all spectral components at the same rate (whereas GD prioritizes principal components), thereby achieving superior worst-class/class-balanced generalization under early stopping. This reveals the mechanism through which Muon outperforms SGD on imbalanced datasets.

Hyperbolic Aware Minimization: Implicit Bias for Sparsity

HAM alternates a lightweight "hyperbolic mirror step" with a standard optimizer step. Without increasing parameters or memory, it replicates the sparse implicit bias brought by m⊙w pointwise over-parameterization, while fixing its inherent "inverse metric collapse" near the origin that stalls sign flips, leading to gains in both dense and sparse training.

Hyperparameter Trajectory Inference with Conditional Lagrangian Optimal Transport

This paper proposes Hyperparameter Trajectory Inference (HTI): treating continuous hyperparameters as "time," it uses conditional Lagrangian optimal transport to learn the trajectory of neural network output distributions across varying hyperparameters. This allows for approximating outputs under unobserved hyperparameter settings at inference time without retraining the original model.

Implicit Bias of Per-sample Adam on Separable Data: Departure from the Full-batch Regime

This paper provides the first proof that the implicit bias of mini-batch Adam differs from the full-batch regime. By constructing specific datasets, it demonstrates that per-sample Adam converges to the \(\ell_2\) max-margin classifier (whereas full-batch Adam converges to \(\ell_\infty\)), and characterizes its data-adaptive Mahalanobis norm margin maximization behavior on general datasets via the AdamProxy framework.

Implicit Regularization of SGD Reduces Shortcut Learning

This paper proves that the implicit regularization of SGD (with strength proportional to the learning rate divided by batch size \(\epsilon/b\)) systematically suppresses the model's reliance on spurious features, thereby improving group robustness without sacrificing accuracy—whereas full-batch GD not only lacks this benefit but may even exacerbate shortcut dependence.

Improved ℓp Regression via Iteratively Reweighted Least Squares

This paper introduces a novel IRLS algorithm for \(\ell_p\) regression with \(p \in (1, \infty)\) from a primal-dual perspective. By employing a lightweight multiplicative update rule, it achieves an iteration complexity of \(O\!\big(p^2 n^{\frac{p-2}{3p-2}}\log\frac{n}{\epsilon}\big)\). This marks the first time a practical IRLS method has attained the optimal iteration bounds previously only held by complex theoretical algorithms, significantly outperforming classical p-IRLS and CVX in experiments.

Improving Feasibility via Fast Autoencoder-Based Projections

Training an autoencoder to "straighten" complex (non-convex) feasible sets into a simple convex latent space (a ball) allows a single forward decoding pass to rapidly correct infeasible neural network outputs back into the feasible region. This achieves near 100% feasibility in sub-millisecond time, serving as a low-latency alternative to traditional projections or solvers.

Improving LLM-based Global Optimization with Search Space Partitioning

HOLLM adaptively partitions the search space into a set of "sub-region meta-arms" using a KD-tree, selects promising local regions via a bandit-style scoring function, and restricts the LLM to sample candidate points only within these small regions. This converts the LLM's weakness in global sampling—characterized by uneven and biased distribution—into a localized low-dimensional sampling advantage, outperforming global LLM sampling and traditional Bayesian optimization on multimodal functions and hyperparameter optimization.

Improving Online-to-Nonconvex Conversion for Smooth Optimization via Double Optimism

Based on the online-to-nonconvex (O2NC) framework by Cutkosky et al., this paper replaces the complex fixed-point inner loop with a "double optimism" hint function. This results in a unified first-order algorithm reaching \(O(\varepsilon^{-1.75}+\sigma^2\varepsilon^{-3.5})\) complexity, achieving both the optimal deterministic rate (removing the logarithmic factor) and the optimal stochastic rate.

In-Context Multi-Objective Optimization

TAMO transforms the Multi-Objective Bayesian Optimization (MOBO) workflow—which traditionally requires re-fitting a surrogate and optimizing an acquisition function for every new task—into an offline-trained, dimension-agnostic Transformer policy. During testing, it generates the next query via a single forward pass based only on historical observations and a candidate pool, maintaining comparable or superior Pareto quality across synthetic and real-world tasks while reducing proposal time by approximately \(50\times\) to \(1000\times\).

Incentives in Federated Learning with Heterogeneous Agents

Analyses incentive problems in heterogeneous Federated Learning (FL) from a game-theoretic perspective, proving the existence of Pure Strategy Nash Equilibrium (PSNE) under heterogeneous data distributions and PAC accuracy targets, and proposes a linear programming-based approximation algorithm to determine optimal contribution levels.

Incorporating Expert Priors into Bayesian Optimization via Dynamic Mean Decay

The proposed framework, DynMeanBO, directly incorporates expert priors (distribution of the optimum \(\pi(x)\)) into the mean function of a Gaussian Process, utilizing a weight that decays over iterations to prioritize the prior early and phase it out later. This results in a prior-informed Bayesian Optimization framework that is compatible with any acquisition function, incurs nearly zero additional overhead, and remains robust to poor priors.

It's All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization

This paper proposes MIRAS, a framework that unifies sequence modules like Transformers, linear RNNs, TTT, and Titans as "associative memory via online optimization at test-time." By extending the design axes of attentional bias and retention, it introduces three attention-free models—MONETA, YAAD, and MEMORA—which outperform various modern recurrent baselines in language modeling, commonsense reasoning, and long-context needle recall.

Jacobian Aligned Random Forests

JARF estimates the Expected Jacobian Outer Product (EJOP) of class probability gradients using a random forest to obtain a globally shared linear preconditioning matrix. This rotates "oblique" splitting directions toward the coordinate axes, allowing standard axis-aligned random forests to achieve the accuracy of oblique decision trees with almost no additional training cost.

Landing with the Score: Riemannian Optimization through Denoising

When a manifold is only implicitly provided via data samples, this paper utilizes the score function and its Jacobian learned by diffusion models to approximate the "nearest point projection" and "tangent space projection" on the manifold. This translates classical Riemannian optimization to scenarios with only data and no explicit geometry, providing two inference-time algorithms (DLF / DRGD) with non-asymptotic convergence guarantees.

LCA: Local Classifier Alignment for Continual Learning

This paper proposes the Local Classifier Alignment (LCA) loss function to resolve the classifier mismatch issue caused by incremental backbone merging in continual learning. By simultaneously minimizing classification loss and loss sensitivity within local regions of Gaussian class prototypes, combined with an incremental PEFT merging strategy (IM), the method achieves an overall average accuracy of 85.6% across seven benchmarks, significantly outperforming the state-of-the-art.

LDT: Layer-Decomposition Training Makes Networks More Generalizable

LDT decomposes network layers into stable and unstable layers based on gradient variance. It employs dual-branch cross-freezing and dynamic EMA updates to sever gradient interference from unstable layers to stable layers, thereby enhancing cross-domain generalization in super-resolution, classification, semantic segmentation, and NLP domain generalization tasks.

Learning from Algorithm Feedback: One-Shot SAT Solver Guidance with GNNs

A GNN is used in a single forward pass to generate "weights + polarities" for every variable in a SAT formula. These are multiplied into the solver's existing branching score function, and the GNN is trained end-to-end via policy gradient (GRPO) using the solver's own search cost as the reward—a paradigm the authors call RLAF (Reinforcement Learning from Algorithm Feedback). This approach accelerates base solvers by up to \(2\times\) across various SAT distributions and outperforms supervised learning schemes based on human-defined heuristics (backbone / UNSAT core).

Learning of Population Dynamics: Inverse Optimization Meets JKO Scheme

This paper proposes iJKOnet, which reformulates the task of "inferring energy functionals from discrete-time population snapshots" as an inverse optimization problem. By maximizing the gap between the optimal value of a JKO step and the value at the ground-truth measure, a min-max objective is derived. This allows learning the energy functional driving the Wasserstein gradient flow through standard adversarial end-to-end training, without requiring input-convex neural networks or precomputed optimal transport couplings.

Learning to Recall with Transformers Beyond Orthogonal Embeddings

This work analyzes the "early stage" of empirical gradient descent for a single-layer Transformer on a token retrieval task under random (non-orthogonal) embedding conditions. It derives an explicit formula for the model's storage capacity, revealing a multiplicative dependence between the sample size \(N\), embedding dimension \(d\), and sequence length \(L\), and proves that this scaling relationship is inherent to information-theoretic lower bounds.

Learning to Segment for Vehicle Routing Problems

The authors observe that during the search process of iterative VRP solvers, a large number of edges become "stable and unchanged" early on, and repeatedly optimizing them wastes computational resources. They propose the First-Segment-Then-Aggregate (FSTA) decomposition framework to aggregate stable segments into supernodes, allowing the solver to focus only on unstable parts. A neural network called L2Seg (with three variants: autoregressive, non-autoregressive, and synergistic) is trained to identify which segments to aggregate, accelerating SOTA solvers such as LKH-3, LNS, and L2D by 2~7 times.

Learning to Solve Orienteering Problem with Time Windows and Variable Profits

DeCoST is proposed as a two-stage learning framework that decouples coupled discrete routing decisions and continuous service time allocation in OPTWVP. The first stage employs a parallel decoder to jointly generate the path and initial service times, while the second stage precisely optimizes service times through Linear Programming (LP) to achieve global optimality. Cross-stage coordination is maintained via pTAR feedback. On 50-500 node OPTWVP instances, DeCoST achieves an optimality gap of only 0.83%-3.31%, with inference speeds up to 45 times faster than meta-heuristics.

LEGACY: A Lightweight Dynamic Gradient Compression Strategy for Distributed Deep Learning

Moving away from parameter-heavy or computationally intensive adaptive compressors, LEGACY utilizes two cost-free signals—"layer size" and "training stage"—to equip any compressor (Top-k, QSGD, PowerSGD, etc.) with a lightweight dynamic scheduler, significantly improving accuracy under the same communication volume.

Leveraging Discrete Function Decomposability for Scientific Design

DADO explicitly incorporates the decomposable structure of discrete objective functions in scientific design into the distributional optimization process. By using message passing on a junction tree to provide coordinated weights for each local generative factor, it identifies high-scoring designs in massive discrete search spaces more efficiently than standard Estimation of Distribution Algorithms (EDA).

LMask: Learn to Solve Constrained Routing Problems with Lazy Masking

LMask transforms the "one-pass forward, no-look-back" construction paradigm of neural routing solvers into a forward-and-backtracking lazy masking decoding. Combined with search trajectory embeddings and soft constraint penalties during training, it reduces the infeasibility rate to near 0% on constrained TSP (Time Windows/Draft Limits) while achieving superior solutions.

Local Entropy Search over Descent Sequences for Bayesian Optimization

The "information gain" objective of entropy search is shifted from the global optimum to the local optimum reachable by an iterative optimizer (e.g., gradient descent) starting from an initial point. By propagating the GP posterior through the optimizer to obtain a distribution of "descent sequences," the next query point is selected to maximize information gain for this distribution, resulting in higher sample efficiency for high-dimensional and complex black-box optimization.

LoRA meets Riemannion: Muon Optimizer for Parametrization-independent Low-Rank Adapters

This work treats LoRA low-rank updates as points on a "fixed-rank manifold" for direct optimization, lifting the Muon optimizer to the Riemannian manifold (termed Riemannion). This fundamentally eliminates the parameterization ambiguity caused by LoRA factorization. Combined with a gradient-aligned initialization and a single-backpropagation implementation, it improves both convergence speed and final accuracy for LLM and diffusion model fine-tuning.

Markovian Transformers for Informative Language Modeling

Ours proposes the Markovian Language Model (MLM) framework, which enforces Chain-of-Thought (CoT) to become a causally necessary reasoning bottleneck through structural constraints (removing the original question during answer prediction and deriving only from CoT)—analogous to the narrow latent layer of an autoencoder. Combined with GRPO-style policy gradient training, performance on GSM8K improved from 19.6% to 57.1%. Moreover, the learned CoT is transferable across model architectures (Llama → Mistral/Phi/GPT-2), proving that CoT encodes natural language reasoning rather than steganography.

MASAM: Multimodal Adaptive Sharpness-Aware Minimization for Heterogeneous Data Fusion

The authors adapt SAM, originally used in unimodal learning to find flat minima, into a "modality-adaptive" version. By using an adaptive perturbation score, the method identifies the current dominant modality and applies a decoupled perturbation only to it along the fusion gradient direction. This simultaneously mitigates modality imbalance and pulls each modality's encoder into flat regions during heterogeneous fusion.

Matched Data, Better Models: Target Aligned Data Filtering with Sparse Autoencoders

Sparse Autoencoders (SAEs) are used to decompose CLIP features into "countable" monosemantic concepts. Data filtering is then modeled as a submodular maximization problem (SDM) where the concept distribution of a selected subset is aligned with a target distribution. On DataComp-medium, this approach approaches state-of-the-art performance with fewer samples and \(5\times\) fewer GPU hours.

MILPnet: A Multi-Scale Architecture with Geometric Feature Sequence Representations for Advancing MILP Problems

The authors reformulate MILP instances from "bipartite graphs" into "geometric feature sequences." By employing multi-scale hybrid attention modeling, the approach fundamentally bypasses the expressiveness bottleneck of GNNs restricted by the Weisfeiler-Lehman (WL) test—which fails to distinguish Foldable MILPs—while providing theoretical guarantees for approximating feasibility, optimal solutions, and optimal value mappings.

Minor First, Major Last: A Depth-Induced Implicit Bias of Sharpness-Aware Minimization

This work provides an in-depth analysis of the implicit bias of SAM trained on linear diagonal networks, revealing a qualitative shift induced by depth from \(L=1\) to \(L=2\): the limit direction of \(\ell_\infty\)-SAM is highly sensitive to initialization, while \(\ell_2\)-SAM exhibits a "weak-to-strong" sequential feature amplification phenomenon. The results indicate that analyses focusing solely on the \(t\to\infty\) limit are insufficient to reveal the complete dynamic behavior of SAM.

Monotone Near-Zero-Sum Games: A Generalization of Convex-Concave Minimax

This paper defines a new game class between monotone zero-sum and general-sum games—monotone \(\delta\)-near-zero-sum games. It proposes the ICL algorithm to black-box reduce the problem into a sequence of zero-sum subproblems. When the game is "sufficiently close to zero-sum," the gradient complexity of non-zero-sum games is accelerated from \(O(L/\min\{\mu,\nu\})\) to \(O(L/\sqrt{\mu\nu})\), approaching the zero-sum optimal rate for the first time.

MT-DAO: Multi-Timescale Distributed Adaptive Optimizers with Local Updates

MT-DAO is proposed as a multi-timescale distributed adaptive optimizer that introduces slow momentum (high \(\beta\)) to resolve the timescale mismatch problem caused by the rapid decay of standard momentum in low-frequency communication training. It provides the first convergence guarantee for such methods, eliminates the performance gap compared to fully synchronous DDP in language model pre-training, and reduces end-to-end training time by 6-27%.

\(\mu\)LO: Compute-Efficient Meta-Generalization of Learned Optimizers

This paper derives the Maximal Update Parametrization (\(\mu\)P) for two state-of-the-art learned optimizers (small_fc_lopt and VeLO) and pairs it with a low-cost "multi-width single-task" meta-training recipe. This allows optimizers meta-trained only on small MLPs to generalize to much wider, deeper, and longer-trained unseen tasks with zero additional computational overhead.

Multi-Action Self-Improvement for Neural Combinatorial Optimization

MACSIM extends the self-improvement paradigm of neural combinatorial optimization (NCO) from "single-step actions" to "multi-agent joint actions." By predicting all agent-task assignments in parallel and using a permutation-invariant set prediction loss, it explicitly leverages agent permutation symmetry to improve sample efficiency and coordination while significantly compressing solution generation latency in the self-improvement loop.

Multilevel Control Functional

This paper proposes Multilevel Control Functionals (MLCF), which graft non-parametric Stein control functionals onto the telescopic sum of Multilevel Monte Carlo (MLMC). By using control functionals to further suppress the variance of differences between adjacent fidelity models at each level, MLCF achieves faster convergence rates than MLMC when the integrand and density are smooth and the dimensionality is moderate. The authors also provide optimal sample allocation strategies and extensions to variational inference.

Muon Outperforms Adam in Tail-End Associative Memory Learning

This paper reveals the mechanism behind Muon's speed advantage over Adam through the lens of "associative memory": Muon's update rule normalizes gradient singular values, which naturally matches the outer-product superposition structure of associative memory, thereby enabling more balanced learning of low-frequency "tail" knowledge in heavy-tailed data.

MuonBP: Faster Muon via Block-Periodic Orthogonalization

MuonBP enables each device to perform orthogonalization only on local shards under tensor parallelism, executing global orthogonalization every \(P\) steps. By using two distinct learning rates—"block step" and "full step"—it eliminates the throughput loss caused by inter-device communication in Muon. On an 8B model, it achieves approximately 8% speedup over Muon with even better convergence results.

Nesterov Finds GRAAL: Optimal and Adaptive Gradient Method for Convex Optimization

This paper "correctly" integrates GRAAL, a parameter-free and line-search-free adaptive gradient method, with Nesterov acceleration for the first time. The resulting Accelerated GRAAL allows the step size to adapt to local curvature at a geometric (linear) rate, similar to non-accelerated GRAAL, achieving near-optimal iteration complexity of \(O(\sqrt{L\|x_0-x^*\|^2/\epsilon})\) under both \(L\)-smoothness and more general \((L_0,L_1)\)-smoothness.

NeuCLIP: Efficient Large-Scale CLIP Training with Neural Normalizer Optimization

NeuCLIP reformulates contrastive loss as a "minimization problem with auxiliary variables" using convex conjugation. It then employs variational analysis to collapse \(n\) sample-wise auxiliary variables into a lightweight Neural Partitioning Network (NPN) to predict log-partition functions. This eliminates optimization errors that scale with the dataset/batch size ratio (\(O(n/B)\)) found in FastCLIP, consistently outperforming existing methods on CLIP training scales from millions to billions.

Neural Hamilton–Jacobi Characteristic Flows for Optimal Transport

This paper reformulates Optimal Transport (OT) maps as gradients of viscosity solutions to Hamilton–Jacobi (HJ) equations. By leveraging the "Method of Characteristics," it collapses dynamic integration into straight lines, enabling a single-network, pure minimization loss framework to obtain closed-form, bidirectional, and provably optimal transport maps, completely eliminating adversarial training and numerical ODE integration.

Neural Multi-Objective Combinatorial Optimization for Flexible Job Shop Scheduling Problems

Using a single preference-conditioned attention network combined with decomposition-based PPO, this work generates an entire Pareto front covering various trade-offs for Multi-Objective Flexible Job Shop Scheduling (MOFJSP) in a single training session, significantly outperforming evolutionary algorithms in both effectiveness and speed.

Neural Networks Learn Generic Multi-Index Models Near Information-Theoretic Limit

This paper proves that under generic non-degenerate assumptions, standard two-layer neural networks trained via layer-wise gradient descent can learn generic Gaussian Multi-Index models \(f(\bm{x})=g(\bm{U}\bm{x})\) using \(\tilde{O}(d)\) samples and \(\tilde{O}(d^2)\) time. Both sample and time complexities reach information-theoretic optimality, providing the first proof that neural networks can efficiently learn hierarchical functions.

Neural Optimal Transport Meets Multivariate Conformal Prediction

The authors utilize Neural Optimal Transport to learn a continuous, cyclically monotone "vector quantile function" (transporting a reference distribution to a conditional distribution), then use the induced multivariate rank as a conformal score to construct multivariate prediction regions that possess finite-sample coverage guarantees and adapt to the geometric shape of the conditional distribution.

Neural Sum-of-Squares: Certifying the Nonnegativity of Polynomials with Transformers

This work utilizes a Transformer to predict a "compact monomial basis" for polynomials to significantly reduce the scale of Semidefinite Programs (SDP) corresponding to Sum-of-Squares (SOS) certificates. Combined with a repair and expansion fallback mechanism that guarantees correctness, it accelerates non-negativity proofs by 100× to 2000× compared to SOTA solvers.

Never Saddle for Reparameterized Steepest Descent as Mirror Flow

This paper proposes a "steepest mirror flow" framework that unifies the entire family of steepest descent algorithms under reparameterization—from SignGF (≈Adam) to GF (≈SGD)—into a mirror flow perspective. It geometrically explains why steeper descent escapes saddle points faster and learns sparse features better, thereby elucidating two mechanisms why Adam/AdamW often outperforms SGD in fine-tuning tasks.

New Hybrid Fine-Tuning Paradigm for LLMs: Algorithm Design and Convergence Analysis Framework

Ours proposes a "Hybrid Fine-Tuning" paradigm that updates the massive base LLM using zeroth-order optimization and lightweight PEFT modules using first-order gradients. A "Mixed Smoothness Condition" is introduced to address the vast disparity in parameter smoothness, providing the first optimal convergence guarantee for Random Reshuffling SGD under generalized smoothness with multiple learning rates.

Newton Method Revisited: Global Convergence Rates up to \(O(1/k^3)\) for Stepsize Schedules and Linesearch Procedures

This paper re-analyzes Newton's method with stepsize from the unconventional perspective of "Hölder continuity of the third derivative." It proposes a family of explicitly computable stepsize schedules (RN), pushing the global convergence rate of the classical Newton method from \(O(1/k^2)\) to \(O(1/k^3)\). It also provides linesearch/backtracking versions (GRLS, UN) that do not require prior knowledge of smoothness constants and, for the first time, proves convergence guarantees for the practically common Greedy Newton linesearch.

NExCO: Native Solution Expansion for Diffusion-based Combinatorial Optimization

NExCO transforms "adaptive expansion" from an external wrapper around global predictors into an intrinsic decoding mechanism for CO-specific masked diffusion models. Every intermediate state of diffusion is a valid partial solution; the model constructs full solutions by progressively "unmasking" high-confidence variables with feasibility projections, achieving approximately 50% quality improvement and up to 4× inference acceleration across TSP, MIS, and CVRP.

Non-Asymptotic Analysis of Efficiency in Conformalized Regression

Establish the first non-asymptotic efficiency bounds for Conformalized Quantile Regression (CQR) and Conformalized Median Regression (CMR) under SGD training, explicitly characterizing the joint dependence of prediction set length deviation on training sample size \(n\), calibration sample size \(m\), and miscoverage rate \(\alpha\).

Non-Convex Federated Optimization under Cost-Aware Client Selection

This paper proposes a cost-aware framework for federated optimization that explicitly models different communication costs associated with various client selection strategies. By combining the Inexact Composite Gradient Method (I-CGM) with a new RG-SAGA gradient estimator, the authors achieve optimal communication and local computation complexities for non-convex optimization within this model.

NRGPT: An Energy-based Alternative for GPT

Ours proposes NRGPT (eNeRgy-GPT), which applies minimal modifications to the standard GPT to transform it into an energy-based model. By designing attention energy and feed-forward energy functions, each forward pass layer becomes equivalent to a gradient descent step of tokens on an energy landscape. The work proves properties of asymptotic energy descent and stable convergence, validating performance comparable to standard GPT on ListOps, Shakespeare, and OpenWebText.

On the Convergence Behavior of Preconditioned Gradient Descent Toward the Rich Learning Regime

Starting from the eigenvalue dynamics of the Neural Tangent Kernel (NTK), this paper demonstrates theoretically and experimentally that preconditioned gradient descent (PGD), such as Gauss-Newton/Levenberg-Marquardt, can flatten the "disparity in convergence rates across frequency modes" caused by spectral bias into uniform convergence. This significantly compresses the delayed generalization window of grokking. However, PGD tends to remain trapped in the lazy NTK subspace, leading to weaker final generalization; it is necessary to switch back to first-order methods (e.g., Adam) after the lazy phase is exhausted to restore generalization.

On the Convergence Direction of Gradient Descent

This paper demonstrates that when Gradient Descent (GD) converges to a local strongly convex minimum, its trajectory does not approach from an arbitrary direction. Instead, it either aligns to a fixed direction (small learning rate) or converges while oscillating back and forth along a straight line (large learning rate). The boundary is exactly \(\eta = 2/(\lambda_1+\lambda_n)\). This discrete version of the "Gradient Conjecture" also provides an explanation for the sharpness oscillation observed in the Edge of Stability phenomenon.

On the Surprising Effectiveness of a Single Global Merging in Decentralized Learning

In decentralized training with extreme communication constraints and high data heterogeneity, the authors find that a "global merging" (averaging all node models) at the end of training brings global test performance close to Federated Learning levels. They theoretically prove for the first time that the global merging model of decentralized SGD can achieve the convergence rate of parallel SGD—the key lies in reinterpreting the inter-node variance, previously considered "harmful noise," as a "constructive component" necessary for matching this rate.

Online Black-Box Prompt Optimization with Regret Guarantees under Noisy Feedback

This paper proposes AOZPT, the first zeroth-order method for black-box prompt optimization in online learning + noisy feedback scenarios. It employs an adaptive uncertainty scaling mechanism to suppress two types of uncertainties: "generative model output noise" and "high variance of zeroth-order gradients." Under non-convex settings, it is proven to achieve sublinear regret, demonstrating superior stability and performance over offline and online baselines in text and image generation tasks.

Optimizer Choice Matters for the Emergence of Neural Collapse

Through 3,900+ training experiments and theoretical analysis, this work reveals that the choice of optimizer (specifically the coupling of weight decay) plays a decisive role in the emergence of Neural Collapse—AdamW (decoupled weight decay) fails to produce Neural Collapse, while SGD and Adam (coupled weight decay) succeed.

Personalized Collaborative Learning with Affinity-Based Variance Reduction

Proposes the personalized collaborative learning framework AffPCL. Through bias correction and importance correction mechanisms, heterogeneous agents collaboratively learn personalized solutions without prior knowledge, achieving an adaptive convergence rate of \(O(t^{-1} \cdot \max\{n^{-1}, \delta\})\). This results in linear acceleration when agents are similar and performance no worse than independent learning when differences are large.

Πnet: Optimizing Hard-Constrained Neural Networks with Orthogonal Projection Layers

The Πnet architecture is proposed to guarantee strict satisfaction of convex constraints by attaching an orthogonal projection layer based on Douglas-Rachford operator splitting to the output layer of neural networks. Efficient backpropagation is achieved through the Implicit Function Theorem, significantly outperforming existing methods in training time, solution quality, and hyperparameter robustness.

Predictive Differential Training Guided by Training Dynamics

Training a DNN is treated as a nonlinear dynamical system in a high-dimensional weight space. Using Koopman/DMD, weights several epochs ahead are predicted to skip SGD iterations. A "dynamic consistency analysis" mask is employed to accept only high-fidelity predicted weights whose local dynamics align with global dynamics. This works as a plug-and-play plugin to accelerate various optimizers (SGD/Adam/LAMB, etc.) by 10–40% without loss of accuracy.

Provable and Practical In-Context Policy Optimization for Self-Improvement

This paper proposes the In-Context Policy Optimization (ICPO) framework, theoretically proving that a single-layer Linear Self-Attention Transformer, after sufficient pre-training, can simulate policy optimization algorithms in-context. It designs a practical ME-ICPO algorithm that achieves multi-round self-reflection at test-time through minimum-entropy selection and self-evaluated rewards, yielding significant improvements in mathematical reasoning tasks (e.g., Qwen2.5-Math-7B improved from 11% to 30% on AIME 2024).

Provably Accelerated Imaging with Restarted Inertia and Score-based Image Priors

To address the slow convergence of RED-like imaging reconstruction algorithms, this paper proposes RISP, which adds "inertial steps + restart mechanism" to iterations. Without requiring convexity of the prior, it provably improves the convergence rate from \(O(n^{-1/2})\) to \(O(n^{-4/7})\), achieving up to 24× speedup in large-scale imaging while maintaining reconstruction quality.

Proving the Limited Scalability of Centralized Distributed Optimization via a New Lower Bound Construction

This paper proves that in classical federated learning (centralized distributed optimization), once the "server \(\to\) worker" communication cost \(\tau_s\) is factored in, even unbiased sparsification compressors cannot simultaneously improve the "communication term \(\tau_s d L\Delta/\varepsilon\)" and the "variance term \(h\sigma^2 L\Delta/\varepsilon^2\)" with the number of workers \(n\). Even in the homogeneous (i.i.d.) setting where all workers share the same function, the improvement magnitude is at most a polynomial-logarithmic factor \(\mathrm{poly}\log n\). To this end, the authors construct a novel chain-based worst-case function and develop a proof framework that reduces the lower bound to "random sum concentration inequalities."

Rapid Training of Hamiltonian Graph Networks using Random Features

This paper proposes RF-HGN, which constructs dense layer parameters through random feature sampling (ELM/SWIM) and solves a linear least squares problem to train Hamiltonian Graph Networks. This approach completely bypasses gradient descent iterative optimization, achieving a 150-600x speedup on N-body physical systems while maintaining comparable accuracy and strong zero-shot generalization.

Reducing Contextual Stochastic Bilevel Optimization via Structured Function Approximation

This paper parameterizes the context-dependent lower-level solution \(y^\star(x,\xi)\) as \(W(x)\Phi(\xi)\) using a set of sufficiently expressive basis functions. This reduces the challenging Contextual Stochastic Bilevel Optimization (CSBO) problem to a standard Stochastic Bilevel Optimization (SBO) problem, eliminating the reliance on conditional sampling oracles and improving the sample complexity of CSBO from \(\tilde O(\epsilon^{-4})\) to a near-optimal \(\tilde O(\epsilon^{-3})\), matching SBO.

Rethinking Consistent Multi-Label Classification Under Inexact Supervision

The COMES framework is proposed to provide consistent risk estimators for multi-label classification under inexact supervision through first-order (Hamming loss) and second-order (Ranking loss) strategies, eliminating the need for label generation process estimation or uniform distribution assumptions.

Riemannian Federated Learning via Averaging Gradient Streams

This paper proposes RFedAGS, which, in the context of Federated Learning on Riemannian manifolds, replaces the averaging of client model points with the weighted averaging of local stochastic gradients transported back to the server's tangent space. This approach provides convergence guarantees even when arbitrary partial participation and non-IID data coexist, outperforming existing Riemannian FL methods on tasks such as PCA, Hyperbolic Structured Prediction, and SPD Fréchet mean.

Riemannian Optimization on Relaxed Indicator Matrix Manifold

Ours proposes a new relaxation for indicator matrices by relaxing column sum constraints from "equal to a fixed value" to "falling within an interval \((l,u)\)." It is proven that this relaxation set forms an \((n-1)c\)-dimensional embedded submanifold (RIM manifold). A complete Riemannian optimization toolbox is provided, reducing the computational complexity of gradients and Hessians from \(O(n^3)\) (required by doubly stochastic manifolds) to \(O(n)\). On image denoising and Ratio Cut clustering with millions of variables, the method is 70–200 times faster and achieves superior results compared to doubly stochastic manifold methods.

Riemannian Zeroth-Order Gradient Estimation with Structure-Preserving Metrics for Geodesically Incomplete Manifolds

To address the failure of zeroth-order estimators on Riemannian manifolds where "geodesic incompleteness" causes the exponential map to push perturbed points out of the manifold, this paper constructs a conformally equivalent metric \(g'\) that is geodesically complete while preserving the original stationary point structure. From a purely intrinsic perspective (independent of embeddings), it derives the mean squared error (MSE) upper bound for the two-point symmetric zeroth-order estimator (revealing its relationship with manifold curvature). Combined with unbiased rejection sampling, it generalizes the optimal convergence complexity of Riemannian zeroth-order SGD from Euclidean metrics to general Riemannian metrics.

Ringleader ASGD: The First Asynchronous SGD with Optimal Time Complexity under Data Heterogeneity

Ringleader ASGD utilizes a "gradient table + round-based buffering + exactly one update per worker per round" asynchronous mechanism to avoid training dominance by fast devices in non-convex stochastic optimization under arbitrary data heterogeneity. It achieves the optimal time complexity for parallel first-order stochastic methods within a fixed computation time model.

Saddle-to-Saddle Dynamics Explains A Simplicity Bias Across Neural Network Architectures

This paper proposes a unified theoretical framework to explain the pervasive simplicity bias—the phenomenon where gradient descent tends to learn simple solutions before progressively moving to complex ones—across various neural network architectures (fully connected, convolutional, and attention-based) through saddle-to-saddle learning dynamics.

Scalable and Adaptive Trust-Region Learning via Projection Convex Hull

This paper proposes Projection Convex Hull (PCH), which transforms the intractable MINLP of convex hull trust-region learning into a differentiable surrogate optimization with weighted projections. By iteratively learning a small number of supporting hyperplanes, it obtains polyhedral trust regions in high-dimensional data that are tight, interpretable, and directly embeddable into downstream optimization models.

Scalable Second-Order Riemannian Optimization for K-means Clustering

This paper reformulates the non-negativity constrained low-rank K-means SDP relaxation as an unconstrained smooth optimization problem on a product manifold. It is solved using a cubic-regularized Riemannian Newton method. Through manifold decomposition, the cost per Newton subproblem is reduced to linear complexity with respect to the number of samples \(n\). This allows convergence to a second-order critical point (globally optimal under benign non-convexity assumptions) within \(n\cdot\epsilon^{-3/2}\cdot\mathrm{poly}(r,d)\) time, which is 2–4 times faster than first-order SOTA methods while maintaining comparable statistical accuracy.

Scaling Laws of SignSGD in Linear Regression: When Does It Outperform SGD?

This work systematically analyzes the scaling laws of SignSGD under the Power-Law Random Features (PLRF) model, revealing two unique effects—drift-normalization and noise-reshaping—and proving that SignSGD's compute-optimal slope can exceed that of SGD in noise-dominated scenarios.

Scaling Multi-Task Bayesian Optimization with Large Language Models

BOLT distills a large number of historical Bayesian Optimization (BO) trajectories into an LLM, enabling the LLM to generate high-quality initial solutions for new tasks. These candidates are then passed to standard single-task BO for continued search, overcoming the performance saturation issues encountered by traditional multi-task BO as the number of tasks increases in domains such as database query plan optimization and antimicrobial peptide design.

SCRAPL: Scattering Transform with Random Paths for Machine Learning

Addressing the high computational cost of the multivariate scattering transform (ST) as a differentiable loss function due to the large number of paths \(P\), the authors propose SCRAPL. By randomly sampling only one path per step and employing three variance reduction techniques—P-Adam (Path-adaptive momentum), P-SAGA (Path-stochastic average gradient), and \(\theta\)-importance sampling—to stabilize gradients, SCRAPL achieves Pareto optimality. It reaches nearly the same accuracy as full-path ST at a low computational cost comparable to MSS in unsupervised sound matching tasks.

Seesaw: Accelerating Training by Balancing Learning Rate and Batch Size Scheduling

This paper theoretically proves the finite-sample equivalence between "learning rate decay" and "batch size increase" under SGD (and normalized SGD as a proxy for Adam). Based on this, it proposes the plug-and-play Seesaw scheduler: whenever a cosine schedule would halve the learning rate, it instead multiplies the learning rate by \(1/\sqrt{2}\) and doubles the batch size. This matches the loss curve of cosine decay under iso-FLOPs while reducing serial wall-clock time by approximately 36%.

SGD with Adaptive Preconditioning: Unified Analysis and Momentum Acceleration

This paper establishes a unified convergence analysis framework for AdaGrad-type adaptive preconditioned SGD. By constraining the preconditioning operator within an operator subspace satisfying specific structural assumptions, the authors use a single proof to simultaneously recover SOTA convergence rates for AdaGrad-Norm, AdaGrad, and ASGO/One-sided Shampoo, while providing the first convergence guarantee for DASGO. It further proves that diagonal preconditioning (AdaGrad, DASGO) can be combined with Nesterov momentum for provable acceleration, theoretically explaining for the first time why the "diagonal preconditioning + momentum" mechanism in Adam is so efficient.

Shuffling the Data, Stretching the Step-Size: Sharper Bias in Constant Step-Size SGD

This paper rigorously combines two classic heuristics—Random Reshuffling (RR1) and Richardson–Romberg Extrapolation (RR2)—into a unified algorithm for the first time. It proves that on quasi-strongly monotone Variational Inequality Problems (VIPs), their synergy can compress the asymptotic bias of constant step-size SGD from \(O(\gamma)\) to \(O(\gamma^3)\), while maintaining the \(O(\gamma^2)\) mean squared error (MSE) provided by RR1. Both theory and experiments validate this "1+1>2" synergy.

Sign-SGD via Parameter-Free Optimization

This paper proposes ALIAS, a series of parameter-free Sign-SGD algorithms that eliminate the need for manual learning rate tuning. By estimating the objective gap and local smoothness constants per iteration, ALIAS automatically determines sign update stepsizes. It matches or exceeds tuned Sign-SGD and AdamW across LLaMA pre-training, Swin fine-tuning, and various benchmarks, significantly reducing the total computational cost associated with learning rate grid searches.

Single-Loop Byzantine-Resilient Federated Bilevel Optimization

This paper investigates federated bilevel optimization in the presence of Byzantine clients. It first derives an asymptotic error lower bound determined by the heterogeneity of both upper and lower levels, then proposes the single-loop algorithm BR-FedBi and its Momentum/PAGE variants. These algorithms utilize auxiliary variables for hypergradient estimation combined with robust aggregation, theoretically achieving optimal Byzantine resilience or optimal sample complexity, and experimentally outperforming BILANTINE which requires sub-loops.

Sobolev Gradient Ascent for Optimal Transport: Barycenter Optimization and Convergence Analysis

This paper reformulates the exact Wasserstein barycenter as an unconstrained concave dual problem and performs gradient ascent directly in the \(\dot H^1\) Sobolev geometry. This approach eliminates expensive \(c\)-concavity projections while providing a global \(O(T^{-1/2})\) convergence guarantee, matching the order of classical non-smooth convex optimization.

Solving the 2-norm k-hyperplane clustering problem via multi-norm formulations

This paper transforms the non-convex exact solution problem of 2-norm k-hyperplane clustering into a multi-norm mixed-integer model reinforced by 2-norm, 1-norm, and \(\infty\)-norm constraints. This allows spatial branch-and-bound to obtain non-zero lower bounds earlier, accelerating median solution time by up to approximately \(41\times\) on LowDim/HighDim benchmarks.

SPREAD: Efficient Sampling-based Adaptive Diffusion Pareto Frontier Refinement

SPREAD treats the conditional DDPM as a multi-objective optimization (MOO) solver: it first trains a Diffusion Transformer conditioned on "objective values," then injects a guidance term consisting of a "multi-gradient descent direction + Gaussian RBF repulsion" into each reverse diffusion step. This enables a batch of candidate points to converge rapidly to the Pareto optimality while spreading uniformly to cover the entire Pareto frontier, matching or exceeding specialized SOTA methods in online, offline, and Bayesian settings.

Strongly Convex Sets in Riemannian Manifolds

This paper systematically generalizes the "strong convexity of sets" from Euclidean (Hilbert) space to Riemannian manifolds for the first time. It provides three non-equivalent definitions of strongly convex sets, their implication relationships, a set of constructive identification tools (lower level sets of smooth strongly convex functions are strongly convex sets), and proves that the Riemannian Frank-Wolfe (RFW) algorithm achieves linear convergence under the (Approximate) Riemannian Scaling Inequality (RSI).

Submodular Function Minimization with Dueling Oracle

This paper investigates submodular function minimization (SFM) in a setting where only noisy pairwise comparison feedback ("which of the two sets has a larger function value") is available (dueling oracle), without any access to direct function values. The authors construct subgradient estimators using the Lovász extension and SGD. For linear transfer functions, they provide an \(O(n^{3/2}/\sqrt{T})\) error bound with a matching lower bound (optimal within a restricted algorithm class). For sigmoid transfer functions, they utilize Firth bias correction to achieve an \(O(n^{7/5}/T^{2/5})\) error bound.

Symmetry-Aware Bayesian Optimization via Max Kernels

When the black-box objective function is invariant under the action of a group \(G\), this paper departs from the mainstream approach of "averaging" over group orbits to obtain an invariant kernel. Instead, it adopts the "maximum alignment" similarity \(k_{\max}\) between orbits and recovers a valid (Positive Semi-Definite) GP kernel \(k_+^{(D)}\) via eigenvalue clipping and Nyström extension. This significantly reduces cumulative regret in Bayesian Optimization (BO) without increasing asymptotic costs, with the advantage becoming more pronounced as the group size increases.

Taming Curvature: Architecture Warm-up for Stable Transformer Training

This paper employs "hot-start power iteration" to reduce the online tracking cost of the (preconditioned) Hessian's maximum eigenvalue for billion-parameter Transformers to \(<5\) HVPs per step. It confirms that training loss spikes are accompanied by curvature surges and that curvature increases with depth. Consequently, it proposes "Architecture Warm-up"—freezing most layers as identity early in training and progressively unfreezing them as the learning rate decays—to significantly suppress divergence and spikes without slowing convergence.

The Polar Express: Optimal Matrix Sign Methods and their Application to the Muon Algorithm

Polar Express transforms the polar decomposition approximation in Muon from heuristic Newton-Schulz coefficient searches into solving for the worst-case error-optimal combination of odd polynomials each round. While remaining purely matrix-multiplication-based and bfloat16-friendly, it allows Muon updates in GPT-2 training to converge faster and more stably toward valid polar factors.

The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton

Instead of designing a new optimizer, this paper utilizes Jacobian-vector products to perform actual full Gauss-Newton (GN) preconditioning on 45M/150M LLaMA models. Treating this as a "performance upper bound" for second-order optimization, the study investigates the gap left by existing methods like SOAP/Muon/Shampoo that use approximate Hessians. Results demonstrate that full GN reduces the iterations required to reach a target loss by \(5.4\times\) compared to SOAP and \(16\times\) compared to Muon at large batch sizes. Furthermore, layerwise curvature—completely ignoring cross-layer information—nearly matches the performance of full GN.

The Power of Small Initialization in Noisy Low-Tubal-Rank Tensor Recovery

In noisy low-rank tensor recovery with an over-estimated tubal-rank, this paper proves that setting the initialization scale of Factorized Gradient Descent (FGD) close to zero (small initialization) allows the final recovery error to depend only on the true tubal-rank \(r\) rather than the over-estimated rank \(R\). This approaches the information-theoretic minimax lower bound and can be achieved without any prior knowledge by using early stopping on a validation set.

Tighter Performance Theory of FedExProx

This paper revisits the federated extrapolated proximal method FedExProx, pointing out that its original theory was surprisingly no better than naive Gradient Descent (GD) on quadratic problems. By proposing a new analysis that "proves distance convergence first and then translates it to function value convergence," the authors bypass the \(L_{\max}\) dependence that caused looseness. They provide the first rigorous proof that FedExProx's total time complexity can be strictly superior to GD in realistic federated scenarios where communication dominates computation, extending the conclusions to partial participation, adaptive extrapolation, and PŁ conditions.

Toward Principled Flexible Scaling for Self-Gated Neural Activation

From a decision-making (multi-criteria scoring) perspective, this paper reveals the root cause of why self-gated activation functions underperform in layers already modeling fine-grained context (e.g., Transformers): gating function saturation leads to "trivially discriminative gating weights" for important features. The authors propose FleS, which uses sign-sensitive channel statistics via a small MLP to generate adaptive "vertical + horizontal" scaling coefficients. These coefficients dynamically adjust the upper bound and steepness of the gating curve. FleS consistently outperforms SOTA activation functions on Swin/PoolFormer/ResNet (e.g., 71.4% vs. 68.7% for GELU on Swin-Min).

Towards Better Branching Policies: Leveraging the Sequential Nature of Branch-and-Bound Tree

This paper proposes Mamba-Branching: explicitly modeling the Branch-and-Bound (B&B) solving process as a sequence of branching paths from the root to the optimal solution node. By utilizing Mamba with linear complexity for ultra-long sequence modeling and pre-training discriminative embeddings for candidate variables via contrastive learning, it outperforms all existing neural branching policies on heterogeneous MILPs and surpasses SCIP's default relpscost rule with faster convergence on difficult instances.

Towards Dynamic Interleaving Optimizers

DOIT treats the problem of "which optimizer to use during training" as an online decision-making problem that changes based on the training state. It utilizes Gaussian Process (GP) proxy models to predict the short-term reward of each optimizer at the current parameter state and selects the optimizer using an acquisition function that integrates transferability and training progress. This allows for dynamic interleaving between multiple optimizers, achieving \(2\%–10\%\) faster convergence and \(1\%–3\%\) higher accuracy compared to single or simple hybrid optimizers.

Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension

This paper provides a unified interpretation of optimizers such as Adam, Shampoo, and SOAP as "optimal Frobenius approximations of the Fisher Information Matrix (FIM) under different structural assumptions." Based on this, two design principles—structure selection and low-rank extension—are proposed, leading to two new memory-efficient optimizers: RACS and Alice. In LLaMA pre-training (up to 1.3B), these optimizers converge over 2\(\times\) faster than Adam with almost no additional VRAM overhead.

Towards Understanding the Calibration Benefits of Sharpness-Aware Minimization

This paper theoretically proves that the effectiveness of Sharpness-Aware Minimization (SAM) in mitigating "overconfidence" in deep networks stems from its implicit regularization of the negative entropy of the predictive distribution (equivalent to implicit maximum entropy). Based on this, it proposes an improved version, CSAM, which specifically suppresses overconfident samples and achieves lower calibration errors than SAM and various calibration methods across multiple datasets (including ImageNet-1K).

Trion: FFT-based Dynamic Subspace Selection for Low-Rank Adaptive Optimization of LLMs

This paper uses a fixed Discrete Cosine Transform (DCT) orthogonal matrix + dynamic column selection to replace the expensive SVD/QR projections in low-rank optimizers such as GaLore and Dion. By storing only \(r\) integer indices per layer instead of full projection matrices, the authors develop two optimizers, Trion and DCT-AdamW, achieving rank-independent runtime and up to a 25% reduction in memory without sacrificing accuracy.

Unbiased Gradient Estimation for Event Binning via Functional Backpropagation

Addressing the issue of biased gradients when learning directly from raw events due to the discontinuity of binning functions used for aggregating events into frames, this paper proposes Functional Backpropagation (FBP). By lifting the binning function into functional space and utilizing integration by parts, the cotangent function emerges naturally. It is then reconstructed from sampled cotangent vectors to synthesize provably unbiased weak derivatives that approximate long-range finite differences. This approach keeps the forward output unchanged and only modifies backpropagation, consistently benefiting tasks such as egomotion, optical flow, and SLAM.

Unified Analyses for Hierarchical Federated Learning: Topology Selection under Data Heterogeneity

This paper establishes the first unified non-convex convergence framework for four two-tier topologies of Hierarchical Federated Learning (HFL) (Star-Star / Star-Ring / Ring-Star / Ring-Ring). By utilizing a common set of assumptions and an "effective learning rate," it compares their convergence bounds in a single table, derives three actionable topology selection principles, and validates them on CIFAR-10, CINIC-10, Fashion-MNIST, and SST-2.

Unifying Formal Explanations: A Complexity-Theoretic Perspective

Proposes a unified framework reducing sufficient and contrastive reasons (local/global, probabilistic/non-probabilistic) to a minimization problem of a unified probability value function. It reveals that global value functions possess key combinatorial optimization properties like monotonicity, submodularity, and supermodularity, proving that global explanations are computable in polynomial time—even when the corresponding local explanations are NP-hard.

Unleashing LLMs in Bayesian Optimization: Preference-Guided Framework for Scientific Discovery

LGBO continuously converts LLM semantic preferences about "where is worth trying" into controllable mean shifts of the GP surrogate model. This allows Bayesian optimization (BO) to utilize domain knowledge to accelerate cold starts in scientific discovery tasks without delegating final point selection to potentially unstable LLMs.

Unlocking the Potential of Weighting Methods in Federated Learning through Communication Compression

This paper proposes ADI (Agnostic DIANA), which embeds "DIANA-style differential compression" into the solving of the agnostic weighted federated learning saddle point problem. This makes the two previously incompatible paths—"automatic weighting" and "communication compression"—simultaneously viable for the first time. By using a \(\min_\theta\max_\pi\) weighting framework to combat data heterogeneity while only uploading compressed differential vectors and a scalar loss, it breaks the communication bottleneck. Theoretically, it covers exact gradient, stochastic gradient, and partial participation scenarios; experimentally, it achieves better convergence than pure compression baselines on heterogeneously partitioned CIFAR-10.

ViTSP: Guiding Large-Scale Traveling Salesman Problem Solving with Vision-Language Models

ViTSP visualizes large-scale TSP instances as images and feeds them into a pre-trained VLM, allowing the VLM to "see" and box promising small regions as subproblems. These are then iteratively solved by an exact solver (Concorde) to improve the global solution. On real-world TSPLIB instances with 1k–88k nodes, it achieves an average optimality gap of only 0.24%, surpassing LKH-3 and various learning-based solvers without any task-specific training.

Weight Decay May Matter More Than µP for Learning Rate Transfer in Practice

This paper revisits learning rate transfer in large model training through a unified "relative update" framework. It discovers that the alignment assumptions upon which µP relies quickly fail during actual training. What truly stabilizes cross-width feature learning and enables learning rate transfer throughout most of the training process is independent weight decay. The learning rate scaling of µP essentially serves only as an "implicit learning rate warmup," which can be largely replaced by a stronger explicit warmup.

WSM: Decay-free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training

WSM theoretically equates "learning rate decay" with "weighted checkpoint merging"—showing that constant learning rate training followed by merging recent checkpoints with derived weights can simulate any decay curve (cosine, linear, 1-sqrt, etc.). This removes the decay phase from training entirely and consistently outperforms the mainstream WSD scheme on benchmarks like MATH, HumanEval, and MMLU-Pro.