When to Restart? Exploring Escalating Restarts on Convergence¶

Conference: ICLR 2026 arXiv: 2603.04117 Area: Optimization Keywords: learning rate scheduling, adaptive restarts, convergence-aware training, SGD optimization, deep learning training

TL;DR¶

This paper proposes SGD-ER (SGD with Escalating Restarts), a convergence-aware learning rate scheduling strategy that triggers restarts with linearly escalating learning rates upon detecting training stagnation, enabling the optimizer to escape sharp local minima and explore flatter loss landscape regions. SGD-ER achieves 0.5–4.5% test accuracy improvements on CIFAR-10/100 and TinyImageNet.

Background & Motivation¶

The learning rate is among the most critical hyperparameters in deep learning training, directly affecting convergence speed, stability, and generalization.

Existing Learning Rate Schedules and Their Limitations¶

Scheduler	Strategy	Limitation
Exponential/Linear Decay	Monotonically decreasing	Cannot escape sharp minima or saddle points
Cosine Annealing (SGDR)	Periodic cosine decay + warm restarts	Restart timing is fixed, agnostic to training dynamics
Cyclical LR (CLR)	Smooth oscillation within predefined bounds	Fixed boundaries, non-adaptive
Warmup-Stable-Decay (WSD)	Three phases: warmup–stable–decay	Tied to a fixed computational budget

Core Problem: Existing methods apply restarts or adjustments that are predefined or periodic, remaining entirely unaware of actual training dynamics such as stagnation or convergence behavior.

Core Argument: Restarts should be adaptive—triggered by convergence rather than a fixed schedule. When the model reaches a loss plateau, restarting with a larger learning rate can help escape the current local minimum.

Method¶

SGD-ER Algorithm¶

The core algorithmic logic proceeds as follows:

Begin training with initial learning rate \(\eta_0\)
Gradually reduce the learning rate using a decay strategy (exponential or linear)
Declare convergence when the validation loss shows no significant improvement within a patience window
Trigger a restart: linearly escalate the learning rate to \(\eta_k = (k+1) \cdot \eta_0\), where \(k\) denotes the restart count
Retain current model parameters and continue training
Termination condition: stop if the best loss after a restart does not improve upon the previous best, or if the maximum number of epochs is reached

Learning Rate Update Rule¶

The SGD update at restart \(k\):

\[\theta_{t+1} = \theta_t - \eta_k \nabla f(\theta_t), \quad \eta_k = (k+1)\eta_0\]

Theoretical Analysis: Accelerated Saddle Point Escape¶

Theorem 1: Let \(f\) be an \(L\)-smooth function and \(\theta^*\) a strict saddle point with \(\lambda_{\min}(\nabla^2 f(\theta^*)) = -\gamma < 0\). The number of iterations required to escape the \(\delta\)-neighborhood at restart \(k\) satisfies:

\[T_k \geq \frac{\ln(\delta / |x_0|)}{\ln(1 + \eta_k \gamma)}\]

As \(k \to \infty\), \(T_k \to 0\)—a larger learning rate accelerates saddle point escape. This provides a theoretical justification for learning rate escalation.

Convergence Detection Criterion¶

A plateau-based criterion is employed: convergence is signaled when the validation loss exhibits no meaningful decrease within a predefined patience window. This is consistent with early stopping practice.

CIFAR-100: patience = 50 epochs
CIFAR-10: patience = 30 epochs

Key Design Considerations¶

Parameter retention: restarts modify only the learning rate without resetting model parameters, allowing continued exploration from learned representations
Linear escalation: the learning rate increases by \(\eta_0\) at each restart, providing a moderate but sustained increase in exploration intensity
Dual termination: training stops if no improvement is observed after a restart, preventing unnecessary divergence

Key Experimental Results¶

Main Results: ResNet-18 Test Accuracy (%)¶

Dataset	SGD_exp	SGD_lin	Adam	CosA	CLR	WSDS	Ours_exp	Ours_lin
CIFAR-10	90.86	91.93	91.34	92.59	92.15	93.05	93.83	93.83
CIFAR-100	68.30	71.00	67.94	71.63	70.44	72.39	74.30	74.30
TinyImageNet	59.09	58.35	54.53	59.46	57.53	59.28	59.71	60.79

Cross-Architecture Results: CIFAR-100 Test Accuracy (%, Exponential Decay)¶

Architecture	SGD_exp	CosA	CLR	WSDS	Ours
ResNet-34	67.75	72.17	71.04	72.36	74.24
ResNet-50	65.52	72.10	70.25	73.76	76.77
VGG-16	65.17	67.35	67.23	68.08	68.56
DenseNet-121	56.10	71.20	66.61	72.45	76.76

Long-Training Results: CIFAR-100, 2000 Epochs¶

SGD_exp	SGD_lin	Adam	CosA	CLR	WSDS	Ours_exp	Ours_lin
68.53	62.17	71.27	72.84	72.10	73.59	74.41	74.41

Overfitting Analysis (CIFAR-100, Average over 3 Seeds)¶

Method	Train Loss	Val Loss	Test Loss	Test Acc (%)
CLR	1.60e-05	0.00488	0.00496	70.65
CosA	1.75e-05	0.00466	0.00472	72.05
WSDS	1.64e-05	0.00462	0.00465	72.83
Ours_exp	2.40e-05	0.00434	0.00443	73.62
Ours_lin	2.16e-05	0.00427	0.00435	74.61

Note: CLR achieves the lowest training loss but the highest test loss—a classic overfitting pattern. SGD-ER exhibits slightly higher training loss yet substantially better generalization.

Key Experimental Findings¶

SGD-ER achieves the highest test accuracy across all dataset–architecture combinations.
The largest gain is observed on DenseNet-121: 56.10% → 76.76% (+20.66%); standard SGD nearly fails to train DenseNet effectively.
A brief accuracy drop occurs immediately after each restart, but the model rapidly recovers and surpasses the previous best performance.
Under extended training (2000 epochs), SGD-ER continues to improve while competing methods saturate.
SGD-ER achieves better generalization at higher training loss, indicating convergence to flatter minima.

Highlights & Insights¶

Simplicity and effectiveness: the method requires only a patience parameter and a linear increment, introduces no additional computational overhead, and can serve as a plug-and-play module for any SGD-based training pipeline.
Convergence-aware vs. fixed-period scheduling: the central principle—"let training dynamics determine when to restart"—is more principled than predefined periodic schedules.
Higher training loss = better generalization: this result directly reflects classical theory on flat vs. sharp minima; the minima found by SGD-ER are wider and generalize better.
Greater benefit for weaker architectures: DenseNet-121 nearly fails under standard SGD, yet SGD-ER restores it to a level competitive with ResNet variants.
Theory–practice alignment: Theorem 1 predicts faster saddle point escape with larger learning rates, which is empirically confirmed by the rapid recovery observed after each restart.

Limitations & Future Work¶

Methodological simplicity: linear escalation may not be the optimal strategy; a systematic investigation of escalation magnitude and schedule is absent.
Manual patience tuning: different values are required for CIFAR-10 (30) and CIFAR-100 (50), necessitating task-specific adjustment.
Post-restart accuracy fluctuation: each restart induces a performance dip before recovery, resulting in a non-smooth training trajectory.
Evaluation limited to image classification: the method has not been validated on NLP, speech, or other task domains.
Integration with Adam-family optimizers insufficiently explored: the work primarily focuses on SGD; Adam variants are only briefly mentioned in the appendix.
Theoretical analysis restricted to saddle point escape: convergence guarantees for escaping local minima and bounds on convergence rates are not established.

Rating¶

Novelty: ⭐⭐⭐ — The idea is intuitive but not sophisticated; it belongs to the category of "simple yet effective" engineering improvements.
Experimental Thoroughness: ⭐⭐⭐⭐ — Covers 3 datasets, 5 architectures, and multiple baselines with consistent and significant results.
Writing Quality: ⭐⭐⭐⭐ — Figures and tables are clear; Fig. 1's learning rate curve comparison is particularly intuitive.
Value: ⭐⭐⭐⭐ — Offers practical engineering value as a plug-and-play module, though theoretical depth is limited.