Rolling Ball Optimizer: Learning by Ironing Out Loss Landscape Wrinkles¶

Conference: ICLR 2026 arXiv: 2505.19527 Area: Optimization Keywords: optimizer, loss landscape, rolling ball, smoothing effect, generalization

TL;DR¶

This paper proposes the Rolling Ball Optimizer (RBO), which breaks the spatial locality of conventional optimizers by simulating the rolling motion of a finite-radius rigid sphere over the loss landscape. This induces an ironing property on the loss function and demonstrates superior convergence speed and generalization performance on MNIST and CIFAR-10/100.

Background & Motivation¶

Training deep learning models fundamentally requires minimizing high-dimensional, data-dependent loss functions whose optimization landscapes are typically highly complex:

Numerous spurious local minima (some sharply curved)
Ill-conditioned valleys and saddle points
Potentially fractal structure
Noise in training data propagates into the fine-grained geometry of the landscape

Fundamental limitation of existing optimizers — their "point-mass" nature:

All mainstream optimizers (SGD, Adam, HeavyBall, NAG, etc.) operate as point-mass dynamics on the loss landscape, relying solely on local information (gradients) at the current position. This spatial locality leads to:

Sensitivity to microscopic structure: any arbitrarily small perturbation to the loss function (including data-noise-induced perturbations) affects the update
Insensitivity to macroscopic structure: inability to capture global geometric features of the landscape
Susceptibility to sharp minima, ill-conditioned valleys, and saddle points

Although SAM and Entropy-SGD abandon spatial locality, they focus solely on avoiding sharp minima, neglecting other geometric properties of the loss landscape.

Method¶

Core Idea¶

The conventional point-mass is replaced by a rigid sphere of finite radius \(\rho > 0\) rolling over the loss landscape. The sphere's dynamics respond to landscape features at a scale proportional to \(\rho\), such that:

Noise much smaller than \(\rho\) does not affect the sphere's trajectory
Sharp minima and ill-conditioned valleys narrower than the sphere cannot "accommodate" it
Adjusting \(\rho\) controls the granularity at which the optimizer interacts with the landscape

Algorithm¶

RBO alternates between two steps:

Step 1: Descent step (analogous to gradient descent)

\[\tilde{c}_{t+1} = c_t - \eta \tau(p_t)\]

where \(p_t\) is the contact point between the sphere and the loss landscape \(\Gamma\), and \(\tau(p)\) is the steepest descent direction of \(\Gamma\) at \(p\).

Step 2: Constraint projection (restoring the distance invariant)

\[p_{t+1} = \arg\min_{p \in \Gamma} \|p - \tilde{c}_{t+1}\|^2\]

The landscape point nearest to the new center is found and taken as the new contact point; the sphere center is then updated as \(c_{t+1} = p_{t+1} + \rho \nu(p_{t+1})\), where \(\nu(p)\) is the upward unit normal to \(\Gamma\) at \(p\).

Distance Invariant¶

The core constraint of RBO: the distance from the sphere center \(c_t\) to the loss landscape \(\Gamma\) always equals the radius \(\rho\):

\[\forall t \geq 0, \quad d(c_t, \Gamma) = \inf_{p \in \Gamma} \|p - c_t\| = \rho\]

Iterative Solution of the Projection Step¶

The constraint projection is solved via iterative optimization:

\[\theta^{(k+1)} = \theta^{(k)} - \gamma[\theta^{(k)} - \tilde{\theta} + (f(\theta^{(k)}) - \tilde{y}) \nabla f(\theta^{(k)})]\]

This makes the dynamics of RBO dependent on landscape information over a broader region, rather than only at the current contact point.

Ironing Property¶

Weak ironing (Lemma): For any continuous bounded function \(\phi: \mathbb{R}^d \to \mathbb{R}\), as \(\rho \to +\infty\), the offset manifold containing the sphere center trajectory converges to a constant — i.e., the loss landscape is completely "ironed out."

Linear ironing (Proposition): For a landscape composed of an affine function \(f\) plus a bounded perturbation \(\phi\), RBO with sufficiently large \(\rho\) behaves approximately as if optimizing on the pure affine landscape.

Unreachable Point Theory¶

If the spectral norm of the Hessian at a point \(p\) on the landscape is \(\sigma = \|\nabla^2 f(\theta_0)\|\), then for \(\rho > 1/\sigma\), \(p\) is unreachable by RBO. This implies:

Automatic avoidance of sharp minima: the larger the curvature, the smaller the \(\rho\) needed to bypass the minimum
Open-set property of unreachable points: neighborhoods of unreachable points are also unreachable

Key Experimental Results¶

Main Results: Test Set Performance Comparison¶

Dataset/Model	SGD (Acc)	Entropy-SGD (Acc)	SAM (Acc)	RBO (Acc)
MNIST/MLP	91.77%	95.22%	97.22%	97.51%
MNIST/ResNet-6	97.59%	98.18%	99.11%	99.07%
MNIST/VGG-9	98.78%	98.57%	99.39%	99.27%
CIFAR-10/ResNet-8	56.54%	59.16%	69.09%	71.58%
CIFAR-10/VGG-9	66.04%	65.46%	77.81%	81.87%
CIFAR-100/ResNet-8	19.28%	28.33%	36.26%	37.11%
CIFAR-100/VGG-9	29.37%	28.98%	47.17%	50.07%

Effect of Radius \(\rho\) on Performance (MNIST/MLP, 3 epochs)¶

\(\rho\) range	Learning rate range	Observation
0.1 – 1.0	0.001 – 1.0	Microscopic regime, close to point-mass optimizer
1.0 – 5.0	0.01 – 50	Macroscopic regime, best performance
> 10	> 100	Super-macroscopic regime, potentially unstable

Key Experimental Findings¶

RBO consistently outperforms SGD and Entropy-SGD in accuracy, and is competitive with SAM (RBO achieves higher accuracy; SAM achieves lower loss values).
RBO converges extremely fast: on CIFAR-10/100, it reaches the final training performance of other optimizers in half the number of epochs.
RBO can stably operate with very large learning rates (e.g., \(\eta = 6\) or even \(\eta = 100\)), whereas SGD is limited to \(\eta = 0.01\).
Performance improves monotonically as \(\rho\) and \(\eta\) increase, until the super-macroscopic instability regime is entered.

Highlights & Insights¶

Elegant physical intuition: The analogy of replacing a point mass with a rolling rigid sphere is both intuitive and profound — just as a car wheel is insensitive to Planck-scale road roughness, a large sphere naturally ignores fine-grained noise.
Theoretical innovation: The ironing property and unreachable point theory provide rigorous mathematical characterizations of the non-locality of the optimizer.
Extreme learning rate stability: RBO remains stable at \(\eta = 100\), which is inconceivable for conventional optimizers.
Honest experimental design: The authors explicitly state that no hyperparameter tuning was performed for any experiment, so the reported results do not represent RBO's optimal performance.
Interesting source of inspiration: The algorithm draws on the physical simulation of sphere motion from the open-source game Marble Marcher.

Limitations & Future Work¶

High computational cost: The constraint projection step requires additional iterative optimization, making the per-step cost substantially higher than SGD.
Incomplete theory: The strong ironing conjecture (generalization to arbitrary continuous functions) remains unproven.
Limited experimental scale: Evaluation is restricted to MNIST and CIFAR-10/100 using relatively small architectures (MLP, ResNet-6/8, VGG-9).
Uncertainty in high dimensions: Approximation errors in the projection step may accumulate with increasing dimensionality; the impact of the curse of dimensionality remains unknown.
Modest validation performance gains: While training performance is excellent, improvements on the validation set are less pronounced than expected.
Only short training runs of 10 epochs are reported; long-horizon training behavior is unknown.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — Reconceptualizes optimizer design from a physics simulation perspective; highly creative.
Experimental Thoroughness: ⭐⭐⭐ — Results are convincing but limited in scale; large-model/large-dataset validation is absent.
Writing Quality: ⭐⭐⭐⭐ — Physical intuition is clearly conveyed, theoretical derivations are rigorous, and figures are well-crafted.
Value: ⭐⭐⭐⭐ — Opens a new direction for non-local optimizers, though practical adoption requires addressing the computational cost.