Rolling Ball Optimizer: Learning by Ironing Out Loss Landscape Wrinkles¶
Conference: ICLR 2026 arXiv: 2505.19527 Area: Optimization Keywords: optimizer, loss landscape, rolling ball, smoothing effect, generalization
TL;DR¶
This paper proposes the Rolling Ball Optimizer (RBO), which breaks the spatial locality of conventional optimizers by simulating the rolling motion of a finite-radius rigid sphere over the loss landscape. This induces an ironing property on the loss function and demonstrates superior convergence speed and generalization performance on MNIST and CIFAR-10/100.
Background & Motivation¶
Training deep learning models fundamentally requires minimizing high-dimensional, data-dependent loss functions whose optimization landscapes are typically highly complex:
- Numerous spurious local minima (some sharply curved)
- Ill-conditioned valleys and saddle points
- Potentially fractal structure
- Noise in training data propagates into the fine-grained geometry of the landscape
Fundamental limitation of existing optimizers — their "point-mass" nature:
All mainstream optimizers (SGD, Adam, HeavyBall, NAG, etc.) operate as point-mass dynamics on the loss landscape, relying solely on local information (gradients) at the current position. This spatial locality leads to:
- Sensitivity to microscopic structure: any arbitrarily small perturbation to the loss function (including data-noise-induced perturbations) affects the update
- Insensitivity to macroscopic structure: inability to capture global geometric features of the landscape
- Susceptibility to sharp minima, ill-conditioned valleys, and saddle points
Although SAM and Entropy-SGD abandon spatial locality, they focus solely on avoiding sharp minima, neglecting other geometric properties of the loss landscape.
Method¶
Core Idea¶
The conventional point-mass is replaced by a rigid sphere of finite radius \(\rho > 0\) rolling over the loss landscape. The sphere's dynamics respond to landscape features at a scale proportional to \(\rho\), such that:
- Noise much smaller than \(\rho\) does not affect the sphere's trajectory
- Sharp minima and ill-conditioned valleys narrower than the sphere cannot "accommodate" it
- Adjusting \(\rho\) controls the granularity at which the optimizer interacts with the landscape
Algorithm¶
RBO alternates between two steps:
Step 1: Descent step (analogous to gradient descent)
where \(p_t\) is the contact point between the sphere and the loss landscape \(\Gamma\), and \(\tau(p)\) is the steepest descent direction of \(\Gamma\) at \(p\).
Step 2: Constraint projection (restoring the distance invariant)
The landscape point nearest to the new center is found and taken as the new contact point; the sphere center is then updated as \(c_{t+1} = p_{t+1} + \rho \nu(p_{t+1})\), where \(\nu(p)\) is the upward unit normal to \(\Gamma\) at \(p\).
Distance Invariant¶
The core constraint of RBO: the distance from the sphere center \(c_t\) to the loss landscape \(\Gamma\) always equals the radius \(\rho\):
Iterative Solution of the Projection Step¶
The constraint projection is solved via iterative optimization:
This makes the dynamics of RBO dependent on landscape information over a broader region, rather than only at the current contact point.
Ironing Property¶
Weak ironing (Lemma): For any continuous bounded function \(\phi: \mathbb{R}^d \to \mathbb{R}\), as \(\rho \to +\infty\), the offset manifold containing the sphere center trajectory converges to a constant — i.e., the loss landscape is completely "ironed out."
Linear ironing (Proposition): For a landscape composed of an affine function \(f\) plus a bounded perturbation \(\phi\), RBO with sufficiently large \(\rho\) behaves approximately as if optimizing on the pure affine landscape.
Unreachable Point Theory¶
If the spectral norm of the Hessian at a point \(p\) on the landscape is \(\sigma = \|\nabla^2 f(\theta_0)\|\), then for \(\rho > 1/\sigma\), \(p\) is unreachable by RBO. This implies:
- Automatic avoidance of sharp minima: the larger the curvature, the smaller the \(\rho\) needed to bypass the minimum
- Open-set property of unreachable points: neighborhoods of unreachable points are also unreachable
Key Experimental Results¶
Main Results: Test Set Performance Comparison¶
| Dataset/Model | SGD (Acc) | Entropy-SGD (Acc) | SAM (Acc) | RBO (Acc) |
|---|---|---|---|---|
| MNIST/MLP | 91.77% | 95.22% | 97.22% | 97.51% |
| MNIST/ResNet-6 | 97.59% | 98.18% | 99.11% | 99.07% |
| MNIST/VGG-9 | 98.78% | 98.57% | 99.39% | 99.27% |
| CIFAR-10/ResNet-8 | 56.54% | 59.16% | 69.09% | 71.58% |
| CIFAR-10/VGG-9 | 66.04% | 65.46% | 77.81% | 81.87% |
| CIFAR-100/ResNet-8 | 19.28% | 28.33% | 36.26% | 37.11% |
| CIFAR-100/VGG-9 | 29.37% | 28.98% | 47.17% | 50.07% |
Effect of Radius \(\rho\) on Performance (MNIST/MLP, 3 epochs)¶
| \(\rho\) range | Learning rate range | Observation |
|---|---|---|
| 0.1 – 1.0 | 0.001 – 1.0 | Microscopic regime, close to point-mass optimizer |
| 1.0 – 5.0 | 0.01 – 50 | Macroscopic regime, best performance |
| > 10 | > 100 | Super-macroscopic regime, potentially unstable |
Key Experimental Findings¶
- RBO consistently outperforms SGD and Entropy-SGD in accuracy, and is competitive with SAM (RBO achieves higher accuracy; SAM achieves lower loss values).
- RBO converges extremely fast: on CIFAR-10/100, it reaches the final training performance of other optimizers in half the number of epochs.
- RBO can stably operate with very large learning rates (e.g., \(\eta = 6\) or even \(\eta = 100\)), whereas SGD is limited to \(\eta = 0.01\).
- Performance improves monotonically as \(\rho\) and \(\eta\) increase, until the super-macroscopic instability regime is entered.
Highlights & Insights¶
- Elegant physical intuition: The analogy of replacing a point mass with a rolling rigid sphere is both intuitive and profound — just as a car wheel is insensitive to Planck-scale road roughness, a large sphere naturally ignores fine-grained noise.
- Theoretical innovation: The ironing property and unreachable point theory provide rigorous mathematical characterizations of the non-locality of the optimizer.
- Extreme learning rate stability: RBO remains stable at \(\eta = 100\), which is inconceivable for conventional optimizers.
- Honest experimental design: The authors explicitly state that no hyperparameter tuning was performed for any experiment, so the reported results do not represent RBO's optimal performance.
- Interesting source of inspiration: The algorithm draws on the physical simulation of sphere motion from the open-source game Marble Marcher.
Limitations & Future Work¶
- High computational cost: The constraint projection step requires additional iterative optimization, making the per-step cost substantially higher than SGD.
- Incomplete theory: The strong ironing conjecture (generalization to arbitrary continuous functions) remains unproven.
- Limited experimental scale: Evaluation is restricted to MNIST and CIFAR-10/100 using relatively small architectures (MLP, ResNet-6/8, VGG-9).
- Uncertainty in high dimensions: Approximation errors in the projection step may accumulate with increasing dimensionality; the impact of the curse of dimensionality remains unknown.
- Modest validation performance gains: While training performance is excellent, improvements on the validation set are less pronounced than expected.
- Only short training runs of 10 epochs are reported; long-horizon training behavior is unknown.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — Reconceptualizes optimizer design from a physics simulation perspective; highly creative.
- Experimental Thoroughness: ⭐⭐⭐ — Results are convincing but limited in scale; large-model/large-dataset validation is absent.
- Writing Quality: ⭐⭐⭐⭐ — Physical intuition is clearly conveyed, theoretical derivations are rigorous, and figures are well-crafted.
- Value: ⭐⭐⭐⭐ — Opens a new direction for non-local optimizers, though practical adoption requires addressing the computational cost.