Learning by ironing out loss landscape wrinkles

13 2 minutes read

[Submitted on 26 May 2025 (v1), last revised 24 Oct 2025 (this version, v3)]

View PDF of the article Rolling Ball Optimizer: Learning by Unlearning the Wrinkles of Missing Landscapes, by Muhammad Jamal al-Din Balghamari and 4 other authors

View PDF HTML (beta)

a summary:Training large neural networks (NNs) requires optimizing high-dimensional data-driven loss functions. The optimization landscape of these functions is often very complex, tight, even fractal-like, with many spurious local minima, poorly fitted valleys, degenerate points, and saddle points. Complicating matters is the fact that these landscape properties are a function of the data, which means that noise in the training data can propagate forward and give rise to unrepresentative undersized geometry. This poses a difficulty for gradient-based optimization methods, which rely on local geometry to compute updates and are therefore vulnerable to being sidetracked by noisy data. In practice, this translates into a strong dependence of the optimization dynamics on the noise in the data, i.e. poor generalization performance. To address this problem, we propose a new optimization: Rolling Ball Optimizer (RBO), which breaks this spatial locality by incorporating information from a larger region of the loss scene into its updates. We achieve this by simulating the motion of a solid ball of finite radius rolling over the loss scene, which is a straightforward generalization of gradient descent (GD) that is simplified in the infinitesimal limit. The radius acts as a hyperparameter that defines the scale at which the RBO views the loss scene, allowing control over the precision of its interaction with it. We are motivated by the intuition that the large-scale geometry of the loss landscape is less specific to the data than its fine structure, and is easier to optimize. We support this intuition by demonstrating that our algorithm has a smoothing effect on the loss function. Evaluation against SGD, SAM, and Entropy-SGD on MNIST and CIFAR-10/100 shows promising results in terms of convergence speed, training accuracy, and generalization performance.