Title: Early stopping vs late stopping: Different flavors of SGD
Authors: Nicole Muecke - Institute for Stochastics and Applications (Germany) [presenting]
Abstract: While stochastic gradient descent (SGD) is a workhorse in machine learning, the learning properties of many variants used in practise are hardly known. We consider non-parametric regression with (strongly) convex objectives and contribute to fill this gap focusing on the effect and interplay of multiple passes, mini-batching and averaging, and in particular tail averaging. An important aspect is choosing in a data-driven way the total number of iterations and the step-size, namely in terms of the localized empirical Rademacher Complexity. The results show how these different flavors of SGD can be combined to achieve optimal learning errors, providing also practical insights.