Speaker
Description
Adaptive gradient optimization algorithms—including Adam, Adagrad, and their variants—have found widespread use in machine learning, signal processing, and many other settings. However many algorithms in this family are not rotationally equivariant: in this talk we examine how a simple change-of-basis in either parameter space or data space can drastically impact both the convergence rates and the generalization of these algorithms. We begin by studying reparameterizations in parameter space, and describe a data-driven method proposed in our recent work which produces a “favorable” basis for adaptive algorithms. Our method is an orthonormal transformation based on the expected gradient outer product (EGOP) matrix. We present theoretical results and empirical evidence that reparameterizations based on the EGOP eigenbasis can improve convergence of adaptive gradient methods, even when these leading eigenspaces are approximated using randomized numerical linear algebra methods. We show that for a broad class of functions, the sensitivity of adaptive algorithms to choice-of-basis is influenced by the decay of the EGOP matrix spectrum. We illustrate the potential impact of EGOP reparameterization by presenting empirical evidence and theoretical arguments that common machine learning tasks with ``natural'' data exhibit EGOP spectral decay.