Speaker
Description
How are learned representations incrementally formed to solve tasks by Gradient Descent (GD)? In this talk, we will show that each step of GD is exactly given by the application of a massive tensor-valued linear operator, which we call the Configuration Space Neural Tangent Kernel (NTK). We prove that it can be decomposed into two operators: P and K, the former capturing state-to-state dynamical dependencies and the latter capturing immediate parameter-to-state dependencies. Based on this, we prove a universal result stating that any weight-based model in deep-learning or optimal control can be factored so that K is a simpler Kronecker product matrix (dubbed the Kronecker core factorization). Importantly, the key ingredients in this core are already computed during inference, hence they can be used to immediately determine which tasks are more easily learnable by GD for a particular model. We show that this structure implies a bottlenecking of GD dynamics, leading to low-rank dynamical corrections and implicit bias towards particular tasks. We specialize our analysis to recurrent models (e.g., neural ODEs or RNNs), showing how the factorization constrains the temporal modes of learning. In addition to these theoretical results, we develop a package (kpflow) using matrix-free numerical linear algebra fast analysis of the NTK gradient operator for any model. This package makes it easier to work with linear operators on tensor domains with fast, randomized analysis tools.