Speaker
Description
Multi-head self-attention is a fundamental building block of the transformer architecture in modern machine learning, enabling large language models and much of modern generative AI as we know it. However, some aspects of the self-attention function space remain poorly understood. In particular, its parameterization is non-unique: continuous families of unique weight matrices can induce the same input–output map. This talk studies the self-attention function space, analyzing this non-uniqueness.
We study two common variants of self-attention. In linear attention, the attention map is a matrix of cubic polynomials in the inputs and tri-linear in the parameters, making it natural to use tools from algebraic geometry to describe generic fibers in the space. In softmax attention, a normalization breaks symmetries, leading to a different notion of generic identifiability, which we study with complex analysis.