Speaker
Description
The Canonical Polyadic (CP) tensor decomposition is a well-known method for interpretable analysis of high-dimensional data. Recently, the Generalized CP (GCP) method was introduced by Hong, Kolda and Duersch (2020) to allow for flexible choice of the loss function in the optimization problem defining the CP model, enabling more interpretable decompositions of strongly non-Gaussian data such as count or binary data. Furthermore, Kolda and Hong (2020) introduced a version of GCP that leverages randomization and stochastic optimization to address scalability to large, sparse data sets. In this work, we take these ideas a step further and consider synchronous and asynchronous algorithms for parallel GCP tensor decomposition through the GenTen software package, exploiting both shared and distributed memory parallelism. We build on shared memory parallel CP decomposition algorithms utilizing Kokkos for portability across CPU and GPU architectures (Phipps and Kolda, 2019) to support the random sampling and stochastic optimization methods required by GCP. We then couple this approach to the well-known medium-grained distributed memory parallelism scheme (Smith and Karypis, 2016) developed for traditional CP decompositions through MPI, providing a synchronous, hybrid MPI+Kokkos, parallel GCP decomposition capability. Finally, we propose an asynchronous distributed parallelism approach building on related techniques for federated learning to achieve even better scalability to large data sets. We study the effectiveness of the proposed synchronous and asynchronous approaches vis-à-vis computational cost and accuracy on synthetic and publicly available real-world datasets of varying sizes, dimensions, and sparsity patterns using several loss functions.