加速GPU上小张量分解的稀疏MTTKRP运算

Accelerating Sparse MTTKRP for Small Tensor Decomposition on GPU

摘要 Abstract

稀疏矩阵化张量Khatri-Rao积(spMTTKRP)是稀疏张量分解中的瓶颈内核。在张量分解过程中,spMTTKRP沿输入张量的所有模式迭代执行。在这项工作中,我们提出了一种针对GPU的模式特定张量布局,该布局使用多个张量副本,每个副本都针对特定模式进行了优化。所提出的张量布局提高了外部内存访问的数据局部性,并消除了在GPU线程块和GPU全局内存之间通信的中间值。我们还提出了一种张量分区方案,基于输入张量的稀疏性和维度,最优地分配了流式多处理器上的总计算量。我们的方法在总执行时间上实现了比最先进的GPU基线分别快2.4倍、7.9倍和8.9倍的几何平均加速。

Sparse Matricized Tensor Times Khatri-Rao Product (spMTTKRP) is the bottleneck kernel of sparse tensor decomposition. In tensor decomposition, spMTTKRP is performed iteratively along all the modes of an input tensor. In this work, we propose a mode-specific tensor layout on GPU that uses multiple tensor copies, where each copy is optimized for a specific mode. The proposed tensor layout increases the data locality of external memory accesses and eliminates the intermediate values communicated between the GPU thread blocks and the GPU global memory. We also propose a tensor partitioning scheme to optimally distribute the total computations among GPU streaming multiprocessors based on the sparsity and the dimensions of the input tensor. Our approach achieves a geometric mean speedup of 2.4x, 7.9x, and 8.9x in total execution time compared with the state-of-the-art GPU baselines.