CFP：基于低开销剖析的通信无结构内算子并行性生成

Research

arXiv

CFP: Low-overhead Profiling-based Intra-operator Parallelism Generation by Preserving Communication-Free Structures

Weifang Hu ,

Xuanhua Shi ,

Chang Wu ,

Yunkai Zhang ,

Xuan Peng ,

Jiaqi Zhai ,

Hai Jin ,

Yongluan Zhou ,

Xuehai Qian

论文信息在线阅读PDF

摘要 Abstract

本文介绍了一个名为CFP的系统，该系统通过利用实际并行程序的运行时剖析来搜索内算子并行性配置。其核心思想是通过识别一种新的结构——ParallelBlock（一组具有通信自由张量划分传播属性的操作符）来剖析有限的空间：即输入张量的划分可以通过所有操作符传播到输出张量，而不会引入通信或同步。基于此属性，可以在避免不必要的通信前提下，从输入张量的划分推断出ParallelBlock内部操作符的最佳张量划分。因此，可以通过仅在每个ParallelBlock入口处以不同的输入张量划分进行剖析，而不是枚举ParallelBlock内部所有操作符的所有组合，从而减少搜索空间。此外，通过识别具有相似并行行为的ParallelBlock序列（段），进一步减少了搜索空间。CFP基于所有段的剖析结果计算模型的整体性能。在GPT、LLAMA和MoE模型上，CFP相比最先进的框架Alpa分别实现了高达1.51倍、1.31倍和3.43倍的速度提升。

This paper introduces CFP, a system that search intra-operator parallelism configurations by leveraging runtime profiles of actual parallel programs. The key idea is to profile a limited space by identifying a new structure named ParallelBlock, which is a group of operators with the property of communication-free tensor partition propagation: the partition of its input tensor can propagate through all operators to its output tensor without introducing communication or synchronization. Based on this property, an optimal tensor partition of operators within a ParallelBlock should be inferred from the partition of input tensor through partition propagation to prevent the avoidable communication. Thus, the search space can be reduced by only profiling each ParallelBlock with different input tensor partitions at its entry, instead of enumerating all combinations among operators within the ParallelBlock. Moreover, the search space is further reduced by identifying ParallelBlock sequences (segments) with similar parallel behavior. CFP computes the overall performance of the model based on the profiles of all segments. On GPT, LLAMA, and MoE models, CFP achieves up to a 1.51x, 1.31x, and 3.43x speedup over the state-of-the-art framework, Alpa.