Mist：通过内存-并行性协同优化实现大规模语言模型高效分布式训练

Research

arXiv

Mist: Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-Optimization

Zhanda Zhu ,

Christina Giannoula ,

Muralidhar Andoorveedu ,

Qidong Su ,

Karttikeya Mangalam ,

Bojian Zheng ,

Gennady Pekhimenko

论文信息在线阅读PDF

摘要 Abstract

数据并行性、张量并行性和管道并行性等各种并行性以及激活检查点、冗余消除和卸载等内存优化方法被提出以加速大规模语言模型的分布式训练。为了找到这些技术的最佳组合，提出了自动分布式训练系统。然而，由于缺乏重叠意识、无法探索庞大的搜索空间以及忽略微批次之间的不平衡，现有系统仅调整优化的一个子集，导致性能次优。为了解决这些不足，我们提出了Mist，这是一种内存、重叠和不平衡感知的自动分布式训练系统，全面协同优化所有内存占用减少技术与并行性。Mist基于三个关键思想：（1）细粒度重叠为中心的调度，以重叠方式协调优化；（2）基于符号的性能分析，利用符号表达式预测运行时和内存使用情况以实现快速调优；（3）不平衡感知的分层调优，将过程分解为跨阶段不平衡和重叠感知的混合整数线性规划问题以及单个阶段的双目标约束优化问题，并通过帕累托前沿采样连接它们。我们的评估结果显示，与最先进的手动系统Megatron-LM和最先进的自动系统Aceso相比，Mist分别实现了平均1.28倍（最高1.73倍）和1.27倍（最高2.04倍）的速度提升。

Various parallelism, such as data, tensor, and pipeline parallelism, along with memory optimizations like activation checkpointing, redundancy elimination, and offloading, have been proposed to accelerate distributed training for Large Language Models. To find the best combination of these techniques, automatic distributed training systems are proposed. However, existing systems only tune a subset of optimizations, due to the lack of overlap awareness, inability to navigate the vast search space, and ignoring the inter-microbatch imbalance, leading to sub-optimal performance. To address these shortcomings, we propose Mist, a memory, overlap, and imbalance-aware automatic distributed training system that comprehensively co-optimizes all memory footprint reduction techniques alongside parallelism. Mist is based on three key ideas: (1) fine-grained overlap-centric scheduling, orchestrating optimizations in an overlapped manner, (2) symbolic-based performance analysis that predicts runtime and memory usage using symbolic expressions for fast tuning, and (3) imbalance-aware hierarchical tuning, decoupling the process into an inter-stage imbalance and overlap aware Mixed Integer Linear Programming problem and an intra-stage Dual-Objective Constrained Optimization problem, and connecting them through Pareto frontier sampling. Our evaluation results show that Mist achieves an average of 1.28$\times$ (up to 1.73$\times$) and 1.27$\times$ (up to 2.04$\times$) speedup compared to state-of-the-art manual system Megatron-LM and state-of-the-art automatic system Aceso, respectively.