MoQa：基于多阶段数据-模型分布感知的MoE量化重思考

Research

arXiv

MoQa: Rethinking MoE Quantization with Multi-stage Data-model Distribution Awareness

Zihao Zheng ,

Xiuping Cui ,

Size Zheng ,

Maoliang Li ,

Jiayu Chen ,

Yun ,

Liang ,

Xiang Chen

论文信息在线阅读PDF

摘要 Abstract

随着人工智能的发展，混合专家（MoE）已成为大型语言模型（LLMs）的主要形式，其对模型压缩的需求不断增加。量化是一种有效的压缩方法，不仅能够减小模型规模，还能显著提升性能。现有的量化方法逐渐从参数缩放转向数据分布分析，但这些分析是为密集型LLMs设计的，并依赖于简单的单模型-全数据映射，这种模式不适合MoE。本文提出了一种新的量化框架MoQa。MoQa在多个分析阶段解耦了MoE的数据-模型分布复杂性，定量揭示了稀疏数据激活、数据-参数映射以及专家间相关性的动态过程。在此基础上，MoQa结合最优的数据-模型分布感知能力，识别特定专家和参数的重要性，并提出了一系列适应不同数据激活和专家组合场景的细粒度混合量化策略。此外，MoQa讨论了现有量化方法的局限性，并分析了每个阶段分析的影响，为MoE量化提供了新颖的见解。实验结果表明，MoQa在语言建模任务中的困惑度降低了1.69~2.18，在零样本推理任务中的准确率提高了1.58%~8.91%。我们相信，MoQa将在未来的MoE构建、优化和压缩中发挥重要作用。

With the advances in artificial intelligence, Mix-of-Experts (MoE) has become the main form of Large Language Models (LLMs), and its demand for model compression is increasing. Quantization is an effective method that not only compresses the models but also significantly accelerates their performance. Existing quantization methods have gradually shifted the focus from parameter scaling to the analysis of data distributions. However, their analysis is designed for dense LLMs and relies on the simple one-model-all-data mapping, which is unsuitable for MoEs. This paper proposes a new quantization framework called MoQa. MoQa decouples the data-model distribution complexity of MoEs in multiple analysis stages, quantitively revealing the dynamics during sparse data activation, data-parameter mapping, and inter-expert correlations. Based on these, MoQa identifies particular experts' and parameters' significance with optimal data-model distribution awareness and proposes a series of fine-grained mix-quantization strategies adaptive to various data activation and expert combination scenarios. Moreover, MoQa discusses the limitations of existing quantization and analyzes the impact of each stage analysis, showing novel insights for MoE quantization. Experiments show that MoQa achieves a 1.69~2.18 perplexity decrease in language modeling tasks and a 1.58%~8.91% accuracy improvement in zero-shot inference tasks. We believe MoQa will play a role in future MoE construction, optimization, and compression.