超越标准MoE:潜在专家混合模型用于资源高效的语言模型

Beyond Standard MoE: Mixture of Latent Experts for Resource-Efficient Language Models

摘要 Abstract

专家混合模型(MoE)已成为大型语言模型(LLMs)高效扩展的关键架构范式,通过为每个输入标记选择性激活参数子集进行操作。然而,传统的MoE架构面临诸多挑战,包括在训练和推理过程中内存占用过大以及通信开销过高,这些问题主要归因于专家模块的激增。本文提出了一种名为潜在专家混合模型(MoLE)的新参数化方法,该方法能够将特定的专家映射到共享的潜在空间中。具体而言,所有专家操作被系统地分解为两个主要组成部分:首先是对低维潜在空间的共享投影,然后是显著降低参数复杂度的专家特定变换。这种分解方法大幅减少了参数数量和计算需求。除了在MoLE架构的预训练实现之外,我们还建立了严格的数学框架,用于将预训练的MoE模型转换为MoLE架构,并定义了最优分解的充分条件,同时开发了一种系统的两阶段算法来完成这一转换过程。我们的全面理论分析表明,MoLE在多个维度上显著提升了计算效率,同时保持了模型的表征能力。实证评估验证了我们的理论发现,证明MoLE在性能上可媲美标准MoE实现,同时大幅降低了资源需求。

Mixture of Experts (MoE) has emerged as a pivotal architectural paradigm for efficient scaling of Large Language Models (LLMs), operating through selective activation of parameter subsets for each input token. Nevertheless, conventional MoE architectures encounter substantial challenges, including excessive memory utilization and communication overhead during training and inference, primarily attributable to the proliferation of expert modules. In this paper, we introduce Mixture of Latent Experts (MoLE), a novel parameterization methodology that facilitates the mapping of specific experts into a shared latent space. Specifically, all expert operations are systematically decomposed into two principal components: a shared projection into a lower-dimensional latent space, followed by expert-specific transformations with significantly reduced parametric complexity. This factorized approach substantially diminishes parameter count and computational requirements. Beyond the pretraining implementation of the MoLE architecture, we also establish a rigorous mathematical framework for transforming pre-trained MoE models into the MoLE architecture, characterizing the sufficient conditions for optimal factorization and developing a systematic two-phase algorithm for this conversion process. Our comprehensive theoretical analysis demonstrates that MoLE significantly enhances computational efficiency across multiple dimensions while preserving model representational capacity. Empirical evaluations corroborate our theoretical findings, confirming that MoLE achieves performance comparable to standard MoE implementations while substantially reducing resource requirements.