分布式数据并行机器学习训练中GPU可信执行环境开销的特征分析

Characterization of GPU TEE Overheads in Distributed Data Parallel ML Training

摘要 Abstract

可信计算(confidential computing)或可信执行 enclave(trusted execution enclaves, TEEs)现已成为云计算中实现安全计算的最常见方法。NVIDIA 最近引入的 GPU TEEs 使机器学习(ML)模型能够在不向云服务提供商泄露模型权重或数据的情况下进行训练。然而,使用 GPU TEEs 进行 ML 训练的潜在性能影响尚未得到充分表征。在这项工作中,我们对分布式数据并行(DDP)ML 训练中与 GPU 可信执行环境相关的性能开销进行了深入的特征分析研究。我们的研究表明了在 GPU TEEs 中 DDP 训练所面临的性能挑战。DDP 使用环形全reduce(ring-all-reduce),这是一种众所周知的方法,用于从多个设备聚合梯度。环形全reduce 包含多个分散-减少(scatter-reduce)和全收集(all-gather)操作。在 GPU TEEs 中,只有 GPU 包(GPU 和 HBM 内存)是可信的,因此任何在 GPU 包之外通信的数据都必须加密和认证以确保机密性和完整性验证。因此,环形全reduce 的每个阶段都需要发送方进行加密和消息认证码(MAC)生成,并由接收方进行解密和 MAC 验证。随着参与 DDP 的 GPU 数量增加,环形全reduce 期间的安全 GPU-to-GPU 通信开销按比例增长。此外,更大的模型会导致更多的异步全reduce 操作,从而加剧通信成本。我们的结果显示,在四个 GPU TEEs 上运行时,根据正在训练的模型不同,每次训练迭代的运行时间平均增加 8 倍,最高可达 41.6 倍,而没有 TEE 的 DDP 训练相比。

Confidential computing (CC) or trusted execution enclaves (TEEs) is now the most common approach to enable secure computing in the cloud. The recent introduction of GPU TEEs by NVIDIA enables machine learning (ML) models to be trained without leaking model weights or data to the cloud provider. However, the potential performance implications of using GPU TEEs for ML training are not well characterized. In this work, we present an in-depth characterization study on performance overhead associated with running distributed data parallel (DDP) ML training with GPU Trusted Execution Environments (TEE). Our study reveals the performance challenges in DDP training within GPU TEEs. DDP uses ring-all-reduce, a well-known approach, to aggregate gradients from multiple devices. Ring all-reduce consists of multiple scatter-reduce and all-gather operations. In GPU TEEs only the GPU package (GPU and HBM memory) is trusted. Hence, any data communicated outside the GPU packages must be encrypted and authenticated for confidentiality and integrity verification. Hence, each phase of the ring-all-reduce requires encryption and message authentication code (MAC) generation from the sender, and decryption and MAC authentication on the receiver. As the number of GPUs participating in DDP increases, the overhead of secure inter-GPU communication during ring-all-reduce grows proportionally. Additionally, larger models lead to more asynchronous all-reduce operations, exacerbating the communication cost. Our results show that with four GPU TEEs, depending on the model that is being trained, the runtime per training iteration increases by an average of 8x and up to a maximum of 41.6x compared to DDP training without TEE.

分布式数据并行机器学习训练中GPU可信执行环境开销的特征分析 - arXiv