分析现代NVIDIA GPU核心

Research

arXiv

分析现代NVIDIA GPU核心

Analyzing Modern NVIDIA GPU cores

Rodrigo Huerta ,

Mojtaba Abaie Shoushtary ,

José-Lorenzo Cruz ,

Antonio González

论文信息在线阅读PDF

摘要 Abstract

GPU是加速高性能计算（HPC）工作负载（如人工智能和科学模拟）的最流行平台。然而，大多数学术界的微体系结构研究依赖于基于超过15年历史架构的GPU核心流水线设计。本文对现代NVIDIA GPU核心进行逆向工程，揭示了其设计中的许多关键方面，并解释了GPU如何利用硬件-编译器技术，在执行过程中由编译器指导硬件。特别是，它揭示了指令调度逻辑的工作原理，包括指令调度器的调度策略、寄存器文件及其相关缓存的结构，以及内存流水线的多个特性。此外，它分析了一个基于流缓冲区的简单指令预取器如何很好地适应现代NVIDIA GPU并可能被采用。进一步地，我们研究了寄存器文件缓存以及读取端口数量对模拟准确性和性能的影响。通过建模这些新发现的微体系结构细节，我们在执行周期上的平均绝对百分比误差（MAPE）比先前最先进的模拟器降低了18.24%，相对于真实硬件（NVIDIA RTX A6000）的平均MAPE为13.98%。此外，我们证明了这个新模型适用于其他NVIDIA架构，例如Turing。最后，我们展示了现代NVIDIA GPU中包含的基于软件的依赖管理机制在性能和面积方面优于基于记分板的硬件机制。

GPUs are the most popular platform for accelerating HPC workloads, such as artificial intelligence and science simulations. However, most microarchitectural research in academia relies on GPU core pipeline designs based on architectures that are more than 15 years old. This paper reverse engineers modern NVIDIA GPU cores, unveiling many key aspects of its design and explaining how GPUs leverage hardware-compiler techniques where the compiler guides hardware during execution. In particular, it reveals how the issue logic works including the policy of the issue scheduler, the structure of the register file and its associated cache, and multiple features of the memory pipeline. Moreover, it analyses how a simple instruction prefetcher based on a stream buffer fits well with modern NVIDIA GPUs and is likely to be used. Furthermore, we investigate the impact of the register file cache and the number of register file read ports on both simulation accuracy and performance. By modeling all these new discovered microarchitectural details, we achieve 18.24% lower mean absolute percentage error (MAPE) in execution cycles than previous state-of-the-art simulators, resulting in an average of 13.98% MAPE with respect to real hardware (NVIDIA RTX A6000). Also, we demonstrate that this new model stands for other NVIDIA architectures, such as Turing. Finally, we show that the software-based dependence management mechanism included in modern NVIDIA GPUs outperforms a hardware mechanism based on scoreboards in terms of performance and area.