高性能计算系统中不确定性量化工作流任务调度的性能分析

A Performance Analysis of Task Scheduling for UQ Workflows on HPC Systems

摘要 Abstract

不确定性量化(UQ)工作负载在科学和工程领域变得越来越普遍。它们涉及提交成千上万甚至数百万个相似的任务,这些任务可能具有不可预测的运行时间,并且总任务数量通常事先未知。静态的“一刀切”批处理脚本可能会导致次优调度,而安装在高性能计算(HPC)系统(如SLURM)上的原生调度器往往难以高效地处理此类工作负载。本文介绍了一种适用于UQ工作流的新负载均衡方法。为了在实际环境中展示其效率,我们重点关注GS2旋量等离子体湍流模拟器。单个模拟可能计算需求较高,运行时间因高维输入参数而显著变化——从几分钟到几小时不等。我们的方法结合了UQ和建模桥(提供与仿真模型无关的语言接口),以及HyperQueue(与原生调度器协同工作)。特别地,部署此框架于HPC系统时无需进行系统级更改。我们将所提出的框架与独立的SLURM方法及基于GS2及其高斯过程代理的SLURM方法进行了基准测试。结果表明,与朴素的SLURM方法相比,我们的方法可以将调度开销减少多达三个数量级,并使长运行模拟的CPU时间减少高达38%,同时不对UQ工作流固有的作业提交模式做出任何假设。

Uncertainty Quantification (UQ) workloads are becoming increasingly common in science and engineering. They involve the submission of thousands or even millions of similar tasks with potentially unpredictable runtimes, where the total number is usually not known a priori. A static one-size-fits-all batch script would likely lead to suboptimal scheduling, and native schedulers installed on High Performance Computing (HPC) systems such as SLURM often struggle to efficiently handle such workloads. In this paper, we introduce a new load balancing approach suitable for UQ workflows. To demonstrate its efficiency in a real-world setting, we focus on the GS2 gyrokinetic plasma turbulence simulator. Individual simulations can be computationally demanding, with runtimes varying significantly-from minutes to hours-depending on the high-dimensional input parameters. Our approach uses UQ and Modelling Bridge, which offers a language-agnostic interface to a simulation model, combined with HyperQueue which works alongside the native scheduler. In particular, deploying this framework on HPC systems does not require system-level changes. We benchmark our proposed framework against a standalone SLURM approach using GS2 and a Gaussian Process surrogate thereof. Our results demonstrate a reduction in scheduling overhead by up to three orders of magnitude and a maximum reduction of 38% in CPU time for long-running simulations compared to the naive SLURM approach, while making no assumptions about the job submission patterns inherent to UQ workflows.

高性能计算系统中不确定性量化工作流任务调度的性能分析 - arXiv