关于分布式内存外的大规模子集选择问题：基于成对次模函数的研究

Research

arXiv

On Distributed Larger-Than-Memory Subset Selection With Pairwise Submodular Functions

摘要 Abstract

现代数据集包含数十亿个样本，使得在全部可用数据上进行训练变得不可行。选择高质量的子集有助于降低训练成本并提升模型质量。次模性是一种离散凸性的类比，常用于解决此类子集选择问题。然而，现有的优化次模函数的算法是顺序执行的，并且现有的分布式方法需要至少一台中心机器将目标子集存储在DRAM中。在十亿量级的数据点规模下，即使子集也可能无法容纳单台机器，而顺序算法运行速度极慢。本文通过提出一种新型的分布式边界算法，放松了对中心机器存储目标子集的要求，并提供了可证明的近似保证。该算法通过迭代计算最小和最大效用值来选择高质量点并丢弃不重要的点。当边界计算未能找到完整子集时，我们采用多轮分区的分布式贪婪算法识别剩余的子集。我们讨论了如何在分布式数据处理框架中实现这些算法，并对不同配置进行了实证分析。我们在CIFAR-100和ImageNet数据集上找到了高质量的子集，与集中式方法相比，质量损失可以忽略不计，并且能够扩展到包含130亿个点的数据集。

Modern datasets span billions of samples, making training on all available data infeasible. Selecting a high quality subset helps in reducing training costs and enhancing model quality. Submodularity, a discrete analogue of convexity, is commonly used for solving such subset selection problems. However, existing algorithms for optimizing submodular functions are sequential, and the prior distributed methods require at least one central machine to fit the target subset in DRAM. At billion datapoint scale, even the subset may not fit a single machine, and the sequential algorithms are prohibitively slow. In this paper, we relax the requirement of having a central machine for the target subset by proposing a novel distributed bounding algorithm with provable approximation guarantees. The algorithm iteratively bounds the minimum and maximum utility values to select high quality points and discard the unimportant ones. When bounding does not find the complete subset, we use a multi-round, partition-based distributed greedy algorithm to identify the remaining subset. We discuss how to implement these algorithms in a distributed data processing framework and empirically analyze different configurations. We find high quality subsets on CIFAR-100 and ImageNet with marginal or no loss in quality compared to centralized methods, and scale to a dataset with 13 billion points.