FaceID-6M：一个大规模开源人脸身份定制数据集

Research

arXiv

FaceID-6M: A Large-Scale, Open-Source FaceID Customization Dataset

Shuhe Wang ,

Xiaoya Li ,

Jiwei Li ,

Guoyin Wang ,

Xiaofei Sun ,

Bob Zhu ,

Han Qiu ,

Mo Yu ,

摘要 Abstract

当前基于数据驱动的人脸身份（FaceID）定制方法的所有最先进的模型都需要包含数百万高质量文本-图像对的大规模数据集进行训练。然而，这些数据集没有一个是公开可用的，这限制了透明性并阻碍了该领域进一步的发展。为了解决这个问题，本文收集并发布了FaceID-6M，这是首个大规模开源的FaceID数据集，包含600万高质量的文本-图像对。从LAION-5B数据集中筛选而来，FaceID-6M经过了严格的图像和文本过滤步骤以确保数据集质量，包括分辨率过滤以保持高质量图像和人脸、人脸过滤以移除缺乏人类面孔的图像，以及基于关键词策略以保留包含与人类相关术语（如国籍、职业和姓名）的描述。通过这些清理过程，FaceID-6M提供了一个优化用于训练强大FaceID定制模型的高质量数据集，通过提供开放资源促进该领域的研究和发展。我们进行了广泛的实验以展示FaceID-6M的有效性，表明在我们的FaceID-6M数据集上训练的模型性能与当前工业可用模型相当，并略好于后者。此外，为了支持和推动FaceID定制社区的研究，我们将代码、数据集和模型完全公开可用。我们的代码、模型和数据集可在https://github.com/ShuheSH/FaceID-6M获取。

Due to the data-driven nature of current face identity (FaceID) customization methods, all state-of-the-art models rely on large-scale datasets containing millions of high-quality text-image pairs for training. However, none of these datasets are publicly available, which restricts transparency and hinders further advancements in the field. To address this issue, in this paper, we collect and release FaceID-6M, the first large-scale, open-source FaceID dataset containing 6 million high-quality text-image pairs. Filtered from LAION-5B \cite{schuhmann2022laion}, FaceID-6M undergoes a rigorous image and text filtering steps to ensure dataset quality, including resolution filtering to maintain high-quality images and faces, face filtering to remove images that lack human faces, and keyword-based strategy to retain descriptions containing human-related terms (e.g., nationality, professions and names). Through these cleaning processes, FaceID-6M provides a high-quality dataset optimized for training powerful FaceID customization models, facilitating advancements in the field by offering an open resource for research and development. We conduct extensive experiments to show the effectiveness of our FaceID-6M, demonstrating that models trained on our FaceID-6M dataset achieve performance that is comparable to, and slightly better than currently available industrial models. Additionally, to support and advance research in the FaceID customization community, we make our code, datasets, and models fully publicly available. Our codes, models, and datasets are available at: https://github.com/ShuheSH/FaceID-6M.