摘要 Abstract
数据蒸馏已成为一种流行的压缩大规模数据集为更小且更高效表示的方法,同时保留模型训练所需的关键信息。数据特征大致可分为两类:实例特定特征,捕捉单个样本的独特细节;以及类别通用特征,代表跨类别的共享模式。然而,先前的方法往往难以平衡这些特征——一些仅关注类别通用模式,忽视了更细粒度的实例细节,而另一些则优先考虑实例特定特征,忽略了对类别理解至关重要的共享特性。本文提出了一种名为非关键区域优化数据蒸馏(NRR-DD)的方法,在合成数据中保留实例特定细节和细粒度区域的同时,用类别通用信息丰富非关键区域。这种方法使模型能够利用所有像素信息,捕获两种特征类型并提升整体性能。此外,我们提出了基于距离的代表性(DBR)知识迁移方法,通过依赖合成数据预测与独热编码标签之间的距离,无需软标签即可进行训练。实验结果表明,NRR-DD在小型和大型数据集上均实现了最先进的性能。此外,通过仅为每个实例存储两个距离,我们的方法在各种设置下提供了可比的结果。代码可在https://github.com/tmtuan1307/NRR-DD获取。
Dataset distillation has become a popular method for compressing large datasets into smaller, more efficient representations while preserving critical information for model training. Data features are broadly categorized into two types: instance-specific features, which capture unique, fine-grained details of individual examples, and class-general features, which represent shared, broad patterns across a class. However, previous approaches often struggle to balance these features-some focus solely on class-general patterns, neglecting finer instance details, while others prioritize instance-specific features, overlooking the shared characteristics essential for class-level understanding. In this paper, we introduce the Non-Critical Region Refinement Dataset Distillation (NRR-DD) method, which preserves instance-specific details and fine-grained regions in synthetic data while enriching non-critical regions with class-general information. This approach enables models to leverage all pixel information, capturing both feature types and enhancing overall performance. Additionally, we present Distance-Based Representative (DBR) knowledge transfer, which eliminates the need for soft labels in training by relying on the distance between synthetic data predictions and one-hot encoded labels. Experimental results show that NRR-DD achieves state-of-the-art performance on both small- and large-scale datasets. Furthermore, by storing only two distances per instance, our method delivers comparable results across various settings. The code is available at https://github.com/tmtuan1307/NRR-DD.