并非所有的LLM生成数据都是平等的：重新思考文本分类中的数据加权问题

Research

arXiv

Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification

Hsun-Yu Kuo ,

Wei-Yun Ma ,

摘要 Abstract

通过大语言模型（LLMs）进行合成数据增强，使研究者能够利用额外的训练数据，从而提升下游任务的表现，特别是在现实世界数据稀缺的情况下。然而，生成的数据可能偏离真实世界的数据，这种不一致在应用训练好的模型时可能导致效果不佳。因此，我们提出了高效的加权损失方法，通过仅使用少量真实世界数据，强调由LLMs生成的高质量且多样化的数据，以实现合成数据与真实世界分布的对齐。我们在多种文本分类任务中实证评估了该方法的有效性，结果表明，基于BERT模型采用我们的方法稳健地优于标准交叉熵和其他数据加权方法，为有效利用来自任何合适数据生成器的合成数据进行模型训练提供了潜在解决方案。

Synthetic data augmentation via large language models (LLMs) allows researchers to leverage additional training data, thus enhancing the performance of downstream tasks, especially when real-world data is scarce. However, the generated data can deviate from the real-world data, and this misalignment can bring deficient outcomes while applying the trained model to applications. Therefore, we proposed efficient weighted-loss approaches to align synthetic data with real-world distribution by emphasizing high-quality and diversified data generated by LLMs with using merely a little real-world data. We empirically assessed the effectiveness of our method on multiple text classification tasks, and the results showed leveraging our approaches on a BERT-level model robustly outperformed standard cross-entropy and other data weighting approaches, providing potential solutions to effectively leveraging synthetic data from any suitable data generator for model training.