摘要 Abstract
虽然普通语言摘要(PLS)模型取得了显著进展,但其评估仍面临挑战。由于涉及独特的变换(如添加背景解释、去除专业术语),PLS缺乏专用的评估指标,现有文本生成评估指标的适用性尚不明确。为解决这些问题,本研究引入了一个细粒度的元评估测试平台APPLS,用于评估PLS的评估指标。我们从先前的工作中确定了四个PLS标准——信息性、简化性、连贯性和忠实性,并定义了一组与这些标准对应的扰动,敏感的指标应能够检测到这些扰动。我们将这些扰动应用于两个PLS数据集的抽取式假设,构建了我们的测试平台。利用APPLS,我们评估了14种指标的表现,包括自动评分、词汇特征以及基于大型语言模型提示的评估方法。分析表明,尽管某些现有指标对特定标准表现出一定的敏感性,但没有一种单一的方法能同时涵盖所有四个标准。因此,我们建议采用一套自动化指标,以全面捕捉PLS质量。这项工作贡献了首个PLS元评估测试平台,并对现有指标进行了全面评估。APPLS及其评估代码可在https://github.com/LinguisticAnomalies/APPLS获取。
While there has been significant development of models for Plain Language Summarization (PLS), evaluation remains a challenge. PLS lacks a dedicated assessment metric, and the suitability of text generation evaluation metrics is unclear due to the unique transformations involved (e.g., adding background explanations, removing jargon). To address these questions, our study introduces a granular meta-evaluation testbed, APPLS, designed to evaluate metrics for PLS. We identify four PLS criteria from previous work -- informativeness, simplification, coherence, and faithfulness -- and define a set of perturbations corresponding to these criteria that sensitive metrics should be able to detect. We apply these perturbations to extractive hypotheses for two PLS datasets to form our testbed. Using APPLS, we assess performance of 14 metrics, including automated scores, lexical features, and LLM prompt-based evaluations. Our analysis reveals that while some current metrics show sensitivity to specific criteria, no single method captures all four criteria simultaneously. We therefore recommend a suite of automated metrics be used to capture PLS quality along all relevant criteria. This work contributes the first meta-evaluation testbed for PLS and a comprehensive evaluation of existing metrics. APPLS and our evaluation code is available at https://github.com/LinguisticAnomalies/APPLS.