利用大型语言模型内部知识评估书评摘要:一种跨模型与语义一致性方法

Evaluating book summaries from internal knowledge in Large Language Models: a cross-model and semantic consistency approach

摘要 Abstract

我们研究了大型语言模型(LLMs)仅凭其内部知识生成全面且准确书评摘要的能力,而无需依赖原始文本。通过使用一组多样化的书籍和多种LLM架构,我们考察这些模型能否合成出与人类既定解释相一致的意义叙事。评估采用“LLM作为裁判”的范式:每个AI生成的摘要通过跨模型评估与高质量的人类撰写摘要进行比较,所有参与的LLM不仅评估自己的输出,还评估其他模型生成的内容。这种方法能够识别潜在的偏差,例如模型倾向于青睐自身摘要风格而非其他风格的倾向。此外,利用ROUGE和BERTScore指标量化人工制作与LLM生成摘要之间的对齐情况,评估语法和语义对应关系的深度。结果揭示了模型间在内容表达和风格偏好上的细微差异,凸显了依赖内部知识进行摘要任务时的优势和局限性。这些发现有助于更深入地理解LLM对事实信息的内部编码以及跨模型评估的动态变化,为开发更稳健的自然语言生成系统提供了启示。

We study the ability of large language models (LLMs) to generate comprehensive and accurate book summaries solely from their internal knowledge, without recourse to the original text. Employing a diverse set of books and multiple LLM architectures, we examine whether these models can synthesize meaningful narratives that align with established human interpretations. Evaluation is performed with a LLM-as-a-judge paradigm: each AI-generated summary is compared against a high-quality, human-written summary via a cross-model assessment, where all participating LLMs evaluate not only their own outputs but also those produced by others. This methodology enables the identification of potential biases, such as the proclivity for models to favor their own summarization style over others. In addition, alignment between the human-crafted and LLM-generated summaries is quantified using ROUGE and BERTScore metrics, assessing the depth of grammatical and semantic correspondence. The results reveal nuanced variations in content representation and stylistic preferences among the models, highlighting both strengths and limitations inherent in relying on internal knowledge for summarization tasks. These findings contribute to a deeper understanding of LLM internal encodings of factual information and the dynamics of cross-model evaluation, with implications for the development of more robust natural language generative systems.