如果大型语言模型是角色,它会了解自己的故事吗?评估LLMs的终身学习能力
If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs
摘要 Abstract
大型语言模型(LLMs)能够进行类似人类的对话,但与人类不同的是,由于叠加特性,它们处于无状态的状态。然而,在多轮、多智能体交互过程中,LLMs开始表现出一致的角色化行为,暗示了一种新兴的终身学习形式。尽管如此,现有的基准测试往往无法捕捉这些动态,主要集中在静态、开放式的评估上。为填补这一空白,我们引入了LIFESTATE-BENCH,这是一个用于评估LLMs终身学习能力的基准。它包含两个情节数据集:《哈姆雷特》和一组合成剧本集合,富含叙事结构和角色互动。我们的事实核查评估探查了模型在参数化和非参数化方法下的自我意识、情节记忆检索以及关系跟踪能力。在Llama3.1-8B、GPT-4-turbo和DeepSeek R1等模型上的实验表明,非参数化方法在管理状态化学习方面显著优于参数化方法。然而,所有模型在交互时间延长时都表现出灾难性遗忘的挑战,这凸显了终身学习领域进一步发展的必要性。
Large language models (LLMs) can carry out human-like dialogue, but unlike humans, they are stateless due to the superposition property. However, during multi-turn, multi-agent interactions, LLMs begin to exhibit consistent, character-like behaviors, hinting at a form of emergent lifelong learning. Despite this, existing benchmarks often fail to capture these dynamics, primarily focusing on static, open-ended evaluations. To address this gap, we introduce LIFESTATE-BENCH, a benchmark designed to assess lifelong learning in LLMs. It features two episodic datasets: Hamlet and a synthetic script collection, rich in narrative structure and character interactions. Our fact checking evaluation probes models' self-awareness, episodic memory retrieval, and relationship tracking, across both parametric and non-parametric approaches. Experiments on models like Llama3.1-8B, GPT-4-turbo, and DeepSeek R1, we demonstrate that nonparametric methods significantly outperform parametric ones in managing stateful learning. However, all models exhibit challenges with catastrophic forgetting as interactions extend, highlighting the need for further advancements in lifelong learning.