摘要 Abstract
大型语言模型(LLM)的实际使用者普遍注意到,在预期为确定性的设置下,相同输入会产生不同的输出。然而,这种现象的普遍程度及其对结果的影响,据我们所知尚未经过系统的调查。我们在五种被配置为确定性的LLM上,针对八项常见任务,在零样本和少样本设定下进行了10次运行,研究了其非确定性问题。结果显示,准确率在不同自然发生的运行之间变化可达15%,而最佳可能性能与最差可能性能之间的差距高达70%。实际上,没有一款LLM能够在所有任务中始终提供可重复的准确率,更不用说相同的输出字符串。与内部人士分享初步结果后发现,非确定性可能是通过输入缓冲区中的混合数据高效利用计算资源的必要条件,因此这一问题短期内不会消失。为了更好地量化我们的观察结果,我们引入了衡量确定性的指标:TARr@N(原始输出在N次运行中的总一致率)和TARa@N(解析答案在N次运行中的总一致率)。我们的代码和数据已在https://github.com/breckbaldwin/llm-stability公开发布。
LLM (large language model) practitioners commonly notice that outputs can vary for the same inputs under settings expected to be deterministic. Yet the questions of how pervasive this is, and with what impact on results, have not to our knowledge been systematically investigated. We investigate non-determinism in five LLMs configured to be deterministic when applied to eight common tasks in across 10 runs, in both zero-shot and few-shot settings. We see accuracy variations up to 15% across naturally occurring runs with a gap of best possible performance to worst possible performance up to 70%. In fact, none of the LLMs consistently delivers repeatable accuracy across all tasks, much less identical output strings. Sharing preliminary results with insiders has revealed that non-determinism perhaps essential to the efficient use of compute resources via co-mingled data in input buffers so this issue is not going away anytime soon. To better quantify our observations, we introduce metrics focused on quantifying determinism, TARr@N for the total agreement rate at N runs over raw output, and TARa@N for total agreement rate of parsed-out answers. Our code and data are publicly available at https://github.com/breckbaldwin/llm-stability.