最大幸福基准:衡量大型语言模型在功利主义道德困境中的对齐情况

The Greatest Good Benchmark: Measuring LLMs' Alignment with Utilitarian Moral Dilemmas

摘要 Abstract

如何做出最大化所有人福祉的决策,对于设计有益于人类且无害的语言模型具有重要意义。我们引入了最大幸福基准,利用功利主义困境评估大型语言模型(LLMs)的道德判断。通过对15种不同LLMs的分析发现,这些模型一致编码的道德偏好偏离了已建立的道德理论和普通人群的道德标准。大多数LLMs表现出明显的公正利他倾向,并拒绝工具性伤害。这些发现展示了LLMs的“人工道德罗盘”,为理解其道德对齐提供了洞见。

The question of how to make decisions that maximise the well-being of all persons is very relevant to design language models that are beneficial to humanity and free from harm. We introduce the Greatest Good Benchmark to evaluate the moral judgments of LLMs using utilitarian dilemmas. Our analysis across 15 diverse LLMs reveals consistently encoded moral preferences that diverge from established moral theories and lay population moral standards. Most LLMs have a marked preference for impartial beneficence and rejection of instrumental harm. These findings showcase the 'artificial moral compass' of LLMs, offering insights into their moral alignment.

最大幸福基准:衡量大型语言模型在功利主义道德困境中的对齐情况 - arXiv