文件在计算机里：版权、记忆化与生成式人工智能

Research

arXiv

The Files are in the Computer: Copyright, Memorization, and Generative AI

摘要 Abstract

《纽约时报》对OpenAI和微软提起的版权诉讼指控OpenAI的GPT模型“记住了”《纽约时报》的文章。其他诉讼也提出了类似的主张。然而，各方、法院以及学者们对“记忆化”的定义、是否正在发生以及其版权影响存在分歧。这些争论因“记忆化”本质的模糊性而更加复杂。我们试图为这一讨论带来清晰度。我们借鉴技术文献，为法律讨论提供坚实基础，并给出一个精确的记忆化定义：当（1）可以从模型中重构出（2）接近完全复制的（3）训练数据的重要部分时，即认为模型“记忆化”了该训练数据。我们将记忆化与“提取”（用户故意导致模型生成近似复制）、“反刍”（模型生成近似复制，与用户意图无关）以及“重构”（通过任何手段从模型中获得近似复制）区分开来。由此得出几个结论。（1）并非所有的学习都是记忆化。（2）记忆化发生在模型训练过程中；反刍是其结果而非原因。（3）记忆了训练数据的模型在版权意义上是一种“复制”。（4）模型不像录像机或其他通用复制技术，它更擅长生成某些类型的输出（可能是反刍的输出）。（5）记忆化不是由蓄意提取的“对抗性”用户引起的，而是模型本身固有的现象。（6）模型记忆的训练数据量是训练过程中选择的结果。（7）记忆化的模型是否会反刍取决于整个系统的架构设计。从某种非常真实的意义上说，记忆化的训练数据就在模型之中——借用《超级名模》的话，“文件就在计算机里”。

The New York Times's copyright lawsuit against OpenAI and Microsoft alleges OpenAI's GPT models have "memorized" NYT articles. Other lawsuits make similar claims. But parties, courts, and scholars disagree on what memorization is, whether it is taking place, and what its copyright implications are. These debates are clouded by ambiguities over the nature of "memorization." We attempt to bring clarity to the conversation. We draw on the technical literature to provide a firm foundation for legal discussions, providing a precise definition of memorization: a model has "memorized" a piece of training data when (1) it is possible to reconstruct from the model (2) a near-exact copy of (3) a substantial portion of (4) that piece of training data. We distinguish memorization from "extraction" (user intentionally causes a model to generate a near-exact copy), from "regurgitation" (model generates a near-exact copy, regardless of user intentions), and from "reconstruction" (the near-exact copy can be obtained from the model by any means). Several consequences follow. (1) Not all learning is memorization. (2) Memorization occurs when a model is trained; regurgitation is a symptom not its cause. (3) A model that has memorized training data is a "copy" of that training data in the sense used by copyright. (4) A model is not like a VCR or other general-purpose copying technology; it is better at generating some types of outputs (possibly regurgitated ones) than others. (5) Memorization is not a phenomenon caused by "adversarial" users bent on extraction; it is latent in the model itself. (6) The amount of training data that a model memorizes is a consequence of choices made in training. (7) Whether or not a model that has memorized actually regurgitates depends on overall system design. In a very real sense, memorized training data is in the model--to quote Zoolander, the files are in the computer.