摘要 Abstract
生成跨越数千个标记的高质量故事需要在多种技能上具备能力,从追踪情节和角色弧线到保持一致且引人入胜的风格。由于获取标注数据集的困难以及精确质量度量的挑战,大多数利用大规模语言模型(LLMs)进行长篇故事生成的工作采用了手设计提示技术的组合来诱发类似作者的行为。这是一种高度依赖具体故事生成任务的手动过程。受数学和编码等领域应用强化学习与可验证奖励的近期成功启发,我们提出了一种通用的故事生成任务(下一章预测)和奖励形式(通过完成可能性改进的可验证奖励),这使我们可以使用未标注的书籍数据集作为推理的学习信号。我们学习对故事的浓缩信息进行推理,并为下一章生成详细计划。我们的推理通过它帮助故事生成器创建的章节进行评估,并与非训练基线和监督微调(SFT)基线进行比较。成对的人类判断显示,我们通过学习推理生成的章节几乎在所有指标上都更受欢迎,这种效果在科幻和奇幻类型中更为明显。
Generating high-quality stories spanning thousands of tokens requires competency across a variety of skills, from tracking plot and character arcs to keeping a consistent and engaging style. Due to the difficulty of sourcing labeled datasets and precise quality measurements, most work using large language models (LLMs) for long-form story generation uses combinations of hand-designed prompting techniques to elicit author-like behavior. This is a manual process that is highly dependent on the specific story-generation task. Motivated by the recent success of applying RL with Verifiable Rewards to domains like math and coding, we propose a general story-generation task (Next-Chapter Prediction) and a reward formulation (Verified Rewards via Completion Likelihood Improvement) that allows us to use an unlabeled book dataset as a learning signal for reasoning. We learn to reason over a story's condensed information and generate a detailed plan for the next chapter. Our reasoning is evaluated via the chapters it helps a story-generator create, and compared against non-trained and supervised finetuning (SFT) baselines. Pairwise human judgments reveal the chapters our learned reasoning produces are preferred across almost all metrics, and the effect is more pronounced in Scifi and Fantasy genres.