利用预训练演示隐式反馈的多智能体运动生成模型直接后训练偏好对齐方法

Research

arXiv

Direct Post-Training Preference Alignment for Multi-Agent Motion Generation Models Using Implicit Feedback from Pre-training Demonstrations

Ran Tian ,

Kratarth Goel

论文信息在线阅读PDF

摘要 Abstract

近年来，大型语言模型（LLMs）的进步彻底改变了具身应用中的运动生成模型。虽然LLM类型的自回归运动生成模型受益于训练的可扩展性，但它们的标记预测目标与人类偏好之间仍存在偏差，导致仅基于标记预测目标训练的模型生成的行为偏离人类偏好的行为，因此后训练偏好对齐对于生成符合人类偏好的运动至关重要。然而，后训练对齐需要对预训练模型生成的大量运动进行广泛的偏好排名，这在标注上成本高昂，尤其是在多智能体设置中。最近，人们越来越有兴趣利用预训练演示来可扩展地生成后训练对齐所需的偏好数据。然而，这些方法通常采用对抗假设，将所有预训练模型生成的样本视为不被偏好的例子。这种对抗方法忽视了模型自身生成之间的偏好排名所提供的宝贵信号，最终降低了对齐的有效性，并可能导致行为对齐失败。在这项工作中，我们没有将所有生成的样本视为同样糟糕，而是利用预训练演示中编码的隐式偏好，构建预训练模型生成之间的偏好排名，提供更精细的偏好对齐指导，且无需任何人工成本。我们将该方法应用于大规模交通仿真，并展示了其在提升预训练模型生成行为真实感方面的有效性。通过仅依赖预训练演示中的隐式反馈，而无需额外的后训练人类偏好注释或高计算成本，我们的轻量级100万规模运动生成模型在性能上可以媲美基于模仿学习的大规模SOTA模型。

Recent advancements in LLMs have revolutionized motion generation models in embodied applications. While LLM-type auto-regressive motion generation models benefit from training scalability, there remains a discrepancy between their token prediction objectives and human preferences. As a result, models pre-trained solely with token-prediction objectives often generate behaviors that deviate from what humans would prefer, making post-training preference alignment crucial for producing human-preferred motions. Unfortunately, post-training alignment requires extensive preference rankings of motions generated by the pre-trained model, which are costly to annotate, especially in multi-agent settings. Recently, there has been growing interest in leveraging pre-training demonstrations to scalably generate preference data for post-training alignment. However, these methods often adopt an adversarial assumption, treating all pre-trained model-generated samples as unpreferred examples. This adversarial approach overlooks the valuable signal provided by preference rankings among the model's own generations, ultimately reducing alignment effectiveness and potentially leading to misaligned behaviors. In this work, instead of treating all generated samples as equally bad, we leverage implicit preferences encoded in pre-training demonstrations to construct preference rankings among the pre-trained model's generations, offering more nuanced preference alignment guidance with zero human cost. We apply our approach to large-scale traffic simulation and demonstrate its effectiveness in improving the realism of pre-trained model's generated behaviors, making a lightweight 1M motion generation model comparable to SOTA large imitation-based models by relying solely on implicit feedback from pre-training demonstrations, without additional post-training human preference annotations or high computational costs.