基于多模态知识蒸馏的人类轨迹预测

Multi-modal Knowledge Distillation-based Human Trajectory Forecasting

摘要 Abstract

行人轨迹预测在自动驾驶和移动机器人导航等多种应用中至关重要。在这些应用中,基于相机的感知可以提取额外的模态(人体姿态、文本)以提高预测准确性。实际上,我们发现文本描述在整合额外模态到统一理解中起到了至关重要的作用。然而,在资源受限系统中,实时提取文本需要使用视觉语言模型(VLM),这可能不可行。为了解决这一挑战,我们提出了一种多模态知识蒸馏框架:通过有限模态的学生模型从经过全模态训练的教师模型中进行蒸馏。利用轨迹、人体姿态和文本综合知识的教师模型,通过仅使用轨迹或人体姿态作为补充,将其知识蒸馏到学生模型中。通过这种方式,我们将代理内部多模态的核心移动洞察力以及代理间交互分别进行蒸馏。我们的通用框架在三个数据集(基于自我视角的JRDB、SIT和基于鸟瞰图视角的ETH/UCY)上的两种设置中,使用标注文本和VLM生成的文本字幕进行了验证,涉及两种最先进的模型。蒸馏后的学生模型在完整和瞬时观测的所有预测指标上均显示出一致的改进,提升幅度高达约13%。代码可在https://github.com/Jaewoo97/KDTF获取。

Pedestrian trajectory forecasting is crucial in various applications such as autonomous driving and mobile robot navigation. In such applications, camera-based perception enables the extraction of additional modalities (human pose, text) to enhance prediction accuracy. Indeed, we find that textual descriptions play a crucial role in integrating additional modalities into a unified understanding. However, online extraction of text requires the use of VLM, which may not be feasible for resource-constrained systems. To address this challenge, we propose a multi-modal knowledge distillation framework: a student model with limited modality is distilled from a teacher model trained with full range of modalities. The comprehensive knowledge of a teacher model trained with trajectory, human pose, and text is distilled into a student model using only trajectory or human pose as a sole supplement. In doing so, we separately distill the core locomotion insights from intra-agent multi-modality and inter-agent interaction. Our generalizable framework is validated with two state-of-the-art models across three datasets on both ego-view (JRDB, SIT) and BEV-view (ETH/UCY) setups, utilizing both annotated and VLM-generated text captions. Distilled student models show consistent improvement in all prediction metrics for both full and instantaneous observations, improving up to ~13%. The code is available at https://github.com/Jaewoo97/KDTF.