ModeSeq:利用序列模式建模驯服稀疏多模态运动预测

ModeSeq: Taming Sparse Multimodal Motion Prediction with Sequential Mode Modeling

摘要 Abstract

预测未来事件的多模态性为安全的自动驾驶奠定了基础。然而,交通参与者的多模态运动预测一直受到缺乏多模态真实数据的困扰。现有方法大多采用胜者全得的训练策略来应对这一挑战,但仍存在轨迹多样性有限和模式置信度不准确的问题。一些方法通过生成过多的轨迹候选来解决这些局限性,但需要一个后处理阶段来识别最具代表性的模式,这一过程缺乏通用原则且会损害轨迹准确性。因此,我们提出了ModeSeq,这是一种新的多模态预测范式,将模式建模为序列。与一次性解码多个可能轨迹的常见做法不同,ModeSeq要求运动解码器逐步推断下一个模式,从而更明确地捕捉模式之间的相关性,并显著提升对多模态性的推理能力。利用序列模式预测的归纳偏置,我们还提出了Early-Match-Take-All(EMTA)训练策略,进一步提高轨迹多样性。无需依赖密集的模式预测或启发式后处理,ModeSeq在显著提升多模态输出多样性的同时保持了令人满意的轨迹准确性,在运动预测基准测试中实现了平衡性能。此外,ModeSeq自然具备模式外推的能力,当未来高度不确定时能够支持预测更多行为模式。

Anticipating the multimodality of future events lays the foundation for safe autonomous driving. However, multimodal motion prediction for traffic agents has been clouded by the lack of multimodal ground truth. Existing works predominantly adopt the winner-take-all training strategy to tackle this challenge, yet still suffer from limited trajectory diversity and uncalibrated mode confidence. While some approaches address these limitations by generating excessive trajectory candidates, they necessitate a post-processing stage to identify the most representative modes, a process lacking universal principles and compromising trajectory accuracy. We are thus motivated to introduce ModeSeq, a new multimodal prediction paradigm that models modes as sequences. Unlike the common practice of decoding multiple plausible trajectories in one shot, ModeSeq requires motion decoders to infer the next mode step by step, thereby more explicitly capturing the correlation between modes and significantly enhancing the ability to reason about multimodality. Leveraging the inductive bias of sequential mode prediction, we also propose the Early-Match-Take-All (EMTA) training strategy to diversify the trajectories further. Without relying on dense mode prediction or heuristic post-processing, ModeSeq considerably improves the diversity of multimodal output while attaining satisfactory trajectory accuracy, resulting in balanced performance on motion prediction benchmarks. Moreover, ModeSeq naturally emerges with the capability of mode extrapolation, which supports forecasting more behavior modes when the future is highly uncertain.