与我互动:从主体视角联合预测交互意图、态度和社会行为

Interact with me: Joint Egocentric Forecasting of Intent to Interact, Attitude and Social Actions

摘要 Abstract

为了实现高效的人机交互,智能体应主动识别目标用户并为即将到来的交互做好准备。我们将这一具有挑战性的问题形式化为一个新的任务,即从智能体(以自我为中心的视角)的角度,同时预测一个人与智能体互动的意图、对智能体的态度以及其将要执行的动作。为此,我们提出了SocialEgoNet——一种基于图的时空框架,通过分层多任务学习方法利用任务间的依赖关系。SocialEgoNet仅使用1秒视频输入提取的全身骨架(面部、手部和身体的关键点)进行高效推理。为评估该模型,我们在一个现有的以自我为中心的人机交互数据集上增加了新的类别标签和边界框注释。在名为JPL-Social的扩展数据集上的大量实验表明,我们的模型实现了实时推理,并在所有任务中的平均准确率达到83.15%,优于多个竞争基准。额外的注释和代码将在接受后提供。

For efficient human-agent interaction, an agent should proactively recognize their target user and prepare for upcoming interactions. We formulate this challenging problem as the novel task of jointly forecasting a person's intent to interact with the agent, their attitude towards the agent and the action they will perform, from the agent's (egocentric) perspective. So we propose \emph{SocialEgoNet} - a graph-based spatiotemporal framework that exploits task dependencies through a hierarchical multitask learning approach. SocialEgoNet uses whole-body skeletons (keypoints from face, hands and body) extracted from only 1 second of video input for high inference speed. For evaluation, we augment an existing egocentric human-agent interaction dataset with new class labels and bounding box annotations. Extensive experiments on this augmented dataset, named JPL-Social, demonstrate \emph{real-time} inference and superior performance (average accuracy across all tasks: 83.15\%) of our model outperforming several competitive baselines. The additional annotations and code will be available upon acceptance.