针对程序性任务错误检测的多重正常动作表示建模

Research

arXiv

Modeling Multiple Normal Action Representations for Error Detection in Procedural Tasks

Zhi-Wei Xia ,

Kun-Yu Lin ,

摘要 Abstract

在增强现实辅助系统和机器人系统中，程序性活动中的错误检测对于确保一致且正确的结果至关重要。现有方法通常专注于时序顺序错误或依赖静态原型来表示正常动作。然而，这些方法往往忽略了这样一个常见场景：在给定的一系列已执行动作之后，存在多个不同的有效动作。这导致了两个问题：（1）当推理环境或动作执行分布与训练数据不同时，模型无法通过静态原型有效地检测错误；（2）如果正在进行的动作标签与预测的动作标签不同，模型可能会使用错误的原型来检测错误。为了解决这些问题，我们提出了自适应多重正常动作表示（AMNAR）框架。AMNAR预测所有有效的下一个动作，并重构其对应的正常动作表示，然后将这些表示与正在进行的动作进行比较以检测错误。广泛的实验表明，AMNAR达到了最先进的性能，凸显了AMNAR的有效性以及在错误检测中建模多个有效下一个动作的重要性。代码可在https://github.com/iSEE-Laboratory/AMNAR获取。

Error detection in procedural activities is essential for consistent and correct outcomes in AR-assisted and robotic systems. Existing methods often focus on temporal ordering errors or rely on static prototypes to represent normal actions. However, these approaches typically overlook the common scenario where multiple, distinct actions are valid following a given sequence of executed actions. This leads to two issues: (1) the model cannot effectively detect errors using static prototypes when the inference environment or action execution distribution differs from training; and (2) the model may also use the wrong prototypes to detect errors if the ongoing action label is not the same as the predicted one. To address this problem, we propose an Adaptive Multiple Normal Action Representation (AMNAR) framework. AMNAR predicts all valid next actions and reconstructs their corresponding normal action representations, which are compared against the ongoing action to detect errors. Extensive experiments demonstrate that AMNAR achieves state-of-the-art performance, highlighting the effectiveness of AMNAR and the importance of modeling multiple valid next actions in error detection. The code is available at https://github.com/iSEE-Laboratory/AMNAR.