摘要 Abstract
虽然顺序决策环境常常涉及高维观测,但并非所有这些观测特征都与控制相关。特别是,观测空间可能捕获环境中由智能体无法控制的因素,但这些因素增加了观测空间的复杂性。为了在可处理的小状态空间中操作,需要忽略这些“噪声”特征,这对有效的策略学习提出了挑战。由于许多此类环境中存在丰富的视频数据,任务无关的无动作离线数据表征学习提供了一个有吸引力的解决方案。然而,近期的研究揭示了在存在时序相关的噪声特征的Exogenous Block MDP(Ex-BMDP)模型中无动作学习的理论局限性。为了解决这些局限性,我们确定了一个现实场景:当来自具有不同策略的多个智能体的无动作视频数据可用时,Ex-BMDP中的表征学习变得可行。具体而言,本文介绍了CRAFT(基于比较的无动作轨迹表征),这是一种样本高效的算法,利用跨智能体可控特征动态的差异来学习表征。我们为CRAFT的性能提供了理论保证,并在一个玩具示例中展示了其可行性,为类似设置中的实用方法奠定了基础。
While sequential decision-making environments often involve high-dimensional observations, not all features of these observations are relevant for control. In particular, the observation space may capture factors of the environment which are not controllable by the agent, but which add complexity to the observation space. The need to ignore these "noise" features in order to operate in a tractably-small state space poses a challenge for efficient policy learning. Due to the abundance of video data available in many such environments, task-independent representation learning from action-free offline data offers an attractive solution. However, recent work has highlighted theoretical limitations in action-free learning under the Exogenous Block MDP (Ex-BMDP) model, where temporally-correlated noise features are present in the observations. To address these limitations, we identify a realistic setting where representation learning in Ex-BMDPs becomes tractable: when action-free video data from multiple agents with differing policies are available. Concretely, this paper introduces CRAFT (Comparison-based Representations from Action-Free Trajectories), a sample-efficient algorithm leveraging differences in controllable feature dynamics across agents to learn representations. We provide theoretical guarantees for CRAFT's performance and demonstrate its feasibility on a toy example, offering a foundation for practical methods in similar settings.