摘要 Abstract
本文解决了现有马尔可夫决策过程(MDP)反事实推理方法的一个关键局限性。当前的方法假设特定的因果模型以使反事实可识别。然而,通常存在许多与MDP的观测分布和干预分布一致的因果模型,每个模型都会产生不同的反事实分布,因此固定某一特定因果模型会限制反事实推理的有效性和实用性。我们提出了一种新颖的非参数方法,计算在所有兼容因果模型下的反事实转移概率的紧界。与以往需要解决指数级复杂度优化问题的方法不同,我们的方法为这些界提供了闭式表达,使得计算对于非平凡MDP而言高效且可扩展。一旦构建了这样的区间反事实MDP,我们的方法能够识别出针对不确定区间MDP概率的最坏情况奖励进行优化的鲁棒反事实策略。我们在多个案例研究中评估了该方法,展示了其相较于现有方法的改进鲁棒性。
This paper addresses a key limitation in existing counterfactual inference methods for Markov Decision Processes (MDPs). Current approaches assume a specific causal model to make counterfactuals identifiable. However, there are usually many causal models that align with the observational and interventional distributions of an MDP, each yielding different counterfactual distributions, so fixing a particular causal model limits the validity (and usefulness) of counterfactual inference. We propose a novel non-parametric approach that computes tight bounds on counterfactual transition probabilities across all compatible causal models. Unlike previous methods that require solving prohibitively large optimisation problems (with variables that grow exponentially in the size of the MDP), our approach provides closed-form expressions for these bounds, making computation highly efficient and scalable for non-trivial MDPs. Once such an interval counterfactual MDP is constructed, our method identifies robust counterfactual policies that optimise the worst-case reward w.r.t. the uncertain interval MDP probabilities. We evaluate our method on various case studies, demonstrating improved robustness over existing methods.