克服适应度优化中的欺骗性问题：基于无监督质量-多样性的方法

Research

arXiv

Overcoming Deceptiveness in Fitness Optimization with Unsupervised Quality-Diversity

摘要 Abstract

政策优化旨在根据目标或适应度函数寻找控制问题的最佳解决方案，是工程和研究领域的一个基础方向，其应用包括机器人学。传统优化方法（如强化学习和进化算法）在面对欺骗性适应度景观时表现不佳，即遵循即时改进可能导致次优解。质量-多样性（QD）算法通过维持多样化的中间解作为跳出局部最优的垫脚石，提供了一种有前景的方法。然而，QD算法需要领域专业知识来定义手工设计的特征，这限制了它们在难以描述解多样性的场景中的适用性。本文展示了无监督QD算法——特别是从感官数据中学习特征的AURORA框架——能够在没有领域专业知识的情况下高效解决欺骗性优化问题。通过结合对比学习和周期性灭绝事件对AURORA进行增强，我们提出了AURORA-XCon，该方法在所有传统优化基线上表现出色，并在某些情况下甚至比具有领域特定手工设计特征的最佳QD基线提高了多达34%。这项工作确立了无监督QD算法的一种新应用，使其从发现新颖解转向传统的优化任务，并扩展了它们在定义特征空间存在挑战的领域的潜力。

Policy optimization seeks the best solution to a control problem according to an objective or fitness function, serving as a fundamental field of engineering and research with applications in robotics. Traditional optimization methods like reinforcement learning and evolutionary algorithms struggle with deceptive fitness landscapes, where following immediate improvements leads to suboptimal solutions. Quality-diversity (QD) algorithms offer a promising approach by maintaining diverse intermediate solutions as stepping stones for escaping local optima. However, QD algorithms require domain expertise to define hand-crafted features, limiting their applicability where characterizing solution diversity remains unclear. In this paper, we show that unsupervised QD algorithms - specifically the AURORA framework, which learns features from sensory data - efficiently solve deceptive optimization problems without domain expertise. By enhancing AURORA with contrastive learning and periodic extinction events, we propose AURORA-XCon, which outperforms all traditional optimization baselines and matches, in some cases even improving by up to 34%, the best QD baseline with domain-specific hand-crafted features. This work establishes a novel application of unsupervised QD algorithms, shifting their focus from discovering novel solutions toward traditional optimization and expanding their potential to domains where defining feature spaces poses challenges.