GDRNPP：一种基于几何引导的完全学习型物体位姿估计器

Research

arXiv

GDRNPP: A Geometry-guided and Fully Learning-based Object Pose Estimator

Xingyu Liu ,

Ruida Zhang ,

Chenyangguang Zhang ,

Gu Wang ,

Jiwen Tang ,

Zhigang Li ,

Xiangyang Ji

论文信息在线阅读PDF

摘要 Abstract

刚体对象的6D位姿估算是计算机视觉领域长期存在的挑战性任务。近期，深度学习的兴起揭示了卷积神经网络（CNNs）在预测可靠6D位姿方面的潜力。鉴于当前直接位姿回归网络性能仍不理想，大多数方法在不同程度上依然依赖传统技术。例如，表现最佳的方法通常采用间接策略，首先建立2D-3D或3D-3D对应关系，然后应用RANSAC-based PnP或Kabsch算法，并进一步利用ICP进行细化。尽管这些方法提升了性能，但传统技术的引入使网络变得耗时且无法端到端训练。与此不同的是，本文提出了一种完全基于学习的物体位姿估计器。我们首先深入研究直接和间接方法，并提出了一种简单而有效的基于几何引导的直接回归网络（GDRN），以端到端的方式从单目图像中学习6D位姿。随后，我们引入了一个基于几何引导的位姿细化模块，在有额外深度数据可用的情况下提高位姿精度。通过预测的坐标图引导，我们构建了一个端到端可微的架构，用于在观测和渲染的RGB-D图像之间建立稳健且准确的3D-3D对应关系以优化位姿。我们的增强版位姿估计流水线GDRNPP（GDRN Plus Plus）连续两年在BOP挑战赛的排行榜上名列前茅，成为首个在准确性和速度上超越所有依赖传统技术的先前方法的技术。代码和模型可在https://github.com/shanice-l/gdrnpp_bop2022获取。

6D pose estimation of rigid objects is a long-standing and challenging task in computer vision. Recently, the emergence of deep learning reveals the potential of Convolutional Neural Networks (CNNs) to predict reliable 6D poses. Given that direct pose regression networks currently exhibit suboptimal performance, most methods still resort to traditional techniques to varying degrees. For example, top-performing methods often adopt an indirect strategy by first establishing 2D-3D or 3D-3D correspondences followed by applying the RANSAC-based PnP or Kabsch algorithms, and further employing ICP for refinement. Despite the performance enhancement, the integration of traditional techniques makes the networks time-consuming and not end-to-end trainable. Orthogonal to them, this paper introduces a fully learning-based object pose estimator. In this work, we first perform an in-depth investigation of both direct and indirect methods and propose a simple yet effective Geometry-guided Direct Regression Network (GDRN) to learn the 6D pose from monocular images in an end-to-end manner. Afterwards, we introduce a geometry-guided pose refinement module, enhancing pose accuracy when extra depth data is available. Guided by the predicted coordinate map, we build an end-to-end differentiable architecture that establishes robust and accurate 3D-3D correspondences between the observed and rendered RGB-D images to refine the pose. Our enhanced pose estimation pipeline GDRNPP (GDRN Plus Plus) conquered the leaderboard of the BOP Challenge for two consecutive years, becoming the first to surpass all prior methods that relied on traditional techniques in both accuracy and speed. The code and models are available at https://github.com/shanice-l/gdrnpp_bop2022.