UniHDSA: 面向层次化文档结构分析的统一关系预测方法

UniHDSA: A Unified Relation Prediction Approach for Hierarchical Document Structure Analysis

摘要 Abstract

文档结构分析(又称文档版面分析)对于理解文档的物理布局和逻辑结构至关重要,服务于信息检索、文档摘要、知识提取等领域。层次化文档结构分析(HDSA)特别旨在恢复通过具有层次化模式的创作软件创建的文档的层次化结构。以往的研究主要遵循两种方法:一种专注于孤立地解决HDSA的特定子任务,例如表格检测或阅读顺序预测;另一种采用统一框架,使用多个分支或模块,每个模块专门用于处理不同的任务。在本文中,我们提出了一种面向HDSA的统一关系预测方法,称为UniHDSA,该方法将各种HDSA子任务视为关系预测问题,并将关系预测标签整合到一个统一的标签空间中。这使得单一的关系预测模块能够同时处理多个任务,无论是页面级还是文档级结构分析。为验证UniHDSA的有效性,我们基于Transformer架构开发了一个多模态端到端系统。广泛的实验结果表明,我们的方法在层次化文档结构分析基准Comp-HRDoc上达到了最先进的性能,并在大规模文档版面分析数据集DocLayNet上取得了具有竞争力的结果,有效展示了我们方法在所有子任务中的优越性。Comp-HRDoc基准和UniHDSA的配置可在https://github.com/microsoft/CompHRDoc公开获取。

Document structure analysis, aka document layout analysis, is crucial for understanding both the physical layout and logical structure of documents, serving information retrieval, document summarization, knowledge extraction, etc. Hierarchical Document Structure Analysis (HDSA) specifically aims to restore the hierarchical structure of documents created using authoring software with hierarchical schemas. Previous research has primarily followed two approaches: one focuses on tackling specific subtasks of HDSA in isolation, such as table detection or reading order prediction, while the other adopts a unified framework that uses multiple branches or modules, each designed to address a distinct task. In this work, we propose a unified relation prediction approach for HDSA, called UniHDSA, which treats various HDSA sub-tasks as relation prediction problems and consolidates relation prediction labels into a unified label space. This allows a single relation prediction module to handle multiple tasks simultaneously, whether at a page-level or document-level structure analysis. To validate the effectiveness of UniHDSA, we develop a multimodal end-to-end system based on Transformer architectures. Extensive experimental results demonstrate that our approach achieves state-of-the-art performance on a hierarchical document structure analysis benchmark, Comp-HRDoc, and competitive results on a large-scale document layout analysis dataset, DocLayNet, effectively illustrating the superiority of our method across all sub-tasks. The Comp-HRDoc benchmark and UniHDSA's configurations are publicly available at https://github.com/microsoft/CompHRDoc.