不同文档类别中PDF解析工具的比较研究

Research

arXiv

不同文档类别中PDF解析工具的比较研究

A Comparative Study of PDF Parsing Tools Across Diverse Document Categories

摘要 Abstract

PDF是最为广泛使用的数据格式之一，因此PDF解析对于信息抽取和检索尤为重要，尤其是在基于检索增强生成（RAG）系统兴起的背景下。尽管存在多种PDF解析工具，但它们在不同文档类型上的有效性仍缺乏深入研究，特别是在非学术论文领域。本研究通过使用DocLayNet数据集，对10种流行的PDF解析工具在6类文档中的表现进行了对比分析。这些工具包括PyPDF、pdfminer-six、PyMuPDF、pdfplumber、pypdfium2、Unstructured、Tabula、Camelot，以及基于深度学习的工具Nougat和Table Transformer (TATR)。我们评估了文本提取和表格检测能力。在文本提取方面，PyMuPDF和pypdfium总体表现优于其他工具，但在科学和技术专利文档中，所有解析器均表现不佳；针对这些具有挑战性的类别，基于学习的工具如Nougat展现了更优的性能。在表格检测方面，TATR在金融、专利、法律与法规、科学等领域表现出色；表格检测工具Camelot在标书文档中表现最佳，而PyMuPDF在操作手册类别中表现优异。我们的研究结果强调了根据文档类型和具体任务选择合适解析工具的重要性，为研究人员和实践者处理多样化文档来源提供了有价值的参考。

PDF is one of the most prominent data formats, making PDF parsing crucial for information extraction and retrieval, particularly with the rise of RAG systems. While various PDF parsing tools exist, their effectiveness across different document types remains understudied, especially beyond academic papers. Our research aims to address this gap by comparing 10 popular PDF parsing tools across 6 document categories using the DocLayNet dataset. These tools include PyPDF, pdfminer-six, PyMuPDF, pdfplumber, pypdfium2, Unstructured, Tabula, Camelot, as well as the deep learning-based tools Nougat and Table Transformer(TATR). We evaluated both text extraction and table detection capabilities. For text extraction, PyMuPDF and pypdfium generally outperformed others, but all parsers struggled with Scientific and Patent documents. For these challenging categories, learning-based tools like Nougat demonstrated superior performance. In table detection, TATR excelled in the Financial, Patent, Law & Regulations, and Scientific categories. Table detection tool Camelot performed best for tender documents, while PyMuPDF performed superior in the Manual category. Our findings highlight the importance of selecting appropriate parsing tools based on document type and specific tasks, providing valuable insights for researchers and practitioners working with diverse document sources.