1900-1950年间中国历史文献中分词、词性标注与命名实体识别的比较分析

Research

arXiv

A Comparative Analysis of Word Segmentation, Part-of-Speech Tagging, and Named Entity Recognition for Historical Chinese Sources, 1900-1950

Zhao Fang ,

摘要 Abstract

本文对比了大规模语言模型（LLMs）与传统自然语言处理（NLP）工具在处理1900年至1950年中国文本时进行分词、词性标注（POS）以及命名实体识别（NER）的能力。由于历史汉语文献采用表意文字、缺乏自然词边界且存在显著的语言变化，给文本分析带来了挑战。利用上海图书馆民国期刊语料库中的样本数据集，将传统工具如Jieba和spaCy与LLMs（包括GPT-4o、Claude 3.5和GLM系列）进行了对比。结果表明，LLMs在所有指标上均优于传统方法，但其计算成本也显著更高，这凸显了准确性和效率之间的权衡。此外，LLMs能够更好地应对特定体裁带来的挑战，例如诗歌以及时间上的变化（即1920年前后文本），展示了其上下文学习能力可通过减少领域特定训练数据的需求来推动历史文本的NLP方法发展。

This paper compares large language models (LLMs) and traditional natural language processing (NLP) tools for performing word segmentation, part-of-speech (POS) tagging, and named entity recognition (NER) on Chinese texts from 1900 to 1950. Historical Chinese documents pose challenges for text analysis due to their logographic script, the absence of natural word boundaries, and significant linguistic changes. Using a sample dataset from the Shanghai Library Republican Journal corpus, traditional tools such as Jieba and spaCy are compared to LLMs, including GPT-4o, Claude 3.5, and the GLM series. The results show that LLMs outperform traditional methods in all metrics, albeit at considerably higher computational costs, highlighting a trade-off between accuracy and efficiency. Additionally, LLMs better handle genre-specific challenges such as poetry and temporal variations (i.e., pre-1920 versus post-1920 texts), demonstrating that their contextual learning capabilities can advance NLP approaches to historical texts by reducing the need for domain-specific training data.