KL3M分词器:面向法律、金融及预处理应用的领域特定与字符级分词器家族

KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications

摘要 Abstract

我们介绍了KL3M分词器,这是一组专为法律、金融和政府文本设计的分词工具。尽管在分词研究方面已有大量成果,但针对专业领域的专用分词器仍缺乏深入研究。本文在这一领域做出了两项主要贡献。首先,我们为法律、金融和政府文本引入了领域特定的BPE分词器。我们的kl3m-004-128k-cased分词器在处理领域特定文档时,相较于GPT-4o和Llama3使用的词汇量更小,但所需tokens数量减少了9%-17%。对于专业术语,我们的带大小写敏感的分词器表现更为高效,对法律术语的token数量减少了高达83%,对金融术语则减少了39%。其次,我们开发了用于文本校正任务(如OCR后处理)的字符级BPE分词器(词汇量分别为4K、8K和16K)。这些分词器能够保持错误文本与正确文本之间token边界的稳定性,从而帮助模型更容易学习到校正模式。这些分词器通过增加上下文窗口内的文本容量、减少计算需求以及保留领域特定术语的意义,有助于专业应用的发展。我们的分析表明,这些效率提升直接有益于长篇法律和金融文件的处理。我们将所有分词器及相关代码通过GitHub和Hugging Face公开发布,以支持该领域进一步的研究。

We present the KL3M tokenizers, a family of specialized tokenizers for legal, financial, and governmental text. Despite established work on tokenization, specialized tokenizers for professional domains remain understudied. Our paper offers two main contributions to this area. First, we introduce domain-specific BPE tokenizers for legal, financial, and governmental text. Our kl3m-004-128k-cased tokenizer uses 9-17% fewer tokens than GPT-4o and Llama3 for domain-specific documents, despite having a smaller vocabulary. For specialized terminology, our cased tokenizer is even more efficient, using up to 83% fewer tokens for legal terms and 39% fewer tokens for financial terms. Second, we develop character-level BPE tokenizers (4K, 8K, and 16K vocabulary sizes) for text correction tasks like OCR post-processing. These tokenizers keep consistent token boundaries between error-containing and correct text, making it easier for models to learn correction patterns. These tokenizers help professional applications by fitting more text in context windows, reducing computational needs, and preserving the meaning of domain-specific terms. Our analysis shows these efficiency gains directly benefit the processing of long legal and financial documents. We release all tokenizers and code through GitHub and Hugging Face to support further research in specialized tokenization.