GLiNER-biomed:面向开放生物医学命名实体识别的高效模型套件

GLiNER-biomed: A Suite of Efficient Models for Open Biomedical Named Entity Recognition

摘要 Abstract

生物医学命名实体识别(NER)因其专业词汇、庞大的实体数量以及新实体不断涌现而面临独特挑战。传统NER模型受限于固定的分类法和人工标注,难以超出预定义的实体类型进行泛化或高效适应新兴概念。为解决这些问题,我们提出了GLiNER-biomed,这是一套针对生物医学NER专门设计的领域自适应通用且轻量化的NER模型(GLiNER)。与传统方法不同,GLiNER利用自然语言描述推断任意实体类型,从而实现零样本识别。我们的方法首先通过大型语言模型(LLM)的知识蒸馏技术,将注释能力转移到一个更小、更高效的模型上,进而生成覆盖广泛的合成生物医学NER数据。随后,我们训练了两种GLiNER架构,即单编码器和双编码器,在多个规模下平衡计算效率和识别性能。在多个生物医学数据集上的评估表明,GLiNER-biomed在零样本和少量样本场景下均优于最先进的GLiNER模型,F1分数提升了5.96%。消融实验突显了我们合成数据生成策略的有效性,并强调了合成生物医学预训练与高质量通用领域微调相结合的互补优势。所有数据集、模型和训练管道均可在https://github.com/ds4dh/GLiNER-biomed公开获取。

Biomedical named entity recognition (NER) presents unique challenges due to specialized vocabularies, the sheer volume of entities, and the continuous emergence of novel entities. Traditional NER models, constrained by fixed taxonomies and human annotations, struggle to generalize beyond predefined entity types or efficiently adapt to emerging concepts. To address these issues, we introduce GLiNER-biomed, a domain-adapted suite of Generalist and Lightweight Model for NER (GLiNER) models specifically tailored for biomedical NER. In contrast to conventional approaches, GLiNER uses natural language descriptions to infer arbitrary entity types, enabling zero-shot recognition. Our approach first distills the annotation capabilities of large language models (LLMs) into a smaller, more efficient model, enabling the generation of high-coverage synthetic biomedical NER data. We subsequently train two GLiNER architectures, uni- and bi-encoder, at multiple scales to balance computational efficiency and recognition performance. Evaluations on several biomedical datasets demonstrate that GLiNER-biomed outperforms state-of-the-art GLiNER models in both zero- and few-shot scenarios, achieving 5.96% improvement in F1-score over the strongest baseline. Ablation studies highlight the effectiveness of our synthetic data generation strategy and emphasize the complementary benefits of synthetic biomedical pre-training combined with fine-tuning on high-quality general-domain annotations. All datasets, models, and training pipelines are publicly available at https://github.com/ds4dh/GLiNER-biomed.

GLiNER-biomed:面向开放生物医学命名实体识别的高效模型套件 - arXiv