BIOMEDICA:一个开源的生物医学图像-标题档案、数据集及从科学文献衍生的视觉-语言模型

BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature

摘要 Abstract

视觉-语言模型(VLMs)的发展依赖于大规模且多样化的多模态数据集。然而,由于缺乏公开可用的注释数据集,特别是在生物学和医学领域,通用型生物医学VLMs的发展受到限制。现有的努力局限于狭窄的领域,未能涵盖科学文献中编码的全部生物医学知识多样性。为解决这一问题,我们推出了BIOMEDICA,这是一个可扩展的开源框架,用于提取、注释并序列化PubMed Central开放获取子集的所有内容,形成易于使用且公开可访问的数据集。我们的框架生成了一个包含超过2400万个独特图像-文本对的全面档案,来源于超过600万篇文章。此外还提供了元数据和专家指导的注释。我们通过发布BMCA-CLIP展示了该资源的实用性和易用性,这是一套通过流式处理在BIOMEDICA数据集上连续预训练的CLIP风格模型,无需下载27TB的数据即可使用。平均而言,我们的模型在涵盖病理学、放射学、眼科学、皮肤病学、外科学、分子生物学、寄生虫学和细胞生物学等领域的40项任务中实现了最先进的性能,在零样本分类任务中平均提升了6.56%(在皮肤病学和眼科学中分别高达29.8%和17.5%),并且在图像-文本检索方面表现更强,同时使用的计算资源仅为常规方法的十分之一。为了促进可重复性和协作,我们将代码库和数据集向更广泛的科研社区开放。

The development of vision-language models (VLMs) is driven by large-scale and diverse multimodal datasets. However, progress toward generalist biomedical VLMs is limited by the lack of annotated, publicly accessible datasets across biology and medicine. Existing efforts are restricted to narrow domains, missing the full diversity of biomedical knowledge encoded in scientific literature. To address this gap, we introduce BIOMEDICA, a scalable, open-source framework to extract, annotate, and serialize the entirety of the PubMed Central Open Access subset into an easy-to-use, publicly accessible dataset. Our framework produces a comprehensive archive with over 24 million unique image-text pairs from over 6 million articles. Metadata and expert-guided annotations are also provided. We demonstrate the utility and accessibility of our resource by releasing BMCA-CLIP, a suite of CLIP-style models continuously pre-trained on the BIOMEDICA dataset via streaming, eliminating the need to download 27 TB of data locally. On average, our models achieve state-of-the-art performance across 40 tasks - spanning pathology, radiology, ophthalmology, dermatology, surgery, molecular biology, parasitology, and cell biology - excelling in zero-shot classification with a 6.56% average improvement (as high as 29.8% and 17.5% in dermatology and ophthalmology, respectively), and stronger image-text retrieval, all while using 10x less compute. To foster reproducibility and collaboration, we release our codebase and dataset for the broader research community.

BIOMEDICA:一个开源的生物医学图像-标题档案、数据集及从科学文献衍生的视觉-语言模型 - arXiv