Gemma 3技术报告

Gemma 3 Technical Report

摘要 Abstract

我们介绍了Gemma 3,这是Gemma家族轻量级开源模型的一个多模态扩展版本,参数规模从10亿到270亿不等。此版本引入了视觉理解能力,覆盖了更广泛的语言,并支持更长的上下文——至少12.8万个标记。我们还调整了模型架构,通过增加局部注意力层与全局注意力层的比例并缩短局部注意力的跨度,减少了因长上下文导致的KV缓存内存爆炸问题。Gemma 3系列模型采用蒸馏训练方式,在预训练和指令微调版本中均优于Gemma 2。特别是我们的新型后训练方法显著提升了数学、对话、指令跟随和多语言能力,使Gemma3-4B-IT在基准测试中与Gemma2-27B-IT相当,而Gemma3-27B-IT则与Gemini-1.5-Pro相当。我们将所有模型发布给社区。

We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by increasing the ratio of local to global attention layers, and keeping the span on local attention short. The Gemma 3 models are trained with distillation and achieve superior performance to Gemma 2 for both pre-trained and instruction finetuned versions. In particular, our novel post-training recipe significantly improves the math, chat, instruction-following and multilingual abilities, making Gemma3-4B-IT competitive with Gemma2-27B-IT and Gemma3-27B-IT comparable to Gemini-1.5-Pro across benchmarks. We release all our models to the community.

Gemma 3技术报告 - arXiv