摘要 Abstract
我们提出了RWKV-7“鹅”,这是一种新的序列建模架构,具有恒定的内存使用量和每令牌恒定的推理时间。尽管在训练数据量远少于其他顶级模型的情况下,我们的29亿参数语言模型在多语言任务上达到了新的3B SOTA,并在英语下游性能上达到了当前的3B SOTA。RWKV-7引入了一种新的广义delta规则公式,包含向量值门控和上下文学习率,以及一种放松的价值替换规则。我们展示了RWKV-7能够进行状态跟踪并识别所有正则语言,同时保持训练的并行化能力。这超过了标准复杂性猜想下的Transformer的能力,后者被限制在$\mathsf{TC}^0$内。为了展示RWKV-7的语言建模能力,我们还发布了一个扩展的开源3.1万亿令牌多语言语料库,并在这个数据集上训练了四个从1.9亿到29亿参数的RWKV-7模型。为了促进开放、可重复性和采用,我们将模型和数据集组件列表发布在https://huggingface.co/RWKV,并将训练和推理代码发布在https://github.com/RWKV/RWKV-LM,均采用Apache 2.0许可证。
We present RWKV-7 "Goose", a new sequence modeling architecture with constant memory usage and constant inference time per token. Despite being trained on dramatically fewer tokens than other top models, our 2.9 billion parameter language model achieves a new 3B SoTA on multilingual tasks and matches the current 3B SoTA on English language downstream performance. RWKV-7 introduces a newly generalized formulation of the delta rule with vector-valued gating and in-context learning rates, as well as a relaxed value replacement rule. We show that RWKV-7 can perform state tracking and recognize all regular languages, while retaining parallelizability of training. This exceeds the capabilities of Transformers under standard complexity conjectures, which are limited to $\mathsf{TC}^0$. To demonstrate RWKV-7's language modeling capability, we also present an extended open source 3.1 trillion token multilingual corpus, and train four RWKV-7 models ranging from 0.19 billion to 2.9 billion parameters on this dataset. To foster openness, reproduction, and adoption, we release our models and dataset component listing at https://huggingface.co/RWKV, and our training and inference code at https://github.com/RWKV/RWKV-LM all under the Apache 2.0 License.