超越简单微调:多阶段、多语言和领域特定方法在低资源机器翻译中的应用
Beyond Vanilla Fine-Tuning: Leveraging Multistage, Multilingual, and Domain-Specific Methods for Low-Resource Machine Translation
摘要 Abstract
微调多语言序列到序列大型语言模型(msLLMs)在为低资源语言(LRLs)开发神经机器翻译(NMT)系统方面显示出潜力。然而,在极低资源的NMT设置中,传统的单阶段微调方法面临挑战,因为训练数据非常有限。本文通过提出两种适应msLLMs的方法为人工智能做出贡献:(1) 持续预训练(CPT),即利用领域特定的单语数据进一步训练msLLM,以弥补LRLs表示不足的问题;(2) 中间任务迁移学习(ITTL),一种通过域内和平行数据微调msLLM的方法,以提升其在各种领域和任务中的翻译能力。作为工程领域的应用,这些方法被应用于特定领域的极低资源环境下的Sinhala、Tamil和英语(六种语言对)NMT系统(数据集包含少于100,000个样本)。实验结果显示,与标准单阶段微调基线相比,这些方法在所有翻译方向上的平均BLEU分数提高了+1.47分。此外,多模型集成进一步提升了性能,额外增加了BLEU分数。
Fine-tuning multilingual sequence-to-sequence large language models (msLLMs) has shown promise in developing neural machine translation (NMT) systems for low-resource languages (LRLs). However, conventional single-stage fine-tuning methods struggle in extremely low-resource NMT settings, where training data is very limited. This paper contributes to artificial intelligence by proposing two approaches for adapting msLLMs in these challenging scenarios: (1) continual pre-training (CPT), where the msLLM is further trained with domain-specific monolingual data to compensate for the under-representation of LRLs, and (2) intermediate task transfer learning (ITTL), a method that fine-tunes the msLLM with both in-domain and out-of-domain parallel data to enhance its translation capabilities across various domains and tasks. As an application in engineering, these methods are implemented in NMT systems for Sinhala, Tamil, and English (six language pairs) in domain-specific, extremely low-resource settings (datasets containing fewer than 100,000 samples). Our experiments reveal that these approaches enhance translation performance by an average of +1.47 bilingual evaluation understudy (BLEU) score compared to the standard single-stage fine-tuning baseline across all translation directions. Additionally, a multi-model ensemble further improves performance by an additional BLEU score.