Whisper-LM：利用语言模型提升低资源语言自动语音识别模型性能

Research

arXiv

Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages

Eva Navas ,

摘要 Abstract

自动语音识别系统在整合多语言和多任务模型（如Whisper）后无疑取得了显著进展，这些模型展现出在广泛语言范围内理解和处理语音的强大能力。尽管这些模型非常稳健，但在处理少数语言的语言学差异时往往表现不足。本研究通过将传统和新颖的语言模型与微调后的Whisper模型相结合，提高了其在较少研究的语言中的性能。通过在多个数据集上的严格微调和评估，我们展示了词错误率的显著改善，特别是在低资源场景下。我们的方法不仅充分利用了Whisper预训练所依赖的大规模数据，还通过引入语言模型增强了其语言适应性。使用统计语言模型，我们在分布内数据集上的性能提升了高达51%，在分布外句子上的性能提升了高达34%；而大型语言模型则在整个多样化的语言环境中提供了适度但持续稳健的改进。研究结果表明，这种集成可靠地使所有模型大小受益，但改进的程度有所不同，突显了优化语言模型参数的重要性。最后，我们强调在报告基于Transformer的ASR模型结果时选择适当评估参数的重要性。总之，这项研究为更包容的ASR技术铺平了道路，使其能够在更多语言中表现出色，通过丰富其语言知识实现这一目标。有关本研究的进一步实施细节，技术文档和源代码可在http://www.github.com/hitz-zentroa/whisper-lm获取。

Automatic speech recognition systems have undoubtedly advanced with the integration of multilingual and multitask models such as Whisper, which have shown a promising ability to understand and process speech across a wide range of languages. Despite their robustness, these models often fall short in handling the linguistic distinctions of minority languages. This study addresses this gap by integrating traditional and novel language models with fine-tuned Whisper models to raise their performance in less commonly studied languages. Through rigorous fine-tuning and evaluation across multiple datasets, we demonstrate substantial improvements in word error rate, particularly in low-resource scenarios. Our approach not only does take advantage of the extensive data Whisper was pre-trained on, but also complements its linguistic adaptability by incorporating language models. We obtained improvements up to 51\% for in-distribution datasets and up to 34\% for out-of-distribution sentences using statistical language models, while large language models provided moderate but consistently robust improvement across diverse linguistic contexts. The findings reveal that, while the integration reliably benefits all model sizes, the extent of improvement varies, highlighting the importance of optimized language model parameters. Finally, we emphasize the importance of selecting appropriate evaluation parameters when reporting the results using transformer-based ASR models. In summary, this research clears the way for more inclusive ASR technologies that perform better across languages by enriching their linguistic knowledge. For further implementation details of this study, the technical documentation and source code are available at http://www.github.com/hitz-zentroa/whisper-lm.