LANGALIGN：通过跨语言嵌入对齐提升非英语语言模型

Research

arXiv

LANGALIGN: Enhancing Non-English Language Models via Cross-Lingual Embedding Alignment

Ho-Jin Choi ,

摘要 Abstract

虽然大型语言模型受到关注，但由于实际限制，许多服务开发者仍依赖基于嵌入的模型。在此情况下，微调数据的质量直接影响性能，而英语数据集常被用作训练非英语模型的种子数据。本研究提出LANGALIGN方法，在语言模型与任务头之间通过对齐英语嵌入向量与目标语言嵌入向量，增强目标语言的处理能力。在韩语、日语和汉语上的实验表明，LANGALIGN显著提升了这三种语言的性能。此外，我们还展示了LANGALIGN可以反向应用，将目标语言数据转换为英语模型可处理的格式。

While Large Language Models have gained attention, many service developers still rely on embedding-based models due to practical constraints. In such cases, the quality of fine-tuning data directly impacts performance, and English datasets are often used as seed data for training non-English models. In this study, we propose LANGALIGN, which enhances target language processing by aligning English embedding vectors with those of the target language at the interface between the language model and the task header. Experiments on Korean, Japanese, and Chinese demonstrate that LANGALIGN significantly improves performance across all three languages. Additionally, we show that LANGALIGN can be applied in reverse to convert target language data into a format that an English-based model can process.