面向非洲语言社交媒体文本的AfroXLMR-Social:预训练语言模型的适配研究
AfroXLMR-Social: Adapting Pre-trained Language Models for African Languages Social Media Text
摘要 Abstract
来自各种来源的预训练语言模型(PLMs)是当今自然语言处理(NLP)进步的基础。这些模型学习的语言表征在来自不同来源且规模各异的数据集上表现出色。我们对低资源非洲语言的领域适应和任务适应连续预训练方法进行了全面分析,并展示了评估任务中的有前景结果。我们创建了AfriSocial语料库,该语料库经过精心的质量预处理设计,用于领域适应微调。使用AfriSocial作为领域适应预训练(DAPT)数据对PLMs进行连续预训练,在针对16种目标语言的细粒度情感分类任务上的宏F1分数提升了1%到28.27%。同样,通过任务适应微调(TAPT)方法,利用少量未标注但相似任务的数据进一步微调,也显示出有希望的结果。例如,未标注的情感数据(源任务)可使基础模型在细粒度情感分类任务(目标任务)的F1分数提升0.55%到15.11%。结合两种方法,即DAPT + TAPT,其性能优于基础模型。所有资源将公开,以改善低资源NLP任务以及类似的领域任务,如仇恨言论和情感分析任务。
Pretrained Language Models (PLMs) built from various sources are the foundation of today's NLP progress. Language representations learned by such models achieve strong performance across many tasks with datasets of varying sizes drawn from various sources. We explore a thorough analysis of domain and task adaptive continual pretraining approaches for low-resource African languages and a promising result is shown for the evaluated tasks. We create AfriSocial, a corpus designed for domain adaptive finetuning that passes through quality pre-processing steps. Continual pretraining PLMs using AfriSocial as domain adaptive pretraining (DAPT) data, consistently improves performance on fine-grained emotion classification task of 16 targeted languages from 1% to 28.27% macro F1 score. Likewise, using the task adaptive pertaining (TAPT) approach, further finetuning with small unlabeled but similar task data shows promising results. For example, unlabeled sentiment data (source) for fine-grained emotion classification task (target) improves the base model results by an F1 score ranging from 0.55% to 15.11%. Combining the two methods, DAPT + TAPT, achieves also better results than base models. All the resources will be available to improve low-resource NLP tasks, generally, as well as other similar domain tasks such as hate speech and sentiment tasks.