基于Transformer的语言模型在社交媒体中索马里语假新闻和有害信息检测的应用
Detection of Somali-written Fake News and Toxic Messages on the Social Media Using Transformer-based Language Models
摘要 Abstract
社交媒体账户的普及使得任何人都能创建并分享内容,公众对社交媒体作为新闻和信息来源的依赖日益增加,这带来了诸如虚假信息、假新闻、有害内容等一系列重大挑战。尽管人工内容审核在一定程度上可能有用,并被这些平台用于标记发布的内容,但人工智能模型的应用提供了一种更加可持续、可扩展且有效的缓解这些有害内容的方法。然而,像索马里语这样低资源语言在人工智能自动化方面面临诸多限制,包括稀缺的标注训练数据集以及缺乏针对其独特语言特征定制的语言模型。本文介绍了我们正在进行的研究工作的一部分,旨在弥合索马里语在这些方面的部分差距。具体而言,我们构建了两个由人工注释的社交媒体来源的索马里语数据集,分别用于下游假新闻分类和毒性分类任务,并开发了一个基于Transformer的单语索马里语语言模型(命名为SomBERTa),据我们所知这是首个此类模型。随后,我们将SomBERTa微调并在有害内容、假新闻及新闻主题分类数据集上进行评估。与相关多语言模型(如AfriBERTa、AfroXLMR等)的对比评估分析表明,SomBERTa在假新闻和毒性内容分类任务中始终优于这些对比模型,同时在所有任务中的平均准确率(87.99%)达到最佳。本研究通过提供一个基础语言模型和可复制的框架,为其他低资源语言贡献了索马里自然语言处理领域的发展,促进了数字和人工智能的包容性以及语言多样性。
The fact that everyone with a social media account can create and share content, and the increasing public reliance on social media platforms as a news and information source bring about significant challenges such as misinformation, fake news, harmful content, etc. Although human content moderation may be useful to an extent and used by these platforms to flag posted materials, the use of AI models provides a more sustainable, scalable, and effective way to mitigate these harmful contents. However, low-resourced languages such as the Somali language face limitations in AI automation, including scarce annotated training datasets and lack of language models tailored to their unique linguistic characteristics. This paper presents part of our ongoing research work to bridge some of these gaps for the Somali language. In particular, we created two human-annotated social-media-sourced Somali datasets for two downstream applications, fake news \& toxicity classification, and developed a transformer-based monolingual Somali language model (named SomBERTa) -- the first of its kind to the best of our knowledge. SomBERTa is then fine-tuned and evaluated on toxic content, fake news and news topic classification datasets. Comparative evaluation analysis of the proposed model against related multilingual models (e.g., AfriBERTa, AfroXLMR, etc) demonstrated that SomBERTa consistently outperformed these comparators in both fake news and toxic content classification tasks while achieving the best average accuracy (87.99%) across all tasks. This research contributes to Somali NLP by offering a foundational language model and a replicable framework for other low-resource languages, promoting digital and AI inclusivity and linguistic diversity.