多模态大语言模型的安全人类反馈强化学习（Safe RLHF-V）：多模态环境下的安全对齐研究

多模态大型语言模型（MLLMs）对于开发通用人工智能助手至关重要，但它们也面临着日益增长的安全风险。如何确保这些模型能够安全地与人类意图对齐，以防止歧视、虚假信息或伦理标准违反等不当行为？进一步而言，我们需要探索如何在增强推理性能的同时确保这些模型满足安全约束条件。从根本上讲，这可以被表述为一个极小-极大优化问题。在这项研究中，我们提出了Safe RLHF-V，这是首个结合多模态奖励和成本模型的安全对齐框架，该框架基于拉格朗日约束优化方法，同时优化了有用性和安全性。鉴于目前缺乏区分多模态场景下有用性和安全性的偏好数据集，我们引入了BeaverTails-V，这是一个开源数据集，包含针对有用性和安全性的双重偏好标注以及多层次的安全标签（轻微、中度、严重）。此外，我们设计了一个多层次护栏系统，以主动防御不安全查询和对抗攻击。通过在前驱模型上应用Beaver-Guard-V审核进行五轮筛选和再生成，上游模型的整体安全性平均提高了40.9%。实验结果表明，使用Safe RLHF微调不同的MLLMs可以有效提升模型的有用性，同时确保更高的安全性。具体而言，Safe RLHF-V将模型的安全性提升了34.2%，有用性提升了34.3%。所有数据集、模型和代码均可在https://github.com/SafeRLHF-V获取，以支持MLLMs的安全发展并降低潜在的社会风险。

Multimodal large language models (MLLMs) are critical for developing general-purpose AI assistants, yet they face growing safety risks. How can we ensure that MLLMs are safely aligned to prevent undesired behaviors such as discrimination, misinformation, or violations of ethical standards? In a further step, we need to explore how to fine-tune MLLMs to enhance reasoning performance while ensuring they satisfy safety constraints. Fundamentally, this can be formulated as a min-max optimization problem. In this study, we propose Safe RLHF-V, the first multimodal safety alignment framework that jointly optimizes helpfulness and safety using separate multimodal reward and cost models within a Lagrangian-based constrained optimization framework. Given that there is a lack of preference datasets that separate helpfulness and safety in multimodal scenarios, we introduce BeaverTails-V, the first open-source dataset with dual preference annotations for helpfulness and safety, along with multi-level safety labels (minor, moderate, severe). Additionally, we design a Multi-level Guardrail System to proactively defend against unsafe queries and adversarial attacks. By applying the Beaver-Guard-V moderation for 5 rounds of filtering and re-generation on the precursor model, the overall safety of the upstream model is significantly improved by an average of 40.9%. Experimental results demonstrate that fine-tuning different MLLMs with Safe RLHF can effectively enhance model helpfulness while ensuring improved safety. Specifically, Safe RLHF-V improves model safety by 34.2% and helpfulness by 34.3%. All of datasets, models, and code can be found at https://github.com/SafeRLHF-V to support the safety development of MLLMs and reduce potential societal risks.