基于后门的神经网络水印持久性的全面评估

Research

arXiv

基于后门的神经网络水印持久性的全面评估

Persistence of Backdoor-based Watermarks for Neural Networks: A Comprehensive Evaluation

Anh Tu Ngo ,

Chuan Song Heng ,

Nandish Chattopadhyay ,

Anupam Chattopadhyay

论文信息在线阅读PDF

摘要 Abstract

深度神经网络（DNN）近年来因其卓越的表现而受到广泛关注。然而，训练这些复杂模型的成本高昂，导致许多人将DNN视为模型所有者的知识产权（IP）。在云计算时代，高性能的DNN经常部署在互联网上，供公众访问。因此，近年来针对保护专有权利的DNN水印方案，特别是基于后门的水印方案得到了积极发展。然而，现有基于后门的水印方案在对抗攻击以及无意操作（如微调神经网络模型）下的鲁棒性仍存在许多不确定性。其中一个原因是，在基于后门的水印背景下无法完全保证鲁棒性。本文在微调场景下对近期提出的基于后门的水印持久性进行了广泛评估，并提出了一种新颖的数据驱动方法，在不暴露触发集的情况下恢复微调后的水印。我们的实验结果表明，仅通过在微调后引入训练数据，如果模型参数在微调过程中没有发生显著变化，则可以恢复水印。根据所使用的触发样本类型的不同，触发准确率可以恢复到高达100%。我们的研究进一步探讨了利用损失曲面可视化技术分析恢复过程的工作原理，以及在微调阶段引入训练数据以缓解水印消失的方法。

Deep Neural Networks (DNNs) have gained considerable traction in recent years due to the unparalleled results they gathered. However, the cost behind training such sophisticated models is resource intensive, resulting in many to consider DNNs to be intellectual property (IP) to model owners. In this era of cloud computing, high-performance DNNs are often deployed all over the internet so that people can access them publicly. As such, DNN watermarking schemes, especially backdoor-based watermarks, have been actively developed in recent years to preserve proprietary rights. Nonetheless, there lies much uncertainty on the robustness of existing backdoor watermark schemes, towards both adversarial attacks and unintended means such as fine-tuning neural network models. One reason for this is that no complete guarantee of robustness can be assured in the context of backdoor-based watermark. In this paper, we extensively evaluate the persistence of recent backdoor-based watermarks within neural networks in the scenario of fine-tuning, we propose/develop a novel data-driven idea to restore watermark after fine-tuning without exposing the trigger set. Our empirical results show that by solely introducing training data after fine-tuning, the watermark can be restored if model parameters do not shift dramatically during fine-tuning. Depending on the types of trigger samples used, trigger accuracy can be reinstated to up to 100%. Our study further explores how the restoration process works using loss landscape visualization, as well as the idea of introducing training data in fine-tuning stage to alleviate watermark vanishing.