基于后训练语言模型引导多级特征注意力网络的恶意URL检测方法(PMANet)

Research

arXiv

PMANet: Malicious URL detection via post-trained language model guided multi-level feature attention network

Ruitong Liu ,

Yanbin Wang ,

Haitao Xu ,

Zhan Qin ,

Fan Zhang ,

Yiwei Liu ,

Zheng Cao

论文信息在线阅读PDF

摘要 Abstract

恶意URL的泛滥使其检测成为提升网络安全的关键任务。尽管预训练语言模型带来了希望，但现有方法在领域特定适应性、字符级信息处理以及局部-全局编码集成方面存在不足。为解决这些挑战，我们提出了PMANet（后训练语言模型引导的多级特征注意力网络）。PMANet通过三种自监督目标（掩码语言建模、噪声语言建模和领域区分）进行后训练过程，有效捕获子词和字符级信息。它还包括一个分层表征模块和动态逐层注意力机制，用于从低到高提取特征。此外，空间金字塔池化实现了局部和全局特征的整合。实验结果表明，PMANet在小样本数据、类别不平衡和对抗攻击等多种场景下优于最先进的模型，案例研究中正确检测出全部20个恶意URL，AUC值达到0.9941。代码和数据集可在https://github.com/Alixyvtte/Malicious-URL-Detection-PMANet获取。

The proliferation of malicious URLs has made their detection crucial for enhancing network security. While pre-trained language models offer promise, existing methods struggle with domain-specific adaptability, character-level information, and local-global encoding integration. To address these challenges, we propose PMANet, a pre-trained Language Model-Guided multi-level feature attention network. PMANet employs a post-training process with three self-supervised objectives: masked language modeling, noisy language modeling, and domain discrimination, effectively capturing subword and character-level information. It also includes a hierarchical representation module and a dynamic layer-wise attention mechanism for extracting features from low to high levels. Additionally, spatial pyramid pooling integrates local and global features. Experiments on diverse scenarios, including small-scale data, class imbalance, and adversarial attacks, demonstrate PMANet's superiority over state-of-the-art models, achieving a 0.9941 AUC and correctly detecting all 20 malicious URLs in a case study. Code and data are available at https://github.com/Alixyvtte/Malicious-URL-Detection-PMANet.