大型语言模型在零样本漏洞检测中的推理应用

Research

arXiv

Reasoning with LLMs for Zero-Shot Vulnerability Detection

摘要 Abstract

在日益复杂且相互依赖的软件系统时代，自动化软件漏洞检测（SVD）仍然是一个关键挑战。尽管代码分析领域的大规模语言模型（LLMs）取得了显著进展，但现有的评估方法学往往缺乏必要的“上下文感知鲁棒性”，无法捕捉现实世界中的复杂性和跨组件交互。为了解决这些局限性，我们提出了VulnSage，这是一个全面的评估框架以及一个从C/C++开发的多样化大规模开源系统软件项目中精心策划的数据集。与现有数据集不同，它利用启发式噪声预过滤方法结合基于LLMs的推理，确保代表性和最小噪声的漏洞光谱。该框架支持函数级、文件级和函数间级别的多层次分析，并采用四种不同的零样本提示策略：Baseline、Chain-of-Thought、Think和Think & Verify。通过这项评估，我们发现结构化推理提示显著提升了LLMs的表现，其中Think & Verify将模糊响应从20.3%降低到9.1%，同时提高了准确性。我们进一步证明，专门针对代码的模型始终优于通用替代方案，其性能在不同类型的漏洞中存在显著差异，表明没有单一方法能够在所有安全上下文中普遍表现优异。数据集和代码链接：https://github.com/Erroristotle/VulnSage.git

Automating software vulnerability detection (SVD) remains a critical challenge in an era of increasingly complex and interdependent software systems. Despite significant advances in Large Language Models (LLMs) for code analysis, prevailing evaluation methodologies often lack the \textbf{context-aware robustness} necessary to capture real-world intricacies and cross-component interactions. To address these limitations, we present \textbf{VulnSage}, a comprehensive evaluation framework and a dataset curated from diverse, large-scale open-source system software projects developed in C/C++. Unlike prior datasets, it leverages a heuristic noise pre-filtering approach combined with LLM-based reasoning to ensure a representative and minimally noisy spectrum of vulnerabilities. The framework supports multi-granular analysis across function, file, and inter-function levels and employs four diverse zero-shot prompt strategies: Baseline, Chain-of-Thought, Think, and Think & Verify. Through this evaluation, we uncover that structured reasoning prompts substantially improve LLM performance, with Think & Verify reducing ambiguous responses from 20.3% to 9.1% while increasing accuracy. We further demonstrate that code-specialized models consistently outperform general-purpose alternatives, with performance varying significantly across vulnerability types, revealing that no single approach universally excels across all security contexts. Link to dataset and codes: https://github.com/Erroristotle/VulnSage.git