自动日志记录基准AL-Bench

AL-Bench: A Benchmark for Automatic Logging

摘要 Abstract

日志记录,即在源代码中插入日志语句的实践,对于提高软件可靠性至关重要。近年来,基于语言模型的技术已开发出来,可以根据输入代码自动生成日志语句。尽管这些工具在先前的研究中显示出有前景的结果,但由于使用了临时数据集,其结果比较的公平性无法得到保证。此外,现有的仅依赖代码相似性度量的评估方法无法捕捉代码差异对运行时日志行为的影响,因为细微的代码修改可能会导致程序不可编译以及日志输出语义的重大差异。为了增强日志记录评估的一致性和可重复性,我们引入了AL-Bench,这是一个专门针对自动日志记录工具设计的综合基准。AL-Bench包含一个大规模、高质量、多样化的数据集,该数据集从10个具有不同日志需求的广泛认可的项目中收集而来。此外,它还引入了一种新颖的动态评估方法,除了传统的源代码层面的静态评估外,还提供了运行时的日志质量视角。具体而言,AL-Bench不仅评估源代码中理想日志语句与预测日志语句之间的相似性,还评估运行时由这两种日志语句打印的日志文件之间的差异。AL-Bench揭示了现有静态评估的重大局限性,所有日志工具在预测日志位置、级别和消息时的平均准确率分别比其报告结果低37.49%、23.43%和15.80%。此外,通过动态评估,AL-Bench显示有20.1%-83.6%的生成日志语句无法编译。并且,表现最好的工具在理想日志语句与生成日志语句的日志文件之间仅达到21.32%的余弦相似性。

Logging, the practice of inserting log statements into source code, is critical for improving software reliability. Recently, language model-based techniques have been developed to automate log statement generation based on input code. While these tools show promising results in prior studies, the fairness of their results comparisons is not guaranteed due to the use of ad hoc datasets. In addition, existing evaluation approaches exclusively dependent on code similarity metrics fail to capture the impact of code diff on runtime logging behavior, as minor code modifications can induce program uncompilable and substantial discrepancies in log output semantics. To enhance the consistency and reproducibility of logging evaluation, we introduce AL-Bench, a comprehensive benchmark designed specifically for automatic logging tools. AL-Bench includes a large-scale, high-quality, diverse dataset collected from 10 widely recognized projects with varying logging requirements. Moreover, it introduces a novel dynamic evaluation methodology to provide a run-time perspective of logging quality in addition to the traditional static evaluation at source code level. Specifically, AL-Bench not only evaluates the similarity between the oracle and predicted log statements in source code, but also evaluates the difference between the log files printed by both log statements during runtime. AL-Bench reveals significant limitations in existing static evaluation, as all logging tools show average accuracy drops of 37.49%, 23.43%, and 15.80% in predicting log position, level, and message compared to their reported results. Furthermore, with dynamic evaluation, AL-Bench reveals that 20.1%-83.6% of these generated log statements are unable to compile. Moreover, the best-performing tool achieves only 21.32% cosine similarity between the log files of the oracle and generated log statements.