GUI-World: 多模态图形用户界面理解的视频基准数据集与数据集
GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented Understanding
摘要 Abstract
近年来,多模态大型语言模型(MLLMs)被用作代理,通过直接感知图形用户界面(GUI)并生成相应的命令来控制键盘和鼠标输入。然而,当前的代理主要在静态环境中展示出强大的理解能力,并且主要应用于相对简单的领域,例如Web或移动界面。我们认为,一个稳健的GUI代理应能够感知GUI上的时间信息,包括动态Web内容和多步骤任务。此外,它应该对各种GUI场景具有全面的理解,包括桌面软件和多窗口交互。为实现这一目标,本文介绍了一个新的数据集——GUI-World,该数据集精心制作了Human-MLLM注释,广泛涵盖了六种GUI场景和三种格式下的八类GUI导向问题。我们评估了当前最先进的MLLMs(包括图像LLMs和视频LLMs)在理解各种类型的GUI内容(特别是动态和顺序内容)方面的能力。研究结果表明,当前模型在没有人工标注的关键帧或操作历史的情况下,难以处理动态GUI内容。另一方面,由于GUI视频数据集的稀疏性,视频LLMs在所有GUI导向任务中表现不佳。因此,我们迈出了第一步,利用经过微调的视频LLM(GUI-Vid)作为GUI导向助手,展示了对各种GUI任务的改进理解。然而,由于基础LLMs性能的限制,我们得出结论,将视频LLMs用作GUI代理仍然是一项重大挑战。我们认为,我们的工作为未来动态GUI内容理解的研究提供了有价值的见解。所有数据集和代码均可公开获取:https://gui-world.github.io。
Recently, Multimodal Large Language Models (MLLMs) have been used as agents to control keyboard and mouse inputs by directly perceiving the Graphical User Interface (GUI) and generating corresponding commands. However, current agents primarily demonstrate strong understanding capabilities in static environments and are mainly applied to relatively simple domains, such as Web or mobile interfaces. We argue that a robust GUI agent should be capable of perceiving temporal information on the GUI, including dynamic Web content and multi-step tasks. Additionally, it should possess a comprehensive understanding of various GUI scenarios, including desktop software and multi-window interactions. To this end, this paper introduces a new dataset, termed GUI-World, which features meticulously crafted Human-MLLM annotations, extensively covering six GUI scenarios and eight types of GUI-oriented questions in three formats. We evaluate the capabilities of current state-of-the-art MLLMs, including Image LLMs and Video LLMs, in understanding various types of GUI content, especially dynamic and sequential content. Our findings reveal that current models struggle with dynamic GUI content without manually annotated keyframes or operation history. On the other hand, Video LLMs fall short in all GUI-oriented tasks given the sparse GUI video dataset. Therefore, we take the initial step of leveraging a fine-tuned Video LLM, GUI-Vid, as a GUI-oriented assistant, demonstrating an improved understanding of various GUI tasks. However, due to the limitations in the performance of base LLMs, we conclude that using video LLMs as GUI agents remains a significant challenge. We believe our work provides valuable insights for future research in dynamic GUI content understanding. All the dataset and code are publicly available at: https://gui-world.github.io.