实现异构数据发现的操作化方法

Towards Operationalizing Heterogeneous Data Discovery

摘要 Abstract

查询和探索大规模数据源集合(如数据湖)一直是数据库社区的重要研究课题。尽管数据湖中的数据发现和集成领域已经取得了很多进展,但这些工作主要集中在数据湖由结构化表组成的情景下。然而,现实世界的企业数据湖往往更加复杂,可能包含结构化、半结构化和非结构化数据的多模态数据源孤岛。本文设想了一个具有声明式接口的端到端系统,用于查询和分析多模态数据湖。首先,我们提出了一组多模态算子,这是一种统一的接口,通过结合AI生成的算子扩展关系操作,以表达各种模态数据源上的分析任务。此外,我们正式定义了系统中的关键步骤,例如数据发现、查询规划、查询处理和结果聚合。在此基础上,我们进一步指出了实现和优化这些步骤的研究挑战,并讨论了利用大型语言模型带来的先进技术所带来的潜在机遇。最后,我们展示了针对该问题的初步尝试,并提出了对该研究主题的未来计划。

Querying and exploring massive collections of data sources, such as data lakes, has been an essential research topic in the database community. Although many efforts have been paid in the field of data discovery and data integration in data lakes, they mainly focused on the scenario where the data lake consists of structured tables. However, real-world enterprise data lakes are always more complicated, where there might be silos of multi-modal data sources with structured, semi-structured and unstructured data. In this paper, we envision an end-to-end system with declarative interface for querying and analyzing the multi-modal data lakes. First of all, we come up with a set of multi-modal operators, which is a unified interface that extends the relational operations with AI-composed ones to express analytical workloads over data sources in various modalities. In addition, we formally define the essential steps in the system, such as data discovery, query planning, query processing and results aggregation. On the basis of it, we then pinpoint the research challenges and discuss potential opportunities in realizing and optimizing them with advanced techniques brought by Large Language Models. Finally, we demonstrate our preliminary attempts to address this problem and suggest the future plan for this research topic.