面向分布感知的数据集搜索的理论框架

A Theoretical Framework for Distribution-Aware Dataset Search

摘要 Abstract

有效数据发现是现代数据驱动决策制定的基础。然而,识别具有特定分布特性(如百分位或偏好)的数据集仍然充满挑战。尽管最近的研究已经使用户能够基于百分位谓词进行搜索,但大多数数据发现研究仍依赖于启发式方法。本文提出了首个有理论支持的统一框架,适用于集中式和去中心化环境下的数据发现。设$\mathcal{P}=\{P_1,...,P_N\}$为包含$N$个数据集的存储库,其中$P_i\subset \mathbb{R}^d$,且$d=O(1)$。我们研究了集中式和联邦环境下百分位索引(Ptile)问题和偏好索引(Pref)问题。在集中式设置下,假设可以直接访问数据集;在联邦设置下,假设可以访问每个数据集的概要信息。Ptile的目标是构建一个数据结构,使得给定谓词(矩形$R$和区间$\theta$),报告所有满足条件的索引集合$J$,即$j\in J$当且仅当$|P_j\cap R|/|P_j|\in\theta$。Pref的目标是构建一个数据结构,使得给定谓词(向量$v$和区间$\theta$),报告所有满足条件的索引集合$J$,即$j\in J$当且仅当$\omega(P_j,v)\in \theta$,其中$\omega(P_j,v)$是$P_j$在$v$上的第$k$大投影内积。首先,我们证明在集中式设置下无法期望接近线性的数据结构和对数多项式的查询时间。接着,我们展示了空间复杂度为$\tilde{O}(N)$的数据结构,可以在$\tilde{O}(1+OUT)$时间内回答Ptile和Pref查询,其中$OUT$是输出大小。每个数据结构返回一组索引$J$,满足:i) 对于每个满足谓词的$P_i$,都有$i\in J$;ii) 如果$j\in J$,则$P_j$满足谓词,误差为$\varepsilon+2\delta$,其中$\varepsilon\in(0,1)$且$\delta$是概要信息的误差。

Effective data discovery is a cornerstone of modern data-driven decision-making. Yet, identifying datasets with specific distributional characteristics, such as percentiles or preferences, remains challenging. While recent proposals have enabled users to search based on percentile predicates, much of the research in data discovery relies on heuristics. This paper presents the first theoretically backed framework that unifies data discovery under centralized and decentralized settings. Let $\mathcal{P}=\{P_1,...,P_N\}$ be a repository of $N$ datasets, where $P_i\subset \mathbb{R}^d$, for $d=O(1)$ . We study the percentile indexing (Ptile) problem and the preference indexing (Pref) problem under the centralized and the federated setting. In the centralized setting we assume direct access to the datasets. In the federated setting we assume access to a synopsis of each dataset. The goal of Ptile is to construct a data structure such that given a predicate (rectangle $R$ and interval $\theta$) report all indexes $J$ such that $j\in J$ iff $|P_j\cap R|/|P_j|\in\theta$. The goal of Pref is to construct a data structure such that given a predicate (vector $v$ and interval $\theta$) report all indexes $J$ such that $j\in J$ iff $\omega(P_j,v)\in \theta$, where $\omega(P_j,v)$ is the inner-product of the $k$-th largest projection of $P_j$ on $v$. We first show that we cannot hope for near-linear data structures with polylogarithmic query time in the centralized setting. Next we show $\tilde{O}(N)$ space data structures that answer Ptile and Pref queries in $\tilde{O}(1+OUT)$ time, where $OUT$ is the output size. Each data structure returns a set of indexes $J$ such that i) for every $P_i$ that satisfies the predicate, $i\in J$ and ii) if $j\in J$ then $P_j$ satisfies the predicate up to an additive error $\varepsilon+2\delta$, where $\varepsilon\in(0,1)$ and $\delta$ is the error of synopses.