LIRA: 一种基于学习的查询感知分区框架用于大规模近似最近邻搜索
LIRA: A Learning-based Query-aware Partition Framework for Large-scale ANN Search
摘要 Abstract
近似最近邻搜索是信息检索中的基础问题。以往基于分区的方法通过探测部分分区来提升搜索效率,但面临两个常见问题。在查询阶段,一种常见的策略是根据查询点到分区质心的距离排名来选择分区进行探测,这种方法会不可避免地探测到不相关的分区,因为它忽略了数据分布。在分区构建阶段,所有基于分区的方法都面临边界问题,即将查询点的最近邻分离到多个分区中,导致kNN分布呈现长尾现象,从而降低最优nprobe(即探测分区的数量)的效果。为了解决这一问题,我们提出了LIRA,一种基于学习的查询感知分区框架。具体来说,我们提出了一种探测模型,可以直接探测包含查询点kNN的分区,这可以减少探测浪费并实现针对查询的个性化探测。此外,我们将探测模型与基于学习的冗余策略相结合,以减轻长尾kNN分布对搜索效率的不利影响。在真实向量数据集上的大量实验表明,LIRA在准确率、延迟和查询扩散之间的权衡上表现出优越性。代码已发布在https://github.com/SimoneZeng/LIRA-ANN-search。
Approximate nearest neighbor search is fundamental in information retrieval. Previous partition-based methods enhance search efficiency by probing partial partitions, yet they face two common issues. In the query phase, a common strategy is to probe partitions based on the distance ranks of a query to partition centroids, which inevitably probes irrelevant partitions as it ignores data distribution. In the partition construction phase, all partition-based methods face the boundary problem that separates a query's nearest neighbors to multiple partitions, resulting in a long-tailed kNN distribution and degrading the optimal nprobe (i.e., the number of probing partitions). To address this gap, we propose LIRA, a LearnIng-based queRy-aware pArtition framework. Specifically, we propose a probing model to directly probe the partitions containing the kNN of a query, which can reduce probing waste and allow for query-aware probing with nprobe individually. Moreover, we incorporate the probing model into a learning-based redundancy strategy to mitigate the adverse impact of the long-tailed kNN distribution on search efficiency. Extensive experiments on real-world vector datasets demonstrate the superiority of LIRA in the trade-off among accuracy, latency, and query fan-out. The codes are available at https://github.com/SimoneZeng/LIRA-ANN-search.