更快速且简洁的字符串网络频率在线计算方法

Faster and Simpler Online Computation of String Net Frequency

摘要 Abstract

字符串 $S$ 中重复子串 $u$ 的出现被称为网络出现(net occurrence),如果向左或向右扩展该出现会使出现次数减少为 1。字符串 $S$ 中重复子串 $u$ 的网络频率(NF)定义为其在 $S$ 中的所有网络出现次数。最近,Guo 等人 [SPIRE 2024] 提出了一种在线 $O(n \log \sigma)$ 时间算法,维护了一个 $O(n)$ 空间的数据结构,可以在 $O(m \log \sigma + \sigma^2)$ 时间内回答单次 NF 查询,并在 $O(n\sigma^2)$ 时间内报告 All-NF 问题的所有答案。这里,$n$ 是输入字符串 $S$ 的长度,$m$ 是查询模式的长度,$\sigma$ 是字母表的大小。$\sigma^2$ 是其方法的主要缺点,因为计算字符串网络频率最初是为中国语言处理设计的,而其中 $\sigma$ 可能非常大。本文提出了一种改进的在线 $O(n \log \sigma)$ 时间算法,单次 NF 查询的时间复杂度降为 $O(m \log \sigma)$,并在输出最优的 $O(|\mathsf{NF}^+(S)|)$ 时间内报告 All-NF 问题的所有答案,其中 $\mathsf{NF}^+(S)$ 是字符串 $S$ 中具有正 NF 值的子串集合。我们注意到,$|\mathsf{NF}^+(S)| = O(n)$ 始终成立。与基于 Ukkonen 后缀树构造的 Guo 等人的算法不同,我们的算法基于 Weiner 后缀树构造。

An occurrence of a repeated substring $u$ in a string $S$ is called a net occurrence if extending the occurrence to the left or to the right decreases the number of occurrences to 1. The net frequency (NF) of a repeated substring $u$ in a string $S$ is the number of net occurrences of $u$ in $S$. Very recently, Guo et al. [SPIRE 2024] proposed an online $O(n \log \sigma)$-time algorithm that maintains a data structure of $O(n)$ space which answers Single-NF queries in $O(m\log \sigma + \sigma^2)$ time and reports all answers of the All-NF problem in $O(n\sigma^2)$ time. Here, $n$ is the length of the input string $S$, $m$ is the query pattern length, and $\sigma$ is the alphabet size. The $\sigma^2$ term is a major drawback of their method since computing string net frequencies is originally motivated for Chinese language processing where $\sigma$ can be thousands large. This paper presents an improved online $O(n \log \sigma)$-time algorithm, which answers Single-NF queries in $O(m \log \sigma)$ time and reports all answers to the All-NF problem in output-optimal $O(|\mathsf{NF}^+(S)|)$ time, where $\mathsf{NF}^+(S)$ is the set of substrings of $S$ paired with their positive NF values. We note that $|\mathsf{NF}^+(S)| = O(n)$ always holds. In contract to Guo et al.'s algorithm that is based on Ukkonen's suffix tree construction, our algorithm is based on Weiner's suffix tree construction.

更快速且简洁的字符串网络频率在线计算方法 - arXiv