【波兰黑科技(持续更新)】Small-Space Multiple-Pattern Matching

时间：2016-04-22 20:23:17 阅读：281 评论：0 收藏：0 [点我收藏+]

标签：

此文鸣谢lct1999与我一起翻译,给我提供了许多的帮助

Claris老司机昨天向我安利了这篇波兰黑科技论文,主要讲的是怎么使用Hash来做AC自动机能做的那些问题,那么为了黑科技事业的蓬勃发展我今天就来把它翻译一下.翻译进度可能会非常非常慢….在线持久更新
翻译的不好的地方可能会非常多…可能很多地方都会是直译…只是给大家看这个论文提供一个参考罢了
语序懒得调整成汉语语序辣
如果某些地方有更好的翻译建议,请联系我.
不严格按照原论文的排版来翻译…

Section1.引言

多串匹配问题，即在长度为n的文本串中定位s个总长为m的模板串的出现位置，是字符串算法领域中的一个基本问题.
此前存在一个来自Aho and Corasick的算法,可以以 $O(n+m)$ 的时间代价, $O(m)$ 的空间代价(除掉额外的用来储存模板串和文本串的空间代价)来解决这个问题.
为了列举出所有(共计occ个)出现的位置(而不是如仅仅最左边的出现位置), $O(occ)$ 的额外时间代价也是必要的.
在空间十分有限的情况下,我们可以使用压缩后的AC自动机(或许是使用STL或者可持久化线段树来保存转移边的意思)
在极端的情况下,我们还可以依次对每个模板串使用线性时间和常数级的空间的单模板串匹配算法,这样的时间复杂度增加至 $O(ns+m)$ .
这种算法的著名例子有Galil and Seiferas [8],Crochemore and Perrin [5],以及Karp and Rabin13。

Multiple-pattern matching, the task of locating the occurrences of s
patterns of total length m in a single text of length n, is a
fundamental problem in the field of string algorithms. The algorithm
by Aho and Corasick [2] solves this problem using O(n+m) time and O(m)
working space in addition to the space needed for the text and
patterns. To list all occ occurrences rather than, e.g., the leftmost
ones, extra O(occ) time is necessary. When the space is limited, we
can use a compressed Aho-Corasick automaton [11]. In extreme cases,
one could apply a linear-time constant-space single-pattern matching
algorithm sequentially for each pattern in turn, at the cost of
increasing the running time to O(n · s + m). Well-known examples of
such algorithms include those by Galil and Seiferas [8], Crochemore
and Perrin [5], and Karp and Rabin [13] (see [3] for a recent survey).

如果给定的所有模板串长度相同的话,我们可以很容易地将KarpRabin算法推广至多串的情形,以O(n+m)的期望时间复杂度和O(s)的空间复杂度来完成匹配.为了实现这个算法,我们将所有模板串的哈希值存储在哈希表中,然后用一个区间滑块(滑块的定义请参照LZ77编码的相关内容)来扫描整个文本串,并维护当前区间内的文本串的哈希值.利用哈希表,我们可以检查当前区间的文本串是否和某个模板串相匹配;如果能够匹配,我们就返回这一匹配及位置,并更新哈希表中的值,使得每个模板串最多被返回一次.这个想法很简单,实际上也很实用,但是我们并不知道如何将它扩展到模板串长度不同的情形.目前这种做法还有待研究.而在这篇论文中,我们会介绍一种适用于任意模板串集合的字典匹配算法,时间复杂度是 $O(nlogn+m)$ ,空间复杂度是 $O(s)$ ,但是使用这个算法的前提是我们可以访问模板串和文本串的任意位置.
如果需要的话,我们还可以使用 $O(nlogn+m)$ 的时间, $O(s)$ 的空间来计算每个模板串在文本串中出现的最长的前缀长度.

It is easy to generalize Karp-Rabin matching to handle multiple patterns in O(n+ m) expected time and
O(s) working space provided that all patterns are of the same length [10]. To do this, we store the fingerprints
of the patterns in a hash table, and then slide a window over the text maintaining the fingerprint of the
fragment currently in the window. The hash table lets us check if the fragment is an occurrence of a pattern.
If so, we report it and update the hash table so that every pattern is returned at most once. This is a very simple and actually applied idea [1], but it is not clear how to extend it for patterns with many distinct
lengths. In this paper we develop a dictionary matching algorithm which works for any set of patterns in
O(nlog n + m) time and O(s) working space, assuming that read-only random access to the text and the
patterns is available. If required, we can compute for every pattern its longest prefix occurring in the text,
also in O(nlog n + m) time and O(s) working space.

在最近的一个克利福德的独立工作给出了一个基于流模型的字符串匹配算法.(流模型是啥..?)
只需要读入模板串和文本串一次(与只读随机访问正好相反),并且一个出现位置需要在他的最后一个字符被读入之后被立刻返回.
这个算法使用 $O(slog l)$ 的空间复杂度,和对单个字符 $O(loglog(s+l))$ 的时间复杂度,l指最长的模板串的长度 $\frac m s \le l \le m$ .尽管两个算法中的许多想法是类似的,但是我们必须承认流模型和只读模型有很多不同.尤其是在流模型中,对每个模板串计算文本串中最长前缀出现位置需要 $\Omega (mlogmin(m,|\sum|))$ bits的空间( $\sum$ 肯定是字符集),与我们给出的解法里的只读模型只有 $O(s)$ 的空间正好相反.

In a very recent independent work Clifford et al. [4] gave a dictionary matching algorithm in the streaming
model. In this setting the patterns and later the text are scanned once only (as opposed to read-only random
access) and an occurrence needs to be reported immediately after its last character is read. Their algorithm
uses O(slog ) space and takes O(loglog(s +)) time per character where is the length of the longest pattern ( m s ≤ ≤ m). Even though some of the ideas used in both results are similar, one should note that
the streaming and read-only models are quite different. In particular, computing the longest prefix occurring
in the text for every pattern requires ?(mlogmin(n, |Σ|)) bits of space in the streaming model, as opposed
to the O(s) working space achieved by our solution in the read-only setting.

我们的算法作为对质数的一个应用, 我们展示了如何近似的计算一段长度为m的LZ77文本的编码,所用的空间与划分元素的数量成正比(我们再一次对文本串进行随机位置的读取操作).用较小的空间计算LZ77解析码是一个很重要的问题,空间是当今大多数算法的瓶颈所在.此外,LZ77不仅仅是在数据压缩方面十分有用,同时也是一个加速算法运行速度的手段.我们介绍一种通常的近似算法,使用 $O(z)$ 的空间复杂度来将读入的文本转化成LZ77编码化并划分成z个元素.
对于任意的 $\epsilon\in (0,1]$ ,这个算法可以被用于在 $O(\epsilon^{-1}nlogn)$ 的时间内生成一个由 $(1+\epsilon)z$ 个元素组成的编码.
据我们所知,使用较小空间的近似LZ77编码分解在此前还从未被考虑过,我们的算法
我们的方法显然比产生精确解的方法有更加高效.一个由K¨arkk¨ainen et提出的最近的次线性空间的算法,对每个字符额外使用 $O(nd)$ 的时间和 $O(\frac n d)$ 的空间(对一个任意的参数d).一个由Gasieniec提出的更早的在线算法对每个字符额外使用使用 $O(z^2log^2z)$ 的时间和 $O(z)$ 的空间.
其他更早的方法显然都会在编码相对于n较小时使用更多的空间.这一点可以在[7]中看到最近的讨论.

As a prime application of our dictionary matching algorithm, we show how to approximate the Lempel
Ziv 77 (LZ77) parse [18] of a text of length n using working space proportional to the number of phrases
(again, we assume read-only random access to the text). Computing the LZ77 parse in small space is an
issue of high importance, with space being a frequent bottleneck of today’s systems. Moreover, LZ77 is
useful not only for data compression, but also as a way to speed up algorithms [15]. We present a general
approximation algorithm working in O(z) space for inputs admitting LZ77 parsing with z phrases. For any
ε ∈ (0,1], the algorithm can be used to produce a parse consisting of (1+ ε)z phrases in O(ε?1nlog n) time.
To the best of our knowledge, approximating LZ77 factorization in small space has not been considered
before, and our algorithm is significantly more efficient than methods producing the exact answer. A recent
sublinear-space algorithm, due to K¨arkk¨ainen et al. [12], runs in O(nd) time and uses O(n/d) space, for any
parameter d. An earlier online solution by Gasieniec et al. [9] uses O(z) space and takes O(z2 log2 z) time
for each character appended. Other previous methods use significantly more space when the parse is small
relative to n; see [7] for a recent discussion.

论文的结构.第二节介绍了一些术语并重定义了部分已知的概念.其次是关于我们的字符串匹配算法的描述.在第三节,我们会介绍如何将模板串处理成至多s部分.在第四节,我们使用不同的过程处理重复或者不重复的较长的模板.在第五节,我们将算法延伸并对每个模板计算最长的公共前缀在文本串的出现位置.最后在第七节,我们使用这个字符串匹配算法来构造一个近似LZ77编码,同时在第六节,我们会解释如何修改这个算法来使其LasVegas?(这里似乎与那个算法和赌城都没有关系?)

Structure of the paper. Sect. 2 introduces terminology and recalls several known concepts. This is
followed by the description of our dictionary matching algorithm. In Sect. 3 we show how to process
patterns of length at most s and in Sect. 4 we handle longer patterns, with different procedures for repetitive
and non-repetitive ones. In Sect. 5 we extend the algorithm to compute, for every pattern, the longest
prefix occurring in the text. Finally, in Sect. 7, we apply the dictionary matching algorithm to construct an
approximation of the LZ77 parsing, and in Sect. 6 we explain how to modify the algorithms to make them
Las Vegas.

计算模型.我们的算法被设计用于word-RAM(?)使用 $\Omega (logn)$ bit个单词.假设字符集都是多项式级别的.Karp-Rabin哈希值的使用使得算法更加随机(什么鬼?)有更大的可能返回正确的答案,错误的概率是输入数据规模的反多项式级别,多项式的项数和指数(大概是这样)可以是任意大小.用一些额外的代价,我们的算法可以变成LasVegas随机级,使得答案总是正确的且时间上下界有较大概率是可控的.通过整篇论文,我们始终假设对文本串和模板串进行只读性随机访问,并且我们在考虑空间时不考虑他们的大小.

Model of computation. Our algorithms are designed for the word-RAM with ?(log n)-bit words and
assume integer alphabet of polynomial size. The usage of Karp-Rabin fingerprints makes them Monte
Carlo randomized: the correct answer is returned with high probability, i.e., the error probability is inverse
polynomial with respect to input size, where the degree of the polynomial can be set arbitrarily large. With
some additional effort, our algorithms can be turned into Las Vegas randomized, where the answer is always
correct and the time bounds hold with high probability. Throughout the whole paper, we assume read
only random access to the text and the patterns, and we do not include their sizes while measuring space
consumption.

Section2.预备说明

我们只考虑一个从 $0\dots \sigma$ 的字符集,其中 $\sigma$ 至多是 $n+m$ .对于一个字符串 $w=w[1]\dots w[n]\in \sum n$ ,我们定义w的长度为 $|w|=n$ .对于 $1\le i \le j \le n$ ,一个词语 $u=w[i]\dots w[j]$ ,我们定义其为字符串w的子串.对于由 $w[i..j]$ 组成的子串,我们将它的出现位置表示为i,称为原字符串w的子段/串.一个左端点为i=1的子段我们称其为原串的前缀,一个右端点j=n的子串称为原串的后缀.

We consider finite words over an integer alphabet Σ = {0, … , σ ? 1}, where σ = poly(n + m). For a word
w = w[1] … w[n] ∈ Σn, we define the length of w as |w| = n. For 1 ≤ i ≤ j ≤ n, a word u = w[i] … w[j]
is called a subword of w. By w[i..j] we denote the occurrence of u at position i, called a fragment of w. A
fragment with i = 1 is called a prefix and a fragment with j = n is called a suffix.

对于一个字符串w如果存在一个确定的整数p使得对于任意的 $i=1,2\dots |w|-1$ 使得w[i]=w[i+p],我们可以把p称为原串的一个周期.在这种情况下,一个前缀w[1..p]通常也被称为w的一个周期前缀.w的一个满足条件的最小周期前缀可以表示为 $per(w)$ .我们称一个字符串w具有周期性,当且仅当 $per(w)\le\frac {|w|} 2$ ,称其具有高度周期性当且仅当 $per(w)\le\frac{|w|}3$ .著名的周期性引理[6]证明,如果p和q都是w的周期,并且 $p+q\le|w|$ ,那么 $gcd(p,q)$ 一定也是w的一个周期.如果每一个per(w)都不恰好是|w|的约数,我们称字符串w是原始的,并且定义周期w[1..per(w)]总是原始的.(不知道这里原始具体是什么意思)

A positive integer p is called a period of w whenever w[i] = w[i + p] for all i = 1, 2, … , |w| ? p. In this
case, the prefix w[1..p] is often also called a period of w. The length of the shortest period of a word w is
denoted as per(w). A word w is called periodic if per(w) ≤ |w|/2 and highly periodic if per(w) ≤ |w|/3. The
well-known periodicity lemma [6] says that if p and q are both periods of w, and p + q ≤ |w|, then gcd(p, q)
is also a period of w. We say that word w is primitive if per(w) is not a proper divisor of |w|. Note that the
shortest period w[1.. per(w)] is always primitive.

2.1使用哈希

我们的随机构造是基于Karp-Rabin所使用的哈希方法的;见[13].对于一个基于 $0\dots \sigma-1$ 的字符串w[1..n],确定一个常数c( $c\ge 1$ ),一个质数p( $p\gt max(\sigma,n^{c+4})$ ),并随机选择一个x(x为正质数).我们定义一个子串w[i..j]的哈希值为 $\Phi(w[i..j])=w[i]+w[i+1]x+\dots +w[j]x^{j-i} \mod p$ .至少也有 $1-\frac 1 {n^c}$ 的概率满足不存在两个长度相同的子串哈希值相同.哈希值相同的情况只会对两个我们称为”不确定”的子串发生.从现在开始,每当说明结果时我们都假设不存在不确定的子串来避免重复答案只是有很大的可能正确而不是绝对正确的情况.
对于字典匹配,我们假设 $w=TP_1P_2\dots P_s$ 没有不同的子串(这里T,P分别对应文本串和所有模板串,w为其连接起来构成的总串)存在相同的哈希值的情况.哈希值可以使得我们肯简单的就能定位很多长度相同的模板串.
在引言中描述了一个直截了当的解法来对模板串构建哈希表.
然而,我们只能保证哈希表可以以 $1-O(\frac 1 {s^c})$ 的正确率构造出来(对于一个任意的常数c),可是我们想要将错误率下降到 $O(\frac 1 {(n+m)^c})$ .于是我们用一个如下文所阐释的那种确定的字典来重置哈希表.尽管时间复杂度会上升到 $O(slogs)$ ,考虑复杂度时较小的项都计算入常数.

2.1 Fingerprints
Our randomized construction is based on Karp-Rabin fingerprints; see [13]. Fix a word w[1..n] over an
alphabet Σ = {0, … , σ?1}, a constant c ≥ 1, a prime number p > max(σ, nc+4), and choose x ∈ Zp uniformly
at random. We define the fingerprint of a subword w[i..j] as Φ(w[i..j]) = w[i]+w[i+1]x+…+w[j]xj?i mod p.
With probability at least 1 ? 1
nc , no two distinct subwords of the same length have equal fingerprints. The
situation when this happens for some two subwords is called a false-positive. From now on when stating the
results we assume that there are no false-positives to avoid repeating that the answers are correct with high
probability. For dictionary matching, we assume that no two distinct subwords of w = T P1 … Ps have equal
fingerprints. Fingerprints let us easily locate many patterns of the same length. A straightforward solution
described in the introduction builds a hash table mapping fingerprints to patterns. However, then we can
only guarantee that the hash table is constructed correctly with probability 1 ? O( s 1 c ) (for an arbitrary
constant c), and we would like to bound the error probability by O( (n+ 1m)c ). Hence we replace hash table
with a deterministic dictionary as explained below. Although it increases the time by O(s log s), the extra
term becomes absorbed in the final complexities.

定理1

给定一个长度为n的文本串T和s个长确定为l的模板串 $P_1,\dots ,P_s$ ,我们可以在 $O*(n+s+slogs$ 的总时间内对每个文本串计算出其在T中的最左边的出现位置,同时只需要 $O(s)$ 的总空间.

Theorem 1. Given a text T of length n and patterns P1, … , Ps, each of length exactly , we can compute the the leftmost occurrence of every pattern Pi in T using O(n + s + s log s) total time and O(s) space.

证明

我们可以对每个文本串计算出其哈希值 $\Phi(P_j)$ .然后我们可以用 $O(slogs)$ 的时间对待匹配的 $\Phi(P_j)$ 和h这样的关系构建出一个确定的字典D(应该就是一个哈希表).对于多个相同的模板串,我们只构建一次,在最后我们将结果记入所有的模板串中.
然后我们可以读入文本串T并选择一个长度为l的匹配区间同时维护区间内 $T[i..i+l-1]$ 的哈希值.使用字典D,我们可以在 $O(1)$ 的时间内寻找某个模板串j是否满足 $\Phi(T[i...i+l-1])=\Phi(P_j)$ .如果满足的话,我们更新Pj的答案(如果在此前 $P_j$ )没有对应的出现位置的话.如果我们预处理出 $x^{-1}$ ,就可以在i增大时以O(1)的时间内计算 $\Phi(T[i..i+l-1])$ .

Proof. We calculate the fingerprint Φ(Pj) of every pattern. Then we build in O(s log s) time [16] a deter
ministic dictionary D with an entry mapping Φ(Pj) to j. For multiple identical patterns we create just
one entry, and at the end we copy the answers to all instances of the pattern. Then we scan the text T
with a sliding window of length while maintaining the fingerprint Φ(T[i..i + ? 1]) of the current window.
Using D, we can find in O(1) time an index j such that Φ(T[i..i + ? 1]) = Φ(Pj), if any, and update the answer for P j if needed (i.e., if there was no occurrence of Pj before). If we precompute x?1, the fingerprints Φ(T[i..i + ? 1]) can be updated in O(1) time while increasing i.

2.2使用Trie

一个由模板串 $P_1..P_n$ 形成的trie是一个有根树,其节点对应着原串中的一个前缀.
根节点不表示任何字符串,树边上标记一个字符.每个节点对应着的那个特殊的前缀称为其轨迹.在一个压缩的trie中,一个一元(与二元,多元相对)的同时不代表任意模板串Pi的节点

A trie of a collection of strings P1, … , Ps is a rooted tree whose nodes correspond to prefixes of the strings.
The root represents the empty word and the edges are labeled with single characters. The node corresponding
to a particular prefix is called its locus. In a compacted trie unary nodes that do not represent any Pi are
dissolved and the labels of their incidents edges are concatenated. The dissolved nodes are called implicit as
opposed to the explicit nodes, which remain stored. The locus of a string in a compacted trie might therefore
be explicit or implicit. All edges outgoing from the same node are stored on a list sorted according to the
first character, which is unique among these edges. The labels of edges of a compacted trie are stored as
pointers to the respective fragments of strings Pi. Consequently, a compacted trie can be stored in space
proportional to the number of explicit nodes, which is O(s).
Consider two compacted tries T1 and T2. We say that (possibly implicit) nodes v1 ∈ T1 and v2 ∈ T2 are
twins if they are loci of the same string. Note that every v1 ∈ T1 has at most one twin v2 ∈ T2.

【波兰黑科技(持续更新)】Small-Space Multiple-Pattern Matching

标签：

原文地址：http://blog.csdn.net/creationaugust/article/details/51203474

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行