标签:
此文鸣谢lct1999与我一起翻译,给我提供了许多的帮助
Claris老司机昨天向我安利了这篇波兰黑科技论文,主要讲的是怎么使用Hash来做AC自动机能做的那些问题,那么为了黑科技事业的蓬勃发展我今天就来把它翻译一下.翻译进度可能会非常非常慢….在线持久更新
翻译的不好的地方可能会非常多…可能很多地方都会是直译…只是给大家看这个论文提供一个参考罢了
语序懒得调整成汉语语序辣
如果某些地方有更好的翻译建议,请联系我.
不严格按照原论文的排版来翻译…
多串匹配问题,即在长度为n的文本串中定位s个总长为m的模板串的出现位置,是字符串算法领域中的一个基本问题.
此前存在一个来自Aho and Corasick的算法,可以以
为了列举出所有(共计occ个)出现的位置(而不是如仅仅最左边的出现位置),
在空间十分有限的情况下,我们可以使用压缩后的AC自动机(或许是使用STL或者可持久化线段树来保存转移边的意思)
在极端的情况下,我们还可以依次对每个模板串使用线性时间和常数级的空间的单模板串匹配算法,这样的时间复杂度增加至
这种算法的著名例子有Galil and Seiferas [8],Crochemore and Perrin [5],以及Karp and Rabin13。
Multiple-pattern matching, the task of locating the occurrences of s
patterns of total length m in a single text of length n, is a
fundamental problem in the field of string algorithms. The algorithm
by Aho and Corasick [2] solves this problem using O(n+m) time and O(m)
working space in addition to the space needed for the text and
patterns. To list all occ occurrences rather than, e.g., the leftmost
ones, extra O(occ) time is necessary. When the space is limited, we
can use a compressed Aho-Corasick automaton [11]. In extreme cases,
one could apply a linear-time constant-space single-pattern matching
algorithm sequentially for each pattern in turn, at the cost of
increasing the running time to O(n · s + m). Well-known examples of
such algorithms include those by Galil and Seiferas [8], Crochemore
and Perrin [5], and Karp and Rabin [13] (see [3] for a recent survey).
如果给定的所有模板串长度相同的话,我们可以很容易地将KarpRabin算法推广至多串的情形,以O(n+m)的期望时间复杂度和O(s)的空间复杂度来完成匹配.为了实现这个算法,我们将所有模板串的哈希值存储在哈希表中,然后用一个区间滑块(滑块的定义请参照LZ77编码的相关内容)来扫描整个文本串,并维护当前区间内的文本串的哈希值.利用哈希表,我们可以检查当前区间的文本串是否和某个模板串相匹配;如果能够匹配,我们就返回这一匹配及位置,并更新哈希表中的值,使得每个模板串最多被返回一次.这个想法很简单,实际上也很实用,但是我们并不知道如何将它扩展到模板串长度不同的情形.目前这种做法还有待研究.而在这篇论文中,我们会介绍一种适用于任意模板串集合的字典匹配算法,时间复杂度是
如果需要的话,我们还可以使用
It is easy to generalize Karp-Rabin matching to handle multiple patterns in O(n+ m) expected time and
O(s) working space provided that all patterns are of the same length [10]. To do this, we store the fingerprints
of the patterns in a hash table, and then slide a window over the text maintaining the fingerprint of the
fragment currently in the window. The hash table lets us check if the fragment is an occurrence of a pattern.
If so, we report it and update the hash table so that every pattern is returned at most once. This is a very simple and actually applied idea [1], but it is not clear how to extend it for patterns with many distinct
lengths. In this paper we develop a dictionary matching algorithm which works for any set of patterns in
O(nlog n + m) time and O(s) working space, assuming that read-only random access to the text and the
patterns is available. If required, we can compute for every pattern its longest prefix occurring in the text,
also in O(nlog n + m) time and O(s) working space.
在最近的一个克利福德的独立工作给出了一个基于流模型的字符串匹配算法.(流模型是啥..?)
只需要读入模板串和文本串一次(与只读随机访问正好相反),并且一个出现位置需要在他的最后一个字符被读入之后被立刻返回.
这个算法使用
In a very recent independent work Clifford et al. [4] gave a dictionary matching algorithm in the streaming
model. In this setting the patterns and later the text are scanned once only (as opposed to read-only random
access) and an occurrence needs to be reported immediately after its last character is read. Their algorithm
uses O(slog) space and takes O(loglog(s +
)) time per character whereis the length of the longest
≤ m). Even though some of the ideas used in both results are similar, one should note that
pattern ( m s ≤
the streaming and read-only models are quite different. In particular, computing the longest prefix occurring
in the text for every pattern requires ?(mlogmin(n, |Σ|)) bits of space in the streaming model, as opposed
to the O(s) working space achieved by our solution in the read-only setting.
我们的算法作为对质数的一个应用, 我们展示了如何近似的计算一段长度为m的LZ77文本的编码,所用的空间与划分元素的数量成正比(我们再一次对文本串进行随机位置的读取操作).用较小的空间计算LZ77解析码是一个很重要的问题,空间是当今大多数算法的瓶颈所在.此外,LZ77不仅仅是在数据压缩方面十分有用,同时也是一个加速算法运行速度的手段.我们介绍一种通常的近似算法,使用
对于任意的
据我们所知,使用较小空间的近似LZ77编码分解在此前还从未被考虑过,我们的算法
我们的方法显然比产生精确解的方法有更加高效.一个由K¨arkk¨ainen et提出的最近的次线性空间的算法,对每个字符额外使用
其他更早的方法显然都会在编码相对于n较小时使用更多的空间.这一点可以在[7]中看到最近的讨论.
As a prime application of our dictionary matching algorithm, we show how to approximate the Lempel
Ziv 77 (LZ77) parse [18] of a text of length n using working space proportional to the number of phrases
(again, we assume read-only random access to the text). Computing the LZ77 parse in small space is an
issue of high importance, with space being a frequent bottleneck of today’s systems. Moreover, LZ77 is
useful not only for data compression, but also as a way to speed up algorithms [15]. We present a general
approximation algorithm working in O(z) space for inputs admitting LZ77 parsing with z phrases. For any
ε ∈ (0,1], the algorithm can be used to produce a parse consisting of (1+ ε)z phrases in O(ε?1nlog n) time.
To the best of our knowledge, approximating LZ77 factorization in small space has not been considered
before, and our algorithm is significantly more efficient than methods producing the exact answer. A recent
sublinear-space algorithm, due to K¨arkk¨ainen et al. [12], runs in O(nd) time and uses O(n/d) space, for any
parameter d. An earlier online solution by Gasieniec et al. [9] uses O(z) space and takes O(z2 log2 z) time
for each character appended. Other previous methods use significantly more space when the parse is small
relative to n; see [7] for a recent discussion.
论文的结构.第二节介绍了一些术语并重定义了部分已知的概念.其次是关于我们的字符串匹配算法的描述.在第三节,我们会介绍如何将模板串处理成至多s部分.在第四节,我们使用不同的过程处理重复或者不重复的较长的模板.在第五节,我们将算法延伸并对每个模板计算最长的公共前缀在文本串的出现位置.最后在第七节,我们使用这个字符串匹配算法来构造一个近似LZ77编码,同时在第六节,我们会解释如何修改这个算法来使其LasVegas?(这里似乎与那个算法和赌城都没有关系?)
Structure of the paper. Sect. 2 introduces terminology and recalls several known concepts. This is
followed by the description of our dictionary matching algorithm. In Sect. 3 we show how to process
patterns of length at most s and in Sect. 4 we handle longer patterns, with different procedures for repetitive
and non-repetitive ones. In Sect. 5 we extend the algorithm to compute, for every pattern, the longest
prefix occurring in the text. Finally, in Sect. 7, we apply the dictionary matching algorithm to construct an
approximation of the LZ77 parsing, and in Sect. 6 we explain how to modify the algorithms to make them
Las Vegas.
计算模型.我们的算法被设计用于word-RAM(?)使用
Model of computation. Our algorithms are designed for the word-RAM with ?(log n)-bit words and
assume integer alphabet of polynomial size. The usage of Karp-Rabin fingerprints makes them Monte
Carlo randomized: the correct answer is returned with high probability, i.e., the error probability is inverse
polynomial with respect to input size, where the degree of the polynomial can be set arbitrarily large. With
some additional effort, our algorithms can be turned into Las Vegas randomized, where the answer is always
correct and the time bounds hold with high probability. Throughout the whole paper, we assume read
only random access to the text and the patterns, and we do not include their sizes while measuring space
consumption.
我们只考虑一个从
We consider finite words over an integer alphabet Σ = {0, … , σ ? 1}, where σ = poly(n + m). For a word
w = w[1] … w[n] ∈ Σn, we define the length of w as |w| = n. For 1 ≤ i ≤ j ≤ n, a word u = w[i] … w[j]
is called a subword of w. By w[i..j] we denote the occurrence of u at position i, called a fragment of w. A
fragment with i = 1 is called a prefix and a fragment with j = n is called a suffix.
对于一个字符串w如果存在一个确定的整数p使得对于任意的
A positive integer p is called a period of w whenever w[i] = w[i + p] for all i = 1, 2, … , |w| ? p. In this
case, the prefix w[1..p] is often also called a period of w. The length of the shortest period of a word w is
denoted as per(w). A word w is called periodic if per(w) ≤ |w|/2 and highly periodic if per(w) ≤ |w|/3. The
well-known periodicity lemma [6] says that if p and q are both periods of w, and p + q ≤ |w|, then gcd(p, q)
is also a period of w. We say that word w is primitive if per(w) is not a proper divisor of |w|. Note that the
shortest period w[1.. per(w)] is always primitive.
我们的随机构造是基于Karp-Rabin所使用的哈希方法的;见[13].对于一个基于
对于字典匹配,我们假设
在引言中描述了一个直截了当的解法来对模板串构建哈希表.
然而,我们只能保证哈希表可以以
2.1 Fingerprints
Our randomized construction is based on Karp-Rabin fingerprints; see [13]. Fix a word w[1..n] over an
alphabet Σ = {0, … , σ?1}, a constant c ≥ 1, a prime number p > max(σ, nc+4), and choose x ∈ Zp uniformly
at random. We define the fingerprint of a subword w[i..j] as Φ(w[i..j]) = w[i]+w[i+1]x+…+w[j]xj?i mod p.
With probability at least 1 ? 1
nc , no two distinct subwords of the same length have equal fingerprints. The
situation when this happens for some two subwords is called a false-positive. From now on when stating the
results we assume that there are no false-positives to avoid repeating that the answers are correct with high
probability. For dictionary matching, we assume that no two distinct subwords of w = T P1 … Ps have equal
fingerprints. Fingerprints let us easily locate many patterns of the same length. A straightforward solution
described in the introduction builds a hash table mapping fingerprints to patterns. However, then we can
only guarantee that the hash table is constructed correctly with probability 1 ? O( s 1 c ) (for an arbitrary
constant c), and we would like to bound the error probability by O( (n+ 1m)c ). Hence we replace hash table
with a deterministic dictionary as explained below. Although it increases the time by O(s log s), the extra
term becomes absorbed in the final complexities.
给定一个长度为n的文本串T和s个长确定为l的模板串
Theorem 1. Given a text T of length n and patterns P1, … , Ps, each of length exactly
, we can compute
+ s log s) total time and O(s) space.
the the leftmost occurrence of every pattern Pi in T using O(n + s
我们可以对每个文本串计算出其哈希值
然后我们可以读入文本串T并选择一个长度为l的匹配区间同时维护区间内
Proof. We calculate the fingerprint Φ(Pj) of every pattern. Then we build in O(s log s) time [16] a deter
ministic dictionary D with an entry mapping Φ(Pj) to j. For multiple identical patterns we create just
one entry, and at the end we copy the answers to all instances of the pattern. Then we scan the text T
with a sliding window of lengthwhile maintaining the fingerprint Φ(T[i..i +
? 1]) of the current window.
Using D, we can find in O(1) time an index j such that Φ(T[i..i +? 1]) = Φ(Pj), if any, and update the
? 1]) can be updated in O(1) time while increasing i.
answer for P
j if needed (i.e., if there was no occurrence of Pj before). If we precompute x?1, the fingerprints
Φ(T[i..i +
一个由模板串
根节点不表示任何字符串,树边上标记一个字符.每个节点对应着的那个特殊的前缀称为其轨迹.在一个压缩的trie中,一个一元(与二元,多元相对)的同时不代表任意模板串Pi的节点
A trie of a collection of strings P1, … , Ps is a rooted tree whose nodes correspond to prefixes of the strings.
The root represents the empty word and the edges are labeled with single characters. The node corresponding
to a particular prefix is called its locus. In a compacted trie unary nodes that do not represent any Pi are
dissolved and the labels of their incidents edges are concatenated. The dissolved nodes are called implicit as
opposed to the explicit nodes, which remain stored. The locus of a string in a compacted trie might therefore
be explicit or implicit. All edges outgoing from the same node are stored on a list sorted according to the
first character, which is unique among these edges. The labels of edges of a compacted trie are stored as
pointers to the respective fragments of strings Pi. Consequently, a compacted trie can be stored in space
proportional to the number of explicit nodes, which is O(s).
Consider two compacted tries T1 and T2. We say that (possibly implicit) nodes v1 ∈ T1 and v2 ∈ T2 are
twins if they are loci of the same string. Note that every v1 ∈ T1 has at most one twin v2 ∈ T2.
【波兰黑科技(持续更新)】Small-Space Multiple-Pattern Matching
标签:
原文地址:http://blog.csdn.net/creationaugust/article/details/51203474