i' = i + m - CSIE -NCKU

advertisement

A simple fast hybrid patternmatching algorithm

Authors : Frantisek Franek, Christopher G. Jennings , W. F. Smyth

Publisher : Journal of Discrete Algorithms 2007

Present: Chung-Chan Wu

Date: December 11, 2007

Department of Computer Science and Information Engineering

National Cheng Kung University, Taiwan R.O.C.

1

Outline

Introduction

Algorithm Description

KMP (Knuth-Morris-Pratt)

Boyer-Moore

Sunday shift

The Hybrid Algorithm (FJS)

Extension

Experimental Results

Conclusions

2

Introduction

 This contribution resides in these categories:

In an effort to reduce processing time , we propose a mixture of

Sunday’s variant of BM with KMP.

Our goal is to combine the best/average case advantages of

Sunday’s algorithm (BMS) with the worst case guarantees of KMP

 According to the experiments we have conducted, our new algorithm (FJS) is among the fastest in practice for the computation of all occurrences of a pattern p = p[1..m] in a text string x = x[1..n] on an alphabet Σ of size k .

3

KMP (Knuth-Morris-Pratt)

 Main Feature

Perform the comparisons from left to right

Space and time complexity : O(m)

Searching phase : O(m+n)

A pre-compute table called pi-table to compare backward.

The π value will avoid another immediate mismatch

• the character of the prefix in the pattern must be different from the character comparing presently.

The best worst case running time in software algorithm.

4

KMP (Knuth-Morris-Pratt) index 0 1 2 3 4 5 6 7 pattern[ i ] G C A G A G A G

π-value -1 0 0 -1 1 -1 1 -1

Input string: pattern:

G C A T CG C A G A G A G T A T A C A G T A CG

G C A G A G A G

5

Boyer-Moore

 Main Feature

Performs the comparisons from right to left

Preprocessing phase : O(m+ δ) in Space and time complexity

Searching phase : O(mn)

A pre-compute table called delta_1 and delta_2.

Perform well in best / average case.

6

BM - Observation 1

 If char is known not occur in pattern, then we know we need not consider the possibility of an occurrence of the pattern.

Input string: pattern: b does not occur in the pattern, use δ

1 bad-character shift b a contains no b m u u k

7

BM - Observation 2

 If the rightmost occurrence of char in pattern is δ1 characters from the right end of pattern, then we know we can slide pattern down δ1 positions without checking for matches.

b occurs in the pattern, use δ

1

Input string: pattern: bad-character shift b a u u b contains no b m k

8

BM - Observation 3(a)

 The good-suffix shift consists in aligning the segment y[i+j+1 … j+m-1 ] = x[i+1 … m-1] with its rightmost occurrence in x that is preceded by a character different from x[i].

u reoccurs in pattern preceded by c ≠ a, use δ

2

Input string: pattern: good-suffix shift ~a u b a u u m k

9

BM - Observation 3(b)

 If there exists no such a segment, the shift consists in aligning the longest suffix v of y[i+j+1 … j+m-1] with a matching prefix of x.

Only a suffix v of u reoccurs in pattern, use δ

2

Input string: pattern: good-suffix shift v b a u u v m k

10

Boyer-Moore Example

δ1 A E L M P X rest shift 4 6 1 3 2 5 7

δ2 E X A M P L E shift 12 11 10 9 8 7 1

Input string: pattern:

HERE I S A S I MP L E EXAMP L E

EXAMP L E

11

Sunday Shift

Boyer Moore

δ1 A E L M P X rest shift 4 6 1 3 2 5 7

Input string: pattern: prefix b p e x a m p l e

Sunday Shift

δ1 A E L M P X rest shift 5 1 2 4 3 6 8

Input string: pattern: prefix b

12

FJS Algorithm

 Definitions

Search p = p [1 ..m

] in x = x [1 ..n

]

• by shifting p from left to right along x .

• position j = 1 of p is aligned with a position i ∈ 1 ..n − m + 1 in x

• partial match: if a mismatch occurs at j > 1, we say that a partial match has been determined with p [1 ..j − 1].

• i’ = i + m - j i i’

Input string: pattern: pat pat m j

13

FJS Algorithm

 Strategy

Whenever no partial match of p with x [ i..i + m − 1] has been found,

Sunday shifts are performed to determine the next position i’ at which x [ i’

] = p [ m ].

When such an i has been found, KMP matching is then performed on p [1 ..m

− 1] and x [ i − m + 1 ..i

− 1].

If a partial match of p with x has been found, KMP matching is continued on p [1 ..m

].

• once a suitable i’ has been found, the first half of FJS just performs KMP matching in a different order:

• position m of p is compared first, followed by 1 , 2 , . . . , m − 1

14

FJS Algorithm

 Pre-processing

Sunday’s array Δ = Δ [1 ..k

], computable in O (m + k) time.

KMP array β’ = β’ [1 ..m

+1], computable in O (m) time.

15

FJS Algorithm index 0 1 2 3 4 5 6 7 pattern[ i ] G C A G A G A G

π-value -1 0 0 -1 1 -1 1 -1

δ1 A C G rest shift 2 7 1 9

Input string: GCA T CGC AG AG AG T A T A C AG T A CG pattern: GC A GAGAG

16

Extension

The alphabet-based preprocessing arrays of BM-type algorithms are their most useful feature, but they can be a source of trouble as well.

The ASCII alphabet:

Text were usually of 8 bits or less.

The processing time can be regardless.

The natural language text:

Wide characters

DNA data: {A, T, C, G} is mapping into {00, 01, 10, 11}, the

• alphabets of size varying by powers of 2 from 2 to 64 .

An example DNA : ACTG ( 2

2 4

)

The preprocessing time is a bottleneck.

17

Environment

18

Experiment – Frequency

These patterns occur 3,366,899 times

19

Experiment – Pattern Length

20

Experiment – Alphabet Size

21

Experiment – Pathological Cases

22

Conclusion

We have tested FJS against four high-profile competitors (BMH,

BMS, RC, TBM) over a range of contexts:

• pattern frequency (C1), pattern length (C2), alphabet size (C3), and pathological cases (C4).

FJS was uniformly superior to its competitors, with an up to

10% advantage over BMS and RC

For FJS the pathological cases (C4) are those in which the

KMP part is forced to execute on prefixes of patterns where

KMP provides no advantage we presented a hybrid exact pattern-matching algorithm, FJS, which combines the benefits of KMP and BMS. It requires

O (m + k) time and space for preprocessing

23

Download