Pattern Matching T: a b a c a a b 1 P: a b a 240-301 Comp. Eng. Lab III (Software), Pattern Matching a b c a a b 4 3 2 c a b 1 Overview 1. 2. 3. 4. 5. What is Pattern Matching? The Brute Force Algorithm The Boyer-Moore Algorithm The Knuth-Morris-Pratt Algorithm More Information 240-301 Comp. Eng. Lab III (Software), Pattern Matching 2 1. What is Pattern Matching? Definition: – given a text string T and a pattern string P, find the pattern inside the text T: “the rain in spain stays mainly on the plain” P: “n th” Applications: – text editors, Web search engines (e.g. Google), image analysis 240-301 Comp. Eng. Lab III (Software), Pattern Matching 3 String Concepts Assume S is a string of size m. A substring S[i .. j] of S is the string fragment between indexes i and j. A prefix of S is a substring S[0 .. i] A suffix of S is a substring S[i .. m-1] – i is any index between 0 and m-1 240-301 Comp. Eng. Lab III (Software), Pattern Matching 4 Examples S a n d r e w 0 Substring All 5 S[1..3] == "ndr" possible prefixes of S: – "andrew", "andre", "andr", "and", "an”, "a" All possible suffixes of S: – "andrew", "ndrew", "drew", "rew", "ew", "w" 240-301 Comp. Eng. Lab III (Software), Pattern Matching 5 2. The Brute Force Algorithm Check each position in the text T to see if the pattern P starts in that position T: a n d r e w P: r e w T: a n d r e w P: r e w P moves 1 char at a time through T .... 240-301 Comp. Eng. Lab III (Software), Pattern Matching 6 Example-1 240-301 Comp. Eng. Lab III (Software), Pattern Matching 7 Example-2 240-301 Comp. Eng. Lab III (Software), Pattern Matching 8 Example-3 T=a bacaabaccabacabaabb P=a b a c a b 240-301 Comp. Eng. Lab III (Software), Pattern Matching 9 Brute Force-Complexity… Given a pattern M characters in length, and a text N characters in length... Worst case: compares pattern to each substring of text of length M. For example, M=5. This kind of case can occur for image data Total number of comparisons: M (N-M+1) Worst case time complexity: O(MN) 240-301 Comp. Eng. Lab III (Software), Pattern Matching 10 Best case if pattern not found…. Given a pattern M characters in length, and a text N characters in length... For example, M=5. Total number of comparisons: N Best case time complexity: O(N) 240-301 Comp. Eng. Lab III (Software), Pattern Matching 11 Analysis Brute force pattern matching runs in time O(mn) in the worst case. But most searches of ordinary text take O(m+n), which is very quick. 240-301 Comp. Eng. Lab III (Software), Pattern Matching continued 12 The brute force algorithm is fast when the alphabet of the text is large – e.g. A..Z, a..z, 1..9, etc. It is slower when the alphabet is small – e.g. 0, 1 (as in binary files, image files, etc.) 240-301 Comp. Eng. Lab III (Software), Pattern Matching continued 13 Example of a worst case: – T: "aaaaaaaaaaaaaaaaaaaaaaaaaah" – P: "aaah" Example of a more average case: – T: "a string searching example is standard" – P: "store" 240-301 Comp. Eng. Lab III (Software), Pattern Matching 14 3. The Boyer-Moore Algorithm The Boyer-Moore pattern matching algorithm is based on two techniques. The looking-glass Heuristic: – When testing a possible placement of P against T, begin the comparisons from the end of P and move backward to the front of P. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 15 Character-Jump Heuristic: During the testing of a possible placement of P against T, a mismatch of text character T[i] with the corresponding pattern character P[j] is handled as follows. If c is not contained anywhere in P, then shift P completely past T[i] (for it cannot match any character in P). Otherwise, shift P until an occurrence of character c in P gets aligned with T[i]. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 16 2. The character-jump Heuristic: – when a mismatch occurs at T[i] == x – the character in pattern P[j] is not the same as T[i] There are 3 possible cases, tried in order. T P x a i ba j 240-301 Comp. Eng. Lab III (Software), Pattern Matching 17 Case 1 If P contains x somewhere, then try to shift P right to align the last occurrence of x in P with T[i]. T P x a i x c ba j T and move i and j right, so j at end 240-301 Comp. Eng. Lab III (Software), Pattern Matching x a ? ? inew P x c ba jnew 18 Case 2 If P contains x somewhere, but a shift right to the last occurrence is not possible, then shift P right by 1 character to T[i+1]. T P x a x i cw ax j T and move i and j right, so j at end x is after 240-301 Comp. Eng. LabjIIIposition (Software), Pattern Matching xa x ? inew P cw ax jnew 19 Case 3 If cases 1 and 2 do not apply, then shift P to align P[0] with T[i+1]. T x a i P d c ba T and move i and j right, so j at end j No x in P 240-301 Comp. Eng. Lab III (Software), Pattern Matching x a ? ?? inew P d c ba 0 jnew 20 Boyer-Moore Example (1) T: a p a t r i 1 t h m P: r i t e r n 2 t h m r i m a t c h i n g 3 t h m r i r i 240-301 Comp. Eng. Lab III (Software), Pattern Matching 4 t h m a l g o r i 5 t h m t h m 11 10 9 8 7 r i t h m r i 6 t h m 21 Boyer-Moore Example (2) T=a bacaabadcabacabaabb P=a b a c a b 240-301 Comp. Eng. Lab III (Software), Pattern Matching 22 Last Occurrence Function algorithm preprocesses the pattern P and the alphabet A to build a last occurrence function L() Boyer-Moore’s – L() maps all the letters in A to integers L(x) is defined as: // x is a letter in A – the largest index i such that P[i] == x, or – -1 if no such index exists 240-301 Comp. Eng. Lab III (Software), Pattern Matching 23 L() Example P a b a c a b 0 1 2 3 4 5 A = {a, b, c, d} P: "abacab" x L(x) a 4 b 5 c 3 d -1 L() stores indexes into P[] 240-301 Comp. Eng. Lab III (Software), Pattern Matching 24 Note In Boyer-Moore code, L() is calculated when the pattern P is read in. Usually L() is stored as an array – something like the table 240-301 Comp. Eng. Lab III (Software), Pattern Matching 25 Boyer-Moore Example (2) T a b a c a a b a d c a b a c a b a a b b 1 Pa b a a b c a a b 4 3 2 13 12 11 10 9 8 c a b a b b a c 5 a b a c a a 7 b a b a c a b 6 a b a c a b 240-301 Comp. Eng. Lab III (Software), Pattern Matching x L(x) a 4 b 5 c 3 d -1 26 Algorithm 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. Algorithm BoyreMooreMatch(T,P): Input: Strings T (text) with n characters and P (pattern) with m characters Output: Starting index of the first substring of T matching P, or an indication that P is not a substring of T compute function last L(X) i←m−1 j←m−1 repeat if T[i] = = P[j] then if j = 0 then return i // a match! else i ← i − 1 j←j−1 else i ← i + m − min(j, 1 + last(T[i])) // jump step j←m−1 until i>n − 1 return “There is no substring of T matching P.” 240-301 Comp. Eng. Lab III (Software), Pattern Matching 27 Analysis Boyer-Moore worst case running time is O(nm + A) But, Boyer-Moore is fast when the alphabet (A) is large, slow when the alphabet is small. – e.g. good for English text, poor for binary Boyer-Moore is significantly faster than brute force for searching English text. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 28 Worst Case Example T: "aaaaa…a" P: "baaaaa" T: a a a a a a a a a 6 5 4 3 2 1 P: b a a a a a 12 11 10 b a a 9 8 7 a a a 18 17 16 15 14 13 b a a a a a 24 23 22 21 20 19 b 240-301 Comp. Eng. Lab III (Software), Pattern Matching a a a a a 29 4. The KMP Algorithm The Knuth-Morris-Pratt (KMP) algorithm looks for the pattern in the text in a left-toright order (like the brute force algorithm). But it shifts the pattern more intelligently than the brute force algorithm. 240-301 Comp. Eng. Lab III (Software), Pattern Matching continued 30 If a mismatch occurs between the text and pattern P at P[j], what is the most we can shift the pattern to avoid wasteful comparisons? Answer: the largest prefix of P[0 .. j-1] that is a suffix of P[1 .. j-1] 240-301 Comp. Eng. Lab III (Software), Pattern Matching 31 Example i T: P: j=5 jnew = 2 240-301 Comp. Eng. Lab III (Software), Pattern Matching 32 Why j == 5 Find largest prefix (start) of: "a b a a b" ( P[0..j-1] ) which is suffix (end) of: "b a a b" ( p[1 .. j-1] ) Answer: "a b" Set j = 2 // the new j value 240-301 Comp. Eng. Lab III (Software), Pattern Matching 33 KMP Failure Function KMP preprocesses the pattern to find matches of prefixes of the pattern with the pattern itself. j = mismatch position in P[] k = position before the mismatch (k = j-1). The failure function F(k) is defined as the size of the largest prefix of P[0..k] that is also a suffix of P[1..k]. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 34 Failure Function Example (k == j-1) P: "abaaba" j: 012345 jk 0 1 2 3 4 F(k) F(j) 0 0 1 1 2 3 F(k) is the size of the largest prefix. In code, F() is represented by an array, like the table. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 35 Why is F(4) == 2? F(4) P: "abaaba" means – find the size of the largest prefix of P[0..4] that is also a suffix of P[1..4] = find the size largest prefix of "abaab" that is also a suffix of "baab" = find the size of "ab" =2 240-301 Comp. Eng. Lab III (Software), Pattern Matching 36 Using the Failure Function algorithm modifies the brute-force algorithm. Knuth-Morris-Pratt’s – if a mismatch occurs at P[j] (i.e. P[j] != T[i]), then k = j-1; j = F(k); // obtain the new j 240-301 Comp. Eng. Lab III (Software), Pattern Matching 37 Example T: a b a c a a b a c c a b a c a b a a b b 1 2 3 4 5 6 P: a b a c a b 7 a b a c a b 8 9 10 11 12 a b a c a b 13 a b a c a b k 0 1 2 3 4 14 15 16 17 18 19 F(k) 0 0 1 0 1 a b a c a b 240-301 Comp. Eng. Lab III (Software), Pattern Matching 38 More Examples.... 1.T= "THIS IS A TEST TEXT“ P=“TEXT” 2.T=“AABDACAADAABAAACA” P=“AABA” P="cgtacgttcgtacg“ Find Failure Function. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 39 Algorithm Algorithm KMPMatch(T,P): Input: Strings T (text) with n characters and P (pattern) with m characters Output: Starting index of the first substring of T matching P, or an indication that P is not a substring of T f ← KMPFailureFunction(P) // construct the failure function f for P i←0 j←0 while i < n do if P[j] = T[i] then if j = m − 1 then return i − m + 1 // a match! i←i+1 j←j+1 else if j > 0 // no match, but we have advanced in P then j ← f(j − 1) // j indexes just after prefix of P that must match else i←i+1 return “There is no substring of T matching P.” 240-301 Comp. Eng. Lab III (Software), Pattern Matching 40 KMP Advantages KMP runs in optimal time: O(m+n) – very fast The algorithm never needs to move backwards in the input text, T – this makes the algorithm good for processing very large files that are read in from external devices or through a network stream 240-301 Comp. Eng. Lab III (Software), Pattern Matching 41 KMP Disadvantages KMP doesn’t work so well as the size of the alphabet increases – more chance of a mismatch (more possible mismatches) – mismatches tend to occur early in the pattern, but KMP is faster when the mismatches occur later 240-301 Comp. Eng. Lab III (Software), Pattern Matching 42 Pattern Matching Applications Applications in Computational Biology – DNA sequence is a long word (or text) over a 4-letter alphabet – GTTTGAGTGGTCAGTCTTTTCGTTTCGACGGAGCCCCCA ATTAATAAACTCATAAGCAGACCTCAGTTCGCTTAGAGC AGCCGAAA….. – Find a Specific pattern W Finding patterns in documents formed using a large alphabet – Word processing – Web searching – Desktop search (Google, MSN) Matching strings of bytes containing – Graphical data – Machine code grep in unix – grep searches for lines matching a pattern. 240-301 Comp. Eng. Lab III (Software), Pattern Matching 43 KMP Extensions The basic algorithm doesn't take into account the letter in the text that caused the mismatch. T: a b a a b x Basic KMP does not do this. P: a b a a b a a b a a b a 240-301 Comp. Eng. Lab III (Software), Pattern Matching 44 5. More Information Algorithms in C++ Robert Sedgewick Addison-Wesley, 1992 – chapter 19, String Searching Online Animated Algorithms: – http://www.ics.uci.edu/~goodrich/dsa/ 11strings/demos/pattern/ – http://www-sr.informatik.uni-tuebingen.de/ ~buehler/BM/BM1.html – http://www-igm.univ-mlv.fr/~lecroq/string/ 240-301 Comp. Eng. Lab III (Software), Pattern Matching 45