ICS220 – Data Structures and Algorithms Analysis Lecture 14 Dr. Ken Cosh Review • Memory Management – Memory Allocation – Garbage Collection This Week • String Matching – String matching is a common task for many computer users; • Internet Searches • String manipulation in word processing • Advanced DNA sequence matching – Therefore effective pattern matching algorithms are essential. Brute Force • Our first simple string matching algorithm is brute force. – We check the first character, if it is a match, we check the second character, if not a match, we step forward one character and start again. • Any useful information that could be used in subsequent searches is then lost. Brute Force bruteForceStringMatching(pattern P, text T) i=0; while i ≤ |T| - |P| j=0; while Ti == Pj && j < |P| i++; j++; if j == |P| return match at i-|P|; i = i – j + 1; return no match; Brute Force • 1 2 3 4 5 6 7 8 T = ababcdababababababad, P=babab ababcdababababababad babab babab babab babab babab babab babab babab In this case the match is found on the 8th try. Brute Force Complexity • The best case for the algorithm is that the string is matched straight away (consider searching this sentence for “The”). Here |P| comparisons are required – O(|P|). • The worst case is if the string isn’t found, but for each character in |T|, we are required to make |P| comparisons – here worst case is O(|T||P|). • The average case depends on the size and frequencies of the character set. Brute Force Complexity • Notice the nested while loops in the Brute Force algorithm; while i ≤ |T| - |P| while Ti == Pj && j < |P| • Shortly we’ll investigate how we can reduce the number of iterations of each loop. • For the worst case to occur we could search of a string such as aaaaaaaaaaaab within a string aaaaaaaaaaaaaaaaaaaaaaaaaaa etc. Improving Brute Force • A key problem with brute force is that each time we abort the comparison we have to start from the beginning of the pattern again. • We could reduce the algorithm complexity by enabling us to skip unnecessary searches. • Hancart’s algorithm allows the search to step forward 2 characters if a match won’t be found. Hancart • Hancart’s algorithm refines brute force in a couple of ways. Mismatch with Ti+1 – Second comparisons begin with the 2nd character in the Text. P0 != P1 Step 2 Step 1 (i=i+2) If P0!=Ti+1, then P1!=Ti+1 – First the first two characters of the pattern are compared • Either they are the same, or they are different. P0 == P1 Mismatch after first comparison Step 1 Step 2 (i=i+2) If P1==Ti+1, then P0!=Ti+1 Hancart • Hancart’s revision works by allowing us to skip forward 2 characters in situations where there can’t be a match. • Notice that the situations where 2 steps forwards are allowed depends on whether the first 2 characters of the pattern. • We can refine the search further by extending this observation – that the number of steps forward allowed depends on the contents of the pattern. • The Knuth Morris Pratt algorithm observes that the pattern contains enough information to determine where the next match could begin. Hancart • Hancart’s algorithm reduces the number of iterations through the outer loop – by sometimes allowing the increment to be; i = i – j + 2; Knuth Morris Pratt • The Knutt Morris Pratt algorithm begins by finding the longest suffix, which is equal to a prefix of the same substring. – Substring: A,B,C,D,A,B,D – Longest Suffix: 0,0,0,0,1,2,0 • i.e. when the 2nd A comes it is both a suffix and a prefix for the substring. The following B forms ‘AB’ a 2 character prefix and suffix. – Now for each iteration of the outer loop i can be increased by j-x, where x is the longest suffix. • i.e. if a mismatch is found when comparing the second A, j=5, so i can be increased by 4 (j-1) Test Try searching for this substring, A,B,C,D,A,B,D within this string ABCDABCABCDABDE Knuth Morris Pratt complexity • Knuth Morris Pratt removes some of the complexity of the brute force algorithm by preprocessing the substring being searched for (to create the suffix table). • Now as we don’t need to recheck characters in the text it is O(|T|) for the outer loop. • Preprocessing can be performed quickly, in O(|P|) time, leaving a total complexity of O(|T|+|P|)