DR. APJ ABDUL KALAM TECHNICAL UNIVERSITY Branch - IT Design and Analysis of Algorithms UNIT-5 String Matching Algorithms Basic Definition of String Matching String Matching Algorithm is also called "String Searching Algorithm." String Matching Algorithms: Given a text array, T [1.....n], of n character and a pattern array, P [1......m], of m characters. The problems are to find an integer s, called valid shift where 0 ≤ s < n-m and T [s+1......s + m] = P [1......m]. In other words, to find even if P in T, i.e., where P is a substring of T. The item of P and T are character drawn from some finite alphabet such as {0, 1} or {A, B .....Z, a, b..... z}. Given a string T [1......n], the substrings are represented as T [i......j] for some 0≤i ≤ j≤n-1, the string formed by the characters in T from index i to index j, inclusive. This process that a string is a substring of itself (take i = 0 and j =m). Different Types of String Matching Algorithms Algorithms used for String Matching: There are different types of method is used to finding the string: • The Naive String Matching Algorithm • Finite Automata • The Rabin-Karp-Algorithm • The Knuth-Morris-Pratt Algorithm • The Boyer-Moore Algorithm String Matching Algorithm: Given a Text and a Pattern, we have to search a pattern in a text and return the number of valid shift after which pattern occurs in a text. Naïve or Standard String Matching Algorithm: Naïve or Standard Matching Algorithms: Naïve approach tests all the possible placement of Pattern P [1.......m] relative to text T [1......n]. We try shift s = 0, 1.......n-m, successively and for each shift s. Compare T [s+1.......s + m] to P [1......m]. NAIVE-STRING-MATCHER (T, P) 1. n ← length [T] 2. m ← length [P] 3. for s ← 0 to n -m 4. do if P [1.....m] = T [s + 1....s + m] 5. then print "Pattern occurs with shift" s Analysis: This for loop from 3 to 5 executes for n-m + 1(we need at least m characters at the end) times and in iteration we are doing m comparisons. So the total complexity is O (n-m+1)m. Example on Naïve String Matching Algorithm: Let T = 1011101110 & P = 111 . Find valid Shift using Naïve String Matching Algorithm. Example on Naïve String Matching Algorithm: Example on Naïve String Matching Algorithm: Rabin Karp String Matching Algorithm: Rabin-Karp-Algorithm: The Rabin-Karp string matching algorithm calculates a hash value for the pattern, as well as for each M-character subsequences of text to be compared. If the hash values are unequal, the algorithm will determine the hash value for next M-character sequence. If the hash values are equal, the algorithm will analyze the pattern and the Mcharacter sequence. In this way, there is only one comparison per text subsequence, and character matching is only required when the hash values match. We can compute the numerical (hash) values using Horner’s rule. Rabin Karp String Matching Algorithm: RABIN-KARP-MATCHER (T, P, d, q) 1. n ← length [T] 2. m ← length [P] 3. h ← dm-1 mod q 4. p ← 0 5. t0 ← 0 6. for i ← 1 to m 7. do p ← (dp + P[i]) mod q 8. t0 ← (dt0+T [i]) mod q 9. for s ← 0 to n-m 10. do if p = ts 11. then if P [1.....m] = T [s+1.....s + m] 12. then "Pattern occurs with shift" s 13. If s < n-m 14. then ts+1 ← (d (ts-T [s+1]h)+T [s+m+1])mod q Example on Rabin Karp String Matching Algorithm: This is the formula used to shift the character. This can be done using Horner’s Rule: ts+1 ← (d (ts-T [s+1]h)+T [s+m+1])mod q Example: For string matching, working module q = 11, how many spurious hits does the Rabin-Karp matcher encounters in Text T = 3141592653589793 when looking for the pattern P=26. Solution: Given Text T = 3141592653589793 Pattern P=26. Working Modulo q = 11 We calculate the: P mod q = 26 mod 11 = 4 Now find the exact match of P mod q in the given Text. Example on Rabin Karp String Matching Algorithm: Example on Rabin Karp String Matching Algorithm: Example on Rabin Karp String Matching Algorithm: Example on Rabin Karp String Matching Algorithm: Example on Rabin Karp String Matching Algorithm: We keep on traversing the Text and trying to match the P mod q with the T mod q. There will be two cases. Case 1 : If P mod q = T mod q, then there is an exact match. Case 2 : If P mod q is not equal to T mod q, then there is a spurious hit. Hence, we search for the next match. Will move to the next characters of the Text with the help of the Horner’s Rule Complexity: The running time of RABIN-KARP-MATCHER in the worst case scenario O ((n-m+1) m but it has a good average case running time. If the expected number of strong shifts is small O (1) and prime q is chosen to be quite large, then the Rabin-Karp algorithm can be expected to run in time O (n+m) plus the time to require to process spurious hits. Example on Rabin Karp String Matching Algorithm: Example 2: Suppose Text T = 2359023141526739921 , Pattern P= 31415 and Working modulo q=13. Find the valid match using Rabin Karp Algorithm. String Matching: Knuth-Morris-Pratt Algorithm Knuth-Morris-Pratt Algorithm: Knuth-Morris and Pratt introduce a linear time algorithm for the string matching problem. A matching time of O (n) is achieved by avoiding comparison with an element of 'S' that have previously been involved in comparison with some element of the pattern 'p' to be matched. i.e., backtracking on the string 'S' never occurs. Components of KMP Algorithm: 1. The Prefix Function (Π): The Prefix Function, Π for a pattern encapsulates knowledge about how the pattern matches against the shift of itself. This information can be used to avoid a useless shift of the pattern 'p.' In other words, this enables avoiding backtracking of the string 'S.' String Matching: Knuth-Morris-Pratt Algorithm 2. The KMP Matcher: With string 'S,' pattern 'p' and prefix function 'Π' as inputs, find the occurrence of 'p' in 'S' and returns the number of shifts of 'p' after which occurrences are found. Knuth Morris Pratt (KMP) is an algorithm, which checks the characters from left to right. For Example: Input: Input: Main String: “AAAABAAAAABBBAAAAB”, The pattern “AAAB” Output: Pattern found at location: 1 Pattern found at location: 7 Pattern found at location: 14 String Matching: Knuth-Morris-Pratt Algorithm Algorithm for Computing the Prefix Function (Π): COMPUTE- PREFIX- FUNCTION (P) 1.m ←length [P] //'p' pattern to be matched 2. Π [1] ← 0 3. k ← 0 4. For q ← 2 to m 5. do while k > 0 and P [k + 1] ≠ P [q] 6. do k ← Π [k] 7. If P [k + 1] = P [q] 8. then k← k + 1 9. Π [q] ← k 10. Return Π String Matching: Knuth-Morris-Pratt Algorithm Running Time Analysis for Computing the Prefix Function: In the above pseudo code for calculating the prefix function, the for loop from step 4 to step 10 runs 'm' times. Step1 to Step3 take constant time. Hence the running time of computing prefix function is O (m). Example: Compute Π for the pattern 'p' below: String Matching: Knuth-Morris-Pratt Algorithm String Matching: Knuth-Morris-Pratt Algorithm String Matching: Knuth-Morris-Pratt Algorithm KMP Matcher: KMP Matcher with the pattern 'p,' the string 'S' and prefix function 'Π' as input, finds a match of p in S. KMP-MATCHER (T, P) 1. n ← length [T] 2. m ← length [P] 3. Π← COMPUTE-PREFIX-FUNCTION (P) 4. q ← 0 // numbers of characters matched 5. for i ← 1 to n // scan S from left to right 6. do while q > 0 and P [q + 1] ≠ T [i] 7. do q ← Π [q] // next character does not match 8. If P [q + 1] = T [i] 9. then q ← q + 1 // next character matches 10. If q = m // is all of p matched? 11. then print "Pattern occurs with shift" i - m 12. q ← Π [q] // look for the next match String Matching: Knuth-Morris-Pratt Algorithm Running Time Analysis: The for loop beginning in step 5 runs 'n' times, i.e., as long as the length of the string 'S.' Since step 1 to step 4 take constant times, the running time is dominated by this for the loop. Thus running time of the matching function is O (n). String Matching: Knuth-Morris-Pratt Algorithm String Matching: Knuth-Morris-Pratt Algorithm String Matching: Knuth-Morris-Pratt Algorithm String Matching: Knuth-Morris-Pratt Algorithm String Matching: Knuth-Morris-Pratt Algorithm If m is the length of the pattern and n is the length of the text to be searched then, Algorithm Matching time Naïve string matching algorithm O((n-m)*m) Rabin Karp string matching algorithm Average: O(n + m) Worst: O((n-m)*m) Knuth-Morris-Pratt algorithm O(n) Introduction to Complexity Classes: Definition of NP class Problem: - The set of all decision-based problems came into the division of NP Problems who can't be solved or produced an output within polynomial time but verified in the polynomial time. NP class contains P class as a subset. NP problems being hard to solve. Note: - The term "NP" does not mean "not polynomial." Originally, the term meant "non-deterministic polynomial. It means according to the one input number of output will be produced. Definition of P class Problem: - The set of decision-based problems come into the division of P Problems who can be solved or produced an output within polynomial time. P problems being easy to solve Definition of Polynomial time: - If we produce an output according to the given input within a specific amount of time such as within a minute, hours. This is known as Polynomial time. Introduction to Complexity Classes: Definition of Non-Polynomial time: - If we produce an output according to the given input but there are no time constraints is known as Non-Polynomial time. But yes output will produce but time is not fixed yet. Definition of Decision Based Problem: - A problem is called a decision problem if its output is a simple "yes" or "no" (or you may need this of this as true/false, 0/1, accept/reject.) We will phrase many optimization problems as decision problems. For example, Greedy method, D.P., given a graph G= (V, E) if there exists any Hamiltonian cycle. Definition of NP-hard class: - Here you to satisfy the following points to come into the division of NP-hard. If we can solve this problem in polynomial time, then we can solve all NP problems in polynomial time. If you convert the issue into one form to another form within the polynomial time Definition of NP-complete class: - A problem is in NP-complete, if It is in NP and It is NP-hard Introduction to NP Completeness: NP-Completeness: A decision problem L is NP-Hard if L' ≤p L for all L' ϵ NP. Definition: L is NP-complete if L ϵ NP and L' ≤ p L for some known NP-complete problem L.' Given this formal definition, the complexity classes are: P: is the set of decision problems that are solvable in polynomial time. NP: is the set of decision problems that can be verified in polynomial time. NP-Hard: L is NP-hard if for all L' ϵ NP, L' ≤p L. Thus if we can solve L in polynomial time, we can solve all NP problems in polynomial time. NP-Complete L is NP-complete if L ϵ NP and L is NP-hard If any NP-complete problem is solvable in polynomial time, then every NPComplete problem is also solvable in polynomial time. Conversely, if we can prove that any NP-Complete problem cannot be solved in polynomial time, every NPComplete problem cannot be solvable in polynomial time. Representation of Complexity Classes: Pictorial representation of all NP classes which includes NP, NP-hard, and NP-complete Randomized Algorithms A Randomized algorithm is an algorithm that employs a degree of randomness as part of its logic. The algorithm typically uses uniformly random bits as an auxiliary input to guide its behavior, in the hope of achieving good performance in the "average case" over all possible choices of random bits. Formally, the algorithm's performance will be a random variable determined by the random bits; thus either the running time, or the output (or both) are random variables. One has to distinguish between algorithms that use the random input to reduce the expected running time or memory usage, but always terminate with a correct result (Las Vegas algorithms) in a bounded amount of time. Randomized version of Quick Sort: Randomized version of Quick sort: Instead of always using A[r] as the pivot, we will use a randomly chosen element from the sub-array A[p _ r]. We do so by exchanging element A[r] with an element chosen at random from A[p _ r]. This modification, in which we randomly sample the range p,...,r, ensures that the pivot element x = A[r] is equally likely to be any of the r - p + 1 elements in the subarray. Because the pivot element is randomly chosen, we expect the split of the input array to be reasonably well balanced on average. The changes to PARTITION and QUICKSORT are small. In the new partition procedure, we simply implement the swap before actually partitioning: Randomized version of Quick Sort: RANDOMIZED-PARTITION(A, p, r): 1. i ← RANDOM(p, r) 2. exchange A[r] ↔ A[i] 3. return PARTITION(A, p, r) RANDOMIZED-QUICKSORT(A, p, r): 1. If p < r 2. then q ← RANDOMIZED-PARTITION(A, p, r) 3. RANDOMIZED-QUICKSORT(A, p, q - 1) 4. RANDOMIZED-QUICKSORT(A, q + 1, r) Randomized version of Quick Sort: Randomized version of Quick Sort: Analysis of Randomized Quicksort: We are discussing here worst and best case complexity of randomized quicksort algorithm Worst case complexity: Let T (n) be the worst-case time for the procedure QUICKSORT on an input of size n. We have the recurrence where the parameter q ranges from 0 to n - 1 because the procedure PARTITION produces two sub-problems with total size n - 1. We guess that T (n) ≤ cn2 for some constant c. Substituting this guess into recurrence. We obtain Randomized version of Quick Sort: Expected running time : If, in each level of recursion, the split induced by RANDOMIZED-PARTITION puts any constant fraction of the elements on one side of the partition, then the recursion tree has depth Θ(lg n), and O(n) work is performed at each level. Even if we add new levels with the most unbalanced split possible between these levels, the total time remains O (n lg n). Introduction to Approximation Algorithms: An Approximate Algorithm is a way of approach NP-COMPLETENESS for the optimization problem. This technique does not guarantee the best solution. The goal of an approximation algorithm is to come as close as possible to the optimum value in a reasonable amount of time which is at the most polynomial time. Such algorithms are called approximation algorithm or heuristic algorithm. For the traveling salesperson problem, the optimization problem is to find the shortest cycle, and the approximation problem is to find a short cycle. For the vertex cover problem, the optimization problem is to find the vertex cover with fewest vertices, and the approximation problem is to find the vertex cover with few vertices. Vertex Cover: Approximation Algorithm Vertex Cover Problem: A Vertex Cover of a graph G is a set of vertices such that each edge in G is incident to at least one of these vertices. The decision vertex-cover problem was proven NPC. Now, we want to solve the optimal version of the vertex cover problem, i.e., we want to find a minimum size vertex cover of a given graph. We call such vertex cover an optimal vertex cover C*. Vertex Cover: Approximation Algorithm Approx. Vertex-Cover (G = (V, E)) { C = empty-set; E'= E; While E' is not empty do { Let (u, v) be any edge in E': (*) Add u and v to C; Remove from E' all edges incident to u or v; } Return C; } Vertex Cover: Approximation Algorithm The idea is to take an edge (u, v) one by one, put both vertices to C, and remove all the edges incident to u or v. We carry on until all edges have been removed. C is a VC. But how good is C? VC = {b, c, d, f, g} Set Cover Problem : Approximate Algorithm Given a universe U of n elements, a collection of subsets of U say S = {S 1, S2…,Sm} where every subset Si has an associated cost. Find a minimum cost subcollection of S that covers all elements of U. Example: U = {1,2,3,4,5} S = {S1,S2,S3} S1 = {4,1,3}, Cost(S1) = 5 S2 = {2,5}, Cost(S2) = 10 S3 = {1,4,3,2}, Cost(S3) = 3 Output: Minimum cost of set cover is 13 and set cover is {S2, S3} There are two possible set covers {S1, S2} with cost 15 and {S2, S3} with cost 13. Set Cover Problem : Approximate Algorithm Set Cover is NP-Hard: There is no polynomial time solution available for this problem as the problem is a known NP-Hard problem. There is a polynomial time Greedy approximate algorithm, the greedy algorithm provides a Logn approximate algorithm. 2-Approximate Greedy Algorithm: Let U be the universe of elements, {S1, S2, … Sm} be collection of subsets of U and Cost(S1), C(S2), … Cost(Sm) be costs of subsets. U = {1,2,3,4,5}, S = {S1,S2,S3} S1 = {4,1,3}, Cost(S1) = 5 S2 = {2,5}, Cost(S2) = 10 S3 = {1,4,3,2}, Cost(S3) = 3 Example: Let us consider the above example to understand Greedy Algorithm. First Iteration: I = {} The per new element cost for S1 = Cost(S1)/|S1 – I| = 5/3 The per new element cost for S2 = Cost(S2)/|S2 – I| = 10/2 The per new element cost for S3 = Cost(S3)/|S3 – I| = 3/4 Since S3 has minimum value S3 is added, I becomes {1,4,3,2}. Second Iteration: I = {1,4,3,2} The per new element cost for S1 = Cost(S1)/|S1 – I| = 5/0 Note that S1 doesn’t add any new element to I. The per new element cost for S2 = Cost(S2)/|S2 – I| = 10/1 Note that S2 adds only 5 to I. The greedy algorithm provides the optimal solution for above example, but it may not provide optimal solution all the time. Fast Fourier Transform(FFTs) Complex roots of unity: A complex nth root of unity is a complex number ω such that ωn=1. There are exactly n complex nth roots of unity: e2Πik/n for k= 0, 1,………, n- 1. To interpret this formula, we use the definition of the exponential of a complex number: eiu= cos (u)+i sin(u) The value ωn == e2Πi/n is the principal nth root of unity, all other complex nth roots of unity are powers of ωn. The n complex nth roots of unity. Recursive FFTs: By using a method known as the fast Fourier transform (FFT), which takes advantage of the special properties of the complex roots of unity, we can compute discrete Fourier transform DFTn (a) in time ‚ϴ(n lg n) as opposed to the ϴ(n2) time of the straightforward method. We assume throughout that n is an exact power of 2. The FFT method employs a divide-and-conquer strategy, using the even-indexed and odd indexed coefficients of A.x/ separately to define the two new polynomials A [0](x) and A [1](x) of degree-bound n/2: Fast Fourier Transform(FFTs) Complexity : To determine the running time of procedure RECURSIVE-FFT, we note that exclusive of the recursive calls, each invocation takes time ϴ (n), where n is the length of the input vector. The recurrence for the running time is therefore T(n) = 2T(n/2) + ϴ(n) = ϴ (n lg n) THANK YOU