Uploaded by Arpit

UNIT-5 DAA Complete Notes

advertisement
DR. APJ ABDUL KALAM TECHNICAL UNIVERSITY
Branch - IT
Design and Analysis of Algorithms
UNIT-5
String Matching Algorithms
Basic Definition of String Matching
String Matching Algorithm is also called "String Searching Algorithm."
String Matching Algorithms: Given a text array, T [1.....n], of n character and a
pattern array, P [1......m], of m characters. The problems are to find an integer s,
called valid shift where 0 ≤ s < n-m and T [s+1......s + m] = P [1......m].
In other words, to find even if P in T, i.e., where P is a substring of T. The item of P
and T are character drawn from some finite alphabet such as {0, 1} or {A, B .....Z, a,
b..... z}.
Given a string T [1......n], the substrings are represented as T [i......j] for some 0≤i ≤
j≤n-1, the string formed by the characters in T from index i to index j, inclusive. This
process that a string is a substring of itself (take i = 0 and j =m).
Different Types of String Matching Algorithms
Algorithms used for String Matching:
There are different types of method is used to finding the string:
• The Naive String Matching Algorithm
• Finite Automata
• The Rabin-Karp-Algorithm
• The Knuth-Morris-Pratt Algorithm
• The Boyer-Moore Algorithm
String Matching Algorithm: Given a Text and a Pattern, we have to search a
pattern in a text and return the number of valid shift after which pattern occurs in
a text.
Naïve or Standard String Matching Algorithm:
Naïve or Standard Matching Algorithms: Naïve approach tests all the possible
placement of Pattern P [1.......m] relative to text T [1......n]. We try shift s = 0, 1.......n-m,
successively and for each shift s. Compare T [s+1.......s + m] to P [1......m].
NAIVE-STRING-MATCHER (T, P)
1. n ← length [T]
2. m ← length [P]
3. for s ← 0 to n -m
4.
do if P [1.....m] = T [s + 1....s + m]
5.
then print "Pattern occurs with shift" s
Analysis: This for loop from 3 to 5 executes for n-m + 1(we need at least m characters at
the end) times and in iteration we are doing m comparisons.
So the total complexity is O (n-m+1)m.
Example on Naïve String Matching Algorithm:
Let T = 1011101110 & P = 111 . Find valid Shift using Naïve String Matching
Algorithm.
Example on Naïve String Matching Algorithm:
Example on Naïve String Matching Algorithm:
Rabin Karp String Matching Algorithm:
Rabin-Karp-Algorithm: The Rabin-Karp string matching algorithm calculates a hash
value for the pattern, as well as for each M-character subsequences of text to be
compared. If the hash values are unequal, the algorithm will determine the hash
value for next M-character sequence.
If the hash values are equal, the algorithm will analyze the pattern and the Mcharacter sequence. In this way, there is only one comparison per text
subsequence, and character matching is only required when the hash values
match. We can compute the numerical (hash) values using Horner’s rule.
Rabin Karp String Matching Algorithm:
RABIN-KARP-MATCHER (T, P, d, q)
1. n ← length [T]
2. m ← length [P]
3. h ← dm-1 mod q
4. p ← 0
5. t0 ← 0
6. for i ← 1 to m
7. do p ← (dp + P[i]) mod q
8. t0 ← (dt0+T [i]) mod q
9. for s ← 0 to n-m
10. do if p = ts
11. then if P [1.....m] = T [s+1.....s + m]
12. then "Pattern occurs with shift" s
13. If s < n-m
14. then ts+1 ← (d (ts-T [s+1]h)+T [s+m+1])mod q
Example on Rabin Karp String Matching Algorithm:
This is the formula used to shift the character.
This can be done using Horner’s Rule:
ts+1 ← (d (ts-T [s+1]h)+T [s+m+1])mod q
Example: For string matching, working module q = 11, how many spurious hits
does the Rabin-Karp matcher encounters in Text T = 3141592653589793 when
looking for the pattern P=26.
Solution: Given
Text T = 3141592653589793
Pattern P=26.
Working Modulo q = 11
We calculate the: P mod q = 26 mod 11 = 4
Now find the exact match of P mod q in the given Text.
Example on Rabin Karp String Matching Algorithm:
Example on Rabin Karp String Matching Algorithm:
Example on Rabin Karp String Matching Algorithm:
Example on Rabin Karp String Matching Algorithm:
Example on Rabin Karp String Matching Algorithm:
We keep on traversing the Text and trying to match the P mod q with the T mod q.
There will be two cases.
Case 1 : If P mod q = T mod q, then there is an exact match.
Case 2 : If P mod q is not equal to T mod q, then there is a spurious hit. Hence, we
search for the next match.
Will move to the next characters of the Text with the help of the Horner’s Rule
Complexity: The running time of RABIN-KARP-MATCHER in the worst case scenario
O ((n-m+1) m but it has a good average case running time.
If the expected number of strong shifts is small O (1) and prime q is chosen to be
quite large, then the Rabin-Karp algorithm can be expected to run in time O (n+m)
plus the time to require to process spurious hits.
Example on Rabin Karp String Matching Algorithm:
Example 2: Suppose Text T = 2359023141526739921 , Pattern P= 31415 and
Working modulo q=13. Find the valid match using Rabin Karp Algorithm.
String Matching: Knuth-Morris-Pratt Algorithm
Knuth-Morris-Pratt Algorithm: Knuth-Morris and Pratt introduce a linear time
algorithm for the string matching problem.
A matching time of O (n) is achieved by avoiding comparison with an element of 'S'
that have previously been involved in comparison with some element of the
pattern 'p' to be matched. i.e., backtracking on the string 'S' never occurs.
Components of KMP Algorithm:
1. The Prefix Function (Π): The Prefix Function, Π for a pattern encapsulates
knowledge about how the pattern matches against the shift of itself.
This information can be used to avoid a useless shift of the pattern 'p.' In other
words, this enables avoiding backtracking of the string 'S.'
String Matching: Knuth-Morris-Pratt Algorithm
2. The KMP Matcher: With string 'S,' pattern 'p' and prefix function 'Π' as inputs,
find the occurrence of 'p' in 'S' and returns the number of shifts of 'p' after which
occurrences are found.
Knuth Morris Pratt (KMP) is an algorithm, which checks the characters from left to
right.
For Example:
Input:
Input: Main String: “AAAABAAAAABBBAAAAB”, The pattern “AAAB”
Output:
Pattern found at location: 1
Pattern found at location: 7
Pattern found at location: 14
String Matching: Knuth-Morris-Pratt Algorithm
Algorithm for Computing the Prefix Function (Π):
COMPUTE- PREFIX- FUNCTION (P)
1.m ←length [P] //'p' pattern to be matched
2. Π [1] ← 0
3. k ← 0
4. For q ← 2 to m
5. do while k > 0 and P [k + 1] ≠ P [q]
6. do k ← Π [k]
7. If P [k + 1] = P [q]
8. then k← k + 1
9. Π [q] ← k
10. Return Π
String Matching: Knuth-Morris-Pratt Algorithm
Running Time Analysis for Computing the Prefix Function:
In the above pseudo code for calculating the prefix function, the for loop from step 4 to step
10 runs 'm' times. Step1 to Step3 take constant time. Hence the running time of computing
prefix function is O (m). Example: Compute Π for the pattern 'p' below:
String Matching: Knuth-Morris-Pratt Algorithm
String Matching: Knuth-Morris-Pratt Algorithm
String Matching: Knuth-Morris-Pratt Algorithm
KMP Matcher: KMP Matcher with the pattern 'p,' the string 'S' and prefix function 'Π' as
input, finds a match of p in S.
KMP-MATCHER (T, P)
1. n ← length [T]
2. m ← length [P]
3. Π← COMPUTE-PREFIX-FUNCTION (P)
4. q ← 0 // numbers of characters matched
5. for i ← 1 to n // scan S from left to right
6. do while q > 0 and P [q + 1] ≠ T [i]
7. do q ← Π [q] // next character does not match
8. If P [q + 1] = T [i]
9. then q ← q + 1 // next character matches
10. If q = m // is all of p matched?
11. then print "Pattern occurs with shift" i - m
12. q ← Π [q] // look for the next match
String Matching: Knuth-Morris-Pratt Algorithm
Running Time Analysis: The for loop beginning in step 5 runs 'n' times, i.e., as long as the
length of the string 'S.' Since step 1 to step 4 take constant times, the running time is
dominated by this for the loop. Thus running time of the matching function is O (n).
String Matching: Knuth-Morris-Pratt Algorithm
String Matching: Knuth-Morris-Pratt Algorithm
String Matching: Knuth-Morris-Pratt Algorithm
String Matching: Knuth-Morris-Pratt Algorithm
String Matching: Knuth-Morris-Pratt Algorithm
If m is the length of the pattern and n is the length of the text to be searched then,
Algorithm
Matching time
Naïve string matching algorithm
O((n-m)*m)
Rabin Karp string matching algorithm
Average: O(n + m) Worst:
O((n-m)*m)
Knuth-Morris-Pratt algorithm
O(n)
Introduction to Complexity Classes:
Definition of NP class Problem: - The set of all decision-based problems came into
the division of NP Problems who can't be solved or produced an output within
polynomial time but verified in the polynomial time. NP class contains P class as a
subset. NP problems being hard to solve.
Note: - The term "NP" does not mean "not polynomial." Originally, the term
meant "non-deterministic polynomial. It means according to the one input
number of output will be produced.
Definition of P class Problem: - The set of decision-based problems come into the
division of P Problems who can be solved or produced an output within
polynomial time. P problems being easy to solve
Definition of Polynomial time: - If we produce an output according to the given
input within a specific amount of time such as within a minute, hours. This is
known as Polynomial time.
Introduction to Complexity Classes:
Definition of Non-Polynomial time: - If we produce an output according to the
given input but there are no time constraints is known as Non-Polynomial time.
But yes output will produce but time is not fixed yet.
Definition of Decision Based Problem: - A problem is called a decision problem if
its output is a simple "yes" or "no" (or you may need this of this as true/false, 0/1,
accept/reject.) We will phrase many optimization problems as decision problems.
For example, Greedy method, D.P., given a graph G= (V, E) if there exists any
Hamiltonian cycle.
Definition of NP-hard class: - Here you to satisfy the following points to come into
the division of NP-hard. If we can solve this problem in polynomial time, then we
can solve all NP problems in polynomial time. If you convert the issue into one
form to another form within the polynomial time
Definition of NP-complete class: - A problem is in NP-complete, if It is in NP and It
is NP-hard
Introduction to NP Completeness:
NP-Completeness: A decision problem L is NP-Hard if L' ≤p L for all L' ϵ NP.
Definition: L is NP-complete if L ϵ NP and L' ≤ p L for some known NP-complete
problem L.' Given this formal definition, the complexity classes are:
P: is the set of decision problems that are solvable in polynomial time.
NP: is the set of decision problems that can be verified in polynomial time.
NP-Hard: L is NP-hard if for all L' ϵ NP, L' ≤p L. Thus if we can solve L in polynomial
time, we can solve all NP problems in polynomial time.
NP-Complete L is NP-complete if L ϵ NP and L is NP-hard
If any NP-complete problem is solvable in polynomial time, then every NPComplete problem is also solvable in polynomial time. Conversely, if we can prove
that any NP-Complete problem cannot be solved in polynomial time, every NPComplete problem cannot be solvable in polynomial time.
Representation of Complexity Classes:
Pictorial representation of all NP classes which includes NP, NP-hard, and NP-complete
Randomized Algorithms
A Randomized algorithm is an algorithm that employs a degree of randomness
as part of its logic.
The algorithm typically uses uniformly random bits as an auxiliary input to guide
its behavior, in the hope of achieving good performance in the "average case"
over all possible choices of random bits.
Formally, the algorithm's performance will be a random variable determined by
the random bits; thus either the running time, or the output (or both) are
random variables.
One has to distinguish between algorithms that use the random input to reduce
the expected running time or memory usage, but always terminate with a
correct result (Las Vegas algorithms) in a bounded amount of time.
Randomized version of Quick Sort:
Randomized version of Quick sort: Instead of always using A[r] as the pivot, we will
use a randomly chosen element from the sub-array A[p _ r].
We do so by exchanging element A[r] with an element chosen at random from
A[p _ r].
This modification, in which we randomly sample the range p,...,r, ensures that the
pivot element x = A[r] is equally likely to be any of the r - p + 1 elements in the subarray.
Because the pivot element is randomly chosen, we expect the split of the input
array to be reasonably well balanced on average.
The changes to PARTITION and QUICKSORT are small.
In the new partition procedure, we simply implement the swap before actually
partitioning:
Randomized version of Quick Sort:
RANDOMIZED-PARTITION(A, p, r):
1. i ← RANDOM(p, r)
2. exchange A[r] ↔ A[i]
3. return PARTITION(A, p, r)
RANDOMIZED-QUICKSORT(A, p, r):
1. If p < r
2. then q ← RANDOMIZED-PARTITION(A, p, r)
3.
RANDOMIZED-QUICKSORT(A, p, q - 1)
4.
RANDOMIZED-QUICKSORT(A, q + 1, r)
Randomized version of Quick Sort:
Randomized version of Quick Sort:
Analysis of Randomized Quicksort: We are discussing here worst and best case
complexity of randomized quicksort algorithm
Worst case complexity: Let T (n) be the worst-case time for the procedure
QUICKSORT on an input of size n. We have the recurrence
where the parameter q ranges from 0 to n - 1 because the procedure PARTITION
produces two sub-problems with total size n - 1. We guess that T (n) ≤ cn2 for some
constant c. Substituting this guess into recurrence. We obtain
Randomized version of Quick Sort:
Expected running time : If, in each level of recursion, the split induced by
RANDOMIZED-PARTITION puts any constant fraction of the elements on one side of
the partition, then the recursion tree has depth Θ(lg n), and O(n) work is performed
at each level.
Even if we add new levels with the most unbalanced split possible between these
levels, the total time remains O (n lg n).
Introduction to Approximation Algorithms:
An Approximate Algorithm is a way of approach NP-COMPLETENESS for the
optimization problem.
This technique does not guarantee the best solution.
The goal of an approximation algorithm is to come as close as possible to the
optimum value in a reasonable amount of time which is at the most polynomial
time.
Such algorithms are called approximation algorithm or heuristic algorithm.
For the traveling salesperson problem, the optimization problem is to find
the shortest cycle, and the approximation problem is to find a short cycle.
For the vertex cover problem, the optimization problem is to find
the vertex cover with fewest vertices, and the approximation
problem is to find the vertex cover with few vertices.
Vertex Cover: Approximation Algorithm
Vertex Cover Problem: A Vertex Cover of a graph G is a set of vertices such that
each edge in G is incident to at least one of these vertices.
The decision vertex-cover problem was proven NPC. Now, we want to solve the
optimal version of the vertex cover problem, i.e., we want to find a minimum size
vertex cover of a given graph.
We call such vertex cover an optimal vertex cover C*.
Vertex Cover: Approximation Algorithm
Approx. Vertex-Cover (G = (V, E))
{
C = empty-set;
E'= E;
While E' is not empty do
{
Let (u, v) be any edge in E': (*)
Add u and v to C;
Remove from E' all edges incident to u or v;
}
Return C;
}
Vertex Cover: Approximation Algorithm
The idea is to take an edge (u, v) one by one, put both vertices to C, and remove all
the edges incident to u or v. We carry on until all edges have been removed. C is a
VC. But how good is C?
VC = {b, c, d, f, g}
Set Cover Problem : Approximate Algorithm
Given a universe U of n elements, a collection of subsets of U say S = {S 1, S2…,Sm}
where every subset Si has an associated cost. Find a minimum cost subcollection of S
that covers all elements of U.
Example:
U = {1,2,3,4,5}
S = {S1,S2,S3}
S1 = {4,1,3}, Cost(S1) = 5
S2 = {2,5}, Cost(S2) = 10
S3 = {1,4,3,2}, Cost(S3) = 3
Output: Minimum cost of set cover is 13 and set cover is {S2, S3} There
are two possible set covers {S1, S2} with cost 15 and {S2, S3} with cost
13.
Set Cover Problem : Approximate Algorithm
Set Cover is NP-Hard: There is no polynomial time solution available for this problem as the problem is a known
NP-Hard problem. There is a polynomial time Greedy approximate algorithm, the greedy algorithm provides a Logn
approximate algorithm.
2-Approximate Greedy Algorithm: Let U be the universe of elements, {S1, S2, … Sm} be collection of subsets of U
and Cost(S1), C(S2), … Cost(Sm) be costs of subsets.
U = {1,2,3,4,5}, S = {S1,S2,S3}
S1 = {4,1,3}, Cost(S1) = 5
S2 = {2,5}, Cost(S2) = 10
S3 = {1,4,3,2}, Cost(S3) = 3
Example: Let us consider the above example to understand Greedy Algorithm.
First Iteration: I = {}
The per new element cost for S1 = Cost(S1)/|S1 – I| = 5/3
The per new element cost for S2 = Cost(S2)/|S2 – I| = 10/2
The per new element cost for S3 = Cost(S3)/|S3 – I| = 3/4
Since S3 has minimum value S3 is added, I becomes {1,4,3,2}.
Second Iteration: I = {1,4,3,2} The per new element cost for S1 = Cost(S1)/|S1 – I| =
5/0 Note that S1 doesn’t add any new element to I. The per new element cost for
S2 = Cost(S2)/|S2 – I| = 10/1 Note that S2 adds only 5 to I.
The greedy algorithm provides the optimal solution for above example, but it may
not provide optimal solution all the time.
Fast Fourier Transform(FFTs)
Complex roots of unity: A complex nth root of unity is a complex number ω such that ωn=1.
There are exactly n complex nth roots of unity: e2Πik/n for k= 0, 1,………, n- 1.
To interpret this formula, we use the definition of the exponential of a complex number: eiu= cos
(u)+i sin(u)
The value ωn == e2Πi/n is the principal nth root of unity, all other complex nth roots of unity are
powers of ωn. The n complex nth roots of unity.
Recursive FFTs: By using a method known as the fast Fourier transform (FFT), which takes
advantage of the special properties of the complex roots of unity, we can compute discrete Fourier
transform DFTn (a) in time ‚ϴ(n lg n) as opposed to the ϴ(n2) time of the straightforward method.
We assume throughout that n is an exact power of 2.
The FFT method employs a divide-and-conquer strategy, using the even-indexed and odd indexed
coefficients of A.x/ separately to define the two new polynomials A [0](x) and A [1](x) of
degree-bound n/2:
Fast Fourier Transform(FFTs)
Complexity : To determine the running time of procedure RECURSIVE-FFT, we note that
exclusive of the recursive calls, each invocation takes time ϴ (n), where n is the length of the
input vector. The recurrence for the running time is therefore T(n) = 2T(n/2) + ϴ(n)
= ϴ (n lg n)
THANK YOU
Download