Construction of Index: (Page 197) • Objective: Given a document, find the number of occurrences of each word in the document. • Example: Computer Science students know computers and computer languages. • Keywords: computer, computers, science, students, know, and, languages. 1 Linear time algorithm: • Let T be the text, |T| the length of T. We can find the occurrences of each word in T in O(|T|) time. 2 Constructing an automaton: c o m p u t e s c i e n c e t u d e n t s k n o w a n d l a n g e g u a r s s 3 Remarks: • There is a final state for each word. • There is a counter on each final state storing the number of occurrences that the final state is reached. • While reading, the algorithm creates new states for the new word. • For words having met before, we just go through the corresponding states. • When the final state is read, add 1 to the counter. 4 Assignment one (due in week 6 on Monday, 9:20 pm) • Write a program to convert a text into a vector such that each element of the vector is the number of occurrences of the corresponding keyword. • Marking Scheme: • 100 % if using the linear time algorithm • 20% if using O(nm) time, where n is the length of the text and m the number of words in the document • A report describing the program is required. – A flow chart of the program is required. – Specification of each function – Comments for codes. 5 Remarks: • The following part might be hard for you. However, it is useful and no other part in the course is harder than this part. 6 String Matching The problem: • Input: a text T (very long string) and a pattern P (short string). • Output: the index in T where a copy of P begins. 7 Some Notations and Terminologies • |P| and |T|: the lengths of P and T. • P[i]: the i-th letter of P. • Prefix of P: a substring of P starting with P[1]. • P[1..i]: the prefix containing the first i letters of P. • suffix of P[1..i]: a substring of P[1..i] ending at P[i], e.g. P[3..i], P[5..i] (i>4). 8 Straightforward method • Basic idea: 1. i=1; 2. Start with T[i] and match P with T[i],T[i+1], ... T[i+|P|-1] 3. whenever a mismatch is found, i=i+1 and goto 2 until i+|P|-1<|T|. • Example 1: T=ABABABCCA and P=ABABC P: ABABC A ABABC | | | T: ABABABCCA ABABABCCA ABABABCCA 9 Analysis • Step 2 takes O(|P|) comparisons in the worst case. • Step 2 could be repeated O(|T|) times. • Total running time is O(|T||P|). 10 Knuth-Morris-Pratt Method (linear time algorithm) A better idea • In step 3, when there is a mismatch we move forward one position (i=i+1). • We may move more than one position at a time when a mismatch occurs. (carefully study the pattern P). For example: P: ABABC ABA T: ABABABCCA ABABABCCA 11 Questions: • How to decide how many positions we should jump when a mismatch occurs? • How much we can benefit? O(|T|+|P|). Example 2: P: abcabcabcaa | T: abcabcabcabcaa | abcabcab back here 12 • We can move forward more than one position. Reason? • Study of Pattern P P[1..7] abcabca P[1..10] abcabcabca P[1..7] abcabca P[1..4] abca • P[1..7] is the longest prefix that is also a suffix of P[1..10]. • P[1..4] is a prefix that is a suffix of P[1..10], but not the longest. • Hint: When mismatch occurs at P[i+1], we want to find the longest prefix of P[1..i] which is also a suffix of P[1..i]. • Suffix of P is a substring of P ending at the last position of P. 13 Failure function • f(i) is the largest r with (r<i) such that P[1] P[2] ...P[r] = P[i-r+1]P[i-r+2], ..., P[i]. Prefix of length r Suffix of P[1]P[2]…P[i] of length r • That is, P[1,f(i)] is the longest prefix that is a suffix of P[1..i]. • Example 3: P=ababaccc and i=5. P[1] P[2] P[3] a b a a b a b a P[3] P[4] P[5] (r=3) f(5)=3. 14 • Example 4: P=abcabbabcabbaa It is easy to verify that f(1)=0, f(2)=0, f(3)=0, f(4)=1, f(5)=2, f(6)=0, f(7)=1, f(8)=2, f(9)=3, f(10)=4, f(11)=5, f(12)=6, f(13)=7, f(14)=1. 15 The Scan Algorithm • • 1. 2. 3. (draw a figure to show) i: indicates that T[i] is the next character in T to be compared with the head of the pattern. q: indicates that P[q+1] is the next character in P to be compared with T[i]. i=1 and q=0; Compare T[i] with P[q+1] case 1: T[i]==P[q+1] i=i+1;q=q+1; if q+1==|P| then print "P occurs at i+1-|P|" case 2: T[i]≠P[q+1] and q≠0 q=f(q); case 3: T[i]≠P[q+1] and q==0 i=i+1; Repeat step2 until i==|T|. 16 • Example 5: P=abcabbabcabbaa i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 f(i) 0 0 0 1 2 0 1 2 3 4 5 6 1 1 T=abcabcabbabbabcabbabcabbaa abcabb | | | abcabbabc | abc | a(i=i+1) abcabbabcabbaa(q+1=|p|) 17 Running time complexity(hard) • The running time of the scan algorithm is O(|T|). • Proof: – There are two pointers i and p. – i: the next character in T to be compared. – p: the position of P[1]. (See figure below) p i P:abcabcabcaa | T:abcabcabcabcaa | P: abcabcaa p 18 Facts: 1 When a match is found, move i forward. 2 When a mismatch is found, move p forward until p and i are the same. (When p=i and a mismatch occur, move both i and p forward) From facts 1 and 2, it is easy to see that the total number of comparisons is at most 2|T|. Thus, the time complexity is O(|T|). 19 Another version of scan algorithm (code) n=|T| m=|P| q=0 for i=1 to n { while q>0 and P[q+1]≠T[i] do { q=f(q) } if P[q+1]==T[i] then q=q+1 if q==m then { print "pattern occurs at i-m+1" q=f(q) } } 20 Failure Function Construction Basic idea: Case 1: f(1) is always 0. Case 2: if P[q]==P[f(q-1)+1] then f(q)=f(q-1)+1. Example: p=abcabcc f(1)=0; f(2)=0; f(3)=0; f(4)=1; f(5)=2; f(6)=3; f(7)=0; P[4]= P[f(4-1)+1], f(4)=f(4-1)+0+1=1. P[5]= P[f(5-1)+1], f(5)=f(5-1)+1=1+1=2. P[6]= P[f(6-1)+1]. F(6)=f(6-1)+1=2+1=3. 21 Case 3: if P[q]P[f(q-1)+1] and f(q-1)≠0 then consider P[q] ?= P[f(f(q-1))+1] (Do it recursively) Case 4: if P[q] P[f(q-1)+1] and f(q-1)==0 then f[q]=0. Consider the computation of f(7). P[4] P[1] P[7] ≠P[f(7-1)+1], P[7] ≠P[f(f(7-1))+1] c a c a 22 The algorithm (code) to compute failure function 1. 2. 3. 4. 5. 6. 7. 8. m=|P|; f(1)=0; k=0; for q=2 to |P| do { k=f(q-1); if(k>0 and P[k+1]!=P[q]) { k=f(k); goto 6; } if(k>0 and P[k+1]==P[q]) { f[q]=k+1; } if(k==0) { if(P[k+1]==P[q] f[q]=1; else f[q]=0; } } 23 Another version 1. 2. 3. 4. 5. 6. 7. 8. 9. m=|P|; f(1)=0; k=0; for q=2 to |P| do { k=f(q-1); while(k>0 and P[k+1]!=P[q]) do { k=f(k); } if(P[k+1]==P[q]) then k=k+1; f[q]=k; } 24 • Example 3: 1 2 3 4 5 6 7 8 9 10 11 12 P=a b c a b c a b c a a c f(1)=0; f(2)=0; f(3)=0; f(4)=1; f(5)=2; f(6)=3; f(7)=4; f(8)=5; f(9)=6; f(10)=7; f(11)=1. (The computation of f(11) is very interesting.) Question: Do we need to compute f(12)? Yes, if you want to find ALL occurrences of P. No, if you just want to find the first occurrence of P. 25 Example: P=abaabc T=abcabcabc abcabc abcabc i 123456 f(i) 0 0 0 1 2 3 When a match is found at the end of P, call f(|p|). Running time complexity (Fun Part, not required) The running time of failure function construction algorithm is O(|P|). (The proof is similar to that for scan algorithm.) Total running time complexity The total complexity for failure function construction and scan algorithm is O(|P|+|T|). 26 Linear Time Algorithm for Multiple patterns • Input: a string T (very long) and a set of patterns P1,P2,...,Pk. • Output: all the occurrences of Pi's in T. Let us consider the set of patterns { he, she, his, hers }. We can construct an automata as follows: 27 e,i,r h 0 e r 1 s 2 8 s i 6 s 3 9 h 4 7 e 5 28 • g(s,a)=s' means that at state s if the next input letter is a then the next state is s'. • The states of the automata is organized column by column. • Each state corresponds to a prefix of some pattern Pi. • F: the set of final states (dark circled) corresponding to the ends of patterns. • For the starting state 0, add g(0,a)=0, if g(0,a) is originally fail. 29 • Exercise: write down the g() function for the above automata. • Failure function f(s) = the state for the longest prefix of some pattern Pi that is a suffix of the string in the path from 0 (starting state) to s. • Example: he is the longest prefix for hers that is a suffix of the string she. 30 The scan algorithm Text: T[1]T[2]...T[n] s=0; for i:=1 to n do { while g(s,T[i])=fail do s=f(s); s:=g(s,T[i]); if s is in F then return "yes"; } return "no" 31 Theorem: The scan algorithm takes O(|T|) time. Proof: Again, the two pointer argument. • When a match is found, move the first pointer forward. (s:=g(s,T[i]);) • When a mismatch is found (g(s,T[i])==fail), move the second pointer forward. (s=f(s);) • When a final state is meet, declare the finding of a pattern. (if s is in F then return "yes";) 32 • Example: i=1 2 3 4 5 6 7 8 s h e r s h i i 3 4 5 2 8 9 3 4 1 0 0 0 s 123456789 f(s) 0 0 0 1 2 0 3 0 3 33 Failure function construction • Basic idea: similar to that for one pattern. for each state s of depth 1 do f(s)=0 for each depth d>=1 do for each state sd of depth d and character a such that g(sd,a)=s' do { s=f(sd) while g(s,a)=fail do { s=f(s) } f(s')=g(s,a) } 34 • g(0,c)≠fail for any possible character c. The failure function for {he, she, his, hers} is s 123456789 f(s) 0 0 0 1 2 0 3 0 3 Time complexity: O(|P1|+|P2|+...+|Pk|). Proof: Two pointer argument. Leave it for assignment (optional) 35