Fussy Set Theory • Definition A fuzzy subset A of a universe of discourse U is characterized by a membership function A : U [0,1] which associate with each element u of U a number A (u) in the interval [0,1]. • Set Theory: A={a, b, c}.Subset of A: {a, c}. • An element is either in a set of not in a set. A (u) is either 0 or 1. Set Theory • • • • • Let U be the set of all elements (universe) There are three basic operations: AB={elements in A or in B}. AB={elements in both A and B} Not A=U-A. • Definition Let U be the universe of discourse, A and B be two fussy subsets of U, and A be the complement of A relative to U. Also, let u be an element of U. Then, A 1 A (u ) AB (u) max{ A (u), B (u )} AB (u) min{ A (u), B (u )} Fuzzy Information Retrieval We first set up term-term correlation matric: For terms ki and kl, ci ,l ni ,l ni nl ni ,l Where ni is the number of documents containing ki , nl is the number of documents containing kl And ni,l is the number of documents containing both ki and kl. Note Ci,i=1. Fuzzy Information Retrieval We define a fuzzy set for each term ki. In the fuzzy set for ki , a document dj has a degree of membership ij computed as i , j 1 (1 ci ,l ) kl d j Example: c1,2=0.1, c1,3=0.21. D1=(0, 1, 1, 0). 1,1= 1-0.9*0.79. D2=(1, 0, 0, 0). 1,2= 1-0. (since c1,1=1.) How is d3=(1, 0, 1,0)? Fuzzy Information Retrieval Whenever, the document dj contains a term that is strongly related to ki, then the document dj is belong to the fuzzy set of term ki, i.e., i,j is very close to 1. Example, c1,2=0.9, d1=(0, 1, 0, 0). 1,1 =1-(1-0.9)=0.9 Query: • Query is a Boolean formula, e.g., • q=Ka and (Kb or not Kc). q ka (kb kc ) • q= (1, 1, 1) or (1, 1, 0) or (1, 0, 0). • Suppose q is q dnf cc1 cc2 cc p Da cc3 Db cc2 cc1 Dc Dq cc1 cc2 cc3 Figure 1. Fuzzy document sets for the query [q ka (kb kc )] . Each cci , i {1,2,3}, is a conjunctive component. Dq is the query fuzzy set. q, j cc1 cc2 cc3 , j 3 1 (1 cci , j ) i 1 1 (1 a, j b, j c, j ) (1 a, j b, j (1 c, j )) (1 a, j (1 b, j )(1 c, j )) Where i , j , i {a, b, c}, is the membership of d j in the fuzzy set associated with ki . q,j is the membership of document j for query q. Exercise: suppose there are 3 doc. and 4 terms. d1=(1, 0, 1, 0), d2=(1, 1, 0, 0), and d3=(0, 1, 1, 0). (1) Compute the term-term correlation matrix ci,j. (2) Compute i,j (membership of document j in term i.) (3) If the query q=(1, 0, 0, 0) or (1, 1, 0, 0), compute q,k for each document dk. Some changes in the last slide. q, j= cc1+cc2+cc3,j=max {cc1,j, cc2,j , cc3,j}, where cc1,j, cc2,j , cc3,j are computed as before. String Matching Allowing Errors • Problem: Given a short pattern P of length m, a long text T of length n, and a maximum allowed number of errors k, find all the text positions where the pattern occurs with at most k errors. Dynamic Programming • C[i,j] be the number of errors allowed, i and j are the indices for the pattern and the text. • Three kinds of error: mismatch (a, b), insertion( a, )and deletion ( , a). C[0, j ] 0 C[i ,0 ] i C[i , j ] if ( Pi T j ) then C[i 1, j 1] else 1 min( C[i 1, j ], C[i, j 1], c[i 1, j 1]) The matrix s x s u r g e r y 0 0 0 0 0 0 0 0 0 0 s 1 0 1 0 1 1 1 1 1 1 u 2 1 1 1 0 1 2 2 2 2 r 3 2 2 2 1 0 1 2 2 3 v 4 3 3 3 2 1 1 2 3 3 e 5 4 4 4 3 2 2 1 2 3 y 6 5 5 5 4 3 3 2 2 2 The dynamic programming algorithm search ‘survey’ in the text ‘surgery’ with two errors. Bold entries indicate matching positions. Running time O(nm). Exercise • Let ABCABCDDABEDF be the text and pattern be ABCDAB. Find the occurrence of the pattern with at most 1 error. String Matching Allowing Errors (FAST Algorithm) • Just keep the cells with value at most k. • This will reduce the time complexity . Regular expressions Matching • Regular expression: 1. Any letter x in {},is a regular expression, where is the set of all letters. 2. if A and B are regular expression, then A|B, A.B and (A)* are regular expressions. Regular expressions Matching (Not Required) • • • • • Given an regular expression E and a string T, find all the substrings in T that match E. Let d(i) be the set of all states in the automaton that can be reached after T1T2…Ti is accepted. Given d(i), d(i+1) can be computed easily. There is a starting and final state in the automaton. Whenever the final state is reach, we find a substring in T that match the expression. ε FA ε S f ε FB FA|B ε S FA ε FA B FB f ε S FA F(A)* f ε b A c a A e ε d B ε ε g B ε h f l ε i ( AA | B) (B | AB) A j B k ε Example: • • • • • • • E=(A|AA).(B|AB). T=ABBAB. D(1)={a, b, d, c} D(2)={ a,b, d, e, f, g, i }, D(3)={a,b,c, e, f, g, i, h, l}. D(4)={a,b,d,c,j} D(5)={a,b,d, e, f, g, i, k} Running time • O(n2), where n is the size of the automaton since d(s, i) could contain O(n) states.