Fussy Set Theory 

advertisement
Fussy Set Theory
• Definition A fuzzy subset A of a universe
of discourse U is characterized by a
membership function  A : U  [0,1]
which associate with each element u of U
a number  A (u) in the interval [0,1].
• Set Theory: A={a, b, c}.Subset of A: {a, c}.
• An element is either in a set of not in a
set.  A (u) is either 0 or 1.
Set Theory
•
•
•
•
•
Let U be the set of all elements (universe)
There are three basic operations:
AB={elements in A or in B}.
AB={elements in both A and B}
Not A=U-A.
• Definition Let U be the universe of discourse, A
and B be two fussy subsets of U, and A be
the complement of A relative to U. Also, let u be
an element of U. Then,
A

1   A (u )
 AB (u)  max{  A (u),  B (u )}
 AB (u)  min{  A (u),  B (u )}
Fuzzy Information Retrieval
We first set up term-term correlation matric:
For terms ki and kl,
ci ,l 
ni ,l
ni  nl  ni ,l
Where ni is the number of documents containing
ki , nl is the number of documents containing kl
And ni,l is the number of documents containing
both ki and kl. Note Ci,i=1.
Fuzzy Information Retrieval
We define a fuzzy set for each term ki. In the
fuzzy set for ki , a document dj has a degree of
membership
ij computed as
i , j  1   (1  ci ,l )
kl d j
Example: c1,2=0.1, c1,3=0.21.
D1=(0, 1, 1, 0). 1,1= 1-0.9*0.79.
D2=(1, 0, 0, 0). 1,2= 1-0. (since c1,1=1.)
How is d3=(1, 0, 1,0)?
Fuzzy Information Retrieval
Whenever, the document dj contains a
term that is strongly related to ki,
then the document dj is belong to
the fuzzy set of term ki, i.e.,
i,j is very close to 1.
Example, c1,2=0.9, d1=(0, 1, 0, 0).
1,1 =1-(1-0.9)=0.9
Query:
• Query is a Boolean formula, e.g.,
• q=Ka and (Kb or not Kc).
q  ka  (kb  kc )
• q= (1, 1, 1) or (1, 1, 0) or (1, 0, 0).
• Suppose q is
q dnf  cc1  cc2    cc p
Da
cc3
Db
cc2
cc1
Dc
Dq  cc1  cc2  cc3
Figure 1. Fuzzy document sets for the query [q  ka  (kb  kc )] . Each
cci , i {1,2,3}, is a conjunctive component. Dq is the query fuzzy set.
q, j  cc1 cc2 cc3 , j
3
 1   (1  cci , j )
i 1
 1  (1  a, j b, j c, j ) 
(1  a, j b, j (1  c, j ))  (1  a, j (1  b, j )(1  c, j ))
Where i , j , i  {a, b, c}, is the membership of d j
in the fuzzy set associated with ki .
q,j is the membership of document j for query q.
Exercise: suppose there are 3 doc. and 4 terms.
d1=(1, 0, 1, 0), d2=(1, 1, 0, 0), and d3=(0, 1, 1, 0).
(1) Compute the term-term correlation matrix ci,j.
(2) Compute i,j (membership of document j in term i.)
(3) If the query q=(1, 0, 0, 0) or (1, 1, 0, 0), compute
q,k for each document dk.
Some changes in the last slide.
q, j= cc1+cc2+cc3,j=max {cc1,j, cc2,j , cc3,j},
where cc1,j, cc2,j , cc3,j are computed as before.
String Matching Allowing Errors
• Problem: Given a short pattern P of length
m, a long text T of length n, and a
maximum allowed number of errors k, find
all the text positions where the pattern
occurs with at most k errors.
Dynamic Programming
• C[i,j] be the number of errors allowed, i and j are
the indices for the pattern and the text.
• Three kinds of error: mismatch (a, b),
insertion( a, )and deletion ( , a).
C[0, j ]  0
C[i ,0 ]  i
C[i , j ]  if ( Pi  T j ) then C[i  1, j  1]
else 1  min( C[i  1, j ], C[i, j  1], c[i  1, j  1])
The matrix
s
x
s
u
r
g
e
r
y
0
0
0
0
0
0
0
0
0
0
s
1
0
1
0
1
1
1
1
1
1
u
2
1
1
1
0
1
2
2
2
2
r
3
2
2
2
1
0
1
2
2
3
v
4
3
3
3
2
1
1
2
3
3
e
5
4
4
4
3
2
2
1
2
3
y
6
5
5
5
4
3
3
2
2
2
The dynamic programming algorithm search ‘survey’ in
the text ‘surgery’ with two errors. Bold entries indicate
matching positions. Running time O(nm).
Exercise
• Let ABCABCDDABEDF be the text and
pattern be ABCDAB. Find the occurrence
of the pattern with at most 1 error.
String Matching Allowing Errors
(FAST Algorithm)
• Just keep the cells with value at most k.
• This will reduce the time complexity .
Regular expressions Matching
• Regular expression:
1. Any letter x in {},is a regular
expression, where  is the set of all
letters.
2. if A and B are regular expression, then
A|B, A.B and (A)* are regular
expressions.
Regular expressions Matching
(Not Required)
•
•
•
•
•
Given an regular expression E and a string T,
find all the substrings in T that match E.
Let d(i) be the set of all states in the automaton
that can be reached after T1T2…Ti is accepted.
Given d(i), d(i+1) can be computed easily.
There is a starting and final state in the
automaton.
Whenever the final state is reach, we find a
substring in T that match the expression.
ε
FA
ε
S
f
ε
FB
FA|B
ε
S
FA
ε
FA B
FB
f
ε
S
FA
F(A)*
f
ε
b
A
c
a
A
e
ε
d
B
ε
ε
g
B
ε
h
f
l
ε
i
( AA | B)  (B | AB)
A
j
B
k
ε
Example:
•
•
•
•
•
•
•
E=(A|AA).(B|AB).
T=ABBAB.
D(1)={a, b, d, c}
D(2)={ a,b, d, e, f, g, i },
D(3)={a,b,c, e, f, g, i, h, l}.
D(4)={a,b,d,c,j}
D(5)={a,b,d, e, f, g, i, k}
Running time
• O(n2), where n is the size of the
automaton since d(s, i) could
contain O(n) states.
Download