ppt - Department of Computer Science

advertisement
FINDING FUZZY SETS FOR QUANTITATIVE
ATTRIBUTES FOR MINING OF FUZZY
ASSOCIATE RULES
By H.N.A. Pham, T.W. Liao, and E. Triantaphyllou
Department of Industrial Engineering
3128 CEBA Building
Louisiana State University
Baton Rouge, LA 70803-6409
Email: hpham15@lsu.edu, ieliao@lsu.edu, and trianta@lsu.edu
1
Outline
Introduction
Background
A fuzzy approach for mining associate
rules
Experimental evaluation
Conclusions
2
Introduction
• Associate analysis is a new and attractive research area
in data mining
• The Apriori algorithm (R. Agrawal, IBM 1993) is a key
technique for Associate analysis
• Though the Apriori principle allows us to considerably
reduce the search space, the technique still requires a
huge computation, particularly for large databases
• This research proposes an approach for finding fuzzy
sets for quantitative attributes in a database by using
clustering techniques and then employs techniques for
mining of fuzzy Associate rules .
3
Outline
Introduction
Background
 Associate rules and the Apriori algorithm
 Necessity to find fuzzy sets for quantitative
attributes
A fuzzy approach for fuzzy mining associate
rules
Experimental evaluation
Conclusions
4
Associate rules: Market basket analysis
• Analyzes customer buying habits by finding associations
between the different items that customers place in their
“shopping baskets” (in the form X  Y, where X and Y are sets of
items)
How often people
• I = {I1=beer, I2=cake, I3=onigiri}
buy candy and beer
together?
• A transactional database
TID1:
TID2:
TID3:
TID4:
TID5:
{I1, I2, I3}
{I1, I2}
{I2, I3}
{I2}
{I1, I2}
• An Associate rule: {I1}  {I3}
5
Rule measures: Support and Confidence
Customer buys both
Customer
buys beer
Customer
buys onigiri
Transaction ID Items Bought
2000
1000
4000
5000
A,B,C
A,C
A,D
B,E,F
 Associate rule: X
Y
 support s = probability that a
transaction contains
X and Y
 confidence c = conditional
probability that a transaction
having X also contains Y
 A  C (s=50%, c=66.6%)
 C  A (s=50%, c=100%)
6
Associate mining: the Apriori algorithm
It is composed of two steps:
1.
2.
Find all frequent itemsets: By definition, each of
these itemsets will occur at least as frequently as a
pre-determined minimum support count
Generate strong Associate rules from the
frequent itemsets: By definition, these rules must
satisfy minimum support and minimum confidence
(Agrawal, 1993)
7
Associate mining: the Apriori principle
Transaction ID
2000
1000
4000
5000
For rule A  C
Items Bought
A,B,C
A,C
A,D
B,E,F
Min. support 50%
Min. confidence 50%
Frequent Itemset Support
{A}
75%
{B}
50%
{C}
50%
{A,C}
50%
support = support({A and C}) = 50%
confidence = support({A and C})/support({A}) = 66.6%
The Apriori principle:
Any subset of a frequent itemset must be frequent
(if an itemset is not frequent, neither are its supersets)
8
The Apriori algorithm: Finding frequent
itemsets using candidate generation
1. Find the frequent itemsets: the sets of items that have
support higher than the minimum support

A subset of a frequent itemset must also be a frequent itemset
i.e., if {AB} is a frequent itemset, both {A} and {B} should be a
frequent itemsets

Iteratively find frequent itemsets Lk with cardinality from 1 to k
(k-itemset) from candidate itemsets Ck (Lk  Ck)
C1  …  Li-1  Ci  Li  Ci+1  …  Lk
2. Use the frequent itemsets to generate Associate rules.
9
Example (min_sup_count = 2)
Transactional data
TID
List of items_IDs
T100
T200
T300
T400
T500
T600
T700
T800
T900
I1,
I2,
I2,
I1,
I1,
I2,
I1,
I1,
I1,
I2,
I4
I3
I2,
I3
I3
I3
I2,
I2,
I5
I4
I3, I5
I3
Scan D for
count of
each
candidate
Compare
candidate
support count
with minimum
support count
C1
L1
Itemset Sup.Count
Itemset Sup.Count
{I1}
{I2}
{I3}
{I4}
{I5}
6
7
6
2
2
{I1}
{I2}
{I3}
{I4}
{I5}
6
7
6
2
2
10
Example (min_sup_count = 2)
C2
Itemset
Generate
{I1, I2}
candidates {I1, I3}
C2 from L1
{I1, I4}
by using
the Apriori {I1, I5}
{I2, I3}
principle
{I2, I4}
{I2, I5}
{I3, I4}
{I3, I5}
{I4, I5}
C2
Scan D for
count of
each
candidate
Generate
candidates
C3 from L2
Itemset
by using
the Apriori
{I1, I2, I3}
principle
{I1, I2, I5}
Compare
candidate
support
count with
minimum
support
count
Itemset S.count
{I1, I2}
4
{I1, I3}
4
{I1, I4}
1
{I1, I5}
2
{I2, I3}
4
{I2, I4}
2
{I2, I5}
2
{I3, I4}
0
{I3, I5}
1
{I4, I5}
0
Scan D for
count of
each
candidate
C3
Itemset
Sc
{I1, I2, I3} 2
{I1, I2, I5} 2
Compare
candidate
support
count with
minimum
support
count
L2
Itemset S.count
{I1, I2}
4
{I1, I3}
4
{I1, I5}
2
{I2, I3}
4
{I2, I4}
2
{I2, I5}
2
L3
Itemset
Sc
{I1, I2, I3} 2
{I1, I2, I5} 2
11
Necessity to find fuzzy sets for
quantitative attributes
Transaction ID
Age
Married
NumCars
100
33
Yes
2
200
39
Yes
2
300
35
No
1
400
20
No
0
A quantitative associate rule with min_sup= min_conf =50%
(Age = 33 or 39) and (Married = Yes) -> (NumCars =2)
A quantitative associate rule with min_sup= min_conf=50%
(Age = 33..39) and (Married = Yes) -> (NumCars =2)
A fuzzy associate rule with min_sup= min_conf =50%
(Age = middle-aged) and (Married = Yes) -> (NumCars =2)
12
Solution: Shape boundary intervals
It is composed of two steps:
1. Partition the attribute domains into small intervals
and combine adjacent intervals into larger ones such
that the combined intervals will have enough
supports
2. Replace the original attribute by its attribute-interval
pairs, the quantitative problem can be transformed to
a Boolean one.
(Srikant and Agrawal, 1996)
13
Example: Shape boundary intervals
Transaction ID
Age
Married
NumCars
100
33
Yes
2
200
39
Yes
2
300
35
No
1
400
20
No
0
Transaction ID
Age: 18-30
Age: 31-39
Married
NumCars:0-1
NumCars:2-3
100
No
Yes
Yes
No
Yes
200
No
Yes
Yes
No
Yes
300
No
Yes
No
Yes
No
400
Yes
No
No
Yes
No
Algorithms ignore or over-emphasize the elements near the
boundary of the intervals in the mining process
•
• The use of shape boundary interval is also not intuitive with
respect to human perception
14
Solution: Experts
• An user or expert must provide to this algorithm the
required fuzzy sets of the quantitative attributes and
their corresponding membership functions
• Fuzzy sets and their corresponding membership
functions provided by experts may not be suitable for
mining fuzzy Associate rules in the database
15
Solution: Fuzzy sets for quantitative
attributes
It is composed of three steps:
Step 1: Transform the original database into positive integer
Step 2: For each attribute
Cluster values of the attribute ith into k medoids
Classify the attribute ith into k fuzzy sets
Generate membership functions for each fuzzy set
End for
Step 3: Transform the database based on fuzzy sets
(Ada, 1998)
Lose association between attributes in the mining approach
16
Outline
Introduction
Background
A fuzzy approach for fuzzy mining
associate rules

Fuzzy approach

Fuzzy mining associate rules
Experimental evaluation
Conclusions
17
Fuzzy approach
It is composed of five steps:
Step 1: Transform the original database into one with
positive integers
Step 2: Cluster values of attributes into k medoids.
Step 3: Classify attributes into k fuzzy sets
Step 4: Generate membership functions for each fuzzy set
Step 5: Transform the database based on fuzzy sets
18
Fuzzy approach: Step 2
Clustering:
• The clustering method considers the search space of a
database with n attributes as an n-dimensional space
• Use the Matlab fuzzy tool box
Do not lose association between attributes in the mining approach
19
Fuzzy approach: Step 3
Classify:
• Let {m1, m2, …, mk} be k medoids found from step 2, where mi =
{ai1, ai2, …, ain} is the medoid ith.
• Let the attribute jth have a range [minj, maxj] and {a1j, a2j, …, akj} be
set of mid-points of the attribute jth. The k fuzzy sets of this attribute
will be ranged in
[minj, a2j], [a1j, a3j], …, [a(i-1)j, a(i+1)j], …, and [a(k-1)j, maxj]
m1
a11
…
aj1
…
a1n
…
…
..
…
…
…
mk
ak1
…
ajn
akn
a(iminj
aij
a(i+1)j
1)j
maxj
Fuzzy set
20
Fuzzy approach: Step 4
Generate membership functions (triangular function):
 1,
if a k2 j  x

x  min j
k

,
if
min
j  xa
j
k
2
 a j  min j
k
2
fij( x : min j , a 2 j , max j )  
 max j  x , if a ( k ) j  x  max
2
 max j  a ( k  1) j
2

 0,
ortherwise
j
21
Fuzzy approach: Step 5
Transform the database based on fuzzy sets:
• Let Tij be the value of the ith transaction at
the jth attribute
Tij = fuzzy label ith if fij(Tij) = max(fkj(Tij))
22
ID
1
Salary
10000
IQ
120
2
7000
100
3
30000
183
4
9000
110
5
15000
140
6
7
20000
5000
165
85
Fuzzy label
Low_S
Medium_S
High_S
Range
Mid-point
4000 –
10000
7000
7000 –
20000
15000
15000 –
32000
30000
Fuzzy label
Range
Mid-point
Low_I
50 – 120
100
Medium_I
100 –
165
140
High_I
140 –
200
183
Step 2
ID
1
Salary
Low_S
IQ
Low_I
2
Low_S
3
ID
Salary’s
membership
IQ’s
membership
1
0.71
0.8
Low_I
2
0.71
0.83
High_S
High_I
3
0.37
0.67
4
Low_S
Low_I
4
0.86
0.86
5
Medium_S
5
0.83
0.74
6
0.56
0.74
7
0.14
0.31
6
7
Medium_S
Low_S
Medium_I
Medium_I
Low_I
Example of fuzzy approach
Steps 3, 4, 5
23
Fuzzy mining Associate rules
It is composed of two steps:
1. Find all itemsets that have fuzzy support (FS<X,A>)
above the user specified minimum support. These
itemsets are called frequent itemsets.
2. Use the frequent itemsets to generate the desired
rules. Let X and Y be frequent itemsets. We can
determine if the rule X => Y holds by computing the
fuzzy confidence FC<<X,A>,<Y,B>> and this value
is larger than the user specified minimum confidence
value.
(Attilia, 2000)
24
Fuzzy mining Associate rules - cont


FS  X , A 
tiT
xj  Xxj (aj  A, ti.xj )
FC  X, A ,  Y, B 
D



 
tiT
tiT
zjZ
xjX
mzj (cj  C , ti.zj )
mxj (aj  A, ti.zj )
• D = {t1, t2, …, tn}: transactions
• <X,A> with X is attributes and A is the corresponding fuzzy sets in X
• Z = X U Y, C = A U B
25
Outline
Introduction
Background
A fuzzy approach for fuzzy mining associate
rules
Experimental evaluation
Conclusions
26
Experiments: Synthetic datasets
• Using synthetic datasets of varying sizes:
Name
|D|
|T|
Size (MB)
D100k.T10
100K
10
3M
D100k.T20
100K
20
6M
D320k.T30
320K
30
18M
|D| = Number of transactions
|T| = Average amount of items on transactions
27
Experiment environment
• Software

Database : Microsoft Access 2003

Language: C++ and Visual Basic, Matlab

Platform: Windows
• Hardware

PC Pentium IV-2.66 GMhz, RAM 1GB
28
Evaluate mean of rules
From database Salary and IQ, we have rules from the approach with
minimum support=43% and minimum confidence = 50% as follows:
Rule 1: If 1st variable is low approximately 7000 [ 4000, 10000]
then 2nd variable is low approximately 100 [50, 120]
Rule 2: If 1st variable is medium approximately 15000 [7000, 20000]
then 2nd variable is medium approximately 140 [ 100, 165]
the Apriori algorithm
No frequent Itemsets
Mining quantitative algorithm with fuzzy approach
Frequent Itemset 1
1st variable is low approximately 7000 [4000, 10000], 2nd
variable is low approximately 100 [50, 120]
Frequent Itemset 2
1st variable is medium approximately 15000 [7000, 20000] , 2nd
variable is medium approximately 140 [ 100, 165]
Minimum support = 43%
29
Evaluate mean of rules - cont
the Apriori algorithm
Mining quantitative algorithm
Frequent Itemset 1
1st variable is 5000, 2nd variable is 85
Frequent Itemset 2
1st variable is 7000, 2nd variable is 100
Frequent Itemset 3
1st variable is 9000, 2nd variable is 110
Frequent Itemset 4
1st variable is 10000, 2nd variable is 120
Frequent Itemset 5
1st variable is 15000, 2nd variable is 140
Frequent Itemset 6
1st variable is 20000, 2nd variable is 165
Frequent Itemset 7
1st variable is 30000, 2nd variable is 183
Frequent Itemset 1
1st variable is low approximately 7000 [ 4000,
10000], 2nd variable is low approximately
100 [50, 120]
Frequent Itemset 2
1st variable is high approximately
30000
nd
[15000, 32000] , 2
variable is high
approximately 183 [140, 200]
Frequent Itemset 3
1st variable is medium approximately 15000
[7000, 20000] , 2nd variable is medium
approximately 140 [ 100, 165]
minimum support = 15%
30
Evaluate fuzziness
ID
1
2
3
4
5
6
7
Salary’s
membership
0.74
0.91
0.57
0.9
0.83
0.66
0.34
IQ’s
membership
0.85
0.93
0.67
0.9
0.84
0.84
0.51
ID
1
2
3
4
5
6
7
Ada
Using the Yager’s fuzziness with p = 1
Salary’s
membership
0.71
0.71
0.37
0.86
0.83
0.56
0.14
IQ’s
membership
0.8
0.83
0.67
0.86
0.74
0.74
0.31
New approach
~ ~
n
Dp( A,  A)
~
~ ~
~
~
fp( A)  1 
~ , D1( A,  A)   A( Xi    A( Xi)
Supp( A)
i 1
• Ada_fuzziness_Salary ≈ 0.357 ≤ NewApproach_fuzziness_Salary ≈ 0.425
• Ada_fuzziness_IQ ≈ 0.51 ≤ NewApproach_fuzziness_IQ ≈ 0.59
The new approach is fuzzier than Ada
31
Evaluate fuzziness - cont
Ada’s approach
New approach
Frequent Itemset 1
1st variable is low approximately 5000 [ 4000,
10000], 2nd variable is low approximately
85 [50, 120]
Frequent Itemset 2
1st variable is high approximately
20000
[15000, 32000] , 2nd variable is high
approximately 165 [140, 200]
Frequent Itemset 3
1st variable is medium approximately 10000
[7000, 20000] , 2nd variable is medium
approximately 120 [ 100, 165]
Frequent Itemset 1
1st variable is low approximately 7000 [ 4000,
10000], 2nd variable is low approximately
100 [50, 120]
Frequent Itemset 2
1st variable is high approximately
30000
[15000, 32000] , 2nd variable is high
approximately 183 [140, 200]
Frequent Itemset 3
1st variable is medium approximately 15000
[7000, 20000] , 2nd variable is medium
approximately 140 [ 100, 165]
minimum support = 15%
In Ada’s Approach, mid points of ranges are moved out centre values.
This leads to change mean of frequent itemsets.
32
Execution time (sec.) with different
minimum support thresholds
Name
Min_sup = 35%
Min_sup = 40%
Min_sup = 50%
Apriori
Fuzzy*
Apriori
Fuzzy *
Apriori
Fuzzy *
D100k.T30
80860
42558
4158
1980
485
244
D100k.T20
155440
77720
30005
15792
27012
13506
D320k.T30
329532
147673
69011
28425
52322
20259
*: do not include the transfer time
Name
Transferring time a database into fuzzy sets
D100k.T30
95
D100k.T20
5062
D320k.T30
9112
33
Execution time (sec.) with different
minimum support thresholds - cont
Min_sup=35%
Min_sup=40%
350000
80000
300000
70000
250000
60000
200000
Fuzzy
50000
150000
Apriori
40000
Apriori
Fuzzy
30000
100000
20000
50000
10000
0
0
1
2
3
•Execution time (transfer +
mining time) of the fuzzy
method is better than the
Apriori.
•Moreover, mean of rules is
more “Understandable”
1
2
3
Min_sup=50%
60000
50000
40000
Apriori
30000
Fuzzy
20000
10000
0
1
2
3
34
Conclusions
• Proposed an approach to find fuzzy sets for
quantitative attributes for mining associate
rules
• An experimental evaluation shows that the
mean of rules and execution time when
using the fuzzy approach in mining Associate
rules are better than that of other algorithms
• Future work:


Improve the fuzzy mining approach
Develop incremental algorithms for associate
analysis using Support Vector Machines
35
THANK YOU
H.N.A. Pham, T.W. Liao, and E. Triantaphyllou
Department of Industrial Engineering
3128 CEBA Building
Louisiana State University
Baton Rouge, LA 70803-6409
Email: hpham15@lsu.edu, ieliao@lsu.edu, and
trianta@lsu.edu
36
Download