Association Rules Hawaii International Conference on System Sciences (HICSS-40) January 2007

advertisement
Association Rules
Hawaii International Conference on System
Sciences (HICSS-40)
January 2007
David L. Olson
Yanhong Li
Fuzzy Association Rules
• Association rules mining provides
information to assess significant
correlations in large databases
• IF X THEN Y
– Initial data mining analysis
– Not predictive
• SUPPORT: degree to which relationship
appears in data
• CONFIDENCE: probability that if X, then Y
Association Rule Algorithms
• APriori
• Agrawal et al., 1993; Agrawal & Srikant, 1994
– Find correlations among transactions, binary
values
• Weighted association rules
• Cai et al., 1998; Lu et al. 2001
• Cardinal data
• Srikant & Agrawal, 1996
– Partitions attribute domain, combines
adjacent partitions until binary
Fuzzy Analysis
Deal with vagueness & uncertainty
• Fuzzy Set Theory
– Zadeh [1965]
• Probability Theory
– Pearl [1988]
• Rough Set Theory
– Pawlak [1982]
• Set Pair Theory
– Zhao [2000]
Fuzzy Association Rules
• Most based on APriori algorithm
• Treat all attributes as uniform
• Can increase number of rules by
decreasing minimum support, decreasing
minimum confidence
– Generates many uninteresting rules
– Software takes a lot longer
Gyenesei (2000)
• Studied weighted quantitative association rules
in fuzzy domain
– With & without normalization
– NONNORMALIZED
• Used product operator to define combined weight and fuzzy
value
• If weight small, support level small, tends to have data
overflow
– NORMALIZED
• Used geometric mean of item weights as combined weight
• Support then very small
Algorithm
• Get membership functions, minimum
support, minimum confidence
• Assign weight to each fuzzy membership
for each attribute (categorical)
• Calculate support for each fuzzy region
• If support > minimum, OK
• If confidence > minimum, OK
• If both OK, generate rules
Demo Model: Loan App
Case
1
2
3
4
5
6
7
8
9
10
Age
20
26
46
31
28
21
46
25
38
27
Income
52623
23047
56810
38388
80019
74561
65341
46504
65735
26047
Risk
-38954
-23636
45669
-7968
-35125
-47592
58119
-30022
30571
-6
Credit Result
Red
0
Green
1
Green
1
Amber
1
Green
1
Green
1
Green
1
Green
1
Green
1
Red
1
Fuzzified Age
1.2
Membership
value
1
0.8
0.6
0.4
0.2
0
Age
0
25
35
Young
Figure 2: The membership functions of attibute Age
40
Middle
50
100
Old
Fuzzify Age
Case
1
2
3
4
5
6
7
8
9
10
Age
20
26
46
31
28
21
46
25
38
27
Young
1.000
0.9
0
0.4
0.7
1
0
1
0
0.8
Middle
0
0.1
0.4
0.6
0.3
0
0.4
0
1
0.2
Old
0
0
0.6
0
0
0
0.6
0
0
0
Calculate Support for Each Pair of
Fuzzy Categories
• Membership value
– Identify weights for each attribute
– Identify highest fuzzy membership category
for each case
• Membership value = minimum weight associated
with highest fuzzy membership category
• Support
– Average membership value for all cases
Support by Single Item
Category
Weight
Sup(Rjk)
Age Young
R11
0.45
0.261
Age Middle
R12
0.45
0.135
Age Old
R13
0.45
0.059
Income High
R21
0.55
0.000
Income Middle
R22
0.55
0.490
Income Low
R23
0.55
0.060
Risk High
R31
0.70
0.320
Risk Middle
R32
0.70
0.146
Risk Low
R33
0.70
0.233
Credit Good
R41
0.80
0.576
Credit Bad
R42
0.80
0.244
Support
• If support for pair of categories is above
minimum support, retain
• Identifies all pairs of fuzzy categories with
sufficiently strong relationship
• For outcomes, R51 (On Time) strong,
R52 (Default) not
Support by Pair: minsup 0.25
R11R22
0.235
R22R41
0.419
R11R31
0.207
R22R51
0.449
R11R41
0.212
R31R41
0.266
R11R51
0.230
R31R51
0.264
R22R31
0.237
R41R51
0.560
Support by Triplet: minsup 0.25
R22R41R51
0.417
R22R31R41
0.198
R22R31R51
0.196
R31R41R51
0.264
Quartets
• None qualify, so algorithm stops
Confidence
• Identify direction
• For those training set cases involving the
pair of attributes, what proportion came
out as predicted?
Confidence Values: Pairs
Minimum confidence 0.9
R22→R41
R41→R22
R22→R51
R51→R22
R31→R41
R41→R31
R31→R51
R51→R31
R41→R51
R51→R41
0.855
0.727
0.916
0.697
0.831
0.462
0.825
0.410
0.972
0.870
R41R22→R51
R41R51→R22
R22R51→R41
R31R41→R51
R31R51→R41
R51R41→R31
0.995
0.744
0.928
0.993
1.000
0.472
4 Rules
• IF Income is Middle THEN Outcome is On-Time
– R22→R51
support 0.490
confidence 0.916
• IF Credit is Good THEN Outcome is On-Time
– R41→R51
support 0.576
confidence 0.972
• IF Income is Middle AND Credit is Good THEN
Outcome is On-Time
– R22R41→R51
support 0.419
confidence 0.995
• IF Risk is High AND Credit is Good THEN
Outcome is On-Time
– R31R41→R51
support 0.266
confidence 0.993
Rules vs. Support
minconf=0.55
the number of
association rules
minconf=0.65
minconf=0.75
20
minconf=0.85
15
minconf=0.95
10
minconf=1
5
0
0.2
0.25
0.3
0.35
0.4
minsup
0.55
Figure 7: The relationship between number of association rules and
minsup using the proposed method
Rules vs. Confidence
minsup=0.2
the number of
association rules
minsup=0.25
minsup=0.3
20
minsup=0.35
15
minsup=0.4
10
minsup=0.55
5
0
0.55
minconf
0.65
0.75
0.85
0.95
1
Figure 8: The relationship betw een number of association rules and
minconf using the proposed method
Higher order combinations
• Try triplets
– If ambitious, sets of 4, and beyond
• Here, none
• Problems:
– Computational complexity explodes
– Doesn’t guarantee total coverage
• That also would explode complexity
• Can control by lowering minsup, minconf
Simulation Testing
• Selected 550 cases
– Held out 100
• Randomly assigned weights to each fuzzy
region of each attribute
– minsup {0.35, 0.45, 0.55, 0.65}
– minconf {0.7, 0.8, 0.9}
Simulation Results
Accuracy vs. minsup & minconf
0.7
0.6
Accuracy
0.5
0.4
weighted minconf=0.7
weighted minconf=0.8
weighted minconf=0.9
0.3
0.2
0.1
0
0.35
0.45
0.55
minsup
0.65
Download