FINDING FUZZY SETS FOR QUANTITATIVE ATTRIBUTES FOR MINING OF FUZZY ASSOCIATE RULES By H.N.A. Pham, T.W. Liao, and E. Triantaphyllou Department of Industrial Engineering 3128 CEBA Building Louisiana State University Baton Rouge, LA 70803-6409 Email: hpham15@lsu.edu, ieliao@lsu.edu, and trianta@lsu.edu 1 Outline Introduction Background A fuzzy approach for mining associate rules Experimental evaluation Conclusions 2 Introduction • Associate analysis is a new and attractive research area in data mining • The Apriori algorithm (R. Agrawal, IBM 1993) is a key technique for Associate analysis • Though the Apriori principle allows us to considerably reduce the search space, the technique still requires a huge computation, particularly for large databases • This research proposes an approach for finding fuzzy sets for quantitative attributes in a database by using clustering techniques and then employs techniques for mining of fuzzy Associate rules . 3 Outline Introduction Background Associate rules and the Apriori algorithm Necessity to find fuzzy sets for quantitative attributes A fuzzy approach for fuzzy mining associate rules Experimental evaluation Conclusions 4 Associate rules: Market basket analysis • Analyzes customer buying habits by finding associations between the different items that customers place in their “shopping baskets” (in the form X Y, where X and Y are sets of items) How often people • I = {I1=beer, I2=cake, I3=onigiri} buy candy and beer together? • A transactional database TID1: TID2: TID3: TID4: TID5: {I1, I2, I3} {I1, I2} {I2, I3} {I2} {I1, I2} • An Associate rule: {I1} {I3} 5 Rule measures: Support and Confidence Customer buys both Customer buys beer Customer buys onigiri Transaction ID Items Bought 2000 1000 4000 5000 A,B,C A,C A,D B,E,F Associate rule: X Y support s = probability that a transaction contains X and Y confidence c = conditional probability that a transaction having X also contains Y A C (s=50%, c=66.6%) C A (s=50%, c=100%) 6 Associate mining: the Apriori algorithm It is composed of two steps: 1. 2. Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a pre-determined minimum support count Generate strong Associate rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence (Agrawal, 1993) 7 Associate mining: the Apriori principle Transaction ID 2000 1000 4000 5000 For rule A C Items Bought A,B,C A,C A,D B,E,F Min. support 50% Min. confidence 50% Frequent Itemset Support {A} 75% {B} 50% {C} 50% {A,C} 50% support = support({A and C}) = 50% confidence = support({A and C})/support({A}) = 66.6% The Apriori principle: Any subset of a frequent itemset must be frequent (if an itemset is not frequent, neither are its supersets) 8 The Apriori algorithm: Finding frequent itemsets using candidate generation 1. Find the frequent itemsets: the sets of items that have support higher than the minimum support A subset of a frequent itemset must also be a frequent itemset i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemsets Iteratively find frequent itemsets Lk with cardinality from 1 to k (k-itemset) from candidate itemsets Ck (Lk Ck) C1 … Li-1 Ci Li Ci+1 … Lk 2. Use the frequent itemsets to generate Associate rules. 9 Example (min_sup_count = 2) Transactional data TID List of items_IDs T100 T200 T300 T400 T500 T600 T700 T800 T900 I1, I2, I2, I1, I1, I2, I1, I1, I1, I2, I4 I3 I2, I3 I3 I3 I2, I2, I5 I4 I3, I5 I3 Scan D for count of each candidate Compare candidate support count with minimum support count C1 L1 Itemset Sup.Count Itemset Sup.Count {I1} {I2} {I3} {I4} {I5} 6 7 6 2 2 {I1} {I2} {I3} {I4} {I5} 6 7 6 2 2 10 Example (min_sup_count = 2) C2 Itemset Generate {I1, I2} candidates {I1, I3} C2 from L1 {I1, I4} by using the Apriori {I1, I5} {I2, I3} principle {I2, I4} {I2, I5} {I3, I4} {I3, I5} {I4, I5} C2 Scan D for count of each candidate Generate candidates C3 from L2 Itemset by using the Apriori {I1, I2, I3} principle {I1, I2, I5} Compare candidate support count with minimum support count Itemset S.count {I1, I2} 4 {I1, I3} 4 {I1, I4} 1 {I1, I5} 2 {I2, I3} 4 {I2, I4} 2 {I2, I5} 2 {I3, I4} 0 {I3, I5} 1 {I4, I5} 0 Scan D for count of each candidate C3 Itemset Sc {I1, I2, I3} 2 {I1, I2, I5} 2 Compare candidate support count with minimum support count L2 Itemset S.count {I1, I2} 4 {I1, I3} 4 {I1, I5} 2 {I2, I3} 4 {I2, I4} 2 {I2, I5} 2 L3 Itemset Sc {I1, I2, I3} 2 {I1, I2, I5} 2 11 Necessity to find fuzzy sets for quantitative attributes Transaction ID Age Married NumCars 100 33 Yes 2 200 39 Yes 2 300 35 No 1 400 20 No 0 A quantitative associate rule with min_sup= min_conf =50% (Age = 33 or 39) and (Married = Yes) -> (NumCars =2) A quantitative associate rule with min_sup= min_conf=50% (Age = 33..39) and (Married = Yes) -> (NumCars =2) A fuzzy associate rule with min_sup= min_conf =50% (Age = middle-aged) and (Married = Yes) -> (NumCars =2) 12 Solution: Shape boundary intervals It is composed of two steps: 1. Partition the attribute domains into small intervals and combine adjacent intervals into larger ones such that the combined intervals will have enough supports 2. Replace the original attribute by its attribute-interval pairs, the quantitative problem can be transformed to a Boolean one. (Srikant and Agrawal, 1996) 13 Example: Shape boundary intervals Transaction ID Age Married NumCars 100 33 Yes 2 200 39 Yes 2 300 35 No 1 400 20 No 0 Transaction ID Age: 18-30 Age: 31-39 Married NumCars:0-1 NumCars:2-3 100 No Yes Yes No Yes 200 No Yes Yes No Yes 300 No Yes No Yes No 400 Yes No No Yes No Algorithms ignore or over-emphasize the elements near the boundary of the intervals in the mining process • • The use of shape boundary interval is also not intuitive with respect to human perception 14 Solution: Experts • An user or expert must provide to this algorithm the required fuzzy sets of the quantitative attributes and their corresponding membership functions • Fuzzy sets and their corresponding membership functions provided by experts may not be suitable for mining fuzzy Associate rules in the database 15 Solution: Fuzzy sets for quantitative attributes It is composed of three steps: Step 1: Transform the original database into positive integer Step 2: For each attribute Cluster values of the attribute ith into k medoids Classify the attribute ith into k fuzzy sets Generate membership functions for each fuzzy set End for Step 3: Transform the database based on fuzzy sets (Ada, 1998) Lose association between attributes in the mining approach 16 Outline Introduction Background A fuzzy approach for fuzzy mining associate rules Fuzzy approach Fuzzy mining associate rules Experimental evaluation Conclusions 17 Fuzzy approach It is composed of five steps: Step 1: Transform the original database into one with positive integers Step 2: Cluster values of attributes into k medoids. Step 3: Classify attributes into k fuzzy sets Step 4: Generate membership functions for each fuzzy set Step 5: Transform the database based on fuzzy sets 18 Fuzzy approach: Step 2 Clustering: • The clustering method considers the search space of a database with n attributes as an n-dimensional space • Use the Matlab fuzzy tool box Do not lose association between attributes in the mining approach 19 Fuzzy approach: Step 3 Classify: • Let {m1, m2, …, mk} be k medoids found from step 2, where mi = {ai1, ai2, …, ain} is the medoid ith. • Let the attribute jth have a range [minj, maxj] and {a1j, a2j, …, akj} be set of mid-points of the attribute jth. The k fuzzy sets of this attribute will be ranged in [minj, a2j], [a1j, a3j], …, [a(i-1)j, a(i+1)j], …, and [a(k-1)j, maxj] m1 a11 … aj1 … a1n … … .. … … … mk ak1 … ajn akn a(iminj aij a(i+1)j 1)j maxj Fuzzy set 20 Fuzzy approach: Step 4 Generate membership functions (triangular function): 1, if a k2 j x x min j k , if min j xa j k 2 a j min j k 2 fij( x : min j , a 2 j , max j ) max j x , if a ( k ) j x max 2 max j a ( k 1) j 2 0, ortherwise j 21 Fuzzy approach: Step 5 Transform the database based on fuzzy sets: • Let Tij be the value of the ith transaction at the jth attribute Tij = fuzzy label ith if fij(Tij) = max(fkj(Tij)) 22 ID 1 Salary 10000 IQ 120 2 7000 100 3 30000 183 4 9000 110 5 15000 140 6 7 20000 5000 165 85 Fuzzy label Low_S Medium_S High_S Range Mid-point 4000 – 10000 7000 7000 – 20000 15000 15000 – 32000 30000 Fuzzy label Range Mid-point Low_I 50 – 120 100 Medium_I 100 – 165 140 High_I 140 – 200 183 Step 2 ID 1 Salary Low_S IQ Low_I 2 Low_S 3 ID Salary’s membership IQ’s membership 1 0.71 0.8 Low_I 2 0.71 0.83 High_S High_I 3 0.37 0.67 4 Low_S Low_I 4 0.86 0.86 5 Medium_S 5 0.83 0.74 6 0.56 0.74 7 0.14 0.31 6 7 Medium_S Low_S Medium_I Medium_I Low_I Example of fuzzy approach Steps 3, 4, 5 23 Fuzzy mining Associate rules It is composed of two steps: 1. Find all itemsets that have fuzzy support (FS<X,A>) above the user specified minimum support. These itemsets are called frequent itemsets. 2. Use the frequent itemsets to generate the desired rules. Let X and Y be frequent itemsets. We can determine if the rule X => Y holds by computing the fuzzy confidence FC<<X,A>,<Y,B>> and this value is larger than the user specified minimum confidence value. (Attilia, 2000) 24 Fuzzy mining Associate rules - cont FS X , A tiT xj Xxj (aj A, ti.xj ) FC X, A , Y, B D tiT tiT zjZ xjX mzj (cj C , ti.zj ) mxj (aj A, ti.zj ) • D = {t1, t2, …, tn}: transactions • <X,A> with X is attributes and A is the corresponding fuzzy sets in X • Z = X U Y, C = A U B 25 Outline Introduction Background A fuzzy approach for fuzzy mining associate rules Experimental evaluation Conclusions 26 Experiments: Synthetic datasets • Using synthetic datasets of varying sizes: Name |D| |T| Size (MB) D100k.T10 100K 10 3M D100k.T20 100K 20 6M D320k.T30 320K 30 18M |D| = Number of transactions |T| = Average amount of items on transactions 27 Experiment environment • Software Database : Microsoft Access 2003 Language: C++ and Visual Basic, Matlab Platform: Windows • Hardware PC Pentium IV-2.66 GMhz, RAM 1GB 28 Evaluate mean of rules From database Salary and IQ, we have rules from the approach with minimum support=43% and minimum confidence = 50% as follows: Rule 1: If 1st variable is low approximately 7000 [ 4000, 10000] then 2nd variable is low approximately 100 [50, 120] Rule 2: If 1st variable is medium approximately 15000 [7000, 20000] then 2nd variable is medium approximately 140 [ 100, 165] the Apriori algorithm No frequent Itemsets Mining quantitative algorithm with fuzzy approach Frequent Itemset 1 1st variable is low approximately 7000 [4000, 10000], 2nd variable is low approximately 100 [50, 120] Frequent Itemset 2 1st variable is medium approximately 15000 [7000, 20000] , 2nd variable is medium approximately 140 [ 100, 165] Minimum support = 43% 29 Evaluate mean of rules - cont the Apriori algorithm Mining quantitative algorithm Frequent Itemset 1 1st variable is 5000, 2nd variable is 85 Frequent Itemset 2 1st variable is 7000, 2nd variable is 100 Frequent Itemset 3 1st variable is 9000, 2nd variable is 110 Frequent Itemset 4 1st variable is 10000, 2nd variable is 120 Frequent Itemset 5 1st variable is 15000, 2nd variable is 140 Frequent Itemset 6 1st variable is 20000, 2nd variable is 165 Frequent Itemset 7 1st variable is 30000, 2nd variable is 183 Frequent Itemset 1 1st variable is low approximately 7000 [ 4000, 10000], 2nd variable is low approximately 100 [50, 120] Frequent Itemset 2 1st variable is high approximately 30000 nd [15000, 32000] , 2 variable is high approximately 183 [140, 200] Frequent Itemset 3 1st variable is medium approximately 15000 [7000, 20000] , 2nd variable is medium approximately 140 [ 100, 165] minimum support = 15% 30 Evaluate fuzziness ID 1 2 3 4 5 6 7 Salary’s membership 0.74 0.91 0.57 0.9 0.83 0.66 0.34 IQ’s membership 0.85 0.93 0.67 0.9 0.84 0.84 0.51 ID 1 2 3 4 5 6 7 Ada Using the Yager’s fuzziness with p = 1 Salary’s membership 0.71 0.71 0.37 0.86 0.83 0.56 0.14 IQ’s membership 0.8 0.83 0.67 0.86 0.74 0.74 0.31 New approach ~ ~ n Dp( A, A) ~ ~ ~ ~ ~ fp( A) 1 ~ , D1( A, A) A( Xi A( Xi) Supp( A) i 1 • Ada_fuzziness_Salary ≈ 0.357 ≤ NewApproach_fuzziness_Salary ≈ 0.425 • Ada_fuzziness_IQ ≈ 0.51 ≤ NewApproach_fuzziness_IQ ≈ 0.59 The new approach is fuzzier than Ada 31 Evaluate fuzziness - cont Ada’s approach New approach Frequent Itemset 1 1st variable is low approximately 5000 [ 4000, 10000], 2nd variable is low approximately 85 [50, 120] Frequent Itemset 2 1st variable is high approximately 20000 [15000, 32000] , 2nd variable is high approximately 165 [140, 200] Frequent Itemset 3 1st variable is medium approximately 10000 [7000, 20000] , 2nd variable is medium approximately 120 [ 100, 165] Frequent Itemset 1 1st variable is low approximately 7000 [ 4000, 10000], 2nd variable is low approximately 100 [50, 120] Frequent Itemset 2 1st variable is high approximately 30000 [15000, 32000] , 2nd variable is high approximately 183 [140, 200] Frequent Itemset 3 1st variable is medium approximately 15000 [7000, 20000] , 2nd variable is medium approximately 140 [ 100, 165] minimum support = 15% In Ada’s Approach, mid points of ranges are moved out centre values. This leads to change mean of frequent itemsets. 32 Execution time (sec.) with different minimum support thresholds Name Min_sup = 35% Min_sup = 40% Min_sup = 50% Apriori Fuzzy* Apriori Fuzzy * Apriori Fuzzy * D100k.T30 80860 42558 4158 1980 485 244 D100k.T20 155440 77720 30005 15792 27012 13506 D320k.T30 329532 147673 69011 28425 52322 20259 *: do not include the transfer time Name Transferring time a database into fuzzy sets D100k.T30 95 D100k.T20 5062 D320k.T30 9112 33 Execution time (sec.) with different minimum support thresholds - cont Min_sup=35% Min_sup=40% 350000 80000 300000 70000 250000 60000 200000 Fuzzy 50000 150000 Apriori 40000 Apriori Fuzzy 30000 100000 20000 50000 10000 0 0 1 2 3 •Execution time (transfer + mining time) of the fuzzy method is better than the Apriori. •Moreover, mean of rules is more “Understandable” 1 2 3 Min_sup=50% 60000 50000 40000 Apriori 30000 Fuzzy 20000 10000 0 1 2 3 34 Conclusions • Proposed an approach to find fuzzy sets for quantitative attributes for mining associate rules • An experimental evaluation shows that the mean of rules and execution time when using the fuzzy approach in mining Associate rules are better than that of other algorithms • Future work: Improve the fuzzy mining approach Develop incremental algorithms for associate analysis using Support Vector Machines 35 THANK YOU H.N.A. Pham, T.W. Liao, and E. Triantaphyllou Department of Industrial Engineering 3128 CEBA Building Louisiana State University Baton Rouge, LA 70803-6409 Email: hpham15@lsu.edu, ieliao@lsu.edu, and trianta@lsu.edu 36