Algoritmy strojového učení: klasifikační a asociační pravidla 1 Klasifikační pravidla a jak je lze získat z dat 2 Presbyope = vetchý Myope = krátkozraký Person O1 O2 O3 O4 O5 O6-O13 O14 O15 O16 O17 O18 O19-O23 O24 Illustrative example: Contact lenses data Age Spect. presc. young myope young myope young myope young myope young hypermetrope ... ... pre-presbyohypermetrope pre-presbyohypermetrope pre-presbyohypermetrope presbyopic myope presbyopic myope ... ... presbyopic hypermetrope Astigm. no no yes yes no ... no yes yes no no ... yes Tear prod. reduced normal reduced normal reduced ... normal reduced normal reduced normal ... normal Lenses NONE SOFT NONE HARD NONE ... SOFT NONE NONE NONE NONE ... NONE Classes: N(none), S(soft), H(hard) contact lenses 3 / 23 Umělá inteligence I. Decision tree for contact lenses recommendation tear prod. reduced normal astigmatism NONE no yes spect. pre. SOFT myope HARD 4 / 23 hypermetrope NONE Umělá inteligence I. Problems with dec. trees? Dec,.trees can be transfored into rules, but to apply them we need to have „complete information“ about the case The resulting rule sets can be rather complex (1 rule = 1 branch of the tree) and difficult to understand for human user Sets of rules in DNF are sometimes easier to grasp: If X then C1 If X and Y then C2 If not X and Z and Y then C3 If B then C2 But learning such sets is more difficult! 5 / 23 Umělá inteligence I. Ordered or unordered sets of rules? Disjunction of 2 rules does not have to increase their performance! Example: Let us have 1000 cases and 2 rules R1 and R2, each covering 100 cases and each correct on 90 of them. What happens if the rules R1 and R2 are combined? In the best case the incorrect cases are identical and the performance of R1 OR R2 is (90+90)/(90+90+10)=0,95 In the worst case R1 and R2 are correct on the same cases and wrong on different ones. In such a case, the performance of R1 OR R2 is (90)/(90+10+10)=0,82 6 / 23 Umělá inteligence I. Ruleset representation Rule base is a disjunctive set of conjunctive rules Standard form of rules: IF Condition THEN Class Class IF Conditions Class Conditions Examples: IF Outlook=Sunny Humidity=Normal THEN PlayTennis=Yes IF Outlook=Overcast THEN PlayTennis=Yes IF Outlook=Rain Wind=Weak THEN PlayTennis=Yes Form of CN2 rules: IF Conditions THEN MajClass [ClassDistr] Rule base: {R1, R2, R3, …, DefaultRule} 7 / 23 Umělá inteligence I. Decision tree vs. rule learning: Splitting vs. covering + + + + + + - - - + + + + + + - - - 8 / 23 Splitting (ID3, C4.5, J48, See5) Covering (AQ, CN2) Umělá inteligence I. Classification Rule Learning Rule set representation Two rule learning approaches: Learn decision tree, convert to rules Learn set/list of rules Learning an unordered set of rules Learning an ordered list of rules 9 / 23 Heuristics, overfitting, pruning Umělá inteligence I. PlayTennis: Training examples Day D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 10 / 23 Outlook Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain Temperature Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild Humidity High High High High Normal Normal Normal High Normal Normal Normal High Normal High Wind Weak Strong Weak Weak Weak Strong Strong Weak Weak Weak Strong Weak Weak Strong PlayTennis No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No Umělá inteligence I. PlayTennis: Using a decision tree for classification Outlook Sunny Overcast Humidity High No Rain Yes Normal Yes Wind Strong No Weak Yes Is Saturday morning OK for playing tennis? Outlook=Sunny, Temperature=Hot, Humidity=High, Wind=Strong PlayTennis = No, because Outlook=Sunny Humidity=High 11 / 23 Umělá inteligence I. Contact lense: classification rules tear production=reduced => lenses=NONE [S=0,H=0,N=12] tear production=normal & astigmatism=no => lenses=SOFT [S=5,H=0,N=1] tear production=normal & astigmatism=yes & spect. pre.=myope => lenses=HARD [S=0,H=3,N=2] tear production=normal & astigmatism=yes & spect. pre.=hypermetrope => lenses=NONE [S=0,H=1,N=2] DEFAULT lenses = NONE 12 / 23 Umělá inteligence I. Unordered rulesets rule Class IF Conditions is learned by first determining Class and then Conditions ordered sequence of classes C1, …, Cn in RuleSet But: unordered (independent) execution of rules when classifying a new instance: all rules are tried and predictions of those covering the example are collected; voting is used to obtain the final classification if no rule fires, then DefaultClass (majority class in E) 13 / 23 Umělá inteligence I. Contact lense: decision list Ordered (order dependent) rules : IF tear production=reduced THEN lenses=NONE ELSE /*tear production=normal*/ IF astigmatism=no THEN lenses=SOFT ELSE /*astigmatism=yes*/ IF spect. pre.=myope THEN lenses=HARD ELSE /* spect.pre.=hypermetrope*/ lenses=NONE 14 / 23 Umělá inteligence I. Ordered set of rules: if-then-else decision lists rule Class IF Conditions is learned by first determining Conditions and then Class Notice: mixed sequence of classes C1, …, Cn in RuleBase But: ordered execution when classifying a new instance: rules are sequentially tried and the first rule that `fires’ (covers the example) is used for classification Decision list {R1, R2, R3, …, D}: rules Ri are interpreted as if-then-else rules If no rule fires, then DefaultClass (majority class in Ecur) 15 / 23 Umělá inteligence I. Original covering algorithm (AQ, Michalski 1969,86) Basic (bottom-up) covering algorithm + + + + + + - - - for each class Ci do Ei := Pi U Ni (Pi pos., Ni neg.) RuleBase(Ci) := empty repeat {learn-set-of-rules} learn-one-rule R covering some positive examples and no negatives add R to RuleBase(Ci) delete from Pi all pos. ex. covered by R until Pi = empty 16 / 23 Umělá inteligence I. Learning unordered set of rules (CN2, Clark and Niblett) Clark and Niblett, 1989: top-down approach to search (specialization applies a beam search) RuleBase := empty for each class Ci do Ei := Pi U Ni, RuleSet(Ci) := empty repeat {learn-set-of-rules} R := Class = Ci IF Conditions, Conditions := true repeat {learn-one-rule} R’ := Class = Ci IF Conditions AND Cond (general-to-specific beam search of Best R’) until stopping criterion is satisfied (no negatives covered or Performance(R’) < ThresholdR) add R’ to RuleSet(Ci) delete from Pi all positive examples covered by R’ until stopping criterion is satisfied (all positives covered or Performance(RuleSet(Ci)) < ThresholdRS) RuleBase := RuleBase U RuleSet(Ci) 17 / 23 Umělá inteligence I. Learn-one-rule: Greedy vs. beam search learn-one-rule by greedy general-to-specific search, at each step selecting the `best’ descendant, no backtracking beam search: maintain a list of k best candidates at each step; descendants (specializations) of each of these k candidates are generated, and the resulting set is again reduced to k best candidates Recommended reading for search in AI V. Mařík: Řešení úloh a využívání znalostí, kapitola v knize Mařík et al.: UI(1), Academia 1993, 2003 Umělá inteligence I. 18 / 23 Illustrative example: Contact lenses data Person O1 O2 O3 O4 O5 O6-O13 O14 O15 O16 O17 O18 O19-O23 O24 19 / 23 Age Spect. presc. young myope young myope young myope young myope young hypermetrope ... ... pre-presbyohypermetrope pre-presbyohypermetrope pre-presbyohypermetrope presbyopic myope presbyopic myope ... ... presbyopic hypermetrope Astigm. no no yes yes no ... no yes yes no no ... yes Tear prod. reduced normal reduced normal reduced ... normal reduced normal reduced normal ... normal Lenses NONE SOFT NONE HARD NONE ... SOFT NONE NONE NONE NONE ... NONE Umělá inteligence I. Learn-one-rule as heuristic search Lenses = hard IF true S=H=N= ... Lenses = hard IF Astigmatism = no [S=5, H=0, N=7] Lenses = hard IF Astigmatism = yes [S=0, H=4, N=8] Lenses = hard IF Tearprod. = reduced [S=0, H=0, N=12] Lenses = hard IF Tearprod. = normal [S=5, H=4, N=3] Lenses = hard IF Tearprod. = normal AND Spect.Pre. = myope [S=2, H=3, N=1] Lenses = hard IF Tearprod. = normal AND Spect.Pre. = hyperm. [S=3, H=1, N=2] 20 / 23 Lenses = hard IF Tearprod. = normal AND Astigmatism = no Lenses = hard IF Tearprod. = normal AND Astigmatism = yes [S=0, H=4, N=2] [S=5, H=0, N=1] Umělá inteligence I. Rule learning: summary 21 / 23 Hypothesis construction: find a set of n rules usually simplified by n separate rule constructions Rule construction: find a pair (Class, Cond) select rule head (class) and construct rule body, or construct rule body and assign rule head (in ordered algos) Body construction: find a set of m features usually simplified by adding to rule body one feature at a time Umělá inteligence I. Associations and Frequent Item Analysis 22 Outline Transactions Frequent itemsets Subset Property Association rules Applications 23 / 23 Umělá inteligence I. Transactions Example TID Produce 1 MILK, BREAD, EGGS 2 BREAD, SUGAR 3 BREAD, CEREAL 4 MILK, BREAD, SUGAR 5 MILK, CEREAL 6 BREAD, CEREAL 7 MILK, CEREAL 8 MILK, BREAD, CEREAL, EGGS 9 MILK, BREAD, CEREAL 24 / 23 Umělá inteligence I. Transaction database: Example TID Products 1 A, B, E 2 B, D 3 B, C 4 A, B, D 5 A, C 6 B, C 7 A, C 8 A, B, C, E 9 A, B, C ITEMS: A = milk B= bread C= cereal D= sugar E= eggs 25 / 23 TID Produce 1 MILK, BREAD, EGGS 2 BREAD, SUGAR 3 BREAD, CEREAL 4 MILK, BREAD, SUGAR 5 MILK, CEREAL 6 BREAD, CEREAL 7 MILK, CEREAL 8 MILK, BREAD, CEREAL, EGGS 9 MILK, BREAD, CEREAL Instances = Transactions Umělá inteligence I. Transaction database: Example Attributes converted to binary flags TID Products 1 A, B, E 2 B, D 3 B, C 4 A, B, D 5 A, C 6 B, C 7 A, C 8 A, B, C, E 9 A, B, C 26 / 23 TID A B C D E 1 1 1 0 0 1 2 0 1 0 1 0 3 4 0 1 1 1 1 0 0 1 0 0 5 1 0 1 0 0 6 0 1 1 0 0 7 8 1 1 0 1 1 1 0 0 0 1 9 1 1 1 0 0 Umělá inteligence I. Definitions Item (položka): attribute =value pair or simply value Itemset I (množina položek) : a subset of possible items usually attributes are converted to binary flags for each value, e.g. Product = “A” is written as “A” Example: I = {A,B,E} (order unimportant) Transaction: (TID, itemset) 27 / 23 TID is transaction ID Umělá inteligence I. Support and Frequent Itemsets Support of an itemset I In example database: sup(I ) = no. of transactions t that support (i.e. contain) I sup ({A,B,E}) = 2, sup ({B,C}) = 4 Frequent itemset I is one with at least the minimum support count 28 / 23 sup(I ) >= minsup TID A B C D E 1 1 1 0 0 1 2 0 1 0 1 0 3 4 0 1 1 1 1 0 0 1 0 0 5 1 0 1 0 0 6 0 1 1 0 0 7 8 1 1 0 1 1 1 0 0 0 1 9 1 1 1 0 0 Umělá inteligence I. SUBSET PROPERTY Every subset of a frequent set is frequent! Q: Why is it so? Example: Suppose {A,B} is frequent. Since each occurrence of A,B includes both A and B, then both A and B must also be frequent Similar argument for larger itemsets 29 / 23 Almost all association rule algorithms are based on this subset property ! Umělá inteligence I. Finding Frequent Itemsets Start by finding one-item frequent sets (easy) Q: How? A: Simply count the frequencies of all items 30 / 23 Umělá inteligence I. Finding itemsets: next level Apriori algorithm (Agrawal & Srikant) Idea: use one-item sets to generate two-item sets, two-item sets to generate three-item sets, … If (A,B) is a frequent item set, then (A) and (B) have to be frequent item sets as well! In general: if X is frequent k-item set, then all (k-1)-item subsets of X are also frequent Compute k-item set by merging (k-1)-item sets 31 / 23 Umělá inteligence I. An example Given: five three-item sets (A B C), (A B D), (A C D), (A C E), (B C D) Lexicographic order improves efficiency Which are candidates for four-item sets? (A B C D) Q: OK? Answer: Yes, because all 3-item subsets are frequent (A C D E) Q: OK? Answer: No, because (C D E) is not frequent 32 / 23 Umělá inteligence I. Classification vs Association Rules Classification Rules Association Rules Focus on one target field Many target fields Specify class in all cases Applicable in some cases Measures : Accuracy Measures : Support, 33 / 23 Confidence, Lift Umělá inteligence I. Association Rules Association rule R : Itemset1 => Itemset2 Itemset1 and Itemset2 are disjoint and Itemset2 is non-empty “if a transaction includes Itemset1 then it also has Itemset2” Examples A,B => C A,B => C,E A => B,C A,B =>D 34 / 23 TID A B C D E 1 1 1 0 0 1 2 0 1 0 1 0 3 4 0 1 1 1 1 0 0 1 0 0 5 1 0 1 0 0 6 0 1 1 0 0 7 8 1 1 0 1 1 1 0 0 0 1 9 1 1 1 0 0 Umělá inteligence I. From Frequent Itemsets to Association Rules Q: Given frequent set {A,B,E}, what are possible association rules? 35 / 23 A => B, E A, B => E A, E => B B => A, E B, E => A E => A, B __ => A,B,E (empty rule), or true => A,B,E Umělá inteligence I. Rule Support and Confidence Suppose R : I => J is an association rule sup (R) = sup (I J) is the support count support of itemset I J conf (R) = sup(I J) / sup(I) is the confidence of R fraction of transactions with I that have J, too 36 / 23 Association rules with given minimum support and conf are sometimes called “strong” rules Umělá inteligence I. Measures for the rule Ant => Suc a is the total number of transactions with items Ant Suc support = a/n confidence = a/r cover = a/k Suc Non (Suc) Ant a b r= a+b Non (Ant) c d s= c+d k= a+c l= b+d n= r+s 4ft quantifiers in LispMiner „above average“ a/r > (1+p)*k/n means “When comparing number of transactions meeting Suc in the full dataset and among all transactions which meet Ant one finds that the difference is at least 100*p % (the number is Umělá inteligence I. higher in the second set)” 37 / 23 Association Rules Example: conf (I => J ) = sup(I J) / sup(I) Q: Given frequent set {A,B,E}, what association rules have minsup = 2 and minconf= 50% ? A, B => E : conf=2/4 = 50% A, E => B : conf=2/2 = 100% B, E => A : conf=2/2 = 100% E => A, B : conf=2/2 = 100% Don’t qualify A =>B, E : conf= 2/6 =33%< 50% B => A, E : conf= 2/7 = 28% < 50% TID List of items 1 A, B, E 2 B, D 3 B, C 4 A, B, D 5 A, C 6 B, C 7 A, C 8 A, B, C, E 9 A, B, C __ => A,B,E : conf= 2/9 = 22% < 50% 38 / 23 Umělá inteligence I. Find Strong Association Rules A rule has the parameters minsup and minconf: Problem: sup(R) >= minsup and conf (R) >= minconf Find all association rules with given minsup and minconf First, find all frequent itemsets 39 / 23 Umělá inteligence I. Generating Association Rules Two stage process: Determine frequent itemsets e.g. with the Apriori algorithm. For each frequent item set I for each subset J of I determine all association rules of the form: I-J => J Main idea used in both stages : subset property 40 / 23 Umělá inteligence I. Example: Generating Rules from an Itemset Frequent itemset from golf/tenis data: Is this frequent item set? {Humidity = normal, Windy = False, Play = Yes } Support is 4 41 / 23 Outlook Temp Humidity Windy Play Sunny Hot High False No Sunny Hot High True No Overcast Hot High False Yes Rainy Mild High False Yes Rainy Cool Normal False Yes Rainy Cool Normal True No Overcast Cool Normal True Yes Sunny Mild High False No Sunny Cool Normal False Yes Rainy Mild Normal False Yes Sunny Mild Normal True Yes Overcast Mild High True Yes Overcast Hot Normal False Yes Rainy Mild High True No Umělá inteligence I. Example: Generating Rules from the freq. set Humidity = Normal, Windy = False,Play = Yes Seven potential rules: If Humidity = Normal and Windy = False then Play = Yes 4/4 If If If If If If Outlook Temp Humidity Windy Play Sunny Hot High False No Sunny Hot High True No Overcast Hot High False Yes Rainy Mild High False Yes Rainy Cool Normal False Yes Rainy Cool Normal True No Overcast Cool Normal True Yes Sunny Mild High False No Sunny Cool Normal False Yes Rainy Mild Normal False Yes Sunny Mild Normal True Yes Overcast Mild High True Yes Overcast Hot Normal False Yes Rainy Mild High True No Humidity = Normal and Play = Yes then Windy = False Windy = False and Play = Yes then Humidity = Normal Humidity = Normal then Windy = False and Play = Yes Windy = False then Humidity = Normal and Play = Yes Play = Yes then Humidity = Normal and Windy = False True then Humidity = Normal and Windy = False and Play = Yes 42 / 23 Umělá inteligence I. 4/6 4/6 4/7 4/8 4/9 4/12 Outlook Temp Humidity Windy Play Sunny Hot High False No Sunny Hot High True No Overcast Hot High False Yes Rainy Mild High False Yes Rainy Cool Normal False Yes Rainy Cool Normal True No Overcast Cool Normal True Yes Rules with support > 1 Sunny Mild High False No Sunny Cool Normal False Yes and confidence = 100% : Rainy Mild Normal False Yes Sunny Mild Normal True Yes Overcast Mild High True Yes Overcast Hot Normal False Yes Rainy Mild High True No Rules for the weather data Association rule Sup. Conf. 1 Humidity=Normal Windy=False Play=Yes 4 100% 2 Temperature=Cool Humidity=Normal 4 100% 3 Outlook=Overcast Play=Yes 4 100% 4 Temperature=Cold Play=Yes Humidity=Normal 3 100% ... ... ... ... ... 58 Outlook=Sunny Temperature=Hot Humidity=High 2 100% In total: 3 rules with support four, 5 with support three, 50 with support two 43 / 23 Umělá inteligence I. Weka associations File: weather.nominal.arff MinSupport: 0.2 44 / 23 Umělá inteligence I. Filtering Association Rules Problem: any large dataset can lead to very large number of association rules, even with reasonable minimal Confidence and Support Confidence by itself is not sufficient ! e.g. if all transactions include Z, then any rule I => Z will have confidence 100%. Other measures to filter rules 45 / 23 Umělá inteligence I. Further WEKA measures for the rule Ant => Suc support = a/n confidence = a/r cover = a/k Suc Non (Suc) Ant a b r= a+b Non (Ant) c d s= c+d k= a+c l= b+d n= r+s lift = (a/r)/(k/n)= a*n/(r*k) “Lift estimates increase in precision of default prediction of Suc on the set of transactions meeting Ant when compared to than on the whole dataset” leverage = (a-r*k/n)/n “Ratio of ‘extra’ transactions covered by the rule when compared to those covered provided Ant and Suc are independent” conviction = r*l/(b*n) “Similar to lift, but it considers transactions, which are not covered by Suc.” 46 / 23 Umělá inteligence I. Weka associations: output 47 / 23 Umělá inteligence I. Association Rule LIFT The lift of an association rule I => J is defined as: lift = P(J|I) / P(J) Note, P(I) = (support of I) / (no. of transactions) ratio of confidence to expected confidence Interpretation: if lift > 1, then I and J are positively correlated lift < 1, then I are J are negatively correlated. lift = 1, then I and J are independent. 48 / 23 Umělá inteligence I. Other issues ARFF format very inefficient for typical market basket data Attributes represent items in a basket and most items are usually missing Interestingness of associations 49 / 23 find unusual associations: Milk usually goes with bread, but soy milk does not. Umělá inteligence I. Beyond Binary Data Hierarchies drink milk low-fat milk Stop&Shop low-fat milk … find associations on any level Sequences over time … 50 / 23 Umělá inteligence I. Applications Market basket analysis Store layout, client offers Bookstore: offers of similar titles (see e.g. Amazon) „Diapers and beer“ urban legend Recommendations concerning new services or new customers, e.g. Finding unusual events if (Car=Porsche & Gender=Male & Age < 20) then (Risk=high & Insurance = high) WSARE – What is Strange About Recent Events … 51 / 23 Umělá inteligence I. Summary Frequent itemsets Association rules Subset property Apriori algorithm Application difficulties 52 / 23 Umělá inteligence I.