Association Rules (Market Basket Analysis) Market basket: collection of items purchased by a customer in a single transaction (e.g. supermarket, web) Association rules: Unsupervised learning Used for pattern discovery Each rule has form: A -> B, or Left -> Right For example: “70% of customers who purchase 2% milk will also purchase whole wheat bread.” Data mining using association rules is the process of looking for strong rules: 1. Find the large itemsets (i.e. most frequent combinations of items) Most frequently used algorithm: Apriori algorithm. 2. Generate association rules for the above itemsets. How to measure the strength of an association rule? 1. Using support/confidence 2. Using dependence framework Support/confidence Support shows the frequency of the patterns in the rule; it is the percentage of transactions that contain both A and B, i.e. Support = Probability(A and B) Support = (# of transactions involving A and B) / (total number of transactions). Confidence is the strength of implication of a rule; it is the percentage of transactions that contain B if they contain A, ie. Confidence = Probability (B if A) = P(B/A) Confidence = (# of transactions involving A and B) / (total number of transactions that have A). Ex.ample: Customer Item purchased 1 pizza 2 salad 3 pizza 4 salad Item purchased beer soda soda tea If A is “purchased pizza” and B is “purchased soda” then Support = P(A and B) = ¼ Confidence = P(B / A) = ½ Confidence does not measure if the association between A and B is random or not. For example, if milk occurs in 30% of all baskets, information that milk occurs in 30% of all baskets with bread is useless. But if milk is present in 50% of all baskets that contain coffee, that is significant information. Support allows us to weed out most infrequent combinations – but sometimes we should not ignore them, for example, if the transaction is valuable and generates a large revenue, or if the products repel each other. Ex. We measure the following: P(Coke in a basket) = 50% P(pepsi in a basket) = 50% P(coke and peps in a basket) = 0.001% What does this mean? If Coke and Pepsi were independent, we would expect that P(coke and pepsi in a basket) = .5*0.5 = 0.25. The fact that the joint probability is much smaller says that the products are dependent and that they repel each other. In order to exploit this information, work with the dependency framework. Dependence framework Ex. To continue the previous example: Actual(coke and peps in a basket) = 0.001% Expected(coke and pepsi in a basket) = 50%*50% = 25% If items are statistically dependent, the presence of one of the items in the basket gives us a lot of information about the other items. How to determine the threshold of statistical dependence? Use: Chi-square Impact Lift Chi_square = (ExpectedCooccurrence – ActualCooccurrence)/ExpectedCooccurrence Pick a small alpha (e.g. 5% or 10%). Number of degrees of freedom equals the number of items minus 1. Ex. Chi_square(pepsi and coke) = (25-0.001)/25 = 0.999 Deg freedom = 2 Alpha = 5% From the tables, chi-square=3.84, which is higher than our Chi_square, therefore pepsi and coke are dependent. Impact = ActualCooccurrence/ExpectedCooccurrence Impact = 1 if products are independent, or ≠ if the products are dependent. Ex. Impact(pepsi on coke) = 0.001/25 Lift(A on B) = (ActualCooccurrence - ExpectedCooccurrence )/(Frequency of occurrence of A) -1 ≤ Lift ≤ 1 Lift is similar to correlation: it is 0 if A and B are independent, and +1 or -1 if they are dependent. +1 indicates attraction, and -1 indicates repulsion. Ex. Lift(coke on pepsi) = (0.001-25)/50 Why do two items repel or attract – are they substitutes? Or are they complimentary and a third product is needed? Or do they address different market segments? Product Triangulation strategy examines cross-purchase skews to answer the above questions. If the most significant skew occurs when triangulating with respect to promotion or pricing, the products are substitutes. Ex. Orange juice and soda repel each other (so are they substitutes?). They each exhibit a different profile when compared with whole bread and potato chips, so they are not substitutes, they have two different market segments. Pepsi and Coke repel and show no cross-purchase patterns. 1. FIND FREQUENT ITEMSETS: Apriori Algorithm Finds all frequent itemsets in a database. Definitions: Itemset: a set of items k-itemset: an itemset which consists of k items Frequent itemset (i.e. large itemset): an itemset with sufficient support Lk or Fk: a set of large (frequent) k-itemsets ck: a set of candidate k-itemsets Appriori property: if an item X is joined with item Y, Support(X U Y) = min(Support(X), Support(Y)) Negative border: an intemset is in the negative border if it is infrequent but all its “neighbors” in the candidate itemset are frequent. Interesting rules: strong rules for which antecedent and consequent are dependent Apriori algorithm: //Perform iterations, bottom up: // iteration 1: find L1, all single items with Support > threshold // iteration 2: using L1, find L2 // iteration i: using L i-1, find L i // … until no more frequent k itemsets can be found. //Each iteration i consists of two phases: 1. candidate generation phase: //Construct a candidate set of large itemsets, i.e. find all the items that could qualify for further consideration by examining only candidates in set Li-1*L i-1 2. candidate counting and selection //Count the number of occurrences of each candidate itemset //Determine large itemsets based on predetermined support, i.e. select only candidates with sufficient support Set Lk is defined as the set containing the frequent k itemsets which satisfy Support > threshold. Lk*L k is defined as: Lk*L k = {X U Y, where X, Y belong to L k and | X ∩Y| = k-1}. Apriori algorithm, in more detail: //find all frequent itemsets Appriori(database D of transactions, min_support) { F1 = {frequent 1-itemsets} k=2 while Fk-1 ≠ EmptySet Ck= AprioriGeneration(Fk-1) for each transaction t in the database D{ Ct= subset(Ck, t) for each candidate c in Ct { count c ++ } Fk = {c in Ck such that countc ≥ min_support} k++ } F = U k ≥ 1 Fk } //prune the candidate itemsets ApprioriGeneration(Fk-1) { //Insert into Ck all combinations of elements in Fk-1 obtained by self-joining itemsets in Fk-1 //self joining means that all but the last item in the itemsets considered “overlaps,” i.e join items p, q from Fk-1 to make candidate k-itemsets of form p1p2 …p k1q1q2…q k-1 (without overlapping) such that p i =q i for i=1,2, .., k-2 and pk-1 < qk-1. //Delete all itemsets c in Ck such that some (k-1) subset of c is not in Lk-1 } //find all subsets of candidates contained in t Subset(Ck, t) { … } Example: http://magna.cs.ucla.edu/~hxwang/axl_manual/node20.html Find association rules, given min_support = 2 and the database of transactions: 1234 234 345 125 245 Apply Apriori algorithm: F1 = { 1, 2, 3, 4, 5} because all singles show up with frequency >=2. K=2 C2 = AprioriGeneration(F1): Insert into C2 pairs 1 2, 1 3, 1 4, 1 5, 2 3, 2 4, 2 5, 3 4, 3 5, 4 5 F2 = {1 2, 2 3, 2 4, 2 5, 3 4, 4 5} because 1 3, 1 4, 1 5, 3 4 are not frequent K=3: C3 = AprioriGeneration(F2): Insert into C3: 2 3 4, 2 3 5, 2 4 5 Delete from C3: 2 3 5 ( because 3 5 is not in F2) F3 = {2 3 4} (because 2 4 5 shows up only once) K=4: C4 = AprioriGeneration(F3) Insert into C4: none Since we cannot generate any more candidate sets by self-joining, the algorithms stops here. The frequent itemsets are F1, F2, and F3. Negative border contains all pairs deleted from C2, and 2 3 5. 2. GENERATE ASSOCIATION RULES FROM FREQUENT ITEMSETS For all pairs of frequent itemsets (assume we call them A and B) such that A U B is also frequent, calculate c, the confidence of the rule: c = support(A U B) / support(A). If c >= min_confidence, the rule is strong and should be included. Example: continuing the previous example: we can generate the rules involving any combination of: 1, 2, 3, 4, 5, 1 2, 2 3, 2 4, 2 5, 3 4, 4 5, 2 3 4. For example, rule 1 2 -> 2 5 is not a strong rule because 1 2 5 is not a frequent itemset. Rule 2 3 -> 4 could be a strong rule, because 2 3 4 and 2 3 are frequent itemsets, and c= 2/4. Applications A: sales of item a B: sales of item B Rule A-> … tells you what products will be affected if A is affected Rule ... -> B tells you what needs to be done so that B is affected Rule A … ->B tells you what to combine with A to affect B. The next step: sequential pattern discovery (i.e. “association rules in time”). For example: college_degree-> professional_job -> high_salary. Example: http://www.icaen.uiowa.edu/~comp/Public/Apriori.pdf Assume min_support = 40% = 2/5, min_confidence = 70%. Five transactions are recorded in a supermarket: # 1 2 3 4 5 Transaction Beer, diaper, baby powder, bread, umbrella Diaper, baby powder, Beer, diaper, milk Diaper, beer, detergent Beer, milk, cola Code BDPRU DP BDM DBG BMC 1. Find frequent itemsets F1 = {B, D, P, M} k = 2 C2 = BD, BP, BM, DP, DM, PM -> eliminate infrequent BP, DM, PM F2 = {BD, BM, DP} K=3: C3 = {BDM} (according to other books; this web page erroneously includes BDP) F3 = {BDM} 2. Generate strong rules out of set B, D, P, M, BD, BM, DP, BDM Rule B->D B->P -------B->M B->BD B->BM B->DP ------B->BDM D->B … M->B … P->B ------P->D … Support (X U Y) 3/5 Support(X) 4/5 Confidence 3/4 Strong rule? yes 2/5 3/5 2/5 4/5 4/5 4/5 2/4 3/4 2/4 no yes no 1/5 4/5 1/4 no 2/5 2/5 1 yes 2/5 2/5 1 yes Interesting! What the rules are saying is that it is very likely that a customer who buys diapers or milk will also buy beer. Does that rule make sense? Example p.170: Assume min_support = 0.4, min_confidence = 0.6. Contingency table for high-school students (with the derived quantities in italics): Eat cereal Y Play Y basketball N 2000 1750 3750 N 1000 250 1250 3000 2000 5000 “Play basketball -> eat cereal” is a strong rule because support (play basketball and eat cereal) = 2000/5000, support(play basketball) = 3000/5000 c(rule) = 2/3 > min_confidence. BUT: support(eat cereal) = 3750/5000 > c So, it does not make sense to talk about eating cereal and playing basketball. Lift (bbal on cereal) = (2000-(3000*3750)/5000)/3000 = -0.1 so eating cereal has nothing to do with playing bbal? Improving Apriori algorithm Group items into higher conceptual groups, e.g. white and brown bread become “bread.” Reduce the number of scans of the entire database (Apriori needs n+1 scans, where n is the length of the longest pattern) o Partition-based apriori o Take a subset from the database, generate candidates for frequent itemsets; then confirm the hypothesis on the entire database. Alternative to Apriori, using fewer scans of database: Frequent Pattern (FP) Growth Method Used to find the frequent itemsets using only two scans of database. Algorithm: 1. 2. 3. 4. Scan databse and find items with frequency greater then or equal to a threshold T Order the frequent items in decreasing order Construct a tree which has only the root Scan database again; for each sample: a. add the items from the sample to the existing tree, using only the frequent items (i.e. items discovered in step 1.) b. repeat a. until all samples have been processed 5. Enumerate all frequent itemsets by examining the tree: the frequent itemsets are present in those paths for which every node is represented with the frequency ≥ T . Example p.173: Assume threshold T = 3. Input itemset facdgimp Steps 1 and 2: Frequent items and their frequencies are: abcflmo bfhjo bcksp afcelpmn f 4 a3 c4 b3 m3 p3 => f4 c4 a3 b3 m3 p3 Frequent input sequence: facmp Sorted frequent input sequence: fcamp abcfm fcabm bf fb bcp cbp afcpm fcamp Steps 3 and 4: Construct the tree using the last column of the table above. The growing tree is shown below. root root root root root f 1 f 2 f 3 f 3 c1 c2 c2 a1 a2 a2 m1 m1 b1 m1 b1 m1 b1 m2 b1 p1 p1 m1 p1 m1 p1 m1 p2 m1 b1 c2 c1 b1 a2 f 4 b1 c3 c1 b1 p1 a3 b1 p1 Step 5: The frequent itemsets are contained in those paths, starting from the root, which have frequency ≥ T. Therefore, at each branching, see if any item from the branches can be added to the frequent “path,” i.e. if the total of branches gives frequency ≥ T for this item. In the above tree, the frequent itemsets are: f c a m, and c p. WEB MINING Mine for: content, structure, or usage Tools are for analyzing on-line or off-line behavior. Efficiency of the web site = number of purchases / number of visits A web site is a network of related pages. Web pages can be: Authorities (provide the best source of information) Hubs (links to authorities) HITS algorithm Searches for authorities and hubs. Algorithm: Use search engines to search for a given term and collect a root set of pages Expand the root set by including all the pages that the root set links to, up to a cutoff point (e.g. 1000-5000 pages including links). Assume that now we have n pages total. Construct the adjacency matrix A such that aij = 1if page i links to page j. Associate authority weight ap and hub weight hp with each page, and set it to a uniform constant initially. a = Transpose{a1, a2, …, an} h = Transpose{h1, h2, …, hn} Then update a and h iteratively: a = Transpose(A)*h = (Transpose(A)*A)*a h = A*a = (A * Transpose(A))*h Example p.181: Assume initial: 1 A= 2 3 6 4 000111 000110 000010 000000 000000 001000 5 a = Transpose{.1, .1, .1, .1, .1, .1} h = Transpose{.1, .1, .1, .1, .1, .1} a = Transpose(A)*A*a = Transpose(0 0 .1 .5 .6 .3) h = (A * Transpose(A))*h = Transpose(.6 .5 .4 0 0 .1) Seems like document 5 is the best authority and document 1 is the best hub. LOGSOM Algorithm For finding user’s navigation behavior – which pages do they visit the most. For a given set of URLs, urli, i=1, ..,n, and a set of user transactions tj, j=1, …, m, assign 1 to a urli if the transaction j involved visiting this page. Make a table of all transactions: t1 t2 url1 url2 … urln 1 0 1 0 1 1 tm 0 0 1 Use K-means clustering to group the users into k transaction groups, and then record the number of hits of each group. For example: group1 group2 url1 url2 … urln 16 0 12 0 5 3 groupk 20 10 0 Mining path-traversal patterns (sequence mining) Given a collection of sequences ordered in time, where each sequence contains a set of Web pages, find the longest and most frequent sequences of navigation patterns. 1. 2. 3. 4. Find the longest traversal pattern Draw it out as a tree (keep track of backward links) Find the longest consecutive sequences Find the most frequent ones Example p.186 Path = A B C D C B E G H G W A O U O V, assume threshold of 40% (2/5). Tree: A–B–C–D B–E–G G–H G–W A–O O –U O –V Maximum forward references: ABCD, ABEGH, ABEGW, AOU, AOV Frequent forward references: AB, BE, AO, EG, BEG, ABEG Maximal reference sequence: ABEG, AO (by heuristics). Text Mining Document = a vector of tokens (each word is a token) Calculate Hamming distance for each token