Rule Mining and the Apriori Algorithm

Rule Mining and the Apriori Algorithm MIT 15.097 Course Notes Cynthia Rudin The Apriori algorithm - often called the “first thing data miners try,” but somehow doesn’t appear in most data mining textbooks or courses! Start with market basket data: Some important definitions: • Itemset: a subset of items, e.g., (bananas, cherries, elderberries), indexed by {2, 3, 5}. • Support of an itemset: number of transactions containing it, Supp(bananas, cherries, elderberries) = m X Mi,2 · Mi,3 · Mi,5 . i=1 • Confidence of rule a → b: the fraction of times itemset b is purchased when itemset a is purchased. Supp(a ∪ b) #times a and b are purchased = Supp(a) #times a is purchased = Pˆ (b|a). Conf(a → b) = 1 We want to find all strong rules. These are rules a → b such that: Supp(a ∪ b) ≥ θ, and Conf(a → b) ≥ minconf. Here θ is called the minimum support threshold. The support has a monotonicity property called downward closure: If Supp(a ∪ b) ≥ θ then Supp(a) ≥ θ and Supp(b) ≥ θ. That is, if a ∪ b is a frequent item set, then so are a and b. Supp(a ∪ b) = #times a and b are purchased ≤ #times a is purchased = Supp(a). Apriori finds all frequent itemsets (a such that Supp(a) ≥ θ). We can use Apriori’s result to get all strong rules a → b as follows: • For each frequent itemset `: – Find all nonempty subsets of ` – For each subset a, output a → {` \ a} whenever Supp(`) ≥ minconf. Supp(a) Now for Apriori. Use the downward closure property: generate all k-itemsets (itemsets of size k) from (k − 1)-itemsets. It’s a breadth-first-search. 2 2-itemsets: {a,b } supp: 7 {a,c} 25 3-itemsets: {a,c,d} supp: 15 d 45 {a,d} 15 {a,c,e} 22 s grap e c 30 ries b 20 rb e r cher ries a 25 elde ban ana s 1-itemsets: supp: app les Example: θ = 10 e 29 /f 5 {a,e} ... 23 g 17 {e,g} 3 {b,d,g} ... 15 4-itemsets: {a,c,d,e} supp: 12 Apriori Algorithm: Input: Matrix M L1 ={frequent 1-itemsets; i such that Supp(i) ≥ θ}. For k = 2, while Lk−1 6= ∅ (while there are large k − 1-itemsets), k + + • Ck = apriori gen(Lk−1 ) generate candidate itemsets of size k • Lk = {c : c ∈ Ck , Supp(c) ≥ θ} frequent itemsets of size k (loop over transactions, scan the database) end Output: S k Lk . 3 The subroutine apriori gen joins Lk−1 to Lk−1 . apriori gen Subroutine: Input: Lk−1 Find all pairs of itemsets in Lk−1 where the first k − 2 items are identical. Union them (lexicographically) to get Cktoo big , e.g.,{a, b, c, d, e, f }, {a, b, c, d, e, g} → {a, b, c, d, e, f, g} Prune: Ck = {c ∈ Cktoo big , all (k − 1)-subsets cs of c obey cs ∈ Lk−1 }. Output: Ck . Example of Prune step: consider {a, b, c, d, e, f, g} which is in Cktoo big , and I want to know whether it’s in Ck . Look at { a, b, c, d, e, f, g}, {a, b, c, d, e, f, g},{a, b, c, d, e, f, g}, {a, b, c, d, e, f, g}, etc. If any are not in L6 , then prune {a, b, c, d, e, f, g} from L7 . The Apriori Algorithm Database D TID 100 200 300 400 C1 Item Set Sup. 2 {1} 134 Scan D {2} 235 3 1235 {3} 3 25 {4} 1 3 {5} Items L2 Item Set Sup. 2 {1 3} {2 3} 2 {2 5} 3 2 {3 5} C2 Item Set Sup. C2 1 {1 2} Scan D {1 3} 2 {1 5} 1 {2 3} 2 3 {2 5} {3 5} C3 2 Item Set {1 2} {1 3} {1 5} {2 3} {2 5} {3 5} L3 Item Set Sup. Item Set {2 3 5} L1 Item Set Sup. 2 {1} {2} 3 {3} 3 3 {5} Scan D {2 3 5} 2 Note: {1,2,3} {1,2,5} and {1,3,5} not in C3. Image by MIT OpenCourseWare, adapted from Osmar R. Zaïane. 4 • Apriori scans the database at most how many times? • Huge number of candidate sets. / • Spawned huge number of apriori-like papers. What do you do with the rules after they’re generated? • Information overload (give up) • Order rules by “interestingness” – Confidence P̂ (b|a) = Supp(a ∪ b) Supp(a) – “Lift”/“Interest” P̂ (b|a) P̂ (b) = Supp(b) 1− Supp(a∪b) Supp(a) : – Hundreds! Research questions: • mining more than just itemsets (e.g, sequences, trees, graphs) • incorporating taxonomy in items • boolean logic and “logical analysis of data” • Cynthia’s questions: Can we use rules within ML to get good predictive models? 5 MIT OpenCourseWare http://ocw.mit.edu 15.097 Prediction: Machine Learning and Statistics Spring 2012 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Rule Mining and the Apriori Algorithm

Related documents

Products

Support

Rule Mining and the Apriori Algorithm

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib