Convexity in Itemset Spaces Limsoon Wong Institute for Infocomm Research Copyright © 2005 by Limsoon Wong Plan • Frequent itemsets – – – – Convexity Equivalence classes, generators, & closed patterns Plateau representation Efficient mining of generators & closed patterns • Emerging patterns • Odds ratio patterns • Relative risk patterns Copyright © 2005 by Limsoon Wong Frequent Itemsets Copyright © 2005 by Limsoon Wong Association Rules • Buyer’s behaviour in supermarket • Mgmt are interested in rules such as Copyright © 2005 by Limsoon Wong Frequent Itemsets • List of items: I = {a, b, c, d, e, f} • List of transactions: T = {T1, T2, T3, T4, T5} • • • • • T1 = {a, c, d} T2 = {b, c, e} T3 = {a, b, c, e, f} T4 = {b, e} T5 = {a, b, c, e} • For each itemset I I, sup(I,T) = |{ Ti T | I Ti}| • Freq itemsets: FT = F(ms,T) ={I I | sup(I,T) ms} Copyright © 2005 by Limsoon Wong A Priori Property • Freq itemset from our example: ms=2 • A priori property: I FT I’ I, I’ FT Copyright © 2005 by Limsoon Wong Lattice of Freq Itemsets • FT can be very large • Is there a concise rep? • Observation: – {a, b, c, e} is maximal – { } is minimal – everything else is betw them • { }, {a, b, c, e} a concise rep for FT? Copyright © 2005 by Limsoon Wong Convexity • An itemset space S is convex if, for all X, Y S st X Y, we have Z S whenever X Z Y • An itemset X is most general in S if there is no proper subset of X in S. These itemsets form the left bound L of S • An itemset is most specific in S if there is no proper superset of X in S.These itemsets form the right bound R of S • L, R is a concise rep of S • [L, R] = { Z | X L, Y R, X Z Y} = S Copyright © 2005 by Limsoon Wong Convexity of Freq Itemsets • Proposition 1: The freq itemset space is convex L, R is a concise rep for a freq itemset space Copyright © 2005 by Limsoon Wong Is it good enough? • { }, {a, b, c, e} can be a concise rep for FT • But we cant get support values for elems in FT Copyright © 2005 by Limsoon Wong What is a good concise rep? • A good concise rep for FT should enable these tasks below efficiently, w/o accessing T again: – – – – – Task 1: Enumerate {I FT} Task 2: Enumerate {(I, sup(I,T)) | I FT } Task 3: Given I, decide if I FT, & if so report sup(I,T) Task 4: Enumerate itemsets w/ sup in a given range etc. Copyright © 2005 by Limsoon Wong Closed Itemset Rep • A pattern is a closed pattern if each of its supersets has a smaller support than it • The closed itemset rep of FT is CR ={ (I, sup(I,T)) | I FT, I is closed pattern} • Proposition 2: {(I, sup(I,T)) | I FT} = {(I, max{sup(I’, T) | (I’, sup(I’,T)) CR, I I’}) | I FT} May be inefficient for Tasks 2, 3, 4 Copyright © 2005 by Limsoon Wong Generator Rep • A pattern is a generator if each of its subsets has a larger support than it • The generator rep of FT is GR ={(I, sup(I,T)) | I FT, I is generator}, GBd- where GBd- are the min in-freq itemsets • Proposition 3: {(I, sup(I,T)) | I FT} = {(I, min{sup(I’,T) | I’ GR, I’ I}) | I FT} May be inefficient for Tasks 2, 3, 4 Copyright © 2005 by Limsoon Wong Freq Itemset Plateaus • Decompose freq itemset lattice into plateaus wrt itemset support, S = i Pi, with Pi = {I S | sup(I,T) = i} • Proposition 6: Each Pi is convex S = i [Li, Ri], where [Li, Ri] = Pi Copyright © 2005 by Limsoon Wong From Generators & Closed Patterns To Equivalence Classes • The equivalence class of an itemset I is [I]T = { I’ | { Ti T | I’ Ti} = {Tj T | I Tj}} • Proposition 4: [I]T is convex. Furthermore, if [L,R] = [I]T, then L = min [I]T, and R = max [I]T is a singleton • Proposition 5: – An itemset I is a generator iff I min [I]T – An itemset I is a closed pattern iff I max [I]T Copyright © 2005 by Limsoon Wong Plateaus = Generators + Closed Patterns • Theorem 7: Let [Li,Ri] = Pi be a freq itemset plateau of FT. Then – Pi = [X1]T … … [Xk]T, where Ri = {X1, …, Xk} – Ri are the closed patterns in Pi – Li = i min [Xi]T are the generators in Pi Copyright © 2005 by Limsoon Wong Freq Itemset Plateau Rep • The freq itemset plateau rep of FT is PR = {(Li, Ri,i) | i ms} where [Li,Ri] is plateau at support level i in FT • Proposition 8: {(I, sup(I,T)) | I FT} = {(I, i)| (Li, Ri, i) PR, X Li, Y Ri, X I Y} All 4 tasks are obviously efficient Copyright © 2005 by Limsoon Wong Remarks • PR is a good concise rep for freq itemsets • PR is more flexible compared to other reps • PR unifies diff notions used in data mining • Nice ... But can we mine PR fast? Copyright © 2005 by Limsoon Wong Mining PR Fast • To mine PR fast, mine its borders fast • To mine its borders fast, mine equiv classes in the plateau fast • To mine equiv classes fast, mine generators & closed patterns of equivalence classes fast Copyright © 2005 by Limsoon Wong From SE-Tree To Trie To FP-Tree T T1 = {a,c,d} T2 = {b,c,d} T3 = {a,b,c,d} T4 = {a,d} {} SE-tree of possible itemsets a ab ac abc abd acd b c ad bc bd cd bcd <1: right-to-left, top-to-bottom traversal of SE-tree abcd d Trie of transactions . a b c a b c d FP-tree head table Copyright © 2005 by Limsoon Wong b . c . d • . . d d • . . c d .c d . • d • . . d d . GC-growth: Fast Simultaneous Mining of Generators & Closed Patterns Copyright © 2005 by Limsoon Wong Step 1: FP-tree construction Copyright © 2005 by Limsoon Wong Step 2: Right-to-left, top-to-bottom traversal Copyright © 2005 by Limsoon Wong Step 5: Confirm Xi is generator Proposition 9: Generators enjoy the apriori property. That is every subset of a generator is also a generator Copyright © 2005 by Limsoon Wong Step 7: Find closed pattern of Xi Proposition 10: Let X be a generator. Then the closed pattern of X is {X’’| X’H[last(X)],X X’, X’ prefix of X’’, T[X’’] = true}. Copyright © 2005 by Limsoon Wong Correctness of GC-growth • Theorem 11: GC-growth is sound and complete for mining generators and closed patterns Copyright © 2005 by Limsoon Wong Performance of GC-growth • GC-growth is mining both generators and closed patterns • But is comparable in speed to the fastest algorithms that mined only closed patterns Copyright © 2005 by Limsoon Wong Emerging Patterns Copyright © 2005 by Limsoon Wong Differentiation and Contrast edible mushrooms poisonous mushrooms x% 0% EPs Example: {odor=none, gill_size=broad, ring_number=1} 64% (edible) vs 0% (poisonous) Copyright © 2005 by Limsoon Wong Emerging Patterns • An emerging pattern is a set of conditions – usually involving several features – that most members of a class P satisfy – but none or few of the other class N satisfy I is emerging pattern if sup(I,P) / sup(I,N) > k, for some fixed threshold k NB: For this talk, we restrict ourselves to “jumping” emerging patterns Copyright © 2005 by Limsoon Wong Convexity of Emerging Patterns • Theorem 12: Let E be an EP space and Pi = { I E | sup(I) = i}. Then E = i Pi, E is convex, and each Pi is convex. That is, E can be decomposed into convex plateaus Copyright © 2005 by Limsoon Wong EP Plateau Rep • A concise rep for E = i Pi is EP plateau rep: EP_PR = { (Li, Ri, i) | [Li, Ri] = Pi} • Proposition 13: {(I, sup(I)) | I E} = { (I, i) | (Li, Ri, i) EP_PR, X Li, Y Ri, X I Y} All 4 tasks are obvious efficient Copyright © 2005 by Limsoon Wong Efficient Mining of EP_PR • Modify GC-growth so that for each equiv class C, it outputs its support in +ve transactions Spos[C] & in -ve transactions Sneg[C] • Then [R[C], C] are emerging patterns if Spos[C] / Sneg[C] > k NB. Assume the threshold for EP is k Copyright © 2005 by Limsoon Wong Odds Ratio Patterns Copyright © 2005 by Limsoon Wong Is an emerging pattern that is absent in most of the positive transactions a “real” pattern? edible mushrooms poisonous mushrooms x% 0% EPs Example: {odor=none, gill_size=broad, ring_number=1} 64% (edible) vs 0% (poisonous) What if this is 4%? 0.4%? 0.04%? Copyright © 2005 by Limsoon Wong Odds Ratio • Odds ratio for a (compound) factor P in a casecontrol study D is OR(P,D) = (PD,ed / PD,-d) / (PD,e- / PD,--) P is a odds ratio pattern if OR(P,D) > k, for some threshold k Copyright © 2005 by Limsoon Wong Nonconvexity of Odds Ratio Pattern Space • Proposition 14: Let SkOR(ms,D) = { P F(ms,D) | OR(P,D) k}. Then SkOR(ms,D) is not convex Copyright © 2005 by Limsoon Wong Convexity of Odds Ratio Pattern Space Plateaus • Theorem 15: Let Sn,kOR(ms,D) = { P F(ms,D) | PD,ed=n, OR(P,D) k}. Then Sn,kOR(ms,D) is convex Copyright © 2005 by Limsoon Wong The space of odds ratio patterns is not convex in general, but becomes convex when stratified into plateaus based on support levels The space of odds ratio patterns can be concisely represented by plateau borders Efficient Mining of Odds Ratio Pattern Space Plateaus How do you find these fast is key! GC-growth can find these fast :-) Copyright © 2005 by Limsoon Wong Performance • FPClose* and CLOSET+ – closed patterns only • Our method computes – closed patterns – generators, and – odds ratio patterns (OR > 2.5) Patterns that are much more statistically sophisticated than frequent patterns can now be mined efficiently Copyright © 2005 by Limsoon Wong Relative Risk Patterns Copyright © 2005 by Limsoon Wong Relative Risk • Relative risk for a (compound) factor P in a prospective study D is P is a relative risk pattern if RR(P,D) > k, for some threshold k Copyright © 2005 by Limsoon Wong Nonconvexity of Relative Risk Pattern Space • Proposition 16: Let SkRR(ms,D) = { P F(ms,D) | RR(P,D) k}. Then SkRR(ms,D) is not convex Copyright © 2005 by Limsoon Wong Convexity of Relative Risk Pattern Space Plateaus • Theorem 17: Let Sn,kRR(ms,D) = { P F(ms,D) | PD,ed=n, RR(P,D) k}. Then Sn,kRR(ms,D) is convex Copyright © 2005 by Limsoon Wong The space of relative risk patterns is not convex in general, but becomes convex when stratified into plateaus based on support levels The space of relative risk patterns can be concisely represented by plateau borders Efficient Mining of Relative Risk Pattern Space Plateaus How do you find these fast is key! x := RR(R,D); GC-growth can find these fast :-) Copyright © 2005 by Limsoon Wong Concluding Remarks • Equiv classes & plateaus are fundamental in – – – – Frequent itemsets Emerging patterns Odds ratio patterns Relative risk patterns, ... • Equiv classes & plateaus of these complex patterns are convex spaces Complex pattern spaces are concisely representable by borders Complex pattern spaces can be efficiently and completely mined Copyright © 2005 by Limsoon Wong Future Works Copyright © 2005 by Limsoon Wong Improve Implementations • Modular pattern mining by construction of a fast equiv class generator and multiple statistical condition filters Generate borders of equiv classes & support levels Test for odds ratio Test for relative risk Copyright © 2005 by Limsoon Wong Test for 2 • Impact of item ordering • Impact of pushing complex statistical filters deeper into equivalence class generators Apply to Classification • Develop classifiers based on the mined patterns – Simple ensemble – PCL • Impact on accuracy of using generators vs closed patterns Copyright © 2005 by Limsoon Wong • Simple ensemble f(X) = Argmax r(X) • PCL c C r Rc, r > 50% accuracy Enrich Data Mining Foundations • Increase statistical sophistication of patterns mined • Increase dimensions and size of data handled Copyright © 2005 by Limsoon Wong Acknowledgements • • • • Haiquan Li Jinyan Li Mengling Feng Yap Peng Tan Copyright © 2005 by Limsoon Wong