Building Global Models from Local Patterns A.J. Knobbe Feature-continuum attributes (constructed) features patterns classifiers target concept Two-phased process Pattern Discovery phase frequent patterns correlated patterns interesting subgroups decision boundaries … Pattern Combination phase redundancy reduction dependency modeling global model building … Pattern Teams pattern networks global predictive models … Break discovery up into two phases Transform complex problem into more simple one Task: Subgroup Discovery Subgroup Discovery: Find subgroups that show substantially different distribution of target concept. top-down search for patterns inductive constraints (sometimes monotonic) evaluation measures: novelty, X2, information gain also known as rule discovery, correlated pattern mining Novelty Also known as weighted relative accuracy Balance between coverage and unexpectedness nov(S,T) = p(ST) – p(S)p(T) between −.25 and .25, 0 means uninteresting target T F T .42 .13 F .12 .33 subgroup .54 .55 1.0 nov(ST) = p(ST)−p(S)p(T) = .42 − .297 = .123 Demo Subgroup Discovery redundancy exists in set of local patterns Demo Subgroup Discovery 500 450 400 350 300 250 200 150 100 50 0 1 335 669 1003 1337 1671 2005 2339 2673 3007 3341 3675 4009 4343 4677 Pattern Combination phase Feature selection, redundancy reduction – Dependency modeling – – Pattern Teams Bayesian networks Association rules Global modeling – Classifiers, regression models Pattern Teams & Pattern Networks Pattern Teams Pattern Discovery typically produces very many patterns with high levels of redundancy Report small informative subset with specific properties Promote dissimilarity of patterns reported Additional value of individual patterns Consider extent of patterns – Treat patterns as binary features/items Intuitions No two patterns should cover same set of examples No pattern should cover complement of another pattern No pattern should cover logical combination of two or more other patterns Patterns should be mutually exclusive The pattern set should lead to the best performing classifier Patterns should lie on convex hull in ROC-space Quality measures for pattern sets Judge pattern sets on the basis quality function Joint Entropy (miki) Exclusive Coverage Wrapper accuracy Area Under Curve in ROC-space Bayesian Dirichlet equivalent uniform unsupervised supervised Pattern Teams 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 -1 -4 -3,5 -3 -2,5 -2 -1,5 -1 -0,5 82 subgroups discovered -1 0 -4 -3,5 -3 -2,5 -2 -1,5 -1 -0,5 0 4 subgroups in pattern team Pattern Network Again, treat patterns as binary features Bayesian networks – conditional independence of patterns Explain relationships between patterns Explain role of patterns in Pattern Team Demo Pattern Team & Network redundancy removed to find truly divers patterns, in this case using maximization of joint entropy Demo Pattern Team & Network pattern team, and related patterns can be presented in a bayesian network peak around 39k peak around 16k peak around 89k Properties of SD phase in PC What knowledge about Subgroup Discovery parameters can be exploited in Combination? Interestingness – – Are interesting subgroups diverse? Are interesting subgroups correlated? Information content Support of patterns joint entropy of 2 interesting subgroups 2.5 subgroups are relatively novel, up to 2 bits of information 2 1.5 1 0.5 0 0 0.05 0.1 0.15 0.2 0.25 subgroups are very novel, 1 bit of information correlation of interesting subgroups novelty subgroups are very novel, and correlate 0.25 0.2 0.15 inter novelty 0.1 0.05 0 0 0.05 0.1 0.15 0.2 0.25 -0.05 -0.1 subgroups are novel, but potentially independent -0.15 -0.2 -0.25 novelty of subgroups Building Classifiers from Local Patterns Combination strategies How to interpret a pattern set? Conjunctive (intersection of patterns) Disjunctive (union of patterns) Majority vote (equal weight linear separator) … Contingencies/Classifiers Decision Table Majority (DTM) Treat every truth-assignment as contingency Classification based on conditional probability Use majority class for empty contingencies Only works with Pattern Team (else overfitting) Support Vector Machine (SVM) SVM with linear kernel Binary data All dimensions have same scale Works with large pattern sets Subgroup discovery has removed XOR-like dependencies Interesting subgroups correlate XOR-like dependencies XOR-like dependencies p1 p2 XOR-like dependencies (0,1) (1,1) (0,0) (1,0) p1 p2 Division of labour between 2 phases Subgroup Discovery Phase – – – Feature selection Decision boundary finding/thresholding Multivariate dependencies (XOR) Pattern Combination Phase – – – Pattern selection Combination (XOR?) Class assignment Combination-aware Subgroup Discovery Better global model Superficially uninteresting patterns can be reported pruning of search space (new rule-measures) subgroups are not novel, team is optimal Combination-aware Subgroup Discovery Subgroup Discovery ++: Find a set of subgroups that show substantially different distribution of target concept. Considerations – – – support of pattern diversity of pattern … Conclusions Less hasty approach to model building Interesting patterns serve two purposes – – understandable knowledge building blocks of global model Pattern discovery without combination limited Information exchange between phases Integration of two phases non-trivial