Efficient Statistical Learning from “Big Science” Data Andrew Moore Andrew Connolly Jeremy Kubica Ting Liu Auton Lab U. Pittsburgh Auton Lab Auton Lab The Auton Lab School of Computer Science Carnegie Mellon University www.autonlab.org The Auton Lab Faculty: Andrew Moore (Prof), Jeff Schneider (Research Scientist), Artur Dubrawski (Systems Scientist) Postdoctoral Fellows: Brigham Anderson, Alexander Gray, Paul Komarek, Dan Pelleg Graduate Students: Brent Bryan, Kaustav Das, Khalid El-Arini, Anna Goldenberg, Jeremy Kubica, Ting Liu, Daniel Neill, Sajid Siddiqi, Purna Sarkar, Ajit Singh, Weng-Keen Wong Head of Software Jeanie Komarek Development: Programmers: Patrick Choi, Adam Goode, Pat Gunn, Joey Liang, John Ostlund, Robin Sabhnani, Rahul Sankathar Executive Assistant: Kristen Schrauder Head of Sys. Admin: Jacob Joseph Undergraduate and Masters Interns: Recent Alumni: Kenny Daniel, Sandy Hsu, Dongryeol Lee, Jennifer Lee, Avilay Parekh, Chris Rotella, Jonathan Terleski Drew Bagnell (RI faculty), Scott Davies (Google), David Cohn (Google), Geoff Gordon (CMU), Paul Hsiung (USC), Marina Meila (U. Washington), Remi Munos (Ecole Polytechnique), Malcolm Strens (Qinetiq) Current Sponsors • • • • • • • • • • • • • National Science Foundation (NSF) NASA Defense Advanced Research Projects Agency (DARPA) Department of Homeland Security (DHS) Homeland Security Advanced Research Projects Agency (HSARPA) United States Department of Agriculture (USDA) State of Pennsylvania Pfizer Corporation Caterpillar Corporation British Petroleum Psychogenics Corporation Transform Pharmaceuticals Health Canada Collaborators... Ting Liu Paul Hsiung Andy Connolly Bob Nichol Jeff Schneider Istvan Szapudi Chris Genovese Alex Szalay Alex Gray Brigham Anderson Anna Goldenberg Kan Deng Greg Cooper Mike Wagner Ajit Singh Andrew Lawson Daniel Neill Scott Davies Rich Tsui Jeremy Kubica Sajid Paul Komarek Drew Bagnell Siddiqi Weng-Keen Wong Dan Pelleg Larry Wasserman 1996 1997 1998 1999 2000 2001 2002 2003 M&M Mars Line control Adrenali ne (NOX minimalizati on) Kodak (Image stabilizatio n) Digital Equipm ent (pregna n-cy monitoring) M&M Mars (manufa c-turing) NASA/ NSF (Astrop hysics mining) 3M: Textile tension control Caterpillar (Spare parts) US Army (biotoxin detection) M&M Mars: Schedulin g with uncertainty 3M (Adhesive design) DigitalMC (Music tastes) Caterpillar (emissions) SmartMoney (anomalies) Unilever (Brand Management ) Phillips Petroleum (work-force optimization) Cellomics (screened anomaly detection) Biometrics company (health monitor) Boeing (intrusion) Masterfoods (new product development) Cellomics (pro-teomics screen) ABB (Circuitbreaker supply chain) SwissAir (Flight delays) 3M (secret) Washington Public Hospital System (ER delays) Unilever (targeted marketing) NASA (National Virtual Observatory) NSF (astrostatistics software) DARPA (national disease monitor) Masterfoods (biochemistry) Pfizer (Highthroughput screen) Caterpillar Inc. (Self-optimizing Engines) Beverage Company (Ingredients/Manuf acturing/Marketing/ Sales Bayes Net) Transform Pharma (massive autonomous experiment design) Census Bureau (privacy protection) Psychogenics Inc: Effects of psychotropic drugs on rats NSF (astrostatistics software) Masterfoods (biochemistry) State of PA (National Disease Monitor [with Mike Wagner of U. Pitt]) State of PA (Anti Cancer [collaboration with CMU Biology] DARPA (detecting patterns in links) Other Government Departments (identifying dangerous people, potential collaborators, and aliases) Other Government Departments (detecting a class of clusters) Other Pharma Research Co. Life Science specific data mining United States Department of Agriculture: Early warning system for food terrorism NSF: Biosurveillance Algorithms Au De ton/ plo SPR ym ent s Our 5 biggest applications in 2004 Biomedical Security (with Mike Wagner, University of Pittsburgh) Autonomous selftweaking engines Drug Screening Intelligence Data Big Astrophysics Automated Science Record # 456621: Big Noisy Database of Links There must be something useful here… but I’m drowning in noise Doe and Smith were booked in the same hotel on 3/6/93 Analyst Potential coll-aborators of Doe: Big Noisy Database of Links This is more useful objective information TGAT The pattern of travel among the group consisting of (Doe, Smith, Jones, Moore) during 93-96 can’t be explained by coincidence Cached Sufficient Statistics Kd-trees and Ball Trees K-nearest neighbor with ball trees Very fast non-parametric classification skewed binary outputs General binary outputs multi-classed outputs Very fast kernel-based statistics n-point computations clustering non-parametric clustering (overdensity hunting) Active learning for anomaly hunting GMorph: Efficient Galaxy morphology fitting Other Auton topics Data Analysis: The old days Question • • Answer Size Ellipticity Color 23 0.96 Red 33 0.55 Red 36 40 20 48 Green Data Analysis: The new days Question • • Seventeen months later… Answer Cached Sufficient Statistics Question • • ∑ ∑∑ ∑ ∑ ∑ ∑∑ ∑ Cached Sufficient Statistics Question • • Answer ∑ ∑∑ ∑ ∑ ∑ ∑∑ ∑ Kd-trees • Friedman, Bentley and Finkel, 1977 Kd-trees • Cleveland and Delvin, 1988 BSPs Fast LOESS Doom Variable Resolution Reinforcement Learning • Simons, Van Brussel, De Schutter and Verhaert, 1982 • Chapman and Kaelbling, 1991 : Kd-trees • Barnes and Hut, 1986 Multipole BSPs Fast LOESS Greengard WSPDs Doom • Callahan and Kosaraju, 1993 Variable Resolution Reinforcement Learning Kd-trees • Kan Deng Multipole BSPs Fast LOESS Greengard WSPDs Doom Cached Counts Variable Resolution Reinforcement Learning Single Kernel Regressions K-means • Dan Pelleg Single Kernel Density Estimations Kd-trees Multipole BSPs Fast LOESS Greengard WSPDs Doom Cached Counts Cached Sufficient Statistics Mahalonobis Maximization Fast LWR Mixture Loglike • Moore, 1998 Mixture EM steps Variable Resolution Reinforcement Learning Single Kernel Regressions K-means Single Kernel Density Estimations Kd-trees Multipole BSPs Fast LOESS Greengard WSPDs Doom Cached Counts Cached Sufficient Statistics Mahalonobis Maximization Fast LWR • Alex Gray Single Kernel Regressions K-means Pipelined Kernel Operations Single Kernel Density Estimations Dual-tree cached stats 2-point correlation Mixture Loglike Mixture EM steps Variable Resolution Reinforcement Learning Multi-tree n-point correlation Kd-trees Multipole BSPs Fast LOESS Greengard WSPDs Doom Cached Counts Cached Sufficient Statistics Non-parametric cluster / clutter / cluster counts Mahalonobis Maximization Single Kernel Regressions K-means Single Kernel Density Estimations • Weng-Keen Fast LWR Dual-tree Wong cached stats 2-point correlation Mixture Loglike Mixture EM steps Variable Resolution Reinforcement Learning Pipelined Kernel Operations Multi-tree n-point correlation Via NIPS Kd-trees Ball trees Multipole Via DBMS BSPs Fast LOESS Metric trees Greengard WSPDs Doom Cached Counts • Alex Gray Cached Non-parametric cluster Sufficient / clutter / cluster Statistics counts Variable Resolution Reinforcement Learning Single Kernel Regressions K-means Single Kernel Density Estimations • Ting Liu Mahalonobis Maximization Fast LWR 2-point correlation Mixture Loglike Mixture EM steps Dual-tree cached stats Pipelined Kernel Operations Multi-tree n-point correlation High dimensional versions Via NIPS Kd-trees Ball trees Multipole Via DBMS BSPs Fast LOESS Metric trees Greengard WSPDs Doom Cached Sufficient Statistics Non-parametric cluster / clutter / cluster counts Mahalonobis Maximization Mixture Loglike Mixture EM steps Cached Counts Fast LWR • Jeremy Kubica Variable Resolution Reinforcement Learning Single Kernel Regressions K-means Dual-tree cached stats a le c s ve i s ct s e j a b M o 2-point e l p i t correlation mul ng i k c a r Pipelined Kernel t Operations Single Kernel Density Estimations Multi-tree n-point correlation High dimensional versions Cached Sufficient Statistics Ball Trees (= Metric Trees) K-nearest neighbor with ball trees Very fast non-parametric classification skewed binary outputs General binary outputs multi-classed outputs Very fast kernel-based statistics n-point computations clustering non-parametric clustering (overdensity hunting) Active learning for anomaly hunting GMorph: Efficient Galaxy morphology fitting Other Auton topics Structure of kdtrees In memory… I contain 300 points and have center of mass (55,52) and covariance… ⎛ 2308 − 45 ⎞ ⎜⎜ ⎟⎟ ⎝ − 45 1965 ⎠ Structure of of kdtrees In memory… I contain 130 points and have center of mass (23,49) and covariance… ⎛ 754 124 ⎞ ⎜⎜ ⎟⎟ ⎝ 124 1965 ⎠ I contain 170 points and have center of mass (73,58) and covariance… ⎛ 654 24 ⎞ ⎜⎜ ⎟⎟ ⎝ 24 1885 ⎠ Structure of of kdtrees In memory… Structure of of kdtrees In memory… Structure of of kdtrees In memory… <more levels here> Structure of of kdtrees In memory… <more levels here> Range Search Range Count Range Count Range Count Range Count Range Count Range Count Range Count Range Count Range Count Range Count Range Count Range Count Range Count Range Count Range Count Range Count A Set of Points in a metric space Ball Tree root node A Ball Tree A Ball Tree A Ball Tree A Ball Tree A Ball Tree •J. Uhlmann, 1991 •S. Omohundro, NIPS 1991 Ball-trees: properties Let Q be any query point and let x be a point inside ball B |x-Q| ≥ |Q - B.center| - B.radius |x-Q| ≤ |Q - B.center| + B.radius B.center x Q Cached Sufficient Statistics Ball Trees (= Metric Trees) K-nearest neighbor with ball trees Very fast non-parametric classification skewed binary outputs General binary outputs multi-classed outputs Very fast kernel-based statistics n-point computations clustering non-parametric clustering (overdensity hunting) Active learning for anomaly hunting GMorph: Efficient Galaxy morphology fitting Other Auton topics Goal: Find out the 2-nearest neighbors of Q. Q •J. Uhlmann, 1991 •S. Omohundro, NIPS 1991 Q Q Q Q Q Q We’ve hit a leaf node, so we explicitly look at the points in the node Q Two nearest neighbors found so far are in pink (remember we have yet to search the blue balls) Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Cached Sufficient Statistics Ball Trees (= Metric Trees) K-nearest neighbor with ball trees Very fast non-parametric classification skewed binary outputs General binary outputs multi-classed outputs Very fast kernel-based statistics n-point computations clustering non-parametric clustering (overdensity hunting) Active learning for anomaly hunting GMorph: Efficient Galaxy morphology fitting Other Auton topics KNS2 • Assume binary output • Assume positive class is much less frequent than negative class • Assume we want more than a “positive/negative” prediction: we want to know exactly how many of the K-NN are from the +ve class KNS2 does this without finding the K-NN Assume we have a set of data points. Some are +ve points (denoted ) The large majority are –ve points (denoted ) Assume we have a set of data points. Some are +ve points (denoted ) Preprocessing: Put all Q your +ve points in a small ball tree Preprocessing: Put all your -ve points in a separate large ball tree The vast majority are – ve points (denoted ) Goal: Find out how many of the 5-nearest neighbors of Q are positive. Q Step One: Find the five nearest +ve points using KNS1. Q We’re assuming there are far fewer +ves than –ves so this is not the dominant cost. Step 2: Search the ball-tree of –ve points starting at the root. Q Q By the end of the search this will contain number of negative points closer to query than closest +ve point Q Q Search the balltree of –ve points starting at the root. By the end of the search this will contain number of negative points closer to query than closest +ve point Search the balltree of –ve points starting By the end of the search this will contain number of negative at the root. Q Q points whose distance to query is between distances of closest +ve point and 2nd closest +ve point By the end of the search this will contain number of negative points closer to query than closest +ve point By the end of the search this will contain number of negative points whose distance to query is between distances of 2nd closest +ve point and 3rd closest +ve point Search the balltree of –ve By the end ofstarting the search this will points By the end of the search this will contain number of negative points distanceroot. to query is between contain number of negativewhose at the distances of 3 closest +ve point and points whose distance to query is 4th closest +ve point between distances of closest +ve point and 2nd closest +ve point rd Q Q By the end of the search this will contain number of negative points whose distance to query is between distances of 4th closest +ve point and 5th closest +ve point By the end of the search this will contain number of negative points whose distance to query is between uery! q N N t to 5point n distances of 2nd closest +ve and a v e l re f i y l on 3Brdut closest +ve point By the end of the search this will contain number of negative points closer to query than ry! que N N closest +ve tpoint o5 t n a v r e le f i y l n By the end of the search this will Bu t o By the end of the search this will contain number of negative points contain number of negativewhose distance to query is between ryis! of 3rd closest +ve point and edistances u ! q points whose distance5-to query N query N N th N 5 o 4 closest +ve point t tof n ant to a v v e betweenfdistances closest +ve l e e l r re i only if y t l u n B nd t o and 2 closest +ve point Bupoint Q Search the balltree of –ve points starting at the root. Q By the end of the search this will contain number of negative points nt releva f i y l whose ! But ondistance ueryto q N N to 5is- between query distances of 4th closest +ve point and 5th closest +ve point Search the balltree of –ve points starting at the root. Q Q Q Q Q Q Q m o fr all ce B n a t in t s th Di oin 5 le e p to b is v Q int – s m r o e po Po any f ce t +v ax t o n ta ses M s i D clo Q Q Q m o fr all ce B n a t in t s th Di oin 5 le e p to b is v Q int – s m r o e po Po any f ce t +v ax t o n ta ses M s i D clo Q Q …and there are eight points in the ball Q Q Q m o fr all ce B n a t in t s Di oin le e p b si – v s Po any ax t o M Q Q Distance from Q to 4thclosest +ve point Q m o fr all ce B n a t in t s Di oin le e p b si – v s Po any ax t o M Q Q Distance from Q to 4thclosest +ve point …and there are eight points in the ball Q Q 3rdQ to t om e fr ve poin an c Dist losest + c Q m o r f al l e c B n a in si t int D o e l ep b i Q ss –v Po ny ax to a M Q 3rdQ to t om e fr ve poin an c Dist losest + c Q m o r f al l e c B n a in si t int D o e l ep b i Q ss –v Po ny ax to a M Q No prune! Q Q Q Q Q to t om e fr ve poin anc Dist osest + cl 3rd- Q Q Max Pos sible Dis tance fro to any – mQ ve point in Ball …and there are six points in the ball Q Q Max Pos sible Dis tance fro to any – mQ ve point in Ball …and there are six points in the ball Q Q Distance fr o m Q to 2nd-closes t +ve point Q Q Q Q Q Q No prune. Q Q No prune. Q Q Ball is leaf so explore its points I contain exactly one point closer than the 1st closest +ve point (says the Ball) Ball is leaf so explore its points Q Q No prune. 1 Return and try other sibling Q Q 1 I can’t possibly have any interesting points Return and try other sibling Q Q 1 I can’t possibly have any interesting points Q Q 1 I can’t possibly have any interesting points Q Q 1 We’re done Q Q 1 We’re done There’s one –ve point closer than the closest +ve point. There are more than 3 –ve points closer than the 2nd closest +ve point. Q => Exactly 1 of the 5 nearest neighbors is +ve Q 1 Balls visited Q Q 1 Another example Q Experimental results Num of Distance computations Speedup for K-NN 100 80 Naive KNS1 KNS2 KNS3 60 40 20 0 ds1 ds2 J_Lee Blanc letter Movie Wall-clock-time speedup for k-NN 70 60 50 40 30 20 10 0 Naive KNS1 KNS2 KNS3 ds1 ds2 J_Lee Blanc letter Movie Cached Sufficient Statistics Ball Trees (= Metric Trees) K-nearest neighbor with ball trees Very fast non-parametric classification skewed binary outputs General binary outputs multi-classed outputs Very fast kernel-based statistics n-point computations clustering non-parametric clustering (overdensity hunting) Active learning for anomaly hunting GMorph: Efficient Galaxy morphology fitting Other Auton topics Cached Sufficient Statistics Ball Trees (= Metric Trees) K-nearest neighbor with ball trees Very fast non-parametric classification skewed binary outputs General binary outputs multi-classed outputs Very fast kernel-based statistics n-point computations clustering non-parametric clustering (overdensity hunting) Active learning for anomaly hunting GMorph: Efficient Galaxy morphology fitting Other Auton topics Data Structures for Fast Kmeans The Auton Lab Carnegie Mellon University www.autonlab.org Some Bio Assay data GMM clustering of the assay data Resulting Density Estimator Three classes of assay (each learned with it’s own mixture model) (Sorry, this will again be semiuseless in black and white) Resulting Bayes Classifier Resulting Bayes Classifier, using posterior probabilities to alert about ambiguity and anomalousness Yellow means anomalous Cyan means ambiguous [Moore, 1999], [Pellg and Moore, 2002] Cached Sufficient Statistics Ball Trees (= Metric Trees) K-nearest neighbor with ball trees Very fast non-parametric classification skewed binary outputs General binary outputs multi-classed outputs Very fast kernel-based statistics n-point computations clustering non-parametric clustering (overdensity hunting) Active learning for anomaly hunting GMorph: Efficient Galaxy morphology fitting Other Auton topics Detecting overdensities Detecting overdensities Finding the overdense regions Finding the overdense regions • Many possible approaches Examples: • Dasgupta and Raftery, 1998 • Byers and Raftery, 1998 • Cuevas et al, 2000 • Reichart et al, 1999 • All share the same computational primitives Finding the overdense regions • Many possible approaches Examples: • Dasgupta and Raftery, 1998 • Byers and Raftery, 1998 • Cuevas et al, 2000 • Reichart et al, 1999 • All share the same computational primitives Finding the overdense regions Finding the overdense regions • Step One: Identify the high density points Finding the overdense regions • Step One: Identify the high density points • Step Two: Delete the rest Finding the overdense regions • Step One: Identify the high density points • Step Two: Delete the rest • Step Three: Find connected components Finding the overdense regions CFF assumes: Kernel Density Estimation • Step One: Identify the high density points • Step Two: Delete the rest • Step Three: Find connected components Finding the overdense regions CFF assumes: Kernel Density Estimation CFF assumes: A and B are in same component if there’s a path between them with all small steps A C B • Step One: Identify the high density points • Step Two: Delete the rest • Step Three: Find connected components Finding the overdense Can be done efficiently regions CFF assumes: Kernel Density Estimation with 2-point style search CFF assumes: A and B are in same component if there’s a path between them with all small steps Can be done efficiently with 2-point-style search plus an extra trick A C B • Step One: Identify the high density points • Step Two: Delete the rest • Step Three: Find connected components Results 4-dimensional Sloan Astrophysics color-space data [Wong and Moore, 2002] Results 4-dimensional Sloan Astrophysics color-space data Cached Sufficient Statistics Ball Trees (= Metric Trees) K-nearest neighbor with ball trees Very fast non-parametric classification skewed binary outputs General binary outputs multi-classed outputs Very fast kernel-based statistics n-point computations clustering non-parametric clustering (overdensity hunting) Active learning for anomaly hunting GMorph: Efficient Galaxy morphology fitting Other Auton topics Active Learning of Anomalies Set of Records Ask expert to classify Spot “important” records Build Model from data and labels Run all data through model Dan Pelleg [Pelleg, Moore and Connolly 2004] Anomaly GUI Anomaly Performance Cached Sufficient Statistics Ball Trees (= Metric Trees) K-nearest neighbor with ball trees Very fast non-parametric classification skewed binary outputs General binary outputs multi-classed outputs Very fast kernel-based statistics n-point computations clustering non-parametric clustering (overdensity hunting) Active learning for anomaly hunting GMorph: Efficient Galaxy morphology fitting Other Auton topics GMorph: Fast Galaxy Morphology • How do you perform 107 large nonlinear optimizations in practical time? • How do you avoid local optima • Idea: Pre-cache a “library” of solutions. Use efficient nearest neighbor to match new problems to library as seeds. • Early tests bring galaxy morphology fits down from minutes to sub-seconds Brigham Anderson [Anderson, Bernardi, Connolly, Moore, Nichol, 2004] Cached Sufficient Statistics Ball Trees (= Metric Trees) K-nearest neighbor with ball trees Very fast non-parametric classification skewed binary outputs General binary outputs multi-classed outputs Very fast kernel-based statistics n-point computations clustering non-parametric clustering (overdensity hunting) Active learning for anomaly hunting GMorph: Efficient Galaxy morphology fitting Other Auton topics Other Relevant Auton Topics Bayesian Networks G H S D C W [Moore and Lee, 1998], [Moore and Wong, 2003] Other Relevant Auton Topics Bayesian Networks “What’s strange about recent events?” G H S Weng-Keen Wong D C W [Wong, Moore, Cooper and Wagner 2003] Other Relevant Auton Topics Bayesian Networks “What’s strange about recent events?” G H Scan Statistics Spatial S D C Daniel Neill W [Neill, Moore and Wagner, 2004] Other Relevant Auton Topics Bayesian Networks “What’s strange about recent events?” G H Scan Statistics Spatial S Massively multiple target tracking D Jeremy Kubica W C Conclusions • Geometry can help tractability of Massive Statistical Data Analysis • Cached sufficient statistics are one approach • Papers, tutorials, software, examples: www.autonlab.org