A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos Mining Massive Data Streams • High-speed data streams abundant – Large retailers – Long distance & cellular phone call records – Scientific projects – Large Web sites • Build model of the process creating data • Use model to interact more efficiently Growing Mismatch Between Algorithms and Data • State of the art data mining algorithms – One shot learning – Work with static databases – Maximum of 1 million – 10 million records • Properties of Data Streams – Data stream exists over months or years – 10s – 100s of millions of new records per day – Process generating data changing over time The Cost of This Mismatch • Fraction of data we can effectively mine shrinking towards zero • Models learned from heuristically selected samples of data • Models out of date before being deployed Need New Algorithms • Monitor a data stream and have a model available at all times • Improve the model as data arrives • Adapt the model as process generating data changes • Have quality guarantees • Work within strict resource constraints Solution: General Framework • Applicable to algorithms based on discrete search • Semi-automatically converts algorithm to meet our design needs • Uses sampling to select data size for each search step • Extensions to continuous searches and relational data Outline • • • • • Introduction Scaling up Decision Trees Our Framework for Scaling Other Applications and Results Conclusion Decision Trees • • • • Examples: x1 ,, xD , y Gender? Male Female Encode: y F x1 ,, xD Age? False Nodes contain tests < 25 >= 25 Leaves contain predictions False True Decision Tree Induction DecisionTree(Data D, Tree T, Attributes A) If D is pure Let T be a leaf predicting class in D Return Let X be best of A according to D and G() Let T be a node that splits on X For each value V of X Let D^ be the portion of D with V for X Let T^ be the child of T for V DecisionTree(D^, T^, A – X) VFDT (Very Fast Decision Tree) • In order to pick split attribute for a node looking at a few example may be sufficient • Given a stream of examples: – – – – Use the first to pick the split at the root Sort succeeding ones to the leaves Pick best attribute there Continue… • Leaves predict most common class • Very fast, incremental, any time decision tree induction algorithm How Much Data? • Make sure best attribute is better than second – That is: G X 1 G X 2 0 • Using a sample so need Hoeffding bound – Collect data till: G X 1 G X 2 1 R ln 2n 2 Core VFDT Algorithm Proceedure VFDT(Stream, δ) Let T = Tree with single leaf (root) x1? Initialize sufficient statistics at root For each example (X, y) in Stream male female Sort (X, y) to leaf using T y=0 x2? Update sufficient statistics at leaf Compute G for each attribute > 65 <= 65 If G(best) – G(2nd best) > ε, then y=0 y=1 Split leaf on best attribute For each branch Start new leaf, init sufficient statistics Return T Quality of Trees from VFDT • Model may contain incorrect splits, useful? • Bound the difference with infinite data tree – Chance an arbitrary example takes different path • Intuition: example on level i of tree has i chances to go through a mistaken node DT , DTHT p Complete VFDT System • Memory management – Memory dominated by sufficient statistics – Deactivate less promising leaves when needed • Ties: G – Wasteful to decide between identical attributes • Check for splits periodically • Pre-pruning – Only make splits that improve the value of G(.) • Early stop on bad attributes VFDT (Continued) • • • • • • Bootstrap with traditional learner Rescan dataset when time available Time changing data streams Post pruning Continuous attributes Batch mode Experiments • Compared VFDT and C4.5 (Quinlan, 1993) • Same memory limit for both (40 MB) – 100k examples for C4.5 • • • • VFDT settings: δ = 10^-7, τ = 5% Domains: 2 classes, 100 binary attributes Fifteen synthetic trees 2.2k – 500k leaves Noise from 0% to 30% Running Times • Pentium III at 500 MHz running Linux • C4.5 takes 35 seconds to read and process 100k examples; VFDT takes 47 seconds • VFDT takes 6377 seconds for 20 million examples; 5752s to read 625s to process • VFDT processes 32k examples per second (excluding I/O) Real World Data Sets: Trace of UW Web requests • Stream of Web page request from UW • One week 23k clients, 170 orgs. 244k hosts, 82.8M requests (peak: 17k/min), 20GB • Goal: improve cache by predicting requests • 1.6M examples, 61% default class • C4.5 on 75k exs, 2975 secs. – 73.3% accuracy • VFDT ~3000 secs., 74.3% accurate Outline • • • • • Introduction Scaling up Decision Trees Our Framework for Scaling Overview of Applications and Results Conclusion Data Mining as Discrete Search • Initial state – Empty – prior – random • Search operators ... – Refine structure • Evaluation function – Likelihood – many other • Goal state – Local optimum, etc. Data Mining As Search Training Data Training Data Training Data 1.8 1.7 ... 1.5 ... 1.9 1.9 2.0 Example: Decision Tree Training Data Training Data X1? • Initial state – Root node • Search operators X1? ... 1.7 – Turn any leaf into a test on attribute • Evaluation – Entropy Reduction ?? p lg p ... i i val ( y ) 1.5 X1? • Goal state – No further gain – Post prune Xd? ... Overview of Framework • Cast the learning algorithm as a search • Begin monitoring data stream – Use each example to update sufficient statistics where appropriate (then discard it) – Periodically pause and use statistical tests • Take steps that can be made with high confidence – Monitor old search decisions • Change them when data stream changes How Much Data is Enough? Training Data X1? 1.65 ... 1.38 Xd? How Much Data is Enough? Sample of Data 1.6 +/- ε X1? – Normal distribution – Hoeffding bound ... 1.4 +/- ε • Use statistical bounds Xd? R 2 ln 1 2n • Applies to scores that are average over examples • Can select a winner if – Score1 > Score2 + ε Global Quality Guarantee δ* = δbdc • • • • δ – probability of error in single decision b – branching factor of search d – depth of search c – number of checks for winner Identical States And Ties • Fails if states are identical (or nearly so) • τ – user supplied tie parameter • Select winner early if alternatives differ by less than τ – Score1 > Score2 + ε – ε <= τ or Dealing with Time Changing Concepts • Maintain a window of the most recent examples • Keep model up to date with this window • Effective when window size similar to concept drift rate • Traditional approach – Periodically reapply learner – Very inefficient! • Our approach – Monitor quality of old decisions as window shifts – Correct decisions in fine-grained manner Alternate Searches • When new test looks better grow alternate sub-tree • Replace the old when new is more accurate • This smoothly adjusts to changing concepts Gender? Pets? false Hair? false College? true true true false RAM Limitations • Each search requires sufficient statistics structure • Decision Tree – O(avc) RAM • Bayesian Network – O(c^p) RAM RAM Limitations Temporarily inactive Active Outline • • • • • • Introduction Data Mining as Discrete Search Our Framework for Scaling Application to Decision Trees Other Applications and Results Conclusion Applications • • • • VFDT (KDD ’00) – Decision Trees CVFDT (KDD ’01) – VFDT + concept drift VFBN & VFBN2 (KDD ’02) – Bayesian Networks Continuous Searches – VFKM (ICML ’01) – K-Means clustering – VFEM (NIPS ’01) – EM for mixtures of Gaussians • Relational Data Sets – VFREL (Submitted) – Feature selection in relational data CFVDT Experiments Activity Profile for VFBN Other Real World Data Sets • Trace of all web requests from UW campus – Use clustering to find good locations for proxy caches • KDD Cup 2000 Data set – 700k page requests from an e-commerce site – Categorize pages into 65 categories, predict which a session will visit • UW CSE Data set – 8 Million sessions over two years – Predict which of 80 level 2 directories each visits • Web Crawl of .edu sites – Two data sets each with two million web pages – Use relational structure to predict which will increase in popularity over time Related Work • DB Mine: A Performance Perspective (Agrawal, Imielinski, Swami ‘93) – Framework for scaling rule learning • RainForest (Gehrke, Ramakrishnan, Ganti ‘98) – Framework for scaling decision trees • ADtrees (Moore, Lee ‘97) – Accelerate computing sufficient stats • PALO (Greiner ‘92) – Accelerate hill climbing search via sampling • DEMON (Ganti, Gehrke, Ramakrishnan ‘00) – Framework for converting incremental algs. for time changing data streams Future Work • Combine framework for discrete search with frameworks for continuous search and relational learning • Further study time changing processes • Develop a language for specifying data stream learning algorithms • Use framework to develop novel algorithms for massive data streams • Apply algorithms to more real-world problems Conclusion • Framework helps scale up learning algorithms based on discrete search • Resulting algorithms: – – – – Work on databases and data streams Work with limited resources Adapt to time changing concepts Learn in time proportional to concept complexity • Independent of amount of training data! • Benefits have been demonstrated in a series of applications