A General Framework for Mining Massive Data Streams

advertisement
A General Framework for
Mining Massive Data Streams
Geoff Hulten
Advised by Pedro Domingos
Mining Massive Data Streams
• High-speed data streams abundant
– Large retailers
– Long distance & cellular phone call records
– Scientific projects
– Large Web sites
• Build model of the process creating data
• Use model to interact more efficiently
Growing Mismatch Between
Algorithms and Data
• State of the art data mining algorithms
– One shot learning
– Work with static databases
– Maximum of 1 million – 10 million records
• Properties of Data Streams
– Data stream exists over months or years
– 10s – 100s of millions of new records per day
– Process generating data changing over time
The Cost of This Mismatch
• Fraction of data we can effectively mine
shrinking towards zero
• Models learned from heuristically selected
samples of data
• Models out of date before being deployed
Need New Algorithms
• Monitor a data stream and have a model
available at all times
• Improve the model as data arrives
• Adapt the model as process generating
data changes
• Have quality guarantees
• Work within strict resource constraints
Solution: General Framework
• Applicable to algorithms based on discrete
search
• Semi-automatically converts algorithm to
meet our design needs
• Uses sampling to select data size for each
search step
• Extensions to continuous searches and
relational data
Outline
•
•
•
•
•
Introduction
Scaling up Decision Trees
Our Framework for Scaling
Other Applications and Results
Conclusion
Decision Trees
•
•
•
•
Examples: x1 ,, xD , y 
Gender?
Male
Female
Encode: y  F x1 ,, xD 
Age?
False
Nodes contain tests
< 25
>= 25
Leaves contain predictions
False
True
Decision Tree Induction
DecisionTree(Data D, Tree T, Attributes A)
If D is pure
Let T be a leaf predicting class in D
Return
Let X be best of A according to D and G()
Let T be a node that splits on X
For each value V of X
Let D^ be the portion of D with V for X
Let T^ be the child of T for V
DecisionTree(D^, T^, A – X)
VFDT (Very Fast Decision Tree)
• In order to pick split attribute for a node
looking at a few example may be sufficient
• Given a stream of examples:
–
–
–
–
Use the first to pick the split at the root
Sort succeeding ones to the leaves
Pick best attribute there
Continue…
• Leaves predict most common class
• Very fast, incremental, any time decision tree
induction algorithm
How Much Data?
• Make sure best attribute is better than
second
– That is: G  X 1   G  X 2   0
• Using a sample so need Hoeffding bound
– Collect data till: G  X 1   G  X 2   
1
R ln  
 
2n
2

Core VFDT Algorithm
Proceedure VFDT(Stream, δ)
Let T = Tree with single leaf (root)
x1?
Initialize sufficient statistics at root
For each example (X, y) in Stream
male
female
Sort (X, y) to leaf using T
y=0
x2?
Update sufficient statistics at leaf
Compute G for each attribute
> 65
<= 65
If G(best) – G(2nd best) > ε, then
y=0
y=1
Split leaf on best attribute
For each branch
Start new leaf, init sufficient statistics
Return T
Quality of Trees from VFDT
• Model may contain incorrect splits, useful?
• Bound the difference with infinite data tree
– Chance an arbitrary example takes different
path
• Intuition: example on level i of tree has i
chances to go through a mistaken node
DT , DTHT  

p
Complete VFDT System
• Memory management
– Memory dominated by sufficient statistics
– Deactivate less promising leaves when needed
• Ties: G    
– Wasteful to decide between identical attributes
• Check for splits periodically
• Pre-pruning
– Only make splits that improve the value of G(.)
• Early stop on bad attributes
VFDT (Continued)
•
•
•
•
•
•
Bootstrap with traditional learner
Rescan dataset when time available
Time changing data streams
Post pruning
Continuous attributes
Batch mode
Experiments
• Compared VFDT and C4.5 (Quinlan,
1993)
• Same memory limit for both (40 MB)
– 100k examples for C4.5
•
•
•
•
VFDT settings: δ = 10^-7, τ = 5%
Domains: 2 classes, 100 binary attributes
Fifteen synthetic trees 2.2k – 500k leaves
Noise from 0% to 30%
Running Times
• Pentium III at 500 MHz running Linux
• C4.5 takes 35 seconds to read and
process 100k examples; VFDT takes 47
seconds
• VFDT takes 6377 seconds for 20 million
examples; 5752s to read 625s to process
• VFDT processes 32k examples per
second (excluding I/O)
Real World Data Sets:
Trace of UW Web requests
• Stream of Web page request from UW
• One week 23k clients, 170 orgs. 244k hosts,
82.8M requests (peak: 17k/min), 20GB
• Goal: improve cache by predicting requests
• 1.6M examples, 61% default class
• C4.5 on 75k exs, 2975 secs.
– 73.3% accuracy
• VFDT ~3000 secs., 74.3% accurate
Outline
•
•
•
•
•
Introduction
Scaling up Decision Trees
Our Framework for Scaling
Overview of Applications and Results
Conclusion
Data Mining as Discrete Search
• Initial state
– Empty – prior – random
• Search operators
...
– Refine structure
• Evaluation function
– Likelihood – many other
• Goal state
– Local optimum, etc.
Data Mining As Search
Training Data
Training Data
Training Data
1.8
1.7
...
1.5
...
1.9
1.9
2.0
Example: Decision Tree
Training Data
Training Data
X1?
• Initial state
– Root node
• Search operators
X1?
...
1.7
– Turn any leaf into
a test on attribute
• Evaluation
– Entropy Reduction
??
 p lg p
...
i
i
val ( y )
1.5
X1?
• Goal state
– No further gain
– Post prune
Xd?
...
Overview of Framework
• Cast the learning algorithm as a search
• Begin monitoring data stream
– Use each example to update sufficient
statistics where appropriate (then discard it)
– Periodically pause and use statistical tests
• Take steps that can be made with high confidence
– Monitor old search decisions
• Change them when data stream changes
How Much Data is Enough?
Training Data
X1?
1.65
...
1.38
Xd?
How Much Data is Enough?
Sample of Data
1.6 +/- ε
X1?
– Normal distribution
– Hoeffding bound
...
1.4 +/- ε
• Use statistical bounds
Xd?
  R 2 ln 1   2n
• Applies to scores that are
average over examples
• Can select a winner if
– Score1 > Score2 + ε
Global Quality Guarantee
δ* = δbdc
•
•
•
•
δ – probability of error in single decision
b – branching factor of search
d – depth of search
c – number of checks for winner
Identical States And Ties
• Fails if states are identical (or nearly so)
• τ – user supplied tie parameter
• Select winner early if alternatives differ by
less than τ
– Score1 > Score2 + ε
– ε <= τ
or
Dealing with Time Changing
Concepts
• Maintain a window of the most recent examples
• Keep model up to date with this window
• Effective when window size similar to concept
drift rate
• Traditional approach
– Periodically reapply learner
– Very inefficient!
• Our approach
– Monitor quality of old decisions as window shifts
– Correct decisions in fine-grained manner
Alternate Searches
• When new test looks better grow alternate
sub-tree
• Replace the old when new is more accurate
• This smoothly adjusts to changing concepts
Gender?
Pets?
false
Hair?
false
College?
true
true
true
false
RAM Limitations
• Each search
requires sufficient
statistics structure
• Decision Tree
– O(avc) RAM
• Bayesian Network
– O(c^p) RAM
RAM Limitations
Temporarily
inactive
Active
Outline
•
•
•
•
•
•
Introduction
Data Mining as Discrete Search
Our Framework for Scaling
Application to Decision Trees
Other Applications and Results
Conclusion
Applications
•
•
•
•
VFDT (KDD ’00) – Decision Trees
CVFDT (KDD ’01) – VFDT + concept drift
VFBN & VFBN2 (KDD ’02) – Bayesian Networks
Continuous Searches
– VFKM (ICML ’01) – K-Means clustering
– VFEM (NIPS ’01) – EM for mixtures of Gaussians
• Relational Data Sets
– VFREL (Submitted) – Feature selection in relational
data
CFVDT Experiments
Activity Profile for VFBN
Other Real World Data Sets
• Trace of all web requests from UW campus
– Use clustering to find good locations for proxy caches
• KDD Cup 2000 Data set
– 700k page requests from an e-commerce site
– Categorize pages into 65 categories, predict which a session will
visit
• UW CSE Data set
– 8 Million sessions over two years
– Predict which of 80 level 2 directories each visits
• Web Crawl of .edu sites
– Two data sets each with two million web pages
– Use relational structure to predict which will increase in
popularity over time
Related Work
• DB Mine: A Performance Perspective (Agrawal, Imielinski,
Swami ‘93)
– Framework for scaling rule learning
• RainForest (Gehrke, Ramakrishnan, Ganti ‘98)
– Framework for scaling decision trees
• ADtrees (Moore, Lee ‘97)
– Accelerate computing sufficient stats
• PALO (Greiner ‘92)
– Accelerate hill climbing search via sampling
• DEMON (Ganti, Gehrke, Ramakrishnan ‘00)
– Framework for converting incremental algs. for time
changing data streams
Future Work
• Combine framework for discrete search with
frameworks for continuous search and relational
learning
• Further study time changing processes
• Develop a language for specifying data stream
learning algorithms
• Use framework to develop novel algorithms for
massive data streams
• Apply algorithms to more real-world problems
Conclusion
• Framework helps scale up learning algorithms
based on discrete search
• Resulting algorithms:
–
–
–
–
Work on databases and data streams
Work with limited resources
Adapt to time changing concepts
Learn in time proportional to concept complexity
• Independent of amount of training data!
• Benefits have been demonstrated in a series of
applications
Download