Mining High Speed Data Streams - Computer Science

advertisement
Mining High Speed
Data Streams
Authors:
(1) Pedro Domingos
University of Washington
Seattle, WA 98195-2350,U.S.A.
(2) Geoff Hulten
University of Washington
Seattle, WA 98195-2350, U.S.A
Presented by:
Nima
[Poornima Shetty]
Date: 11/15/2011
Course: Data Mining [CS332]
Computer Science Department
University of Vermont
Copyright Note:
• This presentation is based on the papers:
Mining High-Speed Data Streams, with Geoff Hulten. Proceedings
of the Sixth International Conference on Knowledge Discovery
and Data Mining (pp. 71-80), 2000. Boston, MA: ACM Press.
A General Framework for Mining Massive Data Streams, with
Geoff Hulten (short paper). Journal of Computational and
Graphical Statistics, 12, 2003
–
The original presentation made by the author has been used
to produce this presentation.
Mining High Speed Data
Streams
2
Overview
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Introduction
Background Knowledge
The Problem
Design Criteria
General Framework
Hoeffding Trees
Hoeffding bounds
Hoeffding Tree algorithm
Properties of Hoeffding Trees
The Basic algorithm concepts
The VFDT system
Study and comparison
A real world example
Conclusion
Mining High Speed Data
Streams
3
Introduction
• In today’s information society, extraction of knowledge is
becoming a very important task for many people. We live in an
age of knowledge revolution.
• The digital universe in 2007 was estimated to be 281
exabytes(10^18), but in 2011 it is estimated to be 10 times the
size it was 5 years before.
• To deal with these huge amount of data in a responsible way,
green computing is becoming a necessity.
– A main approach to green computing is based on algorithmic
efficiency.
Mining High Speed Data
Streams
4
Introduction (Contd.)
• Many organizations such as Wall Mart, K Mart etc. have very large
databases that grow without limit at a rate of several million
records per day.
• Mining these continuous data streams brings unique opportunities
but also new challenges.
• The most efficient algorithms available today, concentrate on
mining database that do not fit in main memory by only requiring
sequential scans of the disk.
Mining High Speed Data
Streams
5
Introduction (Contd.)
•
Knowledge based systems are constrained by three main limited
resources:
– Time
– Memory
– Sample Size
•
In traditional applications of machine learning and statistics, sample size
tends to be the dominant limitation.
– Computational resources for a massive search are available, but carrying out
such a search over the small samples available often leads to “Overfitting”.
•
In today’s data mining applications the bottleneck is time and memory,
not examples.
– The examples are typically in over supply and it is impossible with current KDD
(Knowledge based Discovery and Data mining) systems to make use of all of
them within the available computational resources.
– As a result most of the examples go unused and resulting in “Underfitting”.
Mining High Speed Data
Streams
6
Background Knowledge
• Decision Tree Classification:
– “Traditional Decision Tree (TDT) model” can be implemented by using
classical algorithms (induction and information gain theories).
• Some classical algorithms can be found for ID3, C4.5, and for CART that
have been very widely used in the past decades.
• These algorithms need scan through all the data from a database for
multiple times in order to construct a tree-like structure.
– One example is given in Figure below.
Mining High Speed Data
Streams
7
Background Knowledge
(Contd.)
Figure 1. A typical decision
tree graph layout
Root
Split Attribute
(Node)
Split Attribute
(Node)
Class
(Leaf)
Split Attribute
(Node)
Class
(Leaf)
Class
(Leaf)
Class
(Leaf)
Mining High Speed Data
Streams
Class
(Leaf)
8
Background Knowledge
(Contd.)
• Decision Tree in Stream Mining:
– Maron and Moore in 1993 first highlighted that a small amount of
available data may be sufficient to be used as sample at any given
node, for picking the split attribute for building a decision tree.
– Small amount of data must come in continuously at high speed.
– But exactly how many such streaming data are needed?
• Hoeffding Bound (additive Chernoff Bound)
Mining High Speed Data
Streams
9
The Problem
• Many organizations today produce an electronic record of
essentially every transaction they are involved in.
• This results in tens or hundreds of millions of records being
produced everyday.
– Eg. In a single day WalMart records 20 million sales transactions,
Google handles 150 million searches, and AT&T produces 270 million
call records.
– Scientific data collection (e.g., by earth sensing satellites or
astronomical observations) routinely produces gigabytes of data per
day.
• Data rates of this level have significant consequences for data
mining.
– A few months’ worth data can easily add up to billions of records, and
the entire history of transactions or observations can be in hundreds of
billions.
Mining High Speed Data
Streams
10
The Problem (Contd.)
• Current algorithms for mining complex models from data (e.g.,
decision trees, set of rules) can not mine even a fraction of this
data in useful time.
• Mining a day’s worth of data can take more than a day of CPU
time.
– Data accumulates faster than it can be mined.
– The fraction of the available data that we are able to mine in useful
time is rapidly dwindling towards zero.
• Overcoming this state of affairs requires a shift in our frame of
mind FROM mining database TO mining data streams.
Mining High Speed Data
Streams
11
The Problem (Contd.)
• In the traditional data mining process, data loaded into a stable,
infrequently–updated databases.
– Mining it can take weeks or months.
• The data mining system should be continuously on.
– Processing records at the speed they arrive.
– Incorporating them into the model it is building even if it never sees
them again.
Mining High Speed Data
Streams
12
Design Criteria for mining
High Speed Data Streams
•
A system capable of overcoming these problems needs to meet a number
of stringent design criteria / or requirements:
1. It must be able to build a model using at most one scan of the data.
2. It must use only a fixed amount of main memory.
3. It must require small constant time per record.
4. It must make a usable model available at any point in time, as opposed to only
when it is done processing the data, since it may never be done processing.
–
Ideally, it should produce a model that is equivalent to the one that would be
obtained by the corresponding ordinary database mining algorithm, operating
without the above constraints.
–
When the data-generating phenomenon is changing over time, the model at any
time should be up-to-date.
Mining High Speed Data
Streams
13
Data Stream Classification
Cycle
Training examples
(2) Learning
requirements
2&3
(1) Input
requirement 1
(3) Model
requirement
4
Test examples
Predictions
Figure 2. Data stream classification cycle
Mining High Speed Data
Streams
14
Data Stream Classification
Cycle (Contd.)
• The algorithm is passed the next available examples from the
stream (requirement 1)
• The algorithm processes the example, updating the data
structures.
– Without exceeding memory bounds (requirement 2)
– As quickly as possible (requirement 3)
• The algorithm is ready to accept the next example. On request it
is able to supply a model that can be used to predict the class of
unseen examples (requirement 4).
Mining High Speed Data
Streams
15
General Framework for mining
high speed data streams
• The authors Pedro Domingos and Geoff Hulten developed a
general framework for mining high speed data streams that
satisfies all above mentioned constraints.
• They have designed and implemented massive stream
versions of Decision tree induction, Bayesian network
learning, k-means clustering, and the EM algorithm for
mixtures of Gaussians.
– E.g., VFDT the decision tree learning system based on HT.
• The probability that the Hoeffding and conventional tree
learners will choose different tests at any given node
decreases exponentially with the number of examples.
Mining High Speed Data
Streams
16
Hoeffding trees
• Given N training examples (x, y)
• Goal: Produce Model y = f (x)
• Why statistical rule?
– C4.5, CART, etc. assume data is in RAM
– SPRINT, SLIQ make multiple disk scans
– Hence the goal is to design a Decision tree learner from
extremely large (potentially infinite) datasets.
Mining High Speed Data
Streams
17
Hoeffding trees (Contd.)
• In order to pick an attribute for a node looking at a few examples
may be sufficient
• Given a stream of examples
–
–
–
–
Use first ones to pick root test
Pass succeeding ones to leaves
Pick best attributes there
… And so on recursively
• How many examples are sufficient?
Mining High Speed Data
Streams
18
Hoeffding bounds
• Real-valued random variable r with range R
• n independent observations, and compute their mean, r’
• Hoeffding bound states that, with probability 1- δ, the true mean
of the variable is at least r’ – ε, where
ε = sqrt[R^2 ln(1/δ) / 2n]
Mining High Speed Data
Streams
19
Hoeffding bounds (Contd.)
•
Let G(Xi) be the
–
heuristic measure
used to choose the attribute.
E.g., the measure could be information gain or Gini index.
•
Goal:
•
Assuming G is to be maximized, Let Xa be the attribute with the highest observed G’ and Xb be with second highest
attribute, after seeing n examples.
•
Let ΔG’ = G’(Xa) – G’(Xb) >= 0 be the difference between the observed heuristic values.
•
Then given a desired δ, Hoeffding bound guarantees that Xa is the correct choice with probability 1- δ if n examples
have been seen at this node and
ΔG’ > ϵ
In other words,
If the observed ΔG’ > ϵ, then the Hoeffding bound guarantees that the true ΔG >= ΔG’ - ϵ >0 with probability 1 –
δ, and therefore that Xa is indeed the best attribute with probability 1 – δ.
•
Thus a node needs to accumulate examples from the stream until ϵ becomes smaller than ΔG.
•
The node can be split using the current best attribute and succeeding examples will be passed to
the new leaves.
–
Ensure that, with a high probability, the attribute chosen using n examples, is the same as that would be chosen using infinite
examples.
Mining High Speed Data
Streams
20
The Hoeffding tree
algorithm
The algorithm constructs the tree using the same procedure as
ID3. It calculates the information gain for the attributes and
determines the best two attributes.
• At each node it checks for condition ΔG > ϵ. If the condition is
satisfied, then it creates child nodes based on the test at the node.
• If not it streams in more training examples and carries out the
calculations till it satisfies the condition.
Mining High Speed Data
Streams
21
The Hoeffding tree algorithm
(Contd.)
• If
– X is the number of attributes,
– v is the maximum number of values per attribute, and
– Y is the number of classes,
– The Hoeffding tree algorithm requires O(XvY) memory to store the
necessary counts at each leaf.
• If l is the number of leaves in the tree,
The total memory required is O(lXvY).
Mining High Speed Data
Streams
22
The Hoeffding tree
algorithm (Contd.)
• Inputs:
S -> is a sequence of examples,
X -> is a set of discrete attributes,
G(.) -> is a split evaluation function,
δ -> is one minus the desired probability of choosing the
correct attribute at any given node.
• Outputs:
HT -> is a decision tree.
Mining High Speed Data
Streams
23
The Basic algorithm
•
Hoeffding tree induction algorithm.
–
–
–
–
–
–
–
–
–
–
1: Let HT be a tree with a single leaf (the root)
2: for all training examples do
3: Sort example into leaf l using HT
4: Update sufficient statistics in l
5: Increment nl, the number of examples seen at l
6: if nl mod nmin = 0 and examples seen at l not all of same class then
7:
Compute Gl(Xi) for each attribute
8:
Let Xa be attribute with highest Gl
9:
Let Xb be attribute with second-highest Gl
10:
Compute Hoeffding bound =
ε = sqrt[R^2 ln(1/ δ) / 2n]
–
–
–
–
–
–
–
–
11:
if Xa != Xϕ ; and (Gl(Xa) - Gl(Xb) > ϵ or < T) then
12:
Replace l with an internal node that splits on Xa
13:
for all branches of the split do
14:
Add a new leaf with initialized sufficient statistics
15:
end for
16:
end if
17: end if
18: end for
Mining High Speed Data
Streams
24
The Basic algorithm concepts
• Split Confidence
• Sufficient Statistics
• Grace Period
• Pre-pruning
• Tie-breaking
Mining High Speed Data
Streams
25
Split Confidence
•
The δ parameter is used in the
Hoeffding bound.
–
It is one minus the desired probability
that the correct attribute is chosen at
every point in the tree.
•
With probability close to one, this
parameter is generally set to a small
value.
•
For VFDT, the default value of δ is set
to 10^-7.
•
The figure 3, shows a plot of the
Hoeffding bound using the default
parameters for a two-class problem (R
= log2(2) = 1, δ = 10^-7).
Mining High Speed Data
Streams
26
Sufficient Statistics
•
•
The statistics in a leaf need to be sufficient.
Efficient storage is important.
–
•
Storing unnecessary information would result in an increase in total memory requirement.
For attributes with discrete values,
–
–
Statistics required are, counts of the class label that apply for each attribute value.
E.g., An attribute with v unique attribute values and c possible classes, then the information
can be stored in a table with vc entries.
Mining High Speed Data
Streams
27
Grace Period
• It is costly to evaluate information gain of the attributes after
each and every training examples.
• The nmin parameter, or grace period, says how many examples
since the last evaluation should be seen in a leaf before revisiting
the decision.
Mining High Speed Data
Streams
28
Mining High Speed Data
Streams
29
Pre-pruning
– Pre – pruning is carried out by considering at each node a
NULL attribute X0, that consists of not splitting the node.
– The split will only be made if, with confidence 1– δ, the best
split found is better according to G than not splitting.
– X0 will determine the leaf nodes.
Mining High Speed Data
Streams
30
Tie-Breaking
• A situation may occur where two or more competing attributes
can not be separated.
– Even with very small Hoeffding bound, it would not be able to separate
them and the tree growth would stall.
• Waiting for too long to decide between them may harm the
accuracy of the tree.
• If the Hoeffding bound is sufficiently small, less than T [tie
breaking parameter], then the node is split on the current best
attribute.
Mining High Speed Data
Streams
31
Tie-breaking
•
Without tie-breaking the tree grows much slower, ending up around five
times smaller after 700 million training examples.
•
Without tie breaking the tree takes much longer to come close to the
same level of accuracy as the tie-breaking variant.
Mining High Speed Data
Streams
32
Hoeffding trees - Theorem
• Disagreement between two decision trees:
Δ(DT1, DT2) = Px[Path1(x) !=Path2(x)]
Theorem:
Let E[Δ(HTδ, DT*)] be the expected value of Δ(HTδ, DT*)].
If HTδ is the tree produced by the Hoeffding tree algorithm with
desired probability δ, given infinite examples, DT* is the asymptotic
Batch tree, and p is the leaf probability, then
E[Δ(HTδ, DT*)] <= δ /p.
[for proof, please refer Author’s paper “Mining High-speed data streams”]
Mining High Speed Data
Streams
33
The VFDT system
•
The VFDT system is based on the Hoeffding tree algorithm seen above: It
uses either the information gain or gini index as the attribute evaluation
measure.
•
VFDT is able to mine on the order of a billion examples per day. It mines
examples in less time than it takes to input them from the disk.
VFDT allow the user to specify some parameters as:
• Ties :
– two attributes with very close G’s will lead to examination of a large number of
examples to determine the best one.
•
G computation:
– Is the most time consuming part of the algorithm and it makes sense that just
one example will not dramatically change the G.
– So user can specify a number nmin of new examples before recompilation.
Mining High Speed Data
Streams
34
The VFDT system (contd.)
• Memory and poor attribute :
– The VFDT system leads to minimize memory usage using two
techniques:
• Deactivation of non-promising leaf.
• Dropping of non-promising attribute.
– This allows the system to keep memory available for new leaf.
• Rescan:
– The VFDT can rescan previously-seen examples.
• This option can be activated if either data arrives slowly or
• If dataset is small enough that it is feasible to scan multiple times
Mining High Speed Data
Streams
35
Study and comparison
• To be interesting, VFDT should at least give results comparable to
conventional decision tree learners.
• Compared VFDT with C4.5 (Quinlan, 1993)
• Same memory limit to both (40MB)
– 100K examples for C4.5
• These datasets have been created by sampling random trees
(depth = 18 and between 2.2k to 61k leaves) and adding noise,
from 0 to 30%, to it.
• This study will so compare C4.5, VFDT and VFDT-boot, a VFDT
system bootstrapped with an over-pruned tree produced by C4.5.
Mining High Speed Data
Streams
36
Mining High Speed Data
Streams
37
Mining High Speed Data
Streams
38
Mining High Speed Data
Streams
39
A real world example
• The authors have made a real world study.
• They tried to mine all the web page requests that were made from
the University of Washington main campus during a week in May
1999.
• The estimated population of the University is 50,000 people
(students, faculty and staff).
• During this week they registered 23,000 active internet clients.
• The traced requests summed up to 82.2 million by the end of the
week and the peak rate of which they were received was 17,400
per minute.
• The size of the trace file was around 20 GB.
Mining High Speed Data
Streams
40
A real world example (contd.)
• Testing was carried out on the last day’s log.
• The VFDT was run on 1.61 million examples and took 1277
seconds to learn a decision stump (DT with only one node).
• They also ran the C4.5 algorithm, they could only use 74.5k
examples (what fits in 40MB of memory).
• It took the C4.5 2975 seconds to learn the tree.
• They used a machine with 1GB of RAM.
– They could fit the 1.61 million examples in the memory to run it with
the C4.5
– The run time now increased to 24 hours.
• The VFDT is much faster than the C4.5 and that it can achieve
similar accuracy in a fraction of time.
Mining High Speed Data
Streams
41
Conclusion
• Many organizations today have more than very large databases.
• This paper introduces Hoeffding Trees and VFDT system.
• VFDT uses Hoeffding bounds to guarantee that its output is
asymptotically nearly identical to that of conventional learner.
• Emperical studies show VFDT’s effectiveness in learning from
massive and continuous stream of data.
• VFDT is currently being applied to mining the continuous stream
of web access data from the whole University of Washington main
campus.
Mining High Speed Data
Streams
42
Questions:
Qn.1 Give the Hoeffding bound formula and describe its components
ANS:
•
Real-valued random variable r with range R
•
n independent observations, and compute their mean, r’
•
Hoeffding bound states that, with probability 1- δ, the true
mean of the variable is at least r’ – ε, where
ε = sqrt[R^2 ln(1/δ) / 2n]
Mining High Speed Data
Streams
43
Questions (Contd.)
Qn.2 Compare Mining high speed data streams with Database Mining
ANS:
• Data mining:
–
The data mining approach may allow larger data sets to be handled, but it still
does not address the problem of Continuous supply of data.
–
Typically, a model that was previously induced can not be updated when new
information arrives.
•
•
Instead, the entire training process must be repeated with the new examples included.
Data Stream Mining:
–
In data stream mining the arriving data come in streams, which potentially can
sum to infinity.
–
Algorithms written in data streams can naturally cope with data sizes many
times greater than memory, and can extend to challenging real-time
applications, not previously tackled by machine learning or data mining.
Mining High Speed Data
Streams
44
Questions (Contd.)
Qn.3 state Design Criteria (requirements) for mining High Speed
Data Streams
ANS:
• Process an example at a time, and inspect it only once (at most)
• Use a limited amount of memory
• Work in a limited amount of time
• Be ready to predict at any time
Mining High Speed Data
Streams
45
More Questions?
Mining High Speed Data
Streams
46
References
– Mining High-Speed Data Streams, with Geoff Hulten. Proceedings of
the Sixth International Conference on Knowledge Discovery and Data
Mining (pp. 71-80), 2000. Boston, MA: ACM Press.
– A General Framework for Mining Massive Data Streams, with Geoff
Hulten (short paper). Journal of Computational and Graphical
Statistics, 12, 2003
– http://www.ir.iit.edu/~dagr/DataMiningCourse/Spring2001/Presentations/Summ
ary_10.pdf
– http://www.sftw.umac.mo/~ccfong/pdf/simonfong_2011_biomed_stre
am_mining.pdf
– Learning Model Trees from Data Streams by Elena
Ikonomovska and Joao Gama
– http://www.cs.waikato.ac.nz/~abifet/MOA/StreamMining.pdf
Mining High Speed Data
Streams
47
Download