Data Streams - Suraj @ LUMS

advertisement
Mining from Data Streams
- Competition between Quality and Speed
Adapted from:
Wei-Guang Teng (鄧維光) and
S. Muthukrishnan’s presentations
CS 636 - Adv. Data Mining (Wi 04/05)
1
Streaming: Finding Missing
Numbers


Paul permutes numbers 1…n, and
shows all but one to Carole, in the
permuted order, one after the other.
Carole must find the missing number.

Carole can not remember all the numbers
she has been shown.
CS 636 - Adv. Data Mining (Wi 04/05)
2
Streaming: Finding Missing
Numbers

Carole cumulates the sum of all the numbers
that she has been shown. At the end she can
subtract this sum from


n(n+1)/2
Analysis



Takes O(log n) bits to store the partial sum
Performs one addition each time a new number is
shown (takes O(log n) time per number)
Performs one subtraction at the end (takes O(log
n time)
CS 636 - Adv. Data Mining (Wi 04/05)
3
Data Streams (1)


Traditional DBMS – data stored in finite,
persistent data sets
New Applications – data input as continuous,
ordered data streams







Network monitoring and traffic engineering
Telecom call detail records (CDR)
ATM operations in banks
Sensor networks
Web logs and click-streams
Transactions in retail chains
Manufacturing processes
CS 636 - Adv. Data Mining (Wi 04/05)
4
Data Streams (2)

Definition


Application Characteristics



Continuous, unbounded, rapid, time-varying
streams of data elements
Massive volumes of data (can be several
terabytes)
Records arrive at a rapid rate
Goal

Mine patterns, process queries and compute
statistics on data streams in real-time
CS 636 - Adv. Data Mining (Wi 04/05)
5
Data Stream Algorithms

Streaming involves




Small number of passes over data. (Typically 1?)
Sublinear space (sublinear in the universe or
number of stream items?)
Sublinear time for computing (?)
Similar to dynamic, online, approximation or
randomized algorithms, but with more
constraints.
CS 636 - Adv. Data Mining (Wi 04/05)
6
Data Streams: Analysis Model
User/Application
Query/Mining Target
Results
Stream Processing
Engine
Scratch Space
(Memory and/or Disk)
CS 636 - Adv. Data Mining (Wi 04/05)
7
Motivation





3 Billion Telephone Calls in US each day
30 Billion emails daily, 1 Billion SMS, IMs
Scientific data: NASA's observation satellites
generate billions of readings each day.
IP Network Traffic: up to 1 Billion packets per
hour per router. Each ISP has many
hundreds) of routers!
Compare to human scale data: "only" 1 billion
worldwide credit card transactions per month.
CS 636 - Adv. Data Mining (Wi 04/05)
8
Network Management Application

Monitoring and configuring network hardware
and software to ensure smooth operation
Measurements
Alarms
Network Operations
Center
Network
CS 636 - Adv. Data Mining (Wi 04/05)
9
IP Network Measurement Data

IP session data
Source
10.1.0.2
18.6.7.1
13.9.4.3
15.2.2.9
12.4.3.8
10.5.1.3
11.1.0.6
19.7.1.2

Destination
16.2.3.7
12.4.0.3
11.6.8.2
17.1.2.1
14.8.7.4
13.0.0.1
10.3.4.5
16.5.5.8
Duration
12
16
15
19
26
27
32
18
Bytes
20K
24K
20K
40K
58K
100K
300K
80K
Protocol
http
http
http
http
http
ftp
ftp
ftp
AT&T collects 100 GBs of NetFlow data each
day!
CS 636 - Adv. Data Mining (Wi 04/05)
10
Network Data Processing



Traffic estimation/analysis
 List the top 100 IP addresses in terms of traffic
 What is the average duration of an IP session?
Fraud detection
 Identify all sessions whose duration was more
than twice the normal
Security/Denial of Service
 List all IP addresses that have witnessed a sudden
spike in traffic
 Identify IP addresses involved in more than 1000
sessions
CS 636 - Adv. Data Mining (Wi 04/05)
11
Challenges in Network Apps.




1 link with 2 Gb/s. Say avg packet size is 50
bytes.
Number of pkts/sec = 5 Million.
Time per pkt = 0.2 µsec.
If we capture pkt headers per packet:
src/dest IP, time, no of bytes, etc. at least 10
bytes. Space per second is 50 Mb. Space per
day is 4.5 Tb per link. ISPs have hundreds of
links.
CS 636 - Adv. Data Mining (Wi 04/05)
12
Data Streaming Models



Input data: a1, a2, a3, …
Input stream describes a signal A[i], a
one-dimensional function (value vs.
index)
There is mapping from the input stream
to the signal

This is the data stream model
CS 636 - Adv. Data Mining (Wi 04/05)
13
Time-Series Model

ai’s are form A[i]’s.
CS 636 - Adv. Data Mining (Wi 04/05)
14
Cash-Register Model

ai’s are increments to A[j]


ai= (j, Ii) Ii >= 0
Ai[j] = Ai-1[j + Ii
CS 636 - Adv. Data Mining (Wi 04/05)
15
Turnstile Model

ai’s are updates to A[j]



ai= (j, Ui)
Ai[j] = Ai-1[j + Ui
Strict turnstile model

Ai[j] >= at all i
CS 636 - Adv. Data Mining (Wi 04/05)
16
Data Stream Algorithms


Compute various functions on the signal
A at various times
Performance measures



Processing time per item ai in the stream
Space used to store the data structure on
At at time t
Time needed to compute the functions on
A
CS 636 - Adv. Data Mining (Wi 04/05)
17
Outline


Introduction & Motivation
Issues & Techs. of Processing Data Streams





Sampling
Histogram
Wavelet
Data Streaming Systems System
Example Algorithms for Frequency Counting


Lossy Counting
Sticky Sampling
CS 636 - Adv. Data Mining (Wi 04/05)
18
Data Stream Algorithms

Stream Processing Requirements




Single pass: each record is examined at most once
Bounded storage: limited memory for storing
synopsis
Real-time: per record processing time (to maintain
synopsis) must be low
Generally, algorithms compute approximate
answers

Difficult to compute answers accurately with
limited memory
CS 636 - Adv. Data Mining (Wi 04/05)
19
Approximation in Data Streams

Approximate Answers - Deterministic Bounds


Algorithms only compute an approximate answer,
but bounds on error
Data Streaming Systems System
Approximate Answers - Probabilistic Bounds

Algorithms compute an approximate answer with
high probability
 With probability at least 1   , the computed answer is
within a factor  of the actual answer
CS 636 - Adv. Data Mining (Wi 04/05)
20
Sliding Window Approximation
011000011100000101010

Why?




Approximation technique for bounded memory
Natural in applications (emphasizes recent data)
Well-specified and deterministic semantics
Issues



Extend relational algebra, SQL, query optimization
Algorithmic work
Timestamps?
CS 636 - Adv. Data Mining (Wi 04/05)
21
Timestamps

Explicit




Implicit




Injected by data source
Models real-world event represented by tuple
Tuples may be out-of-order, but if near-ordered can reorder
with small buffers
Introduced as special field by DSMS
Arrival time in system
Enables order-based querying and sliding windows
Issues


Distributed streams?
Composite tuples created by DSMS?
CS 636 - Adv. Data Mining (Wi 04/05)
22
Time

Easiest: global system clock


Stream elements and relation updates
timestamped on entry to system
Application-defined time


Streams and relation updates contain application
timestamps, may be out of order
Application generates “heartbeat”


Or deduce heartbeat from parameters: stream skew,
scrambling, latency, and clock progress
Query results in application time
CS 636 - Adv. Data Mining (Wi 04/05)
23
Sampling: Basics

A small random sample S of the data often wellrepresents all the data

Example: select agg from R where R.e is odd (n=12)
Data stream: 9 3 5 2 7 1 6 5 8 4 9 1
Sample S: 9 5 1 8


If agg is avg, return average of odd elements in S
answer: 5
If agg is count, return average over all elements e in S of

n if e is odd

0 if e is even
CS 636 - Adv. Data Mining (Wi 04/05)
answer: 12*3/4 =9
Unbiased!
24
Histograms


Histograms approximate the frequency
distribution of element values in a stream
A histogram (typically) consists of



A partitioning of element domain values into
buckets
A count C B per bucket B (of the number of
elements in B)
Long history of use for selectivity estimation
within a query optimizer ([Koo80], [PSC84], etc)
CS 636 - Adv. Data Mining (Wi 04/05)
25
Types of Histograms

Equi-Depth Histograms
Select buckets such that counts per bucket are equal
Count for
bucket
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Domain values
V-Optimal Histograms [IP95] [JKM98]
Select buckets to minimize frequency variance within buckets
minimize
B vB ( f v 
CB 2
)
VB
Count for
bucket
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
CS 636 - Adv. Data Mining (Wi 04/05)
Domain values
26
Answering Queries using Histograms
[IP99]


(Implicitly) map the histogram back to an
approximate relation, & apply the query to the
approximate relation
Example: select count(*) from R where 4<=R.e<=15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Count spread
evenly among
bucket values
4  R.e  15
answer: 3.5 * C B

For equi-depth histograms, maximum error:  2 * CB
CS 636 - Adv. Data Mining (Wi 04/05)
27
Wavelet Basics

For hierarchical decomposition of functions/signals

Haar wavelets

Simplest wavelet basis => Recursive pairwise averaging and
differencing at different resolutions
Resolution
3
2
Averages
Detail Coefficients
[2, 2, 0, 2, 3, 5, 4, 4]
[2,
1
0
1,
4,
[1.5,
4]
4]
[2.75]
---[0, -1, -1, 0]
[0.5, 0]
[-1.25]
Haar wavelet decomposition:[2.75, -1.25, 0.5, 0, 0, -1, -1, 0]
CS 636 - Adv. Data Mining (Wi 04/05)
28
Haar Wavelet Coefficients

Hierarchical decomposition structure (“error tree”)
2.75
+
0.5
+
2
0
-
+
2 0
+
-1
-1
2 3
-1.25
-
-
- +
0.5
0
-
0
0
0
-
+
5 4
Original frequency distribution
CS 636 - Adv. Data Mining (Wi 04/05)
+
2.75
-1.25
+
+
Coefficient “Supports”
-1
-4
-1
0
+ + +-+ + -++
29
Wavelet-based Histograms [MVW98]



Problem: range-query selectivity estimation
Key idea: use a compact subset of Haar
wavelet coefficients for approximating
frequency distribution
Steps

Compute cumulative frequency distribution C

Compute Haar wavelet transform of C

Coefficient thresholding: only m<<n coefficients
can be kept
CS 636 - Adv. Data Mining (Wi 04/05)
30
Using Wavelet-based Histograms

Selectivity estimation: count(a<= R.e<= b) = C’[b] - C’[a-1]


C’ is the (approximate) “reconstructed” cumulative distribution
Time: O(min{m, logN}), where m = size of wavelet synopsis
(number of coefficients), N= size of domain
At most logN+1 coefficients are
needed to reconstruct any C’ value
C’[a]

Empirical results over synthetic data shows improvements over
random sampling and histograms
CS 636 - Adv. Data Mining (Wi 04/05)
31
Data Streaming Systems



Low-level application specific approach
DBMS approach
Generic data stream management
systems
CS 636 - Adv. Data Mining (Wi 04/05)
32
DBMS Vs. DSMS: Meta-Questions

Killer-apps



Motivation



Application stream rates exceed DBMS capacity?
Can DSMS handle high rates anyway?
Need for general-purpose DSMS?
Not ad-hoc, application-specific systems?
Non-Trivial

DSMS = merely DBMS with enhanced support for
triggers, temporal constructs, data rate mgmt?
CS 636 - Adv. Data Mining (Wi 04/05)
33
DBMS versus DSMS

Persistent relations

One-time queries

Random access

Access plan
determined by query
processor and
physical DB design
CS 636 - Adv. Data Mining (Wi 04/05)

Transient streams
(and persistent
relations)

Continuous queries

Sequential access

Unpredictable data
characteristics and
arrival patterns
34
(Simplified) Big Picture of DSMS
Register
Query
Stored
Result
Streamed
Result
DSMS
Input streams
Archive
Scratch Store
CS 636 - Adv. Data Mining (Wi 04/05)
Stored
Relations
35
(Simplified) Network Monitoring
Intrusion
Warnings
Online
Performance
Metrics
Register
Monitoring
Queries
DSMS
Network measurements,
Packet traces
Archive
Scratch Store
CS 636 - Adv. Data Mining (Wi 04/05)
Lookup
Tables
36
Using Conventional DBMS


Data streams as relation inserts, continuous
queries as triggers or materialized views
Problems with this approach





Inserts are typically batched, high overhead
Expressiveness: simple conditions (triggers), no
built-in notion of sequence (views)
No notion of approximation, resource allocation
Current systems don’t scale to large # of triggers
Views don’t provide streamed results
CS 636 - Adv. Data Mining (Wi 04/05)
37
Query 1 (self-join)

Find all outgoing calls longer than 2 minutes
SELECT O1.call_ID, O1.caller
FROM
Outgoing O1, Outgoing O2
WHERE (O2.time – O1.time > 2
AND O1.call_ID = O2.call_ID
AND O1.event = start
AND O2.event = end)



Result requires unbounded storage
Can provide result as data stream
Can output after 2 min, without seeing end
CS 636 - Adv. Data Mining (Wi 04/05)
38
Query 2 (join)

Pair up callers and callees
SELECT O.caller, I.callee
FROM
Outgoing O, Incoming I
WHERE O.call_ID = I.call_ID



Can still provide result as data stream
Requires unbounded temporary storage …
… unless streams are near-synchronized
CS 636 - Adv. Data Mining (Wi 04/05)
39
Query 3 (group-by aggregation)

Total connection time for each caller
SELECT
FROM
WHERE
GROUP BY

O1.caller, sum(O2.time – O1.time)
Outgoing O1, Outgoing O2
(O1.call_ID = O2.call_ID
AND O1.event = start
AND O2.event = end)
O1.caller
Cannot provide result in (append-only)
stream


Output updates?
Provide current value on demand?
CS 636 - Adv. Data Mining (Wi 04/05)
40
Data Model

Append-only
Call records

Updates
Stock tickers

Deletes
Transactional data

Meta-Data
Control signals, punctuations

System Internals – probably need all above
CS 636 - Adv. Data Mining (Wi 04/05)
41
Related Database Technology

DSMS must use ideas, but none is substitute






Triggers, Materialized Views in Conventional DBMS
Main-Memory Databases
Sequence/Temporal/Timeseries Databases
Realtime Databases
Adaptive, Online, Partial Results
Novelty in DSMS



Semantics: input ordering, streaming output, …
State: cannot store unending streams, yet need
history
Performance: rate, variability, imprecision, …
CS 636 - Adv. Data Mining (Wi 04/05)
42
Outline



Introduction & Motivation
Data Stream Management System
Issues & Techs. of Processing Data Streams




Sampling
Histogram
Wavelet
Example Algorithms for Frequency Counting


Lossy Counting
Sticky Sampling
CS 636 - Adv. Data Mining (Wi 04/05)
43
Problem of Frequency Counts
Stream

Identify all elements whose current frequency
exceeds support threshold s = 0.1%
CS 636 - Adv. Data Mining (Wi 04/05)
44
Algorithm 1: Lossy Counting


Step 1: Divide the stream into “windows”
Window 1
Window 2
Window 3
Is window size a function of support s? Will fix later…
CS 636 - Adv. Data Mining (Wi 04/05)
45
Lossy Counting in Action ...
Frequency
Counts
+
Empty
First Window
At window boundary, decrement all counters by 1
CS 636 - Adv. Data Mining (Wi 04/05)
46
Lossy Counting (cont’d)
Frequency
Counts
+
Next Window
At window boundary, decrement all counters by 1
CS 636 - Adv. Data Mining (Wi 04/05)
47
Error Analysis

How much do we undercount?
If
and
then
current size of stream
=N
window-size
= 1/ε
frequency error  #windows = εN
Rule of thumb:
Set ε = 10% of support s
Example:
Given support frequency s = 1%,
set error frequency
ε = 0.1%
CS 636 - Adv. Data Mining (Wi 04/05)
48
Analysis of Lossy Counting

Output
Elements with counter values exceeding sN – εN
Approximation guarantees
Frequencies underestimated by at most εN
No false negatives
False positives have true frequency at least sN – εN

How many counters do we need?
Worst case: 1/ε log(εN) counters
CS 636 - Adv. Data Mining (Wi 04/05)
49
Algorithm 2: Sticky Sampling
Stream


28
31
41
23
35
19
34
15
30
Create counters by sampling
Maintain exact counts thereafter
What rate should we sample?
CS 636 - Adv. Data Mining (Wi 04/05)
50
Sticky Sampling (cont’d)

For finite stream of length N


Sampling rate = 2/Nε log 1/(s )
( = probability of failure)
Output
Elements with counter values exceeding sN – εN
Same error guarantees
as Lossy Counting
but probabilistic!
CS 636 - Adv. Data Mining (Wi 04/05)
Same Rule of thumb:
Set ε = 10% of support s
Example:
Given support threshold s = 1%,
set error threshold
ε = 0.1%
set failure probability  = 0.01%
51
Sampling rate?

Finite stream of length N
Sampling rate: 2/Nε log 1/(s)

Infinite stream with unknown N
Gradually adjust sampling rate

In either case,
Expected number of counters = 2/εlog 1/s
Independent of N!
CS 636 - Adv. Data Mining (Wi 04/05)
52
New Directions








Functional approximation theory
Data structures
Computational geometry
Graph theory
Databases
Hardware
Streaming models
Data stream quality monitoring
CS 636 - Adv. Data Mining (Wi 04/05)
53
References (1)
[AGM99] N. Alon, P.B. Gibbons, Y. Matias, M. Szegedy. Tracking Join and Self-Join Sizes in
Limited Storage. ACM PODS, 1999.
[AMS96] N. Alon, Y. Matias, M. Szegedy. The space complexity of approximating the frequency
moments. ACM STOC, 1996.
[CIK02] G. Cormode, P. Indyk, N. Koudas, S. Muthukrishnan. Fast mining of tabular data via
approximate distance computations. IEEE ICDE, 2002.
[CMN98] S. Chaudhuri, R. Motwani, and V. Narasayya. “Random Sampling for Histogram
Construction: How much is enough?”. ACM SIGMOD 1998.
[CDI02] G. Cormode, M. Datar, P. Indyk, S. Muthukrishnan. Comparing Data Streams Using
Hamming Norms. VLDB, 2002.
[DGG02] A. Dobra, M. Garofalakis, J. Gehrke, R. Rastogi. Processing Complex Aggregate Queries
over Data Streams. ACM SIGMOD, 2002.
[DJM02] T. Dasu, T. Johnson, S. Muthukrishnan, V. Shkapenyuk. Mining database structure or
how to build a data quality browser. ACM SIGMOD, 2002.
[DH00] P. Domingos and G. Hulten. Mining high-speed data streams. ACM SIGKDD, 2000.
[EKSWX98] M. Ester, H.-P. Kriegel, J. Sander, M. Wimmer, and X. Xu. Incremental Clustering for
Mining in a Data Warehousing Environment. VLDB 1998.
[FKS99] J. Feigenbaum, S. Kannan, M. Strauss, M. Viswanathan. An approximate L1-difference
algorithm for massive data streams. IEEE FOCS, 1999.
[FM85] P. Flajolet, G.N. Martin. “Probabilistic Counting Algorithms for Data Base Applications”.
JCSS 31(2), 1985
54
CS 636 - Adv. Data Mining (Wi 04/05)
References (2)
[Gib01] P. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and
event reports, VLDB 2001.
[GGI02] A.C. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, M. Strauss. Fast, smallspace algorithms for approximate histogram maintenance. ACM STOC, 2002.
[GGRL99] J. Gehrke, V. Ganti, R. Ramakrishnan, and W.-Y. Loh: BOAT-Optimistic Decision Tree
Construction. SIGMOD 1999.
[GK01] M. Greenwald and S. Khanna. “Space-Efficient Online Computation of Quantile
Summaries”. ACM SIGMOD 2001.
[GKM01] A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, M. Strauss. Surfing Wavelets on Streams:
One Pass Summaries for Approximate Aggregate Queries. VLDB 2001.
[GKM02] A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, M. Strauss. “How to Summarize the Universe:
Dynamic Maintenance of Quantiles”. VLDB 2002.
[GKS01b] S. Guha, N. Koudas, and K. Shim. “Data Streams and Histograms”. ACM STOC 2001.
[GM98] P. B. Gibbons and Y. Matias. “New Sampling-Based Summary Statistics for Improving
Approximate Query Answers”. ACM SIGMOD 1998.
[GMP97] P. B. Gibbons, Y. Matias, and V. Poosala. “Fast Incremental Maintenance of
Approximate Histograms”. VLDB 1997.
[GT01] P.B. Gibbons, S. Tirthapura. “Estimating Simple Functions on the Union of Data Streams”.
ACM SPAA, 2001.
CS 636 - Adv. Data Mining (Wi 04/05)
55
References (3)
[HHW97] J. M. Hellerstein, P. J. Haas, and H. J. Wang. “Online Aggregation”. ACM SIGMOD 1997.
[HSD01] Mining Time-Changing Data Streams. G. Hulten, L. Spencer, and P. Domingos. ACM
SIGKD 2001.
[IKM00] P. Indyk, N. Koudas, S. Muthukrishnan. Identifying representative trends in massive
time series data sets using sketches. VLDB, 2000.
[Ind00] P. Indyk. Stable Distributions, Pseudorandom Generators, Embeddings, and Data Stream
Computation. IEEE FOCS, 2000.
[IP95] Y. Ioannidis and V. Poosala. “Balancing Histogram Optimality and Practicality for Query
Result Size Estimation”. ACM SIGMOD 1995.
[IP99] Y.E. Ioannidis and V. Poosala. “Histogram-Based Approximation of Set-Valued Query
Answers”. VLDB 1999.
[JKM98] H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. Sevcik, and T. Suel.
“Optimal Histograms with Quality Guarantees”. VLDB 1998.
[JL84] W.B. Johnson, J. Lindenstrauss. Extensions of Lipshitz Mapping into Hilbert space.
Contemporary Mathematics, 26, 1984.
[Koo80] R. P. Kooi. “The Optimization of Queries in Relational Databases”. PhD thesis, Case
Western Reserve University, 1980.
CS 636 - Adv. Data Mining (Wi 04/05)
56
References (4)
[MRL98] G.S. Manku, S. Rajagopalan, and B. G. Lindsay. “Approximate Medians and other
Quantiles in One Pass and with Limited Memory”. ACM SIGMOD 1998.
[MRL99] G.S. Manku, S. Rajagopalan, B.G. Lindsay. Random Sampling Techniques for Space
Efficient Online Computation of Order Statistics of Large Datasets. ACM SIGMOD, 1999.
[MVW98] Y. Matias, J.S. Vitter, and M. Wang. “Wavelet-based Histograms for Selectivity
Estimation”. ACM SIGMOD 1998.
[MVW00] Y. Matias, J.S. Vitter, and M. Wang. “Dynamic Maintenance of Wavelet-based
Histograms”. VLDB 2000.
[PIH96] V. Poosala, Y. Ioannidis, P. Haas, and E. Shekita. “Improved Histograms for Selectivity
Estimation of Range Predicates”. ACM SIGMOD 1996.
[PJO99] F. Provost, D. Jenson, and T. Oates. Efficient Progressive Sampling. KDD 1999.
[Poo97] V. Poosala. “Histogram-Based Estimation Techniques in Database Systems”. PhD Thesis,
Univ. of Wisconsin, 1997.
[PSC84] G. Piatetsky-Shapiro and C. Connell. “Accurate Estimation of the Number of Tuples
Satisfying a Condition”. ACM SIGMOD 1984.
[SDS96] E.J. Stollnitz, T.D. DeRose, and D.H. Salesin. “Wavelets for Computer Graphics”.
Morgan-Kauffman Publishers Inc., 1996.
CS 636 - Adv. Data Mining (Wi 04/05)
57
References (5)
[T96] H. Toivonen. Sampling Large Databases for Association Rules. VLDB 1996.
[TGI02] N. Thaper, S. Guha, P. Indyk, N. Koudas. Dynamic Multidimensional Histograms. ACM
SIGMOD, 2002.
[U89] P. E. Utgoff. Incremental Induction of Decision Trees. Machine Learning, 4, 1989.
[U94] P. E. Utgoff: An Improved Algorithm for Incremental Induction of Decision Trees. ICML
1994.
[Vit85] J. S. Vitter. “Random Sampling with a Reservoir”. ACM TOMS, 1985.
CS 636 - Adv. Data Mining (Wi 04/05)
58
Download