Randomization for Massive and Streaming Data Sets Rajeev Motwani May 21, 2003

advertisement
Randomization for Massive
and Streaming Data Sets
Rajeev Motwani
May 21, 2003
CS Forum Annual Meeting
1
Data Streams Mangement Systems
 Traditional DBMS – data stored in finite, persistent
data sets
 Data Streams – distributed, continuous, unbounded,
rapid, time-varying, noisy, …
 Emerging DSMS – variety of modern applications
 Network monitoring and traffic engineering
 Telecom call records
 Network security
 Financial applications
 Sensor networks
 Manufacturing processes
 Web logs and clickstreams
 Massive data sets
2
DSMS – Big Picture
Register
Query
Streamed
Result
Stored
Result
DSMS
Input streams
Archive
Scratch Store
Stored
Relations
3
Algorithmic Issues
 Computational Model
 Streaming data (or, secondary memory)
 Bounded main memory
 Techniques
 New paradigms
 Negative Results and Approximation
 Randomization
 Complexity Measures
 Memory
 Time per item (online, real-time)
 # Passes (linear scan in secondary memory)
4
Stream Model of Computation
1
1
0
0
1
Main Memory
(Synopsis Data Structures)
0
1
1
0
1
1
Memory: poly(1/ε, log N)
Query/Update Time: poly(1/ε, log N)
N: # items so far, or window size
Data Stream
ε: error parameter
5
“Toy” Example – Network Monitoring
Intrusion
Warnings
Online
Performance
Metrics
Register
Monitoring
Queries
DSMS
Network measurements,
Packet traces,
…
Archive
Scratch Store
Lookup
Tables
6
Frequency Related Problems
Analytics on Packet Headers – IP Addresses
Top-k most frequent elements
Find elements that
occupy 0.1% of the tail.
Mean + Variance?
Median?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Find all elements
with frequency > 0.1%
What is the frequency
of element 3?
What is the total frequency
of elements between 8 and 14?
How many elements have non-zero frequency?
7
Example 1– Distinct Values
 Input Sequence X = x1, x2, …, xn, …
 Domain U = {0,1,2, …, u-1}
 Compute D(X) number of distinct values
 Remarks
 Assume stream size n is finite/known
(generally, n is window size)
 Domain could be arbitrary (e.g., text, tuples)
8
Naïve Approach
 Counter C(i) for each domain value i
 Initialize counters C(i) 0
 Scan X incrementing appropriate counters
 Problem
 Memory size M << n
 Space O(u) – possibly u >> n
(e.g., when counting distinct words in web crawl)
9
Negative Result
Theorem:
Deterministic algorithms need M = Ω(n log u) bits
Proof: Information-theoretic arguments
Note: Leaves open randomization/approximation
10
Randomized Algorithm
h:U  [1..t]
Input Stream
Hash Table
Analysis
Random h  few collisions & avg list-size O(n/t)
Thus
Space: O(n) – since we need t =
Time: O(1) per item [Expected]
Ω(n)
11
Improvement via Sampling?
 Sample-based Estimation
 Random Sample R (of size r) of n values in X
 Compute D(R)
 Estimator E = D(R) x n/r
 Benefit – sublinear space
 Cost – estimation error is high
 Why? – low-frequency values underrepresented
12
Negative Result for Sampling
 Consider estimator E of D(X) examining r items in X
 Possibly in adaptive/randomized fashion.
r
Theorem: For any δ  e , E has relative error
nr 1
ln
2r
δ
with probability at least δ.
 Remarks
 r = n/10  Error 75% with probability ½
 Leaves open randomization/approximation on full scans
13
Randomized Approximation
 Simplified Problem – For fixed t, is D(X) >> t?
 Choose hash function h: U[1..t]
 Initialize answer to NO
 For each xi, if h(xi) = t, set answer to YES
h:U  [1..t]
1
t
Boolean Flag
YES/NO
Input Stream
 Observe – need 1 bit memory only !
 Theorem:
 If D(X) < t, P[output NO] > 0.25
 If D(X) > 2t, P[output NO] < 0.14
14
Analysis
 Let – Y be set of distinct elements of X
 output NO  no element of Y hashes to t
 P [element hashes to t] = 1/t
 Thus – P[output NO] = (1-1/t)|Y|
 Since |Y| = D(X),
 D(X) < t
 D(X) > 2t
 P[output NO] > (1-1/t)t > 0.25
 P[output NO] < (1-1/t)2t < 1/e^2
15
Boosting Accuracy
 With 1 bit  distinguish D(X)<t from D(X)>2t
 Running O(log 1/δ) instances in parallel
 reduce error probability to any δ>0
 Running O(log n) in parallel for t = 1, 2, 4, 8,…, n
 can estimate D(X) within factor 2
 Choice of multiplier 2 is arbitrary
 can use factor (1+ε) to reduce error to ε
 Theorem: Can estimate D(X) within factor (1±ε)
with probability (1-δ) using space
n
1
O(log 2  log )
ε
δ
16
Example 2 – Elephants-and-Ants
Stream
 Identify items whose current frequency
exceeds support threshold s = 0.1%.
[Jacobson 2000, Estan-Verghese 2001]
17
Algorithm 1: Lossy Counting
Step 1: Divide the stream into ‘windows’
Window 1
Window 2
Window 3
Window-size W is function of support s – specify later…
18
Lossy Counting in Action ...
Frequency
Counts
+
Empty
First Window
At window boundary, decrement all counters by 1
19
Lossy Counting continued ...
Frequency
Counts
+
Next Window
At window boundary, decrement all counters by 1
20
Error Analysis
How much do we undercount?
If
and
current size of stream
window-size W
=N
= 1/ε
then frequency error  # windows = εN
Rule of thumb:
Set ε = 10% of support s
Example:
Given support frequency s = 1%,
set error frequency
ε = 0.1%
21
Putting it all together…
Output:
Elements with counter values exceeding (s-ε)N
Approximation guarantees
Frequencies underestimated by at most εN
No false negatives
False positives have true frequency at least (s–ε)N
How many counters do we need?
 Worst case bound: 1/ε log εN counters
 Implementation details…
22
Algorithm 2: Sticky Sampling
Stream
 Create counters by sampling
 Maintain exact counts thereafter
What is sampling rate?
28
31
41
23
35
19
34
15
30
23
Sticky Sampling contd...
For finite stream of length N
Sampling rate = 2/εN log 1/s
 = probability of failure
Output:
Elements with counter values exceeding (s-ε)N
Approximation guarantees (probabilistic)
Frequencies underestimated by at most εN
No false negatives
False positives have true frequency at least (s-ε)N
Same Rule of thumb:
Same error guarantees
as Lossy Counting
but probabilistic
Set ε = 10% of support s
Example:
Given support threshold s = 1%,
set error threshold
ε = 0.1%
set failure probability  = 0.01%
24
Number of counters?
Finite stream of length N
Sampling rate: 2/εN log 1/s
Infinite stream with unknown N
Gradually adjust sampling rate
In either case,
Expected number of counters = 2/ log 1/s
Independent of N
25
Example 3 – Correlated Attributes
R1
R2
R3
R4
R5
R6
R7
R8
…
C1
1
1
1
0
1
1
0
0
C2
1
1
0
0
1
1
1
1
…
C3
1
0
0
1
1
1
1
1
C4
1
1
1
0
0
1
1
1
…
C5
0
0
0
1
1
1
1
0
 Input Stream – items with boolean attributes
 Matrix – M(r,c) = 1  Row r has Attribute c
 Identify – Highly-correlated column-pairs
26
Correlation  Similarity
 View column as set of row-indexes
(where it has 1’s)
 Set Similarity (Jaccard measure)
sim(C i , C j ) 
 Example
Ci
0
1
1
0
1
0
Cj
1
0
1
0
1
1
Ci  C j
Ci  C j
sim(Ci,Cj) = 2/5 = 0.4
27
Identifying Similar Columns?
 Goal – finding candidate pairs in small memory
 Signature Idea
 Hash columns Ci to small signature sig(Ci)
 Set of signatures fits in memory
 sim(Ci,Cj) approximated by sim(sig(Ci),sig(Cj))
 Naïve Approach
 Sample P rows uniformly at random
 Define sig(Ci) as P bits of Ci in sample
 Problem
 sparsity  would miss interesting part of columns
 sample would get only 0’s in columns
28
Key Observation
 For columns Ci, Cj, four types of rows
A
B
C
D
Ci
1
1
0
0
Cj
1
0
1
0
 Overload notation: A = # rows of type A
 Observation
A
sim(C i , C j ) 
A BC
29
Min Hashing
 Randomly permute rows
 Hash h(Ci) = index of first row with 1 in column Ci
 Suprising Property
P[h(Ci) = h(Cj)] = sim(Ci, Cj)
 Why?
 Both are A/(A+B+C)
 Look down columns Ci, Cj until first non-Type-D row
 h(Ci) = h(Cj)  if type A row
30
Min-Hash Signatures
 Pick – k random row permutations
 Min-Hash Signature
sig(C) = k indexes of first rows with 1 in column C
 Similarity of signatures
 Define: sim(sig(Ci),sig(Cj)) = fraction of
permutations where Min-Hash values agree
 Lemma E[sim(sig(Ci),sig(Cj))] = sim(Ci,Cj)
31
Example
R1
R2
R3
R4
R5
C1
1
0
1
1
0
C2
0
1
0
0
1
C3
1
1
0
1
0
Signatures
S1
Perm 1 = (12345) 1
Perm 2 = (54321) 4
Perm 3 = (34512) 3
S2
2
5
5
S3
1
4
4
Similarities
1-2
1-3 2-3
Col-Col 0.00 0.50 0.25
Sig-Sig 0.00 0.67 0.00
32
Implementation Trick
 Permuting rows even once is prohibitive
 Row Hashing
 Pick k hash functions hk: {1,…,n}{1,…,O(n)}
 Ordering under hk gives random row permutation
 One-pass implementation
33
Comparing Signatures
 Signature Matrix S
 Rows = Hash Functions
 Columns = Columns
 Entries = Signatures
 Need – Pair-wise similarity of signature columns
 Problem
 MinHash fits column signatures in memory
 But comparing signature-pairs takes too much time
 Limiting candidate pairs – Locality Sensitive Hashing
34
Summary
 New algorithmic paradigms needed for
streams and massive data sets
 Negative results abound
 Need to approximate
 Power of randomization
35
Thank You!
36
References
Rajeev Motwani (http://theory.stanford.edu/~rajeev)
STREAM Project (http://www-db.stanford.edu/stream)
 STREAM: The Stanford Stream Data Manager. Bulletin of
the Technical Committee on Data Engineering 2003.
 Motwani et al. Query Processing, Approximation, and
Resource Management in a Data Stream Management System.
CIDR 2003.
 Babcock-Babu-Datar-Motwani-Widom. Models and Issues in
Data Stream Systems. PODS 2002.
 Manku-Motwani. Approximate Frequency Counts over
Streaming Data. VLDB 2003.
 Babcock-Datar-Motwani-O’Callahan. Maintaining Variance and
K-Medians over Data Stream Windows. PODS 2003.
 Guha-Meyerson-Mishra-Motwani-O’Callahan. Clustering Data
Streams: Theory and Practice. IEEE TKDE 2003.
37
References (contd)
 Datar-Gionis-Indyk-Motwani. Maintaining Stream Statistics
over Sliding Windows. SIAM Journal on Computing 2002.
 Babcock-Datar-Motwani. Sampling From a Moving Window
Over Streaming Data. SODA 2002.
 O’Callahan-Guha-Mishra-Meyerson-Motwani. HighPerformance Clustering of Streams and Large Data Sets.
ICDE 2003.
 Guha-Mishra-Motwani-O’Callagahan. Clustering Data Streams.
FOCS 2000.
 Cohen et al. Finding Interesting Associations without
Support Pruning. ICDE 2000.
 Charikar-Chaudhuri-Motwani-Narasayya. Towards Estimation
Error Guarantees for Distinct Values. PODS 2000.
 Gionis-Indyk-Motwani. Similarity Search in High Dimensions
via Hashing. VLDB 1999.
 Indyk-Motwani. Approximate Nearest Neighbors: Towards
Removing the Curse of Dimensionality. STOC 1998.
38
Download