Identifying Representative Trends In Massive Time Series Data Sets

advertisement
Applications of Sketch Based
Techniques to Data Mining Problems
Nikos Koudas
AT&T Labs Research
joint work with:
G. Cormonde, P. Indyk, S.
Muthukrishnan
DIMACS Summer School
Taming Massive Data Sets
• Requirements of data mining algorithms
– operate on very large data sets
– scalability
– incremental
• Most data mining algorithms have super-linear
complexity
• Deploying mining algorithms on very large data
sets, most likely will result in terrible
performance
DIMACS Summer School
Sketching
• Reduce the dimensionality of the data in a
systematic way, constructing short “data
summaries”
• Effectively reduce data volume.
• Deploy mining algorithms on the reduced data
volume.
• Main issue: Preserve data properties, the mining
algorithm is concerned with, (e.g., distances) in
the reduced data volume.
DIMACS Summer School
Applications
• Clustering time series data
• Clustering tabular data sets
DIMACS Summer School
Introduction
• Time series data abound in many applications.
– Financial, performance data, geographical and
meteorological information, solar and space data etc.
• Various works deal with management and
analysis aspects of time series data:
– Indexing, storage and retrieval
– Analysis and mining (forecasting, outlier and
deviation detection, etc)
• Active research area in many research
communities.
DIMACS Summer School
Representative Trends
Relaxed period
Average trend
DIMACS Summer School
Usage Examples
• Mining/Prediction
– Identifying ‘periodic trends’
– Uncovering unexpected ‘periodic trends’
• Performance management
– Networking (routing, traffic engineering, bandwidth
allocation)
– System Tuning
• Financial databases
– Cyclic behavior
DIMACS Summer School
Definitions
• Given a time series V and an integer T define
V(T) = {(V[iT+1], V[iT+2], …, V[iT+T])},
0 <= i<= n/T-1
• Define:
– Ci(V(T)) = S1<j<n/T-1 D(vi,uj)
• Thus:
– Relaxed Period: min C0(V(T)), for T in [l,u]
– Average Trend: min Ci(V(T)), for T in [l,u]
DIMACS Summer School
Definitions
n points in each time series
n/T vectors for each T, T in [l,u]
Relaxed period
T
n/T vectors for each T, each of them a candidate avg. trend, T in [l,u]
Average Trend
T
DIMACS Summer School
Algorithms
• There exists a quadratic algorithm for identifying
relaxed periods.
• There exists a cubic algorithm for identifying
average trends.
• Simply evaluate the clustering for each T in [l,u],
it takes linear time to evaluate relaxed periods for
each T and quadratic time to evaluate for average
trends.
DIMACS Summer School
Algorithms
• But can we really run these?
– Consider length of sessions in an AT&T service for
each second for a year, it is more than 31M values
and approximately 256MB.
– Consider running the previous algorithms for say 10
years or on a finer time scale.
• Both brute force algorithms are impractical.
• Can we run faster on large datasets, too large to
be in memory?
DIMACS Summer School
Our Approach
• Identify representative trends faster but provide
approximate answers:
– General approach (expresses various notions of representative
trends)
– Provides guaranteed approximation performance, with high
probability
• We present our approach in the following steps:
– Define the sketch of a vector
– Algorithms for finding the sketch of all sub-vectors of width T
– Determine the sketch of all sub-vectors of width in a given
range.
DIMACS Summer School
Sketch of a vector
• Given a vector t of length l, we generate its
sketch S(t) as follows:
• Pick a random vector u of length l, by picking
each component u[i] from a normal distribution
N(0,1) (normalized to 1).
• Define:
– S(t)[i] = t.u = Sjt[j].u[j]
DIMACS Summer School
Sketch Properties
• Theorem
– For any given set L of vectors of length l,
for a fixed e < 1/2, if k = 9 log|L|/e2, then
for any pair of vectors u,w we have
– (1-e)||u-w||2 < ||S(u)-S(w)||2<(1+e)||u-w||2 with
probability 1/2.
• By increasing k we can increase the probability
of success
• This is the Johnson-Lindenstrauss (JL) lemma.
DIMACS Summer School
Fixed window sketches
• Compute all sketches of sub-vectors of length l
in a sequence of length n.
• There are n-l+1 such sub-vectors.
• Straightforward application of JL would require
O(nlk) time since there are O(n) sub-vectors,
each of length l and the sketch is of size k, not
practical.
l
k
DIMACS Summer School
Key Observation
• We can compute ALL such sketches fast by
using the fast fourier transform.
• The problem of computing sketches of all subvectors of length l simultaneously is exactly the
problem of computing the convolution of two
vectors t and u
• Given two vectors A[1…a] and B[1…b] their
convolution is C[1…a+b] where
– C[k] = S1<i<bA[k-i]B[i] for 2 <= k <= a+b
DIMACS Summer School
Example
S1[0] S2[0] S3[0]
2
1
3
1
(-0.97,-0.2) = (-0.4,-2.14,-1.57,-3.11,-0.97)
Convolution with
(0.11,0.99) = (1.98,1.21,3.08,1.32,0.11)
S1[1] S2[1] S3[1]
DIMACS Summer School
Computing all sketches of width in a given
range
• Compute all sketches of all sub-vectors of length
between l and u.
• Brute force is cubic and prohibitive
• Applying our observation would be quadratic and
still prohibitive
• Can we compute all sketches of width in a given
range faster?
DIMACS Summer School
Approach
• We will construct a pool of sketches that we
will pre-compute and store. Following this
preprocessing we will be able to determine
the sketch of any sub-vector in O(1) fairly
accurately.
• Pick an l <= L <= u and construct all
sketches of length L as before using
convolutions, this is O(nlog2n) in the worst
case. Assume for now that L a power of 2
actually construct two such pools S1 and S2
DIMACS Summer School
Approach
• Consider any vector t[i,….i+j-1] we have
two cases:
– j = some power of 2 (L), in this case we have it
in the pool and we can look it up in O(1)
– 2r < j < 2r+1 in this case we can compute the
sketch as follows:
• S(t[i,….,i+j-1])[j] = S1(t[i,…,i+2r-1])[j] + S2[t[i+j2r,….,i+j-1])[j] both terms belong to the pool
DIMACS Summer School
Example
S(U) = S10+ S21
U
2 1 3 1 2 3 2 1
S1
S2
DIMACS Summer School
Why is this enough?
• Theorem
– For any given set L of vector of length l, for fixed e
< 1/2 if k = 9 log L/e2, then for any pair of vectors
u,w in L
• (1-e)||u-w||2 <= ||S(u)-S(w)||2 <= 2 (1+e) ||u-w||2 with
probability 1/2
DIMACS Summer School
Putting it all together
• Given V and [l,u] range
– relaxed period:
• Compute sketches in time O(nlog(u-l)klogu)
• Consider every T in [l,u] and compute C0(V(T)) for every
T.
• Choosing k as described will guarantee that we are at most
2+e away from the true relaxed period
– Average Trends
• Proceed similarly by evaluating Ci(V(T)) for every i
DIMACS Summer School
Implementation Issues
• Computing Sketches
– The pool of sketches can be computed with a single
pass over the data set. We only need to keep a
window worth of data across successive sketch
computations.
• Retrieving Sketches
– Required sketches are retrieved by performing
random IO. However across successive evaluations
for various values if T, required sketches are related.
Random IO can be limited due to prefetching.
DIMACS Summer School
Experimental Evaluation
• Real data from a service AT&T provides
(utilization information).
• Size varying from 16MB (approx. 1 month) to
256MB (approx 1 year) worth of data.
• Evaluated:
– Time to construct sketches
– Scalability of sketch construction
– Efficiency of the proposed convolution based technique
– Time to compute relaxed period and average trends
• computing sketches from scratch and with pre-computed sketches
• Using brute force approaches
– Accuracy of sketching
– Comparison with other time series reduction techniques
DIMACS Summer School
Time to construct sketches
450
400
Time (seconds)
350
300
Processor
250
Write
200
Read
150
100
50
0
4K
8K
16K
log(n)
32K 64K 128K
4K
8K
Sketch Window Size
DIMACS Summer School
16K 32K 64K 128K
2*log(n)
Time to construct sketches
140
100
Processor
80
Write
60
Read
40
20
log(n)
Data Size
DIMACS Summer School
12
8M
25
6M
64
M
32
M
16
M
12
8M
25
6M
64
M
32
M
0
16
M
Time (minutes)
120
2*log(n)
Time to construct without convolution
30000
20000
Processor
15000
Write
Read
10000
5000
log(n)
8K
16
K
32
K
64
K
12
8K
4K
8K
16
K
32
K
64
K
12
8K
0
4K
Time (minutes)
25000
Window Size
DIMACS Summer School
2*log(n)
Time to construct sketches without
convolution
18000
Time (minutes)
16000
14000
12000
Processor
10000
Write
8000
Read
6000
4000
2000
0
16M
32M
64M
128M
Data Size
DIMACS Summer School
256M
Computing relaxed periods
120
80
60
Construct/
Check
40
20
Relaxed Period Range
DIMACS Summer School
64K
32K
16K
8K
4K
2K
1K
0.5K
64K
32K
16K
8K
4K
2K
1K
0
0.5K
Time (Minutes)
100
Computing Relaxed Periods
300
200
150
Construct/Check
100
50
log(n)
16
MB
32
MB
64
MB
12
8M
B
25
6M
B
0
16
MB
32
MB
64
MB
12
8M
B
25
6M
B
Time (minutes)
250
Data Size
2*log(n)
DIMACS Summer School
Computing relaxed period with precomputed
sketches
Time (Seconds)
Check
8
6
4
Check
2
0
0.5K
1K
2K
4K
8K
16K 32K 64K
Relaxed Period Size
DIMACS Summer School
Computing Relaxed Periods Without
Precomputed Sketches
Time (Seconds)
Check
140
120
100
80
60
40
20
0
Check
16MB
32MB
64MB 128MB 256MB
Data Size
DIMACS Summer School
Time (Minutes)
Brute Force Algorithms
450
400
350
300
250
200
150
100
50
0
Total Time
0.5K 1K
2K
4K
8K 16K 32K 64K
Relaxed Period Size
DIMACS Summer School
Brute Force Algorithms
Total Time
Time (Minutes)
1000
800
600
Total Time
400
200
0
16MB
32MB
64MB 128MB 256MB
Data Size
DIMACS Summer School
Computing Average Trend
3500
Time (seconds)
3000
2500
2000
1500
1000
500
0
0.5K
1K
2K
4K
8K
Relaxed Period Size
DIMACS Summer School
16K
32K
64K
Computing Average Trend
700
Time (minutes)
600
500
400
300
200
100
0
16MB
32MB
64MB
Data Size
DIMACS Summer School
128MB
256MB
Accuracy of Sketches
200
Absolute Relative Error
180
160
140
120
100
80
60
40
20
0
5
10
20
40
Data Set 1
80 160 320
Sketch Size
DIMACS Summer School
5
10
20
40
Data Set 2
80 160 320
Clustering Tabular Data
• Many applications produce data in two
dimensional array form.
• Consider traditional telecommunication
applications:
– Data are collected from a variety of collection
stations across the country, recording call volume at
some temporal granularity.
– 2d call volume data set (spatial ordering of collection
stations versus time) recording temporal call activity,
approx. 18MB/day.
DIMACS Summer School
DIMACS Summer School
Clustering tabular data
• Data elements to be clustered are rectangular
data regions.
• Clustering might reveal interesting similarities
(in call volume and time) between geographical
regions.
• One month ~ 600MB of data.
• Sketch rectangular regions:
– extend sketches in 2d
– sketching with respect to any Lp norm p in (0.2]
DIMACS Summer School
DIMACS Summer School
Summary of results
• Sketch construction scales nicely with respect to data volume and
sketch size.
• Convolution based sketch computation is very effective.
• Sketch based approach is orders of magnitude better than brute
force for computing relaxed periods and average trends.
• Performance benefits increase for larger data sets.
• If sketches are pre-computed, clustering can be performed in
seconds even for very large data sets.
• In practice sketches of low dimensionality provide great accuracy.
• Compared with other dimensionality reduction techniques, the
sketch based approach is more accurate and effective.
DIMACS Summer School
Conclusions
• Scalability to large data volume requirement of
the data mining process.
• Effectively reduce data volume using sketches.
• Preserve data properties required by mining
algorithms (e.g., various distances).
• Core techniques, various algorithms could
benefit from them.
• Very large performance benefits, small loss in
accuracy.
DIMACS Summer School
Contact
• koudas@research.att.com
• www.research.att.com/~koudas
DIMACS Summer School
Download