Applications of Sketch Based Techniques to Data Mining Problems Nikos Koudas AT&T Labs Research joint work with: G. Cormonde, P. Indyk, S. Muthukrishnan DIMACS Summer School Taming Massive Data Sets • Requirements of data mining algorithms – operate on very large data sets – scalability – incremental • Most data mining algorithms have super-linear complexity • Deploying mining algorithms on very large data sets, most likely will result in terrible performance DIMACS Summer School Sketching • Reduce the dimensionality of the data in a systematic way, constructing short “data summaries” • Effectively reduce data volume. • Deploy mining algorithms on the reduced data volume. • Main issue: Preserve data properties, the mining algorithm is concerned with, (e.g., distances) in the reduced data volume. DIMACS Summer School Applications • Clustering time series data • Clustering tabular data sets DIMACS Summer School Introduction • Time series data abound in many applications. – Financial, performance data, geographical and meteorological information, solar and space data etc. • Various works deal with management and analysis aspects of time series data: – Indexing, storage and retrieval – Analysis and mining (forecasting, outlier and deviation detection, etc) • Active research area in many research communities. DIMACS Summer School Representative Trends Relaxed period Average trend DIMACS Summer School Usage Examples • Mining/Prediction – Identifying ‘periodic trends’ – Uncovering unexpected ‘periodic trends’ • Performance management – Networking (routing, traffic engineering, bandwidth allocation) – System Tuning • Financial databases – Cyclic behavior DIMACS Summer School Definitions • Given a time series V and an integer T define V(T) = {(V[iT+1], V[iT+2], …, V[iT+T])}, 0 <= i<= n/T-1 • Define: – Ci(V(T)) = S1<j<n/T-1 D(vi,uj) • Thus: – Relaxed Period: min C0(V(T)), for T in [l,u] – Average Trend: min Ci(V(T)), for T in [l,u] DIMACS Summer School Definitions n points in each time series n/T vectors for each T, T in [l,u] Relaxed period T n/T vectors for each T, each of them a candidate avg. trend, T in [l,u] Average Trend T DIMACS Summer School Algorithms • There exists a quadratic algorithm for identifying relaxed periods. • There exists a cubic algorithm for identifying average trends. • Simply evaluate the clustering for each T in [l,u], it takes linear time to evaluate relaxed periods for each T and quadratic time to evaluate for average trends. DIMACS Summer School Algorithms • But can we really run these? – Consider length of sessions in an AT&T service for each second for a year, it is more than 31M values and approximately 256MB. – Consider running the previous algorithms for say 10 years or on a finer time scale. • Both brute force algorithms are impractical. • Can we run faster on large datasets, too large to be in memory? DIMACS Summer School Our Approach • Identify representative trends faster but provide approximate answers: – General approach (expresses various notions of representative trends) – Provides guaranteed approximation performance, with high probability • We present our approach in the following steps: – Define the sketch of a vector – Algorithms for finding the sketch of all sub-vectors of width T – Determine the sketch of all sub-vectors of width in a given range. DIMACS Summer School Sketch of a vector • Given a vector t of length l, we generate its sketch S(t) as follows: • Pick a random vector u of length l, by picking each component u[i] from a normal distribution N(0,1) (normalized to 1). • Define: – S(t)[i] = t.u = Sjt[j].u[j] DIMACS Summer School Sketch Properties • Theorem – For any given set L of vectors of length l, for a fixed e < 1/2, if k = 9 log|L|/e2, then for any pair of vectors u,w we have – (1-e)||u-w||2 < ||S(u)-S(w)||2<(1+e)||u-w||2 with probability 1/2. • By increasing k we can increase the probability of success • This is the Johnson-Lindenstrauss (JL) lemma. DIMACS Summer School Fixed window sketches • Compute all sketches of sub-vectors of length l in a sequence of length n. • There are n-l+1 such sub-vectors. • Straightforward application of JL would require O(nlk) time since there are O(n) sub-vectors, each of length l and the sketch is of size k, not practical. l k DIMACS Summer School Key Observation • We can compute ALL such sketches fast by using the fast fourier transform. • The problem of computing sketches of all subvectors of length l simultaneously is exactly the problem of computing the convolution of two vectors t and u • Given two vectors A[1…a] and B[1…b] their convolution is C[1…a+b] where – C[k] = S1<i<bA[k-i]B[i] for 2 <= k <= a+b DIMACS Summer School Example S1[0] S2[0] S3[0] 2 1 3 1 (-0.97,-0.2) = (-0.4,-2.14,-1.57,-3.11,-0.97) Convolution with (0.11,0.99) = (1.98,1.21,3.08,1.32,0.11) S1[1] S2[1] S3[1] DIMACS Summer School Computing all sketches of width in a given range • Compute all sketches of all sub-vectors of length between l and u. • Brute force is cubic and prohibitive • Applying our observation would be quadratic and still prohibitive • Can we compute all sketches of width in a given range faster? DIMACS Summer School Approach • We will construct a pool of sketches that we will pre-compute and store. Following this preprocessing we will be able to determine the sketch of any sub-vector in O(1) fairly accurately. • Pick an l <= L <= u and construct all sketches of length L as before using convolutions, this is O(nlog2n) in the worst case. Assume for now that L a power of 2 actually construct two such pools S1 and S2 DIMACS Summer School Approach • Consider any vector t[i,….i+j-1] we have two cases: – j = some power of 2 (L), in this case we have it in the pool and we can look it up in O(1) – 2r < j < 2r+1 in this case we can compute the sketch as follows: • S(t[i,….,i+j-1])[j] = S1(t[i,…,i+2r-1])[j] + S2[t[i+j2r,….,i+j-1])[j] both terms belong to the pool DIMACS Summer School Example S(U) = S10+ S21 U 2 1 3 1 2 3 2 1 S1 S2 DIMACS Summer School Why is this enough? • Theorem – For any given set L of vector of length l, for fixed e < 1/2 if k = 9 log L/e2, then for any pair of vectors u,w in L • (1-e)||u-w||2 <= ||S(u)-S(w)||2 <= 2 (1+e) ||u-w||2 with probability 1/2 DIMACS Summer School Putting it all together • Given V and [l,u] range – relaxed period: • Compute sketches in time O(nlog(u-l)klogu) • Consider every T in [l,u] and compute C0(V(T)) for every T. • Choosing k as described will guarantee that we are at most 2+e away from the true relaxed period – Average Trends • Proceed similarly by evaluating Ci(V(T)) for every i DIMACS Summer School Implementation Issues • Computing Sketches – The pool of sketches can be computed with a single pass over the data set. We only need to keep a window worth of data across successive sketch computations. • Retrieving Sketches – Required sketches are retrieved by performing random IO. However across successive evaluations for various values if T, required sketches are related. Random IO can be limited due to prefetching. DIMACS Summer School Experimental Evaluation • Real data from a service AT&T provides (utilization information). • Size varying from 16MB (approx. 1 month) to 256MB (approx 1 year) worth of data. • Evaluated: – Time to construct sketches – Scalability of sketch construction – Efficiency of the proposed convolution based technique – Time to compute relaxed period and average trends • computing sketches from scratch and with pre-computed sketches • Using brute force approaches – Accuracy of sketching – Comparison with other time series reduction techniques DIMACS Summer School Time to construct sketches 450 400 Time (seconds) 350 300 Processor 250 Write 200 Read 150 100 50 0 4K 8K 16K log(n) 32K 64K 128K 4K 8K Sketch Window Size DIMACS Summer School 16K 32K 64K 128K 2*log(n) Time to construct sketches 140 100 Processor 80 Write 60 Read 40 20 log(n) Data Size DIMACS Summer School 12 8M 25 6M 64 M 32 M 16 M 12 8M 25 6M 64 M 32 M 0 16 M Time (minutes) 120 2*log(n) Time to construct without convolution 30000 20000 Processor 15000 Write Read 10000 5000 log(n) 8K 16 K 32 K 64 K 12 8K 4K 8K 16 K 32 K 64 K 12 8K 0 4K Time (minutes) 25000 Window Size DIMACS Summer School 2*log(n) Time to construct sketches without convolution 18000 Time (minutes) 16000 14000 12000 Processor 10000 Write 8000 Read 6000 4000 2000 0 16M 32M 64M 128M Data Size DIMACS Summer School 256M Computing relaxed periods 120 80 60 Construct/ Check 40 20 Relaxed Period Range DIMACS Summer School 64K 32K 16K 8K 4K 2K 1K 0.5K 64K 32K 16K 8K 4K 2K 1K 0 0.5K Time (Minutes) 100 Computing Relaxed Periods 300 200 150 Construct/Check 100 50 log(n) 16 MB 32 MB 64 MB 12 8M B 25 6M B 0 16 MB 32 MB 64 MB 12 8M B 25 6M B Time (minutes) 250 Data Size 2*log(n) DIMACS Summer School Computing relaxed period with precomputed sketches Time (Seconds) Check 8 6 4 Check 2 0 0.5K 1K 2K 4K 8K 16K 32K 64K Relaxed Period Size DIMACS Summer School Computing Relaxed Periods Without Precomputed Sketches Time (Seconds) Check 140 120 100 80 60 40 20 0 Check 16MB 32MB 64MB 128MB 256MB Data Size DIMACS Summer School Time (Minutes) Brute Force Algorithms 450 400 350 300 250 200 150 100 50 0 Total Time 0.5K 1K 2K 4K 8K 16K 32K 64K Relaxed Period Size DIMACS Summer School Brute Force Algorithms Total Time Time (Minutes) 1000 800 600 Total Time 400 200 0 16MB 32MB 64MB 128MB 256MB Data Size DIMACS Summer School Computing Average Trend 3500 Time (seconds) 3000 2500 2000 1500 1000 500 0 0.5K 1K 2K 4K 8K Relaxed Period Size DIMACS Summer School 16K 32K 64K Computing Average Trend 700 Time (minutes) 600 500 400 300 200 100 0 16MB 32MB 64MB Data Size DIMACS Summer School 128MB 256MB Accuracy of Sketches 200 Absolute Relative Error 180 160 140 120 100 80 60 40 20 0 5 10 20 40 Data Set 1 80 160 320 Sketch Size DIMACS Summer School 5 10 20 40 Data Set 2 80 160 320 Clustering Tabular Data • Many applications produce data in two dimensional array form. • Consider traditional telecommunication applications: – Data are collected from a variety of collection stations across the country, recording call volume at some temporal granularity. – 2d call volume data set (spatial ordering of collection stations versus time) recording temporal call activity, approx. 18MB/day. DIMACS Summer School DIMACS Summer School Clustering tabular data • Data elements to be clustered are rectangular data regions. • Clustering might reveal interesting similarities (in call volume and time) between geographical regions. • One month ~ 600MB of data. • Sketch rectangular regions: – extend sketches in 2d – sketching with respect to any Lp norm p in (0.2] DIMACS Summer School DIMACS Summer School Summary of results • Sketch construction scales nicely with respect to data volume and sketch size. • Convolution based sketch computation is very effective. • Sketch based approach is orders of magnitude better than brute force for computing relaxed periods and average trends. • Performance benefits increase for larger data sets. • If sketches are pre-computed, clustering can be performed in seconds even for very large data sets. • In practice sketches of low dimensionality provide great accuracy. • Compared with other dimensionality reduction techniques, the sketch based approach is more accurate and effective. DIMACS Summer School Conclusions • Scalability to large data volume requirement of the data mining process. • Effectively reduce data volume using sketches. • Preserve data properties required by mining algorithms (e.g., various distances). • Core techniques, various algorithms could benefit from them. • Very large performance benefits, small loss in accuracy. DIMACS Summer School Contact • koudas@research.att.com • www.research.att.com/~koudas DIMACS Summer School