Carnegie Mellon Data Mining on Streams Christos Faloutsos CMU DB/IR '06 C. Faloutsos #1 Carnegie Mellon THANK YOU! • Prof. Panos Ipeirotis • Julia Mills DB/IR '06 C. Faloutsos #2 Carnegie Mellon Outline • • • • • Problem and motivation Single-sequence mining: AWSOM Co-evolving sequences: SPIRIT Lag correlations: BRAID Conclusions DB/IR '06 C. Faloutsos #4 Carnegie Mellon Problem definition - example Each sensor collects data (x1, x2, …, xt, …) DB/IR '06 C. Faloutsos #5 Carnegie Mellon Problem definition • Given: one or more sequences x1 , x2 , … , xt , … (y1, y2, … , yt, … …) • Find – patterns; correlations; outliers – incrementally! DB/IR '06 C. Faloutsos #6 Carnegie Mellon Limitations / Challenges Find patterns using a method that is • nimble: limited resources – Memory – Bandwidth, power, CPU • incremental: on-line, ‘any-time’ response – single pass (‘you get to see it only once’) • automatic: no human intervention – eg., in remote environments DB/IR '06 C. Faloutsos #7 Carnegie Mellon Application domains • Sensor devices – – – – Temperature, weather measurements Road traffic data Geological observations Patient physiological data • Embedded devices – Network routers – Intelligent (active) disks DB/IR '06 C. Faloutsos #8 Carnegie Mellon Motivation - Applications (cont’d) • ‘Smart house’ – sensors monitor temperature, humidity, air quality • video surveillance DB/IR '06 C. Faloutsos #9 Carnegie Mellon Motivation - Applications (cont’d) • civil/automobile infrastructure – bridge vibrations [Oppenheim+02] – road conditions / traffic monitoring DB/IR '06 C. Faloutsos #10 Carnegie Mellon Motivation - Applications (cont’d) • Weather, environment/anti-pollution – volcano monitoring – air/water pollutant monitoring DB/IR '06 C. Faloutsos #11 Carnegie Mellon Motivation - Applications (cont’d) • Computer systems – ‘Active Disks’ (buffering, prefetching) – web servers (ditto) – network traffic monitoring – ... DB/IR '06 C. Faloutsos #12 Carnegie Mellon InteMon w/ Evan Hoke, Jimeng Sun self-* PetaByte data center at CMU Carnegie Mellon Outline • • • • • Problem and motivation Single-sequence mining: AWSOM Co-evolving sequences: SPIRIT Lag correlations: BRAID conclusions DB/IR '06 C. Faloutsos #14 Carnegie Mellon Single sequence mining AWSOM • with Spiros Papadimitriou (CMU -> IBM) • Anthony Brockwell (CMU/Stat) DB/IR '06 C. Faloutsos #15 Carnegie Mellon Problem definition • Semi-infinite streams of values (time series) x1, x2, …, xt, … • Find patterns, forecasts, outliers… Periodicity? (twice daily) “Noise”?? DB/IR '06 C. Faloutsos #16 Periodicity? (daily) Carnegie Mellon Requirements / Goals • Adapt and handle arbitrary periodic components and • nimble (limited resources, single pass) • on-line, any-time • automatic (no human intervention/tuning) DB/IR '06 C. Faloutsos #17 Carnegie Mellon Overview • • • • Introduction / Related work Background Main idea Experimental results DB/IR '06 C. Faloutsos #18 Carnegie Mellon Wavelets Example – Haar transform xt t W1,1 W1,2 t W1,3 W1,4 t t t “constant” frequency W2,1 W2,2 t t W3,1 t V4,1 DB/IR '06 t C. Faloutsos time #19 Carnegie Mellon Wavelets Why we like them • Wavelets compress many real signals well: – Image compression and processing – Vision – Astronomy, seismology, … • Wavelet coefficients can be updated as new points arrive DB/IR '06 C. Faloutsos #20 Carnegie Mellon Overview • • • • Introduction / Related work Background Main idea Experimental results DB/IR '06 C. Faloutsos #21 Carnegie Mellon AWSOM xt W1,1 W1,2 W1,3 W1,4 t t t t W2,2 = t t frequency W2,1 t W3,1 t V4,1 t time DB/IR '06 C. Faloutsos #22 Carnegie Mellon AWSOM xt W1,1 W1,2 W1,3 W1,4 t t t t W2,2 t t frequency W2,1 t W3,1 t V4,1 t time DB/IR '06 C. Faloutsos #23 Carnegie Mellon AWSOM - idea Wl,t-2 Wl,t-1 Wl,t Wl’,t’-2 DB/IR '06 Wl’,t’-1 Wl’,t’ Wl,t Wl’,t’ C. Faloutsos l,1Wl,t-1 l,2Wl,t-2 … l’,1Wl’,t’-1 l’,2Wl’,t’-2 … #24 Carnegie Mellon More details… • Update of wavelet coefficients (incremental) • Update of linear models (incremental; RLS) • Feature selection (single-pass) – Not all correlations are significant – Throw away the insignificant ones (“noise”) DB/IR '06 C. Faloutsos #25 Carnegie Mellon ? Complexity • Model update Space: OlgN + mk2 OlgN Time: Ok2 O1 Where – N: number of points (so far) – k: number of regression coefficients; fixed – m: number of linear models; OlgN DB/IR '06 C. Faloutsos #26 Carnegie Mellon Overview • • • • Introduction / Related work Background Main idea Experimental results DB/IR '06 C. Faloutsos #27 Carnegie Mellon Results - Synthetic data AWSOM DB/IR '06 AR Seasonal AR C. Faloutsos • Triangle pulse • Mix (sine + square) • AR captures wrong trend (or none) • Seasonal AR estimation fails #28 Carnegie Mellon Results - Real data • Automobile traffic – Daily periodicity – Bursty “noise” at smaller scales • AR fails to capture any trend • Seasonal AR estimation fails DB/IR '06 C. Faloutsos #29 Carnegie Mellon Results - real data • Sunspot intensity – Slightly time-varying “period” • AR captures wrong trend • Seasonal ARIMA – wrong downward trend, despite help by human! DB/IR '06 C. Faloutsos #30 Carnegie Mellon Conclusions Adapt and handle arbitrary periodic components and nimble Limited memory (logarithmic) Constant-time update on-line, any-time Single pass over the data automatic: No human intervention/tuning DB/IR '06 C. Faloutsos #31 Carnegie Mellon Outline • • • • • Problem and motivation Single-sequence mining: AWSOM Co-evolving sequences: SPIRIT Lag correlations: BRAID conclusions DB/IR '06 C. Faloutsos #32 Carnegie Mellon Part 2 SPIRIT: Mining co-evolving streams [Papadimitriou, Sun, Faloutsos, VLDB05] DB/IR '06 C. Faloutsos #33 Carnegie Mellon Motivation • Eg., chlorine concentration in water distribution network DB/IR '06 C. Faloutsos #34 Carnegie Mellon chlorine concentrations Motivation Phase 1 Phase 2 Phase 3 : : : : : : : : : : : : water distribution network normal operation May have hundreds of measurements, but it is unlikely they are completely unrelated! DB/IR '06 C. Faloutsos #35 Carnegie Mellon Motivation chlorine concentrations Phase 1 : : : : Phase 2 : : : : Phase 3 : : : : sensors near leak sensors away from leak water distribution network normal operation DB/IR '06 C. Faloutsos major leak #36 Carnegie Mellon Motivation chlorine concentrations Phase 1 : : : : Phase 2 : : : : Phase 3 : : : : sensors near leak sensors away from leak water distribution network normal operation DB/IR '06 C. Faloutsos major leak #37 Carnegie Mellon Motivation chlorine concentrations Phase 1 : : Phase 1 : : : : k=1 : : : : : : actual measurements (n streams) k hidden variable(s) We would like to discover a few “hidden (latent) variables” that summarize the key trends DB/IR '06 C. Faloutsos #38 Carnegie Mellon chlorine concentrations Phase 1 : : Phase 2 Motivation : : Phase 1 Phase 2 : : k=2 : : : : : : actual measurements (n streams) k hidden variable(s) We would like to discover a few “hidden (latent) variables” that summarize the key trends DB/IR '06 C. Faloutsos #39 Carnegie Mellon chlorine concentrations Phase 1 : : : : Phase 2 Motivation : : : : Phase 3 Phase 1 Phase 2 Phase 3 : : k=1 : : actual measurements (n streams) k hidden variable(s) We would like to discover a few “hidden (latent) variables” that summarize the key trends DB/IR '06 C. Faloutsos #40 Carnegie Mellon Goals • Discover “hidden” (latent) variables for: – Summarization of main trends for users – Efficient forecasting, spotting outliers/anomalies and the usual: • nimble: Limited memory requirements • on-line, any-time: (single pass etc) • automatic: No special parameters to tune DB/IR '06 C. Faloutsos #41 Carnegie Mellon Related work Stream mining • Stream SVD [Guha, Gunopulos, Koudas / KDD03] • StatStream [Zhu, Shasha / VLDB02] • Clustering [Aggarwal, Han, Yu / VLDB03], [Guha, Meyerson, et al / TKDE], [Lin, Vlachos, Keogh, Gunopulos / EDBT04], • Classification [Wang, Fan, et al / KDD03], [Hulten, Spencer, Domingos / KDD01] DB/IR '06 C. Faloutsos #42 Carnegie Mellon Related work Stream mining • Piecewise approximations [Palpanas, Vlachos, Keogh, etal / ICDE 2004] • Queries on streams [Dobra, Garofalakis, Gehrke, et al / SIGMOD02], [Madden, Franklin, Hellerstein, et al / OSDI02], [Considine, Li, Kollios, et al / ICDE04], [Hammad, Aref, Elmagarmid / SSDBM03] • … DB/IR '06 C. Faloutsos #43 Carnegie Mellon Overview Part 2 • Method • Experiments • Conclusions & Other work DB/IR '06 C. Faloutsos #44 Carnegie Mellon Stream correlations • Step 1: How to capture correlations? • Step 2: How to do it incrementally, when we have a very large number of points? • Step 3: How to dynamically adjust the number of hidden variables? DB/IR '06 C. Faloutsos #45 Carnegie Mellon 1. How to capture correlations? First sensor 20oC time DB/IR '06 C. Faloutsos Temperature t1 30oC #46 Carnegie Mellon 1. How to capture correlations? 20oC time DB/IR '06 C. Faloutsos Temperature t2 30oC First sensor Second sensor #47 Carnegie Mellon 1. How to capture correlations Correlations: Temperature t2 30oC Let’s take a closer look at the first three value-pairs… 20oC DB/IR '06 20oC Temperature t1 30oC C. Faloutsos #48 Carnegie Mellon 1. How to capture correlations time=3 Temperature t2 30oC time=2 time=1 First three lie (almost) on a line in the space of valuepairs… O(n) numbers 20oC for the slope, and One number for each value-pair (offset on line) DB/IR '06 20oC Temperature t1 30oC C. Faloutsos #49 Carnegie Mellon 1. How to capture correlations Other pairs also follow the same pattern: they lie (approximately) on this line Temperature t2 30oC 20oC DB/IR '06 20oC Temperature t1 30oC C. Faloutsos #50 Carnegie Mellon Stream correlations • Step 1: How to capture correlations? • Step 2: How to do it incrementally, when we have a very large number of points? • Step 3: How to dynamically adjust the number of hidden variables? DB/IR '06 C. Faloutsos #51 Carnegie Mellon Incremental updates 30o C Temperature T2 error 20o C 20o 30o C C Temperature T1 Carnegie Mellon Incremental updates • Algorithm runs in O(n) where n= # of streams • no need to access old data error 30oC 20oC 20oC 30oC Temperature T1 Carnegie Mellon Stream correlations Principal Component Analysis (PCA) • The “line” is the first principal component (PC) • This line is optimal: it minimizes the sum of squared projection errors DB/IR '06 C. Faloutsos #54 Carnegie Mellon 2. Incremental update Given number of hidden variables k • Assuming k is known • We know how to update the slope For each new point x and for i = 1, …, k : • yi := wiTx (proj. onto wi) • di di + yi2 (energy i-th eigenval.) • ei := x – yiwi (error) • wi wi + (1/di) yiei (update estimate) • x x – yiwi (repeat with remainder) DB/IR '06 C. Faloutsos x e1 w1 updated w1 y1 #55 Carnegie Mellon Stream correlations • Step 1: How to capture correlations? • Step 2: How to do it incrementally, when we have a very large number of points? • Step 3: How to dynamically adjust k, the number of hidden variables? DB/IR '06 C. Faloutsos #56 Carnegie Mellon Answer • When the reconstruction accuracy is too low (say, <95%) • then introduce another hidden variable (k++) • [How to initialize its values: tricky] DB/IR '06 C. Faloutsos #57 Carnegie Mellon Missing values best guess (given correlations: intersection) Temperature T2 30oC true values (pair) 20oC DB/IR '06 all possible value pairs (given only t1) 20oC Temperature T1 30oC C. Faloutsos #58 Carnegie Mellon Forecasting ? • Assume we want to forecast the next value for a particular stream (e.g. auto-regression) n streams DB/IR '06 C. Faloutsos #59 Carnegie Mellon Forecasting • Option 1: One complex model per stream + – Next value = function of previous values on all streams – Captures correlations – Too costly! [ ~ O(n3) ] n streams DB/IR '06 C. Faloutsos #60 Carnegie Mellon Forecasting • Option 1: One complex model per stream + • Option 2: One simple model per stream – Next value = function of previous value on same stream n streams DB/IR '06 – Worse accuracy, but maybe acceptable – But, still need n models C. Faloutsos #61 Carnegie Mellon Forecasting + hidden variables Only k simple models k hidden vars k << n n streams DB/IR '06 and already capture correlations C. Faloutsos Efficiency & robustness #62 Carnegie Mellon Time/space requirements Incremental PCA O(nk) space (total) and time (per tuple), i.e., • Independent of # points • Linear w.r.t. # streams (n) • Linear w.r.t. # hidden variables (k) In fact, • Can be done in real time DB/IR '06 C. Faloutsos #63 Carnegie Mellon Overview Part 2 • Method • Experiments • Conclusions & Other work DB/IR '06 C. Faloutsos #64 Carnegie Mellon Experiments Chlorine concentration Measurements Reconstruction DB/IR '06 [CMU Civil Engineering] 166 streams 2 hidden variables (~4% error) C. Faloutsos #65 Carnegie Mellon Experiments Chlorine concentration hidden variables • Both capture global, periodic pattern • Second: ~ first, but phase-shifted • Can express any phase-shift… DB/IR '06 [CMU Civil Engineering] C. Faloutsos #66 Experiments Carnegie Mellon Light measurements 54 sensors 2-4 hidden variables (~6% error) DB/IR '06 C. Faloutsos measurement reconstruction #67 Experiments Carnegie Mellon Light measurements intermittent intermittent hidden variables • 1 & 2: main trend (as before) • 3 & 4: potential anomalies and outliers DB/IR '06 C. Faloutsos #68 Carnegie Mellon Conclusions SPIRIT: Discovers hidden variables for – Summarization of main trends for users – Efficient forecasting, spotting outliers/anomalies Incremental, real time computation nimble: With limited memory automatic: No special parameters to tune DB/IR '06 C. Faloutsos #69 Carnegie Mellon Outline • • • • • Problem and motivation Single-sequence mining: AWSOM Co-evolving sequences: SPIRIT Lag correlations: BRAID Conclusions DB/IR '06 C. Faloutsos #70 Carnegie Mellon Part 3: BRAID: Discovering Lag Correlations in Multiple Streams Yasushi Sakurai, Spiros Papadimitriou, Christos Faloutsos SIGMOD’05 DB/IR '06 C. Faloutsos #71 Carnegie Mellon Lag Correlations • Examples – A decrease in interest rates typically precedes an increase in house sales by a few months – Higher amounts of fluoride in the drinking water leads to fewer dental cavities, some years later DB/IR '06 C. Faloutsos #72 Carnegie Mellon Lag Correlations • Example of lag-correlated sequences These sequences are correlated with lag l=1300 time-ticks CCF (Cross-Correlation Function) DB/IR '06 C. Faloutsos #73 Carnegie Mellon Lag Correlations • Example of lag-correlated sequences how to compute it •quickly •cheaply •incrementally CCF (Cross-Correlation Function) DB/IR '06 C. Faloutsos #74 Carnegie Mellon Challenging Problems • Problem definitions – For given two co-evolving sequences X and Y, determine • • Whether there is a lag correlation If yes, what is the lag length l – For given k numerical sequences, X1,…,Xk , report • • DB/IR '06 Which pairs have a lag correlation The corresponding lag for each pair C. Faloutsos #75 Carnegie Mellon Our solution • Ideal characteristics: – ‘Any-time’ processing, and fast Computation time per time tick is constant – Nimble Memory space requirement is sub-linear of sequence length – Accurate Approximation introduces small error DB/IR '06 C. Faloutsos #76 Carnegie Mellon Related Work • Sequence indexing – Agrawal et al. (FODO 1993) – Faloutsos et al. (SIGMOD 1994) – Keogh et al. (SIGMOD 2001) • Compression (wavelet and random projections) – Gilbert et al. (VLDB 2001), Guha et al. (VLDB 2004) – Dobra et al.(SIGMOD 2002), Ganguly et al.(SIGMOD 2003) • Data Stream Management – Abadi et al. (VLDB Journal 2003) – Motwani et al. (CIDR 2003) – Chandrasekaran et al. (CIDR 2003) – Cranor et al. (SIGMOD 2003) DB/IR '06 C. Faloutsos #77 Carnegie Mellon Related Work • Pattern discovery – Clustering for data streams Guha et al. (TKDE 2003) – Monitoring multiple streams Zhu et al. (VLDB 2002) – Forecasting Yi et al. (ICDE 2000) Papadimitriou et al. (VLDB 2003) • None of previously published methods focuses on the problem DB/IR '06 C. Faloutsos #78 Carnegie Mellon Overview • • • • • Introduction / Related work Background Main ideas Theoretical analysis Experimental results DB/IR '06 C. Faloutsos #79 Carnegie Mellon Main Idea (1) • Incremental compution – Sufficient statistics : Sx(1, n) t 1 xt n • Sum of X n Sxx(1, n) t 1 xt2 • Square sum of X : n Sxy(l ) t l 1 xt yt l • Inner-product for X and the shifted Y : – Compute R(l) incrementally: R(l ) C (l ) Vx(l 1, n) Vy(1, n l ) • Covariance of X • Variance of X: DB/IR '06 and Y: C (l ) Sxy(l ) Sx(l 1, n) Sy (1, n l ) nl ( Sx(l 1, n)) 2 Vx(l 1, n) Sxx(l 1, n) nl C. Faloutsos #80 Carnegie Mellon Main Idea (2) Correlation • Sequence smoothing Time t=n Lag DB/IR '06 C. Faloutsos #81 Carnegie Mellon Main Idea (2) • Sequence smoothing Means of windows for each level Sufficient statistics computed from the means CCF computed from the sufficient statistics But, it allows a partial redundancy Level Time h=0 t=n Correlation – – – – Lag DB/IR '06 C. Faloutsos #82 Carnegie Mellon Main Idea (3) Level Time h=0 t=n Correlation • Geometric lag probing Lag DB/IR '06 C. Faloutsos #83 Carnegie Mellon Main Idea (3) • Geometric lag probing Level Time h=0 t=n Correlation – Use colored windows – Keep track of only a geometric progression of the lag values: l={0,1,2,4,8,…,2h,…} – Use a cubic spline to interpolate Lag DB/IR '06 C. Faloutsos #84 Carnegie Mellon Overview • • • • • Introduction / Related work Background Main ideas Theoretical analysis Experimental results DB/IR '06 C. Faloutsos #85 Carnegie Mellon Experimental results • Setup – Intel Xeon 2.8GHz, 1GB memory, Linux – Datasets: Sines, SpikeTrains, Humidity, Light, Temperature, Kursk, Sunspots – Enhanced BRAID, b=16 • Evaluation – Estimation error of lag correlations – Computation time C. Faloutsos DB/IR '06 #86 Carnegie Mellon • Detecting Lag Correlations SpikeTrains (2) BRAID closely estimates the correlation coefficients CCF (Cross-Correlation Function) DB/IR '06 C. Faloutsos #87 Carnegie Mellon • Detecting Lag Correlations Humidity (3) BRAID closely estimates the correlation coefficients CCF (Cross-Correlation Function) DB/IR '06 C. Faloutsos #88 Carnegie Mellon • Detecting Lag Correlations Light (4) BRAID closely estimates the correlation coefficients CCF (Cross-Correlation Function) DB/IR '06 C. Faloutsos #89 Carnegie Mellon • Detecting Lag Correlations Kursk (5) BRAID closely estimates the correlation coefficients CCF (Cross-Correlation Function) DB/IR '06 C. Faloutsos #90 Carnegie Mellon Estimation Error Lag correlation Estimation Datasets Naive BRAID error (%) Sines 716 716 0.000 SpikeTrains 2841 2830 0.387 Humidity 3842 3855 0.338 Light 567 570 0.529 Kursk 1463 1472 0.615 Sunspots 1156 1168 1.038 • Largest relative error is about 1% DB/IR '06 C. Faloutsos #91 Carnegie Mellon Performance • Almost linear w.r.t. sequence length • Up to 40,000 times faster DB/IR '06 C. Faloutsos #92 Carnegie Mellon Group Lag Correlations • Two correlated pairs from 55 Temperature sequences • Each sensor is located in a different place #16 #19 Estimation of CCF of #16 and #19 DB/IR '06 C. Faloutsos #47 #48 Estimation of CCF of #47 and #48 #93 Carnegie Mellon Conclusions Automatic lag correlation detection on stream data • incremental – online, ‘any-time’ • nimble – – • O(log n) space, O(1) time to update the statistics Up to 40,000 times faster than the naive implementation Accurate – DB/IR '06 Detecting the correct lag within 1% relative error or less C. Faloutsos #94 Carnegie Mellon Overall Conclusions time • Mining streaming numerical data: challenging! • Extensions: streaming matrix data (eg., network traffic matrix) IP-source DB/IR '06 C. Faloutsos #95 Carnegie Mellon Thank you • christos <at> cs.cmu.edu • www.cs.cmu.edu/~christos • [InteMon demo] DB/IR '06 C. Faloutsos #96