BRAID: Stream Mining through Group Lag Correlations Yasushi Sakurai Spiros Papadimitriou Christos Faloutsos SIGMOD 2005 Outline Introduction Proposed method EXPERIMENTS CONCLUSIONS Introduction Data Stream Lag correlations : For example: Higher amounts of fluoride in water → fewer dental cavities some years later Goal : Monitor multiple numerical streams determine the pair correlated with lag and the value Introduction k numerical sequences X1,…Xk , report all pair of Xi and Xj which Xi follow Xj with lag l Introduction Introduction In this paper, propose BRAID handle data stream Any time processing, and fast Nimble Accurate Small resource consumption Proposed method Data stream X : {x1, …, xt, ..., xn} , xn is the most recent value R(0) : X and Y with the same length n and have zero lag Pearson ρ Coefficient : Proposed method For lag l ,consider common part of X and shifted Y Proposed method Proposed method R(l) : correlation coefficient, X is delayed by l Score at lag l : Proposed method R(l) for large value of lag l ≈ n, the original and shifted time sequence have too few overlapping Restrict maximum lag m to be n/2 Proposed method Naive solution : At time n, access all value of X and Y, compute R(l) of all value lag l(=0,1,…) Choose earliest max score above r , or report no lag The solution based on three major step Proposed method Need some sufficient statistics for R to computed easily Sx(l,n) = Sxx(l,n) = n t 1 n t 1 n xt 2 xt : sum of X of length n : sum of square X of length n Sxy(l) = xtyt 1 : sum of square X of length n t l 1 Proposed method R(l) is obtained : Proposed method R(l) can estimate at any point time, only need to keep track five sufficient statistics It still needs linear time to compute the cross-correlation function between two sequences Proposed method Propose to keep track of only a geometric progression of the lag value : l= 0,1,2,..2i,. Only O(logn) number to track of, instead of O(n) that “Naïve solution” requires Space required grow linearly with length n Proposed method In order to compute R(l) at any time, keep sliding window of size l, m=n/2 need O(n) space Instead of operating on original time sequence, we also compute their smoothed version, by computing the means of non-overlapping windows Proposed method Window size : power of g=2 X : original time sequence Axh : smoothed version with window of length 2h Ax0 : original sequence, Ax1 : consists of n/2 ticks ,..etc Axh ‘s sufficient statistic need compute every 2h time ticks At time n, need O(log n) level, for each level compute sufficient statistic Proposed method In contrast with small lags, the larger one are sparse Use cubic spline to interpolate the missing correlation coefficient Proposed method Axh(t) : window average at time tick t for level h Axh(0) ≡ xt Proposed method Sufficient statistics: EXPERIMENTS EXPERIMENTS EXPERIMENTS Conclusion Proposed BRAID to detection lag correlation on streaming data At any time Low resource consumption High accuracy