BRAID: Stream Mining through Group Lag Correlations

advertisement
BRAID: Stream Mining through Group Lag
Correlations
Yasushi Sakurai Spiros Papadimitriou Christos Faloutsos
SIGMOD 2005
Outline




Introduction
Proposed method
EXPERIMENTS
CONCLUSIONS
Introduction


Data Stream
Lag correlations :


For example:
Higher amounts of fluoride in water → fewer
dental cavities some years later
Goal :

Monitor multiple numerical streams determine
the pair correlated with lag and the value
Introduction

k numerical sequences X1,…Xk ,
report all pair of Xi and Xj which Xi
follow Xj with lag l
Introduction
Introduction

In this paper, propose BRAID handle data
stream




Any time processing, and fast
Nimble
Accurate
Small resource consumption
Proposed method



Data stream X : {x1, …, xt, ..., xn} , xn is
the most recent value
R(0) : X and Y with the same length n and
have zero lag
Pearson ρ Coefficient :
Proposed method

For lag l ,consider common part of X and
shifted Y
Proposed method
Proposed method


R(l) : correlation coefficient, X is delayed
by l
Score at lag l :
Proposed method

R(l) for large value of lag l ≈ n, the
original and shifted time sequence have
too few overlapping

Restrict maximum lag m to be n/2
Proposed method

Naive solution :



At time n, access all value of X and Y,
compute R(l) of all value lag l(=0,1,…)
Choose earliest max score above r , or report
no lag
The solution based on three major step
Proposed method

Need some sufficient statistics for R to
computed easily


Sx(l,n) =

Sxx(l,n) = 
n
t 1
n
t 1
n

xt
2
xt
: sum of X of length n
: sum of square X of length n
Sxy(l) =  xtyt  1 : sum of square X of length n
t l 1
Proposed method

R(l) is obtained :
Proposed method


R(l) can estimate at any point time, only
need to keep track five sufficient statistics
It still needs linear time to compute the
cross-correlation function between two
sequences
Proposed method



Propose to keep track of only a geometric
progression of the lag value : l= 0,1,2,..2i,.
Only O(logn) number to track of, instead
of O(n) that “Naïve solution” requires
Space required grow linearly with length n
Proposed method


In order to compute R(l) at any time, keep
sliding window of size l, m=n/2 need O(n) space
Instead of operating on original time sequence,
we also compute their smoothed version, by
computing the means of non-overlapping
windows
Proposed method






Window size : power of g=2
X : original time sequence
Axh : smoothed version with window of length 2h
Ax0 : original sequence, Ax1 : consists of n/2
ticks ,..etc
Axh ‘s sufficient statistic need compute every 2h
time ticks
At time n, need O(log n) level, for each level
compute sufficient statistic
Proposed method

In contrast with small lags, the larger one
are sparse

Use cubic spline to interpolate the missing
correlation coefficient
Proposed method


Axh(t) : window average at time tick t for
level h
Axh(0) ≡ xt
Proposed method

Sufficient statistics:
EXPERIMENTS
EXPERIMENTS
EXPERIMENTS
Conclusion

Proposed BRAID to detection lag
correlation on streaming data



At any time
Low resource consumption
High accuracy
Download