BRAID: Stream Mining through Group Lag
Correlations
Yasushi Sakurai Spiros Papadimitriou Christos Faloutsos
SIGMOD 2005
Introduction
Lag correlations :
For example:
Higher amounts of fluoride in water → fewer dental cavities some years later
Goal :
Monitor multiple numerical streams determine the pair correlated with lag and the value
Introduction
k numerical sequences report all pair of follow X j with lag
X l i and
X
X j
1
, … X k
, which X i
Introduction
Introduction
In this paper, propose BRAID handle data stream of semi-infinite length
Any time processing, and fast
Nimble
Accurate
Small resource consumption
Proposed method
Data stream X : { x
1
, … , the most recent value x t
, ..., x n
} , x n is
R (0) : X and Y with the same length n and have zero lag
ρ Coefficient :
Proposed method
For lag shifted l ,consider common part of X and
Y , only nl time ticks
Proposed method
Proposed method
R ( l ) : correlation coefficient, X is delayed by l
Score at lag l :
Proposed method
R ( l ) for large value of lag l ≈ n , the original and shifted time sequence have too few overlapping
Restrict maximum lag m to be n /2
Proposed method
Naive solution :
At time compute n , access all value of X and Y ,
R ( l ) of all value lag l (=0,1, … )
Choose earliest max score above no lag r , or report
The solution based on three major step
Proposed method
Need some sufficient statistics for computed easily
R to
S x ( l , n ) = : sum of X of length n
S xx ( l , n
1 n t x t
2
X of length n
n t
1
S xy ( l t
l
1
x t y t 1
X of length n
Proposed method
R ( l ) is obtained :
Proposed method
R ( l ) can estimate at any point time, only need to keep track five sufficient statistics
It still needs linear time to compute the cross-correlation function between two sequences
Proposed method
Propose to keep track of only a geometric progression of the lag value : l = 0,1,2,..2
i ,.
Only O(log of O( n ) number to track of, instead n ) that “ Na ï ve solution ” requires
Space required grow linearly with length n
Proposed method
In order to compute sliding window of size
R ( l ) at any time, keep l , m = n /2 need O( n ) space
Instead of operating on original time sequence, also compute their smoothed version by computing non-overlapping windows
Proposed method
Window size : power of g=2
X : original time sequence
A x h
: smoothed version with window of length 2 h
A x
0
: original sequence, A ticks ,..etc
x
1
: consists of n/2
A x h
‘s sufficient statistic need compute every 2 h time ticks
At time n, need O(log n ) level, for each level compute sufficient statistic
Proposed method
In contrast with small lags, the larger one are sparse
Use cubic spline to interpolate the missing correlation coefficient
Proposed method
A x h
(t) : window average at time tick t for level h
A x h
(0) ≡ x t
Proposed method
Sufficient statistics:
Enhanced BRAID
If two sequence of size ≈ 2 20 , require about 5*log 2 20 = 5*20=100 float numbers , about 800 bytes
Large memory available, propose a solution to probe more but use O(log space n )
Use mix of arithmetic plus geometric probing
Enhanced BRAID
BRAID use only one window at each smoothing level
Propose use b>1 windows, b=4 instead
Algorithm before b=1,with exception bottom level has 2b coefficient
While computing R ( l ), use mixture geometric and arithmetic progression:
Enhanced BRAID
Example of enhanced BRAID of b=4
The algorithm behind if b=1 also equal to the algorithm before
Conclusion
Proposed BRAID to detection lag correlation on streaming data
At any time
Low resource consumption
High accuracy
Thank you very much~