BRAID: Stream Mining through Group Lag Correlations

advertisement

BRAID: Stream Mining through Group Lag

Correlations

Yasushi Sakurai Spiros Papadimitriou Christos Faloutsos

SIGMOD 2005

Introduction

Lag correlations :

For example:

Higher amounts of fluoride in water → fewer dental cavities some years later

Goal :

Monitor multiple numerical streams determine the pair correlated with lag and the value

Introduction

 k numerical sequences report all pair of follow X j with lag

X l i and

X

X j

1

, … X k

, which X i

Introduction

Introduction

In this paper, propose BRAID handle data stream of semi-infinite length

Any time processing, and fast

Nimble

Accurate

Small resource consumption

Proposed method

Data stream X : { x

1

, … , the most recent value x t

, ..., x n

} , x n is

R (0) : X and Y with the same length n and have zero lag

ρ Coefficient :

Proposed method

For lag shifted l ,consider common part of X and

Y , only nl time ticks

Proposed method

Proposed method

R ( l ) : correlation coefficient, X is delayed by l

Score at lag l :

Proposed method

R ( l ) for large value of lag l ≈ n , the original and shifted time sequence have too few overlapping

Restrict maximum lag m to be n /2

Proposed method

Naive solution :

At time compute n , access all value of X and Y ,

R ( l ) of all value lag l (=0,1, … )

Choose earliest max score above no lag r , or report

The solution based on three major step

Proposed method

Need some sufficient statistics for computed easily

R to

S x ( l , n ) = : sum of X of length n

S xx ( l , n

1 n t x t

2

X of length n

 n t

1

S xy ( l t

 l

1

 x t y t 1

X of length n

Proposed method

R ( l ) is obtained :

Proposed method

R ( l ) can estimate at any point time, only need to keep track five sufficient statistics

It still needs linear time to compute the cross-correlation function between two sequences

Proposed method

Propose to keep track of only a geometric progression of the lag value : l = 0,1,2,..2

i ,.

Only O(log of O( n ) number to track of, instead n ) that “ Na ï ve solution ” requires

Space required grow linearly with length n

Proposed method

In order to compute sliding window of size

R ( l ) at any time, keep l , m = n /2 need O( n ) space

Instead of operating on original time sequence, also compute their smoothed version by computing non-overlapping windows

Proposed method

Window size : power of g=2

X : original time sequence

A x h

: smoothed version with window of length 2 h

A x

0

: original sequence, A ticks ,..etc

x

1

: consists of n/2

A x h

‘s sufficient statistic need compute every 2 h time ticks

At time n, need O(log n ) level, for each level compute sufficient statistic

Proposed method

In contrast with small lags, the larger one are sparse

Use cubic spline to interpolate the missing correlation coefficient

Proposed method

A x h

(t) : window average at time tick t for level h

A x h

(0) ≡ x t

Proposed method

Sufficient statistics:

Enhanced BRAID

If two sequence of size ≈ 2 20 , require about 5*log 2 20 = 5*20=100 float numbers , about 800 bytes

Large memory available, propose a solution to probe more but use O(log space n )

Use mix of arithmetic plus geometric probing

Enhanced BRAID

BRAID use only one window at each smoothing level

Propose use b>1 windows, b=4 instead

Algorithm before b=1,with exception bottom level has 2b coefficient

While computing R ( l ), use mixture geometric and arithmetic progression:

Enhanced BRAID

Example of enhanced BRAID of b=4

The algorithm behind if b=1 also equal to the algorithm before

Conclusion

Proposed BRAID to detection lag correlation on streaming data

At any time

Low resource consumption

High accuracy

Thank you very much~

Download