Mining time series

advertisement
Carnegie Mellon
Data Mining on Streams
Christos Faloutsos
CMU
DB/IR '06
C. Faloutsos
#1
Carnegie Mellon
THANK YOU!
• Prof. Panos Ipeirotis
• Julia Mills
DB/IR '06
C. Faloutsos
#2
Carnegie Mellon
Outline
•
•
•
•
•
Problem and motivation
Single-sequence mining: AWSOM
Co-evolving sequences: SPIRIT
Lag correlations: BRAID
Conclusions
DB/IR '06
C. Faloutsos
#4
Carnegie Mellon
Problem definition - example
Each sensor collects data (x1, x2, …, xt, …)
DB/IR '06
C. Faloutsos
#5
Carnegie Mellon
Problem definition
• Given: one or more sequences
x1 , x2 , … , xt , …
(y1, y2, … , yt, …
…)
• Find
– patterns; correlations; outliers
– incrementally!
DB/IR '06
C. Faloutsos
#6
Carnegie Mellon
Limitations / Challenges
Find patterns using a method that is
• nimble: limited resources
– Memory
– Bandwidth, power, CPU
• incremental: on-line, ‘any-time’ response
– single pass (‘you get to see it only once’)
• automatic: no human intervention
– eg., in remote environments
DB/IR '06
C. Faloutsos
#7
Carnegie Mellon
Application domains
• Sensor devices
–
–
–
–
Temperature, weather measurements
Road traffic data
Geological observations
Patient physiological data
• Embedded devices
– Network routers
– Intelligent (active) disks
DB/IR '06
C. Faloutsos
#8
Carnegie Mellon
Motivation - Applications
(cont’d)
• ‘Smart house’
– sensors monitor temperature, humidity,
air quality
• video surveillance
DB/IR '06
C. Faloutsos
#9
Carnegie Mellon
Motivation - Applications
(cont’d)
• civil/automobile infrastructure
– bridge vibrations [Oppenheim+02]
– road conditions / traffic monitoring
DB/IR '06
C. Faloutsos
#10
Carnegie Mellon
Motivation - Applications
(cont’d)
• Weather, environment/anti-pollution
– volcano monitoring
– air/water pollutant monitoring
DB/IR '06
C. Faloutsos
#11
Carnegie Mellon
Motivation - Applications
(cont’d)
• Computer systems
– ‘Active Disks’ (buffering, prefetching)
– web servers (ditto)
– network traffic monitoring
– ...
DB/IR '06
C. Faloutsos
#12
Carnegie Mellon
InteMon
w/ Evan Hoke, Jimeng Sun
self-* PetaByte
data center at CMU
Carnegie Mellon
Outline
•
•
•
•
•
Problem and motivation
Single-sequence mining: AWSOM
Co-evolving sequences: SPIRIT
Lag correlations: BRAID
conclusions
DB/IR '06
C. Faloutsos
#14
Carnegie Mellon
Single sequence mining AWSOM
• with Spiros Papadimitriou (CMU -> IBM)
• Anthony Brockwell (CMU/Stat)
DB/IR '06
C. Faloutsos
#15
Carnegie Mellon
Problem definition
• Semi-infinite streams of values (time series) x1, x2,
…, xt, …
• Find patterns, forecasts, outliers…
Periodicity? (twice daily)
“Noise”??
DB/IR '06
C. Faloutsos
#16
Periodicity? (daily)
Carnegie Mellon
Requirements / Goals
• Adapt and handle arbitrary periodic
components
and
• nimble (limited resources, single pass)
• on-line, any-time
• automatic (no human intervention/tuning)
DB/IR '06
C. Faloutsos
#17
Carnegie Mellon
Overview
•
•
•
•
Introduction / Related work
Background
Main idea
Experimental results
DB/IR '06
C. Faloutsos
#18
Carnegie Mellon
Wavelets
Example – Haar transform
xt
t
W1,1
W1,2
t
W1,3
W1,4
t
t
t
“constant”
frequency
W2,1
W2,2
t
t
W3,1
t
V4,1
DB/IR '06
t
C. Faloutsos
time
#19
Carnegie Mellon
Wavelets
Why we like them
• Wavelets compress many real signals well:
– Image compression and processing
– Vision
– Astronomy, seismology, …
• Wavelet coefficients can be updated as new
points arrive
DB/IR '06
C. Faloutsos
#20
Carnegie Mellon
Overview
•
•
•
•
Introduction / Related work
Background
Main idea
Experimental results
DB/IR '06
C. Faloutsos
#21
Carnegie Mellon
AWSOM
xt
W1,1
W1,2
W1,3
W1,4
t
t
t
t
W2,2
=
t
t
frequency
W2,1
t
W3,1
t
V4,1
t
time
DB/IR '06
C. Faloutsos
#22
Carnegie Mellon
AWSOM
xt
W1,1
W1,2
W1,3
W1,4
t
t
t
t
W2,2
t
t
frequency
W2,1
t
W3,1
t
V4,1
t
time
DB/IR '06
C. Faloutsos
#23
Carnegie Mellon
AWSOM - idea
Wl,t-2 Wl,t-1 Wl,t
Wl’,t’-2
DB/IR '06
Wl’,t’-1
Wl’,t’
Wl,t 
Wl’,t’ 
C. Faloutsos
l,1Wl,t-1  l,2Wl,t-2  …
l’,1Wl’,t’-1  l’,2Wl’,t’-2  …
#24
Carnegie Mellon
More details…
• Update of wavelet coefficients (incremental)
• Update of linear models (incremental; RLS)
• Feature selection
(single-pass)
– Not all correlations are significant
– Throw away the insignificant ones (“noise”)
DB/IR '06
C. Faloutsos
#25
Carnegie Mellon
?
Complexity
• Model update
Space: OlgN + mk2  OlgN
Time: Ok2  O1
Where
– N: number of points (so far)
– k: number of regression coefficients; fixed
– m: number of linear models; OlgN
DB/IR '06
C. Faloutsos
#26
Carnegie Mellon
Overview
•
•
•
•
Introduction / Related work
Background
Main idea
Experimental results
DB/IR '06
C. Faloutsos
#27
Carnegie Mellon
Results - Synthetic data
AWSOM
DB/IR '06
AR
Seasonal AR
C. Faloutsos
• Triangle pulse
• Mix (sine +
square)
• AR captures
wrong trend (or
none)
• Seasonal AR
estimation fails
#28
Carnegie Mellon
Results - Real data
• Automobile traffic
– Daily periodicity
– Bursty “noise” at smaller scales
• AR fails to capture any trend
• Seasonal AR estimation fails
DB/IR '06
C. Faloutsos
#29
Carnegie Mellon
Results - real data

• Sunspot intensity
– Slightly time-varying “period”
• AR captures wrong trend
• Seasonal ARIMA
– wrong downward trend, despite help by human!
DB/IR '06
C. Faloutsos
#30
Carnegie Mellon
Conclusions
 Adapt and handle arbitrary periodic components
and
 nimble
Limited memory (logarithmic)
Constant-time update
 on-line, any-time
Single pass over the data
 automatic: No human intervention/tuning
DB/IR '06
C. Faloutsos
#31
Carnegie Mellon
Outline
•
•
•
•
•
Problem and motivation
Single-sequence mining: AWSOM
Co-evolving sequences: SPIRIT
Lag correlations: BRAID
conclusions
DB/IR '06
C. Faloutsos
#32
Carnegie Mellon
Part 2
SPIRIT: Mining co-evolving
streams
[Papadimitriou, Sun, Faloutsos, VLDB05]
DB/IR '06
C. Faloutsos
#33
Carnegie Mellon
Motivation
• Eg., chlorine concentration in water
distribution network
DB/IR '06
C. Faloutsos
#34
Carnegie Mellon
chlorine concentrations
Motivation
Phase 1
Phase 2
Phase 3
: :
: :
: :
: :
: :
: :
water distribution network
normal operation
May have hundreds of measurements, but
it is unlikely they are completely unrelated!
DB/IR '06
C. Faloutsos
#35
Carnegie Mellon
Motivation
chlorine concentrations
Phase 1
: :
: :
Phase 2
: :
: :
Phase 3
: :
: :
sensors
near leak
sensors
away
from leak
water distribution network
normal operation
DB/IR '06
C. Faloutsos
major leak
#36
Carnegie Mellon
Motivation
chlorine concentrations
Phase 1
: :
: :
Phase 2
: :
: :
Phase 3
: :
: :
sensors
near leak
sensors
away
from leak
water distribution network
normal operation
DB/IR '06
C. Faloutsos
major leak
#37
Carnegie Mellon
Motivation
chlorine concentrations
Phase 1
: :
Phase 1
: :
: :
k=1
: :
: :
: :
actual measurements
(n streams)
k hidden variable(s)
We would like to discover a few “hidden
(latent) variables” that summarize the key trends
DB/IR '06
C. Faloutsos
#38
Carnegie Mellon
chlorine concentrations
Phase 1
: :
Phase 2
Motivation
: :
Phase 1
Phase 2
: :
k=2
: :
: :
: :
actual measurements
(n streams)
k hidden variable(s)
We would like to discover a few “hidden
(latent) variables” that summarize the key trends
DB/IR '06
C. Faloutsos
#39
Carnegie Mellon
chlorine concentrations
Phase 1
: :
: :
Phase 2
Motivation
: :
: :
Phase 3
Phase 1
Phase 2
Phase 3
: :
k=1
: :
actual measurements
(n streams)
k hidden variable(s)
We would like to discover a few “hidden
(latent) variables” that summarize the key trends
DB/IR '06
C. Faloutsos
#40
Carnegie Mellon
Goals
• Discover “hidden” (latent) variables for:
– Summarization of main trends for users
– Efficient forecasting, spotting outliers/anomalies
and the usual:
• nimble: Limited memory requirements
• on-line, any-time: (single pass etc)
• automatic: No special parameters to tune
DB/IR '06
C. Faloutsos
#41
Carnegie Mellon
Related work
Stream mining
• Stream SVD [Guha, Gunopulos, Koudas / KDD03]
• StatStream [Zhu, Shasha / VLDB02]
• Clustering
[Aggarwal, Han, Yu / VLDB03], [Guha, Meyerson, et al / TKDE],
[Lin, Vlachos, Keogh, Gunopulos / EDBT04],
• Classification
[Wang, Fan, et al / KDD03], [Hulten, Spencer, Domingos / KDD01]
DB/IR '06
C. Faloutsos
#42
Carnegie Mellon
Related work
Stream mining
• Piecewise approximations
[Palpanas, Vlachos, Keogh, etal / ICDE 2004]
• Queries on streams
[Dobra, Garofalakis, Gehrke, et al / SIGMOD02],
[Madden, Franklin, Hellerstein, et al / OSDI02],
[Considine, Li, Kollios, et al / ICDE04],
[Hammad, Aref, Elmagarmid / SSDBM03]
• …
DB/IR '06
C. Faloutsos
#43
Carnegie Mellon
Overview
Part 2
• Method
• Experiments
• Conclusions & Other work
DB/IR '06
C. Faloutsos
#44
Carnegie Mellon
Stream correlations
• Step 1: How to capture correlations?
• Step 2: How to do it incrementally, when
we have a very large number of points?
• Step 3: How to dynamically adjust the
number of hidden variables?
DB/IR '06
C. Faloutsos
#45
Carnegie Mellon
1. How to capture correlations?
First sensor
20oC
time
DB/IR '06
C. Faloutsos
Temperature t1
30oC
#46
Carnegie Mellon
1. How to capture correlations?
20oC
time
DB/IR '06
C. Faloutsos
Temperature t2
30oC
First sensor
Second sensor
#47
Carnegie Mellon
1. How to capture correlations
Correlations:
Temperature t2
30oC
Let’s take a closer
look at the first three
value-pairs…
20oC
DB/IR '06
20oC
Temperature t1
30oC
C. Faloutsos
#48
Carnegie Mellon
1. How to capture correlations
time=3
Temperature t2
30oC
time=2
time=1
First three lie
(almost) on a line in
the space of valuepairs…
 O(n) numbers
20oC
for the slope, and
 One number for
each value-pair
(offset on line)
DB/IR '06
20oC
Temperature t1
30oC
C. Faloutsos
#49
Carnegie Mellon
1. How to capture correlations
Other pairs also
follow the same
pattern: they lie
(approximately) on
this line
Temperature t2
30oC
20oC
DB/IR '06
20oC
Temperature t1
30oC
C. Faloutsos
#50
Carnegie Mellon
Stream correlations
• Step 1: How to capture correlations?
• Step 2: How to do it incrementally, when
we have a very large number of points?
• Step 3: How to dynamically adjust the
number of hidden variables?
DB/IR '06
C. Faloutsos
#51
Carnegie Mellon
Incremental updates
30o
C
Temperature
T2
error
20o
C
20o
30o
C
C
Temperature
T1
Carnegie Mellon
Incremental updates
• Algorithm runs in O(n) where
n= # of streams
• no need to access old data
error
30oC
20oC
20oC
30oC
Temperature T1
Carnegie Mellon
Stream correlations
Principal Component Analysis (PCA)
• The “line” is the first principal component
(PC)
• This line is optimal: it minimizes the sum of
squared projection errors
DB/IR '06
C. Faloutsos
#54
Carnegie Mellon
2. Incremental update
Given number of hidden variables k
• Assuming k is known
• We know how to update the slope
For each new point x and for i = 1, …, k :
• yi := wiTx
(proj. onto wi)
• di  di + yi2
(energy  i-th eigenval.)
• ei := x – yiwi
(error)
• wi  wi + (1/di) yiei (update estimate)
• x  x – yiwi
(repeat with remainder)
DB/IR '06
C. Faloutsos
x
e1
w1 updated
w1
y1
#55
Carnegie Mellon
Stream correlations
• Step 1: How to capture correlations?
• Step 2: How to do it incrementally, when
we have a very large number of points?
• Step 3: How to dynamically adjust k, the
number of hidden variables?
DB/IR '06
C. Faloutsos
#56
Carnegie Mellon
Answer
• When the reconstruction accuracy is too
low (say, <95%)
• then introduce another hidden variable
(k++)
• [How to initialize its values: tricky]
DB/IR '06
C. Faloutsos
#57
Carnegie Mellon
Missing values
best guess
(given correlations:
intersection)
Temperature T2
30oC
true values (pair)
20oC
DB/IR '06
all possible
value pairs
(given only t1)
20oC
Temperature T1
30oC
C. Faloutsos
#58
Carnegie Mellon
Forecasting
?
• Assume we want to forecast the
next value for a particular
stream (e.g. auto-regression)
n streams
DB/IR '06
C. Faloutsos
#59
Carnegie Mellon
Forecasting
• Option 1: One complex model
per stream
+  – Next value = function of
previous values on all
streams
– Captures correlations
– Too costly! [ ~ O(n3) ]
n streams
DB/IR '06
C. Faloutsos
#60
Carnegie Mellon
Forecasting
• Option 1: One complex model
per stream
+  • Option 2: One simple model per
stream
– Next value = function of
previous value on same
stream
n streams
DB/IR '06
– Worse accuracy, but maybe
acceptable
– But, still need n models
C. Faloutsos
#61
Carnegie Mellon
Forecasting
+ 
hidden
variables
Only k simple
models
k hidden vars
k << n
n streams
DB/IR '06
and already
capture
correlations
C. Faloutsos
Efficiency &
robustness
#62
Carnegie Mellon
Time/space requirements
Incremental PCA
O(nk) space (total) and time (per tuple), i.e.,
• Independent of # points
• Linear w.r.t. # streams (n)
• Linear w.r.t. # hidden variables (k)
In fact,
• Can be done in real time
DB/IR '06
C. Faloutsos
#63
Carnegie Mellon
Overview
Part 2
• Method
• Experiments
• Conclusions & Other work
DB/IR '06
C. Faloutsos
#64
Carnegie Mellon
Experiments
Chlorine concentration
Measurements
Reconstruction
DB/IR '06
[CMU Civil Engineering]
166 streams
2 hidden variables (~4% error)
C. Faloutsos
#65
Carnegie Mellon
Experiments
Chlorine concentration
hidden variables
• Both capture global, periodic pattern
• Second: ~ first, but phase-shifted
• Can express any phase-shift…
DB/IR '06
[CMU Civil Engineering]
C. Faloutsos
#66
Experiments
Carnegie Mellon
Light measurements
54 sensors
2-4 hidden variables (~6% error)
DB/IR '06
C. Faloutsos
measurement
reconstruction
#67
Experiments
Carnegie Mellon
Light measurements
intermittent
intermittent
hidden variables
• 1 & 2: main trend (as before)
• 3 & 4: potential anomalies and outliers
DB/IR '06
C. Faloutsos
#68
Carnegie Mellon
Conclusions
SPIRIT:
Discovers hidden variables for
– Summarization of main trends for users
– Efficient forecasting, spotting outliers/anomalies
Incremental, real time computation
nimble: With limited memory
automatic: No special parameters to tune
DB/IR '06
C. Faloutsos
#69
Carnegie Mellon
Outline
•
•
•
•
•
Problem and motivation
Single-sequence mining: AWSOM
Co-evolving sequences: SPIRIT
Lag correlations: BRAID
Conclusions
DB/IR '06
C. Faloutsos
#70
Carnegie Mellon
Part 3:
BRAID: Discovering Lag
Correlations in Multiple Streams
Yasushi Sakurai,
Spiros Papadimitriou,
Christos Faloutsos
SIGMOD’05
DB/IR '06
C. Faloutsos
#71
Carnegie Mellon
Lag Correlations
• Examples
– A decrease in interest rates typically precedes
an increase in house sales by a few months
– Higher amounts of fluoride in the drinking
water leads to fewer dental cavities, some years
later
DB/IR '06
C. Faloutsos
#72
Carnegie Mellon
Lag Correlations
• Example of lag-correlated sequences
These sequences are correlated
with lag l=1300 time-ticks
CCF (Cross-Correlation Function)
DB/IR '06
C. Faloutsos
#73
Carnegie Mellon
Lag Correlations
• Example of lag-correlated sequences
how to compute it
•quickly
•cheaply
•incrementally
CCF (Cross-Correlation Function)
DB/IR '06
C. Faloutsos
#74
Carnegie Mellon
Challenging Problems
•
Problem definitions
– For given two co-evolving sequences X and Y,
determine
•
•
Whether there is a lag correlation
If yes, what is the lag length l
– For given k numerical sequences, X1,…,Xk ,
report
•
•
DB/IR '06
Which pairs have a lag correlation
The corresponding lag for each pair
C. Faloutsos
#75
Carnegie Mellon
Our solution
• Ideal characteristics:
– ‘Any-time’ processing, and fast
Computation time per time tick is constant
– Nimble
Memory space requirement is sub-linear of sequence
length
– Accurate
Approximation introduces small error
DB/IR '06
C. Faloutsos
#76
Carnegie Mellon
Related Work
• Sequence indexing
– Agrawal et al. (FODO 1993)
– Faloutsos et al. (SIGMOD 1994)
– Keogh et al. (SIGMOD 2001)
• Compression (wavelet and random projections)
– Gilbert et al. (VLDB 2001), Guha et al. (VLDB 2004)
– Dobra et al.(SIGMOD 2002), Ganguly et al.(SIGMOD 2003)
• Data Stream Management
– Abadi et al. (VLDB Journal 2003)
– Motwani et al. (CIDR 2003)
– Chandrasekaran et al. (CIDR 2003)
– Cranor
et al. (SIGMOD 2003)
DB/IR
'06
C. Faloutsos
#77
Carnegie Mellon
Related Work
• Pattern discovery
– Clustering for data streams
Guha et al. (TKDE 2003)
– Monitoring multiple streams
Zhu et al. (VLDB 2002)
– Forecasting
Yi et al. (ICDE 2000)
Papadimitriou et al. (VLDB 2003)
• None of previously published methods focuses on the
problem
DB/IR '06
C. Faloutsos
#78
Carnegie Mellon
Overview
•
•
•
•
•
Introduction / Related work
Background
Main ideas
Theoretical analysis
Experimental results
DB/IR '06
C. Faloutsos
#79
Carnegie Mellon
Main Idea (1)
• Incremental compution
– Sufficient statistics
: Sx(1, n)  t 1 xt
n
• Sum of X
n
Sxx(1, n)  t 1 xt2
• Square sum of X :
n
Sxy(l )  t l 1 xt yt l
• Inner-product for X and the shifted Y :
– Compute R(l) incrementally:
R(l ) 
C (l )
Vx(l  1, n) Vy(1, n  l )
• Covariance of X
• Variance of X:
DB/IR '06
and Y: C (l )  Sxy(l ) 
Sx(l  1, n)  Sy (1, n  l )
nl
( Sx(l  1, n)) 2
Vx(l  1, n)  Sxx(l  1, n) 
nl
C. Faloutsos
#80
Carnegie Mellon
Main Idea (2)
Correlation
• Sequence smoothing
Time
t=n
Lag
DB/IR '06
C. Faloutsos
#81
Carnegie Mellon
Main Idea (2)
• Sequence smoothing
Means of windows for each level
Sufficient statistics computed from the means
CCF computed from the sufficient statistics
But, it allows a partial redundancy
Level
Time
h=0
t=n
Correlation
–
–
–
–
Lag
DB/IR '06
C. Faloutsos
#82
Carnegie Mellon
Main Idea (3)
Level
Time
h=0
t=n
Correlation
• Geometric lag probing
Lag
DB/IR '06
C. Faloutsos
#83
Carnegie Mellon
Main Idea (3)
• Geometric lag probing
Level
Time
h=0
t=n
Correlation
– Use colored windows
– Keep track of only a geometric progression of the
lag values: l={0,1,2,4,8,…,2h,…}
– Use a cubic spline to interpolate
Lag
DB/IR '06
C. Faloutsos
#84
Carnegie Mellon
Overview
•
•
•
•
•
Introduction / Related work
Background
Main ideas
Theoretical analysis
Experimental results
DB/IR '06
C. Faloutsos
#85
Carnegie Mellon
Experimental results
• Setup
– Intel Xeon 2.8GHz, 1GB memory, Linux
– Datasets:
Sines, SpikeTrains, Humidity, Light,
Temperature,
Kursk, Sunspots
– Enhanced BRAID, b=16
• Evaluation
– Estimation error of lag correlations
– Computation time C. Faloutsos
DB/IR '06
#86
Carnegie Mellon
•
Detecting Lag Correlations
SpikeTrains
(2)
BRAID closely estimates
the correlation coefficients
CCF (Cross-Correlation Function)
DB/IR '06
C. Faloutsos
#87
Carnegie Mellon
•
Detecting Lag Correlations
Humidity
(3)
BRAID closely estimates
the correlation coefficients
CCF (Cross-Correlation Function)
DB/IR '06
C. Faloutsos
#88
Carnegie Mellon
•
Detecting Lag Correlations
Light
(4)
BRAID closely estimates
the correlation coefficients
CCF (Cross-Correlation Function)
DB/IR '06
C. Faloutsos
#89
Carnegie Mellon
•
Detecting Lag Correlations
Kursk
(5)
BRAID closely estimates
the correlation coefficients
CCF (Cross-Correlation Function)
DB/IR '06
C. Faloutsos
#90
Carnegie Mellon
Estimation Error
Lag correlation Estimation
Datasets
Naive BRAID error (%)
Sines
716
716
0.000
SpikeTrains
2841 2830
0.387
Humidity
3842 3855
0.338
Light
567
570
0.529
Kursk
1463 1472
0.615
Sunspots
1156 1168
1.038
• Largest relative error is about 1%
DB/IR '06
C. Faloutsos
#91
Carnegie Mellon
Performance
• Almost linear w.r.t. sequence length
• Up to 40,000 times faster
DB/IR '06
C. Faloutsos
#92
Carnegie Mellon
Group Lag Correlations
• Two correlated pairs from 55 Temperature sequences
• Each sensor is located in a different place
#16
#19
Estimation of CCF of #16 and #19
DB/IR '06
C. Faloutsos
#47
#48
Estimation of CCF of #47 and #48
#93
Carnegie Mellon
Conclusions
Automatic lag correlation detection on stream data
• incremental – online, ‘any-time’
• nimble
–
–
•
O(log n) space, O(1) time to update the statistics
Up to 40,000 times faster than the naive
implementation
Accurate
–
DB/IR '06
Detecting the correct lag within 1% relative error or
less
C. Faloutsos
#94
Carnegie Mellon
Overall Conclusions
time
• Mining streaming numerical data:
challenging!
• Extensions: streaming matrix data (eg.,
network traffic matrix)
IP-source
DB/IR '06
C. Faloutsos
#95
Carnegie Mellon
Thank you
• christos <at> cs.cmu.edu
• www.cs.cmu.edu/~christos
• [InteMon demo]
DB/IR '06
C. Faloutsos
#96
Download