Sensor and Graph Mining Christos Faloutsos Carnegie Mellon University & IBM www.cs.cmu.edu/~christos

advertisement
School of Computer Science
Carnegie Mellon
Sensor and Graph Mining
Christos Faloutsos
Carnegie Mellon University & IBM
www.cs.cmu.edu/~christos
INTEL 04
C. Faloutsos
1
School of Computer Science
Carnegie Mellon
Joint work with
• Anthony Brockwell (CMU/Stat)
• Deepayan Chakrabarti (CMU)
• Spiros Papadimitriou (CMU)
• Chenxi Wang (CMU)
• Yang Wang (CMU)
INTEL 04
C. Faloutsos
2
School of Computer Science
Carnegie Mellon
Outline
• Introduction - motivation
• Problem #1: Stream Mining
– Motivation
– Main idea
– Experimental results
• Problem #2: Graphs & Virus propagation
• Conclusions
INTEL 04
C. Faloutsos
3
School of Computer Science
Carnegie Mellon
Introduction
• Sensor devices
–
–
–
–
Temperature, weather measurements
Road traffic data
Geological observations
Patient physiological data
• Embedded devices
– Network routers
– Intelligent (active) disks
INTEL 04
C. Faloutsos
4
School of Computer Science
Carnegie Mellon
Introduction
• Limited resources
–
–
–
–
Memory
Bandwidth
Power
CPU
• Remote environments
– No human intervention
INTEL 04
C. Faloutsos
5
School of Computer Science
Carnegie Mellon
Introduction – problem dfn
• Given a emi-infinite stream of values (time
series) x1, x2, …, xt, …
• Find patterns, forecasts, outliers…
INTEL 04
C. Faloutsos
6
School of Computer Science
Carnegie Mellon
Introduction
• E.g.,
Periodicity? (twice daily)
“Noise”??
Periodicity? (daily)
INTEL 04
C. Faloutsos
7
School of Computer Science
Carnegie Mellon
Introduction
• Can we capture these patterns
– automatically
– with limited resources?
Periodicity? (twice daily)
“Noise”??
Periodicity? (daily)
INTEL 04
C. Faloutsos
8
School of Computer Science
Carnegie Mellon
Related work
Statistics: Time series forecasting
• Main problem:
“[…] The first step in the analysis of any time
series is to plot the data [and inspect the graph]”
[Brockwell 91]
• Typically:
• Resource intensive
• Cannot update online
• AR(I)MA and seasonal variants
• ARFIMA, GARCH, …
INTEL 04
C. Faloutsos
9
School of Computer Science
Carnegie Mellon
Related work
Databases: Continuous Queries
• Typically, different focus:
– “Compression”
– Not generative models
• Largely orthogonal problem…
–
–
–
–
Gilbert, Guha, Indyk et al. (STOC 2002)
Garofalakis, Gibbons (SIGMOD 2002)
Chen, Dong, Han et al. (VLDB 2002); Bulut, Singh (ICDE 2003)
Gehrke, Korn, et al. (SIGMOD 2001), Dobra, Garofalakis, Gehrke
et al. (SIGMOD 2002)
– Guha, Koudas (ICDE 2003) Datar, Gionis, Indyk et al. (SODA
2002)
– Madden+ [SIGMOD02], [SIGMOD03]
INTEL 04
C. Faloutsos
10
School of Computer Science
Carnegie Mellon
Goals
• Adapt and handle arbitrary periodic components
• No human intervention/tuning
Also:
• Single pass over the data
• Limited memory (logarithmic)
• Constant-time update
INTEL 04
C. Faloutsos
11
School of Computer Science
Carnegie Mellon
Outline
• Introduction - motivation
• Problem #1: Stream Mining
– Motivation
– Main idea
– Experimental results
• Problem #2: Graphs & Virus propagation
• Conclusions
INTEL 04
C. Faloutsos
12
School of Computer Science
Carnegie Mellon
Wavelets
“Straight” signal
xt
t
I1
I2
t
INTEL 04
I3
t
I4
t
I5
t
C. Faloutsos
I6
t
I7
t
I8
t
t
13
time
School of Computer Science
Carnegie Mellon
Wavelets
Introduction – Haar
xt
t
W1,1
W1,2
t
W1,3
W1,4
t
t
frequency
W2,1
t
W2,2
t
t
W3,1
t
V4,1
INTEL 04
t
C. Faloutsos
time
14
School of Computer Science
Carnegie Mellon
Wavelets
• So?
• Wavelets compress many real signals
well…
– Image compression and processing
– Vision; Astronomy, seismology, …
• Wavelet coefficients can be updated as new
points arrive [Kotidis+]
INTEL 04
C. Faloutsos
15
School of Computer Science
Carnegie Mellon
Wavelets
Correlations
xt
t
W1,1
W1,2
t
W1,3
W1,4
t
t
frequency
W2,1
t
W2,2
=
t
t
W3,1
t
V4,1
INTEL 04
t
C. Faloutsos
time
16
School of Computer Science
Carnegie Mellon
Wavelets
Correlations
xt
t
W1,1
W1,2
t
W1,3
W1,4
t
t
frequency
W2,1
t
W2,2
t
t
W3,1
t
V4,1
INTEL 04
t
C. Faloutsos
time
17
School of Computer Science
Carnegie Mellon
Main idea
Correlations
• Wavelets are good…
• …we can do even better
– One number…
– …and the fact that they are
equal/correlated
INTEL 04
C. Faloutsos
18
School of Computer Science
Carnegie Mellon
Proposed method
Wl,t-2 Wl,t-1 Wl,t
Wl’,t’-2
Wl’,t’-1
Wl’,t’
Wl,t 
Wl’,t’ 
l,1Wl,t-1  l,2Wl,t-2  …
l’,1Wl’,t’-1  l’,2Wl’,t’-2  …
Small windows suffice… (k~4)
INTEL 04
C. Faloutsos
19
School of Computer Science
Carnegie Mellon
More details…
• Update of wavelet coefficients (incremental)
• Update of linear models (incremental; RLS)
• Feature selection
(single-pass)
– Not all correlations are significant
– Throw away the insignificant ones
– very important!!
[see paper]
INTEL 04
C. Faloutsos
20
School of Computer Science
Carnegie Mellon
SKIP
Complexity
• Model update
Space: OlgN + mk2  OlgN
Time: Ok2  O1
Where
– N: number of points (so far)
– k: number of regression coefficients; fixed
– m:number of linear models; OlgN
[see paper]
INTEL 04
C. Faloutsos
21
School of Computer Science
Carnegie Mellon
Outline
• Introduction - motivation
• Problem #1: Stream Mining
– Motivation
– Main idea
– Experimental results
• Problem #2: Graphs & Virus propagation
• Conclusions
INTEL 04
C. Faloutsos
22
School of Computer Science
Carnegie Mellon
Setup
• First half used for model estimation
• Models applied forward to forecast entire
second half
• AR, Seasonal AR (SAR): R
– Simplest possible estimation – no maximum
likelihood estimation (MLE), etc.
• … vs. Python scripts
INTEL 04
C. Faloutsos
23
School of Computer Science
Carnegie Mellon
Results
Synthetic data – Triangle pulse
• Triangle pulse
• AR captures wrong trend (or none)
• Seasonal AR (SAR) estimation fails
INTEL 04
C. Faloutsos
24
School of Computer Science
Carnegie Mellon
Results
Synthetic data – Mix
• Mix (sine + square pulse)
• AR captures wrong trend (or none)
• Seasonal AR estimation fails
INTEL 04
C. Faloutsos
25
School of Computer Science
Carnegie Mellon
Results
Real data – Automobile
(filtered)
• Automobile traffic
– Daily periodicity with rush-hour peaks
– Bursty “noise” at smaller time scales
INTEL 04
C. Faloutsos
26
School of Computer Science
Carnegie Mellon
Results
Real data – Automobile
• Automobile traffic
– Daily periodicity with rush-hour peaks
– Bursty “noise” at smaller time scales
• AR fails to capture any trend (average)
• Seasonal AR estimation fails
INTEL 04
C. Faloutsos
27
School of Computer Science
Carnegie Mellon
Results
Real data – Automobile
• Automobile traffic
– Daily periodicity with rush-hour peaks
– Bursty “noise” at smaller time scales
• INTEL
AWSOM
spots periodicities,
automatically
04
C. Faloutsos
28
School of Computer Science
Carnegie Mellon
Results
Real data – Automobile
• Automobile traffic
– Daily periodicity with rush-hour peaks
– Bursty “noise” at smaller time scales
• Generation with identified noise
INTEL 04
C. Faloutsos
29
School of Computer Science
Carnegie Mellon
Results
Real data – Sunspot
• Sunspot intensity – Slightly time-varying “period”
• AR captures wrong trend (average)
• Seasonal ARIMA
– Captures immediate wrong downward trend
– Requires human to determine seasonal component period (fixed)
INTEL 04
C. Faloutsos
30
School of Computer Science
Carnegie Mellon
Results
Real data – Sunspot
• Sunspot intensity – Slightly time-varying “period”
Estimation: 40 minutes (R) vs. 9 seconds (Python)
INTEL 04
C. Faloutsos
31
School of Computer Science
Carnegie Mellon
SKIP
Variance
~Hurst exponent
~ 1 hour
• Variance (log-power) vs. scale:
– “Noise” diagnostic (if decreasing linear…)
– Can use to estimate noise parameters
INTEL 04
C. Faloutsos
32
School of Computer Science
Carnegie Mellon
time (t)
Running time
stream size (N)
INTEL 04
C. Faloutsos
33
School of Computer Science
Carnegie Mellon
Space requirements
Equal total number of model parameters
INTEL 04
C. Faloutsos
34
School of Computer Science
Carnegie Mellon
Conclusion
Adapt and handle arbitrary periodic
components
No human intervention/tuning
Single pass over the data
Limited memory (logarithmic)
Constant-time update
INTEL 04
C. Faloutsos
35
School of Computer Science
Carnegie Mellon
Conclusion
Adapt and handle arbitrary periodic
no human
components
No human intervention/tuning
Single pass over the data
Limited memory (logarithmic)
Constant-time update
INTEL 04
C. Faloutsos
limited
resources
36
School of Computer Science
Carnegie Mellon
Outline
• Introduction - motivation
• Problem #1: Streams
• Problem #2: Graphs & Virus propagation
–
–
–
–
Motivation & problem definition
Related work
Main idea
Experiments
• Conclusions
INTEL 04
C. Faloutsos
37
School of Computer Science
Carnegie Mellon
Introduction
Internet Map
[lumeta.com]
Food Web
[Martinez ’91]
Protein Interactions
[genomebiology.com]
► Graphs are ubiquitious
Friendship Network
[Moody ’01]
INTEL 04
C. Faloutsos
38
School of Computer Science
Carnegie Mellon
Introduction
• What can we do with
graph analysis?
– Immunization;
– Information
Dissemination
– network value of a
customer [Domingos+]
INTEL 04
C. Faloutsos
“bridges”
“Needle exchange”
networks of drug users
[Weeks et al. 2002]
39
School of Computer Science
Carnegie Mellon
Problem definition
• Q1: How does a virus spread across an
arbitrary network?
• Q2: will it create an epidemic?
• (in a sensor setting, with a ‘gossip’
protocol, will a rumor/query spread?)
INTEL 04
C. Faloutsos
40
School of Computer Science
Carnegie Mellon
Framework
• Susceptible-Infected-Susceptible (SIS)
model
– Cured nodes immediately become susceptible
Infected by neighbor
Susceptible/
healthy
INTEL 04
Cured
internally
C. Faloutsos
Infected
&
infectious
41
School of Computer Science
Carnegie Mellon
The model
• (virus) Birth rate β : probability than an
infected neighbor attacks
• (virus) Death rate δ : probability that an
Healthy
infected node heals
Prob. δ
N2
Prob. β
N1
N
Infected
INTEL 04
N3
C. Faloutsos
43
School of Computer Science
Carnegie Mellon
Epidemic threshold t
Defined as the value of t, such that
if  / d < t
an epidemic can not happen
Thus,
• given a graph
• compute its epidemic threshold
INTEL 04
C. Faloutsos
44
School of Computer Science
Carnegie Mellon
Epidemic threshold t
What should t depend on?
• avg. degree? and/or highest degree?
• and/or variance of degree?
• and/or determinant of the adjacency matrix?
INTEL 04
C. Faloutsos
45
School of Computer Science
Carnegie Mellon
Basic Homogeneous Model
Homogeneous graphs [Kephart-White ’91,
’93]
• Epidemic threshold = 1/<k>
• Homogeneous connectivity <k>, ie, all
nodes have ~same degree  unrealistic
INTEL 04
C. Faloutsos
46
School of Computer Science
Carnegie Mellon
Power-law Networks
• Model for Barabási-Albert
networks
– [Pastor-Satorras &
Vespignani, ’01, ’02]
– Epidemic threshold =
<k> / <k2>
– for BA type networks, with
only γ = 3 (γ = slope of
power-law exponent)
INTEL 04
C. Faloutsos
47
School of Computer Science
Carnegie Mellon
Epidemic threshold
• Homogeneous graphs:
• BA (g=3)
• more complicated graphs
• arbitrary, REAL graphs
1/<k>
<k> / <k2>
?
?
• how many parameters??
INTEL 04
C. Faloutsos
48
School of Computer Science
Carnegie Mellon
Epidemic threshold
• [Theorem] We have no epidemic, if
β/δ <τ = 1/ λ1,A
INTEL 04
C. Faloutsos
49
School of Computer Science
Carnegie Mellon
Epidemic threshold
• [Theorem] We have no epidemic, if
epidemic threshold
recovery prob.
β/δ <τ = 1/ λ1,A
largest eigenvalue
of adj. matrix A
attack prob.
Proof: [Wang+03]
INTEL 04
C. Faloutsos
50
School of Computer Science
Carnegie Mellon
Epidemic threshold for various
networks
• sanity checks / older results:
• Homogeneous networks
– λ1,A = <k>; τ = 1/<k>
– where <k> = average degree
– This is the same result as of Kephart & White !
INTEL 04
C. Faloutsos
51
School of Computer Science
Carnegie Mellon
Epidemic threshold for various
networks
• sanity checks / older results:
• Star networks
– λ1,A = sqrt(d); τ = 1/ sqrt(d)
– where d = the degree of the central node
INTEL 04
C. Faloutsos
52
School of Computer Science
Carnegie Mellon
Epidemic threshold for various
networks
• sanity checks / older results:
• Infinite, power-law networks
– λ1,A = ∞; τ = 0 : *any* virus has a chance!
[Barabasi et al]
• Finite power-law networks
– τ = 1/ λ1,A
INTEL 04
C. Faloutsos
53
School of Computer Science
Carnegie Mellon
Outline
• Introduction - motivation
• Problem #1: Streams
• Problem #2: Graphs & Virus propagation
–
–
–
–
Motivation & problem definition
Related work
Main idea
Experiments
• Conclusions
INTEL 04
C. Faloutsos
54
School of Computer Science
Carnegie Mellon
Experiments
• 2 graphs
– Star network: one “hub” + 99 “spokes”
– “Oregon” Internet AS graph:
• 10,900 nodes, 31180 edges
• topology.eecs.umich.edu/data.html
• More in our paper: [SRDS ’03]
INTEL 04
C. Faloutsos
55
School of Computer Science
Carnegie Mellon
Experiments (Star)
Number of Infected Nodes
50
Star
β= 0.016
45
40
β/δ > τ
(above threshold)
35
30
25
β/δ = τ
(at the threshold)
20
15
10
β/δ < τ
(below threshold)
5
0
0
50
100
150
200
Time
δ:
INTEL 04
0.04
0.08
0.12
C. Faloutsos
0.16
0.20
56
School of Computer Science
Carnegie Mellon
Experiments (Oregon)
Number of Infected Nodes
500
Oregon
β = 0.001
β/δ > τ
(above threshold)
400
300
200
β/δ = τ
(at the threshold)
100
0
0
250
500
750
Time
δ:
INTEL 04
0.05
0.06
1000
β/δ < τ
(below threshold)
0.07
C. Faloutsos
57
Number of
infected nodes
School of Computer Science
Carnegie Mellon
Our prediction vs. previous
PL3
prediction
PL3
Our
Our
β/δ
β/δ
Oregon
Star
• our predictions are more accurate
INTEL 04
C. Faloutsos
58
School of Computer Science
Carnegie Mellon
Conclusions
We found an epidemic threshold
√ that applies to any network topology
√ and it depends only on one parameter of
the graph
INTEL 04
C. Faloutsos
59
School of Computer Science
Carnegie Mellon
Overall conclusions
• Automatic stream mining: AWSOM
• graphs and virus propagation: eigenvalue
INTEL 04
C. Faloutsos
60
School of Computer Science
Carnegie Mellon
Ongoing / related work
• Streams
– how to find hidden variables on multiple
streams [w/ Spiros and Jimeng Sun]
– ‘network tomography’ [w/ Airoldi +]
• Graphs
– graph partitioning [w/ Deepay+]
– important subgraphs [w/ Tomkins + McCurley]
– graph generators [RMAT, w/ Deepay]
INTEL 04
C. Faloutsos
61
School of Computer Science
Carnegie Mellon
Thank you!
Contact info:
christos @ cs.cmu.edu
spapadim @ cs.cmu.edu
deepay @ cs.cmu.edu
INTEL 04
C. Faloutsos
62
School of Computer Science
Carnegie Mellon
Main References
• Spiros Papadimitriou, Anthony Brockwell and Christos
Faloutsos Adaptive, Hands-Off Stream Mining VLDB
2003, Berlin, Germany, Sept. 2003.
• [Wang+03] Yang Wang, Deepayan Chakrabarti, Chenxi
Wang and Christos Faloutsos: Epidemic Spreading in Real
Networks: an Eigenvalue Viewpoint, SRDS 2003,
Florence, Italy.
INTEL 04
C. Faloutsos
63
School of Computer Science
Carnegie Mellon
Additional References
• Connection Subgraphs, C. Faloutsos, K. McCurley, A.
Tomkins, SIAM-DM 2004 workshop on link analysis
• RMAT: A recursive graph generator, D. Chakrabarti, Y.
Zhan, C. Faloutsos, SIAM-DM 2004
• iFilter: Network tomography using particle filters,
Edoardo Airoldi, Christos Faloutsos (submitted)
INTEL 04
C. Faloutsos
64
Download