Data Mining using Fractals and Power Laws

advertisement
School of Computer Science
Carnegie Mellon
School of Computer Science
Carnegie Mellon
THANK YOU!
Data Mining using Fractals and
Power laws
• Prof. Ed Chan
• Debbie Mustin
Christos Faloutsos
Carnegie Mellon University
Waterloo, 2006
C. Faloutsos
1
School of Computer Science
Carnegie Mellon
Waterloo, 2006
C. Faloutsos
2
School of Computer Science
Carnegie Mellon
Thanks to
Overview
• Deepayan Chakrabarti (CMU/Yahoo)
• Goals/ motivation: find patterns in large
datasets:
– (A) Sensor data
– (B) network/graph data
• Michalis Faloutsos (UCR)
• Solutions: self-similarity and power laws
• Discussion
• George Siganos (UCR)
Waterloo, 2006
C. Faloutsos
3
School of Computer Science
Carnegie Mellon
Waterloo, 2006
C. Faloutsos
4
School of Computer Science
Carnegie Mellon
Applications of sensors/streams
Applications of sensors/streams
• ‘Smart house’: monitoring temperature,
humidity etc
• ‘Smart house’: monitoring temperature,
humidity etc
• Financial, sales, economic series
• Financial, sales, economic series
Tamer; Ihab
Waterloo, 2006
C. Faloutsos
5
Waterloo, 2006
C. Faloutsos
6
1
School of Computer Science
Carnegie Mellon
School of Computer Science
Carnegie Mellon
Motivation - Applications
(cont’d)
Motivation - Applications
• Medical: ECGs +; blood
pressure etc monitoring
• civil/automobile infrastructure
– bridge vibrations [Oppenheim+02]
• Scientific data: seismological;
astronomical; environment /
anti-pollution; meteorological
– road conditions / traffic monitoring
# cars2000
1800
1600
Automobile traffic
1400
1200
1000
800
600
400
200
0
Sunspot Data
300
250
200
150
100
time
50
0
Waterloo, 2006
C. Faloutsos
7
Waterloo, 2006
School of Computer Science
Carnegie Mellon
C. Faloutsos
School of Computer Science
Carnegie Mellon
Motivation - Applications
(cont’d)
Web traffic
• [Crovella Bestavros, SIGMETRICS’96]
• Computer systems
2.5*10^7
3*10^7
– web servers (buffering, prefetching)
Bytes
1.2e+06
– ...
http://repository.cs.vt.edu/lbl-conn-7.tar.Z
10^7
5*10^6
600000
400000
200000
0
number of Kb
1e+06
800000
1.5*10^7
2*10^7
– network traffic monitoring
0
0
0
Waterloo, 2006
2e+06
4e+06
6e+06
8e+06
time (in milliseconds)
C. Faloutsos
Waterloo, 2006
2000
3000
C. Faloutsos
10
School of Computer Science
Carnegie Mellon
Self-* Storage (Ganger+)
Self-* Storage (Ganger+)
“self-*” = self-managing, self-tuning, self-healing, …
Goal: 1 petabyte (PB) for CMU researchers
www.pdl.cmu.edu/SelfStar
survivable,
self-managing storage
infrastructure
Waterloo, 2006
1000
Chronological time (slots of 1000 sec)
1e+07
9
School of Computer Science
Carnegie Mellon
~1 PB
8
.
..
.
..
C. Faloutsos
a storage brick
(0.5–5 TB)
11
Figure 2: Trac Bursts over Four Orders of Magnitude; Upper Left: 1000, Upper Right: 100, Lower Left: 10, and L
1 Second
Aggegrations.
(Actual Transfers) self-tuning, self-healing, …
“self-*”
= self-managing,
this eect is by visually inspecting time series plots of trac ear and shows a slope that is distinctly dierent fr
demand.
is shown for comparison); the slope is estimated
In Figure 2 we show four time series plotsAshraf,
of the WWWIhab,
sion asKen
-0.48, yielding an estimate for H of 0.76. T
trac induced by our reference traces. The plots are produced shows an asymptotic slope that is dierent from
by aggregating byte trac into discrete bins of 1, 10, 100, or 1.0 (shown for comparision); it is estimated usin
1000 seconds.
as 0.75, which is also the corresponding estimat
The upper left plot is a complete presentation of the entire periodogram plot shows a slope of -0.66 (the regr
survivable,
trac time series using 1000 second (16.6
minute) bins. The shown), yielding an estimate of H as 0.83. Finally
storageand day
diurnal cycle of network demandself-managing
is infrastructure
clearly evident,
estimator for this dataset (not a graphical metho
to day activity shows noticeable bursts.
However, even within estimate of H = 0:82 with a 95% condence inte
the active portion of a single day there is signicant burstiness; 0.87).
brick in Section 2.1, the Whittle esti
this is shown in the upper right plot, which uses a 100 second a storage
As discussed
timescale and
taken from a typical day in the middle of the only
method
(0.5–5
TB) that yields condence intervals on H
~1is PB
dataset. Finally,
the lower left plot. shows a portion. of the 100 range dependence in the timeseries can introdu
second plot, expanded to 10 second.. detail; and the.. lower right cies in its results. These inaccuracies are minim
plot shows a portion of the lower left expanded to 1 second aggregating the timeseries for successively large
detail. These plots show signicant bursts occurring at the and looking for a value of H around which the
second-to-second level.
mator stabilizes.
Waterloo, 2006
C. Faloutsos
The results of12this method for four busy hours
Figure
4. Each hour is shown in one plot, from the
4.2.2 Statistical Analysis
in the upper left to the least busy hour in the low
We used the four methods for assessing self-similarity described these gures the solid line is the value of the Whi
in Section 2: the variance-time plot, the rescaled range (or of H as a function of the aggregation level m of
R/S) plot, the periodogram plot, and the Whittle estimator. The upper and lower dotted lines are the limits
We concentrated on individual hours from our trac series, so condence interval on H . The three level lines r
as to provide as nearly a stationary dataset as possible.
estimate of H for the unaggregated dataset as
To provide an example of these approaches, analysis of a variance-time, R-S, and periodogram methods.
single hour (4pm to 5pm, Thursday 5 Feb 1995) is shown in
The gure shows that for each dataset, the es
Figure 3. The gure shows plots for the three graphical meth- stays relatively consistent as the aggregation level
ods: variance-time (upper left), rescaled range (upper right), and that the estimates given by the three graph
and periodogram (lower center). The variance-time plot is lin-
2
School of Computer Science
Carnegie Mellon
School of Computer Science
Carnegie Mellon
Problem definition
Problem #1
# bytes
• Find patterns, in large
datasets
• Given: one or more sequences
x1 , x2 , … , xt , …; (y1, y2, … , yt, …)
• Find
– patterns; clusters; outliers; forecasts;
time
Waterloo, 2006
C. Faloutsos
13
School of Computer Science
Carnegie Mellon
Waterloo, 2006
C. Faloutsos
14
School of Computer Science
Carnegie Mellon
Problem #1
Problem #1
# bytes
# bytes
• Find patterns, in large
datasets
• Find patterns, in large
datasets
time
time
Poisson
indep.,
ident. distr
Poisson
indep.,
ident. distr
Waterloo, 2006
C. Faloutsos
15
School of Computer Science
Carnegie Mellon
Waterloo, 2006
C. Faloutsos
16
School of Computer Science
Carnegie Mellon
Problem #1
Overview
# bytes
• Find patterns, in large
datasets
• Goals/ motivation: find patterns in large datasets:
– (A) Sensor data
– (B) network/graph data
• Solutions: self-similarity and power laws
• Discussion
time
Poisson
indep.,
ident. distr
Waterloo, 2006
Q: Then, how to generate
such bursty traffic?
C. Faloutsos
17
Waterloo, 2006
C. Faloutsos
18
3
School of Computer Science
Carnegie Mellon
School of Computer Science
Carnegie Mellon
Problem #2 - network and
graph mining
Network and graph mining
• How does the Internet look like?
• How does the web look like?
• What constitutes a ‘normal’ social
network?
• What is the ‘network value’ of a
customer?
• which gene/species affects the others
the most?
Waterloo, 2006
C. Faloutsos
19
School of Computer Science
Carnegie Mellon
Friendship Network
[Moody ’01]
Food Web
[Martinez ’91]
Protein Interactions
[genomebiology.com]
Graphs are everywhere!
Waterloo, 2006
C. Faloutsos
20
School of Computer Science
Carnegie Mellon
Problem#2
Solutions
Given a graph:
• which node to market-to /
defend / immunize first?
• Are there un-natural subgraphs? (eg., criminals’ rings)?
• New tools: power laws, self-similarity and
‘fractals’ work, where traditional
assumptions fail
• Let’s see the details:
[from Lumeta: ISPs 6/1999]
Waterloo, 2006
C. Faloutsos
21
School of Computer Science
Carnegie Mellon
Waterloo, 2006
C. Faloutsos
22
School of Computer Science
Carnegie Mellon
Overview
What is a fractal?
• Goals/ motivation: find patterns in large
datasets:
= self-similar point set, e.g., Sierpinski triangle:
– (A) Sensor data
– (B) network/graph data
...
zero area: (3/4)^inf
infinite length!
(4/3)^inf
• Solutions: self-similarity and power laws
• Discussion
Q: What is its dimensionality??
Waterloo, 2006
C. Faloutsos
23
Waterloo, 2006
C. Faloutsos
24
4
School of Computer Science
Carnegie Mellon
School of Computer Science
Carnegie Mellon
What is a fractal?
Intrinsic (‘fractal’) dimension
• Q: fractal dimension
of a line?
= self-similar point set, e.g., Sierpinski triangle:
...
• Q: fd of a plane?
zero area: (3/4)^inf
infinite length!
(4/3)^inf
Q: What is its dimensionality??
A: log3 / log2 = 1.58 (!?!)
Waterloo, 2006
C. Faloutsos
25
School of Computer Science
Carnegie Mellon
Waterloo, 2006
C. Faloutsos
26
School of Computer Science
Carnegie Mellon
Intrinsic (‘fractal’) dimension
• Q: fractal dimension
of a line?
• A: nn ( <= r ) ~ r^1
(‘power law’: y=x^a)
Sierpinsky triangle
== ‘correlation integral’
• Q: fd of a plane?
• A: nn ( <= r ) ~ r^2
fd== slope of (log(nn)
vs.. log(r) )
log(#pairs
within <=r )
= CDF of pairwise distances
1.58
log( r )
Waterloo, 2006
C. Faloutsos
27
School of Computer Science
Carnegie Mellon
Waterloo, 2006
Closely related:
• fractals <=>
• self-similarity <=>
• scale-free <=>
• power laws ( y= xa ;
F=K r-2)
• (vs
Waterloo, 2006
28
School of Computer Science
Carnegie Mellon
Observations: Fractals <->
power laws
y=e-ax
C. Faloutsos
Outline
•
•
•
•
Problems
Self-similarity and power laws
Solutions to posed problems
Discussion
log(#pairs
within <=r )
1.58
or
y=xa+b)
C. Faloutsos
log( r )
29
Waterloo, 2006
C. Faloutsos
30
5
School of Computer Science
Carnegie Mellon
School of Computer Science
Carnegie Mellon
Solution #1: traffic
Solution #1: traffic
• disk traces: self-similar: (also: [Leland+94])
• How to generate such traffic?
• disk traces (80-20 ‘law’) – ‘multifractals’
20%
#bytes
80%
#bytes
time
Waterloo, 2006
time
C. Faloutsos
31
Waterloo, 2006
School of Computer Science
Carnegie Mellon
C. Faloutsos
32
School of Computer Science
Carnegie Mellon
80-20 / multifractals
20
80-20 / multifractals
80
20
80
• p ; (1-p) in general
• yes, there are
dependencies
1.2e+06
number of Kb
1e+06
800000
600000
400000
200000
0
0
Waterloo, 2006
2e+06
C. Faloutsos
4e+06
6e+06
8e+06
time (in milliseconds)
33
School of Computer Science
Carnegie Mellon
1e+07
Waterloo, 2006
34
School of Computer Science
Carnegie Mellon
More on 80/20: PQRS
More on 80/20: PQRS
• Part of ‘self-* storage’ project
• Part of ‘self-* storage’ project
time
Waterloo, 2006
C. Faloutsos
cylinder#
C. Faloutsos
35
Waterloo, 2006
p
q
r
s
C. Faloutsos
q
r
s
36
6
School of Computer Science
Carnegie Mellon
School of Computer Science
Carnegie Mellon
Overview
Problem #2 - topology
• Goals/ motivation: find patterns in large datasets:
How does the Internet look like? Any rules?
– (A) Sensor data
– (B) network/graph data
• Solutions: self-similarity and power laws
– sensor/traffic data
– network/graph data
• Discussion
Waterloo, 2006
C. Faloutsos
37
School of Computer Science
Carnegie Mellon
Waterloo, 2006
C. Faloutsos
School of Computer Science
Carnegie Mellon
Patterns?
Patterns?
• avg degree is, say 3.3
• pick a node at random
– guess its degree,
exactly (-> “mode”)
count
avg: 3.3
Waterloo, 2006
degree
39
degree
Waterloo, 2006
C. Faloutsos
40
School of Computer Science
Carnegie Mellon
Patterns?
Solution#2: Rank exponent R
• avg degree is, say 3.3
• pick a node at random
- what is the degree
you expect it to have?
• A: 1!!
• A’: very skewed distr.
• Corollary: the mean is
meaningless!
• (and std -> infinity (!))
count
avg: 3.3
• avg degree is, say 3.3
• pick a node at random
– guess its degree,
exactly (-> “mode”)
• A: 1!!
count
avg: 3.3
C. Faloutsos
School of Computer Science
Carnegie Mellon
Waterloo, 2006
38
degree
C. Faloutsos
41
• A1: Power law in the degree distribution
[SIGCOMM99]
internet domains
att.com
log(degree)
ibm.com
-0.82
log(rank)
Waterloo, 2006
C. Faloutsos
42
7
School of Computer Science
Carnegie Mellon
School of Computer Science
Carnegie Mellon
Solution#2’: Eigen Exponent E
Power laws - discussion
Eigenvalue
Exponent = slope
• do they hold, over time?
E = -0.48
• do they hold on other graphs/domains?
May 2001
Rank of decreasing eigenvalue
• A2: power law in the eigenvalues of the adjacency
matrix
Waterloo, 2006
C. Faloutsos
43
School of Computer Science
Carnegie Mellon
Waterloo, 2006
C. Faloutsos
44
School of Computer Science
Carnegie Mellon
att.com
log(degree)
ibm.com
Power laws - discussion
do they hold, over time?
Yes! for multiple years [Siganos+]
do they hold on other graphs/domains?
Yes!
– web sites and links [Tomkins+], [Barabasi+]
– peer-to-peer graphs (gnutella-style)
– who-trusts-whom (epinions.com)
Waterloo, 2006
C. Faloutsos
0.82
log(rank)
-0.5
Rank exponent
•
•
•
•
Time Evolution: rank R
0
200
400
600
800
-0.6
-0.7
Domain
level
-0.8
-0.9
-1
Instances in time: Nov'97 and on
45
School of Computer Science
Carnegie Mellon
• The rank exponent has not changed!
[Siganos+]
C. Faloutsos
Waterloo, 2006
46
School of Computer Science
Carnegie Mellon
The Peer-to-Peer Topology
count
epinions.com
[Jovanovic+]
count
• who-trusts-whom
[Richardson +
Domingos, KDD
2001]
degree
• Number of immediate peers (= degree), follows a
power-law
Waterloo, 2006
C. Faloutsos
(out) degree
47
Waterloo, 2006
C. Faloutsos
48
8
School of Computer Science
Carnegie Mellon
School of Computer Science
Carnegie Mellon
Why care about these patterns?
Recent discoveries [KDD’05]
• How do graphs evolve?
• degree-exponent seems constant - anything
else?
• better graph generators [BRITE, INET]
– for simulations
– extrapolations
• ‘abnormal’ graph and subgraph detection
Waterloo, 2006
C. Faloutsos
49
School of Computer Science
Carnegie Mellon
Waterloo, 2006
Evolution of diameter?
• Prior analysis, on power-law-like graphs,
hints that
diameter ~ O(log(N)) or
diameter ~ O( log(log(N)))
• i.e.., slowly increasing with network size
• Q: What is happening, in reality?
C. Faloutsos
• Prior analysis, on power-law-like graphs,
hints that
diameter ~ O(log(N)) or
diameter ~ O( log(log(N)))
• i.e.., slowly increasing with network size
• Q: What is happening, in reality?
• A: It shrinks(!!), towards a constant value
51
School of Computer Science
Carnegie Mellon
Waterloo, 2006
C. Faloutsos
52
School of Computer Science
Carnegie Mellon
Shrinking diameter
Shrinking diameter
• Authors &
publications
• 1992
diameter
[Leskovec+05a]
• Citations among physics
papers
• 11yrs; @ 2003:
– 29,555 papers
– 352,807 citations
• For each month M, create a
graph of all citations up to
month M
– 318 nodes
– 272 edges
• 2002
– 60,000 nodes
• 20,000 authors
• 38,000 papers
– 133,000 edges
time
Waterloo, 2006
50
School of Computer Science
Carnegie Mellon
Evolution of diameter?
Waterloo, 2006
C. Faloutsos
C. Faloutsos
53
Waterloo, 2006
C. Faloutsos
54
9
School of Computer Science
Carnegie Mellon
School of Computer Science
Carnegie Mellon
Shrinking diameter
Shrinking diameter
• Patents & citations
• 1975
• Autonomous
systems
• 1997
– 334,000 nodes
– 676,000 edges
– 3,000 nodes
– 10,000 edges
• 1999
• 2000
– 2.9 million nodes
– 16.5 million edges
– 6,000 nodes
– 26,000 edges
• Each year is a
datapoint
Waterloo, 2006
• One graph per day
C. Faloutsos
55
School of Computer Science
Carnegie Mellon
Waterloo, 2006
N
C. Faloutsos
56
School of Computer Science
Carnegie Mellon
Temporal evolution of graphs
Temporal evolution of graphs
• N(t) nodes; E(t) edges at time t
• suppose that
N(t+1) = 2 * N(t)
• Q: what is your guess for
E(t+1) =? 2 * E(t)
Waterloo, 2006
diameter
C. Faloutsos
• N(t) nodes; E(t) edges at time t
• suppose that
N(t+1) = 2 * N(t)
• Q: what is your guess for
E(t+1) =? 2 * E(t)
• A: over-doubled!
57
School of Computer Science
Carnegie Mellon
Waterloo, 2006
C. Faloutsos
58
School of Computer Science
Carnegie Mellon
Temporal evolution of graphs
Densification Power Law
ArXiv: Physics papers
and their citations
• A: over-doubled - but obeying:
E(t) ~ N(t)a
for all t
where 1<a<2
E(t)
1.69
N(t)
Waterloo, 2006
C. Faloutsos
59
Waterloo, 2006
C. Faloutsos
60
10
School of Computer Science
Carnegie Mellon
School of Computer Science
Carnegie Mellon
Densification Power Law
ArXiv: Physics papers
and their citations
Densification Power Law
ArXiv: Physics papers
and their citations
E(t)
‘clique’
E(t)
1
1.69
2
1.69
‘tree’
N(t)
Waterloo, 2006
C. Faloutsos
N(t)
61
School of Computer Science
Carnegie Mellon
Waterloo, 2006
C. Faloutsos
School of Computer Science
Carnegie Mellon
Densification Power Law
Densification Power Law
U.S. Patents, citing each
other
Autonomous Systems
E(t)
E(t)
1.18
1.66
N(t)
Waterloo, 2006
62
C. Faloutsos
N(t)
63
School of Computer Science
Carnegie Mellon
Waterloo, 2006
C. Faloutsos
64
School of Computer Science
Carnegie Mellon
Densification Power Law
Outline
ArXiv: authors & papers
•
•
•
•
E(t)
1.15
problems
Fractals
Solutions
Discussion
– what else can they solve?
– how frequent are fractals?
N(t)
Waterloo, 2006
C. Faloutsos
65
Waterloo, 2006
C. Faloutsos
66
11
School of Computer Science
Carnegie Mellon
School of Computer Science
Carnegie Mellon
What else can they solve?
Problem #3 - spatial d.m.
spatial d.m.
separability [KDD’02]
Ed, Ihab, Tamer
forecasting [CIKM’02]
dimensionality reduction [SBBD’00]
non-linear axis scaling [KDD’02]
disk trace modeling [PEVA’02]
selectivity of spatial/multimedia queries
[PODS’94, VLDB’95, ICDE’00]
• ...
•
•
•
•
•
•
Galaxies (Sloan Digital Sky Survey w/ B.
- ‘spiral’ and ‘elliptical’
Nichol)
galaxies
1
"elliptical.cut.dat"
"spiral.cut.dat"
0.8
0.4
0.2
-attraction/repulsion?
0
-0.2
- separability??
-0.4
-0.6
-0.8
-1
-1
Waterloo, 2006
C. Faloutsos
67
School of Computer Science
Carnegie Mellon
-0.8
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
C. Faloutsos
68
School of Computer Science
Carnegie Mellon
Solution#3: spatial d.m.
CORRELATION INTEGRAL!
log(#pairs within <=r )
1e+10
log(#pairs within <=r )
[w/ Seeger, Traina, Traina, SIGMOD00]
1e+10
"ell-ell.points.ns"
"spi-spi.points.ns"
"spi.dat-ell.dat.points"
1e+09
1e+08
- 1.8 slope
ell-ell
1e+06
"ell-ell.points.ns"
"spi-spi.points.ns"
"spi.dat-ell.dat.points"
1e+09
1e+08
- plateau!
1e+07
10000
-0.6
Waterloo, 2006
Solution#3: spatial d.m.
100000
- patterns? (not Gaussian; not
uniform)
0.6
- repulsion!
spi-spi
ell-ell
1e+06
10000
1000
- plateau!
1e+07
100000
- 1.8 slope
- repulsion!
spi-spi
1000
spi-ell
100
spi-ell
100
10
10
1
1e-081e-071e-061e-050.00010.001 0.01 0.1
Waterloo, 2006
1
10
100
1
1e-081e-071e-061e-050.00010.001 0.01 0.1
log(r)
C. Faloutsos
69
School of Computer Science
Carnegie Mellon
Waterloo, 2006
1
10
100
log(r)
C. Faloutsos
70
School of Computer Science
Carnegie Mellon
Solution#3: spatial d.m.
Solution#3: spatial d.m.
log(#pairs within <=r )
1e+10
"ell-ell.points.ns"
"spi-spi.points.ns"
"spi.dat-ell.dat.points"
1e+09
1e+09
1e+08
1e+08
r1
1e+10
"ell-ell.points.ns"
"spi-spi.points.ns"
"spi.dat-ell.dat.points"
r2
1e+07
ell-ell
1e+06
1e+06
100000
10000
10000
1000
- repulsion!
spi-spi
1000
100
100
10
1
1e-081e-071e-061e-050.00010.001 0.01 0.1
1
10
100
r2 r1
Waterloo, 2006
- plateau!
1e+07
100000
- 1.8 slope
C. Faloutsos
Heuristic on choosing # of
clusters
71
spi-ell
10
1
1e-081e-071e-061e-050.00010.001 0.01 0.1
Waterloo, 2006
C. Faloutsos
1
10
100
log(r)
72
12
School of Computer Science
Carnegie Mellon
School of Computer Science
Carnegie Mellon
What else can they solve?
Problem#4: dim. reduction
cc
• given attributes x1, ... xn
•
•
•
•
•
•
separability [KDD’02]
forecasting [CIKM’02]
dimensionality reduction [SBBD’00]
non-linear axis scaling [KDD’02]
disk trace modeling [PEVA’02]
selectivity of spatial/multimedia queries
[PODS’94, VLDB’95, ICDE’00]
• ...
Waterloo, 2006
– possibly, non-linearly correlated
• drop the useless ones
C. Faloutsos
73
School of Computer Science
Carnegie Mellon
Waterloo, 2006
mpg
C. Faloutsos
74
School of Computer Science
Carnegie Mellon
Problem#4: dim. reduction
Outline
cc
• given attributes x1, ... xn
– possibly, non-linearly correlated
• drop the useless ones
mpg
(Q: why?
A: to avoid the ‘dimensionality curse’)
Solution: keep on dropping attributes, until
the f.d. changes! [w/ Traina+, SBBD’00]
Waterloo, 2006
C. Faloutsos
•
•
•
•
problems
Fractals
Solutions
Discussion
– what else can they solve?
– how frequent are fractals?
75
School of Computer Science
Carnegie Mellon
Waterloo, 2006
C. Faloutsos
76
School of Computer Science
Carnegie Mellon
Fractals & power laws:
Fractals: Brain scans
appear in numerous settings:
• medical
• geographical / geological
• social
• computer-system related
• <and many-many more! see [Mandelbrot]>
• brain-scans
Log(#octants)
16
"oct-count"
x*2.63871-2.9122
15
log2( octree leaves)
14
13
2.63 =
fd
12
11
10
9
8
7
4
Waterloo, 2006
C. Faloutsos
77
Waterloo, 2006
C. Faloutsos
4.5
5
5.5
level k
6
6.5
octree levels
7
78
13
School of Computer Science
Carnegie Mellon
School of Computer Science
Carnegie Mellon
More fractals
More fractals:
• periphery of malignant tumors: ~1.5
• benign: ~1.3
• [Burdet+]
Waterloo, 2006
C. Faloutsos
• cardiovascular system: 3 (!) lungs: ~2.9
79
School of Computer Science
Carnegie Mellon
Waterloo, 2006
80
School of Computer Science
Carnegie Mellon
Fractals & power laws:
More fractals:
appear in numerous settings:
• medical
• geographical / geological
• social
• computer-system related
Waterloo, 2006
C. Faloutsos
C. Faloutsos
• Coastlines: 1.2-1.58
1
1.1
1.3
81
School of Computer Science
Carnegie Mellon
Waterloo, 2006
C. Faloutsos
82
School of Computer Science
Carnegie Mellon
More fractals:
• the fractal dimension for the Amazon river
is 1.85 (Nile: 1.4)
[ems.gphys.unc.edu/nonlinear/fractals/examples.html]
Waterloo, 2006
C. Faloutsos
83
Waterloo, 2006
C. Faloutsos
84
14
School of Computer Science
Carnegie Mellon
School of Computer Science
Carnegie Mellon
More fractals:
GIS points
Cross-roads of
Montgomery county:
• the fractal dimension for the Amazon river
is 1.85 (Nile: 1.4)
•any rules?
[ems.gphys.unc.edu/nonlinear/fractals/examples.html]
Waterloo, 2006
C. Faloutsos
85
School of Computer Science
Carnegie Mellon
Waterloo, 2006
C. Faloutsos
86
School of Computer Science
Carnegie Mellon
GIS
log(#pairs(within <= r))
Examples:LB county
A: self-similarity:
• intrinsic dim. = 1.51
• Long Beach county of CA (road end-points)
log(#pairs)
1.51
1.7
log( r )
Waterloo, 2006
log(r)
C. Faloutsos
87
School of Computer Science
Carnegie Mellon
Waterloo, 2006
C. Faloutsos
88
School of Computer Science
Carnegie Mellon
More power laws: areas –
Korcak’s law
More power laws: areas –
Korcak’s law
log(count( >= area))
Area distribution of regions (SL dataset)
10
-0.85*x +10.8
8
Any pattern?
Scandinavian lakes
area vs
complementary
cumulative count
(log-log axes)
Number of regions
Scandinavian lakes
6
4
2
log(area)
0
0
2
4
6
8
10
12
14
Area
Waterloo, 2006
C. Faloutsos
89
Waterloo, 2006
C. Faloutsos
90
15
School of Computer Science
Carnegie Mellon
School of Computer Science
Carnegie Mellon
More power laws: Korcak
More power laws
• Energy of earthquakes (Gutenberg-Richter
law) [simscience.org]
Area distribution of Japan Archipelago
1000
log(count( >= area))
100
Number of islands
Energy
released
log(count)
10
Japan islands;
area vs cumulative
count (log-log axes)
Waterloo, 2006
1
1
10
100
1000
10000
100000
Area
day
log(area)
C. Faloutsos
91
School of Computer Science
Carnegie Mellon
Waterloo, 2006
Magnitude = log(energy)
C. Faloutsos
92
School of Computer Science
Carnegie Mellon
A famous power law: Zipf’s
law
log(freq)
Fractals & power laws:
appear in numerous settings:
• medical
• geographical / geological
• social
• computer-system related
“a”
• Bible - rank vs.
frequency (log-log)
“the”
“Rank/frequency plot”
log(rank)
Waterloo, 2006
C. Faloutsos
93
School of Computer Science
Carnegie Mellon
Waterloo, 2006
C. Faloutsos
School of Computer Science
Carnegie Mellon
TELCO data
SALES data – store#96
count of
customers
count of
products
“aspirin”
‘best customer’
# units sold
# of service units
Waterloo, 2006
94
C. Faloutsos
95
Waterloo, 2006
C. Faloutsos
96
16
School of Computer Science
Carnegie Mellon
School of Computer Science
Carnegie Mellon
Olympic medals (Sidney’00,
Athens’04):
Olympic medals (Sidney’00,
Athens’04):
log(#medals)
log(#medals)
2.5
2.5
2
2
1.5
athens
1.5
athens
1
sidney
1
sidney
0.5
0.5
0
0
0
0.5
1
1.5
2
0
0.5
1
1.5
log( rank)
Waterloo, 2006
C. Faloutsos
2
log( rank)
97
School of Computer Science
Carnegie Mellon
Waterloo, 2006
C. Faloutsos
98
School of Computer Science
Carnegie Mellon
Even more power laws:
Even more power laws:
library science (Lotka’s law of publication
count); and citation counts:
(citeseer.nj.nec.com 6/2001)
• Income distribution (Pareto’s law)
• size of firms
• publication counts (Lotka’s law)
100
’cited.pdf’
log count
log(count)
10
Ullman
1
100
Waterloo, 2006
C. Faloutsos
99
School of Computer Science
Carnegie Mellon
Waterloo, 2006
1000
log # citations
10000
log(#citations)
C. Faloutsos
100
School of Computer Science
Carnegie Mellon
Even more power laws:
Fractals & power laws:
• web hit counts [w/ A. Montgomery]
appear in numerous settings:
• medical
• geographical / geological
• social
• computer-system related
Web Site Traffic
log(count)
Zipf
“yahoo.com”
log(freq)
Waterloo, 2006
C. Faloutsos
101
Waterloo, 2006
C. Faloutsos
102
17
School of Computer Science
Carnegie Mellon
School of Computer Science
Carnegie Mellon
Power laws, cont’d
Power laws, cont’d
• In- and out-degree distribution of web sites
[Barabasi], [IBM-CLEVER]
• In- and out-degree distribution of web sites
[Barabasi], [IBM-CLEVER]
log(freq)
log indegree
from [Ravi Kumar,
Prabhakar Raghavan,
Sridhar Rajagopalan,
Andrew Tomkins ]
Waterloo, 2006
from [Ravi Kumar,
Prabhakar Raghavan,
Sridhar Rajagopalan,
Andrew Tomkins ]
- log(freq)
C. Faloutsos
103
School of Computer Science
Carnegie Mellon
Waterloo, 2006
log indegree
C. Faloutsos
104
School of Computer Science
Carnegie Mellon
Power laws, cont’d
“Foiled by power law”
• In- and out-degree distribution of web sites
[Barabasi], [IBM-CLEVER]
• [Broder+, WWW’00]
(log) count
log(freq)
Q: ‘how can we use
these power laws?’
log indegree
Waterloo, 2006
C. Faloutsos
105
School of Computer Science
Carnegie Mellon
(log) in-degree
Waterloo, 2006
C. Faloutsos
106
School of Computer Science
Carnegie Mellon
“Foiled by power law”
Power laws, cont’d
• [Broder+, WWW’00]
(log) count
“The anomalous bump at 120
on the x-axis
is due a large clique
formed by a single spammer”
• In- and out-degree distribution of web sites
[Barabasi], [IBM-CLEVER]
• length of file transfers [Crovella+Bestavros
‘96]
• duration of UNIX jobs
(log) in-degree
Waterloo, 2006
C. Faloutsos
107
Waterloo, 2006
C. Faloutsos
108
18
School of Computer Science
Carnegie Mellon
School of Computer Science
Carnegie Mellon
Additional projects
Conclusions
• Find anomalies in traffic matrices [under
review]
• Find correlations in sensor/stream data
[VLDB’05]
• Fascinating problems in Data Mining: find
patterns in
– sensors/streams
– graphs/networks
– Chlorine measurements, with Civ. Eng.
– temperature measurements (INTEL/MIT)
• Virus propagation (SIS, SIR) [Wang+, ’03]
• Graph partitioning [Chakrabarti+, KDD’04]
Waterloo, 2006
C. Faloutsos
109
School of Computer Science
Carnegie Mellon
Waterloo, 2006
Resources
New tools for Data Mining: self-similarity &
power laws: appear in many cases
Bad news:
lead to skewed distributions
(no Gaussian, Poisson,
uniformity, independence,
mean, variance)
C. Faloutsos
• Manfred Schroeder “Chaos, Fractals and
Power Laws”, 1991
Good news:
• ‘correlation integral’
for separability
• rank/frequency plots
• 80-20 (multifractals)
•
•
•
•
(Hurst exponent,
strange attractors,
renormalization theory, 111
++)
School of Computer Science
Carnegie Mellon
Waterloo, 2006
C. Faloutsos
112
School of Computer Science
Carnegie Mellon
References
References
• [vldb95] Alberto Belussi and Christos Faloutsos,
Estimating the Selectivity of Spatial Queries Using the
`Correlation' Fractal Dimension Proc. of VLDB, p. 299310, 1995
• [Broder+’00] Andrei Broder, Ravi Kumar , Farzin
Maghoul1, Prabhakar Raghavan , Sridhar Rajagopalan ,
Raymie Stata, Andrew Tomkins , Janet Wiener, Graph
structure in the web , WWW’00
• M. Crovella and A. Bestavros, Self similarity in World
wide web traffic: Evidence and possible causes ,
SIGMETRICS ’96.
Waterloo, 2006
110
School of Computer Science
Carnegie Mellon
Conclusions - cont’d
Waterloo, 2006
C. Faloutsos
C. Faloutsos
• J. Considine, F. Li, G. Kollios and J. Byers,
Approximate Aggregation Techniques for Sensor
Databases (ICDE’04, best paper award).
• [pods94] Christos Faloutsos and Ibrahim Kamel,
Beyond Uniformity and Independence: Analysis of
R-trees Using the Concept of Fractal Dimension,
PODS, Minneapolis, MN, May 24-26, 1994, pp.
4-13
113
Waterloo, 2006
C. Faloutsos
114
19
School of Computer Science
Carnegie Mellon
School of Computer Science
Carnegie Mellon
References
References
• [vldb96] Christos Faloutsos, Yossi Matias and Avi
Silberschatz, Modeling Skewed Distributions
Using Multifractals and the `80-20 Law’ Conf. on
Very Large Data Bases (VLDB), Bombay, India,
Sept. 1996.
• [sigmod2000] Christos Faloutsos, Bernhard
Seeger, Agma J. M. Traina and Caetano Traina Jr.,
Spatial Join Selectivity Using Power Laws,
SIGMOD 2000
• [vldb96] Christos Faloutsos and Volker Gaede
Analysis of the Z-Ordering Method Using the
Hausdorff Fractal Dimension VLD, Bombay,
India, Sept. 1996
• [sigcomm99] Michalis Faloutsos, Petros Faloutsos
and Christos Faloutsos, What does the Internet
look like? Empirical Laws of the Internet
Topology, SIGCOMM 1999
Waterloo, 2006
C. Faloutsos
115
School of Computer Science
Carnegie Mellon
Waterloo, 2006
116
School of Computer Science
Carnegie Mellon
References
References
• [Leskovec 05] Jure Leskovec, Jon M. Kleinberg,
Christos Faloutsos: Graphs over time:
densification laws, shrinking diameters and
possible explanations. KDD 2005: 177-187
Waterloo, 2006
C. Faloutsos
C. Faloutsos
• [ieeeTN94] W. E. Leland, M.S. Taqqu, W.
Willinger, D.V. Wilson, On the Self-Similar
Nature of Ethernet Traffic, IEEE Transactions on
Networking, 2, 1, pp 1-15, Feb. 1994.
• [brite] Alberto Medina, Anukool Lakhina, Ibrahim
Matta, and John Byers. BRITE: An Approach to
Universal Topology Generation. MASCOTS '01
117
School of Computer Science
Carnegie Mellon
Waterloo, 2006
C. Faloutsos
118
School of Computer Science
Carnegie Mellon
References
References
• [icde99] Guido Proietti and Christos Faloutsos,
I/O complexity for range queries on region data
stored using an R-tree (ICDE’99)
• Stan Sclaroff, Leonid Taycher and Marco La
Cascia , "ImageRover: A content-based image
browser for the world wide web" Proc. IEEE
Workshop on Content-based Access of Image and
Video Libraries, pp 2-9, 1997.
• [kdd2001] Agma J. M. Traina, Caetano Traina Jr.,
Spiros Papadimitriou and Christos Faloutsos: Triplots: Scalable Tools for Multidimensional Data
Mining, KDD 2001, San Francisco, CA.
Waterloo, 2006
C. Faloutsos
119
Waterloo, 2006
C. Faloutsos
120
20
School of Computer Science
Carnegie Mellon
Thank you!
Contact info:
christos <at> cs.cmu.edu
www. cs.cmu.edu /~christos
(w/ papers, datasets, code for fractal dimension
estimation, etc)
Waterloo, 2006
C. Faloutsos
121
21
Download