School of Computer Science Carnegie Mellon School of Computer Science Carnegie Mellon THANK YOU! Data Mining using Fractals and Power laws • Prof. Ed Chan • Debbie Mustin Christos Faloutsos Carnegie Mellon University Waterloo, 2006 C. Faloutsos 1 School of Computer Science Carnegie Mellon Waterloo, 2006 C. Faloutsos 2 School of Computer Science Carnegie Mellon Thanks to Overview • Deepayan Chakrabarti (CMU/Yahoo) • Goals/ motivation: find patterns in large datasets: – (A) Sensor data – (B) network/graph data • Michalis Faloutsos (UCR) • Solutions: self-similarity and power laws • Discussion • George Siganos (UCR) Waterloo, 2006 C. Faloutsos 3 School of Computer Science Carnegie Mellon Waterloo, 2006 C. Faloutsos 4 School of Computer Science Carnegie Mellon Applications of sensors/streams Applications of sensors/streams • ‘Smart house’: monitoring temperature, humidity etc • ‘Smart house’: monitoring temperature, humidity etc • Financial, sales, economic series • Financial, sales, economic series Tamer; Ihab Waterloo, 2006 C. Faloutsos 5 Waterloo, 2006 C. Faloutsos 6 1 School of Computer Science Carnegie Mellon School of Computer Science Carnegie Mellon Motivation - Applications (cont’d) Motivation - Applications • Medical: ECGs +; blood pressure etc monitoring • civil/automobile infrastructure – bridge vibrations [Oppenheim+02] • Scientific data: seismological; astronomical; environment / anti-pollution; meteorological – road conditions / traffic monitoring # cars2000 1800 1600 Automobile traffic 1400 1200 1000 800 600 400 200 0 Sunspot Data 300 250 200 150 100 time 50 0 Waterloo, 2006 C. Faloutsos 7 Waterloo, 2006 School of Computer Science Carnegie Mellon C. Faloutsos School of Computer Science Carnegie Mellon Motivation - Applications (cont’d) Web traffic • [Crovella Bestavros, SIGMETRICS’96] • Computer systems 2.5*10^7 3*10^7 – web servers (buffering, prefetching) Bytes 1.2e+06 – ... http://repository.cs.vt.edu/lbl-conn-7.tar.Z 10^7 5*10^6 600000 400000 200000 0 number of Kb 1e+06 800000 1.5*10^7 2*10^7 – network traffic monitoring 0 0 0 Waterloo, 2006 2e+06 4e+06 6e+06 8e+06 time (in milliseconds) C. Faloutsos Waterloo, 2006 2000 3000 C. Faloutsos 10 School of Computer Science Carnegie Mellon Self-* Storage (Ganger+) Self-* Storage (Ganger+) “self-*” = self-managing, self-tuning, self-healing, … Goal: 1 petabyte (PB) for CMU researchers www.pdl.cmu.edu/SelfStar survivable, self-managing storage infrastructure Waterloo, 2006 1000 Chronological time (slots of 1000 sec) 1e+07 9 School of Computer Science Carnegie Mellon ~1 PB 8 . .. . .. C. Faloutsos a storage brick (0.5–5 TB) 11 Figure 2: Trac Bursts over Four Orders of Magnitude; Upper Left: 1000, Upper Right: 100, Lower Left: 10, and L 1 Second Aggegrations. (Actual Transfers) self-tuning, self-healing, … “self-*” = self-managing, this eect is by visually inspecting time series plots of trac ear and shows a slope that is distinctly dierent fr demand. is shown for comparison); the slope is estimated In Figure 2 we show four time series plotsAshraf, of the WWWIhab, sion asKen -0.48, yielding an estimate for H of 0.76. T trac induced by our reference traces. The plots are produced shows an asymptotic slope that is dierent from by aggregating byte trac into discrete bins of 1, 10, 100, or 1.0 (shown for comparision); it is estimated usin 1000 seconds. as 0.75, which is also the corresponding estimat The upper left plot is a complete presentation of the entire periodogram plot shows a slope of -0.66 (the regr survivable, trac time series using 1000 second (16.6 minute) bins. The shown), yielding an estimate of H as 0.83. Finally storageand day diurnal cycle of network demandself-managing is infrastructure clearly evident, estimator for this dataset (not a graphical metho to day activity shows noticeable bursts. However, even within estimate of H = 0:82 with a 95% condence inte the active portion of a single day there is signicant burstiness; 0.87). brick in Section 2.1, the Whittle esti this is shown in the upper right plot, which uses a 100 second a storage As discussed timescale and taken from a typical day in the middle of the only method (0.5–5 TB) that yields condence intervals on H ~1is PB dataset. Finally, the lower left plot. shows a portion. of the 100 range dependence in the timeseries can introdu second plot, expanded to 10 second.. detail; and the.. lower right cies in its results. These inaccuracies are minim plot shows a portion of the lower left expanded to 1 second aggregating the timeseries for successively large detail. These plots show signicant bursts occurring at the and looking for a value of H around which the second-to-second level. mator stabilizes. Waterloo, 2006 C. Faloutsos The results of12this method for four busy hours Figure 4. Each hour is shown in one plot, from the 4.2.2 Statistical Analysis in the upper left to the least busy hour in the low We used the four methods for assessing self-similarity described these gures the solid line is the value of the Whi in Section 2: the variance-time plot, the rescaled range (or of H as a function of the aggregation level m of R/S) plot, the periodogram plot, and the Whittle estimator. The upper and lower dotted lines are the limits We concentrated on individual hours from our trac series, so condence interval on H . The three level lines r as to provide as nearly a stationary dataset as possible. estimate of H for the unaggregated dataset as To provide an example of these approaches, analysis of a variance-time, R-S, and periodogram methods. single hour (4pm to 5pm, Thursday 5 Feb 1995) is shown in The gure shows that for each dataset, the es Figure 3. The gure shows plots for the three graphical meth- stays relatively consistent as the aggregation level ods: variance-time (upper left), rescaled range (upper right), and that the estimates given by the three graph and periodogram (lower center). The variance-time plot is lin- 2 School of Computer Science Carnegie Mellon School of Computer Science Carnegie Mellon Problem definition Problem #1 # bytes • Find patterns, in large datasets • Given: one or more sequences x1 , x2 , … , xt , …; (y1, y2, … , yt, …) • Find – patterns; clusters; outliers; forecasts; time Waterloo, 2006 C. Faloutsos 13 School of Computer Science Carnegie Mellon Waterloo, 2006 C. Faloutsos 14 School of Computer Science Carnegie Mellon Problem #1 Problem #1 # bytes # bytes • Find patterns, in large datasets • Find patterns, in large datasets time time Poisson indep., ident. distr Poisson indep., ident. distr Waterloo, 2006 C. Faloutsos 15 School of Computer Science Carnegie Mellon Waterloo, 2006 C. Faloutsos 16 School of Computer Science Carnegie Mellon Problem #1 Overview # bytes • Find patterns, in large datasets • Goals/ motivation: find patterns in large datasets: – (A) Sensor data – (B) network/graph data • Solutions: self-similarity and power laws • Discussion time Poisson indep., ident. distr Waterloo, 2006 Q: Then, how to generate such bursty traffic? C. Faloutsos 17 Waterloo, 2006 C. Faloutsos 18 3 School of Computer Science Carnegie Mellon School of Computer Science Carnegie Mellon Problem #2 - network and graph mining Network and graph mining • How does the Internet look like? • How does the web look like? • What constitutes a ‘normal’ social network? • What is the ‘network value’ of a customer? • which gene/species affects the others the most? Waterloo, 2006 C. Faloutsos 19 School of Computer Science Carnegie Mellon Friendship Network [Moody ’01] Food Web [Martinez ’91] Protein Interactions [genomebiology.com] Graphs are everywhere! Waterloo, 2006 C. Faloutsos 20 School of Computer Science Carnegie Mellon Problem#2 Solutions Given a graph: • which node to market-to / defend / immunize first? • Are there un-natural subgraphs? (eg., criminals’ rings)? • New tools: power laws, self-similarity and ‘fractals’ work, where traditional assumptions fail • Let’s see the details: [from Lumeta: ISPs 6/1999] Waterloo, 2006 C. Faloutsos 21 School of Computer Science Carnegie Mellon Waterloo, 2006 C. Faloutsos 22 School of Computer Science Carnegie Mellon Overview What is a fractal? • Goals/ motivation: find patterns in large datasets: = self-similar point set, e.g., Sierpinski triangle: – (A) Sensor data – (B) network/graph data ... zero area: (3/4)^inf infinite length! (4/3)^inf • Solutions: self-similarity and power laws • Discussion Q: What is its dimensionality?? Waterloo, 2006 C. Faloutsos 23 Waterloo, 2006 C. Faloutsos 24 4 School of Computer Science Carnegie Mellon School of Computer Science Carnegie Mellon What is a fractal? Intrinsic (‘fractal’) dimension • Q: fractal dimension of a line? = self-similar point set, e.g., Sierpinski triangle: ... • Q: fd of a plane? zero area: (3/4)^inf infinite length! (4/3)^inf Q: What is its dimensionality?? A: log3 / log2 = 1.58 (!?!) Waterloo, 2006 C. Faloutsos 25 School of Computer Science Carnegie Mellon Waterloo, 2006 C. Faloutsos 26 School of Computer Science Carnegie Mellon Intrinsic (‘fractal’) dimension • Q: fractal dimension of a line? • A: nn ( <= r ) ~ r^1 (‘power law’: y=x^a) Sierpinsky triangle == ‘correlation integral’ • Q: fd of a plane? • A: nn ( <= r ) ~ r^2 fd== slope of (log(nn) vs.. log(r) ) log(#pairs within <=r ) = CDF of pairwise distances 1.58 log( r ) Waterloo, 2006 C. Faloutsos 27 School of Computer Science Carnegie Mellon Waterloo, 2006 Closely related: • fractals <=> • self-similarity <=> • scale-free <=> • power laws ( y= xa ; F=K r-2) • (vs Waterloo, 2006 28 School of Computer Science Carnegie Mellon Observations: Fractals <-> power laws y=e-ax C. Faloutsos Outline • • • • Problems Self-similarity and power laws Solutions to posed problems Discussion log(#pairs within <=r ) 1.58 or y=xa+b) C. Faloutsos log( r ) 29 Waterloo, 2006 C. Faloutsos 30 5 School of Computer Science Carnegie Mellon School of Computer Science Carnegie Mellon Solution #1: traffic Solution #1: traffic • disk traces: self-similar: (also: [Leland+94]) • How to generate such traffic? • disk traces (80-20 ‘law’) – ‘multifractals’ 20% #bytes 80% #bytes time Waterloo, 2006 time C. Faloutsos 31 Waterloo, 2006 School of Computer Science Carnegie Mellon C. Faloutsos 32 School of Computer Science Carnegie Mellon 80-20 / multifractals 20 80-20 / multifractals 80 20 80 • p ; (1-p) in general • yes, there are dependencies 1.2e+06 number of Kb 1e+06 800000 600000 400000 200000 0 0 Waterloo, 2006 2e+06 C. Faloutsos 4e+06 6e+06 8e+06 time (in milliseconds) 33 School of Computer Science Carnegie Mellon 1e+07 Waterloo, 2006 34 School of Computer Science Carnegie Mellon More on 80/20: PQRS More on 80/20: PQRS • Part of ‘self-* storage’ project • Part of ‘self-* storage’ project time Waterloo, 2006 C. Faloutsos cylinder# C. Faloutsos 35 Waterloo, 2006 p q r s C. Faloutsos q r s 36 6 School of Computer Science Carnegie Mellon School of Computer Science Carnegie Mellon Overview Problem #2 - topology • Goals/ motivation: find patterns in large datasets: How does the Internet look like? Any rules? – (A) Sensor data – (B) network/graph data • Solutions: self-similarity and power laws – sensor/traffic data – network/graph data • Discussion Waterloo, 2006 C. Faloutsos 37 School of Computer Science Carnegie Mellon Waterloo, 2006 C. Faloutsos School of Computer Science Carnegie Mellon Patterns? Patterns? • avg degree is, say 3.3 • pick a node at random – guess its degree, exactly (-> “mode”) count avg: 3.3 Waterloo, 2006 degree 39 degree Waterloo, 2006 C. Faloutsos 40 School of Computer Science Carnegie Mellon Patterns? Solution#2: Rank exponent R • avg degree is, say 3.3 • pick a node at random - what is the degree you expect it to have? • A: 1!! • A’: very skewed distr. • Corollary: the mean is meaningless! • (and std -> infinity (!)) count avg: 3.3 • avg degree is, say 3.3 • pick a node at random – guess its degree, exactly (-> “mode”) • A: 1!! count avg: 3.3 C. Faloutsos School of Computer Science Carnegie Mellon Waterloo, 2006 38 degree C. Faloutsos 41 • A1: Power law in the degree distribution [SIGCOMM99] internet domains att.com log(degree) ibm.com -0.82 log(rank) Waterloo, 2006 C. Faloutsos 42 7 School of Computer Science Carnegie Mellon School of Computer Science Carnegie Mellon Solution#2’: Eigen Exponent E Power laws - discussion Eigenvalue Exponent = slope • do they hold, over time? E = -0.48 • do they hold on other graphs/domains? May 2001 Rank of decreasing eigenvalue • A2: power law in the eigenvalues of the adjacency matrix Waterloo, 2006 C. Faloutsos 43 School of Computer Science Carnegie Mellon Waterloo, 2006 C. Faloutsos 44 School of Computer Science Carnegie Mellon att.com log(degree) ibm.com Power laws - discussion do they hold, over time? Yes! for multiple years [Siganos+] do they hold on other graphs/domains? Yes! – web sites and links [Tomkins+], [Barabasi+] – peer-to-peer graphs (gnutella-style) – who-trusts-whom (epinions.com) Waterloo, 2006 C. Faloutsos 0.82 log(rank) -0.5 Rank exponent • • • • Time Evolution: rank R 0 200 400 600 800 -0.6 -0.7 Domain level -0.8 -0.9 -1 Instances in time: Nov'97 and on 45 School of Computer Science Carnegie Mellon • The rank exponent has not changed! [Siganos+] C. Faloutsos Waterloo, 2006 46 School of Computer Science Carnegie Mellon The Peer-to-Peer Topology count epinions.com [Jovanovic+] count • who-trusts-whom [Richardson + Domingos, KDD 2001] degree • Number of immediate peers (= degree), follows a power-law Waterloo, 2006 C. Faloutsos (out) degree 47 Waterloo, 2006 C. Faloutsos 48 8 School of Computer Science Carnegie Mellon School of Computer Science Carnegie Mellon Why care about these patterns? Recent discoveries [KDD’05] • How do graphs evolve? • degree-exponent seems constant - anything else? • better graph generators [BRITE, INET] – for simulations – extrapolations • ‘abnormal’ graph and subgraph detection Waterloo, 2006 C. Faloutsos 49 School of Computer Science Carnegie Mellon Waterloo, 2006 Evolution of diameter? • Prior analysis, on power-law-like graphs, hints that diameter ~ O(log(N)) or diameter ~ O( log(log(N))) • i.e.., slowly increasing with network size • Q: What is happening, in reality? C. Faloutsos • Prior analysis, on power-law-like graphs, hints that diameter ~ O(log(N)) or diameter ~ O( log(log(N))) • i.e.., slowly increasing with network size • Q: What is happening, in reality? • A: It shrinks(!!), towards a constant value 51 School of Computer Science Carnegie Mellon Waterloo, 2006 C. Faloutsos 52 School of Computer Science Carnegie Mellon Shrinking diameter Shrinking diameter • Authors & publications • 1992 diameter [Leskovec+05a] • Citations among physics papers • 11yrs; @ 2003: – 29,555 papers – 352,807 citations • For each month M, create a graph of all citations up to month M – 318 nodes – 272 edges • 2002 – 60,000 nodes • 20,000 authors • 38,000 papers – 133,000 edges time Waterloo, 2006 50 School of Computer Science Carnegie Mellon Evolution of diameter? Waterloo, 2006 C. Faloutsos C. Faloutsos 53 Waterloo, 2006 C. Faloutsos 54 9 School of Computer Science Carnegie Mellon School of Computer Science Carnegie Mellon Shrinking diameter Shrinking diameter • Patents & citations • 1975 • Autonomous systems • 1997 – 334,000 nodes – 676,000 edges – 3,000 nodes – 10,000 edges • 1999 • 2000 – 2.9 million nodes – 16.5 million edges – 6,000 nodes – 26,000 edges • Each year is a datapoint Waterloo, 2006 • One graph per day C. Faloutsos 55 School of Computer Science Carnegie Mellon Waterloo, 2006 N C. Faloutsos 56 School of Computer Science Carnegie Mellon Temporal evolution of graphs Temporal evolution of graphs • N(t) nodes; E(t) edges at time t • suppose that N(t+1) = 2 * N(t) • Q: what is your guess for E(t+1) =? 2 * E(t) Waterloo, 2006 diameter C. Faloutsos • N(t) nodes; E(t) edges at time t • suppose that N(t+1) = 2 * N(t) • Q: what is your guess for E(t+1) =? 2 * E(t) • A: over-doubled! 57 School of Computer Science Carnegie Mellon Waterloo, 2006 C. Faloutsos 58 School of Computer Science Carnegie Mellon Temporal evolution of graphs Densification Power Law ArXiv: Physics papers and their citations • A: over-doubled - but obeying: E(t) ~ N(t)a for all t where 1<a<2 E(t) 1.69 N(t) Waterloo, 2006 C. Faloutsos 59 Waterloo, 2006 C. Faloutsos 60 10 School of Computer Science Carnegie Mellon School of Computer Science Carnegie Mellon Densification Power Law ArXiv: Physics papers and their citations Densification Power Law ArXiv: Physics papers and their citations E(t) ‘clique’ E(t) 1 1.69 2 1.69 ‘tree’ N(t) Waterloo, 2006 C. Faloutsos N(t) 61 School of Computer Science Carnegie Mellon Waterloo, 2006 C. Faloutsos School of Computer Science Carnegie Mellon Densification Power Law Densification Power Law U.S. Patents, citing each other Autonomous Systems E(t) E(t) 1.18 1.66 N(t) Waterloo, 2006 62 C. Faloutsos N(t) 63 School of Computer Science Carnegie Mellon Waterloo, 2006 C. Faloutsos 64 School of Computer Science Carnegie Mellon Densification Power Law Outline ArXiv: authors & papers • • • • E(t) 1.15 problems Fractals Solutions Discussion – what else can they solve? – how frequent are fractals? N(t) Waterloo, 2006 C. Faloutsos 65 Waterloo, 2006 C. Faloutsos 66 11 School of Computer Science Carnegie Mellon School of Computer Science Carnegie Mellon What else can they solve? Problem #3 - spatial d.m. spatial d.m. separability [KDD’02] Ed, Ihab, Tamer forecasting [CIKM’02] dimensionality reduction [SBBD’00] non-linear axis scaling [KDD’02] disk trace modeling [PEVA’02] selectivity of spatial/multimedia queries [PODS’94, VLDB’95, ICDE’00] • ... • • • • • • Galaxies (Sloan Digital Sky Survey w/ B. - ‘spiral’ and ‘elliptical’ Nichol) galaxies 1 "elliptical.cut.dat" "spiral.cut.dat" 0.8 0.4 0.2 -attraction/repulsion? 0 -0.2 - separability?? -0.4 -0.6 -0.8 -1 -1 Waterloo, 2006 C. Faloutsos 67 School of Computer Science Carnegie Mellon -0.8 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 C. Faloutsos 68 School of Computer Science Carnegie Mellon Solution#3: spatial d.m. CORRELATION INTEGRAL! log(#pairs within <=r ) 1e+10 log(#pairs within <=r ) [w/ Seeger, Traina, Traina, SIGMOD00] 1e+10 "ell-ell.points.ns" "spi-spi.points.ns" "spi.dat-ell.dat.points" 1e+09 1e+08 - 1.8 slope ell-ell 1e+06 "ell-ell.points.ns" "spi-spi.points.ns" "spi.dat-ell.dat.points" 1e+09 1e+08 - plateau! 1e+07 10000 -0.6 Waterloo, 2006 Solution#3: spatial d.m. 100000 - patterns? (not Gaussian; not uniform) 0.6 - repulsion! spi-spi ell-ell 1e+06 10000 1000 - plateau! 1e+07 100000 - 1.8 slope - repulsion! spi-spi 1000 spi-ell 100 spi-ell 100 10 10 1 1e-081e-071e-061e-050.00010.001 0.01 0.1 Waterloo, 2006 1 10 100 1 1e-081e-071e-061e-050.00010.001 0.01 0.1 log(r) C. Faloutsos 69 School of Computer Science Carnegie Mellon Waterloo, 2006 1 10 100 log(r) C. Faloutsos 70 School of Computer Science Carnegie Mellon Solution#3: spatial d.m. Solution#3: spatial d.m. log(#pairs within <=r ) 1e+10 "ell-ell.points.ns" "spi-spi.points.ns" "spi.dat-ell.dat.points" 1e+09 1e+09 1e+08 1e+08 r1 1e+10 "ell-ell.points.ns" "spi-spi.points.ns" "spi.dat-ell.dat.points" r2 1e+07 ell-ell 1e+06 1e+06 100000 10000 10000 1000 - repulsion! spi-spi 1000 100 100 10 1 1e-081e-071e-061e-050.00010.001 0.01 0.1 1 10 100 r2 r1 Waterloo, 2006 - plateau! 1e+07 100000 - 1.8 slope C. Faloutsos Heuristic on choosing # of clusters 71 spi-ell 10 1 1e-081e-071e-061e-050.00010.001 0.01 0.1 Waterloo, 2006 C. Faloutsos 1 10 100 log(r) 72 12 School of Computer Science Carnegie Mellon School of Computer Science Carnegie Mellon What else can they solve? Problem#4: dim. reduction cc • given attributes x1, ... xn • • • • • • separability [KDD’02] forecasting [CIKM’02] dimensionality reduction [SBBD’00] non-linear axis scaling [KDD’02] disk trace modeling [PEVA’02] selectivity of spatial/multimedia queries [PODS’94, VLDB’95, ICDE’00] • ... Waterloo, 2006 – possibly, non-linearly correlated • drop the useless ones C. Faloutsos 73 School of Computer Science Carnegie Mellon Waterloo, 2006 mpg C. Faloutsos 74 School of Computer Science Carnegie Mellon Problem#4: dim. reduction Outline cc • given attributes x1, ... xn – possibly, non-linearly correlated • drop the useless ones mpg (Q: why? A: to avoid the ‘dimensionality curse’) Solution: keep on dropping attributes, until the f.d. changes! [w/ Traina+, SBBD’00] Waterloo, 2006 C. Faloutsos • • • • problems Fractals Solutions Discussion – what else can they solve? – how frequent are fractals? 75 School of Computer Science Carnegie Mellon Waterloo, 2006 C. Faloutsos 76 School of Computer Science Carnegie Mellon Fractals & power laws: Fractals: Brain scans appear in numerous settings: • medical • geographical / geological • social • computer-system related • <and many-many more! see [Mandelbrot]> • brain-scans Log(#octants) 16 "oct-count" x*2.63871-2.9122 15 log2( octree leaves) 14 13 2.63 = fd 12 11 10 9 8 7 4 Waterloo, 2006 C. Faloutsos 77 Waterloo, 2006 C. Faloutsos 4.5 5 5.5 level k 6 6.5 octree levels 7 78 13 School of Computer Science Carnegie Mellon School of Computer Science Carnegie Mellon More fractals More fractals: • periphery of malignant tumors: ~1.5 • benign: ~1.3 • [Burdet+] Waterloo, 2006 C. Faloutsos • cardiovascular system: 3 (!) lungs: ~2.9 79 School of Computer Science Carnegie Mellon Waterloo, 2006 80 School of Computer Science Carnegie Mellon Fractals & power laws: More fractals: appear in numerous settings: • medical • geographical / geological • social • computer-system related Waterloo, 2006 C. Faloutsos C. Faloutsos • Coastlines: 1.2-1.58 1 1.1 1.3 81 School of Computer Science Carnegie Mellon Waterloo, 2006 C. Faloutsos 82 School of Computer Science Carnegie Mellon More fractals: • the fractal dimension for the Amazon river is 1.85 (Nile: 1.4) [ems.gphys.unc.edu/nonlinear/fractals/examples.html] Waterloo, 2006 C. Faloutsos 83 Waterloo, 2006 C. Faloutsos 84 14 School of Computer Science Carnegie Mellon School of Computer Science Carnegie Mellon More fractals: GIS points Cross-roads of Montgomery county: • the fractal dimension for the Amazon river is 1.85 (Nile: 1.4) •any rules? [ems.gphys.unc.edu/nonlinear/fractals/examples.html] Waterloo, 2006 C. Faloutsos 85 School of Computer Science Carnegie Mellon Waterloo, 2006 C. Faloutsos 86 School of Computer Science Carnegie Mellon GIS log(#pairs(within <= r)) Examples:LB county A: self-similarity: • intrinsic dim. = 1.51 • Long Beach county of CA (road end-points) log(#pairs) 1.51 1.7 log( r ) Waterloo, 2006 log(r) C. Faloutsos 87 School of Computer Science Carnegie Mellon Waterloo, 2006 C. Faloutsos 88 School of Computer Science Carnegie Mellon More power laws: areas – Korcak’s law More power laws: areas – Korcak’s law log(count( >= area)) Area distribution of regions (SL dataset) 10 -0.85*x +10.8 8 Any pattern? Scandinavian lakes area vs complementary cumulative count (log-log axes) Number of regions Scandinavian lakes 6 4 2 log(area) 0 0 2 4 6 8 10 12 14 Area Waterloo, 2006 C. Faloutsos 89 Waterloo, 2006 C. Faloutsos 90 15 School of Computer Science Carnegie Mellon School of Computer Science Carnegie Mellon More power laws: Korcak More power laws • Energy of earthquakes (Gutenberg-Richter law) [simscience.org] Area distribution of Japan Archipelago 1000 log(count( >= area)) 100 Number of islands Energy released log(count) 10 Japan islands; area vs cumulative count (log-log axes) Waterloo, 2006 1 1 10 100 1000 10000 100000 Area day log(area) C. Faloutsos 91 School of Computer Science Carnegie Mellon Waterloo, 2006 Magnitude = log(energy) C. Faloutsos 92 School of Computer Science Carnegie Mellon A famous power law: Zipf’s law log(freq) Fractals & power laws: appear in numerous settings: • medical • geographical / geological • social • computer-system related “a” • Bible - rank vs. frequency (log-log) “the” “Rank/frequency plot” log(rank) Waterloo, 2006 C. Faloutsos 93 School of Computer Science Carnegie Mellon Waterloo, 2006 C. Faloutsos School of Computer Science Carnegie Mellon TELCO data SALES data – store#96 count of customers count of products “aspirin” ‘best customer’ # units sold # of service units Waterloo, 2006 94 C. Faloutsos 95 Waterloo, 2006 C. Faloutsos 96 16 School of Computer Science Carnegie Mellon School of Computer Science Carnegie Mellon Olympic medals (Sidney’00, Athens’04): Olympic medals (Sidney’00, Athens’04): log(#medals) log(#medals) 2.5 2.5 2 2 1.5 athens 1.5 athens 1 sidney 1 sidney 0.5 0.5 0 0 0 0.5 1 1.5 2 0 0.5 1 1.5 log( rank) Waterloo, 2006 C. Faloutsos 2 log( rank) 97 School of Computer Science Carnegie Mellon Waterloo, 2006 C. Faloutsos 98 School of Computer Science Carnegie Mellon Even more power laws: Even more power laws: library science (Lotka’s law of publication count); and citation counts: (citeseer.nj.nec.com 6/2001) • Income distribution (Pareto’s law) • size of firms • publication counts (Lotka’s law) 100 ’cited.pdf’ log count log(count) 10 Ullman 1 100 Waterloo, 2006 C. Faloutsos 99 School of Computer Science Carnegie Mellon Waterloo, 2006 1000 log # citations 10000 log(#citations) C. Faloutsos 100 School of Computer Science Carnegie Mellon Even more power laws: Fractals & power laws: • web hit counts [w/ A. Montgomery] appear in numerous settings: • medical • geographical / geological • social • computer-system related Web Site Traffic log(count) Zipf “yahoo.com” log(freq) Waterloo, 2006 C. Faloutsos 101 Waterloo, 2006 C. Faloutsos 102 17 School of Computer Science Carnegie Mellon School of Computer Science Carnegie Mellon Power laws, cont’d Power laws, cont’d • In- and out-degree distribution of web sites [Barabasi], [IBM-CLEVER] • In- and out-degree distribution of web sites [Barabasi], [IBM-CLEVER] log(freq) log indegree from [Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins ] Waterloo, 2006 from [Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins ] - log(freq) C. Faloutsos 103 School of Computer Science Carnegie Mellon Waterloo, 2006 log indegree C. Faloutsos 104 School of Computer Science Carnegie Mellon Power laws, cont’d “Foiled by power law” • In- and out-degree distribution of web sites [Barabasi], [IBM-CLEVER] • [Broder+, WWW’00] (log) count log(freq) Q: ‘how can we use these power laws?’ log indegree Waterloo, 2006 C. Faloutsos 105 School of Computer Science Carnegie Mellon (log) in-degree Waterloo, 2006 C. Faloutsos 106 School of Computer Science Carnegie Mellon “Foiled by power law” Power laws, cont’d • [Broder+, WWW’00] (log) count “The anomalous bump at 120 on the x-axis is due a large clique formed by a single spammer” • In- and out-degree distribution of web sites [Barabasi], [IBM-CLEVER] • length of file transfers [Crovella+Bestavros ‘96] • duration of UNIX jobs (log) in-degree Waterloo, 2006 C. Faloutsos 107 Waterloo, 2006 C. Faloutsos 108 18 School of Computer Science Carnegie Mellon School of Computer Science Carnegie Mellon Additional projects Conclusions • Find anomalies in traffic matrices [under review] • Find correlations in sensor/stream data [VLDB’05] • Fascinating problems in Data Mining: find patterns in – sensors/streams – graphs/networks – Chlorine measurements, with Civ. Eng. – temperature measurements (INTEL/MIT) • Virus propagation (SIS, SIR) [Wang+, ’03] • Graph partitioning [Chakrabarti+, KDD’04] Waterloo, 2006 C. Faloutsos 109 School of Computer Science Carnegie Mellon Waterloo, 2006 Resources New tools for Data Mining: self-similarity & power laws: appear in many cases Bad news: lead to skewed distributions (no Gaussian, Poisson, uniformity, independence, mean, variance) C. Faloutsos • Manfred Schroeder “Chaos, Fractals and Power Laws”, 1991 Good news: • ‘correlation integral’ for separability • rank/frequency plots • 80-20 (multifractals) • • • • (Hurst exponent, strange attractors, renormalization theory, 111 ++) School of Computer Science Carnegie Mellon Waterloo, 2006 C. Faloutsos 112 School of Computer Science Carnegie Mellon References References • [vldb95] Alberto Belussi and Christos Faloutsos, Estimating the Selectivity of Spatial Queries Using the `Correlation' Fractal Dimension Proc. of VLDB, p. 299310, 1995 • [Broder+’00] Andrei Broder, Ravi Kumar , Farzin Maghoul1, Prabhakar Raghavan , Sridhar Rajagopalan , Raymie Stata, Andrew Tomkins , Janet Wiener, Graph structure in the web , WWW’00 • M. Crovella and A. Bestavros, Self similarity in World wide web traffic: Evidence and possible causes , SIGMETRICS ’96. Waterloo, 2006 110 School of Computer Science Carnegie Mellon Conclusions - cont’d Waterloo, 2006 C. Faloutsos C. Faloutsos • J. Considine, F. Li, G. Kollios and J. Byers, Approximate Aggregation Techniques for Sensor Databases (ICDE’04, best paper award). • [pods94] Christos Faloutsos and Ibrahim Kamel, Beyond Uniformity and Independence: Analysis of R-trees Using the Concept of Fractal Dimension, PODS, Minneapolis, MN, May 24-26, 1994, pp. 4-13 113 Waterloo, 2006 C. Faloutsos 114 19 School of Computer Science Carnegie Mellon School of Computer Science Carnegie Mellon References References • [vldb96] Christos Faloutsos, Yossi Matias and Avi Silberschatz, Modeling Skewed Distributions Using Multifractals and the `80-20 Law’ Conf. on Very Large Data Bases (VLDB), Bombay, India, Sept. 1996. • [sigmod2000] Christos Faloutsos, Bernhard Seeger, Agma J. M. Traina and Caetano Traina Jr., Spatial Join Selectivity Using Power Laws, SIGMOD 2000 • [vldb96] Christos Faloutsos and Volker Gaede Analysis of the Z-Ordering Method Using the Hausdorff Fractal Dimension VLD, Bombay, India, Sept. 1996 • [sigcomm99] Michalis Faloutsos, Petros Faloutsos and Christos Faloutsos, What does the Internet look like? Empirical Laws of the Internet Topology, SIGCOMM 1999 Waterloo, 2006 C. Faloutsos 115 School of Computer Science Carnegie Mellon Waterloo, 2006 116 School of Computer Science Carnegie Mellon References References • [Leskovec 05] Jure Leskovec, Jon M. Kleinberg, Christos Faloutsos: Graphs over time: densification laws, shrinking diameters and possible explanations. KDD 2005: 177-187 Waterloo, 2006 C. Faloutsos C. Faloutsos • [ieeeTN94] W. E. Leland, M.S. Taqqu, W. Willinger, D.V. Wilson, On the Self-Similar Nature of Ethernet Traffic, IEEE Transactions on Networking, 2, 1, pp 1-15, Feb. 1994. • [brite] Alberto Medina, Anukool Lakhina, Ibrahim Matta, and John Byers. BRITE: An Approach to Universal Topology Generation. MASCOTS '01 117 School of Computer Science Carnegie Mellon Waterloo, 2006 C. Faloutsos 118 School of Computer Science Carnegie Mellon References References • [icde99] Guido Proietti and Christos Faloutsos, I/O complexity for range queries on region data stored using an R-tree (ICDE’99) • Stan Sclaroff, Leonid Taycher and Marco La Cascia , "ImageRover: A content-based image browser for the world wide web" Proc. IEEE Workshop on Content-based Access of Image and Video Libraries, pp 2-9, 1997. • [kdd2001] Agma J. M. Traina, Caetano Traina Jr., Spiros Papadimitriou and Christos Faloutsos: Triplots: Scalable Tools for Multidimensional Data Mining, KDD 2001, San Francisco, CA. Waterloo, 2006 C. Faloutsos 119 Waterloo, 2006 C. Faloutsos 120 20 School of Computer Science Carnegie Mellon Thank you! Contact info: christos <at> cs.cmu.edu www. cs.cmu.edu /~christos (w/ papers, datasets, code for fractal dimension estimation, etc) Waterloo, 2006 C. Faloutsos 121 21