Structure and models of real-world graphs and networks Jure Leskovec Machine Learning Department

advertisement
Structure and models of
real-world graphs and networks
Jure Leskovec
Machine Learning Department
Carnegie Mellon University
jure@cs.cmu.edu
http://www.cs.cmu.edu/~jure/
Networks (graphs)
Jure Leskovec
Examples of networks
(b)
(a)
(c)
(d)
(e)
• Internet (a)
• citation network (b)
• World Wide Web (c)
• sexual network (d)
• food web (e)
Jure Leskovec
Networks of the real-world (1)
• Information networks:
– World Wide Web: hyperlinks
– Citation networks
– Blog networks
• Social networks: people +
interactios
–
–
–
–
–
Organizational networks
Communication networks
Collaboration networks
Sexual networks
Collaboration networks
Florence families
Karate club network
• Technological networks:
–
–
–
–
–
Power grid
Airline, road, river networks
Telephone networks
Internet
Autonomous systems
Friendship network
Collaboration network
Jure Leskovec
Networks of the real-world (2)
• Biological networks
–
–
–
–
metabolic networks
food web
neural networks
gene regulatory
networks
Yeast protein
interactions
Semantic network
• Language networks
– Semantic networks
• Software networks
• …
Language network
Leskovec
XFree86Jurenetwork
Types of networks
•
•
•
•
•
•
•
Directed/undirected
Multi graphs (multiple edges between nodes)
Hyper graphs (edges connecting multiple nodes)
Bipartite graphs (e.g., papers to authors)
Weighted networks
Different type nodes and edges
Evolving networks:
– Nodes and edges only added
– Nodes, edges added and removed
Jure Leskovec
Traditional approach
• Sociologists were first to study networks:
– Study of patterns of connections between
people to understand functioning of the
society
– People are nodes, interactions are edges
– Questionares are used to collect link data
(hard to obtain, inaccurate, subjective)
– Typical questions: Centrality and connectivity
• Limited to small graphs (~10 nodes) and
properties of individual nodes and edges
Jure Leskovec
New approach (1)
• Large networks (e.g., web, internet, on-line
social networks) with millions of nodes
• Many traditional questions not useful anymore:
– Traditional: What happens if a node U is removed?
– Now: What percentage of nodes needs to be
removed to affect network connectivity?
• Focus moves from a single node to study of
statistical properties of the network as a whole
• Can not draw (plot) the network and examine it
Jure Leskovec
New approach (2)
• How the network “looks like” even if I can’t look
at it?
• Need statistical methods and tools to quantify
large networks
• 3 parts/goals:
– Statistical properties of large networks
– Models that help understand these properties
– Predict behavior of networked systems based on
measured structural properties and local rules
governing individual nodes
Jure Leskovec
Statistical properties of networks
• Features that are common to networks of
different types:
– Properties of static networks:
•
•
•
•
•
•
Small-world effect
Transitivity or clustering
Degree distributions (scale free networks)
Network resilience
Community structure
Subgraphs or motifs
– Temporal properties:
• Densification
• Shrinking diameter
Jure Leskovec
Small-world effect (1)
• Six degrees of separation (Milgram 60s)
– Random people in Nebraska were asked to send letters to
stockbrokes in Boston
– Letters can only be passed to first-name acquantices
– Only 25% letters reached the goal
– But they reached it in about 6 steps
• Measuring path lengths:
– Diameter (longest shortest path): max dij
– Effective diameter: distance at which 90% of all connected pairs
of nodes can be reached
– Mean geodesic (shortest) distance l
or
Jure Leskovec
Small-world effect (2)
– 180 million people
– 1.3 billion edges
– Edge if two people
exchanged at least
one message in one
month period
7
10
Pick a random
node, count
how many
nodes are at
distance
1,2,3... hops
6
10
Number of nodes
• Distribution of
shortest path lengths
• Microsoft Messenger
network
8
10
5
10
4
10
3
7
10
2
10
1
10
0
10
0
5
10
15
20
25
Distance (Hops)
Jure Leskovec
30
Small-world effect (3)
• Fact:
– If number of vertices within distance r grows exponentially with r, then
mean shortest path length l increases as log n
• Implications:
– Information (viruses) spread quickly
– Erdos numbers are small
– Peer to peer networks (for navigation purposes)
• Shortest paths exists
• Humans are able to find the paths:
– People only know their friends
– People do not have the global knowledge of the network
• This suggests something special about the structure of the network
– On a random graph short paths exists but no one would be able to find
them
Jure Leskovec
Transitivity or Clustering
• “friend of a friend is a friend”
• If a connects to b, and b to c,
then with high probability
a connects to c.
Ci=1, 1, 1/6, 0, 0
• Clustering coefficient C:
C = 3*number of triangles / number of connected triples
• Alternative definition:
Ci = triangles connected to vertex i / number triples
centered on vertex i
– Clustering coefficient:
Jure Leskovec
Transitivity or Clustering (2)
• Clustering coefficient
scales as
• It is considerably higher
than in a random graph
• It is speculated that in
real networks:
C=O(1) as n→∞
In Erdos-Renyi random
graph: C=O(n-1)
Synonyms network
Jure Leskovec
World Wide
Web
Degree distributions (1)
• Let pk denote a fraction of nodes with degree k
• We can plot a histogram of pk vs. k
• In a Erdos-Renyi random graph degree
distribution follows Poisson distribution
• Degrees in real networks are heavily skewed to
the right
• Distribution has a long tail of values that are far
above the mean
• Heavy (long) tail:
– Amazon sales
– word length distribution, …
Jure Leskovec
Detour: how long is the long tail?
This is not directly related to
graphs, but it nicely explains
the “long tail” effect. It shows
that there is big market for
niche products.
Jure Leskovec
Degree distributions (2)
3.5
x 10
-3
-2
10
3
• Many real world
networks contain hubs: pk
highly connected nodes
• We can easily
distinguish between
exponential and powerlaw tail by plotting on
log-lin and log-log axis
pk
• In scale-free networks
maximum degree
scales as n1/(α-1)
lin-lin
2.5
log-lin
-3
10
2
-4
10
1.5
1
-5
10
0.5
0
0
-6
200
400
-2
10
600
800
1000
10
0
200
400
600
800
k
k
log-log
-3
10
-4
10
-5
10
-6
10 0
10
1
10
k
2
10
10
3
Degree distribution
in
Jure Leskovec
a blog network
1000
Poisson vs. Scale-free network
Poisson network
(Erdos-Renyi random graph)
Scale-free (power-law) network
Degree
distribution is
Power-law
Degree distribution is Poisson
Function is
scale
free if:
Jure Leskovec
f(ax) = b f(x)
Degree distribution
Count
number of people a
person talks to on a
Microsoft Messenger
Highest
degree
X
Jure Leskovec
Node degree
Network resilience (1)
• We observe how the
connectivity (length of the
paths) of the network
changes as the vertices get
removed
• Vertices can be removed:
– Uniformly at random
– In order of decreasing degree
• It is important for
epidemiology
– Removal of vertices
corresponds to vaccination
Jure Leskovec
Network resilience (2)
• Real-world networks are resilient to random attacks
– One has to remove all web-pages of degree > 5 to disconnect the web
– But this is a very small percentage of web pages
• Random network has better resilience to targeted attacks
Mean path length
Internet (Autonomous systems)
Random network
Preferential
removal
Random
removal
Jure Leskovec
Fraction of removed nodes
Fraction of removed nodes
Community structure
• Most social networks show
community structure
– groups have higher density of edges
within than accross groups
– People naturally divide into groups
based on interests, age, occupation,
…
• How to find communities:
– Spectral clustering (embedding into a
low-dim space)
– Hierarchical clustering based on
connection strength
– Combinatorial algorithms
– Block models
– Diffusion methods
Friendship network of
Leskovec
children in aJure
school
Distribution of
Connected components
in MSN Messenger
network
Count
MSN Messenger
Growth of largest
component over
time in a citation
network
• Graphs have a “giant component ”
• Distribution of connected
components follows a power law
Largest
component
X
Jure Leskovec
Size (number of nodes)
Network motifs (1)
• What are the building blocks (motifs) of
networks?
• Do motifs have specific roles in networks?
• Network motifs detection process:
– Count how many times each subgraph appears
– Compute statistical significance for each subgraph –
probability of appearing in random as much as in real
network
3 node
motifs
Jure Leskovec
Network motifs (2)
• Biological networks
– Feed-forward loop
– Bi-fan motif
• Web graph:
– Feedback with two
mutual diads
– Mutual diad
– Fully connected triad
Jure Leskovec
Network motifs (3)
Transcription
networks
Signal transduction
networks
WWW and
friendship
networks
Word adjacency
networks
Jure Leskovec
Networks over time: Densification
• A very basic question: What is the
relation between the number of nodes
and the number of edges in a
network?
• Networks are becoming denser over
time
• The number of edges grows faster
than the number of nodes – average
degree is increasing
Internet
E(t)
a=1.2
N(t)
Citations
a … densification exponent: 1 ≤ a ≤ 2:
– a=1: linear growth – constant outdegree (assumed in the literature so
far)
– a=2: quadratic growth – clique
E(t)
a=1.7
Jure Leskovec
N(t)
Densification & degree distribution
• How does densification affect
degree distribution?
• Given densification exponent a, the
degree exponent is:
Degree exponent
over time
pk=kγ
(a)
γ(t)
– (a) For γ=const over time, we obtain
densification only for 1<γ<2, then γ=a/2
– (b) For γ<2 degree distribution has to
evolve according to:
a=1.1
(b)
γ(t)
• Power-law: y=b xγ, for γ<2 E[y] = ∞
a=1.6 Jure Leskovec
Shrinking diameters
• Intuition says that
distances between the
nodes slowly grow as the
network grows (like log n)
• But as the network grows
the distances between
nodes slowly decrease
Internet
Citations
Jure Leskovec
Models of network
generation and evolution
Recap (1)
• Last time we saw:
– Large networks (web, on-line social networks) are
here
– Many traditional questions not useful anymore
– We can not plot the network so we need statistical
methods and tools to quantify large networks
– 3 parts/goals:
• Statistical properties of large networks
• Models that help understand these properties
• Predict behavior of networked systems based on measured
structural properties and local rules governing individual
nodes
Jure Leskovec
Recap (2)
• We also so features that are common to
networks of various types:
• Properties of static networks:
–
–
–
–
–
–
Small-world effect
Transitivity or clustering
Degree distributions (scale free networks)
Network resilience
Community structure
Subgraphs or motifs
• Temporal properties:
– Densification
– Shrinking diameter
Jure Leskovec
Outline for today
• We will see the network generative models for
modeling networks’ features:
–
–
–
–
–
–
Erdos-Renyi random graph
Exponential random graphs (p*) model
Small world model
Preferential attachment
Community guided attachment
Forest fire model
• Fitting models to real data
– How to generate a synthetic realistic looking network?
Jure Leskovec
(Erodos-Renyi) Random graphs
• Also known as Poisson random graphs or
Bernoulli graphs
– Given n vertices connect each pair i.i.d. with
probability p
• Two variants:
– Gn,p: graph with m edges appears with probability
pm(1-p)M-m, where M=0.5n(n-1) is the max number of
edges
– Gn,m: graphs with n nodes, m edges
• Very rich mathematical theory: many properties
are exactly solvable
Jure Leskovec
Properties of random graphs
• Degree distribution is Poisson since the
presence and absence of edges is
k z
independent
n k
z
e
nk
pk    p (1  p)
k 

k!
• Giant component: average degree k=2m/n:
– k=1-ε: all components are of size log n
– k=1+ε: there is 1 component of size n
• All others are of size log n
• They are a tree plus an edge, i.e., cycles
• Diameter: log n / log k
Jure Leskovec
for non-GCC vertices
Evolution of a random graph
Jure Leskovec
k
Subgraphs in random graphs
Expected number of
subgraphs H(v,e) in Gn,p is
 n  v! e n v p e
E ( X )    p 
a
v a
a... # of isomorphic graphs
Jure Leskovec
Random graphs: conclusion
• Pros:
Configuration model
– Simple and tractable model
– Phase transitions
– Giant component
• Cons:
– Degree distribution
– No community structure
– No degree correlations
• Extensions:
– Configuration model
• Random graphs with arbitrary degree sequence
• Excess degree: Degree of a vertex of the end of
random edge: qk = k pk
Jure Leskovec
Exponential random graphs
(p* models)
• Comes from social sciences
• Let εi set of measurable properties
of a graph (number of edges,
number of nodes of a given
degree, number of triangles, …)
• Exponential random graph model
defines a probability distribution
over graphs:
Examples of εi
Jure Leskovec
Exponential random graphs
• Includes Erdos-Renyi as a special
case
• Assume parameters βi are
specified
– No analytical solutions for the model
– But can use simulation to sample the
graphs:
• Define local moves on a graph:
– Addition/removal of edges
– Movement of edges
– Edge swaps
Example of parameter estimates:
• Parameter estimation:
– maximum likelihood
• Problem:
– Can’t solve for transitivity (produces
cliques)
– Used to analyze small networks
Jure Leskovec
Small-world model
• Used for modeling network transitivity
• Many networks assume some kind of geographical
proximity
• Small-world model:
– Start with a low-dimensional regular lattice
– Rewire:
• Add/remove edges to create shortcuts to join remote parts of the
lattice
• For each edge with prob p move the other end to a random vertex
• Rewiring allows to interpolate between regular lattice
and random graph
Jure Leskovec
Small-world model
• Regular lattice (p=0):
– Clustering coefficient C=(3k3)/(4k-2)=3/4
– Mean distance L/4k
• Almost random graph
(p=1):
Rewiring probability p
– Clustering coefficient C=2k/L
– Mean distance log L / log k
• No power-law degree
distribution
Jure Leskovec
Degree distribution
Models of evolution
• Models of network evolution:
– Preferential attachment
– Edge copying model
– Community Guided Attachment
– Forest Fire model
• Models for realistic network generation:
– Kronecker graphs
Jure Leskovec
Preferential attachment
• Models the growth of the network
• Preferential attachment (Price 1965, Albert & Barabasi
1999):
– Add a new node, create m out-links
– Probability of linking a node ki is
proportional to its degree
• Based on Herbert Simon’s result
– Power-laws arise from “Rich get richer” (cumulative advantage)
• Examples (Price 1965 for modeling citations):
– Citations: new citations of a paper are proportional to the number
it already has
Jure Leskovec
Preferential attachment
• Leads to power-law degree
distributions
pk  k
3
• But:
– all nodes have equal (constant) outdegree
– one needs a complete knowledge of the
network
• There are many generalizations and
variants, but the preferential
selection is the key ingredient that
leads to power-laws
Jure Leskovec
Edge copying model
• Copying model:
– Add a node and choose k the number of edges to add
– With prob β select k random vertices and link to them
– With prob 1-β edges are copied from a randomly
chosen node
• Generates power-law degree distributions with
exponent 1/(1-β)
• Generates communities
• Related Random-surfer model
Jure Leskovec
Community guided attachment
• Want to model/explain
densification in
networks
• Assume community
structure
• One expects many
within-group
friendships and fewer
cross-group ones
University
Arts
Science
CS
Math
Drama
Music
Self-similar university
community structure
Jure Leskovec
Community guided attachment
• Assuming cross-community linking probability
• The Community Guided Attachment leads to
Densification Power Law with exponent
– a … densification exponent
– b … community tree branching factor
– c … difficulty constant, 1 ≤ c ≤ b
• If c = 1: easy to cross communities
– Then: a=2, quadratic growth of edges – near clique
• If c = b: hard to cross communities
– Then: a=1, linear growth of edges – constant out-degree
Jure Leskovec
Forest Fire Model
• Want to model graphs that density and
have shrinking diameters
• Intuition:
– How do we meet friends at a party?
– How do we identify references when
writing papers?
Jure Leskovec
Forest Fire Model for
directed graphs
• The model has 2 parameters:
– p … forward burning probability
– r … backward burning probability
• The model:
– Each turn a new node v arrives
– Uniformly at random chooses an “ambassador” w
– Flip two geometric coins to determine the number inand out-links of w to follow (burn)
– Fire spreads recursively until it dies
– Node v links to all burned nodes
Jure Leskovec
Forest Fire Model
• Simulation experiments
• Forest Fire generates graphs that densify
and have shrinking diameter
densification
1.32
diameter
diameter
E(t)
N(t)
Jure Leskovec
N(t)
Forest Fire Model
• Forest Fire also generates graphs with
heavy-tailed degree distribution
in-degree
count vs. in-degree
out-degree
Jure Leskovec
count vs. out-degree
Forest Fire: Parameter Space
• Fix backward
probability r and vary
forward burning
probability p
• We observe a sharp
transition between
sparse and clique-like
graphs
• Sweet spot is very
narrow
Increasing
diameter
Sparse
graph
Clique-like
graph
Constant
diameter
Decreasing
diameter
Jure Leskovec
Kronecker graphs
• Want to have a model that can generate a
realistic graph:
– Static Patterns
• Power Law Degree Distribution
• Small Diameter
• Power Law Eigenvalue and Eigenvector Distribution
– Temporal Patterns
• Densification Power Law
• Shrinking/Constant Diameter
• For Kronecker graphs all these properties can
actually be proven
Jure Leskovec
Kronecker Product – a Graph
Intermediate stage
Adjacency matrix
Jure Leskovec
Adjacency matrix
Kronecker Product – Definition
• The Kronecker product of matrices A and B is
given by
NxM
KxL
N*K x M*L
• We define a Kronecker product of two graphs as
a Kronecker product of their adjacency matrices
Jure Leskovec
Stochastic Kronecker Graphs
• Create N1N1 probability matrix P1
• Compute the kth Kronecker power Pk
• For each entry puv of Pk include an
edge (u,v) with probability puv
0.5 0.2
0.1 0.3
P1
Kronecker 0.25 0.10 0.10 0.04
multiplication
0.05 0.15 0.02 0.06
0.05 0.02 0.15 0.06
0.01 0.03 0.03 0.09
P2
Instance
Matrix G2
flip biased
coins
Jure
Leskovec
Fitting Kronecker to Real Data
• Given a graph G and Kronecker matrix P1 we
can calculate probability that P1 generated G:
P(G|P1):
1 1 0
0.25 0.10 0.10 0.04
1 1 1
0.05 0.15 0.02 0.06
0.5 0.2
0.05 0.02 0.15 0.06
0 1 1
0.1 0.3
0.01 0.03 0.03 0.09
0 0 1
P1
G
Pk
P(G|P1)
0
0
1
1
σ… nodeJurelabeling
Leskovec
Fitting Kronecker: 2 challenges
0.5 0.2
0.1 0.3
P1
1
1
0
0
0.25 0.10 0.10 0.04
0.05 0.15 0.02 0.06
0.05 0.02 0.15 0.06
0.01 0.03 0.03 0.09
0
1
1
1
0
0
1
1
G
Pk
• Invariance to node labeling σ
(there are N! labelings)
• Calculating P(G|P1) takes
O(N2) (since one needs to
consider every cell of
adjacency matrix)
1
1
1
0
P(G|P1)
1
2
3
==
2
4
4
1
3
P(G | P1 )   P(G | P1 ,Jure
 )Leskovec
P( )
Fitting Kronecker: Solutions
σ… node labeling
0.25 0.10 0.10 0.04
P= 0.05 0.15 0.02 0.06
0.05 0.02 0.15 0.06
1 1 0 0
G= 1 1 1 0
0 1 1 1
0.01 0.03 0.03 0.09
0 0 1 1
• Node Labeling: can use MCMC sampling to average over
(all) node labelings
• P(G|P1) takes O(N2): Real graphs are sparse, so calculate
P(Gempty) and then “add” the edges. This takes O(E).
Jure Leskovec
Experiments on real AS graph
Degree distribution
Adjacency matrix eigen values
Hop plot
Network value
Jure Leskovec
Why fitting generative models?
• Parameters tell us about the structure of a graph
• Extrapolation: given a graph today, how will it
look in a year?
• Sampling: can I get a smaller graph with similar
properties?
• Anonymization: instead of releasing real graph
(e.g., email network), we can release a synthetic
version of it
Jure Leskovec
Processes taking place
on networks
Epidemiological processes
• The simplest way to spread a virus
over the network
– S: Susceptible
– I: Infected
– R: Recovered (removed)
• SIS model: 2 parameters
– β … virus birth rate
– δ … virus death (recovery) rate
SIS model
ζit depends on β and topology
• SIR model: as one gets cured, he or
she can not get infected again
Jure Leskovec
Epidemic threshold for SIS model
• How infectious the virus
needs to be to survive in the
network?
• First results on power-law
networks suggested that any
virus will prevail
• New result that works for any
topology:
• For s>1 virus prevails
• For s<1 virus dies
λ1… largest eigen value
of graph adjacency matrix
Jure Leskovec
Navigation in small-world networks
v
• Milgram’s experiment showed:
– (a) short paths exist in networks
– (b) humans are able to find them
• Assume the following setting:
u
– Nodes of a graph are scattered on a plane
– Given starting node u and we want to reach target
node v
– A small world navigation algorithm navigates the
network by always navigating to a neighbor that is
closest (in Manhattan distance) to target node v
Jure Leskovec
Navigation in small-world networks
Network creation
• Start with random lattice:
– Each node connects with their 4
immediate neighbors
– Long range links are added with
probability proportional to the
distance between the points
(p(u,v) ~ dα)
• Can be show that only for α=2
delivery time is poly-log in
number of nodes n
Deliver time T < nβ
Jure Leskovec
Navigation in a real-world network
• Take a social network of
500k bloggers where for
each blogger we know
their geographical location
• Pick two nodes at random
and geographically greedy
navigate the network
• Results:
– 13% success rate (vs. 18%
for Milgram)
Distribution of path lengths
Friendships vs. distance
Jure Leskovec
Navigation in real-world network
• Geographical distance may
not be the right kind of
distance
• Since population is nonuniform let’s use rank
based friendship distance:
i.e., we measure the
distance d(u,v) by the
number of people living
closer to v than u does
• Then
Jure Leskovec
And the proof still works
•
Some references used to prepare this talk:
– The Structure and Function of Complex Networks, by Mark Newman
– Statistical mechanics of complex networks, by Reka Albert and Albert-Laszlo
Barabasi
– Graph Mining: Laws, Generators and Algorithms, by Deepay Chakrabarti and
Christos Faloutsos
– An Introduction to Exponential Random Graph (p*) Models for Social Networks
by Garry Robins, Pip Pattison, Yuval Kalish and Dean Lusher
– Graph Evolution: Densification and Shrinking Diameters, by Jure Leskovec, Jon
Kleinberg and Christos Faloutsos
– Realistic, Mathematically Tractable Graph Generation and Evolution, Using
Kronecker Multiplication, by Jure Leskovec, Deepayan Chakrabarti, Jon
Kleinberg and Christos Faloutsos
– Navigation in a Small World, by Jon Kleinberg
– Geographic routing in social networks, by David Liben-Nowell, Jasmine Novak,
Ravi Kumar, Prabhakar Raghavan, and Andrew Tomkins
– Some plots and slides borrowed from Lada Adamic, Mark Newman, Mark
Joseph, Albert Barabasi, Jon Kleinberg, David Lieben-Nowell, Sergi Valverde and
Ricard Sole
Jure Leskovec
Rough random material
that did not make it into the
presentation
Jure Leskovec
Bow-tie structure of the web
TENDRILS
44M
IN
44 M
SCC
56 M
OUT
44 M
DISC
17 M
Leskovec
Broder & al. WWW 2000, Dill & al.Jure
VLDB
2001
Study of 3 websites
• study over three universities’ publicly indexable Web
sites
Jure Leskovec
Australia
In- and out-degree distributions
Jure Leskovec
New Zealand
In- and out-degree distributions
Jure Leskovec
United Kingdom
In- and out-degree distributions
Jure Leskovec
We assume this node would like to connect to a
centrally located node; a node whose distances to
other nodes is minimized.
dij is the Euclidean distance
hj is some measure of the “centrality” of node j
α is a parameter – a function of the final number n of
points, gauging the relative importance of the two
objectives
Jure Leskovec
Fabrikant et al. define 3 possible measures of
“centrality”
1. The average number of hops from other nodes
2. The maximum number of hops from another node
3. The number of hops from a fixed center of the tree
Jure Leskovec
α is the crux of the theorem!
Why? Here are some examples:
Jure Leskovec
Fabrikant&al
If α is too low,
then the
Euclidian
distances
become
unimportant,
and the
network
resembles a
star:
Jure Leskovec
Fabrikant&al
But if α grows at least as fast as √n, where n is the
final number of points, then distance becomes too
important, and minimum spanning trees with high
degree occur, but with exponentially vanishing
probability – thus not a power law.
if α is anywhere in between, we have a power law
Through a rather complex and elaborate proof,
Fabrikant&al prove this initial assumption will
produce a power law distribution – I’ll save you the
math!
Jure Leskovec
Download