Data Mining:
Concepts and Techniques
— Chapter 6 —
Network Mining (Social Networks)
Jianlin Cheng
Department of Computer Science
University of Missouri, Columbia
©2006 Jiawei Han and Micheline Kamber. All rights reserved.
Acknowledgements: Based on the slides by Sangkyum Kim and Chen Chen
March 19, 2016
Data Mining: Concepts and Techniques
1
Network Mining

Social Network Introduction

Statistics and Probability Theory

Models of Social Network Generation

Networks in Biological System

Mining on Social Network

Summary
March 19, 2016
Data Mining: Concepts and Techniques
2
Society
Nodes: individuals
Links: social relationship
(family/work/friendship/etc.)
S. Milgram (1967)
Six Degrees of Separation
John Guare
Social networks: Many individuals with
diverse social interactions between them.
March 19, 2016
Data Mining: Concepts and Techniques
3
Communication networks
The Earth is developing an electronic nervous system,
a network with diverse nodes and links are
-computers
-phone lines
-routers
-TV cables
-satellites
-EM waves
Communication
networks: Many
non-identical
components with
diverse
connections
between them.
March 19, 2016
Data Mining: Concepts and Techniques
4
Complex systems
Made of
many non-identical elements
connected by diverse interactions.
NETWORK
March 19, 2016
Data Mining: Concepts and Techniques
5
“Natural” Networks and Universality








Consider many kinds of networks:
 social, technological, business, economic, content,…
These networks tend to share certain informal properties:
 large scale; continual growth
 distributed, democratic growth: vertices “decide” who to link to
 mixture of local and long-distance connections
 abstract notions of distance: geographical, content, social,…
Do natural networks share more quantitative universals?
What would these “universals” be?
How can we make them precise and measure them?
How can we explain their universality?
This is the domain of social network theory
Sometimes also referred to as link analysis
March 19, 2016
Data Mining: Concepts and Techniques
6
Some Interesting Quantities

Connected components:


Network diameter:




maximum (worst-case) or average?
exclude infinite distances? (disconnected components)
the small-world phenomenon
Clustering:




how many, and how large?
to what extent that links tend to cluster “locally”?
what is the balance between local and long-distance connections?
what roles do the two types of links play?
Degree distribution:


what is the typical degree in the network?
what is the overall distribution?
March 19, 2016
Data Mining: Concepts and Techniques
7
A “Canonical” Natural Network has…

Few connected components:
often only 1 or a small number, indep. of network size
Small diameter:
 often a constant independent of network size (like 6)
 or perhaps growing only logarithmically with network size
or even shrink?
 typically exclude infinite distances
A high degree of clustering:
 considerably more so than for a random network
 in tension with small diameter
A heavy-tailed degree distribution:
 a small but reliable number of high-degree vertices
 often of power law form




March 19, 2016
Data Mining: Concepts and Techniques
8
Probabilistic Models of Networks



All of the network generation models we will study are
probabilistic or statistical in nature
They can generate networks of any size
They often have various parameters that can be set:






size of network generated
average degree of a vertex
fraction of long-distance connections
The models generate a distribution over networks
Statements are always statistical in nature:

with high probability, diameter is small

on average, degree distribution has heavy tail
Thus, we’re going to need some basic statistics and
probability theory
March 19, 2016
Data Mining: Concepts and Techniques
9
Social Network Analysis

Social Network Introduction

Statistics and Probability Theory

Models of Social Network Generation

Networks in Biological System

Mining on Social Network

Summary
March 19, 2016
Data Mining: Concepts and Techniques
10
The Normal Distribution

The normal or Gaussian density:
 applies to continuous, real-valued random variables
 characterized by mean (average) m and standard deviation
s
 density at x is defined as


peaks at x = m, then dies off exponentially rapidly
the classic “bell-shaped curve”


exam scores, human body temperature,
remarks:


March 19, 2016
can control mean and standard deviation independently
can make as “broad” as we like, but always have finite variance
Data Mining: Concepts and Techniques
11
The Normal Distribution
March 19, 2016
Data Mining: Concepts and Techniques
12
The Binomial Distribution

coin with Pr[heads] = p, flip n times

probability of getting exactly k heads:


choose(n,k) pk(1-p)n-k
for large n and p fixed:

approximated well by a normal with
m = np, s = sqrt(np(1-p))


s/m  0 as n grows
leads to strong large deviation bounds
March 19, 2016
Data Mining: Concepts and Techniques
13
The Binomial Distribution
How many
wins in 21 bets?
www.professionalgambler.com/ binomial.html
March 19, 2016
Data Mining: Concepts and Techniques
14
The Poisson Distribution

like binomial, applies to variables taken on integer values > 0

often used to model counts of events

number of phone calls placed in a given time period

number of times a neuron fires in a given time period

single free parameter l

probability of exactly x events:


mean and variance are both l
binomial distribution with n large, p = l/n (l fixed)

converges to Poisson with mean l
March 19, 2016
Data Mining: Concepts and Techniques
15
The Poisson Distribution
single photoelectron distribution
March 19, 2016
Data Mining: Concepts and Techniques
16
Heavy-tailed Distributions

March 19, 2016
Data Mining: Concepts and Techniques
17
Heavy-Tailed Distributions
March 19, 2016
Data Mining: Concepts and Techniques
18
Distributions vs. Data






All these distributions are idealized models
In practice, we do not see distributions, but data
Thus, there will be some largest value we observe
Also, can be difficult to “eyeball” data and choose model
So how do we distinguish between Poisson, power law, etc?
Typical procedure:
 might restrict our attention to a range of values of interest
 accumulate counts of observed data into equal-sized bins
 look at counts on a log-log plot
 note that

power law:



Normal:



log(Pr[X = x]) = log(a exp(-x2/b)) = log(a) – x2/b
non-linear, concave near mean
Poisson:


March 19, 2016
log(Pr[X = x]) = log(1/xa) = -a log(x)
linear, slope –a
log(Pr[X = x]) = log(exp(-l) lx/x!)
also non-linear
Data Mining: Concepts and Techniques
19
Zipf’s Law



Look at the frequency of English words:
 “the” is the most common, followed by “of”, “to”, etc.
 claim: frequency of the n-th most common ~ 1/n (power
law, α = 1)
General theme:
 rank events by their frequency of occurrence
 resulting distribution often is a power law!
Other examples:
 North America city sizes
 personal income
 file sizes
March 19, 2016
Data Mining: Concepts and Techniques
20
Zipf’s Law
The same data plotted on linear and logarithmic scales.
Both plots show a Zipf distribution with 300 datapoints
Linear scales on both axes
March 19, 2016
Logarithmic scales on both axes
Data Mining: Concepts and Techniques
21
Social Network Analysis

Social Network Introduction

Statistics and Probability Theory

Models of Social Network Generation

Networks in Biological System

Mining on Social Network

Summary
March 19, 2016
Data Mining: Concepts and Techniques
22
Some Models of Network Generation

Random graphs (Erdös-Rényi models):




Watts-Strogatz models:




gives few components, small diameter and heavy-tailed distribution
does not give high clustering
Hierarchical networks:


give few components, small diameter and high clustering
does not give heavy-tailed degree distributions
Scale-free Networks:


gives few components and small diameter
does not give high clustering and heavy-tailed degree distributions
is the mathematically most well-studied and understood model
few components, small diameter, high clustering, heavy-tailed
Affiliation networks:

models group-actor formation
March 19, 2016
Data Mining: Concepts and Techniques
23
Models of Social Network Generation

Random Graphs (Erdös-Rényi models)

Watts-Strogatz models

Scale-free Networks
March 19, 2016
Data Mining: Concepts and Techniques
24
The Erdös-Rényi (ER) Model
(Random Graphs)





All edges are equally probable and appear independently
NW size N > 1 and probability p: distribution G(N,p)
 each edge (u,v) chosen to appear with probability p
 N(N-1)/2 trials of a biased coin flip
The usual regime of interest is when p ~ 1/N, N is large
 e.g. p = 1/2N, p = 1/N, p = 2/N, p=10/N, p = log(N)/N, etc.
 in expectation, each vertex will have a “small” number of neighbors
 will then examine what happens when N  infinity
 can thus study properties of large networks with bounded degree
Degree distribution of a typical G drawn from G(N,p):
 draw G according to G(N,p); look at a random vertex u in G
 what is Pr[deg(u) = k] for any fixed k?
 Poisson distribution with mean degree: λ = p(N-1) ~ pN
 Sharply concentrated; not heavy-tailed
Especially easy to generate NWs from G(N,p)
March 19, 2016
Data Mining: Concepts and Techniques
25
Erdös-Rényi Model (1960)
Connect with
probability p
Pál Erdös
p=1/6
N=10
k~1.5
Poisson distribution
(1913-1996)
- Democratic
- Random
March 19, 2016
Data Mining: Concepts and Techniques
26
A Closely Related Model

For any fixed m <= N(N-1)/2, define distribution
G(N,m):





March 19, 2016
choose uniformly at random from all graphs
with exactly m edges
G(N,m) is “like” G(N,p) with p = m/(N(N-1)/2)
~ 2m/N2
this intuition can be made precise, and is
correct
if m = cN then p = 2c/(N-1) ~ 2c/N
mathematically trickier than G(N,p)
Data Mining: Concepts and Techniques
27
Another Closely Related Model

Graph process model:

start with N vertices and no edges

at each time step, add a new edge


choose new edge randomly from among all missing
edges
Allows study of the evolution or emergence of properties:

as the number of edges m grows in relation to N

equivalently, as p is increased
March 19, 2016
Data Mining: Concepts and Techniques
28
Evolution of a Random Network

We have a large number n of vertices

We start randomly adding edges one at a time

At what time t will the network:


have at least one “large” connected component?

have a single connected component?

have “small” diameter?

have a “large” clique?
How gradually or suddenly do these properties appear?
March 19, 2016
Data Mining: Concepts and Techniques
29
Combining and Formalizing Familiar Ideas
crime rate

Explaining universal behavior through statistical models
 our models will always generate many networks
 almost all of them will share certain properties (universals)
Explaining tipping through incremental growth
 we gradually add edges, or gradually increase edge probability p
 many properties will emerge very suddenly during this process
prob. NW connected

size of police force
March 19, 2016
number of edges
Data Mining: Concepts and Techniques
30
So Which Properties Tip?

Just about all of them!

The following properties all have threshold functions:


having a “giant component”

being connected

having a perfect matching (N even)

having “small” diameter
With remarkable consistency (N = 50):

March 19, 2016
giant component (~ 40 edges), connected (~ 100
edges), small diameter (~ 180 edges)
Data Mining: Concepts and Techniques
31
Ever More Precise…

March 19, 2016
Data Mining: Concepts and Techniques
32
Erdos-Renyi Summary






A model in which all connections are equally likely
 each of the N(N-1)/2 edges chosen randomly &
independently
As we add edges, a precise sequence of events unfolds:
 graph acquires a giant component
 graph becomes connected
 graph acquires small diameter
Many properties appear very suddenly (tipping, thresholds)
All statements are mathematically precise
But is this how natural networks form?
If not, which aspects are unrealistic?
 may all edges are not equally likely!
March 19, 2016
Data Mining: Concepts and Techniques
33
The Clustering Coefficient of a Network



Let nbr(u) denote the set of neighbors of u in a graph
 all vertices v such that the edge (u,v) is in the graph
The clustering coefficient of u:
 let k = |nbr(u)| (i.e., number of neighbors of u)
 choose(k,2) = k*(k-1) / 2: max possible # of edges between vertices in
nbr(u)
 c(u) = (actual # of edges between vertices in nbr(u))/choose(k,2)
 0 <= c(u) <= 1; measure of cliquishness of u’s neighborhood
Clustering coefficient of a graph:
 average of c(u) over all vertices u
k=4
choose(k,2) = 6
c(u) = 4/6 = 0.666…
March 19, 2016
Data Mining: Concepts and Techniques
u
34
The Clustering Coefficient of a Network
Clustering: My friends will likely know each other!
Probability to be connected C
»p
# of links between 1,2,…n neighbors
C=
n(n-1)/2
L: diameter?
Networks are clustered
[large C(p)]
but have a small
characteristic path length
[small L(p)].
March 19, 2016
Network
C
Crand
L
N
WWW
0.1078
0.00023
3.1
153127
Internet
0.18-0.3
0.001
3.7-3.76
30156209
Actor
0.79
0.00027
3.65
225226
Coauthorship
0.43
0.00018
5.9
52909
Metabolic
0.32
0.026
2.9
282
Foodweb
0.22
0.06
2.43
134
C. elegance
0.28
0.05
2.65
282
Data Mining: Concepts and Techniques
35
Erdos-Renyi: Clustering Coefficient






Generate a network G according to G(N,p)
Examine a “typical” vertex u in G
 choose u at random among all vertices in G
 what do we expect c(u) to be?
Answer: exactly p.
In G(N,m), expect c(u) to be 2m/N(N-1)
Both cases: c(u) entirely determined by overall density
Baseline for comparison with “more clustered” models
 Erdos-Renyi has no bias towards clustered or local
edges
March 19, 2016
Data Mining: Concepts and Techniques
36
Models of Social Network Generation

Random Graphs (Erdös-Rényi models)

Watts-Strogatz models

Scale-free Networks
March 19, 2016
Data Mining: Concepts and Techniques
37
Caveman and Solaria



Erdos-Renyi:
 sharing a common neighbor makes two vertices no more likely to be
directly connected than two very “distant” vertices
 every edge appears entirely independently of existing structure
But in many settings, the opposite is true:
 you tend to meet new friends through your old friends
 two web pages pointing to a third might share a topic
 two companies selling goods to a third are in related industries
Watts’ Caveman world:
 overall density of edges is low
 but two vertices with a common neighbor are likely connected
March 19, 2016
Data Mining: Concepts and Techniques
38
Making it (Somewhat) Precise: the a-model

The a-model has the following parameters or “knobs”:
N: size of the network to be generated

k: the average degree of a vertex in the network to be generated

p: the default probability two vertices are connected

a: adjustable parameter dictating bias towards local connections
For any vertices u and v:

define m(u,v) to be the number of common neighbors (so far)
Key quantity: the propensity R(u,v) of u to connect to v

if m(u,v) >= k, R(u,v) = 0 (share too many friends not to connect)

if m(u,v) = 0, R(u,v) = p (no mutual friends  no bias to connect)

else, R(u,v) = p + (m(u,v)/k)a (1-p)
Generate NW incrementally

using R(u,v) as the edge probability; details omitted
Note: a = infinity is “like” Erdos-Renyi (but not exactly)





March 19, 2016
Data Mining: Concepts and Techniques
39
Small Worlds and Occam’s Razor




For small a, should generate large clustering coefficients
 we “programmed” the model to do so
 Watts claims that proving precise statements is hard…
But we do not want a new model for every little property
 Erdos-Renyi  small diameter
 a-model  high clustering coefficient
In the interests of Occam’s Razor, we would like to find
 a single, simple model of network generation…
 … that simultaneously captures many properties
Watt’s small world: small diameter and high clustering
March 19, 2016
Data Mining: Concepts and Techniques
40
Meanwhile, Back in the Real World…



Watts examines three real networks as case studies:
 the Kevin Bacon graph
 the Western states power grid
 the C. elegans nervous system
For each of these networks, he:
 computes its size, diameter, and clustering coefficient
 compares diameter and clustering to best Erdos-Renyi
approx.
 shows that the best a-model approximation is better
 important to be “fair” to each model by finding best fit
Overall moral:
 if we care only about diameter and clustering, a is better
than p
March 19, 2016
Data Mining: Concepts and Techniques
41
Case 1: Kevin Bacon Graph


Vertices: actors and actresses
Edge between u and v if they appeared in a film together
Kevin Bacon
No. of movies : 46
No. of actors : 1811
Average separation: 2.79
Is Kevin Bacon
the most
connected actor?
NO!
March 19, 2016
Rod Steiger
Donald Pleasence
Martin Sheen
Christopher Lee
Robert Mitchum
Charlton Heston
Eddie Albert
Robert Vaughn
Donald Sutherland
John Gielgud
Anthony Quinn
James Earl Jones
Average
distance
2.537527
2.542376
2.551210
2.552497
2.557181
2.566284
2.567036
2.570193
2.577880
2.578980
2.579750
2.584440
# of
movies
112
180
136
201
136
104
112
126
107
122
146
112
# of
links
2562
2874
3501
2993
2905
2552
3333
2761
2865
2942
2978
3787
KevinBacon
Bacon
Kevin
2.786981
2.786981
46
46
1811
1811
Rank
Name
1
2
3
4
5
6
7
8
9
10
11
12
…
876
876
…
Data Mining: Concepts and Techniques
42
#1 Rod Steiger
#876
Kevin Bacon
Donald
#2
Pleasence
#3 Martin Sheen
March 19, 2016
Data Mining: Concepts and Techniques
43
Case 2: New York State Power Grid



Vertices: generators and substations
Edges: high-voltage power transmission lines and transformers
Line thickness and color indicate the voltage level
 Red 765 kV, 500 kV; brown 345 kV; green 230 kV; grey 138 kV
March 19, 2016
Data Mining: Concepts and Techniques
44
Case 3: C. Elegans Nervous System


Vertices: neurons in the C. elegans worm
Edges: axons/synapses between neurons
March 19, 2016
Data Mining: Concepts and Techniques
45
Two More Examples

M. Newman on scientific collaboration networks
 coauthorship networks in several distinct communities
 differences in degrees (papers per author)
 empirical verification of




giant components
small diameter (mean distance)
high clustering coefficient
Alberich et al. on the Marvel Universe
 purely fictional social network
 two characters linked if they appeared together in an issue
 “empirical” verification of



March 19, 2016
heavy-tailed distribution of degrees (issues and characters)
giant component
rather small clustering coefficient
Data Mining: Concepts and Techniques
46
One More (Structural) Property…

A properly tuned a-model can simultaneously explain
 small diameter
 high clustering coefficient
But what about heavy-tailed degree distributions?
 a-model and simple variants will not explain this
 intuitively, no “bias” towards large degree
 all vertices are created equal

As always, we want a “natural” model

March 19, 2016
Data Mining: Concepts and Techniques
47
Models of Social Network Generation

Random Graphs (Erdös-Rényi models)

Watts-Strogatz models

Scale-free Networks
March 19, 2016
Data Mining: Concepts and Techniques
48
World Wide Web
Nodes: WWW documents
Links: URL links
800 million documents
(S. Lawrence, 1999)
ROBOT:
collects all
URL’s found in a
document and follows
them recursively
R. Albert, H. Jeong, A-L Barabasi, Nature, 401 130 (1999)
March 19, 2016
Data Mining: Concepts and Techniques
49
World Wide Web
Expected Result
Real Result
out= 2.45
 in = 2.1
k ~ 6
P(k=500) ~
10-99
NWWW ~ 109
 N(k=500)~10-90
March 19, 2016
Pout(k) ~ k-out
P(k=500) ~ 10-6
Pin(k) ~ k- in
NWWW ~ 109
 N(k=500) ~ 103
J. Kleinberg, et. al, Proceedings of the ICCC (1999)
Data Mining: Concepts and Techniques
50
World Wide Web
3
l15=2 [125]
6
1
l17=4 [1346  7]
4
5
2
7
… < l > = ??
 Finite size scaling: create a network with N nodes with Pin(k) and Pout(k)
< l > = 0.35 + 2.06 log(N)
L is the length of shortest paths
19 degrees of separation
R. Albert et al Nature (99)
nd.edu
<l>
based on 800 million webpages
[S. Lawrence et al Nature (99)]
IBM
A. Broder et al WWW9 (00)
March 19, 2016
Data Mining: Concepts and Techniques
51
What does that mean?
Poisson distribution
Exponential Network
March 19, 2016
Power-law distribution
Scale-free Network
Data Mining: Concepts and Techniques
52
Scale-free Networks

The number of nodes (N) is not fixed


Networks continuously expand by additional new nodes

WWW: addition of new nodes

Citation: publication of new papers
The attachment is not uniform

A node is linked with higher probability to a node that
already has a large number of links


March 19, 2016
WWW: new documents link to well known sites
(CNN, Yahoo, Google)
Citation: Well cited papers are more likely to be
cited again
Data Mining: Concepts and Techniques
53
Scale-Free Networks






Start with (say) two vertices connected by an edge
For i = 3 to N:
 for each 1 <= j < i, d(j) = degree of vertex j so far
 let Z = S d(j) (sum of all degrees so far)
 add new vertex i with k edges back to {1, …, i-1}:
 i is connected back to j with probability d(j)/Z
Vertices j with high degree are likely to get more links!
“Rich get richer”
Natural model for many processes:
 hyperlinks on the web
 new business and social contacts
 transportation networks
Generates a power law distribution of degrees
 exponent depends on value of k
March 19, 2016
Data Mining: Concepts and Techniques
54
Scale-Free Networks


Preferential attachment explains

heavy-tailed degree distributions

small diameter (~log(N), via “hubs”)
Will not generate high clustering coefficient

March 19, 2016
no bias towards local connectivity, but towards hubs
Data Mining: Concepts and Techniques
55
Case1: Internet Backbone
Nodes: computers, routers
Links: physical lines
(Faloutsos, Faloutsos and Faloutsos, 1999)
March 19, 2016
Data Mining: Concepts and Techniques
56
March 19, 2016
Data Mining: Concepts and Techniques
57
Case2: Actor Connectivity
Days of Thunder (1990)
Far and Away
(1992)
Eyes Wide Shut (1999)
Nodes: actors
Links: cast jointly
N = 212,250 actors
k = 28.78
P(k) ~k-
=2.3
March 19, 2016
Data Mining: Concepts and Techniques
58
Case 3: Science Citation Index
25
Nodes: papers
Links: citations
Witten-Sander
PRL 1981
1736 PRL papers (1988)
2212
P(k) ~k-
( = 3)
(S. Redner, 1998)
March 19, 2016
Data Mining: Concepts and Techniques
59
Case 4: Science Coauthorship
Nodes: scientist (authors)
Links: write paper together
(Newman, 2000, H. Jeong et al 2001)
March 19, 2016
Data Mining: Concepts and Techniques
60
Case 5: Food Web
Nodes: trophic species
Links: trophic interactions
R. Sole (cond-mat/0011195)
March 19, 2016
R.J. Williams, N.D. Martinez Nature (2000)
Data Mining: Concepts and Techniques
61
Case 6: Sex-Web
Nodes: people (Females; Males)
Links: sexual relationships
4781 Swedes; 18-74;
59% response rate.
Liljeros et al. Nature 2001
March 19, 2016
Data Mining: Concepts and Techniques
62
Robustness of
Random vs. Scale-Free Networks



March 19, 2016
Data Mining: Concepts and Techniques
The accidental failure
of a number of nodes
in a random network
can fracture the
system into noncommunicating islands.
Scale-free networks
are more robust in the
face of such failures.
Scale-free networks
are highly vulnerable
to a coordinated attack
against their hubs.
63
Social Network Analysis

Social Network Introduction

Statistics and Probability Theory

Models of Social Network Generation

Networks in Biological System

Mining on Social Network

Summary
March 19, 2016
Data Mining: Concepts and Techniques
64
Bio-Map
GENOME
protein-gene
interactions
PROTEOME
protein-protein
interactions
METABOLISM
Bio-chemical
reactions
Citrate Cycle
March 19, 2016
Data Mining: Concepts and Techniques
65
Metabolic Network
Citrate Cycle
March 19, 2016
METABOLISM Bio-chemical
reactions
Data Mining: Concepts and Techniques
66
March 19, 2016
Data Mining: Concepts and Techniques
67
Metabolic Network
Nodes: chemicals (substrates)
Links: bio-chemical reactions
March 19, 2016
Data Mining: Concepts and Techniques
68
Metabolic Network
Archaea
Bacteria
Eukaryotes
Organisms from all three domains of life are
scale-free networks!
H. Jeong, B. Tombor, R. Albert, Z.N. Oltvai, and A.L. Barabasi, Nature, 407 651 (2000)
March 19, 2016
Data Mining: Concepts and Techniques
69
Bio-Map
GENOME
protein-gene
interactions
PROTEOME
protein-protein
interactions
METABOLISM
Bio-chemical
reactions
Citrate Cycle
March 19, 2016
Data Mining: Concepts and Techniques
70
Protein Network
PROTEOME
protein-protein
interactions
March 19, 2016
Data Mining: Concepts and Techniques
71
Yeast Protein Network
Nodes: proteins
Links: physical interactions (binding)
P. Uetz, et al. Nature 403, 623-7 (2000).
March 19, 2016
Data Mining: Concepts and Techniques
72
Topology of the Protein Network
P (k ) ~ (k  k0 )  exp( 
k  k0
)
k
H. Jeong, S.P. Mason, A.-L. Barabasi, Z.N. Oltvai, Nature 411, 41-42 (2001)
March 19, 2016
Data Mining: Concepts and Techniques
73
p53 Network
Nature 408 307 (2000)
…
“One way to understand the p53 network
is to compare it to the Internet.
The cell, like the Internet, appears to
be a ‘scale-free network’.”
March 19, 2016
Data Mining: Concepts and Techniques
74
p53 Network (mammals)
March 19, 2016
Data Mining: Concepts and Techniques
75
Social Network Analysis

Social Network Introduction

Statistics and Probability Theory

Models of Social Network Generation

Networks in Biological System

Mining on Social Network

Summary
March 19, 2016
Data Mining: Concepts and Techniques
76
Information on the Social Network


Heterogeneous, multi-relational data represented as a
graph or network
 Nodes are objects
 May have different kinds of objects
 Objects have attributes
 Objects may have labels or classes
 Edges are links
 May have different kinds of links
 Links may have attributes
 Links may be directed, are not required to be binary
Links represent relationships and interactions between
objects - rich content for mining
March 19, 2016
Data Mining: Concepts and Techniques
77
What is New for Link Mining Here

Traditional machine learning and data mining approaches
assume:


Real world data sets:


A random sample of homogeneous objects from single
relation
Multi-relational, heterogeneous and semi-structured
Link Mining

March 19, 2016
Newly emerging research area at the intersection of
research in social network and link analysis, hypertext
and web mining, graph mining, relational learning and
inductive logic programming
Data Mining: Concepts and Techniques
78
A Taxonomy of Common Link Mining Tasks


Object-Related Tasks
 Link-based object ranking
 Link-based object classification
 Object clustering (group detection)
 Object identification (entity resolution)
Link-Related Tasks
 Link prediction
March 19, 2016
Data Mining: Concepts and Techniques
79
What Is a Link in Link Mining?




Link: relationship among data
Two kinds of linked networks
 homogeneous vs. heterogeneous
Homogeneous networks
 Single object type and single link type
 Single model social networks (e.g., friends)
 WWW: a collection of linked Web pages
Heterogeneous networks
 Multiple object and link types
 Medical network: patients, doctors, disease, contacts,
treatments
 Bibliographic network: publications, authors, venues
March 19, 2016
Data Mining: Concepts and Techniques
80
Link-Based Object Ranking (LBR)




LBR: Exploit the link structure of a graph to order or
prioritize the set of objects within the graph
 Focused on graphs with single object type and single
link type
This is a primary focus of link analysis community
Web information analysis
 PageRank and Hits are typical LBR approaches
In social network analysis (SNA), LBR is a core analysis task
 Objective: rank individuals in terms of “centrality”
 Degree centrality vs. eigen vector/power centrality
 Rank objects relative to one or more relevant objects in
the graph vs. ranks object over time in dynamic graphs
March 19, 2016
Data Mining: Concepts and Techniques
81
PageRank: Capturing Page Popularity (Brin & Page’98)



Intuitions
 Links are like citations in literature
 A page that is cited often can be expected to be more
useful in general
PageRank is essentially “citation counting”, but improves
over simple counting
 Consider “indirect citations” (being cited by a highly
cited paper counts a lot…)
 Smoothing of citations (every page is assumed to have
a non-zero citation count)
PageRank can also be interpreted as random surfing (thus
capturing popularity)
March 19, 2016
Data Mining: Concepts and Techniques
82
The PageRank Algorithm (Brin & Page’98)
Random surfing model:
At any page,
With prob. a, randomly jumping to a page
With prob. (1 – a), randomly picking a link to follow
d1
d3
d2
0
1
M 
0

1/ 2
1/ 2 1/ 2 
0
0
0 
1
0
0 

1/ 2 0
0 
0
pt 1 (di )  (1  a )
d4
d j IN ( di )
p(di )   [
k

March 19, 2016
m ji pt (d j )  a 
k
Same as
a/N (why?)
1
pt (d k )
N
1
a  (1  a )mki ] p(d k )
N
p  (a I  (1  a ) M )T p
Initial value p(d)=1/N
“Transition matrix”
Iij = 1/N
Stationary (“stable”)
distribution, so we
ignore time
Iterate until converge
Data Mining: Concepts and Techniques
83
Link-Based Object Classification (LBC)





Predicting the category of an object based on its
attributes, its links and the attributes of linked objects
Web: Predict the category of a web page, based on
words that occur on the page, links between pages,
anchor text, html tags, etc.
Citation: Predict the topic of a paper, based on word
occurrence, citations, co-citations
Epidemics: Predict disease type based on characteristics
of the patients infected by the disease
Communication: Predict whether a communication
contact is by email, phone call or mail
March 19, 2016
Data Mining: Concepts and Techniques
84
Group Detection

Cluster the nodes in the graph into groups that
share common characteristics
 Web: identifying communities
 Citation: identifying research communities
March 19, 2016
Data Mining: Concepts and Techniques
85
Entity Resolution



Predicting when two objects are the same, based on their
attributes and their links
Also known as: deduplication, reference reconciliation, coreference resolution, object consolidation
Applications
 Web: predict when two sites are mirrors of each other
 Citation: predicting when two citations are referring
to the same paper
 Epidemics: predicting when two disease strains are
the same
 Biology: learning when two names refer to the same
protein
March 19, 2016
Data Mining: Concepts and Techniques
86
Entity Resolution Methods



Earlier viewed as pair-wise resolution problem: resolved
based on the similarity of their attributes
Importance at considering links
 Coauthor links in bib data, hierarchical links between
spatial references, co-occurrence links between name
references in documents
Use of links in resolution
 Collective entity resolution: one resolution decision
affects another if they are linked
 Propagating evidence over links in a depen. graph
 Probabilistic models interact with different entity
recognition decisions
March 19, 2016
Data Mining: Concepts and Techniques
87
Link Prediction



Predict whether a link exists between two entities, based
on attributes and other observed links
Applications
 Web: predict if there will be a link between two pages
 Citation: predicting if a paper will cite another paper
 Epidemics: predicting who a patient’s contacts are
Methods
 Often viewed as a binary classification problem
 Local conditional probability model, based on structural
and attribute features
 Difficulty: sparseness of existing links
 Collective prediction, e.g., Markov random field model
March 19, 2016
Data Mining: Concepts and Techniques
88
Link Cardinality Estimation


Predicting the number of links to an object
 Web: predict the authority of a page based on the
number of in-links; identifying hubs based on the
number of out-links
 Citation: predicting the impact of a paper based on
the number of citations
 Epidemics: predicting the number of people that will
be infected based on the infectiousness of a disease
Predicting the number of objects reached along a path
from an object
 Web: predicting number of pages retrieved by crawling
a site
 Citation: predicting the number of citations of a
particular author in a specific journal
March 19, 2016
Data Mining: Concepts and Techniques
89
Social Network Analysis

Social Network Introduction

Statistics and Probability Theory

Models of Social Network Generation

Networks in Biological System

Mining on Social Network

Summary
March 19, 2016
Data Mining: Concepts and Techniques
90
Ref: Mining on Social Networks








D. Liben-Nowell and J. Kleinberg. The Link Prediction Problem for Social
Networks. CIKM’03
P. Domingos and M. Richardson, Mining the Network Value of
Customers. KDD’01
M. Richardson and P. Domingos, Mining Knowledge-Sharing Sites for
Viral Marketing. KDD’02
D. Kempe, J. Kleinberg, and E. Tardos, Maximizing the Spread of
Influence through a Social Network. KDD’03.
P. Domingos, Mining Social Networks for Viral Marketing. IEEE
Intelligent Systems, 20(1), 80-82, 2005.
S. Brin and L. Page, The anatomy of a large scale hypertextual Web
search engine. WWW7.
S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, S.R. Kumar, P.
Raghavan, S. Rajagopalan, and A. Tomkins, Mining the link structure of
the World Wide Web. IEEE Computer’99
D. Cai, X. He, J. Wen, and W. Ma, Block-level Link Analysis. SIGIR'2004.
March 19, 2016
Data Mining: Concepts and Techniques
91