PPT - Mining of Massive Datasets

advertisement
Note to other teachers and users of these slides: We would be delighted if you found this our
material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org
Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff Ullman
Stanford University
http://www.mmds.org
Can we identify
node groups?
(communities,
modules, clusters)
Nodes: Football Teams
Edges: Games played
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
2
NCAA conferences
Nodes: Football Teams
Edges: Games played
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
3
Can we identify
functional modules?
Nodes: Proteins
Edges: Physical interactions
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
4
Functional modules
Nodes: Proteins
Edges: Physical interactions
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
5
Can we identify
social communities?
Nodes: Facebook Users
Edges: Friendships
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
6
Social communities
Summer
internship
High school
Stanford (Basketball)
Stanford (Squash)
Nodes: Facebook Users
Edges: Friendships
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
7

Non-overlapping vs. overlapping communities
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
8
Nodes
Nodes
Network
Adjacency matrix
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
9

What is the structure of community overlaps:
Edge density in the overlaps is higher!
Communities as “tiles”
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
10
Communities
in a network
This is what we want!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
11

1) Given a model, we generate the network:
B
Generative
model for
networks
F
A
D
E
G
C
H

2) Given a network, find the “best” model
B
F
A
D
E
G
C
H
Generative
model for
networks
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
12

Goal: Define a model that can generate
networks
 The model will have a set of “parameters” that we
will later want to estimate (and detect communities)
Generative
model for
networks
B
F
A
D
E
G
C
H

Q: Given a set of nodes, how do communities
“generate” edges of the network?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
13
Communities, C
pA
pB
Model
Memberships, M
Nodes, V
Model

Network
Generative model B(V, C, M, {pc}) for graphs:
 Nodes V, Communities C, Memberships M
 Each community c has a single probability pc
 Later we fit the model to networks to detect
communities
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
14
Communities, C
pA
pB
Model
Memberships, M
Nodes, V
Community Affiliations

Network
AGM generates the links: For each
 For each pair of nodes in community 𝑨,
we connect them with prob. 𝒑𝑨
 The overall edge probability is:
P(u , v)  1 
 (1  p )
c
cM u  M v
If 𝒖, 𝒗 share no communities: 𝑷 𝒖, 𝒗 = 𝜺
𝑴𝒖 … set of communities
node 𝒖 belongs to
Think of this as an “OR” function: If at least 1 community says “YES” we create an edge
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
15
Model
Network
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
16

AGM can express a
variety of community
structures:
Non-overlapping,
Overlapping, Nested
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
17

Detecting communities with AGM:
B
F
A
D
E
G
C
H
Given a Graph 𝑮(𝑽, 𝑬), find the Model
1) Affiliation graph M
2) Number of communities C
3) Parameters pc
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
19

Maximum Likelihood Principle (MLE):
 Given: Data 𝑿
 Assumption: Data is generated by some model 𝒇(𝚯)
 𝒇 … model
 𝚯 … model parameters
 Want to estimate 𝑷𝒇 𝑿 𝚯):
 The probability that our model 𝒇 (with parameters 𝜣)
generated the data
 Now let’s find the most likely model that could have
generated the data: arg max 𝑷𝒇 𝑿 𝚯)
Θ
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
20
Imagine we are given a set of coin flips
Task: Figure out the bias of a coin!
 Data: Sequence of coin flips: 𝑿 = [𝟏, 𝟎, 𝟎, 𝟎, 𝟏, 𝟎, 𝟎, 𝟏]
 Model: 𝒇 𝚯 = return 1 with prob. Θ, else return 0
 What is 𝑷𝒇 𝑿 𝚯 ? Assuming coin flips are independent
 So, 𝑷𝒇 𝑿 𝚯 = 𝑷𝒇 𝟏 𝚯 ∗ 𝑷𝒇 𝟎 𝚯 ∗ 𝑷𝒇 𝟎 𝚯 … ∗ 𝑷𝒇 𝟏 𝚯
 What is 𝑷𝒇 𝟏 𝚯 ? Simple, 𝑷𝒇 𝟏 𝚯 = 𝚯
 Then, 𝑷𝒇 𝑿 𝚯 = 𝚯𝟑 𝟏 − 𝚯
 For example:
𝟓
 𝑷𝒇 𝑿 𝚯 = 𝟎. 𝟓 = 𝟎. 𝟎𝟎𝟑𝟗𝟎𝟔
𝟑
 𝑷𝒇 𝑿 𝚯 = 𝟖 = 𝟎. 𝟎𝟎𝟓𝟎𝟐𝟗
 What did we learn? Our data was most
likely generated by coin with bias 𝚯 = 𝟑/𝟖
𝚯∗ = 𝟑/𝟖
𝑷𝒇 𝑿 𝚯


J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
𝚯
21

How do we do MLE for graphs?
 Model generates a probabilistic adjacency matrix
 We then flip all the entries of the probabilistic
matrix to obtain the binary adjacency matrix 𝑨
Flip
𝑨
For every pair
of nodes 𝒖, 𝒗
AGM gives the
prob. 𝒑𝒖𝒗 of
them being
linked

0
0.10
0.10
0.04
0.10
0
0.02
0.10
0.02
0.04
0.06
biased
coins
0
1
0
0
0.06
1
0
1
1
0
0.06
0
1
0
1
0.06
0
0
1
1
0
The likelihood of AGM generating graph G:
P(G | )   P(u , v)  (1  P(u , v))
( u ,v )E
( u ,v )E
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
22

A
Given graph G(V,E) and Θ, we calculate
likelihood that Θ generated G: P(G|Θ)
B
Θ=B(V, C, M, {pc})
G
0
0.9
0.9
0
0
1
1
0
0.9
0
0.9
0
1
0
1
0
0.9
0.9
0
0.9
1
1
0
1
0
0
0.9
0
0
0
1
0
G
P(G|Θ)
P(G | )   P(u , v)  (1  P(u , v))
( u ,v )E
( u ,v )E
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
23

Our goal: Find 𝚯 = 𝑩(𝑽, 𝑪, 𝑴, 𝒑𝑪 ) such
that:
arg max


P(
𝑮
|
AGM
 )
How do we find 𝑩(𝑽, 𝑪, 𝑴, 𝒑𝑪 ) that
maximizes the likelihood?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
24

Our goal is to find 𝑩 𝑽, 𝑪, 𝑴, 𝒑𝑪
arg max
𝐵(𝑽,𝑪,𝑴, 𝒑𝑪 )
𝒖,𝒗∈𝑬

𝑷(𝒖, 𝒗)
such that:
(𝟏 − 𝑷 𝒖, 𝒗 )
𝒖𝒗∉𝑬
Problem: Finding B means finding the
bipartite affiliation network.
 There is no nice way to do this.
 Fitting 𝑩(𝑽, 𝑪, 𝑴, 𝒑𝑪 ) is too hard,
let’s change the model (so it is easier to fit)!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
25

Relaxation: Memberships have strengths
𝑭𝒖𝑨
u
v
 𝑭𝒖𝑨 : The membership strength of node 𝒖
to community 𝑨 (𝑭𝒖𝑨 = 𝟎: no membership)
 Each community 𝑨 links nodes independently:
𝑷𝑨 𝒖, 𝒗 = 𝟏 − 𝐞𝐱𝐩(−𝑭𝒖𝑨 ⋅ 𝑭𝒗𝑨 )
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
26
Community membership strength matrix 𝑭
Nodes

Communities

 Probability of connection is
proportional to the product of
strengths
𝑭=
 Notice: If one node doesn’t belong to the
community (𝐹𝑢𝐶 = 0) then 𝑷(𝒖, 𝒗) = 𝟎

j
𝑭𝒗𝑨 …
𝑷𝑨 𝒖, 𝒗 = 𝟏 − 𝐞𝐱𝐩(−𝑭𝒖𝑨 ⋅ 𝑭𝒗𝑨 )
strength of 𝒖’s
membership to 𝑨
Prob. that at least one common
community 𝑪 links the nodes:
 𝑷 𝒖, 𝒗 = 𝟏 −
𝑪
𝟏 − 𝑷𝑪 𝒖, 𝒗
𝑭𝒖 … vector of
community
membership
strengths of 𝒖
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
27



Community 𝑨 links nodes 𝒖, 𝒗 independently:
𝑷𝑨 𝒖, 𝒗 = 𝟏 − 𝐞𝐱𝐩(−𝑭𝒖𝑨 ⋅ 𝑭𝒗𝑨 )
Then prob. at least one common 𝑪 links them:
𝑷 𝒖, 𝒗 = 𝟏 − 𝑪 𝟏 − 𝑷𝑪 𝒖, 𝒗
= 𝟏 − 𝐞𝐱𝐩(− 𝑪 𝑭𝒖𝑪 ⋅ 𝑭𝒗𝑪 )
= 𝟏 − 𝐞𝐱𝐩(−𝑭𝒖 ⋅ 𝑭𝑻𝒗 )
Example 𝑭 matrix:
𝑭𝒖 :
0
1.2
0
0.2
𝑭𝒗 :
0.5
0
0
0.8
𝑭𝒘 :
0 1.8 1
0
Node community
membership strengths
Then: 𝑭𝒖 ⋅ 𝑭𝑻𝒗 = 𝟎. 𝟏𝟔
And: 𝑷 𝒖, 𝒗 = 𝟏 − 𝒆𝒙𝒑 −𝟎. 𝟏𝟔 = 𝟎. 𝟏𝟒
But: 𝑷 𝒖, 𝒘 = 𝟎. 𝟖𝟖
𝑷 𝒗, 𝒘 = 𝟎
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
28

Task: Given a network 𝑮(𝑽, 𝑬), estimate 𝑭
 Find 𝑭 that maximizes the likelihood:
𝒂𝒓𝒈 𝒎𝒂𝒙𝑭
𝑷(𝒖, 𝒗)
(𝒖,𝒗)∈𝑬
(𝟏 − 𝑷 𝒖, 𝒗 )
𝒖,𝒗 ∉𝑬
 where: 𝑷(𝒖, 𝒗) = 𝟏 − 𝐞𝐱𝐩(−𝑭𝒖 ⋅ 𝑭𝑻𝒗 )
 Many times we take the logarithm of the likelihood,
and call it log-likelihood: 𝒍 𝑭 = 𝐥𝐨𝐠 𝑷(𝑮|𝑭)

Goal: Find 𝑭 that maximizes 𝒍(𝑭):
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
29

Compute gradient of a single row 𝑭𝒖 of 𝑭:

Coordinate gradient ascent:
 Iterate over the rows of 𝑭:
𝓝(𝒖).. Set out
outgoing neighbors
 Compute gradient 𝜵𝒍 𝑭𝒖 of row 𝒖 (while keeping others fixed)
 Update the row 𝑭𝒖 : 𝑭𝒖 ← 𝑭𝒖 + 𝜼 𝛁𝒍(𝑭𝒖 )
 Project 𝑭𝒖 back to a non-negative vector: If 𝑭𝒖𝑪 < 𝟎: 𝑭𝒖𝑪 = 𝟎

This is slow! Computing 𝜵𝒍 𝑭𝒖 takes linear time!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
30

However, we notice:
 We cache 𝒗 𝑭𝒗
 So, computing 𝒗∉𝓝(𝒖) 𝑭𝒗 now takes linear time
in the degree |𝓝 𝒖 | of 𝒖
 In networks degree of a node is much smaller to the total
number of nodes in the network, so this is a significant
speedup!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
31

BigCLAM takes 5 minutes for 300k node nets
 Other methods take 10 days

Can process networks with 100M edges!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
32
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
34

Extension:
Make community membership edges directed!
 Outgoing membership: Nodes “sends” edges
 Incoming membership: Node “receives” edges
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
35
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
36

Everything is almost the same except now
we have 2 matrices: 𝑭 and 𝑯
 𝑭… out-going community memberships
 𝑯… in-coming community memberships

Edge prob.: 𝑷 𝒖, 𝒗 = 𝟏 − 𝒆𝒙𝒑(−𝑭𝒖 𝑯𝑻𝒗 )

Network log-likelihood:
𝑭
𝑯
which we optimize the same way as before
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
37
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
38

Overlapping Community Detection at Scale: A Nonnegative Matrix
Factorization Approach by J. Yang, J. Leskovec. ACM International
Conference on Web Search and Data Mining (WSDM), 2013.

Detecting Cohesive and 2-mode Communities in Directed and
Undirected Networks by J. Yang, J. McAuley, J. Leskovec. ACM
International Conference on Web Search and Data Mining (WSDM),
2014.

Community Detection in Networks with Node Attributes by J. Yang,
J. McAuley, J. Leskovec. IEEE International Conference On Data
Mining (ICDM), 2013.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
39
Download