Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org Can we identify node groups? (communities, modules, clusters) Nodes: Football Teams Edges: Games played J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2 NCAA conferences Nodes: Football Teams Edges: Games played J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3 Can we identify functional modules? Nodes: Proteins Edges: Physical interactions J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4 Functional modules Nodes: Proteins Edges: Physical interactions J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5 Can we identify social communities? Nodes: Facebook Users Edges: Friendships J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6 Social communities Summer internship High school Stanford (Basketball) Stanford (Squash) Nodes: Facebook Users Edges: Friendships J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7 Non-overlapping vs. overlapping communities J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8 Nodes Nodes Network Adjacency matrix J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9 What is the structure of community overlaps: Edge density in the overlaps is higher! Communities as “tiles” J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10 Communities in a network This is what we want! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11 1) Given a model, we generate the network: B Generative model for networks F A D E G C H 2) Given a network, find the “best” model B F A D E G C H Generative model for networks J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12 Goal: Define a model that can generate networks The model will have a set of “parameters” that we will later want to estimate (and detect communities) Generative model for networks B F A D E G C H Q: Given a set of nodes, how do communities “generate” edges of the network? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13 Communities, C pA pB Model Memberships, M Nodes, V Model Network Generative model B(V, C, M, {pc}) for graphs: Nodes V, Communities C, Memberships M Each community c has a single probability pc Later we fit the model to networks to detect communities J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14 Communities, C pA pB Model Memberships, M Nodes, V Community Affiliations Network AGM generates the links: For each For each pair of nodes in community 𝑨, we connect them with prob. 𝒑𝑨 The overall edge probability is: P(u , v) 1 (1 p ) c cM u M v If 𝒖, 𝒗 share no communities: 𝑷 𝒖, 𝒗 = 𝜺 𝑴𝒖 … set of communities node 𝒖 belongs to Think of this as an “OR” function: If at least 1 community says “YES” we create an edge J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15 Model Network J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16 AGM can express a variety of community structures: Non-overlapping, Overlapping, Nested J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17 Detecting communities with AGM: B F A D E G C H Given a Graph 𝑮(𝑽, 𝑬), find the Model 1) Affiliation graph M 2) Number of communities C 3) Parameters pc J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19 Maximum Likelihood Principle (MLE): Given: Data 𝑿 Assumption: Data is generated by some model 𝒇(𝚯) 𝒇 … model 𝚯 … model parameters Want to estimate 𝑷𝒇 𝑿 𝚯): The probability that our model 𝒇 (with parameters 𝜣) generated the data Now let’s find the most likely model that could have generated the data: arg max 𝑷𝒇 𝑿 𝚯) Θ J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20 Imagine we are given a set of coin flips Task: Figure out the bias of a coin! Data: Sequence of coin flips: 𝑿 = [𝟏, 𝟎, 𝟎, 𝟎, 𝟏, 𝟎, 𝟎, 𝟏] Model: 𝒇 𝚯 = return 1 with prob. Θ, else return 0 What is 𝑷𝒇 𝑿 𝚯 ? Assuming coin flips are independent So, 𝑷𝒇 𝑿 𝚯 = 𝑷𝒇 𝟏 𝚯 ∗ 𝑷𝒇 𝟎 𝚯 ∗ 𝑷𝒇 𝟎 𝚯 … ∗ 𝑷𝒇 𝟏 𝚯 What is 𝑷𝒇 𝟏 𝚯 ? Simple, 𝑷𝒇 𝟏 𝚯 = 𝚯 Then, 𝑷𝒇 𝑿 𝚯 = 𝚯𝟑 𝟏 − 𝚯 For example: 𝟓 𝑷𝒇 𝑿 𝚯 = 𝟎. 𝟓 = 𝟎. 𝟎𝟎𝟑𝟗𝟎𝟔 𝟑 𝑷𝒇 𝑿 𝚯 = 𝟖 = 𝟎. 𝟎𝟎𝟓𝟎𝟐𝟗 What did we learn? Our data was most likely generated by coin with bias 𝚯 = 𝟑/𝟖 𝚯∗ = 𝟑/𝟖 𝑷𝒇 𝑿 𝚯 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 𝚯 21 How do we do MLE for graphs? Model generates a probabilistic adjacency matrix We then flip all the entries of the probabilistic matrix to obtain the binary adjacency matrix 𝑨 Flip 𝑨 For every pair of nodes 𝒖, 𝒗 AGM gives the prob. 𝒑𝒖𝒗 of them being linked 0 0.10 0.10 0.04 0.10 0 0.02 0.10 0.02 0.04 0.06 biased coins 0 1 0 0 0.06 1 0 1 1 0 0.06 0 1 0 1 0.06 0 0 1 1 0 The likelihood of AGM generating graph G: P(G | ) P(u , v) (1 P(u , v)) ( u ,v )E ( u ,v )E J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22 A Given graph G(V,E) and Θ, we calculate likelihood that Θ generated G: P(G|Θ) B Θ=B(V, C, M, {pc}) G 0 0.9 0.9 0 0 1 1 0 0.9 0 0.9 0 1 0 1 0 0.9 0.9 0 0.9 1 1 0 1 0 0 0.9 0 0 0 1 0 G P(G|Θ) P(G | ) P(u , v) (1 P(u , v)) ( u ,v )E ( u ,v )E J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 23 Our goal: Find 𝚯 = 𝑩(𝑽, 𝑪, 𝑴, 𝒑𝑪 ) such that: arg max P( 𝑮 | AGM ) How do we find 𝑩(𝑽, 𝑪, 𝑴, 𝒑𝑪 ) that maximizes the likelihood? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24 Our goal is to find 𝑩 𝑽, 𝑪, 𝑴, 𝒑𝑪 arg max 𝐵(𝑽,𝑪,𝑴, 𝒑𝑪 ) 𝒖,𝒗∈𝑬 𝑷(𝒖, 𝒗) such that: (𝟏 − 𝑷 𝒖, 𝒗 ) 𝒖𝒗∉𝑬 Problem: Finding B means finding the bipartite affiliation network. There is no nice way to do this. Fitting 𝑩(𝑽, 𝑪, 𝑴, 𝒑𝑪 ) is too hard, let’s change the model (so it is easier to fit)! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 25 Relaxation: Memberships have strengths 𝑭𝒖𝑨 u v 𝑭𝒖𝑨 : The membership strength of node 𝒖 to community 𝑨 (𝑭𝒖𝑨 = 𝟎: no membership) Each community 𝑨 links nodes independently: 𝑷𝑨 𝒖, 𝒗 = 𝟏 − 𝐞𝐱𝐩(−𝑭𝒖𝑨 ⋅ 𝑭𝒗𝑨 ) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 26 Community membership strength matrix 𝑭 Nodes Communities Probability of connection is proportional to the product of strengths 𝑭= Notice: If one node doesn’t belong to the community (𝐹𝑢𝐶 = 0) then 𝑷(𝒖, 𝒗) = 𝟎 j 𝑭𝒗𝑨 … 𝑷𝑨 𝒖, 𝒗 = 𝟏 − 𝐞𝐱𝐩(−𝑭𝒖𝑨 ⋅ 𝑭𝒗𝑨 ) strength of 𝒖’s membership to 𝑨 Prob. that at least one common community 𝑪 links the nodes: 𝑷 𝒖, 𝒗 = 𝟏 − 𝑪 𝟏 − 𝑷𝑪 𝒖, 𝒗 𝑭𝒖 … vector of community membership strengths of 𝒖 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27 Community 𝑨 links nodes 𝒖, 𝒗 independently: 𝑷𝑨 𝒖, 𝒗 = 𝟏 − 𝐞𝐱𝐩(−𝑭𝒖𝑨 ⋅ 𝑭𝒗𝑨 ) Then prob. at least one common 𝑪 links them: 𝑷 𝒖, 𝒗 = 𝟏 − 𝑪 𝟏 − 𝑷𝑪 𝒖, 𝒗 = 𝟏 − 𝐞𝐱𝐩(− 𝑪 𝑭𝒖𝑪 ⋅ 𝑭𝒗𝑪 ) = 𝟏 − 𝐞𝐱𝐩(−𝑭𝒖 ⋅ 𝑭𝑻𝒗 ) Example 𝑭 matrix: 𝑭𝒖 : 0 1.2 0 0.2 𝑭𝒗 : 0.5 0 0 0.8 𝑭𝒘 : 0 1.8 1 0 Node community membership strengths Then: 𝑭𝒖 ⋅ 𝑭𝑻𝒗 = 𝟎. 𝟏𝟔 And: 𝑷 𝒖, 𝒗 = 𝟏 − 𝒆𝒙𝒑 −𝟎. 𝟏𝟔 = 𝟎. 𝟏𝟒 But: 𝑷 𝒖, 𝒘 = 𝟎. 𝟖𝟖 𝑷 𝒗, 𝒘 = 𝟎 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28 Task: Given a network 𝑮(𝑽, 𝑬), estimate 𝑭 Find 𝑭 that maximizes the likelihood: 𝒂𝒓𝒈 𝒎𝒂𝒙𝑭 𝑷(𝒖, 𝒗) (𝒖,𝒗)∈𝑬 (𝟏 − 𝑷 𝒖, 𝒗 ) 𝒖,𝒗 ∉𝑬 where: 𝑷(𝒖, 𝒗) = 𝟏 − 𝐞𝐱𝐩(−𝑭𝒖 ⋅ 𝑭𝑻𝒗 ) Many times we take the logarithm of the likelihood, and call it log-likelihood: 𝒍 𝑭 = 𝐥𝐨𝐠 𝑷(𝑮|𝑭) Goal: Find 𝑭 that maximizes 𝒍(𝑭): J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 29 Compute gradient of a single row 𝑭𝒖 of 𝑭: Coordinate gradient ascent: Iterate over the rows of 𝑭: 𝓝(𝒖).. Set out outgoing neighbors Compute gradient 𝜵𝒍 𝑭𝒖 of row 𝒖 (while keeping others fixed) Update the row 𝑭𝒖 : 𝑭𝒖 ← 𝑭𝒖 + 𝜼 𝛁𝒍(𝑭𝒖 ) Project 𝑭𝒖 back to a non-negative vector: If 𝑭𝒖𝑪 < 𝟎: 𝑭𝒖𝑪 = 𝟎 This is slow! Computing 𝜵𝒍 𝑭𝒖 takes linear time! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 30 However, we notice: We cache 𝒗 𝑭𝒗 So, computing 𝒗∉𝓝(𝒖) 𝑭𝒗 now takes linear time in the degree |𝓝 𝒖 | of 𝒖 In networks degree of a node is much smaller to the total number of nodes in the network, so this is a significant speedup! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 31 BigCLAM takes 5 minutes for 300k node nets Other methods take 10 days Can process networks with 100M edges! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 32 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 34 Extension: Make community membership edges directed! Outgoing membership: Nodes “sends” edges Incoming membership: Node “receives” edges J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 35 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 36 Everything is almost the same except now we have 2 matrices: 𝑭 and 𝑯 𝑭… out-going community memberships 𝑯… in-coming community memberships Edge prob.: 𝑷 𝒖, 𝒗 = 𝟏 − 𝒆𝒙𝒑(−𝑭𝒖 𝑯𝑻𝒗 ) Network log-likelihood: 𝑭 𝑯 which we optimize the same way as before J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 37 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 38 Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach by J. Yang, J. Leskovec. ACM International Conference on Web Search and Data Mining (WSDM), 2013. Detecting Cohesive and 2-mode Communities in Directed and Undirected Networks by J. Yang, J. McAuley, J. Leskovec. ACM International Conference on Web Search and Data Mining (WSDM), 2014. Community Detection in Networks with Node Attributes by J. Yang, J. McAuley, J. Leskovec. IEEE International Conference On Data Mining (ICDM), 2013. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 39