Community Detection with Edge Content in Social Media Networks Paper presented by Konstantinos Giannakopoulos Outline • Definitions – Social Networks & Big Data – Community Detection • The framework of Matrix Factorization algorithms. – Steps, Goals, Solution – The PCA approach • The EIMF algorithm – Description, Performance Metrics, Evaluation • Other Approaches – Algorithms, Models, Metrics From Social Networks to Big Data Network Social Network BIG DATA Social Networks • Users act (conversations, like, share) • Users are connected Community Detection Density of Links Links and Content Some of the less strongly linked vertices may belong to the same community if they share similar content General Methodology of MF models • Decomposition of a matrix into a product of matrices. • M: A matrix representation of the social network. M:[m x n] A:[m x k] B:[k x n] = • Product of two low-rank matrices • k-dimensional feature vector What’s next? • Two sub-models: – Link matrix factorization FL – Content matrix factorization FC Each matrix contains a k- dimensional feature vector. F = min ||M − P ||2F + regulation term • P: Product of matrices that approximate M • Content is incorporated in FC using: – cosine similarity – norm Laplacian matrix. • Regulation term – improves robustness – Prevents overfitting. Goal & Solution • Goal: To find an optimum representation of the latent vectors. Optimization problem. • Frobenius norm ||.||2F measures the discrepancy between matrices • When FL and FC are convex functions, minimization problem is solved using – conjugate gradient or quasi-Newton methods. • Then, FL and FC are incorporated into one objective function that is usually a convex function too. • Obtain high quality communities. – use of traditional classifiers like k-means, or SVMs The PCA Algorithm • State-of-Art method for this model: – PCA (similar to LSI). Optimization problem: min||M−ZUT||2 + γ||U|| – Z: [n × l] matrix with each row being the l-dim feature vector of a node, – U: [l × n] matrix, and – ||.||2F : the Fobenius norm. Goal: Approximate M by Z UT , a product of two lowrank matrices, with a regulalization on U. Edge-Induced Matrix-Factorization (EIMF) • The partitioning of the edge set into k communities which are based both on their linkages and content. – Edges : latent vector space based on link structure. – Content is incorporated into edges, so that the latent vectors of the edges with similar content are clustered together. • Two Objective Functions – Linkage-based connectivity/density, captured by Ol – Content-based similarity among the messages, OC Ol: link structure for any vertex and its incident edges • Approximate Link Matrix Γ: [m x n] Ol(E)=||ETV−Γ||2F Ol(E)=||ETE∆−Γ||2F or v1 v2 v3 v4 v5 v6 e1 1 0 0 1 0 0 e2 0 1 0 1 0 0 e3 0 0 1 1 0 0 e4 0 0 0 1 1 0 e5 0 0 0 0 1 1 E:[k x m] e1 e2 e3 e4 V:[k x n] e5 v1 k1 k1 k2 k2 v2 v3 v4 v5 v6 Oc: link incorporating edge content • For each edge, the content is associated with it. – Each document is represented w/ a d-dim feature vector. C: [d x m] e1 e2 e3 e3 e4 e5 term1 term2 ………. termd – Cosine Similarity: Similarity measure of two corresponding feature vectors: – Normalized Laplacian matrix: To minimize the contentbased objective function. Oc(E) = min tr(ET ·L·E) E To Sum Up • Two Objective Functions: – Linkage based connectivity/density. link structure for any vertex and its incident edges is: Ol(E) = ||ETV−Γ||2F – Content-based similarity among text documents. Oc(E) = min tr(ET ·L·E) • Goal E – Minimize the objective function O(E ) = Ol (E ) + λ · Oc (E ) • Solution – Convex functions => no local minimum => Gradient – Apply k-means for the detection of final communities Experiments • Characteristics of the Datasets – Enron Email Dataset • #of messages: 200,399 • #of users: 158 • #of communities: 53 – Flickr Social Network Dataset • #of users: 4.703 • #of communities: 15 • #of images: 26,920 Performance Metrics • Supervised – Precision: • The fraction of retrieved docs that are relevant. • eg. high precision: Every result on first page to be relevant. – Recall: • The fraction of relevant docs that are retrieved. • eg. Retrieve all the relevant results. – Pairwise F-measure: • A higher value suggests that the clustering is of good quality. Performance Metrics • Average Cluster Purity (ACP) – The average percentage of the dominant community in the different clusters. Evaluation • Four sets of experiments with other algorithms – Link only • Newman • LDA-Link – Content • LDA-Word • NCUT-Content – Link + node content • LDA-Link-Word • NCUT-Link-Content – Link + edge content • EIMF-Lap • EIMF-LP • Tuning the balancing parameter λ Strong/Weak points • Strong Points – Incorporation of content messages to link connectivity. – Detection of overlapping communities. • Weak Points – Tested mainly on email datasets (directed communication) and on dataset with tags. Not on a social network (broadcast communication). – Experiments do not see it as a unified model. More Link-Based Algorithms • Modularity Measures the strength of division of a network into modules. High modularity => dense inner connections & sparse outer connections. k1 = k2 = 1, k3 = k4 = k5 = 2, M = 2|E| = 8, Pij = (ki kj) / (M) Even More Link-based Algorithms • Betweenness Measures a node’s centrality in a network. It is the number of shortest paths from all vertices to all others that pass through that node. • Normalized Cut (Spectral Clustering) Using the eigenvalues of the similarity matrix of the data points to perform dimensionality reduction before clustering in fewer dimensions. It partitions points in two sets based on the eigenvector corresponding to the second-smallest eigenvalue of the normalized Laplacian matrix of S where D is the diagonal matrix • Clique-based • PHITS Other Algorithms • Node-Content – PLSA Probabilistic model. Data is observations that arise from a generative probabilistic process that includes hidden variables. Posterior inference to infer the hidden structure. – LDA Each content is a mixture of various topics. – SVM (on content and/or on links-content) A vector-based method that finds a decision boundary between two classes. • Combined Link Structure and Node-Content Analysis – NCUT-Link-Content – LDA-Link-Content Other Community Detection Models • Discriminative – Given a value vector c that the model aims to predict, and a vector x that contains the values of the input features, the goal is to find the conditional distribution p(c | x). – p(c | x) is described by a parametric model. – Usage of Maximum Likelihood Estimation technique for finding optimal values of the model parameters. – State-of-art approach: PLSI • Generative – Given some hidden parameters, it randomly generates data. The goal is to find the joint probability distribution p(x, c). – the conditional probability p(c|x) can be estimated through the joint distribution p(x, c). e.g. P (c, u, z, ω) = P(ω|u) P(u|z) P(z|c) P(c) – State-of-art approach: LDA Bayesian Models 1. 2. 3. 4. Estimate prior distributions for model parameters. (e.g. Dirichlet distribution with Gamma function, Beta distribution). Estimate the Joint probability of the complete data. A Bayesian inference framework is used to maximize the posterior probability. The problem is intractable, thus optimization is necessary. Apply Gibbs sampling approach for parameter estimation, to compute the conditional probability. 1. 2. Compute statistics with initial assignments. For each iteration and for each node: a. Estimate the objective function. b. Sample the community assignment for node i according to the above distribution. c. Update the statistics. Additional Evaluation Metrics • Normalized Mutual Information NMI – The average percentage of the dominant community in different clusters. • Modularity • NCUT Additional Evaluation Metrics • Perplexity – A metric for evaluating language models (topic models). – A higher value of perplexity implies a lesser model likelihood and hence lesser generative power of the model. Comparative Analysis • Three models – MF, Discriminative (D), Generative (G) • Parameter Estimation – Objective Function min (MF) • Frobenius norm, Cosine similarity, Laplacian norm, quasi-Newton. – EM & MLE (D) – Gibbs Sampling (Entropy-based, Blocked) (G) • Metrics – PWF, ACP (MF) – NMI, PWF, Modu (D) m – NMI, Modu, Perplexity, Runnning Time, #of iterations (G)