Community Detection with Edge Content in Social Media Networks

advertisement
Community Detection with Edge
Content in Social Media Networks
Paper presented by
Konstantinos Giannakopoulos
Outline
• Definitions
– Social Networks & Big Data
– Community Detection
• The framework of Matrix Factorization algorithms.
– Steps, Goals, Solution
– The PCA approach
• The EIMF algorithm
– Description, Performance Metrics, Evaluation
• Other Approaches
– Algorithms, Models, Metrics
From Social Networks to Big Data
Network
Social Network
BIG DATA
Social Networks
• Users act (conversations, like, share)
• Users are connected
Community Detection
Density of Links
Links and Content
Some of the less strongly linked vertices
may belong to the same community
if they share similar content
General Methodology of MF models
• Decomposition of a matrix into a product of
matrices.
• M: A matrix representation of the social network.
M:[m x n]
A:[m x k]
B:[k x n]
=
• Product of two low-rank matrices
• k-dimensional feature vector
What’s next?
• Two sub-models:
– Link matrix factorization FL
– Content matrix factorization FC
Each matrix contains a k- dimensional feature vector.
F = min ||M − P ||2F + regulation term
• P: Product of matrices that approximate M
• Content is incorporated in FC using:
– cosine similarity
– norm Laplacian matrix.
• Regulation term
– improves robustness
– Prevents overfitting.
Goal & Solution
• Goal: To find an optimum representation of the
latent vectors. Optimization problem.
• Frobenius norm ||.||2F measures the discrepancy
between matrices
• When FL and FC are convex functions, minimization
problem is solved using
– conjugate gradient or quasi-Newton methods.
• Then, FL and FC are incorporated into one objective
function that is usually a convex function too.
• Obtain high quality communities.
– use of traditional classifiers like k-means, or SVMs
The PCA Algorithm
• State-of-Art method for this model:
– PCA (similar to LSI). Optimization problem:
min||M−ZUT||2 + γ||U||
– Z: [n × l] matrix with each row being the l-dim
feature vector of a node,
– U: [l × n] matrix, and
– ||.||2F : the Fobenius norm.
Goal: Approximate M by Z UT , a product of two lowrank matrices, with a regulalization on U.
Edge-Induced Matrix-Factorization
(EIMF)
• The partitioning of the edge set into k
communities which are based both on their
linkages and content.
– Edges : latent vector space based on link structure.
– Content is incorporated into edges, so that the latent
vectors of the edges with similar content are clustered
together.
• Two Objective Functions
– Linkage-based connectivity/density, captured by Ol
– Content-based similarity among the messages, OC
Ol: link structure for any vertex and its incident
edges
• Approximate Link Matrix Γ: [m x n]
Ol(E)=||ETV−Γ||2F
Ol(E)=||ETE∆−Γ||2F
or
v1
v2
v3
v4
v5
v6
e1
1
0
0
1
0
0
e2
0
1
0
1
0
0
e3
0
0
1
1
0
0
e4
0
0
0
1
1
0
e5
0
0
0
0
1
1
E:[k x m]
e1
e2
e3
e4
V:[k x n]
e5
v1
k1
k1
k2
k2
v2
v3
v4
v5
v6
Oc: link incorporating edge content
• For each edge, the content is associated with it.
– Each document is represented w/ a d-dim feature vector.
C: [d x m]
e1
e2
e3
e3
e4
e5
term1
term2
……….
termd
– Cosine Similarity: Similarity measure of two corresponding
feature vectors:
– Normalized Laplacian matrix: To minimize the contentbased objective function.
Oc(E) = min tr(ET ·L·E)
E
To Sum Up
• Two Objective Functions:
– Linkage based connectivity/density. link structure for any
vertex and its incident edges is:
Ol(E) = ||ETV−Γ||2F
– Content-based similarity among text documents.
Oc(E) = min tr(ET ·L·E)
• Goal
E
– Minimize the objective function
O(E ) = Ol (E ) + λ · Oc (E )
• Solution
– Convex functions => no local minimum => Gradient
– Apply k-means for the detection of final communities
Experiments
• Characteristics of the Datasets
– Enron Email Dataset
• #of messages: 200,399
• #of users: 158
• #of communities: 53
– Flickr Social Network Dataset
• #of users: 4.703
• #of communities: 15
• #of images: 26,920
Performance Metrics
• Supervised
– Precision:
• The fraction of retrieved docs that are relevant.
• eg. high precision: Every result on first page to be relevant.
– Recall:
• The fraction of relevant docs that are retrieved.
• eg. Retrieve all the relevant results.
– Pairwise F-measure:
• A higher value suggests that
the clustering is of good quality.
Performance Metrics
• Average Cluster Purity (ACP)
– The average percentage of the dominant
community in the different clusters.
Evaluation
• Four sets of experiments with other algorithms
– Link only
• Newman
• LDA-Link
– Content
• LDA-Word
• NCUT-Content
– Link + node content
• LDA-Link-Word
• NCUT-Link-Content
– Link + edge content
• EIMF-Lap
• EIMF-LP
• Tuning the balancing parameter λ
Strong/Weak points
• Strong Points
– Incorporation of content messages to link
connectivity.
– Detection of overlapping communities.
• Weak Points
– Tested mainly on email datasets (directed
communication) and on dataset with tags. Not on
a social network (broadcast communication).
– Experiments do not see it as a unified model.
More Link-Based Algorithms
• Modularity
Measures the strength of division of a network into
modules.
High modularity => dense inner connections &
sparse outer connections.
k1 = k2 = 1,
k3 = k4 = k5 = 2,
M = 2|E| = 8,
Pij = (ki kj) / (M)
Even More Link-based Algorithms
• Betweenness
Measures a node’s centrality in a network. It is the number of
shortest paths from all vertices to all others that pass through that
node.
• Normalized Cut (Spectral Clustering)
Using the eigenvalues of the similarity matrix of the data points to
perform dimensionality reduction before clustering in fewer
dimensions.
It partitions points in two sets based on the eigenvector
corresponding to the second-smallest eigenvalue of the
normalized Laplacian matrix
of S where D is the diagonal matrix
• Clique-based
• PHITS
Other Algorithms
• Node-Content
– PLSA
Probabilistic model. Data is observations that arise from a
generative probabilistic process that includes hidden
variables. Posterior inference to infer the hidden structure.
– LDA
Each content is a mixture of various topics.
– SVM (on content and/or on links-content)
A vector-based method that finds a decision boundary
between two classes.
• Combined Link Structure and Node-Content Analysis
– NCUT-Link-Content
– LDA-Link-Content
Other Community Detection Models
• Discriminative
– Given a value vector c that the model aims to
predict, and a vector x that contains the values of
the input features, the goal is to find the
conditional distribution p(c | x).
– p(c | x) is described by a parametric model.
– Usage of Maximum Likelihood Estimation
technique for finding optimal values of the model
parameters.
– State-of-art approach: PLSI
• Generative
– Given some hidden parameters, it randomly
generates data. The goal is to find the joint
probability distribution p(x, c).
– the conditional probability p(c|x) can be
estimated through the joint distribution p(x, c).
e.g.
P (c, u, z, ω) = P(ω|u) P(u|z) P(z|c) P(c)
– State-of-art approach: LDA
Bayesian Models
1.
2.
3.
4.
Estimate prior distributions for model parameters. (e.g. Dirichlet
distribution with Gamma function, Beta distribution).
Estimate the Joint probability of the complete data.
A Bayesian inference framework is used to maximize the posterior
probability. The problem is intractable, thus optimization is
necessary.
Apply Gibbs sampling approach for parameter estimation, to
compute the conditional probability.
1.
2.
Compute statistics with initial assignments.
For each iteration and for each node:
a. Estimate the objective function.
b. Sample the community assignment for node i according to the
above distribution.
c. Update the statistics.
Additional Evaluation Metrics
• Normalized Mutual Information NMI
– The average percentage of the dominant community in
different clusters.
• Modularity
• NCUT
Additional Evaluation Metrics
• Perplexity
– A metric for evaluating language models (topic
models).
– A higher value of perplexity implies a lesser
model likelihood and hence lesser generative
power of the model.
Comparative Analysis
• Three models
– MF, Discriminative (D), Generative (G)
• Parameter Estimation
– Objective Function min (MF)
• Frobenius norm, Cosine similarity, Laplacian norm, quasi-Newton.
– EM & MLE (D)
– Gibbs Sampling (Entropy-based, Blocked) (G)
• Metrics
– PWF, ACP (MF)
– NMI, PWF, Modu (D)
m
– NMI, Modu, Perplexity, Runnning Time, #of iterations (G)
Download