social network

advertisement
SOCIAL
NETWORKS
and
COMMUNITY DETECTION
“Networks” is a pervasive term?
Networked
Economy
Ego
Networks
Networking
Regional
Networks
Social
Networks
National
Innovation
Networks
Entrepreneurial
Networks
Immigrant
Networks
Infrastructure
Networks
Social Networks
• What Social Network Analysis is?
Network Analysis is the keyword
For the 21st Century
Researchers , Politicians , People talk
about Social Networks.
What is a Network?
(Web definition) A set of nodes, points, or locations connected by means of data,
voice, and video communications for the purpose of exchange.
Real World Web Networks
•
•
•
•
•
•
•
Internet
World Wide Web.
Citation Networks.
Transportation Network.
Food Webs.
Social Networks.
Biochemical Networks.
Social Networks
A social network is a
description of the social
structure between actors,
mostly individuals or
organizations. It indicates
the ways in which they
are connected through
various social familiarities
ranging from casual
acquaintance to close
familiar bonds.
A paleo-social network
Marriage ties among Renaissance Florentine families
Social Network Analysis
• Social network analysis [SNA] is the mapping and measuring of
relationships and flows between people, groups, organizations,
computers or other information/knowledge processing entities. It
also includes community detection.
• The nodes in the network are the people and groups while the links
(ties) show relationships or flows between the nodes.
• Community detection is discovering groups in the network where
individuals’ group memberships are not explicitly given
Social Network Data
The unit of interest in a network are the combined
sets of actors and their relations.
We represent actors with points and relations with
lines.
Actors are referred to variously as:
Nodes, vertices or points
Relations are referred to variously as:
Edges, Arcs, Lines, Ties
Example:
b
a
d
c
e
SN = graph
• A network can then be represented as a
graph data structure
• We can apply a variety of measures and
analysis to the graph representing a
given SN
• Ties in a SN can be directed or
undirected (e.g. friendship, coauthorship are usually undirected,
emails are directed)
Social Network
Basic Data Structures
From graphs to matrices
b
b
d
a
c
e
d
a
c
Undirected, binary (0,1)
e
Directed, binary
a b c d e
a
1
b 1
1
a b c d e
a
1
b 1
c
d
e
1
1
1
c
1
1 1
1
d
e
1
1
1 1
1
Social Network
Basic Data Structures
From matrices to lists
a b c d e
a
1
b 1
1
c
d
e
1
1
1
1
1 1
1
Adjacency List
ab
bac
cbde
dce
ecd
Arc List
ab
ba
bc
cb
cd
ce
dc
de
ec
ed
Social Network as a graph
In general, a relation can be:
Binary or Valued
Directed or Undirected
b
b
d
a
c
e
a
1
a
1
3
c
c
e
Directed, binary
Undirected, binary
b
d
d
4
Undirected, Valued
b
d
2
e
a
c
Directed, Valued
e
Social Network
• Measuring the flow of information
– Topology
• Connectivity
• Centrality
– Time
• Structure & Social Space
Measuring Networks: Flow
In addition to the simple probability that one actor
passes information on to another (pij), two factors affect
flow through a network:
Topology
-the shape, or form, of the network
- Example: one actor cannot pass information to
another unless they are either directly or indirectly
connected
Time
- the timing of contact matters
- Example: an actor cannot pass information he has
not receive yet
Measuring Networks: Topology
Two features of the network’s topology are known to be
important: connectivity and centrality
Connectivity refers to how actors in one part of the network
are connected to actors in another part of the network.
• Reachability: Is it possible for actor i to reach actor j?
This can only be true if there is a chain of contact from
one actor to another.
• Distance: Given they can be reached, how many
steps are they from each other?
• Number of paths: How many different paths connect
each pair?
Measuring Networks: Connectivity
Without full network data, you can’t distinguish actors with
limited information potential from those more deeply
embedded in a setting.
c
b
a
Measuring Networks: Connectivity
Reachability
Indirect connections are what make networks systems.
One actor can reach another if there is a path in the
graph connecting them.
b
a
a
d
c
b
e
f
c
f
d
e
Paths can be directed, leading to a distinction between
“strong” and “weak” components
Measuring Networks: Connectivity
Reachability
If you can trace a sequence of relations from one
actor to another, then the two are reachable. If there
is at least one path connecting every pair of actors in
the graph, the graph is connected and is called a
component.
Intuitively, a component is the set of people who
are all connected by a chain of relations.
Measuring Networks: Connectivity
Reachability
This example
contains
many
components.
Measuring Networks: Connectivity
Reachability
In general, components can be directed or undirected.
For a graph with any directed edges, there are two types of
components:
Strong components consist of the set(s) of all nodes that
are mutually reachable
Weak components consist of the set(s) of all nodes where
at least one node can reach the other.
(hidden)
Measuring Networks: Connectivity
Reachability
There are only 2
strong components
with more than 1
person in this
network.
(hidden)
Measuring Networks: Connectivity
Distance & number of paths
Distance is measured by the (weighted) number of relations
separating a pair:
Actor “a” is:
1 step from 4
2 steps from 5
3 steps from 4
4 steps from 3
5 steps from 1
a
Measuring Networks: Flow
Distance & number of paths
Paths are the different routes one can take. Nodeindependent paths are particularly important.
b
a
There are 2
independent paths
connecting a and
b.
There are many
non-independent
paths
Measuring Networks: Connectivity
Distance & number of paths
Probability of transfer
by distance and number of paths, assume a constant pij of 0.6
1.2
1
probability
10 paths
0.8
5 paths
0.6
2 paths
0.4
1 path
0.2
0
2
3
4
Path distance
5
6
Measuring Networks: Centrality
Centrality
Centrality refers to (one dimension of) location, identifying
where an actor resides in a network.
• For example, we can compare actors at the edge of
the network to actors at the center.
• In general, this is a way to formalize intuitive notions
about the distinction between insiders and outsiders.
Measuring Networks: Centrality
Conceptually, centrality is fairly straight forward: we
want to identify which nodes are in the ‘center’ of the
network. In practice, identifying exactly what we mean
by ‘center’ is somewhat complicated.
Three standard centrality measures capture a wide
range of “importance” in a network:
•Degree
•Closeness
•Betweenness
Measuring Networks: Centrality
Centrality Degree
The most intuitive notion of centrality focuses on
degree. Degree is the number of ties, and the actor
with the most ties is the most important:
CD  d (ni )  X i    X ij
j
Measuring Networks: Centrality
If we want to measure the degree to which the graph as a
whole is centralized, we look at the dispersion of
centrality:
Simple: variance of the individual centrality scores.
g

2
2
S D   (CD (ni )  Cd )  / g
 i 1

Or, using Freeman’s general formula for centralization:

C


g
CD
i 1
(n )  C D (ni )
*
D

[( g  1)( g  2)]
CD(n*) is the maximum attained value of the same network size
therefore we are measuring the dispersion around that value
Measuring Networks: Centrality
Degree Centralization Scores
(7 -1) ´ 7 + (7 - 7) 42
Freeman :
=
=1
(8 -1)(8 - 2)
42
Freeman: 1.0
Variance: 3.9
Freeman: .02
Variance: .17
Freeman: .07
Variance: .20
Freeman: 0.0
Variance: 0.0
Measuring Networks: Centrality
Degree Centralization Scores
Freeman: 0.1
Variance: 4.84
Measuring Networks: Centrality
A second measure of centrality is closeness centrality. An
actor is considered important if he/she is relatively close to all
other actors.
Closeness is based on the inverse of the distance of each
actor to every other actor in the network.
Closeness Centrality:


Cc (ni )   d (ni , n j )
 j 1

g
1
Normalized Closeness Centrality
CC' (ni )  (CC (ni ))(g 1)
Measuring Networks: Centrality
Closeness Centrality in the examples
Distance
01111111
10222222
12022222
12202222
12220222
12222022
12222202
12222220
Distance
Closeness normalized
.143
.077
.077
.077
.077
.077
.077
.077
1.00
.538
.538
.538
.538
.538
.538
.538
Closeness normalized
012344321
101234432
210123443
321012344
432101234
443210123
344321012
234432101
123443210
.050
.050
.050
.050
.050
.050
.050
.050
.050
.400
.400
.400
.400
.400
.400
.400
.400
.400
Measuring Networks: Centrality
Closeness Centrality in the examples
Distance Closeness normalized
0 1 2 3 4 5 6 .048 .286
1 0 1 2 3 4 5 .063 .375
2 1 0 1 2 3 4 .077 .462
3 2 1 0 1 2 3 .083 .500
4 3 2 1 0 1 2 .077 .462
5 4 3 2 1 0 1 .063 .375
6 5 4 3 2 1 0 .048 .286
Measuring Networks: Centrality
Closeness Centrality in the example
Distance
0112344556556
1011233445445
1101233445445
2110122334334
3221011223223
4332102334112
4332120112334
5443231011445
5443231101445
6554342110556
5443213445011
5443213445101
6554324556110
.021
.027
.027
.034
.042
.034
.034
.027
.027
.021
.027
.027
.021
Closeness normalized
.255
.324
.324
.414
.500
.414
.414
.324
.324
.255
.324
.324
.255
Measuring Networks: Centrality
Closeness Centrality in the examples
Measuring Networks: Centrality
Graph-teoretic center
Identify the set of all vertices A where the greatest distance
d(A,B) to other vertices B is minimal.
Value = longest
distance to any
other node.
The graph
theoretic center is
‘3’
Measuring Networks: Centrality
Graph Theoretic Center (Barry or Jordan Center).
Measuring Networks: Centrality
Betweenness Centrality:
Model based on communication flow: A person who lies on
communication paths can control communication flow, and
is thus important.
Betweenness centrality counts the number of geodesic
paths between i and k that actor j resides on. Geodesics
are defined as the shortest path between points
b
a
C d e f g h
Measuring Networks: Centrality
Betweenness Centrality:
a
a
b
c
d
e
f
k
m
l
g
c
d
d
e
e
g
f
h
i
b
j
h
k
l
m
m j
g
f
i
k
l
h
j
m
m j
i
j
Measuring Networks: Centrality
Betweenness Centrality:
CB (ni )   g jk (ni ) / g jk
j k
Where gjk = the number of geodesics connecting jk, and
gjk = the number that actor i is on.
Usually normalized by:
C (ni )  CB (ni ) /[(g 1)(g  2) / 2]
'
B
Measuring Networks: Centrality
Centralization: 1.0
Betweenness Centrality:
Centralization: .59
Centralization: .31
Centralization: 0
Measuring Networks: Centrality
Centralization: .183
Measuring Networks: Centrality
Information Centrality:
It is quite likely that information can flow through paths other
than the geodesic. The Information Centrality score uses all
paths in the network, and weights them based on their length.
Measuring Networks: Centrality
Information Centrality:
Measuring Networks: Centrality
Actors that appear
very different when
seen individually, are
comparable in the
global network.
Node size proportional to
betweenness centrality )
(
Measuring Networks: Time
Time
Two factors that affect network flows:
Topology
- the shape, or form, of the network
- simple example: one actor cannot pass information
to another unless they are either directly or indirectly
connected
Time
- the timing of contacts matters
- simple example: an actor cannot pass information
he has not yet received.
Measuring Networks: Time
Timing in networks
A focus on contact structure has often slighted the
importance of network dynamics, though a number of
recent works are addressing this.
Time affects networks in two important ways:
1) The structure itself evolves, in ways that will
affect the topology an thus flow.
2) The timing of contact constrains information flow
Measuring Networks: Time
Drug Relations, Colorado Springs, Year 1
Data on drug users
in Colorado Springs,
over 5 years
Measuring Networks: Time
Drug Relations, Colorado Springs, Year 2
Current year in red, past relations in gray
Measuring Networks: Time
Drug Relations, Colorado Springs, Year 3
Current year in red, past relations in gray
Measuring Networks: Time
Drug Relations, Colorado Springs, Year 4
Current year in red, past relations in gray
Measuring Networks: Time
Drug Relations, Colorado Springs, Year 5
Current year in red, past relations in gray
Measuring Networks: Time
What impact does timing have on flow through the network?
C
A
2-5
8-9
E
B
D
Numbers above lines indicate contact periods
3-5
F
Measuring Networks: Time
The path graph for the hypothetical contact network
A
C
E
D
F
B
While clearly important, this is not often handled well by
current SNA software.
Measuring Networks: Structure & Social Space
The second broad division for measuring networks steps
back to generalized features of the global network.
These factors almost always are of interest because of
what they imply about how information moves through
the network, but have resulted in a distinct line of
methods and substantive research.
The study of these generalized features is also known as
community detection (small worlds analysis, etc.)
Community
• Community: It is formed by individuals such that those
within a group interact with each other more frequently
than with those outside the group
– a.k.a. group, cluster, cohesive subgroup, module in different
contexts
• Community detection: discovering groups in a network
where individuals’ group memberships are not explicitly
given
• Why communities in social media?
– Human beings are social
– Easy-to-use social media allows people to extend their social
life in unprecedented ways
– Difficult to meet friends in the physical world, but much easier to
find friend online with similar interests
– Interactions between nodes can help determine communities
59
Communities in Social Media
• Two types of groups in social media
– Explicit Groups: formed by user subscriptions
– Implicit Groups: implicitly formed by social
interactions
• Some social media sites allow people to join groups, is it
necessary to extract groups based on network topology?
– Not all sites provide community platform
– Not all people want to make effort to join groups
– Groups can change dynamically
• Network interaction provides rich information about the
relationship between users
– Can complement other kinds of information, e.g. user profile
– Help network visualization and navigation
– Provide basic information for other tasks, e.g. recommendation
60
Subjectivity of Community Definition
A densely-knit
community
Each component is a
community
Definition of a community
can be subjective.
61
Taxonomy of Community Criteria
• Criteria vary depending on the tasks
• Roughly, community detection methods can be divided
into 4 categories (not exclusive):
• Node-Centric Community
– Each node in a group satisfies certain properties
• Group-Centric Community
– Consider the connections within a group as a whole. The group
has to satisfy certain properties without zooming into node-level
• Network-Centric Community
– Partition the whole network into several disjoint sets
• Hierarchy-Centric Community
– Construct a hierarchical structure of communities
62
Node-Centric Community
Detection
• Nodes satisfy different properties
– Complete Mutuality
• cliques
– Reachability of members
• k-clique, k-clan, k-club
– Nodal degrees
• k-plex, k-core
– Relative frequency of Within-Outside Ties
• LS sets, Lambda sets
• Commonly used in traditional social network analysis
• Here, we discuss some representative ones
63
Complete Mutuality: Cliques
• Clique: a maximum complete subgraph in which
all nodes are adjacent to each other
Nodes 5, 6, 7 and 8 form a clique
• NP-hard to find the maximum clique in a network
• Straightforward implementation to find cliques is
very expensive in time complexity
64
Finding the Maximum Clique
• In a clique of size k, each node maintains degree >=
k-1
– Nodes with degree < k-1 will not be included in the maximum
clique
• Recursively apply the following pruning procedure
– Sample a sub-network from the given network, and find a
clique in the sub-network, say, by a greedy approach
– Suppose the clique above is size k, in order to find out a
larger clique, all nodes with degree <= k-1 should be
removed.
• Repeat until the network is small enough
• Many nodes will be pruned as social media networks
follow a power law distribution for node degrees
65
Maximum Clique Example
• Suppose we sample a sub-network with nodes {1-9} and
find a clique {1, 2, 3} of size 3
• In order to find a clique >3, remove all nodes with degree
<=3-1=2
– Remove nodes 2 and 9
– Remove nodes 1 and 3
– Remove node 4
66
Clique Percolation Method
(CPM)
• Clique is a very strict definition, unstable
• Normally use cliques as a core or a seed to find
larger communities
• CPM is such a method to find overlapping
communities
– Input
• A parameter k, and a network
– Procedure
• Find out all cliques of size k in a given network
• Construct a clique graph. Two cliques are adjacent if
they share k-1 nodes
• Each connected components in the clique graph form a
community
67
CPM Example
Cliques of size 3:
{1, 2, 3}, {1, 3, 4}, {4, 5, 6},
{5, 6, 7}, {5, 6, 8}, {5, 7, 8},
{6, 7, 8}
Communities:
{1, 2, 3, 4}
{4, 5, 6, 7, 8}
68
Reachability : k-clique, k-club
• Any node in a group should be reachable in k hops
• k-clique: a maximal subgraph in which the largest
geodesic distance between any two nodes <= k
• k-club: a substructure of diameter <= k
Cliques: {1, 2, 3}
2-cliques: {1, 2, 3, 4, 5}, {2, 3, 4, 5, 6}
2-clubs: {1,2,3,4}, {1, 2, 3, 5}, {2, 3, 4, 5, 6}
• A k-clique might have diameter larger than k in the
subgraph
– E.g. {1, 2, 3, 4, 5}
• Commonly used in traditional SNA
• Often involves combinatorial optimization
69
Group-Centric Community Detection:
Density-Based Groups
• The group-centric criterion requires the whole group
to satisfy a certain condition
– E.g., the group density >= a given threshold
• A subgraph
is a
quasi-clique if
where the denominator is the maximum
number of degrees.
,
• A similar strategy to that of cliques can be used
– Sample a subgraph, and find a maximal
quasiclique (say, of size
)
– Remove nodes with degree less than the average degree
<
Network-Centric Community
Detection
• Network-centric criterion needs to consider
the connections within a network globally
• Goal: partition nodes of a network into disjoint
sets
• Approaches:
– (1) Clustering based on vertex similarity
– (2) Latent space models (multi-dimensional
scaling )
– (3) Block model approximation
– (4) Spectral clustering
– (5) Modularity maximization
71
(1) Clustering based on vertex similarity
Clustering based on Vertex
Similarity
• Apply k-means or similarity-based clustering to nodes
• Vertex similarity is defined in terms of the similarity of
their neighborhood
• Structural equivalence: two nodes are structurally
equivalent iff they are connecting to the same set of
actors
Nodes 1 and 3 are
structurally equivalent;
So are nodes 5 and 6.
• Structural equivalence is too restrict for practical use.
72
(1) Clustering based on vertex similarity
Vertex Similarity
• Jaccard Similarity
• Cosine similarity
73
Vertex Similarity (2)
Clustering based on vertex
similarity (K-means)
• Each cluster is associated with a centroid
• Each node is assigned to the cluster with the
closest centroid
Cut
• Most interactions are within group whereas
interactions between groups are few
• community detection  minimum cut problem
• Cut: A partition of vertices of a graph into two disjoint
sets
• Minimum cut problem: find a graph partition such that
the number of edges between the two sets is
minimized
77
Ratio Cut & Normalized Cut
• Minimum cut often returns an imbalanced partition, with
one set being a singleton, e.g. node 9
• Change the objective function to consider community
size
Ci,: a community
|Ci|: number of nodes in Ci
vol(Ci): sum of degrees in Ci
78
Ratio Cut & Normalized Cut
Example
For partition in red:
For partition in green:
Both ratio cut and normalized cut prefer a balanced partition
79
Hierarchy-Centric Community
Detection
• Goal: build a hierarchical structure of
communities based on network topology
• Allow the analysis of a network at different
resolutions
• Representative approaches:
– Divisive Hierarchical Clustering (top-down)
– Agglomerative Hierarchical clustering
(bottom-up)
80
Divisive Hierarchical
Clustering
• Divisive clustering
– Partition nodes into several sets
– Each set is further divided into smaller ones
– Network-centric partition can be applied for the partition
• One particular example: recursively remove the
“weakest” tie
– Find the edge with the least strength
– Remove the edge and update the corresponding strength of
each edge
• Recursively apply the above two steps until a network
is decomposed into desired number of connected
components.
81
• Each component forms a community
Edge Betweenness
• The strength of a tie can be measured by edge
betweenness
• Edge betweenness: the number of shortest paths that
pass along with the edge
The edge betweenness of
e(1, 2) is 4 (=6/2 + 1), as
all the shortest paths
from 2 to {4, 5, 6, 7, 8, 9}
have to either pass e(1, 2)
or e(2, 3), and e(1,2) is
the shortest path
between 1 and 2
• The edge with higher betweenness tends to be the
82
bridge between two communities.
Divisive clustering based on edge
betweenness
Initial betweenness value
After remove e(4,5), the
betweenness of e(4, 6)
becomes 20, which is the
highest;
After remove e(4,6), the edge
e(7,9) has the highest betweenness
value 4, and should be removed.83
Idea: progressively removing edges with the highest betweenness
Agglomerative Hierarchical
Clustering
• Initialize each node as a community
• Merge communities successively into larger
communities following a certain criterion
– E.g., based on vertex similarity
Dendrogram according to Agglomerative Clustering based on Modularity
85
Summary of Community
Detection
• Node-Centric Community Detection
– cliques, k-cliques, k-clubs
• Group-Centric Community Detection
– quasi-cliques
• Network-Centric Community Detection
– Clustering based on vertex similarity
• Hierarchy-Centric Community Detection
– Divisive clustering
– Agglomerative clustering
87
COMMUNITY EVALUATION
88
Evaluating Community Detection
(1)
• For groups with clear definitions
– E.g., Cliques, k-cliques, k-clubs, quasicliques
– Verify whether extracted communities
satisfy the definition
• For networks with ground truth
information
– Normalized mutual information
– Accuracy of pairwise community
memberships
89
Measuring a Clustering Result
1, 2, 3
4, 5, 6
Ground Truth
1, 3
2
4, 5, 6
Clustering Result
How to measure the
clustering quality?
• The number of communities after grouping can be
different from the ground truth
• No clear community correspondence between clustering
result and the ground truth
• Normalized Mutual Information can be used
90
Accuracy of Pairwise Community
Memberships
• Consider all the possible pairs of nodes and check whether they
reside in the same community
• An error occurs if
– Two nodes belonging to the same community are assigned
to different communities after clustering
– Two nodes belonging to different communities are assigned
to the same community
• Construct a contingency table or confusion matrix
91
Accuracy Example
1,
2, 3
4,
5, 6
Ground Truth
1,
3
4,
2
5,
6
Clustering Result
Ground Truth
Clustering
Result
C(vi) = C(vj)
C(vi) ≠ C(vj)
C(vi) = C(vj)
4
0
C(vi) ≠ C(vj)
2
9
Accuracy = (4+9)/ (4+2+9+0) = 13/15
92
Normalized Mutual Information
• Entropy: the information contained in a distribution
• Mutual Information: the shared information between two
distributions
• Normalized Mutual Information (between 0 and 1)
or
• Consider a partition as a distribution (probability of
one node falling into one community), we can compute
the matching between the clustering result and the
ground truth
93
ka, kb = clusters generati dalle partizioni πa, πb, h e l sono
gli indici dei clusters nelle partizioni
NMI-Example
1, 2, 3
1, 3
2
• Partition a: [1, 1, 1, 2, 2, 2]
• Partition b: [1, 2, 1, 3, 3, 3]
nha
4, 5, 6
4, 5,6
nlb
nh ,l
l=1
l=2
l=3
h=1
3
l=1
2
h=1
2
1
0
h=2
3
l=2
1
h=2
0
0
3
l=3
3
contingency table or confusion matrix
=0.8278
95
Reference: http://www.cse.ust.hk/~weikep/notes/NormalizedMI.m
Evaluation using Semantics
• For networks with semantics
– Networks come with semantic or attribute
information of nodes or connections
– Human subjects can verify whether the extracted
communities are coherent
• Evaluation is qualitative
• It is also intuitive and helps understand a community
An animal
community
A health
community
96
Evaluation without Ground
Truth
• For networks without ground truth or semantic
information
• This is the most common situation
• An option is to resort to cross-validation
– Extract communities from a (training) network
– Evaluate the quality of the community structure on a network
constructed from a different date or based on a related type
of interaction
• Quantitative evaluation functions
– Modularity (M.Newman. Modularity and community structure
in networks. PNAS 06.)
– Link prediction (the predicted network is compared with the
true network)
97
Download