Patrick_Xiaofei_presentation2

advertisement
Social network partition
Presenter:
Xiaofei Cao
Partick Berg
Problem Statement
• Say we have a graph of nodes, representing anything you can
imagine, but for our purposes let’s say it represents the
population spread out across a country. Some of these nodes
are closely batched together, representing a “community”,
while others are further away representing a different
“community”. We want to detect the communities in this
graph. So how do we do this?
Some Definitions
• Before we try and develop a solution to this problem, we
should get a few common definitions out of the way first.
• Degrees – A degree is the number of edges connected to a
node.
• Community – A community is a grouped together (by some
similarity) set of nodes that are densely connected internally.
A
Degree of D is 2
Degree of C is 3
B
C
D
E
F
G
H
Why find Communities?
• We want to find communities to see the relations between
groups and their connections to others.
• We can use this to find groups of people that share particular
traits easily, such as terrorist organizations (or any other social
network).
How do we find Communities?
• Vertex Betweenness – This is a measure of a vertex (or node’s)
centrality within the graph. This quantifies the number of
times a node acts as a bridge in a shortest path between two
other nodes.
Use BFS to find shortest-paths
• We use the BFS (Breadth First Search) Algorithm to find the
shortest paths between each node and every other node.
• From this we can calculate the vertex betweenness for each
node.
Girvan Newman Algorithm
• We can use the Girvan Newman algorithm to detect
communities in the graph.
• Girvan Newman takes the “Betweenness” score and extends
the definition to edges.
• So an edge “Betweenness” score is the number of shortest
paths between a pair of nodes that runs along it.
• If there are more than one shortest paths, each path is
assigned a value such that all paths have equal value.
Girvan Newman Algorithm Continued
We can see that by using this method of edge “betweenness”
scoring that communities will have lower edge scores between
nodes in their community and higher edge scores along edges
that connect them to other communities.
To find the community, we now remove the highest scoring edge
and re-calculate the “betweenness” score for each of the
affected edges.
Example
The highest edge score is 6,
connecting node A to node C.
So we remove this edge first.
A
C
B
D
Girvan Newman Algorithm Continued
• Now we continue to remove each highest score edge from the
graph and recalculate until no edges remain.
• The end result is a dendrogram that shows the clusters of
communities in our graph.
Sequential Algorithm
• Proposed by Girvan-Newman in paper: "Community structure
in social and biological networks." Proceedings of the National
Academy of Sciences 99.12 (2002): 7821-7826.
• Complete algorithm in paper: "Finding and evaluating
community structure in networks." Physical review E 69.2
(2004): 026113.
Girvan Newman algorithm
• Goal: find the edge with the highest betweenness score and
remove it. Continue doing that until the graph been
partitioned.
• Import: The graph for every iteration. (adjacency matrix)
• Output: The betweenness score for every edges. (Betweenness
matrix)
• The algorithm can be separate into 2 parts.
Part I: Find the number of shortest path from one
node to every other nodes
• From top to down.
• Using breadth first algorithm to generate a new view for that
node.
• Find the number of shortest path.
2
1
1
3
2
4
5
7
1
1
3
1
2
4
1
7
1
1
8
5
1
1
6
1
5
6
1
1
1
4
2
6
3
7
8
8
View from node 1
3
2
Part II Calculate the edges betweenness score for
every iteration
• From bottom to up.
• Every nodes contain one score.
• Every edges’ score equal to Node_score/#shortest_path*(# of
shortest path to the upper layer nodes)
• Sum up edges’ scores for every iteration.
Score=Node_score/#shortest_path*(# of shortest path to the upper layer nodes)
2
1
1
1
3
2
4
11/6
3
1
1
2
5/6
4
6
2/3
1/3
3
7
8
8
View from node 1
5
3/2
1
7
1
1
4/3
1
3
3
8
5
1
1
6
1
25/6
1
5/6
5
7
1
6
1/2
1
1
4
2
1/2
1/2
3
3/2
1/2
1
2
Analysis the time complex
•
•
•
•
Number of iteration in the big loop: n (number of nodes)
Time complex of finding the shortest path: O(n^2)
Time complex of calculating the betweenness score: O(n)
Adding the betweenness matrix: n^2
• Time complex is: n*(n^2+n+n^2)=O(n^3);
Parallel algorithm (Intuitively)
• Assigned every processor the same adjacency matrix of the
original network.
• They start from different nodes. Generating views and
calculating the betweenness matrix for each starting nodes.
Then sum the matrix locally first.
• Doing prefix sum and update the original network by remove
the highest score edges.
Gn: start
from node
n in graph
Breath
first
algorithm
Sum the
between
-ness
score
locally
Parallel
Prefix
Sum
P1
P2
P3
P4
G1,G2,G3 G4,G5,G6 G7,G8,G9 G10,G11,G12
P5
G13,G14,G15
V1,
V2,
V3
V4,
V5,
V6
V7,
V8,
V9
V10,
V11,
V12
V13,
V14,
V15
B1
B2
B3
B4
B1
B2
B3
B1
B2
B3
P6
P7
P8
G16,G17,G18 G19,G20,G21 G22,G23,G24
V16,
V17,
V18
V19,
V20,
V21
V22,
V23,
V24
B5
B6
B7
B8
B4
B5
B6
B7
B8
B4
B5
B6
B7
B8
B1
B2
B3
B4
B5
B6
B7
B1
B2
B3
B4
B5
B6
B7
Use B8
B8 Value to
update
B8 network
Analysis of time complex
•
•
•
•
Number of iteration: n/p;
Find the number of shortest path: O(n^2);
Find the betweenness score: O(n);
Adding betweenness score locally: O(n^2);
• Adding betweenness score globally(prefix sum): O(n^2*log(p))
• Time complex: n/p*(n^2+n+n^2)+n^2*log(p)
=n^2(n/p+log(p));
Continue
• Speed up: n/(n/p+log(P))
• When n=p*log(p); speed up = p; It is cost optimal.
Question
Download