PPT

advertisement
Parallel Subgraph Listing in a
Large-Scale Graph
Yingxia Shao Bin Cui  Lei Chen Lin Ma  Junjie Yao  Ning Xu 
 School
of EECS, Peking University
 Hong Kong University of Science and Technology
1
Outline
• Subgraph listing operation
• Related work
• PSgL framework
• Evaluation
• Conclusion
2
Introduction
Motivation
Motif Detection in Bioinformatics
Triangle Counting in SN
Cascades Counting in RN
3
Introduction
Problem Definition
1
2
4
3
Subgraph Listing Operation
o Input: pattern graph, data graph [both are undirected]
o Output: all the occurrences of pattern graph in the data graph.
Pattern graph
Goal of our work
o Efficiently listing subgraph in a large-scale graph
4
6
5
1
3
2
Data graph
4
Related Work
Related Work
Centralized algorithms
 Enumerate one by one [Chiba ’85, Wernicke ’06, Grochow ’07]
Streaming algorithms
 Only counting and results are inaccurate [Buriol ’06, Bordino ’08, Zhao ’10]
MapReduce based Parallel algorithms
 Decompose pattern graph + explicit join operation [Afrati ’13]
 Fixed exploration plan + implicit join operation [Plantenga ’13]
Other efficient algorithms for specific pattern graph
 Triangle [Suri ’11, Chu ’11, Hu ’13]
5
Related Work
Drawbacks in existing parallel solutions
• MapReduce is not friendly to process graphs.
• Join operation is expensive.
• Do not take care of the balance of data distribution.
• Data graph
• Intermediate results
The novel PSgL framework lists subgraph via graph traversal
on in-memory stored native graph.
6
Contributions
• We propose an efficient parallel subgraph listing framework, PSgL.
• We introduce a cost model for the subgraph listing in PSgL.
• We propose a simple but effective workload-aware distribution
strategy, which facilitates PSgL to achieve good workload balance.
• We design three independent mechanisms to reduce the size of
intermediate results.
7
Preliminaries
Partial subgraph instance
• A data structure that records
the mapping between pattern
graph and data graph.
• Denoted by 𝐺𝑝𝑠𝑖
• Assume the vertices of 𝐺𝑝 are
numbered from 1 to |𝑉𝑝 |, we
simply state 𝐺𝑝𝑠𝑖 as {map(1),
map(2), ..., map(|𝑉𝑝 |)}.
4
1
2
4
3
6
5
1
{?,?,?,?}
1
{2,3,4,5}
1
{1,5,6,?}
3
2
4
6
5
3
2
4
6
5
3
2
8
Preliminaries
Independence Property
• 𝐺𝑝𝑠𝑖 Tree
• A node is a 𝐺𝑝𝑠𝑖
• The children of a node are derived from
expanding one mapped data vertex in the
node.
Gpsi
{?,?,?,?}
Gpsi
{1,?,?,?}
Gpsi
{2,?,?,?}
Gpsi
{4,?,?,?}
Gpsi
{6,?,?,?}
• Characteristics
• A 𝐺𝑝𝑠𝑖 encodes a set of results.
• 𝐺𝑝𝑠𝑖 s are independent from each other
except the ones in its generation path.
Gpsi
{2,1,?,5}
Gpsi
{2,1,?,3}
Gpsi
{2,3,?,5}
Gpsi
{6,1,?,5}
Gpsi
{6,5,?,1}
𝐺𝑝𝑠𝑖 Tree
9
PSgL
PSgL: Parallel Subgraph Listing Framework
• PSgL follows the popular graph
processing paradigm
Iteration
• vertex centric model
• BSP model
• PSgL iteratively generates 𝐺𝑝𝑠𝑖 in
parallel;
• Each 𝐺𝑝𝑠𝑖 is expanded by a data
vertex.
Gpsi
Gpsi
Gpsi
Gpsi
Gpsi
Gpsi
Gpsi
Gpsi
Gpsi
P1
P2
P3
Worker-1
Worker-2
Worker-3
10
PSgL
Vertex program
Algorithm of Expanding a 𝐺𝑝𝑠𝑖 - I
• Partial Pattern Graph encodes
• pattern graph,
• 𝐺𝑝𝑠𝑖 ,
• progress state.
• Three types of vertices
• BLACK vertex is the one which has been
expanded.
• GRAY vertex has a mapped data vertex,
but it has not been expanded.
• WHITE vertex is the one which hasn’t
been mapped to any data vertex.
1
4
2
Gp
<1,3>
<2,5>
<4,2>
<3,?>
3
+
Gpsi = { 3, 5, ?, 2 }
Gpp
11
PSgL
Vertex program
Algorithm of Expanding a Gpsi - II
• Main logic
• Changes one GRAY vertex into BLACK;
• Validates the expanding vertex’s GRAY neighbors;
• Makes the expanding vertex’s WHITE neighbor
become GRAY.
• Two observations
• In each expansion, at least one pattern vertex is
processed.
• All GRAYs are the valid candidates for the next
expansion.
1
4
2
Gp
<1,3>
<2,5>
<4,2>
<3,?>
3
+
Gpsi = { 3, 5, ?, 2 }
Gpp
Example: expanding vertex <4, 2>
12
PSgL
Analysis
Efficiency of PSgL
# of Gpsi processed
by worker k
# of
iterations
Total cost
# of
workers
Three metrics:
cost of
processing a
Gpsi
• The number of iterations.
• S is bounded by |MVC| ≤ S ≤ |Vp| - 1.
• Workload balance.
• Required by the max function.
• The number of 𝐺𝑝𝑠𝑖 s
*Refer to the paper for the details of estimating load(𝐺𝑝𝑠𝑖 ).
13
Optimization
Workload balance - I
• Partial subgraph instance distribution problem
• There are N 𝐺𝑝𝑠𝑖 s to be processed by K workers, the goal is to find out a
distribution strategy to achieve
• NP-hard problem!
• Naive Solutions
• Random distribution strategy
• Roulette wheel distribution strategy
• 𝐺𝑝𝑠𝑖 has a higher probability to be expanded by a data vertex with smaller degree.
14
Optimization
Workload balance - II
• Workload aware distribution strategy
• A general greedy-based heuristic rule.
α
Description
Drawbacks
1
Selecting worker 𝑗 for the 𝑖 𝑡ℎ 𝐺𝑝𝑠𝑖 which has minimal overall workload 𝑊𝑖
local optimal
0
Selecting worker 𝑗 where the 𝐺𝑝𝑠𝑖 incurs the least increased workload 𝑤𝑖𝑗
imbalance
Making a trade-off between local optimal and imbalance
-
0.5 (*)
All three strategies have the same worst bound which is K*|OPT|.
But in practice, α = 0.5 performs best.
15
Optimization
Comparison among various approaches
Random
Roulette
𝜶=0
𝜶=1
𝜶=0
16
Optimization
Partial subgraph instance reduction - I
• Pattern graph automorphism breaking
• Using DFS to find the equivalent vertex
group
• Assign partial order for each equivalent
vertex group
• Initial pattern vertex selection
• Introduce a cost model
• General pattern graph
1
2
<
3
Automorphism
Breaking
Cost
Model
Best
Initial
Pattern
Vertex
Initial Pattern Vertex Section
based on cost model
• Enumerate all possible selections based on
cost model
• Cycle and clique
• The vertex with lowest rank is the best one.
17
Optimization
Partial subgraph instance reduction - II
1
• Online pruning invalid 𝑮𝒑𝒔𝒊 s
2
• Filter by the partial order and degree restriction
• Prune with the help of a light weight global edge index
3
PG1
1
4
2
3
• Using bloom filter to index the ends of an edge
Data Graph
PG
Gpsi # w/ index
Gpsi # w/o index
Pruning Ratio
LiveJournal
PG1(v1)
2.86 x 108
6.81 x 108
58.01%
PG4(v1)
9.93 x
109
OOM
unknown
PG5(v1)
2.26 x 107
3.17 x 108
92.87%
PG5(v3; v4)
7.38 x 109
2.04 x 1010
63.89%
UsPatent
PG4
1
2
5
3
4
PG5
18
Evaluation
Evaluation - Comparing to MR solutions
 Afrati and SGIA-MR are the state-of-art MapReduce solutions.
 The ratios exceed 100 times are not visualized.
PSgL: 4302s
Afrati: 7291s
1
2
3
1
4
2
3
19
Evaluation
Evaluation - Comparing to GraphLab
Data Graph
Pattern Graph
Afrati
PowerGraph
PSgL
Twitter
𝑃𝐺1
432min
2min
12.5min
Wikipedia
𝑃𝐺1
871s
36s
125s
WikiTalk
𝑃𝐺2
4402s
48s
318s
WikiTalk
𝑃𝐺3
13743s
100s
494s
WikiTalk
𝑃𝐺3
13743s
OOM*
494s
WikiTalk
𝑃𝐺4
1785s
127s
38s
LiveJournal
𝑃𝐺4
2749s
OOM
1330s
* using a different traversal order.
1
2
3
𝑃𝐺1
1
4
1
4
1
4
2
3
2
3
2
3
𝑃𝐺2
𝑃𝐺3
𝑃𝐺4
20
Conclusion
• Subgraph listing is a fundamental operation for massive graph
analysis.
• We propose an efficient parallel subgraph listing framework, PSgL.
• Various distribution strategies
• Cost model
• Light-weight global edge index
• The workload-aware distribution strategy can be extended to other
balance problems.
• A new execution engine is required for larger pattern graphs.
21
Thanks!
22
Backup Expr. – Scalability of PSgL
Performance vs. Worker Number
23
Backup Expr. – Initial pattern vertex
selection
Livejournal
Random graph
Influences of the Initial Pattern Vertex on Various Data Graphs
24
Download