talk

advertisement
Mizan: A system for Dynamic
Load Balancing in Large-Scale
Graph Processing
Presented
by:
Armend
Hoxha
Trevor
Hodde
Kexin Shi
A System for Dynamic Load Balancing




Pregel
HADI
Pegasus
X-RIME
We’ll focus on Pregel, since Mizan is a
refined Pregel system.
Mizan (arabic) : a double-pan scale
Pregel
 A scalable graph mining system
 Message passing based programming model
It improved the overall performance by 1-2 orders of magnitude over a traditional MapReduce implementation for a
wide array of graph mining algorithms:
• Page Rank
• Shortest paths problems
• Bipartite matching
• Semi-clustering
Pregel
 A scalable graph mining system
 Message passing based programming model
It improved the overall performance by 1-2 orders of magnitude over a traditional MapReduce implementation for a
wide array of graph mining algorithms:
• Page Rank
• Shortest paths problems
• Bipartite matching
• Semi-clustering
Pregel
 A scalable graph mining system
 Message passing based programming model
It improved the overall performance by 1-2 orders of magnitude over a traditional MapReduce implementation for a
wide array of graph mining algorithms:
• Page Rank
• Shortest paths problems
• Bipartite matching
• Semi-clustering
Existing implementations focus primarily
on graph partitioning as a preprocessing
step to balance computation.
Pregel is built on BSP programming model
BSP = Bulk Synchronous Parallel
This picture is taken from the paper
Pregel is built on BSP programming model
 Each vertex runs an algorithm and can send
messages asynchronously to any other vertex
 During each superstep vertices run in parallel
across a distributed infrastructure
 Each vertex processes incoming messages from
the previous superstep
 A vertex is said to be active if it is processing
and/or sending messages to other vertices
 This process continues until all vertices have no
messages to send – they all become inactive
This picture is taken from the paper
Graph Algorithms
 Stationary
An algorithm is stationary if its active vertices send and
receive the same distribution of messages across
supersteps. At the and of each superstep, all active nodes
become inactive.
Examples: Page Rank, Diameter Estimation, Finding Weakly
Connected Components etc.
Superstep1 Superstep2 Superstep3
Worker 1
~N
~N
~N
Worker 2
~N
~N
~N
Worker 3
~N
~N
~N
No. Msg
~3N
~3N
~3N
 Non-Stationary
An algorithm is non-stationary if the size of its outgoing
messages changes across supersteps. Such variations can
create workload imbalances across supersteps.
Examples: Distributed Minimal Spanning Tree Construction
(DMST), Graph Queries, various simulations on social network
graphs – advertisement propagation etc.
Superstep1 Superstep2 Superstep3
Worker 1
~N
~3N
~5N
Worker 2
~0.5N
~4N
~6N
Worker 3
~0.5N
~2N
~4N
No. Msg
~2N
~9N
~15N
Page Rank (stationary) against DMST (non-stationary)
This picture is taken from the paper
Difference between stationary and
non-stationary algorithms. Total
represents the sum across all workers
and Max represents the maximum
amount on a single worker.
Both of these algorithms were run on a
cluster of 21 machines and processed
the same dataset (LiverJournal1). The
input graph was partitioned using a
hash function.
What causes imbalance in non-stationary algorithms?
There are two sources of imbalance:
- one originating from graph structure, and
- another from the algorithm behavior
This picture is taken from the paper
Graph partitioning
Common approaches to partitioning the data are:
 Hash-based
 Range-based
Divide the dataset based on a simple heuristic: to
evenly distribute vertices across compute nodes,
irrespective of their edge connectivity.
 Min-cuts
Considers the vertex connectivity and partitions the
data such that it places strongly connected vertices
close to each other (on the same cluster).
Graph partitioning
Common approaches to partitioning the data are:
 Hash-based
 Range-based
Divide the dataset based on a simple heuristic: to
evenly distribute vertices across compute nodes,
irrespective of their edge connectivity.
 Min-cuts
Considers the vertex connectivity and partitions the
data such that it places strongly connected vertices
close to each other (on the same cluster).
Both of these partitioning strategies result in graph dependent performance.
Graph partitioning
G(N,E)
350
300
Run Time (Min)
250
Hash
Range
Min-cuts
|N|
|E|
kg4m68m
4,194,304
68,671,566
LiveJournal1
4,847,571
68,993,773
arabic-2005
22,774,080
639,999,458
Because of its size, arabic-2005
couldn’t be partitioned using min-cuts
in the local cluster where experiments
were conducted.
200
150
100
50
0
LiveJournal
kg4m68m
arabic-2005
Balanced Computation and Communication
Balancing the computation and communication is fundamental to the efficiency of a Pregel System
Existing implementations of Pregel take one or more of the following five approaches to achieving a balanced workload:
1.
2.
3.
4.
5.
Provide simple graph partitioning schemes, like hash- or range-based partitioning (e.g. Giraph)
Allow developers to set their own partitioning scheme or pre-partition the graph data (e.g. Pregel)
Provide more sophisticated partitioning techniques (e.g. GraphLab, GoldenOrb and Surfer use min-cuts)
Utilize the distributed data stores and graph indexing on vertices and edges (e.g. GoldenOrb and Hama), and
Perform coarse-grained load balancing (e.g. Pregel)
Balanced Computation and Communication
Balancing the computation and communication is fundamental to the efficiency of a Pregel System
Existing implementations of Pregel take one or more of the following five approaches to achieving a balanced workload:
1.
2.
3.
4.
5.
Provide simple graph partitioning schemes, like hash- or range-based partitioning (e.g. Giraph)
Allow developers to set their own partitioning scheme or pre-partition the graph data (e.g. Pregel)
Provide more sophisticated partitioning techniques (e.g. GraphLab, GoldenOrb and Surfer use min-cuts)
Utilize the distributed data stores and graph indexing on vertices and edges (e.g. GoldenOrb and Hama), and
Perform coarse-grained load balancing (e.g. Pregel)
All of the approaches above assume that:
• The structure of the graph is static,
• The algorithm has predictable behavior or
• Developer knows about the runtime characteristics of the algorithm.
New Approaches Are Needed
From the experiments presented before, we saw that:
- When graph mining algorithm has unpredictable communication needs,
- Frequently changes the structure of the graph, or
- Has a variable computation complexity
A single solution is no enough.
We want to build a system that is:
1. Adaptive
2. Agnostic to the graph structure, and
3. Requires no a priori knowledge of the behavior of the algorithm
MIZAN
What is Mizan?
•
Binary Space Partitioning graph processing framework
•
•
A way of “recursively subdividing space” into sets which are
collected into a graph (or in this case, a tree)
Open source, written in C++
•
Extends the Pregel API (you can read more here)
•
•
•
Pregel is a scalable system for “Large-Scale Graph Processing”
Known to improve MapReduce performance by 1-2 orders of magnitude
Focuses on efficient load balancing across workers in a
cluster
•
Two major steps in this process
•
•
Monitoring
Migration Planning
Monitoring (The easy part)
• Mizan monitors certain metrics for each worker node to ensure equal
load balancing
• Number of outgoing messages to other nodes
• Only messages sent to remote workers are counted, not messages rerouted to the same
worker
• Total incoming messages
• This includes messages rerouted to the same worker because queue size can affect
performance
• Response time
• Measured from the time at which the node begins processing a message until it finishes
Migration Planning
• Objectives:
• Decentralized
• Make sure all processing happens in parallel to improve performance
• “Simple” <- The writers words…not mine
• Transparent
• Extends the Pregel API but does not need to change it
• Does not assume anything about a graph structure or algorithm
Migration Planning (more)
• 5 easy steps to find the “strongest cause” of workload imbalance
among the three monitored metrics
• Provides a way to detect and (hopefully) prevent instability due to
workload imbalance
Migration Planning (the steps)
• Step 1: Identify the source of imbalance
• Compare the worker’s execution time against a normal distribution and flag
any outliers
• Remember: Mizan monitors each vertex for those 3 metrics
• Outgoing Messages to remote nodes
• All incoming messages
• Response time
Migration Planning (step 1 picture)
This picture is taken from the paper
Migration Planning (the steps)
• Step 1: Identify the source of imbalance
• Step 2: Select the migration objective
• Detects the largest cause of workload imbalance by comparing metrics for incoming and
outgoing messages for each worker with the worker’s execution time
• The result of this comparison will tell the system what needs optimization:
• Optimize outgoing messages
• Optimize incoming messages
• Optimize response time
Migration Planning (the steps)
• Step 1: Identify the source of imbalance
• Step 2: Select the migration objective
• Step 3: Pair over-utilized workers with under-utilized ones
• All workers create and execute the migration plan in parallel
• Each worker can be paired with up to one other worker
This picture is taken from the paper
Migration Planning (the steps)
• Step 1: Identify the source of imbalance
• Step 2: Select the migration objective
• Step 3: Pair over-utilized workers with under-utilized ones
• Step 4: Select vertices to migrate
• This is based on the migration objective determined by the previous steps
Migration Planning (step 4 explained)
• Example:
• Migration Objective: balance outgoing messages
• Solution: select vertices with a large number of outgoing messages and
migrate to an underutilized worker
Migration Planning (the steps)
• Step 1: Identify the source of imbalance
• Step 2: Select the migration objective
• Step 3: Pair over-utilized workers with under-utilized ones
• Step 4: Select vertices to migrate
• Step 5: Migrate the vertices
• Challenges:
• Migrating vertices across workers while maintaining vertex ownership and fast updates
• Minimizing migration costs to ensure that a migration actually helps reduce workload
Maintaining Vertex Ownership
• Mizan uses a distributed hash table to implement a lookup service
•
•
•
•
Vertex v can execute at any worker
V’s home worker ID = hash(ID) mod N
Workers ask the home worker of v for its location
The home worker is notified with the new location as v migrates
Maintaining Vertex Ownership
This picture is taken from the paper
Vertices with Large Message Size
• Delayed Migration is possible for vertices with very large message
sizes
• During the migration phase, only the ownership of vertex v is moved to its
new worker
• Then, at a later time, the actual vertex is passed to the new worker
Vertices with Large Message Size
This picture is taken from the paper
Vertices with Large Message Size
This picture is taken from the paper
Vertices with Large Message Size
This picture is taken from the paper
Experimental Setup
• Mizan is implemented on C++ with MPI with three variations:
• Static Mizan: Emulates Giraph, disables dynamic migration
• Work stealing (WS): Emulates Pregel's coarse-grained dynamic load balancing
• Mizan: Activates dynamic migration
Experimental Setup
• Runs on local cluster of 21 machines:
• Mix of i5 and i7 processors with 16GB RAM Each
• IBM Blue Gene/P supercomputer with 1024 compute nodes:
• Each is a 4 core PowerPC450 CPU at 850MHz with 4GB RAM
Dataset
• Experiments on public datasets:
• Stanford Network Analysis Project (SNAP)
• The Laboratory for Web Algorithmics (LAW)
• Kronecker generator
This picture is taken from the paper
Giraph vs. Static Mizan
Both pictures are taken from the paper
Effectiveness of Dynamic Vertex Migration
This picture is taken from the paper
• PageRank on a
social graph
(LiveJournal1)
• The shaded
columns:
algorithm
runtime
• The unshaded
columns: initial
partitioning cost
Effectiveness of Dynamic Vertex Migration
• Comparing Work stealing (Pregel clone) vs. Mizan
using DMST and ad propagation simulation, on a
metis partitioned social graph (LiveJournal1)
This picture is taken from the paper
Scalability of Mizan
This picture is taken from the paper
Scalability of Mizan
This picture is taken from the paper
Conclusion
• Mizan is a Pregel system that uses fine-grained vertex migration to
load balance computation and communication across supersteps
• Mizan is an open source project developed within InfoCloud group in
KAUST in collaboration with IBM, programmed in C++ with MPI
• Mizan scales up to thousands of workers
• Mizan improves the overall computation cost between 40% up to two
orders of magnitude with less than 10% migration overhead
References
• http://www.cs.cornell.edu/~djwill/pubs/mizan.pdf
• http://kowshik.github.io/JPregel/pregel_paper.pdf
• http://thegraphsblog.files.wordpress.com/2012/11/zpresentation.pdf
Thank you!
Download