View the slides

advertisement
Managing Large Graphs on
Multi-Cores With Graph Awareness
Vijayan, Ming, Xuetian, Frank, Lidong, Maya
Microsoft Research
Motivation
• Tremendous increase in graph data and applications
– New class of graph applications that require real-time
responses
– Even batch-processed workloads have strict time-constraints
• Multi-core revolution
– Default standards on most machines
– Large-scale multi-cores with terabytes of main memory
– Run workloads that are traditionally run on distributed
systems
• Existing graph-processing systems lack support for both
A High-level Description
Outline
of Grace
Overview
Grace is an in-memory graph management and
processing system
Details of
optimizations
Implements several optimizations
– Graph-specific
– Multi-core-specific
Details on
transactions
Supports snapshots and transactional updates
on graphs
Subset of
results
Evaluation shows that optimizations help Grace
run several times faster than other alternatives
An Overview of Grace
Keeps an entire graph in
memory in smaller parts.
Exposes C-style API for writing
graph workloads, iterative
workloads, and updates.
v = GetVertex(Id)
for (i=0; i<v.degree;i++)
neigh=v.GetNeighbor(i)
Iterative Programs
(e.g., PageRank)
Grace API
RPC
Graph and Multi-core Optimizations
B
A
Design driven by two trends
- Graph-specific locality
- Partitionable and
parallelizable workloads
C
E
D
Core 0
Core 1
Net
Data Structures
Vertex
Log
Edge
Log
A B C
Edges of A Edges of B Edges of C
A 0
Edge Pointer Array
B 1
C 2
1 1 1 0
Vertex Index Vertex Allocation Map
Data Structures in a Partition
Graph-Aware Partitioning & Placement
• Partitioning and placement – are they useful on a single machine?
– Yes, to take advantage of multi-cores and memory hierarchies
• Solve them using graph partitioning algorithms
– Divide a graph into sub-graphs, minimizing edge-cuts
• Grace provides an extensible library
– Graph-aware: heuristic-based, spectral partitioning, Metis
– Graph-agnostic: hash partitioning
• Achieve better layout by recursive graph partitioning
– Recursively run graph partition until a sub-graph can fit in a cache line
– Recompose all the sub-graphs to get the vertex layout
Platform for Parallel Iterative Computations
Iterative computation platform implements
“bulk synchronous parallel” model.
Iteration 1
Parallel computations
Propagate updates
Iteration 2
Barrier
Load Balancing and Updates Batching
Problem1: overloaded
partitions can affect
performance
Solution1: Load balancing is implemented
by sharing a portion of vertices
Problem2: Updates in
arbitrary order can
increase cache misses
Solution2: Updates batching is implemented by
- grouping updates by their destination part
- Issuing updates in a round-robin fashion
Barrier
Cache line
B
D
A
C
Part0
Core0
Part1 Part2
Core1 Core2
Transactions on Graphs
Grace supports structural changes to a graph
BeginTransaction()
AddVertex(X)
AddEdge(X, Y)
EndTransaction()
Transactions use snapshot isolation
- Instantaneous snapshots using CoW techniques
- CoW can affect careful memory layout!
Evaluation
Graphs:
- Web (v:88M, e:275M), sparse
- Orkut (v:3M, e:223M), dense
Workloads:
- N-hop-neighbor queries, BFS, DFS, PageRank, WeaklyConnected Components, Shortest Path
Architecture:
- Intel Xeon-12 cores, 2 chips with 6 cores each
- AMD Opteron-48 cores, 4 chips with 12 cores each
Questions:
- How well partitioning and placement work?
- How useful are load balancing and updates batching?
- How does Grace compare to other systems?
Partitioning and Placement Performance On Intel
PageRank Speedup
12
30
10
25
8
20
6
15
4
10
2
5
0
0
3
6
3
12
Orkut graph partitions
2
1
3
6
12
Web graph partitions
Hash
Heuristic
Metis
Hash + Placement
Heuristic + Placement
Metis + Placement
Observation: Careful
For
smaller
vertex
number
arrangement
of
partitions,
works
better when
algorithm
graph
Placing
neighboring
vertices
closepartition
together
improves
didn’t
partitioning
make is
asignificantly
big
used
difference
for sparse graphs
performance
Reason: graph
All
partitions
fitputs
within
cores ofunder
singlesame
chipreduced
minimizing
part
L1, the
L2,partitioning
and
L3 cache
andneighbors
Data-TLB
misses
are
communication
helping
better placement
cost
2
2
1
1
0
0
3
6
Load Balancing
2
1
3
12
Orkut graph partitions
6
12
Web graph partitions
Load Balancing + Update Batching
Observation: Load
Batching
balancing
updates
and
gives
updates
betterbatching
performance
didn’t improve
performance
improvementfor
forweb
Orkut
graph
graph
Reason: Sparse
Updatesgraphs
batching
can reduces
be partitioned
remotebetter
cacheand
accesses
there are
fewer updates to send
Retired
Load
PageRank Speedup
Load Balancing and Updates Batching On Intel
1500
Load Balancing
1000
500
0
Sibling Core
Remote Chip
Load Balancing +
Updates Batching
Comparing Grace, BDB, and Neo4j
Running Time (s)
10000
1000
100
10
1
0.1
BDB
Neo4j
Grace
Conclusion
Grace explores graph-specific and multi-core specific
optimizations
What worked and what didn’t (in our setup; your mileage
might differ)
– Careful vertex placement in memory gave good
improvements
– Partitioning and updates batching worked in most
cases, but not always
– Load balancing wasn’t as useful
Download