Managing Large Graphs on Multi-Cores With Graph Awareness Vijayan, Ming, Xuetian, Frank, Lidong, Maya Microsoft Research Motivation • Tremendous increase in graph data and applications – New class of graph applications that require real-time responses – Even batch-processed workloads have strict time-constraints • Multi-core revolution – Default standards on most machines – Large-scale multi-cores with terabytes of main memory – Run workloads that are traditionally run on distributed systems • Existing graph-processing systems lack support for both A High-level Description Outline of Grace Overview Grace is an in-memory graph management and processing system Details of optimizations Implements several optimizations – Graph-specific – Multi-core-specific Details on transactions Supports snapshots and transactional updates on graphs Subset of results Evaluation shows that optimizations help Grace run several times faster than other alternatives An Overview of Grace Keeps an entire graph in memory in smaller parts. Exposes C-style API for writing graph workloads, iterative workloads, and updates. v = GetVertex(Id) for (i=0; i<v.degree;i++) neigh=v.GetNeighbor(i) Iterative Programs (e.g., PageRank) Grace API RPC Graph and Multi-core Optimizations B A Design driven by two trends - Graph-specific locality - Partitionable and parallelizable workloads C E D Core 0 Core 1 Net Data Structures Vertex Log Edge Log A B C Edges of A Edges of B Edges of C A 0 Edge Pointer Array B 1 C 2 1 1 1 0 Vertex Index Vertex Allocation Map Data Structures in a Partition Graph-Aware Partitioning & Placement • Partitioning and placement – are they useful on a single machine? – Yes, to take advantage of multi-cores and memory hierarchies • Solve them using graph partitioning algorithms – Divide a graph into sub-graphs, minimizing edge-cuts • Grace provides an extensible library – Graph-aware: heuristic-based, spectral partitioning, Metis – Graph-agnostic: hash partitioning • Achieve better layout by recursive graph partitioning – Recursively run graph partition until a sub-graph can fit in a cache line – Recompose all the sub-graphs to get the vertex layout Platform for Parallel Iterative Computations Iterative computation platform implements “bulk synchronous parallel” model. Iteration 1 Parallel computations Propagate updates Iteration 2 Barrier Load Balancing and Updates Batching Problem1: overloaded partitions can affect performance Solution1: Load balancing is implemented by sharing a portion of vertices Problem2: Updates in arbitrary order can increase cache misses Solution2: Updates batching is implemented by - grouping updates by their destination part - Issuing updates in a round-robin fashion Barrier Cache line B D A C Part0 Core0 Part1 Part2 Core1 Core2 Transactions on Graphs Grace supports structural changes to a graph BeginTransaction() AddVertex(X) AddEdge(X, Y) EndTransaction() Transactions use snapshot isolation - Instantaneous snapshots using CoW techniques - CoW can affect careful memory layout! Evaluation Graphs: - Web (v:88M, e:275M), sparse - Orkut (v:3M, e:223M), dense Workloads: - N-hop-neighbor queries, BFS, DFS, PageRank, WeaklyConnected Components, Shortest Path Architecture: - Intel Xeon-12 cores, 2 chips with 6 cores each - AMD Opteron-48 cores, 4 chips with 12 cores each Questions: - How well partitioning and placement work? - How useful are load balancing and updates batching? - How does Grace compare to other systems? Partitioning and Placement Performance On Intel PageRank Speedup 12 30 10 25 8 20 6 15 4 10 2 5 0 0 3 6 3 12 Orkut graph partitions 2 1 3 6 12 Web graph partitions Hash Heuristic Metis Hash + Placement Heuristic + Placement Metis + Placement Observation: Careful For smaller vertex number arrangement of partitions, works better when algorithm graph Placing neighboring vertices closepartition together improves didn’t partitioning make is asignificantly big used difference for sparse graphs performance Reason: graph All partitions fitputs within cores ofunder singlesame chipreduced minimizing part L1, the L2,partitioning and L3 cache andneighbors Data-TLB misses are communication helping better placement cost 2 2 1 1 0 0 3 6 Load Balancing 2 1 3 12 Orkut graph partitions 6 12 Web graph partitions Load Balancing + Update Batching Observation: Load Batching balancing updates and gives updates betterbatching performance didn’t improve performance improvementfor forweb Orkut graph graph Reason: Sparse Updatesgraphs batching can reduces be partitioned remotebetter cacheand accesses there are fewer updates to send Retired Load PageRank Speedup Load Balancing and Updates Batching On Intel 1500 Load Balancing 1000 500 0 Sibling Core Remote Chip Load Balancing + Updates Batching Comparing Grace, BDB, and Neo4j Running Time (s) 10000 1000 100 10 1 0.1 BDB Neo4j Grace Conclusion Grace explores graph-specific and multi-core specific optimizations What worked and what didn’t (in our setup; your mileage might differ) – Careful vertex placement in memory gave good improvements – Partitioning and updates batching worked in most cases, but not always – Load balancing wasn’t as useful