A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn, Sungpack Hong*, Sungjoo Yoo, Onur Mutlu+, Kiyoung Choi Seoul National University *Oracle Labs +Carnegie Mellon University Graphs • Abstract representation of object relationships – Vertex: object (e.g., person, article, …) – Edge: relationship (e.g., friendships, hyperlinks, …) • Recent trend: explosive increase in graph size 36 Million Wikipedia Pages 1.4 Billion Facebook Users 300 Million Twitter Users 30 Billion Instagram Photos Large-Scale Graph Processing • Example: Google’s PageRank π π =πΌ+ π€ππ π [π] π∈Succ(π) for (v: graph.vertices) { for (w: v.successors) { w.next_rank += weight * v.rank; } } for (v: graph.vertices) { v.rank = v.next_rank; v.next_rank = alpha; } Large-Scale Graph Processing • Example: Google’s PageRank π π = πΌ + to Eachπ€Vertex Independent ππ π [π] π∈Succ(π) Vertex-Parallel Abstraction for (v: graph.vertices) { for (w: v.successors) { w.next_rank += weight * v.rank; } } for (v: graph.vertices) { v.rank = v.next_rank; v.next_rank = alpha; } PageRank Performance 6 5 Speedup 4 3 2 +42% 1 0 32 Cores DDR3 128 Cores DDR3 Bottleneck of Graph Processing for (v: graph.vertices) { for (w: v.successors) { w.next_rank += weight * v.rank; } } 1. Frequent random memory accesses v &w w.rank w.next_rank w.edges … w weight * v.rank 2. Little amount of computation Bottleneck of Graph Processing for (v: graph.vertices) { for (w: v.successors) { w.next_rank += weight * v.rank; } } High Memory Bandwidth 1. Frequent randomDemand memory accesses v &w w.rank w.next_rank w.edges … w weight * v.rank 2. Little amount of computation PageRank Performance 6 5 Speedup 4 3 2 +42% 1 0 32 Cores DDR3 (102.4GB/s) 128 Cores DDR3 (102.4GB/s) PageRank Performance 6 5 Speedup 4 3 2 +42% +89% 1 0 32 Cores DDR3 (102.4GB/s) 128 Cores DDR3 (102.4GB/s) 128 Cores HMC (640GB/s) PageRank Performance 6 5 Speedup 4 3 2 +42% +89% 1 0 32 Cores DDR3 (102.4GB/s) 128 Cores DDR3 (102.4GB/s) 128 Cores HMC (640GB/s) PageRank Performance 6 Lessons Learned: 5 1. High memory bandwidth is the key to the scalability of graph processing 2. Conventional systems do not fully utilize high memory bandwidth Speedup 4 3 2 +42% 5.3x +89% 1 0 32 Cores DDR3 (102.4GB/s) 128 Cores DDR3 (102.4GB/s) 128 Cores HMC (640GB/s) 128 Cores HMC Internal BW (8TB/s) Challenges in Scalable Graph Processing • Challenge 1: How to provide high memory bandwidth to computation units in a practical way? – Processing-in-memory based on 3D-stacked DRAM • Challenge 2: How to design computation units that efficiently exploit large memory bandwidth? – Specialized in-order cores called Tesseract cores • Latency-tolerant programming model • Graph-processing-specific prefetching schemes Tesseract System Host Processor Memory-Mapped Accelerator Interface (Noncacheable, Physically Addressed) LP PF Buffer Crossbar Network … … MTP Message Queue DRAM Controller … … In-Order Core NI Tesseract System Host Processor Memory-Mapped Accelerator Interface (Noncacheable, Physically Addressed) LP PF Buffer Crossbar Network … … MTP Message Queue DRAM Controller … … In-Order Core NI Tesseract System Host Processor Memory-Mapped Accelerator Interface (Noncacheable, Physically Addressed) Crossbar Network … … DRAM Controller … … In-Order Core Communications LP PF Buffervia Remote Function Calls MTP Message Queue NI Communications in Tesseract for (v: graph.vertices) { for (w: v.successors) { w.next_rank += weight * v.rank; } } v w &w Communications in Tesseract for (v: graph.vertices) { for (w: v.successors) { w.next_rank += weight * v.rank; } } Vault #1 v Vault #2 &w w Communications in Tesseract for (v: graph.vertices) { Non-blocking Remote Function Call for (w: v.successors) { put(w.id, function() { w.next_rank += weight * v.rank; }); } Can be delayed } until the nearest barrier barrier(); Vault #1 Vault #2 put v &w put put put w Non-blocking Remote Function Call 1. Send function address & args to the remote core 2. Store the incoming message to the message queue 3. Flush the message queue when it is full or a synchronization barrier is reached Local Core NI Remote Core &func, &w, value NI MQ put(w.id, function() { w.next_rank += value; }) Benefits of Non-blocking Remote Function Call • Latency hiding through fire-and-forget – Local cores are not blocked by remote function calls • Localized memory traffic – No off-chip traffic during remote function call execution • No need for mutexes – Non-blocking remote function calls are atomic • Prefetching – Will be covered shortly Tesseract System Host Processor Memory-Mapped Accelerator Interface (Noncacheable, Physically Addressed) LP PF Buffer Crossbar Network … … MTP Message Queue DRAM Controller … … In-Order Core NI Tesseract System Host Processor Memory-Mapped Accelerator Interface (Noncacheable, Physically Addressed) LP PF Buffer Crossbar Network … … MTP Message Queue DRAM Controller … … Prefetching In-Order Core NI Memory Access Patterns in Graph Processing Sequential (Local) for (v: graph.vertices) { for (w: v.successors) { put(w.id, function() { w.next_rank += weight * v.rank; }); } } Vault #1 Vault #2 v &w Random (Remote) w Message-Triggered Prefetching • Prefetching random memory accesses is difficult • Opportunities in Tesseract – Domain-specific knowledge of target data address – Time slack of non-blocking remote function calls for (v: graph.vertices) { for (w: v.successors) { put(w.id, function() { w.next_rank += weight * v.rank; }); } Can be delayed } until the nearest barrier barrier(); Message-Triggered Prefetching PF Buffer MTP Message Queue DRAM Controller In-Order Core NI put(w.id, function() { w.next_rank += value; }) Message-Triggered Prefetching PF Buffer MTP Message Queue DRAM Controller In-Order Core NI put(w.id, function() { w.next_rank += value; }, &w.next_rank) Message-Triggered Prefetching w.next.. PF Buffer MTP οΌ MMessage Queue 1 DRAM Controller In-Order Core 1. Message M1 received 2. Request prefetch 3. Mark M1 as ready when the prefetch is serviced NI put(w.id, function() { w.next_rank += value; }, &w.next_rank) Message-Triggered Prefetching d1 d2 d3 PF Buffer MTP οΌ M2οΌ Queue MMessage M3οΌ M4 1 DRAM Controller In-Order Core 1. Message M1 received 2. Request prefetch 3. Mark M1 as ready when the prefetch is serviced 4. Process multiple ready messages at once NI put(w.id, function() { w.next_rank += value; }, &w.next_rank) Other Features of Tesseract • • • • • Blocking remote function calls List prefetching Prefetch buffer Programming APIs Application mapping Please see the paper for details Evaluated Systems DDR3-OoO HMC-OoO (with FDP) (with FDP) HMC-MC Tesseract (32-entry MQ, 4KB PF Buffer) 32 Tesseract Cores 8 OoO 4GHz 8 OoO 4GHz 8 OoO 4GHz 8 OoO 4GHz 128 In-Order 2GHz 128 In-Order 2GHz 8 OoO 4GHz 8 OoO 4GHz 8 OoO 4GHz 8 OoO 4GHz 128 In-Order 2GHz 128 In-Order 2GHz 102.4GB/s 640GB/s 640GB/s 8TB/s Workloads • Five graph processing algorithms – – – – – Average teenage follower Conductance PageRank Single-source shortest path Vertex cover • Three real-world large graphs – – – – ljournal-2008 (social network) enwiki-2003 (Wikipedia) indochina-0024 (web graph) 4~7M vertices, 79~194M edges Performance 16 13.8x 14 11.6x Speedup 12 9.0x 10 8 6 4 2 +56% +25% 0 DDR3-OoO HMC-OoO HMC-MC Tesseract Tesseract- TesseractLP LP-MTP Performance Memory Bandwidth Consumption Memory Bandwidth (TB/s) 16 14 Speedup 12 13.8x 3.5 3 11.6x 2.5 2 9.0x 8 1.5 1.3TB/s 10 6 0 2.2TB/s 1 4 0.5 2 2.9TB/s 0 80GB/s 190GB/s +56% 243GB/s +25% DDR3-OoO HMC-OoO HMC-MC Tesseract Tesseract- TesseractLP LP-MTP DDR3-OoO HMC-OoO HMC-MC Tesseract Tesseract- TesseractLP LP-MTP Iso-Bandwidth Comparison HMC-MC Bandwidth (640GB/s) Tesseract Bandwidth (8TB/s) 7 6.5x 6 Speedup 5 4 Programming Model 3.0x 3 2 2.3x Bandwidth 1 0 HMC-MC HMC-MC + PIM BW Tesseract + Tesseract Conventional BW (No Prefetching) Execution Time Breakdown 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% AT.LJ Normal Mode CT.LJ Interrupt Mode PR.LJ SP.LJ Interrupt Switching Network VC.LJ Barrier Execution Time Breakdown 100% 90% 80% 70% 60% 50% 40% 30% 20% Network Backpressure due to Message Passing (Future work: message combiner, …) 10% 0% AT.LJ Normal Mode CT.LJ Interrupt Mode PR.LJ SP.LJ Interrupt Switching Network VC.LJ Barrier Execution Time Breakdown 100% 90% 80% 70% 60% 50% 40% Workload Imbalance (Future work: load balancing) 30% 20% 10% 0% AT.LJ Normal Mode CT.LJ Interrupt Mode PR.LJ SP.LJ Interrupt Switching Network VC.LJ Barrier Prefetch Efficiency Prefetch Timeliness Coverage 1.2 1.04 1 Demand L1 Miss, 12% Speedup 0.8 0.6 Prefetch Buffer Hit, 88% 0.4 0.2 0 Tesseract with Prefetching (Prefetch takes zero cycles) Ideal Scalability Speedup DDR3-OoO Tesseract with Prefetching 16 16 14 14 12 12 10 10 8 8 6 6 4 13x MemoryCapacityProportional Performance 4x 4 +42% 2 0 12x 2 0 32 Cores 128 Cores 32 Cores 128 Cores 512 Cores (8GB) (32GB) (128GB) 512 Cores + Partitioning Memory Energy Consumption Memory Layers Logic Layers Cores 1.2 1 0.8 0.6 0.4 -87% 0.2 0 HMC-OoO Tesseract with Prefetching Memory Energy Consumption Memory Layers Logic Layers Cores Cores 1.2 Memory Layers 1 0.8 Logic Layers 0.6 0.4 -87% 0.2 0 HMC-OoO Tesseract with Prefetching Conclusion • Revisiting the PIM concept in a new context – Cost-effective 3D integration of logic and memory – Graph processing workloads demanding high memory bandwidth • Tesseract: scalable PIM for graph processing – – – – Many in-order cores in a memory chip New message passing mechanism for latency hiding New hardware prefetchers for graph processing Programming interface that exploits our hardware design • Evaluations demonstrate the benefits of Tesseract – 14x performance improvement & 87% energy reduction – Scalable: memory-capacity-proportional performance A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn, Sungpack Hong*, Sungjoo Yoo, Onur Mutlu+, Kiyoung Choi Seoul National University *Oracle Labs +Carnegie Mellon University