A Scalable Processing-in-Memory Accelerator for Parallel Graph

advertisement
A Scalable Processing-in-Memory
Accelerator for Parallel Graph Processing
Junwhan Ahn, Sungpack Hong*, Sungjoo Yoo,
Onur Mutlu+, Kiyoung Choi
Seoul National University
*Oracle
Labs
+Carnegie
Mellon University
Graphs
• Abstract representation of object relationships
– Vertex: object (e.g., person, article, …)
– Edge: relationship (e.g., friendships, hyperlinks, …)
• Recent trend: explosive increase in graph size
36 Million
Wikipedia Pages
1.4 Billion
Facebook Users
300 Million
Twitter Users
30 Billion
Instagram Photos
Large-Scale Graph Processing
• Example: Google’s PageRank
𝑅 𝑖 =𝛼+
𝑀𝑗𝑖 𝑅[𝑗]
𝑗∈Succ(𝑖)
for (v: graph.vertices) {
for (w: v.successors) {
w.next_rank += weight * v.rank;
}
}
for (v: graph.vertices) {
v.rank = v.next_rank; v.next_rank = alpha;
}
Large-Scale Graph Processing
• Example: Google’s PageRank
𝑅
𝑖 = 𝛼 + to Each𝑀Vertex
Independent
𝑗𝑖 𝑅[𝑗]
𝑗∈Succ(𝑖)
Vertex-Parallel
Abstraction
for (v: graph.vertices) {
for (w: v.successors) {
w.next_rank += weight * v.rank;
}
}
for (v: graph.vertices) {
v.rank = v.next_rank; v.next_rank = alpha;
}
PageRank Performance
6
5
Speedup
4
3
2
+42%
1
0
32 Cores
DDR3
128 Cores
DDR3
Bottleneck of Graph Processing
for (v: graph.vertices) {
for (w: v.successors) {
w.next_rank += weight * v.rank;
}
}
1. Frequent random memory accesses
v
&w
w.rank
w.next_rank
w.edges
…
w
weight * v.rank
2. Little amount of computation
Bottleneck of Graph Processing
for (v: graph.vertices) {
for (w: v.successors) {
w.next_rank += weight * v.rank;
}
}
High Memory
Bandwidth
1. Frequent
randomDemand
memory accesses
v
&w
w.rank
w.next_rank
w.edges
…
w
weight * v.rank
2. Little amount of computation
PageRank Performance
6
5
Speedup
4
3
2
+42%
1
0
32 Cores
DDR3
(102.4GB/s)
128 Cores
DDR3
(102.4GB/s)
PageRank Performance
6
5
Speedup
4
3
2
+42%
+89%
1
0
32 Cores
DDR3
(102.4GB/s)
128 Cores
DDR3
(102.4GB/s)
128 Cores
HMC
(640GB/s)
PageRank Performance
6
5
Speedup
4
3
2
+42%
+89%
1
0
32 Cores
DDR3
(102.4GB/s)
128 Cores
DDR3
(102.4GB/s)
128 Cores
HMC
(640GB/s)
PageRank Performance
6
Lessons Learned:
5
1. High memory bandwidth is the key to
the scalability of graph processing
2. Conventional systems do not fully
utilize high memory bandwidth
Speedup
4
3
2
+42%
5.3x
+89%
1
0
32 Cores
DDR3
(102.4GB/s)
128 Cores
DDR3
(102.4GB/s)
128 Cores
HMC
(640GB/s)
128 Cores
HMC Internal BW
(8TB/s)
Challenges in Scalable Graph Processing
• Challenge 1: How to provide high memory
bandwidth to computation units in a practical way?
– Processing-in-memory based on 3D-stacked DRAM
• Challenge 2: How to design computation units that
efficiently exploit large memory bandwidth?
– Specialized in-order cores called Tesseract cores
• Latency-tolerant programming model
• Graph-processing-specific prefetching schemes
Tesseract System
Host Processor
Memory-Mapped
Accelerator Interface
(Noncacheable, Physically Addressed)
LP
PF Buffer
Crossbar Network
…
…
MTP
Message Queue
DRAM Controller
…
…
In-Order Core
NI
Tesseract System
Host Processor
Memory-Mapped
Accelerator Interface
(Noncacheable, Physically Addressed)
LP
PF Buffer
Crossbar Network
…
…
MTP
Message Queue
DRAM Controller
…
…
In-Order Core
NI
Tesseract System
Host Processor
Memory-Mapped
Accelerator Interface
(Noncacheable, Physically Addressed)
Crossbar Network
…
…
DRAM Controller
…
…
In-Order Core
Communications
LP
PF Buffervia
Remote Function Calls
MTP
Message Queue
NI
Communications in Tesseract
for (v: graph.vertices) {
for (w: v.successors) {
w.next_rank += weight * v.rank;
}
}
v
w
&w
Communications in Tesseract
for (v: graph.vertices) {
for (w: v.successors) {
w.next_rank += weight * v.rank;
}
}
Vault #1
v
Vault #2
&w
w
Communications in Tesseract
for (v: graph.vertices) {
Non-blocking Remote Function Call
for (w: v.successors) {
put(w.id, function() { w.next_rank += weight * v.rank; });
}
Can be delayed
}
until the nearest barrier
barrier();
Vault #1
Vault #2
put
v
&w
put
put
put
w
Non-blocking Remote Function Call
1. Send function address & args to the remote core
2. Store the incoming message to the message queue
3. Flush the message queue when it is full or a
synchronization barrier is reached
Local
Core
NI
Remote
Core
&func, &w, value
NI
MQ
put(w.id, function() { w.next_rank += value; })
Benefits of Non-blocking Remote Function Call
• Latency hiding through fire-and-forget
– Local cores are not blocked by remote function calls
• Localized memory traffic
– No off-chip traffic during remote function call execution
• No need for mutexes
– Non-blocking remote function calls are atomic
• Prefetching
– Will be covered shortly
Tesseract System
Host Processor
Memory-Mapped
Accelerator Interface
(Noncacheable, Physically Addressed)
LP
PF Buffer
Crossbar Network
…
…
MTP
Message Queue
DRAM Controller
…
…
In-Order Core
NI
Tesseract System
Host Processor
Memory-Mapped
Accelerator Interface
(Noncacheable, Physically Addressed)
LP
PF Buffer
Crossbar Network
…
…
MTP
Message Queue
DRAM Controller
…
…
Prefetching
In-Order
Core
NI
Memory Access Patterns in Graph Processing
Sequential
(Local)
for (v: graph.vertices) {
for (w: v.successors) {
put(w.id, function() { w.next_rank += weight * v.rank; });
}
}
Vault #1
Vault #2
v
&w
Random
(Remote)
w
Message-Triggered Prefetching
• Prefetching random memory accesses is difficult
• Opportunities in Tesseract
– Domain-specific knowledge of target data address
– Time slack of non-blocking remote function calls
for (v: graph.vertices) {
for (w: v.successors) {
put(w.id, function() { w.next_rank += weight * v.rank; });
}
Can be delayed
}
until the nearest barrier
barrier();
Message-Triggered Prefetching
PF Buffer
MTP
Message Queue
DRAM Controller
In-Order Core
NI
put(w.id, function() { w.next_rank += value; })
Message-Triggered Prefetching
PF Buffer
MTP
Message Queue
DRAM Controller
In-Order Core
NI
put(w.id, function() { w.next_rank += value; }, &w.next_rank)
Message-Triggered Prefetching
w.next..
PF Buffer
MTP
οƒΌ
MMessage
Queue
1
DRAM Controller
In-Order Core
1. Message M1 received
2. Request prefetch
3. Mark M1 as ready when
the prefetch is serviced
NI
put(w.id, function() { w.next_rank += value; }, &w.next_rank)
Message-Triggered Prefetching
d1 d2 d3
PF Buffer
MTP
οƒΌ M2οƒΌ Queue
MMessage
M3οƒΌ M4
1
DRAM Controller
In-Order Core
1. Message M1 received
2. Request prefetch
3. Mark M1 as ready when
the prefetch is serviced
4. Process multiple ready
messages at once
NI
put(w.id, function() { w.next_rank += value; }, &w.next_rank)
Other Features of Tesseract
•
•
•
•
•
Blocking remote function calls
List prefetching
Prefetch buffer
Programming APIs
Application mapping
Please see the paper for details
Evaluated Systems
DDR3-OoO
HMC-OoO
(with FDP)
(with FDP)
HMC-MC
Tesseract
(32-entry MQ, 4KB PF Buffer)
32
Tesseract
Cores
8 OoO
4GHz
8 OoO
4GHz
8 OoO
4GHz
8 OoO
4GHz
128
In-Order
2GHz
128
In-Order
2GHz
8 OoO
4GHz
8 OoO
4GHz
8 OoO
4GHz
8 OoO
4GHz
128
In-Order
2GHz
128
In-Order
2GHz
102.4GB/s
640GB/s
640GB/s
8TB/s
Workloads
• Five graph processing algorithms
–
–
–
–
–
Average teenage follower
Conductance
PageRank
Single-source shortest path
Vertex cover
• Three real-world large graphs
–
–
–
–
ljournal-2008 (social network)
enwiki-2003 (Wikipedia)
indochina-0024 (web graph)
4~7M vertices, 79~194M edges
Performance
16
13.8x
14
11.6x
Speedup
12
9.0x
10
8
6
4
2
+56%
+25%
0
DDR3-OoO HMC-OoO HMC-MC Tesseract Tesseract- TesseractLP
LP-MTP
Performance
Memory Bandwidth Consumption
Memory Bandwidth (TB/s)
16
14
Speedup
12
13.8x
3.5
3
11.6x
2.5
2
9.0x
8 1.5
1.3TB/s
10
6
0
2.2TB/s
1
4 0.5
2
2.9TB/s
0
80GB/s
190GB/s
+56%
243GB/s
+25%
DDR3-OoO HMC-OoO HMC-MC Tesseract Tesseract- TesseractLP
LP-MTP
DDR3-OoO HMC-OoO HMC-MC Tesseract Tesseract- TesseractLP
LP-MTP
Iso-Bandwidth Comparison
HMC-MC Bandwidth (640GB/s)
Tesseract Bandwidth (8TB/s)
7
6.5x
6
Speedup
5
4
Programming Model
3.0x
3
2
2.3x
Bandwidth
1
0
HMC-MC
HMC-MC +
PIM BW
Tesseract +
Tesseract
Conventional BW (No Prefetching)
Execution Time Breakdown
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
AT.LJ
Normal Mode
CT.LJ
Interrupt Mode
PR.LJ
SP.LJ
Interrupt Switching
Network
VC.LJ
Barrier
Execution Time Breakdown
100%
90%
80%
70%
60%
50%
40%
30%
20%
Network Backpressure due to Message Passing
(Future work: message combiner, …)
10%
0%
AT.LJ
Normal Mode
CT.LJ
Interrupt Mode
PR.LJ
SP.LJ
Interrupt Switching
Network
VC.LJ
Barrier
Execution Time Breakdown
100%
90%
80%
70%
60%
50%
40%
Workload Imbalance
(Future work: load balancing)
30%
20%
10%
0%
AT.LJ
Normal Mode
CT.LJ
Interrupt Mode
PR.LJ
SP.LJ
Interrupt Switching
Network
VC.LJ
Barrier
Prefetch Efficiency
Prefetch Timeliness
Coverage
1.2
1.04
1
Demand
L1 Miss,
12%
Speedup
0.8
0.6
Prefetch
Buffer Hit,
88%
0.4
0.2
0
Tesseract with
Prefetching
(Prefetch takes zero cycles)
Ideal
Scalability
Speedup
DDR3-OoO
Tesseract with Prefetching
16
16
14
14
12
12
10
10
8
8
6
6
4
13x
MemoryCapacityProportional
Performance
4x
4
+42%
2
0
12x
2
0
32
Cores
128
Cores
32
Cores
128
Cores
512
Cores
(8GB)
(32GB)
(128GB)
512 Cores +
Partitioning
Memory Energy Consumption
Memory Layers
Logic Layers
Cores
1.2
1
0.8
0.6
0.4
-87%
0.2
0
HMC-OoO
Tesseract with Prefetching
Memory Energy Consumption
Memory Layers
Logic Layers
Cores
Cores
1.2
Memory
Layers
1
0.8
Logic Layers
0.6
0.4
-87%
0.2
0
HMC-OoO
Tesseract with Prefetching
Conclusion
• Revisiting the PIM concept in a new context
– Cost-effective 3D integration of logic and memory
– Graph processing workloads demanding high memory bandwidth
• Tesseract: scalable PIM for graph processing
–
–
–
–
Many in-order cores in a memory chip
New message passing mechanism for latency hiding
New hardware prefetchers for graph processing
Programming interface that exploits our hardware design
• Evaluations demonstrate the benefits of Tesseract
– 14x performance improvement & 87% energy reduction
– Scalable: memory-capacity-proportional performance
A Scalable Processing-in-Memory
Accelerator for Parallel Graph Processing
Junwhan Ahn, Sungpack Hong*, Sungjoo Yoo,
Onur Mutlu+, Kiyoung Choi
Seoul National University
*Oracle
Labs
+Carnegie
Mellon University
Download