A Scalable Processing-in-Memory
Accelerator for Parallel Graph Processing
Junwhan Ahn, Sungpack Hong*, Sungjoo Yoo,
Onur Mutlu+, Kiyoung Choi
Seoul National University
*Oracle
Labs
+Carnegie
Mellon University
Graphs
• Abstract representation of object relationships
– Vertex: object (e.g., person, article, …)
– Edge: relationship (e.g., friendships, hyperlinks, …)
• Recent trend: explosive increase in graph size
36 Million
Wikipedia Pages
1.4 Billion
Facebook Users
300 Million
Twitter Users
30 Billion
Instagram Photos
Large-Scale Graph Processing
• Example: Google’s PageRank
π
π =πΌ+
π€ππ π
[π]
π∈Succ(π)
for (v: graph.vertices) {
for (w: v.successors) {
w.next_rank += weight * v.rank;
}
}
for (v: graph.vertices) {
v.rank = v.next_rank; v.next_rank = alpha;
}
Large-Scale Graph Processing
• Example: Google’s PageRank
π
π = πΌ + to Eachπ€Vertex
Independent
ππ π
[π]
π∈Succ(π)
Vertex-Parallel
Abstraction
for (v: graph.vertices) {
for (w: v.successors) {
w.next_rank += weight * v.rank;
}
}
for (v: graph.vertices) {
v.rank = v.next_rank; v.next_rank = alpha;
}
PageRank Performance
6
5
Speedup
4
3
2
+42%
1
0
32 Cores
DDR3
128 Cores
DDR3
Bottleneck of Graph Processing
for (v: graph.vertices) {
for (w: v.successors) {
w.next_rank += weight * v.rank;
}
}
1. Frequent random memory accesses
v
&w
w.rank
w.next_rank
w.edges
…
w
weight * v.rank
2. Little amount of computation
Bottleneck of Graph Processing
for (v: graph.vertices) {
for (w: v.successors) {
w.next_rank += weight * v.rank;
}
}
High Memory
Bandwidth
1. Frequent
randomDemand
memory accesses
v
&w
w.rank
w.next_rank
w.edges
…
w
weight * v.rank
2. Little amount of computation
PageRank Performance
6
5
Speedup
4
3
2
+42%
1
0
32 Cores
DDR3
(102.4GB/s)
128 Cores
DDR3
(102.4GB/s)
PageRank Performance
6
5
Speedup
4
3
2
+42%
+89%
1
0
32 Cores
DDR3
(102.4GB/s)
128 Cores
DDR3
(102.4GB/s)
128 Cores
HMC
(640GB/s)
PageRank Performance
6
5
Speedup
4
3
2
+42%
+89%
1
0
32 Cores
DDR3
(102.4GB/s)
128 Cores
DDR3
(102.4GB/s)
128 Cores
HMC
(640GB/s)
PageRank Performance
6
Lessons Learned:
5
1. High memory bandwidth is the key to
the scalability of graph processing
2. Conventional systems do not fully
utilize high memory bandwidth
Speedup
4
3
2
+42%
5.3x
+89%
1
0
32 Cores
DDR3
(102.4GB/s)
128 Cores
DDR3
(102.4GB/s)
128 Cores
HMC
(640GB/s)
128 Cores
HMC Internal BW
(8TB/s)
Challenges in Scalable Graph Processing
• Challenge 1: How to provide high memory
bandwidth to computation units in a practical way?
– Processing-in-memory based on 3D-stacked DRAM
• Challenge 2: How to design computation units that
efficiently exploit large memory bandwidth?
– Specialized in-order cores called Tesseract cores
• Latency-tolerant programming model
• Graph-processing-specific prefetching schemes
Tesseract System
Host Processor
Memory-Mapped
Accelerator Interface
(Noncacheable, Physically Addressed)
LP
PF Buffer
Crossbar Network
…
…
MTP
Message Queue
DRAM Controller
…
…
In-Order Core
NI
Tesseract System
Host Processor
Memory-Mapped
Accelerator Interface
(Noncacheable, Physically Addressed)
LP
PF Buffer
Crossbar Network
…
…
MTP
Message Queue
DRAM Controller
…
…
In-Order Core
NI
Tesseract System
Host Processor
Memory-Mapped
Accelerator Interface
(Noncacheable, Physically Addressed)
Crossbar Network
…
…
DRAM Controller
…
…
In-Order Core
Communications
LP
PF Buffervia
Remote Function Calls
MTP
Message Queue
NI
Communications in Tesseract
for (v: graph.vertices) {
for (w: v.successors) {
w.next_rank += weight * v.rank;
}
}
v
w
&w
Communications in Tesseract
for (v: graph.vertices) {
for (w: v.successors) {
w.next_rank += weight * v.rank;
}
}
Vault #1
v
Vault #2
&w
w
Communications in Tesseract
for (v: graph.vertices) {
Non-blocking Remote Function Call
for (w: v.successors) {
put(w.id, function() { w.next_rank += weight * v.rank; });
}
Can be delayed
}
until the nearest barrier
barrier();
Vault #1
Vault #2
put
v
&w
put
put
put
w
Non-blocking Remote Function Call
1. Send function address & args to the remote core
2. Store the incoming message to the message queue
3. Flush the message queue when it is full or a
synchronization barrier is reached
Local
Core
NI
Remote
Core
&func, &w, value
NI
MQ
put(w.id, function() { w.next_rank += value; })
Benefits of Non-blocking Remote Function Call
• Latency hiding through fire-and-forget
– Local cores are not blocked by remote function calls
• Localized memory traffic
– No off-chip traffic during remote function call execution
• No need for mutexes
– Non-blocking remote function calls are atomic
• Prefetching
– Will be covered shortly
Tesseract System
Host Processor
Memory-Mapped
Accelerator Interface
(Noncacheable, Physically Addressed)
LP
PF Buffer
Crossbar Network
…
…
MTP
Message Queue
DRAM Controller
…
…
In-Order Core
NI
Tesseract System
Host Processor
Memory-Mapped
Accelerator Interface
(Noncacheable, Physically Addressed)
LP
PF Buffer
Crossbar Network
…
…
MTP
Message Queue
DRAM Controller
…
…
Prefetching
In-Order
Core
NI
Memory Access Patterns in Graph Processing
Sequential
(Local)
for (v: graph.vertices) {
for (w: v.successors) {
put(w.id, function() { w.next_rank += weight * v.rank; });
}
}
Vault #1
Vault #2
v
&w
Random
(Remote)
w
Message-Triggered Prefetching
• Prefetching random memory accesses is difficult
• Opportunities in Tesseract
– Domain-specific knowledge of target data address
– Time slack of non-blocking remote function calls
for (v: graph.vertices) {
for (w: v.successors) {
put(w.id, function() { w.next_rank += weight * v.rank; });
}
Can be delayed
}
until the nearest barrier
barrier();
Message-Triggered Prefetching
PF Buffer
MTP
Message Queue
DRAM Controller
In-Order Core
NI
put(w.id, function() { w.next_rank += value; })
Message-Triggered Prefetching
PF Buffer
MTP
Message Queue
DRAM Controller
In-Order Core
NI
put(w.id, function() { w.next_rank += value; }, &w.next_rank)
Message-Triggered Prefetching
w.next..
PF Buffer
MTP
οΌ
MMessage
Queue
1
DRAM Controller
In-Order Core
1. Message M1 received
2. Request prefetch
3. Mark M1 as ready when
the prefetch is serviced
NI
put(w.id, function() { w.next_rank += value; }, &w.next_rank)
Message-Triggered Prefetching
d1 d2 d3
PF Buffer
MTP
οΌ M2οΌ Queue
MMessage
M3οΌ M4
1
DRAM Controller
In-Order Core
1. Message M1 received
2. Request prefetch
3. Mark M1 as ready when
the prefetch is serviced
4. Process multiple ready
messages at once
NI
put(w.id, function() { w.next_rank += value; }, &w.next_rank)
Other Features of Tesseract
•
•
•
•
•
Blocking remote function calls
List prefetching
Prefetch buffer
Programming APIs
Application mapping
Please see the paper for details
Evaluated Systems
DDR3-OoO
HMC-OoO
(with FDP)
(with FDP)
HMC-MC
Tesseract
(32-entry MQ, 4KB PF Buffer)
32
Tesseract
Cores
8 OoO
4GHz
8 OoO
4GHz
8 OoO
4GHz
8 OoO
4GHz
128
In-Order
2GHz
128
In-Order
2GHz
8 OoO
4GHz
8 OoO
4GHz
8 OoO
4GHz
8 OoO
4GHz
128
In-Order
2GHz
128
In-Order
2GHz
102.4GB/s
640GB/s
640GB/s
8TB/s
Workloads
• Five graph processing algorithms
–
–
–
–
–
Average teenage follower
Conductance
PageRank
Single-source shortest path
Vertex cover
• Three real-world large graphs
–
–
–
–
ljournal-2008 (social network)
enwiki-2003 (Wikipedia)
indochina-0024 (web graph)
4~7M vertices, 79~194M edges
Performance
16
13.8x
14
11.6x
Speedup
12
9.0x
10
8
6
4
2
+56%
+25%
0
DDR3-OoO HMC-OoO HMC-MC Tesseract Tesseract- TesseractLP
LP-MTP
Performance
Memory Bandwidth Consumption
Memory Bandwidth (TB/s)
16
14
Speedup
12
13.8x
3.5
3
11.6x
2.5
2
9.0x
8 1.5
1.3TB/s
10
6
0
2.2TB/s
1
4 0.5
2
2.9TB/s
0
80GB/s
190GB/s
+56%
243GB/s
+25%
DDR3-OoO HMC-OoO HMC-MC Tesseract Tesseract- TesseractLP
LP-MTP
DDR3-OoO HMC-OoO HMC-MC Tesseract Tesseract- TesseractLP
LP-MTP
Iso-Bandwidth Comparison
HMC-MC Bandwidth (640GB/s)
Tesseract Bandwidth (8TB/s)
7
6.5x
6
Speedup
5
4
Programming Model
3.0x
3
2
2.3x
Bandwidth
1
0
HMC-MC
HMC-MC +
PIM BW
Tesseract +
Tesseract
Conventional BW (No Prefetching)
Execution Time Breakdown
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
AT.LJ
Normal Mode
CT.LJ
Interrupt Mode
PR.LJ
SP.LJ
Interrupt Switching
Network
VC.LJ
Barrier
Execution Time Breakdown
100%
90%
80%
70%
60%
50%
40%
30%
20%
Network Backpressure due to Message Passing
(Future work: message combiner, …)
10%
0%
AT.LJ
Normal Mode
CT.LJ
Interrupt Mode
PR.LJ
SP.LJ
Interrupt Switching
Network
VC.LJ
Barrier
Execution Time Breakdown
100%
90%
80%
70%
60%
50%
40%
Workload Imbalance
(Future work: load balancing)
30%
20%
10%
0%
AT.LJ
Normal Mode
CT.LJ
Interrupt Mode
PR.LJ
SP.LJ
Interrupt Switching
Network
VC.LJ
Barrier
Prefetch Efficiency
Prefetch Timeliness
Coverage
1.2
1.04
1
Demand
L1 Miss,
12%
Speedup
0.8
0.6
Prefetch
Buffer Hit,
88%
0.4
0.2
0
Tesseract with
Prefetching
(Prefetch takes zero cycles)
Ideal
Scalability
Speedup
DDR3-OoO
Tesseract with Prefetching
16
16
14
14
12
12
10
10
8
8
6
6
4
13x
MemoryCapacityProportional
Performance
4x
4
+42%
2
0
12x
2
0
32
Cores
128
Cores
32
Cores
128
Cores
512
Cores
(8GB)
(32GB)
(128GB)
512 Cores +
Partitioning
Memory Energy Consumption
Memory Layers
Logic Layers
Cores
1.2
1
0.8
0.6
0.4
-87%
0.2
0
HMC-OoO
Tesseract with Prefetching
Memory Energy Consumption
Memory Layers
Logic Layers
Cores
Cores
1.2
Memory
Layers
1
0.8
Logic Layers
0.6
0.4
-87%
0.2
0
HMC-OoO
Tesseract with Prefetching
Conclusion
• Revisiting the PIM concept in a new context
– Cost-effective 3D integration of logic and memory
– Graph processing workloads demanding high memory bandwidth
• Tesseract: scalable PIM for graph processing
–
–
–
–
Many in-order cores in a memory chip
New message passing mechanism for latency hiding
New hardware prefetchers for graph processing
Programming interface that exploits our hardware design
• Evaluations demonstrate the benefits of Tesseract
– 14x performance improvement & 87% energy reduction
– Scalable: memory-capacity-proportional performance
A Scalable Processing-in-Memory
Accelerator for Parallel Graph Processing
Junwhan Ahn, Sungpack Hong*, Sungjoo Yoo,
Onur Mutlu+, Kiyoung Choi
Seoul National University
*Oracle
Labs
+Carnegie
Mellon University
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )