Slide - University of British Columbia

advertisement
Rich Miler – www.datacenterknowledge.com
Characterizing and Evaluating a Key-value
Store Application on Heterogeneous
CPU-GPU Systems
Tayler H. Hetheringtonɣ
Timothy G. Rogersɣ
Lisa Hsu*
Mike O’Connor*
Tor M. Aamodtɣ
ɣUBC
*AMD
University of British Columbia
In Proc. 2012 ACM/IEEE Int’l Symp. On Performance Analysis of Systems and Software (ISPASS)
Bruno Giussani – ww.wired.com
SIMD
Efficiency
Motivation
50%
40%
30%
20%
10%
0%
Intuition
Actual
New
types
ofrequire
workloads
Programmer’s
initial intuition
into an application’s
Server
applications
Server
farms
a lot of power
–
– Non-HPC
Need for efficient, cost-effective solutions
behavior
– Memcached
–
applications
– Server
GPU/APUs
Tayler Hetherington, Timothy Rogers,
Lisa Hsu, Mike O'Connor, Tor M. Aamodt
Memcached Key-value Store on GPU/APU
2
3
Background
Memcached
*Slide from HPCA-18, 2012 Facebook Keynote, Sanjeev Kumar
Tayler Hetherington, Timothy Rogers,
Lisa Hsu, Mike O'Connor, Tor M. Aamodt
Memcached Key-value Store on GPU/APU
4
•
•
•
•
Memcached - Compatible with GPU?
Irregular control flow
Irregular memory access patterns
Large memory requirements
Highly input data dependent
Tayler Hetherington, Timothy Rogers,
Lisa Hsu, Mike O'Connor, Tor M. Aamodt
Memcached Key-value Store on GPU/APU
5
Porting Memcached
Simple key-value lookup
Server2
• READ (GET)
Hash
GET
requests on
Memory
GPU
• WRITE (SET)
requests on
CPU
Tayler Hetherington, Timothy Rogers,
Lisa Hsu, Mike O'Connor, Tor M. Aamodt
Miss
Key Comparison
Hash chaining
Hit
Memcached Key-value Store on GPU/APU
Return
Hit/Miss
6
Porting Memcached - Batching
Servern
Server2
Server
2
Server
2
GET
GET
Memory
GET
GET
GET
Hash
Hash
Hash
Hash
Hash
Memory
Memory
Memory
Memory
Miss
Key Comparison
Miss
MissMiss
Key
Comparison
Hash
chaining
Miss
Comparison
chaining
Key Key
Comparison
HashHash
chaining
Key Comparison
Hash chaining
Hit
Hit
Tayler Hetherington, Timothy Rogers,
Lisa Hsu, Mike O'Connor, Tor M. Aamodt
Hash chaining
Return
Return
Hit
Return
Hit
Hit/Miss
Hit
Hit/Miss
Return Hit/Miss
Hit/Miss
Memcached Key-value Store on GPU/APU
Return
Hit/Miss
7
Porting Memcached
• Main Goals
– Increase request throughput
– Keep request latency reasonable
• Main Challenges
– Irregular memory access patterns
– Irregular control flow
– Data transfer overheads
Tayler Hetherington, Timothy Rogers,
Lisa Hsu, Mike O'Connor, Tor M. Aamodt
Memcached Key-value Store on GPU/APU
8
Methodology
• Hardware
– AMD Radeon HD 5870 (Discrete)
– AMD Llano A8-3850 (Fusion)
– AMD Zacate E-350 (Fusion)
• Simulators
– GPGPU-Sim v3.x
– In-house GPU control flow simulator
• Testing and Simulation
– Traces of Wikipedia accesses
Tayler Hetherington, Timothy Rogers,
Lisa Hsu, Mike O'Connor, Tor M. Aamodt
Memcached Key-value Store on GPU/APU
Porting Memcached
Memory Access
• One request per work item
• Data accesses for GET requests are input data
dependent
• Data can be anywhere in memory
– Poor performance on GPU?
Tayler Hetherington, Timothy Rogers,
Lisa Hsu, Mike O'Connor, Tor M. Aamodt
Memcached Key-value Store on GPU/APU
9
Porting Memcached
10
Memory Divergence
Percentage of Peak IPC
35%
30%
25%
20%
15%
10%
5%
0%
No L1
Cache
8k 8way
Tayler Hetherington, Timothy Rogers,
Lisa Hsu, Mike O'Connor, Tor M. Aamodt
32k 8- 64k 8- 128k 8- 256k 8- 1M 8- 1M FA No
No
way
way
way
way
way
Mem Mem
Latency Stalls
Memcached Key-value Store on GPU/APU
Porting Memcached
Control Flow
• Recall the control flow graph
Work item ID
1–2–3–4–5
3–4
1–2–5
1–5
2
3–4
• Many branch outcomes are input data dependent
Tayler Hetherington, Timothy Rogers,
Lisa Hsu, Mike O'Connor, Tor M. Aamodt
Memcached Key-value Store on GPU/APU
11
Porting Memcached
12
Control Flow
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
SIMD Efficiency Breakdown
29%
62%
MC (Pes)
Overall 15%
Tayler Hetherington, Timothy Rogers,
Lisa Hsu, Mike O'Connor, Tor M. Aamodt
51%
MC (Aug)
MC (Act)
40%
Memcached Key-value Store on GPU/APU
# Active
Work-items
1-4
5-8
9-12
13-16
17-20
21-24
25-28
29-32
13
Porting Memcached
Data Management
• Dynamic memory
manager
• Transfer memory regions
to device
• Virtual addresses
different on host and
device
Tayler Hetherington, Timothy Rogers,
Lisa Hsu, Mike O'Connor, Tor M. Aamodt
Memcached Key-value Store on GPU/APU
Porting Memcached
Data Transfer Reduction
• Fusion Systems
– Physical shared memory region between host and device
– Zero-copy data
• Discrete Systems
– Possible transfer reduction techniques
• Reduction in unnecessary transfers
• Acyclic data transfers (Overlap comm. with comp.)
• Automatic data transfer frameworks
Tayler Hetherington, Timothy Rogers,
Lisa Hsu, Mike O'Connor, Tor M. Aamodt
Memcached Key-value Store on GPU/APU
14
15
Porting Memcached
Performance vs. CPU
35
Performance vs. CPU
Speed up (X)
Speed up (X)
30
40
3025
2020
1015
100%
80%010
60%
5
40%
20% 0
0%
No Data
No Data
Transfers
Transfers
Execution Breakdown
AMD Radeon HD
5870
Llano A8-3850
Data
Data
Transfers
Transfers
Data
Zacate E-350
Transfer
Execution
AMD Radeon HD
Llano A8-3850
5870
AMD Radeon HD
Llano A8-3850
5870
Tayler Hetherington, Timothy Rogers,
Lisa Hsu, Mike O'Connor, Tor M. Aamodt
Zacate E-350
Zacate E-350
Memcached Key-value Store on GPU/APU
16
Results
Radeon HD 5870
Normalized Throughput (Requests/Second)
Normalized Latency
7
Latency - 0.5ms
14
12
6
10
5
8
4
6
3
4
2
1
2
0
0
0
10000 20000 30000 40000 50000 60000 70000 80000 90000
Requests / Batch
Normalized Latency
Normalized Throughput
8
• ~8000 requests yields highest ratio of throughput to latency
Tayler Hetherington, Timothy Rogers,
Lisa Hsu, Mike O'Connor, Tor M. Aamodt
Memcached Key-value Store on GPU/APU
Rich Miler – www.datacenterknowledge.com
17
Summary
• Programmer intuition doesn’t always paint the
whole picture
• We exploited the available parallelism on
GPUs by batching requests, showing a 7.5X
performance increase on the Llano system
• Data transfer overheads can have a large
impact on overall performance
• Thank you – Questions?
Tayler Hetherington, Timothy Rogers,
Lisa Hsu, Mike O'Connor, Tor M. Aamodt
Memcached Key-value Store on GPU/APU
Download