cpu-gpu-ieeemicro-2012-talk

advertisement
Redefining the Role of the CPU in
the Era of CPU-GPU Integration
Manish Arora, Siddhartha Nath, Subhra Mazumdar,
Scott Baden and Dean Tullsen
Computer Science and Engineering, UC San Diego
IEEE Micro Nov – Dec 2012
AMD Research August 20th 2012
Overview



Motivation
Benchmarks and Methodology
Analysis







CPU Criticality
ILP
Branches
Loads and Stores
Vector Instructions
TLP
Impact on CPU Design
2
Historical
Progression
General Purpose
Applications
Throughput
Applications
Energy
Efficient GPUs
Performance/Energy/…
gains with chip
integration
Multicore
CPUs
GPGPU
APU
Focus of
Improvements
Improved
Memory
Systems
Improved
GPGPU
Scaling
…
CPU
Architecture
?
Next-Gen
APU
Easier
Programming
3
The CPU-GPU Era
Consumer: Phenom
/Athlon II
Server: Barcelona...
Consumer: Vishera
Server: Delhi/Abu
Dhabi …
Parts
APUs have essentially the sameServer
CPU
cores as CPU-only parts
Components
Husky (K10)
CPU
+
NI GPU
Piledriver
CPU
+
SI GPU
Steamroller
CPU
+
Sea I GPU
AMD APU
Products
Llano
Trinity
Kaveri
2011
2012
2013
4
Example CPU-GPU Benchmark

KMeans (Implementation from Rodinia)
Randomly Pick Centers
Find Closest Center for
each Point
GPU
Find new Centers
CPU
Easy data parallelism over Few Centers with possibly
different #points
each point
5
Properties of KMeans
Metric
CPU Only
With GPU
Time fraction running
Kernel Code
~50%
~16%
(Kernel speedup 5x)
Time spent on the
CPU
100%
Perfect Instruction
Level Parallelism
(Window Size 128)
7.0
4.8
“Hard” Branches
2.3%
4.6%
“Hard” Loads
36.2%
64.5%
Application Speedup
on 8 Core CPU
1.5x
1.0x
~84%
CPU Performance Critical
+GPU drastically impacts CPU code properties
Aim: Understand and Evaluate this “new” CPU workload
6
The Need to Rethink CPU Design



APUs: Prime example of heterogeneous systems
Heterogeneity: Composing cores run subsets well
CPU need not be fully general-purpose

Sufficient to optimize for Non-GPU code
Investigate Non-GPU code and guide CPU design
7
Overview



Motivation
Benchmarks and Methodology
Analysis







CPU Criticality
ILP
Branches
Loads and Stores
Vector Instructions
TLP
Impact on CPU Design
8
Benchmarks
Mixed
Serial Apps
CPU only
Parallel Apps
GPU
Heavy
Partitioned Apps
CPU
Heavy
CPU
GPU
9
Benchmarks

CPU-Heavy (11 Apps)




Important computing apps with no evidence of GPU ports
SPEC: Parser, Bzip, Gobmk, MCF, Sjeng, GemsFDTD [Serial]
Parsec: Povray, Tonto, Facesim, Freqmine, Canneal [Parallel]
Mixed and GPU-Heavy (11 + 11 Apps)


Rodinia (7 Apps)
SPEC/Parsec mapped to GPUs (15 Apps)
10
Mixed
Benchmark
Suite
GPU Kernels Kernel Speedup
Kmeans
Rodinia
2
5.0
H264
SPEC
2
12.1
SRAD
Rodinia
2
15.0
Sphinx3
SPEC
1
17.7
Particlefilter
Rodinia
2
32.0
Blackscholes
Parsec
1
13.7
Swim
SPEC
3
25.3
Milc
SPEC
18
6.0
Hmmer
SPEC
1
19.0
LUD
Rodinia
1
13.5
Streamcluster
Parsec
1
26.0
11
GPU-Heavy
Benchmark
Suite
GPU Kernels Kernel Speedup
Bwaves
SPEC
1
18.0
Equake
SPEC
1
5.3
Libquantum
SPEC
3
28.1
Ammp
SPEC
2
6.8
CFD
Rodinia
5
5.5
Mgrid
SPEC
4
34.3
LBM
SPEC
1
31.0
Leukocyte
Rodinia
3
70.0
Art
SPEC
3
6.8
Heartwall
Rodinia
6
7.9
Fluidanimate
Parsec
6
3.9
12
Methodology


Interested in Non-GPU portions of CPU-GPU code
Ideal scenario: Port all applications on the GPU and use
hardware counters


Man hours / Domain expertise needed / Platform and
architecture dependent code
CPU-GPU partitioning based on expert information


Publically available source code (Rodinia)
Details of GPU portions from publications and own
implementations (SPEC/Parsec)
13
Methodology

Microarchitectural simulations



Marked GPU portions on application code
Ran marked applications via PIN based microarchitectural
simulators (ILP, Branches, Loads and Stores)
Machine measurements


Using marked code (CPU Criticality)
Used parallel CPU source code when available (TLP studies)
14
Overview



Motivation
Benchmarks and Methodology
Analysis







CPU Criticality
ILP
Branches
Loads and Stores
Vector Instructions
TLP
Impact on CPU Design
15
CPU Criticality

16
CPU Time
Future Averages weighted by
conservative CPU time
100
CPU-Only Non-Kernel Time
With Reported Speedups
With Conservative Speedups
Proportion of Total Application Time (%)
90
Mixed: Even though 80% code is mapped
to the GPU, the CPU is still the bottleneck
80
70
More time spend on the CPU
than on the GPU
60
50
40
CPU executes 7-14% of time
even for GPU-Heavy apps
30
20
10
0
Mixed
GPU-Heavy
17
Instruction Level Parallelism


Measures inherent instruction stream parallelism
Measured ILP with perfect memory and branches
18
Instruction Level Parallelism
Parallel Instructions within Instruction Window
30
Window Size 128
Window Size 512
25
20
15
12.7
9.6
10
5
0
CPU-Heavy
19
Parallel Instructions within Instruction Window
Instruction Level Parallelism
Window Size 128 CPU Only
30
Window Size 128 with GPU
Window Size 512 CPU Only
25
Overall
9.9  9.5 (128)
13.7  12.2 (512)
Window Size 512 with GPU
10.3  9.2 (128)
15.3  11.1 (512)
20
14.6  13.7
15
12.7
10 9.6
5
CPU
+GPU
CPU
+GPU
0
Mixed
GPU-Heavy
20
Instruction Level Parallelism

ILP dropped in 17 of 22 applications




Common case



4% for 128 size and 10.9% for 512 size
Dropped by half for 5 applications
Mixed apps ILP dropped by as much as 27.5%
Independent loops mapped to the GPU
Less regular dependence heavy code on the CPU
Occasionally long dependent chains on the GPU

Blackscholes (total of 5/22 outliers)
Potential gains from larger windows are going to be degraded
21
Branches

Branches categorized into 4 categories




Biased (> 95% same direction)
Patterned (> 95% accuracy on very large local predictor)
Correlated (> 95% accuracy on very large gshare predictor)
Hard (Remaining)
22
Branch Distribution
100
Hard
Correlated
90
Patterned
Biased
Percentage of Dynamic Branches
24.7%
80
7.0%
70
13.1%
60
50
40
30
55.2%
20
10
0
CPU-Heavy
23
Branch Distribution
Hard
100
5.1%
Percentage of Dynamic Branches
90
80
Correlated
Effect of CPU-Heavy Apps
Patterned
11.3%
Biased
9.4%
Effects of DataDependent branches
on GPU-Heavy Apps
70
18.6%
60
50
40
Overall: Branch predictors tuned
for generic CPU execution may
not be sufficient
30
20
CPU
10
+GPU
0
Mixed
GPU-Heavy
24
Loads and Stores

Loads and Stores categorized into 4 categories




Static (> 95% same address)
Strided (> 95% accuracy on very large stride predictor)
Patterned (> 95% accuracy on very large Markov predictor)
Hard (Remaining)
25
Distribution of Loads
Hard
Patterned
Strided
100
Percentage of Non-Trivial Loads
90
80
70
60
77.5%
50
40
30
20
5.9%
10
16.6%
0
CPU-Heavy
26
Distribution of Stores
100
Hard
Patterned
Strided
Percentage of Non-Trivial Stores
90
80
70
71.7%
60
50
40
30
10.2%
20
10
18.1%
0
CPU-Heavy
27
Distribution of Loads
Hard
Patterned
Strided
100
Percentage of Non-Trivial Loads
90
80
44.4%
70
61.6%
60
50
40
Overall: Stride or next line
predictors will struggle
30
20
CPU
Effects of kernels with
Irregular accesses
moving to the GPU
47.3%
10
+GPU
27.0%
0
Mixed
GPU-Heavy
28
Distribution of Stores
Hard
100
Patterned
Strided
Percentage of Non-Trivial Stores
90
80
38.6%
70
51.3%
60
50
40
Overall: Slightly less pronounced
but similar results as loads
30
48.6%
20
CPU
10
+GPU
34.9%
0
Mixed
GPU-Heavy
29
Vector Instructions
40
Percentage of Dynamic Instructions
35
30
25
20
15
10
7.3%
5
0
CPU-Heavy
30
Vector Instructions
40
SSE Instructions
SSE Instructions with GPU
Vector ISA
enhancements
targeting the
same regions of
code as the GPU
Fraction of Dynamic Instructions
35
30
25
16.9%
20
15.0%
9.6%
8.5%
15
10
CPU
5
+GPU
0
Mixed
GPU-Heavy
31
Thread Level Parallelism
8 Cores
32 Cores
20
Sppedup
15
10
5
0
CPU-Heavy
32
Thread Level Parallelism
20
8 Cores
32 Cores
8 Cores with GPU
32 Cores with GPU
Abundant
parallelism in GPUHeavy disappears.
No gain going from
8 cores to 32 cores.
15
Speedup
14.0x
2.1x
10
Mixed: Gains
drop from 4x to
1.4x
5
Overall 10% gain going
from 8 cores to 32 cores.
32 core TLP dropped 60%
from 5.5x to 2.2x
CPU
+GPU
CPU
0
+GPU
Mixed
GPU-Heavy
33
Overview



Motivation
Benchmarks and Methodology
Analysis







CPU Criticality
ILP
Branches
Loads and Stores
Vector Instructions
TLP
Impact on CPU Design
34
CPU Design in the post-GPU Era


Only modest gains from increasing window sizes
Considerably increased pressure on branch predictor



Memory access will continue to be major bottlenecks



Stride or next-line prefetching significantly much less relevant
Lots of literature but never adapted on real machines (e.g. Helper thread
prefetching or mechanisms targeted at pointer chains)
SSE rendered significantly less important


In spite of fewer static branches
Adopt techniques targeting fewer difficult branches (L-Tage Seznec 2007 )
Every core need not have it / cores could share SSE hardware
Extra CPU cores/threads not of much use because of lack of TLP
35
CPU Design in the post-GPU Era
(1) Clear case for Big Cores (with a focus on
loads/stores/branches and not ILP) + GPUs
(2) Need to start adopting proposals for few-thread performance
(3) Start by relooking old techniques with current perspectives
36
Backup
On Using Unmodified Source Code

Most common memory layout change: AOS -> SOA




Still a change in stride value
AOS well captured by stride/markov predictors
CPU only code has even better locality well captured by
strided/markov predictors
But the locality enhanced accesses map to the GPU

Minimal impact on CPU code with GPU: still irregular accesses
38
Download