Exploiting Heterogeneous Architectures Alex Beutel, John Dickerson, Vagelis Papalexakis

advertisement
Exploiting Heterogeneous
Architectures
Alex Beutel, John Dickerson, Vagelis Papalexakis
15-740/18-740 Computer Architecture, Fall 2012 In
Class Discussion
Tuesday 10/16/2012
Heterogeneous Hardware Systems
• Multiple CPU Single GPU systems
• Asymmetric Multicore Processors (AMP)
– Combination of general-purpose big and small cores
– Trade-off between performance & power
consumption
– Usually “on chip” AMP’s
• Single-ISA architectures
– Similar to AMP but have same instruction sets among
cores
– “small” processors can support in-order execution
– “big ones can support out-of-order execution
Overview
BIS
Optimization
Speed – Remove Speed – ILP and
bottlenecks
MLP
Hardware
Location
PIE
YinYang
Speed - Async
and Power
ACMP (single ISA)
SW/HW
Compiler (SW)
SMS
Speed - Memory
GPU & CPU
VM/SW
Mem Controller
(HW)
Outline
1.
BIS: Jos A. Joao et al. “Bottleneck identification and scheduling in
multithreaded applications,” in Proceedings of the seventeenth
international conference on Architectural Support for Programming
Languages and Operating Systems (AS- PLOS ’12).
2.
YinYang: Ting Cao et al. “The yin and yang of power and performance for
asymmetric hardware and managed software,” in Proceedings of the
39th International Symposium on Computer Architecture (ISCA ’12).
3.
PIE: Kenzo Van Craeynest et al. “Scheduling heterogeneous multi-cores
through Performance Impact Estimation (PIE),” in Pro- ceedings of the
39th International Symposium on Computer Architecture (ISCA ’12).
4.
SMS: Rachata Ausavarungnirun et al. “Staged memory scheduling:
achieving high performance and scalability in heterogeneous systems,” in
Proceedings of the 39th International Symposium on Computer
Architecture (ISCA ’12).
Bottleneck Identification and Scheduling
in Multithreaded Applications
• Focuses on the problem of removing bottlenecks
– Big problem in many systems – can’t scale well to many threads
• Bottlenecks include critical sections, pipeline stalls, barriers
are a few examples
• In ACMPs previous research shows that “big” cores can be
used to handle (serializing) bottlenecks
– Limited fine grain adaptivity and generality
• Authors propose BIS
– Key insight – costliest bottlenecks are those that make other
threads wait longest
– involves co-operation of software and hardware to detect
bottlenecks
– Accelerates them using 1 or more “big” cores of the ACMP
Bottlenecks
per makes three main contributions:
opose a cooperative hardware-software mechanism to
y the most critical bottlenecks of different types, e.g.,
sections, barriers and pipeline stages, allowing the proer, the software or the hardware to remove or accelerate
ottlenecks. To our knowledge, this is the first such pro-
• Amdahl’s serial portion
• Critical sections
Barriers
opose an•
automatic
acceleration mechanism that decides
bottlenecks to accelerate and where to accelerate them.
the fast cores on an ACMP as hardware-controlled fine• Pipeline stages
ccelerators for serializing bottlenecks. By using hard-
T1
...
T2
C1
N
T3
N
T4
N
10
N
C1
30
...
C1
N
C1
20
N
C1
40
N
50
60
time
70
(a) Critical section
Idle
T1
...
...
T2
T3
T4
upport, we minimize the overhead of execution migration
ow for quick adaptation to changing critical bottlenecks.
the first paper to explore the trade-offs of bottleneck
ation using ACMPs with multiple large cores. We show
ultiple large cores improve performance when multiple
ecks are similarly critical and need to be accelerated
aneously.2
aluation shows that BIS improves performance by 37%
core Symmetric CMP and by 32% over an ACMP (which
erates serial sections) with the same area budget. BIS
rforms recently proposed dynamic mechanisms (ACS
pelined workloads and FDP for pipelined workloads) on
y 15% on a 1-large-core, 28-small-core ACMP. We also
BIS’ performance improvement increases as the number
Idle
N
time
10
20
30
40
50
60
70
(b) Barrier
T1
...
Idle
S1
T2
T3
T4
10
...
S1
S2
S3
time
20
30
40
50
60
70
(c) Pipeline stages
Figure 1. Examples of bottleneck execution.
on the critical path), but also reduce the wait for both T1 and T4
However, accelerating the same C1 on T2 between times 20 and
Bottleneck Identification
• Software used to identify bottlenecks
• Instructions such as BottleneckCall,
BottleneckReturn, BottleneckWait give
feedback to BIS system
• BIS system keeps track of blocks and thread
waiting cycles (TWC) (with optimizations)
Scheduling in Multithreaded
Applications (with ACMPs)
• Take N bottlenecks with highest TWC and
accelerate
• Many methods to accelerate, they focus on
assigning to bigger cores in ACMP
• Send worst bottlenecks from small cores to big
core and keep in Scheduling Buffer
• Lots of edge cases dealt with such as avoiding
false serialization
• Also extend to multiple large core context
Bottleneck Identification and Scheduling
in Multithreaded Applications
The Yin/Yang Metaphor
•
Hardware: heterogeneous multi-core balances power and
performance
–
•
•
•
Everyone cares about performance-per-energy (PPE) instead of
absolute performance
Software: move toward managed programming languages
with virtual machines, like Java (JVM), C# (.NET), JavaScript,
Yang of heterogeneous: exposed hardware adds complexity
Yin of managed language: VM handles all that exposed
complexity for the programmer
•
•
•
Yang of VM languages is overhead
Yin of heterogeneous hardware is small cores can alleviate
that overhead problem
Yin and Yang of Power and
Performance: Overview
•
•
Virtual machines consume a ton of extra
computation time and energy (~40%)
Java VM-related numbers (~37%):
–
–
–
•
10% garbage collection
12% JIT
15% executing untouched instructions via the
interpreter
Paper: exploits GC, JIT, Interpreter tasks by
placing them on the right types of cores with
combination of parallelism, asynchrony, noncriticality, and hardware sensitivity
Yin and Yang of Power and
Performance: Overview
• Garbage collection – asynchronous, can use
many cores, does not benefit from high clock
rate. Use low power core with high memory
bandwidth
• JIT – async, some parallelism, and non-critical.
Use small core because powerful enough
• Interpreter – Critical path and not async. Uses
the applications parallelism. Again use low
power cores generally
YinYang Experimental Evaluation
•
•
•
•
•
•
Power: they measure power overhead of VM services,
and yes the VM eats power so it's a good candidate for
heterogeneous systems
Power-per-energy (PPE): lots of results reported like
this instead of absolute.
Moving the JIT and GC to lower-clocked cores increases
PPE (by 9-13%)
GC very memory-bound, great on low-power cores
JIT less memory-bound, but embarrassingly parallel so
still great on low-power cores
Interpreter PPE improvement less stark, but still there.
YinYang Experimental Evaluation
PIE: Performance Impact Estimation
•
•
•
A heterogeneous multi-core architecture is one
that features big, powerful, power-hungry core(s)
and small, weak, energy-efficient core(s).
How do we map workloads onto the appropriate
cores to maximize "speed-per-energy"
PIE is a static or dynamic scheduler that takes
both memory-level parallelism and instructionlevel parallelism into account to predict how well
a job will do on different types of cores.
–
–
Static: schedule jobs once for duration of job
Dynamic: push parts of jobs to appropriate cores
PIE (contd.)
• Motivation: Intuition is wrong:
– “Compute-heavy jobs should go on the
heavyweight 'big' cores, while memory-heavy jobs
can do well enough on 'small' cores.”
– Authors do experiments and find out that big
cores do well on MLP intensive jobs, while small
cores do well on ILP intensive jobs
More PIE
•
One way to schedule is to randomly sample job-core
mappings, learn, choose best
–
•
•
PIE instead tries to estimate a job's performance on
Core-type B, while the job is running on Core-type A
Estimate is based on aggregate of:
–
–
•
High overhead!
Base instruction-level parallelism (ILP) score
A memory-level parallelism (MLP) component that is a
function of the processor's architecture and the observed
cache misses from specific job
Scheduler takes estimate of job's performance on
different processor types, decides where to put the job
PIE works well
Multiple CPU Single GPU Systems
• Memory becomes critical resource
–
–
–
–
GPU accesses vastly different from CPU ones
GPUs generate significantly more requests
GPU spawns many different threads
Increased contention between GPU and CPU
• Need to design a Memory Controller
– Schedules the memory accesses
– Ensures fairness
– Is scalable and easy to implement
• Current approaches non robust to presence of
both GPU and CPUs
Staged Memory Scheduling: Achieving High
Performance and Scalability in Heterogeneous
Systems
• Proposes a multi-stage approach to application-aware
memory scheduling
– Handles interference between bandwidth demanding apps and non
demanding ones (e.g. GPU and CPU respectively)
– Simplified hardware implementation due to decoupling of memory
controller across multiple stages
• Improves CPU performance without degrading GPU
performance
– Authors test on many settings (e.g. CPU only, GPU only, CPU & GPU)
– Compare to existing approaches
Staged Memory Scheduling: Achieving High
Performance and Scalability in Heterogeneous
Systems
• Sophisticated approaches that prioritize memory
accesses need too complex logic
– E.g. CAM memories
• SMS uses a three stage approach:
1. Batch Formulation: Per source aggregation of
memory request batches
2. Batch Scheduler: Prioritize batches coming from
latency critical apps (e.g. CPU ones)
3. DRAM Command Scheduler: FIFO queues per DRAM
bank/each batch from Stage2 is placed on these
FIFOs
SMS works well
Discussion
•
Leaving ISA assumption, how can we combine ideas from
first three papers?
–
–
•
Seems like we can incorporate ILP and MLP in our queuing
decisions in BIS
VM can also become more dynamic
BIS, PIE, and YinYang papers assume heterogeneous multicore systems with the same instruction set architecture
(ISA). How would these papers change if we assumed
different ISAs?
–
–
–
I'm thinking not fundamentally. It would make the prediction
part of PIE more complicated (you'd need some context-aware
performance scaling between ISAs), but wouldn't break it.
Similarly, you'd need some scaling for VM stuff, but it wouldn't
break anything there, either.
Maybe the class has an opinion on this? Do we need to keep
ISAs homogeneous across a heterogeneous multi-core? What
do we gain or lose from this?
Download