Qilin: Exploiting parallelism on heterogeneous multiprocessors with

advertisement
Chi-Keung Luk
Sunpyo Hong
Hyesoon Kim
Presented by Chris Spain
Heterogeneous Computing
Multiple levels of hardware parallelism exposed to software on current CPU+GPU
systems (The GPU has tens to hundreds of special-purpose cores, while the CPU has
a few general-purpose cores. Within each CPU core, there is shortvector parallelism
provided by the SIMD extension of the ISA.)
Problem: Dot Product of Large
Vectors
Multiplications of large vectors can be
done in parallel on GPU.
 Summation of products also done on
GPU using reduction.
 After reduction, for each block we have
the partial sum.
 How should we add the partial sums?

 GPU: if small number then this is a waste
 CPU: let the GPU get on with the next set.
Example
𝑁
𝐴∙𝐵 =
A
𝐴𝑖 𝐵𝑖
B
𝑖=1
Block 1
Block 2
…
FINAL SUM ON GPU
OR CPU?
Array Reduction
Block N
Not an Either/Or Decision
Matrix multiplication
example
Most efficient mapping is
dependent on data size and may
share work between CPU and
GPU
How do we know whether to run
on GPU or CPU?
Could depend on problem type – is it
mostly serial, does it need large
amounts of cache? Recall last week’s
paper showing sort and search faster on
CPU
 Also depends on size of data set –
would we bother summing 30 elements
on a GPU? Have to allocate memory,
transfer data, run kernel, transfer back

Mapping Computations

Manually mapping to GPU or CPU
 Labor intensive
 Does not adapt to changes during runtime
 Does not adapt to hardware changes
Qilin: Adaptive Mapping





API for writing parallelizable operations in
C/C++
Automatic mapping
Responds to runtime changes in data size
Adapts to new hardware
Quicker and easier than manual mapping
Qilin API





API on top of C/C++
Built on Intel Thread Building Blocks (TBB)
and CUDA
Qilin compiler generates TBB & CUDA source
code on the fly
Defines new Qilin types: QArray & QArrayList
Allows the option to specify which device, for
example:
 Add(Qx, Qy, PE_SELECTOR_GPU)
 Add(Qx, Qy, PE_SELECTOR_CPU)
 Add(Qx, Qy, PE_SELECTOR_DEFAULT)

Uses dynamic compilation to compile API calls
into native machine codes while the program
runs.
Two Possible Approaches Can be
Used.
Stream API
 Threading API

Stream API operations
Example 1: Stream API
Convert normal arrays
into QArrays
All operations here are
elementwise in parallel
(sum does the
reduction)
Convert Qarrays back
to normal arrays
Example 2: Threading API
CPU Implementation
GPU Implementation
Threading API Continued
Convert normal arrays
into 2D QArrays
Create glued
implementation of
function
Allow work to be
divided between CPU
& GPU and create
argument list.
Run the function with
the argument list and
default mapping
Compilation of Qilin Programs

Uses dynamic compilation at runtime
with the following steps:
1) Build Directed Acyclic Graphs (DAGs) from
API calls according to data dependencies
2) Decide the mapping from computation to
processing elements (CPU and/or GPU)
3) Optimize DAGs via coalescing and
removal of temp arrays
4) Code generation
Compilation of Qilin Programs
Adaptive Mapping
Adaptive mapping to automatically find
the near-optimal mapping from
computations to processing elements
 Optimal for the given application,
problem size, and system configuration
 Stores a database of execution time
projections for every program it has ever
executed.
 First execution of program treated as a
training run.

First Execution of Program
1)
2)
3)
4)
5)
6)
Take input data of size Nt and divide in two
parts N1 and N2.
Use N1 on the CPU and N2 on the GPU.
Subdivide each and run the pieces on
their respective devices.
Measure execution time of each piece.
Fit a line to the time datapoints
The line then becomes the projection of
how execution time scales with
performance.
Projection of Execution Time as
Data Size Increases
Using the Curves to Determine
Division of Work
Equation of projection lines
Experimental Setup
Results 1: Effect on Power
Results 2: Training Set Size
Impact of the training set size on adaptive mapping
performance (Note: The y-axis is in logarithmic scale.
The legend ”X%” means the training set size is X% of
the reference set size.).
Results 3: Change in Hardware
Issues:
Is it really worth the trouble? Many
problems have one single loop or task
that is suited to the GPU
 Data size is often known and fixed
 Threading API: not fun writing two
implementations of the same code (CPU
and CPU)
 Compilation overhead at runtime

Conclusion
Automated mapping preferable to
manual techiques
 Qilin works almost as well as manual
mapping for:

 Power consumption
 Execution time

Qilin is adaptable to change in:
 Input size
 Hardware configuration
Questions?
2 um
Download