Compiling with multicore Jeehyung Lee 15-745 Spring 2009 1

advertisement
Compiling with multicore
Jeehyung Lee
15-745 Spring 2009
1
Papers

Automatic Thread Extraction with Decoupled
Software Pipelining



Fully automatic
Fine grained pipelining
A Practical Approach to Exploring CoarseGrained Pipeline Parallelism in C Programs


Semi-automatic
Coarse grained pipelining
2
First paper

Automatic Thread Extraction with Decoupled
Software Pipelining
 Guilherme Ottoni, Ram Rangan, Adam
Stoler and David August
 From Princeton University
3
What is the paper about?

Despite increasing uses of multiprocessors,
many single threaded applications do not
benefit

Let the compiler automatically extract threads
and exploit lurking pipeline parallelism

Extract non-speculative and truly decoupled
threads through Decoupled Software
Pipelining(DSWP)
4
Why decoupled pipelining?
Example
Linked list traversal
5
Why decoupled pipelining?
DOACROSS
Iteration * (LD latency + communication latency)
6
Why decoupled pipelining?
DSWP
One way pipelining
Iteration * LD latency
7
DSWP

Flow of data (dependency) is acyclic among
cores
 With use of inter-core queue, threads can be
decoupled

Efficiency + high tolerance for latency
8
DSWP Algorithm






Build dependence graph
Find strongly connected components (SCC)
Create DAG of SCC
Partition DAG
Split codes into partitions
Add flows to partitions
9
Build dependence graph
Include every traditional
dependence (data, control,
and memory) & extensions
10
Find SCC

SCC : Instructions that form a dependency
cycle in a loop
 Instructions in SCC cannot be parallelized
1
1
2
2
1
2
11
Create DAG of SCCs

Merge instructions within each SCC and
update dependency arrows
12
Partition DAG

Partition DAG nodes into n partitions
( n <= # of processors)
 Use heuristic to maximize load balance




Decide # of partitions (threads)
Start filling in from partition 1 with nodes from the
top of DAG.
When the partition is stuffed (estimated by # of
cycles), move on to next partition
Find the best # of threads and its partition
13
Split codes and insert flows (done!)

For each partition, insert code basic blocks
relevant to its contained SCC node

Add in codes for dependency flow
14
Result

19.4% speedup on important benchmark loops,
9.2% overall
 When core bandwidth is halved



Single threaded code slows down by 17.1%
DSWP code is still slightly faster than singlethreaded code running on full-bandwidth core
Promising enabler for Thread-LevelParallelism(TLP)?
15
Second Paper

A Practical Approach to Exploring CoarseGrained Pipeline Parallelism in C Programs
 William Thies, Vikram Chandrasekhar and
Saman Amaransinghe
 From MIT
16
What is the paper about?

Despite increasing uses of multiprocessors,
many single threaded… (Repeated)

Coarse grained pipelining is more desirable,
but is especially hard with obfuscated C codes

Let people define pipeline, and learn practical
dependencies in runtime
17
What is the paper about?

Despite increasing uses of multiprocessors,
many single threaded… (Repeated)

Coarse grained pipelining is more desirable,
but is especially hard with obfuscated C codes

Let people define stages, and learn practical
dependencies in runtime …for streaming
applications
18
Interface

Add annotations in the body of top loop
19
Dynamic analysis

The system creates a stream graph according
to annotations.
How do they find dependencies?
20
Dynamic analysis

Streaming applications tend to have a fixed
pattern of dataflow (stable flow) among
pipeline stages
21
Dynamic analysis

Run the application on training examples, and
record every relevant store-load pair across
pipeline boundaries
This gives us practical dependencies
22
Interface

Program shows a complete stream graph
User decides if he/she likes this
pipelining or not
• If yes, done!
• else, redo annotations. Iterate
over until satisfied
23
Actual pipelining

When compiled, annotation macros emit
codes that will fork original program for each
pipeline stage
24
Result

Average 2.78x speedup, max 3.89x on 4-core
 Seems unsound but practical (?)
25
Download