sai-isca06.ppt

advertisement
Program Demultiplexing:
Data-flow based Speculative Parallelization
Saisanthosh Balakrishnan
Guri Sohi
University of Wisconsin-Madison
Speculative Parallelization
2
• Construct threads from sequential program
– Loops, methods, …
• Execute threads speculatively
– Hardware support to enforce program order
• Application domain
– Irregularly parallel
• Importance now
– Single-core performance incremental
Speculative Parallelization Execution
3
• Execution model
– Fork threads in program order for execution
– Commit tasks in that order
Control-flow Speculative Parallelization
• Limitation
– Reaching distant parallelism
T1
T2
T3
T4
Outline
4
• Program Demultiplexing Overview
• Program Demultiplexing Execution Model
• Hardware Support
• Evaluation
Program Demultiplexing Framework
5
Trigger
Handler
M()
– Begins execution of Handler
• Handler
– Setup execution, parameters
• Demultiplexed execution
M()
Call Site
Sequential Execution
• Trigger
EB
PD
Execution
– Speculative
– Stored in Execution Buffer
• At call site
– Search EB for execution
• Dependence violations
– Invalidate executions
Program Demultiplexing Highlights
6
• Method granularity
– Well defined
• Parameters
• Stack for local communication
• Trigger forks execution
– Means for reaching distant method
– Different from call site
• Independent speculative executions
– No control dependence with other executions
– Triggers lead to unordered execution
• Not according to program order
Outline
7
• Program Demultiplexing Overview
• Program Demultiplexing Execution Model
• Hardware Support
• Evaluation
Example: 175.vpr, update_bb ()
8
..
x_from = block [b_from].x;
y_from = block [b_from].y;
find_to (x_from, y_from, block [b_from].type, rlim, &x_to, &y_to);
..
..
for ( k = 0; k < num_nets_affected; k++ ) {
inet = nets_to_update [k];
if (net_block_moved [k] == FROM_AND_TO)
continue;
..
if ( net [inet].num_pins <= SMALL_NET ) {
get_non_updateable_bb (inet, &bb_coord_new [bb_index]);
} else {
if ( net_block_moved [k] == FROM )
Call Site 2
update_bb
( inet, &bb_coord_new [bb_index],
&bb_edge_new [bb_index], x_from, y_from, x_to, y_to );
else
Call Site 1
update_bb
( inet, &bb_coord_new [bb_index],
&bb_edge_new [bb_index], x_to, y_to, x_from, y_from );
}
..
..
bb_index++
}`
Handlers
9
• Provides parameters to execution
update_bb
(inet, &bb_coord_new [bb_index], &bb_edge_new
[bb_index], x_from, y_from, x_to, y_to);
• Achieves separation of call site and execution
• Handler code
– Slice of dependent instructions from call site
– Many variants possible
Handlers Example
10
H1 H2
..
x_from = block [b_from].x;
y_from = block [b_from].y;
find_to ( x_from, y_from, block [b_from].type, rlim, &x_to, &y_to );
..
..
for ( k = 0; k < num_nets_affected; k++ ) {
inet = nets_to_update [k];
if (net_block_moved [k] == FROM_AND_TO)
continue;
..
if ( net [inet].num_pins <= SMALL_NET ) {
get_non_updateable_bb (inet, &bb_coord_new [bb_index]);
} else {
if ( net_block_moved [k] == FROM )
update_bb
( inet, &bb_coord_new [bb_index],
&bb_edge_new [bb_index], x_from, y_from, x_to, y_to );
else
update_bb
( inet, &bb_coord_new [bb_index],
&bb_edge_new [bb_index], x_to, y_to, x_from, y_from );
}
..
..
bb_index++
}
Triggers
11
• Fork demultiplexed execution
– Usually when method and handler are ready
• i.e. when data dependencies satisfied
• Begins execution of the handler
Identifying Triggers
• Generate memory profile
• Identify trigger point
Program state for
H + M available
• Collect for many executions
– Good coverage
• Represent trigger points
– Use instruction attributes
• PCs, Memory write address
Program state
for H+M
Sequential Exec.
12
Handler
M()
Triggers Example
13
..
T1 T2
H1 H2
M M
M M
M
x_from = block [b_from].x;
y_from = block [b_from].y;
find_to ( x_from, y_from, block [b_from].type, rlim, &x_to, &y_to );
..
Minimum of 400 cycles
..
for ( k = 0; k < num_nets_affected; k++ ) {
inet = nets_to_update [k];
if (net_block_moved [k] == FROM_AND_TO)
continue;
..
if ( net [inet].num_pins <= SMALL_NET ) {
get_non_updateable_bb (inet, &bb_coord_new [bb_index]);
} else {
if ( net_block_moved [k] == FROM )
90 cycles
per
execution
update_bb
( inet, &bb_coord_new [bb_index],
&bb_edge_new [bb_index], x_from, y_from, x_to, y_to );
else
update_bb
( inet, &bb_coord_new [bb_index],
&bb_edge_new [bb_index], x_to, y_to, x_from, y_from );
}
..
..
bb_index++
}
H1 H2
Handlers Example … (2)
14
..
T1 T2
x_from = block [b_from].x;
y_from = block [b_from].y;
find_to ( x_from, y_from, block [b_from].type, rlim, &x_to, &y_to );
..
..
for ( k = 0; k < num_nets_affected; k++ ) {
inet = nets_to_update [k];
if (net_block_moved [k] == FROM_AND_TO)
continue;
..
if ( net [inet].num_pins <= SMALL_NET ) {
get_non_updateable_bb (inet, &bb_coord_new [bb_index]);
} else {
if ( net_block_moved [k] == FROM )
Stack
references
update_bb
( inet, &bb_coord_new [bb_index],
&bb_edge_new [bb_index], x_from, y_from, x_to, y_to );
else
update_bb
( inet, &bb_coord_new [bb_index],
&bb_edge_new [bb_index], x_to, y_to, x_from, y_from );
}
..
..
bb_index++
}
Outline
15
• Program Demultiplexing Overview
• Program Demultiplexing Execution Model
• Hardware Support
• Evaluation
Hardware Support Outline
16
• Support for triggers
• Demultiplexed execution
• Maintaining executions
– Storage
– Invalidation
– Committing
Dealt in other
spec. parallelization
proposals
Support for Triggers
17
• Triggers are registered with hardware
– ISA extensions
– Similar to debug watchpoints
• Evaluation of triggers
– Only by Committed instructions
• PC, address
– Fast lookup with filters
Demultiplexed Execution
18
• Hardware: Typical MP system
– Private cache for speculative data
– Extend cache line with “access” bit
• Misses serviced by Main processor
– No communication with other executions
• On completion
– Collect read set (R)
• Accessed lines
– Collect write set (W)
• Dirty lines
– Invalidate write set in cache
Main
C
C
P0
P3
P1
P2
C
C
Auxiliary
Execution buffer pool
19
Method (Parameters)
• Holds speculative executions
Read Set
<tag>
<data>
• Execution entry contains
– Read and write set
– Parameters and return value
• Alternatives
– Use cache
• May be more efficient
– Similar to other proposals
• Not the focus in this paper
Write Set
<tag>
<data>
Return value
Method (Parameters)
Read Set
.
.
.
Invalidating Executions
20
Method (Parameters)
<tag>
• For a committed store address
Read Set
<data>
Write Set
<tag>
<data>
– Search Read and Write sets
– Invalidate matching executions
Return value
Method (Parameters)
<tag>
Read Set
<data>
Write Set
<tag>
<data>
Invalidation
Return value
Method (Parameters)
<tag>
Read Set
<data>
Write Set
<tag>
<data>
Return value
Method (Parameters)
<tag>
Read Set
<data>
Write Set
<tag>
<data>
Return value
Method (Parameters)
<tag>
Read Set
<data>
Write Set
<tag>
<data>
Return value
Using Executions
21
Method (Parameters)
<tag>
Read Set
<data>
Write Set
<tag>
<data>
Return value
Method (Parameters)
<tag>
Read Set
<data>
Write Set
<tag>
<data>
– Search method name, parameters
– Get write and read set
– Commit
• If accessed by program
– Use
Return value
Search
• For a given call site
Method (Parameters)
<tag>
Read Set
<data>
Write Set
<tag>
<data>
Return value
Method (Parameters)
<tag>
Read Set
<data>
Write Set
<tag>
<data>
Return value
• If accessed by another method
– Nested methods
Method (Parameters)
<tag>
Read Set
<data>
Write Set
<tag>
<data>
Return value
Outline
22
• Program Demultiplexing Overview
• Program Demultiplexing Execution Model
• Hardware Support
• Evaluation
Reaching distant parallelism
23
Call site
1000
Fo rk
A
Call
Site
100
A
B
10
1
vpr
vortex
twolf
parser
mcf
0.01
gzip
M()
gap
B
crafty
0.1
Performance evaluation
– Performance benefits limited by
• Methods in program
• Handler implementation
vpr
6p
vortex
5p
twolf
4p
parser
3p
mcf
gap
2p
gzip
2.8
2.6
2.4
2.2
2
1.8
1.6
1.4
1.2
1
crafty
Speedup
24
Summary of other results (Refer paper)
25
• Method sizes
– 10s to 1000s of instructions. Lower 100s usually
• Demultiplexed execution overheads
– Common case 1.1x to 2.0x
• Trigger points
– 1 to 3. Outliers exist: macro usage
• Handler length
– 10 to 50 instructions average
• Cache lines
– Read – 20s, Written – 10s
• Demultiplexed execution
– Held average of 100s of cycles
Conclusions
26
• Method granularity
– Exploit modularity in program
• Trigger and handler to allow “earliest” execution
– Data-flow based
• Unordered execution
– Reach distant parallelism
• Orthogonal to other speculative parallelization
– Use to further speedup demultiplexed execution
Backup
Average trigger points in call site
28
30
27
25
Avg
PCs
20
18
15
14
12
10
5
Max
8
7
2
2
1
2
3
3
2
4
2
1
0
crafty
gap
gzip
mcf
parser
twolf
vortex
• Small set of trigger points for a given call site
– Defines reachability from trigger to the call site
vpr
Evaluation
29
• Full-system execution-based simulator
–
–
–
–
Intel x86 ISA and Virtutech Simics
4-wide out-of-order processors
64K Level 1 caches (2 cycle), 1 MB Level 2 (12 cycle)
MSI coherence
• Software toolchain
– Modified gcc-compiler and lancet tool
• Debugging information, CFG, program dependence graph
– Simulator based memory profile
– Generates triggers and handlers
• No mis-speculations occur
Reaching distant parallelism
30
A = Cycles between Fork and Call Site
3000
gap
parser
twolf
924
392
798
228
1477
348
2880
mcf
vortex
2750
1146
gzip
378
1122
crafty
9600
500
978
1000
183
1500
0
M()
Max
2000
1700
A
Avg
411
Cycles
2500
vpr
Execution Buffer Entries
31
60
52
Entries
50
Avg
40
32
20
29
28
30
Max
23
15
13
16
12
8
10
1
8
4
3
2
4
0
crafty
900
gap
590
gzip
mcf
70
520
parser
413
twolf
244
Avg. Cycles Held
– Storage requirements
• Max case 284 KB
– Minimize entries by better scheduling
vortex
160
vpr
308
Read and write set
32
Cache lines (64b)
Cache lines written
25
Min
Max
18
20
15
10
10
5
Avg
3
7
5
4
gap
gzip
6
3
0
crafty
mcf
parser
twolf
vortex
vpr
Cache lines read
Cache lines (64b)
60
50
40
Min
30
24
10
8
10
crafty
gap
gzip
Max
22
20
8
Avg
13
10
7
0
mcf
parser
twolf
vortex
vpr
Demultiplexed execution overheads
mcf
parser twolf vortex
• Overheads due to
– Handler
– Cache misses due to demultiplexed execution
• Common case
– between 1.1 to 2.0x
• Small methods  High overheads
1.1
gzip
Max
1.3
gap
Avg
1.4
5.0
crafty
2.0
1.5
Min
2.2
8
7
6
5
4
3
2
1
0
1.8
Execution
Time Overhead
33
vpr
Length of handlers
34
Instructions
300
240
250
200
Min
Avg
Max
150
79
100
50
42
8
32
7 20
12 20
37
18
38
32
9
16
parser
twolf
vortex
vpr
4%
40%
4%
9
0
crafty
14%
gap
10%
gzip
9%
mcf
100%
16%
Handler Instruction Count Overhead
Method sizes
35
Instructions
1500
Min
Avg
Max
1200
1000
945
900
600
455
300
300
245
80
180
62
13
220
195
55
45 63
0
crafty
gap
gzip
mcf
parser
twolf
vortex
vpr
Methods
36
crafty
gap
gzip
mcf
parser
twolf
vortex
vpr
Methods
24
Call Sites 206
16
59
9
27
8
9
12
84
10
26
11
106
11
20
Exec. time 85
(%)
90
51
30
55
92
88
99
– Runtime includes frequently called methods
Loop-level Parallelization
37
• Unit: Loop iterations
• Live-ins from
– P-slice
• Similar to handler
Mitosis
fork
• Fork instruction
– Restricted
• Same basic block level, method
– Program order dependent
– Ordered forking
loop
endl
Method-level parallelization
38
• Unit: Method continuations
– Program after the method returns
• Orthogonal to PD
Method-level
call
M()
ret
Reaching distant parallelism
39
10000
M1()
1000
100
B
A
B
A
M2()
10
1
0.1
0.01
B
A > 1 (%)
crafty gap gzip mcf pars twolf vortex vpr
60
72
30
80
70
40
63
47
Reaching distant parallelism
40
B = Call Time to Earliest execution time
2500
(1 outstanding)
Avg
crafty
2()
M
A
gzip
2.5
twolf
vortex
vpr
Multiple Executions (R1)
No parameters (R2)
2
1.5
2.5
1.4
1.2
1
1.6
1.2
2.7
1.7
1.2
2.5
2.2
1.1
1.1
1.7
0
1.7
1.8
1
0.5
2500
420
178
665
1042
parser
190
547
129
1800
236
mcf
C/B
= R1
CNo params/C = R2
3
Multiplying ratio
gap
8000
0
391
B
73
500
935
1000
1000
C
Max
1500
242
M1()
Cycles
2000
crafty
gap
gzip
mcf
parser
twolf
vortex
vpr
Issues with Stack
41
• Stack pointer is position dependent
– Handler has to insert parameters at right position
• Same stack addresses denote different variables
– Affects triggers
• Different stack pointers in program and execution
– Stack may be discarded
– To commit requires relocation of stack results
• Example: parameters passed by reference
Benchmarks
42
• SPECint2000 benchmarks
– C programs
• Did not evaluate gcc, perl, bzip2, and eon
– No intention of creating concurrency
– No specific/ clean Programming style
• Many methods perform several tasks
– May have less opportunities
Hardware System
43
• Intel x86 simulation
– Virtutech Simics based full-system, Bochs decoder
– 4-processors at 3 GHz
– Simple memory system
• Micro-architecture model
–
–
–
–
–
4-wide out of order without cracking into micro-ops
Branch predictors
32K L1 (2-cycle), 1 MB L2 (12-cycle)
MSI, 15-cycle communication cache to cache
Infinite Execution buffer pool
Software
44
• Modified gcc-compiler tool chain and lancet tool
• Extract from compiled binary
– Debugging information
– CFG, Program Dependence Graph
• Software
– Dynamic information from simulator
– Generates handler, trigger for call site as encountered
• Control-flow in handler not included [ongoing work]
• Perfect control transfer from trigger to method
– Handler doesn’t execute if a branch leads to not calling the method
Generating Handlers
45
• Cannot easily identify and demarcate code
– Heuristic to demarcate
– Terminate when load address is from heap
• Handler has
– Loads and stores to stack
– No stores to heap
– Limitation
• Heuristic. Doesn’t always work
Generating Handlers
46
• 1: Specify parameters to method
– Pushed into stack by program
• Introduces dependency
• Prevents separation
• 2: Computing parameters
– Program performs it near call site
– Need to identify the code
– Deal with
• Use of stack
• Control-flow
• Inter-method dependence
1: G = F (N)
2: if (…)
3:
X = G + 2
4: else
5:
X = G * 2
6: M (X)
Control-flow in Handlers
47
• Depends on call site’s CF
• Handler for D
– Call site in C () BB 3
– Include Loop
CFG (C), Call Graph
C
1
• BB 4 to BB 1
– Include Branch
2
3
• Branch in BB 1
• Inclusion depends on trigger
– Multiple iterations, diff. triggers
• Ongoing work
D
4
Other dependencies in Handlers
48
• C calls D, A or B calls C
– Dependence (X) extends
Call Graph
• May need multiple handlers A(X)
– If multiple call sites
C (X)
D(X)
B(X)
Buffering Handler Writes
49
• General case
– Writes in handler to be buffered
– Provided to execution
– Discarded after execution
P1
P2
P3
C
C
C
• Current implementation
– Only stack writes
EB
Methods for Speculative Execution
50
• Well encapsulated
– Defined by parameters and return value
– Stack for local computation
– Heap for global state
• Often performs specific tasks
– Access limited global state
– Limits side-effects
Download