Parallel Execution Models for Future Multicore Architectures

advertisement
Parallel Execution Models for Future
Multicore Architectures
Guri Sohi
University of Wisconsin
Outline
•
•
•
•
Retrospective
The road ahead
Review existing parallel execution models
New parallel execution models and opportunities
– Program demultiplexing
– Instrumented redundant multithreading
2
The Road Behind
• Hardware has continued to get fast
– Mostly transparent to software
• Added software functionality directly impacts
performance
– Consequence of uni-processor execution
– Limits additional software functionality
3
The Secret of Hardware Success
• Transparency to higher-level software
• Very low level parallel execution
• Appearance of sequential execution
– Software written with a sequential assumption
• Easier to express
• Easier to get right
4
The Road Ahead: Part 1
• Multicore architectures
– Likely low core complexity to conserve power
• Limited exploitation of low-level parallelism
– Will need to achieve concurrency
• Increasing hardware unreliability
– Will likely need help from software to enhance system
reliability
• Continuing software unreliability
– Will likely result in additional (overhead) functionality
in software
5
The Road Ahead: Part 2
• Lots of multimedia applications
– Possibly amenable to traditional forms of concurrency
• Heavier use of modularity, encapsulation,
information hiding, etc.
– Amenable to traditional parallelization?
– Benefit from or match different parallel execution
models?
• Heavier use of dynamic actions/decisions
6
The Big Challenges
• Execution models to achieve execution
concurrency on multicore architectures
– Concurrent processing of core work
• Building reliable software/hardware systems from
unreliable software and hardware components
– Redundancy: additional overhead work
– Redundancy as opportunity for concurrent processing
7
Stepping Back
Given an ordered sequence of tasks
• Process them in the given order: Sequential
• Try to come up with unordered sequences that
accomplish the same: Traditional Parallel
• Process in arbitrary order; give appearance of
processing in given order: Proposed
– Separate processing from giving appearance
8
Traditional Parallel
• Hardware people build a multiprocessor
• Throw it to software people to use
• Come up with correct unordered sequences
– This is very hard
• Use synchronization to ease reasoning
– i.e., create order; restrict unorder
• Very difficult to parallelize transparently in the
general case
9
Rethinking Traditional Parallelization
• Typically use speculation to alleviate ordering
constraints
– Speculative multithreading
– Transactions
10
Speculative Multithreading
• Speculatively parallelize an application
– Speculatively create unordered sequences from
ordered one
– Use speculation to overcome ambiguous dependences
– Use hardware support to recover from mis-speculation
– E.g., multiscalar
• Speculatively acquire a lock
11
Transactions
• Simplify expression of unordered sequences
• Very high overhead to implement semantics in software
• Hardware support for transactions will exist
– Speculative multithreading is transactions with restrictions on
ordering
– No software overhead to implement semantics
– More applications likely to be written with transactions
• Lots of similarities to speculative multithreading
– Similar opportunities and limitations
12
Control-Driven vs. Data-Driven Models
• Sequential execution is control-driven at the instruction
level
– Instruction available to process (on ALU) when control gets to it
• Traditional parallelization, speculative multithreading,
transactions, etc., are also control-driven
– Initiate execution of task/transaction when program control
reaches it
– Concurrently-executing entities can be ordered or unordered
– Limits ability to reach distant parallelism
• Can we have a usable data-driven parallel execution
model?
13
Out-of-Order Superscalar
• Instructions fetched in control-driven (sequential)
order
• Instructions executed in data-driven order
• Instructions committed in control-driven
(sequential) order
• Low-level 2-4X parallel execution with high-level
sequential view
• Maintaining high-level sequential view critical to
software and hardware development
14
Program Demultiplexing
• New opportunities for parallelism
– High-level 2-4X parallelism
• Program is a multiplexing of methods (or
functions) onto single control flow
– Convenience of expression
– Matched contemporary processing hardware
• De-multiplex methods of program
• Execute methods in parallel in dataflow fashion
• Give appearance of ordered execution
15
Program Demultiplexing (PD)
3
1
2
6
4
5
Sequential Program
PD Execution
• Program Demultiplexing
– Programs
• Sequential
– Execution
• Near-dataflow on methods
– Nodes
• Methods on processors
7
16
Sequential Exec.
Execution Framework
Call Site
M()
Trigger
Handler
M()
Means of reaching
distant parallelism – call site
EB
Call Site
17
Sequential Exec.
Execution Framework
Setup execution,
Trigger
provide parameters
Handler
M()
Begins
execution of M
On Main
Execute speculatively
processor
on
Overheads of
auxiliary EB
processor
demultiplexed
Call Site Demultiplexed execution
execution
Call Site M()
on
Auxiliary
processor
Save
execution
in
Invalidate
that
bufferexecutions
pool
Search execution
violate data dependencies
Use if valid entry
buffer pool
found in buffer pool
Assumption: Model used for compiled C programs
18
Example
ucxx2.c in 300.twolf
wire_chg = cost - funccost ;
Trig 1
mov %eax,0xffffffcc(%ebp)
mov 0xffffffcc(%ebp),%ecx
mov %ecx,(%esp,1)
Trig 2
mov %eax,0xffffffd0(%ebp)
mov 0xffffffd0(%ebp),%edx
mov %edx,(%esp,1)
truth = acceptt (...);
if( truth == 1 ) {
. . .
new_assgnto_old2( . . .);
dbox_pos(btermptr) dbox_pos(atermptr)
(40 cycles)
(40 cycles)
(60 cycles)
dbox_pos( atermptr );
dbox_pos( btermptr );
}
19
Execution Framework
Sequential Exec.
• Handler per call site of M
– Separates call site from program
– May have control-flow
P ()
• Every call is a demultiplexed execution
• Trigger per call site
M ()
PD Exec.
– Usually fires when method, handler ready
– Begins demultiplexed execution(s)
P ()
M ()
• Unordered executions
– Data-flow based
20
Methods
• Well encapsulated
– Defined by parameters and return value
– Stack for local computation
– Heap for global state
• Often performs specific tasks
– Access limited global state
• Now: Don’t care how computation implemented
• Proposed: Don’t care where, when(?), and how
computation carried out
21
Handlers
• Task
– Begin demultiplexed execution(s) of a method
– Providing parameters to the execution(s)
• Specifying handler
– Not explicitly specified in program, but part of it
• Evaluating compiled sequential programs
• Generate handler from program
– Slice of instructions from call site providing parameters
22
Triggers
• Indicates readiness of method and handler
– Data dependencies satisfied
• Fires when method and handler are ready
• Begins executing the handler
23
Demultiplexed Execution
• Demultiplexed execution
– Immediately on auxiliary proc.
• Better Scheduling possible
– With extra book-keeping
Main
Proc.
Auxiliary
processors
P1
P2
P3
C
C
C
• Intelligent policies
EB
24
Reaching distant parallelism
10000
M1()
1000
100
B
A
B
A
M2()
10
1
0.1
0.01
B
A > 1 (%)
crafty gap gzip mcf pars twolf vortex vpr
60
72
30
80
70
40
63
47
25
Reliable Systems
• Software is unreliable and error-prone
• Hardware will be unreliable and error-prone
• Improving hardware/software reliability will result
in significant software redundancy
– Redundancy will be source of parallelism
26
Software Reliability & Security
Reliability & Security via Dynamic Monitoring
- Many academic proposals for C/C++ code
- Ccured, Cyclone, SafeC, etc…
- VM performs checks for Java/C#
High Overheads!
- Encourages use of unsafe code
27
Instrumented Redundant Multithreading
• Insert instrumentation/checking functionality
(redundancy) into code without commensurate
performance impact
• Use parallelism to alleviate performance impact
– Non-traditional model for parallelism
• Successful parallelization will encourage more
use of novel (overhead) functionality
• Likely crucial techniques for overall system
(software AND hardware) reliability
28
Execution Model
Monitoring
Code
A
Program
• Divide program into tasks
• Fork a monitor thread to
check computation of each
task
• Monitor thread
instrumented with safety
checking code
B
C
A’
B’
D
C’
29
Task Commits & Aborts
• Commit/abort at task
granularity
• Precise error detection
achieved by reexecuting
code w/ inlined checks
– Also provides precise
exceptions
A
B
B’’
C
C
D
A’
COMMIT
C’
B’
ABORT
30
IRMT Implementation
• Hardware support
–
–
–
–
Hardware thread contexts
Register checkpointing
Speculative buffering
PC translation
• Software support
– Task selection
– Instrumentation
31
Other Opportunities
• Spreading single thread computation(s) to
multiple processing cores
– Special form of demultiplexing
– Similar to cohort scheduling
• Example: user and OS, different interrupts, etc.
– Significant instruction cache benefits
– Significant branch predictor benefits
– Potential data cache benefits
• Can this be done in a manner transparent to OS?
32
Summary
• CMPs will require as well as allow for innovative
models for software concurrency
• Data-driven, method-level concurrency is a
promising model
– Likely good match for anticipated programming styles
• Techniques for enhancing software and hardware
reliability will afford new forms of concurrency
– Now is the time to start thinking about future
opportunities for concurrency
33
Download