The Problem with Threads

advertisement
Based on work by Edward A. Lee (2006)
Presented by Leeor Peled, June 2010
Seminar in VLSI Architectures (048879)
Asynchronous computing
 During this course, we learned how to design asynchronous
logic, how to coordinate and time its elements, and how to build
async elements, controllers and data paths.
 It’s now time to investigate further layers of computing systems
and see if we can utilize what we learned there.
Wire delays,
Gate delays
Signal
level
CE’s
Data
Dependency
Clock skewing
RTL/CL
level
SOC
level
Handshake
protocols
OS scheduling,
Interrupts,
Threads!
?
SW
domain
GALS
?
SW Parallelism
 Most applications are serial
 HW manipulates Inst/mem/data level parallelism
 Superscaling, OOO, Vectorization (SIMD)
 Dependencies still limit the parallelism.
 Still high penalty on mem access, IO
 Thread level parallelism –
 Software manipulation - high latency stall  switch context
 Good for multiple tasks (e.g. servers), but can we boost a single
app?

Yes. Write concurrent code!
 But Very hard to develop
 Bug prone
 Few software paradigms / programming models
SW Parallelism (cont.)
 Interesting similarity between SW to HW:
 Asynchronous ≈ parallel ?
Faster ,more efficient
but also
 Non-deterministic

 Various possibilities for the order of occurrence -
Must be prepared for each.
 Race condition may occur between threads just
like signals
 So why not use similar methods?
Parallelism examples –
Fine Grain Parallelization
(Taken from Ginosar, “many-cores” slides)
 Convert (independent) loop iterations
 for ( i=0; i<10000; i++ ) { a[i] = b[i]*c[i]; }
 Into parallel tasks
 duplicable task XX(…) 10000
{
ii = INSTANCE;
a[ii] = b[ii]*c[ii];
}
 All tasks, or any subset, can be executed in parallel
5
Linear Solver: Simulation snap-shots
(Taken from Ginosar, “many-cores” slides)
Parallelism examples (cont.)
 Unfortunately, not all applications are “embarrassingly parallel”.
 In reality we employ various “design patterns” that were thoroughly
investigated (and available in libs)
 Producer-Consumer model :
procedure producer() {
while (true) {
item = produceItem()
procedure consumer() {
while (true) {
if (itemCount == 0) {
sleep()
}
if (itemCount == BUFFER_SIZE) {
sleep()
}
item = removeItemFromBuffer()
itemCount = itemCount - 1
putItemIntoBuffer(item)
itemCount = itemCount + 1
if (itemCount == BUFFER_SIZE - 1) {
wakeup(producer)
}
if (itemCount == 1) {
wakeup(consumer)
}
}
consumeItem(item)
}
}
}
Producer-Consumer visualization
Looks
familiar?
http://www.eonclash.com/Tutorials/Multithreading/MartinHarvey1.1/Ch9.html
Threads: problem statement
 Real workloads must work very hard to sync concurrent code.
 Following example shows the problem with unprotected access
 Serial: functinos A and B can be called in any order.
 Possible outputs are 0,0 and 1,1
A:
St [x],1
St [y],1
 Concurrent: also possible 0,1 (what about 1,0?).
 How would the program react?
 Design Issues:
 Memory ordering
 Coherency
 Consistency
 Debugability
B:
S = ld [x]
T = ld [y]
Print S,T
Threads: problem statement (cont.)
 Invalid results are bad, but some problems are worse –
 Deadlock
 Livelocks
 Example – observer pattern (in Java):
public class ValueHolder {
public void addListener(listener) {…}
public void setValue(newValue) {
myValue = newValue;
for (int i = 0; i < myListeners.length; i++) {
myListeners[i].valueChanged(newValue)
}
}
 What’s the problem?
Threads: problem statement (cont.)
 Invalid results are bad, but some problems are worse –
 Deadlock
 Livelocks
 Example – observer pattern (in Java):
public class ValueHolder {
public synchronized void addListener(listener) {…}
public synchronized void setValue(newValue) {
myValue = newValue;
for (int i = 0; i < myListeners.length; i++) {
myListeners[i].valueChanged(newValue)
}
}
 What’s the problem?
Threads: problem statement (cont.)
 Invalid results are bad, but some problems are worse –
 Deadlock
 Livelocks
 Example – observer pattern (in Java):
public synchronized void addListener(listener) {…}
public void setValue(newValue) {
synchronized(this) {
myValue = newValue;
listeners = myListeners.clone();
}
for (int i = 0; i < listeners.length; i++) {
listeners[i].valueChanged(newValue)
}
}
 What’s the problem?
Other
Synchronizing
Object
Threads: the bleak reality
All Programmers
Programmers
who use threads
Those who
Want to do it
properly

Threads: current methods
 Currently, the only defenses againt such problems are –
 The technical aspect –
 Analyze software structure using dedicated tools (formal verification)
 Blast, intel thread checker
 Use protected languages
 Cilk, Split-C (also various SW TM flavors) – lock/sync semantics
 Guava (private mem space for unsynced objectes)
 Use predefined design patterns
 Transactions (DB), TM
 The human aspect –
 Employ experienced programmers
 Apply a strict software design process (code reviews, debug sessions)
 Coding rules (lock acquiring order)
 The business aspect – be prepared to recall and compensate often…
Parallel objects - solutions
 Lee’s Observation: It’s not concurrency that is inherently difficult
it’s just the thread model!
 Key issues here - a thread shares everything, so everything might change
for it between two atomic actions.
 Threads may interleave in any way (memory ordering has vast options)
can change state on all other threads
t1
t0
A
A’
 Parallel computation with threads can be shown to explode exponentially
in the number of outcomes
 Long, boring mathematical proof ahead…
 But In fact - we usually only need to share a single message or data stream!
Some math
 Let :
 N={0,1,2,3,...}
 B={0,1}
 B* : the set of all finite bit sequences
 Bω:(NB) : the set of all infinite bit sequences
 B** = B* U Bω will represent the state of the computing macine
 Q: (B**B**)
 An imperative macine M=(A,c) is composed of a finite set of atomic
“instructions” A ⊂ Q , and a control function c: B**N that represents
how they’re sequenced.
 A “halt” instruction h ∊ A is defined : ∀ b ∊ B**, h(b)=b
 A sequential program (length m) is a function p:NA, s.t. ∀n≥m, p(n)=h
 The set of all programs is countably infinite (|P|=0‫)א‬
 An execution of p starts with b0 ∊ B**, and ∀n∊N, bn+1=p(c(bn),bn)
Some math (cont’d)
 Now, for multiple threads, we replace the program execution with –
 bn+1=pi(c(bn),bn), i∊{1,2}
 Each action is atomic, but for each step, i (the active context) is determined
arbitrarily (we’re assuming no simultaneous execution for simplicity).
 The correct notation should be: bn+1=pin(c(bn),bn), in∊{1,2}
 Let S:({1..m}{1,2}) be the vector of contexts (i0, i1, ..im), so |S|=2m
 Interleaving leads to exponential growth in possible outcomes, even
for a given set of programs and initial state.
 Further advantages of sequential programs  The sequence bn is well defined.
 The function computed by the program is partially defined for each
input leading to halt.
 p1 and p2 can be compared
 Multithreading also makes these exponentially harder.
Parallel objects - solutions (cont.)
 What other solutions do we have to activate multiple objects concurrently?
 Move from object-oriented design to actor-oriented
 Also similar to the async logic we discussed – each logical element is in
charge of its own input/output
 To compare – OO equivalent in VLSI means that the signals would have
to be “responsible” for their own correct transfer 
 Let us study the following 4 actor oriented models of computation (MOCs)
 Rendezvous
 PN (process network)
 SR (synchronous/reactive)
 DE (discrete events)
 these MOCs are all different alternatives with a similar computability
strength, but one might be better than the other for some design patterns
Actor oriented design - Rendezvous
 Based on work by Reo. Same functionality as before
 Each actor (producer/consumer/observer) is a process
 (No more process per dataflow / data object)
 Communication is through randezvous
 Producers are mutually exclusive (consumers are not)
 2 possible 3-way rendezvous possibilities
 Merge is now the only non-deterministic element
 No deadlocks, no consistency (values ordering) problem
Actor oriented design - PN (process network)
 Based on PN model of concurrency by Kahn & MacQueen (‘77)
 Communication is through streams
 Unbounded FIFOs
 Blocking reads
 Same benefits, plus – queuing allows the observer/consumer to operate at
different speeds (unless we explicitly add dependency), or delay the
observation indifferently
Actor oriented design - SR (sync/reactive)
 Concept based on synchronous languages such Esterel, SIGNAL and Lustre (mostly
used for RT/embedded systems like aircraft control, nuclear plants)
 Synchronous: time is an ordered sequence of instants
 Actual evaluation assumed to be zero time – instant reactions
 Reactive: Instants initiated by environmental events (Harel/Penueli)
 “When is just as important as what”
 At each clock tick, every signal is evaluated (iteratively if needed) or is absent
 Provides deterministic concurrency, events are ordered
 Scheduler picks order of evaluation (may be done in compilation time, Edwards ‘98).
Mutual dependency handled by iterations.
Actor oriented design - DE (discrete events)
 Concept based on VHDL/Verilog or Opnet network modeler
 Exact timing specification with rigorous semantics
 Each event is timed and processed chronologically.
 Merge (and the entire system) are deterministic.
 Unlike SR, here every evaluation takes a certain time delta
 More realistic
 However, evaluation order might introduce non-determinism if not
define properly
The road ahead
 Actor oriented design is not new, various languages exist 









CORBA event service (distributed push-pull)
ROOM and UML-2 (dataflow, Rational, IBM)
VHDL, Verilog (discrete events, Cadence, Synopsys, ...)
LabVIEW (structured dataflow, National Instruments)
Modelica (continuous-time, constraint-based, Linkoping)
OPNET (discrete events, Opnet Technologies)
SDL (process networks)
Occam (rendezvous)
Simulink (Continuous-time, The MathWorks)
SPW (synchronous dataflow, Cadence, CoWare)
 However, most are domain specific, and the few general purpose
ones never caught on
 Programmers don’t like new syntax
 Adding libs to existing languages is not enough
 UML case study?
 Lee’s suggested solution is “coordination languages”
 Polymorphic objects from other languages, general type-system
Ptolemy II Design environment
 Actors/components can be defined in C/C++, java, Matlab, python, perl, …
 Visual editor, abstract syntax
 Varying concurrency models
Models of Computation in Ptolemy II
 CI – Push/pull component interaction
 Click – Push/pull with method invocation
 CSP – concurrent threads with rendezvous
 Continuous – continuous-time modeling with fixed-point semantics
 CT – continuous-time modeling
 DDF – Dynamic dataflow
 DE – discrete-event systems
 DDE – distributed discrete events
 DPN – distributed process networks
 FSM – finite state machines
 DT – discrete time (cycle driven)
 Giotto – synchronous periodic
 GR – 3-D graphics
 PN – process networks
 Rendezvous – extension of CSP
 SDF – synchronous dataflow
 SR – synchronous/reactive
 TM – timed multitasking
Actor oriented design - examples
 Two implementation of sequential




interleaving based on rendezvous
Both are deterministic
Barrier allows rendezvous to occur
only when both inputs are ready
Buffer can rendezvous with input
OR with output.
Commutator chooses one input
for rendezvous (round robin)
Actor oriented design - examples
Conclusions
 The bottom line from Lee’s work is –
Instead of working with non-deterministic threads and
attempting to prune this non-determinism, we should start with
deterministic models, and add non-determinism only when
needed.
 Problem in adapting it is still - lack of cooperation from users (same
as with async VLSI design, in fact)
 Only languages that are general purpose, and no new syntax
 A transparent solution would be simpler to enforce

Library based, compiler, HW…
 Can we take something back to the VLSI level?
 Some synchronization schemes can be built in HW (which?)
 Actor oriented approach – are we there?
 Design methodology / tools?
Download