Based on work by Edward A. Lee (2006) Presented by Leeor Peled, June 2010 Seminar in VLSI Architectures (048879) Asynchronous computing During this course, we learned how to design asynchronous logic, how to coordinate and time its elements, and how to build async elements, controllers and data paths. It’s now time to investigate further layers of computing systems and see if we can utilize what we learned there. Wire delays, Gate delays Signal level CE’s Data Dependency Clock skewing RTL/CL level SOC level Handshake protocols OS scheduling, Interrupts, Threads! ? SW domain GALS ? SW Parallelism Most applications are serial HW manipulates Inst/mem/data level parallelism Superscaling, OOO, Vectorization (SIMD) Dependencies still limit the parallelism. Still high penalty on mem access, IO Thread level parallelism – Software manipulation - high latency stall switch context Good for multiple tasks (e.g. servers), but can we boost a single app? Yes. Write concurrent code! But Very hard to develop Bug prone Few software paradigms / programming models SW Parallelism (cont.) Interesting similarity between SW to HW: Asynchronous ≈ parallel ? Faster ,more efficient but also Non-deterministic Various possibilities for the order of occurrence - Must be prepared for each. Race condition may occur between threads just like signals So why not use similar methods? Parallelism examples – Fine Grain Parallelization (Taken from Ginosar, “many-cores” slides) Convert (independent) loop iterations for ( i=0; i<10000; i++ ) { a[i] = b[i]*c[i]; } Into parallel tasks duplicable task XX(…) 10000 { ii = INSTANCE; a[ii] = b[ii]*c[ii]; } All tasks, or any subset, can be executed in parallel 5 Linear Solver: Simulation snap-shots (Taken from Ginosar, “many-cores” slides) Parallelism examples (cont.) Unfortunately, not all applications are “embarrassingly parallel”. In reality we employ various “design patterns” that were thoroughly investigated (and available in libs) Producer-Consumer model : procedure producer() { while (true) { item = produceItem() procedure consumer() { while (true) { if (itemCount == 0) { sleep() } if (itemCount == BUFFER_SIZE) { sleep() } item = removeItemFromBuffer() itemCount = itemCount - 1 putItemIntoBuffer(item) itemCount = itemCount + 1 if (itemCount == BUFFER_SIZE - 1) { wakeup(producer) } if (itemCount == 1) { wakeup(consumer) } } consumeItem(item) } } } Producer-Consumer visualization Looks familiar? http://www.eonclash.com/Tutorials/Multithreading/MartinHarvey1.1/Ch9.html Threads: problem statement Real workloads must work very hard to sync concurrent code. Following example shows the problem with unprotected access Serial: functinos A and B can be called in any order. Possible outputs are 0,0 and 1,1 A: St [x],1 St [y],1 Concurrent: also possible 0,1 (what about 1,0?). How would the program react? Design Issues: Memory ordering Coherency Consistency Debugability B: S = ld [x] T = ld [y] Print S,T Threads: problem statement (cont.) Invalid results are bad, but some problems are worse – Deadlock Livelocks Example – observer pattern (in Java): public class ValueHolder { public void addListener(listener) {…} public void setValue(newValue) { myValue = newValue; for (int i = 0; i < myListeners.length; i++) { myListeners[i].valueChanged(newValue) } } What’s the problem? Threads: problem statement (cont.) Invalid results are bad, but some problems are worse – Deadlock Livelocks Example – observer pattern (in Java): public class ValueHolder { public synchronized void addListener(listener) {…} public synchronized void setValue(newValue) { myValue = newValue; for (int i = 0; i < myListeners.length; i++) { myListeners[i].valueChanged(newValue) } } What’s the problem? Threads: problem statement (cont.) Invalid results are bad, but some problems are worse – Deadlock Livelocks Example – observer pattern (in Java): public synchronized void addListener(listener) {…} public void setValue(newValue) { synchronized(this) { myValue = newValue; listeners = myListeners.clone(); } for (int i = 0; i < listeners.length; i++) { listeners[i].valueChanged(newValue) } } What’s the problem? Other Synchronizing Object Threads: the bleak reality All Programmers Programmers who use threads Those who Want to do it properly Threads: current methods Currently, the only defenses againt such problems are – The technical aspect – Analyze software structure using dedicated tools (formal verification) Blast, intel thread checker Use protected languages Cilk, Split-C (also various SW TM flavors) – lock/sync semantics Guava (private mem space for unsynced objectes) Use predefined design patterns Transactions (DB), TM The human aspect – Employ experienced programmers Apply a strict software design process (code reviews, debug sessions) Coding rules (lock acquiring order) The business aspect – be prepared to recall and compensate often… Parallel objects - solutions Lee’s Observation: It’s not concurrency that is inherently difficult it’s just the thread model! Key issues here - a thread shares everything, so everything might change for it between two atomic actions. Threads may interleave in any way (memory ordering has vast options) can change state on all other threads t1 t0 A A’ Parallel computation with threads can be shown to explode exponentially in the number of outcomes Long, boring mathematical proof ahead… But In fact - we usually only need to share a single message or data stream! Some math Let : N={0,1,2,3,...} B={0,1} B* : the set of all finite bit sequences Bω:(NB) : the set of all infinite bit sequences B** = B* U Bω will represent the state of the computing macine Q: (B**B**) An imperative macine M=(A,c) is composed of a finite set of atomic “instructions” A ⊂ Q , and a control function c: B**N that represents how they’re sequenced. A “halt” instruction h ∊ A is defined : ∀ b ∊ B**, h(b)=b A sequential program (length m) is a function p:NA, s.t. ∀n≥m, p(n)=h The set of all programs is countably infinite (|P|=0)א An execution of p starts with b0 ∊ B**, and ∀n∊N, bn+1=p(c(bn),bn) Some math (cont’d) Now, for multiple threads, we replace the program execution with – bn+1=pi(c(bn),bn), i∊{1,2} Each action is atomic, but for each step, i (the active context) is determined arbitrarily (we’re assuming no simultaneous execution for simplicity). The correct notation should be: bn+1=pin(c(bn),bn), in∊{1,2} Let S:({1..m}{1,2}) be the vector of contexts (i0, i1, ..im), so |S|=2m Interleaving leads to exponential growth in possible outcomes, even for a given set of programs and initial state. Further advantages of sequential programs The sequence bn is well defined. The function computed by the program is partially defined for each input leading to halt. p1 and p2 can be compared Multithreading also makes these exponentially harder. Parallel objects - solutions (cont.) What other solutions do we have to activate multiple objects concurrently? Move from object-oriented design to actor-oriented Also similar to the async logic we discussed – each logical element is in charge of its own input/output To compare – OO equivalent in VLSI means that the signals would have to be “responsible” for their own correct transfer Let us study the following 4 actor oriented models of computation (MOCs) Rendezvous PN (process network) SR (synchronous/reactive) DE (discrete events) these MOCs are all different alternatives with a similar computability strength, but one might be better than the other for some design patterns Actor oriented design - Rendezvous Based on work by Reo. Same functionality as before Each actor (producer/consumer/observer) is a process (No more process per dataflow / data object) Communication is through randezvous Producers are mutually exclusive (consumers are not) 2 possible 3-way rendezvous possibilities Merge is now the only non-deterministic element No deadlocks, no consistency (values ordering) problem Actor oriented design - PN (process network) Based on PN model of concurrency by Kahn & MacQueen (‘77) Communication is through streams Unbounded FIFOs Blocking reads Same benefits, plus – queuing allows the observer/consumer to operate at different speeds (unless we explicitly add dependency), or delay the observation indifferently Actor oriented design - SR (sync/reactive) Concept based on synchronous languages such Esterel, SIGNAL and Lustre (mostly used for RT/embedded systems like aircraft control, nuclear plants) Synchronous: time is an ordered sequence of instants Actual evaluation assumed to be zero time – instant reactions Reactive: Instants initiated by environmental events (Harel/Penueli) “When is just as important as what” At each clock tick, every signal is evaluated (iteratively if needed) or is absent Provides deterministic concurrency, events are ordered Scheduler picks order of evaluation (may be done in compilation time, Edwards ‘98). Mutual dependency handled by iterations. Actor oriented design - DE (discrete events) Concept based on VHDL/Verilog or Opnet network modeler Exact timing specification with rigorous semantics Each event is timed and processed chronologically. Merge (and the entire system) are deterministic. Unlike SR, here every evaluation takes a certain time delta More realistic However, evaluation order might introduce non-determinism if not define properly The road ahead Actor oriented design is not new, various languages exist CORBA event service (distributed push-pull) ROOM and UML-2 (dataflow, Rational, IBM) VHDL, Verilog (discrete events, Cadence, Synopsys, ...) LabVIEW (structured dataflow, National Instruments) Modelica (continuous-time, constraint-based, Linkoping) OPNET (discrete events, Opnet Technologies) SDL (process networks) Occam (rendezvous) Simulink (Continuous-time, The MathWorks) SPW (synchronous dataflow, Cadence, CoWare) However, most are domain specific, and the few general purpose ones never caught on Programmers don’t like new syntax Adding libs to existing languages is not enough UML case study? Lee’s suggested solution is “coordination languages” Polymorphic objects from other languages, general type-system Ptolemy II Design environment Actors/components can be defined in C/C++, java, Matlab, python, perl, … Visual editor, abstract syntax Varying concurrency models Models of Computation in Ptolemy II CI – Push/pull component interaction Click – Push/pull with method invocation CSP – concurrent threads with rendezvous Continuous – continuous-time modeling with fixed-point semantics CT – continuous-time modeling DDF – Dynamic dataflow DE – discrete-event systems DDE – distributed discrete events DPN – distributed process networks FSM – finite state machines DT – discrete time (cycle driven) Giotto – synchronous periodic GR – 3-D graphics PN – process networks Rendezvous – extension of CSP SDF – synchronous dataflow SR – synchronous/reactive TM – timed multitasking Actor oriented design - examples Two implementation of sequential interleaving based on rendezvous Both are deterministic Barrier allows rendezvous to occur only when both inputs are ready Buffer can rendezvous with input OR with output. Commutator chooses one input for rendezvous (round robin) Actor oriented design - examples Conclusions The bottom line from Lee’s work is – Instead of working with non-deterministic threads and attempting to prune this non-determinism, we should start with deterministic models, and add non-determinism only when needed. Problem in adapting it is still - lack of cooperation from users (same as with async VLSI design, in fact) Only languages that are general purpose, and no new syntax A transparent solution would be simpler to enforce Library based, compiler, HW… Can we take something back to the VLSI level? Some synchronization schemes can be built in HW (which?) Actor oriented approach – are we there? Design methodology / tools?