Concurrency Motivations To capture the logical structure of a problem Servers, graphical applications To exploit extra processors, for speed Ubiquitous multi-core processors To cope with separate physical devices Internet applications HTC vs HPC High throughput computing Environments that can deliver large amounts of processing capacity over long periods of time High performance computing Uses supercomputers and computer clusters to solve advanced computation problems DACTAL Condor Concurrent application Concurrency Any system in which two or more tasks may be underway at the same time (at an unpredictable point in their execution) Parallel: more than one task physically active Requires multiple processors Distributed: processors are associated with people or devices that are physically separated from one another in the real world Levels of Concurrency Instruction level Two or more machine instructions Statement level Two of more source language statements Unit level Two or more subprogram units Program level Two or more programs Fundamental Concepts A task or process is a program unit that can be in concurrent execution with other program units Tasks differ from ordinary subprograms in that: A task may be implicitly started When a program unit starts the execution of a task, it is not necessarily suspended When a task’s execution is completed, control may not return to the caller Tasks usually work together Task Categories Heavyweight tasks Execute in their own address space and have their own run-time stacks Lightweight tasks All run in the same address space and use the same run-time stack A task is disjoint if it does not communicate with or affect the execution of any other task in the program in any way Synchronization A mechanism that controls the order in which tasks execute Cooperation: Task A must wait for task B to complete some specific activity before task A can continue its execution e.g. the producer-consumer problem Competition: Two or more tasks must use some resource that cannot be simultaneously used e.g. a shared counter, dining philosophers Competition is usually provided by mutually exclusive access The Producer-Consumer Problem There are a number of “classic” synchronization problems, one of which is the producer-consumer problem... producer1 producer2 There are M producers that put items into a fixed-sized buffer. The buffer is shared with N consumers that remove items from the buffer. The problem is to devise a solution that synchronizes the producer & consumer accesses to the buffer. producerM consumer1 … consumer2 consumerN Accesses to the buffer must be synchronized, because if multiple producers and/or consumers access it simultaneously, values may get lost, retrieved twice, etc. In the bounded-buffer version, the buffer has some fixed-capacity N. Dining Philosophers Five philosophers sit at a table, alternating between eating noodles and thinking. In order to eat, a philosopher must have two chopsticks. However, there is a single chopstick between each pair of plates, so if one is eating, neither neighbor can eat. A philosopher puts down both chopsticks when thinking. Devise a solution that ensures: no philosopher starves; and a hungry philosopher is only prevented from eating by his neighbor(s). philosopher := [ [true] whileTrue: [ self get: left. self get: right. self eat. self release: left. self release: right. self think. ] ] Deadlock! How about this instead? philosopher := [ [true] whileTrue: [ [self have: left and: right] whileFalse: [ self get: left. [right notInUse] ifTrue: [self get: right] ifFalse: [self release: left] ]. self eat. self release: left. self release: right. self think. ] ] Livelock! Liveness and Deadlock Liveness is a characteristic that a program unit may or may not have In sequential code, it means the unit will eventually complete its execution In a concurrent environment, a task can easily lose its liveness If all tasks in a concurrent environment lose their liveness, it is called deadlock (or livelock) Race conditions When the resulting value of a variable, when two different thread of a program are writing to it, will differ depending on which thread writes to it first Transient errors, hard to debug E.g. c=c+1 1. load c 2. add 1 3. store c • Solution: acquire access to the shared resource before execution can continue • Issues: lockout, starvation Task Execution States Assuming some mechanism for synchronization (e.g. a scheduler), tasks can be in a variety of states: New - created but not yet started Runnable or ready - ready to run but not currently running (no available processor) Running Blocked - has been running, but cannot now continue (usually waiting for some event to occur) Dead - no longer active in any sense Design Issues Competition and cooperation synchronization Controlling task scheduling How and when tasks start and end execution Alternatives: Semaphores Monitors Message Passing Semaphores Simple mechanism that can be used to provide synchronization of tasks Devised by Edsger Dijkstra in 1965 for competition synchronization, but can also be used for cooperation synchronization A data structure consisting of an integer and a queue that stores task descriptors A task descriptor is a data structure that stores all the relevant information about the execution state of a task Semaphore operations Two atomic operations, P and V Consider a semaphore s: P (from the Dutch “passeren”) P(s) – if s > 0 then assign s = s – 1; otherwise block (enqueue) the thread that calls P Often referred to as “wait” V (from the Dutch “vrygeren/vrijgeven”) V(s) – if a thread T is blocked on the s, then wake up T; otherwise assign s =s+1 Often referred to as “signal” Dining Philosophers wantBothSticks := Semaphore new. philosopher := [ [true] whileTrue: [ [self haveBothSticks] whileFalse: [ wantBothSticks wait. left available and right available ifTrue: [ self get: left. self get: right. ]. wantBothSticks signal. ]. self eat. self release: left. self release: right. self think. ] ] The trouble with semaphores No way to statically check for the correctness of their use Leaving out a single wait or signal event can create many different issues Getting them just right can be tricky Per Brinch Hansen (1973): “The semaphore is an elegant synchronization tool for an ideal programmer who never makes mistakes.” Locks and Condition Variables A semaphore may be used for either of two purposes: Mutual exclusion: guarding access to a critical section Synchronization: making processes suspend/resume This dual use can lead to confusion: it may be unclear which role a semaphore is playing in a given computation… For this reason, newer languages may provide distinct constructs for each role: Locks: guarding access to a critical section Condition Variables: making processes suspend/resume Locks provide for mutually-exclusive access to shared memory; condition variables provide for thread/process synchronization. Locks Like a Semaphore, a lock has two associated operations: acquire() try to lock the lock; if it is already locked, suspend execution release() unlock the lock; awaken a waiting thread (if any) These can be used to ‘guard’ a critical section: sharedLock.acquire(); sharedLock.acquire(); Lock sharedLock; // access sharedObj // access sharedObj Object sharedObj; sharedLock.release(); sharedLock.release(); A Java class has a hidden lock accessible via the synchronized keyword Condition Variables A Condition is a predefined type available in some languages that can be used to declare variables for synchronization. When a thread needs to suspend execution inside a critical section until some condition is met, a Condition can be used. There are three operations for a Condition: wait() suspend immediately; enter a queue of waiting threads signal(), aka notify() in Java awaken a waiting thread (usually the first in the queue), if any broadcast(), aka notifyAll() in Java awaken all waiting threads, if any Java has no Condition class, but every Java class has an anonymous condition-variable that can be manipulated via wait, notify & notifyAll Monitor motivation A Java class has a hidden lock accessible via the synchronized keyword Deadlocks/livelocks/non-mutual-exclusion are easy to produce Just as control structures were “higher level” than the goto, language designers began looking for higher level ways to synchronize processes In 1973, Brinch-Hansen and Hoare proposed the monitor, a class whose methods are automatically accessed in a mutually-exclusive manner. A monitor prevents simultaneous access by multiple threads Monitors The idea: encapsulate the shared data and its operations to restrict access A monitor is an abstract data type for shared data Shared data is resident in the monitor (rather than in the client units) All access resident in the monitor Monitor implementation guarantee synchronized access by allowing only one access at a time Calls to monitor procedures are implicitly queued if the monitor is busy at the time of the call Monitor Visualization The compiler ‘wraps’ calls to put() and get() as follows: buf.lock.acquire(); … call to put or get put(obj) get(obj) public (interface) lock hidden private buf.lock.release(); If the lock is locked, the thread enters the entry queue notEmpty … notFull … myHead mySize myTail N myValues … … entry queue Each condition variable has its own internal queue, in which waiting threads wait to be signaled… Evaluation of Monitors A better way to provide competition synchronization than are semaphores Equally powerful as semaphores: Semaphores can be used to implement monitors Monitors can be used to implement semaphores Support for cooperation synchronization is very similar as with semaphores, so it has the same reliability issues Distributed Synchronization Semaphores, locks, condition variables, monitors, are shared-memory constructs, and so only useful on a tightly-coupled multiprocessor. They are of no use on a distributed multiprocessor On a distributed multiprocessor, processes can communicate via message-passing -- using send() and receive() primitives. If the message-passing system has no storage, then the send/receive operations must be synchronized: 1. Sender (ready) 3. message (transmitted) 2. Receiver (ready) If the message-passing system has storage to buffer the message, then the send() can proceed asynchronously: 1. Sender (ready) 2. message (buffered) The receiver can then retrieve the message when it is ready... 3. Receiver (not ready) Tasks In 1980, Ada introduced the task, with 3 characteristics: its own thread of control; its own execution state; and mutually exclusive subprograms (entry procedures) Entry procedures are self-synchronizing subprograms that another task can invoke for task-to-task communication. If task t has an entry procedure p, then another task t2 can execute t.p( argument-list ); In order for p to execute, t must execute: accept p ( parameter-list ); - If t executes accept p and t2 has not called p, t will automatically wait; - If t2 calls p and t has not accepted p, t2 will automatically wait. Rendezvous When t and t2 are both ready, p executes: t t2’s argument-list is evaluated and passed accept p(params) to t.p’s parameters begin t2 suspends t2 t.p (args) [suspend] … t executes the body of p, using its end p; parameter values return-values (or out or in out parameters) are passed back to t2 t continues execution; t2 resumes execution [resume] This interaction is called a rendezvous between t and t2. It does not depend on shared memory, so t1 and t2 can be on a uniprocessor, a tightly-coupled or a distributed multiprocessor. time Example Problem How can we rewrite what’s below to complete more quickly? procedure sumArray is N: constant integer := 1000000; type RealArray is array(1..N) of float; anArray: RealArray; function sum(a: RealArray; first, last: integer) return float is result: float := 0.0; begin for i in first..last loop result := result + a(i); end loop; return result; end sum; begin -- code to fill anArray with values omitted put( sum(anArray, 1, N) ); end sumArray; Divide-And-Conquer via Tasks procedure parallelSumArray is -- declarations of N, RealArray, anArray, Sum() as before … task type PartialAdder entry SumSlice(Start: in Integer; Stop: in Integer); entry GetSum(Result: out float); end PartialAdder; task body ArraySliceAdder is i, j: Integer; Answer: Float; begin accept SumSlice(Start: in Integer; Stop: in Integer) do i:= Start; j:= Stop; -- get ready end SumSlice; Answer := Sum(anArray, i, j); accept GetSum(Result: out float) do Result := Answer; end GetSum; end ArraySliceAdder; -- continued on next slide… -- do the work -- report outcome Divide-And-Conquer via Tasks (ii) -- continued from previous slide … firstHalfSum, secondHalfSum: Integer; T1, T2 : ArraySliceAdder; -- T1, T2 start & wait on accept begin -- code to fill anArray with values omitted T1.SumSlice(1, N/2); T2.SumSlice(N/2 + 1, N); -- start T1 on 1st half -- start T2 on 2nd half T1.GetSum( firstHalfSum ); -- get 1st half sum from T1 T2.GetSum( secondHalfSum ); -- get 2nd half sum from T2 put( firstHalfSum + secondHalfSum ); end parallelSumArray; -- we’re done! Using two tasks T1 and T2, this parallelSumArray version requires roughly 1/2 the time required by sumArray (on a multiprocessor). Using three tasks, the time will be roughly 1/3 the time of sumArray. … Producer-Consumer in Ada To give the producer and consumer separate threads, we can define the behavior of one in the ‘main’ procedure: and the behavior of the other in a separate task: We can then build a Buffer task with put() and get() as (autosynchronizing) entry procedures... procedure ProducerConsumer is buf: Buffer; it: Item; task consumer; task body consumer is it: Item; begin loop buf.get(it); -- consume Item it end loop; end consumer; begin -- producer task loop -- produce an Item in it buf.put(it); end loop; end ProducerConsumer; Capacity-1 Buffer A single-value buffer is easy to build using an Ada task-type: As a task-type, variables of this type (e.g., buf) will automatically have their own thread of execution. The body of the task is a loop that accepts calls to put() and get() in strict alternation. task type Buffer is entry get(it: out Item); entry put(it: in Item); end Buffer; task body Buffer is B: Item; begin loop accept put(it: in Item) do B:= it; end put; accept get(it: out Item) do it := B; end get; end loop; end Buffer; This causes myBuffer to alternate between being empty and nonempty. Capacity-N Buffer An N-value buffer is a bit more work: We can accept any call to get() so long as we are not empty, and any call to put() so long as we are not full. Ada provides the select-when statement to guard an accept, and perform it if and only if a given condition is true -- task declaration is as before … task body Buffer is N: constant integer := 1024; package B is new Queue(N, Items); begin loop select when not B.isFull => accept put(it: in Item) do B.append(it); end put; or when not B.isEmpty => accept get(it: out Item) do it := B.first; B.delete; end get; end select; end loop; end Buffer; The Importance of Clusters Scientific computation is increasingly performed on clusters Cost-effective: Created from commodity parts Scientists want more computational power Cluster computational power is easy to increase by adding processors Cluster size keeps increasing! Clusters Are Not Perfect Failure rates are increasing The number of moving parts is growing (processors, network connections, disks, etc.) Mean Time Between Failures (MTBF) is shrinking Anecdotal: every 20 minutes for Google’s cluster How can we deal with these failures? Options for Fault-Tolerance Redundancy in space Each participating process has a backup process Expensive! Redundancy in time Processes save state and then rollback for recovery Lighter-weight fault tolerance Today’s Answer: Redundancy in Time Programmers place checkpoints Small checkpoint size Synchronous Every process checkpoints in the same place in the code Global synchronization before and after checkpoints What’s the Problem? Future systems will be larger Checkpointing will hurt program performance Many processes checkpointing synchronously will result in network and file system contention Checkpointing to local disk not viable Application programmers are only willing to pay 1% overhead for fault-tolerance The solution: Avoid synchronous checkpoints Understanding Staggered Checkpointing More processes, moreis data, synchronous Not State That’s so is fast… inconsistent--easy! There We’ll State is consistent---it could have existed Today: Tomorrow: Nonotproblem! checkpoints checkpoints…. Contention! communication! it could stagger thehave existed Send not Processes 0 X X saved 1 Receive is saved checkpoint with checkpoint contention … 64K2 VALID Recovery line Recovery line [Randall 75] X Receive not saved Send is saved Time Identify All Possible Valid Recovery Lines There are so many! [1,0,0] [2,0,0] [3,2,0] Processes 0 1 2 [4,5,2] [2,3,2] [2,4,2] [1,1,0] [1,2,0] [2,0,1] Time [2,0,2] [2,5,2] [2,4,3] Coroutine A coroutine is two or more procedures that share a single thread of execution, each exercising mutual control over the other: procedure A; begin -- do something resume B; -- do something resume B; -- do something -- … end A; procedure B; begin -- do something resume A; -- do something resume A; -- … end B; Summary Concurrent computations consist of multiple entities. Processes in Smalltalk Tasks in Ada Threads in Java OS-dependent in C++ On a shared-memory multiprocessor: The Semaphore was the first synchronization primitive Locks and condition variables separated a semaphore’s mutual-exclusion usage from its synchronization usage Monitors are higher-level, self-synchronizing objects Java classes have an associated (simplified) monitor On a distributed system: Ada tasks provide self-synchronizing entry procedures