Multicores, Multiprocessors, and Clusters P. A. Wilsey Univ of Cincinnati 1 / 12 Classification of Parallelism Classification from Textbook Serial Hardware Parallel Software Sequential Concurrent Some problem written as a se- Some problem written as a quential program (the MATLAB concurrent program (the O/S example from the textbook). example from the textbook). Execution on a serial platform. Execution on a serial platform. Some problem written as a se- Some problem written as a quential program (the MATLAB concurrent program (the O/S example from the textbook). example from the textbook). Execution on a parallel plat- Execution on a parallel platform. form. 2 / 12 Flynn’s Classification of Parallelism CU: control unit PU: processor unit MM: memory unit SM: shared memory IS: instruction stream DS: data stream PU CU IS PU IS DS1 MM 1 1 DS2 MM 2 2 SM IS CU PU DS MM DSn PU n (a) SISD computer MM m IS (b) SIMD computer IS1 IS1 IS2 CU 1 CU 2 IS1 IS2 PU 1 IS1 DS SM PU 2 MM 1 MM 2 IS2 MM m CU 1 CU 2 IS1 IS2 PU PU DS1 1 DS2 2 MM 1 MM 2 IS2 SM ISn CU n ISn PU n ISn IS2 DS IS1 CU n ISn PU n DSn ISn MM m ISn (c) MISD computer (d) MIMD computer Single program, multiple data (SPMD): Single program run on every node on a MIMD style platform. 3 / 12 Parallelism is Hard/Amdhal’s Law Amdhal’s Law: Idealized parallelism 20 95% parallel 90% parallel 75% parallel 50% parallel 18 16 14 Speedup 12 10 8 6 4 2 0 1 Speedup = 4 16 64 256 1024 Number of Processors 1 Fractionenhanced (1−Fractionenhanced )+ Speedupenhanced Restating 4096 16384 65536 Hence the drive to find scalable problems sizes (see textbook example on pg 506) or embarrassingly parallel applications. 1 Speedup = (1−% affected+% left after optimization) 4 / 12 Speed-up Challenges Strong Scaling: increasing parallelism in hardware achieves increased speedup. Weak Scaling: increased parallelism is achieved only by increasing the problem size with the hardware parallelism size increases. 5 / 12 Shared Memory Multiprocessors (SMP) Parallel hardware that presents the programmer with a single physical address space across all processors. Uniform Memory Access (UMA, nobody uses this acronym): all memory is accessed in near uniform speeds. NonUniform Memory Access (NUMA, widely used): memory access speed is dependent on location vis-a-via the processor/memory positions. Shared memory programming model requires mechanisms (in the hardware) to enable programming to protect shared objects from simultaneous updates. I locks other atomic primitives I non-blocking algorithms and atomic operations I transactional memory (forthcoming in the Intel Haswell processor) 6 / 12 Locks and Other Atomic Primitives mutexes: A locking mechanism to synchronize access to a shared resource. Semaphores: A signaling mechanism to control access to a shared resource. Deadlock: programming error where a collection of tasks are all waiting for a lock/semaphore but the lock is not available and no tasks are configured so that the locked data structure will be unlocked. 7 / 12 Mutex Operation Think of a mutex as a lock/unlock mechanism. A task locks an object, accesses it, and then unlocks the object. While different tasks can lock the same object, only the task that receives the lock can unlock the resource. task 1: loop while (!done1) lock (x); shared_variable++; unlock (x); end loop; task 2: loop while (!done2) lock (x); shared_variable--; unlock (x); end loop; In this example, the lock variable x is used to protect access to the shared variable shared variable. 8 / 12 Semaphores Most often structured as a counting semaphore that is initialized with the number of how many simultaneous accesses are allowed. The counting semaphores are variables that are accessed using wait (P) and signal (V). producer: wait(emptyCount); wait(useQueue); addItemInToQueue(item); signal(useQueue); signal(fullCount); consumer: wait(fullCount); wait(useQueue); removeItemFromQueue(); signal(useQueue); signal(emptyCount); This example illustrates the classic producer/consumer problem. The tasks communicate through a queue of maximum size N. Queue overruns/underuns are not permitted. the solution uses three semaphores: emptyCount initially set to N, fullCount initially set to 0, and useQueue initially set to 1. 9 / 12 Non-blocking Algorithms Lock-free: guarantees system wide progress. Basically the system guarantees that some parallel task will progress, but could conceptually starve one (or more) of the tasks. Wait-free: guarantees that all operations have a bounded number of steps to completion. Basically ensuring that no task starves (much stronger constraint than lock-free). 10 / 12 Transactional memory Transactional memory allows the grouping of collections of operations to memory so that they occur as a single atomic operation. The atomic operation to the memory is committed only if no other operation to that memory subspace is serviced during the transaction. Thus, the collection of operations are either all committed or not and when the transaction fails, the programmer must take some action to attempt to accomplish the operation. Execution an instruction to denote the beginning of a transaction. Execute one or more memory operations (e.g., x++; y--). Execute an instruction to denote the end of the transaction. Test if the transaction was committed. 11 / 12 Message Passing Multiprocessors Parallel hardware where each processor has its own private address space. I Information is shared by transmitting messages between processing nodes. I Most common form is a Beowulf Cluster or collection (tightly coupled or loosely coupled) of computers. I Often using a standardized message passing protocol such as MPI, PVM, or Infiniband. Of course TCP/IP and UDP can also be used. 12 / 12