Communication Models for Parallel Computer Architectures • Two distinct models have been proposed for how CPUs in a parallel computer system should communicate. In the first model, all CPUs share a common physical memory. • This kind of system is called a multiprocessor or shared memory system. In the second design, each CPU has its own private memory. • Such a design is called a multicomputer or distributed memory system. Multiprocessors Consider a program to find all of the objects in a bitmap image. • One copy of the image is kept in memory. • Each CPU runs a single process which inspects one section of the image. • Some objects occupy multiple sections, so it is essential that each process have access to the entire image. Example multiprocessors include: Sun Enterprise 1000 Sequent NUMA-Q SGI Origin 2000 HP/Convex Exemplar Multiprocessors Multicomputers In a multicomputer solving the same problem, each CPU has a section of the image in its local memory. • If the CPUs need to follow an object across the border, they must request the information from a neighboring CPU. This is done via message passing. Programming multicomputers is more difficult than programming multiprocessors, but they are more scalable. • Building a multicomputer with 10,000 CPUs is straightforward. Multicomputers Multicomputers Example multicomputers include: • IBM SP/2 • Intel/Sandia Option Red • Wisconsin COW Much research focuses on hybrid systems combining the best of both worlds. Shared memory might be implemented at a higher-level than the hardware. • The operating system might simulate a shared memory by providing a single system-wide paged shared address space. This approach is called DSM (Distributed Shared Memory). Shared Memory • Each machine has its own virtual memory and its own page table. • When a CPU does a LOAD or STORE on a page it does not have, a trap to the OS occurs. • The OS locates the page and asks the CPU currently holding it to unmap the page and send it over the interconnection network. • When it arrives, the page is mapped in and the faulting instruction restarted. A third possibility is to have a user-level runtime system implement a form of shared memory. Shared Memory Shared Memory The programming language provides a shared memory abstraction implemented by the compiler and runtime system. • The Linda model is based on the abstraction of a shared space of tuples. Processes can input a tuple from the shared tuple space or output a tuple to the shared tuple space. • The Orca model allows shared generic objects. Processes can execute object-specific methods on shared objects. When a change occurs to the internal state of some object, it is up to the runtime system to simultaneously update all copies of the object. Interconnection Networks • Multicomputers are held together by interconnection networks which move packets between CPUs and memory. The CPUs and memory modules of multiprocessors are also interconnected. Interconnection networks consist of: • • • • • CPUs Memory modules Interfaces Links Switches Interconnection Networks The links are the physical channels over which bits move. They can be • electrical or optical fiber • serial or parallel • simplex, half-duplex, or full duplex The switches are devices with multiple input ports and multiple output ports. • When a packet arrives at an input port on a switch some bits are used to select the output port to which the packet is sent. Topology Switching An interconnection network consists of switches and wires connecting them. The following slide shows an example. • Each switch has four input ports and four output ports. • In addition each switch has some CPUs and interconnect circuitry. • The job of the switch is to accept packets arriving on any input port and send each one out on the correct output port. • Each output port is connected to an input port of another switch by a parallel or serial line. Switching Switching Several switching strategies are possible. • In circuit switching, before a packet is sent, the entire path from the source to the destination is reserved in advance. All ports and buffers are claimed, so that when transmission starts, all necessary resources are guaranteed to be available and the bits can move at full speed from the source, through the switches to the destination. • In store-and-forward packet switching, no advance reservation is needed. The source sends a complete packet to the first switch where it is stored in its entirety. The switches may need to buffer packets if an output port is busy. Switching Communication Methods When a program is split up into pieces, the pieces (processes) often need to communicate with one another. This communication can be done in one of two ways: • shared variables • explicit message passing • Logical sharing of variables is possible even on a multicomputer. • Message passing is easy to implement on a multiprocessor by simply copying from the sender to the receiver. Communication Methods Message Passing Modes • Messaging systems can be either persistent or transient Are messages retained when the senders and/or receivers stop executing? • Can also be either synchronous or asynchronous Blocking vs. non-blocking Persistent Communication • Persistent communication of letters back in the days of the Pony Express. Persistence and Synchronicity in Communication a) b) Persistent asynchronous communication Persistent synchronous communication 2-22.1 Persistence and Synchronicity in Communication 2-22.2 c) d) Transient asynchronous communication Receipt-based transient synchronous communication Persistence and Synchronicity in Communication e) f) Delivery-based transient synchronous communication at message delivery Response-based transient synchronous communication Remote Procedure Call (RPC) • Developed by Birrell and Nelson (1984). In multiprocessor systems the client code for copying a file is quite different from the normal centralized (uniprocessor) code. Let’s make the client server request-reply look like a normal procedure call and return. Notice that getchar in the centralized version turns into a read system call. The following is for Unix: • read looks like a normal procedure to its caller. Remote Procedure Call (RPC) • read is a user mode program. • read manipulates registers and then does a trap to the kernel. • After the trap, the kernel manipulates registers and then does a C-language routine and lots of work gets done (drivers, disks, etc). • After the I/O, the process get unblocked, the kernel read manipulates registers, and returns. The user mode read manipulates registers and returns to the original caller. Let’s do something similar with request reply: Remote Procedure Call (RPC) User (client) does a subroutine call to getchar (or read). • Client knows nothing about messages. We link in a user mode program called the client stub (analogous to the user mode read above). • This takes the parameters to read and converts them to a message (marshalls the arguments). • Sends a message to machine containing the server directed to a server stub. • Does a blocking receive (of the reply message). Remote Procedure Call (RPC) The server stub is linked with the server. • It receives the message from the client stub. • Unmarshalls the arguments and calls the server (as a subroutine). The server procedure does what it does and returns (to the server stub). • Server knows nothing about messages Server stub now converts this to a reply message sent to the client stub. • Marshalls the arguments. Remote Procedure Call (RPC) Client stub unblocks and receives the reply. • Unmarshalls the arguments. • Returns to the client. Client believes (correctly) that the routine it calls has returned just like a normal procedure does. Passing Value Parameters (1) • Steps involved in doing remote computation through RPC 2-8 Remote Procedure Call (RPC) Heterogeneity: Machines have different data formats. • How can we handle these differences in RPC? Have conversions between all possibilities. Done during marshalling and unmarshalling. • Adopt a standard and convert to/from it. Passing Value Parameters (2) a) b) c) Original message on the Pentium The message after receipt on the SPARC The message after being inverted. The little numbers in boxes indicate the address of each byte Remote Procedure Call (RPC) Pointers: Avoid them for RPC! • Can put the object pointed to into the message itself (assuming you know its length). • Convert call-by-reference to copyin/copyout If we have in or out parameters (instead of in out) can eliminate one of the copies • Change the server to handle pointers in a special way. Callback to client stub Registering and name servers As we said before, we can use a name server. This permits the server to move using the following process. • deregister from the name server • move • reregister This is sometimes called dynamic binding. Registering and name servers The client stub calls the name server (binder) the first time to get a handle to use for the future. • There is a callback from the binder to the client stub if the server deregisters or we could have the attempt to use the handle fail so that the client stub will go to the binder again. How does a programmer create a program with RPC? • uuidgen generates a unique identifier for the RPC • Include it in an IDL (interface description language file) and describe the interface for the RPC in the file as well • Write the client and server code • Client and server stubs are generated from the IDL file automatically • Link things together and run on desired machines Writing a Client and a Server 2-14 • The steps in writing a client and a server in DCE RPC. Processor Allocation • Processor Allocation Decide which processes should run on which processors. Could also be called process allocation. We assume that any process can run on any processor. Processor Allocation Often the only difference between different processors is: • CPU speed • CPU speed and amount of memory What if the processors are not homogeneous? • Assume that we have binaries for all the different architectures. What if not all machines are directly connected • Send process via intermediate machines Processor Allocation • If we have only PowerPC binaries, restrict the process to PowerPC machines. • If we need machines very close for fast communication, restrict the processes to a group of close machines. Can you move a running process or are processor allocations done at process creation time? • Migratory allocation algorithms vs. non migratory. Processor Allocation What is the figure of merit, i.e. what do we want to optimize in order to find the best allocation of processes to processors? • Similar to CPU scheduling in centralized operating systems. Minimize response time is one possibility. Processor Allocation • We are not assuming all machines are equally fast. Consider two processes. P1 executes 100 millions instructions, P2 executes 10 million instructions. Both processes enter system at time t=0 Consider two machines A executes 100 MIPS, B 10 MIPS If we run P1 on A and P2 on B each takes 1 second so average response time is 1 sec. If we run P1 on B and P2 on A, P1 takes 10 seconds P2 .1 sec. so average response time is 5.05 sec. If we run P2 then P1 both on A finish at times .1 and 1.1 so average response time is .6 seconds!! Processor Allocation Minimize response ratio. • Response ratio is the time to run on some machine divided by time to run on a standardized (benchmark) machine, assuming the benchmark machine is unloaded. • This takes into account the fact that long jobs should take longer. Maximize CPU utilization Throughput • Jobs per hour • Weighted jobs per hour Processor Allocation If weighting is CPU time, we get CPU utilization This is the way to justify CPU utilization (user centric) • Design issues Deterministic vs. Heuristic • Use deterministic for embedded applications, when all requirements are known a priori. Patient monitoring in hospital Nuclear reactor monitoring Centralized vs. distributed • We have a tradeoff of accuracy vs. fault tolerance and bottlenecks. Processor Allocation Optimal vs. best effort • Optimal normally requires off line processing. • Similar requirements as for deterministic. • Usual tradeoff of system effort vs. result quality. Transfer policy • Does a process decide to shed jobs just based on its own load or does it have (and use) knowledge of other loads? • Also called local vs. global • Usual tradeoff of system effort (gather data) vs. result quality. Processor Allocation • Location policy Sender vs. receiver initiated. • Sender initiated - uploading programs to a compute server • Receiver initiated - downloading Java applets Look for help vs. look for work. Both are done. Processor Allocation • Implementation issues Determining local load • Normally use a weighted mean of recent loads with more recent weighted higher. Processor Allocation • Example algorithms • Min cut deterministic algorithm Define a graph with processes as nodes and IPC traffic as arcs Goal: Cut the graph (i.e some arcs) into pieces so that • All nodes in one piece can be run on one processor Memory constraints Processor completion times • Values on cut arcs are minimized Processor Allocation • Minimize the max minimize the maximum traffic for a process pair • Minimize the sum minimize total traffic Minimize the sum to/from a piece don't overload a processor • Minimize the sum between pieces minimize traffic for processor pair Tends to get hard as you get more realistic Processor Allocation • Up-down centralized algorithm Centralized table that keeps "usage" data for a user, the users are defined to be the workstation owners. Call this the score for the user. The goal is to give each user a fair share. When user requests a remote job, if a workstation is available it is assigned. For each process a user has running remotely, the user's score increases by a fixed amount each time interval. Processor Allocation When a user has an unsatisfied request pending (and none being satisfied), the score decreases (it can go negative). If no requests are pending and none are being satisfied, the score is bumped towards zero. When a processor becomes free, assign it to a requesting user with the lowest score. Processor Allocation • Hierarchical algorithm Goal - assign multiple processors to a job Quick idea of algorithm • Processors arranged in a tree • Requests go up the tree until a subtree has enough resources • Request is split and parts go back down the tree Arrange processors in a hierarchy (tree) • This is a logical tree independent of how physically connected • Each node keeps (imperfect) track of how many available processors are below it. If a processor can run more than one process, must be more sophisticated and must keep track of how many processes can be allocated (without overload) in the subtree below. Processor Allocation • If a new request appears in the tree, the current node sees if it can be satisfied by the processors below (plus itself). If so, do it. If not pass the request up the tree Actually since machines may be down or the data on availability may be out of date, you actually try to find more processes than requested • Once a request has gone high enough to be satisfied, the current node splits the request into pieces and sends each piece to appropriate child. • What if a node dies? Promote one of its children say C Now C's children are peers with the previous peers of C Processor Allocation If this is considered too unbalanced, we can promote one of C children to take C's place. • How can we decide which child C to promote? Peers of dead node have an election Children of dead node have an election Parent of dead node decides • What if the root dies? Must use children since no peers or parent If we want to use peers, then we do not have a single root I.e. the top level of the hierarchy is a collection of roots that communicate. This is a forest, not a tree What if multiple requests are generated simultaneously? Processor Allocation • Gets hard fast as information gets stale and potential race conditions and deadlocks are possible. • Distributed heuristic algorithm Goal - find a lightly loaded processor to migrate job to Send probe to a random processor If the remote load is low, ship the job If the remote load is high, try another random probe After k (parameter of implementation) probes all say the load is too high, give up and run the job locally. Modelled analytically and seen to work fairly well Scheduling General goal is to have processes that communicate frequently run simultaneously If they don’t and we use busy waiting for messages, we will have a huge disaster. Even if we use context switching, we may have a small disaster as only one message transfer can occur per time scheduling slot Co-scheduling (a.k.a. gang scheduling). Processes belonging to a job are scheduled together Scheduling • Time slots are coordinated among the processors. • Some slots are for gangs; other slots are for regular processes. Taxonomy of Parallel Computers • Although many researchers have tried to come up with a taxonomy of parallel computers, the only one which is widely used is that of Flynn (1972). • This classification is based on two concepts instruction streams • corresponding to a program counter data streams • consisting of a set of operands Taxonomy of Parallel Computers Taxonomy of Parallel Computers Memory Semantics Even though all multiprocessors present the CPUs with the image of a single shared address space, often there are many memory modules present, each holding some portion of the physical memory. • The CPUs and memories are often interconnected by a complex interconnection network. • Several CPUs may be attempting to read a memory word at the same time several other CPUs are attempting to write the same word. • Multiple copies of some blocks may be in caches. Memory Semantics One view of memory semantics is to view it as a contract between the software and the memory hardware. • The rules are called consistency models, and many different ones have been proposed and implemented. • For example, suppose that CPU 0 writes the value 1 to some memory word and a little later CPU 1 writes the value 2 to the same word. • Now CPU 2 reads the word and gets the value 1. • Is this an error? Memory Semantics The simplest model is strict consistency. • With this model, any read to a location x, always returns the value of the most recent write to x. • This model is great for programmers, but almost impossible to implement. The next best model is called sequential consistency. • The basic idea is that in the presence of multiple read and write requests, some interleaving of all the requests is chosen by the hardware (nondeterministically), but all CPUs see the same order. Memory Semantics Memory Semantics A looser consistency model, but one that is easier to implement on large multiprocessors, is processor consistency. It has two properties: • Writes by any CPU are seen by all CPUs in the order they were issued. • For every memory word, all CPUs see all writes to it in the same order. • If CPU 1 issues writes with values 1A, 1B, and 1C to some memory location in that sequence, then all other processors see them in that order too. • Every memory word has an unambiguous value after several CPUs write to it and stop. Memory Semantics Weak consistency does not even guarantee that writes from a single CPU are seen in that order. • One CPU might see 1A before 1B and another CPU might see 1A after 1B. • However, to add some order, weakly consistent memories have synchronization variables or a synchronization variable. When a synchronization is executed, all pending writes are finished and no new ones are started until all the old ones are done and the synchronization itself is done. In effect a synchronization “flushes the pipeline” and brings the memory to a stable state with no operations pending. Time is divided into epochs delimited by the synchronizations. Memory Semantics Memory Semantics Weak consistency has the problem that it is quite inefficient because it must finish off all pending memory operations and hold all new ones until the current ones are done. Release consistency improves matters by adopting a model akin to critical sections. • The idea behind this model is that when a process exits a critical region it is not necessary to force all writes to complete immediately. It is only necessary to make sure that they are done before any process enters the critical region again. Memory Semantics In this model, the synchronization operation offered by weak consistency is split into two different operations. • To read or write a shared data variable, a CPU must first do an acquire operation on the synchronization variable to get exclusive access to the shared data. • When it is done, the CPU does a release operation on the synchronization variable to indicate that it is finished. UMA Bus-Based SMP Architectures • The simplest multiprocessors are based on a single bus. Two or more CPUs and one or more memory modules all use the same bus for communication. If the bus is busy when a CPU wants to read memory, it must wait. Adding more CPUs results in more waiting. This can alleviated by having a private cache for each CPU. UMA Bus-Based SMP Architectures Snooping Caches With caches a CPU may have stale data in its private cache. This problem is known as the cache coherence or cache consistency problem. This problem can be controlled by algorithms called cache coherence protocols. • In all solutions, the cache controller is specially deigned to allow it to eavesdrop on the bus, monitoring all bus requests and taking action in certain cases. • These devices are called snooping caches. Snooping Caches MESI Cache Coherence Protocol When a protocol has the property that not all writes go directly through to memory (a bit is set instead and the cache line is eventually written to memory) we call it a write-back protocol. One popular write-back protocol is called the MESI protocol. • It is used by the Pentium II and other CPUs. • Each cache entry can be in one of four states: Invalid - the cache entry does not contain valid data Shared - multiple caches may hold the line; memory is up to date MESI Cache Coherence Protocol Exclusive - no other cache holds the line; memory is up to date Modified - the entry is valid; memory is invalid; no copies exist • Initially all cache entries are invalid • The first time memory is read, the cache line is marked E (exclusive) • If some other CPU reads the data, the first CPU sees this on the bus, announces that it holds the data as well, and both entries are marked S (shared) • If one of the CPUs writes the cache entry, it tells all other CPUs to invalidate their entries (I) and its entry is now in the M (modify) state. MESI Cache Coherence Protocol • If some other CPU now wants to read the modified line from memory, the cached copy is sent to memory, and all CPUs needing it read it from memory. They are marked as S. • If we write to an uncached line and the writeallocate is in use, we will load the line, write to it and mark it as M. • If write-allocate is not in use, the write goes directly to memory and the line is not cached anywhere. MESI Cache Coherence Protocol UMA Multiprocessors Using Crossbar Switches Even with all possible optimizations, the use of a single bus limits the size of a UMA multiprocessor to about 16 or 32 CPUs. • To go beyond that, a different kind of interconnection network is needed. • The simplest circuit for connecting n CPUs to k memories is the crossbar switch. Crossbar switches have long been used in telephone switches. At each intersection is a crosspoint - a switch that can be opened or closed. The crossbar is a nonblocking network. UMA Multiprocessors Using Crossbar Switches Sun Enterprise 1000 An example of a UMA multiprocessor based on a crossbar switch is the Sun Enterprise 1000. • This system consists of a single cabinet with up to 64 CPUs. • The crossbar switch is packaged on a circuit board with eight plug in slots on each side. • Each slot can hold up to four UltraSPARC CPUs and 4 GB of RAM. • Data is moved between memory and the caches on a 16 X 16 crossbar switch. • There are four address buses used for snooping. Sun Enterprise 1000 UMA Multiprocessors Using Multistage Switching Networks In order to go beyond the limits of the Sun Enterprise 1000, we need to have a better interconnection network. We can use 2 2 switches to build large multistage switching networks. • One example is the omega network. • The wiring pattern of the omega network is called the perfect shuffle. • The labels of the memory can be used for routing packets in the network. • The omega network is a blocking network. UMA Multiprocessors Using Multistage Switching Networks UMA Multiprocessors Using Multistage Switching Networks NUMA Multiprocessors To scale to more than 100 CPUs, we have to give up uniform memory access time. This leads to the idea of NUMA (NonUniform Memory Access) multiprocessors. • They share a single address space across all the CPUs, but unlike UMA machines local access is faster than remote access. • All UMA programs run without change on NUMA machines, but the performance is worse. When the access time to the remote machine is not hidden (by caching) the system is called NC-NUMA. NUMA Multiprocessors When coherent caches are present, the system is called CC-NUMA. It is also sometimes known as hardware DSM since it is basically the same as software distributed shared memory but implemented by the hardware using a small page size. • One of the first NC-NUMA machines was the Carnegie Mellon Cm*. This system was implemented with LSI-11 CPUs (the LSI-11 was a single-chip version of the DEC PDP-11). A program running out of remote memory took ten times as long as one using local memory. Note that there is no caching in this type of system so there is no need for cache coherence protocols. NUMA Multiprocessors Cache Coherent NUMA Multiprocessors Not having a cache is a major handicap. One of the most popular approaches to building large CC-NUMA (Cache Coherent NUMA) multiprocessors currently is the directorybased multiprocessor. • Maintain a database telling where each cache line is and what its status is. • The db is kept in special-purpose hardware that responds in a fraction of a bus cycle. Cache Coherent NUMA Multiprocessors DASH Multiprocessor The first directory-based CC-NUMA multiprocessor, DASH (Directory Architecture for SHared Memory), was built at Stanford University as a research project. • It has heavily influenced a number of commercial products such as the SGI Origin 2000 • The prototype consists of 16 clusters, each one containing a bus, four MIPS R3000 CPUs, 16 MB of global memory, and some I/O equipment. • Each CPU snoops on its local bus, but not on any other buses, so global coherence needs a different mechanism. DASH Multiprocessor DASH Multiprocessor Each cluster has a directory that keeps track of which clusters currently have copies of its lines. Each cluster in DASH is connected to an interface that allows the cluster to communicate with other clusters. • The interfaces are connected in a rectangular grid. • A cache line can be in one of three states UNCACHED SHARED MODIFIED • The DASH protocols are based on ownership and invalidation. DASH Multiprocessor • At every instant each cache line has a unique owner. For UNCACHED or SHARED lines, the line’s home cluster is the owner For MODIFIED lines, the cluster holding the one and only copy is the owner. • Requests for a cache line work there way out from the cluster to the global network. • Maintaining memory consistency in DASH is fairly complex and slow. • A single memory access may require a substantial number of packets to be sent. Sequent NUMA-Q Multiprocessor The DASH was an important project, but it was never a commercial system. As an example of a commercial CC-NUMA multiprocessor, consider the Sequent NUMA-Q 2000. • It uses an interesting and important cache coherence protocol called SCI (Scalable Coherent Interface). • The NUMA-Q is based on the standard quad board sold by Intel containing four Pentium Pro CPU chips and up to 4 GB of RAM. All these caches are kept coherent by using the MESI protocol. Sequent NUMA-Q Multiprocessor Sequent NUMA-Q Multiprocessor Each quad board is extended with an IQ-Link board plugged into a slot designed for network controllers. • The IQ-Link primarily implements the SCI protocol. • It holds 32 MB of cache, a directory for the cache, a snooping interface to the local quad board bus and a custom chip called the data pump that connects it with other IQ-Link boards. It pumps data from the input side to the output side, keeping data aimed at its node and passing other data unmodified. Together all the IQ-link boards form a ring. Sequent NUMA-Q Multiprocessor Distributed Shared Memory A collection of CPUs sharing a common paged virtual address space is called DSM (Distributed Shared Memory). • When a CPU accesses a page in its own local RAM, the read or write just happens without any further delay. • If the page is in a remote memory, a page fault is generated. • The runtime system or OS sends a message to the node holding the page to unmap it and send it over. • Read-only pages may be shared. Distributed Shared Memory Distributed Shared Memory Pages, however, are an unnatural unit for sharing, so other approaches have been tried. Linda provides processes on multiple machines with a highly structured distributed shared memory. • The memory is accessed through a small set of primitive operations that can be added to existing languages such as C and FORTRAN. • The unifying concept behind Linda is that of an abstract tuple space. • Four operations are provided on tuples: Distributed Shared Memory • out, puts a tuple into the tuple space • in, retrieves a tuple from the tuple space. The tuples are addresses by content, rather than by name. • read is like in but it does not remove the tuple from the tuple space. • eval causes its parameters to be evaluated in parallel and the resulting tuple to be deposited in the tuple space. Various implementations of Linda exist on multicomputers. • Broadcasting and directories are used for distributing the tuples. Distributed Shared Memory Distributed Shared Memory Orca uses full-blown objects rather than tuples as the unit of sharing. Objects consist of internal state plus operations for changing the state. Each Orca method consists of a list of (guard, block-of-statements) pairs. • A guard is a Boolean expression that does not contain any side effects, or the empty guard, which is simply true. • When an operation is invoked, all of its guards are evaluated in an unspecified order. Distributed Shared Memory • If all of them are false, the invoking process is delayed until one becomes true. • When a guard is found that evaluates to true, the block of statements following it is executed. • Orca has a fork statement to create a new process on a user-specified processor. • Operations on shared objects are atomic and sequentially consistent. • Orca integrates shared data and synchronization in a way not present in page-based DSM systems. Distributed Shared Memory