Multiprocessors Speed of execution is a paramount concern, always so … If feasible … the more simultaneous execution that can be done on multiple computers … the better Easier to do in the server and embedded processor markets where there is a natural parallelism that is exhibited by the applications and algorithms Less so in the desktop market Multiprocessors and ThreadLevel Parallelism Chapter 6 delves deeply into the issues surrounding multiprocessors. Thread level parallelism is a necessary adjunct to the study of multiprocessors Outline Intro to problems in parallel processing Taxonomy MIMDs Communication Shared-Memory Multiprocessors Multicache coherence Implementation Performance Outline - continued Distributed-Memory Multiprocessors Synchronization Coherence protocols Performance Atomic operations, spin locks, barriers Thread-Level Parallelism Low Level Issues in Parallel Processing Consider the following generic code: y=x+3 z = 2*x + y w = w*w Lock file M Read file M Naively splitting up the code between two processors leads to big problems. Low Level Issues in Parallel Processing - continued Processor A Processor B y=x+3 z = 2*x + y w = w*w Lock file M Read file M Problems: commands must be executed so as to not violate the original sequential nature of the algorithm, a processor has to wait on a file etc. Low Level Issues in Parallel Processing - continued This was a grossly bad example of course, but the underlying issues appear in good multiprocessing applications Two key issues are: Shared memory (shared variables) Interprocessor communication (e.g. current shared variable updates, file locks) Computation/Communication “A key characteristic in determining the performance of parallel programs is the ratio of computation to communication.” (bottom of page 546 “Communication is the costly part of parallel computing” … and also the slow part A table on page 547 shows this ratio for some DSP calculations – which normally have a good ratio Computation/Communication – best and worst cases Problem: Add 6 to each component of vector x[n]. Three processors A, B, and C. Best: give A the first n/3 components, B the next n/3 and C the last n/3. One message at the beginning, the results passed back in one message at the end. Computation/Communication ratio = n/2 Computation/Communication – best and worst cases Worst case: have processor A add 1 to x[k] pass it to B which adds 2 which passes it to C to add 3. Two messages per effective computation. Computation/Communication ratio = n/(2*n) = 1/2 Of course this is terrible coding but it makes the point. Real examples are found on page 547 Taxonomy SISD – single instruction stream, single data stream (uniprocessors) SIMD – single instruction stream, multiple data streams (vector processors) MISD – multiple instruction streams, single data stream (no commercial processors have been built of this type, to date) MIMD – multiple instruction streams, multiple data streams MIMDs Have emerged as the architecture of choice for general purpose multiprocessors. Often built with off-the-shelf microprocessors Flexible designs are possible Two Classes of MIMD Two basic structures will be studied: Centralized shared-memory multiprocessors Distributed-memory multiprocessors Why focus on memory? Communication or data sharing can be done at several levels in our basic structure Sharing disks is no problem and sharing cache between processors is probably not feasible Hence our main distinction is whether or not to share memory Centralized Shared-Memory Multiprocessors Main memory is shared This has many advantages Much faster message passing !! This also forces many issues to be dealt with Block write contention Coherent memory Distributed-Memory Multiprocessors Each processor has its own memory An interconnection network aids the message passing Communication Algorithms or applications that can be parsed completely into independent streams of computations are very rare. Usually, in order to parse an application between n processors a great deal of inter-processor information must be communicated Examples, which data a processor is working on, how far it has processed the data it is working on, computed values that are needed by another processor, etc. Message passing, shared memory, RPCs, all are methods of communication for multiprocessors The Two Biggest Challenges in Using Multiprocessors Page 540 and 537 Insufficient parallelism (in the algorithms or code) Long-latency remote communications “Much of this chapter focuses on techniques for reducing the impact of long remote communication latency.” page 540 2nd paragraph Advantages of Different Communication Mechanisms Since this is a key distinction, both in terms of system performance and cost you should be aware of the comparative advantages. Know the issues on pages 535-6 SMPs - Shared-Memory Multiprocessors Often called by SMP rather than centralized shared-memory multiprocessors We now look at the coherent memory problem Multiprocessor Cache Coherence – the key problem Time Event Cache for A Cache for B 0 Memory contents for X 1 1 CPU A reads X 1 1 2 CPU B reads X 1 1 1 3 CPU A stores 0 in X 0 1 0 The problem is that CPU B is still using a value of X = 1 whereas A is not. Obviously we can’t allow this … but how do we stop it? Basic Schemes for Enforcing Coherence – Section 6.3 Look over the definitions of coherence and consistency (page 550) Coherence protocols on page 552: directory based and snooping We concentrate on snooping with invalidation that is implemented by a write-back cache Understand the basics in figure 6.8 and 6.9 Study the finite-state transition diagram on page 557 A Cache Coherence Protocol Performance of Symmetric Shared-Memory Multiprocessors Comments: Not an easy topic, definitions can vary as with the case of single processors Results of studies are given in section 6.4 Review the specialized definitions on page 561 first Coherence misses True sharing misses False sharing Example: CPU execution on a four-processor system Study figure 6.13 (page 563) and the accompanying explanation What is considered in CPU time measurements Note that these benchmarks include substantial I/O time which is ignored in the CPU time measurements. Of course the cache access time is included in the CPU time measurements since the processes will not be switched out on a cache access vice a memory miss or I/O request L2 hits, L3 hits and pipeline stalls add time to the execution – these are shown graphically Commercial Workload Performance OLTP Performance and L3 Caches Online transaction processing workloads (part of the commercial benchmark) demand a lot from memory systems. This graph focuses on the impact of L3 cache size Memory Access Cycles vs. Processor Count Note the increase in memory access cycles as the processor count increases This is mainly due to true and false sharing misses which increase as the processor count increases Distributed Shared-Memory Architectures Coherence is again an issue Study pages 576-7 where some of the disadvantages of allowing hardware to exclude cache coherence are discussed Directory-Based Cache Coherence Protocols Just as with a snooping protocol there are two primary operations that a directory protocol must handle: read misses and writes to shared, clean blocks. Basics: a directory is added to each node Directory Protocols We won’t spend as much time in class on these. But look over the state transition diagrams and browse over the performance section. Synchronization Key ability needed to synchronize in a multiprocessor setup Ability to atomically read and modify a memory location That means: no other process can context switch in and modify the memory location after our process reads and before our process modifies. Synchronization “These hardware primitives are the basic building blocks that are used to build a wide variety of user-level synchronization operations, including locks and barriers.” (page 591) Examples of these atomic operations are given on 591-3 in both code and text form Read over and understand both the spin lock and barrier concepts. Problems on the next exam may well include one of these. Synchronization Examples Check out the examples on 596, 603-4. They bring out key points in the operation of multiprocessor synchronization that you need to know. Threads Threads are “lightweight processes” Thread switches are much faster than process or context switches Page 608 for this study a thread is: Thread = {copy of registers, separate PC, separate page table } Threads and SMT SMST – Simultaneous Multithreading exploits TLP (thread-level parallelism) at the same it exploits ILP (instruction-level parallelism) And why is SMT good? It turns out that most modern multiple-issue processors have more functional unit parallelism available than a single thread can effectively use (see section 3.6 for more – basically they allow multiple instructions to issue in a single clock cycle – superscaler and VLIW are two basic flavors – but more later in the course.