CS 252 Graduate Computer Architecture Lecture 11: Multiprocessors-II Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste http://inst.eecs.berkeley.edu/~cs252 Recap: Sequential Consistency A Memory Model P P P P P P M “ A system is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in the order specified by the program” Leslie Lamport Sequential Consistency = arbitrary order-preserving interleaving of memory references of sequential programs 10/16/2007 2 Recap: Sequential Consistency Sequential consistency imposes more memory ordering constraints than those imposed by uniprocessor program dependencies ( ) What are these in our example ? T1: Store (X), 1 (X = 1) Store (Y), 11 (Y = 11) T2: Load R1, (Y) Store (Y’), R1 (Y’= Y) Load R2, (X) Store (X’), R2 (X’= X) additional SC requirements 10/16/2007 3 Recap: Mutual Exclusion and Locks Want to guarantee only one process is active in a critical section • Blocking atomic read-modify-write instructions e.g., Test&Set, Fetch&Add, Swap vs • Non-blocking atomic read-modify-write instructions e.g., Compare&Swap, Load-reserve/Store-conditional vs • Protocols based on ordinary Loads and Stores 10/16/2007 4 Issues in Implementing Sequential Consistency P P P P P P M Implementation of SC is complicated by two issues • Out-of-order execution capability Load(a); Load(b) yes Load(a); Store(b) yes if a b Store(a); Load(b) yes if a b Store(a); Store(b) yes if a b • Caches Caches can prevent the effect of a store from being seen by other processors SC complications motivates architects to consider weak or relaxed memory models 5 10/16/2007 Memory Fences Instructions to sequentialize memory accesses Processors with relaxed or weak memory models (i.e., permit Loads and Stores to different addresses to be reordered) need to provide memory fence instructions to force the serialization of memory accesses Examples of processors with relaxed memory models: Sparc V8 (TSO,PSO): Membar Sparc V9 (RMO): Membar #LoadLoad, Membar #LoadStore Membar #StoreLoad, Membar #StoreStore PowerPC (WO): Sync, EIEIO Memory fences are expensive operations, however, one pays the cost of serialization only when it is required 10/16/2007 6 Using Memory Fences Producer tail Rhead R Consumer: Load Rhead, (head) spin: Load Rtail, (tail) if Rhead==Rtail goto spin MembarLL Load R, (Rhead) Rhead=Rhead+1 Store (head), Rhead ensures that R is process(R) not loaded before x has been stored 7 Producer posting Item x: Load Rtail, (tail) Store (Rtail), x MembarSS Rtail=Rtail+1 Store (tail), Rtail 10/16/2007 Consumer Rtail Rtail ensures that tail ptr is not updated before x has been stored head Data-Race Free Programs a.k.a. Properly Synchronized Programs Process 1 ... Acquire(mutex); < critical section> Release(mutex); Process 2 ... Acquire(mutex); < critical section> Release(mutex); Synchronization variables (e.g. mutex) are disjoint from data variables Accesses to writable shared data variables are protected in critical regions no data races except for locks (Formal definition is elusive) In general, it cannot be proven if a program is data-race free. 10/16/2007 8 Fences in Data-Race Free Programs Process 1 ... Acquire(mutex); membar; < critical section> membar; Release(mutex); Process 2 ... Acquire(mutex); membar; < critical section> membar; Release(mutex); • Relaxed memory model allows reordering of instructions by the compiler or the processor as long as the reordering is not done across a fence • The processor also should not speculate or prefetch across fences 10/16/2007 9 Mutual Exclusion Using Load/Store A protocol based on two shared variables c1 and c2. Initially, both c1 and c2 are 0 (not busy) Process 1 ... c1=1; L: if c2=1 then go to L < critical section> c1=0; Process 2 ... c2=1; L: if c1=1 then go to L < critical section> c2=0; What is wrong? 10/16/2007 10 Mutual Exclusion: second attempt To avoid deadlock, let a process give up the reservation (i.e. Process 1 sets c1 to 0) while waiting. Process 1 ... L: c1=1; if c2=1 then { c1=0; go to L} < critical section> c1=0 Process 2 ... L: c2=1; if c1=1 then { c2=0; go to L} < critical section> c2=0 What can go wrong now? • Deadlock is not possible but with a low probability a livelock may occur • An unlucky process may never get to enter the critical section starvation 10/16/2007 11 A Protocol for Mutual Exclusion T. Dekker, 1966 A protocol based on 3 shared variables c1, c2 and turn. Initially, both c1 and c2 are 0 (not busy) Process 1 ... c1=1; turn = 1; L: if c2=1 & turn=1 then go to L < critical section> c1=0; Process 2 ... c2=1; turn = 2; L: if c1=1 & turn=2 then go to L < critical section> c2=0; • turn = i ensures that only process i can wait • variables c1 and c2 ensure mutual exclusion Solution for n processes was given by Dijkstra and is quite tricky! 10/16/2007 12 Scenario 1 ... Process 1 c1=1; turn = 1; L: if c2=1 & turn=1 then go to L < critical section> c1=0; ... Process 2 c2=1; turn = 2; L: if c1=1 & turn=2 then go to L < critical section> c2=0; Scenario 2 Analysis of Dekker’s Algorithm ... Process 1 c1=1; turn = 1; L: if c2=1 & turn=1 then go to L < critical section> c1=0; ... Process 2 c2=1; turn = 2; L: if c1=1 & turn=2 then go to L < critical section> c2=0; 10/16/2007 13 N-process Mutual Exclusion Lamport’s Bakery Algorithm Process i Initially num[j] = 0, for all j Entry Code choosing[i] = 1; num[i] = max(num[0], …, num[N-1]) + 1; choosing[i] = 0; for(j = 0; j < N; j++) { while( choosing[j] ); while( num[j] && ( ( num[j] < num[i] ) || ( num[j] == num[i] && j < i ) ) ); } Exit Code num[i] = 0; 10/16/2007 14 CS252 Administrivia • Project meetings next week (10/23-25), same schedule as before (M 1-3PM, Tu/Th 9:40-11AM) – Schedule on website – All in 645 Soda, 20mins/group • Hope to see: – Project web site – At least one initial result (some delta from “hello world”) – Grasp of related work • Midterm review 10/16/2007 15 CS252 Midterm 1 Problem 1 Distribution (22 points total) 35 % of students 30 25 20 15 10 5 0 0-3 4-7 8-11 12-15 16-19 20-22 Scores 10/16/2007 16 CS252 Midterm 1 Problem 2 (19 points total) 35 % of students 30 25 20 15 10 5 0 0-3 4-7 8-11 12-15 16-19 Score Range 10/16/2007 17 CS252 Midterm 1 Problem 3 (19 points total) 45 40 % of students 35 30 25 20 15 10 5 0 0-3 4-7 8-11 12-15 16-19 Score Range 10/16/2007 18 CS252 Midterm 1 Problem 4 (20 points total) 35 % of students 30 25 20 15 10 5 0 0-3 4-7 8-11 12-15 16-20 Score Range 10/16/2007 19 CS252 Midterm 1 (80 points total) Average: 39.8 Median: 40.5 25 % of Students 20 15 10 5 0 0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-80 Score Range 10/16/2007 20 EECS Graduate Grading Guidelines A+, A, A- Quality expected from PhD student B+, B Quality expected from MS student, not PhD <= B- < Quality expected from MS student • Class average somewhere in range 3.2 - 3.6 http://www.eecs.berkeley.edu/Policies/grad.grading.shtml 10/16/2007 21 Memory Consistency in SMPs CPU-1 A CPU-2 cache-1 100 A 100 cache-2 CPU-Memory bus A 100 memory Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has a stale value Do these stale values matter? What is the view of shared memory for programming? 10/16/2007 22 Write-back Caches & SC • T1 is executed prog T1 ST X, 1 ST Y,11 • cache-1 writes back Y • T2 executed • cache-1 writes back X • cache-2 writes back X’ & Y’ 10/16/2007 cache-1 X= 1 Y=11 memory X=0 Y =10 X’= Y’= X= 1 Y=11 X=0 Y =11 X’= Y’= Y= Y’= X= X’= X= 1 Y=11 X=0 Y =11 X’= Y’= X=1 Y =11 X’= Y’= Y = 11 Y’= 11 X=0 X’= 0 Y = 11 Y’= 11 X=0 X’= 0 X=1 Y =11 X’= 0 Y’=11 Y =11 Y’=11 X=0 X’= 0 X= 1 Y=11 X= 1 Y=11 cache-2 Y= Y’= X= X’= prog T2 LD Y, R1 ST Y’, R1 LD X, R2 ST X’,R2 23 Write-through Caches & SC prog T1 ST X, 1 ST Y,11 • T1 executed • T2 executed cache-1 X= 0 Y=10 memory X=0 Y =10 X’= Y’= cache-2 Y= Y’= X=0 X’= X= 1 Y=11 X=1 Y =11 X’= Y’= Y= Y’= X=0 X’= X= 1 Y=11 X=1 Y =11 X’= 0 Y’=11 Y = 11 Y’= 11 X=0 X’= 0 prog T2 LD Y, R1 ST Y’, R1 LD X, R2 ST X’,R2 Write-through caches don’t preserve sequential consistency either 10/16/2007 24 Maintaining Sequential Consistency SC is sufficient for correct producer-consumer and mutual exclusion code (e.g., Dekker) Multiple copies of a location in various caches can cause SC to break down. Hardware support is required such that • only one processor at a time has write permission for a location • no processor can load a stale copy of the location after a write cache coherence protocols 10/16/2007 25 Cache Coherence Protocols for SC write request: the address is invalidated (updated) in all other caches before (after) the write is performed read request: if a dirty copy is found in some cache, a writeback is performed before the memory is read We will focus on Invalidation protocols as opposed to Update protocols 10/16/2007 26 Warmup: Parallel I/O Memory Bus Address (A) Proc. Data (D) Physical Memory Cache R/W Page transfers occur while the Processor is running A Either Cache or DMA can be the Bus Master and effect transfers D R/W DMA DISK (DMA stands for Direct Memory Access) 10/16/2007 27 Problems with Parallel I/O Cached portions of page Memory Bus Proc. Physical Memory Cache DMA transfers DMA DISK Memory Disk 10/16/2007 Disk: Physical memory may be stale if Cache copy is dirty Memory: Cache may hold state data and not see memory writes 28 Snoopy Cache Goodman 1983 • Idea: Have cache watch (or snoop upon) DMA transfers, and then “do the right thing” • Snoopy cache tags are dual-ported Used to drive Memory Bus when Cache is Bus Master A Proc. R/W Tags and State D Data (lines) A R/W Snoopy read port attached to Memory Bus Cache 10/16/2007 29 Snoopy Cache Actions for DMA Observed Bus Cycle Cache State Cache Action Address not cached No action Cached, unmodified No action Cached, modified Cache intervenes Address not cached No action DMA Write Cached, unmodified Cache purges its copy Disk Cached, modified ??? DMA Read Memory 10/16/2007 Disk Memory 30 Shared Memory Multiprocessor Memory Bus M1 Snoopy Cache M2 Snoopy Cache M3 Snoopy Cache Physical Memory DMA DISKS Use snoopy mechanism to keep all processors’ view of memory coherent 10/16/2007 31 Cache State Transition Diagram The MSI protocol Each cache line has a tag M: Modified S: Shared I: Invalid Address tag state bits Other processor reads P1 writes back M P1 reads or writes Write miss Other processor intent to write Read miss Read by any processor 10/16/2007 S Other processor intent to write I Cache state in processor P1 32 Two Processor Example (Reading and writing the same cache line) P1 reads P1 writes P2 reads P2 writes P1 reads P1 writes P2 writes P1 P2 reads, P1 writes back M P1 reads or writes Write miss P2 intent to write Read miss P1 writes P2 S P2 intent to write P1 reads, P2 writes back I M P2 reads or writes Write miss P1 intent to write Read miss 10/16/2007 S P1 intent to write I 33 Observation Other processor reads P1 writes back M P1 reads or writes Write miss Other processor intent to write Read miss Read by any processor S Other processor intent to write I • If a line is in the M state then no other cache can have a copy of the line! – Memory stays coherent, multiple differing copies cannot exist 10/16/2007 34 MESI: An Enhanced MSI protocol increased performance for private data Each cache line has a tag M: Modified Exclusive Address tag state bits P1 write or read M E: Exclusive, unmodified S: Shared I: Invalid P1 write E Other processor reads P1 writes back Read miss, shared Read by any processor 10/16/2007 S Other processor intent to write P1 read Read miss, not shared Write miss Other processor intent to write I Cache state in processor P1 35 Optimized Snoop with Level-2 Caches CPU CPU CPU CPU L1 $ L1 $ L1 $ L1 $ L2 $ L2 $ L2 $ L2 $ Snooper Snooper Snooper Snooper • Processors often have two-level caches • small L1, large L2 (usually both on chip now) • Inclusion property: entries in L1 must be in L2 invalidation in L2 invalidation in L1 • Snooping on L2 does not affect CPU-L1 bandwidth What problem could occur? 10/16/2007 36 Intervention CPU-1 A CPU-2 cache-1 200 cache-2 CPU-Memory bus A 100 memory (stale data) When a read-miss for A occurs in cache-2, a read request for A is placed on the bus • Cache-1 needs to supply & change its state to shared • The memory may respond to the request also! Does memory know it has stale data? Cache-1 needs to intervene through memory controller to supply correct data to cache-2 10/16/2007 37 False Sharing state blk addr data0 data1 ... dataN A cache block contains more than one word Cache-coherence is done at the block-level and not word-level Suppose M1 writes wordi and M2 writes wordk and both words have the same block address. What can happen? 10/16/2007 38 Synchronization and Caches: Performance Issues Processor 1 Processor 2 Processor 3 R1 L: swap (mutex), R; if <R> then goto L; <critical section> M[mutex] 0; R1 L: swap (mutex), R; if <R> then goto L; <critical section> M[mutex] 0; R1 L: swap (mutex), R; if <R> then goto L; <critical section> M[mutex] 0; cache mutex=1 cache cache CPU-Memory Bus Cache-coherence protocols will cause mutex to ping-pong between P1’s and P2’s caches. Ping-ponging can be reduced by first reading the mutex location (non-atomically) and executing a swap only if it is found to be zero. 10/16/2007 39 Performance Related to Bus Occupancy In general, a read-modify-write instruction requires two memory (bus) operations without intervening memory operations by other processors In a multiprocessor setting, bus needs to be locked for the entire duration of the atomic read and write operation expensive for simple buses very expensive for split-transaction buses modern ISAs use load-reserve store-conditional 10/16/2007 40 Load-reserve & Store-conditional Special register(s) to hold reservation flag and address, and the outcome of store-conditional Load-reserve R, (a): <flag, adr> <1, a>; R M[a]; Store-conditional (a), R: if <flag, adr> == <1, a> then cancel other procs’ reservation on a; M[a] <R>; status succeed; else status fail; If the snooper sees a store transaction to the address in the reserve register, the reserve bit is set to 0 • Several processors may reserve ‘a’ simultaneously • These instructions are like ordinary loads and stores with respect to the bus traffic Can implement reservation by using cache hit/miss, no additional hardware required (problems?) 10/16/2007 41 Performance: Load-reserve & Store-conditional The total number of memory (bus) transactions is not necessarily reduced, but splitting an atomic instruction into load-reserve & storeconditional: • increases bus utilization (and reduces processor stall time), especially in splittransaction buses • reduces cache ping-pong effect because processors trying to acquire a semaphore do not have to perform a store each time 10/16/2007 42 Out-of-Order Loads/Stores & CC snooper Wb-req, Inv-req, Inv-rep load/store buffers CPU Cache (I/S/E) Blocking caches pushout (Wb-rep) Memory (S-rep, E-rep) (S-req, E-req) One request at a time + CC SC CPU/Memory Interface Non-blocking caches Multiple requests (different addresses) concurrently + CC Relaxed memory models CC ensures that all processors observe the same order of loads and stores to an address 10/16/2007 43