Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lecture 3 (Complexities of Parallelism) Course Outline Introduction Multi-threading on multi-core processors Multi-core applications and their complexities Multi-core parallel applications Complexities of multi-threading and parallelism Application layer computing on multi-core Performance measurement and tuning Copyright © 2009 3-2 KICS, UET Agenda for Today Multi-core parallel applications space Scientific/engineering applications Commercial applications Complexities due to parallelism Threading related issues Memory consistency and cache coherence Synchronization Copyright © 2009 3-3 KICS, UET Parallel Applications Science/engineering application, generalpurpose application, and desktop applications David E. Culler and Jaswinder Pal Singh, Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann, 1998 Parallel Application Trends There is an ever-increasing demand for high performance computing in a number of application areas Scientific and engineering applications: Computational fluid dynamics Weather modeling Number of applications from physics, chemistry, biology, etc. General-purpose computing applications Video encoding/decoding, graphics, games Database management Networking applications Copyright © 2009 3-5 KICS, UET Application Trends (2) Demand for cycles fuels advances in hardware, and vice-versa Range of performance demands Cycle drives exponential increase in microprocessor performance Drives parallel architecture harder: most demanding applications Need range of system performance with progressively increasing cost Platform pyramid Goal of applications in using multi-core machines: Speedup Speedup (p cores) = Performance (p cores) Performance (1 core) For a fixed problem size (input data set), performance = 1/time Speedup fixed problem (p cores) = Time (1 core) Time (p cores) Copyright © 2009 3-6 KICS, UET Scientific Computing Demand Copyright © 2009 3-7 KICS, UET Engineering Application Demands Large parallel machines a mainstay in many industries Petroleum (reservoir analysis) Automotive (crash simulation, drag analysis, combustion efficiency), Aeronautics (airflow analysis, engine efficiency, structural mechanics, electromagnetism), Computer-aided design Pharmaceuticals (molecular modeling) Visualization in all of the above entertainment (films like Toy Story) architecture (walk-throughs and rendering) Copyright © 2009 3-8 KICS, UET Application Trends Example: ASCI Accelerated Strategic Computing Initiative (ASCI) is a US DoE program that proposes the use of high performance computing for 3D modeling and simulation Promised to provide 5 orders of magnitude greater computing power in 8 years (1996 to 2004) than state-of-the-art (1 GFlops to 100 Tflops) Copyright © 2009 3-9 KICS, UET Application Trends Example (2) Platforms ASCI Red ASCI Blue Mountain 3.1 TOPs peak performance Developed by Intel with 4,510 nodes 3 TOPs peak performance Developed by SGI with 48, 128 node Origin2000s ASCI White 12 TOPs peak performance Developed by IBM as cluster of SMPs Copyright © 2009 3-10 KICS, UET Commercial Applications Databases, online-transaction processing, decision support, data mining Also relies on parallelism for high end Scale not so large, but use much more widespread High performance means performing more work (transactions) in a fixed time Copyright © 2009 3-11 KICS, UET Commercial Applications (2) TPC benchmarks (TPC-C order entry, TPC-D decision support) Explicit scaling criteria provided Size of enterprise scales with size of system Problem size no longer fixed as p increases, so throughput is used as a performance measure (transactions per minute or tpm) Desktop applications Video applications Secure computing and web services Copyright © 2009 3-12 KICS, UET Parallel Applications Landscape HPCC (Science/ engineering) Data Center Appls. (Search, e-commerce, Enterprise, SOA) Desktop Applications (WWW browser, office, multimedia applications) Embedded Applications (Wireless and mobile devices, PDAs, consumer electronics) Copyright © 2009 3-13 KICS, UET Summary of Application Trends Transition to parallel computing has occurred for scientific and engineering computing In rapid progress in commercial computing Desktop also uses multithreaded programs, which are a lot like parallel programs Demand for improving throughput on sequential workloads Greatest use of small-scale multiprocessors Currently employ multi-core processors Solid application demand exists and will increase Copyright © 2009 3-14 KICS, UET Solutions to Common Parallel Programming Problems using Multiple Threads Chapter 7 Shameem Akhtar and Jason Roberts, MultiCore Programming, Intel Press, 2006 Common Problems Too many threads Data races, deadlocks, and live locks Heavily contended locks Non-blocking algorithms Copyright © 2009 3-16 Thread-safe functions and libraries Memory issues Cache related issues Pipeline stalls Date organization KICS, UET Too Many Threads Little threading good many will be great Not always true Excessive threading can degrade performance Two types of impacts of excessive threads Too little work per thread Overhead of starting and maintaining dominates Fine granularity of work hides any performance benefits Excessive contention for hardware resources OS uses time-slicing for fair scheduling May result in excessive context switching overhead Thrashing at virtual memory level Copyright © 2009 3-17 KICS, UET Data Races, Deadlocks, and Livelocks Race condition Due to unsynchronized accesses to shared data Program results are non-deterministic Can be handled through locking Deadlock Depend on relative timings of threads A problem due to incorrect locking Results due to cyclic dependence that stops forward progress by threads Livelock Thread continuously conflict with each other and back off No thread makes any progress Solution: back off with release of acquired locks to allow at least one thread to make progress Copyright © 2009 3-18 KICS, UET Races among Unsynchronized Threads Copyright © 2009 3-19 KICS, UET Race Conditions Hiding Behind Language Syntax Copyright © 2009 3-20 KICS, UET A Higher-Level Race Condition Example Race conditions possible with synch However, synchronization at too low level Higher level may still have data races Example Each key should occur only once in the list Individual list operators have locks Problem: two threads simultaneously may find that key does not exist and insert the same key in the list one after the other Solution: locking both for list as well as to protect key repetition Copyright © 2009 3-21 KICS, UET Deadlock Caused by Cycle Copyright © 2009 3-22 KICS, UET Conditions for a Deadlock Deadlock can occur only if the following four conditions are true: Access to each resource is exclusive; A thread is allowed to hold one resource requesting another; No thread is willing to relinquish a resource that it has acquired; and There is a cycle of threads trying to acquire resources, where each resource is held by one thread and requested by another Copyright © 2009 3-23 KICS, UET Locks Ordered by their Addresses Consistent ordering of lock acquisition Prevents deadlock Copyright © 2009 3-24 KICS, UET Try and Backoff Logic One reason for deadlocks: no thread willing to give up a resource Solution: thread gives up resource if it cannot acquire another one Copyright © 2009 3-25 KICS, UET Heavily Contested Locks Locks ensure correctness By preventing race conditions By preventing deadlocks Performance impact When locks become heavily contested among threads Threads try to acquire the lock at a rate faster than the rate at which a thread can execute the corresponding critical section If a thread falls asleep, all threads have to wait for it Copyright © 2009 3-26 KICS, UET Priority Inversion Scenario Copyright © 2009 3-27 KICS, UET Solution: Spreading out Contention Copyright © 2009 3-28 KICS, UET Hash Table with Fine-Grained Locking Mutexes protecting each bucket Copyright © 2009 3-29 KICS, UET Non-Blocking Algorithms How about not using locks at all! To resolve the locking problems Such algorithms are called non-blocking Stopping one thread does not prevent rest of the system from making progress Non-blocking guarantees: Obstruction freedom—thread makes progress as long as no contention livelock possible uses exponential backoff to avoid it Lock freedom—system as a whole makes progress Wait freedom—every thread makes progress even when faced with contention practically difficult to achieve Copyright © 2009 3-30 KICS, UET Thread-Safe Functions Thread-safe function when concurrently called on different objects Implementer should ensure thread safety of hidden shared state Copyright © 2009 3-31 KICS, UET Memory Issues Speed disparity Processing is fast Memory access is slow Multiple cores can exacerbate the problem Specific memroy issues Bandwidth Working in the cache Memory contention Memory consistency Copyright © 2009 3-32 KICS, UET Bandwidth Copyright © 2009 3-33 KICS, UET Working in the Cache Copyright © 2009 3-34 KICS, UET Memory Contention Types of memory accesses Two types of data dependences: Between a core and main memory Between two cores Read-write dependency: a core write a cache line and then different core reads it Write-write dependency: a cores write a cache line and then a different core writes it Interactions among cores Consume bandwidth Are avoided when multiple cores only read from cache lines Can be avoided by minimizing the shared locations Copyright © 2009 3-35 KICS, UET False Sharing Cache block may also introduce artifacts Two distinct variables in the same cache block Technique: allocate data used by each processor contiguously, or at least avoid interleaving in memory Example problem: an array of ints, one written frequently by each processor (many ints per cache line) Copyright © 2009 3-36 KICS, UET Performance Impact of False Sharing Copyright © 2009 3-37 KICS, UET What is Memory Consistency? Copyright © 2009 3-38 KICS, UET Itanium Architecture Copyright © 2009 3-39 KICS, UET Shared Memory without a Lock Copyright © 2009 3-40 KICS, UET Memory Consistency and Cache Coherence David E. Culler and Jaswinder Pal Singh, Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann, 1998 (Advanced Topics—can be skipped) Memory Consistency for Multi-Core Architectures Memory consistency issue Programs are written for a conceptual sequential machine with memory Programs for parallel architectures: Written for multiple concurrent instruction streams Memory accesses may occur in any order May result in incorrect comupation This is a well-known problem Traditional parallel architecture deal with it Multi-core architectures inherit this complexity Presented in this section for sake of completion More relevant for HPCC applications Not as complex for multi-threading thread level solutions Copyright © 2009 3-42 KICS, UET Memory Consistency Consistency requirement: writes to a location become visible to all in the same order But when does a write become visible How to establish orders between a write and a read by different process? Typically use event synchronization By using more than one location Copyright © 2009 3-43 KICS, UET Memory Consistency (2) P P 1 2 /*Assume initial value of A and flag is 0*/ A = 1; while (flag == 0); flag = 1; print A; Sometimes expect memory to respect order between accesses to different locations issued by a given processor /*spin idly*/ to preserve orders among accesses to same location by different processes Coherence doesn’t help: pertains only to single location Copyright © 2009 3-44 KICS, UET An Example of Orders P1 P2 /*Assume initial values of A and B are 0*/ (1a) A = 1; (2a) print B; (1b) B = 2; We need an ordering model for clear semantics (2b) print A; across different locations as well so programmers can reason about what results are possible This is the memory consistency model Copyright © 2009 3-45 KICS, UET Memory Consistency Model Specifies constraints on the order in which memory operations (from any process) can appear to execute with respect to one another What orders are preserved? Given a load, constrains the possible values returned by it Without it, can’t tell much about an SAS program’s execution Copyright © 2009 3-46 KICS, UET Memory Consistency Model (2) Implications for both programmer and system designer Programmer uses to reason about correctness and possible results System designer can use to constrain how much accesses can be reordered by compiler or hardware Contract between programmer and system Copyright © 2009 3-47 KICS, UET Sequential Consistency Processors P1 issuing memory references as per program order P2 Pn T he “switch” is randomly set after each memory reference Memory (as if there were no caches, and a single memory) Copyright © 2009 3-48 KICS, UET Sequential Consistency (2) Total order achieved by interleaving accesses from different processes Maintains program order, and memory operations, from all processes, appear to [issue, execute, complete] atomically w.r.t. others Programmer’s intuition is maintained “A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.” [Lamport, 1979] Copyright © 2009 3-49 KICS, UET What Really is Program Order? Intuitively, order in which operations appear in source code Straightforward translation of source code to assembly At most one memory operation per instruction But not the same as order presented to hardware by compiler So which is program order? Depends on which layer, and who’s doing the reasoning We assume order as seen by programmer Copyright © 2009 3-50 KICS, UET Sequential Consistency: Example P1 P2 /*Assume initial values of A and B are 0*/ (1a) A = 1; (2a) print B; (1b) B = 2; (2b) print A; possible outcomes for (A,B): (0,0), (1,0), (1,2) impossible under SC: (0,2) we know 1a1b and 2a2b by program order A = 0 implies 2b1a, which implies 2a1b B = 2 implies 1b2a, which leads to a contradiction BUT, actual execution 1b1a2b2a is SC, despite not program order appears just like 1a1b2a2b as visible from results actual execution 1b2a2b1a is not SC Copyright © 2009 3-51 KICS, UET Implementing SC Two kinds of requirements: Program order memory operations issued by a process must appear to become visible (to others and itself) in program order Atomicity in the overall total order, one memory operation should appear to complete with respect to all processes before the next one is issued needed to guarantee that total order is consistent across processes tricky part is making writes atomic Copyright © 2009 3-52 KICS, UET Write Atomicity Write Atomicity: Position in total order at which a write appears to perform should be the same for all processes Nothing a process does after it has seen the new value produced by a write W should be visible to other processes until they too have seen W In effect, extends write serialization to writes from multiple processes Copyright © 2009 3-53 KICS, UET Write Atomicity (2) Transitivity implies A should print as 1 under SC Problem if P2 leaves loop, writes B, and P3 sees new B but old A (from its cache, say) P1 A=1; Copyright © 2009 P2 P3 while (A==0); B=1; 3-54 while (B==0); print A; KICS, UET Formal Definition of SC Each process’s program order imposes partial order on set of all operations Interleaving of these partial orders defines a total order on all operations Many total orders may be SC (SC does not define particular interleaving) Copyright © 2009 3-55 KICS, UET Formal Definition of SC (2) SC Execution: An execution of a program is SC if the results it produces are the same as those produced by some possible total order (interleaving) SC System: A system is SC if any possible execution on that system is an SC execution Copyright © 2009 3-56 KICS, UET Sufficient Conditions for SC Every process issues memory operations in program order After a write operation is issued, the issuing process waits for the write to complete before issuing its next operation After a read operation is issued, the issuing process waits for the read to complete, and for the write whose value is being returned by the read to complete, before issuing its next operation (provides write atomicity) Copyright © 2009 3-57 KICS, UET Sufficient Conditions for SC (2) Sufficient, not necessary, conditions Clearly, compilers should not reorder for SC, but they do! Even if issued in order, hardware may violate for better performance Loop transformations, register allocation (eliminates!) Write buffers, out of order execution Reason: uniprocessors care only about dependences to same location Makes the sufficient conditions very restrictive for performance Copyright © 2009 3-58 KICS, UET Summary of SC Implementation Assume for now that compiler does not reorder Hardware needs mechanisms to detect: For all protocols and implementations, we will see Detect write completion (read completion is easy) Ensure write atomicity How they satisfy coherence, particularly write serialization How they satisfy sufficient conditions for SC (write completion and write atomicity) How they can ensure SC but not through sufficient conditions Will see that centralized bus interconnect makes it easier Copyright © 2009 3-59 KICS, UET Cache Coherence CC for SMP architectures One memory location in multiple caches Not a problem for read accesses Write access drive coherence requirements No need to update the memory address Computation can continue on local processor Memory needs to be updated Need to invalidate cache copies in other processors Multiple ways to deal with updates Update memory immediately write through caches Update later write back caches Copyright © 2009 3-60 KICS, UET Cache Coherence (2) CC is a well-known problem For traditional SMP style multiprocessors Inherited by multi-core processors Multiple solutions Can be resolved in software However, traditionally resolved in hardware Hardware supports CC protocols A mechanism to detect cache coherence related events Mechanisms to keep the caches coherent Presented here for the sake of completion Programmer does not have to worry about it However, a key consideration for a multi-core architecture Copyright © 2009 3-61 KICS, UET SC in Write-through Provides SC, not just coherence Extend arguments used for coherence Writes and read misses to all locations serialized by bus into bus order If read obtains value of write W, W guaranteed to have completed since it caused a bus transaction When write W is performed w.r.t. any processor, all previous writes in bus order have completed Copyright © 2009 3-62 KICS, UET Design Space for Snooping Protocols No need to change processor, main memory, cache … Focus on protocols for write-back caches Dirty state now also indicates exclusive ownership Extend cache controller and exploit bus (provides serialization) Exclusive: only cache with a valid copy Owner: responsible for supplying block upon a request for it Design space Invalidation versus Update-based protocols Set of states Copyright © 2009 3-63 KICS, UET Invalidation-based Protocols Exclusive means can modify without notifying anyone else i.e. without bus transaction Must first get block in exclusive state before writing into it Even if already in valid state, need transaction, so called a write miss Store to non-dirty data generates a read-exclusive bus transaction Copyright © 2009 3-64 KICS, UET Invalidation-based Protocols (2) The read-exclusive bus transaction (cont’d) Tells others about impending write, obtains exclusive ownership makes the write visible, i.e. write is performed may be actually observed (by a read miss) only later write hit made visible (performed) when block updated in writer’s cache Only one RdX can succeed at a time for a block: serialized by bus Read and Read-exclusive bus transactions drive coherence actions Writeback transactions also, but not caused by memory operation and quite incidental to coherence protocol note: replaced block that is not in modified state can be dropped Copyright © 2009 3-65 KICS, UET Update-based Protocols A write operation updates values in other caches New, update bus transaction Advantages Other processors don’t miss on next access: reduced latency In invalidation protocols, they would miss and cause more transactions Single bus transaction to update several caches can save bandwidth Also, only the word written is transferred, not whole block Copyright © 2009 3-66 KICS, UET Update-based Protocols (2) Disadvantages Multiple writes by same processor cause multiple update transactions In invalidation, first write gets exclusive ownership, others local Detailed tradeoffs more complex Copyright © 2009 3-67 KICS, UET Invalidate versus Update Basic question of program behavior Is a block written by one processor read by others before it is rewritten? Invalidation: Yes => readers will take a miss No => multiple writes without additional traffic Update: Yes => readers will not miss if they had a copy previously and clears out copies that won’t be used again single bus transaction to update all copies No => multiple useless updates, even to dead copies Invalidation protocols much more popular Some systems provide both, or even hybrid Copyright © 2009 3-68 KICS, UET Protocols 3-state writeback invalidation protocol 4-state writeback invalidation protocol 4-state writeback update protocol Copyright © 2009 3-69 KICS, UET Basic MSI Writeback Invalidation Protocol States Invalid (I) Shared (S): one or more Dirty or Modified (M): one only Processor Events: PrRd (read) PrWr (write) Bus Transactions 3-70 BusRdX: asks for copy with intent to modify BusWB: updates memory Actions Copyright © 2009 BusRd: asks for copy with no intent to modify Update state, perform bus transaction, flush value onto bus KICS, UET State Transition Diagram Write to shared block: PrRd/— Already have latest data; can use upgrade (BusUpgr) instead of BusRdX M BusRd/Flush PrWr/BusRdX Replacement changes state of two blocks: outgoing and incoming Copyright © 2009 PrWr/— 3-71 S BusRdX/Flush BusRdX/— PrRd/BusRd PrRd/— BusRd/— PrWr/BusRdX I KICS, UET Satisfying Coherence Write propagation is clear Write serialization? All writes that appear on the bus (BusRdX) ordered by the bus Write performed in writer’s cache before it handles other transactions, so ordered in same way even w.r.t. writer Reads that appear on the bus ordered wrt these Copyright © 2009 3-72 KICS, UET Satisfying Coherence (2) Write serialization? (cont’d) Write that don’t appear on the bus: sequence of such writes between two bus trnsactions for the block must come from same processor, say P in serialization, the sequence appears between these two bus transactions reads by P will see them in this order w.r.t. other bus transactions reads by other processors separated from sequence by a bus transaction, which places them in the serialized order w.r.t the writes so reads by all processors see writes in same order Copyright © 2009 3-73 KICS, UET Satisfying Sequential Consistency Appeal to definition: Bus imposes total order on bus xactions for all locations Between transactions, processors perform reads/writes locally in program order So any execution defines a natural partial order Mj subsequent to Mi if (i) follows in program order on same processor, (ii) Mj generates bus xaction that follows the memory operation for Mi In segment between two bus transactions, any interleaving of ops from different processors leads to consistent total order In such a segment, writes observed by processor P serialized as follows Writes from other processors by the previous bus xaction P issued Writes from P by program order Copyright © 2009 3-74 KICS, UET Satisfying Sequential Consistency (2) Show sufficient conditions are satisfied Write completion: can detect when write appears on bus Write atomicity: if a read returns the value of a write, that write has already become visible to all others already (can reason different cases) Copyright © 2009 3-75 KICS, UET Lower-level Protocol Choices BusRd observed in M state: what transition to make? Depends on expectations of access patterns S: assumption that I’ll read again soon, rather than other will write good for mostly read data what about “migratory” data I read and write, then you read and write, then X reads and writes... better to go to I state, so I don’t have to be invalidated on your write Synapse transitioned to I state Sequent Symmetry and MIT Alewife use adaptive protocols Choices can affect performance of memory system Copyright © 2009 3-76 KICS, UET MESI (4-state) Invalidation Protocol Problem with MSI protocol Reading and modifying data is 2 bus xactions, even if none sharing e.g. even in sequential program BusRd (I->S) followed by BusRdX or BusUpgr (S->M) Add exclusive state: write locally without xaction, but not modified Main memory is up to date, so cache not necessarily owner Copyright © 2009 3-77 KICS, UET MESI (4-state) Invalidation Protocol (2) Add exclusive state: (cont’d) States invalid exclusive or exclusive-clean (only this cache has copy, but not modified) shared (two or more caches may have copies) modified (dirty) I E on PrRd if no one else has copy needs “shared” signal on bus: wired-or line asserted in response to BusRd Copyright © 2009 3-78 KICS, UET MESI State Transition Diagram BusRd(S) means shared line asserted on BusRd transaction Flush’: if cache-to-cache sharing (see next), only one cache flushes data MOESI protocol: Owned state: exclusive but memory not valid PrRd PrWr/— M BusRd/Flush BusRdX/Flush PrWr/— PrWr/BusRdX E BusRd/ Flush PrRd/— BusRdX/Flush PrWr/BusRdX S BusRdX/Flush PrRd/ BusRd (S ) PrRd/— BusRd/Flush PrRd/ BusRd(S) I Copyright © 2009 3-79 KICS, UET Lower-level Protocol Choices Who supplies data on miss when not in M state: memory or cache Original, lllinois MESI: cache, since assumed faster than memory Cache-to-cache sharing Not true in modern systems Intervening in another cache more expensive than getting from memory Copyright © 2009 3-80 KICS, UET Lower-level Protocol Choices (2) Cache-to-cache sharing also adds complexity How does memory know it should supply data (must wait for caches) Selection algorithm if multiple caches have valid data But valuable for cache-coherent machines with distributed memory May be cheaper to obtain from nearby cache than distant memory Especially when constructed out of SMP nodes (Stanford DASH) Copyright © 2009 3-81 KICS, UET Dragon Write-back Update Protocol 4 states Exclusive-clean or exclusive (E): I and memory have it Shared clean (Sc): I, others, and maybe memory, but I’m not owner Shared modified (Sm): I and others but not memory, and I’m the owner Sm and Sc can coexist in different caches, with only one Sm Modified or dirty (D): I and, noone else Copyright © 2009 3-82 KICS, UET Dragon Write-back Update Protocol (2) No invalid state New processor events: PrRdMiss, PrWrMiss If in cache, cannot be invalid If not present in cache, can view as being in notpresent or invalid state Introduced to specify actions when block not present in cache New bus transaction: BusUpd Broadcasts single word written on bus; updates other relevant caches Copyright © 2009 3-83 KICS, UET Dragon State Transition Diagram PrRd/— BusUpd/Update PrRd/— BusRd/— E Sc PrRdMiss/BusRd(S) PrRdMiss/BusRd(S) PrWr/— PrWr/BusUpd(S) PrWr/BusUpd(S) BusUpd/Update BusRd/Flush PrWrMiss/BusRd(S) PrWrMiss/(BusRd(S); BusUpd) Sm M PrWr/BusUpd(S) PrRd/— PrWr/BusUpd(S) BusRd/Flush Copyright © 2009 PrRd/— PrWr/— 3-84 KICS, UET Lower-level Protocol Choices Can shared-modified state be eliminated? If update memory as well on BusUpd transactions (DEC Firefly) Dragon protocol doesn’t (assumes DRAM memory slow to update) Should replacement of an Sc block be broadcast? Would allow last copy to go to E state and not generate updates Replacement bus transaction is not in critical path, later update may be Copyright © 2009 3-85 KICS, UET Lower-level Protocol Choices (2) Shouldn’t update local copy on write hit before controller gets bus Can mess up serialization Coherence, consistency considerations much like write-through case In general, many subtle race conditions in protocols But first, let’s illustrate quantitative assessment at logical level Copyright © 2009 3-86 KICS, UET Synchronization David E. Culler and Jaswinder Pal Singh, Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann, 1998 (Advanced Topic—can be skipped) Synchronization Synchronization is a fundamental concept of parallel computing “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.” Types Mutual Exclusion Event synchronization point-to-point group global (barriers) Copyright © 2009 3-88 KICS, UET Synchronization (2) Synchronization is a well-known problem Resolution requires hardware and software In traditional parallel computing Inherited by multi-core architectures Processor instruction set needs to provide an atomic testand-set instruction System software uses it to provide synchronization mechanisms Presented here for sake of completion To provide exposure to idea behind it Multithreading software provides synchronization primitives Copyright © 2009 3-89 KICS, UET History and Perspectives Much debate over hardware primitives over the years Conclusions depend on technology and machine style speed vs flexibility Most modern methods use a form of atomic read-modify-write IBM 370: included atomic compare&swap for multiprogramming x86: any instruction can be prefixed with a lock modifier Copyright © 2009 3-90 KICS, UET History and Perspectives (2) Atomic read-modify-write (cont’d) High-level language advocates want hardware locks/barriers SPARC: atomic register-memory ops (swap, compare&swap) MIPS, IBM Power: no atomic operations but pair of instructions but it goes against the “RISC” flow load-locked, store-conditional later used by PowerPC and DEC Alpha too Rich set of tradeoffs Copyright © 2009 3-91 KICS, UET Components of a Synchronization Event Acquire method Waiting algorithm Wait for synch to become available when it isn’t Release method Acquire right to the synch (enter critical section, go past event Enable other processors to acquire right to the synch Waiting algorithm is independent of type of synchronization Copyright © 2009 3-92 KICS, UET Waiting Algorithms Blocking Waiting processes are descheduled High overhead Allows processor to do other things Busy-waiting Waiting processes repeatedly test a location until it changes value Releasing process sets the location Lower overhead, but consumes processor resources Can cause network traffic Copyright © 2009 3-93 KICS, UET Waiting Algorithms (2) Busy-waiting better when Scheduling overhead is larger than expected wait time Processor resources are not needed for other tasks Scheduler-based blocking is inappropriate (e.g. in OS kernel) Hybrid methods: busy-wait a while, then block Copyright © 2009 3-94 KICS, UET Role of System and User User wants to use high-level synchronization operations Locks, barriers... Doesn’t care about implementation System designer: how much hardware support in implementation? Speed versus cost and flexibility Waiting algorithm difficult in hardware, so provide support for others Copyright © 2009 3-95 KICS, UET Role of System and User (2) Popular trend: System provides simple hardware primitives (atomic operations) Software libraries implement lock, barrier algorithms using these But some propose and implement full-hardware synchronization Copyright © 2009 3-96 KICS, UET Challenges Same synchronization may have different needs at different times Lock accessed with low or high contention Different performance requirements: low latency or high throughput Different algorithms best for each case, and need different primitives Multiprogramming can change synchronization behavior and needs Process scheduling and other resource interactions May need more sophisticated algorithms, not so good in dedicated case Copyright © 2009 3-97 KICS, UET Challenges (2) Rich area of software-hardware interactions Which primitives available affects what algorithms can be used Which algorithms are effective affects what primitives to provide Need to evaluate using workloads Copyright © 2009 3-98 KICS, UET Mutual Exclusion Mutual exclusion = lock-unlock operation Wide range of algorithms to implement these operations Role of contention for locks Simple algorithms are fast when low contention for locks Sophisticated algorithms deal with contention in a better way but have higher cost Types of locks Hardware locks Simple lock algorithms Advanced lock algorithms Copyright © 2009 3-99 KICS, UET Hardware Locks Separate lock lines on the bus: holder of a lock asserts the line Locking algorithm Busy-wait with timeout Lock registers (Cray XMP) Priority mechanism for multiple requestors Set of registers shared among processors Inflexible, so not popular for general purpose use few locks can be in use at a time (one per lock line) hardwired waiting algorithm Primarily used to provide atomicity for higher-level software locks Copyright © 2009 3-100 KICS, UET First Attempt at Simple Software Lock lock: and unlock: ld cmp bnz st ret register, location location, #0 lock location, #1 /* copy location to register */ /* compare with 0 */ /* if not 0, try again */ /* store 1 to mark it locked */ /* return control to caller */ st ret location, #0 /* write 0 to location */ /* return control to caller */ Copyright © 2009 3-101 KICS, UET First Attempt at Simple Software Lock (2) Problem: lock needs atomicity in its own implementation Read (test) and write (set) of lock variable by a process not atomic Solution: atomic read-modify-write or exchange instructions atomically test value of location and set it to another value, return success or failure somehow Copyright © 2009 3-102 KICS, UET Atomic Exchange Instruction Specifies a location and register. In atomic operation: Value in location read into a register Another value (function of value read or not) stored into location Many variants Varying degrees of flexibility in second part Copyright © 2009 3-103 KICS, UET Atomic Exchange Instruction (2) Simple example: test&set Value in location read into a specified register Constant 1 stored into location Successful if value loaded into register is 0 Other constants could be used instead of 1 and 0 Can be used to build locks Copyright © 2009 3-104 KICS, UET Simple Test&Set Lock lock: unlock: t&s bnz ret st ret Copyright © 2009 register, location lock location, #0 3-105 /* if not 0, try again */ /* return control to caller */ /* write 0 to location */ /* return control to caller */ KICS, UET Simple Test&Set Lock (2) Other read-modify-write primitives can be used too Swap Fetch&op Compare&swap Three operands: location, register to compare with, register to swap with Not commonly supported by RISC instruction sets Can be cacheable or uncacheable (we assume cacheable) Copyright © 2009 3-106 KICS, UET Simple Test&Set Lock (3) On SGI Challenge Code: lock; delay(c); unlock; Same total number of lock calls as p increases measure time per transfer Copyright © 2009 3-107 KICS, UET T&S Lock Microbenchmark Performance (2) 20 18 l s s T est&set, c =0 l T est&set, exponential backof f, c = 3.64 T est&set, exponential backof f, c =0 u Ideal s s 16 s s 14 l s Time ( m s) s l s s s l 10 8 l 6 l l s l l s 12 s l 4 s l 2 l s uuuuuuuuuuuuuuu l su 0 l 3 l l 5 7 9 11 13 15 Number of processors Performance degrades because unsuccessful test&sets generate traffic Copyright © 2009 3-108 KICS, UET Enhancements to Simple Lock Algorithm Reduce frequency of issuing test&sets while waiting Test&set lock with backoff Don’t back off too much or will be backed off when lock becomes free Exponential backoff works quite well empirically: ith time = k*ci Copyright © 2009 3-109 KICS, UET Enhancements to Simple Lock Algorithm (2) Busy-wait with read operations rather than test&set Test-and-test&set lock Keep testing with ordinary load cached lock variable will be invalidated when release occurs When value changes (to 0), try to obtain lock with test&set only one attemptor will succeed; others will fail and start testing again Copyright © 2009 3-110 KICS, UET Performance Criteria (T&S Lock) Uncontended Latency Traffic Very low if repeatedly accessed by same processor; indept. of p Lots if many processors compete; poor scaling with p Each t&s generates invalidations, and all rush out again to t&s Storage Very small (single variable); independent of p Copyright © 2009 3-111 KICS, UET Performance Criteria (2) Fairness Test&set with backoff similar, but less traffic Test-and-test&set: slightly higher latency, much less traffic But still all rush out to read miss and test&set on release Poor, can cause starvation Traffic for p processors to access once each: O(p2) Luckily, better hardware primitives as well as algorithms exist Copyright © 2009 3-112 KICS, UET Improved Hardware Primitives: LL-SC Goals: Test with reads Failed read-modify-write attempts don’t generate invalidations Nice if single primitive can implement range of rm-w operations Two instructions: Load-Locked (or -linked), Store-Conditional LL reads variable into register Copyright © 2009 3-113 KICS, UET Improved Hardware Primitives (2) Follow with arbitrary instructions to manipulate its value SC tries to store back to location if and only if no one else has written to the variable since this processor’s LL If SC succeeds, means all three steps happened atomically If fails, doesn’t write or generate invalidations (need to retry LL) Success indicated by condition codes Copyright © 2009 3-114 KICS, UET Simple Lock with LL-SC lock: unlock: Copyright © 2009 ll sc beqz ret st ret reg1, location /* LL location to reg1 */ location, reg2 /* SC reg2 into location*/ reg2, lock /* if failed, start again */ location, #0 3-115 /* write 0 to location */ KICS, UET Simple Lock with LL-SC (2) Can do more fancy atomic ops by changing what’s between LL & SC SC can fail (without putting transaction on bus) if: But keep it small so SC likely to succeed Don’t include instructions that would need to be undone (e.g. stores) Detects intervening write even before trying to get bus Tries to get bus but another processor’s SC gets bus first LL, SC are not lock, unlock respectively Only guarantee no conflicting write to lock variable between them But can use directly to implement simple operations on shared variables Copyright © 2009 3-116 KICS, UET More Efficient SW Locking Algorithms Problem with Simple LL-SC lock No invals on failure, but read misses by all waiters after both release and successful SC by winner No test-and-test&set analog, but can use backoff to reduce burstiness Doesn’t reduce traffic to minimum, and not a fair lock Copyright © 2009 3-117 KICS, UET More Efficient SW Locking (2) Better SW algorithms for bus (for r-m-w instructions or LL-SC) Only one process to try to get lock upon release valuable when using test&set instructions; LL-SC does it already Only one process to have read miss upon release valuable with LL-SC too Ticket lock achieves first Array-based queueing lock achieves both Both are fair (FIFO) locks as well Copyright © 2009 3-118 KICS, UET Ticket Lock Only one r-m-w (from only one processor) per acquire Works like waiting line at a bank Two counters per lock (next_ticket, now_serving) Acquire: fetch&inc next_ticket; wait for now_serving to equal it atomic op when arrive at lock, not when it’s free (so less contention) Release: increment now-serving FIFO order, low latency for low-contention if fetch&inc cacheable Copyright © 2009 3-119 KICS, UET Ticket Lock (2) Works like waiting line at a bank (cont’d) Still O(p) read misses at release, since all spin on same variable Can be difficult to find a good amount to delay on backoff like simple LL-SC lock, but no inval when SC succeeds, and fair exponential backoff not a good idea due to FIFO order backoff proportional to now-serving - next-ticket may work well Wouldn’t it be nice to poll different locations ... Copyright © 2009 3-120 KICS, UET Array-based Queuing Locks Waiting processes poll on different locations in an array of size p Acquire Release fetch&inc to obtain address on which to spin (next array element) ensure that these addresses are in different cache lines or memories set next location in array, thus waking up process spinning on it O(1) traffic per acquire with coherent caches Copyright © 2009 3-121 KICS, UET Array-based Queuing Locks (2) Waiting processes poll on different locations in an array of size p (cont’d) FIFO ordering, as in ticket lock But, O(p) space per lock Good performance for bus-based machines Not so great for non-cache-coherent machines with distributed memory array location I spin on not necessarily in my local memory Copyright © 2009 3-122 KICS, UET Lock Performance on SGI Challenge Loop: lock; delay(c); unlock; delay(d); Array-based 6 LL-SC LL-SC, exponential u Ticket s Ticket, proportional 7 u 6 u u u 6 u u u u u u u 5 u l l l l u l l u l sl l s l l s l l s6 s s s s6 s s s s 4 3 6 6 6 6 2 1 6 6 6 su 6 u l s 6 6 sl u 6 6 6 Time (ms) Time (ms) 5 5 7 9 Number of processors (a) Null (c = 0, d = 0) Copyright © 2009 11 13 15 u u l l l u u sl l sl s u su s s s 4 3 1 6 l s sl l 6 l l s6 s s s6 6 l s 7 u 6 9 11 13 15 Number of processors (b) Critical-section (c = 3.64 ms, d = 0) 3-123 u l l u 6 l su 6 l u 6 s 6 s su s 6 3 2 u u u 6 6 6 6 l 4 u u l 6 6 l l 6 l l 6 6 l l s l s l s s s s s s s 1 0 5 u u l 6 6 6 6 6 6 6 6 3 u 5 6 s l u 6 u u u u 2 0 3 u 1 0 1 l 7 u Time (ms) 7 l s l 6 u 1 3 5 7 9 11 13 15 Number of processors (c) Delay (c = 3.64 ms, d = 1.29 ms) KICS, UET Lock Performance on SGI Challenge (2) Simple LL-SC lock does best at small p due to unfairness Not so with delay between unlock and next lock Need to be careful with backoff Ticket lock with proportional backoff scales well, as does array lock Methodologically challenging, and need to look at real workloads Copyright © 2009 3-124 KICS, UET Point to Point Event Synchronization Software methods: Interrupts Busy-waiting: use ordinary variables as flags Blocking: use semaphores Full hardware support: full-empty bit with each word in memory Set when word is “full” with newly produced data (i.e. when written) Unset when word is “empty” due to being consumed (i.e. when read) Copyright © 2009 3-125 KICS, UET Point to Point Event Synchronization (2) Full hardware support: (cont’d) Natural for word-level producer-consumer synchronization producer: write if empty, set to full; consumer: read if full; set to empty Hardware preserves atomicity of bit manipulation with read or write Problem: flexiblity multiple consumers, or multiple writes before consumer reads? needs language support to specify when to use composite data structures? Copyright © 2009 3-126 KICS, UET Barriers Software algorithms implemented using locks, flags, counters Hardware barriers Wired-AND line separate from address/data bus Set input high when arrive, wait for output to be high to leave In practice, multiple wires to allow reuse Useful when barriers are global and very frequent Copyright © 2009 3-127 KICS, UET Barriers (2) Hardware barriers Difficult to support arbitrary subset of processors Difficult to dynamically change number and identity of participants even harder with multiple processes per processor e.g. latter due to process migration Not common today on bus-based machines Let’s look at software algorithms with simple hardware primitives Copyright © 2009 3-128 KICS, UET A Simple Centralized Barrier Shared counter maintains number of processes that have arrived increment when arrive (lock), check until reaches numprocs struct bar_type {int counter; struct lock_type lock; int flag = 0;} bar_name; BARRIER (bar_name, p) { LOCK(bar_name.lock); if (bar_name.counter == 0) bar_name.flag = 0; /* reset flag if first to reach*/ mycount = bar_name.counter++; /* mycount is private */ UNLOCK(bar_name.lock); if (mycount == p) { /* last to arrive */ bar_name.counter = 0; /* reset for next barrier */ bar_name.flag = 1; /* release waiters */ } else while (bar_name.flag == 0) {}; /* busy wait for release */ } Problem? Copyright © 2009 3-129 KICS, UET A Working Centralized Barrier Consecutively entering the same barrier doesn’t work Must prevent process from entering until all have left previous instance Could use another counter, but increases latency and contention Sense reversal: wait for flag to take different value consecutive times Toggle this value only when all processes reach Copyright © 2009 3-130 KICS, UET A Working Centralized Barrier (2) BARRIER (bar_name, p) { local_sense = !(local_sense); /* toggle private sense variable */ LOCK(bar_name.lock); mycount = bar_name.counter++; /* mycount is private */ if (bar_name.counter == p) UNLOCK(bar_name.lock); bar_name.flag = local_sense; /* release waiters*/ else { UNLOCK(bar_name.lock); while (bar_name.flag != local_sense) {}; } } Copyright © 2009 3-131 KICS, UET Centralized Barrier Performance Latency Traffic Want short critical path in barrier Centralized has critical path length at least proportional to p Barriers likely to be highly contended, so want traffic to scale well About 3p bus transactions in centralized Storage Cost Very low: centralized counter and flag Copyright © 2009 3-132 KICS, UET Centralized Barrier Performance (2) Fairness Same processor should not always be last to exit barrier No such bias in centralized Key problems for centralized barrier are latency and traffic Especially with distributed memory, traffic goes to same node Copyright © 2009 3-133 KICS, UET Improved Barrier Algorithms for a Bus Contention Little contention Flat Tree structured Software combining tree Only k processors access the same location, where k is degree of tree Copyright © 2009 3-134 KICS, UET Improved Barrier Algorithms for a Bus (2) Separate arrival and exit trees, and use sense reversal Valuable in distributed network: communicate along different paths On bus, all traffic goes on same bus, and no less total traffic Higher latency (log p steps of work, and O(p) serialized bus xactions) Advantage on bus is use of ordinary reads/writes instead of locks Copyright © 2009 3-135 KICS, UET Barrier Performance on SGI Challenge l 35 30 u s Centralized Combining tree Tournament Dissemination Time (ms) 25 u u 20 15 10 5 u sl u sl u s l sl u l s u l l s s u 0 sl 12345678 Number of processors Centralized does quite well Copyright © 2009 3-136 KICS, UET Synchronization Summary Rich interaction of hardware-software tradeoffs Must evaluate hardware primitives and software algorithms together Evaluation methodology is challenging primitives determine which algorithms perform well Use of delays, microbenchmarks Should use both microbenchmarks and real workloads Simple software algorithms with common hardware primitives do well on bus Copyright © 2009 3-137 KICS, UET Key Takeaways for this Session Multi-core processors are here These are multiprocessor/MIMD systems We need to understand parallel programming System support for multi-core is available Strengths, weaknesses, opportunities, and threads No “free lunch” for performance improvement OS: both Linux and Windows support them Compilers/language support: gcc, C#, java Two types of development tracks High performance computing High throughput computing Both have their unique challenges Copyright © 2009 3-138 KICS, UET Key Takeaways (2) High performance computing Most scientific/engineering applications Available programming models: message-passing (MPI) or shared-memory processing (OpenMP) Challenge: performance scalability with cores and problem size while dealing with data/function partitioning High throughput computing Most business applications Available programming model: multi-threading (sharedmemory processing) Challenge: performance scalability while dealing with deadlocks, locking, cache, and memory issues Copyright © 2009 3-139 KICS, UET