COMPUTER ARCHITECTURE Rajeshwari B Department of Electronics and Communication Engineering COMPUTER ARCHITECTURE Thread-Level Parallelism Rajeshwari B Department of Electronics and Communication Engineering COMPUTER ARCHITECTURE Thread-Level Parallelism UNIT 5 • • • • • • Multiprocessor Architecture: Challenges of Parallel Processing Centralized Shared-Memory Architectures- Cache Coherence, Snoopy Protocol Distributed Shared-Memory and Directory-Based Coherence Synchronization: The Basics Models of Memory Consistency: COMPUTER ARCHITECTURE Thread-Level Parallelism Importance of Multiprocessing • Exploiting more ILP, became inefficient, since power and silicon costs grew faster than performance. • A growing interest in high-end servers as cloud computing and software-as aservice become more important. • A growth in data-intensive applications driven by the availability of massive amounts of data on the Internet. • Increasing desktop performance less important • Improved understanding in how to use multiprocessors effectively – Especially server where significant natural TLP -parallelism among large numbers of independent requests (request level parallelism). COMPUTER ARCHITECTURE Thread-Level Parallelism • Advantage of leveraging design investment by replication – Rather than unique design COMPUTER ARCHITECTURE Thread-Level Parallelism Exploiting TLP on Multiprocessors • Our focus in the chapter is on tightly-coupled shared-memory multiprocessors – A set of processors whose coordination and usage are typically controlled by a single operating system and that share memory through a shared address space. • Two different software models – Parallel processing • The execution of a tightly coupled set of threads collaborating on a single task – Request-level parallelism • The execution of multiple, relatively independent processes that may originate from one or more users (called multiprogramming). • How to exploit thread-level parallelism efficiently? – Identified the independent threads by compiler? (X) – identified the independent threads at a high level by the software system or programmer (yes) COMPUTER ARCHITECTURE Thread-Level Parallelism Architecture for Multiprocessors -2 Models for Communication 1. Message-passing multiprocessors: Communication occurs by explicitly passing messages among the processors: 2. Shared memory multiprocessors: Communication occurs through a shared address space (via loads and stores) 2 classes of multiprocessors WRT memory: 1. Centralized Memory Multiprocessor • < few dozen cores • Small enough to share single, centralized memory with uniform memory access latency (UMA) 2. Physically Distributed ‐Memory Multiprocessor • Larger number chips and cores than 1. • BW demands Memory distributed among processors with non ‐uniform memory access/latency (NUMA) COMPUTER ARCHITECTURE Shared‐Memory Multiprocessor Symmetric multiprocessors (SMP) – A centralized memory – Small number of cores (< 32) – Share single memory with uniform memory latency _It becomes less attractive as the number of processors sharing centralized memory increases COMPUTER ARCHITECTURE Distributed shared memory (DSM) – Memory is distributed among processors – Non-uniform memory access/latency (NUMA) – Processors connected via direct (switched) and non-direct (multihop) interconnection network –Communicating data between processors more complex COMPUTER ARCHITECTURE Thread-Level Parallelism Challenges of Parallel Processing • The first challenge is the limited parallelism available in programs Eg:Suppose 80X speedup from 100 processors. What fraction of original program can be sequential? Thus, to achieve a speedup of 80 with 100 processors, only 0.25% of the original computation can be sequential. COMPUTER ARCHITECTURE Thread-Level Parallelism Challenges of Parallel Processing • The second challenge is the relatively high cost of communications Eg: Suppose 32-core MP, 4GHz, 100 ns remote memory, all local accesses hit memory hierarchy and base CPI is 0.5. What is performance impact if 0.2% instructions involve remote access? • Answer: – Remote access = 100/0.25 = 400 clock cycles. – CPI = Base CPI + Remote request rate x Remote request cost – CPI = 0.5 + 0.2% x 400 = 0.5 + 0.8 = 1.3 – No communication (the MP with all local reference) is 1.3/0.5 or 2.6 faster than 0.2% instructions involve remote access COMPUTER ARCHITECTURE Thread-Level Parallelism 1.Application parallelism -> primarily via new algorithms that have better parallel performance 2. Long remote latency impact -> both by architect and by the programmer • For example, reduce frequency of remote accesses either by – Caching shared data (HW) – Restructuring the data layout to make more accesses local (SW) Much of this lecture focuses on techniques for reducing the impact of long remote latency – How caching can be used to reduce remote access frequency, while maintaining a coherent view of memory? – Efficient synchronization? – And, latency-hiding techniques COMPUTER ARCHITECTURE Thread-Level Parallelism Cache Coherence Problem Multiprocessors usually cache both private data (used by a single processor) and shared data (used by multiple processors) – Caching shared data can reduce latency and required memory bandwidth (due to local access) – But, caching shared data introduces cache coherence problem – Processors see different values for u after event 3. (Unacceptable for programming, and it’s frequent !! ) COMPUTER ARCHITECTURE 1. Coherence defines values returned by a read – Write to the same location by any two processors are seen in the same order by all processors 2. Consistency determines when a written value will be returned by a read – If a processor writes location A followed by location B, any processor that see the new value of B must also see the new value of A • Coherence defines behavior of reads and writes to same location, • Consistency defines behavior of reads and writes to other locations COMPUTER ARCHITECTURE Defining Coherent Memory System 1. Preserve Program Order: A read by processor P to location X that follows a write by P to X, with no writes of X by another processor occurring between the write and the read by P, always returns the value written by P 2. Coherent view of memory: A read by a processor to location X that follows a write by another processor to X returns the written value if the read and write are sufficiently separated in time and no other writes to X occur between the two accesses. 3. Write serialization: 2 writes to same location by any 2 processors are seen in the same order by all processors. – If not, a processor could keep value 1 since saw as last write – For example, if the values 1 and then 2 are written to a location, processors can never read the value of the location as 2 and then later read it as 1 COMPUTER ARCHITECTURE Thread-Level Parallelism Basic Schemes for Enforcing Coherence(to track the sharing status) 2 Classes of Cache Coherence Protocols • HW cache coherence protocol – Use hardware to track the status of the shared data 1. Directory based — Sharing status of a block of physical memory is kept in just one location, the directory – Centralized control protocol 2. Snooping — Every cache with a copy of data also has a copy of sharing status of block, but no centralized state is kept – Distributed control protocol COMPUTER ARCHITECTURE Thread-Level Parallelism Snooping Coherence Protocols • Write Invalidate Protocol: – Multiple readers, single writer – Write to shared data: an invalidate is sent to all caches which snoop and invalidate any copies – Read Miss: » Write-through: memory is always up-to-date » Write-back: snoop in caches to find most recent copy • Write Broadcast Protocol: – Write to shared data: broadcast on bus, processors snoop, and update copies – Read miss: memory is always up-to-date • Write serialization: bus serializes requests – Bus is single point of arbitration COMPUTER ARCHITECTURE Thread-Level Parallelism (a) Invalidate protocol; (b) Update protocol for shared variables. COMPUTER ARCHITECTURE Thread-Level Parallelism Basic Implementation Techniques • Write-through: get up-to-date copy from lower level cache or memory (simpler, but enough memory BW is necessary) • Write-back: it is harder to get up-to-date copy of data • Can use same snooping mechanism 1. Snoop every address placed on the bus 2. If a processor has dirty copy of requested cache block, it provides it in response to a read request and aborts the memory access – Complexity from retrieving cache block from a processor cache, which can take longer than retrieving it from memory COMPUTER ARCHITECTURE Thread-Level Parallelism Basic Implementation Techniques • Normal cache tags can be used for snooping • Valid bit per block makes invalidation easy • Read misses easy since rely on snooping • Writes -> Need to know if know whether any other copies of the block are cached – No other copies -> No need to place write on bus for WB – Other copies ->Need to place invalidate on bus COMPUTER ARCHITECTURE Thread-Level Parallelism Cache Resources for WB Snooping • To track whether a cache block is shared, add extra state bit associated with each cache block, like valid bit and dirty bit – Write to Shared block => Need to place invalidate on bus and mark cache block as private (if an option) – No further invalidations will be sent for that block – This processor called owner of cache block – Owner then changes state from shared to unshared (or exclusive) COMPUTER ARCHITECTURE Thread-Level Parallelism An WB Snoopy Protocol • Invalidation protocol, write-back cache – Snoops every address on bus – If it has a dirty copy of requested block, provides that block in response to the read request and aborts the memory access • Each memory block is in one of three state: – Clean in all caches and up-to-date in memory (Shared) – OR Dirty in exactly one cache (Exclusive) – OR Not in any caches (Invalid) • Each cache block is in one of three state (track these): – Shared : block can be read – OR Exclusive : cache has only copy, its writeable, and dirty – OR Invalid : block contains no data (in uniprocessor cache too) • Read misses: cause all caches to snoop bus • Writes to clean blocks are treated as misses COMPUTER ARCHITECTURE Thread-Level Parallelism Example In shared state memory also updated COMPUTER ARCHITECTURE Thread-Level Parallelism COMPUTER ARCHITECTURE Thread-Level Parallelism COMPUTER ARCHITECTURE Thread-Level Parallelism COMPUTER ARCHITECTURE Thread-Level Parallelism COMPUTER ARCHITECTURE Thread-Level Parallelism COMPUTER ARCHITECTURE Thread-Level Parallelism Exercise If four processors are sharing memory using UMA model ,Perform snoopy based protocol for the following sequence of memory operations. Write the state of the cache blocks for each of the following operation. P1 reads X P4 reads X P3 writes to X P2 reads X P1 writes to X P2 writes to X Where X is a block in main memory COMPUTER ARCHITECTURE Thread-Level Parallelism Extensions to the Basic Coherence Protocol • The basic protocol is a simple three-state MSI (Modified, Shared, Invalid) protocol. • Optimize the protocol by adding additional states to improve performance – MESI (Modified, Exclusive, Shared, and Invalid) protocol, e.g. Intel i7 • Add exclusive state to indicate that a cache block is resident in only a single cache but is clean. (It can be written without generating any invalidates.) – MOESI (Modified, Owned, Exclusive, Shared, and Invalid protocol, e.g. AMD Opteron • Add owned state to indicate that the main memory copy is out of date and that the designated cache is the owner. • When there is an attempt to share a block in the Modified state – The block must be written back to memory in the original – The block can be changed (from Modified) to Owned state without writing it to memory. COMPUTER ARCHITECTURE Thread-Level Parallelism Limitations in SMPs and Snooping Protocols As the number of processors grows, a single shared bus soon becomes a bottleneck (because every cache must examine every miss, and having additional interconnection bandwidth only pushes the problem to the cache). Techniques for Increasing the Snoop Bandwidth • The tags can be duplicated. This doubles the effective cache-level snoop bandwidth • If the outermost cache on a multicore (typically L3) is shared, we can distribute that cache so that each processor has a portion of the memory and handles snoops for that portion of the address space COMPUTER ARCHITECTURE Thread-Level Parallelism We can place a directory at the level of the outermost shared cache (say, L3). L3 acts as a filter on snoop requests and must be inclusive COMPUTER ARCHITECTURE Thread-Level Parallelism True and false sharing misses • Coherence influences cache miss rate • Coherence misses can be broken into two separate sources. • True sharing misses • Write to shared block (transmission of invalidation) • Read an invalidated block • False sharing misses • A block is invalidated because some word in the block other • than the one read was written into. A subsequent reference to the block causes a miss COMPUTER ARCHITECTURE Thread-Level Parallelism Example : Assume that words x1 and x2 are in the same cache block, which is in the shared state in the caches of both P1 and P2. Assuming the following sequence of Events. Identify the true and false sharing misses in each of the five steps. Prior to time step 1 , x1 was read by P2. COMPUTER ARCHITECTURE Thread-Level Parallelism Answer :Here are the classifications by time step: 1. This event is a true sharing miss, since x1 was read by P2 and needs to be invalidated from P2. 2. This event is a false sharing miss, since x2 was invalidated by the write of x1 in P1, but that value of x1 is not used in P2. 3. This event is a false sharing miss, since the block containing x1 is marked shared due to the read in P2, but P2 did not read x1. The cache block containing x1 will be in the shared state after the read by P2; a write miss is required to obtain exclusive access to the block. In some protocols this will be handled as an upgrade request, which generates a bus invalidate, but does not transfer the cache block. 4. This event is a false sharing miss for the same reason as step 3. 5. This event is a true sharing miss, since the value being read was written by P2. COMPUTER ARCHITECTURE Thread-Level Parallelism Two ways to prevent false sharing. Changing the cache block size or the amount of the cache block covered by a given valid bit are hardware changes . Software solution : allocate data objects so that 1. only one truly shared object occurs per cache block frame in memory and 2. no non-shared objects are located in the same cache block frame as any shared object. If this is done, then even with just a single valid bit per cache block, false sharing is impossible. THANK YOU Rajeshwari B Department of Electronics and Communication Engineering rajeshwari@pes.edu COMPUTER ARCHITECTURE Rajeshwari B Department of Electronics and Communication Engineering COMPUTER ARCHITECTURE Thread-Level Parallelism Rajeshwari B Department of Electronics and Communication Engineering COMPUTER ARCHITECTURE Thread-Level Parallelism Distributed Shared-Memory and Directory-Based Coherence •Snoopy schemes broadcast requests over memory bus •Difficult to scale to large numbers of processors •Requires additional bandwidth to cache tags for snoop requests COMPUTER ARCHITECTURE Thread-Level Parallelism Alternative idea: avoid broadcast by storing information about the status of the line in one place: a “directory” - The directory entry for a cache line contains information about the state of the cache line in all caches. - Caches look up information from the directory as necessary - Cache coherence is maintained by point-to-point messages between the caches on a “need to know” basis (not by broadcast mechanisms) -Prevent directory as bottleneck: distribute directory entries with memory, each keeping track of which Procs have copies of their blocks COMPUTER ARCHITECTURE Thread-Level Parallelism COMPUTER ARCHITECTURE Thread-Level Parallelism A directory is added to each node to implement cache coherence in a distributed-memory multiprocessor COMPUTER ARCHITECTURE Thread-Level Parallelism – Local node is the node where a request originates – Home node is the node where the memory location of an address resides Example: node 0 is the home node of the yellow line, node 1 is the home node of the blue line – Remote node is the node that has a copy of a cache block, whether exclusive or shared COMPUTER ARCHITECTURE Directory Protocol • Similar to Snoopy Protocol: Three states – Shared: 1 processors have data, memory up-to-date – Uncached (no processor has it; not valid in any cache) – Exclusive: 1 processor (owner) has data; memory out of-date • In addition to cache state, must track which processors have data when in the shared state (usually bit vector, 1 if processor has copy) • Writes to non-exclusive data => write miss – Processor blocks until access completes – Assume messages received and acted upon in order sent COMPUTER ARCHITECTURE Example Directory Protocol • Block is in Uncached state: the copy in memory is the current value; only possible requests for that block are: – Read miss: requesting processor sent data from memory & requestor made only sharing node; state of block made Shared. – Write miss: requesting processor is sent the value & becomes the Sharing node. The block is made Exclusive to indicate that the only valid copy is cached. Sharers indicates the identity of the owner. • Block is Shared => the memory value is up-to-date: – Read miss: requesting processor is sent back the data from memory & requesting processor is added to the sharing set. – Write miss: requesting processor is sent the value. All processors in the set Sharers are sent invalidate messages, & Sharers is set to identity of requesting processor. The state of the block is made Exclusive. COMPUTER ARCHITECTURE Thread-Level Parallelism • Block is Exclusive: current value of the block is held in the cache of the processor identified by the set Sharers (the owner) => three possible directory requests: – Read miss: owner processor sent data fetch message, which causes state of block in owner’s cache to transition to Shared and causes owner to send data to directory, where it is written to memory & sent back to requesting processor. Identity of requesting processor is added to set Sharers, which still contains the identity of the processor that was the owner (since it still has a readable copy). – Data write-back: owner processor is replacing the block and hence must write it back. This makes the memory copy up-todate (the home directory essentially becomes the owner), the block is now uncached, and the Sharer set is empty. – Write miss: block has a new owner. A message is sent to old owner causing the cache to send the value of the block to the directory from which it is sent to the requesting processor, which becomes the new owner. Sharers is set to identity of new owner, and state of block is made Exclusive. COMPUTER ARCHITECTURE Thread-Level Parallelism Example 1: read miss to clean line COMPUTER ARCHITECTURE Thread-Level Parallelism Example 1: read miss to clean line -contd ▪ Read miss message sent to home node of the requested line ▪ Home directory checks entry for line - If dirty bit for cache line is OFF, - respond with contents from memory, set presence[0] to true (to indicate line is cached by processor 0) COMPUTER ARCHITECTURE Thread-Level Parallelism 1. If dirty bit is ON, then data must be sourced by another processor 2. Home node responds with message providing identity of line owner 3. Requesting node requests data from owner 4. Owner responds to requesting node, changes state in cache to SHARED (read only) 5. Owner also responds to home node, home clears dirty, updates presence bits, updates memory COMPUTER ARCHITECTURE Thread-Level Parallelism 2: read miss to dirty line:Read from main memory by processor 0 of the blue line: line is dirty (contents in P2’s cache) COMPUTER ARCHITECTURE 2: read miss to dirty line-contd COMPUTER ARCHITECTURE COMPUTER ARCHITECTURE Thread-Level Parallelism 3: write miss Write to memory by processor 0: line is clean, but (shared)resident in P1’s and P2’s caches COMPUTER ARCHITECTURE Thread-Level Parallelism 3: write miss-contd COMPUTER ARCHITECTURE Thread-Level Parallelism COMPUTER ARCHITECTURE Thread-Level Parallelism COMPUTER ARCHITECTURE Thread-Level Parallelism State transition diagram for an individual cache block in a directory based system. COMPUTER ARCHITECTURE Thread-Level Parallelism The state transition diagram for the directory has the same states and structure as the transition diagram for an individual cache COMPUTER ARCHITECTURE Thread-Level Parallelism Advantage of directories ▪ On reads, directory tells requesting node exactly where to get the line from - Either from home node (if the line is clean) - Or from the owning node (if the line is dirty) - Either way, retrieving data involves only point-to-point communication ▪ On writes, the advantage of directories depends on the number of sharers - In the limit, if all caches are sharing data, all caches must be communicated with (just like broadcast in a snooping protocol) COMPUTER ARCHITECTURE Thread-Level Parallelism Example Perform directory based protocol for the following sequence of memory operations and show the status of directories . Assume three processors A,B and C each having 1 GB memory. Let data location ‘x’ be in A memory and location ‘y’ in B memory. The sequence of memory operations are A reads x, B reads x, C reads x, A write x, C write x, B read x, A read y, B write x , B read y , B write x and B read y. COMPUTER ARCHITECTURE Thread-Level Parallelism Synchronization: The Basics • Synchronization : necessary for coordination of complex activities and multi-threading. • Atomic actions : actions that cannot be interrupted • Locks : mechanisms to protect a critical section, code that can be executed by only one thread at a time • Hardware support for locks Thread coordination - multiple threads of a thread group need to act in concert Mutual exclusion - only one thread at a time should be allowed to perform an action COMPUTER ARCHITECTURE Thread-Level Parallelism Atomicity requires hardware support: Test-and-Set : instruction which writes to a memory location and returns the old content of that memory cell as non-interruptible. Compare-and-Swap :instruction which compares the contents of a memory location to a given value and, only if the two values are the same, modifies the contents of that memory location to a given new value. The pair of load linked or load locked and a special store called a store conditional. These instructions are used in sequence: If the contents of the memory location specified by the load linked are changed before the store conditional to the same address, the store conditional fails COMPUTER ARCHITECTURE Thread-Level Parallelism Atomic exchange on the memory location specified by the contents of R1: try: MOV R3,R4 ;mov exchange value LL R2,0(R1) ;load linked SC R3,0(R1) ;store conditional BEQZ R3,try ;branch store fails MOV R4,R2 ;put load value in R4 At the end of this sequence the contents of R4 and the memory location specified by R1 have been atomically exchanged COMPUTER ARCHITECTURE Thread-Level Parallelism Atomic fetch-and-increment: Using LL and SC try: LL R2,0(R1) DADDUIR3,R2,#1 SC R3,0(R1) BEQZ R3,try ;load linked ;increment ;store conditional ;branch store fails COMPUTER ARCHITECTURE Thread-Level Parallelism Spin lock : a lock the system keep trying to acquire The system supports cache coherence -cache the locks - When a thread tries to acquire a lock it will use the local copy rather than requiring a global memory access. - Locality of lock access – a thread that used a lock is likely to need it again. Implementing locks COMPUTER ARCHITECTURE Thread-Level Parallelism COMPUTER ARCHITECTURE Thread-Level Parallelism Implementing locks Advantage of this scheme: reduces memory traffic lock=0=> unlocked lock=1 =>locked Cache coherence steps and bus traffic for three processors, P0, P1, and P2. COMPUTER ARCHITECTURE Thread-Level Parallelism Models of Memory Consistency: Multiple processors/cores should see a consistent view of memory. • What properties should be enforced among reads and writes to different location in a multiprocessor? • • Sequential consistency :the result of any execution be the same as if the memory accesses by each processor be kept in order and the accesses were interleaved • To enforce it require a processor to delay any memory access until all invalidations caused by that access are completed • Simplicity from programmer’s point of view • Performance disadvanatage COMPUTER ARCHITECTURE Thread-Level Parallelism Models of Memory Consistency :Example Two threads are running on different processors and A and B are cached by each processor and the initial values are 0. Should be impossible for both if-statements to be evaluated as true -Should give correct result. Sequential consistency: Result of execution should be the same as long as: -Accesses on each processor were kept in order - Accesses on different processors were arbitrarily interleaved COMPUTER ARCHITECTURE Thread-Level Parallelism Models of Memory Consistency: The simplest way to implement sequential consistency is to require a processor to delay the completion of any memory access until all the invalidations Although sequential consistency presents a simple programming paradigm, it reduces potential performance More efficient implementation is to assume that programs are synchronized. A program is synchronized if all accesses to shared data are ordered by synchronization operations. Program-enforced synchronization to force write on processor to occur before read on the other processor _ Requires synchronization object for A and another for B - “Unlock” after write - “Lock” after read COMPUTER ARCHITECTURE Thread-Level Parallelism Models of Memory Consistency:Problem Suppose we have a processor where a write miss takes 50 cycles to establish ownership, 10 cycles to issue each invalidate after ownership is established, and 80 cycles for an invalidate to complete and be acknowledged once it is issued. Assuming that four other processors share a cache block, how long does a write miss stall the writing processor if the processor is sequentially consistent? Assume that the invalidates must be explicitly acknowledged before the coherence controller knows they are completed. Suppose we could continue executing after obtaining ownership for the write miss without waiting for the invalidates; how long would the write take? COMPUTER ARCHITECTURE Thread-Level Parallelism Models of Memory Consistency: Solution Answer When we wait for invalidates, each write takes the sum of the ownership time plus the time to complete the invalidates. Since the invalidates can overlap, we need only worry about the last one, which starts 10 + 10 + 10 + 10 = 40 cycles after ownership is established. Hence, the total time for the write is 50 + 40 + 80 = 170 cycles. In comparison, the ownership time is only 50 cycles. With appropriate write buffer implementations, it is even possible to continue before ownership is established. COMPUTER ARCHITECTURE Thread-Level Parallelism Relaxed consistency models Rules: X→Y Operation X must complete before operation Y is done Sequential consistency requires: all four R → W, R → R, W → R, W → W Relaxing 1. Relax W → R “Total store ordering” 2. Relax W → W “Partial store order” 3. Relax R → W and R → R “Weak ordering” and “release consistency” THANK YOU Rajeshwari B Department of Electronics and Communication Engineering rajeshwari@pes.edu