Token Coherence: Decoupling Performance and Correctness Jason Bosko January 30, 2008 Based on the “Token Coherence: Decoupling Performance and Correctness” written by Martin, Hill, and Wood Outline Background and Motivation for Token Coherence Decoupling Performance and Correctness Correctness Substrate Performance Protocol TokenB: an implementation of Token Coherence Cache Coherence Background We’ve already looked at two types of coherence protocols: Snooping (MOESI) and Directory Snooping Negatives? Requires totally-ordered interconnect, broadcasts use up bandwidth. Positives? Avoids indirection on cache-to-cache misses Directory Negatives? Added level of indirection (data must be requested from home node first) Positives? Works on any interconnect Incorrect Behavior Example 1 P0 P1 1 2 3 Mem Needs totally-ordered interconnect to guarantee correctness! Assuming no ordering on the interconnect: 1) P0 requests write access on block A of memory, which is broadcast on the network. P1 hears it first, but does not have the data, so it enters the invalid state. The interconnect is slow to deliver the request to memory… 2) Now P1 wants read-only access to memory block A, and because P0’s request has not yet reached the memory, the memory gives the block to P1. 3) P0’s request finally arrives at the memory (out of order), and the memory block is sent to P0. P0 believes he has exclusive access to block A, and P1 believes he has a shared copy of A. This is incorrect! Cache Coherence Background Token Coherence wants to combine the best of both worlds… Avoid indirection on cache misses (like snooping) while allowing any type of interconnect (like directory protocols) Glueless Interconnect Glue interconnect includes discrete switch chips – can provide totallyordered interconnect Glueless interconnect can be high-bandwidth, low-latency, but cannot provide ordering We want to be able to use a glueless interconnect, so we need a design that would work for any interconnect. Decoupling Performance and Correctness The big idea: if you can separate correctness and performance, then any performance protocol can be used because the correctness substrate will guarantee correct behavior. Why is this important? The performance protocol can do whatever it wants without worrying about race conditions and corner-cases, because the correctness substrate will take care of any problems. Correctness Substrate We need safety (coherent reads and writes) and starvation avoidance (everything is eventually completed) Use tokens! But how? Tokens Each block of shared memory has a set of tokens associated with it. There is at least one token for every processor in the system. In order to read a block of memory, you need at least one of its tokens. In order to write a block of memory, you need all of its tokens. If you want to think in terms of MSI, then holding all tokens is like the modified state, holding at least one but not all tokens is like the shared state, and holding no tokens is like the invalid state But what about the actual data? How is it passed from cache to cache? Tokens – Optimization For each block of shared memory, exactly one of its tokens will be the owner token. The owner token is always sent with valid data – other tokens can have data sent with them, but is optional (to utilize bandwidth). Also attach valid bits to tokens that signify whether attached data is valid. So now, tokens provide safety (never read unless you have a token, never write unless you have every token). But what about avoiding starvation? Avoiding Starvation – Persistent Requests When a processor detects starvation (timeout period, etc.), it initiates a persistent request. The request is broadcast, and all nodes recognize the persistent request, and forward all its tokens to the requester. Once the processor gets its tokens, it does its load or store, and then exits persistent mode, and broadcasts this back to the other nodes so they can go back to their business. Any problems with this? Persistent Requests What if two people try to initiate persistent requests at the same time? Each processor has an arbiter state machine to monitor persistent requests. The requests are forwarded to the home node of the requested memory block. The arbiter is responsible for activating only one request at a time, and keeps track of requests in a hardware table. When the request is finally broadcast, each node must respond with an acknowledgment. Similarly, when the persistent state is over, the arbiter broadcasts that, and all nodes acknowledge this so that a new persistent request can be issued! Performance Protocols Technically any protocol can work on top of the correctness substrate, because correctness is guaranteed – that is the beauty of decoupling performance and correctness! In general, high performance is achieved through the use of transient requests. Transient Requests A transient request is a request for a memory block that is not guaranteed to succeed, due to race conditions. Most transient request succeed, and those that don’t will eventually be replaced by persistent requests. TokenB Performance Protocol Everything so far has been simply “Token Coherence”, which is more of the idea behind the correctness substrate. Token-Coherence-using-Broadcast (TokenB) performance protocol uses Token Coherence to guarantee correctness, while using broadcasted transient requests for high performance. Transient requests that are not fulfilled are reissued (after an exponential backoff) until it is fulfilled or the correctness substrate invokes a persistent request TokenB How a component should respond to transient requests: State of component Response to shared request Response to exclusive request Holds no tokens Ignore request Ignore request Holds only non-owner tokens Ignore request Send all tokens (but no data) Holds owner token, but not every token Send one token along with the data Send all tokens with the data Holds all tokens Send all tokens with the data Send all tokens with the data Why not just send one token? Should the owner token be sent? Example, revisited 1 P0 P1 1 2 3 Mem No rules were broken, but P0 didn’t get it’s write permission yet? What can P0 do? 1) P0 requests write access on block A of memory, which is broadcast with a transient request. P1 hears it first, but can do nothing. The interconnect is slow to deliver the request to memory… 2) Now P1 broadcasts a transient read request, which arrives at memory first, so memory sends the owner token along with the data to P1. 3) P0’s request finally arrives at the memory (out of order), so the memory is obligated to send the rest of its tokens to P0. Because P1 heard the request before it got the data, it never sends its newly acquired data or the token. Obvious Issue? Reissuing transient requests and persistent request seem like a huge hassle? Is it worth it? The majority of the time, transient requests are fulfilled. Very rarely does it go so far as to issue a persistent request! Performance Glueless interconnect has huge impact on bandwidth! Performance TokenB still relies on broadcasts, so traffic is higher than in directory protocol Things to Consider Is TokenB scalable? Can we avoid broadcasting? What kind of overhead do tokens have? This is a relatively recent paper (2003). In 2005, “Improving Multiple-CMP Systems Using Token Coherence” was published, so the idea might be catching on.