Directory-based Cache Coherence Large-scale multiprocessors Directory contents • for each memory block, the directory shows: – is it cached? – if cached, where? – if cached, clean or dirty? • full directory has complete info for all blocks – n processor coherence => (n+1) bit vector Terminology • cache line, L • home node, j: every block has a fixed “home” • directory vector: a bit vector, V – really, V(L) – bit 0 set if dirty, reset if clean – bit i set => block is cached at processor i The easy cases • read hit (clean or dirty), or • write hit (dirty) – no directory action required – serve from local cache Handling read misses at processor i • read miss => send request to home node, j – L is not cached, or cached and clean => send it to processor i, set bit i in vector – L is cached and dirty => read owner from vector – if cached on the home node, write back to memory, reset dirty bit, send data to i, set bit i – if cached elsewhere, j asks owner to write back to home node memory, resets dirty bit on receipt, then j sends line directly to i, home node sets bit i Handling write misses at processor i • i sends the request for block L to the home node, j • if L is not cached anywhere, j sends L to i and sets bits 0 and i • if L is cached and clean, j sends invalidation to all sharers, resets their bits, sends L to i, and sets bits 0 and i • if L is cached and dirty, it can only be dirty at one node. – if L is cached at the home node, write back to memory, invalidate L in cache j, clear bit j, send data to i, set bits 0 and i – if L is cached at node k, request a write back from k to home node memory, clear bit k, send data to i, set bits 0 and i Handling write hits at processor i • if clean, same process as write miss on a clean line – except data does not need to be forwarded from j to i. So: invalidate, reset bits, and set bits 0 and i • if dirty, it can only be dirty in one cache – and that must be cache i – just return the value from the local cache Replacement (at any processor) • if the line is dirty, write back to the home node and clear the bit vector • if the line is clean, just reset bit i – this avoids unnecessary invalidations later Directory organization • Centralized vs. distributed – centralized directory helps to resolve many races, but becomes a bandwidth bottleneck – one solution is to provide a banked directory structure: associate each directory bank with its memory bank – but: memory is distributed, so this leads to a distributed directory structure where each node holds the directory entries corresponding to the memory blocks for which it is the home Hierarchical organization • organize the processors as the leaves of a logical tree (need not be binary) • each internal node stores the directory entries for the memory lines local to its children • a directory entry indicates which of its children subtrees are caching the line and whether a nonchild subtree is caching the line • finding a directory entry requires a tree traversal • inclusion is maintained between level k and k+1 directory nodes • in the worst case may have to go to the root • hierarchical schemes are not used much due to high latency and volume of messages (up and down the tree); also the root may become a bottleneck Format of a directory entry • many possible variations • in-memory vs. cache-based are just two possibilities • memory-based bit vector is very popular: invalidations can be overlapped or multicast • cache-based schemes incur serialized message chain for invalidation In-memory directory entries • directory entry is co-located in the home node with the memory line – most popular format is a simple bit vector – with 128 B lines, storage overhead for 64 nodes is 6.35%, for 256 nodes 25%, for 1024 nodes 100% Cache-based directory entries • directory is a distributed linked-list where the sharer nodes form a chain – cache tag is extended to hold a node number – home node only knows the ID of the first sharer – on a read miss the requester adds itself to the head (involves home and first sharer) – a write miss requires the system to traverse list and invalidate (serialized chain of messages) – distributes contention and does not make the home node a hot-spot, and storage overhead is fixed; but very complex (IEEE SCI standard) Directory overhead • quadratic in number of processors for bit vector – assume P processors, each with M bytes of local memory (so total shared memory is M*P) – let coherence granularity (memory block size) = B – number of memory blocks per node = M/B = number of directory entries per node – size of one directory entry = P + O(1) – total size of directory across all processors = (M/B)(P+O(1))*P = O(P2) Reducing directory storage overhead • common to group a number of nodes into a cluster and have one bit per cluster – for example, all cores on the same chip • leads to useless invalidations for nodes in the cluster that aren’t sharing the invalidated block • trade-off is between precision of information and performance Overflow schemes • How can we make the directory size independent of the number of processors? – use a vector with a fixed number of entries, where the entries are now node IDs rather than just bits – use the usual scheme until the total number of sharers equals the number of available entries – when the number of sharers overflows the directory, the hardware resorts to an “overflow scheme” • DiriB: i sharer bits, broadcast invalidation on overflow • DiriNB: pick one sharer and invalidate it • DiriCV: assign one bit to a group of nodes of size P/i; broadcast invalidations to that group on a write Overflow schemes • DiriDP (Stanford FLASH) – DP stands for dynamic pointer – allocate directory entries from a free list pool maintained in memory – how do you size it? – may run into reclamation if free list pool is not sized properly at boot time – need replacement hints – if replacement hints are not supported, assume k sharers on average per memory block (k=8 is found to be good) – reclamation algorithms? • pick a random cache line and invalidate it Overflow schemes • DiriSW (MIT Alewife) – trap to software on overflow – software maintains the information about sharers that overflow – MIT Alewife has directory entry of five pointers plus a local bit (i.e. overflow threshold is five or six) • remote read before overflow takes 40 cycles and after overflow takes 425 cycles • five invalidations take 84 cycles while six invalidations take 707 cycles Sparse directory • Observation: total number of cache lines in all processors is far less than total number of memory blocks • Assume a 32 MB L3 cache and 4 GB memory: less than 1% of directory entries are active at any point in time • Idea is to organize directory as a highly associative cache • On a directory entry “eviction” send invalidations to all sharers or retrieve line if dirty When is a directory useful? • before looking up the directory you cannot decide what to do - even if you start reading memory speculatively • directory introduces one level of indirection in every request that misses in processor’s cache • snooping is preferable if the system has enough memory controller and router bandwidth to handle broadcast messages; AMD Opteron adopted this scheme, but target is small scale • directory is preferable if number of sharers is small because in this case a broadcast would waste enormous amounts of memory controller and router bandwidth • in general, directory provides far better utilization of bandwidth for scalable MPs compared to broadcast Extra slides (not covered in class) • more details about directory protocols (mostly to do with implementation and performance) on the following slides for those interested Watch the writes! • frequently written cache lines exhibit a small number of sharers; so small number of invalidations • widely shared data are written infrequently; so large number of invalidations, but rare • synchronization variables are notorious: heavily contended locks are widely shared and written in quick succession generating a burst of invalidations; require special solutions such as queue locks or tree barriers Interventions • interventions are very problematic because they cannot be sent before looking up the directory; any speculative memory lookup would be useless • few interventions in scientific applications due to one producer-many consumer pattern • many interventions for database workloads due to migratory pattern; number tends to increase with cache size Optimizing for sharing • optimizing interventions related to migratory sharing has been a major focus of high-end scalable servers – AlphaServer GS320 employs few optimizations to quickly resolve races related to migratory hand-off – some research looked at destination or owner prediction to speculatively send interventions even before consulting the directory (Martin and Hill 2003, Acacio et al 2002) – another idea was to generate an early writeback so the next request can find it in home memory instead of coming to the owner to get it (Lai and Falsafi 2000) Path of a read miss • assume that the line is not shared by anyone – load issues from load queue (for data) or fetcher accesses icache; looks up TLB and gets PA – misses in L1, L2, L3,… caches – launches address and request type on local system bus – request gets queued in memory controller and registered in OTT or TTT (Outstanding Transaction Table or Transactions in Transit Table) – memory controller eventually schedules the request – decodes home node from upper few bits of address – local home: access directory and data memory – remote home: request gets queued in network interface Path of a read miss • from the network interface onward – eventually the request gets forwarded to the router and through the network to the home node – at the home the request gets queued in the network interface and waits for scheduling by the memory controller – after scheduling, the home memory controller looks up directory and data memory – reply returns through the same path Correctness issues • serialization to a location – schedule order at home node – use NACKs creating extra traffic and possible livelock – or smarter techniques like back-off (NACK-free) • flow control deadlock – avoid buffer dependence cycles – avoid network queue dependence cycles – virtual networks multiplexed on physical networks Virtual networks • consider a two-node system with one incoming and one outgoing queue on each node P0 P1 Virtual channels Physical network • single queue is not enough to avoid deadlock – single queue forms a single virtual network Virtual networks • similar deadlock issues as multi-level caches – incoming message may generate another message e.g., request generates reply, ReadX generates reply and invalidation requests, request may generate intervention request – memory controller refuses to schedule a message if the outgoing queue is full – same situation may happen on all nodes: deadlock – one incoming and one outgoing queue is not enough – what if we have two in each direction? One for request and one for reply – what about requests generating requests? Virtual networks • what is the length of the longest transaction in terms of number of messages? – this decides the number of queues needed in each direction – one type of message is usually assigned to a queue – one queue type connected across the system forms a virtual network of that type e.g. request network, reply network, third party request (invalidations and interventions) network – virtual networks are multiplexed over a physical network • sink message type must get scheduled eventually – resources should be sized properly so that scheduling of these messages does not depend on anything – avoid buffer shortage (and deadlock) by keeping reserved buffer for the sink queue Three-lane protocols • quite popular due to simplicity – let the request network be R, reply network Y, intervention/invalidation network be RR – network dependence (aka lane dependence) graph looks something like this Y R RR Performance issues • latency optimizations – overlap activities: protocol processing and data access, invalidations, invalidation acknowledgments – make critical path fast: directory cache, integrated memory controller, smart protocol – reduce occupancy of protocol engine • throughput optimizations – pipeline the protocol processing – multiple coherence engines – protocol decisions: where to collect invalidation acknowledgments, existence of clean replacement hints