CS 152 Computer Architecture and Engineering Lecture 14 - Cache Design and Coherence 2014-3-6 John Lazzaro (not a prof - “John” is always OK) TA: Eric Love www-inst.eecs.berkeley.edu/~cs152/ Play: CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB Today: Shared Cache Design and Coherence CPU multi-threading ... CPU CPU Keeps memory system busy. Private Cache Private ... Cache Shared Caches DRAM Shared Ports I/O Crossbars and Rings How to do on-chip sharing. Concurrent requests Interfaces that don’t stall. Coherency Protocols Building coherent caches. Multithreading Sun Microsystems Niagara series CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB The case for multithreading Some applications spend their lives waiting for memory. C = compute M = waiting Amdahl’s Law tells us that optimizing C is the wrong thing to do ... Idea: Create a design that can multiplex threads onto one pipeline. Goal: Maximize throughput of large number of threads. CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB Multi-threading: Assuming perfect caches 4 CPUs running @ 1/4 clock. S. Cray, 1962. Labels show this state: T4 CS 152 L14: Cache Design and Coherency T3 T2 T1 UC Regents Spring 2014 © UCB Bypass network is no longer needed ... Result: Critical path shortens -- can trade for speed or pow ID (Decode) IR EX IR WB MEM IR IR WE, MemToReg Mux,Logic From WB A Y R M M B CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB Multi-threading: Supporting cache misses A thread scheduler keeps track of information about all threads that share pipeline. When a thread experiences a cache miss, it is taken off the pipeline during the miss penalty period. Thread scheduler CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB Sun Niagara II # threads/core? 8 threads/core: Enough to keep one core busy, given clock speed, memory system latency, and target application characteristics. CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB Crossbar Networks CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB Shared-memory CPU Private Cache ... CPU Private ... Cache Shared Caches DRAM Shared Ports I/O CPUs share lower level of memory system, and I/O. Common address space, one operating system image. Communication occurs through the memory system (100ns latency, 20 GB/s bandwidth) Sun’s Niagara II: Single-chip implementation ... SPC == SPARC Core. Only DRAM is not on Crossbar: Like N ports on an N-register file clk Flexible, but ... reads slows down as sel(ws) sel(rs1) 2 O(N ) ... R0 - The constant 0 Q 5 5 32 . M 32 Q D En R1 . U D rd1 . X E WE M. Q 32 D En R2 U. . sel(rs2) X. . ... 5 32 . . M 32 Q D En R31 rd2 . U . X 32 Why? Number of loads on each wd Q goes as O(N), and the wire length 32 to port mux goes as O(N). CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB Design challenge: High-performance crossbar Niagara II: 8 cores, 8 L2 banks, 4 DRAM channels. Apps are locality-poor. Goal: saturate DRAM BW. Each DRAM channel: 50 GB/s Read, 25 GB/s Write BW. Crossbar BW: 270 GB/s total (Read + Write). Sun Niagara II 8 x 9 Crossbar Tri-state distributed mux, as in microcode talk. Every cross of blue and purple is a tri-state buffer with a unique control signal. 72 control signals (if distributed unencoded). CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB Sun Niagara II 8 x 9 Crossbar 8 ports on CPU side (one per core) 100-200 wires/ port (each way). 4 cycle latency (715ps/cycle). Cycles 1-3 are for arbitration. Transmit data on cycle 4. Pipelined. CS 152 L14: Cache Design and Coherency 8 ports for L2 banks, plus one for I/0 UC Regents Spring 2014 © UCB A complete switch transfer (4 epochs) Epoch 1: All input ports (that are ready to send data) request an output port. Epoch 2: Allocation algorithm decides which inputs get to write. Epoch 3: Allocation system informs the winning inputs and outputs. Epoch 4: Actual data transfer takes place. Allocation is pipelined: a data transfer happens on every cycle, as does the three allocation stages, for different sets of requests. CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB Epoch 3: The Allocation Problem (4 x 4) Output Ports (W, X, Y, Z) W X Y Z Input Ports (A, B, C, D) A 0 0 1 0 B 1 0 0 0 C 0 0 1 0 D 1 0 0 0 A 1 codes that an input has data ready to send to an output. W X Y Z 0 0 1 0 0 0 0 0 switches. Algorithm should be “fair”, so C 0 no port always loses ... should also D 1 “scale” to run large matrices fast. 0 0 0 0 0 0 A Allocator returns a matrix with at most one 1 in each row and column to set B CS 152 L21: Networks and Routers UC Regents Fall 2006 © UCB Sun Niagara II Crossbar Notes Low latency: 4 cycles (less than 3 ns). Uniform latency between all port pairs. Crossbar defines floorplan: all port devices should be equidistant to the crossbar. CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB Sun Niagara II Energy Facts Crossbar only 1% of total power. CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB Sun Niagara II Crossbar Notes Low latency: 4 cycles (less than 3 ns). Uniform latency between all port pairs. Crossbar defines floorplan: all port devices should be equidistant to the crossbar. Did not scale up for 16-core Rainbow Falls. Rainbow Falls keeps the 8 x 9 crossbar, and shares each CPU-side port with two cores. Design alternatives to CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB CLOS Networks: From telecom world ... Build a high-port switch by tiling fixed-sized shuffle units. Pipeline registers naturally fit between tiles. Trades scalability for latency. CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB CLOS Networks: An example route Numbers on left and right are port numbers. Colors show routing paths for an exchange. Arbitration still needed to prevent blocking. CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB Ring Networks CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB Intel Xeon Data Center server chip 20% of Intel’s revenues, 40% of profits. Why? Cloud is growing, Xeon is dominant. Compiled Chips Xeon is a chip family, varying by # of cores, L3 cache size. Chip family mask layouts generated automatically, by adding core/cache slices. Ring Bus Bi-directional Ring Bus connects: Cores, cache banks, DRAM controllers, off-chip I/O. Chip compiler might size the ring bus to scale bandwidth with # of cores. Ring latency increases with # of cores. But compared to baseline, small. Ring Stop Tiles along x-axis are 20 ways of cache 2.5 MB L3 cache slice from Xeon E5 Ring stop interface lives in the Cache Control Box (CBOX) Ring bus (perhaps 1024 wires), with address, data, and header fields (sender #, recipient #, command) 1024 Ring Stop #1 Ring Stop #2 Ring Stop #3 Empty Data Data Out In Control Ring Stop #2 Interface Reading: Sense Data Out to see if message is for Ring Stop #2. If so, latch data, mux Empty onto ring. Writing: Check is Data Out is Empty. If so, mux a message onto the ring via the Data In port. In practice: “Extreme EE” to co-optimize bandwidth, reliability. Debugging: “Network analyzer” built into chip to capture ring messages of a particular kind. Sent off chip via an aux port. A derivative of this ring bus is also used on laptop and desktop chips. Break Play: CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB Hit-over-Miss Caches CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB Recall: CPU-cache port that doesn’t stall on a miss CPU makes a request by placing the following items in Queue 1: From CPU To CPU Queue 1 Queue 2 CMD: Read, write, etc ... MTYPE: 8-bit, 16-bit, 32-bit, or 64-bit. TAG: 9-bit number identifying the request. MADDR: Memory address of first byte. STORE-DATA: For stores, the data to store. This cache is used in an ASPIRE CPU (Rocket) When request is ready, cache places the following items in Queue 2: From CPU To CPU Queue 1 Queue 2 TAG: Identity of the completed command. LOAD-DATA: For loads, the requested data. CPU saves info about requests, indexed by TAG. Why use TAG approach? Multiple misses can proceed in parallel. Loads can return out of order. Today: How a read request proceeds in L1 D-Cache From CPU CPU requests a read by placing MTYPE, TAG, MADDR Queue 1 in Queue 1. To CPU Queue 2 “We” == L1 D-Cache controller We do a normal cache access. If there is a hit, we put place load result in Queue 2 ... In the case of a miss, we use the Inverted Miss Status Holding Register. Inverted MSHR (Miss Status Holding Register) (1) Associatively look up block # of memory address in table. If there are no hits, do memory request. To look up a memory address ... Cache Block # 42 0 Valid Bit MTYPE 1 0 1st Byte in4 Block 0 = 0 512-entry table, so that every 9-bit TAG value has an entry. [ ... ] Valid Qualifies Hit [ ... ] Hit Tag ID (ROM) 8 0 = 511 Valid Qualifies CS 152 L14: Cache Design and Coherency Hit Hit Assumptions: 32-byte blocks, 48-bit physical address space. UC Regents Spring 2014 © UCB Inverted MSHR (Miss Status Holding Register) (2) Index into table using 9-bit TAG, and set all fields using MADDR and MTYPE queue values. 8 0 To look up This indexing always finds V=0, because TAG (9 bits) a memory CPU promises not to reuse in-flight tags. address ... Cache 1st Byte Tag ID Valid MTYPE Block # (ROM) 8 0 in4 Block Bit 42 0 1 0 0 = 512-entry table, so that every 9-bit TAG value has an entry. [ ... ] Valid Qualifies Hit [ ... ] Hit 0 = 511 Valid Qualifies CS 152 L14: Cache Design and Coherency Hit Hit Assumptions: 32-byte blocks, 48-bit physical address space. UC Regents Spring 2014 © UCB Inverted MSHR (Miss Status Holding Register) (3) Whenever memory system returns data, associatively look up block # to find all pending transactions. Place transaction data for all hits in Queue 2, and clear valid bits. Also update L1 cache. To look up a memory address ... Cache Block # 42 0 Valid Bit MTYPE 1 0 1st Byte in4 Block 0 = 0 512-entry table, so that every 9-bit TAG value has an entry. [ ... ] Valid Qualifies Hit [ ... ] Hit Tag ID (ROM) 8 0 = 511 Valid Qualifies CS 152 L14: Cache Design and Coherency Hit Hit Assumptions: 32-byte blocks, 48-bit physical address space. UC Regents Spring 2014 © UCB Inverted MHSR notes. Structural hazards only occur when TAG space is exhausted by the CPU. High cost (# comparators + SRAM cells). See Farkas and Jouppi on class website, for low-cost designs that are often good enough. We will return to MHSRs to discuss CPI performance later in the semester. CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB Coherency Hardware CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB Cache Placement CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB Two CPUs, two caches, shared DRAM ... CPU0: LW R2, 16(R0) CPU1 CPU0 CPU1: LW R2, 16(R0) Cache Cache Addr Value Addr Value 16 5 16 50 Shared Main Memory Addr Value 16 5 0 CPU1: SW R0,16(R0) View of memory no longer “coherent”. Loads of location 16 from CPU0 and CPU1 see different values! Write-through CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB The simplest solution ... one cache! CPU1 CPU0 Memory Switch Shared Multi-Bank Cache Shared Main Memory CS 152 L14: Cache Design and Coherency CPUs do not have internal caches. Only one cache, so different values for a memory address cannot appear in 2 caches! Multiple caches banks support read/writes by both CPUs in a switch epoch, unless both target same bank. In that case, one UC Regents Spring 2014 © UCB Not a complete solution ... good for L2. CPU1 CPU0 Memory Switch Shared Multi-Bank Cache Shared Main Memory Sequent Systems (1980s) CS 152 L14: Cache Design and Coherency For modern clock rates, access to shared cache through switch takes 10+ cycles. Using shared cache as the L1 data cache is tantamount to slowing down clock 10X for LWs. Not good. This approach was a complete solution in the days when DRAM row access time and CPU clock period were well matched. UC Regents Spring 2014 © UCB Modified form: Private L1s, shared L2 CPU0 CPU1 L1 Caches L1 Caches Memory Switch or Bus Shared Multi-Bank L2 Cache Shared Main Memory CS 152 L14: Cache Design and Coherency Thus, we need to solve the cache coherency problem for L1 cache. Advantages of shared L2 over private L2s: Processors communicate at cache speed, not DRAM speed. Constructive interference, if both CPUs need same data/instr. Disadvantage: CPUs share BW to L2 cache ... UC Regents Spring 2014 © UCB IBM Power 4 (2001) Dual core Shared, multi-bank L2 cache. Private L1 caches Off-chip L3 caches Cache Coherency CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB Cache coherency goals ... CPU0 CPU1 Cache Cache Addr Value Addr Value 16 5 16 50 Shared Memory Hierarchy Addr Value 16 CS 152 L14: Cache Design and Coherency 1. Only one processor at a time has write permission for a memory location. 2. No processor can load a stale copy of a location after a write. 5 0 UC Regents Spring 2014 © UCB Simple Implementation: Snoopy Caches CPU1 CPU0 Cache Snooper Cache Snooper Memory bus Shared Main Memory Hierarchy Each cache has the ability to “snoop” on memory bus transactions of other The bus also has mechanisms to let a CPUs. CPU intervene to stop a bus transaction, and to invalidate cache lines of other CPUs. CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB Writes from 10,000 feet ... for write-thru L1 1. Writing CPU takes control of bus. For write-thru caches ... CPU1 CPU0 Cache Snooper Cache Snooper Memory bus Shared Main Memory Hierarchy To a first-order, reads will “just work” if write-thru caches implement this policy. A “two-state” protocol (cache lines are “valid” or “invalid”). CS 152 L14: Cache Design and Coherency 2. Address to be written is invalidated in all other caches. Reads will no longer hit in cache and get stale data. 3. Write is sent to main memory. Reads will cache miss, retrieve new value from main UC Regents Spring 2014 © UCB Limitations of the write-thru approach CPU1 CPU0 Cache Snooper Cache Snooper Memory bus Shared Main Memory Hierarchy Every write goes to the bus. Total bus write bandwidth does not support more than 2 CPUs, in modern practice. Write-back big trick: add extra states. To scale further, we Simplest version: MSI -- Modified, need to use writeShared, Invalid. More efficient back caches. versions add more states (MESI adds Exclusive). State definitions are subtle ... CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB Figure 5.5, page 358 ... the best starting point. Read misses ... for a MESI protocol ... For write-back caches ... CPU1 CPU0 Cache Snooper Cache Snooper Memory bus Shared Main Memory Hierarchy 1. A cache requests a cache-line fill for a read miss. 2. Another cache with an exclusive on this line responds with fresh data. Reads miss will not hit main memory, retrieve stale 3. Thedata. responding These sketches are just to give you a cache changes line from exclusive to sense of how coherency protocols modified. work. Deep understand requires Future writes will go understanding the complete “state to bus to be CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB Snoopy mechanism doesn’t scale ... CPU1 CPU0 Cache Snooper Cache Snooper Memory bus Shared Main Memory Hierarchy Single-chip implementations have moved to a centralized “directory” service that tracks the status of each line of each private cache. Multi-socket systems use distributed directories. CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB Directories attached to on-chip cache network ... 2 socket system ... each socket a multi-core chip Each chip has its own bank of DRAM. Distributed directories for multi-socket systems Directories for Chip 0 ... and Chip 1 L1 L1 L2 L2 Directory for Chip 0 DRAM. Directory for Chip 1 DRAM. Figure 5.21, page 381 ... directory message basics Conceptually similar to snoopy caches ... but the different mechanisms require rethinking the protocol to get correct behaviors. Other Machine Architectures CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB NUMA: Non-uniform Memory Access CPU 0 ... CPU 1023 Cache Cache DRAM DRAM Each CPU has part of main memory attached to it. To access other parts of main memory, use the interconnection network. For best results, Interconnection Network applications take the non-uniform Network uses coherent global address memory latency space. Directory protocols over fiber into account. networking. CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB Clusters: Supercomputing version of WSC Connect large numbers of 1-CPU or 2-CPU rack mount computers together with highend network technology (not normal Ethernet). Instead of using hardware to create a shared memory abstraction, let an University of Illinois, 650 2-CPU Apple application build its own memory Xserve cluster, connected with Myrinet (3.5 μs ping time - low latency model. CS 152 L14: Cache Design and Coherency UC Regents Spring 2014 © UCB On Tuesday We return to CPU design ... Have a good weekend !