CS 152 Computer Architecture and Engineering TA: Eric Love

advertisement
CS 152
Computer Architecture and Engineering
Lecture 14 - Cache Design and Coherence
2014-3-6
John Lazzaro
(not a prof - “John” is always OK)
TA: Eric Love
www-inst.eecs.berkeley.edu/~cs152/
Play:
CS 152 L14: Cache Design and Coherency
UC Regents Spring 2014 © UCB
Today: Shared Cache Design and
Coherence
CPU multi-threading
...
CPU
CPU
Keeps memory system busy.
Private
Cache
Private
...
Cache
Shared Caches
DRAM
Shared Ports
I/O
Crossbars and Rings
How to do on-chip sharing.
Concurrent requests
Interfaces that don’t stall.
Coherency Protocols
Building coherent caches.
Multithreading
Sun Microsystems Niagara series
CS 152 L14: Cache Design and Coherency
UC Regents Spring 2014 © UCB
The case for
multithreading
Some applications spend their
lives waiting for memory.
C = compute
M = waiting
Amdahl’s Law
tells us that
optimizing C
is the wrong
thing to do ...
Idea: Create a design that can multiplex threads
onto one pipeline. Goal: Maximize throughput of
large number of threads.
CS 152 L14: Cache Design and Coherency
UC Regents Spring 2014 © UCB
Multi-threading: Assuming perfect caches
4 CPUs
running
@ 1/4
clock.
S. Cray,
1962.
Labels show this
state:
T4
CS 152 L14: Cache Design and Coherency
T3
T2
T1
UC Regents Spring 2014 © UCB
Bypass network is no longer needed ...
Result: Critical path shortens -- can trade for speed or pow
ID (Decode)
IR
EX
IR
WB
MEM
IR
IR
WE, MemToReg
Mux,Logic
From
WB
A
Y
R
M
M
B
CS 152 L14: Cache Design and Coherency
UC Regents Spring 2014 © UCB
Multi-threading: Supporting cache misses
A thread scheduler keeps track of information
about all threads that share pipeline. When
a thread experiences a cache miss, it is taken off
the pipeline during the miss penalty period.
Thread scheduler
CS 152 L14: Cache Design and Coherency
UC Regents Spring 2014 © UCB
Sun Niagara II
# threads/core?
8 threads/core: Enough to keep one core busy,
given clock speed, memory system latency, and
target application characteristics.
CS 152 L14: Cache Design and Coherency
UC Regents Spring 2014 © UCB
Crossbar Networks
CS 152 L14: Cache Design and Coherency
UC Regents Spring 2014 © UCB
Shared-memory
CPU
Private
Cache
...
CPU
Private
...
Cache
Shared Caches
DRAM
Shared Ports
I/O
CPUs share lower
level of memory
system, and I/O.
Common address
space, one operating
system image.
Communication
occurs through the
memory system
(100ns latency,
20 GB/s bandwidth)
Sun’s Niagara II: Single-chip implementation ...
SPC == SPARC Core. Only DRAM is not on
Crossbar: Like N ports on an N-register file
clk
Flexible, but ... reads slows down as
sel(ws)
sel(rs1)
2
O(N ) ...
R0 - The constant 0 Q
5
5
32
. M 32
Q
D En R1
. U
D
rd1
. X
E
WE
M.
Q
32
D En R2
U.
.
sel(rs2)
X.
.
...
5
32
.
. M 32
Q
D En R31
rd2
. U
. X
32
Why? Number of loads on each
wd Q goes as O(N), and the wire length 32
to port mux goes as O(N).
CS 152 L14: Cache Design and Coherency
UC Regents Spring 2014 © UCB
Design challenge: High-performance crossbar
Niagara II: 8 cores, 8 L2 banks, 4 DRAM channels.
Apps are locality-poor. Goal: saturate DRAM BW.
Each DRAM channel: 50 GB/s Read, 25 GB/s Write BW.
Crossbar BW: 270 GB/s total (Read + Write).
Sun Niagara II
8 x 9 Crossbar
Tri-state distributed mux, as in microcode
talk.
Every cross
of blue and
purple is a
tri-state
buffer with
a unique control
signal.
72 control
signals (if
distributed
unencoded).
CS 152 L14: Cache Design and Coherency
UC Regents Spring 2014 © UCB
Sun Niagara II
8 x 9 Crossbar
8 ports on CPU side (one per core)
100-200 wires/
port (each way).
4 cycle latency
(715ps/cycle).
Cycles 1-3 are
for arbitration.
Transmit data
on cycle 4.
Pipelined.
CS 152 L14: Cache Design and Coherency
8 ports for L2 banks, plus one for I/0
UC Regents Spring 2014 © UCB
A complete switch transfer (4 epochs)
Epoch 1: All input ports (that are ready
to send data) request an output port.
Epoch 2: Allocation algorithm
decides which inputs get to write.
Epoch 3: Allocation system informs
the winning inputs and outputs.
Epoch 4: Actual data transfer takes
place.
Allocation is pipelined: a data transfer happens
on every cycle, as does the three allocation
stages, for different sets of requests.
CS 152 L14: Cache Design and Coherency
UC Regents Spring 2014 © UCB
Epoch 3: The Allocation Problem (4 x 4)
Output Ports
(W, X, Y, Z)
W X Y Z
Input
Ports
(A, B, C,
D)
A
0
0
1
0
B
1
0
0
0
C
0
0
1
0
D
1
0
0
0
A 1 codes that an input
has data ready to send
to an output.
W X
Y
Z
0
0
1
0
0
0
0
0
switches. Algorithm should be “fair”, so C 0
no port always loses ... should also
D 1
“scale” to run large matrices fast.
0
0
0
0
0
0
A
Allocator returns a matrix with at most
one 1 in each row and column to set B
CS 152 L21: Networks and Routers
UC Regents Fall 2006 © UCB
Sun Niagara II
Crossbar Notes
Low latency: 4 cycles (less than 3 ns).
Uniform latency between all port pairs.
Crossbar defines floorplan: all port devices
should be equidistant to the crossbar.
CS 152 L14: Cache Design and Coherency
UC Regents Spring 2014 © UCB
CS 152 L14: Cache Design and Coherency
UC Regents Spring 2014 © UCB
Sun Niagara II
Energy Facts
Crossbar only
1% of total power.
CS 152 L14: Cache Design and Coherency
UC Regents Spring 2014 © UCB
Sun Niagara II
Crossbar Notes
Low latency: 4 cycles (less than 3 ns).
Uniform latency between all port pairs.
Crossbar defines floorplan: all port devices
should be equidistant to the crossbar.
Did not scale up for 16-core Rainbow Falls.
Rainbow Falls keeps the 8 x 9 crossbar, and
shares each CPU-side port with two cores.
Design alternatives to
CS 152 L14: Cache Design and Coherency
UC Regents Spring 2014 © UCB
CLOS Networks: From telecom world ...
Build a high-port switch by tiling fixed-sized
shuffle units. Pipeline registers naturally fit
between tiles. Trades scalability for latency.
CS 152 L14: Cache Design and Coherency
UC Regents Spring 2014 © UCB
CLOS Networks: An example route
Numbers on left and right are port numbers.
Colors show routing paths for an exchange.
Arbitration still needed to prevent blocking.
CS 152 L14: Cache Design and Coherency
UC Regents Spring 2014 © UCB
Ring Networks
CS 152 L14: Cache Design and Coherency
UC Regents Spring 2014 © UCB
Intel Xeon
Data Center
server chip
20% of
Intel’s
revenues,
40% of
profits.
Why?
Cloud is
growing,
Xeon is
dominant.
Compiled Chips
Xeon is a
chip family,
varying by
# of cores, L3
cache size.
Chip family
mask layouts
generated
automatically,
by adding
core/cache
slices.
Ring
Bus
Bi-directional Ring Bus connects:
Cores, cache banks, DRAM controllers, off-chip I/O.
Chip compiler
might size the
ring bus to scale
bandwidth with
# of cores.
Ring latency
increases
with # of cores.
But compared to
baseline, small.
Ring
Stop
Tiles along x-axis are 20 ways of cache
2.5 MB L3
cache slice
from Xeon
E5
Ring stop
interface
lives in the
Cache
Control
Box
(CBOX)
Ring bus (perhaps 1024 wires), with address, data,
and header fields (sender #, recipient #, command)
1024
Ring Stop #1
Ring Stop #2
Ring Stop #3
Empty
Data Data
Out In Control
Ring Stop #2 Interface
Reading: Sense Data Out to see if message is for
Ring Stop #2. If so, latch data, mux Empty onto ring.
Writing: Check is Data Out is Empty. If so, mux
a message onto the ring via the Data In port.
In practice: “Extreme EE” to co-optimize bandwidth,
reliability.
Debugging: “Network analyzer” built into chip to capture
ring messages of a particular kind. Sent off chip via an
aux port.
A derivative of this
ring bus is also
used on laptop
and desktop chips.
Break
Play:
CS 152 L14: Cache Design and Coherency
UC Regents Spring 2014 © UCB
Hit-over-Miss Caches
CS 152 L14: Cache Design and Coherency
UC Regents Spring 2014 © UCB
Recall: CPU-cache port that doesn’t stall on a miss
CPU makes a
request by
placing the
following items
in Queue 1:
From CPU
To CPU
Queue 1
Queue 2
CMD: Read, write, etc ...
MTYPE: 8-bit, 16-bit, 32-bit, or 64-bit.
TAG: 9-bit number identifying the request.
MADDR: Memory address of first byte.
STORE-DATA: For stores, the data to store.
This cache is used in an ASPIRE CPU (Rocket)
When request is
ready, cache
places the
following items
in Queue 2:
From CPU
To CPU
Queue 1
Queue 2
TAG: Identity of the completed command.
LOAD-DATA: For loads, the requested data.
CPU saves info about requests, indexed by TAG.
Why use TAG approach? Multiple misses can
proceed in parallel. Loads can return out of order.
Today: How a read request proceeds in L1 D-Cache
From CPU
CPU requests
a read by placing
MTYPE, TAG,
MADDR
Queue 1
in Queue 1.
To CPU
Queue 2
“We” == L1 D-Cache
controller
We do a normal cache access. If there is
a hit, we put place load result in Queue 2 ...
In the case of a miss, we use the
Inverted Miss Status Holding Register.
Inverted MSHR (Miss Status Holding Register)
(1) Associatively look up block # of memory address
in table. If there are no hits, do memory request.
To look up
a memory
address ...
Cache
Block
#
42
0
Valid
Bit
MTYPE
1
0
1st Byte
in4 Block
0
=
0
512-entry table, so
that every 9-bit TAG
value has an entry.
[ ... ]
Valid Qualifies
Hit
[ ... ]
Hit
Tag ID
(ROM)
8
0
=
511
Valid Qualifies
CS 152 L14: Cache Design and Coherency
Hit
Hit
Assumptions: 32-byte blocks,
48-bit physical address space.
UC Regents Spring 2014 © UCB
Inverted MSHR (Miss Status Holding Register)
(2) Index into table using 9-bit TAG, and set all
fields using MADDR and MTYPE queue values.
8
0
To look up This indexing always finds V=0, because
TAG (9 bits)
a memory CPU promises not to reuse in-flight tags.
address ...
Cache
1st Byte Tag ID
Valid
MTYPE
Block
#
(ROM)
8
0
in4 Block
Bit
42
0
1
0
0
=
512-entry table, so
that every 9-bit TAG
value has an entry.
[ ... ]
Valid Qualifies
Hit
[ ... ]
Hit
0
=
511
Valid Qualifies
CS 152 L14: Cache Design and Coherency
Hit
Hit
Assumptions: 32-byte blocks,
48-bit physical address space.
UC Regents Spring 2014 © UCB
Inverted MSHR (Miss Status Holding Register)
(3) Whenever memory system returns data,
associatively look up block # to find all pending
transactions. Place transaction data for all hits in
Queue 2, and clear valid bits. Also update L1 cache.
To look up
a memory
address ...
Cache
Block
#
42
0
Valid
Bit
MTYPE
1
0
1st Byte
in4 Block
0
=
0
512-entry table, so
that every 9-bit TAG
value has an entry.
[ ... ]
Valid Qualifies
Hit
[ ... ]
Hit
Tag ID
(ROM)
8
0
=
511
Valid Qualifies
CS 152 L14: Cache Design and Coherency
Hit
Hit
Assumptions: 32-byte blocks,
48-bit physical address space.
UC Regents Spring 2014 © UCB
Inverted
MHSR notes.
Structural hazards only occur when
TAG space is exhausted by the CPU.
High cost (# comparators + SRAM cells).
See Farkas and Jouppi on class website, for
low-cost designs that are often good enough.
We will return to MHSRs to discuss
CPI performance later in the semester.
CS 152 L14: Cache Design and Coherency
UC Regents Spring 2014 © UCB
Coherency Hardware
CS 152 L14: Cache Design and Coherency
UC Regents Spring 2014 © UCB
Cache Placement
CS 152 L14: Cache Design and Coherency
UC Regents Spring 2014 © UCB
Two CPUs, two caches, shared DRAM ...
CPU0:
LW R2, 16(R0)
CPU1
CPU0
CPU1:
LW R2, 16(R0)
Cache
Cache
Addr
Value
Addr
Value
16
5
16
50
Shared Main Memory
Addr
Value
16
5 0
CPU1:
SW R0,16(R0)
View of memory no
longer “coherent”.
Loads of location 16
from CPU0 and
CPU1 see different
values!
Write-through
CS 152 L14: Cache Design and Coherency
UC Regents Spring 2014 © UCB
The simplest solution ... one cache!
CPU1
CPU0
Memory Switch
Shared Multi-Bank Cache
Shared Main Memory
CS 152 L14: Cache Design and Coherency
CPUs do not have
internal caches.
Only one cache, so
different values for a
memory address
cannot appear in 2
caches!
Multiple caches
banks support
read/writes by both
CPUs in a switch
epoch, unless both
target same bank.
In that case, one
UC Regents Spring 2014 © UCB
Not a complete solution ... good for L2.
CPU1
CPU0
Memory Switch
Shared Multi-Bank Cache
Shared Main Memory
Sequent Systems (1980s)
CS 152 L14: Cache Design and Coherency
For modern clock
rates,
access to shared
cache through switch
takes 10+ cycles.
Using shared cache
as the L1 data cache
is tantamount to
slowing down clock
10X for
LWs. Not good.
This approach was a
complete solution in
the days when DRAM
row access time and
CPU clock period
were well matched.
UC Regents Spring 2014 © UCB
Modified form: Private L1s, shared L2
CPU0
CPU1
L1 Caches
L1 Caches
Memory Switch or Bus
Shared Multi-Bank L2 Cache
Shared Main Memory
CS 152 L14: Cache Design and Coherency
Thus, we need to
solve the cache
coherency problem
for L1 cache.
Advantages of shared
L2 over private L2s:
Processors
communicate at
cache speed, not
DRAM speed.
Constructive
interference, if
both CPUs need
same data/instr.
Disadvantage: CPUs
share BW to L2 cache
...
UC Regents Spring 2014 © UCB
IBM
Power 4
(2001)
Dual core
Shared,
multi-bank
L2 cache.
Private
L1 caches
Off-chip L3
caches
Cache Coherency
CS 152 L14: Cache Design and Coherency
UC Regents Spring 2014 © UCB
Cache coherency goals ...
CPU0
CPU1
Cache
Cache
Addr
Value
Addr
Value
16
5
16
50
Shared Memory Hierarchy
Addr
Value
16
CS 152 L14: Cache Design and Coherency
1. Only one
processor at a time
has write
permission for a
memory location.
2. No processor
can load a stale
copy of a location
after a write.
5 0
UC Regents Spring 2014 © UCB
Simple Implementation: Snoopy Caches
CPU1
CPU0
Cache
Snooper
Cache
Snooper
Memory
bus
Shared Main Memory Hierarchy
Each cache has the ability to “snoop”
on
memory bus transactions of other
The bus also has mechanisms to let a
CPUs.
CPU intervene to stop a bus
transaction, and to invalidate cache
lines of other CPUs.
CS 152 L14: Cache Design and Coherency
UC Regents Spring 2014 © UCB
Writes from 10,000 feet ... for write-thru L1
1. Writing CPU
takes control of
bus.
For write-thru caches ...
CPU1
CPU0
Cache
Snooper
Cache
Snooper
Memory
bus
Shared Main Memory Hierarchy
To a first-order, reads will “just work”
if write-thru caches implement this
policy.
A “two-state” protocol (cache lines
are “valid” or “invalid”).
CS 152 L14: Cache Design and Coherency
2. Address to be
written is
invalidated in all
other caches.
Reads will no
longer hit in cache
and get stale
data.
3. Write is sent to
main memory.
Reads will cache
miss, retrieve new
value from main
UC Regents Spring 2014 © UCB
Limitations of the write-thru approach
CPU1
CPU0
Cache
Snooper
Cache
Snooper
Memory
bus
Shared Main Memory Hierarchy
Every write goes
to the bus.
Total bus write
bandwidth does
not support more
than 2 CPUs, in
modern practice.
Write-back big trick: add extra states. To scale further, we
Simplest version: MSI -- Modified,
need to use writeShared, Invalid. More efficient
back caches.
versions add more states (MESI adds
Exclusive). State definitions are
subtle ...
CS 152 L14: Cache Design and Coherency
UC Regents Spring 2014 © UCB
Figure 5.5, page 358 ... the best starting point.
Read misses ... for a MESI protocol ...
For write-back caches
...
CPU1
CPU0
Cache
Snooper
Cache
Snooper
Memory
bus
Shared Main Memory Hierarchy
1. A cache requests
a cache-line fill
for a read miss.
2. Another cache with
an exclusive
on this line responds
with fresh data.
Reads miss will
not hit main
memory, retrieve
stale
3. Thedata.
responding
These sketches are just to give you a cache changes line
from exclusive to
sense of how coherency protocols
modified.
work.
Deep understand requires
Future writes will go
understanding the complete “state
to bus to be
CS 152 L14: Cache Design and Coherency
UC Regents Spring 2014 © UCB
Snoopy mechanism doesn’t scale ...
CPU1
CPU0
Cache
Snooper
Cache
Snooper
Memory
bus
Shared Main Memory Hierarchy
Single-chip implementations have
moved to a centralized “directory”
service that tracks the status of each
line of each private cache.
Multi-socket systems use distributed
directories.
CS 152 L14: Cache Design and Coherency
UC Regents Spring 2014 © UCB
Directories attached to on-chip cache network ...
2 socket system ... each socket a multi-core chip
Each chip has its own bank of DRAM.
Distributed directories for multi-socket systems
Directories for Chip 0
... and Chip 1
L1
L1
L2
L2
Directory for
Chip 0 DRAM.
Directory for
Chip 1 DRAM.
Figure 5.21, page 381 ... directory message basics
Conceptually similar to snoopy caches ... but
the different mechanisms require rethinking
the protocol to get correct behaviors.
Other Machine Architectures
CS 152 L14: Cache Design and Coherency
UC Regents Spring 2014 © UCB
NUMA: Non-uniform Memory Access
CPU 0
...
CPU 1023
Cache
Cache
DRAM
DRAM
Each CPU has part
of main memory
attached to it.
To access other
parts of main
memory, use the
interconnection
network.
For best results,
Interconnection Network
applications take
the non-uniform
Network uses coherent global address memory latency
space. Directory protocols over fiber into account.
networking.
CS 152 L14: Cache Design and Coherency
UC Regents Spring 2014 © UCB
Clusters: Supercomputing version of WSC
Connect large
numbers of 1-CPU
or 2-CPU rack
mount computers
together with highend network
technology (not
normal Ethernet).
Instead of using
hardware to create
a shared memory
abstraction, let an
University of Illinois, 650 2-CPU Apple application build
its own memory
Xserve cluster, connected with
Myrinet (3.5 μs ping time - low latency model.
CS 152 L14: Cache Design and Coherency
UC Regents Spring 2014 © UCB
On Tuesday
We return to CPU design ...
Have a good weekend !
Download