CS 252 Graduate Computer Architecture Lecture 11: Multiprocessors-II Krste Asanovic

advertisement
CS 252 Graduate Computer Architecture
Lecture 11: Multiprocessors-II
Krste Asanovic
Electrical Engineering and Computer Sciences
University of California, Berkeley
http://www.eecs.berkeley.edu/~krste
http://inst.eecs.berkeley.edu/~cs252
Recap: Sequential Consistency
A Memory Model
P
P
P
P
P
P
M
“ A system is sequentially consistent if the result of
any execution is the same as if the operations of all
the processors were executed in some sequential
order, and the operations of each individual processor
appear in the order specified by the program”
Leslie Lamport
Sequential Consistency =
arbitrary order-preserving interleaving
of memory references of sequential programs
10/16/2007
2
Recap: Sequential Consistency
Sequential consistency imposes more memory ordering
constraints than those imposed by uniprocessor
program dependencies (
)
What are these in our example ?
T1:
Store (X), 1 (X = 1)
Store (Y), 11 (Y = 11)
T2:
Load R1, (Y)
Store (Y’), R1 (Y’= Y)
Load R2, (X)
Store (X’), R2 (X’= X)
additional SC requirements
10/16/2007
3
Recap: Mutual Exclusion and Locks
Want to guarantee only one process is active in a critical
section
• Blocking atomic read-modify-write instructions
e.g., Test&Set, Fetch&Add, Swap
vs
• Non-blocking atomic read-modify-write instructions
e.g., Compare&Swap, Load-reserve/Store-conditional
vs
• Protocols based on ordinary Loads and Stores
10/16/2007
4
Issues in Implementing
Sequential Consistency
P
P
P
P
P
P
M
Implementation of SC is complicated by two issues
• Out-of-order execution capability
Load(a); Load(b)
yes
Load(a); Store(b)
yes if a  b
Store(a); Load(b)
yes if a  b
Store(a); Store(b)
yes if a  b
• Caches
Caches can prevent the effect of a store from
being seen by other processors
SC complications motivates architects to consider weak or
relaxed memory models
5
10/16/2007
Memory Fences
Instructions to sequentialize memory accesses
Processors with relaxed or weak memory models (i.e.,
permit Loads and Stores to different addresses to be
reordered) need to provide memory fence instructions
to force the serialization of memory accesses
Examples of processors with relaxed memory models:
Sparc V8 (TSO,PSO): Membar
Sparc V9 (RMO):
Membar #LoadLoad, Membar #LoadStore
Membar #StoreLoad, Membar #StoreStore
PowerPC (WO): Sync, EIEIO
Memory fences are expensive operations, however, one
pays the cost of serialization only when it is required
10/16/2007
6
Using Memory Fences
Producer
tail
Rhead
R
Consumer:
Load Rhead, (head)
spin: Load Rtail, (tail)
if Rhead==Rtail goto spin
MembarLL
Load R, (Rhead)
Rhead=Rhead+1
Store (head), Rhead
ensures that R is
process(R)
not loaded before
x has been stored
7
Producer posting Item x:
Load Rtail, (tail)
Store (Rtail), x
MembarSS
Rtail=Rtail+1
Store (tail), Rtail
10/16/2007
Consumer
Rtail
Rtail
ensures that tail ptr
is not updated before
x has been stored
head
Data-Race Free Programs
a.k.a. Properly Synchronized Programs
Process 1
...
Acquire(mutex);
< critical section>
Release(mutex);
Process 2
...
Acquire(mutex);
< critical section>
Release(mutex);
Synchronization variables (e.g. mutex) are disjoint
from data variables
Accesses to writable shared data variables are
protected in critical regions
no data races except for locks
(Formal definition is elusive)
In general, it cannot be proven if a program is data-race
free.
10/16/2007
8
Fences in Data-Race Free Programs
Process 1
...
Acquire(mutex);
membar;
< critical section>
membar;
Release(mutex);
Process 2
...
Acquire(mutex);
membar;
< critical section>
membar;
Release(mutex);
• Relaxed memory model allows reordering of instructions
by the compiler or the processor as long as the reordering
is not done across a fence
• The processor also should not speculate or prefetch
across fences
10/16/2007
9
Mutual Exclusion Using Load/Store
A protocol based on two shared variables c1 and c2.
Initially, both c1 and c2 are 0 (not busy)
Process 1
...
c1=1;
L: if c2=1 then go to L
< critical section>
c1=0;
Process 2
...
c2=1;
L: if c1=1 then go to L
< critical section>
c2=0;
What is wrong?
10/16/2007
10
Mutual Exclusion: second attempt
To avoid deadlock, let a process give up the reservation
(i.e. Process 1 sets c1 to 0) while waiting.
Process 1
...
L: c1=1;
if c2=1 then
{ c1=0; go to L}
< critical section>
c1=0
Process 2
...
L: c2=1;
if c1=1 then
{ c2=0; go to L}
< critical section>
c2=0
What can go wrong now?
• Deadlock is not possible but with a low probability a
livelock may occur
• An unlucky process may never get to enter the
critical section starvation
10/16/2007
11
A Protocol for Mutual Exclusion
T. Dekker, 1966
A protocol based on 3 shared variables c1, c2 and turn.
Initially, both c1 and c2 are 0 (not busy)
Process 1
...
c1=1;
turn = 1;
L: if c2=1 & turn=1
then go to L
< critical section>
c1=0;
Process 2
...
c2=1;
turn = 2;
L: if c1=1 & turn=2
then go to L
< critical section>
c2=0;
• turn = i ensures that only process i can wait
• variables c1 and c2 ensure mutual exclusion
Solution for n processes was given by Dijkstra
and is quite tricky!
10/16/2007
12
Scenario 1
...
Process 1
c1=1;
turn = 1;
L: if c2=1 & turn=1
then go to L
< critical section>
c1=0;
...
Process 2
c2=1;
turn = 2;
L: if c1=1 & turn=2
then go to L
< critical section>
c2=0;
Scenario 2
Analysis of Dekker’s Algorithm
...
Process 1
c1=1;
turn = 1;
L: if c2=1 & turn=1
then go to L
< critical section>
c1=0;
...
Process 2
c2=1;
turn = 2;
L: if c1=1 & turn=2
then go to L
< critical section>
c2=0;
10/16/2007
13
N-process Mutual Exclusion
Lamport’s Bakery Algorithm
Process i
Initially num[j] = 0, for all j
Entry Code
choosing[i] = 1;
num[i] = max(num[0], …, num[N-1]) + 1;
choosing[i] = 0;
for(j = 0; j < N; j++) {
while( choosing[j] );
while( num[j] &&
( ( num[j] < num[i] ) ||
( num[j] == num[i] && j < i ) ) );
}
Exit Code
num[i] = 0;
10/16/2007
14
CS252 Administrivia
• Project meetings next week (10/23-25), same
schedule as before (M 1-3PM, Tu/Th 9:40-11AM)
– Schedule on website
– All in 645 Soda, 20mins/group
• Hope to see:
– Project web site
– At least one initial result (some delta from “hello world”)
– Grasp of related work
• Midterm review
10/16/2007
15
CS252 Midterm 1
Problem 1 Distribution
(22 points total)
35
% of students
30
25
20
15
10
5
0
0-3
4-7
8-11
12-15
16-19
20-22
Scores
10/16/2007
16
CS252 Midterm 1
Problem 2
(19 points total)
35
% of students
30
25
20
15
10
5
0
0-3
4-7
8-11
12-15
16-19
Score Range
10/16/2007
17
CS252 Midterm 1
Problem 3
(19 points total)
45
40
% of students
35
30
25
20
15
10
5
0
0-3
4-7
8-11
12-15
16-19
Score Range
10/16/2007
18
CS252 Midterm 1
Problem 4
(20 points total)
35
% of students
30
25
20
15
10
5
0
0-3
4-7
8-11
12-15
16-20
Score Range
10/16/2007
19
CS252 Midterm 1 (80 points total)
Average: 39.8 Median: 40.5
25
% of Students
20
15
10
5
0
0-9
10-19
20-29
30-39
40-49
50-59
60-69
70-80
Score Range
10/16/2007
20
EECS Graduate Grading Guidelines
A+, A, A- Quality expected from PhD student
B+, B
Quality expected from MS student, not PhD
<= B- < Quality expected from MS student
• Class average somewhere in range 3.2 - 3.6
http://www.eecs.berkeley.edu/Policies/grad.grading.shtml
10/16/2007
21
Memory Consistency in SMPs
CPU-1
A
CPU-2
cache-1
100
A
100
cache-2
CPU-Memory bus
A
100
memory
Suppose CPU-1 updates A to 200.
write-back: memory and cache-2 have stale values
write-through: cache-2 has a stale value
Do these stale values matter?
What is the view of shared memory for programming?
10/16/2007
22
Write-back Caches & SC
• T1 is executed
prog T1
ST X, 1
ST Y,11
• cache-1 writes back Y
• T2 executed
• cache-1 writes back X
• cache-2 writes back
X’ & Y’
10/16/2007
cache-1
X= 1
Y=11
memory
X=0
Y =10
X’=
Y’=
X= 1
Y=11
X=0
Y =11
X’=
Y’=
Y=
Y’=
X=
X’=
X= 1
Y=11
X=0
Y =11
X’=
Y’=
X=1
Y =11
X’=
Y’=
Y = 11
Y’= 11
X=0
X’= 0
Y = 11
Y’= 11
X=0
X’= 0
X=1
Y =11
X’= 0
Y’=11
Y =11
Y’=11
X=0
X’= 0
X= 1
Y=11
X= 1
Y=11
cache-2
Y=
Y’=
X=
X’=
prog T2
LD Y, R1
ST Y’, R1
LD X, R2
ST X’,R2
23
Write-through Caches & SC
prog T1
ST X, 1
ST Y,11
• T1 executed
• T2 executed
cache-1
X= 0
Y=10
memory
X=0
Y =10
X’=
Y’=
cache-2
Y=
Y’=
X=0
X’=
X= 1
Y=11
X=1
Y =11
X’=
Y’=
Y=
Y’=
X=0
X’=
X= 1
Y=11
X=1
Y =11
X’= 0
Y’=11
Y = 11
Y’= 11
X=0
X’= 0
prog T2
LD Y, R1
ST Y’, R1
LD X, R2
ST X’,R2
Write-through caches don’t preserve
sequential consistency either
10/16/2007
24
Maintaining Sequential Consistency
SC is sufficient for correct producer-consumer
and mutual exclusion code (e.g., Dekker)
Multiple copies of a location in various caches
can cause SC to break down.
Hardware support is required such that
• only one processor at a time has write
permission for a location
• no processor can load a stale copy of
the location after a write
 cache coherence protocols
10/16/2007
25
Cache Coherence Protocols for SC
write request:
the address is invalidated (updated) in all other
caches before (after) the write is performed
read request:
if a dirty copy is found in some cache, a writeback is performed before the memory is read
We will focus on Invalidation protocols
as opposed to Update protocols
10/16/2007
26
Warmup: Parallel I/O
Memory
Bus
Address (A)
Proc.
Data (D)
Physical
Memory
Cache
R/W
Page transfers
occur while the
Processor is running
A
Either Cache or DMA can
be the Bus Master and
effect transfers
D
R/W
DMA
DISK
(DMA stands for Direct Memory Access)
10/16/2007
27
Problems with Parallel I/O
Cached portions
of page
Memory
Bus
Proc.
Physical
Memory
Cache
DMA transfers
DMA
DISK
Memory
Disk
10/16/2007
Disk: Physical memory may be
stale if Cache copy is dirty
Memory: Cache may hold state data and not
see memory writes
28
Snoopy Cache Goodman 1983
• Idea: Have cache watch (or snoop upon) DMA
transfers, and then “do the right thing”
• Snoopy cache tags are dual-ported
Used to drive Memory Bus
when Cache is Bus Master
A
Proc.
R/W
Tags and
State
D
Data
(lines)
A
R/W
Snoopy read port
attached to Memory
Bus
Cache
10/16/2007
29
Snoopy Cache Actions for DMA
Observed Bus
Cycle
Cache State
Cache Action
Address not cached
No action
Cached, unmodified
No action
Cached, modified
Cache intervenes
Address not cached
No action
DMA Write
Cached, unmodified
Cache purges its copy
Disk
Cached, modified
???
DMA Read
Memory
10/16/2007
Disk
Memory
30
Shared Memory Multiprocessor
Memory
Bus
M1
Snoopy
Cache
M2
Snoopy
Cache
M3
Snoopy
Cache
Physical
Memory
DMA
DISKS
Use snoopy mechanism to keep all processors’
view of memory coherent
10/16/2007
31
Cache State Transition Diagram
The MSI protocol
Each cache line has a tag M: Modified
S: Shared
I: Invalid
Address tag
state
bits
Other processor reads
P1 writes back
M
P1 reads
or writes
Write miss
Other processor
intent to write
Read
miss
Read by any
processor
10/16/2007
S
Other processor
intent to write
I
Cache state in
processor P1 32
Two Processor Example
(Reading and writing the same cache line)
P1 reads
P1 writes
P2 reads
P2 writes
P1 reads
P1 writes
P2 writes
P1
P2 reads,
P1 writes back
M
P1 reads
or writes
Write miss
P2 intent to write
Read
miss
P1 writes
P2
S
P2 intent to write
P1 reads,
P2 writes back
I
M
P2 reads
or writes
Write miss
P1 intent to write
Read
miss
10/16/2007
S
P1 intent to write
I
33
Observation
Other processor reads
P1 writes back
M
P1 reads
or writes
Write miss
Other processor
intent to write
Read
miss
Read by any
processor
S
Other processor
intent to write
I
• If a line is in the M state then no other cache
can have a copy of the line!
– Memory stays coherent, multiple differing copies
cannot exist
10/16/2007
34
MESI: An Enhanced MSI protocol
increased performance for private data
Each cache line has a tag M: Modified Exclusive
Address tag
state
bits
P1 write
or read
M
E: Exclusive, unmodified
S: Shared
I: Invalid
P1 write
E
Other processor reads
P1 writes back
Read miss,
shared
Read by any
processor
10/16/2007
S
Other processor
intent to write
P1 read
Read miss,
not shared
Write miss
Other processor
intent to write
I
Cache state in
processor P1 35
Optimized Snoop with Level-2 Caches
CPU
CPU
CPU
CPU
L1 $
L1 $
L1 $
L1 $
L2 $
L2 $
L2 $
L2 $
Snooper
Snooper
Snooper
Snooper
• Processors often have two-level caches
• small L1, large L2 (usually both on chip now)
• Inclusion property: entries in L1 must be in L2
invalidation in L2  invalidation in L1
• Snooping on L2 does not affect CPU-L1 bandwidth
What problem could occur?
10/16/2007
36
Intervention
CPU-1
A
CPU-2
cache-1
200
cache-2
CPU-Memory bus
A
100
memory (stale data)
When a read-miss for A occurs in cache-2,
a read request for A is placed on the bus
• Cache-1 needs to supply & change its state to shared
• The memory may respond to the request also!
Does memory know it has stale data?
Cache-1 needs to intervene through memory
controller to supply correct data to cache-2
10/16/2007
37
False Sharing
state
blk addr data0 data1
...
dataN
A cache block contains more than one word
Cache-coherence is done at the block-level and
not word-level
Suppose M1 writes wordi and M2 writes wordk and
both words have the same block address.
What can happen?
10/16/2007
38
Synchronization and Caches:
Performance Issues
Processor 1
Processor 2
Processor 3
R1
L: swap (mutex), R;
if <R> then goto L;
<critical section>
M[mutex]  0;
R1
L: swap (mutex), R;
if <R> then goto L;
<critical section>
M[mutex]  0;
R1
L: swap (mutex), R;
if <R> then goto L;
<critical section>
M[mutex]  0;
cache
mutex=1
cache
cache
CPU-Memory Bus
Cache-coherence protocols will cause mutex to ping-pong
between P1’s and P2’s caches.
Ping-ponging can be reduced by first reading the mutex
location (non-atomically) and executing a swap only if it is
found to be zero.
10/16/2007
39
Performance Related to Bus
Occupancy
In general, a read-modify-write instruction
requires two memory (bus) operations without
intervening memory operations by other
processors
In a multiprocessor setting, bus needs to be
locked for the entire duration of the atomic read
and write operation
expensive for simple buses
very expensive for split-transaction buses
modern ISAs use
load-reserve
store-conditional
10/16/2007
40
Load-reserve & Store-conditional
Special register(s) to hold reservation flag and
address, and the outcome of store-conditional
Load-reserve R, (a):
<flag, adr>  <1, a>;
R  M[a];
Store-conditional (a), R:
if <flag, adr> == <1, a>
then cancel other procs’
reservation on a;
M[a] <R>;
status succeed;
else status fail;
If the snooper sees a store transaction to the address
in the reserve register, the reserve bit is set to 0
• Several processors may reserve ‘a’ simultaneously
• These instructions are like ordinary loads and stores
with respect to the bus traffic
Can implement reservation by using cache hit/miss, no
additional hardware required (problems?)
10/16/2007
41
Performance:
Load-reserve & Store-conditional
The total number of memory (bus) transactions
is not necessarily reduced, but splitting an
atomic instruction into load-reserve & storeconditional:
• increases bus utilization (and reduces
processor stall time), especially in splittransaction buses
• reduces cache ping-pong effect because
processors trying to acquire a semaphore do
not have to perform a store each time
10/16/2007
42
Out-of-Order Loads/Stores & CC
snooper
Wb-req, Inv-req, Inv-rep
load/store
buffers
CPU
Cache
(I/S/E)
Blocking caches
pushout (Wb-rep)
Memory
(S-rep, E-rep)
(S-req, E-req)
One request at a time + CC  SC
CPU/Memory
Interface
Non-blocking caches
Multiple requests (different addresses) concurrently + CC
 Relaxed memory models
CC ensures that all processors observe the same
order of loads and stores to an address
10/16/2007
43
Download