ppt

advertisement
The Performance of
Spin Lock Alternatives for
Shared-Memory Multiprocessors
THOMAS E. ANDERSON
Presented by Daesung Park
Introduction



In shared-memory multiprocessors, each processor can
directly access memory
For consistency of the data structure, we need a method
to serialize the operations done on it
Shared-memory multiprocessors provide some form of
hardware support for mutual exclusion - atomic
instructions
Why lock is needed?

If the operations on critical sections are simple enough




Encapsulate these operations into single atomic instruction
Mutual exclusion is directly guaranteed
Each processor attempts to access the shared data waits its
turn without returning control back to software
If the operations are not simple



A LOCK is needed
If the lock is busy, waiting is done in software
Two choices, block or spin
The topics of this paper

Are there efficient algorithms for software spinning for
busy lock?


5 software solutions are presented
Are more complex kinds of hardware support needed for
performance?

Hardware solutions for ‘Multistage Interconnection Network
Multiprocessors’ and ‘Single Bus Multiprocessors’ are
presented
Multiprocessor Architectures

How processors are connected to memory


Where or not each processor has a coherent private
cache


Multistage interconnection network or Bus
Yes or No
What is the coherence protocol

Invalidation-based or Distributed-write
For the performance



Minimize the communication bandwidth
Minimize the delay between a lock is released and
reacquired
Minimize latency by using simple algorithm

When there is no lock contention
The problem of spinning

Spin on Test-and-Set


The performance of spinning on test-and-set degrades as the
number of spinning processors increases
The lock holder must contend with spinning processors to
access the lock location and other locations for normal
operation
The problem of spinning – Spin on TAS
P1
P2
MEMORY
P3
P4
lock := CLEAR;
while (TestAndSet(lock) = BUSY)
lock := CLEAR;
BUS, Write-Through, Invalidation-based, Spin on Read
The problem of spinning

Spin on Read (Test-and-Test-and-Set)





Use cache to reduce the cost of spinning
When lock is released, each cache is updated or invalidated
The waiting processor sees the change and performs a testand-set
When critical section is small, this is as poor as spin on testand-set
This is most pronounced for systems with invalidation-based
cache coherence, but also occurs with distributed-write
The problem of spinning – Spin on read
P1
P2
1 0
1 0
P3
1 0
MEMORY
P4
1 0
1 0
while (lock = BUSY or
TestAndSet(lock) = BUSY)
BUS, Write-Through, Invalidation-based
Reasons for the poor performance of
spin on read

There is a separation between detecting that the lock is
released and attempting to acquire it with a test-and-set
instruction



More than one test-and-set can occur
Cache is invalidated by test-and-set even if the value is
not changed
Invalidation-based cache coherence requires O(P) bus or
network cycle to broadcast invalidation
Problem of spinning
Measurement Result1
Problem of spinning
Measurement Result2
Software solutions
Delay Alternatives


Insert delay into the spinning loop
Where to insert delay



The length of delay


After the lock has been released
After every separate access to the lock
Static or dynamic
Lock latency is not affected because processors try to get
lock before delay
Delay Alternatives

Delay after Spinning processor Notices Lock has been
Released






Reduce the number of test-and-sets when spin on read
Each processor can be statically assigned a separate slot, or
amount of time to delay
The spinning processor with smallest delay gets the lock
Others may resume spinning without test-and-set
When there are few spinning processors, using fewer slots is
better
When there are many spinning processors, using fewer slots
results in many attempts to test-and-set
Delay Alternatives





Vary spinning behavior based on the number of waiting
processors
The number of collision = The number of processors
Initially assume that there are no other waiting
processors
Try to test-and-set->fail->collision
Double the delay up to some limit
Delay Alternatives





Delay Between Each Memory Reference
Can be used on architectures without cache or with
invalidation-based cache
Reduce bandwidth consumption of spinning processors
Mead delay can be set statically or dynamically
More frequently polling improves performance when
there are few spinning processors
Software Solutions
Queuing in Shared Memory




Each processor insert itself into a queue then spins on a
separate memory location flag
When a processor finishes with critical section, it sets the
flag next processor in the queue
Only one cache read miss occurs
Maintaining queue is expensive – much worse for small
critical sections
Queuing
Init
Lock
flags[0] := HAS_LOCK;
flags[1..P-1] := MUST_SAIT;
queueLast := 0;
myPlace := ReadAndIncrement(queueLast);
while(flags[myPlace mod P] = MUST_WAIT)
;
CRITICAL SECTION;
Unlock
flags[myPlace mod P] := MUST_WAIT;
flags[(myPlace + 1) mod P] := HAS_LOCK;
Queuing
Implementations among architectures

Distributed-write cache coherence




Invalidation-based cache coherence



All processors share counter
To release lock, a processor writes its sequence number into
shared counter
Each cache is updated, directly notifying the next processor to
get lock
Each processor should wait on a flag in a separate cache block
One of caches is invalidated and one read miss occurs
Multistage-network without coherence


Each processor should wait on a flag in a separate cache block
Have to poll to learn when it is their turn
Queuing
Implementations

Bus without coherence




Without atomic read-and-increment instruction



Processors must poll to find out if it is their turn
This can swamp bus
A delay can be inserted between each poll according to the position
of processors in the queue and the execution time of critical
sections
Lock is needed
One of delay alternatives above may be helpful for contention
Problem : Increment lock latency



Increment of counter
Make its location 0, set another location
If there is no contention, this latency is loss of performance
Measurement Results of
Software Alternatives1
Measurement Result of
Software Alternatives2
Measurement Result of
Software Alternatives3
Hardware Solutions
Multistage Interconnection Network Multiprocessors

Combining networks




Hardware queuing at the memory module




For spin on test-and-set
Only one of test-and-set requests are forwarded to memory
and all other requests are returned with the value set
Lock latency may increase
Eliminates polling across the network without coherence
Issues ‘enter’ and ‘exit’ instructions to the memory module
Lock latency is likely to be better than software queuing
Caches to hold queue links

Stores the name of the next processors in the queue directly
in each processor’s cache
Hardware Solutions
Single Bus Multiprocessors

Read broadcast




Eliminates duplicate read miss requests
If a read occurs in the bus that is invalid in a processor’s cache,
the cache takes the data and make itself valid
Thus invalid caches of processors can be validated by another
processor’s read
Special handling test-and-set requests in the cache


Processor can spin on test-and-set, acquiring the lock quickly
when it is free without consuming bandwidth when it is busy
If test-and-set seems to fail, it is not committed
Conclusion




Simple method of spin-waiting degrade performance as
the number of spinning processors increases
Software queuing and backoff have good performance
even for large numbers of spinning processors
Backoff has better performance when there is no
contention, queuing performs best when there are
contention
Special hardware support can improve performance, too
Download