Token Coherence: Decoupling Performance and Correctness Jason Bosko

advertisement
Token Coherence:
Decoupling Performance
and Correctness
Jason Bosko
January 30, 2008
Based on the “Token Coherence: Decoupling Performance and Correctness” written by Martin, Hill, and Wood
Outline
Background and Motivation for Token
Coherence
 Decoupling Performance and Correctness
 Correctness Substrate
 Performance Protocol
 TokenB: an implementation of Token
Coherence

Cache Coherence Background


We’ve already looked at two types of coherence
protocols: Snooping (MOESI) and Directory
Snooping



Negatives? Requires totally-ordered interconnect, broadcasts
use up bandwidth.
Positives? Avoids indirection on cache-to-cache misses
Directory


Negatives? Added level of indirection (data must be requested
from home node first)
Positives? Works on any interconnect
Incorrect Behavior Example
1
P0
P1
1
2
3
Mem
Needs totally-ordered
interconnect to guarantee
correctness!
Assuming no ordering on the interconnect:
1) P0 requests write access on block A
of memory, which is broadcast on the
network. P1 hears it first, but does not
have the data, so it enters the invalid
state. The interconnect is slow to
deliver the request to memory…
2) Now P1 wants read-only access to
memory block A, and because P0’s
request has not yet reached the
memory, the memory gives the block to
P1.
3) P0’s request finally arrives at the
memory (out of order), and the memory
block is sent to P0. P0 believes he has
exclusive access to block A, and P1
believes he has a shared copy of A.
This is incorrect!
Cache Coherence Background
Token Coherence wants to combine the
best of both worlds…
 Avoid indirection on cache misses (like
snooping) while allowing any type of
interconnect (like directory protocols)

Glueless Interconnect
Glue interconnect includes discrete
switch chips – can provide totallyordered interconnect
Glueless interconnect can be
high-bandwidth, low-latency,
but cannot provide ordering
We want to be able to use a glueless interconnect, so we need
a design that would work for any interconnect.
Decoupling Performance and
Correctness


The big idea: if you can separate correctness and
performance, then any performance protocol can be
used because the correctness substrate will guarantee
correct behavior.
Why is this important?

The performance protocol can do whatever it
wants without worrying about race conditions and
corner-cases, because the correctness substrate will
take care of any problems.
Correctness Substrate

We need safety (coherent reads and
writes) and starvation avoidance
(everything is eventually completed)

Use tokens! But how?
Tokens





Each block of shared memory has a set of tokens
associated with it. There is at least one token for every
processor in the system.
In order to read a block of memory, you need at least
one of its tokens.
In order to write a block of memory, you need all of its
tokens.
If you want to think in terms of MSI, then holding all
tokens is like the modified state, holding at least one but
not all tokens is like the shared state, and holding no
tokens is like the invalid state
But what about the actual data? How is it passed from
cache to cache?
Tokens – Optimization




For each block of shared memory, exactly one of its
tokens will be the owner token.
The owner token is always sent with valid data – other
tokens can have data sent with them, but is optional (to
utilize bandwidth).
Also attach valid bits to tokens that signify whether
attached data is valid.
So now, tokens provide safety (never read unless you
have a token, never write unless you have every token).
But what about avoiding starvation?
Avoiding Starvation – Persistent
Requests




When a processor detects starvation (timeout
period, etc.), it initiates a persistent request.
The request is broadcast, and all nodes
recognize the persistent request, and forward all
its tokens to the requester.
Once the processor gets its tokens, it does its
load or store, and then exits persistent mode,
and broadcasts this back to the other nodes so
they can go back to their business.
Any problems with this?
Persistent Requests





What if two people try to initiate persistent requests at
the same time?
Each processor has an arbiter state machine to monitor
persistent requests.
The requests are forwarded to the home node of the
requested memory block.
The arbiter is responsible for activating only one request
at a time, and keeps track of requests in a hardware
table.
When the request is finally broadcast, each node must
respond with an acknowledgment. Similarly, when the
persistent state is over, the arbiter broadcasts that, and
all nodes acknowledge this so that a new persistent
request can be issued!
Performance Protocols
Technically any protocol can work on top
of the correctness substrate, because
correctness is guaranteed – that is the
beauty of decoupling performance and
correctness!
 In general, high performance is achieved
through the use of transient requests.

Transient Requests
A transient request is a request for a
memory block that is not guaranteed to
succeed, due to race conditions.
 Most transient request succeed, and those
that don’t will eventually be replaced by
persistent requests.

TokenB Performance Protocol



Everything so far has been simply “Token
Coherence”, which is more of the idea behind
the correctness substrate.
Token-Coherence-using-Broadcast (TokenB)
performance protocol uses Token Coherence to
guarantee correctness, while using broadcasted
transient requests for high performance.
Transient requests that are not fulfilled are
reissued (after an exponential backoff) until it is
fulfilled or the correctness substrate invokes a
persistent request
TokenB
How a component should respond to transient requests:
State of component
Response to shared
request
Response to
exclusive request
Holds no tokens
Ignore request
Ignore request
Holds only non-owner
tokens
Ignore request
Send all tokens (but
no data)
Holds owner token,
but not every token
Send one token along
with the data
Send all tokens with
the data
Holds all tokens
Send all tokens with
the data
Send all tokens with
the data
Why not just send
one token?
Should the owner
token be sent?
Example, revisited
1
P0
P1
1
2
3
Mem
No rules were broken, but P0 didn’t
get it’s write permission yet? What
can P0 do?
1) P0 requests write access on block A of
memory, which is broadcast with a
transient request. P1 hears it first, but can
do nothing. The interconnect is slow to
deliver the request to memory…
2) Now P1 broadcasts a transient read
request, which arrives at memory first,
so memory sends the owner token
along with the data to P1.
3) P0’s request finally arrives at the
memory (out of order), so the memory
is obligated to send the rest of its
tokens to P0. Because P1 heard the
request before it got the data, it never
sends its newly acquired data or the
token.
Obvious Issue?

Reissuing transient requests and
persistent request seem like a huge
hassle? Is it worth it?
The majority of the
time, transient requests
are fulfilled. Very rarely
does it go so far as to
issue a persistent
request!
Performance
Glueless interconnect has huge
impact on bandwidth!
Performance
TokenB still relies on broadcasts, so
traffic is higher than in directory protocol
Things to Consider




Is TokenB scalable?
Can we avoid broadcasting?
What kind of overhead do tokens have?
This is a relatively recent paper (2003). In 2005,
“Improving Multiple-CMP Systems Using Token
Coherence” was published, so the idea might be
catching on.
Download