Shared memory can support simple programming models
However, shared memory means complex consistency and cache coherence protocols
Consistency - must provide rules for ordering loads/stores
Tradeoff between ease of use and performance (e.g.
SC vs. RC)
Cache coherence - must track ownership of cache lines
Requires low latency for many small coherence messages
Latency of small coherence messages unlikely to scale well
Interprocessor bandwidth likely to scale better
Message passing models can take advantage, but hard to program
Can we design a shared memory model with:
Simple programming model
Simple hardware
Good performance
...and take advantage of available bandwidth?
Transaction - sequence of instructions that executes speculatively on a processor and completes only as an atomic unit
All writes are buffered locally and only committed to shared memory when the transaction completes
Commit is atomic: entire effect of transaction committed to shared memory state at once
Transaction’s effects not visible to other processors until commit is performed
Processor broadcasts entire write buffer in one large commit packet
High broadcast bandwidth needed, latency not as important
Interconnect need not provide ordering
System-wide view: transactions appear to execute in commit order
Each processor must snoop on commit packets to detect violations
Transaction detecting violation must rollback and restart
Register checkpointing mechanism needed
Consistency is simplified:
Other consistency models: ordering rules between individual memory references
TCC: sequential ordering only between transaction commits
All references from an earlier commit appear to occur
“before” all references from later commit
Interleaving between memory accesses by different processors only allowed at transaction boundaries
Can provide illusion of uniprocessor execution by imposing original program’s transaction order
Coherence is simplified:
No ownership of cache lines
Writes are buffered
Invalidation or update only occurs by snooping commit packets
Don’t rely on many latency-sensitive coherence messages
TCC automatically handles WAW and WAR
Programmer inserts transaction boundaries
Similar to threading, but no locks… less errors
One hard rule: transaction breaks cannot be placed between a load and a subsequent store of a shared value
Steps for parallelizing code with TCC:
Divide into potentially parallel transactions
Examples: loop iterations, after function calls
Transactions need not be independent
Dependence violations caught at runtime
Specify order
Tune performance
Most programs require ordering between certain transactions
Solution: assign phase number to each transaction
Only transactions from the “oldest” phase can commit
Can implement barriers or full ordering
How to choose transactions:
Large transactions amortize startup and commit overhead
Smaller transactions should be used when violations are frequent
Minimize amount of lost work
TCC system provides feedback about violations to facilitate tuning
Write buffer
Read bit(s), modified bit, optional renamed bits for L1 cache lines
System wide commit arbitration
Double buffering
Allow a processor to work on next transaction while previous transaction waits to commit
Use additional write buffers and sets of read and modified bits
Without double buffering Extra write buffer Extra write buffer and read bits
Hardware-controlled transactions
Automatically divide program into transactions as buffers overflow
Take full advantage of available buffer space
Programmer must still mark critical regions where transaction boundaries cannot occur
Automatically merge small transactions into larger ones
Must guarantee no rollback after input is read
Obtain commit permission before reading input
If ordering of outputs is important
Same idea: request commit permission immediately
Can TCC extract parallelism for shared memory benchmarks?
How large are the read and write states which must be buffered?
What is the broadcast bandwidth requirement?
Optimal TCC model (infinite bus bandwidth, no memory delays) extracted parallelism well for many benchmarks
Speedups for automatically and manually parallelized benchmarks with optimal TCC model
Most benchmarks needed 6-12 KB of read state and 4-8 KB of write state
Reasonable for current caches and on-chip write buffers
Most of the benchmarks requiring large read and write state can probably be divided into smaller transactions (e.g. radix_l vs. radix_s)
Write state for smallest 10%, 50%, and 90% of iterations
If invalidate protocol is used with 32-bit addresses, average of 0.5 bytes/cycle for 32 processors
For update protocol, up to 16 bytes/cycle for 32 processors
If only dirty data is sent, only 8 bytes/cycle
Average bytes/cycle broadcast by 1 IPC system with an update protocol
Snooping requirement: significantly less than 1 address/cycle
Single snoop port per processor sufficient for up to 32 processors
Commit arbitration overhead: compiler-parallelized apps were insensitive, while performance suffered for apps with smaller transaction sizes
Extensions did not yield benefits in most cases
Extra read state bits (per-word rather than per-line) only mattered for a few applications
Double-buffering did not help
Will be useful when bandwidth is limited
TCC simplifies consistency and coherence
No need for rules for ordering individual memory references
No need for latency-sensitive coherence messages
TCC provides a simple and flexible programming model
Correctness is guaranteed: no error-prone locks
Tuning performance based on observed violations is straightforward
Uniprocessor ordering can be achieved by ordering all transactions
An optimal TCC implementation extracts parallelism well for a wide range of benchmarks
Will be limited by broadcast bandwidth and commit arbitration overhead
Evaluation on a realistic hardware model necessary