Transactional Memory Coherence and Consistency


Transactional Memory

Coherence and


Lance Hammond, Vicky Wong, Mike Chen, Brian

D. Carlstrom, John D. Davis, Ben Hertzberg,

Manohar K. Prabhu, Honggo Wijaya, Christos

Kozyrakis, and Kunle Oluktun

Presented by Peter Gilbert

ECE259 Spring 2008


Shared memory can support simple programming models

However, shared memory means complex consistency and cache coherence protocols

Consistency - must provide rules for ordering loads/stores

Tradeoff between ease of use and performance (e.g.

SC vs. RC)

Cache coherence - must track ownership of cache lines

Requires low latency for many small coherence messages


Latency of small coherence messages unlikely to scale well

Interprocessor bandwidth likely to scale better

Message passing models can take advantage, but hard to program

Can we design a shared memory model with:

Simple programming model

Simple hardware

Good performance

...and take advantage of available bandwidth?


Transaction - sequence of instructions that executes speculatively on a processor and completes only as an atomic unit

All writes are buffered locally and only committed to shared memory when the transaction completes

Commit is atomic: entire effect of transaction committed to shared memory state at once

Transaction’s effects not visible to other processors until commit is performed

Processor broadcasts entire write buffer in one large commit packet

High broadcast bandwidth needed, latency not as important

Interconnect need not provide ordering

System-wide view: transactions appear to execute in commit order

Dependence violations

Each processor must snoop on commit packets to detect violations

Transaction detecting violation must rollback and restart

Register checkpointing mechanism needed


Consistency is simplified:

Other consistency models: ordering rules between individual memory references

TCC: sequential ordering only between transaction commits

All references from an earlier commit appear to occur

“before” all references from later commit

Interleaving between memory accesses by different processors only allowed at transaction boundaries

Can provide illusion of uniprocessor execution by imposing original program’s transaction order


Coherence is simplified:

No ownership of cache lines

Writes are buffered

Invalidation or update only occurs by snooping commit packets

Don’t rely on many latency-sensitive coherence messages



TCC automatically handles WAW and WAR

Programming model

Programmer inserts transaction boundaries

Similar to threading, but no locks… less errors

One hard rule: transaction breaks cannot be placed between a load and a subsequent store of a shared value

Steps for parallelizing code with TCC:


Divide into potentially parallel transactions

Examples: loop iterations, after function calls

Transactions need not be independent

Dependence violations caught at runtime


Specify order


Tune performance

Transaction Ordering

Most programs require ordering between certain transactions

Solution: assign phase number to each transaction

Only transactions from the “oldest” phase can commit

Can implement barriers or full ordering

Performance tuning

How to choose transactions:

Large transactions amortize startup and commit overhead

Smaller transactions should be used when violations are frequent

Minimize amount of lost work

TCC system provides feedback about violations to facilitate tuning

Hardware requirements

Write buffer

Read bit(s), modified bit, optional renamed bits for L1 cache lines

System wide commit arbitration


Double buffering

Allow a processor to work on next transaction while previous transaction waits to commit

Use additional write buffers and sets of read and modified bits

Without double buffering Extra write buffer Extra write buffer and read bits


Hardware-controlled transactions

Automatically divide program into transactions as buffers overflow

Take full advantage of available buffer space

Programmer must still mark critical regions where transaction boundaries cannot occur

Automatically merge small transactions into larger ones


Must guarantee no rollback after input is read

Obtain commit permission before reading input

If ordering of outputs is important

Same idea: request commit permission immediately


Can TCC extract parallelism for shared memory benchmarks?

How large are the read and write states which must be buffered?

What is the broadcast bandwidth requirement?

Simulation results

Optimal TCC model (infinite bus bandwidth, no memory delays) extracted parallelism well for many benchmarks

Speedups for automatically and manually parallelized benchmarks with optimal TCC model

Read and write state

Most benchmarks needed 6-12 KB of read state and 4-8 KB of write state

Reasonable for current caches and on-chip write buffers

Most of the benchmarks requiring large read and write state can probably be divided into smaller transactions (e.g. radix_l vs. radix_s)

Write state for smallest 10%, 50%, and 90% of iterations

Broadcast bandwidth

If invalidate protocol is used with 32-bit addresses, average of 0.5 bytes/cycle for 32 processors

For update protocol, up to 16 bytes/cycle for 32 processors

If only dirty data is sent, only 8 bytes/cycle

Average bytes/cycle broadcast by 1 IPC system with an update protocol

Other parameters

Snooping requirement: significantly less than 1 address/cycle

Single snoop port per processor sufficient for up to 32 processors

Commit arbitration overhead: compiler-parallelized apps were insensitive, while performance suffered for apps with smaller transaction sizes

Extensions did not yield benefits in most cases

Extra read state bits (per-word rather than per-line) only mattered for a few applications

Double-buffering did not help

Will be useful when bandwidth is limited


TCC simplifies consistency and coherence

No need for rules for ordering individual memory references

No need for latency-sensitive coherence messages

TCC provides a simple and flexible programming model

Correctness is guaranteed: no error-prone locks

Tuning performance based on observed violations is straightforward

Uniprocessor ordering can be achieved by ordering all transactions

An optimal TCC implementation extracts parallelism well for a wide range of benchmarks


Will be limited by broadcast bandwidth and commit arbitration overhead

Evaluation on a realistic hardware model necessary