Signatures in Transactional Memory Systems Dissertation Defense Luke Yen

advertisement
Signatures in Transactional
Memory Systems
Dissertation Defense
Luke Yen
1/29/2009
Key Contributions
Trend: Transactional memory (TM) emerging parallel programming
paradigm. Programmer-annotated transactions that execute
atomically (all or nothing).
Challenge #1: Hardware TM (HTM) systems may restrict
transactions or incur overheads on common events (e.g., cache
evictions).
Contribution: LogTM-SE HTM: Simple hardware and interacts with
operating system to virtualize transactions. No overhead on
cache evictions.
2
Key Contributions Cont.
Challenge #2: (1) H3 signatures high area & power overheads &
(2) Thread-private references cause false conflicts.
Contribution: Notary: (1) Page-Block-XOR - performs similar to H3
but lower overheads (2) Stack & heap-based privatization.
Challenge #3: Difficult to understand HTM system performance.
Contribution: TMProf: Lightweight hardware performance counters
help HTM designers & TM programmers.
Challenge #4: Signatures suffer from false conflicts.
Contribution: Six hardware/software signature extensions to
mitigate false conflicts.
3
Outline
• Introduction and Background
• Transactional Memory background
• LogTM-SE [HPCA 2007]
Contribution #1
• Notary [MICRO 2008]
Contribution #2
• TMProf (Submitted for publication)
Focus of
presentation
Contribution #3
• Conclusion
* Skip “Extensions to Signatures”
Contribution #4
4
Transactional Memory (TM)
• Locks do not compose
• Can lead to deadlocks
• TM programmer says
• “I want this atomic”
• TM system
• “Makes it so”
void move(T s, T d, Obj key){
atomic {
tmp = s.remove(key);
d.insert(key, tmp);
}
}
• Focus on Hardware TM (HTM) Implementations
• Fast
• Leverage cache coherence & speculation
• But hardware finite & should be policy-free
5
Example
7/26/2016
LogTM Signature Edition (LogTM-SE) at
50,000 feet
• HTMs Fast
• Version management – for transaction commits & aborts
• HW handles old/new versions (e.g., write buffer)
• Conflict detection – commit only non-conflicting transactions
• HW handles conflict detection (R/W bits & coherence)
• But Closely Coupled to L1 cache
• On critical paths & hard for SW to save/restore
• Our Approach: Decoupled, Simple HW, SW control
• LogTM-SE
• HW: LogTM’s Log + Signatures (from Illinois Bulk)
• SW: Unbounded nesting, thread switching, & paging
Details
6
Signature Background
• Signatures used to summarize and detect conflicts
with a transaction’s read- and write-sets
•
•
•
•
Inspired by Bulk system [Ceze,ISCA’06]
Imprecise, can be implemented with Bloom filters
Can have false positives, but never false negatives
Also proposed for non-TM purposes (e.g., SC violation
detection, atomicity violation detection, race recording)
• Ex: Use k Bloom filters of size m/k, with independent
hash functions
7
Outline
• Introduction and Background
• Notary
•
•
•
•
•
Signature Background
Entropy & Page-Block-XOR
Privatization
Methodology & Results
Conclusions
• TMProf
• Conclusion
8
Notary Executive Summary
Tackle 2 problems with hardware signatures:
• Problem 1: Best signature hashing (i.e., H3) has high area &
power overheads
• Solution 1: Use entropy analysis to guide lower-cost hashing
(Page-Block-XOR, PBX) that performs similar to H3
• Ex: 8x fewer gates - 160 gates for H3 vs 20 gates for PBX
• Problem 2: Spurious signature conflicts caused by signature
bits set by private memory addrs
• Solution 2: Avoid inserting private stack addrs, propose
privatization interface for higher performance
9
Outline
• Introduction and Background
• Notary
•
•
•
•
•
Signature Background
Entropy & Page-Block-XOR
Privatization
Methodology & Results
Conclusions
• TMProf
• Conclusion
10
Signature hash functions
• Which hash function is best? [Sanchez, YEN, MICRO’07]
• Bit-selection? Hash simply decodes some number of input bits
• H3? Each bit of a hash value is an XOR of (on avg.) half of the input
address bits
LogTM-SE
w/ 2kb
signatures
• Result: H3 better with >=2 hash functions
• However, H3 uses many multi-level XOR trees
•Can we improve this?
Details
11
H3 implementation
addr length in bits
ck
• Num XOR
4
• Ex: 2kb signatures, k=2, c=10, 32-bit addr =
160 XOR gates per signature
• Can we reduce the total gate count?
12
Outline
• Introduction and Background
• Notary
•
•
•
•
•
Signature Background
Entropy & Page-Block-XOR
Privatization
Methodology & Results
Conclusions
• TMProf
• Conclusion
13
Entropy defined
• Insight: Use most random bits for hashing
• Use entropy to measure bit randomness
• Entropy =
N
  p( xi ) log 2 ( p( xi ))
i 1
• p(xi) = the probability of the occurrence of value xi
• N = number of sample values random variable x can
take on
• Entropy = amount of information required on average
to describe outcome of variable x (in bits)
• Ex: What is the best possible lossless compression?
Other cases
n bits
0 bits
max
min
Entropy value of n-bit field
n-bit field
All bit patterns
constant value
in n-bit field
with probability 1
equally probable
14
Our measures of entropy
• For our workloads, we care about:
• Q1: What is the best achievable entropy?
• Global entropy – upper bound on entropy of
address
• Q2: How does entropy change within an
address?
• Local entropy – entropy of bit-field within the
address
31
Addr
Global entropy
6
31
Local entropy
Addr
6
NSkip
15
Entropy results
• Workloads to be described later
• Global entropy is at most 16 bits
• Bit-window for local entropy is 16 bits wide (NSkip from 0-10)
• Smaller windows (<16b) may not reach global entropy value
• Larger windows (>16b) hides some fine-grain info
16
Commercial Workloads
Page-Block-XOR (PBX)
• Motivated by 3 findings:
• (1) Lower-order bits have most entropy
• Follows from our entropy results
• (2) XORing two bit-fields produces random hash values
• From prior work on XOR hashing (e.g., data placement in caches,
DRAM)
• (3) Bit-field overlaps can lead to higher false positives
• Correlation between the two bit-fields can reduce the range of
hash values produced (worse for larger signatures)
Overlap Details
17
PBX implementation
• For 2kb signatures with 2 hash functions:
• 20 XOR gates for PBX vs 160 XOR gates for H3!
• PPN and Cache-index fields not tied to system
params:
• Use entropy to find two non-overlapping bit-fields with
high randomness
18
Summary thus far
• Problem 1: H3 has high area & power overheads
• Solution 1: Use entropy analysis to guide lower-cost
PBX
• Ex: 160 gates for H3 vs 20 gates for PBX
• Problem 2: Spurious signature conflicts caused by
signature bits set by private memory addrs
• Solution 2: To be described
19
Outline
• Introduction and Background
• Notary
•
•
•
•
•
Signature Background
Entropy & Page-Block-XOR
Privatization
Methodology & Results
Conclusions
• TMProf
• Conclusion
20
Privatization
• Problem: False conflicts caused by thread-private
addrs
• Avoid conflicts if addrs not inserted in thread’s signatures
Two privatization solutions:
• (1) Remove private stack references from sigs.
• Very little work for programmer/compiler
• Benefits depend on fraction of stack addresses versus all
transactional references
• (2) Language-level interface (e.g., private_malloc(),
shared_malloc())
• Even higher performance boost
• WARNING: Incorrectly marking shared objects as private can lead
to program errors!
21
Page-based implementation
• Each page is assigned a status, private or
shared
• Invariant: Page is shared if any object is shared
• If stack is private, library marks stack pages as
private
• If using privatization heap functions, mark heap
pages accordingly
22
OS support
• OS allocates different physical page frames for
shared and private pages
• Sets a per-frame bit in translation entry if shared
• Reduce number of page frames used by packing
objects with same status together
• Signatures insert memory addresses of
transactional references to shared pages
• Query page sharing bit in HW TLB & current
transactional status
23
Outline
• Introduction and Background
• Notary
•
•
•
•
•
Signature Background
Entropy & Page-Block-XOR
Privatization
Methodology & Results
Conclusions
• TMProf
• Conclusion
24
Methodology
•
•
•
•
Full-system simulation (GEMS)
Transistor-level design for area & power of XOR gates
CACTI for Bloom filter bit array area & power
Linear scaling to 65nm or 90nm for area, original 400nm
for power
• Single-chip CMP
•
•
•
•
•
16 single-threaded, in-order cores
32kB, 4-way private L1 I & D
8MB, 8-way shared L2 cache
MESI directory protocol
Signatures from 64b-64kb (8B-8kB) & “perfect”
25
Workloads
• Micro-benchmarks
• SPLASH-2 apps
• Barnes & Raytrace – exert most signature pressure
• Stanford STAMP apps
• Vacation, Genome, Delaunay, Bayes, Labyrinth, Yada,
Intruder
• DNS server
• BIND
26
PBX vs H3 area & power
• Area & power overheads (2kb, k=4):
Type of
Bloom
overhead filter bit
array
H3 hash
PBX
hash
H3 sig.
PBX sig.
%
savings
for PBX
sig.
Area
(mm2)
4.67e-3
1.35e-3
7.83e-5
6.02e-3
4.75e-3
21
Power
(mW)
1.80e2
1.04e1
1.02
1.90e2
1.81e2
4.7
27
PBX vs H3 execution time
PBX performs similar to H3
28
Privatization results summary
• Removing private stack references from
signatures did not help
• Most addr references not to stack
• Most likely because running with SPARC ISA. Other
ISAs (e.g., x86) likely have more benefits
• Privatization interface helps five workloads
• Remainder either does not have private heap
structures or does not have high transactional duty
cycle
Stack Results
29
Privatization interface results
Can improve
execution time
30
Outline
• Introduction and Background
• Notary
•
•
•
•
•
Signature Background
Entropy & Page-Block-XOR
Privatization
Methodology & Results
Conclusions
• TMProf
• Conclusion
31
Conclusions
• Tackle 2 problems with signature designs:
• (1) Area and power overheads of H3 hashing
• E.g., 160 XOR gates for H3, 20 for PBX
• (2) False conflicts due to signature bits set by private
memory references
• Our solutions:
• (1) Use entropy analysis to guide hashing function (PBX), a
low-cost alternative that performs similarly to H3
• (2) Prevent private stack references from entering
signatures, and propose a privatization interface for heap
allocations
• Notary can be applied to non-TM uses:
• PBX hashing can directly transfer
• Privatization may transfer if addr filtering applies
Related Work
32
Outline
• Introduction and Background
• Notary
• TMProf
•
•
•
•
•
•
Motivation
Background
TMProf
Two Case Studies
Future Directions for TMProf
Conclusions
• Conclusion
33
TMProf Executive Summary
• TM more parallelism than lock-based programs
• Complex thread interactions
• How can HTM designer understand HTM
performance?
• How can TM programmer understand TM
program performance?
• TMProf: Per-processor hardware performance
counters to count cumulative event frequencies &
overheads in HTM system
34
Outline
• Introduction and Background
• Notary
• TMProf
•
•
•
•
•
•
Motivation
Background
TMProf
Two Case Studies
Future Directions for TMProf
Conclusions
• Conclusion
35
Critical-section Parallelism
• TM enables critical-section parallelism – more
thread interleavings
With Locks
Thread 0
Thread 1
Lock A
Lock A
With TM
Thread 0
xact_begin
Thread 1
xact_begin
36
Hard to Predict Program Performance
• TM programmers may not have mastered
intricacies of HTM system
• Programs run faster on specific HTM
• Example:
37
Profiling with TMProf
• Allows HTM designers & TM programmers to
understand HTM performance
• With TMProf:
38
Outline
• Introduction and Background
• Notary
• TMProf
•
•
•
•
•
•
Motivation
Background
TMProf
Two Case Studies
Future Directions for TMProf
Conclusions
• Conclusion
39
Background on Conflicts
• Three types: RW,
WR, and WW
• Analogous to WAR,
RAW, and WAW
dependencies in
uniprocessors
Thread 0
RW
xact_begin
…
LD A
…
WR
xact_begin
…
ST B
…
WW
xact_begin
…
ST C
…
Thread 1
xact_begin
…
ST A
…
xact_begin
…
LD B
…
xact_begin
…
ST C
…
40
Conflict Detection & Resolution
• Conflicts detected eagerly or lazily
• Eagerly – when requests occur
• Lazily – at transaction commit
• Conflict resolution
• Stall or abort on conflict
• Choose set of procs to take action
41
Outline
• Introduction and Background
• Notary
• TMProf
•
•
•
•
•
•
Motivation
Background
TMProf
Two Case Studies
Future Directions for TMProf
Conclusions
• Conclusion
42
TMProf
• Per-processor HW counters measuring cumulative
event frequencies and cumulative event overheads
• Two implementations: Base & Extended
• Base (BaseTMProf): Breaks down HTM execution
cycles into common components
• Extended (ExtTMProf): Builds on BaseTMProf & adds
HTM-specific transaction-level profiling
43
BaseTMProf & ExtTMProf
• BaseTMProf:
• Total cycles = stalls + aborts + wasted_trans +
useful_trans + committing + nontrans + implementation
specific
• Assume in-order procs, but can extend for out-of-order
procs
• ExtTMProf: BaseTMProf profiling plus
• Size of aborted transactions
• Amount of transactional work after write-set prediction
• HTMs may add more detailed profiling in future
Details
44
Outline
• Introduction and Background
• Notary
• TMProf
•
•
•
•
•
•
Motivation
Background
TMProf
Two Case Studies
Future Directions for TMProf
Conclusions
• Conclusion
45
Two Case Studies
• TMProf profiling two HTMs:
• LogTM-SE (eager conflict detection & version management,
EE)
• Approximation of Stanford’s TCC (lazy conflict detection &
version management, LL)
• Examine key parameters of eager & lazy conflict detection
• Idealize version management
• Same system parameters as Notary
• 16-processor CMP w/ in-order, single-issue processor cores
• Perfect signatures
• Same workloads
46
EE: Different Conflict Resolutions
• Three different conflict resolutions:
• Base, Timestamp, Hybrid
• All use timestamps
• Base: Requestor stalls until possible deadlock
• Timestamp: Older requestors always abort
younger transactions. Younger requestors stalled
by older transactions.
• Hybrid: Base, except RW from older writer aborts
younger reader
47
EE: Write-set Prediction
• Avoid aborts from load then store pattern from
thread
• Predict & serialize on these conflicts
T0
T1
ABORT
GetS
…
… ABORT
…
…
GetS
GetX
…
…
T2
…
GetS
…
…
T0
GetX
…
…
GetX
…
T1
…
…
STALL
GetS
…
T2
…
STALL
GetS
…
…
48
Results from Conflict Resolutions
Trends:
1) Timestamp & Hybrid better than Base
49
Timestamp & Hybrid Better than Base
Fewer total stalls & eliminates all RW Requestor older stalls
50
EE Summary with BaseTMProf
• BaseTMProf helps HTM designer understand
performance of conflict resolution schemes
• Lightweight, fast, dynamic profiling
• Can be implemented in prototype HTM systems
51
Write-set Prediction Results
• Focus on workloads that degrade from prediction
Prediction increases Stall cycles
52
ExtTMProf’s Transaction-level Profiling
Predictions Help
Prediction helps short
transactions
Predictions Hurt
Prediction hurts large
transactions – reduces
concurrency
53
EE Summary with ExtTMProf
• Helps HTM designers understand why write-set
prediction degrades (or improves) performance
• Offline analysis (e.g., traces) unable to determine
performance implications of dynamic conflicts
• How can TMProf help analyze LL systems?
54
LL: Parallel Versus Serial Commit
• Serial = Only one committer at a time
• Parallel = Multiple concurrent committers
• Faster than Serial
• We idealize its implementation
55
LL: More Prefetching than EE
• Eager conflict detection:
• Progress bounded by location of conflicts
• Early conflicts  abort transactions early (little prefetching)
• Late conflicts  abort transactions late (lots of prefetching)
• Lazy conflict detection:
• Committers finish transaction before detecting conflicts
• High probability for lots of prefetching
56
Parallel Commit Results
Parallel commit removes commit token bottleneck
57
Conflicts with Parallel Commit
All conflicts either RW or WR – no WW conflicts
58
LL Summary with BaseTMProf
• BaseTMProf clearly shows why parallel commit
helps
• Stall breakdown shows mostly WR conflicts
• BaseTMProf helps HTM designers decide
whether to implement parallel commit
• Parallel commit more complex than serial commit
59
Prefetching Results
Useful Trans should be similar for EE & LL, but
LL incurs fewer cycles
Why?
60
ExtTMProf’s Transaction-level Profiling
LL’s aborted transactions prefetch farther than EE
61
LL Summary with ExtTMProf
• Explains why workloads execute faster on LL
than on EE
• May influence HTM design decision to implement
LL rather than EE
• Helps TM programmer understand why programs
run faster on some HTMs
62
Outline
• Introduction and Background
• Notary
• TMProf
•
•
•
•
•
•
Motivation
Background
TMProf
Two Case Studies
Future Directions for TMProf
Conclusions
• Conclusion
63
Software Rollback Better than Hardware
Rollback
Software rollback reduces Stalls & Wasted Trans
May reduce contention in HTM?
64
Hardware for Critical-path Profiling
• Counter-based profiling is not sufficient
• Multi-threaded programs exhibit variability:
• Different dynamic code paths
• Inter-thread dependencies
• Memory latencies
• Factors change critical-path – longest control flow
that determines execution time
• Hardware critical-path profiling can aid in
understanding performance
• Faster than offline, software analyses
65
Outline
• Introduction and Background
• Notary
• TMProf
•
•
•
•
•
•
Motivation
Background
TMProf
Two Case Studies
Future Directions for TMProf
Conclusions
• Conclusion
66
Conclusions
• TMProf – lightweight per-processor hardware
counters for understanding HTM performance
• Cumulative event frequencies & overheads
• Two implementations: Base & Extended
• Two case studies: LogTM-SE & Approximation of
TCC
• Future TMProf might add hardware support for
critical-path profiling
Related Work
67
Outline
• Introduction and Background
• Notary
• TMProf
• Conclusion
68
Conclusions
• Challenge #1: Hardware TM (HTM) systems may
restrict transactions or incur overheads on
common events.
• Contribution: LogTM-SE HTM
• Challenge #2: (1) H3 signatures high area &
power overheads & (2) Thread-private references
cause false conflicts.
• Contribution: Notary
69
Conclusions Cont.
• Challenge #3: Difficult to understand HTM
system performance.
• Contribution: TMProf
• Challenge #4: Signatures suffer from false
conflicts.
• Contribution: Six hardware/software extensions to
signatures
70
Other Research & Contributions
• OS Support for Virtualizing Transactional Memory
[Swift et al. TRANSACT ‘08]
• Implementing Signatures for Transactional Memory
[Sanchez et al. MICRO ‘07]
• Performance Pathologies in Hardware Transactional
Memory [Bobba et al., ISCA ’07 & Top Picks ‘08]
• Supporting Nested Transactional Memory in LogTM
[Moravan et al., ASPLOS ‘06]
• GEMS 2.X development & support
• SMT in Opal, LogTM-SE in Ruby
71
Thank You!
Questions?
72
Backup Slides
73
LogTM-SE Processor Hardware
• Segmented log, like LogTM
Registers
Register
Checkpoint
LogFrame
• Track R / W sets with
R / W signatures
TMcount
Read
LogPtr
Write
SummaryRead
SummaryWrite
• Over-approximate R / W sets
• Tracks physical addresses
• Summary signature used for
virtualization
SMT Thread Context
Tag
Data
NO TM STATE
• Conflict detection by
coherence protocol
• Check signatures on every
memory access for SMT
Data Caches
74
Thread Switching Support
• Why?
• Support long-running transactions
• What?
• Conflict Detection for descheduled transactions
• How?
• Summary Read / Write signatures:
If thread t of process P is scheduled to use an active signature,
the corresponding summary signature holds the union of the saved
signatures from all descheduled threads from process P.
Updated using TLB-shootdown-like mechanism
75
Handling Thread Switching
WWW 00000000
00000000
00000000
Summary
Summary
Summary
00000000
00000000
R RR 00000000
OS
T2
T1
T3
W
R
Summary
01001000
W
R
01010010
P1
0100000
W
R
01000010
P2
00000000
W
R
00000000
P3
00000000
00000000
0100000
W
R
01010010
P4
76
Handling Thread Switching
OS
W
01001000
00000000
SummaryR
01010010
00000000
Deschedule
T2
T1
W
R
Summary
W
00000000
00000000
01001000
01001000
01010010
R
01010010
P1
W
R
Summary
00000000
00000000
0100000
W
R
01000010
P2
T3
W
R
Summary
00000000
00000000
00000000
W
R
00000000
P3
W
R
Summary
00000000
00000000
0100000
W
R
01010010
P4
77
Handling Thread Switching
01001000
W
W
W
01001000
01001000
SummaryR
SummarySummary
R 01010010
01010010
R 01010010
OS
Deschedule
T2
T1
W
R
Summary
00000000
00000000
01001000
W
R
01010010
P1
W
R
Summary
00000000
00000000
0100000
W
R
01000010
P2
T3
W
R
Summary
00000000
00000000
00000000
W
R
00000000
P3
W
R
Summary
00000000
00000000
0100000
W
R
01010010
P4
78
Handling Thread Switching
OS
W
01001000
SummaryR
01010010
T1
T2
W
R
Summary
00000000
00000000
00000000
W
R
00000000
P1
W
R
Summary
01001000
01010010
0100000
W
R
01000010
P2
T3
W
R
Summary
01001000
01010010
00000000
W
R
00000000
P3
W
R
Summary
00000000
00000000
0100000
W
R
01010010
P4
79
Thread Switching Support Summary
• Summary Read / Write signatures
• Summarizes descheduled threads with
active transactions
• One OS structure per process
Coherence
• Check summary signature on every
memory access
• Updated on transaction deschedule
• Similar to TLB shootdown
80
Paging Support Summary
Problem:
• Changing page frames
• Need to maintain isolation on transactional blocks
Solution:
On Page-Out:
• Save Virtual -> Physical mapping
On Page-In:
• If different page frame, update signatures with physical
address of transactional blocks in new page frame.
81
Paging Support Animation
VP1
Page-out
Page-in
PP1
PP2
A
B
C
A’
C’
A’
B’
Read sig.
C?
D?
A?
B?
B’
D’
Write sig.
D
Y
Y
B’
C’
D’
Read & Write signatures isolate memory blocks from PP1 & PP2
Return
82
BaseTMProf for LogTM-SE (1 of 3)
• Differentiate between read dependent & write
dependent aborts
• Meta-data (e.g., 3 bits for conflict types + 1 bit
indicating if responder older) on NACK messages
• Per-processor tables to track conflicts with other procs
• RW conflict only = read-dependent
• Stall cycles = cycle conflict detected – cycle
request sent to memory subsystem
• Abort cycles = cycle abort completes – cycle
abort initiates
83
BaseTMProf for LogTM-SE (2 of 3)
• Wasted_trans cycles = cycle abort initiates – cycle
transaction begins
• Store transaction begin cycle in separate register
• Commit cycles = cycle commit completes – cycle
commit initiates
• No commit actions = no commit cycles
• Track cycle of start of commit action in separate register
84
BaseTMProf for LogTM-SE (3 of 3)
• Nontrans cycles = cycle of transaction begin –
cycle after last transaction commit
• Track cycle of last transaction commit in separate
register
• Backoff cycles = cycle retry transaction – cycle
abort completes
• Barrier cycles = cycle exit barrier – cycle enter
barrier
85
ExtTMProf for LogTM-SE
• Work remaining after write-set prediction:
• Store transaction size (read+write-set sizes) at each
prediction - lazily copy to software or use many
registers
• At commit, subtract saved transaction size from final
transaction size at commit
• Differences processed by software to produce
histograms
• Size of aborted transactions:
• Store read- and write-set sizes of aborted transaction
in separate registers
86
BaseTMProf for TCC
• Stall cycles recorded at transaction commit
• When write-sets broadcasted or commit request sent
to directory
• No breakdown of read-dependent & writedependent abort cycles
• Since aborts do not stall winner (abortee)
• Committing cycles = cycle commit phase
completes – cycle commit phase begins
• Between cycle all stores flushed from write buffer &
broadcasting write-set
87
ExtTMProf for TCC
• Size of aborted transactions:
• Track read- and write-set sizes of aborted transactions
• Just like for LogTM-SE
Return
88
Extensions to Signatures Overview
• Six extensions to reduce false conflicts
•
•
•
•
•
•
Static Transaction Identifier (XID) Independence
Object Identifiers (IDs)
Best
Spatial locality with static signatures
performance
Spatial locality with dynamic signatures
Coarse-fine hashing
Dynamic re-hashing
• Evaluate using ideal hardware & software
89
XID Independence, Object IDs
• XID Independence:
• Programmer declares set of static XIDs that conflict
with each other
• Information passed to hardware for conflict detection
• Signature check only for XIDs that possibly conflict
• Object IDs:
• High-level objects accessed by each transaction
• E.g., Trees, hash buckets, nodes
• Programmer declares set of objects accessed in
transaction
• Designed to handle dynamic, fine-grain conflicts
90
Optimizing for Spatial locality
• Spatial locality exists in many programs
• High probability of accessing memory addresses
neighboring current address in future
• Spatially local addresses may form a set that sets only
a single signature bit
• Static signatures:
• Signature hashes operate on fixed, larger granularity
(i.e., greater than cache-block)
• Granularity may not be suitable for all workloads
• Dynamic signatures:
• A set of signatures that hash on different granularities
& set of hit counters
• Dynamically select which signature is “best” to use
91
Coarse-fine hashing, Dynamic re-hashing
• Coarse-fine hashing:
• Split addresses into two regions: Coarse & Fine
• Coarse – High-order address bits (e.g., page number)
• Fine – Low-order address bits (e.g., multiple cache blocks)
• Assign signature hashes to operate on Coarse & Fine
bits
• Dynamic re-hashing:
• False conflicts can be caused by bad luck
• Dynamically alter hash functions – rotate input address
bits before hashing
• Transform persistent false conflicts into transient false conflicts
92
Privatization interface
Privatization function
Usage
shared_malloc(size),
private_malloc(size)
Dynamic allocation of shared
and private memory objects
shared_free(ptr),
private_free(ptr)
Frees up memory allocated by
shared or private allocators
privatize_barrier(num_threads, ptr, size),
publicize_barrier(num_threads, ptr, size)
Program threads come to a
common point to privatize or
publicize an object. Must be
used outside of transactions
93
Dynamic privatization
• Dynamically switch from private to shared, and
vice versa
• If transitioning from private -> shared, safe to
mark page as shared (at cost of performance)
• If transitioning from shared -> private, default
policy is to disallow if there exists other shared
objects on same page
• Otherwise, trap to user software and let programmer
call shared_free(), followed by private_malloc() on
object
94
Bit-field overlaps hurt PBX
Return
95
Removing stack refs doesn’t help
Return
96
Entropy of commercial workloads
Return
97
Type of Hash Functions
• In real programs, addresses neither independent nor
uniformly distributed (key assumptions to derive
PFP(n))
• But can generate hash values that are almost
uniformly distributed and uncorrelated with good
(universal/almost universal) hash functions
• Hash functions considered:
Bit-selection
(inexpensive, low quality)
H3
[Carter, CSS79]
(moderate, higher quality)
Return
98
Notary Related Work
• Hash functions for memory hierarchy designs
•
•
•
Used to reduce cache, bank, or row-buffer contention
XOR hashes [Gonzales ‘97, Seznec ‘93, Zhang ‘00]
Polynomial hashes [Rau ‘91]
• Alternatives to XOR hashing
•
•
•
[Kharbutli ‘04,’05]
Prime modulo & odd-multiplier displacement hashing
Reduce probability of bad hash values
Can require modifying existing hardware (e.g., additional TLB bits or adders)
• Detailed analysis of XOR hashes [Vandierendonck ‘05]
•
•
Linear-algebra based analysis
Replacing & swapping columns can minimize the fan-in and maximum fan-out of
XOR gates
• Previous uses of entropy
•
•
•
•
Overheads of addressing memory in ISA [Hammerstrom ‘77]
Base Register Cache to reduce size of transferred address [Park ‘90]
Mechanisms which compact & expand address & data values [Citron ‘95]
Low-power TLB design [Ballapuram ’06]
99
Notary Related Work Cont.
• Software-only privatization
• Four pointer types for STMs [Scott ‘07]
• exclude & only keywords for transactional OpenMP
[Milovanovic ‘07]
• private & shared keywords in OpenTM [Baek ‘07]
• protect() and unprotect() for transactional C#
[Abadi ‘08]
• Hardware support for privatization
• Virtual Memory Filter [Matveev ‘07]
• More general than Notary’s privatization
• Programmer declares memory regions to be transactional
Return
100
TMProf Related Work
• Profiling transaction characteristics &
implementation-specific features [Hammond ‘04]
• Ex: Read- and write-set sizes, nesting depth, commit
bandwidth
• Disadvantage: Does not profile common, high-level
HTM overheads
• Transactional Application Profiling Environment
(TAPE) [Chafi ‘05]
• Profiles TCC HTM & summarizes problem areas back
to source code lines
• Disadvantage: Tied to TCC-specific overheads
101
TMProf Related Work Cont.
• Performance Pathologies [Bobba ‘07]
• Identified several pathologies affecting performance of
eager & lazy HTM systems
• Disadvantage: Pathologies identified offline using
detailed traces
• Additional profiling [Perfumo ’08, Porter ‘08]
• Metrics like read-to-write ratio, abort rate
• Statically predicting TM performance using Syncchar
• Can be added to TMProf implementations
Return
102
Results from Conflict Resolutions
Trends:
1) Timestamp & Hybrid better than Base
2) Hybrid sometimes better than Timestamp
103
Hybrid Better than Timestamp
Fewer Stalls &
Wasted Trans
Fewer Wasted Trans
104
Results from Conflict Resolutions
Trends:
3) Timestamp can be worse than Base
4) Hybrid can be worse than Base
105
Timestamp Worse than Base
More Wasted Trans
Fewer WW Req.
Older Stalls
(i.e., more younger
thread aborts)
106
Hybrid Worse than Base
More RW Req.
younger stalls
Leads to load
imbalance
(more Barrier cycles)
107
Stall Breakdown
Prediction serializes read requests from older transactions
108
Stall Breakdown
Prediction serializes write requests (perhaps unnecessarily)
109
Locks are Hard
// WITH LOCKS
void move(T s, T d, Obj key){
LOCK(s);
Moreover
LOCK(d);
tmp = s.remove(key);
d.insert(key, tmp);
Coarse-grain locking limits
UNLOCK(d);
concurrency
UNLOCK(s);
}
Fine-grain locking difficult
Thread 0
move(a, b, key1);
Thread 1
move(b, a, key2);
DEADLOCK!
Return
110
Motivation
111
Background on Aborts
• Read-dependent & Write-dependent
• Read-dependent – conflict is RW only
• Write-dependent – conflicts include WR or WW
• HTM system may optimize for read-dependent
aborts
• E.g., Eager conflict detection can release read-isolation
early on aborts (no nesting)
• Does not stall requestor
112
Notary Future Work
• Dynamic entropy calculation:
• How to adapt PBX hashing to entropy changes over time?
• Dynamic privatization characteristics:
• How common is it for objects to change sharing status?
Related Work
113
Sun’s Rock HTM [Dice et al., ASPLOS’09]
• Best-effort HTM – 1st general-purpose processor
with HTM support
• Profiling targets why transactions fail
• TMProf profiles higher-level categories, including
successes
• Aborts update Checkpoint Status Register (CPS)
• Version R2 includes more detailed breakdowns of
CPS than R1
• Different reasons for failure given same CPS status in
R1
• Profiling in common with ExtTMProf: read- and
write-set sizes of aborted transactions
114
Download