Signatures in Transactional Memory Systems Dissertation Defense Luke Yen 1/29/2009 Key Contributions Trend: Transactional memory (TM) emerging parallel programming paradigm. Programmer-annotated transactions that execute atomically (all or nothing). Challenge #1: Hardware TM (HTM) systems may restrict transactions or incur overheads on common events (e.g., cache evictions). Contribution: LogTM-SE HTM: Simple hardware and interacts with operating system to virtualize transactions. No overhead on cache evictions. 2 Key Contributions Cont. Challenge #2: (1) H3 signatures high area & power overheads & (2) Thread-private references cause false conflicts. Contribution: Notary: (1) Page-Block-XOR - performs similar to H3 but lower overheads (2) Stack & heap-based privatization. Challenge #3: Difficult to understand HTM system performance. Contribution: TMProf: Lightweight hardware performance counters help HTM designers & TM programmers. Challenge #4: Signatures suffer from false conflicts. Contribution: Six hardware/software signature extensions to mitigate false conflicts. 3 Outline • Introduction and Background • Transactional Memory background • LogTM-SE [HPCA 2007] Contribution #1 • Notary [MICRO 2008] Contribution #2 • TMProf (Submitted for publication) Focus of presentation Contribution #3 • Conclusion * Skip “Extensions to Signatures” Contribution #4 4 Transactional Memory (TM) • Locks do not compose • Can lead to deadlocks • TM programmer says • “I want this atomic” • TM system • “Makes it so” void move(T s, T d, Obj key){ atomic { tmp = s.remove(key); d.insert(key, tmp); } } • Focus on Hardware TM (HTM) Implementations • Fast • Leverage cache coherence & speculation • But hardware finite & should be policy-free 5 Example 7/26/2016 LogTM Signature Edition (LogTM-SE) at 50,000 feet • HTMs Fast • Version management – for transaction commits & aborts • HW handles old/new versions (e.g., write buffer) • Conflict detection – commit only non-conflicting transactions • HW handles conflict detection (R/W bits & coherence) • But Closely Coupled to L1 cache • On critical paths & hard for SW to save/restore • Our Approach: Decoupled, Simple HW, SW control • LogTM-SE • HW: LogTM’s Log + Signatures (from Illinois Bulk) • SW: Unbounded nesting, thread switching, & paging Details 6 Signature Background • Signatures used to summarize and detect conflicts with a transaction’s read- and write-sets • • • • Inspired by Bulk system [Ceze,ISCA’06] Imprecise, can be implemented with Bloom filters Can have false positives, but never false negatives Also proposed for non-TM purposes (e.g., SC violation detection, atomicity violation detection, race recording) • Ex: Use k Bloom filters of size m/k, with independent hash functions 7 Outline • Introduction and Background • Notary • • • • • Signature Background Entropy & Page-Block-XOR Privatization Methodology & Results Conclusions • TMProf • Conclusion 8 Notary Executive Summary Tackle 2 problems with hardware signatures: • Problem 1: Best signature hashing (i.e., H3) has high area & power overheads • Solution 1: Use entropy analysis to guide lower-cost hashing (Page-Block-XOR, PBX) that performs similar to H3 • Ex: 8x fewer gates - 160 gates for H3 vs 20 gates for PBX • Problem 2: Spurious signature conflicts caused by signature bits set by private memory addrs • Solution 2: Avoid inserting private stack addrs, propose privatization interface for higher performance 9 Outline • Introduction and Background • Notary • • • • • Signature Background Entropy & Page-Block-XOR Privatization Methodology & Results Conclusions • TMProf • Conclusion 10 Signature hash functions • Which hash function is best? [Sanchez, YEN, MICRO’07] • Bit-selection? Hash simply decodes some number of input bits • H3? Each bit of a hash value is an XOR of (on avg.) half of the input address bits LogTM-SE w/ 2kb signatures • Result: H3 better with >=2 hash functions • However, H3 uses many multi-level XOR trees •Can we improve this? Details 11 H3 implementation addr length in bits ck • Num XOR 4 • Ex: 2kb signatures, k=2, c=10, 32-bit addr = 160 XOR gates per signature • Can we reduce the total gate count? 12 Outline • Introduction and Background • Notary • • • • • Signature Background Entropy & Page-Block-XOR Privatization Methodology & Results Conclusions • TMProf • Conclusion 13 Entropy defined • Insight: Use most random bits for hashing • Use entropy to measure bit randomness • Entropy = N p( xi ) log 2 ( p( xi )) i 1 • p(xi) = the probability of the occurrence of value xi • N = number of sample values random variable x can take on • Entropy = amount of information required on average to describe outcome of variable x (in bits) • Ex: What is the best possible lossless compression? Other cases n bits 0 bits max min Entropy value of n-bit field n-bit field All bit patterns constant value in n-bit field with probability 1 equally probable 14 Our measures of entropy • For our workloads, we care about: • Q1: What is the best achievable entropy? • Global entropy – upper bound on entropy of address • Q2: How does entropy change within an address? • Local entropy – entropy of bit-field within the address 31 Addr Global entropy 6 31 Local entropy Addr 6 NSkip 15 Entropy results • Workloads to be described later • Global entropy is at most 16 bits • Bit-window for local entropy is 16 bits wide (NSkip from 0-10) • Smaller windows (<16b) may not reach global entropy value • Larger windows (>16b) hides some fine-grain info 16 Commercial Workloads Page-Block-XOR (PBX) • Motivated by 3 findings: • (1) Lower-order bits have most entropy • Follows from our entropy results • (2) XORing two bit-fields produces random hash values • From prior work on XOR hashing (e.g., data placement in caches, DRAM) • (3) Bit-field overlaps can lead to higher false positives • Correlation between the two bit-fields can reduce the range of hash values produced (worse for larger signatures) Overlap Details 17 PBX implementation • For 2kb signatures with 2 hash functions: • 20 XOR gates for PBX vs 160 XOR gates for H3! • PPN and Cache-index fields not tied to system params: • Use entropy to find two non-overlapping bit-fields with high randomness 18 Summary thus far • Problem 1: H3 has high area & power overheads • Solution 1: Use entropy analysis to guide lower-cost PBX • Ex: 160 gates for H3 vs 20 gates for PBX • Problem 2: Spurious signature conflicts caused by signature bits set by private memory addrs • Solution 2: To be described 19 Outline • Introduction and Background • Notary • • • • • Signature Background Entropy & Page-Block-XOR Privatization Methodology & Results Conclusions • TMProf • Conclusion 20 Privatization • Problem: False conflicts caused by thread-private addrs • Avoid conflicts if addrs not inserted in thread’s signatures Two privatization solutions: • (1) Remove private stack references from sigs. • Very little work for programmer/compiler • Benefits depend on fraction of stack addresses versus all transactional references • (2) Language-level interface (e.g., private_malloc(), shared_malloc()) • Even higher performance boost • WARNING: Incorrectly marking shared objects as private can lead to program errors! 21 Page-based implementation • Each page is assigned a status, private or shared • Invariant: Page is shared if any object is shared • If stack is private, library marks stack pages as private • If using privatization heap functions, mark heap pages accordingly 22 OS support • OS allocates different physical page frames for shared and private pages • Sets a per-frame bit in translation entry if shared • Reduce number of page frames used by packing objects with same status together • Signatures insert memory addresses of transactional references to shared pages • Query page sharing bit in HW TLB & current transactional status 23 Outline • Introduction and Background • Notary • • • • • Signature Background Entropy & Page-Block-XOR Privatization Methodology & Results Conclusions • TMProf • Conclusion 24 Methodology • • • • Full-system simulation (GEMS) Transistor-level design for area & power of XOR gates CACTI for Bloom filter bit array area & power Linear scaling to 65nm or 90nm for area, original 400nm for power • Single-chip CMP • • • • • 16 single-threaded, in-order cores 32kB, 4-way private L1 I & D 8MB, 8-way shared L2 cache MESI directory protocol Signatures from 64b-64kb (8B-8kB) & “perfect” 25 Workloads • Micro-benchmarks • SPLASH-2 apps • Barnes & Raytrace – exert most signature pressure • Stanford STAMP apps • Vacation, Genome, Delaunay, Bayes, Labyrinth, Yada, Intruder • DNS server • BIND 26 PBX vs H3 area & power • Area & power overheads (2kb, k=4): Type of Bloom overhead filter bit array H3 hash PBX hash H3 sig. PBX sig. % savings for PBX sig. Area (mm2) 4.67e-3 1.35e-3 7.83e-5 6.02e-3 4.75e-3 21 Power (mW) 1.80e2 1.04e1 1.02 1.90e2 1.81e2 4.7 27 PBX vs H3 execution time PBX performs similar to H3 28 Privatization results summary • Removing private stack references from signatures did not help • Most addr references not to stack • Most likely because running with SPARC ISA. Other ISAs (e.g., x86) likely have more benefits • Privatization interface helps five workloads • Remainder either does not have private heap structures or does not have high transactional duty cycle Stack Results 29 Privatization interface results Can improve execution time 30 Outline • Introduction and Background • Notary • • • • • Signature Background Entropy & Page-Block-XOR Privatization Methodology & Results Conclusions • TMProf • Conclusion 31 Conclusions • Tackle 2 problems with signature designs: • (1) Area and power overheads of H3 hashing • E.g., 160 XOR gates for H3, 20 for PBX • (2) False conflicts due to signature bits set by private memory references • Our solutions: • (1) Use entropy analysis to guide hashing function (PBX), a low-cost alternative that performs similarly to H3 • (2) Prevent private stack references from entering signatures, and propose a privatization interface for heap allocations • Notary can be applied to non-TM uses: • PBX hashing can directly transfer • Privatization may transfer if addr filtering applies Related Work 32 Outline • Introduction and Background • Notary • TMProf • • • • • • Motivation Background TMProf Two Case Studies Future Directions for TMProf Conclusions • Conclusion 33 TMProf Executive Summary • TM more parallelism than lock-based programs • Complex thread interactions • How can HTM designer understand HTM performance? • How can TM programmer understand TM program performance? • TMProf: Per-processor hardware performance counters to count cumulative event frequencies & overheads in HTM system 34 Outline • Introduction and Background • Notary • TMProf • • • • • • Motivation Background TMProf Two Case Studies Future Directions for TMProf Conclusions • Conclusion 35 Critical-section Parallelism • TM enables critical-section parallelism – more thread interleavings With Locks Thread 0 Thread 1 Lock A Lock A With TM Thread 0 xact_begin Thread 1 xact_begin 36 Hard to Predict Program Performance • TM programmers may not have mastered intricacies of HTM system • Programs run faster on specific HTM • Example: 37 Profiling with TMProf • Allows HTM designers & TM programmers to understand HTM performance • With TMProf: 38 Outline • Introduction and Background • Notary • TMProf • • • • • • Motivation Background TMProf Two Case Studies Future Directions for TMProf Conclusions • Conclusion 39 Background on Conflicts • Three types: RW, WR, and WW • Analogous to WAR, RAW, and WAW dependencies in uniprocessors Thread 0 RW xact_begin … LD A … WR xact_begin … ST B … WW xact_begin … ST C … Thread 1 xact_begin … ST A … xact_begin … LD B … xact_begin … ST C … 40 Conflict Detection & Resolution • Conflicts detected eagerly or lazily • Eagerly – when requests occur • Lazily – at transaction commit • Conflict resolution • Stall or abort on conflict • Choose set of procs to take action 41 Outline • Introduction and Background • Notary • TMProf • • • • • • Motivation Background TMProf Two Case Studies Future Directions for TMProf Conclusions • Conclusion 42 TMProf • Per-processor HW counters measuring cumulative event frequencies and cumulative event overheads • Two implementations: Base & Extended • Base (BaseTMProf): Breaks down HTM execution cycles into common components • Extended (ExtTMProf): Builds on BaseTMProf & adds HTM-specific transaction-level profiling 43 BaseTMProf & ExtTMProf • BaseTMProf: • Total cycles = stalls + aborts + wasted_trans + useful_trans + committing + nontrans + implementation specific • Assume in-order procs, but can extend for out-of-order procs • ExtTMProf: BaseTMProf profiling plus • Size of aborted transactions • Amount of transactional work after write-set prediction • HTMs may add more detailed profiling in future Details 44 Outline • Introduction and Background • Notary • TMProf • • • • • • Motivation Background TMProf Two Case Studies Future Directions for TMProf Conclusions • Conclusion 45 Two Case Studies • TMProf profiling two HTMs: • LogTM-SE (eager conflict detection & version management, EE) • Approximation of Stanford’s TCC (lazy conflict detection & version management, LL) • Examine key parameters of eager & lazy conflict detection • Idealize version management • Same system parameters as Notary • 16-processor CMP w/ in-order, single-issue processor cores • Perfect signatures • Same workloads 46 EE: Different Conflict Resolutions • Three different conflict resolutions: • Base, Timestamp, Hybrid • All use timestamps • Base: Requestor stalls until possible deadlock • Timestamp: Older requestors always abort younger transactions. Younger requestors stalled by older transactions. • Hybrid: Base, except RW from older writer aborts younger reader 47 EE: Write-set Prediction • Avoid aborts from load then store pattern from thread • Predict & serialize on these conflicts T0 T1 ABORT GetS … … ABORT … … GetS GetX … … T2 … GetS … … T0 GetX … … GetX … T1 … … STALL GetS … T2 … STALL GetS … … 48 Results from Conflict Resolutions Trends: 1) Timestamp & Hybrid better than Base 49 Timestamp & Hybrid Better than Base Fewer total stalls & eliminates all RW Requestor older stalls 50 EE Summary with BaseTMProf • BaseTMProf helps HTM designer understand performance of conflict resolution schemes • Lightweight, fast, dynamic profiling • Can be implemented in prototype HTM systems 51 Write-set Prediction Results • Focus on workloads that degrade from prediction Prediction increases Stall cycles 52 ExtTMProf’s Transaction-level Profiling Predictions Help Prediction helps short transactions Predictions Hurt Prediction hurts large transactions – reduces concurrency 53 EE Summary with ExtTMProf • Helps HTM designers understand why write-set prediction degrades (or improves) performance • Offline analysis (e.g., traces) unable to determine performance implications of dynamic conflicts • How can TMProf help analyze LL systems? 54 LL: Parallel Versus Serial Commit • Serial = Only one committer at a time • Parallel = Multiple concurrent committers • Faster than Serial • We idealize its implementation 55 LL: More Prefetching than EE • Eager conflict detection: • Progress bounded by location of conflicts • Early conflicts abort transactions early (little prefetching) • Late conflicts abort transactions late (lots of prefetching) • Lazy conflict detection: • Committers finish transaction before detecting conflicts • High probability for lots of prefetching 56 Parallel Commit Results Parallel commit removes commit token bottleneck 57 Conflicts with Parallel Commit All conflicts either RW or WR – no WW conflicts 58 LL Summary with BaseTMProf • BaseTMProf clearly shows why parallel commit helps • Stall breakdown shows mostly WR conflicts • BaseTMProf helps HTM designers decide whether to implement parallel commit • Parallel commit more complex than serial commit 59 Prefetching Results Useful Trans should be similar for EE & LL, but LL incurs fewer cycles Why? 60 ExtTMProf’s Transaction-level Profiling LL’s aborted transactions prefetch farther than EE 61 LL Summary with ExtTMProf • Explains why workloads execute faster on LL than on EE • May influence HTM design decision to implement LL rather than EE • Helps TM programmer understand why programs run faster on some HTMs 62 Outline • Introduction and Background • Notary • TMProf • • • • • • Motivation Background TMProf Two Case Studies Future Directions for TMProf Conclusions • Conclusion 63 Software Rollback Better than Hardware Rollback Software rollback reduces Stalls & Wasted Trans May reduce contention in HTM? 64 Hardware for Critical-path Profiling • Counter-based profiling is not sufficient • Multi-threaded programs exhibit variability: • Different dynamic code paths • Inter-thread dependencies • Memory latencies • Factors change critical-path – longest control flow that determines execution time • Hardware critical-path profiling can aid in understanding performance • Faster than offline, software analyses 65 Outline • Introduction and Background • Notary • TMProf • • • • • • Motivation Background TMProf Two Case Studies Future Directions for TMProf Conclusions • Conclusion 66 Conclusions • TMProf – lightweight per-processor hardware counters for understanding HTM performance • Cumulative event frequencies & overheads • Two implementations: Base & Extended • Two case studies: LogTM-SE & Approximation of TCC • Future TMProf might add hardware support for critical-path profiling Related Work 67 Outline • Introduction and Background • Notary • TMProf • Conclusion 68 Conclusions • Challenge #1: Hardware TM (HTM) systems may restrict transactions or incur overheads on common events. • Contribution: LogTM-SE HTM • Challenge #2: (1) H3 signatures high area & power overheads & (2) Thread-private references cause false conflicts. • Contribution: Notary 69 Conclusions Cont. • Challenge #3: Difficult to understand HTM system performance. • Contribution: TMProf • Challenge #4: Signatures suffer from false conflicts. • Contribution: Six hardware/software extensions to signatures 70 Other Research & Contributions • OS Support for Virtualizing Transactional Memory [Swift et al. TRANSACT ‘08] • Implementing Signatures for Transactional Memory [Sanchez et al. MICRO ‘07] • Performance Pathologies in Hardware Transactional Memory [Bobba et al., ISCA ’07 & Top Picks ‘08] • Supporting Nested Transactional Memory in LogTM [Moravan et al., ASPLOS ‘06] • GEMS 2.X development & support • SMT in Opal, LogTM-SE in Ruby 71 Thank You! Questions? 72 Backup Slides 73 LogTM-SE Processor Hardware • Segmented log, like LogTM Registers Register Checkpoint LogFrame • Track R / W sets with R / W signatures TMcount Read LogPtr Write SummaryRead SummaryWrite • Over-approximate R / W sets • Tracks physical addresses • Summary signature used for virtualization SMT Thread Context Tag Data NO TM STATE • Conflict detection by coherence protocol • Check signatures on every memory access for SMT Data Caches 74 Thread Switching Support • Why? • Support long-running transactions • What? • Conflict Detection for descheduled transactions • How? • Summary Read / Write signatures: If thread t of process P is scheduled to use an active signature, the corresponding summary signature holds the union of the saved signatures from all descheduled threads from process P. Updated using TLB-shootdown-like mechanism 75 Handling Thread Switching WWW 00000000 00000000 00000000 Summary Summary Summary 00000000 00000000 R RR 00000000 OS T2 T1 T3 W R Summary 01001000 W R 01010010 P1 0100000 W R 01000010 P2 00000000 W R 00000000 P3 00000000 00000000 0100000 W R 01010010 P4 76 Handling Thread Switching OS W 01001000 00000000 SummaryR 01010010 00000000 Deschedule T2 T1 W R Summary W 00000000 00000000 01001000 01001000 01010010 R 01010010 P1 W R Summary 00000000 00000000 0100000 W R 01000010 P2 T3 W R Summary 00000000 00000000 00000000 W R 00000000 P3 W R Summary 00000000 00000000 0100000 W R 01010010 P4 77 Handling Thread Switching 01001000 W W W 01001000 01001000 SummaryR SummarySummary R 01010010 01010010 R 01010010 OS Deschedule T2 T1 W R Summary 00000000 00000000 01001000 W R 01010010 P1 W R Summary 00000000 00000000 0100000 W R 01000010 P2 T3 W R Summary 00000000 00000000 00000000 W R 00000000 P3 W R Summary 00000000 00000000 0100000 W R 01010010 P4 78 Handling Thread Switching OS W 01001000 SummaryR 01010010 T1 T2 W R Summary 00000000 00000000 00000000 W R 00000000 P1 W R Summary 01001000 01010010 0100000 W R 01000010 P2 T3 W R Summary 01001000 01010010 00000000 W R 00000000 P3 W R Summary 00000000 00000000 0100000 W R 01010010 P4 79 Thread Switching Support Summary • Summary Read / Write signatures • Summarizes descheduled threads with active transactions • One OS structure per process Coherence • Check summary signature on every memory access • Updated on transaction deschedule • Similar to TLB shootdown 80 Paging Support Summary Problem: • Changing page frames • Need to maintain isolation on transactional blocks Solution: On Page-Out: • Save Virtual -> Physical mapping On Page-In: • If different page frame, update signatures with physical address of transactional blocks in new page frame. 81 Paging Support Animation VP1 Page-out Page-in PP1 PP2 A B C A’ C’ A’ B’ Read sig. C? D? A? B? B’ D’ Write sig. D Y Y B’ C’ D’ Read & Write signatures isolate memory blocks from PP1 & PP2 Return 82 BaseTMProf for LogTM-SE (1 of 3) • Differentiate between read dependent & write dependent aborts • Meta-data (e.g., 3 bits for conflict types + 1 bit indicating if responder older) on NACK messages • Per-processor tables to track conflicts with other procs • RW conflict only = read-dependent • Stall cycles = cycle conflict detected – cycle request sent to memory subsystem • Abort cycles = cycle abort completes – cycle abort initiates 83 BaseTMProf for LogTM-SE (2 of 3) • Wasted_trans cycles = cycle abort initiates – cycle transaction begins • Store transaction begin cycle in separate register • Commit cycles = cycle commit completes – cycle commit initiates • No commit actions = no commit cycles • Track cycle of start of commit action in separate register 84 BaseTMProf for LogTM-SE (3 of 3) • Nontrans cycles = cycle of transaction begin – cycle after last transaction commit • Track cycle of last transaction commit in separate register • Backoff cycles = cycle retry transaction – cycle abort completes • Barrier cycles = cycle exit barrier – cycle enter barrier 85 ExtTMProf for LogTM-SE • Work remaining after write-set prediction: • Store transaction size (read+write-set sizes) at each prediction - lazily copy to software or use many registers • At commit, subtract saved transaction size from final transaction size at commit • Differences processed by software to produce histograms • Size of aborted transactions: • Store read- and write-set sizes of aborted transaction in separate registers 86 BaseTMProf for TCC • Stall cycles recorded at transaction commit • When write-sets broadcasted or commit request sent to directory • No breakdown of read-dependent & writedependent abort cycles • Since aborts do not stall winner (abortee) • Committing cycles = cycle commit phase completes – cycle commit phase begins • Between cycle all stores flushed from write buffer & broadcasting write-set 87 ExtTMProf for TCC • Size of aborted transactions: • Track read- and write-set sizes of aborted transactions • Just like for LogTM-SE Return 88 Extensions to Signatures Overview • Six extensions to reduce false conflicts • • • • • • Static Transaction Identifier (XID) Independence Object Identifiers (IDs) Best Spatial locality with static signatures performance Spatial locality with dynamic signatures Coarse-fine hashing Dynamic re-hashing • Evaluate using ideal hardware & software 89 XID Independence, Object IDs • XID Independence: • Programmer declares set of static XIDs that conflict with each other • Information passed to hardware for conflict detection • Signature check only for XIDs that possibly conflict • Object IDs: • High-level objects accessed by each transaction • E.g., Trees, hash buckets, nodes • Programmer declares set of objects accessed in transaction • Designed to handle dynamic, fine-grain conflicts 90 Optimizing for Spatial locality • Spatial locality exists in many programs • High probability of accessing memory addresses neighboring current address in future • Spatially local addresses may form a set that sets only a single signature bit • Static signatures: • Signature hashes operate on fixed, larger granularity (i.e., greater than cache-block) • Granularity may not be suitable for all workloads • Dynamic signatures: • A set of signatures that hash on different granularities & set of hit counters • Dynamically select which signature is “best” to use 91 Coarse-fine hashing, Dynamic re-hashing • Coarse-fine hashing: • Split addresses into two regions: Coarse & Fine • Coarse – High-order address bits (e.g., page number) • Fine – Low-order address bits (e.g., multiple cache blocks) • Assign signature hashes to operate on Coarse & Fine bits • Dynamic re-hashing: • False conflicts can be caused by bad luck • Dynamically alter hash functions – rotate input address bits before hashing • Transform persistent false conflicts into transient false conflicts 92 Privatization interface Privatization function Usage shared_malloc(size), private_malloc(size) Dynamic allocation of shared and private memory objects shared_free(ptr), private_free(ptr) Frees up memory allocated by shared or private allocators privatize_barrier(num_threads, ptr, size), publicize_barrier(num_threads, ptr, size) Program threads come to a common point to privatize or publicize an object. Must be used outside of transactions 93 Dynamic privatization • Dynamically switch from private to shared, and vice versa • If transitioning from private -> shared, safe to mark page as shared (at cost of performance) • If transitioning from shared -> private, default policy is to disallow if there exists other shared objects on same page • Otherwise, trap to user software and let programmer call shared_free(), followed by private_malloc() on object 94 Bit-field overlaps hurt PBX Return 95 Removing stack refs doesn’t help Return 96 Entropy of commercial workloads Return 97 Type of Hash Functions • In real programs, addresses neither independent nor uniformly distributed (key assumptions to derive PFP(n)) • But can generate hash values that are almost uniformly distributed and uncorrelated with good (universal/almost universal) hash functions • Hash functions considered: Bit-selection (inexpensive, low quality) H3 [Carter, CSS79] (moderate, higher quality) Return 98 Notary Related Work • Hash functions for memory hierarchy designs • • • Used to reduce cache, bank, or row-buffer contention XOR hashes [Gonzales ‘97, Seznec ‘93, Zhang ‘00] Polynomial hashes [Rau ‘91] • Alternatives to XOR hashing • • • [Kharbutli ‘04,’05] Prime modulo & odd-multiplier displacement hashing Reduce probability of bad hash values Can require modifying existing hardware (e.g., additional TLB bits or adders) • Detailed analysis of XOR hashes [Vandierendonck ‘05] • • Linear-algebra based analysis Replacing & swapping columns can minimize the fan-in and maximum fan-out of XOR gates • Previous uses of entropy • • • • Overheads of addressing memory in ISA [Hammerstrom ‘77] Base Register Cache to reduce size of transferred address [Park ‘90] Mechanisms which compact & expand address & data values [Citron ‘95] Low-power TLB design [Ballapuram ’06] 99 Notary Related Work Cont. • Software-only privatization • Four pointer types for STMs [Scott ‘07] • exclude & only keywords for transactional OpenMP [Milovanovic ‘07] • private & shared keywords in OpenTM [Baek ‘07] • protect() and unprotect() for transactional C# [Abadi ‘08] • Hardware support for privatization • Virtual Memory Filter [Matveev ‘07] • More general than Notary’s privatization • Programmer declares memory regions to be transactional Return 100 TMProf Related Work • Profiling transaction characteristics & implementation-specific features [Hammond ‘04] • Ex: Read- and write-set sizes, nesting depth, commit bandwidth • Disadvantage: Does not profile common, high-level HTM overheads • Transactional Application Profiling Environment (TAPE) [Chafi ‘05] • Profiles TCC HTM & summarizes problem areas back to source code lines • Disadvantage: Tied to TCC-specific overheads 101 TMProf Related Work Cont. • Performance Pathologies [Bobba ‘07] • Identified several pathologies affecting performance of eager & lazy HTM systems • Disadvantage: Pathologies identified offline using detailed traces • Additional profiling [Perfumo ’08, Porter ‘08] • Metrics like read-to-write ratio, abort rate • Statically predicting TM performance using Syncchar • Can be added to TMProf implementations Return 102 Results from Conflict Resolutions Trends: 1) Timestamp & Hybrid better than Base 2) Hybrid sometimes better than Timestamp 103 Hybrid Better than Timestamp Fewer Stalls & Wasted Trans Fewer Wasted Trans 104 Results from Conflict Resolutions Trends: 3) Timestamp can be worse than Base 4) Hybrid can be worse than Base 105 Timestamp Worse than Base More Wasted Trans Fewer WW Req. Older Stalls (i.e., more younger thread aborts) 106 Hybrid Worse than Base More RW Req. younger stalls Leads to load imbalance (more Barrier cycles) 107 Stall Breakdown Prediction serializes read requests from older transactions 108 Stall Breakdown Prediction serializes write requests (perhaps unnecessarily) 109 Locks are Hard // WITH LOCKS void move(T s, T d, Obj key){ LOCK(s); Moreover LOCK(d); tmp = s.remove(key); d.insert(key, tmp); Coarse-grain locking limits UNLOCK(d); concurrency UNLOCK(s); } Fine-grain locking difficult Thread 0 move(a, b, key1); Thread 1 move(b, a, key2); DEADLOCK! Return 110 Motivation 111 Background on Aborts • Read-dependent & Write-dependent • Read-dependent – conflict is RW only • Write-dependent – conflicts include WR or WW • HTM system may optimize for read-dependent aborts • E.g., Eager conflict detection can release read-isolation early on aborts (no nesting) • Does not stall requestor 112 Notary Future Work • Dynamic entropy calculation: • How to adapt PBX hashing to entropy changes over time? • Dynamic privatization characteristics: • How common is it for objects to change sharing status? Related Work 113 Sun’s Rock HTM [Dice et al., ASPLOS’09] • Best-effort HTM – 1st general-purpose processor with HTM support • Profiling targets why transactions fail • TMProf profiles higher-level categories, including successes • Aborts update Checkpoint Status Register (CPS) • Version R2 includes more detailed breakdowns of CPS than R1 • Different reasons for failure given same CPS status in R1 • Profiling in common with ExtTMProf: read- and write-set sizes of aborted transactions 114