Collaborator: Prof. Stark C. Draper
Advisor: Prof. Mark D. Hill
University of Wisconsin, Madison
MICRO-41 - November 11, 2008 www.cs.wisc.edu/multifacet/papers/micro08_notary.pdf
Tackle 2 problems with hardware signatures:
• Problem 1: Best signature hashing (i.e., H
3
) has high area & power overheads
• Solution 1: Use entropy analysis to guide lower-cost hashing
(Page-Block-XOR, PBX) that performs similar to H
3
– Ex: 160 gates for H
3 vs 20 gates for PBX
• Problem 2: Spurious signature conflicts caused by signature bits set by private memory addrs
• Solution 2: Avoid inserting private stack addrs, propose privatization interface for higher performance
University of Wisconsin-Madison
4/11/2020 2
• Signature background
• Entropy
• Entropy results & PBX
• Privatization
• Methodology & workloads
• Results
• Conclusions & Future Work
3 4/11/2020
University of Wisconsin-Madison
• Signatures (hardware Bloom filters) used to summarize and detect conflicts with a transaction’s read- and write-sets
– Inspired by Bulk system [Ceze,ISCA’06]
– Implemented in LogTM-SE [Yen,HPCA’07]
– Can have false positives, but never false negatives
– Also proposed for non-TM purposes (e.g., SC violation detection, atomicity violation detection, race recording)
• Ex: Use k Bloom filters of size m/k , with independent hash functions
4 University of Wisconsin-Madison
4/11/2020
• Which hash function is best? [Sanchez, MICRO’07]
– Bit-selection? Hash simply decodes some number of input bits
– H
3
? Each bit of a hash value is an XOR of (on avg.) half of the input address bits
LogTM-SE w/
2kb signatures
• Result: H
3 better with >=2 hash functions
• However, H
3 uses many multi-level XOR trees
•Can we improve this?
4/11/2020 5 University of Wisconsin-Madison
3
• Num XOR
• Ex: 2kb signatures, k =2, c =10, 32-bit addr = 160 XOR gates per signature
• Can we reduce the total gate count?
University of Wisconsin-Madison
4/11/2020 6
• Signature background
• Entropy
• Entropy results & PBX
• Privatization
• Methodology & workloads
• Results
• Conclusions & Future Work
7 4/11/2020
University of Wisconsin-Madison
• Not all address bits have equal randomness
– Ex: High-level address bits unlikely to change if working set size is small
• Key insight: If input bits are random and those bits are used as inputs to hash functions, random hash values result
– Use entropy to measure bit randomness
• Entropy – measure of the uncertainty of a random variable x
University of Wisconsin-Madison
4/11/2020 8
• Entropy = i
N
1 p ( x i
) log
2
( p ( x i
))
• p ( x i
) = the probability of the occurrence of value x i
• N = number of sample values random variable x can take on
• Entropy = amount of information required on average to describe outcome of variable x (in bits)
– Ex: What is the best possible lossless compression?
Other cases
0 bits n bits max min Entropy value of n -bit field
4/11/2020 n -bit field has constant value
9
All bit patterns in n -bit field equally likely
University of Wisconsin-Madison
• For our workloads, we care about:
• Q1: What is the best achievable entropy?
– Global entropy – upper bound on entropy of address
• Q2: How does entropy change within an address?
– Local entropy – entropy of bit-field within the address
31 Addr
Global entropy
6 31 6
NSkip
University of Wisconsin-Madison
4/11/2020 10
• Signature background
• Entropy
• Entropy results & PBX
• Privatization
• Methodology & workloads
• Results
• Conclusions & Future Work
11 4/11/2020
University of Wisconsin-Madison
• Workloads to be described later
• Global entropy is at most 16 bits
• Bit-window for local entropy is 16 bits wide (NSkip from 0-10)
– Smaller windows (<16b) may not reach global entropy value
– Larger windows (>16b) hides some fine-grain info
4/11/2020 12 University of Wisconsin-Madison
• More entropy results in our MICRO paper
• In summary, for our workloads entropy monotonically decreases when moving towards high-order bits
– We calculate the average entropy across the entire workload’s execution
– May miss entropy changes due to program phase behavior
• Our Page-Block-XOR (PBX) hash takes advantage of this overall trend
University of Wisconsin-Madison
4/11/2020 13
• Motivated by 3 findings:
– (1) Lower-order bits have most entropy
• Follows from our entropy results
– (2) XORing two bit-fields produces random hash values
• From prior work on XOR hashing (e.g., data placement in caches, DRAM)
– (3) Bit-field overlaps can lead to higher false positives
• Correlation between the two bit-fields can reduce the range of hash values produced (worse for larger signatures)
University of Wisconsin-Madison
4/11/2020 14
• For 2kb signatures with 2 hash functions:
– 20 XOR gates for PBX vs 160 XOR gates for H
3
!
• PPN and Cache-index fields not tied to system params:
• Use entropy to find two non-overlapping bit-fields with high randomness
4/11/2020 15 University of Wisconsin-Madison
• Problem 1: H
3 has high area & power overheads
• Solution 1: Use entropy analysis to guide lower-cost PBX
– Ex: 160 gates for H
3 vs 20 gates for PBX
• Problem 2: Spurious signature conflicts caused by signature bits set by private memory addrs
• Solution 2: To be described
16 University of Wisconsin-Madison
4/11/2020
• Signature background
• Entropy
• Entropy results & PBX
• Privatization
• Methodology & workloads
• Results
• Conclusions & Future Work
17 4/11/2020
University of Wisconsin-Madison
• False conflicts caused by thread-private addrs
– Avoid conflicts if addrs not inserted in thread’s signatures
4/11/2020 18 University of Wisconsin-Madison
• Two solutions proposed:
– (1) Remove private stack references from sigs.
• Very little work for programmer/compiler
• Benefits depend on fraction of stack addresses versus all transactional references
– (2) Language-level interface (e.g., private_malloc() , shared_malloc() )
• Even higher performance boost
• For skilled programmer
• WARNING: Incorrectly marking shared objects as private can lead to program errors!
University of Wisconsin-Madison
4/11/2020 19
• Each page is assigned a status, private or shared
– Invariant: Page is shared if any object is shared
• If stack is private, library marks stack pages as private
• If using privatization heap functions, mark heap pages accordingly
20 University of Wisconsin-Madison
4/11/2020
• OS allocates different physical page frames for shared and private pages
– Sets a per-frame bit in translation entry if shared
– Reduce number of page frames used by packing objects with same status together
• Signatures insert memory addresses of transactional references to shared pages
– Query page sharing bit in HW TLB & current transactional status
University of Wisconsin-Madison
4/11/2020 21
• Signature background
• Entropy
• Entropy results & PBX
• Privatization
• Methodology & workloads
• Results
• Conclusions & Future Work
22 4/11/2020
University of Wisconsin-Madison
• Full-system simulation using Simics and Wisconsin
GEMS timing modules
• Transistor-level design for area & power of XOR gates
• CACTI for Bloom filter bit array area & power
• Simulated system
– Single-chip CMP
– 16 single-threaded,in-order cores
– 32kB, 4-way private L1 I & D, write-back
– 8MB, 8-way shared L2 cache
– MESI directory protocol
– Signatures from 64b-64kb (8B-8kB) & “Perfect”
4/11/2020 23 University of Wisconsin-Madison
• Micro-benchmarks
– BTree – read and write ops on shared tree
– Sparse Matrix – algorithm from dense column vector multiplication kernel
• SPLASH-2 apps
– Barnes & Raytrace – exert most signature pressure
• Stanford STAMP apps
– Vacation, Genome, Delaunay, Bayes, Labyrinth
• DNS server
– BIND
University of Wisconsin-Madison
4/11/2020 24
• Signature background
• Entropy
• Entropy results & PBX
• Privatization
• Methodology & workloads
• Results
• Conclusions & Future Work
25 4/11/2020
University of Wisconsin-Madison
3
• Area & power overheads (2kb, k=4):
Type of overhead
Area
(mm 2 )
Bloom filter bit array
H
3 hash PBX hash
H
3 sig.
PBX sig.
% savings for PBX sig.
2.70e-2 8.10e-3 4.70e-4 3.50e-2 2.70e-2 23
Power
(mW)
1.80e2
1.04e1
1.02
1.90e2
1.81e2
4.7
University of Wisconsin-Madison
4/11/2020 26
3
4/11/2020
PBX performs similar to H
3
Additional workload results in paper
27 University of Wisconsin-Madison
• Removing private stack references from signatures did not help much
– Most addr references not to stack
– Most likely because running with SPARC ISA. Other ISAs
(e.g., x86) likely has more benefits
• Privatization interface helps four workloads
– Remainder either does not have private heap structures or does not have high transactional duty cycle
28 University of Wisconsin-Madison
4/11/2020
4/11/2020 29 University of Wisconsin-Madison
• Signature background
• Entropy
• Entropy results & PBX
• Privatization
• Methodology & workloads
• Results
• Conclusions & Future Work
30 4/11/2020
University of Wisconsin-Madison
• Tackle 2 problems with signature designs:
– (1) Area and power overheads of H
3 hashing
• E.g., 160 XOR gates for H
3
, 20 for PBX
– (2) False conflicts due to signature bits set by private memory references
• Our solutions:
– (1) Use entropy analysis to guide hashing function (PBX), a low-cost alternative that performs similarly to H
3
– (2) Prevent private stack references from entering signatures, and propose a privatization interface for heap allocations
• Notary can be applied to non-TM uses:
– PBX hashing can directly transfer
– Privatization may transfer if addr filtering applies
4/11/2020 31 University of Wisconsin-Madison
• Dynamic entropy calculation:
– How to adapt PBX hashing to entropy changes over time?
• Dynamic privatization characteristics:
– How common is it for objects to change sharing status (i.e., from private to shared, and vice versa)?
32 University of Wisconsin-Madison
4/11/2020
4/11/2020 33 University of Wisconsin-Madison
Privatization function shared_malloc(size), private_malloc(size) shared_free(ptr), private_free(ptr) privatize_barrier(num_threads, ptr, size), publicize_barrier(num_threads, ptr, size)
Usage
Dynamic allocation of shared and private memory objects
Frees up memory allocated by shared or private allocators
Program threads come to a common point to privatize or publicize an object. Must be used outside of transactions
University of Wisconsin-Madison
4/11/2020 34
• Dynamically switch from private to shared, and vice versa
• If transitioning from private -> shared, safe to mark page as shared (at cost of performance)
• If transitioning from shared -> private, default policy is to disallow if there exists other shared objects on same page
• Otherwise, trap to user software and let programmer call shared_free(), followed by private_malloc() on object
University of Wisconsin-Madison
4/11/2020 35
4/11/2020 36 University of Wisconsin-Madison
4/11/2020 37 University of Wisconsin-Madison
4/11/2020 38 University of Wisconsin-Madison
4/11/2020
Program: xbegin
LD A
ST B
LD C
LD D
ST C
…
Hash Function(s)
R
W
00100 1 00
0 0 100010
ALIAS
NO CONFLICT
CONFLICT!
39 University of Wisconsin-Madison
• In real programs, addresses neither independent nor uniformly distributed (key assumptions to derive
P
FP
(n))
• But can generate hash values that are almost uniformly distributed and uncorrelated with good
(universal/almost universal) hash functions
• Hash functions considered:
Bit-selection
(inexpensive, low quality)
4/11/2020 40
H
3
[Carter, CSS79]
(moderate, higher quality)
University of Wisconsin-Madison