Architecture Support for Data Isolation & Memory Monitoring Arrvindh Shriraman, Sandhya Dwarkadas, and Michael L. Scott Department of Computer Science, University of Rochester Motivation Multi-core processors based on shared memory programming will soon dominate the computing spectrum P P } P ……… M RTM [ ISCA’07 ] Memory Monitoring Programmer’s view Alert-On-Update (AOU) New instruction, ALoad, loads and marks cache line A-tagged line on invalidation jumps to handler ➡ trigger event type can be capacity eviction or coherence Coordinating and synchronizing data shared across multiple threads is hard! Execution Pipeline Ld Remote Store Add Handler / Eviction …. ALoad/ Clear A TAG FlexTM [ ISCA’08 ] Integrated Hardware-Software transactional memory approach to HTM cache entry Hardware-Software transactions STM organization flexible FlexTM deploys ➡ Signatures for detecting and notifying conflicts ➡ CSTs for noticing and managing conflicts ➡ Lazy caches for in-cache data isolation and Redo-Buffer for handling cache overflows ➡ AOU for propagating abort events to remote transactions Meta ➡ RDIMM mechanisms operations W TAG Data accelerate common STM Data Data ➡ software makes policy decisions Version Version Conflict & management managementsupport uncommon resolution events ➡ software routines Data FlexTM software Rochester Transactional Memory ➡ checkpoints registers at Begin_Tx ➡ manages conflicts; controls Tx aborts using AOU trigger ➡ controls commit phase Cache Line Tracking memory location accesses is difficult because of transparent coherence events Cannot issue speculative operations to memory because hardware protocol does not support undoing of writes Shared Memory ++ Memory Monitoring (MM) ➡ provides read/write access summaries of code blocks ➡ event-style notification of desired coherence events Apps: Reliability, Security, Watchpoints, and Debugging Access summary Signatures A TAG 1 Insert addresses accessed by thread in hardware bloom filters. (Reads update Rsig & Writes update Wsig) + unboundedness, decouples tracking from caches - false positives Special instructions access cache blocks and insert physical address into bloom filter Coherence requests snoop signatures, test for membership and piggy-back conflict type on response message FWD_REQ 0xff83ff48 Address h1 h1 h2 Meta Data Fastpath Transactions TxD_2 TxD_1 COMMIT OH(A) Owner #S TxD_1 COMMIT COMMIT OH(A) Owner #S CAS Overflow Readers A (current) A current Overflow Readers FlexTM L2 Directory TG d_ Fw 2 Wsig:{A} Rsig:{} W-W 1 X ET C1 1 TG INV ta K_ 2 Da 4 AC ET X A;M@C0,C1 A;M@C0 RTM 10 ➡ Alert-On-Update: precise but bounded size ➡ Signatures: imprecise but unbounded ➡ CST: track inter-processor conflicts for all watched locations Data Isolation primitives ➡ PDI: private caches speculative-write buffer ➡ Redo-Log: holds cache overflows in virtual memory Registers Control Regs. Read/Write Locations Summary Read Signature Write Signature R-W W-R W-W Conflict Tables Processor Context Cache block Isolation ASI Overflow Sig. Inter-Processor Data Conflicts iss 1m T A Track stores to cache-line Tag Data L Base Address Hash Param. Overflow Count C/A Overflow Table Controller TST B Issue TLoad/TStore for speculative memory operations AOU ………. PDI In Cache Foreach I set in W-R or W-W Iterate over CSTs and update status word of conflicting transactions CAS (Status[i], ACT, ABORT) A’ Logically commit on status word; start physical commit of hardware state CAS-Commit Status[id] new version 8 2.3X 1.8X 4 4 2 2 0 0 HashTable RBTree Delaunay 3.8X Caches detach lines selectively from coherence protocol ➡ track coherence messages and choose time to enforce rules Cache protocol extended by two ‘T’ bit tagged states TMI allows concurrent sharers & isolates data in cache TMI & TI require just a flash-clear to convert lines to MESI 47--------------- 100 TAGS 80 104 Data ---------------35 1 FlexWatcher 1.5X 1.15X Discover 75X 17X Compiler/ Programmer specifies addresses to be tracked GZIP2 IV 1.05X N/A Man BO 1.80X 65X Squid ML 2.50X N/A Hardware triggers trampoline on snoop hits Discover is a SPARC binary instrumentation tool from OpenSPARC Discover overheads were estimated on a Sun T1000 server Other Uses ➡ watchpoints and race detectors ---------------35 Redo-Buffer controller Base 1000 OSig Config Sets,Ways Count Commtd 0 0 Bug BO BO Debugging Store %o1, (80) /*o1 = 35 */ L1 Data Cache C0 0 Benchmark BC GZIP ➡ fast mutexes and asynchronous messages ➡ performs look-aside transparently on L1 misses ---------------23 1 2 4 8 16 1 RBTree Synchronization A per-thread hash-table in virtual memory Hardware controller ➡ fills table with “TMI” write-back data blocks 80 5 Extend ISA to support signatures and AOU as first-class entities Redo-Buffer TMI 2 Vacation-L Vacation-H ➡ insert,member,activate,clear etc 80 10 2 4 8 16 RandomGraph FlexWatcher Memory Debugger Lazy Coherence 12--------------- Lazy 8 6 1.9X Eager 4.1X W Wsigsig:{A} :{} Rsig:{} W-W 1 Data Isolation Addr 00 Lazy encourages progress 16 Threads STM 10 6 ➡ TMI buffers TStores; TI allows incoherence with remote TMI Memory Monitoring primitives ACTIVE DIMM aids improve software-controlled TMs Record inter-processor R-W, W-W & W-R conflicts Decouples access conflict tracking from access tracking DIMM Hardware Support ➡ refine architecture incrementally ➡ software evolve the API and use in varying applications ➡ decouple policy from mechanism TLD A Hardware-acceleration of Software-controlled transactions 3Threat Decoupled hardware primitives for DIMM help TxD_2 CAS Checkpoint processor registers and record abort handler PC Begin_Tx abort_pc Overflow Transactions h2 Conflict Summary Tables (CST) C0 Programmable-Data-Isolation for data versioning Alert-On-Update for conflict detection Member ? ➡ allows control over propagation of writes to remote threads ➡ buffer written locations and commit or undo as an atomic unit Apps: Sand-boxing, Transactional programming, Speculation Data m bits m bits Data Isolation (DI) R W TAG {80} Security ➡ buffer overflow attacks, information-flow trackers & drivers/plugin isolation Speculation Buffer Overflow (B0) Pad all heap allocated buffers with 64bytes, watch padded locations Memory Leak (ML) Monitor all heap allocated objects and update the address’s timestamp on access. Invariant Violation (IV) ALoad cache line for variable X of interest. On AOU handler trigger assert program specific invariants. Conclusion Data-Isolation and Memory-Monitoring primitives will help multi-core chips achieve widespread use across traditional and emerging application domains Decoupling the hardware components will help refine the architecture incrementally and help software evolve the API Use simple hardware to accelerate the common case, minimize hardware state and employ software for the uncommon case ➡ Thread-level speculation and lock elision 1 Web : http://www.cs.rochester.edu/research/cosyn/ L1 Data Cache 1 Email: {ashriram, sandhya, scott}@cs.rochester.edu