Analytical Evaluation of Shared-Memory Systems with Commercial Workloads Jichuan Chang <chang@cs.wisc.edu> CS747 1 05-10-2002 Outline • A Case for Analytical Models • Existing Models and Their Limitations • What Kind of Tools do We Need CS747 2 05-10-2002 Background • Shared-memory Multiprocessors Servers – Important - the computing infrastructure of our society – Complex system (ILP processors + caches + interconnection) • Commercial workloads – Important - 80% server market, supporting our daily business – Different behavior from scientific workloads • Large code size and data set, different cache behaviors • Lots of OS interactions (context switches), higher I/O rate – Hard to study (complex, hard to setup, no code, moving target) CS747 3 05-10-2002 A Motivating Example • Bob is designing a next generation multiprocessor server for commercial workloads. Assume that the largest benchmark he can setup now is a 10G database. • How can Bob predict the performance (IPC, or tpm) of running a 100G database TPC-D benchmark on the future machine? • What’s the ideal cache hierarchy design for this workload given his prediction of future technology constants? • We need tools to characterize the workloads! • We need tools to prune the vast design space! CS747 4 05-10-2002 Performance Evaluation Tools • Hardware Monitors, Binary Instrumentation Tools + Realistic, dynamic information - Only work for existing systems, aggregated info • Program Analysis Tools (i.e. compilers) + Can do global analysis, works well for arrays/loops - Little dynamic info, not good for (pointer-based) irregular programs, needs source code. • (Full System) Architecture Simulators + Detailed simulation, realistic result, can simulate future HW - Slow (can’t extrapolate), complex, can’t simulate future SW • Analytical Models + Fast, gives insights, can predict for future SW/HW combinations + Need to create models of multiprocessor with new workloads CS747 5 05-10-2002 Sorin et al. MVA for ILP Multiprocessors ILP Processor The rest of the system (Bus, NI, Switches DRAM, Directories) L1$ L2$ MSHR (when MSHR not full) • Application input parameters – CV fM fsync-write Pread Pwrite … ... • Iterate between 2 submodels – SB (fraction of time CPU stalls due to synch operations) – MB (fraction of time CPU stalls due to limited MSHR size) – Surrogate service time inflation CS747 6 05-10-2002 Sorin et al. MVA Model + Target system design, answer question like + MSHR size, directory organization, NI latencies, etc + Insight into application behavior + Miss rate (), burstiness (CV ), degree of parallelism (fM) – Some app. param. (, fM, fsync-write) depend on arch. param. – Most parameters insensitive to changes outside CPU/cache – Need input parameters for each CPU/cache configuration – Caches also interact with the system design (i.e update protocol) – Fixed problem size, not characterizing the workload • Can we break the processor/cache black-box into processor and cache two submodels? • What would be the application input parameters? CS747 7 05-10-2002 Cache Models (1) • Stack distance model – Estimate capacity misses, based on one access trace – Work for inclusive fully-associated cache – Have extensions for direct-mapped and set-associative cache Hit rate ABBACA 100% 90% 90% 80% 80% 70% 70% 60% 60% 50% 50% 40% 40% 30% 30% 20% 20% 10% 10% 0% 0% 1 2 3 1 infinity 2 4 16 32 64 128 256 1K 2K 4K 8K 16K Cache size Cache size CS747 A typical access trace Hit rate 100% 8 05-10-2002 Cache Models (2) • Agarwal et al. 1989 – Model cache block size, working-set transitions, conflict misses and multi-programming interference • Data Reference Model (Tsai/Agarwal 1993) – Configuration independent model for Multiprocessor • problem size, # processor, block size as parameters – Model sharing pattern for each shared block – Assume certain data distribution for data-dependent applications (i.e. parallel quick-sort) – Limitation: simple and iterative program, well-known algorithm, no significant synchronization CS747 9 05-10-2002 Cache Models (3) • Mathematical Cache Miss Equations – Compiler generated equations for loop-based array access – Model reuse along array dimensions by “reuse vector” – Extended to model pointer data structures • Single-linked lists and binary trees on uniprocessor • Must understand malloc() implementation – Ultimate aim is to model B-tree for databases CS747 10 05-10-2002 Architects’ Workload Characterization • Observe for different configurations – – – – Busy/stall time breakdown Kernel/user time breakdown Misses breakdown (4C) Last touch prediction • Observe for different problem size – Working set and working set transition – Sharing degree (producer-consumer, migratory) CS747 11 05-10-2002 What Tools do We Need • Application models for commercial workloads – – – – – What to model? (working set, sharing, communication, etc.) Include problem size as input parameter Configuration independent (or less dependent) Algorithm-based (need source code) Or observation-based (on simulations) • Architectural Models – Separate processor core and caches – Separate CPU and the rest of the system [Sorin et al] • Model vs. Simulation – Analytical models to simplify simulator design [CAECW 01] – Simulators to ease the acquisition of model parameters CS747 12 05-10-2002 Configuration Independent Analysis • What to characterize? [Abandah/Davidson] – – – – – – general characteristics working set (access-age, footprint) concurrency (serial / imbalance / contention / busy) communication pattern (sharing degree/invalidation degree) communication phases and locality, sharing behavior Possible parameters for workload characterization • An Example - DSS systems working-set sizes – Application parameters (for each node i in the query plan) • Ni = # truples in a scan; Hi = probability a tuple matches • QD = depth of the query tree; • DB_REi = fraction of a relation accessed – Model the reuse after working set transitions (instructions, private, meta-data, index, tuple-locks, tuples) CS747 13 05-10-2002 A (simplistic?) Model for TPCC • Use stack distance curve to derive miss rates • L1 cache accesses totally overlapped with execution • M/G/1 queue to model bus/memory contention • Things not being modeled – Query algorithms – Communication misses – Overlapping between computation and memory access • The paper reports <10% errors. [Zhang et al 99] CS747 14 05-10-2002 Conclusion • Analytical models are needed to – Characterize commercial workloads – Predict their performance on multiprocessors • We need models that – Perform configuration independent analysis – Can use the output from workload models CS747 15 05-10-2002 Thank You! Questions? CS747 16 05-10-2002 Backup Slides • References • Acknowledgement CS747 17 05-10-2002 References • Cache Models – An Analytical Cache Model, Agarwal et al, ACM Transaction on Computer Systems, 1989 – Analyzing Multiprocessor Cache Behavior Through Data Reference Modeling, Tsai and Agarwal, SIGMETRICS 93 – An Analytical Model for Designing Memory Hierarchies, Jacob et al, IEEE Transaction on Computers, 1996 – Cache Miss Equations: A Compiler Framework for Analyzing and Turning Memory Behavior, Ghosh et al, ACM Transactions on Programming Languages and Systems, 1999 – A Mathematical Cache Miss Analysis for Pointer Data Structures, Zhang and Martonosi, SIAM • Commercial Workloads Overview – Trends in Shared Memory Multiprocessing, Stenstrom et al, IEEE Computer 97 – Memory System Characterization of Commercial Workloads, Barroso et al, ISCA 98 CS747 18 05-10-2002 Reference (cont.) • Configuration Independent Analysis – Configuration Independent Analysis for Characterizing Shared-memory Applications, Abandah and Davidson, UMich TR 1997. • Shared Memory Multiprocessor Models – Analytical Evaluation of Shared-memory Systems with ILP Processors, Sorin et al, ISCA 98 – A Customized MVA Model for Shared-memory Systems with Heterogeneous Applications, Sorin et al, UWisc TR, 2000 • Commercial Workload Specific Models – An Analytical Model of the Working-set Sizes in Decision-Support Systems, Karlsson et al, SIGMETRICS 2000 – Analysis of Commercial Workload on SMP Multiprocessors, Zhang et al, Proceedings of Performance 99 • Evaluation of Commercial Workloads – A Processor Queueing Simulation Model for Multiprocessor System Performance Analysis, Tsuei and Yamamoto, CAECW 2001 – Evaluating the Non-determinism in Commercial Workloads, Multifacet group, CAECW 2001 CS747 19 05-10-2002