Analytical Evaluation of Shared-Memory Systems with Commercial Workloads Jichuan Chang <> 1

advertisement
Analytical Evaluation of Shared-Memory
Systems with Commercial Workloads
Jichuan Chang <chang@cs.wisc.edu>
CS747
1
05-10-2002
Outline
• A Case for Analytical Models
• Existing Models and Their Limitations
• What Kind of Tools do We Need
CS747
2
05-10-2002
Background
• Shared-memory Multiprocessors Servers
– Important - the computing infrastructure of our society
– Complex system (ILP processors + caches + interconnection)
• Commercial workloads
– Important - 80% server market, supporting our daily business
– Different behavior from scientific workloads
• Large code size and data set, different cache behaviors
• Lots of OS interactions (context switches), higher I/O rate
– Hard to study (complex, hard to setup, no code, moving target)
CS747
3
05-10-2002
A Motivating Example
• Bob is designing a next generation multiprocessor server for
commercial workloads. Assume that the largest benchmark he
can setup now is a 10G database.
• How can Bob predict the performance (IPC, or tpm) of running
a 100G database TPC-D benchmark on the future machine?
• What’s the ideal cache hierarchy design for this workload given
his prediction of future technology constants?
• We need tools to characterize the workloads!
• We need tools to prune the vast design space!
CS747
4
05-10-2002
Performance Evaluation Tools
• Hardware Monitors, Binary Instrumentation Tools
+ Realistic, dynamic information
- Only work for existing systems, aggregated info
• Program Analysis Tools (i.e. compilers)
+ Can do global analysis, works well for arrays/loops
- Little dynamic info, not good for (pointer-based) irregular programs,
needs source code.
• (Full System) Architecture Simulators
+ Detailed simulation, realistic result, can simulate future HW
- Slow (can’t extrapolate), complex, can’t simulate future SW
• Analytical Models
+ Fast, gives insights, can predict for future SW/HW combinations
+ Need to create models of multiprocessor with new workloads
CS747
5
05-10-2002
Sorin et al. MVA for ILP Multiprocessors
ILP
Processor
The rest of the system
(Bus, NI, Switches
DRAM, Directories)
L1$
L2$

MSHR
(when MSHR not full)
• Application input parameters
–  CV fM fsync-write Pread Pwrite … ...
• Iterate between 2 submodels
– SB (fraction of time CPU stalls due to synch operations)
– MB (fraction of time CPU stalls due to limited MSHR size)
– Surrogate service time inflation
CS747
6
05-10-2002
Sorin et al. MVA Model
+ Target system design, answer question like
+ MSHR size, directory organization, NI latencies, etc
+ Insight into application behavior
+ Miss rate (), burstiness (CV ), degree of parallelism (fM)
– Some app. param. (, fM, fsync-write) depend on arch. param.
– Most parameters insensitive to changes outside CPU/cache
– Need input parameters for each CPU/cache configuration
– Caches also interact with the system design (i.e update protocol)
– Fixed problem size, not characterizing the workload
• Can we break the processor/cache black-box into
processor and cache two submodels?
• What would be the application input parameters?
CS747
7
05-10-2002
Cache Models (1)
• Stack distance model
– Estimate capacity misses, based on one access trace
– Work for inclusive fully-associated cache
– Have extensions for direct-mapped and set-associative cache
Hit rate
ABBACA
100%
90%
90%
80%
80%
70%
70%
60%
60%
50%
50%
40%
40%
30%
30%
20%
20%
10%
10%
0%
0%
1
2
3
1
infinity
2
4
16
32
64
128
256
1K
2K
4K
8K
16K
Cache size
Cache size
CS747
A typical access trace
Hit rate
100%
8
05-10-2002
Cache Models (2)
• Agarwal et al. 1989
– Model cache block size, working-set transitions,
conflict misses and multi-programming interference
• Data Reference Model (Tsai/Agarwal 1993)
– Configuration independent model for Multiprocessor
• problem size, # processor, block size as parameters
– Model sharing pattern for each shared block
– Assume certain data distribution for data-dependent
applications (i.e. parallel quick-sort)
– Limitation: simple and iterative program, well-known
algorithm, no significant synchronization
CS747
9
05-10-2002
Cache Models (3)
• Mathematical Cache Miss Equations
– Compiler generated equations for loop-based array
access
– Model reuse along array dimensions by “reuse vector”
– Extended to model pointer data structures
• Single-linked lists and binary trees on uniprocessor
• Must understand malloc() implementation
– Ultimate aim is to model B-tree for databases
CS747
10
05-10-2002
Architects’ Workload Characterization
• Observe for different configurations
–
–
–
–
Busy/stall time breakdown
Kernel/user time breakdown
Misses breakdown (4C)
Last touch prediction
• Observe for different problem size
– Working set and working set transition
– Sharing degree (producer-consumer, migratory)
CS747
11
05-10-2002
What Tools do We Need
• Application models for commercial workloads
–
–
–
–
–
What to model? (working set, sharing, communication, etc.)
Include problem size as input parameter
Configuration independent (or less dependent)
Algorithm-based (need source code)
Or observation-based (on simulations)
• Architectural Models
– Separate processor core and caches
– Separate CPU and the rest of the system [Sorin et al]
• Model vs. Simulation
– Analytical models to simplify simulator design [CAECW 01]
– Simulators to ease the acquisition of model parameters
CS747
12
05-10-2002
Configuration Independent Analysis
• What to characterize? [Abandah/Davidson]
–
–
–
–
–
–
general characteristics
working set (access-age, footprint)
concurrency (serial / imbalance / contention / busy)
communication pattern (sharing degree/invalidation degree)
communication phases and locality, sharing behavior
Possible parameters for workload characterization
• An Example - DSS systems working-set sizes
– Application parameters (for each node i in the query plan)
• Ni = # truples in a scan; Hi = probability a tuple matches
• QD = depth of the query tree;
• DB_REi = fraction of a relation accessed
– Model the reuse after working set transitions (instructions,
private, meta-data, index, tuple-locks, tuples)
CS747
13
05-10-2002
A (simplistic?) Model for TPCC
• Use stack distance curve to derive miss rates
• L1 cache accesses totally overlapped with execution
• M/G/1 queue to model bus/memory contention
• Things not being modeled
– Query algorithms
– Communication misses
– Overlapping between computation and memory access
• The paper reports <10% errors. [Zhang et al 99]
CS747
14
05-10-2002
Conclusion
• Analytical models are needed to
– Characterize commercial workloads
– Predict their performance on multiprocessors
• We need models that
– Perform configuration independent analysis
– Can use the output from workload models
CS747
15
05-10-2002
Thank You!
Questions?
CS747
16
05-10-2002
Backup Slides
• References
• Acknowledgement
CS747
17
05-10-2002
References
• Cache Models
– An Analytical Cache Model, Agarwal et al, ACM Transaction on Computer Systems, 1989
– Analyzing Multiprocessor Cache Behavior Through Data Reference Modeling, Tsai and
Agarwal, SIGMETRICS 93
– An Analytical Model for Designing Memory Hierarchies, Jacob et al, IEEE Transaction on
Computers, 1996
– Cache Miss Equations: A Compiler Framework for Analyzing and Turning Memory
Behavior, Ghosh et al, ACM Transactions on Programming Languages and Systems, 1999
– A Mathematical Cache Miss Analysis for Pointer Data Structures, Zhang and Martonosi,
SIAM
• Commercial Workloads Overview
– Trends in Shared Memory Multiprocessing, Stenstrom et al, IEEE Computer 97
– Memory System Characterization of Commercial Workloads, Barroso et al, ISCA 98
CS747
18
05-10-2002
Reference (cont.)
• Configuration Independent Analysis
– Configuration Independent Analysis for Characterizing Shared-memory Applications,
Abandah and Davidson, UMich TR 1997.
• Shared Memory Multiprocessor Models
– Analytical Evaluation of Shared-memory Systems with ILP Processors, Sorin et al, ISCA 98
– A Customized MVA Model for Shared-memory Systems with Heterogeneous Applications,
Sorin et al, UWisc TR, 2000
• Commercial Workload Specific Models
– An Analytical Model of the Working-set Sizes in Decision-Support Systems, Karlsson et al,
SIGMETRICS 2000
– Analysis of Commercial Workload on SMP Multiprocessors, Zhang et al, Proceedings of
Performance 99
•
Evaluation of Commercial Workloads
– A Processor Queueing Simulation Model for Multiprocessor System Performance Analysis,
Tsuei and Yamamoto, CAECW 2001
– Evaluating the Non-determinism in Commercial Workloads, Multifacet group, CAECW 2001
CS747
19
05-10-2002
Download