Workload Selection and Characterization Andy Wang CIS 5930 Computer Systems Performance Analysis Workloads • Types of workloads • Workload selection 2 Types of Workloads • • • • • • • What is a Workload? Instruction Workloads Synthetic Workloads Real-World Benchmarks Application Benchmarks “Standard” Benchmarks Exercisers and Drivers 3 What is a Workload? • Workload: anything a computer is asked to do • Test workload: any workload used to analyze performance • Real workload: any observed during normal operations • Synthetic workload: created for controlled testing 4 Real Workloads • Advantage: represent reality • Disadvantage: uncontrolled – Can’t be repeated – Can’t be described simply – Difficult to analyze • Nevertheless, often useful for “final analysis” papers – E.g., “We ran system foo and it works well” 5 Synthetic Workloads • Advantages: – Controllable – Repeatable – Portable to other systems – Easily modified • Disadvantage: can never be sure real world will be the same 6 Instruction Workloads • Useful only for CPU performance – But teach useful lessons for other situations • Development over decades – “Typical” instruction (ADD) – Instruction mix (by frequency of use) • Sensitive to compiler, application, architecture • Still used today (GFLOPS) – Processor clock rate • Only valid within processor family 7 Instruction Workloads (cont’d) • Modern complexity makes mixes invalid – Pipelining – Data/instruction caching – Prefetching • Kernel is inner loop that does useful work: – Sieve, matrix inversion, sort, etc. – Ignores setup, I/O, so can be timed by analysis if desired (at least in theory) 8 Synthetic Workloads • Complete programs – Designed specifically for measurement – May do real or “fake” work – May be adjustable (parameterized) • Two major classes: – Benchmarks – Exercisers 9 Real-World Benchmarks • • • • Pick a representative application Pick sample data Run it on system to be tested Modified Andrew Benchmark, MAB, is a real-world benchmark • Easy to do, accurate for that sample data • Fails to consider other applications, data 10 Application Benchmarks • Variation on real-world benchmarks • Choose most important subset of functions • Write benchmark to test those functions • Tests what computer will be used for • Need to be sure important characteristics aren’t missed • Mix of functions must reflect reality 11 “Standard” Benchmarks • Often need to compare general-purpose computer systems for general-purpose use – E.g., should I buy a Sony or a Dell PC? – Tougher: Mac or PC? • Desire for an easy, comprehensive answer • People writing articles often need to compare tens of machines 12 “Standard” Benchmarks (cont’d) • Often need to make comparisons over time – Is this year’s CPU faster than last year’s • Probably yes, but by how much? • Don’t want to spend time writing own code – Could be buggy or not representative – Need to compare against other people’s results • “Standard” benchmarks offer solution 13 • • • • • • • • • • Popular “Standard” Benchmarks Sieve, 8 queens, etc. Whetstone Linpack Dhrystone Debit/credit TPC SPEC MAB Winstone, webstone, etc. ... 14 Sieve, etc. • Prime number sieve (Erastothenes) – Nested for loops – Often such small array that it’s silly • 8 queens – Recursive • Many others • Generally not representative of real problems 15 Whetstone • Dates way back (can compare against 70’s) • Based on real observed frequencies • Entirely synthetic (no useful result) – Modern optimizers may delete code • Mixed data types, but best for floating • Be careful of incomparable variants! 16 LINPACK • Based on real programs and data • Developed by supercomputer users • Great if you’re doing serious numerical computation 17 Dhrystone • Bad pun on “Whetstone” • Motivated by Whetstone’s perceived excessive emphasis on floating point • Dates to when p’s were integer-only • Very popular in PC world • Again, watch out for version mismatches 18 Debit/Credit Benchmark • Developed for transaction processing environments – CPU processing is usually trivial – Remarkably demanding I/O, scheduling requirements • Models real TPS workloads synthetically • Modern version is TPC benchmark 19 SPEC Suite • Result of multi-manufacturer consortium • Addresses flaws in existing benchmarks • Uses 10 real applications, trying to characterize specific real environments • Considers multiple CPUs • Geometric mean gives SPECmark for system • Becoming standard comparison method 20 Modified Andrew Benchmark • Used in research to compare file system, operating system designs • Based on software engineering workload • Exercises copying, compiling, linking • Probably ill-designed, but common use makes it important • Needs scaling up for modern systems 21 Winstone, Webstone, etc. • “Stone” has become suffix meaning “benchmark” • Many specialized suites to test specialized applications – Too many to review here • Important to understand strengths & drawbacks – Bias toward certain workloads – Assumptions about system under test 22 Exercisers and Drivers • For I/O, network, non-CPU measurements • Generate a workload, feed to internal or external measured system – I/O on local OS – Network • Sometimes uses dedicated system, interface hardware 23 Advantages of Exercisers • Easy to develop, port • Can incorporate measurement • Easy to parameterize, adjust 24 Disadvantages of Exercisers • High cost if external • Often too small compared to real workloads – Thus not representative – E.g., may use caches “incorrectly” • Internal exercisers often don’t have real CPU activity – Affects overlap of CPU and I/O • Synchronization effects caused by loops 25 Workload Selection • Services exercised • Completeness – Sample service characterization • • • • Level of detail Representativeness Timeliness Other considerations 26 Services Exercised • What services does system actually use? – Network performance useless for matrix work • What metrics measure these services? – MIPS/GIPS for CPU speed – Bandwidth/latency for network, I/O – TPS for transaction processing 27 Completeness • Computer systems are complex – Effect of interactions hard to predict • Dynamic voltage scaling can speed up heavy loads (e.g., accessing encrypted files) – So must be sure to test entire system • Important to understand balance between components – I.e., don’t use 90% CPU mix to evaluate I/O-bound application 28 Component Testing • Sometimes only individual components are compared – Would a new CPU speed up our system? – How does IPV6 affect Web server performance? • But component may not be directly related to performance – So be careful, do ANOVA (analysis of variance), don’t extrapolate too much 29 Service Testing • May be possible to isolate interfaces to just one component – E.g., instruction mix for CPU • Consider services provided and used by that component • System often has layers of services – Can cut at any point and insert workload 30 Characterizing a Service • Identify service provided by major subsystem • List factors affecting performance • List metrics that quantify demands and performance • Identify workload provided to that service 31 Example: Web Server Web Page Visits Web Client TCP/IP Connections Network HTTP Requests Web Server Web Page Accesses File System Disk Transfers Disk Drive 32 Web Client Analysis • Services: visit page, follow hyperlink, display page information • Factors: page size, number of links, fonts required, embedded graphics, sound • Metrics: response time (both definitions) • Workload: a list of pages to be visited and links to be followed 33 Network Analysis • Services: connect to server, transmit request, transfer data • Factors: packet size, protocol used • Metrics: connection setup time, response latency, achieved bandwidth • Workload: a series of connections to one or more servers, with data transfer 34 Web Server Analysis • Services: accept and validate connection, fetch & send HTTP data • Factors: Network performance, CPU speed, system load, disk subsystem performance • Metrics: response time, connections served • Workload: a stream of incoming HTTP connections and requests 35 File System Analysis • Services: open file, read file (writing often doesn’t matter for Web server) • Factors: disk drive characteristics, file system software, cache size, partition size • Metrics: response time, transfer rate • Workload: a series of file-transfer requests 36 Disk Drive Analysis • • • • Services: read sector, write sector Factors: seek time, transfer rate Metrics: response time Workload: a statistically-generated stream of read/write requests 37 Level of Detail • Detail trades off accuracy vs. cost • Highest detail is complete trace • Lowest is one request, usually most common • Intermediate approach: weight by frequency • We will return to this when we discuss workload characterization 38 Representativeness • Obviously, workload should represent desired application – Arrival rate of requests – Resource demands of each request – Resource usage profile of workload over time • Again, accuracy and cost trade off • Need to understand whether detail matters 39 Timeliness • Usage patterns change over time – File size grows to match disk size – Web pages grow to match network bandwidth • If using “old” workloads, must be sure user behavior hasn’t changed • Even worse, behavior may change after test, as result of installing new system – “Latent demand” phenomenon 40 Other Considerations • Loading levels – Full capacity – Beyond capacity – Actual usage • External components not considered as parameters • Repeatability of workload 41 Workload Characterization • • • • • • • • Terminology Averaging Specifying dispersion Single-parameter histograms Multi-parameter histograms Principal-component analysis Markov models Clustering 42 Workload Characterization Terminology • User (maybe nonhuman) requests service – Also called workload component or workload unit • Workload parameters or workload features model or characterize the workload 43 Selecting Workload Components • Most important: components should be external: at interface of SUT (system under test) • Components should be homogeneous • Should characterize activities of interest to the study 44 Choosing Workload Parameters • Select parameters that depend only on workload (not on SUT) • Prefer controllable parameters • Omit parameters that have no effect on system, even if important in real world 45 Averaging • Basic character of a parameter is its average value • Not just arithmetic mean • Good for uniform distributions or gross studies 46 Specifying Dispersion • Most parameters are non-uniform • Specifying variance or standard deviation brings major improvement over average • Average and s.d. (or C.O.V.) together allow workloads to be grouped into classes – Still ignores exact distribution 47 Single-Parameter Histograms • Make histogram or kernel density estimate • Fit probability distribution to shape of histogram • Chapter 27 (not covered in course) lists many useful shapes • Ignores multiple-parameter correlations 48 Multi-Parameter Histograms • Use 3-D plotting package to show 2 parameters – Or plot each datum as 2-D point and look for “black spots” • Shows correlations – Allows identification of important parameters • Not practical for 3 or more parameters 49 Principal-Component Analysis (PCA) • How to analyze more than 2 parameters? • Could plot endless pairs – Still might not show complex relationships • Principal-component analysis solves problem mathematically – Rotates parameter set to align with axes – Sorts axes by importance 50 Advantages of PCA • • • • Handles more than two parameters Insensitive to scale of original data Detects dispersion Combines correlated parameters into single variable • Identifies variables by importance 51 Disadvantages of PCA • Tedious computation (if no software) • Still requires hand analysis of final plotted results • Often difficult to relate results back to original parameters 52 Markov Models • • • • Sometimes, distribution isn’t enough Requests come in sequences Sequencing affects performance Example: disk bottleneck – Suppose jobs need 1 disk access per CPU slice – CPU slice is much faster than disk – Strict alternation uses CPU better – Long disk access strings slow system 53 Introduction to Markov Models • Represent model as state diagram • Probabilistic transitions between states • Requests generated on transitions Network 0.4 0.6 0.3 CPU 0.3 0.4 Disk 0.2 0.8 54 Creating a Markov Model • Observe long string of activity • Use matrix to count pairs of states • Normalize rows to sum to 1.0 CPU Network Disk CPU 0.6 0.4 Network 0.3 0.4 0.3 Disk 0.8 0.2 55 Example Markov Model • Reference string of opens, reads, closes: ORORRCOORCRRRRCC • Pairwise frequency matrix: Open Read Close Sum Open 1 3 4 Read 1 4 3 8 Close 1 1 1 3 56 Markov Model for I/O String • Divide each row by its sum to get transition matrix: Open Open 0.25 Read 0.13 Close 0.33 Read Close 0.75 0.50 0.37 0.33 0.34 • Model: Read 0.50 0.75 0.33 0.13 0.25 0.37 Open Close 0.34 0.33 57 Clustering • Often useful to break workload into categories • “Canonical example” of each category can be used to represent all samples • If many samples, generating categories is difficult • Solution: clustering algorithms 58 Steps in Clustering • • • • • • • • Select sample Choose and transform parameters Drop outliers Scale observations Choose distance measure Do clustering Use results to adjust parameters, repeat Choose representative components 59 Selecting A Sample • Clustering algorithms are often slow – Must use subset of all observations • Can test sample after clustering: does every observation fit into some cluster? • Sampling options – Random – Heaviest users of component under study 60 Choosing and Transforming Parameters • Goal is to limit complexity of problem • Concentrate on parameters with high impact, high variance – Use principal-component analysis – Drop a parameter, re-cluster, see if different • Consider transformations such as Sec. 15.4 (logarithms, etc.) 61 Dropping Outliers • Must get rid of observations that would skew results – Need great judgment here – No firm guidelines • Drop things that you know are “unusual” • Keep things that consume major resources – E.g., daily backups 62 Scale Observations • Cluster analysis is often sensitive to parameter ranges, so scaling affects results • Options: – Scale to zero mean and unit variance – Weight based on importance or variance – Normalize range to [0, 1] – Normalize 95% of data to [0, 1] 63 Choosing a Distance Measure • Endless possibilities available • Represent observations as vectors in kspace • Popular measures include: – Euclidean distance, weighted or unweighted – Chi-squared distance – Rectangular distance 64 Clustering Methods • • • • Many algorithms available Computationally expensive (NP hard) Can be simple or hierarchical Many require you to specify number of desired clusters • Minimum Spanning Tree is not only option! 65 Minimum Spanning Tree Clustering • Start with each point in a cluster • Repeat until single cluster: – Compute centroid of each cluster – Compute intercluster distances – Find smallest distance – Merge clusters with smallest distance • Method produces stable results – But not necessarily optimum 66 K-Means Clustering • • • • One of most popular methods Number of clusters is input parameter First randomly assign points to clusters Repeat until no change: – Calculate center of each cluster: x, y – Assign each point to cluster with nearest center 67 Interpreting Clusters • Art, not science • Drop small clusters (if little impact on performance) • Try to find meaningful characterizations • Choose representative components – Number proportional to cluster size or to total resource demands 68 Drawbacks of Clustering • Clustering is basically AI problem • Humans will often see patterns where computer sees none • Result is extremely sensitive to: – Choice of algorithm – Parameters of algorithm – Minor variations in points clustered • Results may not have functional meaning 69 White Slide