Performance Evaluation Course Manijeh Keshtgary Shiraz University of Technology Spring 1392 Goals of This Course Comprehensive course on performance analysis Includes measurement, statistical modeling, experimental design, simulation, and queuing theory How to avoid common mistakes in performance analysis Graduate course: (Advanced Topics) Lot of independent reading Project/Survey paper 2 Text Books Required Raj Jain. The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling, John Wiley and Sons, Inc., New York, NY, 1991. ISBN:0471503363 3 Grading Midterm Final HW & Paper 40% 40% 20% Objectives: What You Will Learn Specifying performance requirements Evaluating design alternatives Comparing two or more systems Determining the optimal value of a parameter (system tuning) Finding the performance bottleneck (bottleneck identification) Characterizing the load on the system (workload characterization) Determining the number and sizes of components (capacity planning) Predicting the performance at future loads (forecasting) 5 WHAT IS PERFORMANCE EVALUATION Performance evaluation is about quantifying the service delivered by a computer or communication system For example, we might be interested in: comparing the power consumption of several server farm configurations it is important to carefully define the load and the metric, and to be aware of the performance evaluation goals Introduction Computer system users, administrators, and designers are all interested in performance evaluation since the goal is to obtain or to provide the highest performance at the lowest cost. A system could be any collection of HW, SW and firmware components; e.g., CPU, DB system, Network. Computer performance evaluation is of vital importance in the selection of computer systems, the design of applications and equipment, and analysis of existing systems. Basic Terms System: Any collection of hardware, software, and firmware Metrics: Criteria used to evaluate the performance of the system components Workloads: The requests made by the users of the system 8 Main Parts of the Course Part I: An Overview of Performance Evaluation Part II: Measurement Techniques and Tools Part III: Probability Theory and Statistics Part IV: Experimental Design and Analysis Part V: Simulation Part VI: Queuing Theory 9 Part I: An Overview of Performance Evaluation Introduction Common Mistakes and How To Avoid Them Selection of Techniques and Metrics 10 Part II: Measurement Techniques and Tools Types of Workloads Popular Benchmarks The Art of Workload Selection Workload Characterization Techniques Monitors Accounting Logs Monitoring Distributed Systems Load Drivers Capacity Planning The Art of Data Presentation 11 Part III: Probability Theory and Statistics Probability and Statistics Concepts Four Important Distributions Summarizing Measured Data By a Single Number Summarizing The Variability Of Measured Data Graphical Methods to Determine Distributions of Measured Data Sample Statistics Confidence Interval Comparing Two Alternatives Measures of Relationship Simple Linear Regression Models 12 Part IV: Experimental Design and Analysis Introduction to Experimental Design One Factor Experiments 13 Part V: Simulation Introduction to Simulation Types of Simulations Model Verification and Validation Analysis of Simulation Results Random-Number Generation Testing Random-Number Generators Commonly Used Distributions 14 Part VI: Queuing Theory Introduction to QueueingTheory Analysis of A Single Queue Queuing Networks Operational Laws Mean Value Analysis and Related Techniques Advanced Techniques 15 Purpose of Evaluation Three general purposes of performance evaluation: selection evaluation - system exists elsewhere performance projection - system does not yet exist performance monitoring - system in operation Selection Evaluation Evaluate plans to include performance as a major criterion in the decision to obtain a particular system from a vendor is the most frequent case To determine among the various alternatives which are available and suitable for a given application To choose according to some specified selection criteria At least one prototype of the proposed system must exist Performance Projection Orientated towards designing a new system to estimate the performance of a system that does not yet exist Secondary goal - projection of a given system on a new workload, i.e. modifying existing system in order to increase its performance or decrease it costs or both (tuning therapy) Upgrading of a system - replacement or addition of one or more hardware components Performance Monitoring usually performed for a substantial portion of the lifetime of an existing running system performance monitoring is done: to detect bottlenecks to predict future capacity shortcomings to determine most cost-effective way to upgrade the system to overcome performance problems, and to cope with increasing workload demands Evaluation Metrics A computer system, like any other engineering machine, can be measured and evaluated in terms of how well it meets the needs and expectations of its users. It is desirable to evaluate the performance of a computer system because we want to make sure that it is suitable for its intended applications, and that it satisfies the given efficiency and reliability requirements. We also want to operate the computer system near its optimal level of processing power under the given resource constraints. Three Basic Issues All performance measures deal with three basic issues: 1. How quickly a given task can be accomplished, 2. How well the system can deal with failures and other unusual situations, and 3. How effectively the system uses the available resources. Performance Measures We can categorize the performance measures as follows. Responsiveness Usage Level Mission ability Dependability Productivity Responsiveness Responsiveness: These measures are intended to evaluate how quickly a given task can be accomplished by the system. Possible measures are: waiting time, Processing time, Queue length, etc. Usage Level and Missionability: Usage Level: These measures are intended to evaluate how well the various components of the system are being used. Possible measures are: Throughput and utilization of various resources. Missionability: These measures indicate if the system would remain continuously operational for the duration of a mission. Possible measures : interval availability (Probability that the system will keep performing satisfactorily throughout the mission time) and life-time (time when the probability of unacceptable behaviour increases beyond some threshold). These measures are useful when repair/tuning is impractical or when unacceptable behavior may be catastrophic Dependability: These measures indicate how reliable the system is over the long run. Possible measures are: Number of failures/day. MTTF(mean time to failure). MTTR(mean time to repair). Long-term availability, and cost of a failure. These measures are useful when repairs are possible and failures are tolerable. Productivity These measures indicate how effectively a user can get his or her work accomplished. Possible measures are: User friendliness. Maintainability. And understandability. Which measures for what system The relative importance of various measures depends on the application involved. In the following, we provide a broad classification of computer systems according to the application domains, indicating which measures are most relevant: 1. General purpose computing: These systems are designed for general purpose problem solving. Relevant measures are: responsiveness, usage level. and productivity. 2. High availability Such systems are designed for transaction processing environments: (bank, Airline. Or telephone databases. Switching systems. etc.). The most important measures are responsiveness and dependability. Both of these requirements are more severe than for general purpose computing systems, moreover, any data corruption or destruction is unacceptable. Productivity is also an important factor. 3. Real-time control Such systems must respond to both periodic and randomly occurring events within some(possibly hard) timing constraints. They require high levels of responsiveness and dependability for most workloads and failure types and are therefore significantly overdesigned. Utilization and Throughput play little role in such systems 4. Mission Oriented These systems require extremely high levels of reliability over a short period, called the mission time. Little or no repair/tuning is possible during the mission. Such systems include fly-by-wire airplanes, battlefield systems, and spacecrafts. Responsiveness is also important, but usually not difficult to achieve. Such systems may try to achieve high reliability during the short term at the expense of poor reliability beyond the mission period. 5. Long-life Systems like the ones used for unmanned spaceships need long life without provision for manual diagnostics and repairs. Thus. In addition to being highly dependable, they should have considerable intelligence built in to do diagnostics and repair either automatically or by remote control from aground station. Responsiveness is important but not difficult to achieve. 1.2. Techniques of Performance Evaluation 1. 2. 3. 4. measurement. simulation analytic modeling The latter two techniques can also be combined to get what is usually known as hybrid modeling. Measurement Measurement is the most fundamental technique and is needed even in analysis and simulation to calibrate the models. Some measurements are best done in hardware, some in software, and some in a hybrid manner. Simulation Modeling Simulation involves constructing a model for the behavior of the system and driving it with an appropriate abstraction of the workload. The major advantage of simulation is its generality and flexibility, almost any behavior can be easily simulated However, there are many important issues that must be considered in simulation: 1. It must be decided what not to simulate and at what level of detail. Simply duplicating the detailed behavior of the system is usually unnecessary and prohibitively expensive. 2. Simulation, like measurement. Generates much raw data. Which must be analyzed using statistical techniques. 3. Similar to measurements. A careful experiment design is essential to keep the simulation cost down. Analytic Modeling Analytic modeling involves constructing a mathematical model of the system behavior(at the desired level of detail) and solving it. The main difficulty here is that the domain of tractable models is rather limited. Thus, analytic modeling will fail if the objective is to study the behavior in great detail. However. For an overall behavior characterization, analytic modeling is an excellent tool. Advantages of Analytical model over other two It generates good insight into the workings of the system that is valuable even if the model is too difficult to solve. Simple analytic models can usually be solved easily, yet provide surprisingly accurate results, and Results from analysis have better predictive value than those obtained from measurement or simulation Hybrid Modeling A complex model may consist of several submodels, each representing certain aspect of the system. Only some of these submodels may be analytically tractable, the others must be simulated. For example, the fraction of memory wasted due to fragmentation may be difficult to estimate analytically, even though other aspects of system can be modeled analytically. Hybrid Model (cont) We can take the hybrid approach. Which will proceed as follows: 1. Solve the analytic model assuming no fragmentation of memory and determine the distribution of memory holding time. 2. Simulate only memory allocation, Holding, And deallocation and determine the average fraction of memory that could not be used because of fragmentation. 3. Recalibrate the analytic model of step 1 with reduced memory and solve it. (It may be necessary to repeat these steps a few times to get convergence.) 1.3 Applications of Performance Evaluation System design System Selection System Upgrade System Tuning System Analysis System Design In designing a new system. One typically starts out with certain performance/ reliability objectives and a basic system architecture. And then decides how to choose various parameters to achieve the objectives. This involves constructing a model of the system behavior at the appropriate level of detail, and evaluating it to choose the parameters. Analytical model is ok if we just want to eliminate bad choices and simulation if we need more details. System Selection Here the problem is to select the “best” system from among a group of systems that are under consideration for reasons of cost, availability, compatibility, etc. Although direct measurement is the ideal technique to use here. There might be practical difficulties in doing so (e.g., not being able to use them under realistic workloads, or not having the system available locally). Therefore, it may be necessary to make projections based on available data and some simple modeling. System Upgrade This involves replacing either the entire system or parts there of with a newer but compatible unit. The compatibility and cost considerations may dictate the vendor, So the only remaining problem is to choose quantity, speed, and the like. Often, analytic modeling is adequate here; however, in large systems involving complex interactions between subsystems. Simulation modeling may be essential System Tuning The purpose of tune up is to optimize the performance by appropriately changing the various resource management policies. Some examples are process scheduling mechanism, Context switching, buffer allocation schemes, cluster size for paging, and contiguity in file space allocation. It is necessary to decide which parameters to consider changing and how to change them to get maximum potential benefit. Direct experimentation is the simplest technique to use here, but may not be feasible in a production environment. Analytical model can not present its changes, so simulation is better System Analysis Suppose that we find a system to be unacceptably sluggish. The reason could be either inadequate hardware resources(CPU, memory, disk, etc.) or poor system management. In the former case, we need system upgrade, and in the latter, a system tune up. Nevertheless, the first task is to determine which of the two cases applies. This involves monitoring the system and examining the behavior of various resource management policies under different loading conditions. Experimentation coupled with simple analytic reasoning is usually adequate to identify the trouble spots; however, in some cases, complex interactions may make a simulation study essential. Performance Evaluation Metrics Performance metrics can be categorised into three classes based on their utility function: Higher is Better or HB Lower is Better or LB Nominal is Best or NB LB metric (e.g., response time) HB metric (e.g., throughput) NB metric (e.g., utilisation) Outline Objectives What kind of problems will you be able to solve after taking this course? The Art Common Mistakes Systematic Approach Case Study 47 Objectives (1 of 6) Select appropriate evaluation techniques, performance metrics and workloads for a system Techniques: measurement, simulation, analytic modeling Metrics: criteria to study performance (ex: response time) Workloads: requests by users/applications to the system Example: What performance metrics should you use for the following systems? a) Two disk drives b) Two transactions processing systems c) Two packet retransmission algorithms 48 Objectives (2 of 6) Conduct performance measurements correctly Need two tools Load generator – a tool to load the system Monitor – a tool to measure the results Example: Which type of monitor (software and hardware) would be more suitable for measuring each of the following quantities? a) Number of instructions executed by a processor b) Degree of multiprogramming on a timesharing system c) Response time of packets on a network 49 Objectives (3 of 6) Use proper statistical techniques to compare several alternatives Find the best among a number of alternatives One run of workload often not sufficient Many non-deterministic computer events that effect performance Comparing average of several runs may also not lead to correct results Especially if variance is high Example: Packets lost on a link. Which link is better? File Size Link A Link B 1000 5 10 1200 7 3 1300 3 0 50 0 1 50 Objectives (4 of 6) Design measurement and simulation experiments to provide the most information with the least effort Often many factors that affect performance. Separate out the effects of individual factors. Example: The performance of a system depends upon three factors: A) garbage collection technique: G1, G2, or none B) type of workload: editing, compiling, AI C) type of CPU: P2, P4, Sparc How many experiments are needed? How can the performance of each factor be estimated? 51 Objectives (5 of 6) Perform simulations correctly Select correct language, seeds for random numbers, length of simulation run, and analysis Before all of that, may need to validate simulator Example: To compare the performance of two cache replacement algorithms: A) What type of simulation model should be used? B) How long should the simulation be run? C) What can be done to get the same accuracy with a shorter run? D) How can one decide if the random-number generator in the simulation is a good generator? 52 Objectives (6 of 6) Use simple queuing models to analyze the performance of systems Queuing models are commonly used for analytical modeling of computer systems Often can model computer systems by service rate and arrival rate of load Multiple servers Multiple queues Example: The average response time of a database system is 3 seconds. During a 1-minute observation interval, the idle time on the system was 10 seconds. Using a queuing model for the system, determine the following: System utilization, average service time per query, the number of queries completed during observation, average number of jobs in the system, … 53 Outline Objectives The Art Common Mistakes Systematic Approach Case Study 54 The Art of Performance Evaluation Evaluation cannot be produced mechanically Requires intimate knowledge of system Careful selection of methodology, workload, tools Not one correct answer as two performance analysts may choose different metrics or workloads Like art, there are techniques to learn how to use them when to apply them 55 Example: Comparing Two Systems Two systems, two workloads, measure transactions per second System Workload 1 Workload 2 A 20 10 B 10 20 Which is better? 56 Example: Comparing Two Systems Two systems, two workloads, measure transactions per second System Workload 1 Workload 2 Average A 20 10 15 B 10 20 15 They are equally good! … but is A better than B? 57 The Ratio Game Take system B as the base System Workload 1 Workload 2 Average A 2 0.5 1.25 B 1 1 1 A is better! … but is B better than A? 58 The Ratio Game Take system A as the base System Workload 1 Workload 2 Average A 1 1 1 B 0.5 2 1.25 B is better!? 59 Outline Objectives The Art Common Mistakes Systematic Approach Case Study 60 Common Mistakes (1-4) 1. Undefined Goals (Don’t shoot and then draw target) There is no such thing as a general model Describe goals and then design experiments 2. Biased Goals (Performance analysis is like a jury) Don’t show YOUR system better than HERS 3. Unsystematic Approach Arbitrary selection of system parameters, factors, metrics, … will lead to inaccurate conclusions 4. Analysis without Understanding (“A problem well-stated is half solved”) Don’t rush to modeling before defining a problem 61 Common Mistakes (5-8) 5. Incorrect Performance Metrics E.g., MIPS 6. Incorrect Workload Wrong workload will lead to inaccurate conclusions 7. Wrong Evaluation Technique (Don’t have a hammer and see everything as a nail) Use most appropriate: model, simulation, measurement 8. Overlooking Important Parameters Start from a complete list of system and workload parameters that affect the performance 62 Common Mistakes (9-12) 9. Ignoring Significant Factors Parameters that are varied are called factors; others are fixed Identify parameters that make significant impact on performance when varied 10. Inappropriate Experimental Design Relates to the number of measurement or simulation experiments to be conducted 11. Inappropriate Level of Detail Can have too much! Ex: modeling disk Can have too little! Ex: analytic model for congested router 12. No Analysis Having a measurement expert is desirable but not enough Expertise in analyzing results is crucial 63 Common Mistakes (13-16) 13. Erroneous Analysis E.g., take averages on too short simulations 14. No Sensitivity Analysis Analysis is evidence and not fact Need to determine how sensitive results are to settings 15. Ignoring Errors in Input Often parameters of interest cannot be measured; Instead, they are estimated using other variables Adjust the level of confidence on the model output 16. Improper Treatment of Outliers Outliers are values that are too high or too low compared to a majority of values If possible in real systems or workloads, do not ignore them 64 Common Mistakes (17-20) 17. Assuming No Change in the Future Workload may change in the future 18. Ignoring Variability If variability is high, the mean performance alone may be misleading 19. Too Complex Analysis A simpler and easier to explain analysis should be preferred 20. Improper Presentation of Results It is not the number of graphs, but the number of graphs that help make decisions 65 Common Mistakes (21-22) 21. Ignoring Social Aspects Writing and speaking are social skills 22. Omitting Assumptions and Limitations E.g.: may assume most traffic TCP, whereas some links may have significant UDP traffic May lead to applying results where assumptions do not hold 66 Checklist for Avoiding Common Mistakes in Performance Evaluation 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. Is the system correctly defined and the goals are clearly stated? Are the goals stated in an unbiased manner? Have all the steps of the analysis followed systematically? Is the problem clearly understood before analyzing it? Are the performance metrics relevant for this problem? Is the workload correct for this problem? Is the evaluation technique appropriate? Is the list of parameters that affect performance complete? Have all parameters that affect performance been chosen as factors to be varied? Is the experimental design efficient in terms of time and results? Is the level of detail proper? Is the measured data presented with analysis and interpretation? Is the analysis statistically correct? Has the sensitivity analysis been done? Would errors in the input cause an insignificant change in the results? Have the outliers in the input or the output been treated properly? Have the future changes in the system and workload been modeled? Has the variance of input been taken into account? Has the variance of the results been analyzed? Is the analysis easy to explain? Is the presentation style suitable for its audience? Have the results been presented graphically as much as possible? Are the assumptions and limitations of the analysis clearly documented? 67 Outline Objectives The Art Common Mistakes Systematic Approach Case Study 68 A Systematic Approach 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. State goals and define boundaries List services and outcomes Select performance metrics List system and workload parameters Select factors and values Select evaluation techniques Select workload Design experiments Analyze and interpret the data Present the results. Repeat. 69 State Goals and Define Boundaries Just “measuring performance” or “seeing how it works” is too broad E.g.: goal is to decide which ISP provides better throughput Definition of system may depend upon goals E.g.: if measuring CPU instruction speed, system may include CPU + cache E.g.: if measuring response time, system may include CPU + memory + … + OS + user workload 70 List Services and Outcomes List services provided by the system E.g., a computer network allows users to send packets to specified destinations E.g., a database system responds to queries E.g., a processor performs a number of tasks A user request for any of these services results in a number of possible outcomes (desirable or not) E.g., a database system may answer correctly, incorrectly (due to inconsistent updates), or not at all (due to deadlocks) 71 Select Metrics Criteria to compare performance In general, related to speed, accuracy and/or availability of system services E.g.: network performance Speed: throughput and delay Accuracy: error rate Availability: data packets sent do arrive E.g.: processor performance Speed: time to execute instructions 72 List Parameters List all parameters that affect performance System parameters (hardware and software) E.g.: CPU type, OS type, … Workload parameters E.g.: Number of users, type of requests List may not be initially complete, so have a working list and let grow as progress 73 Select Factors to Study Divide parameters into those that are to be studied and those that are not E.g.: may vary CPU type but fix OS type E.g.: may fix packet size but vary number of connections Select appropriate levels for each factor Want typical and ones with potentially high impact For workload often smaller (1/2 or 1/10th) and larger (2x or 10x) range Start small or number can quickly overcome available resources! 74 Select Evaluation Technique Depends upon time, resources, and desired level of accuracy Analytic modeling Quick, less accurate Simulation Medium effort, medium accuracy Measurement Typical most effort, most accurate Note, above are all typical but can be reversed in some cases! 75 Select Workload Set of service requests to system Depends upon measurement technique Analytic model may have probability of various requests Simulation may have trace of requests from real system Measurement may have scripts impose transactions Should be representative of real life 76 Design Experiments Want to maximize results with minimal effort Phase 1: Many factors, few levels See which factors matter Phase 2: Few factors, more levels See where the range of impact for the factors is 77 Analyze and Interpret Data Compare alternatives Take into account variability of results Statistical techniques Interpret results The analysis does not provide a conclusion Different analysts may come to different conclusions 78 Present Results Make it easily understood Graphs Disseminate (entire methodology!) "The job of a scientist is not merely to see: it is to see, understand, and communicate. Leave out any of these phases, and you're not doing science. If you don't see, but you do understand and communicate, you're a prophet, not a scientist. If you don't understand, but you do see and communicate, you're a reporter, not a scientist. If you don't communicate, but you do see and understand, you're a mystic, not a scientist." 79 Outline Objectives The Art Common Mistakes Systematic Approach Case Study 80 Case Study Consider remote pipes (rpipe) versus remote procedure calls (rpc) rpc is like procedure call but procedure is handled on remote server Client caller blocks until return rpipe is like pipe but server gets output on remote machine Client process can continue, non-blocking Results are returned asynchronously Goal: study the performance of applications using rpipes to similar applications using rpcs 81 System Definition Client and Server and Network Key component is “channel”, either a rpipe or an rpc Only the subset of the client and server that handle channel are part of the system Client Network Server - Try to minimize effect of components outside system 82 Services There are a variety of services that can happen over a rpipe or rpc Choose data transfer as a common one, with data being a typical result of most client-server interactions Classify amount of data as either large or small Thus, two services: Small data transfer Large data transfer 83 Metrics Limit metrics to correct operation only (no failure or errors) Study service rate and resources consumed Performance metrics A) elapsed time per call B) maximum call rate per unit time C) Local CPU time per call D) Remote CPU time per call E) Number of bytes sent per call 84 Parameters System Speed of CPUs Local Remote Network Speed Reliability (retrans) Operating system overhead Workload Time between calls Number and sizes of parameters of results Type of channel rpc Rpipe For interfacing with channels Other loads For interfacing with network On CPUs On network 85 Key Factors A) Type of channel rpipe or rpc B) Speed of network Choose short (LAN) and across country (WAN) C) Size of parameters Small or larger D) Number of calls 11 values: 8, 16, 32 …1024 E) All other parameters are fixed (Note, try to run during “light” network load) 86 Evaluation Technique Since there are prototypes, use measurement Use analytic modeling based on measured data for values outside the scope of the experiments conducted 87 Workload Synthetic program generated specified channel requests Will also monitor resources consumed and log results Use “null” channel requests to get baseline resources consumed by logging Heisenberg uncertainty principle in physics: “the measurement of position necessarily disturbs a particle's momentum, and vice versa—i.e., that the uncertainty principle is a manifestation of the observer effect” 88 Experimental Design Full factorial (all possible combinations of factors) 2 channels, 2 network speeds, 2 sizes, 11 numbers of calls 2 x 2 x 2 x 11 = 88 experiments 89 Data Analysis Analysis of variance will be used to quantify the first three factors Are they different? Regression will be used to quantify the effects of n consecutive calls Performance is linear? Exponential? 90 Data Presentation The final results will be plotted as a function of the block size n 91