Performance Analysis of Parallel Processing Dr. Subra Ganesan Department of Computer Science and Engineering School of Engineering and Computer Science Oakland University CSE664 Project report Winter 2000 4/29/2000 BY Ahmad Milhim Contents 1. 2. 3. 4. 5. 6. Introduction Definitions 2.1 Performance analysis 2.2 Performance analysis techniques 2.3 Performance analysis metrics Performance means 3.1 Arithmetic mean performance 3.2 Geometric mean performance 3.3 Harmonic mean performance Speedup performance laws 4.1 Asymptotic speedup 4.2 Harmonic speedup 4.3 Amdahl’s speedup law Workloads 5.1 Types of test workloads 5.2 HINT References 1. Introduction The goal of all computer system users and designers is to get the highest performance at the lowest cost. Basic knowledge of performance evaluation terminology and technology is a must for computer professionals to evaluate their systems. Performance analysis is required at every stage of the life cycle of a computer system, starting from the design stage to manufacturing, sales, and upgrade. Designers do performance analysis when they want to compare a number of alternative designs and pick the best design, system administrators use performance analysis to choose a system from a set of possible systems for a certain application, and users need to know how well their systems are performing and if an upgrade is needed. It is said that performance analysis is an art, every analysis requires an intimate knowledge of the system being evaluated. Computer applications are so numerous and different that it is not possible to have a standard measure of performance analysis for all applications. There are three techniques used for performance analysis, analytical modeling, simulation, and measurement. Different considerations help analysts choose which technique to use, the key consideration is the stage in which the system is, if the system is a new concept analytical modeling and simulations are the only possible techniques to be used, at the same time these techniques can be base on previous measurements of other similar systems, other considerations of less importance are the time available for analysis, tools, accuracy, and cost. Users of computer systems seek ways to increase the productivity of their systems in terms of both computer hardware and programmers, and then reduce the cost of computing. Computer throughputs were increased by making the operating system handle resource sharing. There has been always a desire to evaluate how will computer systems are performing, and to find ways of improving their performance. Measurement and analysis of parallel processing is a new field, the analysis of parallel processing helps users and researchers to answer key questions about the configuration of parallel systems. Usually the questions that analysts try to find an answer for are what is the best way to configure the processing elements to solve a particular problem, how much overhead is generated by parallelism, what is the effect of varying the number of processors on the performance, which is better to use shared memory or local memory, and so on. 2. Definitions. In this section the most common terms and concepts relating to the performance analysis of systems in general and parallel processing in particular are introduced, and how these terms are applied to computer systems and parallel processing. First the definition of performance analysis is introduced, then performance metrics and techniques are discussed in more detail. 2.1 performance analysis. The ideal performance of a computer system can be achieved if a perfect match between hardware capability and software behavior is reached. 2.2 Performance analysis techniques Three analysis techniques are used for performance analysis; these are analytical modeling, simulation, and measurement. To choose which one to use for the analysis of a system depends on certain considerations. The most important consideration is the stage in which the analysis is to be performed, measurement can’t be performed unless the system already exists or at least a similar system does, if the proposed design is a new idea then only analytical modeling or simulation techniques can be used. 2.2.1 Analytical modeling. Used if the system is in the early design stages, and the results are needed soon. Provides insight into the underlying system, but it may not be precise due to simplification in modeling and mathematical equations. 2.2.2 Simulation. It is a useful technique for analysis. Simulation models provide easy ways to predict the performance of computer systems before they exist and it is used to validate the results of analytical modeling and measurement. Simulation provides snapshots of system behavior 2.2.3 Measurement. It can be used only if the system is available for measurement (postprototype) or at least other systems similar to the system under design exist. Its cost is high compared to the other two techniques because instruments of hardware and software are needed to perform measurements. It also takes a long time to monitor the system efficiently. The most important considerations that are used to choose one or more of these three techniques are illustrated in Figure 1 below. Most important Criterion Simulation stage Analytical Modeling Any Any Measurement ____________ postprototype Time required Small Medium Varies Tools Analysts Computer Languages Instruments Accuracy Cost Low small Moderate Medium Varies High Figure 1. Considerations in order of importance It is said that not to trust the results of one technique until they have been validated by the other two techniques. This is to emphasize that using one technique may be miss leading, or at least you may not get accurate results. 2.3 Performance metrics. Each performance study has different metrics that can be defined and used to evaluate that specific system under study, although these metrics are varied and different from one system to another, common metrics exist and are used in most general computer system evaluation. These most common metrics are introduced here. 2.3.1 Response time. It is the interval between a users request and the system response. In general the response time increases as the load increases, terms that are related to the response time are turnaround time and reaction time, turnaround time is defined as the time elapsed between submitting a job and the completion of its output, reaction time elapsed between submitting of a request and the beginning of its execution by the computer system. Stretch factor is the ratio of the response time at a particular load to that at a minimum load. 2.3.2 Throughput is defined as the rate (requests per unit time) at which the request can be serviced by the system, for CPUs the rate is measured in million instructions per second (MIPS) or million floating point instructions per second (Mflops), for networks the rate is measured in bits per second (bps) or packets per second, and for transaction systems it is measured in transactions per second (TPS). In general the throughput of a the system increases as the load initially increases, then after a certain load the throughput stops increasing and in some cases starts decreasing. The maximum achievable throughput under ideal workload conditions is called the nominal capacity of the system, it is the bandwidth in computer networks and is measured in bits per second. The response time at the nominal capacity is often too high which leads to the definition of a new term, the usable capacity which is defined as the maximum achievable throughput without exceeding a pre specified response time. The optimal operating point is the point after which the response time increases rapidly as a function of the load, at the same time the gain in throughput is small, it is the knee capacity in Figure 2 below. Nominal capacity Throughput Knee capacity Load Response time Load Figure 2 Capacity of the system Usable capacity 2.3.3 System Efficiency E(n). It is defined as the ratio of the maximum achievable throughput (usable capacity ) to the maximum achievable capacity under ideal workload conditions. It is an indication of the actual degree of speedup performance achieved as compared with the maximum value for parallel processing, in other words it is the ratio of the performance of nprocessor system to that of a single-processor system. The lowest efficiency corresponds to the case where the entire program code being executed sequentially on a single processor. The maximum efficiency corresponds to the case when all processors are fully utilized throughout the execution period. The minimum efficiency is the case where the entire program code being executed sequentially on a single processor as illustrated in figure 3. E(n)=S(n)/n 6 Efficiency 5 4 3 2 1 0 0 1 2 3 4 5 Number of processors Figure 3 efficiency 6 2.3.4 Speedup S(n). It is an indication of the degree of the speed gain in parallel computations. It is discussed in more detail later in this paper. There are three different speedup measures, asymptotic speedup, harmonic speedup, and Amdahl's speedup law, simply the speedup is defined as the ratio of the time taken by one processor to execute a program to the time taken by n processors to execute the same program. S(n) = T(1)/T(n) 2.3.5 Redundancy R(n). The ratio of the total number of unit operations performed by n-processor system O(n) to the total number of unit operations performed by a single-processor system O(1). R(n) = O(n)/O(1) 2.3.6 Utilization U(n). The fraction of the time the resource is busy or it is the ratio of busy time to total time over an observation period. Idle time is the time during which a resource is not used. Balancing between resources produces more efficient systems when parallel processing system is being designed. In terms of system redundancy and its efficiency, the utilization is expressed as system redundancy times its efficiency. It indicates the percentage of the resources that were kept busy during the execution of a parallel program. U(n) = R(n)*E(n) 2.3.7 Quality . This metric combines the effects of speedup, utilization, and redundancy to assess relative merits of parallel computation. It is directly proportional to the speedup and efficiency and inversely related to the redundancy. Q(n) = S(n)*E(n)/R(n) Since the efficiency is always a fraction and the redundancy is greater than one then the quality of the parallel system is upper bounded by the speedup. 2.3.8 Reliability. It is measured by the probability of errors or by the mean time between errors. 2.3.9 availability. The availability of a resource is the fraction of the time the resource is available to service users requests. Two other terms are used in system analysis are downtime and uptime, system downtime is the time period where the system is not available, the uptime is the time during which the system is available.1 2.3.10 Cost/performance ratio. this metric is used to compare two or more systems in terms of their cost and performance. The cost should inched the cost of hardware, software, etc. the efficiency is measured in terms of throughput for a pre specified response time. 3. Performance means. To understand the concepts of performance analysis and the significance of the output of the performance techniques, one should understand the general concepts of mathematical performance laws and performance means. In this section performance means are discussed briefly. 3.1 Arithmetic mean performance. It is defined as the ratio of the sum of all execution rates of all programs to the total number of programs used. Symbolically, let Rj be the execution rate of program j where j =1,2,…m, then the arithmetic mean performance Ra = Rj / m, for all j =1 to m. this is true only if all programs have equal weighting. If programs are weighted, then weighted arithmetic mean is defined as: Ra* = fjRj / m, for all j =1 to m Where fj is the weight of program j. 3.2 Geometric mean performance. The geometric mean of n values is obtained by multiplying the values together and taking the nth root of the product. It is used only if the product of the values is a quantity of interest. The geometric mean of the execution rates Rj is: Rg = ( Rj )1/m , j = 1,2,… m Where m is the number of programs. Similarly this value is correct if only all the programs have equal weights. Neither the arithmetic mean nor the geometric mean represents the real performance of the execution rates of benchmark programs, which leads to the definition of a third performance means. kind of 3.3 Harmonic performance means. The harmonic mean of n values is defined as the ratio of the number of values n to the sum of all the fractions 1/xj where xj is the jth value in the set of values. In the context of performance analysis, it is defined as the average performance across a large number of programs running in various execution modes, these modes corresponds to scalar, vector, sequential, or parallel processing with different program parts. Symbolically the harmonic performance mean is: Rh = m / (1 / Rj) Where Rj is the execution rate of program j, m is the total number of programs and Rh is the harmonic mean if equal weighting of all programs. The harmonic mean is the closest of the means to the real performance. 4. Speedup performance laws. 4.1 Asymptotic speedup law. Consider the parallel system of n processors and , denote wi as the amount of work done with a degree of parallelism (DOP) equals i. The execution time of wi on a single processor is ti(1), similarly the execution time of wi on k processors is ti(k). The response time of T(1) is the response time of a single processor system when executing wi. And T() is the response time of executing the same workload wi if an infinite number of processors is available. At this point the asymptotic speedup S is defined as the ratio of T(1) to T(). S = T(1) / T(). 4.2 Harmonic speedup law. Suppose a workload of multiple programs is to be executed on an n-processor system, the program (workload) may use different number of processors at different execution times. The program is executed in mode i if i processors are used, Ri is the corresponding execution rate which reflects the collective speed of i processors. The weighted harmonic speedup S is defined as the ratio of the sequential execution time T1 to the weighted arithmetic mean execution time T* across the n execution modes. S = T1 / T* 4.3 Amdahl’s speedup law. This law can be derived from the previous laws under the assumption that the system works only in two modes, fully sequential system with probability or fully parallel mode with probability 1-. Amdahl’s speedup expression Sn equals to the ratio of the total number of processors to 1+(n-1) Sn = n / 1+(n-1) This implies that under the above assumption, the best speedup that we can get is upper bounded by 1/ regardless of how many processors the system actually have because Sn 1/ as n . In Figure 4, Sn is plotted as a function of n for different values of . Note that the ideal speedup achieved when = 0, and the speedup drops sharply as increases Amdahl's speedup 10000 =0 speedup 1000 = 0.01 100 = 0.1 10 = 0.9 1 1 10 100 1000 10000 n: processors Figure 4 plotting speedup versus number of processors for four values of 5. Workloads. The term test workload denotes any workload that is used in performance study. Two type of workloads exist, Real workload is one that is used in normal operations and can’t be repeated, therefore it is not suitable to be used as a test workload. The other workload is the Synthetic workload , which is a load that can be used repeatedly, and it models the real workload. 5.1 Types of test workloads 5.1.1 Addition instruction, most early computers where designed around arithmetic logic units that use the addition instruction to perform most of the computations needed at that time. Thus using addition instruction to measure the performance of computers at that time was good enough. The computer with the faster addition instruction was considered better. 5.1.2 Instruction mix. The addition instruction was no longer sufficient to measure the performance of computer systems as the number of supported instructions by increased. An instruction mix workload was then introduced. 5.1.3 Kernels are sets of instructions that constitute a higher level function performed by the processors, kernels are used mainly to measure the performance of processors. The kernel performance measure does not reflect the total system performance due to the fact that most kernel workloads don’t use I/O operations. 5.1.4 Synthetic programs. simple programs written in a high-level language, these programs make a specified number of I/O or service calls requests. This workload measures the CPU time and the time for the I/O requests. Synthetic programs do not make representative memory and secondary memory references. 5.1.5 Application benchmarks. Application programs that represent the real workload. They make use of all available resources, including processors, networks, I/O devices, and databases. Benchmarks like Siev, Whetstone, Linpack, Dhrystone, and SPEC are just some of the well known benchmarks that have been used to evaluate the performance of computer systems. In the next section, a new benchmark called HINT is discussed in more details. 5.2 HINT. HINT is a new benchmark developed by John L. Gustafson and Quinn O. Snell. “It is a practical approach that provides mathematically sound comparison of computational performance even when the algorithms, computer, and precision are changed”. It can be used to compare computing as slow as human calculator to computing as fast as the best super computers. Most benchmarks are based on the idea of measuring the time various computers’ take to complete a fixed-size task, others like database benchmarks fix the time and vary the job size. HINT on the other hand, is based on a different concept. HINT stands for Hierarchical INTegration, it does not fix time HINT produces a speed measure called QUIPS, QUality Improvement Per Second. It does not fix time nor problem size. It reveals memory bandwidth, it is scalable; compares computing as slow as hand calculation to computing as fast as the largest supercomputers, it is portable ; it ports to every sequential and parallel environment with very little effort , and has a low cost, permits low cost comparison of any architecture. A computer with twice the QUIPS rating is as twice as powerful so it must have more arithmetic speed precision storage bandwidth The following HINT plots use logarithmic scale for the time, the plots illustrate the powerful use of HINT as a performance analysis and evaluation benchmark. It compares different kinds of computer systems. Figure 6. Comparison of Different Precessions Figure 7. Comparison of Different Clock Speeds Figure 9. Comparison of Different Main Memory Sizes Figure 10. Comparison of a Scalable Parallel Computer Figure 11. Comparison of several Parallel Systems Figure 12. Comparison of various workstations References 1. Jain, Raj. (1991). The Art of Computer Systems Analysis: techniques for experimental design, measurement, simulation, and modeling. John Wiley and sons. 2. Hwang, Kai. (1993). Advanced Computer Architecture, Parallelism, Scalability, Programmability. McGrow-Hill. 3. McKerrow, Phillip. (1988). Performance Measurement of Computer Systems. Addison-Wesely. 4. Gustafson, L. John and Snell O. Quinn. HINT: A new Way To Measure Computer Performance. (http://www.scl.ameslab.gov/HINT)