Performance Analysis of Parallel Processing

advertisement
Performance Analysis of Parallel
Processing
Dr. Subra Ganesan
Department of Computer Science and Engineering
School of Engineering and Computer Science
Oakland University
CSE664
Project report
Winter 2000
4/29/2000
BY
Ahmad Milhim
Contents
1.
2.
3.
4.
5.
6.
Introduction
Definitions
2.1 Performance analysis
2.2 Performance analysis techniques
2.3 Performance analysis metrics
Performance means
3.1 Arithmetic mean performance
3.2 Geometric mean performance
3.3 Harmonic mean performance
Speedup performance laws
4.1 Asymptotic speedup
4.2 Harmonic speedup
4.3 Amdahl’s speedup law
Workloads
5.1 Types of test workloads
5.2 HINT
References
1.
Introduction
The goal of all computer system users and designers is to get the highest
performance at the lowest cost. Basic knowledge of performance evaluation
terminology and technology is a must for computer professionals to evaluate
their systems. Performance analysis is required at every stage of the life cycle
of a computer system, starting from the design stage to manufacturing, sales,
and upgrade.
Designers do performance analysis when they want to compare a number of
alternative designs and pick the best design, system administrators use
performance analysis to choose a system from a set of possible systems for a
certain application, and users need to know how well their systems are
performing and if an upgrade is needed.
It is said that performance analysis is an art, every analysis requires an
intimate knowledge of the system being evaluated. Computer applications are
so numerous and different that it is not possible to have a standard measure of
performance analysis for all applications. There are three techniques used for
performance analysis, analytical modeling, simulation, and measurement.
Different considerations help analysts choose which technique to use, the key
consideration is the stage in which the system is, if the system is a new
concept analytical modeling and simulations are the only possible techniques
to be used, at the same time these techniques can be base on previous
measurements of other similar systems, other considerations of less
importance are the time available for analysis, tools, accuracy, and cost.
Users of computer systems seek ways to increase the productivity of their
systems in terms of both computer hardware and programmers, and then
reduce the cost of computing. Computer throughputs were increased by
making the operating system handle resource sharing.
There has been always a desire to evaluate how will computer systems are
performing, and to find ways of improving their performance.
Measurement and analysis of parallel processing is a new field, the analysis
of parallel processing helps users and researchers to answer key questions
about the configuration of
parallel systems. Usually the questions that
analysts try to find an answer for are what is the best way to configure the
processing elements to solve a particular problem, how much overhead is
generated by parallelism, what is the effect of varying the number of
processors on the performance, which is better to use shared memory or local
memory, and so on.
2.
Definitions. In this section the most common terms and concepts
relating to the performance analysis of systems in general and parallel
processing in particular are introduced, and how these terms are applied to
computer systems and parallel processing. First the definition of performance
analysis is introduced, then performance metrics and techniques are discussed
in more detail.
2.1
performance analysis. The ideal performance of a computer system
can be achieved if a perfect match between hardware capability and software
behavior is reached.
2.2
Performance analysis techniques Three analysis techniques are used
for performance analysis; these are analytical modeling, simulation, and
measurement. To choose which one to use for the analysis of a system
depends on certain considerations. The most important consideration is the
stage in which the analysis is to be performed, measurement can’t be
performed unless the system already exists or at least a similar system does,
if the proposed design is a new idea then only analytical modeling or
simulation techniques can be used.
2.2.1 Analytical modeling. Used if the system is in the early design stages,
and the results are needed soon. Provides insight into the underlying system,
but it may not be precise due to simplification in modeling and mathematical
equations.
2.2.2 Simulation. It is a useful technique for analysis. Simulation models
provide easy ways to predict the performance of computer systems before
they exist and it is used to validate the results of analytical modeling and
measurement. Simulation provides snapshots of system behavior
2.2.3 Measurement. It can be used only if the system is available for
measurement (postprototype) or at least other systems similar to the system
under design exist. Its cost is high compared to the other two techniques
because instruments of hardware and software
are needed to perform
measurements. It also takes a long time to monitor the system efficiently. The
most important considerations that are used to choose one or more of these
three techniques are illustrated in Figure 1 below.
Most
important
Criterion
Simulation
stage
Analytical
Modeling
Any
Any
Measurement
____________
postprototype
Time
required
Small
Medium
Varies
Tools
Analysts
Computer
Languages
Instruments
Accuracy
Cost
Low
small
Moderate
Medium
Varies
High
Figure 1. Considerations in order of importance
It is said that not to trust the results of one technique until they have been
validated by the other two techniques. This is to emphasize that using one
technique may be miss leading, or at least you may not get accurate results.
2.3
Performance metrics. Each performance study has different metrics
that can be defined and used to evaluate that specific system under study,
although these metrics are varied and different from one system to another,
common metrics exist and are used in most general computer system
evaluation. These most common metrics are introduced here.
2.3.1 Response time. It is the interval between a users request and the
system response. In general the response time increases as the load increases,
terms that are related to the response time are turnaround time and reaction
time, turnaround time is defined as the time elapsed between submitting a job
and the completion of its output, reaction time elapsed between submitting of
a request and the beginning of its execution by the computer system. Stretch
factor is the ratio of the response time at a particular load to that at a
minimum load.
2.3.2 Throughput is defined as the rate (requests per unit time) at which the
request can be serviced by the system, for CPUs the rate is measured in
million instructions per second (MIPS) or million floating point instructions
per second (Mflops), for networks the rate is measured in bits per second
(bps) or packets per second, and for transaction systems it is measured in
transactions per second (TPS). In general the throughput of a the system
increases as the load initially increases, then after a certain load the
throughput stops increasing and in some cases starts decreasing.
The maximum achievable throughput under ideal workload conditions is
called the nominal capacity of the system, it is the bandwidth in computer
networks and is measured in bits per second. The response time at the
nominal capacity is often too high which leads to the definition of a new
term, the usable capacity which is defined as the maximum achievable
throughput without exceeding a pre specified response time. The optimal
operating point is the point after which the response time increases rapidly as
a function of the load, at the same time the gain in throughput is small, it is
the knee capacity in Figure 2 below.
Nominal capacity
Throughput
Knee capacity
Load
Response
time
Load
Figure 2 Capacity of the system
Usable capacity
2.3.3 System Efficiency E(n). It is defined as the ratio of the maximum
achievable throughput (usable capacity ) to the maximum achievable capacity
under ideal workload conditions. It is an indication of the actual degree of
speedup performance achieved as compared with the maximum value for
parallel processing, in other words it is the ratio of the performance of nprocessor system to that of a single-processor system. The lowest efficiency
corresponds to the case where the entire program code being executed
sequentially on a single processor. The maximum efficiency corresponds to
the case when all processors are fully utilized throughout the execution
period. The minimum efficiency is the case where the entire program code
being executed sequentially on a single processor as illustrated in figure 3.
E(n)=S(n)/n
6
Efficiency
5
4
3
2
1
0
0
1
2
3
4
5
Number of processors
Figure 3 efficiency
6
2.3.4 Speedup S(n). It is an indication of the degree of the speed gain in
parallel computations. It is discussed in more detail later in this paper. There
are three different speedup measures, asymptotic speedup, harmonic speedup,
and Amdahl's speedup law, simply the speedup is defined as the ratio of the
time taken by one processor to execute a program to the time taken by n
processors to execute the same program.
S(n) = T(1)/T(n)
2.3.5 Redundancy R(n). The ratio of the total number of unit operations
performed by n-processor system O(n) to the total number of unit operations
performed by a single-processor system O(1).
R(n) = O(n)/O(1)
2.3.6 Utilization U(n). The fraction of the time the resource is busy or it is
the ratio of busy time to total time over an observation period. Idle time is the
time during which a resource is not used. Balancing between resources
produces more efficient systems when parallel processing system is being
designed. In terms of system redundancy and its efficiency, the utilization is
expressed as system
redundancy times its efficiency. It indicates the
percentage of the resources that were kept busy during the execution of a
parallel program.
U(n) = R(n)*E(n)
2.3.7 Quality . This metric combines the effects of speedup, utilization, and
redundancy to assess relative merits of parallel computation. It is directly
proportional to the speedup and efficiency and inversely related to the
redundancy.
Q(n) = S(n)*E(n)/R(n)
Since the efficiency is always a fraction and the redundancy is greater than
one then the quality of the parallel system is upper bounded by the speedup.
2.3.8 Reliability. It is measured by the probability of errors or by the mean
time between errors.
2.3.9 availability. The availability of a resource is the fraction of the time
the resource is available to service users requests. Two other terms are used
in system analysis are downtime and uptime, system downtime is the time
period where the system is not available, the uptime is the time during which
the system is available.1
2.3.10 Cost/performance ratio. this metric is used to compare two or more
systems in terms of their cost and performance. The cost should inched the
cost of hardware, software, etc. the efficiency is measured in terms of
throughput for a pre specified response time.
3.
Performance means.
To understand the concepts of performance analysis and the significance of
the output of the performance techniques, one should understand the general
concepts of mathematical performance laws and performance means. In this
section performance means are discussed briefly.
3.1
Arithmetic mean performance. It is defined as the ratio of the sum
of all execution rates of all programs to the total number of programs used.
Symbolically, let Rj be the execution rate of program j where j =1,2,…m,
then the arithmetic mean performance
Ra =  Rj / m, for all j =1 to m.
this is true only if all programs have equal weighting. If programs are
weighted, then weighted arithmetic mean is defined as:
Ra* =  fjRj / m, for all j =1 to m
Where fj is the weight of program j.
3.2
Geometric mean performance. The geometric mean of n values is
obtained by multiplying the values together and taking the nth root of the
product. It is used only if the product of the values is a quantity of interest.
The geometric mean of the execution rates Rj is:
Rg = ( Rj )1/m , j = 1,2,… m
Where m is the number of programs. Similarly this value is correct if only all
the programs have equal weights. Neither the arithmetic mean nor the
geometric mean represents the real performance of the execution rates of
benchmark programs, which leads to the definition of a third
performance means.
kind of
3.3
Harmonic performance means. The harmonic mean of n values is
defined as the ratio of the number of values n to the sum of all the fractions
1/xj where xj is the jth value in the set of values. In the context of
performance analysis, it is defined as the average performance across a large
number of programs running in various execution modes, these modes
corresponds to scalar, vector, sequential, or parallel processing with different
program parts. Symbolically the harmonic performance mean is:
Rh = m /  (1 / Rj)
Where Rj is the execution rate of program j, m is the total number of
programs and Rh is the harmonic mean if equal weighting of all programs.
The harmonic mean is the closest of the means to the real performance.
4.
Speedup performance laws.
4.1
Asymptotic speedup law. Consider the parallel system of n processors
and , denote wi as the amount of work done with a degree of parallelism
(DOP) equals i. The execution time of wi on a single processor is ti(1),
similarly the execution time of wi on k processors is ti(k). The response time
of T(1) is the response time of a single processor system when executing wi.
And T() is the response time of executing the same workload wi if an
infinite number of processors is available. At this point the asymptotic
speedup S is defined as the ratio of T(1) to T().
S = T(1) / T().
4.2
Harmonic speedup law. Suppose a workload of multiple programs is
to be executed on an n-processor system, the program (workload) may use
different number of processors at different execution times. The program is
executed in mode i if i processors are used, Ri is the corresponding execution
rate which reflects the collective speed of i processors. The weighted
harmonic speedup S is defined as the ratio of the sequential execution time T1
to the weighted arithmetic mean execution time T* across the n execution
modes.
S = T1 / T*
4.3
Amdahl’s speedup law. This law can be derived from the previous
laws under the assumption that the system works only in two modes, fully
sequential system with probability  or fully parallel mode with probability
1-. Amdahl’s speedup expression Sn equals to the ratio of the total number
of processors to 1+(n-1)
Sn = n / 1+(n-1)
This implies that under the above assumption, the best speedup that we can
get is upper bounded by 1/  regardless of how many processors the system
actually have because Sn  1/ as n  . In Figure 4, Sn is plotted as a
function of n for different values of . Note that the ideal speedup achieved
when  = 0, and the speedup drops sharply as  increases
Amdahl's speedup
10000
=0
speedup
1000
 = 0.01
100
 = 0.1
10
 = 0.9
1
1
10
100
1000
10000
n: processors
Figure 4 plotting speedup versus number of processors for four values of 
5.
Workloads.
The term test workload denotes any workload that is used in performance
study. Two type of workloads exist, Real workload is one that is used in
normal operations and can’t be repeated, therefore it is not suitable to be used
as a test workload. The other workload is the Synthetic workload , which is a
load that can be used repeatedly, and it models the real workload.
5.1
Types of test workloads
5.1.1 Addition instruction, most early computers where designed around
arithmetic logic units that use the addition instruction to perform most of the
computations needed at that time. Thus using addition instruction to measure
the performance of computers at that time was good enough. The computer
with the faster addition instruction was considered better.
5.1.2 Instruction mix. The addition instruction was no longer sufficient to
measure the performance of computer systems as the number of supported
instructions by increased. An instruction mix workload was then introduced.
5.1.3 Kernels are sets of instructions that constitute a higher level function
performed by the processors, kernels are used mainly to measure the
performance of processors. The kernel performance measure does not reflect
the total system performance due to the fact that most kernel workloads don’t
use I/O operations.
5.1.4 Synthetic programs. simple programs written in a high-level language,
these programs make a specified number of I/O or service calls requests.
This workload measures the CPU time and the time for the I/O requests.
Synthetic programs do not make representative memory and secondary
memory references.
5.1.5 Application benchmarks. Application programs that represent the real
workload. They make use of all available resources, including processors,
networks, I/O devices, and databases. Benchmarks like Siev, Whetstone,
Linpack, Dhrystone, and SPEC are just some of the well known benchmarks
that have been used to evaluate the performance of computer systems. In the
next section, a new benchmark called HINT is discussed in more details.
5.2
HINT.
HINT is a new benchmark developed by John L. Gustafson and Quinn O.
Snell. “It is a practical approach that provides mathematically sound
comparison of computational performance even when the algorithms,
computer, and precision are changed”. It can be used to compare computing
as slow as human calculator to computing as fast as the best super computers.
Most benchmarks are based on the idea of measuring the time various
computers’ take to complete a fixed-size task, others like database
benchmarks fix the time and vary the job size. HINT on the other hand, is
based on a different concept. HINT stands for Hierarchical INTegration, it
does not fix time
HINT produces a speed measure called QUIPS, QUality Improvement Per
Second. It does not fix time nor problem size. It reveals memory bandwidth,
it is scalable; compares computing as slow as hand calculation to computing
as fast as the largest supercomputers, it is portable ; it ports to every
sequential and parallel environment with very little effort , and has a low
cost, permits low cost comparison of any architecture.
A computer with twice the QUIPS rating is as twice as powerful so it must
have more
 arithmetic speed
 precision
 storage
 bandwidth
The following HINT plots use logarithmic scale for the time, the plots
illustrate the powerful use of HINT as a performance analysis and evaluation
benchmark. It compares different kinds of computer systems.
Figure 6. Comparison of Different Precessions
Figure 7. Comparison of Different Clock Speeds
Figure 9. Comparison of Different Main Memory Sizes
Figure 10. Comparison of a Scalable Parallel Computer
Figure 11. Comparison of several Parallel Systems
Figure 12. Comparison of various workstations
References
1.
Jain, Raj. (1991). The Art of Computer Systems Analysis: techniques
for experimental design, measurement, simulation, and modeling. John
Wiley and sons.
2.
Hwang, Kai. (1993). Advanced Computer Architecture, Parallelism,
Scalability, Programmability. McGrow-Hill.
3.
McKerrow, Phillip. (1988). Performance Measurement of Computer
Systems. Addison-Wesely.
4.
Gustafson, L. John and Snell O. Quinn. HINT: A new Way To
Measure Computer Performance. (http://www.scl.ameslab.gov/HINT)
Download