CS 1104 Help Session V Performance Analysis Colin Tan,

advertisement
CS 1104
Help Session V
Performance Analysis
Colin Tan,
ctank@comp.nus.edu.sg
S15-04-15
Benchmarking
• Your boss wants you to make a purchasing
decision between two systems.
– How will you decide which system is better for your
company?
• Benchmarks are programs that help you to decide
– Measures various aspects of a computer’s performance
• CPU speed
• I/O speed
• Etc.
Pitfalls of Benchmarks
• Benchmarks are sensitive to various hardware and
software configurations:
– For portability, benchmarks typically come in sourcecode form.
• Execution time (and hence system “performance”) can be
affected by how clever the compiler is (e.g. can the compiler
do loop-index-transformations, inline expansions, loop
unrolling, common sub-expression elimination, intelligent
spills).
• Execution time can also be affected by compiler “switches”
– Switches control how the compiler produces the object code.
– Can optimize for size
» Produces small but slow code => seemingly poor
performance
» Produces big but fast code => seemingly good performance
Pitfalls of Benchmarks
– Cache Effects
• Some benchmarks make use of tight loops
• Extremely good performance on machines with large block
sizes.
• This is because the entire loop may be contained in a cache
block, resulting in 100% instruction cache hits.
• Likewise, benchmarks may give good results because data
accesses are confined to a small portion of memory.
• Can give unrealistic results, since real programs may not have
tight loops at all, and may access data from many parts of
memory.
Pitfalls of Benchmarks
– I/O Effects
• I/O intensive benchmarks may perform well on
machines with slow CPU but fast I/O.
• CPU-bound benchmarks may perform well on
machines with fast CPU but slow I/O.
Pitfalls of Benchmarks
• Moral:
– Benchmarks chosen must reflect behavior of actual
programs to be used.
• E.g. Cache access patterns, I/O access patterns.
– Standardize on a compiler
• Don’t use slow, stupid and clunky Microsoft compiler on one
machine and fast, intelligent GNU compiler on another.
– Standardize on compiler switches
• Don’t optimize for size on one machine (slow) and optimize
for speed on another.
Optional Reading
Why Are Fast Programs Fat?
• When a compiler is set to optimize for speed, it
often produces large code:
– Loops are unrolled.
• Aim of loop unrolling is to generate more instructions for a
program so that there is more opportunity for instruction
scheduling and for superscalar execution.
• Can lead to large code
– See example on the next slide.
Why Are Fast Programs Fat?
Unrolling Example (Optional)
• Original Code:
for(i=0; i<100; i++)
{
a[i] = b[i] + c
}
• Unrolled Code
for(i=0; i<100; i = i + 3)
{
a[i]=b[i]+c
a[i+1]=b[i+1]+c
a[i+2]=b[i+2]+c
}
Optional Reading
Why Are Fast Programs Fat?
• Short functions are in-lined
– Function calls are replaced by the actual function code,
removing the need to actually do a function call.
– This saves time, as function calls are very expensive
• Need to save the context (register values, etc.) of the caller to
the stack.
• Need to save the Program Counter value of the caller.
• Need to do jump to called function’s code.
• Need to restore caller context before jumping back to caller’s
code.
Why Are Fast Programs Fat?
Function In-Lining Example
(Optional Reading)
• Original Code
function f(int x)
{
int y;
y = y + 3;
return cos(y)+sin(y);
}
main()
{
int x1 = f(1);
int x2 = f(2);
int x3 = f(3);
}
Why Are Fast Programs Fat?
Function In-Lining Example
(Optional Reading)
• In-lined Code:
main()
{
int y1=1+3;
int x1 = cos(y1)+sin(y);
int y2=2+3;
int x2 = cos(y2)+sin(y2);
int y3=3+3;
int x3=cos(y3)+sin(y3);
}
Amdahl’s Law
• Suppose you had a program that runs in Y
seconds, of which X seconds are caused by an
instruction class C.
• Suppose you wanted to improve Y by a certain
percentage, by improving the running time of
instructions in class C.
• How much improvement must be made to class
C’s execution time?
Amdahl’s Law
• TNEW = TC/speedup_c + TNC
– Here TC is the time taken by the class C instructions,
speedup_c is the improvement that you must make to
class C instructions, and TNC is the timing of all the rest
of the instructions.
– TNEW is the new execution time.
Amdahl’s Law Example
• Suppose a program runs in 100 seconds, of which
80 seconds are taken up by multiply instructions.
How much improvements must be made to
multiply instruction execution timing in order to
get improve overall execution time by 4 times?
• What about improving overall execution time by 5
times?
Solution
4-times Speedup
• TC = 80 seconds, TNC = 100 - 80 = 20 seconds,
TNEW = 100/4 = 25 seconds, speedup_c to be
determined.
• TNEW = TC/speedup_c + TNC
• 25 = 80/speedup_c + 20
• speedup_c = 16.
• Therefore multiply instructions must be sped up
by 16x to get a 4x improvement in overall
execution time.
Solution
5-times Speedup
• TC=80 seconds, TNC=100-80 = 20 seconds, TNEW
= 100/5 = 20 seconds
• Amdahl’s Law:
20 = 80/speedup_c + 20
0 = 80 / speedup_c
0 = 0 ??
• If the speedup_c is a strange answer (0 = 0), or if
it is negative, then the new timing TNEW is not
possible. In this case, it is not possible to improve
execution time by 5x from 100s to 20s.
Optimizing Strategy
Making the Common Case Fast
• Suppose your program runs in 100 seconds, and
rotate operations take 5 seconds while add
instructions take 75 seconds. Find the
improvement in execution time if:
– Rotates are improved by 4x
– Adds are improved by 4x
Optimizing Strategy
Making the Common Case Fast
• Improving Rotate Timings
– TC = 5s, speedup_c = 4x, TNC = 100-5 = 95
TNEW = 5/4 + 95
= 96.25 seconds (Improvement: 3.75 seconds)
• Improving Add Timings
– TC = 75s, speedup_c = 4x, TNC = 100-75 = 25
TNEW = 75/4 + 25
= 43.75 seconds (Improvement: 56.25 seconds)
Optimizing Strategy
Making the Common Case Fast
• The add instruction was used more frequently that
the rotate instruction, and accounted for more of
the execution time (75% vs. 5%) than the rotate
instructions.
• By speeding up the add instruction, we actually
speed up the bulk of the execution time.
• This results in better performance improvement
than if we sped up the rotate instructions.
Cycles Per Instruction
• Each instruction takes a certain amount of time to
execute
– Time must be spent fetching the instruction,
understanding what it means, fetching data, performing
the operation on the data, and storing the results
• When this time is measured in clock cycles, the
unit of measure is the Cycles Per Instruction or
CPI.
Cycles Per Instruction
• CPI may be divided into two categories:
– Class CPIs
• These are the CPI measurements for various classes
of instructions.
– E.g. for multiply instructions, add instructions, load/store
instructions etc.
• Class CPIs are affected by processor organization
– Affected by things like how fast we can fetch data from
the registers, how fast the ALU is, etc.
• Different hardware implementations of the same
processor may have different Class CPIs.
– E.g. a pipelined implementation of a processor typically
has lower class CPIs than a non-pipelined implementation.
Cycles Per Instruction
– Overall CPI
• This is affected by the class CPIs, as well as by the
compiler.
• For a processor with 4 classes of instructions (A, B,
C, D), the overall CPI is given by:
CPIoverall= fa/N * CPIA + fB/N * CPIB + fC/N * CPIc + fD/N * CPID
fX = # of instructions in program taken from class X, CPIX = CPI of
class X, N is the total number of instructions in the program.
• Usually the ratio fX/N is given for class X.
• E.g. table may say that class X has a frequency of 0.05. This
means that if the program has 100 instructions, 5 of the
instructions come from class X
• Overall CPI varies between programs
– Each program would have different proportions (fX/N) for each
class of instructions.
Cycles Per Instructions
Example 1
• Given the following data about a program P, find
the overall CPI of the program:
–
–
–
–
Class A, CPI 3, Frequency 0.1
Class B, CPI 4, Frequency 0.5
Class C, CPI 5, Frequency 0.2
Class D, CPI 6, Frequency 0.2
Overall CPI = 0.1 * 3 + 0.5 * 4 + 0.2 * 5 + 0.2 * 6
= 4.5
Cycles Per Instruction
Improving Performance
• An improvement in either overall or class CPIs
will result in better performance, all else being
equal.
• Overall CPI
– Overall CPI can be improved by using a better
compiler, or by using better hardware design.
• Better compiler improves overall CPI by having a higher
frequency of fast instructions.
• Better hardware design improves overall CPI by improving
individual class CPIs.
Cycles Per Instruction
Example 2
• The program from the previous example was
modified such that 40% of the total set of
instructions come from class A, 40% from class B,
10% from class C and 10% from class D. Find the
new overall CPI.
Cycles Per Instruction
Example 2
Overall CPI = 0.4 * 3 + 0.4 * 4 + 0.1 * 5 + 0.1 * 6
= 3.8
• What is the lower bound on the overall CPI?
– The lower bound occurs when all instructions come
from the fastest class. So the lower bound for our
example is:
1.0 * 3 + 0.0 * 4 + 0.0 * 5 + 0.0 * 6 = 3.0
Cycles Per Instruction
Improving Performance
• Improving Class CPIs
– Class CPIs are affected by hardware implementation.
– Hence the only way to improve class CPIs is to
improve the hardware design.
– CANNOT BE ACHIEVED BY SOFTWARE MEANS!
• Changing compiler etc will not improve class CPIs.
Cycles Per Instruction
Example 3
• In our Example 1, suppose we improved the CPI
of class B instructions to 1.0. What will be the
new overall CPI?
Overall CPI = 0.1 * 3 + 0.5 * 1 + 0.2 * 5 + 0.2 * 6
= 3.0
• What is the new lower-bound on the overall CPI?
– Again the lower-bound is achieved if all instructions came from
the fastest class (here it is class B)
Overall CPI = 0.0 * 3 + 1.0 * 1 + 0.0 * 5 + 0.0 * 6
= 1.0
Cycles Per Instruction
Lower Bound for Class CPIs
• What is the lower bound for class CPIs?
– The answer, surprisingly, is 0!
– Reason:
• The hardware design (i.e. the organization) for certain types of
instructions can actually be optimized such that execution of
that instruction takes much less than 1 cycle.
• This effectively gives that instruction a lower bound CPI of 0!
Cycles Per Instruction
Example 4
• Take the INTEL mov ax, bx instruction for
example.
– This instruction copies the contents of register
bx into register ax.
– Traditional data paths connect all the registers
to the ALU inputs, and the ALU output to all
the registers.
– Electronic switches control which register will
load its data onto the ALU, and which register
the ALU result will be written to.
Cycles Per Instruction
Example 4
• Traditional implementation of the mov instruction:
•
•
•
•
Read register AX - 1 cycle
Pass through ALU, don’t do anything - 1 cycle
Write to register BX - 1 cycle
Total = 3 cycles (i.e. CPI for move instructions is 3)
• The Read and Write operations both take 1 cycle,
as they are synchronous operations that take place
only at the rising edge of a clock cycle.
• ALU pass-through also takes 1 cycle, to simplify
timing design of the CPU, even if ALU does not
do anything.
Cycles Per Instruction
Example 4
• New Implementation
– 1. Connect registers directly to each other, as well as to ALU
inputs (i.e. AX connects directly to BX)
– 2. If a mov ax, bx instruction is encountered, close the direct
connection between registers AX and BX.
• This is fast, as it consists just of closing a transistor switch
– 1/10th of a cycle?
– 3. Perform asynchronous copy of AX
• Independent of clock cycles => no need to wait for next rising edge.
• This is again fast, typically about 2 transistor times. Around 2/10th of
a cycle.
– Total timing = 3/10th of a cycle => effectively 0 cycles lower
bound!
Pitfalls Of Performance
Prediction
• Trying to predict whether machine A is better than
machine B based on single measurements is
dangerous
– Single measurements: This means that you consider
only CPI or clock rate in determining who is faster.
– This can lead to very unpredictable results.
Pitfalls of Performance
Prediction - CPI
• CPI gives a good idea of performance of a
program on a given processor.
– Lower CPI means better performance, on a given
processor.
• The term “given processor” means that factors like clock speed
remain constant!
• However it is a fallacy to assume that low CPI =
fast execution time!
Pitfalls of Performance
Prediction - Clock Frequency
• Chip manufacturers are fond of quoting clock
frequencies as an indication of speed.
– E.g. Intel 500MHz Pentium II
• Unfortunately increasing clock speeds can
sometimes slow down a system rather than
speeding it up!
– All design requires compromise.
– Increasing clock speed may compromise on CPI,
leading to larger overall CPIs. This may actually give
worse performance.
Pitfalls of Performance
Prediction - Execution Time
• Execution time of a program is the only semireliable way of telling which machine is faster.
– It is only semi-reliable, as it is still subject to the
pitfalls of benchmarks mentioned earlier
• A benchmark may run very fast on machine A and slowly on
machine B, yet the actual applications that you use may be just
the reverse!
• This problem is caused by choosing inappropriate benchmarks.
E.g. choosing processor-intensive benchmarks even though
your actual applications are I/O intensive.
Execution Time
• To compute execution time:
– 1. Compute the total number of clock cycles taken by
the program:
a) Usually we have to start off from an average (i.e. “overall”)
CPI value CPI
b) Find out how many instructions IC are executed.
c) The total number of cycles taken is just IC x CPI
– 2. Multiply by the clock period TCLK to get the
execution time:
• TCLK = 1/clock_rate, if clock_rate is specified in Hz.
• Combining (1) and (2)
• TEXEC = (CPI * IC) / clock_rate
Execution Time
Example 5
• In the previous four examples, given that each of
them runs on a 500MHz machine, find the
execution times.
– TExec1 = 4.5IC/(500 x 10^6) seconds
– TExec2 = 3.8IC/(500 x 10^6) seconds
– TExec3 = 3.0IC/(500 x 10^6) seconds
• Find the relative speedups between TExec1, TExec2
and TExec3.
Execution Time
Example 5
• Note that it is not possible to compare TExec2 vs.
TExec1 or TExec3
– This is because a different compiler was used, and the
instruction count IC for TExec2 may not be the same as
the instruction count IC for TExec1 and TExec3!
• Please read the question carefully before proceeding!
– Speedup of example 3 over example 1:
Speedup = TExex1/TExec3
= [4.5Ic/(500x10^6)]/[3.0Ic/(500 x 10^6)]
= 1.5x speedup
Execution Time
Example 6
• For examples 1 and 3 above, suppose that the
program in Example 1 runs on a computer with a
600 MHz clock, while the program in Example 3
runs on a computer with a 300 MHz clock. Which
program runs faster, and by how much?
Execution Time
Example 6
• TExec1 = 4.5IC/(600 x 10^6)
• TExec2 = 3.0IC/(300 x 10^6)
• Speedup of TExec2 over TExec1 is:
TExec1/TExec2 = [4.5IC/(600x10^6)] /
[(3.0IC/(300x10^6)]
= 0.60
• TExec1 is actually smaller than TExec2!
• This is despite the fact that the CPI for example 1 was
bigger!
• Example of why CPI is not a reliable indication of which
computer is faster.
Execution Time
Example 8
• Given the following information:
Machine A
Machine B
400 MHz
300 MHz
Overall CPI
Overall CPI
of program
of program
= 4.5
= 2.0
Find out which machine is faster, and by how
much. Assume that both programs are
identical.
Execution Time
Example 8
• If both programs are identical, then the instruction
count IC is the same.
• Texec_A = 4.5IC/(400 x 10^6)
• Texec_B = 2.0IC/(300 x 10^6)
• Speedup = Texec_A/Texec_B
= 1.6875
• Thus Texec_A > Texec_B even though the clock rate
for machine A is faster!
• Example of how clock rate is not a reliable
estimate of relative performance.
Instruction Frequencies
A Pitfall
• When we are computing overall CPIs, we must
ALWAYS check that the instruction frequencies (if
they are given in fractions like 0.1, etc.) add up to
1.0!
• If they don’t, the final CPI must be normalized.
Instruction Frequencies
Example 9
• Given the following information, compute the
overall CPI
–
–
–
–
Class A, CPI 1.0, Frequency 0.2
Class B, CPI 2.0, Frequency 0.3
Class C, CPI 3.0, Frequency 0.1
Class D, CPI 4.0, Frequency 0.2
• Overall CPI = 0.2 * 1.0 + 0.3 * 2.0 + 0.1 * 3.0 +
0.2 * 4.0 = 1.9
• This answer is WRONG!
Instruction Frequencies
Example 9
• The answer is wrong because the frequencies do
not add up to 1.0!
– 0.2 + 0.3 + 0.1 + 0.2 = 0.8! NOT 1.0!
• To get the correct CPI, divide by the frequency:
– Correct CPI = Wrong CPI / Freq
= 1.9 / 0.8
= 2.375
Summary
• Benchmarks help us to choose which system is
better for our needs.
• Unfortunately benchmarks are subject to many
pitfalls, and we must choose benchmarks that
closely resemble our actual application load.
• Amdahl’s Law implies that to get the best
improvement in performance, we should make the
common case fast.
– The examples covered show that making uncommon
cases fast produces little speedup.
Summary
• CPI is a measure of how many cycles each
instruction takes to execute.
• Two types of CPI
– Class CPIs
• These are the CPIs of instructions in individual
classes.
• Affected by hardware design
– Changing the hardware design often changes the class
CPIs.
– E.g. we may shift from a serial adder to a parallel adder,
resulting in fewer cycles required to add 2 numbers
Summary
• Two types of CPI
– Overall (or average) CPI
• Affected by compiler
• Also affected by hardware organization changes
– These will cause individual class CPIs to change, causing
change in the overall CPI.
• If the designer changes only the clock frequency
(BUT NOT ANYTHING ELSE!), then both types
of CPI will not be affected.
Summary
• Neither clock rate nor CPI are reliable measures of
which computer is faster.
• The only semi-reliable measure is execution time
– However execution time is subject to the pitfalls of
benchmarking.
• If you choose benchmarks that are not representative of your
expected work load, or if the actual work load ends up beign
different from your expected workload, then the execution time
of the benchmarks no longer form the basis for reliable
comparisons.
Download