CMSC 611: Advanced Computer Architecture Performance & Benchmarks

advertisement
CMSC 611: Advanced
Computer Architecture
Performance & Benchmarks
Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides
Some material adapted from Hennessy & Patterson / © 2003 Elsevier Science
2
Calculation of CPU Time
CPU time =
Instructions Clock cycles
Seconds
´
´
Program
Instruction Clock cycle
CPU time = Instruction count ´ CPI´ Clock cycle time
Or
Instruction count ´ CPI
CPU time =
Clock rate
Component of performance
CPU execution time for a program
Instruction count
Clock cycles per instructions (CPI)
Clock cycle time
Units of measure
Seconds for the program
Instructions executed for the program
Average number of clock cycles/instruction
Seconds per clock cycle
3
CPU Time (Cont.)
• CPU execution time can be measured by
running the program
• The clock cycle is usually published by the
manufacture
• Measuring the CPI and instruction count is not
trivial
– Instruction counts can be measured by: software
profiling, using an architecture simulator, using
hardware counters on some architecture
– The CPI depends on many factors including:
processor structure, memory system, the mix of
instruction types and the implementation of these
instructions
4
CPU Time (Cont.)
• Designers sometimes uses the following
formula:
n
CPU clock cycles = åCPIi ´ Ci
i =1
Where: Ci
is the count of number of instructions of class i executed
CPIi is the average number of cycles per instruction for that instruction class
n
is the number of different instruction classes
5
Example
Suppose we have two implementation of the same instruction set architecture.
Machine “A” has a clock cycle time of 1 ns and a CPI of 2.0 for some program, and
machine “B” has a clock cycle time of 2 ns and a CPI of 1.2 for the same program.
Which machine is faster for this program and by how much?
Both machines execute the same instructions for the program. Assume the
number of instructions is “I”,
CPU clock cycles (A) = I  2.0
CPU clock cycles (B) = I  1.2
The CPU time required for each machine is as follows:
CPU time (A) = CPU clock cycles (A)  Clock cycle time (A)
= I  2.0  1 ns = 2  I ns
CPU time (B) = CPU clock cycles (B)  Clock cycle time (B)
= I  1.2  2 ns = 2.4  I ns
Therefore machine A will be faster by the following ratio:
CPU Performanc e (A) CPU time (B) 2.4 ´ I ns
=
=
= 1.2
CPU Performanc e (B) CPU time (A) 2 ´ I ns
6
Comparing Code Segments
A compiler designer is trying to decide between two code sequences for a
particular machine. The hardware designers have supplied the following facts:
Instruction class
A
B
C
CPI for this instruction class
1
2
3
For a particular high-level language statement, the compiler writer is
considering two code sequences that require the following instruction counts:
Code sequence
1
2
Instruction count for instruction class
A
B
C
2
1
2
4
1
1
Which code sequence executes the most instructions? Which will be faster?
What is the CPI for each sequence?
Answer:
Sequence 1:
executes 2 + 1 + 2 = 5 instructions
Sequence 2:
executes 4 + 1 + 1 = 6 instructions

7
Comparing Code Segments
n
Using the formula:
Sequence 1:
Sequence 2:
CPU clock cycles = åCPIi ´ Ci
i =1
CPU clock cycles = (2 1) + (1 2) + (2 3) = 10 cycles
CPU clock cycles = (4 1) + (1 2) + (1 3) = 9 cycles
 Therefore Sequence 2 is faster although it executes more instructions
Using the formula:
Sequence 1:
Sequence 2:
CPI =
CPU clock cycles
Instruction count
CPI = 10/5 = 2
CPI = 9/6 = 1.5
 Since Sequence 2 takes fewer overall clock cycles but has more
instructions it must have a lower CPI
8
The Role of Performance
• Hardware performance is a key to the
effectiveness of the entire system
• Performance has to be measured and
compared to evaluate designs
• To optimize the performance, major affecting
factors have to be known
• For different types of applications
– different performance metrics may be appropriate
– different aspects of a computer system may be
most significant
• Instructions use and implementation, memory
hierarchy and I/O handling are among the
factors that affect the performance
9
Calculation of CPU Time
Instruction count ´ CPI
CPU time =
Clock rate
Instr. Count
CPI
Program
X
Compiler
X
X
Instruction Set
X
X
Organization
X
Technology
Clock Rate
X
X
n
CPU clock cycles = åCPIi ´ Ci
i =1
Where: Ci
is the count of number of instructions of class i executed
CPIi is the average number of cycles per instruction for that instruction class
n
is the number of different instruction classes
10
Important Equations (so far)
1
Performance =
Execution time
Performance (B) Time (A)
Speedup =
=
Performance (A) Time (B)
Instructions
Cycles
Seconds
CPU time =
´
´
Program
Instruction
Cycle
n
CPU clock cycles =
åCPIi ´ Instructionsi
i=1
11
Amdahl’s Law
The performance enhancement possible with a given improvement
is limited by the amount that the improved feature is used
Execution time after improvement =
Execution time affected by the improvement
Amount of improvement
+ Execution time unaffected
• A common theme in Hardware design is to make the common case fast
• Increasing the clock rate would not affect memory access time
• Using a floating point processing unit does not speed integer ALU operations
Example: Floating point instructions improved to run 2X; but only 34% of
actual instructions are floating point
Exec-Timenew = Exec-Timeold x (0.66 + .34/2) = 0.83 x Exec-Timeold
Speedupoverall = Exec-Timeold / Exec-Timenew = 1/0.83 = 1.205
12
Ahmdal’s Law Visually
Timeold
Timenew
13
Ahmdal’s Law for Speedup
(
Timeold = Timeold * Fractionunchanged + Fractionenhanced
)
æ
Fractionenhanced ö
Timenew = Timeold * çFractionunchanged +
÷
Speedupenhanced ø
è
Speedupoverall =
Timeold
=
Timenew
Timeold
æ
Fractionenhanced ö
Timeold * çFractionunchanged +
÷
Speedupenhanced ø
è
1
=
Fractionenhanced
Fractionunchanged +
Speedupenhanced
Speedupoverall =
1
(
)
1- Fractionenhanced +
Fractionenhanced
Speedupenhanced
14
Performance Variations
•
•
•
•
Performance is dependent on workload
Task dependent
Need a measure of workload
Best = run your program
– Often cannot
• Benchmark = “typical” workload
– Standardized for comparison
Synthetic Benchmarks
• Synthetic benchmarks are artificial programs
that are constructed to match the
characteristics of large set of programs
• Whetstone (scientific programs), Dhrystone
(systems programs), LINPACK (linear
algebra), …
Synthetic Benchmark
Drawbacks
1. They do not reflect the user interest
since they are not real applications
2. They do not reflect real program
behavior (e.g. memory access pattern)
3. Compiler and hardware can inflate the
performance of these programs far
beyond what the same optimization can
achieve for real-programs
17
Application Benchmarks
• Real applications typical of expected
workload
• Applications and mix important
The SPEC Benchmarks
• System Performance Evaluation
Cooperative
• Suite of benchmarks
•
•
•
•
– Created by a set of companies to improve the
measurement and reporting of CPU
performance
SPEC2006 is the latest suite
SPECint = 12 integer programs
SPECfloat = 17 floating point programs
Since SPEC requires running applications
on real hardware, the memory system has
a significant effect on performance
– Reported with results
Performance Reports
Hardware
Model number
CPU
FPU (floating point)
Number of CPU
Cache size per CPU
Memory
Disk subsystem
Network interface
Powerstation 550
41.67-MHz POWER 4164
Integrated
1
64K data/8k instruction
64 MB
2 400-MB SCSI
N/A
Software
OS type and revision
Compiler revision
Other software
File system type
Firmware level
AIX Ver. 3.1.5
AIX XL C/6000 Ver. 1.1.5
AIX XL Fortran Ver. 2.2
None
AIX
N/A
System
Tuning parameters
Background load
System state
None
None
Multi-user (single-user login)
Guiding principle is reproducibility (report environment & experiments setup)
The SPEC Benchmarks
Execution time on SUN SPARCstation 10/40
SPEC ratio =
Execution time on the measure machine
• Bigger numeric values of the SPEC ratio
indicate faster machine
• 1997 “historical” reference machine
SPEC95 for Pentium and
Pentium Pro
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
50
100
150
Clock rate (MHz)
200
Pentium
Pentium Pro
250
50
100
150
Clock rate (MHz)
200
250
Pentium
Pentium Pro
• The performance measured may be different on other
Pentium-based hardware with different memory
system and using different compilers
– At the same clock rate, the SPECint95 measure shows that
Pentium Pro is 1.4-1.5 times faster while the SPECfp95
shows that it is 1.7-1.8 times faster
– When the clock rate is increased by a certain factor, the
processor performance increases by a lower factor
Comparing & Summarizing
Performance
Program 1 (seconds)
Program 2 (seconds)
Total time (seconds)
Computer A
1
1000
1001
Computer B
10
100
110
• Wrong summary can present a confusing picture
– A is 10 times faster than B for program 1
– B is 10 times faster than A for program 2
• Total execution time is a consistent summary measure
• Relative execution times for the same workload
– Assuming that programs 1 and 2 are executing for the same number of
times on computers A and B
CPU Performance (B) Total execution time (A) 1001
=
=
= 9.1
CPU Performance (A) Total execution time (B) 110
Execution time is the only valid and unimpeachable measure of performance
Effect of Compilation
800
700
600
500
400
300
200
100
0
gcc
espresso
spice
doduc
nasa7
li
eqntott
matrix300
fpppp
tomcatv
Benchmark
Compiler
Enhanced compiler
App. and arch. specific optimization can dramatically impact performance
Performance Summary
1n
Arithmetic Mean (AM) = å Execution_ Time i
n i =1
n
Weighted Arithmetic Mean (WAM) = å wi ´ Execution_Timei
i =1
Where: n is the number of programs executed
wi is a weighting factor that indicates the frequency of executing program i
n
with å w i = 1 and
i =1
0 £ wi £ 1
• Weighted arithmetic means summarize performance while tracking exec. time
• Never use AM for normalizing time relative to a reference machine
Time on A Time on B
Program 1
Program 2
AM of normalized time
AM of time
1
1000
10
100
500.5
55
Norm. to A Norm. to B
A
B
A
B
1
10
0.1
1
1
0.1
10
1
1
5.05 5.05
1
1
0 . 1 1 9.1
1
Price-Performance Metric
SPECbase CINT2000
• Different results are obtained for other benchmarks, e.g. SPEC CFP2000
• With the exception of the Sunblade price-performance metrics were
consistent with performance
SPEC CINT2000 per $1000 in price
Prices reflects those of July 2001
Download