What is Execution Time - Department of Electrical Engineering

advertisement
CSE2021: Computer Organization
Instructor: Dr. Amir Asif
Department of Computer Science
York University
Handout # 2: Measuring Performance
Topics:
1. Performance: Definition
2. Performance Metrics: CPU Execution Time and Throughput
3. Benchmarks: SPEC’95
4. Alternative Performance Metrics: MIPS and FLOPS
Patterson: Sections 1.4 – 1.9.
Analogy with Commercial Airplanes
What does it mean do say that one computer is better than another?
Passenger
Capacity
Cruising range
(miles)
Cruising
speed (mph)
Passenger throughput
(passengers × mph)
Boeing 777
375
4630
610
375 × 610 = 228,750
Boeing 747
470
4150
610
470 × 610 = 286,700
BAC/Scud Concorde
132
4000
1350
132 × 1350 = 178,200
Douglas DC-8-50
146
8720
544
146 × 544 = 79,424
Airplane
—
To know which of the four planes exhibits the best performance, we need to define a criteria
for measuring “performance”.
Performance Criteria:
Speed
Capacity
Range
Throughput
Winner:
BAC/Scud Concorde
Boeing 747
Douglas DC-8-50
Airbus A3xx
2
Computer Performance (1)
—
—
—
Performance of a computer is based on the following criteria:
1. Response/Execution Time: Elapsed time between the start and the end of one task.
2. Throughput: Total number of tasks finished in a given interval of time.
An IT manager will be interested in having a higher overall throughput while a computer
user will like to have a lower execution time for his task.
Using execution time as the criteria, the performance of a machine X is defined as
Performance X 
—
1
Execution time X
Performance ratio (n) between two machines X and Y is defined as
n
Performance X Execution Time Y

PerformanceY Execution Time X
3
Computer Performance (2)
Activity 1: If machine X runs a program in 30 seconds and machine Y runs the same program in 45
seconds, how much faster is X than Y?
Activity 2: Discuss which of the following two options is suited to enhance performance from a
user’s perspective:
(a) Upgrading a machine to a faster CPU
(b) Adding additional processors to the machine s.t. multiple processors are used for
different tasks.
Repeat for an IT manager?
4
What is Execution Time (1)?
1.
—
2.
—
—
3.
Elapsed Time/Response Time/Wall Clock Time: is defined as the clock time it takes from
the start to the end of a program or a task
Since computer is a timeshared machine (running several programs simultaneously), the
elapsed time will be dependent on the number and complexity of other programs running on
the machine.
CPU Time: is the execution time that the CPU spends on completing a task
CPU time does not include time spent on running other programs or waiting for the I/O to
become free.
CPU time can be further broken down into:
(a) User CPU time: CPU time spent to execute the measured program
(b) System CPU time: CPU time spent in the operating system performing tasks on behalf
of the measured program
Command time in Unix can be used to determine the elapsed time and CPU time for a
particular program
Syntax:
time <name_of_program>
Result:
1.180u
0.130s
0 : 0 : 44.95
2.9%



CPU user time CPU system time

Elapsed timein h:m:s



CPU time Elapsed time
5
What is Execution Time (2)?
4.
Performance based on User CPU time is called the CPU performance
CPU Performance X 
5.
Performance based on System time is called the system performance
System Performance X 
6.
1
CPU Execution time X
1
System Execution time X
Vendors specify the speed of a computer in terms of clock cycles.
For example, a 1GHz Pentium is generally believed to be faster than a 500MHz Pentium.
We define the clock cycles formally next.
Multiples defined:
1K  210  10 3
1M  2 20  10 6
1G  230  10 9
1T  2 40  1012
1P  250  1015
(kilo)
(mega)
(giga)
(tera)
(penta)
6
Clock (1)
Binary 1
Binary 0
1 cycle
Leading
edge
Trailing
edge
—
—
—
—
All events in a computer are synchronized to the clock signal
Clock signal is therefore received by every HW component in a computer
Clock cycle time: is defined as the duration of 1 cycle of the clock signal
Clock cycle rate: is the inverse of the clock cycle time.
—
CPU execution time is therefore defined as
Clock Signal
CPU Execution Time  CPU Clock Cycles 

 
 Clock Cycle Time
 for a program
  for a program 
CPU Clock Cycles for a program

Clock Rate
7
Clock (2)
—
Timing Programs generally return the average number of clock cycle needed per instruction
(denoted by CPI)
CPU Execution Time  CPU Clock Cycles for a program


Clock Rate
 for a program

n

 Instruction Count in a program
i
 CPIi
i=1
Clock Rate
Activity 3: For a CPU, instructions from a high-level language are classified in 3 classes

Instruction Class
A
B
C
CPI for the instruction class
1
2
3
Two SW implementations with the following instruction counts are being considered
Instruction counts per instruction class
A
B
C
Implementation 1
2
1
2
Implementation 2
4
1
1
Which implementation executes the higher number of instructions? Which runs faster?
What is the CPI count for each implementation?
8
Performance Comparison (1)
To compare performance between two computers,
1. Select a set of programs that represent the workload
2. Run these programs on each computer
3. Compare the average execution time of each computer
4. At time the geometric mean is used.
Geometric Mean  n
n
Execution Time 
i
i1
Activity 4: Based on the arithmetic and geometric means execution time, which of the two
computer is faster?

Execution Time
Computer A
Computer B
Program 1
2
10
Program 2
100
105
Program 3
1000
100
Program 4
25
75
9
Performance Comparison: Benchmarks (2)
— Benchmarks: are standard programs chosen to compare performance between different
computers.
— Benchmarks are generally chosen from the applications that a user would typically use the
computer to execute.
— Benchmarks can be classified in three categories:
1. Real applications reflecting the expected workload, e.g., multimedia, computer
visualization, database, or macromedia director applications
2. Small benchmarks are specialized code segments with a mixture of different types of
instructions
3. Benchmark suites containing a standard set of real programs and applications. A
commonly used suite is SPEC (System Performance Evaluation Cooperative) with
different versions available, e.g., SPEC89, SPEC’92, SPEC’95, SPEChpc96, SPEC
CPU2006 and SPECINTC2006 suites.
10
Performance Comparison: SPEC Suite (2)
— SPEC’95 suite has a total of 18 programs (integer and floating point). However, SPEC
CPU2000 has a total of 32 programs
— SPEC ratio for a program is defined as the ratio of the execution time of the program on a
Sun Ultra 5/10(300 MHz processor) to the execution time on the measured machine.
— CINT2000 is the geometric mean of the SPEC ratios obtained from the integer programs.
The geometric mean is defined as
CINT2006  n
n
SPEC ratio 
i
i1
— CFP2000 is the geometric mean of the SPEC ratios from the floating-point programs
Activity 5: Complete the following table to predict the performance of machines A and B

Time on A
(seconds)
Time on B
(seconds)
Program 1
5
25
Program 2
125
25
Normalized to A
A
B
Normalized to B
A
B
Arithmetic Mean
Geometric Mean
11
Performance Comparison: CINT2006 for Opteron X4 2356 (3)
Name
Description
Instruct.
Count ×109
CPI
Clock Cycle
time (ns)
Exec
time (s)
Reference
time (s)
SPECratio =
Ref. time / Ex. time
perl
Interpreted string processing
2,118
0.75
0.40
637
9,777
15.3
bzip2
Block-sorting compression
2,389
0.85
0.40
817
9,650
11.8
gcc
GNU C Compiler
1,050
1.72
0.47
24
8,050
11.1
mcf
Combinatorial optimization
336
10.00
0.40
1,345
9,120
6.8
go
Go game (AI)
1,658
1.09
0.40
721
10,490
14.6
hmmer
Search gene sequence
2,783
0.80
0.40
890
9,330
10.5
sjeng
Chess game (AI)
2,176
0.96
0.48
37
12,100
14.5
libquantum
Quantum computer simulation
1,623
1.61
0.40
1,047
20,720
19.8
h264avc
Video compression
3,102
0.80
0.40
993
22,130
22.3
omnetpp
Discrete event simulation
587
2.94
0.40
690
6,250
9.1
astar
Games/path finding
1,082
1.79
0.40
773
7,020
9.1
xalancbmk
XML parsing
1,058
2.70
0.40
1,143
6,900
6.0
Geometric mean
11.7
High cache miss rates
CINT2006  n
n
SPEC ratio 
i
i1
12
Improving Performance
CPU Execution Time  Instruction Count in a program  CPI


for
a
program
Clock Rate


Performance of a CPU can be improved by:
1.
Increasing the clock rate (decreasing the clock cycle time)
2.
Enhancements in the Compiler to decrease the instruction count in a program
3.
Improvement in the CPU to decrease the clock cycle per instruction (CPI)
Unfortunately factors (1 – 3) are not independent. For example, if you increase the clock
frequency then the CPI may also increase.
13
Power Trends (1)
— One way of enhancing performance is to increase clock rate (implying a reduction in clock
cycle time) .
— A consequence of increasing clock rate is an increase in power dissipation.
14
Power Trends (2)
— Dynamic power dissipation is computed from the expression
Power  Capacitive load  Voltage 2  Frequency
— In the previous figure, note that when clock rates have increased by a factor of 1000 (25 to
2667), the power dissipation only increased by a factor of 20 (4.9 to 95).

— The above is largely a consequence of the operating voltage in CMOS technology going
down from 5V to 1V.
Activity: What is the impact on dynamic power dissipation if a new processor reduces the
capacitive load, voltage, and clock frequency all by a factor of 15%?
15
Uniprocessor Performance
Constrained by power, instruction-level parallelism,
memory latency
16
SPEC Power Benchmark
— SPECpower reports power consumption of servers at different workloads, divided by 10%
increments, over a period of time.
Target Load %
Perform. (ssj_ops/sec)
Average Power (Watts)
100%
231,867
295
90%
211,282
286
80%
185,803
275
70%
163,427
265
60%
140,160
256
50%
118,324
246
 ssj_ops power
40%
920,35
233
i0
30%
70,500
222
20%
47,126
206
10%
23,066
180
0%
0
141
1,283,590
2,605
Overall ssj_ops per Watt 
10
Overall sum
∑ssj_ops/ ∑power
493
10
i
i
i0

17
PITFALL: Amdahl’s Law (1)
 Improving an aspect of a computer and expecting a proportional improvement in overall
performance
Timprov ed 
Taf f ected
 Tunaf f ected
improvement factor
 Suppose a program runs in 100s on a computer with multiply operations responsible for 80%
of the time. How much should the speed of multiplication be improved to make the program

run twice faster? Five times faster?
Solution:
For 2 times faster, the speed of multiplication should be improved by a factor of 8/3.
Five times faster is not possible.
18
PITFALL: MIPS as a Performance Metric (2)
Do not use MIPS (million instructions per second) as a performance metric
MIPS 
Instruction Count in a program
Execution time 106
Activity 6: For a CPU, instructions from a high-level language are classified in 3 classes

Instruction Class
A
B
C
CPI for the instruction class
1
2
3
Two SW implementations with the following instruction counts are being considered
Instruction counts (in billions)
for each instruction class
A
B
C
Implementation 1
5
1
1
Implementation 2
10
1
1
Assuming that the clock rate is 500 MHz, which code sequence will be run faster based on (a)
MIPS and (b) execution time.
19
PITFALLS: Do Not’s (3)
4.
Do not use MFLOPS (Million Floating point operations per second) as a performance metric
MFLOPS 
5.
Number of floating - point operations in a program
Execution time 10 6
Do not use PEAK Performance as a performance metric
Example:
I860: @50MHz: 100MFLOPS and 150 MOPS
MIPS @R3000: 16MFLOPS and 33 MOPS
Running SPEC benchmarks, R3000 was 15% faster in execution time based on the
geometric mean
20
PITFALLS: Do Not’s (4)
4.
5.
Do not use synthetic benchmarks (vendor created) to predict performance
— Commonly used synthetic benchmarks are Whetstone and Dhrystone
— Whetstone was based on Algol in an engineering environment and later converted to
Fortran
— Dhrystone was written in Ada for systems programming environments and later
converted to C
— Such synthetic benchmarks do not reflect the applications typically run by a user
Do not use the arithmetic mean of execution time to predict performance. Geometric means
provide better estimates.
21
Download