Lecture Set 4

advertisement
Computer Systems Organization:
Lecture 4
Ankur Srivastava
University of Maryland, College Park
Based on Slides from Mary Jane Irwin ( www.cse.psu.edu/~mji )
www.cse.psu.edu/~cg431
[Adapted from Computer Organization and Design,
Patterson & Hennessy, © 2005, UCB]
ENEE350
Performance Metrics

Purchasing perspective

given a collection of machines, which has the
- best performance ?
- least cost ?
- best cost/performance?

Design perspective

faced with design options, which has the
- best performance improvement ?
- least cost ?
- best cost/performance?

Both require



basis for comparison
metric for evaluation
Our goal is to understand what factors in the architecture
contribute to overall system performance and the relative
importance (and cost) of these factors
ENEE350
Defining (Speed) Performance

Normally interested in reducing

Response time (aka execution time) – the time between the start
and the completion of a task
- Important to individual users

Thus, to maximize performance, need to minimize execution time
performanceX = 1 / execution_timeX
If X is n times faster than Y, then
performanceX
execution_timeY
-------------------- = --------------------- = n
performanceY
execution_timeX

Throughput – the total amount of work done in a given time
- Important to data center managers

ENEE350
Decreasing response time almost always improves throughput
Performance Factors
1.
2.
3.
ECE244 Recall: Sequential Systems Need Synchronizing Clocks
A Computer is a Sequential System and has a Clock
Each Instruction Takes up a few Clock Cycles to Execute
CPU execution time = # CPU clock cyclesx clock cycle time
for a program
for a program
or
CPU execution time = #------------------------------------------CPU clock cycles for a program
for a program
clock rate

Can improve performance by reducing either the length
of the clock cycle or the number of clock cycles required
for a program
ENEE350
Review: Machine Clock Rate

Clock rate (MHz, GHz) is inverse of clock cycle time
(clock period)
CC = 1 / CR
one clock period
10 nsec clock cycle => 100 MHz clock rate
5 nsec clock cycle => 200 MHz clock rate
2 nsec clock cycle => 500 MHz clock rate
ENEE350
1 nsec clock cycle =>
1 GHz clock rate
500 psec clock cycle =>
2 GHz clock rate
250 psec clock cycle =>
4 GHz clock rate
200 psec clock cycle =>
5 GHz clock rate
Clock Cycles per Instruction

Not all instructions take the same amount of time to
execute (different number of clock cycles in each
instruction). For example MUL takes more cycles than
Add
# CPU clock cycles
# Instructions Average clock cycles
= for a program x
for a program
per instruction

Clock cycles per instruction (CPI) – the average number
of clock cycles each instruction takes to execute

A way to compare two different implementations of the same ISA
CPI
ENEE350
CPI for this instruction class
A
B
C
1
2
3
THE Performance Equation

Our basic performance equation is then
CPU time
= Instruction_count x CPI x clock_cycle
or
CPU time

=
Instruction_count x
CPI
----------------------------------------------clock_rate
These equations separate the three key factors that
affect performance
ENEE350
THE Performance Equation

Our basic performance equation is then
CPU time
= Instruction_count x CPI x clock_cycle
Instruction Count: Depends on the kind of instructions supported by the
Architecture.
For example a Multiply operation in C Language could
be represented as a sequence of Adds in Assembly code,
but
the number of instructions would be quite a lot. Having a
dedicated Mul instruction reduces the total number of
instructions in the program
ENEE350
THE Performance Equation

Our basic performance equation is then
CPU time
= Instruction_count x CPI x clock_cycle
CPI: Depends on how complicated the instructions that are. More complex
instructions need more clocks to execute. For example Mul instruction
in MIPS takes more clocks than Add instruction in MIPS
Hence if the Compiler and Assembler choose more complex
instructions then they will increase the CPI but may reduce the
total number of instructions
ENEE350
Computing the Effective CPI

Our basic performance equation is then
CPU time
= Instruction_count x CPI x clock_cycle
Given a specific Computer Architecture (MIPS for instance), each
Instruction i can be associated with the number of clocks that it
Needs Ci.
Given a C/Java program, the compiler and assembler decide which
instructions from the available instruction set to choose, this affects
both the number of instructions and the CPI. Let us suppose that
they end up choosing ICi number instructions from an instruction i.
Then the effective CPI becomes (here n is the total number of
instructions)
Overall effective CPI =
ENEE350

(Ci x ICi)/n
Computing the Effective CPI

Hence the effective CPI depends on
ENEE350

The kind of instructions (instruction set) supported by the
Architecture

The choice of instructions from this instruction set by the
compiler and assembler
Determinates of CPU Performance
CPU time
= Instruction_count x CPI x clock_cycle
Algorithm
Programming
language
Compiler
ISA
Instruction Set
Processor
organization
Technology
ENEE350
Instruction_
count
CPI
clock_cycle
X
X
X
X
X
X
X
X
X
X
X
X
A Simple Example
Op
Freq
CPIi
Freq x CPIi
ALU
50%
1
.5
.5
.5
.25
Load
20%
5
1.0
.4
1.0
1.0
Store
10%
3
.3
.3
.3
.3
Branch
20%
2
.4
.4
.2
.4
2.2
1.6
2.0
1.95
=

How much faster would the machine be if a better architecture
reduced the average load time to 2 cycles?
CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster

How does this compare with using branch prediction to shave
a cycle off the branch time?
CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster

What if two ALU instructions could be executed at once?
CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster
ENEE350
Another Example

Let us suppose that the ISA has 3 kinds of instructions

A:CPI=1

B:CPI=2

C:CPI=3
Let the Compiler/Assembler generate 2 kinds of Codes
Code 1: Has 2 from A, 1 from B and 2 from C
Code 2: Has 4 from A, 1 from B and 1 from C
Which is Betters
Code 1= Total Number of Clocks = 2x1 + 1x2 + 2x3 = 10
Code 2=
ENEE350
4x1 + 1x2 + 1x3 = 9
Another Example

Let us suppose that the ISA has 3 kinds of instructions

A:CPI=1

B:CPI=2

C:CPI=3
Let the Compiler/Assembler generate 2 kinds of Codes
Code 1: Has 2 from A, 1 from B and 2 from C
Code 2: Has 4 from A, 1 from B and 1 from C
Which is Betters
Code 1= Total Number of Clocks = 2x1 + 1x2 + 2x3 = 10
Code 2=
4x1 + 1x2 + 1x3 = 9
CODE 1 EXECUTION TIME = 10 x CLOCK CYCLE
CODE 2 EXECUTION TIME = 9 x CLOCK CYCLE
Therefore Code 2 is faster even though it has more
instructions
ENEE350
SPEC Benchmarks www.spec.org
Integer benchmarks
gzip
compression
vpr
FPGA place & route
gcc
GNU C compiler
mcf
Combinatorial optimization
crafty
Chess program
parser
Word processing program
eon
Computer visualization
perlbmk perl application
gap
vortex
bzip2
twolf
ENEE350
Group theory interpreter
Object oriented database
compression
Circuit place & route
FP benchmarks
wupwise Quantum chromodynamics
swim
Shallow water model
mgrid
Multigrid solver in 3D fields
applu
Parabolic/elliptic pde
mesa
3D graphics library
galgel
Computational fluid dynamics
art
Image recognition (NN)
equake Seismic wave propagation
simulation
facerec Facial image recognition
ammp
Computational chemistry
lucas
Primality testing
fma3d
Crash simulation fem
sixtrack Nuclear physics accel
apsi
Pollutant distribution
Example SPEC Ratings
ENEE350
Download