COURSE NOTES:

advertisement
COURSE NOTES:
ECE8405 COMPUTER ARCHITECTURE
Spring 2004
Ed Char
1
1. INTRODUCTION
Summary: Hierarchy as a way of “controlling” complexity, overview.
Architecture is at the heart of computer engineering
 Computers are driving technology, the economy, entertainment and
society.
 Advances are coming at a breakneck pace:
Moore’s Law: processors and memory are doubling performance
every 1.5 years! (Due to technology AND organization).
 Most nontrivial digital systems are based on computers.
 The computer has become the most complicated (centralized) humanmade object, with hundreds of millions of identifiable parts.
 How can such a complicated system work reliably? By…
Hierarchical organization (object-oriented design) makes it possible
 Many different systems have to come together to allow not only
functionality, but high PERFORMANCE metrics.
 Central is the instruction set- the key component of the hardwaresoftware interface. This is the set of instructions implemented by the
processor, and the bit patterns that encode them.
 Let’s consider the major component categories of this hierarchy:
Networking
Operating Systems
High-level languages and compilers
Assembly-language and assemblers, linkers, loaders
|
INSTRUCTION SET
|
Architecture (registers, datapath and ALU)
Systems design (CPU, memory, I/O, exceptions)
Gate-level design, simulation
VLSI design and validation
SOFTWARE
HARDWARE
… not to mention …
Performance evaluation
Chip and system packaging
Clock speeds, CPU size, etc
ECONOMICS
2
Standards define the interface between layers:
 Like in OOP, a public definition hides the complexity at a given layer
e.g. Network protocols, HLL language syntax, O.S. calls, the instruction
set of a microprocessor, inputs/outputs of a VLSI circuit module.
 A “simple” interface specification hides the underlying complexity.
 Allows designers at any level to work at a reasonable complexity.
 Permits design changes and migration to different platforms.
Think of computer engineering from an object-oriented perspective
To conclude this train of thought, let’s consider the design of a new uP:
The design of a new high-performance processor requires:
 Fast time to market to leapfrog competition
 Innovate using newest and fastest VLSI fabrication line
 New instruction set design, BUT
Instruction-set choices influences performance at all other levels!
(e.g. OS, HLL compilers may be inefficient if certain instructions not
included, but hardware implementation of certain instructions may
require large VLSI “real-estate”). SO
IT’S AN OPTIMIZATION PROBLEM AT ALL LEVELS!
What actually happens is that the WHOLE PROJECT (instruction set,
compilers, O.S., hardware) is SIMULATED over and over, until the best set
of compromises are reached that give best performance at reasonable cost.
Only then is the processor committed to silicon.
Some processors have worked the first time they were integrated, others
have been sold to the public with serious bugs….
Major processors today:
80x86/Pentium/K6, PowerPC, DEC Alpha, MIPS, SPARC, HP
3
The five basic components of a computer
 Datapath (Chs. 4-6)
 Control (Chs. 5-6)
 Memory (Ch. 7)
 Input and Output (Ch 8)
In more detail, what will we cover in this course?
 Performance metrics, because these allow us to determine whether one
machine is better than another, or whether a design modification is
worthwhile.
 Instruction set design principles and examples (MIPS RISC mainly)
 MIPS assembly language and mapping ‘C’ onto MIPS
 ALU implementation concepts (integer and floating point)
 Block-diagram implementation of a basic MIPS processor
 Pipelining the implementation for better performance:
Approaching one instruction per clock
 Superscalar implementation for better performance
Issuing MORE than one instruction per clock!
 Memory hierarchy, including cache and virtual memory
 Multiprocessor systems (time permitting)
Thus, we will look mainly at the hardware aspects of microprocessors, at a
block diagram level.
Why is this stuff important?
 You will learn how today’s computers work.
 Performance analysis will allow you to make educated choices in
purchasing systems
 Software design is helped by (and sometimes requires) an understanding
of the underlying hardware concepts
4
2. Apples and Oranges: The importance of PERFORMANCE
Summary, performance is key. However, it is quite difficult to evaluate. We
will present and critique various measures that are in use, both within and
between architectures.
Performance measures are key
 They influence numerous purchasing decisions every day!
 Can give a measure of usefulness to purchaserHow long will you have to wait for results? (finite element analysis)
 Within an architecture, can evaluate the effectiveness of design changes
Textbooks would have you believe that people carefully compare the
price/performance of different systems. Unlikely….
 Most users know what architecture they want (wintel, mac)
 Many then go for the cheapest systems they can buy, leading to problems
with reliability, compatibility, bad monitor, inadequate cache memory,
software incompatibilities, etc.
 Some go to a reputable dealer and pay more to get some assurances
 Very few actually shop around much! Why? …..
Performance is difficult to characterize between different architectures
 Every user takes advantage of different capabilities
(e.g. integer vs. floating point)
 Instructions different
 Superscalarity further muddies picture (how often are 2/4 instructions
actually issued?)
5
Consider a first-order model of a processor:
CONTROL
DATAPATH
I/O
MEMORY SYSTEM
CLOCK
(Clock in cycles per second = Hertz, or cycle interval time=1/Hz in sec’s)
Key Features:
 Control determines path of data through system
 Instructions decoded by control
 CLOCK sets rate at which instructions are:
1. Fetched
2. Decoded
3. Executed
How about clock rate as a performance measure?
 One instruction could take several clock cycles
 Different machines have different instructions- one machine may require
half as many instructions to execute a program than another
 DRAM reads may take many clocks, so cache memory performance is
very important
 Newer CPUs can issue multiple instructions each clock cycle!
WITHIN an architecture, OK to use clock rate as performance measure:
Performance(80MHz)/Performance(60Mhz) = 80 / 60 = 1.21
So we say that the 80 Mhz system is 21% (or 1.21 times) faster than the 60
MHz system. We can’t say that one system is XXX% slower than another.
6
Another measure is CPI: average clocks per instruction.
 Can be determined by simulation or timing program execution
 OK within family and using same program:
CPI = ExecutionTimeOfProgram / NumberInstructionsInProgram
Perf(A)/Perf(B)=ExecutionTime(B)/ExecutionTime(A) = CPI(B)/CPI(A)
 Which program should one execute?
 Somewhat poor comparison between families, although most instruction
sets are becoming more similar (RISC)
How about MIPS: millions of instructions per second?
 Combines clock rate and CPI:
 PEAK MIPS is the fastest rate obtainable, it’s not measured MIPS
 Relative MIPS = ExecuteTimeref x MIPSref / ExecuteTimetest
Works only for reference program and input
 So again, depends on program (or theory, for peak MIPS)
You, as user, want the machine to respond as fast as possible, so let’s define
performance in terms of execution time of some program on machine X:
Performance(X) = 1/ExecutionTime(X)
Then we can compare two different machines X and Y using that program:
Performance(X)/Performance(Y)=ExecTime(Y)/ExecTime(X)=n
So X is n times faster than Y (if Y is faster switch X and Y).
Which program do we use? You can measure performance using the
program (or algorithm type) which YOU will run, or use an industrystandard program MIX that exercises different machine functions (integer,
float, memory access, I/O, etc). We’ll consider benchmarks shortly.
How can we measure execution time? On UNIX machines, it’s possible to
obtain an estimate using the time command: Time someCommand args
Returns: 90.7u 12,9s 2:39 65%
First val is sec’s executing user (application) code
Second val is sec’s executing (operating) system code
Third value is total elapsed time as seen by user
Fourth value is % of elapsed time that program was executed
7
Let’s consider CPU execution time only (user + sys). Then for a program:
CPUexecTime = CPUclockCycles x clockCycleTime
CPUclockCycles = InstructsInProgram x AvgClockCyclesPerInstruct
Can improve performance by decreasing clock cycle time (faster clock rate),
or reducing the number of clock cycles required to execute program. The
latter can be addressed by:
 More “powerful” instructions (maybe) – gives fewer instructions
 Fewer clocks per instruction – more efficient instruction execution
 More instructions running in parallel
 Faster memory access (many cycles “wasted” waiting for memory)
NOTE: This is often ignored!
 Better compiler – issues fewer instructions
 More efficient O.S. – executes fewer instructions per system call
….. and so on!
This gives a hint of where we are going in this course. We will study the
techniques that are being used in today’s computers to improve performance.
EXAMPLE: We are designing an architecture, and find that in order to
reduce the average clocks per instruction from 3.5 to 2.7 (due to an
improvement in the ALU), we have to decrease the clock speed by 25%. Is
this worth it?
CPUtime = InstructionCount x CPI x ClockCycleTime
So, divide original by modified:
CPUtime(o)/CPUtime(m) = 3.5 x 1.0 / (2.7 x 1.25) = 1.037
So the modified version is almost 4% faster than the original.
BENCHMARK SUITES – a group of programs used to gauge performance
To use a BENCHMARK suite of programs to more fairly allow comparison
between different architectures, we need to consider:
1. Which programs should we use, and
2. How can we combine them into one performance measure?
Unfortunately, neither of these questions is trivial!
8
Which programs should we use?
 Every user uses a different WORKLOAD (i.e. set of programs run)
 Different programs may have very different performance measures
 Want to exercise different capabilities of computer in one (or two)
benchmark(s)
 BEST: you evaluate systems using the exact mix of programs you will
use on your computer!
How should they be combined?
 Arithmentic mean- average all execution times on a machine
AM = 1/n  TimeI
Problem: longer-running programs wash out more fleeting prog’s.
 Geometric mean- independent of running time, each program has same
weight
GM = nth root of ( Execution Time Ratioi )
Where the Execution time ration is ExecTimetest / ExecTimeref
The ExecTime ratio of the computer under test to a reference machine
Neither is intrinsically better: depends what you want from your benchmark!
Example:
Test:
Ref:
App1 = 20 secs
App1 = 100 secs
Arithmetic mean: Test=130 sec
App2 = 240 secs
App2 = 120 secs
ref=110sec -> ref machine is 18% faster
Geometric mean: Test = sq. root(0.2 x 2.0)= 0.632
Ref = sq. root(5.0 x 0.5)= 1.58
-> test machine is 2.5 times faster!!
BOTTOM LINE: Be careful how you slice your performance evaluation
data.
Amdahl’s Law
Expecting the improvement of one aspect of a machine to increase
performance by an amount proportional to the size of the improvement is
wrong. For example: Suppose a program runs in 100 seconds on a machine,
9
with multiply instructions responsible for 80 seconds of this time. How
much faster do I have to improve the speed of multiplication if I want my
program to run 5 times faster?
So the execution time is equal to the speed of the multiplications plus the
other parts:
Exec time after improvement = 80/n + (100 – 80 seconds)
n is what we need to divide by to get our performance:
Since we want the execution to be 5 times faster:
20 = 80/n +20
0 = 80/n
which doesn’t work. You need to work on others sections as well to get a
five-fold increase in performance.
Example Problems:
1) (2.3) If the clock rates of machines M1 and M2 are 200 MHz and 300
MHz, find the clock cycles per instruction for program 1 given the
following:
Program
Time on M1
Time on M2
1
10 sec
5 sec
2
3 sec
4 sec
Program
1
Instructions on M1
200x106
Instructions on M2
160x106
Solution:
CPI = Cycles per second / Instructions per second. So
M1 = 200x106 / (200x106 / 10 sec) = 10 cycles per instruction
M2 = 160x106 / (160x106 / 5 sec) = 9.4 cycles per instruction
2) (2.10) Consider 2 different implementations, M1 and M2 of the same
instruction set. There are four classes of instructions (A, B, C and D) in the
instruction set.
M1 has a clock rate of 500MHz. The average number of cycles for
each instruction class on M1 is:
Class
CPI for this class
A
1
B
2
10
C
D
3
4
M2 has a clock rate of 750 MHz. The average number of cycles for
each instruction class on M2 is:
Class
CPI for this class
A
2
B
2
C
4
D
4
Assume that peak performance is defined as the fastest rate that a
machine can execute an instruction sequence chosen to maximize that rate.
What are the peak performances of M1 and M2 expressed as instructions per
second?
Solution:
For M1 the peak performance will be achieved with a sequence on
instructions of class A, which have a CPI of 1. The peak performance is
thus 500 MIPS.
For M2 , a mixture of A and B instructions both of which have a CPI
of 2, will achieve peak performance which is 375 MIPS.
3) (2.13) Consider 2 different implementations, M1 and M2, of the same
instruction set. There are three classes of instructions (A, B and C) in the
instruction set. M1 has a clock rate of 400 MHz and M2 has a clock rate of
200 MHz. The average number of cycles for each instruction set is given in
the following:
Class
A
B
C
CPI M1
4
6
8
CPI M2
2
4
3
C1 Usage
30%
50%
20%
C2 Usage
30%
20%
50%
3rd party
50%
30%
20%
The table also contains information of how the three different
compilers use the instruction set. C1 is a compiler produced by the makers
of M1, C2 is produced by the makers of M2 and the other is a third party
compiler. Assume that each compiler uses the same number of instructions
for a given program but that the instruction mix is as described in the table.
11
Using C1 on both M1 and M2 how much faster can the makers of M1 claim
that M1 is over M2? Using C2 how much faster is M2 over M1? If you
purchase M1 which compiler should you use? For M2?
Solution:
Using C1, the CPI on M1 = 4 * .3 + 6 * .5 + 8 * .2 = 5.8
M2 = 2 * .3 + 4 * .5 + 3* .2 = 3.2
So assuming that M1 is faster we have: (3.2/200E6) / (5.8/400E6) = 1.10 or
M1 is 10% faster than M2 using C1
Using C2, the CPI on M1 = 4*.3 + 6*.2 + 8*.5 = 6.4
M2 = 2 * .3 + 4 * .2 + 3* .5 = 2.9
Assuming that M2 is faster we have (6.4/400E6) / (2.9/200E6) = 1.10 or
now M2 is 10% faster than M1 using C2.
For the 3rd party compiler: M1 = 4*.5 + 6*.3 + 8*.2 = 5.4
M2 = 2 * .5 + 4 * .3 + 3* .2 = 2.8
rd
This tells us that the 3 party compiler is better for both machines since the
CPI is lower. If we try M2 as the faster machine we get:
(5.4/400E6) / (2.8/200E6) = .964
So M1 is the faster machine. How much faster? Reverse the equation:
(2.8/200E6) / (5.4/400E6) = 1.037 or M1 is 3.7% faster than M2.
4) (2.15) Consider a program P, with the following mix of operations:
floating-point multiply
10%
floating-point add
15%
floating-point divide
5%
integer instructions
70%
Machine MFP has floating point hardware and can implement the floating
point operations directly. It requires the following number of clock cycles
for each instruction class:
floating-point multiply
6
floating-point add
4
floating-point divide
20
integer instructions
2
Machine MNFP has no floating point hardware and so must emulate the
floating-point operations using integer instructions. The integer instructions
12
take 2 clock cycles. The number of integer instructions needed to implement
each of the floating point operations is:
floating-point multiply
30
floating-point add
20
floating-point divide
50
Both machines have a clock rate of 1000 MHz. Find the native MIPS rating
for both machines.
Solution:
MIPS = Clock Rate / CPI x106
CPI for MFP = .1*6 + .15*4 + 0.05*20 + .7*2 = 3.6
CPI of MNFP = 2.
MIPS for MFP = 1000/CPI = 278
MIPS for MNFP = 1000/CPI = 500
5) (2.16) If the machine in MFP in the last exercise needs 300 million
instructions for a program, how many integer instructions does the machine
MNFP require for the same program?
Solution:
This is really just a ratio type problem:
INSTRUCTION
Count on MFP in 106
floating-point multiply
30
floating-point add
45
floating-point divide
15
integer instructions
210
TOTALS
300
Count on MNFP in 106
900
900
750
210
2760
6) (2.26) The table below shows the number of floating point operations
executed in two different programs and the runtime for those programs on
three different machines: (times given in seconds)
Program
FP ops
1
10,000,000
2
100,000,000
Computer A
1
1000
Computer B
10
100
Computer C
20
20
Which machine is faster given total execution time? How much faster is it
than the other two?
13
Solution:
Total execution time for Computer A is 1001 seconds, Computer B is 110
seconds and C 40 seconds. So C is the fastest. It is 1001/40 = 25 times
faster than A and 110/40 = 2.75 times faster than B.
Homework problems:
1) Using the information from Example 2 above on M1 and M2, If the
number of instructions executed in a certain program is divided equally
among the classes of instructions, how much faster is M2 over M1?
(Hint: find CPI of each machine first)
2) (2.18) You are the lead designer of a new processor. The processor
design and compiler are complete, and now you must decide whether to
produce the current design as it stands or spend additional time to improve
it. Your options are:
a) Leave the design as it stands. Call this base machine Mbase. It has
a clock rate of 500 MHz and the following measurements have been made
using a simulator:
Instruction Class
A
B
C
D
CPI
2
3
3
5
Frequency
40%
25%
25%
10%
b) Optimize the hardware. The hardware team claims that it can
improve the processor design to give it a clock rate of 600 MHz. Call this
machine MOPT. The following measurements were made for MOPT:
Instruction Class
A
B
C
D
CPI
2
2
3
4
What is the CPI for each machine?
14
Frequency
40%
25%
25%
10%
3) (2.21) Using the material for Mbase and MOPT above, a compiler team
proposes to improve the compiler for the machine to further enhance
performance. Call this combination of the improved compiler and the base
machine Mcomp. The instructions improvements from this enhanced
compiler have been estimated as follows:
Instruction Class % of instructions executed vs. base machine
A
90%
B
90%
C
85%
D
95%
For example, if Mbase executed 500 Class A instructions, Mcomp would
execute .9*500 = 450 instructions. What is the CPI for Mcomp?
4) (2.27) Given the material from Exercise 6 (problem 2.26) Which
machine is fastest using Geometric mean?
5) (2.44) You are going to enhance a machine, and there are two possible
improvements: either make multiply instructions run four times faster than
before, or make memory access instructions run two times faster than before.
You repeatedly run a program that takes 100 seconds to execute. Of this
time, 20% is used for multiplications, 50% for memory instructions and 30%
for other tasks. What will the speedup be if you improve only
multiplication? What will the speedup be if you improve only memory
access? What will the speedup be if both improvements are made?
15
Download