In class notes

advertisement
Computer Architecture
• We will use a quantitative approach to analyze
architectures and potential improvements and see
how well they work (if at all)
– We study RISC instruction sets to promote instructionlevel, block-level and thread-level parallelism
– Pipelining, superscalar, branch speculation, vector
processing, multi-core & parallel processing
– Out of order completion architectures
– Compiler optimizations
– Improving cache performance (and virtual memory
performance if time permits)
– Early on, we concentrate on the MIPS 5-stage pipeline,
later we will also look at other approaches including
the Pentium processing
Performance Measures
• Many different values can be used
– MIPS, MegaFLOPS – misleading values
– Clock Speed – does not take into account
parallelism/pipeline, stalls, cache, etc
– Execution time – we use this to compare benchmarks but
we have to make sure the benchmarks were run equally
(loaded system, unloaded system, etc)
– Throughput – number of programs per unit of time,
possibly useful for servers
– CPU time, user CPU time, system CPU time
– CPU performance = 1 / execution time
• What does it really mean for 1 computer to be faster
than another?
– If we use a benchmark suite and 1 computer consistently
outperforms the other, this is useful, otherwise we have to
take into account the types of programs where one
computer was better than the other
SPEC
2006
Benchmarks
Design Concepts
• Take advantage of parallelism
– There are many opportunities for parallelism in code
• Use multiple hardware components (ALU functional units,
register ports, caches/memory modules, disk drive access, etc)
• Distribute instructions to hardware components in an overlapped
(pipelined) or distributed (parallel) fashion
• Use hardware and software approaches
• Principle of locality of reference
– Design memory systems to support this aspect of program
and data access
• Focus on the common case
– Amdahl’s Law (next slide) demonstrates that minor
improvements to the common case is usually more useful
than large improvements to rare cases
– Find ways to enhance the hardware for common cases
over rare cases
Amdahl’s Law
• When comparing two systems, we view the speedup as
– CPU time of old system / CPU time of new system
• E.g., old system takes 10.5, new system takes 5.25, speedup = 10.5 /
5.25 = 2
• Speedup of one enhancement
– 1 / (1 – F + F / S)
• F = fraction of the time the enhancement can be used
• S = the speedup of the enhancement itself (that is, how much faster the
computer runs when the enhancement is in use)
– Example: an integer processor performs FP operations in
software routines, a benchmark consists of 14% FP operations,
a co-processor could perform all FP operations 4 times faster, if
we add the co-processor, our speedup is
• 1 / (1 - .14 + .14 / 4) = 1.12, or a 12% speedup
• Why does Amdahl’s Law promote the “common case”?
– Since we have a reciprocal, the smaller the value, the greater the
speedup
– The denominator subtracts F from 1 and adds F / S, so –F will
have a larger impact than F / S
Examples
• Web server enhancements:
– Faster CPU (10 times faster on computation, 30% CPU
operations, 70% I/O)
• speedup = 1 / (1 - .3 + .3 / 10) = 1.37 (37% speedup)
– More hard disk space that improves I/O performance by
2.5
• speedup = 1 / (1 - .7 + .7 / 2.5) = 1.72 (72% speedup)
– Select the common case
• A benchmark has 20% FP square root operations,
50% total FP operations, 50% other
– Add an FP sqrt unit with a speedup of 10
• speedup = 1 / (1 - .2 + .2 / 10) = 1.22
– Add a new FP ALU with a speedup of 1.6 for all FP ops
• speedup = 1 / (1 - .5 + .5 / 1.6) = 1.23
– Again, the common case is the better way to go (slightly)
Another Example
• Architects have suggested a new feature that can be used
20% of the time and offers a speedup of 3
• One architect though feels that he can provide a better
enhancement that will offer a 7 time speedup for that
particular feature
• What percentage of the time would the second feature have
to be used to match the first enhancement?
– Speedup from feature 1 = 1 / (1 - .2 + .2 / 3) = 1.154
– For speedup from feature 2 = 1.154, we need to solve for x
where 1.154 = 1 / (1 – x + x / 7)
• Algebra gives us the following
– 1 – x + x / 7 = 1 / 1.154 = .867
– 1 - .867 = x – x / 7  .133 = (7x – x) / 7 = 6x / 7
– 7 * .133 / 6 = x = .156, so the second feature would have to be
used 15.6% of the time to offer the same speedup
CPU Performance Formulae
• Another way to compute speedup is to compute the
CPU’s performance before and after some
enhancement(s)
• We will use the following formulae
– CPU time = CPU clock cycles * clock cycle time
• CPU clock cycles = number of elapsed clock cycles
• CPU clock cycles = instruction count (IC) * clock cycles per
instruction (CPI) = IC * CPI
– not all instructions will have the same CPI, so we might have to
compute this as (S CPIi * ICi) for all classes of instructions i
– For instance, we might have loads, stores, branches, ALU (integer)
operations, FP operations with CPIs of 5, 4, 3, 2 and 10 respectively
• Clock cycle time = 1 / clock rate (we will abbreviate clock cycle
time as CCT going forward)
– Given two enhancements, compute their CPU exeuction
time, speedup of machine 2 over machine 1=
• CPU time machine1 / CPU time machine2
Example
• Consider that we can either enhance the FP sqrt unit or
enhance all FP units
– IC breakdown: 25% FP operations, 2% of which are FP square
root operations, 75% all other instructions
– CPI: 4.0 for FP operations (on average across all FP operations),
20 for FP sqrt, 1.33 for all other instructions
• CPI original machine = 25% * 4.0 + 75% * 1.33 = 2.00
– If we enhance all FP units, the overall CPI for FP operations
becomes 2.5, if we enhance just the FP sqrt, it reduces to 2.0
• Compute the CPU time of each (note that IC and clock rate
(CCT) remain the same)
– CPI all FP = 75% * 1.33 + 25% * 2.5 = 1.625
– Speedup enhancing all FP = (IC * 2.00 * CCT) / (IC * 1.625 *
CCT) = 1.23
– CPI FP sqrt = CPI original – 2% * (20 – 2) = 1.64
– Speedup enhancing FP sqrt = (IC * 2.00 * CCT) / (IC * 1.64 *
CCT) = 1.22
– Enhancing all FP is better by 1.64 / 1.625 = 1.01, or about 1%
Another Example
• Our current machine has a load-store architecture
and we want to know whether we should introduce
a register-memory mode for ALU operations
– Assume a benchmark of 21% loads, 12% stores, 43%
ALU operations and 24% branches
– CPI is 2 for all instructions except ALU which is 1
• The new mode will lengthen the ALU CPI of 2, and
it also, as a side effect, lengthens Branch CPI to 3
– The IC will be reduced because we need fewer loads,
let’s assume this new mode will be used in 25% of all
ALU operations
• Use the CPU execution time formula to determine
the speedup of the new addressing mode
Solution
• The number of ALU operations that will use this new
mode is 25%, or 43% * 25% = 11%
– This means that we will have 11% fewer instructions so ICnew
= 89% * ICold
– Those dropped instructions will all be loads, so we will have a
different breakdown of instruction mix
•
•
•
•
–
–
–
–
Loads = (21% - 11%) / 89% = 11%
Stores = 12% / 89% = 13%
ALU = 43% / 89% = 48%
Branches = 24% / 89% = 27%
CPIold = 43% * 1 + 57% * 2 = 1.57
CPInew = (48% + 11% + 13%) * 2 + 27% * 3 = 1.89
CPU execution time old = IC * 1.57 * CCT
CPU execution time new = .89 * IC * 1.89 * CCT
• Speedup = 1.57 / (.89 * 1.89) = .933, which is actually a
slowdown!
– We would not want to use this enhancement
Which Formula?
• In a previous example, we solved the problem of FP sqrt or
all FP units which we had solved earlier (slide 6)
• Which approach should we use?
– Depends on what information we are given, notice in using
Amdahl’s law, we know the fraction of time an enhancement
can be used and how much speedup that enhancement gives us
– We can compute the same thing in the CPU time formula
• Let’s try another example to prove it
– Benchmark consists of 35% loads, 15% stores, 40% ALU and
10% branch operations with a CPI breakdown of 5
(loads/stores), 4 (ALU branches)
– Enhancement: since we have separate INT and FP registers and
this benchmark does not use the FP registers, can we use a
compiler to move values from INT to FP registers and back
rather than using the slower loads & stores? Yes. How much
speedup will this give us?
– Assume the compiler can reduce the loads/stores by 20%
because of this enhancement
Solution
• CPI goes down, IC remains the same, CPU clock
time is unchanged
– Solution using CPU Time formula
• CPIold = 50% * 5 + 50% * 4 = 4.5
– 20% of the loads/stores now become register moves, so our new
breakdown of instructions is 40% load/store and 60% ALU
• CPInew = 40% * 5 + 60% * 4 = 4.4
• Speedup = (4.5 * IC * CPU clock time) / (4.4 * IC * CPU clock
time) = 4.5 / 4.4 = 1.023 or 2.3% speedup
– Amdahl’s Law
• Speedup of enhancement is 5/4 (5 cycles down to 4) = 1.25
• Fraction the enhancement can be used
– the enhancement is used in 20% of the loads/stores which were 50% of
the total instruction mix and these instructions took up 5 cycles of time
each, so it is used .20 * .50 * 5 / 4.5 (the original CPI) = .111
• Speedup = 1 / (1 - .111 + .111 / 1.25) = 1.023
Instruction Set Principles
• We studied instruction set design issues in 362
– Here, we develop a RISC instruction set to be used
throughout the course, called MIPS
• We want a fixed-length instruction format and a
load-store instruction set both of which will help
support a pipeline
• What other issues should we consider?
– Number of operands (2-operand or 3-operand)?
– Number of registers and what type (should we
differentiate between data and address registers?)
• Memory issues
– What addressing modes should we allow?
– How many bits should we allow for address
displacements, for immediate data?
Comparisons
Addressing Modes
• Data will either be
– Constants (immediate data)
– Stored in registers
– Stored in memory
• For data stored in memory, there are numerous ways to specify
the address
– Direct, indirect (pointers), register indirect (pointers in registers), base
displacement (sum of displacement and value in register) indexed (sum
of values in two registers), etc – see the next slide
– Complex modes can impact CPI because of the time it takes to obtain
or compute the address
• Design issues
– How many bits should be allowed for an immediate datum or a
displacement? Analysis of SPEC benchmarks indicate no more than
15 bits are needed for a displacement (displacements are < 32K) and 8
bits for most immediate data
– Which modes? Again, an analysis of SPEC benchmarks indicate that
immediate and displacement modes are most common (see figure A.7)
Branch Issues
• Branches typically use PC-relative branching
– The branch target location is computed as PC  PC +
offset rather than PC  offset, this keeps the offset to
fewer bits in the instruction
• also, with PC + offset, we do not need to know absolute memory
locations at compile time allowing code to be moved in memory
during execution
• Branches break down into
– Conditional branches (test a condition first)
– Unconditional branches
– Procedure calls and returns (require storing a return
address, probably parameter passing as well)
• register windows are used in some high performance computers
for parameter passing (this is explained in the out-of-class notes)
• Conditional branches make up 75-82% of all branches
– Distance (offsets) for most branches can be limited to about
8 bits (255 locations) – see figure A.15 on page A-18
Continued
• For procedure calls/returns, how is the state saved/restored
– Through memory or register windows
• What is the form of condition for conditional branches?
– Complex conditions can be time consuming
– Using condition codes is problematic with pipelines
– A simple zero test (value == 0 or value != 0) is the simplest and
fastest approach but requires additional instructions
• e.g., to compare x == y + 1, do x – y + 1 first, then compare the result to 0
• When is the comparison performed?
– With the branch instruction or prior to the branch?
Types of Instructions
• Arithmetic (integer) and logic operations
• FP operations (+, -, *, /) and conversion between int and FP
– We separate FP and integer operations for several reasons
• they have different CPIs
• we will use different register files
• we will use different execution units in the pipeline
• Data transfer (loads, stores)
• Control (conditional, unconditional branches, procedure calls,
returns, traps)
• I/O
• Strings (move, compare, search)
• OS operations (OS system calls, virtual memory, other)
• Graphics (pixel operations, compression/decompression, others)
– In this course we will only concentrate on the first 4 classes although
we will briefly consider vector operations as well, which are often
used to support graphics
– See figure A.13 on page A-16 for a breakdown of the SPECInt92
benchmark programs as executed on the Intel 80x86 architecture
Embedded Application Instruction Sets
• With RISC, instruction sets were being restricted
–
–
–
–
Fewer instructions in the instruction set
Fixed length instructions
Fewer and simpler addressing modes
Load-store instruction sets
• With the popularity of embedded applications due to
handheld devices, new restrictions are being
introduced
– 16-bit and 32-bit instruction sizes to accommodate
narrower buses
• Requires smaller memory addresses, smaller immediate data,
fewer registers
• This also improves cache performance because we can fit more
in the caches
• An alternative is to use compression on instructions, compress an
instruction, fetch it, uncompress it in the CPU – IBM follows this
Compiler Optimizations
• In order to support the increasingly complex hardware,
we need compiler support in the form of machine code
optimizations, here are some examples:
– High-level optimizations on source code
• example: procedure in-lining, loop transformation
– Local optimizations on single-lines of code
• example: change the order of references in a block or expression
– Global optimizations extend local across branches
• example: loop unrolling
– Register allocation to optimize the storage of variables in
registers and minimize memory fetches
– Machine-dependent optimizations
• take advantage of the specific architecture
– see Figure A.19, page A-25 and A.20, page A-28
• Two examples
Continued
– Sub-expression elimination – assume that a particular
expression is used in several expressions, the value can be
computed one time and stored in a register, later uses can
reference the register and not have to re-compute the same
expression
– Graph coloring – an algorithm used to determine the best (or a
good) allocation of local variables to registers
• this is an NP complete problem, so compilers use a graph coloring
approximation or heuristic algorithm instead
• There is a problem with compiler optimizations: phaseordering
– Since compiler optimizations are made in a particular order, one
optimization might impact and wipe out the gain by an earlier
optimization
• consider for performing register allocation is performed near the end of
optimization but sub-expression elimination, performed earlier in the
process, needs some registers, so the earlier optimization relies on
having access to registers which may be re-allocated later!
Introduction to MIPS
• Developed in 1985, since then, there have been many
versions, here, we examine a subset called MIPS64
• RISC architecture designed for pipeline efficiency
– optimizing compiler essential to improve efficiency
• General-purpose register set and load-store architecture
– 32 64-bit general purpose (integer) registers
• labeled R0, …, R31
• R0 is always 0
• 8-, 16-, 32-bit values are sign extended to become 64 bit values
– 32 64-bit floating point registers
• labeled F0, …, F31 where only half the register is used for floats
• No explicit character or string types
– characters treated as ints, ala C
– strings as arrays of ints
• Arrays are available, using base-displacement addressing
Continued
• Two addressing modes used: displacement,
immediate
– direct addressing can be accomplished by using R0 as the
displacement register
– register indirect can be accomplished by using a base of 0
– displacements of 12-16 bits and immediate data of 8-16 bits
– memory is byte addressable and 64-bit addresses are used
• Approximately 100 operations
– op code requires 7 bits, however we will reduce this to 6 bits by
using one op code for all integer ALU operations
– Fixed length 32-bit instructions
– 3 instruction formats used (shown on the next slide)
• I-type for immediate data, used for loads, stores, conditional branches and ALU
operations that have an immediate datum as an operand
• R-type for register type, used for all other ALU operations and FP operations
• J-type for jump type, used for jump, jump and link (procedure call), trap, return
– Immediate data and displacements are limited to 16 bits except for
Jump instructions in which case displacements are limited to 26 bits
Continued
3 operand instructions are
available as long as all
operands are in registers
(R-type) or 2 registers
and immediate datum
(I-type)
immediate datum (which is
also used for displacement
offsets) is limited to 16
bits (2’s complement) but
extended to 32 bits
funct is the specific type of
ALU or FP function
MIPS Instructions
• See Figure A.26, page A-40 for full list, here we look at the
instructions we will be using
• Loads/Stores
– LD, SD – load/store double word (we could also use LW, SW
for word sized data movements)
• LD R2, 204(R3) – load item from M[204 + R3] in R2
– L.S, L.D, S.S, S.D – load and store single and double precision
floats (S = single, D = double)
• L.S F3, 0(R5) – note the use of integer register for the base
• ALU operations (integer only)
– DADD/DSUB, DADDI/DSUBI, DMUL, DDIV –
add/subtract/multiply/divide with 3 registers (or 2 registers and
an immediate datum for add/subtract)
• DADD R1, R2, R3
• DADDI R1, R2, #1
– also similar operations for AND, OR, shift, rotate
– SLT, SLTI – set less than – used for comparison
• SLT R1, R2, R3 – if R2 < R3, set R1 to 1, else R1 = 0
• Branch operations
Continued
– BEQZ, BNEZ – branch if register tested is 0/not zero
• BEQZ R1, foo – PC = PC + foo if R1 = 0
– J – unconditional jump to given location
• J foo – sets PC = PC + foo
– JR – unconditional jump where offset is in given register
• JR R3 – sets PC = PC + R3
• Floating Point operations
– ADD.D, ADD.S, SUB.D, SUB.S, MUL.D, MUL.S, DIV.D,
DIV.S
• floating point operations, 3 FP registers
• ADD.S F1, F2, F3
– C.__.D, C.__.S – FP comparisons, __ is LT, GT, LE, GE, EQ,
NE
– CVT.__.__ - converts from one type to another using two
registers
• CVT.D.L F2, R4 – convert double in F2 to long in R4
MIPS 5-Stage Architecture
See section C.3 pages C-31-C-34
• IF:
– PC sent to instruction cache
– PC incremented by 4, stored in
NPC temporarily
IF & ID Stages
• a MUX in the MEM stage
determines if the PC should get
the value in NPC or the value
computed in EX
– Instruction stored in IR
• ID:
– Bits 6..10 denote one source
register (I-type and R-type
instructions)
– Bits 11..15 denote one source
register (R-type)
– Bits 16..32 store immediate
datum or displacement, sign
extended to 32 bits
• NPC, A, B and IMM are
temporary registers used
in later stages
• This stage
EX Stage
– computes ALU operations
• using register A & B or A& IMM, result
from ALU placed in ALU output register
and passed on to next stage
– computes effective addresses for loads
and stores
• A + IMM, stored in ALU output and
passed onto next stage
– computes branch target locations and
performs the zero test to determine if
a branch is taken or not
• A zero tested
• ALU adds PC + IMM, value sent to ALU
output and passed to next stage
• MEM:
MEM and WB Stages
– If load, ALU output stores address,
sent to data cache, resulting datum
stored in LMD
– If store, ALU output stores address
and B register stores datum, both
are sent to data cache
– If branch, based on condition,
MUX either selects NPC or branch
target location (as computed in the
ALU EX) to send back to PC
– If ALU, forward result from ALU
output directly to LMD
• WB:
– If a datum in LMD (load or ALU),
store in the register file
Comments on the MIPS Architecture
• The simplified nature of MIPS means that many tasks
will require more than a single assembly/machine
operation to complete
– in CISC instruction sets, some operations can be done in 1
instruction, such as indirect addressing and compare-andbranch operations
– registers must be pre-loaded with the data before
performing an ALU operation
– two or more instructions to perform scaled or indexed
modes
• The CPI of MIPS operations is less than those in
other instruction sets making up for this
– all operations have a CPI of 4 except Loads and ALU
operations which have a CPI of 5 (because they must write
their results to registers in the WB stage)
• The static size of all MIPS operations makes it easier
to deal with pre-fetching and pipelining
Continued
• The architecture requires the following hardware
elements to implement:
– the ALU should have all integer operations (arithmetic,
logic)
• we address floating point operations later in the semester
– an additional adder is needed for the IF stage (PC
increment)
– several temporary registers are needed
• IR, A, B, Imm, NPC, ALUOutput, LMD
– multiplexors to select the following
• what to do after a condition is evaluated
• whether a computed value is to be used later in temporary
registers A or B
• whether to use a register value or the immediate datum
• multiplexors in the ALU to select the output based on the specific
ALU operation (not shown in the figure)
• multiplexors in the register file to select which register to send on
to the A or B temporary registers, and a demultiplexor to pass
along the LMD value into one of the registers (not shown in the
figure)
MIPS Code Example
• Write a set of code to compute the average of the elements in an int array,
assuming the array starts at memory location 50000 and that the number
of elements in the array is stored at location 10000
• Store the resulting float value at location 10004
DADDI
DADDI
LW
CVT.W.S
Loop: BEQZ
LW
DADD
DADDI
DSUBI
J
Out: CVT.W.S
DIV.S
S.S
R1, R0, #50000
R2, R0, #0
R3, 10000(R0)
F2, R3
R3, Out
R4, 0(R1)
R2, R2, R4
R1, R1, #4
R3, R3, #1
Loop
F1, R2
F3, F1, F2
F3, 10004(R0)
// R1 is our array index
// R2 is our sum
// R3 is our loop counter
// copy number of elements into F2 as a float
// If R3 = = 0, then exit loop
// R4 is the next array element
// convert sum to floating point
// store resulting average
Another Example
• Write a set of code that will find the largest and smallest items in
an array, the array’s starting location is stored in R5 and the array
contains 500 elements, store the min in R1, the max in R2
Loop:
SetMin:
SetMax:
Out:
LW
LW
DADDI
BEQZ
DADDI
LW
DSUBI
SLT
BNEZ
SLT
BNEZ
J
DADDI
J
DADDI
J
…
R1, 0(R5)
R2, 0(R5)
R3, R0, #500
R3, Out
R5, R5, #4
R4, 0(R5)
R3, R3, #1
R6, R4, R1
R6, SetMin
R6, R2, R4
R6, SetMax
Loop
R1, R4, #0
Loop
R2, R4, #0
Loop
// R1 is min
// R2 is max
// R3 is our loop counter
// reset array pointer to next element
// load next array element
// if R4 < R1, set R6 (new min to take care of)
// if R2 < R4, set R6 (new max to take care of)
Download