UNIT-II CENTRAL PROCESSING UNIT

advertisement
UNIT‐II CENTRAL PROCESSING UNIT
•INTODUCTION
•ARITHMETIC LOGIC UNIT
•FIXED POINT ARITHMETIC
•FLOATING POINT ARITHMETIC
•EXECUTION OF A COMPLETE INSTRUCTION
•BASIC CONCEPTS OF PIPELINING
Slides Courtesy of Carl Hamacher,”Computer Organization,” Fifth edition,McGrawHill
The arithmetic logic unit (ALU)
The central processing unit (CPU) performs operations on
data. In most architectures it has three parts: an arithmetic
logic unit (ALU), a control unit and a set of registers, fast
storage locations (Figure ).
Figure Central processing unit (CPU)
Data Representation
• The basic form of information handled by a computer are instructions and data
• Data can be in the form of numbers or nonnumeric data
• Data in the number form can further classified as fixed point and floating point
Digit Sets and Encodings
Conventional and unconventional digit sets
• Decimal
digits in [0, 9]; 4-bit BCD, 8-bit ASCII
• Hexadecimal, or hex for short: digits 0-9 & a-f
• Conventional digit set for radix r is [0, r – 1]
• Conventional binary digit set in [0, 1]
Positional Number Systems
Representations of natural numbers {0, 1, 2, 3, …}
||||| ||||| ||||| ||||| ||||| ||
27
11011
XXVII
sticks or unary code
radix-10 or decimal code
radix-2 or binary code
Roman numerals
Fixed-radix positional representation with k digits
Value of a number: x = (xk–1xk–2 . . . x1x0)r = Σ xi r i
k–1
i=0
For example:
27 = (11011)two = (1×24) + (1×23) + (0×22) + (1×21) + (1×20)
Fixed Point Representation
• Fixed point number actually symbolizes the real data types. • As radix point is fixed ,the number system is fixed point number system
• Fixed point numbers are those which have a defined numbers after and before the decimal point. Fixed‐Point Numbers
Positional representation: k whole and l fractional digits
Value of a number: x = (xk–1xk–2 . . . x1x0 . x–1x–2 . . . x–l )r = Σ xi r i
For example:
2.375 = (10.011)two = (1×21) + (0×20) + (0×2−1) + (1×2−2) + (1×2−3)
Numbers in the range [0, rk – ulp] representable, where ulp = r –l
Fixed-point arithmetic same as integer arithmetic
(radix point implied, not explicit)
Two’s complement properties (including sign change) hold here as well:
(01.011)2’s-compl = (–0×21) + (1×20) + (0×2–1) + (1×2–2) + (1×2–3) = +1.375
(11.011)2’s-compl = (–1×21) + (1×20) + (0×2–1) + (1×2–2) + (1×2–3) = –0.625
Unsigned Integer
• Unsigned integers represent positive numbers
• The decimal range of unsigned 8‐bit binary numbers is 0 ‐ 255
Unsigned Binary Integers
0000
1111
1110
15
0
0001
0010
1
14
2
1101
0011
13
1100
12
1011
Turn x notches
counterclockwise
to add x
3
Inside: Natural number
Outside: 4-bit encoding
11
4
5
10
1001
0100
0101
12
11
10
15
1
0
2
3
4
5
9 8 7
6
6
9
1010
14
13
8
1000
7
0110
0111
Turn y notches
clockwise
to subtract y
Schematic representation of 4-bit code for integers in
[0, 15].
Signed Integers
• We dealt with representing the natural numbers
• Signed or directed whole numbers = integers
{ . . . , −3, −2, −1, 0, 1, 2, 3, . . . }
•Signed magnitude for 8 bit numbers ranges from
+127 to -127
• Signed-magnitude representation
+27 in 8-bit signed-magnitude binary code 0 0011011
–27 in 8-bit signed-magnitude binary code 1 0011011
–27 in 2-digit decimal code with BCD digits 1 0010 0111
Introduction to Fixed Point Arithmetic
• Using fixed point numbers to simulate floating point numbers
• Fixed point processor is usually cheaper
Addition
Subtraction
A Serial Multiplier
Example of Multiplication Using Serial Multiplier
Serial Divider
Division Example Using Serial Divider
Floating‐Point Numbers
To accommodate very large integers and very
small fractions, a computer must be able to
represent numbers and operate on them in such
a way that the position of the binary point is
variable and is automatically adjusted as
computation proceeds.
• Floating-point representation is like scientific notation:
−20 000 000 = −2 × 10 7
0.000 000 007 = 7 × 10–9
Significand
Exponent base
Exponent
Also, 7E−9
Floating‐point Computations
• Representation: (fraction, exponent) Has three fields: sign, significant digits and exponent
eg.111101.100110 1.11101100110*25
• Value representation = +/‐ M*2 E’‐127
In case of a 32 bit number 1 bit represents sign
8 bits represents exponent E’=E +127(bias) [ excess 127 format]
23 bits represents Mantissa
Floating‐point Computations
• Arithmetic operations
.5372400 x 102
.5372400 x 102
.56780 x 105
+ .1580000 x 10-1
+ .0001580 x 102
+ .56430 x 105
.5373980 x 102
.00350 x 105
.35000 x 103
.5372400 x 102
x .1580000 x 10-1
Addition
Floating‐point Computations
• Biased Exponent
– Bias: an excess number added to the exponent so that all exponents become positive
– Advantages
• Only positive exponents
• Simpler to compare the relative magnitude
Floating‐point Computations
• Standard Operand Format of floating‐point numbers
– Single‐precision data type: 32bits
• ADDFS
– Double‐precision data type: 64bits
• ADDFL
IEEE Floating-Point Operand Format
Floating‐point Computations
• Significand
– A leading bit to the left of the implied binary point, together with the fraction in the field
Significand
1.100…0
1.010…0
1.000…0
f field
100…0
010…0
000…0
Decimal Equivalent
1.50
1.25
1.00
s
~
Minimum number
~
01000..000
Maximum number
ANSI/IEEE Standard Floating‐Point Format (IEEE 754)
Revision (IEEE 754R) is being considered by a committee
Short (32-bit) format
8 bits,
bias = 127,
–126 to 127
23 bits for fractional part
(plus hidden 1 in integer part)
Sign Exponent
11 bits,
bias = 1023,
–1022 to 1023
Short exponent range is –127 to 128
but the two extreme values
are reserved for special operands
(similarly for the long format)
Significand
52 bits for fractional part
(plus hidden 1 in integer part)
Long (64-bit) format
The two ANSI/IEEE standard floating-point formats.
Short and Long IEEE 754 Formats: Features
Table Some features of ANSI/IEEE standard floating-point formats
Feature
Word width in bits
Significand in bits
Significand range
Exponent bits
Exponent bias
Zero (±0)
Denormal
Single/Short
32
23 + 1 hidden
[1, 2 – 2–23]
8
127
e + bias = 0, f = 0
e + bias = 0, f ≠ 0
represents ±0.f × 2–126
e + bias = 255, f = 0
e + bias = 255, f ≠ 0
e + bias ∈ [1, 254]
e ∈ [–126, 127]
represents 1.f × 2e
Double/Long
64
52 + 1 hidden
[1, 2 – 2–52]
11
1023
e + bias = 0, f = 0
e + bias = 0, f ≠ 0
represents ±0.f × 2–1022
e + bias = 2047, f = 0
e + bias = 2047, f ≠ 0
e + bias ∈ [1, 2046]
e ∈ [–1022, 1023]
represents 1.f × 2e
min
2–126 ≅ 1.2 × 10–38
2–1022 ≅ 2.2 × 10–308
max
≅ 2128 ≅ 3.4 × 1038
≅ 21024 ≅ 1.8 × 10308
Infinity (±∞)
Not-a-number (NaN)
Ordinary number
Floating Point Arithmetic
• Floating point arithmetic differs from integer arithmetic in that
exponents must be handled as well as the magnitudes of the
operands.
• The exponents of the operands must be made equal for
addition and subtraction. The fractions are then added or
subtracted as appropriate, and the result is normalized.
• Eg: Perform the floating point operation:(.101*23 +.111*24)2
• Start by adjusting the smaller exponent to be equal to the
larger exponent, and adjust the fraction accordingly. Thus we
have .101* 23 = .010 *24, losing .001 *23 of precision in the
process.
• The resulting sum is (.010 +.111)*24 =1.001*24 =.1001* 25,
and rounding to three significant digits, .100 *25, and we have
lost another 0.001 *24 in the rounding process.
Floating Point Multiplication/Division
• Floating point multiplication/division are performed in a manner
similar to floating point addition/subtraction, except that the
sign, exponent, and fraction of the result can be computed
separately.
• Like/unlike signs produce positive/negative results, respectively.
Exponent of result is obtained by adding exponents for
multiplication, or by subtracting exponents for division.
Fractions are multiplied or divided according to the operation,
and then normalized.
• Ex: Perform the floating point operation: (+.110 *25)/(+.100* 24)2
• The source operand signs are the same, which means that the
result will have a positive sign. We subtract exponents for
division, and so the exponent of the result is 5 – 4 = 1.
• We divide fractions, producing the result: 110/100 = 1.10.
• Putting it all together, the result of dividing (+.110 *25) by (+.100
* 24) produces (+1.10* 21). After normalization, the final result is
(+.110* 22).
Floating point Arithmetic
• Represent binary number in floating point format
• 10011101011.001=1.0011101011001*210
• In single precision format sign =0,exponent =e+127 =10+127=137=10001001
• 0 1000 1001 0011101011001…0
Floating Point Addition
•
•
•
•
•
•
•
A= 0 1000 1001 0010000…0
B= 0 1000 0101 0100000…0
Exponent for A=1000 1001+137
Actual Exponent =137‐127=10
Exponent B =1000 0101=133
Actual exponent=133‐127=6
Number B has smaller exponent with difference 4 .Hence its mantissa is shifted right by 4 bits
• Shifted mantissa of B= 00000100..0
• Add mantissas
• A =00100000…0
• B =00000100…0
• Result=00100100…0
• Result = 0 1000 1001 00100100…0
Adders and Simple ALUs
Addition is the most important arithmetic operation in computers:
–
–
Even the simplest computers must have an adder
An adder, plus a little extra logic, forms a simple ALU
•
Simple Adders
•
Carry Lookahead Adder
•
Counting and Incrementing
•
Design of Fast Adders
•
Logic and Shift Operations
•
Multifunction ALUs
Simple Adders
Inputs
x
Outputs
y
c
x
s
c
0
0
1
1
0
1
0
1
0
0
0
1
Inputs
y
HA
0
1
1
0
s
Outputs
x
y
cin
cout
s
0
0
0
0
1
1
1
1
0
0
1
1
0
0
1
1
0
1
0
1
0
1
0
1
0
0
0
1
0
1
1
1
0
1
1
0
1
0
0
1
x
cout
y
FA
cin
s
Binary half-adder (HA) and full-adder (FA).
Full‐Adder Implementations
x
y
HA
c out
HA
c in
x
y
c out
s
(a) FA built of two HAs
x
y
c out
0
1
2
3
0
0
1
2
3
1
s
(b) CMOS mux-based FA
c in
c in
s
(c) Two-level AND-OR FA
Full adder implemented with two half-adders, by means of two 4input multiplexers, and as two-level gate network.
Ripple‐Carry Adder: Slow But Simple
Because of the carry propagation time to MSb position. It is linearly
proportional to the length n of the adder
x31
c32
cout
y31
FA
x1
c31
. . .
c2
y1
FA
x0
c1
y0
FA
Critical path
s31
s1
s0
Ripple-carry binary adder with 32-bit inputs and output.
c0
cin
Carry Look ahead adder
The carry look ahead adder generates carry for any position parallely by
additional logic circuit referred to as carry look ahead block.
gi = xi yi
g i p i Carry is :
xi
yi
pi = xi ⊕ yi
0
0
1
1
0
1
0
1
annihilated or killed
propagated
generated
(im possible)
g k−1 p k−1
g k−2 p k−2
g i+1 p i+1
gi
pi
...
...
g1 p1
g0 p0
c0
Carry network
ck
c k−1
...
c k−2
ci
c i+1
...
c1
c0
si
The main part of an adder is the carry network. The rest is just a set
of gates to produce the g (carry generate function) and p (carry
propagate function) signals and the sum bits.
Carry‐Lookahead Addition
• Carries are
represented in
terms of Gi
(generate) and Pi
(propagate)
expressions.
Gi = aibi and Pi = ai + bi
c0 = 0
c 1 = G0
c 2 = G 1 + P 1G 0
c 3 = G 2 + P 2G 1 + P 2P 1G 0
c 4 = G 3 + P 3G 2 + P 3P 2G 1 + P 3P 2P 1G 0
Ripple‐Carry Adder Revisited
The carry recurrence: ci+1 = gi + pi ci
Latency of k-bit adder is roughly 2k gate delays:
1 gate delay for production of p and g signals, plus
2(k – 1) gate delays for carry propagation, plus
1 XOR gate delay for generation of the sum bits
gk−1 pk−1
ck
gk−2 pk−2
g1
p1
...
ck−1
ck−2
c2
c1
The carry propagation network of a ripple-carry adder.
g0
p0
c0
The Complete Design of a Carry Look Ahead Adder
gi pi
Carry is:
0
0
1
1
annihilated or killed
propagated
generated
(impossible)
0
1
0
1
g k−1 p k−1
g
c
p
k − 1
k − 1
xi
g k−2 p k−2
yi
gi = xi yi
pi = xi ⊕ yi
g i+1 p i+1
gi
pi
...
...
g
p
k − 2
g1 p1
g
k − 2
1
p
1
g0 p0
g
0
.
k
c
k − 1
c
k − 2
c
c
2
Carry network
ck
c k−1
...
c k−2
ci
...
c i+1
si
K-bit carry- lookahead adder
c1
p
0
c
1
c0
c0
0
Carry Lookahead Adder
• Maximum
gate delay
for the carry
generation is
only 3. The
full adders
introduce two
more gate
delays.
Worst case
path is 5 gate
delays.
16‐bit Group Carry Lookahead Adder
• A16-bit GCLA is composed of four 4-bit CLAs, with
additional logic that generates the carries between
the four-bit groups.
GG0 = G3 + P3G2 + P3P2G1 + P3P2P1G0
GP0 = P3P2P1P0
c4 = GG0 + GP0c0
c8 = GG1 + GP1c4 = GG1 + GP1GG0 + GP1GP0c0
c12 = GG2 + GP2c8 = GG2 + GP2GG1 + GP2GP1GG0 +
GP2GP1GP0c0
c16 = GG3 + GP3c12 = GG3 + GP3GG2 + GP3GP2GG1 +
GP3GP2GP1GG0 + GP3GP2GP1GP0c0
16‐Bit Group Carry Lookahead Adder
• Each
CLA has
a longest
path of 5
gate
delays.
• In the GCLL section, GG and GP signals are
generated in 3 gate delays; carry signals are
generated in 2 more gate delays, resulting in
5 gate delays to generate the carry out of
each GCLA group and 10 gates delays on
the worst case path (which is s15 – not c16).
The Booth Algorithm
• Booth multiplication reduces the number of
additions for intermediate results, but can
sometimes make it worse as we will see.
• Positive and negative numbers treated alike.
A Worst Case Booth Example
• A worst case situation in which the simple
Booth algorithm requires twice as many
additions as serial multiplication.
Bit‐Pair Recoding (Modified Booth Algorithm)
Coding of Bit Pairs
Multifunction ALUs
Logic fn (AND, OR, . . .)
Operand 1
Operand 2
Logic
unit
Arith
unit
0
1
Select fn type
(logic or arith)
Arith fn (add, sub, . . .)
General structure of a simple arithmetic/logic unit.
Result
Const′Var
Shift function
Constant
5
amount
0
Amount
5
1
5
Variable
amount
2
00
01
10
11
No shift
Logical left
Logical right
Arith right
Shifter
Function
class
32
5 LSBs
Shifted y
x
Adder
y
0 or 1
c0
32
32
k
/
c 31
x±y
Shift
Set less
Arithmetic
Logic
2
0
Shorthand
symbol
for ALU
1
s
MSB
32
2
32
00
01
10
11
An ALU for MiniMIPS
Control
c 32
3
x
Func
Add′Sub
s
ALU
Logic
unit
AND
OR
XOR
NOR
00
01
10
11
32input
NOR
y
Ovfl
Zero
2
Logic function
Zero
Ovfl
Figure A multifunction ALU with 8 control signals (2 for function
class, 1 arithmetic, 3 shift, 2 logic) specifying the operation.
Machine Cycle
The CPU uses repeating machine cycles to execute
instructions in the program, one by one, from
beginning to end. A simplified cycle can consist of
three phases: fetch, decode and execute
The steps of a cycle
Load Fetch/Execute Cycle
1. PC -> MAR
Transfer the address from the
PC to the MAR
2. MDR -> IR
Transfer the instruction to the IR
3. IR(address) -> MAR
Address portion of the instruction
loaded in MAR
Actual data copied into the
accumulator
4. MDR -> A
5. PC + 1 -> PC
Program Counter incremented
Store Fetch/Execute Cycle
1. PC -> MAR
2. MDR -> IR
3. IR(address) -> MAR
4. A -> MDR*
5. PC + 1 -> PC
Transfer the address from the
PC to the MAR
Transfer the instruction to the
IR
Address portion of the
instruction loaded in MAR
Accumulator copies data into
MDR
Program Counter incremented
*Notice how Step #4 differs for LOAD and STORE
ADD Fetch/Execute Cycle
1. PC -> MAR
Transfer the address from the
PC to the MAR
2. MDR -> IR
Transfer the instruction to the
IR
3. IR(address) -> MAR
Address portion of the
instruction loaded in MAR
Contents of MDR added to
contents of accumulator
4. A + MDR -> A
5. PC + 1 -> PC
Program Counter incremented
The Fetch/Execute Cycle
• A five‐step cycle:
1. Instruction Fetch (IF)
2. Instruction Decode (ID)
3. Data Fetch (DF)
4. Instruction Execution (EX)
5. Result Return (RR)
Instruction Interpretation
• Process of executing a program
– Computer is interpreting our commands, but in its own language
• Execution begins by moving the instruction at the address given by the PC from memory to the control unit
Instruction Interpretation (cont'd)
• Bits of the instruction are placed into the decoder circuit of the CU
• Once an instruction is fetched, the Program Counter (PC) can be readied for fetching the next instruction
• The PC is “incremented”
Instruction Interpretation (cont'd)
• In the Instruction Decode step, the ALU is set up for the indicated operation
• The Decoder will find the memory address of the instruction's data (source operands)
– Most instructions operate on 2 data values stored in memory (like ADD), so most instructions have addresses for two source operands
– These addresses are passed to the circuit that fetches the values from memory during the next step, Data Fetch
• The Decoder finds destination address for the Result Return step, and places it in RR circuit
• Decoder determines what operation the ALU will perform, and sets it up appropriately
Instruction Interpretation (cont'd)
• Instruction Execution: The actual computation is performed. • For the ADD instruction, the addition circuit adds the two source operands together to produce their sum
Instruction Interpretation (cont'd)
• Result Return: result of execution is returned to the memory location specified by the destination address. • Once the result is returned, the cycle begins again (This is a Loop).
Execution of complete Instructions
• Consider the instruction Add (R3), R1 which adds the content of memory location pointed to by R3 to register R1.
• Executing this instruction requires the following actions
•
•
•
•
Fetch the instruction
Fetch the first operand
Perform the addition
Load the result into R1
FETCH OPERATION
• Loading the content of PC into MAR and sending Read request to the memory.
• Select signal is set to select 4, which causes the MUX to select the constant 4 and add to the operand at B, Which is the content of PC and the result is stored in register Z
• The updated value is moved from register Z back into PC • The word fetched from memory loaded into IR
DECODE and EXECUTING PHASE
• Interprets the content of IR
• Enables the control circuitry to activate the control signals
• The content of register R3 transferred to MAR and memory Read initiated
• Content of R1 transferred to register Y to prepare for addition operation
• Memory operand available in register MDR and addition performed
• Sum is stored in register Z, then transferred to R1
What Is A Pipeline?
• Pipelining is used by virtually all modern microprocessors to enhance performance by overlapping the execution of instructions.
• A common analogue for a pipeline is a factory assembly line. Assume that there are three stages:
1. Welding
2. Painting
3. Polishing
• For simplicity, assume that each task takes one hour.
What Is A Pipeline?
• If a single person were to work on the product it would take three hours to produce one product.
• If we had three people, one person could work on each stage, upon completing their stage they could pass their product on to the next person (since each stage takes one hour there will be no waiting).
• We could then produce one product per hour assuming the assembly line has been filled.
Characteristics Of Pipelining
• If the stages of a pipeline are not balanced and one stage is slower than another, the entire throughput of the pipeline is affected. • In terms of a pipeline within a CPU, each instruction is broken up into different stages. Ideally if each stage is balanced (all stages are ready to start at the same time and take an equal amount of time to execute.) the time taken per instruction (pipelined) is defined as:
Time per instruction (unpipelined) / Number of stages
Characteristics Of Pipelining
• The previous expression is ideal. We will see later that there are many ways in which a pipeline cannot function in a perfectly balanced fashion.
• In terms of a CPU, the implementation of pipelining has the effect of reducing the average instruction time, therefore reducing the average CPI.
• EX: If each instruction in a microprocessor takes 5 clock cycles (unpipelined) and we have a 4 stage pipeline, the ideal average CPI with the pipeline will be 1.25 . •
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Instruction Pipelining
• Break the instruction cycle into stages
• Simultaneously work on each stage
Two Stage Instruction Pipeline
Break instruction cycle into two stages:
• FI: Fetch instruction
• EI: Execute instruction
FI EI
Clock cycle ® 1 2 3 4 5 6 7
Instruction i
Instruction i+1
Instruction i+2
Instruction i+3
Instruction i+4 FI
EI
EI
EI
E
Two Stage Instruction Pipeline
Break instruction cycle into two stages:
• FI: Fetch instruction
• EI: Execute instruction
Clock cycle 1 2 3 4 5 6 7
Instruction i FI EI
Instruction i+1 FI EI
Instruction i+2 FI EI
Instruction i+3 FI EI
Instruction i+4 FI EI
T w o S t a g e I n s t r u c t i o n Pipeline
• But not doubled:
q Fetch usually shorter than execution
q If execution involves memory accessing, the fetch stage has to wait
q Any jump or branch means that prefetched instructions are not the required instructions
• Add more stages to improve performance
Six Stage Pipelining
• Fetch instruction (FI)
• Decode instruction (DI)
• Calculate operands (CO)
• Fetch operands (FO)
• Execute instructions (EI)
• Write operand (WO)
MIPS Pipeline
• Pipeline stages:
–
–
–
–
Instruction–
number
Instruction
Instruction
Instruction
Instruction
Instruction
IF
ID (decode + Reg fetch)
EX MEM
Clock number
Write back
1
2
3
4
i
i+ 1
i+ 2
i+ 3
i+ 4
IF
ID
IF
EX
ID
IF
MEM
EX
ID
IF
5
WB
MEM
EX
ID
IF
6
7
8
9
WB
MEM
EX
ID
WB
MEM
EX
WB
MEM
WB
On each clock cycle another instruction is fetched and begins its five-step execution. If an
instruction is started every clock cycle, the performance will be five times that of a machine
that is not pipelined.
Looking At The Big Picture
• Overall the most time that an non‐pipelined instruction can take is 5 clock cycles. Below is a summary:
• Branch ‐ 2 clock cycles
• Store ‐ 4 clock cycles
• Other ‐ 5 clock cycles
• EX: Assuming branch instructions account for 12% of all instructions and stores account for 10%, what is the average CPI of a non‐
pipelined CPU?
ANS: 0.12*2+0.10*4+0.78*5 = 4.54
The Classical RISC 5 Stage Pipeline
• In an ideal case to implement a pipeline we just need to start a new instruction at each clock cycle.
• Unfortunately there are many problems with trying to implement this. Obviously we cannot have the ALU performing an ADD operation and a MULTIPLY at the same time. But if we look at each stage of instruction execution as being independent, we can see how instructions can be “overlapped”.
Download