Document

advertisement
Arithmetic for Computers
Chapter 3
Sections 3.1 – 3.5 & 3.8
Appendix C.1 – C.3, C.5 – C.6
Dr. Iyad F. Jafar
Outline
 Addition and Subtraction
 Overflow Detection
 Faster Addition
 The 1-Bit ALU
 The 32-bit MIPS ALU
 Shift Operations
 Multiplication
 Division
 Floating Point Numbers
 Fallacies and Pitfalls
2
Addition and Subtraction
 Add corresponding bits including the sign bit
and ignore the carry out of the MSB
 For subtraction, add the negative
4
0100
-4
1100
4
0100
+3
0011
+3
0011
-3
1101
7
0111
-1
1111
1
1 0001
-4
- (-3)
3
1100
1101
-4
1100
+3
0011
-1
1111
Detecting Overflow
 When do we get overflow?
 Adding two positive numbers and get a negative
number
 When we add two negative numbers and get a
positive number
 Investigate the sign bit!
+
+
+
Cout
0
0
+0
0
Cin
0
No overflow
4
+
+
-
Cout
0
0
+0
1
Cin
1
Overflow
+
Cout
1
1
+1
0
Cin
0
Overflow
Overflow when carry into sign bit
does not equal the carry out
-
Cout
1
1
+1
1
Cin
1
No Overflow
Cin
Cout
+
-
Cout
0
1
+0
1
Cin
0
No Overflow
+
+
Cout
1
1
+0
0
Cin
1
No Overflow
Overflow
Addition and Subtraction
 How to perform addition in hardware?
 Design 32-bit adder (two 32-bit inputs !!!!)
 Cell design !
 1-bit Full Adder
CarryIn
A1
+
B1
Sum
CarryOut
Cout
Cin
5
AB
00
B
0
0
1
1
0
0
1
1
Sum
A
01
11
10
0 0
0
1
0
1 0
1
1
1
B
A
0
0
0
0
1
1
1
1
Cout = AB + BCin+ ACin
Cin
Cin
0
1
0
1
0
1
0
1
AB
Cout
0
0
0
1
0
1
1
1
Sum
0
1
1
0
1
0
0
1
A
00
01
11
10
0
0
1
0
1
1
1
0
1
0
B
Sum = A B  Cin
Addition and Subtraction
 32-bit ripple-carry adder
 Cascade 32 copies and wire them up
through the Cin and Cout
A31
B31
A2
B2
A1
B1
A0
B0
FA
FA
FA
FA
S31
S2
S1
S0
C32
 How long does it take to get the result ?
6
0
Addition and Subtraction
 32-bit ripple-carry Subtractor
 Subtraction is addition of the negative!
 Compute the 2s complement = 1s complement + 1
B1
B31
A31
A2
FA
B2
A1
B0
A0
FA
FA
FA
D2
D1
D0
B32
D31
7
1
Addition and Subtraction
 32-bit ripple-carry adder/subtractor
 Redundancy in hardware!! Subtraction is addition of the
negative!
 Use one adder and configure the second input
 Remember X X’ and X X
Add/Sub
B31
A31
B1
A2
B0
B2
A1
A0
FA
FA
FA
FA
S31
S2
S1
S0
C32
8
0  ADD
1  Subtract
Faster Addition
 The ripple-carry adder is slow!
 We have to wait until the carry is propagated to the
final position in order to read out the addition or
subtraction result.
 Carry generation is associated with two
levels of gates at each bit position
Coi = AiBi + AiCini + BiCini
 Total delay = gate delay x 2 x number of
bits
 Example
 16 bit adder  delay is 32 delay units
 Can we go faster?
 What if we generate the carries in parallel?
9
Faster Addition
 The carries can be expressed by the
Adders inputs and c0 exclusively!
 Add a separate hardware to compute
the carry in parallel!
 Carry-lookahead Adder
A31 – A0
B31 – B0
c4
c3
c2
c1
c0
10
Faster Addition
 In a 4-bit adder, the equations of the carries are
 By substitution
c1 = (b0 . c0) + (a0 . c0) + (a0 . b0)
c2 = (b1 . c1) + (a1 . c1) + (a1 . b1)
c3 = (b2 . c2) + (a2 . c2) + (a2 . b2)
c4 = (b3 . c3) + (a3 . c3) + (a3 . b3)
c2 = (a1 . a0 . b0) + (a1 . a0 . c0) + (a1 . b0 . c0) + (b1 . a0 . b0) + (b1 . a0 . c0 )
+ (b1 . b0 . c0) + (a1 . b1)
c3 = (b2 . a1 . a0 . b0) + (b2 . a1 . a0 . c0) + (b2 . a1 . b0 . c0) +
(b2 . b1 . a0 . b0) + (b2 . b1 . a0 . c0 ) + (b2 . b1 . b0 . c0) +
(b2 . a1 . b1) + (a2 . a1 . a0 . b0) + (a2 . a1 . a0 . c0)
+
(a2 . a1 . b0 . c0) + (a2 . b1 . a0 . b0) + (a2 . b1 . a0 . c0 ) +
(a2 . b1 . b0 . c0) + (a2 . a1 . b1) + (a2 . b2)
c4 = ……
 All carries require two gate delays !
 However, imagine the equation/cost if the adder is 32 bits ??
11
Faster Addition
 We can reduce the logic cost by simple simplification
 ci+1
= (ai . bi) + (bi . ci) + (ai . ci)
= (ai . bi) + (ai + bi) . ci
= gi + pi . ci
 gi : carry generate
 pi : carry propagate
 Carry equations for 4 bit adder
 c1 = g0 + p0 . c0
 c2 = g1 + p1. c1 = g1 + (p1 . g0) + (p1 . p0 . c0)
 c3 = g2 + p2. c2 = g2 + (p2 . g1) + (p2 . p1 . g0) +
(p2 . p1 . p0 . c0)
 c4 = g3 + p3. c3= g3 + (p3 . g2) + (p3 . p2 . g1) +
(p3 . p2 . p1 . g0) + (p3 . p2 . p1 . p0 . c0)
 Delay to generate c4 is 3 gate delay
 Still cost is high for large adders ! ! !
12
Faster Addition
 2nd Level of Abstraction
 Example: 16-bit adder. assume that we have four 4-bit carry-
lookahead adders
 These 4-bit adders will be designed to produce supper generate (G)
and propagate (P) signals
 P  the four bits propagate a carry to the next four bits
 G  the four bits generate a carry to the next four bits
 The super carry signals are fed to a separate carry generation unit
c0
c4
A3-A0
B3-B0
g0 p0
c3
c2
c1
a0 b0
4-bit CLA
+
13
+
+
+
s0
S3-S0
G0
P0
c0
Faster Addition
 Need to generate the carry propagate and generate
signals at higher level
 Think of each 4-bit adder block as a single unit that
can either generate or propagate a carry.
A15-A12 B15-B12
A11-A8 B11-B8
A7-A4
B7-B4
A3-A0
B3-B0
C0
4-bit CLA
S15-S12
4-bit CLA
S11-S8
G3 P3 C3
4-bit CLA
S7-S4
G2 P2 C2
S3-S0
G1 P1 C1
Carry Generation Unit
14
C4
4-bit CLA
G0 P0
Faster Addition
 Super propagate signals
 P0 = p3⋅p2⋅p1⋅p0
(how can the first 4-bit adder propagate c0?)
 P1 = p7⋅p6⋅p5⋅p4
 P2 = p11⋅p10⋅p9⋅p8
 P3 = p15⋅p14⋅p13⋅p12
 Super generate signals
 G0 = g3+(p3 ⋅ g2)+(p3⋅p2⋅g1)+(p3⋅p2⋅p1⋅g0)
 G1 = g7+(p7 ⋅ g6)+(p7⋅p6⋅g5)+(p7⋅p6⋅p5⋅g4)
 G2 = g11+(p11 ⋅ g10)+(p11⋅p10⋅g9)+(p11⋅p10⋅p9⋅g8)
 G3 = g15+(p15 ⋅ g14)+(p15⋅p14⋅g13)+(p15⋅p14⋅p13⋅g12)
 Carry signal at higher levels are
 C1 = G0 + (P0 ⋅ c0)
 C2 = G1 + (P1 ⋅ G0) + (P1⋅P0⋅c0)
 C3 = G2 + (P2 ⋅ G1) + (P2⋅P1⋅G0) + (P2⋅P1⋅P0⋅c0)
 C4 = G3 + (P3 ⋅ G2) + (P3⋅P2⋅G1) + (P3⋅P2⋅P1⋅G0) + (P3⋅P2⋅P1⋅P0⋅c0)
15
Faster Addition
 Each supper carry signal is two level
implementation in terms of Pi and Gi
 Pi is one level of gates while Gi is two and
expressed in terms of pi and gi
 pi and gi are one level of gates
 Total delay is 2 + 2 + 1 = 5
 16-bit CLA is ~6 times faster than the 16-
bit ripple carry adder
16
Designing the ALU
 We want to design an ALU
that
 Supports logic operations
zero ovf
 Supports arithmetic operations
 Supports the set-on-less-than
instruction
1
A
32
ALU
 Supports test for equality
 With special handling to
17

sign extension

zero extension

overflow detection
1
result
32
B
32
4
m (operation)
Designing the ALU
 We start by 1-bit ALU
 Starting with logical operations is easier since
they map directly to hardware
Two operands, two results.
We need only one result... Use
2-to MUX
Operation
A
B
AB
A+B
18
0
1
Result
Function
Operation
A and B
0
A or B
1
The Operation input comes from logic that looks at the opcode
Designing the ALU
 How about addition?
Add an Adder
Cin
A
Operation
0
11
B
+
Cout
19
2
Connect Cin(from
previous bit) and Cout (to
next bit)
Result
Expand Mux to 3-to-1
(Op is now 2 bits)
Function
Operation
A and B
00
A or B
01
A+B
10
Designing the ALU
 How about subtraction?
Cin
BInvert
A
Operation
0
1
B
0
1
+
Cout
20
Use the same adder for
subtraction
Result
Depending operation,
choose whether to compute
the 2s complement of B or
not
(MUX or XOR)
For 2s complement, define
the Binvert signal and set
Cin of LSB to 1
2
Function
Operation
BInvert
Cin
A and B
00
0
x
A or B
01
0
x
A+B
10
0
0
A-B
10
1
1
Designing the ALU
 Can we add the NOR instruction?
AInvert
A
BInvert
Cin
Operation
0
0
1
1
B
0
1
+
Cout
21
No need to add a NOR
gate !!
Result
2
Use Demorgan’s
theorem, an inverter and
2-to-1 MUX
Define the Ainvert signal
Function
Operation
BInvert
Cin
AInvert
A and B
00
0
x
0
A or B
01
0
x
0
A+B
10
0
0
0
A-B
10
1
1
0
A nor B
00
1
x
1
Designing the ALU
 Building the 32-bit ALU

Simply, we need to wire up 32 copies of the ALU we designed earlier
with special care to the LSB ALU
 The Cin and Binvert signals are the same, tie them together into one signal
BNegate
AInvert
A
BNegate
Operation
0
0
1
1
B
0
1
22
+
Cout
2
LSB ALU
Result
Designing the ALU
 Building the 32-bit ALU
BNegate
Operation
A0
B0
Note that the
Cin and Bnegate
for the LSB are
the same in
order to
compute the 2s
complement in
case of
subtraction
A1
B1
Cin
ALU0
Cout
Cin
ALU1
Cout
A2
B2
A31
B31
Cin
ALU2
Cout
Cin
ALU31
Cout
Cout
Result0
Result1
Result2
Result31
Designing the ALU
 Supporting SLT instruction
24

Expand the multiplexer for one more input (Less).

Subtract the two registers and feed the sign bit (the result of bit 31)
back to the Less input of the LSB ALU

The Less inputs of remaining ALUs is 0.
Designing the ALU
 The second version of
BNegate
Operation
32-bit ALU
 For SLT instruction, the
MSB is fed back to the LSB
while other bits are set to
zero!
 The operation is basically
subtraction
A0
B0
A1
B1
0
A2
B2
0
A31
B31
0
Cin
Result0
ALU0
Less
Cout
Cin
Result1
ALU1
Less
Cout
Cin
Result2
ALU2
Less
Cout
Cin
ALU31
Less
Cout
Cout
Set
Result31
OverFlow
Designing the ALU
 Supporting Branch instructions

Basically, subtract two registers!

However, we need to generate a signal that indicates whether the
result is zero or not.

Simply OR the result bits and take the complement.

This signal will be used to make the selection between the branch
address and the PC.
26
Example on using the Zero signal to
select the address for BEQ instruction
Designing the ALU
BNegate
A0
B0
A1
B1
0
A2
B2
0
A31
B31
0
Operation
Cin
Result0
ALU0
Less
Cout
Cin
Result1
ALU1
Less
Cout
Cin
Result2
ALU2
Less
Cout
Cin
ALU31
Less
Cout
Cout
The 32-bit ALU
Set
Result31
OverFlow
Designing the ALU
 The 32-bit ALU
List of Supported Operations
28
Function
Operation
BNegate
AInvert
A and B
00
0
0
A or B
01
0
0
A+B
10
0
0
A-B
10
1
0
A nor B
00
1
1
SLT
11
1
0
BEQ
10
1
0
BNE
10
1
0
Shift Operations
 Shift operations are commonly needed!
 MIPS ISA specifies three shift instructions
 Two logical shift instructions
SLL
$rt, $rs, shift_amount
#R[rt] = R[rs] << shift_amount
SRL
$rt, $rs, shift_amount
#R[rt] = R[rs] >> shift_amount
 One arithmetic shift instruction
SRA
$rt, $rs, shift_amount #R[rt] = R[rs] >> shift_amount
 What is the difference?
 Unlike the SRL, the SRA instruction preserves the sign of
the number!
 Encoding
R-type
29
op
rs
rt
rd
shamt
funct
6
5
5
5
5
6
Shift Operations
 Example 1.
1. You need to extract the 2nd byte of a 4-byte word in $t1
$t1
0010 0011 0111 0110 1010 1111 0000 1101
8
srl $t1, $t1, 8
$t1
$t1
0000 0000 0010 0011 0111 0110 1010 1111
0000 0000 0000 0000 0000 0000 1111 1111
andi $t1, $t1, 0x00FF
0000 0000 0000 0000 0000 0000 1010 1111
2. You want to multiply $t3 by 8 (note: 8 equals 23)
$t3
0000 0000 0000 0000 0000 0000 0000 0101
sll $t3, $t3, 3
30
$t3
(equals 5)
# move 3 places to the left
0000 0000 0000 0000 0000 0000 0010 1000
(equals 40)
Shift Operations
 How are these instructions implemented?
 Outside the ALU
 Shift registers  slow; shifting by one bit requires
one cycle!
 Barrel Shifters
31

A digital circuit that can shift a data word by a
specified number of bits in one clock cycle, if long
enough!

Simply a set of multiplexors !
Shift Operations
 Example 2. 4-bit barrel shifter
(rotate to left by 0, 1, 2, or 3 bits)
D
4
4-bit
Barrel
Shifter
S1
32
4
Y
S0
Shift Value
Output
S1 S0
Y3 Y2 Y1 Y0
0
0
D3 D2 D1 D0
0
1
D2 D1 D0 D3
1
0
D1 D0 D3 D2
1
1
D0 D3 D2 D1
D0
D3
D2
D1
Y0
D1
D0
D3
D2
Y1
D2
D1
D0
D3
Y2
D3
D2
D1
D0
Y3
Multiplication
Multiplicand
421
Multiplier
x 123
1263
842
+ 421
51783
Multiplying two 3-digit numbers A and B
n partial products, where B is n digits long
In Binary...
Each partial product is either:
110 (A*1) or 000 (A*0)
Note: Product may take as many
as two times the number of bits!
33
n - 1 additions
110
x 101
110
000
+ 110
11110
6x5
Equals 30
Multiplication
 Multiplication Steps
1 1 01 01 00
x 1 10 01
110
0000
+ 11000
0 10 1 1 0
1
Step1: LSB of multiplier is 1  Add a copy of multiplicand
Step2: Shift multiplier right to reveal new LSB
Shift multiplicand left to multiply by 2
Step 3: LSB of multiplier is 0  Add zero
Step 4: Shift multiplier right, multiplicand left
Step 5: LSB of multiplier is 1  Add a copy of multiplicand
Step 6: Add partial products
Done!
Thus, we need hardware to:
34
1. Hold multiplier (32 bits) and shift it right
2. Hold multiplicand (32 bits) and shift it left (requires 64 bits)
3. Hold product (result) (64 bits)
4. Add the multiplicand to the current result
Multiplication
 Multiplication Hardware
1. Hold multiplier (32 bits) and shift it right
2. Hold multiplicand (32 bits) and shift it left (requires 64 bits)
3. Hold product (result) (64 bits)
4. Add the multiplicand to the current result
5. Control the whole process
Shift Left
Multiplicand
64 bit
LSB
Multiplier
64-bit
Shift Right
Write
Product
64 bit
35
Control
32 bit
Multiplication
 Example 3. (4-bit multiplication)
Initial Values
•1-->Add Multiplicand to Product
•Shift M’cand left, M’plier right
•0-->Do nothing
•Shift M’cand left, M’plier right
•1-->Add Multiplicand to Product
•Shift M’cand left, M’plier right
•0-->Do nothing
•Shift M’cand left, M’plier right
Multiplicand Multiplier Product
xxxx1101
0101
xxx11010
0010
xx110100
0001
x1101000
0000
00001101
+
01000001
11010000
0000
01000001
xxxx1101
00000000
+
00001101
ShLeft
8 bit
ShRight
0101
4 bit
8-bit
36
000000000 Write
8 bit
Control
Multiplication
 A Cheaper Implementation
 Even though we’re only adding 32 bits at a time, we need a 64-
bit adder
 Instead, hold the multiplicand still and shift the product register
right!
 Now we’re only adding 32 bits each time
Extra bit for carryout
Multiplicand
Shift Right
Multiplier
32 bit
Write
32-bit
LH Product RH Product
64 bit
37
Control
Shift Right
32 bit
Multiplication
 A Cheaper than the Cheaper Implementation
 Note that we’re shifting bits out of the multiplier and into the
product
 Why not put these together into the same register?!!
 As space opens up in the multiplier, overwrite it with the
product bits
Multiplicand
32 bit
32-bit
Control
Write
LSB
LH Product Multiplier
64 bit
38
Shift Right
Multiplication
 Fast Multiplication
 Use 31 32-bit adders to
compute the partial products
 One input is the multiplicand
ANDed with a multiplier,
and the other is the partial
product from previous step.
 Question?
Show the multiplication tree
to compute 5 X 3. Assume
unsigned numbers
represented using 3 bits and
we have 4-bit ALU.
39
Multiplication
 MIPS Multiplication
 Two multiplication instructions
mult
$s0, $s1
# hi||lo = $s0 * $s1
multu
$s0, $s1
# hi||lo = $s0 * $s1
R-type
op
rs
rt
rd
shamt
funct
6
5
5
5
5
6
 The result is 64 bits and it stored in two special registers
 LO  holds the lower 32 bits of the result
 Hi  holds the upper 32 bits of the result
 The contents of these registers can be read using two special
instructions
40
mfhi
mflo
$t5
$t6
# move Hi to register $t5
# move Lo to register $t6
Multiplication
 MIPS Multiplication (NOTES)
 Both multiplication instructions ignore overflow!
 It is the responsibility of the software to check if the
result fits into 32 bits !
 For MULTU, there is no overflow if hi is 0
 For MULT, there is no overflow if hi is the replicated sign
of lo
 Question!
 Modify the designed multiplier to
support signed multiplication.
41
Division
Dividend = Divisor * Quotient + Remainder
divisor
quotient
3221
15 48323
-45
33
-30
32
-30
23
-15
remainder
8
dividend
5
14
0111 0
101 1001001
-000
100 1
-101
100 0
-101
110
-101
11
-000
3
11
Idea: Repeatedly subtract divisor. Shift as appropriate.
42
73
Division
Looking at the alignment a little differently…
0111 0
101 1001001
-000
100 1
-101
100 0
-101
110
-101
11
-000
11
43
0111 0
0101 01001001
-01010000
01001001
-00101000
00100001
-00010100
00001101
-00001010
00000011
-00000101
00000011
Make the dividend 8 bits and the
divisor 4 bits by filling in with 0’s
Each iteration, re-express the
entire remainder as 8 bits
Note: At any step, the dividend =
divisor * quotient + current remainder
Try subtracting the divisor from the
current remainder each time – if it
doesn’t fit, restore the remainder
Division
Division Hardware
1. Hold divisor (32 bits) and shift it right (requires 64 bits)
2. Hold remainder (64 bits)
3. Hold quotient (result) (32 bits) and shift it left
4. Subtract the divisor from the current result
5. Control the whole process
Algorithm
Divisor
Shift Right
64 bit
Quotient
64-bit
Shift Left
Write
Remainder
64 bit
44
Control
32 bit
initialize registers
(divisor in LHS);
for (i=0; i<33; i++) {
remainder -= divisor;
if (remainder < 0)
{
remainder+=divisor;
left shift quotient 1, LSB=0
} else {
left shift quotient 1, LSB=1
}
Division
 Read pages 236 -242
45
Division
 MIPS Division
 Two multiplication instructions
div
divu
R-type
$s0, $s1
$s0, $s1
# hi = $s0 / $s1
# lo = $s0 mod $s1
op
rs
rt
rd
shamt
funct
6
5
5
5
5
6
 As with multiply, divide ignores overflow so software must
determine if the quotient is too large.
 Software must also check the divisor to avoid division by 0
 Signed division
 Remember the signs of the dividend and divisor
and use to determine the sign of the quotient
 The sign of the remainder is always the same as the
dividend
46
(Check by yourself the division of 5/2 using different combinations of the
signs of the dividend and the divisor)
Floating Point Numbers
 Numbers used so far are 32-bit integers!
 How about larger and smaller values? How
about fractions?
 4,600,000,000
or 4.6 x 109
 0.0000000000000000000000000166 or 1.6 x 10-27
 3.5 , - 0.0213
 The IEEE 754 FP Standard !
 Uses 32 (single precision) or 64 bits (double precision)
to represent numbers
 Any number is represented by 3 parts: sign,
significand, and exponent
 Used in most computers
47
Floating Point Numbers
 The IEEE 754 FP Standard
 Single precision (32 bits)
Sign
Exponent
Fraction
1 bit
8 bits
23 bits
 Normalized representation (no leading zeros and
one none zero bit to the left of binary point in the
significand)
 Since the bit to the left of the binary point is always
1, it is implied and not stored in the fraction
(WHY!)
Value = (-1)sign x (Fraction+1) x 2Exponent
48
 Smallest number is 1.175494350822288e-038
 Largest number is 3.402823466385289e+038
Floating Point Numbers
 The IEEE 754 FP Standard
 Double precision (64 bits)
Sign
Exponent
Fraction
1 bit
11 bits
52 bits
 Normalized representation (no leading zeros and
one none zero bit to the left of binary point in the
significand)
 Since the bit to the left of the binary point is always
1, it is implied and not stored in the fraction
(WHY!)
Value = (-1)sign x (Fraction+1) x 2Exponent
 Smallest number is 2.225073858507201e-308
49
 Largest number is 1.797693134862316e+308
Floating Point Numbers
 The IEEE 754 FP Standard !
 The way numbers are represented simplifies sorting
of floating numbers using integer comparison
 The fraction is sign-magnitude
 The exponent is signed 2s complement
 Placing the exponent before the significand
 The exponent is biased
 A constant value is added to represent all exponents with
positive numbers
 In single precision, bias is 127
 Exponent -3 is represented as -3 + 127 = 124
 Exponent 5 is represented as 5 + 127 = 132
 While in double precision , the bias is 1023
 So in biased notation
50
Value = (-1)sign x (Fraction+1) x 2Exponent - Bias
Floating Point Numbers
 Example 4. Show the IEEE754 representation of -
0.75 using single and double precision formats





(0.75)ten = (0.11)two
(-0.75) ten = (-0.11)two (we use sign and magnitude)
in binary scientific notation -0.11two x 20
in normalized binary scientific notation -1.1two x 2-1
add the bias to the exponent
 In single precision add 127 
-1.1two x 2126
 In double precision add 1023  -1.1two x 21022
 convert the exponent into binary
 126 = (01111110)2
 1022 = (01111111110)2
 drop the 1 on the left of the binary point and fill the
corresponding fields
51
Floating Point Numbers
 Example 4. Show the IEEE754 representation of -
0.75 using single and double precision formats
 Single precision
 Double precision
52
Floating Point Numbers
 Example 5. What is the value represented by the
following IEEE754 number?
N = (-1)S x (1+Fraction) x 2(Exponent – Bias)
= (-1)1 x (1+0.25) x 2(129 – 127)
= -1 x 1.25 x 22
= -1.25 x 4
= -5
53
Floating Point Numbers
 Special Numbers in IEEE 754 Standard
Single Precision
Double Precision
E (8)
F (23)
E (11)
F (52)
0
0
0
0
0
nonzero
0
nonzero
± 1-254 anything ± 1-2046 anything
± 255
0
± 2047
0
255
nonzero
2047
nonzero
54
Object Represented
true zero (0)
± denormalized number
± floating point number
± infinity
not a number (NaN)
Floating Point Numbers
 Addition of floating numbers
 Analogy to adding floating decimals
 Example: 9.999x101 + 1.610 x 10-1 using four digits)
 Steps to perform (F1  2E1) + (F2  2E2) = F3  2E3
55

Step 1: Restore the hidden bit in F1 and in F2

Step 1: Align fractions by right shifting F2 by E1 - E2 positions
(assuming E1  E2)

Step 2: Add the resulting F2 to F1 to form F3

Step 3: Normalize F3 (so it is in the form 1.XXXXX …) and
check for overflow/underflow in the exponent

Step 4: Round F3 and possibly normalize F3 again

Step 5: Rehide the most significant bit of F3 before storing the
result
Floating Point Numbers
 Example 6. Show how to add 0.625 and -0.125
using floating point binary representation
 In normalized scientific notation this is equivalent




56
1.010 x 2-1 + -1.000 x 2-3
Align exponents
1.010 x 2-1 + -0.010 x 2-1
Add significands
1.000 x 2-1
Normalize the sum (if necessary) and check for
overflow/underflow
Round the sum and normalize again
Floating Point Numbers
 Addition hardware of floating numbers
57
Floating Point Numbers
 Accurate Arithmetic
 In arithmetic we are restricted with the number of bits. Thus we
may need to truncate the operand with smallest power to fit
into the available bits
 IEEE754 standards define two extra bits to the right of the
numbers; the guard and round bits.

Decimal example: 2.56 x 100 + 2.34 x 102
 Assume significand is represented in 3 digits only
 Without guard and round digits (truncation occurs for two
digits)
(2.34 + 0.02) x 102 = 2.36 x 102
 With guard digit, we don’t have to truncate the small number
when shifted to the right to match the large number
(2.3400 + 0.0256) x 102 = 2.3656 x 102 = 2.37 x 102
(after rounding)
 Sticky bit !
58
Floating Point Numbers
 MIPS Floating Point Support
 MIPS ISA defines a separate floating point register
file
 Register $f0 -$f31 (each is 32 bit)
 Registers are combined in pairs for double precision
arithmetic
 Some instructions
59
lwc1
$f1,54($s2)
#$f1 = Memory[$s2+54]
swc1
$f1,58($s4)
#Memory[$s4+58] = $f1
add.s
$f2,$f4,$f6
#$f2 = $f4 + $f6
add.d
$f2,$f4,$f6
#$f2||$f3 = $f4||$f5 + $f6||$f7
Floating Point Numbers
 MIPS Floating Point Support
 Compare instructions
c.x.s
$f2,$f4
#if($f2 x $f4) cond=1; else cond=0
c.x.d
$f2,$f4
#$f2||$f3 x $f4||$f5 cond=1;
# else cond=0
 Branch instruction
60
bclt
25
#if(cond==1) go to PC+4+100
bclf
25
#if(cond==0) go to PC+4+100
Fallacies and Pitfalls
 Fallacy 1. Only theoretical mathematicians
care
about floating point accuracy (The Pentium bug
1994)
 Pitfall 1. Just as left shift instruction can replace an
integer multiply by a power of 2, a right shift is the
same as integer division by power of 2.
 Pitfall 2. The MIPS instruction addiu sign-extends
its 16-bit immediate
61
Download