Design and implementation of Decimal Floating Point (DFP) unit

advertisement
Faculty of Engineering
VLSI Laboratory
Design and implementation
of Decimal Floating Point
(DFP) unit
Ariel Burg
Hillel Rosensweig
B.Sc. Graduation Project - Computer Engineering
Advisor:
Academic Advisor:
Yifat Manzor
Dr. Osnat Keren
‫ תשרי תשע"ב‬,October 2011
Abstract:
Over the past few years, there has been a growing interest in the development of
Decimal Floating Point Units (DFPU) due to precision and timing constraints. We
present a summarized IEEE standard in order to comply with Standard limitations.
We then present our design for a DFPU Framework available for expansion, which
performs basic DFP operations, the most complex of which are Addition/Subtraction.
These operations were designed with "High Performance over Silicon Cost" in mind.
Algorithmic simulation schemes are presented as well as our low-level design
verification process. Finally, our synthesis results are presented.
2
Introduction
It is our pleasure to present the following project book. This project book is the
product of a year's work of research and development.
The Research was performed using an array of tools. We employed the academic
knowledge gained throughout our degree in a variety of courses in several fields:

Logic Design - Digital Logic Circuitry.

Architecture - Computer Architecture, MicroComputer and Assembly Language.

Arithmetic - Computer Arithmetic Algorithms.
We would be remiss if we did not mentioned the wealth of information gained from
taking advantage of Prof. Mike Cowlishaw's 'Speleotrove' website.1
Software tools used included: MATLAB, Cadence Simvision, Xilinx ISE Design
Suite 13.2.
Hardware implementation was achieved on Virtex®-6 FPGA ML605 Evaluation
Board.
For verification purposes we used our own Test Bench, IBM FPgen Testing Suite2,3
and made use of Prof. Mike Cowlishaws' test vectors.
Ariel Burg
Hillel Rosensweig
3
Contents
1. Overview ...................................................................................................................6
1.1. Purpose .............................................................................................................6
1.2. Motivation ........................................................................................................6
2. Definition & Specification ......................................................................................10
2.1. The Decimal Floating Point Format ...............................................................10
2.2. Infinity and NaNs ...........................................................................................14
2.3. Exception Handling ........................................................................................15
2.4. Normalizing & Rounding ...............................................................................16
3. High-level Design ...................................................................................................20
3.1. General Operation Scheme .............................................................................20
3.2. Interface ..........................................................................................................21
3.3. Arithmetic Algorithm .....................................................................................22
3.4. Data Path ........................................................................................................24
3.5. Instruction Set Architecture (ISA) .................................................................25
4. Simulation ...............................................................................................................32
4.1. Simulation Introduction………………..........................................................32
4.2. Translate, Inverse Translate Simulation .........................................................32
4.3. Exponent Comparator Simulation ..................................................................33
4.4. Full Path Simulation .......................................................................................34
4.5. Simulation Results ..........................................................................................34
4.6. Full Path Graphic User Interface ....................................................................35
5. Implementation ........................................................................................................36
5.1. Low-level Design ...........................................................................................36
5.2. Program Counter ............................................................................................36
5.3. Register File ...................................................................................................36
5.4. Translate & Inverse Translate ........................................................................37
5.5. Exponent Comparator ....................................................................................38
5.6. Check Needed ................................................................................................38
5.7. Right Shifter ...................................................................................................39
5.8. Adder/Subtractor ............................................................................................40
5.9. Normalizer ......................................................................................................51
5.10. Rounder ........................................................................................................51
4
5.11. Sign Decision ...............................................................................................52
6. Integration ...............................................................................................................53
6.1. Composing the Complete System ..................................................................53
6.2. Creating a Pipelined Datapath ........................................................................53
6.3. Creating a Control Unit ..................................................................................53
6.4. Pipeline Hazards .............................................................................................57
7. Arithmetic and System Verification ........................................................................58
7.1. Verification Properties ...................................................................................58
7.2. Verification Conclusions and Results ............................................................59
8. Synthesis ..................................................................................................................61
8.1. Implementation on FPGA ..............................................................................61
8.2. Design Evaluation ..........................................................................................61
9. Summary .................................................................................................................62
9.1. DFPU Review ................................................................................................62
9.2. Future Expansions ..........................................................................................62
Appendices ..................................................................................................................63
A. DFP History .....................................................................................................63
Bibliography ................................................................................................................65
5
Design and implementation of Decimal
Floating Point (DFP) unit
1. Overview
1.1 Purpose
The objective of this project is to design, implement and test a decimal floating-point
arithmetic unit, based on formats and methods for floating-point arithmetic as
specified in IEEE 754-2008 standard.
1.2 Motivation
Currently, most arithmetic hardware units perform operations on numbers in binary
format. As the most basic memory unit ('bit') is itself binary, binary arithmetic
implementations would be the natural and intuitive choice. Despite this, drawbacks of
binary arithmetic implementations have created renewed interest in developing
arithmetic units, capable of performing operations on numbers set in a decimal
format:

Speed: As opposed to computers, users prefer the use of the decimal notation as
opposed to the binary one. On certain applications the need for decimal to binary
conversions and binary to decimal conversions is so great that decimal operations
require 50%-90% of processor time. A system with a direct decimal representation
and hardware support would save this overhead.

Accuracy: With floating-point numbers in binary format, accuracy problems
prevail. For instance, the decimal term 0.1 has no finite binary representation:
Dec: 0.1
=
Bin: 0.0001100110011....
Due to limited memory, a 32 bit representation will truncate the infinite edge and
round up, leading to obvious accuracy errors.
6
Examples of such errors are extensively documented14. For instance, the following
C program (compiled with Visual C++):
for (i=0.1; i<0.5; i=i+0.1) printf ("%f\n",100000000*i);
will print out:
100000001.490116
200000002.980232
300000011.920929
400000005.960464
□
Similarly, using C (compiled with Visual C++), the following two loops will not
run the same amount of iterations due to rounding errors:
The loop:
for (num=1.1; num<=1.5; num=num+0.1)
printf ("%f\n",num);
prints:
1.100000
1.200000
1.300000
1.400000
whereas the loop:
for (num=0.1; num<=0.5; num=num+0.1)
printf ("%f\n",num);
prints:
0.100000
0.200000
0.300000
0.400000
0.500000
□
One important example of the implications of such an error occurred in the Gulf
War:
7
On February 25, 1991, during the Gulf War, an American Patriot Missile battery
in Dharan, Saudi Arabia, failed to track and intercept an incoming Iraqi Scud
missile. Specifically, the time in tenths of second as measured by the system's
internal clock was multiplied by 1/10 to produce the time in seconds. This
calculation was performed using a 24 bit fixed point register. In particular, the
value 1/10, which has a non-terminating binary expansion, was chopped at 24 bits
after the radix point. The small chopping error, when multiplied by the large
number giving the time in tenths of a second, led to a significant error, and
consequently caused severe damage and human casualties.

Uniformity: Today's computation with floating-point numbers might yield diverse
results in different processors.
Proposed solution
The proposed solution is to encode data in a decimal format: a format which would
give a distinct representation for each digit, separate from other digits. The format is
based on IEEE 754-2008 standard. This solution would solve all the previously
mentioned problems:

Speed: Each digit has a unique representation in the encoding scheme, so that
converting numbers to computer code becomes simply a direct translation using
tables instead of a costly conversion between bases.

Accuracy: Each digit has a unique representation, therefore any number that can
be expressed visually (within a given accuracy) by the user can be expressed in
the decimal encoding precisely.

Uniformity: Using the format and methods specified in IEEE 754-2008 standard,
results of computation will be identical, independent of implementation, given the
same input data. Errors, and error conditions, in the mathematical processing will
be reported in a consistent manner regardless of implementation. The arithmetic
unit designed in this project complies with IEEE754-2008 specifications for
decimal floating-point. Therefore, computations done in the arithmetic unit will
yield the same results as in any other implementation which complies with the
IEEE 754-2008 standard.
8
Note: because '10' is not a primary number, but is built of '2' and '5', working in base
'10' provides a wider finite representation range, as opposed to working in base '2',
where any fraction in base '5' cannot be represented in any finite way.
For a detailed summary of decimal floating-point history solutions see Appendix A –
DFP History.
9
2. Definition & Specification
2.1 The Decimal Floating Point Format
The IEEE 754-2008 standard specifies two decimal floating-point formats:

DEC128 - uses 128 bits for representation.

DEC64 - uses 64 bits for representation.
Although the DEC128 format provides a better precision, using the DEC64 format
will gain more speed and provide a sufficient precision level for various applications.
Therefore, for this project the chosen format is DEC64.
A general structure of a decimal floating-point number is:
r  S , Exp  bias, Sig  , v   1 s 10 Expbias  Sig
where S is the Sign of the number, Exp (Exponent) is the integer power to which the
radix (10) is raised, and Sig (Significand) - the digits that comprise the significant
portion of the number. These are the three elements needed to construct a floating
point number. Figure2.1 shows the 64 bit format for a decimal floating point number.
1 bit
S
(sign)
13 bits
LSB MSB
G
(combination field)
G0……………….G12
50 bits = 5declets
T
(trailing significand field)
MSB
LSB
Figure 2.1. DEC64 format for 64 bit decimal floating-point number.
As figure 2.1 shows, each 64 bit operand is built of 3 fields:

S - Sign bit

G - Combination Field

T - Trailing Significand Field.
The three elements which construct a floating point number are encoded in (S,G,T)
fields:

Sign
The Sign Field in the format represents the sign of the number (sign=(-1)s)

Exponent
The exponent is one of the elements which comprise the Combination Field.
01
The Exponent is 10 bit long in the range [emin,emax]=[-383,384].
The Exponent is biased so that it will be represented with positive values.
Therefore the bias is 383 and the biased Exponent range is [0,767].
The Exponent is encoded entirely in the Combination Field.

Significand
The Significands' precision is 16 digits. In its decoded form, it is 64 bits long
(BCD representation). In its coded form, it is split into 15 digits encoded in the
field T (50 bits = 5x10 = 5declets), and an MSD (most significant digit)
encoded in the combination field (G).
The format also supports representation of ±∞ and NaN (Not a Number).
Decoding and Encoding the Combination Field
The combination field is encoded/decoded by using the 5 MSB's (G4...G0). These five
bits hold the status of the number (Inf, NaN or finite) as well as the Significands'
MSD and the two MSBs of the exponent (for finite numbers). The remaining 8 bits
hold the remainder of the exponent. Encoding/Decoding of the Combination field is
described in Table 2.1.
Combination field
(5 bits)
Type
Exponent MSBs Coefficient MSD
G0G1G2G3G4
(2 bits)
(4 bits)
abcde
Finite
ab
0cde
11cde
Finite
cd
100e
11110
Infinity
--
----
11111
NaN
--
----
Table 2.1. The first five bits of the Combination Field indicates the
type of the number, the Significands' MSD and the 2 MSBs of the
exponent (for finite numbers).

NaNs:
G5 differentiates between quiet NaNs (qNaN) and signaling NaNs (sNaN),
where Signaling NaNs signal uninitialized variables and arithmetic
enhancements that are not in the scope of the standard. Quiet NaNs afford
retrospective diagnostic information inherited from invalid operations.
00
For NaNs:
 sNaN G5  1
G0G1G2G3G4  11111  v  NaN , r  
, T  payload
qNaN G5  0
Where: v = actual value, r = format representation.

Infinity:
For ±Infinity:
G0G1G2G3G4  11110  r  v   1   
s

Finite numbers:
G0G1G2G3G4  0 XXXX OR 11XXX  r   S , E  bias, C  , v   1 s 10E bias  C
Densely-Packed Decimal (DPD)
In order to allow for Decimal representations of numbers without adding a memory
overhead to the implementation, significands are stored in a Densely Packed Decimal
format.
Using DPD coding takes advantage of the BCD representation redundancy.
Decoding 10-bit densely-packed decimal to 3 decimal digits
Decoding a Densely Packed Decimal declet is performed according to Table 2.2:
Table 2.2. Decoding 10-bit Densely-Packed Decimal to 3 decimal digits.
____________________________________________________________________
Example 2.1.
For the following declet:
02
b(0) b(1) b(2) b(3) b(4) b(5) b(6) b(7) b(8) b(9)
1
0
1
1
0
0
1
1
0
1
We use the appropriate table entry:
b(6) b(7) b(8) b(3) b(4)
1
1
0
1
0
Therefore:
d(1)=8+b(2)=8+1=9 ; d(2)= 4*b(3)+2* b(4)+b(5)= 4*1+2*0+0=4 ;
d(3)= 4*b(0)+2* b(1)+b(9)= =4*1+2*0+1=5
Therefore the decoded number is 945.
□
_____________________________________________________________________
Encoding 3 decimal digits to 10-bit Densely-Packed Decimal
Encoding Decimal numbers in Densely Packed Decimal format is done using
Table 2.3.
Table 2.3. Encoding 3 decimal digits to 10-bit Densely-Packed Decimal.
_____________________________________________________________________
Example 2.2.
For the number 683, the BCD representation is:
d(1)
d(2)
d(3)
0 1 2 3 0 1 2 3 0 1 2 3
0 1 1 0 1 0 0 0 0 0 1 1
03
Using the first bit of each digit:
d(1,0)= 0 ; d(2,0)= 1 ; d(3,0)= 0
We use the appropriate table entry:
Bits 1,2,3 in d(1) are 110, therefore b(0)b(1)b(2) = 110
Bits 1,2 in d(3) are 01, therefore b(3)b(4) = 01
Bit 3 in d(2) 0, therefore b(5)b(6)b(7)b(8)=0101
Bit 3 in d(3) is 1, therefore b(9) is 1.
The final encoding is:
b(0) b(1) b(2) b(3) b(4) b(5) b(6) b(7) b(8) b(9)
1
1
0
0
1
0
1
0
1
1
□
_____________________________________________________________________
Note: using DPD (Densely Packed Decimal) coding, 15 BCD digits (60 bits) are
packed into 50 bits in field, taking advantage of the BCD representation redundancy.
2.2 Infinity and NaNs
Nota Number (NaNs)
There are two different kinds of NaN, Signaling and Quiet.

Signaling NaNs (sNaN) represent uninitialized variables and other unique
situations.

Quiet NaNs (qNaN) supply diagnostic information inherited from invalid or
unavailable data and results.
qNaN Propagation
To allow propagation of the diagnostic information, as much information as possible
should be preserved in NaN results of operations. In other words, operations
performed on NaNs should preserve in the result as much of the original NaN operand
as possible.
If two or more inputs are NaN, then the payload of the resulting NaN should be
identical to the payload of one of the input NaNs if representable in the destination
04
format. The standard does not specify which of the input NaNs will provide the
payload.
qNaN Generation
In general, operations that signal an invalid operation exception (see Para. 2.3) shall
generate a quiet NaN.
Infinity
The approach to infinites in floating-point arithmetic is equivalent to the approach to
Overflow (see Para. 2.3). In general, an Overflow in the result will itself raise an OVF
flag and the result will be coded as Infinity.
Operations on infinite operands usually don't signal exceptions and return an Infinite
result (for infinite coding, see IEEE Standards section). This applies to the following
operations:
Addition(∞, x), Addition(x, ∞), Subtraction(∞, x), or Subtraction(x, ∞), for finite x.
The exceptions that do pertain to infinities are signaled (see Para. 2.3) only when:

∞ is an invalid operand (in certain operations).

∞ is created from finite operands by overflow.

Subtraction of infinities, such as: Addition(+∞, −∞).
2.3 Exception Handling
Invalid operation 7.2.0
The invalid operation exception is signaled if and only if the arithmetic operation
provides no useful result. The default result of an operation that signals the invalid
operation exception shall be a quiet NaN that should provide some diagnostic
information (see Para.2.2).
Operations that signal Invalid operation flag:

Any operation on a signaling NaN.

Addition or Subtraction of infinities, such as: Addition(+∞, −∞).
05
Overflow 7.4.0
The overflow exception is signaled if and only if the result format’s largest finite
number is exceeded in magnitude by what would have been the rounded floatingpoint result were the exponent range unbounded. The default result shall be
determined by the rounding-direction attribute and the sign of the intermediate result.
Specifically, in accordance to the DFPU rounding scheme - roundTiesToEven - all
overflows are rounded to ∞ with the sign of the intermediate result. In addition, under
default exception handling for overflow, the overflow flag shall be raised and the
inexact exception shall be signaled.
Inexact 7.6.0
Unless stated otherwise, if the rounded result of an operation is inexact - that is, it
differs from what would have been computed were both exponent range and precision
unbounded - then the inexact exception shall be signaled. The rounded or overflowed
result shall be delivered to the destination.
Note: underflow, divide by zero exceptions are included in the standard, but were not
fully implemented in the current design as they are not necessary in this context.
2.4 Normalizing & Rounding
When executing an instruction, the result operand should be represented in a
normalized form, i.e. with no leading zeros.
Using the normalized form simplifies the comparison of two Decimal Floating Point
operands. A normalized form allows finite operands (≠0) to have a unique
representation, which is helpful for comparison: a larger exponent indicates a larger
operand and significands should be compared only in case of equal exponents.
Note: In case of comparison with 0 one should only check if sig==0.
There are three possible Normalization scenarios in Addition and Subtraction:
1. Significand ≥ 10, therefore the significand should be shifted to the right and the
exponent should be increased by one (possible in case of Addition).
2. 1 ≤ significand < 10. No shifting needed (possible in case of Addition or
Subtraction).
06
3. Significand < 1, therefore the significand should be shifted to the left and the
exponent should be decreased as long as there are leading zeros (possible in case
of Subtraction).
The first case may lead to overflow, since increasing the exponent may cause
exceeding the maximum exponent for a finite number.
The third case may lead to underflow, since decreasing the exponent may cause
exceeding the minimum exponent for a finite number.
Shifting is done using Barrel Shifter, which fasten the operation.
Rounding is done using roundTiesToEven attribute: The floating-point number
nearest to the infinitely precise result shall be delivered; if the two nearest floatingpoint numbers bracketing an unrepresentable infinitely precise result are equally near,
the one with an even least significant digit shall be delivered.
Choosing this attribute gives an average rounding error = 0.
_____________________________________________________________________
Example 2.3.
If the exact result significand is 1.23456789012345678 (precision is p=16
digits), then the returned significand should be 1.234567890123457
Example 2.4.
If the exact result significand is 1.23456789012345650, then the returned
significand should be1.234567890123456
Example 2.5.
If the exact result significand is 1.23456789012345651, then the returned
significand should be 1.234567890123457
□
_____________________________________________________________________
Rounding is done using three Rounding Digits:

Guard digit

Round digit

Sticky digit
07
If (R>5) or (R=5 and S≠0) or (R=5 and S=0 and LSD=odd number) then the
significand is increased by 1, as can be seen in example 1, 3.
The Sticky Digit serves as a tie-breaker in the roundTiesToEven attribute.
The role of the Guard Digit is to guard against loss of information in case of postnormalization (Scenario 2), as explained in the next proof.
Three Rounding Digits are sufficient when using roundTiesToEven attribute.
_____________________________________________________________________
Proof: consider the three possible Normalization scenarios mentioned above:

Case 1: In the worst case of this scenario the exponent difference of the original
operands is 1 (see Para.3.3), i.e. one shift on pre-alignment, so that there is a carry
out.
For example: sigA = 9900000000000000, sigB = 1000000000000051,
exponent difference = 1.
R S
A
9900000000000000
+
B aligned
100000000000005 1
A+B
1 0000000000000005 1
Post-normalization
1000000000000000 5
rounding
1000000000000001
1
Therefore two extra digits are needed for rounding.

Case 2: No shifting is done. Therefore there is no need of rounding digits.

Case 3: The significand is shifted to the left and the exponent is decreased as long
as there are leading zeros. Let us concentrate on two possible cases in this
scenario:
o The subtrahend is shifted more than one position to the right (prealignment).The difference has at most one leading zero => at most one
shifted-out digit required for post-normalization.
Sticky Digit = 0 if all the rightmost shifted digits starting from the 19th
place are zero. If at least one of them is bigger than zero then Sticky Digit
= 1.
08
For example: sigA = 1000000000000000, sigB = 9999999999994002,
exponent difference = 5. sigB is shifted 5 positions to the right.
=> Digits in 19th, 20th, 21th places: 0, 0, 2 => S=1.
G R S
A
B aligned
A-B
postnormalization
1000000000000000
99999999999 9
0999900000000000 0
9999000000000000 5
4
5
9
1
9
R S
9999000000000000 5 9
9999000000000001
rounding
Note: The Sticky digit participates in subtraction only to generate borrow.
After subtracting the aligned operands, the true value of the rightmost result
digit is not important. What matters is if it is zero or not.
After post-normalization the Guard Digit serves as the Round Digit and
Round Digit serves as the Sticky Digit.
Therefore three extra digits are needed for rounding.
o The subtrahend is shifted up to one position to the right (pre-alignment); at
most one digit is pre-aligned out of the 16 digit range.
For example: sigA = 1200000000000000, sigB = 1000000000000004,
exponent difference = 1.
A
B aligned
A-B
rounding
R
1200000000000000
100000000000000 4
1099999999999999 6
1100000000000000
Therefore one extra digit is needed for rounding.
In Conclusion: considering the worst case, three extra digits are needed for rounding.
□
_____________________________________________________________________
After Normalizing and Rounding the result, another post-normalization may be
needed (in case rounding lead to significand ≥ 10). Therefore another Normalizing
component is set after the result is rounded.
09
3. High-level Design
3.1 General Operation Scheme
The general operation of the system is described in Figure 3.1.
Figure 3.1. General operation of the system. Shows the progress of a command.
A Designated Compiler transfers DFPU commands to the correct 74 bit format. It also
translates data (operands) to DEC64 format and creates DFPU instructions for data
transfer into the DFPU register file. These commands are sent to the CPU as payload
for a Load Word operation, which writes the commands to a designated memory
segment in the RAM.
Upon writing DFPU commands in the designated memory segment, the CPU
commands the DMAC (Direct Memory Access Controller) to load the DFPU
commands to the internal DFPU memory. The CPU sends a 'go' signal to the DFPU
(see Fig. 3.2) and the DFPU subsequently begins reading the internal memory and
processing commands. Another form of communication from CPU to DFPU is
through Interrupt request (see Fig. 3.2).
Upon completion of running DFPU commands, and upon certain exception
occurrence (see Para. 2.3), an exception notice is sent to the CPU.
Note: the Designated Compiler delivers numbers in a normalized form.
21
3.2 Interface
The DFPU (Decimal Floating Point Unit) serves as a peripheral computation unit.
Its interface includes four input signals (nrst, clk, go, interrupt) and one output signal
(Exception). Figure 3.2 describes the DFPU interface.
Figure 3.2. The DFPU interface.

nrst –reset signal (negative reset).

clk – unit clock signal

go – CPU signal to DFPU; Kick start DFPU operation

interrupt – CPU signal to DFPU (e.g. soft reset)

Exception – DFPU feedback to CPU.
Exception signal is sent in the following cases:

Finished - DFPU completed performance of loaded tasks.

Invalid operation (see Para. 2.3).

Overflow (see Para. 2.3).

Underflow, Divide by Zero (see Para. 2.3, should be available in future
designs - not necessary in this context).
Note: 'Inexact' signal does not raise interface exception flag, due to the fact it is an
acceptable and regular condition.
20
3.3 Arithmetic Algorithm
Assuming that Addition/Subtraction is the most complicated operation in the current
design, and it's implementation covers other, simpler operations (negation, increment,
decrement) from both an arithmetic and architectural point of view, the Arithmetic
algorithm was developed according to it.
For any addition/subtraction of a pair of standardized decimal operands: A,B, the
following expansion is true:
AB
 sig A 10 Exp A  sig B 10 ExpB 


 sig A  sig B 10 ExpB  Exp A  10 Exp A
Assuming that the term sig B 10 ExpB ExpA signifies a shifted sig B by ExpB  Exp A
positions to the right (assume, without loss of generality, that ExpB  Exp A ), and that
each significand has a limited precision, we can conclude that for some operands,
where exponent difference exceeds significand precision, addition/subtraction is
irrelevant. With all that in mind, an addition algorithm emerges (Fig. 3.3).
The diagram in Figure 3.3 does not relate to the Sign bit in each operand. The sign bit
is dealt with separately, and its main function is to determine the type of operation
performed during Addition/Subtraction (example: subtraction of a negative from a
positive is performed as addition).
Adding/subtracting two signed operands gives:
A  B   1 A  sig A  10 Exp A   1 B  sig B  10 ExpB 
s

s
 sig A   1 B
s sA
Using the fact that:  1 B
s s A
Concludes:

 sig B  10  ExpB  Exp A   1 A  10 Exp A
s
 1 sB  s A

  1 sB  s A

1
s

s
B
A



A  B  sig A   1 s B  s A  sig B 10  ExpB  Exp A   1 A  10 Exp A
s
The actual type of operation carried out is decided by the original operation code
(add/sub) and the signs of the operands.
Therefore, if 'add' operation is coded as 'op=0' and 'sub' operation as 'op=1', the actual
operation can be derived:
22
Figure 3.3. The addition algorithm.
23
 addition
sB  s A  op  1
Actual operation  
subtractio n sB  s A  op  0
3.4 Data Path
In essence, the datapath manages the three elements - Sign, Significand and Exponent
- using separate paths with some interaction between them:

Sign - the result sign is dependent on the input operand Signs, the type of
operation performed, and the Sign of the result of the significand
addition/subtraction.

Significand - the result significand is formed by addition/subtraction of
aligned significands (shifted according to exponent difference), rounding and
normalizing.

Exponent - the result exponent is formed by choosing the larger exponent and
revaluing according to the normalization.
Accordingly, the above algorithm can be divided into smaller sub-algorithms, and
each one can be organized as a separate resource ('black boxes'):

Program Counter - holds address of current instruction. Address advances
with each clock cycle.

Register file - collection of registers capable of Read/Write.

Translate - decode DEC64 operand to (Sign,Exponent,Significand).

Exponent Comparator - compare operand exponents and return exponent
difference, which exponent is bigger and its value.

Check Needed - check whether there is need for significand shifting and
addition/subtraction (due to limited precision).

Right Shifter - Aligning one of the significands according to the exponent
difference.

Add/Sub - conclude the actual operation (addition/subtraction) performed on
the significands.

Normalizer - adjusting the result significand and exponent values to avoid
leading zeros in significand.

Rounder - GRS rounding using roundTiesToEven scheme.
24

Inverse Translate – encode (Result Sign, Result Exponent, Result
Significand) values into DEC64 operand.

Sign Decision - Conclude Result Sign according to the input operand Signs,
the type of operation performed, and the Sign of the result of the significand
addition/subtraction.
Each of the above mentioned resources was built as a function in a MATLAB script
for simulation (Chapter 4) and later implemented in a low-level Verilog design
(Chapter 5).
As discussed in Chapter 6, an instruction is divided into four stages, i.e. moving from
single-cycle datapath to a four-stage-pipelined datapath.
Therefore, the complete performance of an operation with a DFPU involves the
following stages:
1. Instruction Fetch (IF): Retrieval of DFPU command.
2. Decode (D): Retrieval and translation of DEC64 Operands to Sign,
Significand and Exponent fields.
3. Execution (E): Performing the arithmetic algorithm mentioned above.
4. Write Back (WB): Result Sign, Significand and Exponent are encoded into
DEC64 format and written to register file or result Memory.
These four stages are implemented as pipe stages. Further in this design, Pipeline
Registers are set between each two stages in order store Intermediate results.
3.5 Instruction Set Architecture(ISA)
General Information

Instruction length: 74 bits.

Register address: 5 bits (32 registers).

Opcode length: 5 bits.
The DFP unit supports the following operations:

Arithmetic operations: add_r, add_m, sub_r, sub_m, inc_r, inc_m, dec_r, dec_m,
neg.

Data handling operations: mov_i, mov_r.
25
Note: the number of bits allocated for opcode is bigger than necessary in order to
enable future expansion of instruction set.
Arithmetic operations
The instruction format for arithmetic operation is:
opcode
ri
5 bits

5 bits
rj (optional)
5 bits
rk (optional)
5 bits
54 bits
add_r:
o Operation Description: dual operand addition; result written to Register
File.
o Command Format:
add_r ri,rj,rk
o Actual operation:
ri=rj+rk
o Datapath Description:

Decode: two operands are read from the Register File in location set
by index of ri,rj. These operands are translated to spread form. Result
address and spread operands are saved in pipeline register as well as
result address and control signals.

Execute: Exponents are compared and significands are aligned
accordingly. Significands are added and addition result together with
the bigger exponent derived from Exponent Comparator go through
normalization and rounding. The final sign is derived from Sign
decision. Result {Sign,Sig,Exp} are saved in pipeline register as well
as result address and control signals.

Write Back: finally, the correct result is inverted to DEC64 format.
Final result is written to the Register File.

add_m:
o Operation Description: dual operand addition; result written to Register
File and Result Memory.
o Command Format:
add_m ri,rj,rk
o Actual operation:
ri=rj+rk ; Mem[mem_addr]=rj+rk ; mem_addr++
o Datapath Description: Identical to description of add_r, except that the
result is written to both Result Memory and Register File.
26

sub_r:
o Operation Description: operands subtraction; result written to Register
File.
o Command Format:
sub_r ri,rj,rk
o Actual operation:
ri=rj-rk
o Datapath Description:

Decode: two operands are read from the Register File in location set
by index of ri,rj. These operands are translated to spread form. Result
address and spread operands are saved in pipeline register as well as
result address and control signals.

Execute: Exponents are compared and significands are aligned
accordingly. Significands are subtracted and subtraction result
together with the bigger exponent derived from Exponent
Comparator go through normalization and rounding. The final sign is
derived from Sign decision. Result {Sign,Sig,Exp} are saved in
pipeline register as well as result address and control signals.

Write Back: finally, the correct result is inverted to DEC64 format.
Final result is written to the Register File.

sub_m:
o Operation Description: operands subtraction; result written to Register File
and Result Memory.
o Command Format:
sub_m ri,rj,rk
o Actual operation:
ri=rj-rk ; Mem[mem_addr]=rj-rk ; mem_addr++
o Datapath Description: Identical to description of sub_r, except that the
result is written to both Result Memory and Register File.

inc_r:
o Operation Description: increase operand by one; result written to Register
File.
o Command Format:
inc_r ri
o Actual operation:
ri =ri+1
o Datapath Description: Identical to add_r, except that the second operand
that is added is an artificially created constant whose value is +1, and that
both source and destination register is ri.
27

inc_m:
o Operation Description: increase operand by one; result written to Register
File and Result Memory.
o Command Format:
inc_m ri
o Actual operation:
ri= ri+1 ; Mem[mem_addr]= ri+1 ; mem_addr++
o Datapath Description: Identical to inc_r, except that the result is written to
both Result Memory and Register File.

dec_r:
o Operation Description: decrease operand by one; result written to Register
File.
o Command Format:
dec_r ri
o Actual operation:
ri= ri-1
o Datapath Description: Identical to add_r, except that the second operand
(the subtrahend) is artificially created to equal -1, and that both source and
destination register is ri.

dec_m:
o Operation Description: decrease operand by one; result written to Register
File and Result Memory.
o Command Format:
dec_m ri
o Actual operation:
ri= ri-1 ; Mem[mem_addr]=ri-1 ; mem_addr++
o Datapath Description: Identical to dec_m, except that the result is written
to both Result Memory and Register File.

neg:
o Operation Description: change sign of register operand; result written to
Register File.
o Command Format:
neg ri
o Actual operation:
ri= -ri
o Datapath Description:

Decode: an operand is read from the Register File in location set by
index of ri and is saved in pipeline register as DPD (Densely Packed
Decimal) operand as well as result address and control signals.

Execute: the DPD operand, result address and control signals saved
in the next pipeline register.
28

Write Back: the first bit of the DPD operand is complemented and,
along with the rest of the DPD operand bits, is written to the Register
File.
Data handling operations

mov_i:
o Operation Description: transfer immediate value to register.
o Command Format:
mov_i ri,imm
o Actual operation:
ri=imm
o Datapath description:

Decode: Immediate data is saved directly into Pipeline register, as
DPD operand as well as result address and control signals.

Execute: the DPD operand, result address and control signals are
transferred to next pipeline register.

Write Back: the DPD operand is written back to register file in
address mentioned by result address in write back pipeline register.

Instruction format:
opcode
ri
5 bits

immediate
5 bits
64 bits
mov_r:
o Operation Description: transfer one registers' value to another.
o Command Format:
mov_r ri,rj
o Actual operation:
ri=rj
o Datapath description:

Decode: an operand is read from the Register File in location set by
index of ri and is saved in pipeline register as DPD operand as well
as the result address (rj) and control signals.

Execute: the DPD operand, result address and control signals are
transferred to next pipeline register.

Write Back: the DPD operand is written back to register file in
address mentioned by result address in write back pipeline register.
29

opcode
5 bits
Instruction format:
ri
5 bits
rj
5 bits
Table 3.1 shows a summary of the ISA properties.
31
59 bits
Operation Description
add_r
Dual operand addition; result written to
Register File
Dual operand addition; result written to
add_m
Register File and Result Memory
Command
Format
Actual operation
add_r ri,rj,rk
ri=rj+rk
add_m ri,rj,rk
ri=rj+rk
Mem[mem_addr]=rj+rk
mem_addr++
sub_r
Operands subtraction; result written to
Register File
sub_r ri,rj,rk
ri=rj-rk
sub_m
Operands subtraction; result written to
Register File and Result Memory
sub_m ri,rj,rk
ri=rj-rk
Mem[mem_addr]=rj-rk
mem_addr++
inc_r
Increase operand by one; result written
to Register File
inc_r ri
ri =ri+1
inc_m
Increase operand by one; result written
to Register File and Result Memory
inc_m ri
ri= ri+1
Mem[mem_addr]= ri+1
mem_addr++
dec_r ri
ri= ri-1
dec_m ri
ri= ri-1
Mem[mem_addr]=ri-1
mem_addr++
neg ri
ri= -ri
mov_i ri,imm
ri=imm
dec_r
dec_m
neg
mov_i
Decrease operand by one; result
written to Register File
Decrease operand by one; result
written to Register File and Result
Memory
Change sign of register operand; result
written to Register File
Transfer immediate value to register
Instruction Format
opcode
5 bits
opcode
5 bits
mov_r
Transfer one registers' value to another
mov_r ri,rj
opcode
ri=rj
5 bits
Table 3.1. Summary of the ISA properties.
30
ri
5 bits
rj(optional)
rk(optional)
5 bits
5 bits
ri
immediate
5 bits
64 bits
ri
rj
5 bits
5 bits
59 bits
54 bits
4. Simulation
4.1 Simulation Introduction
The Following section describes the construction of MATLAB simulations matching the
arithmetic and encoding/decoding algorithms, and the tests run on them in order to assess
their practical implementation.
The importance of such simulations is in the simple application of the algorithms in a way
that mirrors a practical implementation. Similarly, tests run on the simulations can reveal
flaws in the practical application of the algorithms.
4.2 Translate, Inverse Translate Simulation
Relevant standard sections (referring to Fig. 2.1):
"The representation r of the floating-point datum, and value v of the floating-point datum
represented, are inferred from the constituent fields as follows:
a)
If G0 through G4 are 11111, then v is NaN regardless of S. Furthermore, if G5 is 1,
then r is sNaN; otherwise r is qNaN. The remaining bits of G are ignored, and T
constitutes the NaN’s payload, which can be used to distinguish various NaNs. The
NaN payload is encoded similarly to finite numbers described below, with G treated
as though all bits were zero. The payload corresponds to the significand of finite
numbers, interpreted as an integer with a maximum value of 10 (3×J) − 1, and the
exponent field is ignored (it is treated as if it were zero). A NaN is in its preferred
(canonical) representation if the bits G6 through Gw + 4 are zero and the encoding of
the payload is canonical.
b)
If G0 through G4 are 11110 then r and v = (−1) S × (+∞). The values of the
remaining bits in G, and T, are ignored. The two canonical representations of infinity
have bits G5 through Gw +4 = 0, and T = 0.
c)
For finite numbers, r is (S, E − bias, C) and v = (−1) S × 10 (E−bias) × C, where C
is the concatenation of the leading significand digit or bits from the combination field
G and the trailing significand field T, and where the biased exponent E is encoded in
the combination field. The encoding within these fields depends on whether the
implementation uses the decimal or the binary encoding for the significand."9
Simulation Method
Testing Translation / inv. translation surrounded three distinct cases:
32
1. Combination field = 1 1 1 1 1 (NaN).
2. Combination field = 1 1 1 1 0 (Infinity).
3. Combination field = other (finite numbers).

In the first two cases: The Combination field bits are preset and 500 sets of additional 59
random bits are generated. Correct Simulation of the translate function activates NaN/Inf
flags accordingly.

In the final case:
64 random bits are generated and testing is performed as followed:
1. For each random binary vector x1- Translate command is used to find (sign1,
significand1 and exponent1).
2. Inverse Translate parameters (sign, significand and exponent) back to binary vector 'res'.
3. For Binary vector 'res' - Translate command is used to find parameters (sign2,
significand2 and exponent2) and compare with (sign1, significand1 and exponent1).
4.3 Exponent Comparator Simulation
In accordance with the Arithmetic Algorithm (see Para. 3.3) addition/subtraction of operands,
includes finding the bigger exponent, and exponent difference. According to standard:
"The set of finite floating-point numbers representable within a particular format is
determined by the following integer parameters:
― b = the radix, 2 or 10
― p = the number of digits in the significand (precision)
― emax = the maximum exponent e
― emin = the minimum exponent e
emin shall be 1 − emax for all formats."9
In the decimal 64 format: emax=+384, b=10, p=16. Therefore, the dynamic exponent range
[emin,emax] = [-383,384]. It is important to note that all exponents in IEEE 754_2008 format
are biased, that is:
"For finite numbers, r is (S, E − bias, C) and v = (−1) S × 10 (E−bias) × C ... where the
biased exponent E is encoded in the combination field.."9
In our case, the bias is 383. Therefore the actual range of the exponent E is [0,767].
33
Simulation Method:
Run all possible combinations of e1,e2 to test exponent_comparator function.
4.4 Full Path Simulation
Using all the resources simulated in MATLAB, one full path can be constructed, creating a
full addition/subtraction path that can be simulated.
Simulation Method
The simulation of the full addition / subtraction path consists of 3 stages:
1. Initialization: 1000 pairs of 64 bit, DEC64 coded operands are randomly created. Each
pair is translated into spread form.
2. Run: Each pair of 64 bit DEC64 operands is input into 'Full_Path'. For each pair in
'Full_Path', a matching DEC64 format addition result is created, and translated to spread
format.
3. Result Analysis: The initial operands are added externally in MATLAB and compared to
the result output of the 'Full_path' in its spread form. If one of the random operands is a NaN
or Inf, the result operand should reflect it in its Combination field.
Note: Due to precision limitations of MATLAB, these simulations needed to employ the use
of Variable Precision Arithmetic functions (VPA) in the Symbolic Math Toolbox. These
functions allow for variable precision and provide more flexibility and control in
manipulating numbers.
4.5 Simulation Results
1. Translate, Inverse translate: of 1000 cases, there were no cases found where 'x1' differs
from 'res' (i.e. the original vector and the result vector differ).
2. Exponent Comparator: All cases of 'Exponent Comparator' variables were examined - no
errors were found.
3. Full Path: of 1000 cases (each case using 2 random variables), all cases proved the
operand addition/subtraction creates the expected result using the above algorithm.
34
In conclusion
In 100% of the cases, simulation results matched the expected values.
4.6 Full Path Graphic User Interface
In addition to MATLAB simulation of the full path, a Graphic User Interface (GUI) was
designed in order to have a user-friendly simulation tool for decimal floating-point
computation that complies with the IEEE 754-2008 standard.
Figure 4.1 shows the simulation GUI for decimal floating-point computation.
The input operands and result are also displayed in DEC64 format (in hexadecimal form).
Figure 4.1. Simulation GUI for decimal floating-point computation.
35
5. Implementation
5.1 Low-level Design
The low-level design of the DFPU is implemented in Verilog, using Cadence Simvision
simulation tool. Each resource mentioned in Chapter 3.4 is implemented as a separate Verilog
file, and is checked against its own test bench.
5.2 Program Counter
The Program Counter (Fig. 5.1) consists of a simple 8 bit counter that produces the address of
current instruction. Address advances with each clock cycle.
Figure 5.1. Program Counter.
A 'jump to address' option is created for further design. jmp_en bit is used to enable the jump
and 8 bit offset value defines the jump amount.
5.3 Register File
The Register File (Fig. 5.2) consists of 32 registers, each one with a 64 bit width.
Read: Two registers can be read simultaneously (Dual-Port Register File), using the registers
index (5 bit).
Write: 64 bit of data can be written to a register, using the register's index and setting the
Write enable bit.
A register can be written while reading from a different indexed register, i.e. results are
written back to register in parallel to reading operands during decode stage.
36
Figure 5.2. Register File.
5.4 Translate & Inverse Translate
The Translate component (Fig. 5.3) decodes a DEC64 input operand to sign, exponent, and
significand. If the given input is a NaN/Infinity, the isNaN/isInf output bit is set.
Figure 5.3. Translate component.
The Inverse Translate component (Fig. 5.4) encodes sign, exponent, and significand to a
DEC64 output operand. If the encoded operand is a NaN/Infinity, the input isNaN/isInf bit
declares it.
37
Figure 5.4. The Inverse Translate component.
5.5 Exponent Comparator
The Exponent Comparator (Fig. 5.5) subtracts the input exponents and returns:
diff – the difference between the input exponents.
isBigger – a bit that indicates which exponent is bigger. (0: if exp1≥exp2. 1: if exp1<exp2).
biggerexp – the bigger exponent.
Figure 5.5. Exponent Comparator.
5.6 Check Needed
The Check Needed component (Fig.5.6) simply checks if the input diff > 17decimal. If it does:
en=0 and there is no need of shifting, adding or subtracting. Else en=1.
Figure 5.6. The Check Needed component.
38
5.7 Right Shifter
The Right Shifter component (Fig.5.7) aligns the input significands according to the other
inputs:
en_in –shift enable bit.
isBigger –indicates which operand has a bigger exponent.
diff –the amount of shifts (difference between exponents of the operands).
The output significands consist of 76 bits.
If en_in is set: Right Shift must be performed. The significand to be shifted is concatenated
with 64 bits (16 trailing zeroes) and goes through a Barrel Shifter.
The significand to be shifted is chosen according to the value in isBigger:
If isBigger=0 - sig2 is shifted by 'diff' positions.
If isBigger=1 - sig1 is shifted by 'diff' positions.
The output of the Barrel Shifter is truncated to 76 bits (19 digits).The 19th Digit of the
truncated output must serve as the Sticky Digit - signifying the existence of non-zero trailing
digits. The Sticky Digit is constructed according to the following rule: If the 19thDigit is not
zero, then the Sticky Digit retains its value. The Sticky Digit will retain a 0 value if and only
if the 56 least significant bits are 0. Otherwise it is set to decimal 1.
The unshifted significand is concatenated with three trailing zero digits (12 bits).
If en_in is cleared: the output significand of the bigger operand is concatenated with three
trailing zero digits. The other output significand is zero (won't be used later).
en_out=en_in and is designed for timing reasons.
Shifting is done using Barrel Shifter, which fasten the operation.
Figure 5.8 shows the implementation of the Right Shifter component.
39
Figure 5.7. The Right Shifter component.
Figure 5.8. Implementation of the Right Shifter component.
5.8 Adder/Subtractor
Given two unsigned significands, the main goal is to produce a new result significand, which
is the output of one of the following scenarios:
1. Addition: adding the two significands.
2. Subtraction: subtracting the two significands (return the absolute difference).
3. No operation (return one specific significand out of the two input significands).
Figure 5.9 shows a general description of the Adder/Subtractor.
41
Figure 5.9. Adder/Subtractor.
Inputs:

sig1, sig2 – The input significands consist of 19 digits each, while each digit is
represented by 4 bits (BCD - binary-coded decimal representation), thus an input
significand consists of 76 bits.

en – Operation enable bit. When set - Addition/Subtraction is carried out. When cleared no operation is taking place.

add/sub – Indicates the type of operation. 1 - Addition. 0 – Subtraction.

isBigger – Indicates which of the operands that include the significands is bigger.
1: if operand1 < operand2.
0: if operand1 ≥ operand2.
Note: isBigger gives no information about the relation between sig1 and sig2.
Outputs:

res. sig – The output significand consisting of 19 digits (76 bits).

c_out – The output carry of the operation.

op_sign – The sign of the output result
Addition or Subtraction of two significands cannot be done bitwise, but must be performed in
groups of 4 bits due to the use of BCD representation.
Note: BCD representation is a 4 bit binary representation for decimal digits in range:
{0:9}10→{0000 - 1001}2.
40
Figure 5.10 describes the implementation of 4 bit BCD Adder for calculation of a single
digit.
Similarly to binary subtraction using 1's complement, calculation of Subtraction is carried out
using 9's Complement representation. Therefore, if the current operation is subtraction
(add/sub = 0), the complemented digit B (which is 9-B) is chosen by the multiplexer at the
entrance of the upper 4 bit Full Adder. The reason for complementing B is that subtraction is
simply addition with 9's complemented Subtrahend.10
Whenever the sum of the upper 4 bit Full Adder exceeds (1001)2=9, the output sum has to be
fixed so that the output will equal (sum-10) and carry out will equal 1.This can be achieved
by adding (0110)2=6to the sum.
___________________________________________________________________________
Proof: (sum-10) = sum + 6 - 16 = (sum+6) - 16. Subtracting 16 from (sum+6) is the same as
taking the 4 rightmost bits and omitting the MSB (which is the carry out).
□
___________________________________________________________________________
The check whether a fix is needed can be obtained by a rather simple circuit.
A fix is needed whenever carry out=1. Examination of the Truth Table in Table 5.1
concludes:
carry out  s1  s2   s3  c _ out
The carry out bit also serves as the input decision bit of the output Multiplexer.
42
Figure 5.10. Implementation of 4 bit BCD Adder.
If the carry out bit is cleared, the output is the sum of the upper 4 bit Full Adder.
If the carry out bit is set, the output is the sum of the lower 4 bit Full Adder (the fixed sum).
43
s3 s2 s1
s0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
0/1
0/1
0/1
0/1
0/1
0/1
0/1
0/1
0/1
0/1
0/1
0/1
0/1
0/1
0/1
0/1
0
0
0
0
1
1
1
1
0
0
0
0
1
1
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
c_out carry out
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
1
1
1
1
1
1
Table 5.1. Truth Table for carry out bit.
_____________________________________________________________________
Example 5.1.
Adding 9 and 5 needs a fix:
0101
1001
1110

0110
0100

 10 dec  fix is needed !
Answer is (sum = 4) and (carry = 1) which corresponds to 14.
Example 5.2.
Adding 3 and 4 doesn’t need a fix:

0011
0100  10 dec  no fix is needed !
0111
Answer is (sum = 7) and (carry = 0) which corresponds to 7.
□
_____________________________________________________________________
44
Both addition and subtraction are executed using Carry-Select Adder with groups of 4 bits
(Fig. 5.11).
Reminder: calculations are carried out using BCD representation; therefore calculation of a
certain digit (4 bits) should be separated from calculations of the other digits.
Carry-Select Adder saves the carry ripple time11 in exchange of adding another Full Adder
for calculation of each digit (except for the least significant digit). This approach follows the
principle: "High Performance over Silicon Cost".
The added digits (each consists of 4 bits) enter two identical 4 bit BCD Adders, where the
input carry of one adder is logic '0' and the input carry of the other adder is logic '1'. A
Multiplexer chooses one of the two sums produced and one of the two output carries. The
decision bit is the carry out that is chosen in the previous Multiplexer.
Addition:
Adding two significands is rather simple when compared to Subtraction.
The output op_sign that should indicate the sign of the result is always 0, since the result is
always positive.
In case of a result significand that is ≥ 10, carry out = 1.
Subtraction:
When subtracting two significands there are two possible scenarios:
1. sig1 > sig2
2. sig1 ≤ sig2
The result of subtraction should be displayed in an absolute value.
The output op_sign should indicate the sign of the result.
The first subtraction scenario leads to (carry out = 1). This wrap-around-carry should be
added to the result significand.
_____________________________________________________________________
Proof: sig1-sig2 = sig1 + sig2_complemented = sig1 + (99……999 – sig2) =
= (sig1-sig2) + 99……999 => carry out = 1 since (sig1-sig2) > 0.
adding wrap-around-carry => (sig1-sig2) + 99……999 + 1 =
45
= sig1-sig2 (omitting the MSB) = |sig1-sig2|.
□
_____________________________________________________________________
The second subtraction scenario leads to (carry out = 0). Answer needs 9's completion.
_____________________________________________________________________
Proof: sig1-sig2 = sig1 + sig2_complemented = sig1 + (99……999 – sig2) =
= (sig1-sig2) + 99……999 = (=> carry out = 0 since (sig1-sig2) ≤0)
= 99……999 - (sig2-sig1) = (sig2-sig1) complemented =>
=> complementing the answer will give (sig2-sig1) = |sig1-sig2|.
□
_____________________________________________________________________
In order to save the time of adding the wrap-around-carry (first scenario) or completing the
answer (second scenario), the Adder/Subtractor component calculates the result according to
the three scenarios (one in Addition, two in Subtraction) in parallel and a Multiplexer
chooses the output significand.
Using two 76 bit Carry-Select BCD Adder (named pipe1 and pipe2), one with (carry in = 0)
and the other with (carry in = 1), and a 9's complement unit, a correct result can be obtained
for each of the three scenarios:

For Addition: the result is the output of pipe1 (carry in = 0).

For Subtraction (first subtraction scenario): the result is the output of pipe2
(carry in = 1).
Adding a wrap-around-carry is the same as setting (carry in = 1) from the first place.

For Subtraction (second subtraction scenario): the result is the 9's complement of the
output of pipe1 (carry in = 0).
Note: creation and completion of the output of pipe1 is carried out in parallel (once an output
digit is calculated, it is completed) and not after the entire output significand is calculated.
Figure 5.12 shows the implementation of the Adder/Subtractor component.
46
No operation:
When the input en bit is cleared, no operation is needed. Therefore the output significand
and op_sign are chosen according to the other input bits: add/sub and isBigger.

isBigger is cleared (operand1≥operand2):
The output significand is sig1 and the result is positive (op_sign = 0).

isBigger is set (operand1<operand2):
The output significand is sig2.
If the operation is Addition (add/sub = 1) then op_sign = 0.
If the operation is Subtraction (add/sub = 0) then op_sign = 1.
The Truth Table in Table 5.2 summarizes the relations between the outputs and the inputs:
en
isBigger
add/sub
p1_cout
res. sig
c_out
op_sign
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
0
0
0
0
1
1
1
1
0
0
0
0
1
1
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
sig2
sig2
sig2
sig2
sig1
sig1
sig1
sig1
pipe1c
pipe2
pipe1
pipe1
pipe1c
pipe2
pipe1
pipe1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
1
1
0
0
1
0
0
0
1
0
0
0
Table 5.2. Truth Table for the outputs of the Adder/Subtractor component, where: p1_cout is the
carry out of pipe1 Adder. pipe1 is the pipe1 output significand.pipe1c is the pipe1 output significand
complemented. pipe2 is the pipe1 output significand.
Therefore:
op _ sign  en  isBigger  add / sub
c _ out  en  add / sub  p1 _ cout
47




res. sig  en  isBigger  sig 1  en  isBigger  sig 2  en  add / sub   sig _ out _ pipe1 


 en  add / sub  p1 _ cout  sig _ out _ pipe
 en  add / sub  p1 _ cout  sig _ out _ pipe1 _ completed 
2
48
Figure 5.11. Implementation of a 76 bit Carry-Select BCD Adder.
49
Figure 5.12. Implementation of the Adder/Subtractor component.
51
5.9 Normalizer
The Normalizer (Fig. 5.13) shifts the input significand according to the Normalizing
specifications (see Para. 2.4).
c_out indicates that a carry out has occurred in the previous resource (significand ≥ 10).
OVF/UNDF bit is set if the normalization causes an overflow/underflow.
Figure 5.13. Normalizer.
Shifting is done using Barrel Shifter, which fasten the operation.
5.10 Rounder
The Rounder (Fig. 5.14) rounds the input significand according to the Rounding
specifications (see Para. 2.4).
Figure 5.14. Rounder.
50
5.11 Sign Decision
The Sign Decision component (Fig. 5.15) finds the final sign of the result operand.
Figure 5.15. The Sign Decision component.
If add/sub=1: the actual operation carried out in Adder/Subtractor was addition. Therefore
the final sign is sign1.
If add/sub=0: the actual operation carried out in Adder/Subtractor was addition. Therefore
the final sign is (sign1XORop_sign), where op_sign is the sign of the result of the
subtraction.
Figure 5.16 shows the implementation of the Sign Decision component.
Figure 5.16. Implementation of the Sign Decision component.
52
6. Integration
6.1 Composing the Complete System
In order to compose the complete system, the implemented units (Chapter 5) were integrated
into a full, connected datapath.
6.2 Creating a Pipelined Datapath
An instruction is divided into four stages, i.e. moving from single-cycle datapath to a fourstage-pipelined datapath, which means that up to four instructions will be in execution during
any single clock cycle.
By creating three Pipeline Registers, stages are separated and information is saved with each
rising edge of the clock.
The pipeline execution throughput is one instruction per cycle.
Figure 6.1 shows the Pipelined Datapath.
Note: certain signals (OVF, UNDF, isNaN, isInf etc.) were omitted from Figure 6.1in order
to describe the main flow of data.
6.3 Creating a Control Unit
In order to manage the advancement of the pipeline, and manage the control signals (enable
bits, multiplexer decision bits) of the different concurrent operation, a central Controlling
Unit is necessary. The control unit is implemented as a Finite State Machine (FSM).
Given 6 states: Instruction Fetch, Decode, Execute, Write Back, Idle state (for system reset),
wait4inst (system out of reset and waiting for input cache to load), and a pipelined datapath,
several states can coexist simultaneously (each combination of IF,D,E,WB). For each of the
possible state combinations, the control unit allocates a unique state, and sends signals to the
datapath according to the current state. Overall there are 17 states available.
53
Figure 6.1. The Pipelined Datapath.
54
Figure 6.2 describes the transfer function between states.
For simplicity, states in the diagram were joined according to similar next states.
Transitions between states are represented by <condition>/<next state>. For example: NI/EW
means that no new instruction has arrived and the next state is EW.
Each transfer between states depends on the previous state, and whether or not there is a
following instruction to perform (in that case the FSM will receive mem_valid=1). So long as
there is a following instruction to perform, a valid bit is sent to the Program Counter in order
to fetch the new instruction. Once there are no more instructions to perform, the Program
Counter valid bit is cleared, and no more instruction are fetched.
Note: the pipeline continues to execute the existing instructions.
For each new instruction Fetched, the opcode is analyzed by the FSM, and it returns the
appropriate control signals. Similarly, with each state transfer, write enable signals are sent to
the pipeline registers.
Table 6.1 describes the values of the control signals for each type of input instruction opcode.
The control signals are:

DPD source –decision for the source of the densely packed decimal operand.

wsource –decision for the source of the data written.

incdec –decision for inc/dec.

unbinop – decision for unary/binary operation.

wbmethod – decision for Write Back method.

negator – decision for negate operation.

wen – write enable for the Register File.

sub_op – decision for subtract operation.

selfwrite – decision for read & write to the same register.
An extra signal is reserved for further design.
55
Figure 6.2. Control Unit: transfer function between states.
56
add_r
add_m
sub_r
sub_m
inc_r
inc_m
dec_r
dec_m
neg
mov_i
mov_r
DPD
source
1
1
1
1
1
1
1
1
1
0
1
wsource
incdec
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
1
1
0
0
0
0
0
unbinop wbmethod negator
0
0
0
0
1
1
1
1
0
0
0
0
1
0
1
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
wen
sub_op
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
1
1
1
1
0
0
0
selfwrite reserved
0
0
0
0
1
1
1
1
1
0
0
Table 6.1. Control Signals for each type of instruction.
6.4 Pipeline Hazards
A Read-after-Write (RAW) data hazard may occur in the designed Pipeline, which can result
in incorrect computation.
This hazard occurs when an instruction refers to a result that has not yet been calculated or
retrieved.
For example:
add_r r1,r2,r3
add_r r5,r1,r4
r1 is read before its true value is written, because the second instruction starts the Execution
stage when the first instruction starts the Write Back stage.
Some of the possible future solutions for this problem are:
1. Stalling the pipeline (will increase latency).
2.
Forwarding: once an instruction finishes its Execution stage, the result can be used
immediately in the Execution stage of next instruction.12
3. Reordering instructions to avoid hazards (done by the designated compiler).
57
0
0
0
0
0
0
0
0
0
0
0
7. Arithmetic and System Verification
7.1 Verification Properties
The main properties necessary for verification and validation for the DFP unit are:

Correct calculation of arithmetic operation - includes arithmetic testing and correct
Datapath operation.

Compliance with IEEE 754-2008 standard specifications
Correct calculation of arithmetic operation
The following types of test were preformed:
1. Correct addition/subtraction of operands.
2. Correct operation for large exponent difference.
3. Correct handling of Overflow/Infinity.
Compliance with IEEE 754-2008 standard specifications
The following specifications were tested:
1. Correct translation to/from DEC64 format to spread decimal floating point format.
2. Correct result rounding according to IEEE 754-2008 standard scheme chosen.
3. Correct encoding/decoding of infinity/NaN.
The guidelines for the verification were taken from test vectors published by Prof. Mike
Cowlishaw1. His work was written prior to the publishing of the new IEEE 754-2008
standard, and therefore could not be used in full, but the principals of his verification
technique were adapted for this project.
Initially, as a 'quick confidence check', a sample assembly program was loaded into the
instruction cache and results were validated. This initial test examined the basic operation for
each available command.
The following step was to build a more robust testing array, based on the scheme in
Figure 6.3.
58
Figure 6.3. Verification Scheme.
The test vectors used for comparison were taken from Cowlishaws' website and also from
IBM Haifa's Floating Point test Generator2,3. The IBM test vectors were translated to DFP
commands according to the DFPU ISA Using AWK scripts.
The Commands were loaded into the UUT (Unit Under test - DFPU) and results were printed
out and compared to the results given by IBM.
7.2 Verification Conclusions and Results
Correct calculation of arithmetic operation
1. Correct addition/subtraction of operands - verified. In cases where results differed, close
examination showed that the cause was different rounding schemes.
2. Correct operation for large exponent difference - verified.
3. Correct handling of Overflow/Infinity - verified. In cases where results differed, close
examination showed that the cause was different rounding schemes.
Compliance with IEEE 754-2008 standard specifications
1. Correct translation to/from DEC64 format to spread decimal floating point format verified.
59
2. Correct result rounding according to IEEE 754-2008 standard scheme chosen - Despite
the different rounding schemes, in some cases the result is rounded to the same value. Of
all the cases examined, some errors in rounding were identified and corrected. In other
cases, the DFPU result agreed with the chosen rounding scheme, and differences between
the DFPU and the IBM test vectors were due to different rounding schemes.
3. Correct encoding/decoding of infinity/NaN – verified.
61
8. Synthesis
8.1 Implementation on FPGA
The integrated system was implemented using Virtex®-6 FPGA ML605 Evaluation Board. In
order to load the design, the Xilinx ISE Design Suite 13.2 was used. The *.list files, used in
the Cadence Simvision environment to simulate the instruction and result memory , were
implemented in the Virtex6 system using Distributed RAM, loaded with *.coe files.
inst_mem.coe represents our instruction memory and is the basis of our Test bench.
8.2 Design Evaluation
Running the design on the Virtex6 provided the possibility to test the actual ability to run the
design on real-life hardware with real-life hardware constraints. Specifically, it allows testing
Timing and Clock Frequency constraints.
Solving synthesis problems
A significant problem with the synthesis was that the designed Normalizer included a While
loop which is not synthesizable. Conversion of the While loop to a series of conditional-if
solved this issue.
Identifying optimal clock rate
The process of identifying the optimal clock rate for the DFPU involved running the unit on
higher clock rates until incorrect results are returned due to inability to conclude command
performance.
Using PLL (Phase-locked loop), multiple clocks with different rates were created.
The working clock was chosen by on-board switches.
The optimal clock rate identified for the DFPU is 66 MHz .
60
9. Summary
9.1 DFPU Review
The DFPU is a hardware implementation of decimal arithmetic algorithms (specifically
addition, subtraction and related operations). Its' high-level design is integrated into the lowlevel design. It has undergone algorithm simulation, verification and final hardware synthesis.
The design is unique in terms of several parameters:

The Design is built to comply with the IEEE 754-2008 standard definitions.

The design includes an advanced Adder/Subtractor, which provides equal runtime for
addition or subtraction calculations, and avoids wasteful (both in terms of time and
silicon size) comparison of significands which existed in earlier designs13, which
provides modularity.

The design provides addition/subtraction with a latency of 4 clock cycles and one
clock cycle throughput.
9.2 Future Expansions
Potential expansions to the DFPU range from functionality to efficiency.
Functionality

Additional DFP Functions should be made available, such as: Multiply, Divide, Fused
Multiply Add, Compare etc.

Additional Control Functionality should be made available, such as Loop Support and
Branch support.

Additional hardware for the creation of detailed Data Payload in case of Invalid
Operation Exception should be made available.
Efficiency

Adder/Subtractor can be enhanced using carry-look-ahead in each 4 bit BCD adder.

Further attempts to create an even pipeline should be made. For example, it is possible
to take advantage of distributed RAM capabilities to speed up Fetch stage and upend
it to the following Decode stage, thus forming a 3 stage pipeline.

Support for advanced data hazard solutions can be added (Forwarding, Reordering,
see Para. 6.4)
62
Appendices
A. DFP History
The suggested DFP Unit is not the first decimal floating-point unit implemented, but it is
unique in that it complies with the new IEEE 754-2008 standard.
Hardware solutions
Select Past attempts:

ENIAC - The United States Military began construction of the ENIAC during WWII
(1943), designed to calculate artillery firing tables for the United States Army's
Ballistic Research Laboratory. The ENIAC could store a ten digit decimal number in
memory, but could not perform decimal computations.4

Bell Laboratories Mark V - The first documented decimal floating-point processor
was the Bell Laboratories Mark V computer designed in 1946.5

Burroughs 2500 & 3500 - another important Decimal floating-point computer was
the Burroughs 2500, developed in 1966. It used strings of up to 100 digits, with two
4-bit BCD (Binary-Coded Decimal) digits per byte.
These examples were developed before the existence of a floating-point standard.
The 754-1985 standard was the first to define formats for representing floating-point numbers
and special values (NaN, Inf), floating-point operations, rounding modes and exceptions. The
standard in use today - IEEE 754-2008 - revised and replaced the IEEE 754-1985. The
revision extended the previous standard in including, among other things, decimal arithmetic
and formats, and merged in IEEE-854 (1987) - the radix-independent floating-point standard.
Two examples of a standardized Decimal Floating Point Unit are the IBM Z9 (2005-2006)
and Z10 (2008).The Z9 utilized an encoded decimal representation for data, instructions for
performing decimal floating point computations, and an instruction which performed data
conversions to and from the decimal floating point representation. The System Z9 was the
first commercial server to add IEEE 754 decimal floating point instructions, although these
instructions were implemented in microcode with some hardware assists.
63
The Z10 introduced Full hardware support for Hardware Decimal Floating-point Unit
(HDFU): it implemented the main IEEE 754 decimal floating point operations in a built-in,
integral component of each processor core and instruction set architecture.6
Note: It is important to note that the Z10 was developed before the publication of the IEEE
754-2008 standard.
Software solutions
For reasons of backwards compatibility and in order to gain software flexibility, several
software libraries, capable of handling decimal floating-point operations were developed.
Some of the more well-known ones are:

Intel® Decimal Floating-Point Math Library

decNumber/decNumber++ by Mike Cowlishaw1
These solutions indeed solve the precision issue, but fall short (and actually worsen the
situation) with regards to the speed requirement.
Research performed in the University of Wisconsin show that when using the decNumber
library for DFP arithmetic, most benchmarks spend more than 75% of their execution time in
DFP functions.7,8 The research also showed that providing fast hardware support for DFP
instructions results in speedups for the same benchmarks ranging from 1.3 to 31.2.
64
Bibliography
1. www.speleotrove.com
2. www.haifa.il.ibm.com/projects/verification/fpgen/
3. www.haifa.il.ibm.com/projects/verification/fpgen/ieeets.html
4. www.computerhistory.org
5. Harvey G. Cragon, Computer Architecture and Implementation (Cambridge University
Press, Feb. 2003)
6. www.ibm.com/systems/z/hardware/
7. Liang-Kai Wang, Charles Tsen, Michael J. Schulte, and Divya Jhalani, Benchmarks and
Performance Analysis of Decimal Floating-Point Applications (University of Wisconsin Madison, Department of Electrical and Computer Engineering, Oct. 2007)
8. Michael J. Schulte, Nick Lindberg, Anitha Laxminarain, Performance Evaluation of
Decimal Floating-Point Arithmetic (University of Wisconsin - Madison, Department of
Electrical and Computer Engineering,2005)
9. IEEE Standard for Floating-Point Arithmetic (IEEE Computer Society, Aug 2008)
10. Anshul Singh, Aman Gupta, Sreehari Veeramachaneni, M.B. Srinivas, A High
Performance Unified BCD and Binary Adder/Subtractor (IEEE Computer Society
Annual Symposium on VLSI, 2009)
11. Israel Koren, Computer Arithmetic Algorithms (A. K. Peters/CRC Press, 2nd edition,
Dec, 2001)
12. David A. Patterson, John L. Hennessy, Computer Organization and Design, The
Hardware/Software Interface (Morgan Kaufmann, 4th edition, Nov. 2008)
13. John Thompson, Nandini Karra, Michael J. Schulte, A 64-bit Decimal Floating-Point
Adder (IEEE Computer Society Annual Symposium on VLSI: Emerging Trends in VLSI
Systems Design (ISVLSI'04), 2004)
14. http://speleotrove.com/decimal/decifaq.html
15. Michael F. Cowlishaw, Decimal Floating-Point: Algorism for Computers (Proceedings of
the 16th IEEE Symposium on Computer Arithmetic (ARITH'03), 2003)
65
Download