Faculty of Engineering VLSI Laboratory Design and implementation of Decimal Floating Point (DFP) unit Ariel Burg Hillel Rosensweig B.Sc. Graduation Project - Computer Engineering Advisor: Academic Advisor: Yifat Manzor Dr. Osnat Keren תשרי תשע"ב,October 2011 Abstract: Over the past few years, there has been a growing interest in the development of Decimal Floating Point Units (DFPU) due to precision and timing constraints. We present a summarized IEEE standard in order to comply with Standard limitations. We then present our design for a DFPU Framework available for expansion, which performs basic DFP operations, the most complex of which are Addition/Subtraction. These operations were designed with "High Performance over Silicon Cost" in mind. Algorithmic simulation schemes are presented as well as our low-level design verification process. Finally, our synthesis results are presented. 2 Introduction It is our pleasure to present the following project book. This project book is the product of a year's work of research and development. The Research was performed using an array of tools. We employed the academic knowledge gained throughout our degree in a variety of courses in several fields: Logic Design - Digital Logic Circuitry. Architecture - Computer Architecture, MicroComputer and Assembly Language. Arithmetic - Computer Arithmetic Algorithms. We would be remiss if we did not mentioned the wealth of information gained from taking advantage of Prof. Mike Cowlishaw's 'Speleotrove' website.1 Software tools used included: MATLAB, Cadence Simvision, Xilinx ISE Design Suite 13.2. Hardware implementation was achieved on Virtex®-6 FPGA ML605 Evaluation Board. For verification purposes we used our own Test Bench, IBM FPgen Testing Suite2,3 and made use of Prof. Mike Cowlishaws' test vectors. Ariel Burg Hillel Rosensweig 3 Contents 1. Overview ...................................................................................................................6 1.1. Purpose .............................................................................................................6 1.2. Motivation ........................................................................................................6 2. Definition & Specification ......................................................................................10 2.1. The Decimal Floating Point Format ...............................................................10 2.2. Infinity and NaNs ...........................................................................................14 2.3. Exception Handling ........................................................................................15 2.4. Normalizing & Rounding ...............................................................................16 3. High-level Design ...................................................................................................20 3.1. General Operation Scheme .............................................................................20 3.2. Interface ..........................................................................................................21 3.3. Arithmetic Algorithm .....................................................................................22 3.4. Data Path ........................................................................................................24 3.5. Instruction Set Architecture (ISA) .................................................................25 4. Simulation ...............................................................................................................32 4.1. Simulation Introduction………………..........................................................32 4.2. Translate, Inverse Translate Simulation .........................................................32 4.3. Exponent Comparator Simulation ..................................................................33 4.4. Full Path Simulation .......................................................................................34 4.5. Simulation Results ..........................................................................................34 4.6. Full Path Graphic User Interface ....................................................................35 5. Implementation ........................................................................................................36 5.1. Low-level Design ...........................................................................................36 5.2. Program Counter ............................................................................................36 5.3. Register File ...................................................................................................36 5.4. Translate & Inverse Translate ........................................................................37 5.5. Exponent Comparator ....................................................................................38 5.6. Check Needed ................................................................................................38 5.7. Right Shifter ...................................................................................................39 5.8. Adder/Subtractor ............................................................................................40 5.9. Normalizer ......................................................................................................51 5.10. Rounder ........................................................................................................51 4 5.11. Sign Decision ...............................................................................................52 6. Integration ...............................................................................................................53 6.1. Composing the Complete System ..................................................................53 6.2. Creating a Pipelined Datapath ........................................................................53 6.3. Creating a Control Unit ..................................................................................53 6.4. Pipeline Hazards .............................................................................................57 7. Arithmetic and System Verification ........................................................................58 7.1. Verification Properties ...................................................................................58 7.2. Verification Conclusions and Results ............................................................59 8. Synthesis ..................................................................................................................61 8.1. Implementation on FPGA ..............................................................................61 8.2. Design Evaluation ..........................................................................................61 9. Summary .................................................................................................................62 9.1. DFPU Review ................................................................................................62 9.2. Future Expansions ..........................................................................................62 Appendices ..................................................................................................................63 A. DFP History .....................................................................................................63 Bibliography ................................................................................................................65 5 Design and implementation of Decimal Floating Point (DFP) unit 1. Overview 1.1 Purpose The objective of this project is to design, implement and test a decimal floating-point arithmetic unit, based on formats and methods for floating-point arithmetic as specified in IEEE 754-2008 standard. 1.2 Motivation Currently, most arithmetic hardware units perform operations on numbers in binary format. As the most basic memory unit ('bit') is itself binary, binary arithmetic implementations would be the natural and intuitive choice. Despite this, drawbacks of binary arithmetic implementations have created renewed interest in developing arithmetic units, capable of performing operations on numbers set in a decimal format: Speed: As opposed to computers, users prefer the use of the decimal notation as opposed to the binary one. On certain applications the need for decimal to binary conversions and binary to decimal conversions is so great that decimal operations require 50%-90% of processor time. A system with a direct decimal representation and hardware support would save this overhead. Accuracy: With floating-point numbers in binary format, accuracy problems prevail. For instance, the decimal term 0.1 has no finite binary representation: Dec: 0.1 = Bin: 0.0001100110011.... Due to limited memory, a 32 bit representation will truncate the infinite edge and round up, leading to obvious accuracy errors. 6 Examples of such errors are extensively documented14. For instance, the following C program (compiled with Visual C++): for (i=0.1; i<0.5; i=i+0.1) printf ("%f\n",100000000*i); will print out: 100000001.490116 200000002.980232 300000011.920929 400000005.960464 □ Similarly, using C (compiled with Visual C++), the following two loops will not run the same amount of iterations due to rounding errors: The loop: for (num=1.1; num<=1.5; num=num+0.1) printf ("%f\n",num); prints: 1.100000 1.200000 1.300000 1.400000 whereas the loop: for (num=0.1; num<=0.5; num=num+0.1) printf ("%f\n",num); prints: 0.100000 0.200000 0.300000 0.400000 0.500000 □ One important example of the implications of such an error occurred in the Gulf War: 7 On February 25, 1991, during the Gulf War, an American Patriot Missile battery in Dharan, Saudi Arabia, failed to track and intercept an incoming Iraqi Scud missile. Specifically, the time in tenths of second as measured by the system's internal clock was multiplied by 1/10 to produce the time in seconds. This calculation was performed using a 24 bit fixed point register. In particular, the value 1/10, which has a non-terminating binary expansion, was chopped at 24 bits after the radix point. The small chopping error, when multiplied by the large number giving the time in tenths of a second, led to a significant error, and consequently caused severe damage and human casualties. Uniformity: Today's computation with floating-point numbers might yield diverse results in different processors. Proposed solution The proposed solution is to encode data in a decimal format: a format which would give a distinct representation for each digit, separate from other digits. The format is based on IEEE 754-2008 standard. This solution would solve all the previously mentioned problems: Speed: Each digit has a unique representation in the encoding scheme, so that converting numbers to computer code becomes simply a direct translation using tables instead of a costly conversion between bases. Accuracy: Each digit has a unique representation, therefore any number that can be expressed visually (within a given accuracy) by the user can be expressed in the decimal encoding precisely. Uniformity: Using the format and methods specified in IEEE 754-2008 standard, results of computation will be identical, independent of implementation, given the same input data. Errors, and error conditions, in the mathematical processing will be reported in a consistent manner regardless of implementation. The arithmetic unit designed in this project complies with IEEE754-2008 specifications for decimal floating-point. Therefore, computations done in the arithmetic unit will yield the same results as in any other implementation which complies with the IEEE 754-2008 standard. 8 Note: because '10' is not a primary number, but is built of '2' and '5', working in base '10' provides a wider finite representation range, as opposed to working in base '2', where any fraction in base '5' cannot be represented in any finite way. For a detailed summary of decimal floating-point history solutions see Appendix A – DFP History. 9 2. Definition & Specification 2.1 The Decimal Floating Point Format The IEEE 754-2008 standard specifies two decimal floating-point formats: DEC128 - uses 128 bits for representation. DEC64 - uses 64 bits for representation. Although the DEC128 format provides a better precision, using the DEC64 format will gain more speed and provide a sufficient precision level for various applications. Therefore, for this project the chosen format is DEC64. A general structure of a decimal floating-point number is: r S , Exp bias, Sig , v 1 s 10 Expbias Sig where S is the Sign of the number, Exp (Exponent) is the integer power to which the radix (10) is raised, and Sig (Significand) - the digits that comprise the significant portion of the number. These are the three elements needed to construct a floating point number. Figure2.1 shows the 64 bit format for a decimal floating point number. 1 bit S (sign) 13 bits LSB MSB G (combination field) G0……………….G12 50 bits = 5declets T (trailing significand field) MSB LSB Figure 2.1. DEC64 format for 64 bit decimal floating-point number. As figure 2.1 shows, each 64 bit operand is built of 3 fields: S - Sign bit G - Combination Field T - Trailing Significand Field. The three elements which construct a floating point number are encoded in (S,G,T) fields: Sign The Sign Field in the format represents the sign of the number (sign=(-1)s) Exponent The exponent is one of the elements which comprise the Combination Field. 01 The Exponent is 10 bit long in the range [emin,emax]=[-383,384]. The Exponent is biased so that it will be represented with positive values. Therefore the bias is 383 and the biased Exponent range is [0,767]. The Exponent is encoded entirely in the Combination Field. Significand The Significands' precision is 16 digits. In its decoded form, it is 64 bits long (BCD representation). In its coded form, it is split into 15 digits encoded in the field T (50 bits = 5x10 = 5declets), and an MSD (most significant digit) encoded in the combination field (G). The format also supports representation of ±∞ and NaN (Not a Number). Decoding and Encoding the Combination Field The combination field is encoded/decoded by using the 5 MSB's (G4...G0). These five bits hold the status of the number (Inf, NaN or finite) as well as the Significands' MSD and the two MSBs of the exponent (for finite numbers). The remaining 8 bits hold the remainder of the exponent. Encoding/Decoding of the Combination field is described in Table 2.1. Combination field (5 bits) Type Exponent MSBs Coefficient MSD G0G1G2G3G4 (2 bits) (4 bits) abcde Finite ab 0cde 11cde Finite cd 100e 11110 Infinity -- ---- 11111 NaN -- ---- Table 2.1. The first five bits of the Combination Field indicates the type of the number, the Significands' MSD and the 2 MSBs of the exponent (for finite numbers). NaNs: G5 differentiates between quiet NaNs (qNaN) and signaling NaNs (sNaN), where Signaling NaNs signal uninitialized variables and arithmetic enhancements that are not in the scope of the standard. Quiet NaNs afford retrospective diagnostic information inherited from invalid operations. 00 For NaNs: sNaN G5 1 G0G1G2G3G4 11111 v NaN , r , T payload qNaN G5 0 Where: v = actual value, r = format representation. Infinity: For ±Infinity: G0G1G2G3G4 11110 r v 1 s Finite numbers: G0G1G2G3G4 0 XXXX OR 11XXX r S , E bias, C , v 1 s 10E bias C Densely-Packed Decimal (DPD) In order to allow for Decimal representations of numbers without adding a memory overhead to the implementation, significands are stored in a Densely Packed Decimal format. Using DPD coding takes advantage of the BCD representation redundancy. Decoding 10-bit densely-packed decimal to 3 decimal digits Decoding a Densely Packed Decimal declet is performed according to Table 2.2: Table 2.2. Decoding 10-bit Densely-Packed Decimal to 3 decimal digits. ____________________________________________________________________ Example 2.1. For the following declet: 02 b(0) b(1) b(2) b(3) b(4) b(5) b(6) b(7) b(8) b(9) 1 0 1 1 0 0 1 1 0 1 We use the appropriate table entry: b(6) b(7) b(8) b(3) b(4) 1 1 0 1 0 Therefore: d(1)=8+b(2)=8+1=9 ; d(2)= 4*b(3)+2* b(4)+b(5)= 4*1+2*0+0=4 ; d(3)= 4*b(0)+2* b(1)+b(9)= =4*1+2*0+1=5 Therefore the decoded number is 945. □ _____________________________________________________________________ Encoding 3 decimal digits to 10-bit Densely-Packed Decimal Encoding Decimal numbers in Densely Packed Decimal format is done using Table 2.3. Table 2.3. Encoding 3 decimal digits to 10-bit Densely-Packed Decimal. _____________________________________________________________________ Example 2.2. For the number 683, the BCD representation is: d(1) d(2) d(3) 0 1 2 3 0 1 2 3 0 1 2 3 0 1 1 0 1 0 0 0 0 0 1 1 03 Using the first bit of each digit: d(1,0)= 0 ; d(2,0)= 1 ; d(3,0)= 0 We use the appropriate table entry: Bits 1,2,3 in d(1) are 110, therefore b(0)b(1)b(2) = 110 Bits 1,2 in d(3) are 01, therefore b(3)b(4) = 01 Bit 3 in d(2) 0, therefore b(5)b(6)b(7)b(8)=0101 Bit 3 in d(3) is 1, therefore b(9) is 1. The final encoding is: b(0) b(1) b(2) b(3) b(4) b(5) b(6) b(7) b(8) b(9) 1 1 0 0 1 0 1 0 1 1 □ _____________________________________________________________________ Note: using DPD (Densely Packed Decimal) coding, 15 BCD digits (60 bits) are packed into 50 bits in field, taking advantage of the BCD representation redundancy. 2.2 Infinity and NaNs Nota Number (NaNs) There are two different kinds of NaN, Signaling and Quiet. Signaling NaNs (sNaN) represent uninitialized variables and other unique situations. Quiet NaNs (qNaN) supply diagnostic information inherited from invalid or unavailable data and results. qNaN Propagation To allow propagation of the diagnostic information, as much information as possible should be preserved in NaN results of operations. In other words, operations performed on NaNs should preserve in the result as much of the original NaN operand as possible. If two or more inputs are NaN, then the payload of the resulting NaN should be identical to the payload of one of the input NaNs if representable in the destination 04 format. The standard does not specify which of the input NaNs will provide the payload. qNaN Generation In general, operations that signal an invalid operation exception (see Para. 2.3) shall generate a quiet NaN. Infinity The approach to infinites in floating-point arithmetic is equivalent to the approach to Overflow (see Para. 2.3). In general, an Overflow in the result will itself raise an OVF flag and the result will be coded as Infinity. Operations on infinite operands usually don't signal exceptions and return an Infinite result (for infinite coding, see IEEE Standards section). This applies to the following operations: Addition(∞, x), Addition(x, ∞), Subtraction(∞, x), or Subtraction(x, ∞), for finite x. The exceptions that do pertain to infinities are signaled (see Para. 2.3) only when: ∞ is an invalid operand (in certain operations). ∞ is created from finite operands by overflow. Subtraction of infinities, such as: Addition(+∞, −∞). 2.3 Exception Handling Invalid operation 7.2.0 The invalid operation exception is signaled if and only if the arithmetic operation provides no useful result. The default result of an operation that signals the invalid operation exception shall be a quiet NaN that should provide some diagnostic information (see Para.2.2). Operations that signal Invalid operation flag: Any operation on a signaling NaN. Addition or Subtraction of infinities, such as: Addition(+∞, −∞). 05 Overflow 7.4.0 The overflow exception is signaled if and only if the result format’s largest finite number is exceeded in magnitude by what would have been the rounded floatingpoint result were the exponent range unbounded. The default result shall be determined by the rounding-direction attribute and the sign of the intermediate result. Specifically, in accordance to the DFPU rounding scheme - roundTiesToEven - all overflows are rounded to ∞ with the sign of the intermediate result. In addition, under default exception handling for overflow, the overflow flag shall be raised and the inexact exception shall be signaled. Inexact 7.6.0 Unless stated otherwise, if the rounded result of an operation is inexact - that is, it differs from what would have been computed were both exponent range and precision unbounded - then the inexact exception shall be signaled. The rounded or overflowed result shall be delivered to the destination. Note: underflow, divide by zero exceptions are included in the standard, but were not fully implemented in the current design as they are not necessary in this context. 2.4 Normalizing & Rounding When executing an instruction, the result operand should be represented in a normalized form, i.e. with no leading zeros. Using the normalized form simplifies the comparison of two Decimal Floating Point operands. A normalized form allows finite operands (≠0) to have a unique representation, which is helpful for comparison: a larger exponent indicates a larger operand and significands should be compared only in case of equal exponents. Note: In case of comparison with 0 one should only check if sig==0. There are three possible Normalization scenarios in Addition and Subtraction: 1. Significand ≥ 10, therefore the significand should be shifted to the right and the exponent should be increased by one (possible in case of Addition). 2. 1 ≤ significand < 10. No shifting needed (possible in case of Addition or Subtraction). 06 3. Significand < 1, therefore the significand should be shifted to the left and the exponent should be decreased as long as there are leading zeros (possible in case of Subtraction). The first case may lead to overflow, since increasing the exponent may cause exceeding the maximum exponent for a finite number. The third case may lead to underflow, since decreasing the exponent may cause exceeding the minimum exponent for a finite number. Shifting is done using Barrel Shifter, which fasten the operation. Rounding is done using roundTiesToEven attribute: The floating-point number nearest to the infinitely precise result shall be delivered; if the two nearest floatingpoint numbers bracketing an unrepresentable infinitely precise result are equally near, the one with an even least significant digit shall be delivered. Choosing this attribute gives an average rounding error = 0. _____________________________________________________________________ Example 2.3. If the exact result significand is 1.23456789012345678 (precision is p=16 digits), then the returned significand should be 1.234567890123457 Example 2.4. If the exact result significand is 1.23456789012345650, then the returned significand should be1.234567890123456 Example 2.5. If the exact result significand is 1.23456789012345651, then the returned significand should be 1.234567890123457 □ _____________________________________________________________________ Rounding is done using three Rounding Digits: Guard digit Round digit Sticky digit 07 If (R>5) or (R=5 and S≠0) or (R=5 and S=0 and LSD=odd number) then the significand is increased by 1, as can be seen in example 1, 3. The Sticky Digit serves as a tie-breaker in the roundTiesToEven attribute. The role of the Guard Digit is to guard against loss of information in case of postnormalization (Scenario 2), as explained in the next proof. Three Rounding Digits are sufficient when using roundTiesToEven attribute. _____________________________________________________________________ Proof: consider the three possible Normalization scenarios mentioned above: Case 1: In the worst case of this scenario the exponent difference of the original operands is 1 (see Para.3.3), i.e. one shift on pre-alignment, so that there is a carry out. For example: sigA = 9900000000000000, sigB = 1000000000000051, exponent difference = 1. R S A 9900000000000000 + B aligned 100000000000005 1 A+B 1 0000000000000005 1 Post-normalization 1000000000000000 5 rounding 1000000000000001 1 Therefore two extra digits are needed for rounding. Case 2: No shifting is done. Therefore there is no need of rounding digits. Case 3: The significand is shifted to the left and the exponent is decreased as long as there are leading zeros. Let us concentrate on two possible cases in this scenario: o The subtrahend is shifted more than one position to the right (prealignment).The difference has at most one leading zero => at most one shifted-out digit required for post-normalization. Sticky Digit = 0 if all the rightmost shifted digits starting from the 19th place are zero. If at least one of them is bigger than zero then Sticky Digit = 1. 08 For example: sigA = 1000000000000000, sigB = 9999999999994002, exponent difference = 5. sigB is shifted 5 positions to the right. => Digits in 19th, 20th, 21th places: 0, 0, 2 => S=1. G R S A B aligned A-B postnormalization 1000000000000000 99999999999 9 0999900000000000 0 9999000000000000 5 4 5 9 1 9 R S 9999000000000000 5 9 9999000000000001 rounding Note: The Sticky digit participates in subtraction only to generate borrow. After subtracting the aligned operands, the true value of the rightmost result digit is not important. What matters is if it is zero or not. After post-normalization the Guard Digit serves as the Round Digit and Round Digit serves as the Sticky Digit. Therefore three extra digits are needed for rounding. o The subtrahend is shifted up to one position to the right (pre-alignment); at most one digit is pre-aligned out of the 16 digit range. For example: sigA = 1200000000000000, sigB = 1000000000000004, exponent difference = 1. A B aligned A-B rounding R 1200000000000000 100000000000000 4 1099999999999999 6 1100000000000000 Therefore one extra digit is needed for rounding. In Conclusion: considering the worst case, three extra digits are needed for rounding. □ _____________________________________________________________________ After Normalizing and Rounding the result, another post-normalization may be needed (in case rounding lead to significand ≥ 10). Therefore another Normalizing component is set after the result is rounded. 09 3. High-level Design 3.1 General Operation Scheme The general operation of the system is described in Figure 3.1. Figure 3.1. General operation of the system. Shows the progress of a command. A Designated Compiler transfers DFPU commands to the correct 74 bit format. It also translates data (operands) to DEC64 format and creates DFPU instructions for data transfer into the DFPU register file. These commands are sent to the CPU as payload for a Load Word operation, which writes the commands to a designated memory segment in the RAM. Upon writing DFPU commands in the designated memory segment, the CPU commands the DMAC (Direct Memory Access Controller) to load the DFPU commands to the internal DFPU memory. The CPU sends a 'go' signal to the DFPU (see Fig. 3.2) and the DFPU subsequently begins reading the internal memory and processing commands. Another form of communication from CPU to DFPU is through Interrupt request (see Fig. 3.2). Upon completion of running DFPU commands, and upon certain exception occurrence (see Para. 2.3), an exception notice is sent to the CPU. Note: the Designated Compiler delivers numbers in a normalized form. 21 3.2 Interface The DFPU (Decimal Floating Point Unit) serves as a peripheral computation unit. Its interface includes four input signals (nrst, clk, go, interrupt) and one output signal (Exception). Figure 3.2 describes the DFPU interface. Figure 3.2. The DFPU interface. nrst –reset signal (negative reset). clk – unit clock signal go – CPU signal to DFPU; Kick start DFPU operation interrupt – CPU signal to DFPU (e.g. soft reset) Exception – DFPU feedback to CPU. Exception signal is sent in the following cases: Finished - DFPU completed performance of loaded tasks. Invalid operation (see Para. 2.3). Overflow (see Para. 2.3). Underflow, Divide by Zero (see Para. 2.3, should be available in future designs - not necessary in this context). Note: 'Inexact' signal does not raise interface exception flag, due to the fact it is an acceptable and regular condition. 20 3.3 Arithmetic Algorithm Assuming that Addition/Subtraction is the most complicated operation in the current design, and it's implementation covers other, simpler operations (negation, increment, decrement) from both an arithmetic and architectural point of view, the Arithmetic algorithm was developed according to it. For any addition/subtraction of a pair of standardized decimal operands: A,B, the following expansion is true: AB sig A 10 Exp A sig B 10 ExpB sig A sig B 10 ExpB Exp A 10 Exp A Assuming that the term sig B 10 ExpB ExpA signifies a shifted sig B by ExpB Exp A positions to the right (assume, without loss of generality, that ExpB Exp A ), and that each significand has a limited precision, we can conclude that for some operands, where exponent difference exceeds significand precision, addition/subtraction is irrelevant. With all that in mind, an addition algorithm emerges (Fig. 3.3). The diagram in Figure 3.3 does not relate to the Sign bit in each operand. The sign bit is dealt with separately, and its main function is to determine the type of operation performed during Addition/Subtraction (example: subtraction of a negative from a positive is performed as addition). Adding/subtracting two signed operands gives: A B 1 A sig A 10 Exp A 1 B sig B 10 ExpB s s sig A 1 B s sA Using the fact that: 1 B s s A Concludes: sig B 10 ExpB Exp A 1 A 10 Exp A s 1 sB s A 1 sB s A 1 s s B A A B sig A 1 s B s A sig B 10 ExpB Exp A 1 A 10 Exp A s The actual type of operation carried out is decided by the original operation code (add/sub) and the signs of the operands. Therefore, if 'add' operation is coded as 'op=0' and 'sub' operation as 'op=1', the actual operation can be derived: 22 Figure 3.3. The addition algorithm. 23 addition sB s A op 1 Actual operation subtractio n sB s A op 0 3.4 Data Path In essence, the datapath manages the three elements - Sign, Significand and Exponent - using separate paths with some interaction between them: Sign - the result sign is dependent on the input operand Signs, the type of operation performed, and the Sign of the result of the significand addition/subtraction. Significand - the result significand is formed by addition/subtraction of aligned significands (shifted according to exponent difference), rounding and normalizing. Exponent - the result exponent is formed by choosing the larger exponent and revaluing according to the normalization. Accordingly, the above algorithm can be divided into smaller sub-algorithms, and each one can be organized as a separate resource ('black boxes'): Program Counter - holds address of current instruction. Address advances with each clock cycle. Register file - collection of registers capable of Read/Write. Translate - decode DEC64 operand to (Sign,Exponent,Significand). Exponent Comparator - compare operand exponents and return exponent difference, which exponent is bigger and its value. Check Needed - check whether there is need for significand shifting and addition/subtraction (due to limited precision). Right Shifter - Aligning one of the significands according to the exponent difference. Add/Sub - conclude the actual operation (addition/subtraction) performed on the significands. Normalizer - adjusting the result significand and exponent values to avoid leading zeros in significand. Rounder - GRS rounding using roundTiesToEven scheme. 24 Inverse Translate – encode (Result Sign, Result Exponent, Result Significand) values into DEC64 operand. Sign Decision - Conclude Result Sign according to the input operand Signs, the type of operation performed, and the Sign of the result of the significand addition/subtraction. Each of the above mentioned resources was built as a function in a MATLAB script for simulation (Chapter 4) and later implemented in a low-level Verilog design (Chapter 5). As discussed in Chapter 6, an instruction is divided into four stages, i.e. moving from single-cycle datapath to a four-stage-pipelined datapath. Therefore, the complete performance of an operation with a DFPU involves the following stages: 1. Instruction Fetch (IF): Retrieval of DFPU command. 2. Decode (D): Retrieval and translation of DEC64 Operands to Sign, Significand and Exponent fields. 3. Execution (E): Performing the arithmetic algorithm mentioned above. 4. Write Back (WB): Result Sign, Significand and Exponent are encoded into DEC64 format and written to register file or result Memory. These four stages are implemented as pipe stages. Further in this design, Pipeline Registers are set between each two stages in order store Intermediate results. 3.5 Instruction Set Architecture(ISA) General Information Instruction length: 74 bits. Register address: 5 bits (32 registers). Opcode length: 5 bits. The DFP unit supports the following operations: Arithmetic operations: add_r, add_m, sub_r, sub_m, inc_r, inc_m, dec_r, dec_m, neg. Data handling operations: mov_i, mov_r. 25 Note: the number of bits allocated for opcode is bigger than necessary in order to enable future expansion of instruction set. Arithmetic operations The instruction format for arithmetic operation is: opcode ri 5 bits 5 bits rj (optional) 5 bits rk (optional) 5 bits 54 bits add_r: o Operation Description: dual operand addition; result written to Register File. o Command Format: add_r ri,rj,rk o Actual operation: ri=rj+rk o Datapath Description: Decode: two operands are read from the Register File in location set by index of ri,rj. These operands are translated to spread form. Result address and spread operands are saved in pipeline register as well as result address and control signals. Execute: Exponents are compared and significands are aligned accordingly. Significands are added and addition result together with the bigger exponent derived from Exponent Comparator go through normalization and rounding. The final sign is derived from Sign decision. Result {Sign,Sig,Exp} are saved in pipeline register as well as result address and control signals. Write Back: finally, the correct result is inverted to DEC64 format. Final result is written to the Register File. add_m: o Operation Description: dual operand addition; result written to Register File and Result Memory. o Command Format: add_m ri,rj,rk o Actual operation: ri=rj+rk ; Mem[mem_addr]=rj+rk ; mem_addr++ o Datapath Description: Identical to description of add_r, except that the result is written to both Result Memory and Register File. 26 sub_r: o Operation Description: operands subtraction; result written to Register File. o Command Format: sub_r ri,rj,rk o Actual operation: ri=rj-rk o Datapath Description: Decode: two operands are read from the Register File in location set by index of ri,rj. These operands are translated to spread form. Result address and spread operands are saved in pipeline register as well as result address and control signals. Execute: Exponents are compared and significands are aligned accordingly. Significands are subtracted and subtraction result together with the bigger exponent derived from Exponent Comparator go through normalization and rounding. The final sign is derived from Sign decision. Result {Sign,Sig,Exp} are saved in pipeline register as well as result address and control signals. Write Back: finally, the correct result is inverted to DEC64 format. Final result is written to the Register File. sub_m: o Operation Description: operands subtraction; result written to Register File and Result Memory. o Command Format: sub_m ri,rj,rk o Actual operation: ri=rj-rk ; Mem[mem_addr]=rj-rk ; mem_addr++ o Datapath Description: Identical to description of sub_r, except that the result is written to both Result Memory and Register File. inc_r: o Operation Description: increase operand by one; result written to Register File. o Command Format: inc_r ri o Actual operation: ri =ri+1 o Datapath Description: Identical to add_r, except that the second operand that is added is an artificially created constant whose value is +1, and that both source and destination register is ri. 27 inc_m: o Operation Description: increase operand by one; result written to Register File and Result Memory. o Command Format: inc_m ri o Actual operation: ri= ri+1 ; Mem[mem_addr]= ri+1 ; mem_addr++ o Datapath Description: Identical to inc_r, except that the result is written to both Result Memory and Register File. dec_r: o Operation Description: decrease operand by one; result written to Register File. o Command Format: dec_r ri o Actual operation: ri= ri-1 o Datapath Description: Identical to add_r, except that the second operand (the subtrahend) is artificially created to equal -1, and that both source and destination register is ri. dec_m: o Operation Description: decrease operand by one; result written to Register File and Result Memory. o Command Format: dec_m ri o Actual operation: ri= ri-1 ; Mem[mem_addr]=ri-1 ; mem_addr++ o Datapath Description: Identical to dec_m, except that the result is written to both Result Memory and Register File. neg: o Operation Description: change sign of register operand; result written to Register File. o Command Format: neg ri o Actual operation: ri= -ri o Datapath Description: Decode: an operand is read from the Register File in location set by index of ri and is saved in pipeline register as DPD (Densely Packed Decimal) operand as well as result address and control signals. Execute: the DPD operand, result address and control signals saved in the next pipeline register. 28 Write Back: the first bit of the DPD operand is complemented and, along with the rest of the DPD operand bits, is written to the Register File. Data handling operations mov_i: o Operation Description: transfer immediate value to register. o Command Format: mov_i ri,imm o Actual operation: ri=imm o Datapath description: Decode: Immediate data is saved directly into Pipeline register, as DPD operand as well as result address and control signals. Execute: the DPD operand, result address and control signals are transferred to next pipeline register. Write Back: the DPD operand is written back to register file in address mentioned by result address in write back pipeline register. Instruction format: opcode ri 5 bits immediate 5 bits 64 bits mov_r: o Operation Description: transfer one registers' value to another. o Command Format: mov_r ri,rj o Actual operation: ri=rj o Datapath description: Decode: an operand is read from the Register File in location set by index of ri and is saved in pipeline register as DPD operand as well as the result address (rj) and control signals. Execute: the DPD operand, result address and control signals are transferred to next pipeline register. Write Back: the DPD operand is written back to register file in address mentioned by result address in write back pipeline register. 29 opcode 5 bits Instruction format: ri 5 bits rj 5 bits Table 3.1 shows a summary of the ISA properties. 31 59 bits Operation Description add_r Dual operand addition; result written to Register File Dual operand addition; result written to add_m Register File and Result Memory Command Format Actual operation add_r ri,rj,rk ri=rj+rk add_m ri,rj,rk ri=rj+rk Mem[mem_addr]=rj+rk mem_addr++ sub_r Operands subtraction; result written to Register File sub_r ri,rj,rk ri=rj-rk sub_m Operands subtraction; result written to Register File and Result Memory sub_m ri,rj,rk ri=rj-rk Mem[mem_addr]=rj-rk mem_addr++ inc_r Increase operand by one; result written to Register File inc_r ri ri =ri+1 inc_m Increase operand by one; result written to Register File and Result Memory inc_m ri ri= ri+1 Mem[mem_addr]= ri+1 mem_addr++ dec_r ri ri= ri-1 dec_m ri ri= ri-1 Mem[mem_addr]=ri-1 mem_addr++ neg ri ri= -ri mov_i ri,imm ri=imm dec_r dec_m neg mov_i Decrease operand by one; result written to Register File Decrease operand by one; result written to Register File and Result Memory Change sign of register operand; result written to Register File Transfer immediate value to register Instruction Format opcode 5 bits opcode 5 bits mov_r Transfer one registers' value to another mov_r ri,rj opcode ri=rj 5 bits Table 3.1. Summary of the ISA properties. 30 ri 5 bits rj(optional) rk(optional) 5 bits 5 bits ri immediate 5 bits 64 bits ri rj 5 bits 5 bits 59 bits 54 bits 4. Simulation 4.1 Simulation Introduction The Following section describes the construction of MATLAB simulations matching the arithmetic and encoding/decoding algorithms, and the tests run on them in order to assess their practical implementation. The importance of such simulations is in the simple application of the algorithms in a way that mirrors a practical implementation. Similarly, tests run on the simulations can reveal flaws in the practical application of the algorithms. 4.2 Translate, Inverse Translate Simulation Relevant standard sections (referring to Fig. 2.1): "The representation r of the floating-point datum, and value v of the floating-point datum represented, are inferred from the constituent fields as follows: a) If G0 through G4 are 11111, then v is NaN regardless of S. Furthermore, if G5 is 1, then r is sNaN; otherwise r is qNaN. The remaining bits of G are ignored, and T constitutes the NaN’s payload, which can be used to distinguish various NaNs. The NaN payload is encoded similarly to finite numbers described below, with G treated as though all bits were zero. The payload corresponds to the significand of finite numbers, interpreted as an integer with a maximum value of 10 (3×J) − 1, and the exponent field is ignored (it is treated as if it were zero). A NaN is in its preferred (canonical) representation if the bits G6 through Gw + 4 are zero and the encoding of the payload is canonical. b) If G0 through G4 are 11110 then r and v = (−1) S × (+∞). The values of the remaining bits in G, and T, are ignored. The two canonical representations of infinity have bits G5 through Gw +4 = 0, and T = 0. c) For finite numbers, r is (S, E − bias, C) and v = (−1) S × 10 (E−bias) × C, where C is the concatenation of the leading significand digit or bits from the combination field G and the trailing significand field T, and where the biased exponent E is encoded in the combination field. The encoding within these fields depends on whether the implementation uses the decimal or the binary encoding for the significand."9 Simulation Method Testing Translation / inv. translation surrounded three distinct cases: 32 1. Combination field = 1 1 1 1 1 (NaN). 2. Combination field = 1 1 1 1 0 (Infinity). 3. Combination field = other (finite numbers). In the first two cases: The Combination field bits are preset and 500 sets of additional 59 random bits are generated. Correct Simulation of the translate function activates NaN/Inf flags accordingly. In the final case: 64 random bits are generated and testing is performed as followed: 1. For each random binary vector x1- Translate command is used to find (sign1, significand1 and exponent1). 2. Inverse Translate parameters (sign, significand and exponent) back to binary vector 'res'. 3. For Binary vector 'res' - Translate command is used to find parameters (sign2, significand2 and exponent2) and compare with (sign1, significand1 and exponent1). 4.3 Exponent Comparator Simulation In accordance with the Arithmetic Algorithm (see Para. 3.3) addition/subtraction of operands, includes finding the bigger exponent, and exponent difference. According to standard: "The set of finite floating-point numbers representable within a particular format is determined by the following integer parameters: ― b = the radix, 2 or 10 ― p = the number of digits in the significand (precision) ― emax = the maximum exponent e ― emin = the minimum exponent e emin shall be 1 − emax for all formats."9 In the decimal 64 format: emax=+384, b=10, p=16. Therefore, the dynamic exponent range [emin,emax] = [-383,384]. It is important to note that all exponents in IEEE 754_2008 format are biased, that is: "For finite numbers, r is (S, E − bias, C) and v = (−1) S × 10 (E−bias) × C ... where the biased exponent E is encoded in the combination field.."9 In our case, the bias is 383. Therefore the actual range of the exponent E is [0,767]. 33 Simulation Method: Run all possible combinations of e1,e2 to test exponent_comparator function. 4.4 Full Path Simulation Using all the resources simulated in MATLAB, one full path can be constructed, creating a full addition/subtraction path that can be simulated. Simulation Method The simulation of the full addition / subtraction path consists of 3 stages: 1. Initialization: 1000 pairs of 64 bit, DEC64 coded operands are randomly created. Each pair is translated into spread form. 2. Run: Each pair of 64 bit DEC64 operands is input into 'Full_Path'. For each pair in 'Full_Path', a matching DEC64 format addition result is created, and translated to spread format. 3. Result Analysis: The initial operands are added externally in MATLAB and compared to the result output of the 'Full_path' in its spread form. If one of the random operands is a NaN or Inf, the result operand should reflect it in its Combination field. Note: Due to precision limitations of MATLAB, these simulations needed to employ the use of Variable Precision Arithmetic functions (VPA) in the Symbolic Math Toolbox. These functions allow for variable precision and provide more flexibility and control in manipulating numbers. 4.5 Simulation Results 1. Translate, Inverse translate: of 1000 cases, there were no cases found where 'x1' differs from 'res' (i.e. the original vector and the result vector differ). 2. Exponent Comparator: All cases of 'Exponent Comparator' variables were examined - no errors were found. 3. Full Path: of 1000 cases (each case using 2 random variables), all cases proved the operand addition/subtraction creates the expected result using the above algorithm. 34 In conclusion In 100% of the cases, simulation results matched the expected values. 4.6 Full Path Graphic User Interface In addition to MATLAB simulation of the full path, a Graphic User Interface (GUI) was designed in order to have a user-friendly simulation tool for decimal floating-point computation that complies with the IEEE 754-2008 standard. Figure 4.1 shows the simulation GUI for decimal floating-point computation. The input operands and result are also displayed in DEC64 format (in hexadecimal form). Figure 4.1. Simulation GUI for decimal floating-point computation. 35 5. Implementation 5.1 Low-level Design The low-level design of the DFPU is implemented in Verilog, using Cadence Simvision simulation tool. Each resource mentioned in Chapter 3.4 is implemented as a separate Verilog file, and is checked against its own test bench. 5.2 Program Counter The Program Counter (Fig. 5.1) consists of a simple 8 bit counter that produces the address of current instruction. Address advances with each clock cycle. Figure 5.1. Program Counter. A 'jump to address' option is created for further design. jmp_en bit is used to enable the jump and 8 bit offset value defines the jump amount. 5.3 Register File The Register File (Fig. 5.2) consists of 32 registers, each one with a 64 bit width. Read: Two registers can be read simultaneously (Dual-Port Register File), using the registers index (5 bit). Write: 64 bit of data can be written to a register, using the register's index and setting the Write enable bit. A register can be written while reading from a different indexed register, i.e. results are written back to register in parallel to reading operands during decode stage. 36 Figure 5.2. Register File. 5.4 Translate & Inverse Translate The Translate component (Fig. 5.3) decodes a DEC64 input operand to sign, exponent, and significand. If the given input is a NaN/Infinity, the isNaN/isInf output bit is set. Figure 5.3. Translate component. The Inverse Translate component (Fig. 5.4) encodes sign, exponent, and significand to a DEC64 output operand. If the encoded operand is a NaN/Infinity, the input isNaN/isInf bit declares it. 37 Figure 5.4. The Inverse Translate component. 5.5 Exponent Comparator The Exponent Comparator (Fig. 5.5) subtracts the input exponents and returns: diff – the difference between the input exponents. isBigger – a bit that indicates which exponent is bigger. (0: if exp1≥exp2. 1: if exp1<exp2). biggerexp – the bigger exponent. Figure 5.5. Exponent Comparator. 5.6 Check Needed The Check Needed component (Fig.5.6) simply checks if the input diff > 17decimal. If it does: en=0 and there is no need of shifting, adding or subtracting. Else en=1. Figure 5.6. The Check Needed component. 38 5.7 Right Shifter The Right Shifter component (Fig.5.7) aligns the input significands according to the other inputs: en_in –shift enable bit. isBigger –indicates which operand has a bigger exponent. diff –the amount of shifts (difference between exponents of the operands). The output significands consist of 76 bits. If en_in is set: Right Shift must be performed. The significand to be shifted is concatenated with 64 bits (16 trailing zeroes) and goes through a Barrel Shifter. The significand to be shifted is chosen according to the value in isBigger: If isBigger=0 - sig2 is shifted by 'diff' positions. If isBigger=1 - sig1 is shifted by 'diff' positions. The output of the Barrel Shifter is truncated to 76 bits (19 digits).The 19th Digit of the truncated output must serve as the Sticky Digit - signifying the existence of non-zero trailing digits. The Sticky Digit is constructed according to the following rule: If the 19thDigit is not zero, then the Sticky Digit retains its value. The Sticky Digit will retain a 0 value if and only if the 56 least significant bits are 0. Otherwise it is set to decimal 1. The unshifted significand is concatenated with three trailing zero digits (12 bits). If en_in is cleared: the output significand of the bigger operand is concatenated with three trailing zero digits. The other output significand is zero (won't be used later). en_out=en_in and is designed for timing reasons. Shifting is done using Barrel Shifter, which fasten the operation. Figure 5.8 shows the implementation of the Right Shifter component. 39 Figure 5.7. The Right Shifter component. Figure 5.8. Implementation of the Right Shifter component. 5.8 Adder/Subtractor Given two unsigned significands, the main goal is to produce a new result significand, which is the output of one of the following scenarios: 1. Addition: adding the two significands. 2. Subtraction: subtracting the two significands (return the absolute difference). 3. No operation (return one specific significand out of the two input significands). Figure 5.9 shows a general description of the Adder/Subtractor. 41 Figure 5.9. Adder/Subtractor. Inputs: sig1, sig2 – The input significands consist of 19 digits each, while each digit is represented by 4 bits (BCD - binary-coded decimal representation), thus an input significand consists of 76 bits. en – Operation enable bit. When set - Addition/Subtraction is carried out. When cleared no operation is taking place. add/sub – Indicates the type of operation. 1 - Addition. 0 – Subtraction. isBigger – Indicates which of the operands that include the significands is bigger. 1: if operand1 < operand2. 0: if operand1 ≥ operand2. Note: isBigger gives no information about the relation between sig1 and sig2. Outputs: res. sig – The output significand consisting of 19 digits (76 bits). c_out – The output carry of the operation. op_sign – The sign of the output result Addition or Subtraction of two significands cannot be done bitwise, but must be performed in groups of 4 bits due to the use of BCD representation. Note: BCD representation is a 4 bit binary representation for decimal digits in range: {0:9}10→{0000 - 1001}2. 40 Figure 5.10 describes the implementation of 4 bit BCD Adder for calculation of a single digit. Similarly to binary subtraction using 1's complement, calculation of Subtraction is carried out using 9's Complement representation. Therefore, if the current operation is subtraction (add/sub = 0), the complemented digit B (which is 9-B) is chosen by the multiplexer at the entrance of the upper 4 bit Full Adder. The reason for complementing B is that subtraction is simply addition with 9's complemented Subtrahend.10 Whenever the sum of the upper 4 bit Full Adder exceeds (1001)2=9, the output sum has to be fixed so that the output will equal (sum-10) and carry out will equal 1.This can be achieved by adding (0110)2=6to the sum. ___________________________________________________________________________ Proof: (sum-10) = sum + 6 - 16 = (sum+6) - 16. Subtracting 16 from (sum+6) is the same as taking the 4 rightmost bits and omitting the MSB (which is the carry out). □ ___________________________________________________________________________ The check whether a fix is needed can be obtained by a rather simple circuit. A fix is needed whenever carry out=1. Examination of the Truth Table in Table 5.1 concludes: carry out s1 s2 s3 c _ out The carry out bit also serves as the input decision bit of the output Multiplexer. 42 Figure 5.10. Implementation of 4 bit BCD Adder. If the carry out bit is cleared, the output is the sum of the upper 4 bit Full Adder. If the carry out bit is set, the output is the sum of the lower 4 bit Full Adder (the fixed sum). 43 s3 s2 s1 s0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 c_out carry out 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 Table 5.1. Truth Table for carry out bit. _____________________________________________________________________ Example 5.1. Adding 9 and 5 needs a fix: 0101 1001 1110 0110 0100 10 dec fix is needed ! Answer is (sum = 4) and (carry = 1) which corresponds to 14. Example 5.2. Adding 3 and 4 doesn’t need a fix: 0011 0100 10 dec no fix is needed ! 0111 Answer is (sum = 7) and (carry = 0) which corresponds to 7. □ _____________________________________________________________________ 44 Both addition and subtraction are executed using Carry-Select Adder with groups of 4 bits (Fig. 5.11). Reminder: calculations are carried out using BCD representation; therefore calculation of a certain digit (4 bits) should be separated from calculations of the other digits. Carry-Select Adder saves the carry ripple time11 in exchange of adding another Full Adder for calculation of each digit (except for the least significant digit). This approach follows the principle: "High Performance over Silicon Cost". The added digits (each consists of 4 bits) enter two identical 4 bit BCD Adders, where the input carry of one adder is logic '0' and the input carry of the other adder is logic '1'. A Multiplexer chooses one of the two sums produced and one of the two output carries. The decision bit is the carry out that is chosen in the previous Multiplexer. Addition: Adding two significands is rather simple when compared to Subtraction. The output op_sign that should indicate the sign of the result is always 0, since the result is always positive. In case of a result significand that is ≥ 10, carry out = 1. Subtraction: When subtracting two significands there are two possible scenarios: 1. sig1 > sig2 2. sig1 ≤ sig2 The result of subtraction should be displayed in an absolute value. The output op_sign should indicate the sign of the result. The first subtraction scenario leads to (carry out = 1). This wrap-around-carry should be added to the result significand. _____________________________________________________________________ Proof: sig1-sig2 = sig1 + sig2_complemented = sig1 + (99……999 – sig2) = = (sig1-sig2) + 99……999 => carry out = 1 since (sig1-sig2) > 0. adding wrap-around-carry => (sig1-sig2) + 99……999 + 1 = 45 = sig1-sig2 (omitting the MSB) = |sig1-sig2|. □ _____________________________________________________________________ The second subtraction scenario leads to (carry out = 0). Answer needs 9's completion. _____________________________________________________________________ Proof: sig1-sig2 = sig1 + sig2_complemented = sig1 + (99……999 – sig2) = = (sig1-sig2) + 99……999 = (=> carry out = 0 since (sig1-sig2) ≤0) = 99……999 - (sig2-sig1) = (sig2-sig1) complemented => => complementing the answer will give (sig2-sig1) = |sig1-sig2|. □ _____________________________________________________________________ In order to save the time of adding the wrap-around-carry (first scenario) or completing the answer (second scenario), the Adder/Subtractor component calculates the result according to the three scenarios (one in Addition, two in Subtraction) in parallel and a Multiplexer chooses the output significand. Using two 76 bit Carry-Select BCD Adder (named pipe1 and pipe2), one with (carry in = 0) and the other with (carry in = 1), and a 9's complement unit, a correct result can be obtained for each of the three scenarios: For Addition: the result is the output of pipe1 (carry in = 0). For Subtraction (first subtraction scenario): the result is the output of pipe2 (carry in = 1). Adding a wrap-around-carry is the same as setting (carry in = 1) from the first place. For Subtraction (second subtraction scenario): the result is the 9's complement of the output of pipe1 (carry in = 0). Note: creation and completion of the output of pipe1 is carried out in parallel (once an output digit is calculated, it is completed) and not after the entire output significand is calculated. Figure 5.12 shows the implementation of the Adder/Subtractor component. 46 No operation: When the input en bit is cleared, no operation is needed. Therefore the output significand and op_sign are chosen according to the other input bits: add/sub and isBigger. isBigger is cleared (operand1≥operand2): The output significand is sig1 and the result is positive (op_sign = 0). isBigger is set (operand1<operand2): The output significand is sig2. If the operation is Addition (add/sub = 1) then op_sign = 0. If the operation is Subtraction (add/sub = 0) then op_sign = 1. The Truth Table in Table 5.2 summarizes the relations between the outputs and the inputs: en isBigger add/sub p1_cout res. sig c_out op_sign 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 sig2 sig2 sig2 sig2 sig1 sig1 sig1 sig1 pipe1c pipe2 pipe1 pipe1 pipe1c pipe2 pipe1 pipe1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 1 0 0 0 1 0 0 0 Table 5.2. Truth Table for the outputs of the Adder/Subtractor component, where: p1_cout is the carry out of pipe1 Adder. pipe1 is the pipe1 output significand.pipe1c is the pipe1 output significand complemented. pipe2 is the pipe1 output significand. Therefore: op _ sign en isBigger add / sub c _ out en add / sub p1 _ cout 47 res. sig en isBigger sig 1 en isBigger sig 2 en add / sub sig _ out _ pipe1 en add / sub p1 _ cout sig _ out _ pipe en add / sub p1 _ cout sig _ out _ pipe1 _ completed 2 48 Figure 5.11. Implementation of a 76 bit Carry-Select BCD Adder. 49 Figure 5.12. Implementation of the Adder/Subtractor component. 51 5.9 Normalizer The Normalizer (Fig. 5.13) shifts the input significand according to the Normalizing specifications (see Para. 2.4). c_out indicates that a carry out has occurred in the previous resource (significand ≥ 10). OVF/UNDF bit is set if the normalization causes an overflow/underflow. Figure 5.13. Normalizer. Shifting is done using Barrel Shifter, which fasten the operation. 5.10 Rounder The Rounder (Fig. 5.14) rounds the input significand according to the Rounding specifications (see Para. 2.4). Figure 5.14. Rounder. 50 5.11 Sign Decision The Sign Decision component (Fig. 5.15) finds the final sign of the result operand. Figure 5.15. The Sign Decision component. If add/sub=1: the actual operation carried out in Adder/Subtractor was addition. Therefore the final sign is sign1. If add/sub=0: the actual operation carried out in Adder/Subtractor was addition. Therefore the final sign is (sign1XORop_sign), where op_sign is the sign of the result of the subtraction. Figure 5.16 shows the implementation of the Sign Decision component. Figure 5.16. Implementation of the Sign Decision component. 52 6. Integration 6.1 Composing the Complete System In order to compose the complete system, the implemented units (Chapter 5) were integrated into a full, connected datapath. 6.2 Creating a Pipelined Datapath An instruction is divided into four stages, i.e. moving from single-cycle datapath to a fourstage-pipelined datapath, which means that up to four instructions will be in execution during any single clock cycle. By creating three Pipeline Registers, stages are separated and information is saved with each rising edge of the clock. The pipeline execution throughput is one instruction per cycle. Figure 6.1 shows the Pipelined Datapath. Note: certain signals (OVF, UNDF, isNaN, isInf etc.) were omitted from Figure 6.1in order to describe the main flow of data. 6.3 Creating a Control Unit In order to manage the advancement of the pipeline, and manage the control signals (enable bits, multiplexer decision bits) of the different concurrent operation, a central Controlling Unit is necessary. The control unit is implemented as a Finite State Machine (FSM). Given 6 states: Instruction Fetch, Decode, Execute, Write Back, Idle state (for system reset), wait4inst (system out of reset and waiting for input cache to load), and a pipelined datapath, several states can coexist simultaneously (each combination of IF,D,E,WB). For each of the possible state combinations, the control unit allocates a unique state, and sends signals to the datapath according to the current state. Overall there are 17 states available. 53 Figure 6.1. The Pipelined Datapath. 54 Figure 6.2 describes the transfer function between states. For simplicity, states in the diagram were joined according to similar next states. Transitions between states are represented by <condition>/<next state>. For example: NI/EW means that no new instruction has arrived and the next state is EW. Each transfer between states depends on the previous state, and whether or not there is a following instruction to perform (in that case the FSM will receive mem_valid=1). So long as there is a following instruction to perform, a valid bit is sent to the Program Counter in order to fetch the new instruction. Once there are no more instructions to perform, the Program Counter valid bit is cleared, and no more instruction are fetched. Note: the pipeline continues to execute the existing instructions. For each new instruction Fetched, the opcode is analyzed by the FSM, and it returns the appropriate control signals. Similarly, with each state transfer, write enable signals are sent to the pipeline registers. Table 6.1 describes the values of the control signals for each type of input instruction opcode. The control signals are: DPD source –decision for the source of the densely packed decimal operand. wsource –decision for the source of the data written. incdec –decision for inc/dec. unbinop – decision for unary/binary operation. wbmethod – decision for Write Back method. negator – decision for negate operation. wen – write enable for the Register File. sub_op – decision for subtract operation. selfwrite – decision for read & write to the same register. An extra signal is reserved for further design. 55 Figure 6.2. Control Unit: transfer function between states. 56 add_r add_m sub_r sub_m inc_r inc_m dec_r dec_m neg mov_i mov_r DPD source 1 1 1 1 1 1 1 1 1 0 1 wsource incdec 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 unbinop wbmethod negator 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 wen sub_op 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 0 0 0 selfwrite reserved 0 0 0 0 1 1 1 1 1 0 0 Table 6.1. Control Signals for each type of instruction. 6.4 Pipeline Hazards A Read-after-Write (RAW) data hazard may occur in the designed Pipeline, which can result in incorrect computation. This hazard occurs when an instruction refers to a result that has not yet been calculated or retrieved. For example: add_r r1,r2,r3 add_r r5,r1,r4 r1 is read before its true value is written, because the second instruction starts the Execution stage when the first instruction starts the Write Back stage. Some of the possible future solutions for this problem are: 1. Stalling the pipeline (will increase latency). 2. Forwarding: once an instruction finishes its Execution stage, the result can be used immediately in the Execution stage of next instruction.12 3. Reordering instructions to avoid hazards (done by the designated compiler). 57 0 0 0 0 0 0 0 0 0 0 0 7. Arithmetic and System Verification 7.1 Verification Properties The main properties necessary for verification and validation for the DFP unit are: Correct calculation of arithmetic operation - includes arithmetic testing and correct Datapath operation. Compliance with IEEE 754-2008 standard specifications Correct calculation of arithmetic operation The following types of test were preformed: 1. Correct addition/subtraction of operands. 2. Correct operation for large exponent difference. 3. Correct handling of Overflow/Infinity. Compliance with IEEE 754-2008 standard specifications The following specifications were tested: 1. Correct translation to/from DEC64 format to spread decimal floating point format. 2. Correct result rounding according to IEEE 754-2008 standard scheme chosen. 3. Correct encoding/decoding of infinity/NaN. The guidelines for the verification were taken from test vectors published by Prof. Mike Cowlishaw1. His work was written prior to the publishing of the new IEEE 754-2008 standard, and therefore could not be used in full, but the principals of his verification technique were adapted for this project. Initially, as a 'quick confidence check', a sample assembly program was loaded into the instruction cache and results were validated. This initial test examined the basic operation for each available command. The following step was to build a more robust testing array, based on the scheme in Figure 6.3. 58 Figure 6.3. Verification Scheme. The test vectors used for comparison were taken from Cowlishaws' website and also from IBM Haifa's Floating Point test Generator2,3. The IBM test vectors were translated to DFP commands according to the DFPU ISA Using AWK scripts. The Commands were loaded into the UUT (Unit Under test - DFPU) and results were printed out and compared to the results given by IBM. 7.2 Verification Conclusions and Results Correct calculation of arithmetic operation 1. Correct addition/subtraction of operands - verified. In cases where results differed, close examination showed that the cause was different rounding schemes. 2. Correct operation for large exponent difference - verified. 3. Correct handling of Overflow/Infinity - verified. In cases where results differed, close examination showed that the cause was different rounding schemes. Compliance with IEEE 754-2008 standard specifications 1. Correct translation to/from DEC64 format to spread decimal floating point format verified. 59 2. Correct result rounding according to IEEE 754-2008 standard scheme chosen - Despite the different rounding schemes, in some cases the result is rounded to the same value. Of all the cases examined, some errors in rounding were identified and corrected. In other cases, the DFPU result agreed with the chosen rounding scheme, and differences between the DFPU and the IBM test vectors were due to different rounding schemes. 3. Correct encoding/decoding of infinity/NaN – verified. 61 8. Synthesis 8.1 Implementation on FPGA The integrated system was implemented using Virtex®-6 FPGA ML605 Evaluation Board. In order to load the design, the Xilinx ISE Design Suite 13.2 was used. The *.list files, used in the Cadence Simvision environment to simulate the instruction and result memory , were implemented in the Virtex6 system using Distributed RAM, loaded with *.coe files. inst_mem.coe represents our instruction memory and is the basis of our Test bench. 8.2 Design Evaluation Running the design on the Virtex6 provided the possibility to test the actual ability to run the design on real-life hardware with real-life hardware constraints. Specifically, it allows testing Timing and Clock Frequency constraints. Solving synthesis problems A significant problem with the synthesis was that the designed Normalizer included a While loop which is not synthesizable. Conversion of the While loop to a series of conditional-if solved this issue. Identifying optimal clock rate The process of identifying the optimal clock rate for the DFPU involved running the unit on higher clock rates until incorrect results are returned due to inability to conclude command performance. Using PLL (Phase-locked loop), multiple clocks with different rates were created. The working clock was chosen by on-board switches. The optimal clock rate identified for the DFPU is 66 MHz . 60 9. Summary 9.1 DFPU Review The DFPU is a hardware implementation of decimal arithmetic algorithms (specifically addition, subtraction and related operations). Its' high-level design is integrated into the lowlevel design. It has undergone algorithm simulation, verification and final hardware synthesis. The design is unique in terms of several parameters: The Design is built to comply with the IEEE 754-2008 standard definitions. The design includes an advanced Adder/Subtractor, which provides equal runtime for addition or subtraction calculations, and avoids wasteful (both in terms of time and silicon size) comparison of significands which existed in earlier designs13, which provides modularity. The design provides addition/subtraction with a latency of 4 clock cycles and one clock cycle throughput. 9.2 Future Expansions Potential expansions to the DFPU range from functionality to efficiency. Functionality Additional DFP Functions should be made available, such as: Multiply, Divide, Fused Multiply Add, Compare etc. Additional Control Functionality should be made available, such as Loop Support and Branch support. Additional hardware for the creation of detailed Data Payload in case of Invalid Operation Exception should be made available. Efficiency Adder/Subtractor can be enhanced using carry-look-ahead in each 4 bit BCD adder. Further attempts to create an even pipeline should be made. For example, it is possible to take advantage of distributed RAM capabilities to speed up Fetch stage and upend it to the following Decode stage, thus forming a 3 stage pipeline. Support for advanced data hazard solutions can be added (Forwarding, Reordering, see Para. 6.4) 62 Appendices A. DFP History The suggested DFP Unit is not the first decimal floating-point unit implemented, but it is unique in that it complies with the new IEEE 754-2008 standard. Hardware solutions Select Past attempts: ENIAC - The United States Military began construction of the ENIAC during WWII (1943), designed to calculate artillery firing tables for the United States Army's Ballistic Research Laboratory. The ENIAC could store a ten digit decimal number in memory, but could not perform decimal computations.4 Bell Laboratories Mark V - The first documented decimal floating-point processor was the Bell Laboratories Mark V computer designed in 1946.5 Burroughs 2500 & 3500 - another important Decimal floating-point computer was the Burroughs 2500, developed in 1966. It used strings of up to 100 digits, with two 4-bit BCD (Binary-Coded Decimal) digits per byte. These examples were developed before the existence of a floating-point standard. The 754-1985 standard was the first to define formats for representing floating-point numbers and special values (NaN, Inf), floating-point operations, rounding modes and exceptions. The standard in use today - IEEE 754-2008 - revised and replaced the IEEE 754-1985. The revision extended the previous standard in including, among other things, decimal arithmetic and formats, and merged in IEEE-854 (1987) - the radix-independent floating-point standard. Two examples of a standardized Decimal Floating Point Unit are the IBM Z9 (2005-2006) and Z10 (2008).The Z9 utilized an encoded decimal representation for data, instructions for performing decimal floating point computations, and an instruction which performed data conversions to and from the decimal floating point representation. The System Z9 was the first commercial server to add IEEE 754 decimal floating point instructions, although these instructions were implemented in microcode with some hardware assists. 63 The Z10 introduced Full hardware support for Hardware Decimal Floating-point Unit (HDFU): it implemented the main IEEE 754 decimal floating point operations in a built-in, integral component of each processor core and instruction set architecture.6 Note: It is important to note that the Z10 was developed before the publication of the IEEE 754-2008 standard. Software solutions For reasons of backwards compatibility and in order to gain software flexibility, several software libraries, capable of handling decimal floating-point operations were developed. Some of the more well-known ones are: Intel® Decimal Floating-Point Math Library decNumber/decNumber++ by Mike Cowlishaw1 These solutions indeed solve the precision issue, but fall short (and actually worsen the situation) with regards to the speed requirement. Research performed in the University of Wisconsin show that when using the decNumber library for DFP arithmetic, most benchmarks spend more than 75% of their execution time in DFP functions.7,8 The research also showed that providing fast hardware support for DFP instructions results in speedups for the same benchmarks ranging from 1.3 to 31.2. 64 Bibliography 1. www.speleotrove.com 2. www.haifa.il.ibm.com/projects/verification/fpgen/ 3. www.haifa.il.ibm.com/projects/verification/fpgen/ieeets.html 4. www.computerhistory.org 5. Harvey G. Cragon, Computer Architecture and Implementation (Cambridge University Press, Feb. 2003) 6. www.ibm.com/systems/z/hardware/ 7. Liang-Kai Wang, Charles Tsen, Michael J. Schulte, and Divya Jhalani, Benchmarks and Performance Analysis of Decimal Floating-Point Applications (University of Wisconsin Madison, Department of Electrical and Computer Engineering, Oct. 2007) 8. Michael J. Schulte, Nick Lindberg, Anitha Laxminarain, Performance Evaluation of Decimal Floating-Point Arithmetic (University of Wisconsin - Madison, Department of Electrical and Computer Engineering,2005) 9. IEEE Standard for Floating-Point Arithmetic (IEEE Computer Society, Aug 2008) 10. Anshul Singh, Aman Gupta, Sreehari Veeramachaneni, M.B. Srinivas, A High Performance Unified BCD and Binary Adder/Subtractor (IEEE Computer Society Annual Symposium on VLSI, 2009) 11. Israel Koren, Computer Arithmetic Algorithms (A. K. Peters/CRC Press, 2nd edition, Dec, 2001) 12. David A. Patterson, John L. Hennessy, Computer Organization and Design, The Hardware/Software Interface (Morgan Kaufmann, 4th edition, Nov. 2008) 13. John Thompson, Nandini Karra, Michael J. Schulte, A 64-bit Decimal Floating-Point Adder (IEEE Computer Society Annual Symposium on VLSI: Emerging Trends in VLSI Systems Design (ISVLSI'04), 2004) 14. http://speleotrove.com/decimal/decifaq.html 15. Michael F. Cowlishaw, Decimal Floating-Point: Algorism for Computers (Proceedings of the 16th IEEE Symposium on Computer Arithmetic (ARITH'03), 2003) 65