Floating point for VHDL and Verilog

advertisement
Fixed- and floating-point packages for VHDL 2005
David Bishop, Eastman Kodak Company, Rochester, NY
Abstract
The pending update to VHDL LRM contains several new packages and functions. The new packages include support for
both fixed-point and floating-point binary math. These fully synthesizable packages will raise the level of abstraction in
VHDL. DSP applications, which previously needed an independent processor core, or required very difficult manual
translation, can now be performed within your VHDL source code. In addition, Schematic-based DSP algorithms can
now be translated directly to VHDL. This paper will describe these packages and give examples of their use.
Introduction:
For the past 15 years we have been using HDL to increase the level of abstraction in our ASIC and FPGA designs. HDL
was a major leap from schematics. What have we done sense? Little.
Attempts have been made. E, System-C and System-Verilog are good examples. These are great ideas, but they do not
give the designer the control and tool maturity that VHDL and Verilog provide.
Why not simply increase the level of abstraction in a language that is already well known? The potential of VHDL has
not yet been fully tapped. Designed from the ground up as a software language it is easily extendable and flexible.
Constructed at a higher level then Verilog, it has the ability to provide higher levels of abstraction directly, with already
mature tools.
Typically designers use integer math in their RTL code. For fixed point they tend to just “remember” where the decimal
point is. For floating point you use a DSP, which may even be off chip. Designers tend to use math solutions in order of
“integer math”, “fixed point math” and “floating point math”, where 80% of designs are done in integer, of the next 20% 80
% of those are done in fixed point. Note that the complexity of fixed point math is not that much higher than integer math,
but that floating point is about 3x as complex as integer math.
The integer math problem has been effectively solved with the NUMERIC_STD packages (1076.3, now part of VHDL200X-FT). This package has been well adopted and been in use for many years.
In this paper, I intend to describe a new set of packages, which are being added to the VHDL language in the VHDL2005 update. These packages include VHDL overloads that allow you to do fixed and floating point math directly, without
the user having to perform any conversions. These packages raise the level of abstraction in VHDL AND give the user the
flexibility and power of an HDL.
Fixed-point package:
Fixed-point math is basically integer math with numbers that can be less than 1.0. A fixed-point number has an assigned
width and an assigned location for the decimal point. As long as the number is big enough to provide enough precision then
fixed point is fine for most DSP applications. Since it is based on integer math it is extremely efficient as long as the long as
the data does not very too much in magnitude.
The fixed-point math packages are based on the VHDL 1076.3 numeric_std package and use the signed and unsigned
arithmetic from within that package. This makes them highly efficient as the numeric_std package is well supported by
simulation and synthesis tools. This package defines two new types “ufixed” which is unsigned fixed point, and “sfixed”
which is signed fixed point.
Usage model:
use ieee.fixed_pkg.all;
....
signal a, b : sfixed (7 downto -6);
signal c: sfixed (8 downto -6);
begin
....
a <= to_sfixed (-3.125, 7, -6);
b <= to_sfixed (inp1, b’high, b’low);
c <= a + b;
The two data types are defined as follows:
type ufixed is array (integer range <>) of std_logic;
-- base Unsigned fixed point type, downto direction assumed
type sfixed is array (integer range <>) of std_logic;
-- base Signed fixed point type, downto direction assumed
This data type uses a negative index to show you where the decimal point is. The decimal point is assumed to be
between the "0" and "-1" index. Thus is we can assume "signal y : ufixed (4 downto -5)" as the data type (unsigned fixed
point, 10 bits wide, 5 bits of decimal), then y = 6.5 = "00110.10000", or simply:
y <= "01011010000";
You can also say:
y <= to_ufixed (6.5, 4, -5);
where "4" is the upper index, and "-5" is the lower index, so you could also say:
y <= to_ufixed (6.5, y'high, y'low);
The signed version uses a two compliment to show represent a negative number, just like the "numeric_std" package.
Any non-zero index range is valid. Thus:
signal z : ufixed (-2 downto -3);
z <= "11"; -- 0.375 = 0.011
signal x : sfixed (4 downto 1);
y <= "111"; -- -2 = 1110.0
The data widths in the fixed-point package were designed (by Ryan Hilton) so that there is no possibility of an overflow.
This is a departure from the “numeric_std” model which simply throws away underflow and overflow bits.
For unsigned fixed point:
ufixed(a downto b) + ufixed(c downto d) = ufixed(max(a,c)+1 downto min(b,d))
ufixed(a downto b) - ufixed(c downto d) = ufixed(max(a,c)+1 downto min(b,d))
ufixed(a downto b) * ufixed(c downto d) = ufixed(a+c+1 downto b+d)
ufixed(a downto b) / ufixed(c downto d) = ufixed(a-d+1 downto b-c-1)
reciprocal (ufixed(a downto b)) = ufixed(a-b+1 downto b-a-1)
ufixed(a downto b) rem ufixed(c downto d) = ufixed(c downto d)
ufixed(a downto b) mod ufixed(c downto d) = ufixed(a downto b)
For signed fixed point:
sfixed(a downto b) + sfixed(c downto d) = sfixed(max(a,c)+1 downto min(b,d))
sfixed(a downto b) - sfixed(c downto d) = sfixed(max(a,c)+1 downto min(b,d))
sfixed(a downto b) * sfixed(c downto d) = sfixed(a+c
downto b+d)
sfixed(a downto b) / sfixed(c downto d) = sfixed(a-d
downto b-c)
reciprocal (sfixed(a downto b)) = sfixed(a-b
downto b-a)
ufixed(a downto b) rem ufixed(c downto d) = ufixed(c downto d)
ufixed(a downto b) mod ufixed(c downto d) = ufixed(a downto b)
Unsigned Example:
signal x : ufixed ( 7 downto –3);
signal y : ufixed ( 2 downto –9);
If we multiply x by y we would get a signal which would be:
x * y = ufixed (7+2+1 downto –3+(-9)) or ufixed (10 downto –12);
Signed Example:
signal x : sfixed (-1 downto –3);
signal y : sfixed (3 downto 1);
If we divide x by y we would get a signal which would be:
x/y = sfixed (-1-1 downto –3-3) or sfixed (-2 downto –6);
The “resize” function can be used to fix the size of the output. However, rounding and saturate rules are applied:
X <= resize (x * y, x’high, x’low);
What about an accumulator? An accumulator is a fixed width number that you continually add to. To implement an
accumulator in the fixed-point packages, you can use the “resize” function as follows:
Signal X : ufixed (7 downto 0);
X <= resize (X + 1, X’high, X’low, false, false);
Where the first “false” is the round_style. Since we do not need to do any rounding, we set this to false. The second
“false” is the overflow_style. If this is set to true, we saturate, or go to the maximum possible number. When set to “false”
we wrap, meaning that the upper most bit is dropped and the number simply recycles. Note that the default for both
overflow_style and round_style is “true”.
Integer and real overloaded for all operators, thus you can say:
Signal x : sfixed (4 downto –5);
Signal y : real;
…
Z := x + y;
In the case where an operation is performed which includes both a fixed-point number and an integer or real then the
sizing rules are modified. For a real number, then the real is converted to a fixed-point number that is the same size as the
fixed-point number that has been passed as the other argument. Thus in the above example:
Z := x + sfixed(y, 4, -5);
Would be called, which would result in Z being an “sfixed (5 downto –5)” type. For an integer, the number is also converted
to a fixed-point number, but the size is only “downto 0”, as an integer can never have a fraction. Thus, if “y” were an integer
the above example would look like:
Z := x + sfixed (y, 4, 0);
Which in this case would not affect the resultant number’s size. However this has a fairly large effect on the size of the
output numbers in the multiply and divide routines.
The following operations are defined for ufixed:
+, -, *, /, rem, mod, =, /=, <, >, >=, <=, sll, srl, rol, ror, sla, sra
The following functions are defined for ufixed:
divide, reciprocal, scalb, maximum, minimum, find_lsb, find_msb, resize, To_01, Is_X,
Conversion functions are defined for ufixed:
to_ufixed (natural), to_ufixed (real), to_ufixed (unsigned), to_ufixed(signed), remove_sign (sfixed), to_unsigned,
to_real, to_integer, to_UFix
The following operations are defined for sfixed:
+, -, *, /, rem, mod, =, /=, <, >, >=, <=, sll, srl, rol, ror, sla, sra, abs, - (unary)
The following functions are defined for ufixed
divide, reciprocal, scalb, maximum, minimum, find_lsb, find_msb, resize, to_01, Ix_X
Conversion functions are defined for ufixed:
to_sfixed (natural), to_sfixed (real), to_sfixed (unsigned), to_sfixed(signed), add_sign (ufixed), to_signed, to_real,
to_integer, to_Fix
All of the operators are overloaded for “real” and “integer” data types. In each case the number is converted into fixed
point before the operation is done. Thus the fixed-point operand must be of a format large enough to accommodate the
converted input or a “vector truncated” warning is produced. In the case of an integer, the number is converted in the form
“integer_width downto 0” which causes the size of the output vector to change accordingly. In these functions
“fixed_saturate” is set to true regardless of what the “overflow_style” constant is set to.
This package defines 3 constants that are used to manipulate fixed-point numbers:
constant fixed_round : boolean := true; -- Round or truncate
constant fixed_saturate : boolean := true -- saturate or wrap
constant fixed_guard_bits : natural := 3; -- guard bits for rounding
These constants are defaults, and can be overridden everywhere they are used.
"round_style" defaults to fixed_round (true) that turns on the rounding routines. If false then the number is truncated. If
the MSB of the remainder is a "1" AND the LSB of the unround result is a '1' or the lower bits of the remainder include a '1'
then the result will be rounded. This is similar to the floating-point “round_nearest” style.
"overflow_style" default to fixed_saturate (true) that returns the maximum possible number if the number is too large to
represent, otherwise a "wrap" routine is used which simply truncates the top bits. Unlike the way it is done in “numeric_std”,
the sign bit is not preserved when wrapping. Thus it is possible to get positive result when resizing a negative number in this
mode.
Finally "guard_bits" defaults to "fixed_guard_bits" which defaults to 3. Guard bits are used in the rounding routines. If
guard is set to 0, then the rounding is automatically turned off. These extra bits are added to the end of numbers in the
division and “to_real” functions to make the numbers more accurate.
The “resize” function is defined as follows:
function resize (arg
: sfixed;
constant integer_width : INTEGER;
constant fraction_width : INTEGER;
constant round_style
: BOOLEAN := fixed_round;
constant overflow_style : BOOLEAN := fixed_saturate)
In “saturate” mode (where overflow_style is true) if the output size is smaller than the input number then the number will
“saturate”. An unsigned fixed point will saturate to all “1”, a signed positive number will be all “1” with the first bit a “0”,
and a signed negative number will saturate to be all “0” with the first number a “1”.
If in “wrap” mode (where overflow_style is false) the number will be truncated. In this case the top or the number is
simply truncated without regard to the sign bits, so you can truncate a negative number to be a positive one. The rounding
routines are left intact in “wrap” mode.
If “round_style” is true, then the rounding routines are turned on. Otherwise the number is simply truncated.
Shift operators are functionally the same as the 1076-1993 shift operators with the exception of the arithmetic shift
operations. An arithmetic shift (“sra”, or “sla”) on an unsigned number is the same as a logical shift. An arithmetic shift on a
signed number is a logical shift if you are shifting left, and an arithmetic shift (sign bit replicated) if you are shifting right.
The divide function is defined as follows:
function divide (
l, r
: sfixed;
guard_bits : NATURAL := fixed_guard_bits;
round_style : BOOLEAN := fixed_round)
return sfixed;
The output is sized with the same rules as the “/” operator. The function allows you to override the number of guard bits
and the rounding operation. Note that the output size is calculated so that overflow is not possible.
The reciprocal function is defined in a very similar manor to the divide function:
function reciprocal (
arg
: ufixed;
guard_bits : NATURAL := fixed_guard_bits;
round_style : BOOLEAN := fixed_round)
return ufixed;
This function performs a “1/X” function, with the output vector following the sizing rules as noted above. This function
is very useful for dividing by a constant, example:
A := B/Cons;
Can be rewritten as:
A := B*reciprocal(Cons);
Since a multiply uses less logic then a divide this can save you significant hardware resources.
The “scalb” function is a fixed-point version of a very common floating-point function. The function looks like this:
function scalb (y : ufixed; N : SIGNED) return ufixed;
This function computes y * 2**N without computing 2**N by using a shift operator. The size of the output number is the
same as the input. For this function overflow and rounding functions are ignored, as this is treated like a shift operator. The
“N” input is also overloaded for the type “INTEGER”.
The “maximum” and “minimum” functions do a compare operation and return the appropriate value. These functions
are not overloaded for integer and real inputs. The size of the inputs does not need to match.
The “find_lsb” and “find_msb” functions are used to find the most significant bit or least significant bit of a fixed-point
number. The function looks like the following:
function find_msb (arg : ufixed; y : STD_ULOGIC) return INTEGER;
In this case, “y” can be any “std_ulogic” value. These functions search for the first occurrence of “y” in the fixed-point
number. “find_msb” starts at the MSB (arg’high) and goes down. “find_msb” starts at the LSB (arg’low) and goes up. If
that value is not found in the “find_msb” function, then “arg’low-1” is returned. If the value is not found in the “find_lsb”
function then “arg’high+1” is returned.
“to_01” and “Is_X” are similar in function to the numeric_std functions with the same name.
Most synthesis tools do not support any I/O format other than “std_logic_vector” and “std_logic”. Thus functions have
been created to convert between std_logic_vector and ufixed or sfixed and visa versa:
Uf7_3 <= to_ufixed (slv7, uf7_3’high, uf7_3’low);
and
Slv7 <= to_slv (sf7_3);
One of the changes made to all packages in vhdl-2005 is that the read and write routines for all data types are now
defined in the same package that defines that type. Thus the READ, WRITE, HREAD, HWRITE, OREAD, and OWRITE
routines are defined for fixed-point data types. A “.” Separator is added between the integer part and the fractional part of the
fixed-point number. Thus if you write out or “6.5” example from above you will get the string "00110.10000", which you
can also read into that data type.
New to vhdl-2005 are the functions “to_string”, “to_ostring” and “to_hstring”. These are very useful in “assert”
statements. Example:
Assert x=y
Report to_string(x) & “ /= “ & to_string(y) report error;
Or, if you prefer to see the numbers as “real” numbers, you can use:
Assert x=y
Report to_string(to_real(x)) & “ /= “ & to_string(to_real(y)) report error;
MathWorks Simulink is these days the most common way to define a fixed point DSP algorithm. It what would seem to
be a major step into the past as it is schematic based. In Simulink an unsigned fixed point number is described as ufix[14,10],
which specifies a 14 bit long word with 10 bits after the fraction. This translates into “ufixed (3 downto –10)” in the
unsigned fixed-point type. The Simulink “sfix” notation translates much better because of the extra sign bit that must be
generated. Sfix(14, 10) will translate into “sfixed(4 downto –10) in the notation of the “fixed_pkg”.
Issues:
A negative or “to” index is flagged as an error by the fixed point routines. Thus if you define a number as “ufixed (-1 to 5)”
the routines will automatically error out.
String literals are also a problem. By default, if you do the following:
Z <= a + “011011”;
The index of the fixed-point number is undefined. The VHDL compiler will assume that the range of this number has the
range “Integer’low to integer’low+5”, making it very small. To avoid crashing the simulator with a 32,000 bit wide number
this also will automatically error out.
Floating-point numbers:
After Fixed point the next step is floating point. Floating-point numbers are well defined by IEEE-754 (32 and 64 bit)
and IEEE-854 (variable width) specifications. Floating point has been used in processors and IP for years and is a wellunderstood format.
There are many concepts in floating point that make it different from our well understood signed and unsigned number
notations. These come from how a floating-point number is defined. Lets first take a look at a 32-bit floating-point number:
S
EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFF
31 30
25 24
0
+/exp. Fraction
Basically, a floating-point number comprises a sign bit (+ or -), a normalized exponent, and a fraction. To convert this
number back into an integer, the following equation can be used:
S * (1.0 + Fraction/Max fraction) ** 2 (exponent – exponent_base)
where the “exponent_base” is 2**((maximum exponent/2)–1) and “Fraction” is always a number less than one. Thus for 32
bit floating point an example would be:
0 10000001 101000000000000000000000
= +1 * 2** (129 – 127) * (1.0 + 10485760/16777216) = +1 * 1.625 * 4.0 = 6.5
There are also “denormal numbers”, which are normally numbers smaller than can be represented with this structure. The tag
for a denormal number is that the exponent is “0”. This forces you to invoke another formula:
0 00000000 100000000000000000000000
= +1 * 2** -126 * (8388608/16777216) = +1 * 2**-1 * 2**-126 = 2**-127
Next are the “constants” that exist in the floating-point context:
0 00000000 000000000000000000000000 = 0
1 00000000 000000000000000000000000 = -0 (which = 0)
0 11111111 000000000000000000000000 = positive infinity
1 11111111 000000000000000000000000 = negative infinity
If you get a number with an infinite (all “1”s) exponent and anything other than an all zero fraction, then it is said to
be a NAN, or “Not A Number”. NANs come in two types, signaling and non-signaling. For the purposes of these packages I
chose a fraction with an MSB of “1” to be a signaling NAN and anything else to be a quiet NAN.
Thus you wind up with the following classes (or states) that each floating-point number can fall into:
 nan
Signaling NaN
 quiet_nan
Quiet NaN
 neg_inf
Negative infinity
 neg
Negative normalized nonzero
 neg_denormal Negative denormalized
 neg_zero
-0
 zero
+0
 denormal
Positive denormalized
 normal
Positive normalized nonzero
 infinity
Positive infinity
In the packages I use these states to both examine and create numbers needed for floating point operations. This defines
the type “valid_fpstate” . The constants zerofp, nanfp, qnanfp, pos_inffp, neginf_fp, neg_zerofp are also defined.
Rounding comes in 4 different flavors
 Round nearest
 Round positive infinity
 Round negative infinity
 Round zero
“Round nearest” has the extra caveat that if the remainder is exactly ½ then you need to round so that the LSB of the number
you will get is a zero. The implementation of this feature requires two compare operations, but they can be consolidated.
Round negative infinity rounds down, and round positive infinity always rounds up. Round zero is a mix of the two, and has
the effect of doing a truncation (no rounding).
The floating point packages:
The new floating-point packages take advantage of a new feature in VHDL-2005 called package generics. The 32 bit
floating point package looks like the following:
package fphdl32_pkg is new IEEE.fphdl_pkg
generic map (
fp_fraction_width => 23;
-- 23 bits of fraction
fp_exponent_width => 8;
-- exponent 8 bits
fp_round_style
=> round_nearest; -- round nearest algorithm
fp_denormalize
=> true; -- Turn on Denormalized numbers
fp_check_error
=> true; -- Turn on NAN and overflow processing
fp_guard_bits
=> 3); -- number of guard bits
Package generics allow you to specify any data width or size of floating point number you like.
The resulting data type will be called “fp”. Thus you have the following use model:
signal a, b, c : fp;
signal x : unsigned (5 downto 0);
constant PI : real := 3.14;
begin
b <= to_fp (x);
c <= a + PI;
The actual floating-point type is defined as follows:
type fp is array (fp_exponent_width downto -fp_fraction_width) of STD_LOGIC;
Once again we are using the negative index trick to separate the fraction part of the floating-point number from the exponent.
The top bit is the sign bit (‘high) the next bits are the exponent (‘high-1 downto 0) and the negative bits are the fraction (-1
downto ‘low). For a 32-bit representation that specification makes the number look as follows:
0 00000000 0000000000000000000000
8 7
0 -1
-23
+/exp. fraction
where the sign is bit 8, the exponent is contained in bits 7-0 (8 bits) with bit 7 being the MSB, and the mantissa is contained
in bits -1 - -23 (32 - 8 - 1 = 23 bits) where bit -1 is the MSB.
The negative index format turns out to be a very natural format for the floating-point number, as the fraction is always
assumed to be a number between 1.0 and 2.0 (unless we are denormalized). Thus the implied “1.0” can be assumed on the
positive side of the index, and the negative side represents the fraction less than one.
Valid values for fp_exponent_width and fp_fraction_width are 3 and up. Thus the smallest (width wise) number
that can be made is fp ( 3 downto –3) or a 7-bit Floating-point number.
A generic called "fp_denormalize" is also provided for all operations. This parameter allows you to disable the
creation of denormalized numbers. In normal (aka poor man's) floating point, the number closest to "0" consists of an
exponent of "1" and a mantissa of "0" (2**-126 in the 32 bit case). Denormal numbers allow for numbers smaller than this
by assuming that if the exponent is "0" than the mantissa represents a fraction less than 1. This adds a great deal of overhead
to the floating point operations, and was thus left as an option defaulted to "true" in the IEEE 32 and 64 bit implementations,
but can be shut off.
“fp_check_error” turns off overflow and NAN processing. As every number must go through this check for every
operation according to IEEE-754 this represents a significant hardware savings.
“fp_guard_bits” are bits that are added to the end of every operation to maintain precision. Most implementations of
floating point use 3 bits. Any number of bits (including 0) is valid. Note that setting the number of guard bits to 0 is similar
to turning off rounding with the “round_zero” round_type.
Defined operations for floating point numbers are:
Unary -,abs, “+”, “-“, “*”, “/”, “rem”, “mod”, “=”, “/=”, “<”, “>”, “<=”, “>=”
All of these operations are overloaded for “integer” and “real” types. The non floating-point type is first converted into
floating point and the operation is performed. If the number is out of bounds for that number then the appropriate infinity or
zero is returned. Errors from these routines are treated as described in IEEE-754.
Defined functions for floating point number aredividbyp2 (divide by a power of 2), reciprocal (1/x), maximum,
minimum, to_unsigned, to_signed, to_ufixed, to_sfixed, to_real, to_integer , To_fp(SIGNED), To_fp(UNSIGNED),
To_fp(ufixed), To_fp(sfixed), To_fp(integer), To_fp(real), to_01. These functions operate silently, this is to say they the give
no warnings for overflow or underflow. Outputting either infinity, or NAN signals errors in the to_fp routines. Errors from
the routines that read FP numbers are returned the same way.
Functions recommended by IEEE-854:
Copysign (x, y) – Returns x with the sign of y.
Scalb (y, N) – Returns y*(2**n) (where N is an integer or SIGNED) without computing 2**n.
Logb (x) – Returns the unbiased exponent of x
Nextafter(x,y) – Returns the next representable number after x in the direction of y.
Fininte(x) – Boolean, true if X is not positive or negative infinity
Isnan(x) – Boolean, true if X is a NAN or quiet NAN.
Unordered(x, y) – Boolean, returns true of either X or Y are some type of NAN.
Class(x) – valid_fpstate, returns the type of floating point number (see valid_fpstate definition above)
Two extra functions named break_number and normalize are also provided. “break_number” takes a floating-point
number and returns a “SIGNED” exponent (biased by –1) and an “ufixed” fixed point number. “normalize” takes a SIGNED
exponent and a fixed-point number and returns a floating-point number. These functions are useful for times when you want
to operate on the fraction of a floating-point number without having to do the shifts on every operation.
To_slv (aliased to to_std_logic_vector and to_StdLogicVector) as well as to_fp(std_logic_vector) are used to
convert between std_logic_vector and fp types. These should be use on the interface of your designs. The result of “to_slv”
is a std_logic_vector with the length of the input fp type.
The procedures Reading and writing floating point numbers are also included in this package. Procedures read,
write, oread, owrite (octal), bread, bwrite (binary), hread and hwrite (hex) are defined. To_string, to_ostring, and to_hstring
are also provided for string results. Floating point numbers are written in the format “0:000:000” (for a 7 bit FP). They can
be read as a simple string of bits, or with a “.” Or “:” separator.
Changing from one floating point format to another can be done through the “resize” function provided. Example:
use ieee.fphdl32_pkg.all;
architecture RTL of XXX is
alias fp32 is ieee.fphdl32_pkg.fp; -- or just “fp”
alias fp64 is ieee.fphdl64_pkg.fp;
signal x : fp32;
signal y : fp64;
begin
Y <= ieee.fphdl64_pkg.resize (arg => y, exponent_width => fp_exponent_width,
fraction_width => fp_fraction_width, denormalize => fp_denormalized, round_style
=> fp_round_style);
Challenges for Synthesis vendors:
Now that we are bringing numbers that are less than 1.0 into the realm of synthesis, the type “REAL” becomes
meaningful. To_fp (MATH_PI) will now evaluate to a string of bits. This means that synthesis vendors will now have to not
only understand the “real” type, but the functions in the “math_real” IEEE package as well.
Both of these packages depend on a negative index. Basically, everything that is at an index that is less than zero is
assumed to be to the right of the decimal point. By doing this we were able to avoid using record types. This also represents
a challenge for some synthesis vendors, but it makes these functions portable to Verilog.
References
1. Floating point for VHDL and Verilog – David Bishop, Eastman Kodak - DVCon 2003
2. IEEE Std 754-1985 - IEEE Standard for Binary Floating-Point Arithmetic.
3. IEEE Std 854-1987 - IEEE Standard for Binary Floating-Point Arithmetic.
4. Lecture Notes on the Status of IEEE Standard 754 for Binary. Floating-Point Arithmetic - Prof
W. Khan, University of California.
5. “What Every Computer Scientist Should Know About Floating-Point Arithmetic,” by David
Goldberg.
6. Floating point types for Synthesis – Dr. Alex Zamfirescu.
7. RSVP based bandwidth allocation – Ananda Rangan and Vignesh Nandakumar, Washington
University in St. Louis.
8. http://babbage.cs.qc.edu/courses/cs341/IEEE-754.html
Floating-Point Conversion.
-
IEEE-754
9. http://www.markworld.com/showfloat.html - Decompose IEEE Floating Point
Number.
10. http://www.ecs.umass.edu/ece/koren/arith/simulator/FPAdd/
Floating-point addition and subtraction.
11. IEEE 1076.3 - VHDL Standard Synthesis packages.
—
12. Cadence's Verilog-XL Reference Manual.
Download