HIGH SPEED CONVOLUTION USING RESIDUE ... MASSACHUSETTS INSTITUTE 1989 ENGINEERING AND

advertisement
HIGH SPEED CONVOLUTION USING RESIDUE NUMBER SYSTEMS
by
KURT ANTHONY LOCHER
Submitted to the
DEPARTMENT OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
in
partial
fulfillment
of the
requirements
for the degrees of .
BACHELOR OF SCIENCE IN ELECTRICAL ENGINEERING
and
MASTER OF SCIENCE IN ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
January
©
1989
All rights reserved
1989.
Kurt A. Locher
The author hereby grants to MIT permission to reproduce and to
distribute copies of this thesis document in whole or in part.
Signature
of A uthor
________-___-________
Department
Engineering
of Electrical
and
Computer
Science
January 21,
C e rtifie d b y
__ _ _ __-_
_-_-_ _ _ _ __-_ _ _ _ _
1989
_ __-_ _ _ _ _ __ _ __
Bruce R. Musicus
Thesis
Certified
by
t
Thesis
Accepted
(Academic)
_ __I_
I
-.
~-,
Supervisor
_ _
Hogan
Paul
Supervisor
by-
(Raytheon
Corporation)
____
-
Arthur C. Smith
Chairman,
Department
9 1990
MAR~k
/a
A
DC-so)
IF
Committee on
Graduate
Students
a
2
HIGH SPEED CONVOLUTION USING RESIDUE NUMBER SYSTEMS
by
KURT ANTHONY LOCHER
Submitted to the
1988 in partial fulfillment of the requirements
23,
on January
Science
and Computer
Engineering
of Electrical
Department
for the degrees of
Engineering
in Electrical
of Science
Bachelor
and
and
Engineering
in Electrical
Master of Science
Science
Computer
Abstract
and
speed
the
Although
size.
hardware
is
focus
number system, the same concepts can be directly
system
number
developed
architectures
speed
and
Using the size
developed that prunes
speed/size
optimal
results
the design space
into
and
out
number
residue
a design
is the
hardware
residue
of
system
is presented
architectures:
standard
residue
to quadratic residue
applied
classes
design
in
designs,
a
are
class.
each
aid was
design
a core group of designs with
leaving
of residue number system
residue designs.
Thesis
Supervisor:
large
architectures
associated
overhead
To
representation.
that uses
for the
Title:
detailed
a
combinations.
One of the largest disadvantages
computational
of the
the
investigated
implemetations
detailed
with several
on
general
Two
well.
as
implementations,
different
judge
which
by
parameters
two
gives
that
using
convolution
signal
is on VLSI
The focus
residue number system techniques.
choice
speed
for high
architectures
investigates
This thesis
and
provide
a
conventional
several of the
design
implementations
with
the
conversion
between
comparision
binary
architectures
optimizations
developed
Bruce R. Musicus
Assistant
Professor of
Electrical
of
Engineering
3
Acknowledgements
First, I would like to thank Bruce Musicus for all of the great ideas that flowed
during
our discussions
on
residue
architectures.
I'm sure
that
I
would
not
have gotten as far if not for his inspiration.
I would also like to thank Frank
Horrigan for introducing
me to this subject
and Paul Hogan for agreeing to supervise me on the Raytheon side.
Finally, to Claire who put up with me through
finished...
the entire process, I'm finally
4
of
Table
Contents
List of Abbreviations...............................................................................................
List of Figures................................................................................................................7
6
Chapter
9
Introduction.......................................................................................
Chapter 2
FIR Filter Background .......................................................................
Conventional
Im plem entations................................................................
D irect Form .......................................................................................
Transpose Form .................................................................................
Systolic D ata Flow G raph Architectures..................................................
Bitw ise D ecom position .................................................................................
Chapter 3
RNS Background.................................................................................
Basic Residue Arithm etic Units...................................................................23
Residue Adders .................................................................................
Residue M ultiply by 2 Block ..........................................................
10
11
11
13
14
20
22
23
32
Residue M ultipliers ..........................................................................
33
Residue General Function Units ....................................................
41
Problem s with RNS.......................................................................................
43
Conversion into and out of residue representation...................44
Scaling and M agnitude Com parison..............................................
Processing Com plex Quantities (QRNS)....................................................
Chapter 4
44
45
M odular Efficient RNS FIR filter......................................................
47
Residue FIR filter
...................................................................................
Brute Force........................................................................................
Coefficient Decom position ..............................................................
Base 2 - Bitwise.....................................................................
Balanced Ternary ................................................................
48
48
49
51
52
Offset Radix 4.........................................................................
55
Balanced Quinary...................................................................57
Subtap Sum m ary..................................................................
61
Scaling
Sum m ing..........................................................
64
New Algorithm with Fewer Buses...................................................
68
and
The
Algorithm .......................................................................
68
The
H ardw are........................................................................
Balanced Ternary ......................................................
Balanced Quinary......................................................
71
71
74
M odified Balanced Septary......................................
Subtap Sum m ary......................................................
Putting it all Together ....................................................................
Binary to RNS Conversion Block...............................................................
Table Lookup Approach...................................................................
No Table Approach ...........................................................................
Residue to Binary Conversion....................................................................
Chinese Rem ainder Theorem ..........................................................
M ixed Radix Conversion ..................................................................
The A lgorithm .......................................................................
The H ardware.........................................................................91
Chapter 5
Design Aid..............................................................................................
76
79
82
84
84
87
88
88
89
89
93
5
M oduli Selection Algorithm (Basic RNS)..................................................
The
g
...............................................................................................
93
96
Discussion......................................................................................................
97
Chapter 6
Standard Binary Arithmetic with Pipelining................................
Development of architecture ....................................................................
Filter Tap ...........................................................................................
98
98
98
Binary Subtap............................................................................100
Balanced
Higher
Ternary .....................................................................
100
Radices..........................................................................100
Shift Add Reconstruction....................................................................101
Hardware Required/Contrast with RNS architecture................................102
General Discussion...........................................................................................103
Chapter 7
Appendix 1
Conclusion.................................................................................................104
Dynamic Range for Optimum Moduli Sets.........................................106
Appendix 2 -- Design Aid Code....................................................................................109
References.......................................................................................
124
6
List
of Abbreviations
CRT
Chinese
Remainder
FIR
finite
IIR
infinite
impulse
MSI
medium
scale
MUX
multiplexor
RNS
residue
SSI
small
VLSI
very
XNOR
exclusive
nor
XOR
exclusive
or (gate)
WSI
wafer
impulse
response
response
large
(filter)
system
integration
scale
scale
(filter)
integration
number
scale
Theorem
integration
(gate)
integration
7
List of Figures
Figure
1
Direct Form FIR Filter (N=4)
Figure 2
Binary Tree of Adders
Figure 3
Transpose Form FIR Filter
Figure 4
Data Flow Graph for 4 point Convolution
Figure 5
Direct Form Systolic Sweep
Figure 6
Transpose Form Systolic Sweep
Figure 7
Multiply
Figure 8
Bitwise FIR Filter
Figure 9
Subtap for Bitwise FIR Filter
Figure 10
ROM residue adder
Figure 11
b bit adder + ROM
Figure
12
Graphic Example of cases of residue sum
Figure
13
binary
Figure
14
final
Accumulate
adder
residue
and
Systolic Sweep
conditional
subtractor
adder
Figure 15
preadd p to one of the inputs before adding
Figure
modulo
16
accumulator
Figure 17
Multiply by 2 block
Figure
18
Binary
Figure
19
3x3
shift+add
array
multiplier
multiplier
Figure 20
Enhanced Multiply by 2 Block (2x, 2x+p.)
Figure 21
Residue
Conditional
Accumulator
#1
Figure 22
Residue
Conditional
Accumulator
#2
Figure 23
General Function of two variables (1 bit)
Figure 24
General Function of two variables (2 bit)
Figure 25
Block Diagram of an FIR Filter
Figure 26
Balanced
Ternany
Figure 27
Balanced
Quaternary
Subtap
(positive)
Figure 28
Balanced
Quaternary
Subtap
(negative)
Figure 29
Balanced
Quinary
Subtap
(positive)
Figure
Balanced
Quinary
Subtap
(negative)
30
Subtap
-- Biased/Unbiased
Figure 31
Size vs b
Figure 32
Transistors vs b
Figure 33
Computational
-- Biased/Unbiased
Procedure
#1
8
Figure 34
Computational
Figure 35
Horner's
Algorithm
#1
Figure 36
Horner's
Algorithm
#2
Figure 37
Latency 2, multiply by 3
Figure 38
General
Figure 39
Balanced Ternary (New Algorithm) b+1 bit MUX
Figure 40
Balanced
Ternary
(New
Algorithm)
Figure 41
Balanced
Quinary
(New
Algorithm)
Figure 42
Balanced
Septary
(New
Algorithm)
Figure 43
Size vs b -- New Algorithm
Figure 44
Transistors vs b -- New Algorithm
Figure 45
Normalization
Figure 46
Block Diagram of a norm box
Figure 47
Binary to RNS,
Figure 48
Top Level Block diagram of no lookup conversion
Figure 49
Top Level Block Diagram of mixed radix conversion
Figure 50
Residue
Figure 51
6 bit
Adder/Register
Figure 52
6
Pipelined
Figure 53
Binary
Figure 54
Input
bit
Procedure
#2
Block Diagram of New Algorithm
algorithm
for
New
final
Algorithm
large table lookup
Subtractor
Subtap
Stage
Combination
adder/register
combination
9
Chapter 1
Introduction
Residue
Number
(RNS)
Systems
alternate
number
since the
1800's, to add parallelism
investigated
for
representation,
to
application
that
and
then
the basic
ideas have
been
a renewed
interest
has
theory,
but
instead
as
a
result
speed
standard
the advances
non-RNS,
binary
significant
to
amount
the
RNS
of
1950's.
suited
is the
The
which
occur
technology
to be applied.
representation
in both
Szabo and Tanaka,
Recently
not because
of new
circuit
technology.
to RNS
applications
also open
the
The conversion
and
from
hardware
RNS
and
possibility
processes
to
binary
As
latency.
add
convolution
in
computation
is to create
radar,
operation
sonar,
is performed
is used
is especially
for both
using
analog
and
because
techniques
an FIR
correlation
Frequently,
applications.
communications
and
filtering
of
size
a design aid that searches the possible
architectures
creating
a sufficiently
rich
that provide some variety in speed and hardware size.
exercise
in programming
a
search
The
finding
The user will be able to
input and the length of
the filter, and the design aid will return the optimum set of designs.
is aimed toward
and
Maybe RNS can change this.
input the dynamic range of the filter coefficients and
simple
the
suited; the most
sum or equivalently
an optimum design for a given convolution problem.
effort
a
a result,
This thesis will focus on the RNS implementation of the FIR filter.
the
from
for the benefits of RNS to overcome
evaluation of a long convolution
speed limitations of digital circuitry.
goal
A
overhead.
filter.
the
aware
RNS was first
early
integrated
are ideally
in VLSI
of overhead
been
significantly.
applications,
There are a few applications for which RNS
common
the
an
density.
must be very large
the computation
conversion
designs
in
advanced
of advancing
allowing high
and circuit
not
in RNS
and WSI technologies
for other,
computation
of
concept
have
was published in 1967 by
Current VLSI
Unfortunately,
theory
mathematicians
digital
on the subject
there
number
to certain computations.
conclusive book
since
a
use
algorithm.
Most of
set of implementations
The design aid itself is
10
Chapter 2
A
FIR Filter Background
finite
convolution
impulse
of an
Mathematically,
written as
response
input signal
(FIR)
with
the convolution
filter
performs
a fixed
the
(finite length)
expression for a length N
discrete
system
time
response.
system response
is
follows:
y[n] = jh[i]x[n-i]
i=O
(1)
This operation is commonly denoted shorthand by the expression y[n] = x[n]
h[n]
where
and y[n]
x[n]
the
is
computation
is the
input sequence,
output
intensive
sequence.
operation
h[n]
This
for large
is the system
impulse response,
an
convolution
becomes
because
N multiplies
N
*
extremely
are
needed
to compute each output point.
Finite
close
can
impulse
cousins,
before
unstable
In
have
a number
guarantees
of
advantages
filters1 .
(IIR)
response
which
a
implementation,
of the
because
the
dispersion caused
performance
impulse
nonrecursively
filter
limited
over their
First, FIR filters
Although
operation.
stable
is
that
dynamic
stable
range
on
of finite
applications
of
radar
and
can
paper
register
Second, FIR filters can be designed
the coefficient truncation.
phase.
filters
appear to be a filter design problem that could be solved on
stability issues
paper
infinite
the
realized
be
response
communications
become
lengths
or
to have linear
the
frequency
by the nonlinear phase of IIR filters can be harmful to the
Finally, roundoff noise caused by
of the system.
lengths can be minimized with
finite register
an FIR structure.
In order to obtain the good properties of an FIR filter, however, one gives
up
the
filter.
degrees
of
freedom
afforded by
the recursive
coefficients
of an IIR
(see prior footnote, the FIR filter is actually a constrained version of
the IIR filter)
To obtain a similar frequency
magnitude response
from an FIR
filter, large values of N are needed relative to the order of a similar IIR filter.
As
a
result
it has
been
proposed
that
FIR
filters
be
implemented
in
the
1 The output of an IIR filter depends on past values of the output as well as past values of
the input. The difference equation for an IIR filter is commonly written as follows:
N-1
M-1
y[n] =
-E
i=1
aiy[n-i]
+ Xbix[n-i]
i=O
The recursive nature of this computation forces any implementation to be recursive and
causes stability to be an issue.
11
frequency
domain using
Unfortunately,
several
an
of
FFT algorithm
the
to reduce
properties
good
processing
of
an
FIR
requirements.
filter
are
not
time domain,
the
completely preserved if an FFT implementation is used.
Conventional
Implementations
If an FIR filter is going to be implemented in the
sum has to
convolution
multiplies
N-1
and
The computation
be computed.
adds
calculate
to
each
output
at
requires
In
point.
the
least N
following
discussion I will assume that a sufficient number of multipliers and adders are
to perform the entire computation each time
available
be
implemented
these
however,
could only
to
use
arithmetic
fewer
follow
directly
designs
units
from
by
multiplexing1 ;
time
complete
the
FIR filters can
cycle.
designs
and
only
complication to the discussion.
add unnecessary
Because there are a large number of ways to compute a convolution sum, I
will
the
discuss
two
introduction
general
common
most
to
systolic
forms
architecture
in
some
design
and
detail
a
the
other
architecture.
The
characterize
to
give
then
possibilities.
Direct Form
The textbook
form
direct
chain
is
the
result
of delay
of
bluntly
translating
forms
a length
registers
shift register that stores the previous N-1
(FIFO)
These
N-1
input
are
result.
are
design
A
hardware.
is the direct form
FIR filter architecture
the
scheme.
delayed
values
multiplied
by
of the
the
input
appropriate
necessary
quantities
without
into
first-in-first-out
the current
and
The total design contains N multipliers and N-1
minimum
N-1
(1)
(delayed) values of the input.
along with
weights
equation
using
summed
value
to
of the
form
the
two input adders which
a
time
multiplexing
A length 4 (N=4) direct form filter is shown in figure 1.
1 A time multiplexed design uses the same arithmetic unit(s) more than once per time
period with different data. For example, an FIR filter of arbitrary length N could be
designed using only one multiplier and one adder; each time period would then be, at
minimum, greater than N multiply times. Because we are aiming for the highest
throughput possible, it is safe to assume that time multiplexing is not a viable option.
12
Processing Element 1 Processing Element 2
Processing Element 3 Processing Element 4
Delay
Delay
h[0]
.0DelaDey
h[1]
h[2]
h[3]
y[n]
Figure
The figure shows N-1
1
Direct
Form FIR
Filter
adders in a linear chain for clarity;
design would probably use a binary tree of adders (figure 2).
advantage of a tree structure is its reduced latency.
an actual
The obvious
The latency of a linear
chain is N-1 adder delays; for a tree structure the latency is Flog2 NI
Fxl
where
denotes the smallest integer greater than or equal to x.
The less apparent
advantage is that a tree structure is more easily pipelined.
Pipeline registers
can be placed on each level of the tree (where the dashed line crosses the data
paths in the figure).
If a register is placed anywhere between adders in the
linear chain, an additional register is needed in all successive adders to delay
multiplier results until the correct partial
Figure 2
sum arrives.
Tree Adder
13
The
bonuses
tree
adder
there
is
appears
an equal
to
be
an
obvious
and opposite
choice,
penalty
attached.
removes the regular structure that is shown in figure
stage to the implementation in figure
however,
as
The
with
tree
adder
To add an additional
1.
stage is merely
1, an additional
attached
To extend a design that uses a tree adder, the entire adder structure
to the end.
must be modified, in this case, a new level must be affixed to the tree.
discussed
chose
the
in
for VLSI
optimal
all
the
section
on
designs;
however,
form
direct
this
architectures,
systolic
it is interesting to
with
implementation
is
not
necessarily
note that
LSI Logic
to
pipelining
As
implement
their
transpose
form
new FIR filter chip.
Transpose
The
Form
most
second
but an easy
perform the same algorithm,
out
and to
correct as drawn
of
ith
for
denoted
be
processing
each
way to show this is to assume the
engineer
reverse
element
processing
equation
following
is the
architecture
filter
At first it is not at all obvious that the two architectures
shown in figure 3.
design
FIR
common
by
it.
pi[n-1],
we
sum
form
the
(2)
and proceeding
element
first processing
at the
can
partial
element.
pj[n-1] = p. 1 [n-2] + h[n-i] * x[n]
Starting
the
Letting
inductively,
we have
pi [n-l] = h[N-1] * x[n]
p 2 [n-1] = p 1 [n-2] + h[N-2] * x[n] = h[N-1] * x[n-1] + h[N-2] * x[n]
p [n-1]
h[N-i+j]
=
* x[n-j]
j=0
Finally,
equation
y[n-1]
setting
=
PN [n-1]
yields
form
more
the
expression
convolution
in
(1).
Overall,
the
transpose
uses
hardware
than
the
direct
basic
form because the delay registers must be large enough to contain the sums of
the
scaled
versions
of
the
input.
that
Assuming
wide.
implementation
become
very
are included to the
If registers
to
achieve
similar.
a
The
similar
throughput,
different
coefficients
input, the registers
represented by the same number of bits as is the
twice as
the
will be
adder tree of a direct
the
connection
hardware
scheme
are
form
requirements
does,
however,
14
significantly
change
the
properties
of the
design.
The largest
difference
is
that all additions are localized so that no "global" .N input addition is needed.
x[nl
Processing Element 1
Processing Element 2
x
Delay
--
is
the
that
transpose
stages
to
throughput
of the
transpose
delay
the
plus
end
one
also, but only
throughput
Delay
Delay
Transpose FIR Filter Form
additional
multiply
h[0]
Delay -
Figure 3
result
Processing Element 4
h[1]
h[2]
h[3]
One
Processing Element 3
of
be
linear
the
simply
expanded
structure.
In
without additional
form
add
can
form
the
delay;
pipeline
form
direct
adding
addition,
is one
registers
can
the
achieve
this
registers to the adder
addition of pipeline
with the
by
As always there is a cost for these benefits; in this case it is that the
structure.
However, as discussed later,
input must be broadcast to each tap of the filter.
this cost may not be too severe.
Data Flow
Systolic
FIR filters
Because
required,
of
that
provide
other
a
general
for
architecture
systolic
mapping
a
problem
into
an
architecture.
technique
for
evaluating
structure
computational
The
in the
computations
to investigate FIR filter designs is
within which
architectures.
methodology
general
regularity
a high degree
exhibit
framework
a good
systolic
Graph Architectures
In
a
addition,
a
a
regular
the
concepts
involving
possible
provide
concepts
architecture
against
designs.
Systolic
architectures
became
very
popular
in
the
late
1970's
1980's because of work done at Carnegie-Mellon by H.T. Kung.
and
early
The basic
philosophy of a systolic design is a rhythmic flow of data through a series of
15
where
line
production
each
another,
(PE's).
elements
processing
partial
results
adds
something
of whom
is a
analogy for a systolic architecture
A good
regularly
move
result. 1
final
to the
one
from
to
worker
The goal is to
achieve 100% efficiency (each PE occupied), a simple data flow between PE's, a
simple
structure
control
expandable
design.
wafer
For
these
All
to
implementations
minimize
characteristics
these
of
design
integration 2
scale
no
(preferably
become
control
properties
time
where
and
PE's
fundamental
computational
generally
recognized
exhibiting
a
calculations.
computations,
more
regular
general
that
systolic
computational
More
complex
problems,
are
dataflow
systolic framework.
there
mapped
better
in
to
signal
better
structure
such
as
those
for
VLSI
performance.
interconnected
that
to
problems
simple
primitive
applied
with
requiring more
there are
are
however, it is now
architecture
a wavefront
processing
system
selectively
model;
are
However,
architecture.
important
modularly
there was a drive to push more
concepts
regular
a
requirements.
the systolic
into
very
problems
Because
problems
are
be
and
all),
maximize
can
In the early days of systolic architectures
complex
at
ideal
to
3
general
or some other
a number
examine
of simple
within
the
One of these is the convolution sum.
an
infinite
number
architectures
of systolic
that could
be
used to solve a particular problem, a data flow graph can be used to visualize
the characteristics
of the different possible designs.
The idea behind
a data
flow graph is to list all of the primitive operations of the larger computation
to be performed in a geometrically
graph
for a
four
point
As an example, the data flow
regular grid.
convolution
is
shown
in
figure
4.
All
of the
computation for a single output point is listed on one line; the computation
the next (chronological)
output point is listed
on the next (consecutive)
for
line.
1 Analogy due to H.T. Kung, one of the fathers of systolic architectures
2 Wafer Scale Integration (WSI) is a fabrication technique that uses an entire silicon wafer
to provide ultrahigh circuit densities. Because the yields for wafer size designs would be
unacceptably low, extra processing elements are included, and the top level of
metalization is configured to allow selective interconnection of processing elements.
3 Kung, S.Y. VLSI Array Processors, IEEE ASSP Magazine, July 1985, Vol. 2, No. 3, pp 422
16
y[ 3 ] = I
h[O]x[3]
h[1]x[2]
h[2]x[l]
h[3]x[O]
y[4]= I
h[O]x[4]
h[1]x[3]
h[2]x[2]
h[3]x[1]
y[5] = 2:
h[O]x[5]
h[1]x[4]
h[2]x[3]
h[3]x[2]
y[6] = I
h[O]x[6]
h[1]x[5]
h[2]x[4]
h[3]x[3]
y[7] = Z
h[0]x[7]
h[1]x[6]
h[2]x[5]
h[3]x[4]
y18] = I
h[O]x[8]
h[1]x[7]
h[2]x[6]
h[3]x[5]
Figure 4
Once
the
acceptable
operations
required
manner,
the
primitive
operations
available
perform at each
to
Data Flow Graph
time
have
can
processors
Every
step.
laid
been
be
assigned
the
processors
the processors
computations
which
and the sweep
defines
a
different
If
map to computations on the data flow graph in a linear pattern
in a regular pattern
are swept
architecture
will
timing
operations.
for
the
across
an
primitive
Some of these obviously may be more desirable than others.
architecture.
and
make
in
of processing
pattern
includes the initial processor placement on the data flow graph
that
out
have
a
PE 1
simple
across the
data
flow
data flow graph,
between
processors
PE 3
PE 2
the resulting
and
a simple
PE 4
sweep direction
Figure
5
An example
All
Direct
Form
Processor
of a processor arrangement
Arrangement
and sweep
and
is shown in figure 5.
of the processing for a single output point is performed
period.
multiplies;
Sweep
Each of the PE's, multipliers in this case, computes
in a single time
one of the four
an additional adder is needed to sum up the results of the four PE's.
Moving the row of processors down the data flow graph by one line (one time
step)
causes the coefficient, h[i],
in each processor to remain
and causes the
17
delayed versions of the input to shift right one PE as a new input point enters
PE #1.
as
An architecture
the direct
form
that has these properties
has
already been
described
implementation.
PE 4
PE 3
PE 2
PE 1
sweep direction
Figure
Transpose
6
example
Another
Because
of
operating
on the
remain
same
input
respective
Processor
processor
processor
Arrangement
arrangement,
same
from
processors
on
is
arrangement
point at the
Focusing
is downward.
direction
a
diagonal
the
their
in
of
Form
step
of the
time.
Also,
in
figure
processors
6.
are
the coefficients
because
the
sweep
for a particular
output
step
the computation
Sweep
shown
all
to
and
point, PE #1 does the first multiply at time i; PE #2 does the second multiply at
does the third multiply at time i+2; and PE #4 does the final
time i+1; PE #3
multiply at time i+3.
1
If PE #1 passes the result of the first multiply to PE #2 at
time i+1, the two products can be summed to form a partial result that is passed
to PE #3
adders,
at time i+2.
be multiply-
particular
time with
four different
on
operate
In this way the four PE's, which would
output
points
at one
This architecture is the same as the the transpose form
results exiting PE #4.
that was described before and shown in figure 3.
At
this
point
by
architectures
it
is
examining the
directly
error.
The
placement
delayed
versions
elements.
Two
examples.
A
of
of
the
possible
horizontal
to
possible
the
input
characterize
data flow
processors
must
be
graph
determines
made
processor
organizations
processor
arrangement
different
possible
rather than
trial and
the
how
available
are
will
seen
the
to
in
require
input
and/or
computational
the
previous
N
delayed
1 At this point the reader should be thankful that a longer FIR filter is not being used as
an example
18
versions of the input to be made available with the current input always used
A diagonal
by PE #1 and the oldest version of the input always used by PE #4.
processor
arrangement
broadcast
to
all
arrangements
with
-1
will
elements
PE
different
requirements
processing
will
=
slope
result
in
#1
through
on
input to
PE
#4.
the
input.
be
Other
For
with slope = 1 would require 2N-1
processor arrangement
example, a diagonal
require the current
= 7 delayed versions of the input to be available with every other one used at
any
While
time.
input,
Both
the
the
the
direction
sweep
form
direct
are
the processors
swept
the
considered
in
a
the
and
had
the
attached
on the
coefficients.
sweeps
downward
coefficients
to determine
output
point
which
the
and
and
their respective
to
For example, if the
circularly
shift
the
direction
sweep
requirements,
coefficient
for this point are performed.
in
form
remained
the
layout
processor
tandem
particular
design
coefficients
on
requirements
transpose
horizontally,
different processing elements.
on
the
requirements
the
through
each time step.
Although
requirements
the
determines
Other sweeps are possible, however.
processing elements.
processors
determines
and
in both cases
therefore
layout
processor
respectively,
data
flow
input
specify
the
two
requirements
be
must
the
between
The easiest way to do this seem to be focusing
examining
and
how
the
computations
primitive
As a final example of a systolic architecture, a
shifted
are
coefficients
through
the
elements
processor
will be examined to show the general use of the data flow graph.
PE 4
4--,
PE 3
P
sweep direction
PE 1
Figure
A
7
good
direction
-1,
exercise
to
is
shown in figure 7.
indicates
elements,
Processor
Multiply/Accumulate
and
that
the
the
processor
and
arrangement
Sweep
and
sweep
First, the processor layout, diagonal with slope =
current
sweep
the
analyze
Arrangement
input
direction
through the processing elements.
will
dictates
be
that
Now, focusing
broadcast
the
to
all
coefficients
on a single
processing
will
shift
output point,
it
19
becomes
by
apparent that all computation for this output point will be performed
the same
processing
partial
an
processing
element,
sums
output
element
over
in this case
and consecutively
point.
An
four consecutive
time
a multiply/accumulator,
(as the
enhanced
final
product
version
periods.
will
accumulate
is accumulated)
of
the
Each
the
produce
multiply/accumulate
architecture is used in the Zoran Digital Filter Processor family.
The
to
the
previous
example
architecture
filter each time
filter coefficients every
implementation
built;
all
M-1
because
Because
the
of the
interesting
coefficients
shift
adaptive filter can be implemented
N time periods.
but every
that is
chosen
described.
step, an
the results of all
was
through
the
that updates
the
If the output is to be downsampled,
Mth processing element can be ignored.
out of M processing
needed
advantages
in
elements
these voided
coefficients by each time cycle.
do
places
is
not
a
even
In this
need to
register to
be
shift the
A final advantage is that point failures in an
arithmetic unit affect only the output points that would be computed by this
element
with
(1
this
of N output points).
But as always
design.
control
First,
added
is
there
are also disadvantages
necessary
to
determine
processing element should output on a particular clock cycle.
would be solved by adding a tag bit to the coefficient path.
In general, this
A tag attached to
h[O] would indicate to a PE to output and clear its accumulator.
serious problem
to
A second more
would be routing the output from each PE to common output
pins for the VLSI chip.
used
which
selectively
Some form
drive
the
bus,
of tristate driver arrangement
but
loading
the
on
this
bus
could be
could
be
excessive.
Although
the
problems
the
of the
advantages
could
architecture
be
solved
are for
with
the
therefore
design
exhibits
required,
and
focus
a
exclusively
simple
a maximum
require
downsampling
multiply
accumulate
on
the
modularity,
throughput.
or
design.
adaptability
minimum
Other more
should
the
For the residue FIR filter
transpose
a
architecture,
application;
a more specialized
added complication and hardware is not warranted.
I will
previous
form
architecture.
quantity
specific
probably
of
This
hardware
applications
that
investigate
the
20
Bitwise
Decomposition
As will become obvious later, it is advantageous to minimize the number
of residue multipliers that are needed for the residue FIR filter.
It is possible
to build a binary fixed point filter that uses only adders and no multipliers.
A
similar design, discussed later, can be used for a residue FIR filter.
Mini-Convolutions
20
Shift-and-Add
channel
x[n]
Figure 8
The idea behind the
Bitwise
FIR Filter Architecture
adder only FIR filter is
recognizing
that a bxb bit
multiply consists of b recursive left shifts and b-1
conditional
adds as shown
in equation
(3).
21
Let y = b2y.
i=O
then x*y = x *
2*y
If the results of several multiplies
2(x*y )
=
i=O
i=O
are being summed then the shifts and adds
do not need to be performed until after the final sum.
bits
needed
broken
to
represent
b
mini
into
the
coefficients,
convolution
the
equations
If b is the number of
convolution
the
final
equation
results
of
can
which
be
are
shifted and added.
y[n]
i=O
Because
to be
x[n-l]*h [i]*2
x[n-i]*h[i] =
=
x[n-i]*h .[i]
2
i=O j=O
j=O
i=O
hj[i] in the final equation is either 0 or 1 no general multiplies need
performed
Using
the
in the
mini-convolution
transpose
form
for
chains.
the
mini-convolutions,
employed to condition the adder in each processing element.
the
hj[i] 's are
If hj[i] equals
1,
the current value of the input is added to the previous result; if hj[i] is 0, the
previous result is passed on unchanged.
FIR filter is shown in figure
A top level block diagram of bitwise
8; an individual processing element is shown in
figure 9.
Current
Input
b
b
b
A
C -N/C
b bit
ADDERJ
b
Result from
previous stage
Figure 9
The
idea
accumulators
of
can
besides base 2.
design
and
breaking
be
the
extended
FIR
to
for
Bitwise
computation
other
into
decompositions
FIR
Resultto
next stage
Filter
parallel
of the
channels
of
coefficients
This concept is used in both RNS designs and the conventional
design that follow.
filter
Element
Processing
Reuto
A deeper discussion is included in the sections on residue
conventional
filter
design.
22
Chapter 3
RNS Background
Residue
possible
arithmetic,
based
to
alternative
on
simple
conventional
principles
of
for
arithmetic
number
larger
theory,
is
a
operations.
integer
several relatively primel numbers Mi1 , m2, ... , mr as a moduli set,
Starting with
or remainders
an integer x by its residues
it is possible to represent
members of the moduli set:
x mod mi,
x mod m2,
...
x mod mr.
,
to the
This new
of x is unique for any integer x that less that the product of
representation
the moduli in the moduli set (0 s x s M-1, where M = mim2... mr ).
The proof of
this result is from the Chinese Remainder Theorem.
subtraction,
operations
m i))
most
with the
"commute"
conversion
= (x-y mod mi)
The
multiplication.
no
is
where
is
operation
there
channels;
for
useful
of
operations
the
addition,
A basic result of number theory is that these
and multiplication.
mod mi)
moduli
is
arithmetic
Residue
- is either
between
mi) - (y mod
mod
addition,
independently
performed
coupling
(((x
operation:
in
each
channels.
the
or
subtraction,
of the r
the
Because
moduli can typically be selected to be much smaller than the integers x and y,
the
operation
can
executed
be
parallel
in
in
rapidly than if x and y are in their conventional
straightforward
equally difficult
uncoupling
not
is
division
Because
a
to compare
the digits in the
has
a
It is because
number
of
disadvantages.
operation 2 , it is not
in residue representation.
numbers
the magnitude
of two
representation
of
more
for FIR filters.
integer
fundamental
round or truncate
to
also
arithmetic
residue
channel
representation.
of these properties that residue arithmetic is appealing
Unfortunately,
moduli
each
The
numbers.
a number is that
It is
result
of
there is no
In
longer any significance that can be attached to a particular digit position.
general,
to
a number
perform
converting
operations.
either of these
into
and
residue representation
converted
must be
out
of
residue
back
This
to
a conventional
leads
into
The
representation.
is usually done by a table lookup.
of residue representation
uses
a result of the Chinese
the
representation
final difficulty:
conversion
The conversion
into
out
Remainder Theorem and
1 Two numbers are relatively prime if they contain no common integer factors other than
1. For example, the numbers 10 and 21 are relatively prime; 10 and 14 are not.
2 The set of integers is not closed under division. For example, what integer equals 5
divided by 3?
23
is not as simple.
now
let's
assume
possibilities
Basic
All of these problems are a topic of current research, but for
for
that
these
residue
Residue
difficulties
arithmetic
can
be
overcome
and
examine
units.
Arithmetic Units
In order to design FIR filters, it is necessary to implement some basic RNS
arithmetic
units
is that
blocks.
they
One
any modulus
requirement
"programmable"
be
The term programmable
moduli.
for
building
than
less
to
that
permit
will be
imposed
computations
size
either by
different
with
can be used
implies that the same hardware
a certain
on these
rewriting
entries
into
a
If there is a choice
table or asserting some constant(s) to one or more inputs.
between a design that involves a table lookup and one that does not, all else
In
equal, the latter would be preferred.
are some tricks that
there
addition,
can be used with certain classes of moduli to optimize arithmetic computation;
will not be practical if specialized arithmetic units
Within these
modulus.
are high
throughput
some
filter;
although
actual
filter design
arithmetic
units.
designing
more
of
it is worthwhile
later.
will
not
First,
develop a set
to
the
apparent
be
in these designs
used
The techniques
units
needed
design
will
of
the
of primitive
be
a
until
helpful
for
programmable
With the adder the more complex multiply by
residue adder will be addressed.
Finally, although it
2 block and a general multiply block can be designed.
involves a table lookup, a general function
Residue
a RNS FIR
to implement
be needed in order
these units
is begun,
custom
designs
minimum size.
units will
residue
Several
and
designed for each
must be
the goals of the arithmetic units
few bounds
design
The overall
in these classes is fairly restrictive.
however, membership
unit will be discussed briefly.
Adders
One of the earliest proposals for a residue adder was to use a conventional
ROM as a table lookup (figure
10).
a b bit binary channel (i.e. m
2b),
example, a 6 bit modulus (m
early research
reduce to size
into
it is necessary to have a 22b x b ROM.
64) would require a 4Kx6 ROM.
exploiting the
of the ROM.
For moduli that can be represented within
symmetry inherent
in the
A first stab is realizing
For
There was some
addition tables to
that the
operation of
addition is commutative; this reduces the size of the ROM by a factor of two.
Unfortunately,
as the design is optimized
to reduce the size of the ROM, the
24
external
large
circuitry
area
increases
required
to
and
the
implement
throughput
memories
and
decreases.
the
In
access
general,
time
memories prohibits this approach for all but the smallest modulil.
of
the
these
Although
table lookup would not be practical for a residue adder, it is important to note
that any integer operation on two variables can- be performed
using a table
lookup.
x
y
f(x, y)
Figure
10
Table Lookup
Residue
Adder
Focusing on the addition problem, the size of ROM in the previous design
can be significantly
in figure
11.
reduced by using a standard b bit binary adder as shown
The output of the adder is b+1 bits wide including both the b bit
result and a carry bit.
Because this b+1 bit output may exceed the modulus, the
ROM is necessary to correct the result to lie in the normalized range [0, m-1].
In this case the size of the ROM is 2b+lxb.
be
128x6 bits.
Although this is
For a 6 bit modulus, the ROM would
significantly better than the
previous design
decreasing the size of the ROM by by a factor of 32, a closer examination of the
ROM's contents shows that this design can also be improved.
1 Chaing C-L & Jonsson Lennart,
Design? pgs 80-83
Residue Arithmetic and VLSI 1983 IEEE Computer
25
x
y
tb
b
Input A
Input B
b bit Adder
Result
Carry Out
b'
Address
b+1
2
x b ROM
Data
b
Assuming
formi,
ROM
Residue Add using a Binary Adder with Correction
Figure 11
the
inputs
to
our
residue
adder
are
in
normalized
residue
the output of the binary adder is falls into three cases (see figure
12).
First, if the sum of the two numbers is greater than or equal to 0 and less than
the modulus,
the result is already
passes the result unchanged.
in normalized
residue form,
Second, if the sum of the two numbers is greater
than or equal to the modulus and less than or equal to
of the binary adder),
the b bit result exceeds
equal to
2
2
b
(where b is the width
its normalized representation
the value of the modulus, and the ROM subtracts
of the binary adder.
and the ROM
by
the modulus from the output
Finally, if the sum of the two numbers is greater than or
b (carry bit set), the b+1
bit result, including the carry bit as the
1 A residue is in normalized residue form if its magnitude is between 0 and the modulus.
A residue is not in normalized residue form if its magnitude is greater than or equal to
the modulus or less than 0.
F26
form by the value of the modulus, and the
b+lth bit, exceeds the normalized
ROM subtracts the modulus from the b+1 bit result.
2 b= 64
m = 43
Modulus
Bias g =64-43 =21
CASE 1
15
15
10-
CASE 2
25
52
43 = 9
52 - 21
=73 (9)
5252+
27
CASE 3
25
66 = 2 + carry -2 ROM --
66 - 43 =23
2+21=23
41
Residue Addition with
12
Figure
a Binary
Adder
The ROM entries can be reduced to two operations: either the output of the
is subtracted from
or the modulus
binary adder is passed unchanged
ROM can be eliminated entirely as shown in figure 13.
it.
The first binary adder
performs as before with its output that may or may not be normalized.
b+1
bit binary
adder.
and
This
subtracter
serves
indicating (by
[0,
m-1];
greater than m.
binary
the modulus
the dual purpose
its overflow
from
of providing
bit) whether the
The
the result of the first
other possible final result
output of the first adder is
If the overflow is set, the output of the binary adder was in the
normalized.
range
subtracts
The
adder
if
no
overflow
is set, the
output of
the binary
adder
was
The overflow can be used to select between the output of the
and
subtracter.
27
x
y
b
b
Input A
Input B
b bit Adder
Carry Out
Result
b
b+1
Input
A
Input B
b+1 bit Subtractor
Result (A-B)
Overflow
b
b
P
\
_ 2-1 MUXA
IB
A
b
<x + y>
Residue Adder without ROM
Figure 13
One
final
wordlength
optimization
binary
channel.
results
from
Adding
two
normalized
adder yields a result, u, in the range [0, 2(m-1)].
bit channel,
then 2(m-1)
s
2
b+1.
of
nature
modulo
the
residues
in
a
the
finite
binary
If m is representable in a b
When m is subtracted from u, a number v is
obtained that is always less than m (consider only the case u-m > 0) and is
therefore
representable
in
b
bits.
The
number
v
can
be
obtained
I pow-
28
alternatively
by
adding g = 2b - m to the low order b bit of u and ignoring the
carry; this is a result of the mod 2b nature of the channel.
The advantage to
this approach is that a b bit binary adder can be used instead of b+1 bit binary
subtracter; one stage of carry propagation
shown in figure
is saved.
The final residue adder is
14.
y
x
<x
Figure
14
Final Residue
+y>m
Adder
Design
29
Because the carry is not input into the second binary adder, the carry out
of the second adder will only be set if u is in the range [m,
greater than or equal to
2
the
select
].
If u is
b (the carry out of the first adder is set), the carry
The logical OR of the two carries is used
out of the second adder will not be set.
to
2 b- 1
multiplexer;
either
if
carry
is
the
set,
subtracted
version
is
chosen.
is it useful
At this point
residue
adder.
blocks
to
Similar summaries
simple
permit
will
comparisons
for other final
be generated
between
of the final
summaryl
more
version
architecture
complex
The basic components that will go into the summaries are 1 bit full
sections.
adders,
include a hardware
to
MUX's,
and
gates.
simple
For the
final
simple
adder the
residue
summary is as follows:
Number
Sizing
Transistors
1 bit Full Adder
2b
19584b g2
64b
2-1 MUX
b
4896b + 1632 g2
10b + 4
OR gate
1
4896 g2
6
24480b + 6528 p2
74b + 10
Part
Type
Totals
Architecture
the
Examining
insight
into
the
final
Summary
adder
modulo
operation
of
for
Final
design,
modulo
RNS
gain
some
helpful
performed
with
binary
we
addition
Adder
can
The basic result is that modulo addition is the same as binary
arithmetic units.
addition unless the result exceeds the modulus in which case the modulus, m,
is subtracted from the binary sum, or, equivalently,
binary sum.
p = 2b - m is added to the
As a result of the previous discussion for the final modulo adder,
we will focus on performing the correction, if necessary,
Now,
instead
of possibly performing the correction
by adding
later, preadd p to one
of the inputs and use a single binary adder as shown in figure 15.
is initially normalized,
correspondingly
it falls
be in the range
in
the range
[g, 2b -1],
[0, m-1];
g.
because
Because xi
xi + p
will
it can be represented entirely in a
1 The space estimates were derived from an existing standard cell library. The transistor
count numbers were derived from simple designs in CMOS and include both p and n type
transistors. A more detailed discussion of these hardware estimates is included in the
Appendix.
30
b bit binary channel,
and the carry out of the preadder
can be ignored.
By
the previous result the output of the main binary adder will now either equal
the correct modulo sum or exceed this sum by p.
adder provides
The carry out of the binary
a flag to indicate which case the answer is in.
If x 1 + x 2 is
greater than or equal to m, then xi + x2 + 9 will be greater than or equal to 2b
Since x1 + x 2 > m is the case that needed correction,
and the carry will be set.
the
output of the
the carry,
adder, ignoring
is the
proper normalized
If x 1 + x2 is less than m, then xi + x2 + p will be less than 2b and
modulo sum.
the carry
binary
will
not
be set;
the output
of the binary
adder
will
exceed
the
correct normalized modulo sum by p.
X1
t
tFb
Input A
b
Input B
b bit Adder
Carry Out
X2
Result
N/C
lb
tb
Input A
Input B
b bit Adder
Carry Out
Result
Tb
Figure 15
Preadding p
to one of the Inputs
At first it appears that preadding one of the inputs trades one problem for
another very
similar problem.
Without
preadding,
the
binary
the carry, can fall short of the correct modulo sum by p; with
binary sum can exceed the correct modulo sum by p.
sum,
ignoring
preadding, the
However, if a series of
31
numbers
at
xi is being accumulated and the preadd of p to each can be performed
minimal
general
expense,
residue
adder.
we
can
It
is
increase
always
the
performance
possible
to
guarantee
over
that
that
one
of
the
of the
inputs to the binary adder is a biased residue (exceeds its normalized value by
p.) and the other is a proper normalized residue.
If the current partial sum is
a biased residue, the carry out, 0, is used to select the normalized version of xi;
if the current partial sum is normalized,
biased version
necessary
of the
input.
The completed modulo
register is shown in figure
.
16
1, is used
accumulator
16.
x
x+
Figure
carry out,
1
Residue
Accumulator
to select
the
including a
32
A single
addition in the modulo
adder delay,
adder
while an addition
delays.
improved
On
the
accumulator
in the
surface
is requires
final modulo
it appears
that
only one b bit
adder requires
the
modulo
adder
somehow; however, it is important to realize that the
a rather constrained form of the addition problem.
delay assumes that the preadds
only average delay value.
the output of the
Nevertheless,
sum may
all of these caveats, this
be
Also, the one b bit adder
and is an
correction stage must be included
accumulator because the final
even with
could
accumulator is
can be performed with no overhead
An additional
two b bit
at
be in biased form.
configuration is very useful
when several numbers are being accumulated (for example an FIR filter).
Residue Multiply by 2 Block
The next arithmetic unit to examine is a modulo multiply by 2 block that
takes
a normalized
block
This
is very
multiplier block.
residue
input
and generates
useful
when
building the
Now, in the
multiply
a
number
nothing
is
as
by
two:
straightforward
a normalized
more
shift
in
residue
output.
general
modulo
number system it is simple to
standard binary
simply
complex
residue
left
one
place.
computations
Unfortunately,
and
this
is
no
exception.
The obvious way to implement a modulo multiply by two block is to build
upon what
together.
we already
know by using
a modulo
adder with both inputs tied
Looking at the final modulo adder in figure
14, the
first b bit adder will just be a left shifted version of the input.
function,
however, can be hardwired
making the
output of the
The left shift
first adder unnecessary.
To
eliminate the adder, the high order bit of the input is routed to "carry out,"
and the remaining b-1 bits are left shifted with a 0 inserted as the low order
bit to form the b bit "result."
figure
17.
The basic multiply by two block is shown in
33
x
b
Hardwired Left Shift
[b-1..o]
[b-1]
[b-2..0]
0
b-1
b bb
b
Input A
Input B
b bit Adder
Carry Out
Result
/b
T b
_ 2-1 MUX
A/B
y
b
<2x>m
Figure
Residue
17
Residue
Multiply by 2 Block
Multipliers
The final residue arithmetic block to be added to our toolbox is a general
modulo multiplier.
A multiplier block is considerably more complex than the
adder block or multiply by two block.
needed
in
the RNS
to binary
Because several modulo multipliers are
converter, the general
overall
system design
34
goals
for
Hopefully,
from
being
latency,
throughput,
the multipliers
the
system
and
hardware
real
can be designed in such
Binary
before
multiplier
multipliers
and
tackling the
designs
can
be
divided
Shift and
clocked and tend to be slower overall;
faster because
pipeline registers
carries
be
addressed.
a way to prevent them
the design of standard
more complicated
array multipliers.
clocked (although
must
bottleneck.
At this point it is instructive to investigate
multipliers1
estate
are propagated
into
add
binary
modulo multiplier problem.
two
classes:
shift
and
add
designs are by their nature
array designs which are not necessarily
could be inserted into carry
chains) are
more efficiently.
X
x*y
Figure 18
Shift and Add Multiplication
1 Material in this section was obtained from Rabiner and Gold Theory and Application of
Digital Signal Processing, pgs 514-540. See this reference for a more exhaustive
discussion of binary multiplier design. Other ideas can be obtained by using the systolic
design techniques discussed earlier.
35
A shift and add multiplier forms its product exactly as its name implies by
accumulating
the
following
Let y =
2y
sum:
then x*y= x *
i=O
=
2y
(2i*x)*yi
i=O
i=O
Shifted values of the multiplicand x are accumulated
appropriate
y.
bits of the multiplier
conditioned on the
An unwrapped nonrecursive
version of
the shift and add multiplier is shown in figure 18.1
The major disadvantage of a shift and add multiplier is that the carry bits
do
not propagate
To solve this problem
efficiently.
array
minimize the length of the longest carry propagation path.
array multiplier is shown in figure
bit
full
adder
complicated
same with n2
cells.
carry
Better
propagation
A simple 3x3 bit
In the figure the circles represent
19.
array
adders attempt to
multipliers
but
schemes,
can
be
the basic
created
using
structure
remains
1
more
the
full adders and the simpler version is easier to understand. 2
x 1 Y1
x 0y 1
X 2 Y0
0
X
2
0
x y0
0
1
Z0pIP
-,~P
X0 y 2
x 2Y 2
P2
0
p5
P
P
4
Figure 19
3
3x3
Array Multiplier
1 A design that uses a smaller adder and is recursive is shown in Rabiner and Gold pg 516
2 Again, see Rabiner and Gold for a discussion of various implementations.
36
and add type seems most conducive
the modulus
result, several
In
because the partial
to the modulo problem
after
normalized
step.
each
the
Although
by
to
times
several
In
its value.
normalize this
order to
of correction circuitry would be needed.
stages
order
a modulo
implement
add multiplier,
and
shift
of
unfortunately has
multiply by
and
biased
blocks
both
two
been
have
the
designed,
Although
block
be designed
could
of the output,
versions
residue
adder
by 2 block.
If a
final
almost twice the latency of the multiply
both the
that provided
similar to the
a structure
residue
both
multiply by two blocks and residue adder blocks must be available.
versions
array
would be faster, the result of a full bxb bit binary multiply could
multiplier
exceed
be
can
the
attack
Of the two classes of binary multipliers, the shift
modulo multiplier problem.
accumulations
ready to
we are
multipliers,
of binary
some knowledge
With
normalized
accumulator
in
figure 7 could be used that would have the reduced latency that we desire.
An enhanced
designed
20)
(figure
version
of the
cost and no
with minimal hardware
multiply
by
block
two
additional delay.
can
be
To understand
the biased side of the multiply by two block, two cases must be examined, 2x[n]
>
m and 2x[n] < m.
2x[n] + 2p1.
Regardless, the binary output of the left shifter equals
m, then 2x[n] is an unnormalized residue and one of the
If 2x[n]
two Vs is needed to normalize the residue yielding the desired result, 2x[n] + pL.
If 2x[n] < m, then the output of the left shifter exceeds the desired result by g.
Fortunately in the case 2x[n] < m the output of the adder in the unbiased side
this modification
disadvantage
are
a b bit 2-1
is that
of the design
input must be
available,
but
if
components required for
The additional
result.
of the doubler has the desired
and a b bit
MUX
both biased
several
and
two
only
versions of the
unbiased
multiply by
The
register.
blocks
are chained
together, this is only a problem for the first one.
The
modified
accumulator
recursive)
except
and
the
accumulator
that
add
the
is
be
very
accumulation
conditional.
results from the previous stage,
the result to the next stage.
will
will
The
not
occur
accumulator
the
in
will
previous
place
take
(not
partial
add another value to the partial sum and pass
The carry out from the previous stage must also
be passed to indicate whether the partial result is biased.
of the multiplier, yi,
to
similar
must also be asserted to
determine
the add or pass the previous partial result to the next stage.
The appropriate bit
whether to perform
37
x[n-1]
x[n-1] -g
Hardwired Left Shift
Hardwired Left Shift
..0]
[b-i
S-
[b-2..0]1
-,
[b-1]
-2..0]
0
0
b-1
-1
1..1]
[0]
N/C
Input
A
Input B
b bit Adder
Result
b
<2x>
Figure 20
Modified Multiply by 2 Block
<2x>m* R
38
Carry
In
Previous
Result
n-1]
Si_ [n-1]
C i_
0
9
2 ix
2 1x + g
y;
Figure
Ci [n]
Si [n]
Carry
Out
Result
Out
21
Residue
Conditional
Accumulator
#1
39
The 4-1 MUX
A possible design for this accumulator is shown in figure 21.
in the figure
performs
the
following
function:
yi
Cin
MUX outut
Explanation
0
0
0
previous
result biased, result same
0
1
m
previous
result unbiased,
1
0
Input
previous
result biased,
1
1
previous
result unbiased,
Input +
result biased
use unbiased
input
use biased
input
Both biased and unbiased versions of zero are needed to assure that the carry
out
properly
accumulator
of the
the
indicates
of the
state
the
If
result.
previous result is unbiased and a 0 is added to it, there will be no carry from
the adder and the next stage will assume a biased input.
Instead, a biased zero
is added to the unbiased previous result which biases the result; the carry out
again will not be set, however this time, correctly indicating that the result is
biased.
Number
Part Type
1 bit Full Adder
b
4-1 MUX
b
Transistors
Sizing
9792b
g2
32b
8160b + 9792 g 2
18b + 32
p2
24b + 24
1 bit Register
b+1
8160b + 8160
Totals
- - -
26112b + 17952
Hardware
it
Unfortunately,
multiplexor to
previous
result.
Summary
is
not
decode zeros
If the carry
for
Residue
aesthetically
or to require
74b + 56
Accumulator
pleasing
a biased
to
use
#1
rungs
zero to be added
out from the accumulator
of
a
to the
can be fixed up, the
multiplexor can be reduced to a 2-1 multiplexor by placing an AND gate after
the multiplexor to condition the add.
The biased zero was only necessary
in
the case when the no add is being performed (yi = 0) and the previous result is
unbiased
(Cin
=
0).
If the previous
unchanged, the carry out must be set to 1.
result is
passed
to the
next
stage
With the addition of the necessary
gates, the new accumulator design is shown in figure 22.
n-
40
Part
Type
1 bit Full Adder
b
2-1 MUX
b
Transistors
Sizing
Number
g2
9792b
32b
4896b + 1632 p 2
10b + 4
p2
24b + 24
8160b + 8160
1 bit Register
b+1
AND gate
2
9792 p2
12
OR gate
1
4896 p2
6
1
2
2
Inverter
3264 p
22848b + 27744
Totals
Hardware
Summary
for
66b + 48
Accumulator
Residue
#2
y.
C [n]
x[n-1]
x[n-1] +p
S [n]
C i-
[n-1]
S i-
[n-1]
ee
Figure
22
Residue
Conditional
#2
Accumulator
Finally, we are ready to put all of the blocks together to form the general
purpose programmable
residue
multiplier.
a b bit adder that
registers.
Because
generates
there
is
is the
discussed
which has not been exhaustively
a biased
no
form
previous
multiplier does not need an accumulator.
only
The
b AND
and two
the
first
second
section
unbiased previous
accumulator
result.
is accordingly
A final
the multiplier
hardwired
section that unbiases
to
contains
b bit pipeline
section
gates determine
unbiased 0 or an unbiased x[n] is passed to the second section.
the
in
"first stage" which
of x[n]
result
block
of
the
whether an
The carry in of
1 to expect
the potentially
an
biased
output of the final accumulator has not been included, but may be needed.
41
In
The complete multiplier requires a large amount of circuitry.
accumulator
is 2b
residue multiplier
needed
summary
for
2
- b.
binary
the
of
the number
design,
array
1 bit
full
discussed
multipliers
for
needed
This can be compared to the b
2
a bxb
bit
1 bit full adders
An
earlier.
abbreviated
Number
Type
2b 2
1 bit Full Adder
2-1 MUX
3b
1 bit Register
2
3.5b
The latency
2b
-
2
Inverter
b
-
1
modulo
--
for
Summary
the
Residue
multiplier has
residue
fairly
clock cycles,
Multiplier
high performance.
and the throughput is
The limiting factor on the clock cycle time is the delay through
of a b bit binary
consisting
accumulator
Even with the performance
gate delays.
-
complete
through the multiplier is b+1
1 clock cycle.
3b
-
OR gate
side
On the positive
2
+ 1.5b - 1
Totals
Hardware
b
-
b2 + b - 1
AND gate
hardware
adders
of the residue multiplier is shown below.
Part
the
Using either
and a b-i bit register.
a b bit accumulator,
a b bit doubler,
the ith one of which requires
other sections,
to the first section there are b-1
addition
adder delay
and a few
that can be obtained, the
amount of
any residue design to avoid multipliers if at all
required encourages
possible.
Residue
For
General
more
approaches.
efficient,
in
combination
Function Units
general
To
some
functions
implement
cases,
of binary
a
to use
arithmetic
it
is
general
table
units.
worthwhile
integer
lookup
to
examine
function,
rather
The general
than
it
may
a more
function
be
more
complex
of two
variables has already been discussed as the first possible adder design.
2
lookup
table
b bit
Here a
2b x b ROM was used which is the most general implementation of an integer
function of two variables.
If the function exhibits any special
properties, the
42
table
can be
lookup
requirements
hardware
and
throughput
smaller pieces.
into several
broken
of the
In
lookup
table
the
this case
may
approach
become competitive with those of custom designs.
8*K
I
x
A
A/B
3
4*K
0
x2
B
2*K
0
I
|I
A
A/B
B
Xi
1*K
0
A _
A/B
y
|
|
1|
B
A_
SA/B
Y
0
B
Four Input Residue Adder
(K*x)
bit General
Four
23
Figure
Linear Function
First, let's examine the case of a function of one variable.
is
linearl,
(d=1)
Evaluation
If the function
the b bit input can be broken into groups of d bits, and
smaller d bit lookups can be performed with the results added.
scale a four bit residue, x,
by a four bit constant, K,
Fb/d]
For example to
the bits of x can be used
to select precalculated multiples of the constant k from tables or hardwired to
multiplexors.
Instead
replaced
This is shown more clearly in figure 23 for the case d = 1.
of selecting between
with AND
0 and a multiple of K, the 2-1
gates when d=1.
addressing scheme will have to be used.
24.
If d>1,
MUX's could be
MUX's or some other table
An example of d=2 is shown in figure
It is interesting to notice that with d=1 the design is very similar to the
general residue multiplier except that since K is known,
multiples of it can be
precalculated.
The
evaluation
of
a
multivariable
However, for an integer function
function
of two variables
could
also
use
to be performed
this
trick.
by partial
1 A linear function is one which satisfies the equation, f(ax + by) = af(x) + bf(y). An
example of a linear function is f(x) = Kx; an example a function which is not is f(x) = x +
K.
43
lookups, the function must also be linear. 1
Considering that a linear function
f(x,y) = f(x + y) = f(x) + f(y),
of two variables has the property
the two inputs
could be evaluated in parallel and the results added to obtain the final result.
So, in general, the partial evaluation
scheme is most useful for single variable
functions.
4*K
0
8*K
12*K
I
I
I
I
Ao
A
A2
A3
2o
x2
-
2*K
1*K
lI
I
Ao
30
3*K
I
A
|
A2
A3
02
xiS
Y
SI
3
0
Two Input Residue Adder
(K*x)m
Figure
24
Four
bit
General
Linear
Function
Evaluation
(d=2)
Problems with RNS
advantages of RNS,
The discussion to this point has focused on .the
parallelism
is
Unfortunately,
numbers
to
the
can
be
developed
units
arithmetic
there
must
computation
added
be
are
also
several
into
converted
using
and
of
disadvantages
their
modulus
binary
standard
form,
residue
The substantial size
converted back to binary.
how
programmable
arithmetic
units.
First,
binary
RNS.
and
how
results
must
be
of the conversion
and latency
units tend to rule out the use of RNS for all but very large computational tasks.
Second,
because
cannot
be
consistently.
digits
the residue
scaled
in
Finally,
any
simple
also
are uncoupled,
manner;
because
the
all
a number
digits
residue
in residue
must
digits
are
be
form
modified
uncoupled,
1 Actually, the function has to satisfy a slightly weaker requirement than linearity in two
variables. This requirement is as follows: f(x,y) = f(X,Y) + f(X-x, Y-y) V (X,Y) within the
range of f. The author challenges the reader to find a useful function satisfying the above
condition that is not linear.
44
not possible
residue
in the
magnitude
comparison
is
with RNS
drastically
limit the number
of possible
These
form.
applications
problems
and to
some
extent explain the reason that RNS has not been widely used.
into and out of residue
Conversion
The
into
conversion
representation
representation
residue
is simple
and the
conversion
The residue mod m of a number
for each modulus can operate independently.
If r
is the remainder obtained when the number is divided by the modulus.
moduli are included in an RNS design, r similar conversion units can operate
in
parallel.
The
cannot
independently
operate
is
conversion
based
1
on
for
a classic
residue
the
and
known
theorem
as
conversion
the
from
each
the
process
modulus.
... ,
The
Remainder
Chinese
{x1, x2,
Given the residue representation
(CRT) .
Theorem
is not simple
out of residue
conversion
xr} of x, the
value of x can be computed using the following identity:
ll
where M =
+ M 2 <M 2 >m
M1<M~>MI x
x=
2
+ ... + Mr<Mr
mi, Mi = M/mi, and xi = x mod mi.
m>Xr
mod M
Although conversion hardware
will be addressed in section 4.3, the large mod M computations do not look very
promising.
To avoid mod M calculations, the Mixed Radix Conversion (MRC) algorithm
is
conventionally
discussed
through
Scaling
used.
Although
in a later section,
an
intermediate
details
it is a two part
mixed
and Magnitude
the
radix
of
the
conversion
representation.
algorithm
will
process that
be
passes
2
Comparison
Because RNS is not a weighted number system, it is not possible to round a
For either of
number or to compare the magnitude of one number to another.
these
operations
the
residue
digits
must
be
converted
out
of
the
residue
1 In fact, the mathematical validity of the residue number system relies on the results of
the theorem.
2 Mixed radix number systems are similar to fixed radix except that the radix a can vary
from place to place. If xi are the digits of a mixed radix number with radices ai, the value
of the number is computed as follows:
J-1
i=0
ai)
0
x,
45
Rather
representation.
part of the
algorithm
which
is
itself
conversion
for the
a
Processing
To
however,
many
in
are
coefficients
with
high
into two
general
can be used
the
first
to a mixed radix
to convert
number
weighted
number,
radix
system.
within the
of
All
modulo
the
channels,
to return to a residue representation.
integer
computation
signal
filtering
applications
Several
encountered.
residue
complex
fixed
real
only
point
a
Quantities (QRNS)
Complex
this
to
can be performed
algorithm can be reversed
and the
deal
converting
mixed radix
representation
operations
than
approaches
The
both
considered;
been
have
to
first is simply
data
complex
developed
but these approaches
representations,
categories.
been
has
use three
tend
and
to
to
fall
parallel
real
The second is to use one of -the quadratic residue number
residue channels.
systems (QRNS).
Processing
using an innovative trick.
coefficient,
the inputs to the three
second channel
parallel
are a,
real channels
are c,
and the
and a+b,
The output
output of the first channel
from the
(bd) is subtracted
The outputs of both the first
are subtracted from the
(bd) channel
and the second
b,
d, and c+d, respectively.
(ac) to form the real part of the result (ac - bd).
(ac)
is performed
channels
If (a + bi) is the complex input and (c + di) is a
coefficients of the three channels
of the
with three
quantities
complex
output of the third
channel (ac + bc + ad + bd) to form the complex part of the result (ad + bc).
hardware
The
To
however.
of
expense
adding
a
third
avoid this expense the quadratic
channel
can
be
significant,
residue number system (QRNS)
has been developed that can uncouple the real and imaginary part of complex
operations. 1
extension
primes
of
fields
this
QRNS
is a complex
modulus number system isomorphic
of primes of the form 4k +
form,
-1
is
a
quadratic
to the
1 where k is an integer.
residue.
2
Letting
I represent
For
the
quadratic residue of -1 modulo p, the two quadratic residues of the input (a+bi)
are formed as follows:
A = (a + bI) mod p
and B = (a - bI) mod p.
coefficients in a quadratic residue form also C = (c + dl) mod p
With the
and D = (c + dl)
1 For more detail on QRNS see the reference section for some interesting papers.
2 A number r is a quadratic residue modulo p iff there is a solution to the equation
x2= r mod p
46
mod p, multiplication and addition are uncoupled: (A, B) - (C, D) = (A - C, B - D)
with the computations A-C and B-D being performed in modulo p channels.
At
residue
first
QRNS
channels,
seems
but
to be an advantage
deeper
as significant as expected.
complex
numbers,
limited set.
moduli
in
the
investigation
over using three
reveals
that
the
conventional
advantage
is not
Although only two channels are used to represent
moduli
in
each
channel
come
from
a
significantly
The moduli in a QRNS system must be primes of the form 4k+1.
a standard
another.
This
moduli.
To
RNS
limitation
avoid this
system
causes
problem
only
a much
need to
smaller
other number
dynamic
systems
such as modified quadratic number system (MQRNS)
moduli, but these systems have other problems.
be relatively
prime
range
have
from
been
The
to one
a set
developed
that allow a richer set of
For a more detailed discussion
of QRNS and its extensions see the references listed at the end of the thesis.
47
Modular Efficient RNS FIR filter
Chapter 4
filter.
either
quantities
the
techniques
designs
are
very
the
conversion
from
of
into
that
the
out
filter
of
integer
chapter,
with
the
complex
channels
residue
FIR
complex
include
Computing
residue
and
real
previous
would only add unnecessary
of complex cases
The inclusion
derived
similar.
number
the
increases
complicates
be
can
a
on
focuses
follows
using
data
and
hardware
resulting
slightly
implementations
Although
coefficients
that
discussion
architecture
The
and/or
representation.
confusion.
It is also assumed that the designs will be implemented in VLSI or WSI.
Practically,
hardware
there
required
some restrictions
design.
is
no
for
the
MSI
RNS
could
However,
designs.
this
massive
the
support
assumption
places
between the different blocks in the
Some of these have been mentioned in the section on systolic designs.
most
significant
is the interconnection
or
that
platform
of the interconnectivity
Figure
The
other
components
25
Residue FIR Filter System
restriction to
constraint.
can
use
between discrete components;
the
following
architecture
circuit
card populated
A printed
several
signal
layers
for
the
comparison
with SSI
interconnection
a typical VLSI process includes only two layers
48
of
metal
for
interconnection
between
the
custom
blocks.
interconnects in VLSI or WSI tend to consume area.
has
gates but
has
that
been
optimized
interconnection
for
gate
count
less space than
the
at
a
result,
A design that uses more
may be occupy
simple interconnects,
As
expense
of
a design
a
complex
scheme.
A top level block diagram of a real residue FIR filter is shown in figure
25.
The complete system should require only three basic designs:
RNS
conversion,
the
consideration,
system
level
designs
should
system.
The
designs
RNS
binary to
programmable
be
to
for
to
any
As a
conversion.
filter tap
and FIR
conversion
be used
be programmed
can
to binary
and an RNS
a FIR filter tap,
a binary to
moduli
of the
the
either
modulus
a particular
in
asserting a value to an input or by loading a value(s) into a register(s).
an
additional
Ideally,
the
RNS to binary
any
standard
This
designs and makes the system expandable
limits the number of specialized
adding
by
conversion
block
and
RNS
filter
moduli set, but, because of the very
a set number of moduli for the system,
chain.
to be used
could be programmed
conversion
tap
structured computation
it may be worthwhile
by
with
that includes
to design the
optimal moduli set into this component.
Residue FIR filter tap
The major focus of the design is the residue FIR filter tap.
that
any
with
implemented
filter being
residue
techniques
It is assumed
will
a large
have
number of taps to justify the added overhead of two conversion stages.
As a
the filter
chains.
result,
the
primary
classes
approach
will
normalized
chapter
to
output
The
normalizing the
more
developed.
be
3.
the
that
other
result
complex
One
was
limited range of values.
explores
tap,
Several
two
are reasonable
sections
and
in
this
match the
result from
a
progressing from
section
architectures.
Two
the
biased/unbiased
for
the
slightly
but instead,
residue
approach
constraining
the
chapter
throughput
will
present
of the
conversion
filter chains.
with
units
of
output
will be presented
a
design
algorithm
function
different
filter tap designs
general
in
not
to a
and for
size and throughput estimated for each.
each design type with hardware
remaining
uses
developed
of each
will
in depth in this
will be discussed
The filter tap
simple
constraint
speed/size
designs
The
that
49
Brute Force
The obvious first approach to a residue FIR filter chain is to replace all of
the
units
arithmetic
corresponding
to construct
auxiliary
gates, 2b 2 + b
individual
been
MUXes,
or
registers,
Decomposition
In
multiplier.
residue
have
1 bit adders are needed for each tap.
coefficient
each
bits
with
left
partial
b
multiplier,
shifted
comes
approach
of the brute force
of the complexity
Most
Not including 2-1
each filter tap.
required
Coefficient
All
fixup block.
Although this design will work, a lot of hardware is
in chapter 3.
developed
the
with
Each tap will need one residue adder,
units.
a multiplier output
and
diagram
filter block
FIR
transpose
residue arithmetic
multiplier,
one residue
the
in
(multiplies
of
of the input)
are
results
versions1
from the
If the partial results are not added together at each tap, but instead
summed.
residue adds of the partial results need to be
passed on to the next tap, the
performed
only
once,
at the
end of the
filter
In
chain.
addition,
because
convolution is a linear operation, the left shifts of the input also only need to
be
performed
once.
approach
This
was already
examined in chapter 2 as
a way to build
a
The
same top level design can
be used for an RNS filter by replacing the binary
arithmetic units with their
binary FIR filter without explicit multipliers.
residue equivalents.
the
result of the
multiplies
At each subtap either the current input or 0 is added to
previous
stage.
because both the input
The
subtaps
are
and 0 are available
able
to
operate
without
without computation.
Looking closer, the basic idea exploited here is that the b individual bits
of the coefficients
select numbers (either 0 or current input) that are added to
the b previous partial results.
available
such as:
1*input,
Now, if more versions of the current input are
2*input, and
3*input, then the coefficient bits can
be taken in groups of two to select between these three numbers
could be added to the result of the prior stage.
subtaps
needed
per tap is Fb/2 1.
In general,
and zero that
In this case the number of
it is possible to represent the
In the binary
1 More exactly, versions of the input that have been recursively doubled.
domain doubling is performed by a left shift; in the residue domain the result of the left
shift must also be normalized.
50
coefficients
in
any
radix 1
fixed
number
integer
representation
and
precompute all multiples of the input that a single digit in this representation
can span.
For a base a
of the coefficients
decomposition
Floga
max(m)l
subtaps 2 are needed per tap.
it is useful to examine the
To add some formalism to the development,
The coefficients are represented in base a notation as
mathematics involved.
follows:
J-1
h[i] =
h [i] ax
(1)
j=0
where J = Floga
of h[i].
max(m)l and hj[i] is the jth digit in the base a
representation
Inserting equation (1) into the convolution equation yields
J-1~
y[n]
hI [i] a&
=
x[n-i]
i=0 _ j=0
By
are
reversing the order of summation, two different computational procedures
generated.
J-1
y[n] =
-
[i=O
j=0
y[n] =
[a
h [i]
j=O
The first procedure
(2)
ai I:2 h [i] x[n-i]
(eq 2)
1
x[n-i]
(3)
i=O
dictates that the mini-convolutions
with the results scaled by powers of a
are computed
The second procedure
and then added.
(eq 3) dictates that the input is prescaled by powers of a and the results of the
mini-convolutions
procedures
was
are
not
directly
added.
significant
in
the
The
difference
between
case
a
binary
with
these
two
= 2 because
multiplying by factors of two in a binary representation is equivalent to left
shifting.
In the residue case, where computation must be performed to scale
1 A fixed radix (base a) integer number system is defined by the following rule:
(...a3a2a aa)
= ... + 3a3 + a 2a2 + a1 cl + a0
where ai are the digits of the radix a representation.
2 Remember, m is the modulus of the channel, and therefore a range of distinct numbers is
needed for the representation of the coefficients that meets or exceeds the maximum
modulus.
51
by
any
number,
the
one
procedure
however, will be addressed
may
be
better
than
the
other.
This,
later when the scaling units are developed.
Base 2 - Bitwise
The simplest case of coefficient
algorithm
that
is (X = 2, bitwise.
shown
in
8
figure
residue conditional
with
accumulators
with the biased/unbiased
decomposition
The unbiased/biased
architecture
the
in
binary
subtaps
figure
to
9
by
replaced
case the
In the binary
(figure 22).
is identical
number
in the
of subtaps per tap was determined by the precision of the coefficients;
residue
channel
case
the number
(which
modulus
of
subtaps
effectively
for
the
the
the
precision
magnitude
of
the
of the
coefficients
max(m)] = b subtapsl
Flog2
The complexity per tap for this design is b * b bit
are needed per channel.
adders, or b 2
sets
As shown above
within a particular channel).
by
is determined
1 bit full adders which is less than one-half the number needed
brute
force
design.
is
there
Unfortunately,
the
added
expense
of
broadcasting both the b bit current input and the b bit biased current input to
all subtaps.
A complete
summary of hardware
required
per subtap is shown
below
Part
Transistors
Sizing
Number
Type
9792b
32b
1 bit Full Adder
b
2
2-1 MUX
b
4896b+1632 g2
10b + 4
1 bit Register
b+2
8160b + 16320 g2
24b + 48
AND gate
b+1
4896b + 4896 p2
6b + 6
p2
6
2
OR gate
1
4896
Inverter
1
3264 p2
I
Totals
27744b + 31008
---
p2
72b +66
Global Bus
2b signal lines
Critical
2-1 MUX + AND gate + b bit adder + OR gate + register
Path
Throughput
(1.9b + 8.06 ns)-
Architecture
Summary
for
base
2
subtap
1 Because arithmetic is being performed using b bit binary arithmetic units, it seems
logical to set the maximum modulus to that number which uses the full dynamic range of
these units. In general, max(m) = 2 b where b is the chosen width of the channel.
52
Ternary
Balanced
The
advantage
coefficients
is
of
going
that fewer
a
higher
are
needed;
to
subtaps
the disadvantage
If the digits
{-1,
is that more
with digits {0,
the coefficients are represented as standard radix -3 (ternary)
2), four b bit numbers must be broadcast to each subtap.
the
For example, if
scaled versions of the input must be broadcast to each subtap.
no reason to use the standard digits.
of
decomposition
radix
1,
there is
Fortunately,
0, 1) are used instead
(balanced ternary), a simple trick permits a design that needs only two b bit
numbers to be broadcast to each tap.
With
the
coefficients
in
a
ternary
balanced
representation,
there
are
four possible versions of the input (<x>m, <x>m + g, <-x>m, or <-x>m + 4, selected
and carry out of the prior stage) that could be added to the
by the coefficient
However, in the balanced case the latter two can
prior result at each subtap.
be easily derived from the former two.
bit binary
channel
the negative
channel
is
If a normalized unbiased residue in a b
is two's complemented 1 , the result is the biased version of
If a normalized biased residue
of the residue.
the
two's complemented,
negative of the residue.
result
is the
normalized
version
2b_
<-X> m+
complementing
Xm +
)m
+p4-(x)
-p =
can be built into hardware
-x>m
by using
XOR gates
invert bits and using the carry in of the adder to perform the add 1.
balanced
ternary
subtap
of the
2
2b _<>m = m + g-(x>m
Two's
in a b bit binary
using
this
is
technique
shown
in
figure
to
The final
26.
The
coefficients are coded as follows:
hi
h0
ternary
digit
0
0
0
0
1
1
1
1
-1
1
0
undefined
1 Two's complement is a convenient way to represent both positive and negative numbers
in a binary system. The two's complement of a b bit number x is obtained by subtracting
x from 2 b, -x = 2b - x. In the binary system this computation is equivalent to inverting
each digit of x (0->1, 1->O) and adding 1 to the result.
2 Remember when looking at the equations , = 2b - m, so 2b = m + g, and <-x>m = m - <x>m
53
three
Because
necessary
to
precalculated
above
digits
the
hold
being
be placed
definitions
two
binary
bits
of
register
are
a subtap.
Since
the
coefficients
are
representation,
the
represented,
coefficient
and can
coefficient
hardware
are
for
unique
in any
chosen
were
digit/radix
to
specifically
the
simplify
decoding.
1'1
1
0Yi
x[n-1]
.C [n]
x[n-1] +
C i-
S.
S
[n-1]
1-1
Figure
At this
balanced
point
tertiary
a
26
logical
system
concern
includes
is
Subtap
Ternary
Balanced
that
negative
the
range
numbers.
The
range
subtracted
from
numbers can be used.
numbers
in
the
maximum modulus.
without changing
any
residue
All
that needs
coefficient
range
(
3
its
to be guaranteed
)
is
greater
value,
number of subtaps needed for typical values of b.
Since m
the negative
is that the span of
than
or
equal
The number of subtaps needed per tap is Flog3 2b]
b is the number of bits in the binary channels.
a
of numbers
for a J digit balanced tertiary number system is [-3J/ 2 + 1, 3/2 - 1].
can be
in
of coefficients
to
the
where
The chart below shows the
.[n]
54
b
Flog3
2
2
2
3
4
For b >
lower
the
2,
the balanced
number
of
5
4
6
4
7
8
5
6
9
6
more
complicated
representation
per
needed
number of global broadcast buses.
slightly
3
ternary
subtaps
channel
of
the
coefficients
without
does
increasing
the
This benefit is obtained at the expense of a
subtap,
a two
XOR
delay, and a two bit coefficient at each subtap.
shown
2 b)
gate
increase
in
propagation
The summary of the design is
below
Number
Sizini
Transistors
1 bit Full Adder
b
9792b p2
32b
2-1 MUX + XOR
b
4896b + 3264 g 2
16b + 6
p2
24b + 72
Part
Type
1 bit Register
b+3
8160b + 24480
AND gate
b+1
4896b + 4896 p 2
p2
6b + 6
10
XOR gate
1
4896
OR gate
1
4896 g 2
6
1
2
3264 p.
2
27744b + 45696 p 2
78b + 102
Inverter
Totals
Global Bus
2b signal lines
Critical
XOR + (2-1 MUX + XOR) + AND + b bit adder + OR + register
Path
Throughput
(1.9b + 9.98 ns)-
Architecture
Summary
1
for
balanced
ternary
subtap
r
7--
55
Radix 4
Offset
are
If we
willing
to broadcast
additional
versions
With
the coefficients
can
For radix 4 one of the offset
For the design the former
digit sets {-2, -1, 0, 11 or {-1, 0, 1, 2} should be used.
digit set will be (arbitrarily)
input we
As in the radix three case,
achieve a radix 4 decomposition of the coefficients.
the standard digit set {0, 1, 2, 3} is not optimal.
of the
chosen.
in the offset radix 4 representation,
six versions
of
the input (<- 2 x>m, <- 2 x>m + g, <-x>m, <-x>m + g, <x>m, or <x>m + g, selected by the
coefficient and carry out of the prior stage) could be added to the result of the
prior
stage.
broadcast
Using
the
two'
s
complementing
trick,
only
four
need
to
be
to each subtap.
x[n-1]
x[n-1] +g
[n]
2x[n-1]
2x[n-1] +pg
C
[n-1]
S.
[n-1]
[n]
Figure
27
Offset
Quaternary
Subtap
The design is shown in a easy to understand positive logic form in figure
27.
The coefficients are coded as follows:
56
Using the
Inverters
ho
0
0
0
0
1
1
1
0
-2
1
1
-1
same coefficient
removed)
by
quaternary
h1
coding the
drawing
the
can be slightly
design
circuit
digit
in
a
less
optimized
intuitive
manner
(two
as
shown in figure 28.
2x[n-1]
2x[n-1] +g
[n]
x[n-1]
x[n-1] +i
Ci-
[n]
[n-1]
S i-I [n-1]
Figure
28
Offset
Quaternary
Subtap
(Negative
Logic)
The span of the coefficients in a J digit offset radix 4 representation is
The
total
corresponding
span
equals
_-- 3 (4 - 1),
3
as
does
4
J
it
(4
-
in
1)
_
any
radix
number of subtaps per tap is J = Flog4 2b] =
below lists J for some typical values of b.
4
system.
Fb/21.
The
The chart
57
number of subtaps
The
the
balanced
are
more
doubled.
ternary
Fb/21
2
1
3
4
2
2
5
3
6
3
7
8
4
4
9
5
per tap decreases
Unfortunately,
design.
complicated,
b
and
requisite
the
the number
from
the
number
offset
of
needed
quaternary
global
bus
with
subtaps
lines
has
The summary of the optimized hardware is shown below
Number
Sizing
Transistors
1 bit Full Adder
b
9792b g2
32b
4-1 MUX + XNOR
b
11424b + 11424
1 bit Register
b+3
8160b + 24480
NOR gate
b+1
4896b + 4896 g2
AND gate
1
4896
XOR gate
1
4896 g2
OR gate
1
4896
Part
Type
p2
26b + 34
24b + 72
4b + 4
p2
6
10
p2
6
34272b + 55488
Totals
2
2
86b + 134
Global Bus
4b signal lines
Critical
XOR + (4-1 MUX + XNOR) + NOR + b bit adder + OR + register
Path
(1.9b + 12.14 ns)- 1
Throughput
t
Architecture
Balanced
The
Summary
for
the
offset
quaternary
subtap
Quinary
logical
extension
of
the
offset
quaternary
representation
of
the
coefficients is a balanced quinary representation, radix 5 with digit set {-2, -1,
58
0, 1, 2}.
No new innovations needed for the design which is shown in positive
logic form in figure 29.
The coefficients for the positive logic form are coded
as follows:
h2
hl
ho
quinary
0
0
0
0
0
0
1
0
1
1
2
1
0
1
-1
1
1
1
-2
i2
i1
digit
i0
x[n-1]
x[n-1] +p
2x[n-1]
[n]
2x[n-1] +p
[n]
Ci_
[n-1]
Si_
[n-1].
Figure
29
Balanced
Quinary Subtap
By modifying the design slightly, it is possible to eliminate the inverter.
final
version
coefficient
is
shown
coding as
in
follows:
figure
30.
This
version
requires
a
The
different
59
I
h2
The only
difference
I
hl
I
ho
quinary
0
0
1
0
0
0
0
1
0
1
0
2
1
0
0
-1
1
1
0
-2
between
the two
coefficient
digit
sets
is that
ho has been
inverted.
i2
ii
i0
b
x[n-1]
A0 S1
b
Al
pr
x[n-1] +
b
2x[n-1]
[n]
A2
b
2x[n-1] +j.
C _
A3
so
[n]
[n-1]
Si-i [n-1]
Figure
30
Balanced
Quinary Subtap
(negative
logic)
The span of the coefficients in a J digit balanced radix 5 representation is
1 5
1j
. 2
'2
I
The total span equals 5J as it does in any radix 5 system.
number of subtaps per tap is J =
typical values of b.
Flog5
2bl.
The corresponding
The chart below lists J for some
60
Unfortunately,
the
b
5l0g5 2b]
2
3
1
2
4
2
5
6
7
3
3
4
8
4
9
4
advantage
of
using
until b is greater than or equal to 9.
requires
offset
balanced
quinary
is
Although the balanced
not
realized
quinary subtap
the same number of global buses and has the same throughput
quaternary
warranted
unless
requirements,
the
design,
b
the
> 9.1
architecture
Part Type
marginally
For
extra
those
summary
is
truly
shown
b
4-1 MUX + XNOR
b
9792b
would
massive
not
dynamic
32b
11424b + 11424 p 2
26b + 34
g2
24b + 96
8160b + 32640
b+4
NOR gate
b
4896b g2
4b
AND gate
1
4896 p 2
6
XOR gate
1
4896
p2
OR gate
1
4896 p 2
6
34272b + 58752 p 2
86b + 152
Global Bus
Critical
Path
Throughput
range
Transistors
p2
1 bit Register
Totals
be
below
Sizing
Number
1 bit Full Adder
hardware
as the
10
4b signal lines
XOR + (4-1 MUX + XNOR) + NOR + b bit adder + OR + register
(1.9b + 12.14 ns)- 1
Architecture
Summary
for
the
offset
quinary
subtap
1 Proof that aesthetics and symmetry are not the only things that is important
F-
61
Subtap
Summary
The
coefficients
representations,
but
could
the
be
decomposed
disadvantages
larger multiplexors would outweigh any
having fewer subtaps.
needed
calculated
using
having
even
more
higher
global
buses
per tap
the
for
each
numbers
of the
from
the
four
designs.
architecture
The values
summary
for
2
size (p )
entire tap
Itransistors
86496
210
282
354
426
498
570
642
714
114240
141984
169728
197472
225216
252960
280704
size (p2)
Itransistors
172992
342720
567936
848640
1184832
1576512
2023680
2526336
420
846
1416
2130
2988
3990
5136
6426
Binary
per
2
subtap
entire
tap
J
size (p )
transistors
size (pt2)
transistors
2
2
101184
128928
156672
184416
212160
239904
267648
295392
258
336
414
492
570
648
726
804
202368
257856
470016
737664
848640
1199520
1605888
1772352
516
672
1242
1968
2280
3240
4356
4824
b
2
3
4
5
6
7
8
4
4
5
6
9
6
3
by
The tables below list the size and number of
per subtap
J
2
3
4
5
6
7
8
9
and
advantages that would be obtained
design.
b
2
3
4
5
6
7
8
9
radix
At this point it is most instructive to examine the four
designs for different values of b.
transistors
of
into
Balanced
Ternary
were
each
62
per
b
J
size (42)
2
3
4
5
6
7
1
2
2
3
3
4
4
5
124032
158304
192576
226848
261120
295392
329664
363936
8
9
subtap
ent ire
transistors
306
392
size (p 2 )
124032
316608
478
385152
564
650
736
822
908
680544
783360
1181568
1318656
1819680
Offset
per
b
2
3
4
5
6
7
8
9
J
1
2
2
3
3
4
4
4
306
784
956
1692
1950
2944
3288
4540
en tire
transistoirs
324
410
496
582
668
754
840
926
127296
161568
195840
230112
264384
298656
332928
367200
transistors
Quaternary
subtap
size (p2)
tap
Balanced
tap
size (p2)
transistors
127296
323136
391680
690336
793152
1194624
1331712
1468800
324
820
992
1746
2004
3016
3360
3704
Quinary
Figures 31 and 32 provide a graphic comparison of the size and number of
transistors,
base 3
respectfully,
for
the
appears to be marginally
different
designs.
better for b =
Although
3 and
the
balanced
the balanced
base 5
appears to be clearly better for b = 9, the offset base four design seems to have
an
advantage
for all
values of b in between.
Assuming
a large
number of
filter taps, it is possible to neglect the scaling and summing units needed for
the
different
requirements,
balanced
quaternary
designs.
however,
ternary
1
can
The
not
implementations
and the balanced
differences
be
between
neglected.
need
2b
Both
global
bus
the
the
global
binary
2
lines ;
busing
and
the
the offset
quinary need 4b lines.
1 Remember, the output of each mini-convolution (subtap) chain must be scaled and
summed with the scaled outputs of the other chains.
2 In reality, the clock must be globally broadcast also, but this is common to all designs.
63
3000000
2500000
Binary
2000000"
Balanced Ternary
1500000
Offset Quaternary
1000000
M Balanced Quinary
500000
0
2
Figure 31
3
4
Size (g2)
5
verses
6
Number
7
of
8
Bits
9
in
Channel
Binary
7000 '
6000 '
5000
Binary
4000
Balanced Ternary
3000
Offset Quaternary
M
2000
Balanced Quinary
1000
0 1
2
Figure
The
3
4
5
6
32
Transistors verses Number
delay
through
the
subtaps
increases or the number of bits
7
8
9
of Bits in Binary Channel
increases
monotonically
in the binary channel
increases.
as
the
radix
A summary
of the latencyl through a single subtap is shown in the following table:
1 As are the hardware sizing estimates, the latency estimates are derived from a standard
cell library. See the appendix...
64
offset
auaternarv
15.94 ns
17.84 ns
19.74 ns
21.64 ns
23.54 ns
25.44 ns
27.34 ns
29.24 ns
balanced
b
2
3
4
5
6
7
8
9
binary
11.86
13.76
15.66
17.56
19.46
21.36
23.26
25.16
Scaling
and
Although
hardware
ternary
ns
ns
ns
ns
ns
ns
ns
ns
13.78
15.68
17.58
19.48
21.38
23.28
25.18
27.08
ns
ns
ns
ns
ns
ns
ns
ns
balanced
auinarv
15.94 ns
17.84 ns
19.74 ns
21.64 ns
23.54 ns
25.44 ns
27.34 ns
29.24 ns
Summing
the
scaling
comparison
and summing
between
the
operations
different
can
be neglected
coefficient
for the
decomposition
designs, it is an essential part of the complete design.
Figure
Earlier
Either the
in
33
this
Premultiplication
chapter
two
the radix as shown in (figure 34).
Powers
computational
of
the
procedures
Radix
were
discussed.
33),
or
chains can be postmultiplied by powers
of
input can be premultiplied
the results of the mini-convolution
by
by powers
To
of the radix
(figure
simplify the computation
and decrease
I
___
65
the necessary
figures
35 and
requiring
channels
times.
hardware,
that
2
36.
the
a form of Homer's Algorithmi can be used as seen in
Algorithm has the complication
The use of Horner's
data
arrival
times
be
skewed
in
so that the inputs to the final adder chain
the
of
mini-convolution
arrive at the proper
To minimize the number of registers needed to skew the data, the first
computational procedure
is used in the form shown in figure 35.
for the multiply by a
blocks
consecutive
subtaps.
equalizes
If the second
the delay
computational
needed
The latency
between
procedure
outputs
is used,
of
a chain
of registers would be needed to provide delayed versions of the input to each
mini-convolution
Figure
With
a
34
chain.
Postmultiplication
computational
procedure
by Powers
finally
chosen,
of the Radix
scaling
units
developed to multiply by the radices {2, 3, 4, 5} used in the designs.
the minimum
latency designs
must
Obviously,
are desired, but the scaling blocks must operate
with a throughput equal to that of the filter subtaps or, equivalently,
equal to
1 Horner's Algorithm is usually associated with polynomial evaluation where
n
I
i=0
a x1
be
is evaluated as ao + x(a, + x(a 2 + x(a 3 + x(
...
2 The ith subtap in each mini-convolution chain will be operating on data from different
input times.
66
one clock cycle.
that meets
A multiply by 2 block has already been developed (figure 20)
the timing requirements.
It has
a throughput
and
latencyl
equal
to one clock cycle and outputs both biased and unbiased forms of the product.
The
multiply
blocks.
by 4 block
would
consist of two
consecutive multiply
by two
It would have a throughput of one clock cycle, but a latency of two
(three) clock cycles.
The *4 block would also output both biased and unbiased
forms of the product.
Figure
35
Horner's
Algorithm
for
Premultiplication
Unfortunately, a multiply by 3 unit is not as simple as the previous two.
combination
of one-half of a *2 unit
multiply by 3.
with additional
hardware
is needed
A design with a latency of two is shown in figure 38.
number of adders could be reduced by adding x to 2x and conditionally
g to the result;
A
to
The
adding
cost of this hardware optimization, however, is an increase in
latency to 3 clock cycles.
1 Assuming both x and x+m are available. Otherwise, an extra stage would be necessary to
generate both unbiased and biased versions of the multiplicand. This extra stage would
have a throughput of one clock cycle, but would increase the total latency to two clock
cycles. One half of the multiply by two block will operate on an unbiased input to
produce only an unbiased result in one clock cycle.
67
Figure
36
Horner's
Algorithm
for
Postmultiplication
<3x> m
Figure 37
Multiply by 3 Block
68
The multiply by 5 unit is very similar to the multiply by three block.
A
design with a latency of three is shown in can be implemented by adding an
additional multiply by 2 block to the *3 design and an additional delay register
for the input.
As with the *3 block, the number of adders could be reduced by
adding x to 4x and conditionally adding g to the result at the cost of increased
latency.
New
Algorithm
To
reduce
with
the
number
radix decompositions,
biased/unbiased
used
Buses
of globally
designs
signals
needed
for
larger
in more
depth.
In each
design the
number of bits
channels is equal to the maximum number of bits that a
residue
could
occupy.
The
output of each subtap to be normalized1
be uniquely
broadcast
it is necessary to back up and examine the basis of the
in the binary
normalized
Fewer
represented.
dynamic
range
because
The normalization
is
restriction
unnormalized
achieved
by
forces
numbers
the
could
guaranteeing
that
a normalized unbiased residue is always added to a normalized biased one.
output of the previous stage will always be in normalized
be biased.
Because of the uncertain
unbiased
and biased
must be precalculated
keep
residues
number
The
versions
of each
within
The
but may or may not
state of the previous stage's output, both
normalized
and bused to the subtaps.
normalized
2
a
binary
digit
multiple of the
input
This clever procedure used to
channel
has
led
to
the
large
of buses.
Algorithm
If more
represent
uniquely
the
are
each subtap,
as required
included
normalized
represented.
normalized,
number
bits
in
the
residues,
Instead
of
binary
channels
unnormalized
requiring
that
than
are
residues
the
output
of
needed
to
can
also
be
a
subtap
be
the output can be restricted to some range of values modulo m.
add in the new product plus positive
or negative
to keep the result within some restricted range.
of bits
in the
binary
channels,
the
sum
of
At
multiples of m
With a sufficient
two unbiased
residues,
1 Generally, normalized residue implies that the magnitude of the residue falls within the
range [0, m-1]. In this case, because the dynamic range of the channel is equal to the
dynamic range need for the maximum modulus, the "normalized range" could actually be
any offset of the standard normalized range (ie [a, m+a-1] I a e Z ). The point is that the
span of the output cannot exceed m uniquely.
2 Inductive reasoning
69
either of
which may not
be normalized,
which also may not be normalized.
always
be
unbiasedi,
will
previous
If these digits are powers
adder merely
two's
negative
representation
designs
values
in
of negative
all numbers were
indicated
complementing
However,
residue,
of two, the unbiased digit
at the
subtap by
a left shifter.
can be calculated by two's complementing.
support
complement
unbiased
Since the output of the previous stage will
multiples of the coefficient can be calculated
To
an
we only need to provide the subtaps with unbiased digit
multiples of the input.
Negative values
result in
whether
trick
unbiased
environment,
binary numbers
considered
the
used to
an
must be used.
positive.
sum was
the
In the
The carry out of the
biased or unbiased.
invert residues
two's
generated
Even the
positive
values 2 .
with true negative numbers in the system a method exists to ensure
that the output of a subtap lies within a certain range.
In
order
to
keep
the
temporary
results
form
growing
in
multiples of m must be added or subtracted from the accumulation.
magnitude,
By
adding
and subtracting multiples of both x (the current input) and x - m (now a true
negative
number),
subtracted
to keep the result within a specified range.
the
multiples
of
m
can
is negative, a positive number is added to it;
a negative number is added to it.
following
notation
3
be
automatically
added
If the previous
result
if the previous result is positive,
The subtap algorithm is listed below with the
: pi[n] equal to the result of the ith stage at the nth time
step and hj[N-i] equal to the jth digit in the balanced radix a
decomposition
the N-i th coefficient.
if (hj[N-i] == 0)
/* case 0*!
pi[n] = pi.1[n-1]
/* case 1 */
if (hj[N-i] > 0 && pi-l[n-1] > 0)
pi[n] = pi.1[n-1]
if (hj[N-i] > 0
or
+ hj[N-i]
* (x-m)
&& pi-i[n-1] < 0)
pi[n] = pi.1[n-1]
/* case 2 */
+ hj[N-i] * x
1 Inductive reasoning, again
2 When <x>m was inverted to generate <-x>m, the result <-x>m equaled m - <x>m not - <x>m.
3 Some of this notation was developed in section 2.2.2 in the discussion of the transpose
filter
of
70
if (hj[N-i] < 0 && pi.i[n-1] > 0)
pi[n] = p-ii[n-1]
/* case 3 */
+ hj[N-i] * x
/* case 4 */
if (hj[N-i] < 0 && pi-i[n-1] < 0)
pi[n] = pi-i[n-1]
Examining
algorithm
the
magnitudes
shows the
* (x-m)
+ hj[N-i]
of
the
quantities
range that the output, pi[n],
number of bits needed in the binary channels.
the
largest when pi[n-1]
from
contribute
-m
to
can span
in
the
above
and therefore
The magnitude of pi[n] will be
-1,
Also, because the x spans from 0 to m-1 and x-m
it is
expected
to the largest results.
be examined separately.
that
the
Regardless,
For case #1,
cases
each
including
x-m
would
case in the algorithm will
since a negative
number is being added
to pi-i[n-1],
the maximum
1[n-1] = 0.
For case #2, since a positive number is being added to pi-i[n-1],
maximum
magnitude
For
#3,
case
Finally,
for
Collecting
- 1.
the
case
these
magnitude
of pi[n]
maximum
#4,
the
is close to zero and the magnitude of the coefficient
hj[N-i] equals its maximum.
spans
involved
the
of pi[n]
will
equal
-m*max(hj)
when pi_
the
will equal
(m-1)*max(hj) - 1 when pi-i[n-1] = -1.
magnitude
of pi[n]
will
equal
(m-1)*max(hj).
magnitude
will
equal
-m*min(hj) - 1.
maximum
results, the output
spans the range
-m*max(hj)
to -m*min(hj)
If h[n] is decomposed in a balanced radix system with all digits equal to
powers of two and the maximum value of m equal to 2b, the span of the output
can
be
efficientlyl
represented
channel where c = log2
Because
in
a
b+c+1
bit
two's
complement
binary
(max(hj)) and an extra bit is used for the sign.
the temporary
values
in
each
mini-convolution
chain
can
span
the range ± m* max(hi), some method is needed to normalize the values after
the
final
multiples
required
tap.
Although
of m
and
will
not
the
calculation
require
a
which
consists
significant
therefore must be considered.
Also,
amount
as
result for the residue
this hardware
channel.
will not be significant,
adding/subtracting
of hardware,
is part
algorithm, the outputs of the final subtaps must be scaled
the final
of
it
is
of the original
and summed to form
Again,- for a large
number of taps
but still must be incorporated
into the
1 Efficient implies unique representation with no wasted dynamic range for the case m =
71
system.
ignored.
After
return to
these designs.
the
two
designs
subtap
different
components
the
of
been
have
system
completed,
will
we
be
will
Hardware
The
The
and balanced
quinary,
implement the
algorithm
of the coefficient
channels
binary
are
will
septary
in the previous
decomposition
slightly
A general
left shifts.
that
implementations
three
balanced
be
and
The
(base 7).
balanced
hardware
ternary,
required
to
section is very similar to the design
are that the
The major differences
subtaps.
wider,
are
examined
some
method
is needed
to
perform
is shown
block diagram of the component elements
in
38.
figure
left
shift
x I x-m
x[n-1]
x[n-1]
these
Initially,
complete
-A
invert
zero
A/B
b
A
Co -N/C
- m -
S.i
Si_
[n-1]
Figure
Balanced
The
38
New
Algorithm
Subtap
Ternary
first
implementation,
balanced
ternary,
does
capability because the only digits are -1, 0, and 1.
is shown in figure 39.
subtap
Architecture
except
need
the left
shift
The basic design of subtap
It is virtually the same as the coefficient decomposition
that no carry
are decoded as follows:
not
forwarding
circuitry
is needed.
The
coefficients
72
hI
ho
0
0
0
0
1
1
1
1
-1
1
0
undefined
il
ternary
digit
10
2-1 MULX
b+1i
x[n-1]
A
b+
b +1
A
Co -N/C
Y
-
b+
-
x[n-1] -m
b+1-
B A/
gb
o b+
b+1
b+Y
~
B
[n-1]
S.
1-1
'[b+1]
b+;
If)
Figure
The
high
result
the
order
bit
is negative,
coefficient
39
of the
New
Balanced
previous
result
Ternary Subtap
indicates
whether
and the high order bit of the coefficient
is negative.
The XNOR combination
the
previous
indicates whether
of these
signals
assures
that a positive number is always added to a negative one and a negative always
added to a positive.
The design can be slightly optimized by eliminating
MUX.
Because the maximum value of m is
and the minimum value of x-m is - 2 b.
2 b,
one stage of the 2-1
the maximum value of x is
2b
_ I
So, the high order bit of x is always set
to 0, and the high order bit of x-m is always set to 1.
Because the output of the
XNOR is 0 to select x and 1 to select x-m, it can provide the high order bit of x
and x-m.
In addition,
globally broadcast.
Although
the
the high order bits of x and x-m do not need to be
The final design is shown in figure 40 .
carry
forwarding
circuitry
has
been
removed
and
with
it
an OR gate in the critical path, the overall design is probably worse than the
73
More
gates
were
added
than
were
delay of additional carry stage in the adder is
and the propagation
removed,
design.
unbiased/biased
corresponding
Y.
11
Y
1U
x[n-1]
x[n-1] - m
S. [n]
[n-1]
S
Figure
40
Improved
New
Balanced
longer than the delay of the removed OR gate.
lines
are
needed.
architecture
summary
At
is
least
design
this
Ternary
Subtap
Also, the same number of bus
shows
proof
of
The
concept.
listed below
Number
Sizing
Transistors
1 bit Full Adder
b+1
9792b + 9792 p.2
32b + 32
2-1 MUX + XOR
b
Part
Type
4896b + 3264 p
2
16b + 6
p2
24b + 72
1 bit Register
b+3
8160b + 24480
AND gate
b+1
4896b + 4896 g2
6b + 6
p2
10
XOR gate
1
4896
XNOR gate
1
4896 p 2
8
27744b + 52224 p 2
78b + 134
Totals
r
Global Bus
2b signal lines
Critical
XNOR + (2-1 MUX + XOR) + AND + b+1 bit adder + register
Path
Throughput
Architecture
(1.9b + 10.36 ns)-1
Summary
for
balanced
ternary
subtap
#2
74
Quinary
Balanced
subtap algorithm
purpose of the
The
new
balanced
to
numbers
two
could not
implementation
ternary
The
requires
the number
lower
quinary
be
to be an
have been expected
implementation,
the
broadcast,
globally
will
however,
of global
the subtap
Because
amount of hardware per subtap.
buses, not to reduce the
algorithm
is to
reduce
balanced
improvement.
the
number
from four to two.
of globally broadcast numbers
Yi2
3
il
iO
b +1
AO Si
x[n-1]
x[n-1]
b+1
- m --- - A11+
2x[n-1] -m
CSO - N/C
A
b+2
b
2xn1
A23b+
S [n]
bb+
S
Figure
except
basic
design
that hardware
0b
Q
[b+2]C
[n-1]
The
.
41
will
New
Balanced
be very
must be added
Quinary Subtap
similar to the
to perform
balanced
left
ternary
design
Because
shifts.
the
standard cell library that I am using does not include a left shifter, a b + 1 bit
MUX with hardwired left shifted versions x and x-m is used.
The b+1th bit of
the x input to the MUX is tied to 0, the b+1th bit of the x-m input is tied to 1; the
low order bits of both 2x and 2x-2M inputs is tied to 0.
The high order bit (b+2
bit) of all inputs is added by the selector line after the MUX.
is shown in figure 43.
The final design
The coefficients are decoded as follows:
75
_
again,
Once
quinary
balanced
used
quinary
digit
0
0
0
0
0
1
1
0
1
1
2
1
0
1
-1
1
1
1
-2
designs
results
algorithm
subtap
the
is
number
in
of
hardware
more
globally
to
necessary
longer
relative
hardware
Also, the left shifter could be implemented as a
populated grid of transmission gatesi, and a b bit 2-1 MUX could be
for
technology,
choose
to
the
between
architecture
x
and
x-m.
summary is
as
Regardless,
within
the
available
follows:
Number
Sizing
Transistors
1 bit Full Adder
b+2
9792b + 19584 t2
32b + 64
4-1 MUX + XOR
b+1
8160b + 19584 t2
24b + 58
1 bit Register
b+5
8160b + 40800 p2
24b + 120
Part
has
lines
broadcast
the
determine
and
VLSI layout of both
At this point a custom
cost of an additional bus line.
partially
ho
0
4b to 2b.
reduced from
I
-
However,
delays.
propagation
been
the
ht
I.
Type
4896b g
2
6b + 12
AND gate
b+2
XOR gate
1
4896 p2
10
XNOR gate
1
4896 p2
8
31008b + 89760 g 2
86b + 272
Totals
Global Bus
2b signal lines
Critical
XNOR + (4-1 MUX + XOR) + AND + b+2 bit adder + register
Path
Throughput
Architecture
(1.9b + 14.06)-l
Summary
for
the
balanced
quinary
subtap
#2
1 The grid would be rectangular with transmission gates on the two central diagonals to
allow the rows (input) to be directly passed to the columns (output) or to allow the rows to
be passed shifted left one place. Once the select lines have been set, the propagation
delay through the shifter would only be one transmission gate delay. A similar design
with a fully populated grid is frequently used as a barrel shifter.
76
Modified
Balanced
Septary
algorithm only requires two b bit numbers to be
the new subtap
Because
a
broadcast,
radix
biased/unbiased
seven
design
can
be
six b bit numbers
algorithm,
With
implemented.
would have been
the
needed.
standard balanced septary digits would be the set {-3, -2, -1, 0, 1, 2, 31.
old
The
Because
we are trying to avoid real multiplications, the modified digit set {-4, -2, -1, 0,
1, 2, 41 will be used.
to be performed by
numbers
Although the modified digit set allows all multiplications
left shifts, it does put some restrictions on the range of
spanned.
3
i2
3
i1
3
i0
0
0
x[n-1]
x[n-1] - m
2x[n-1]
S. [n]
2x[n-1] - 2m
4x[n-1]
4x[n-1] - 4m
S i-i [n-1]-
Figure 42
New Modified
Septary
Subtap
A good way to understand the modified digit set is to examine a number in
the standard
digit set.
balanced
septary
representation
convert
and
A simple rule exists for the conversion:
whenever a 3
occurs replace
it with
a -4 and
occurs replace it with a 4 and carry -1.
modified system are listed below
it
to
the
modified
starting from right to left
carry
1, and whenever
a -3
The positive numbers of a two digit
77
modified
standard
bt
I
ao
0
0
0
1
0
0
1
1
2
0
2
2
3
1
-4
3
-3
0
4
4
-2
1
-2
5
-1
1
-1
6
0
0
7
1
8
2
1
1
1
2
9
3
2
-4
10
-3
1
4
11
-2
2
-2
12
-1
2
-1
13
0
2
0
14
1
2
1
15
2
2
2
16
3
the
Not Possible
table of modified
17
largest
magnitude
2(7i) + 1 +
i=0
second
first
term
Unfortunately,
largest
can be represented
number
negative
is
that
is (2...2)7.
with
all
The total span of a J bit modified radix 7 number system is
digits equal to -2.
where the
the
In general, for any number
is (22)7.
of digits the largest positive number that can
the
7 representations,
balanced radix
positive number that can be represented
Correspondingly,
number
al
1
Examining
decimal
bo
0,
=
(7 j -1)
+ 1
i=0
term on
for
2(7i)
the
and
left
the
accounts
final
term
for
the
for
negative
the
numbers,
positive
the
numbers.
we lose approximately one-third of the span to make all of the
digits powers of two.
The resulting number of subtaps per tap is listed in the
following table for several values of b.
78
b
Subtaps
2
1
2
2
3
4
5
6
2
7
3
8
4
4
3
9
requires fewer subtaps
The modified radix 7 representation
per tap than the
balanced radix 5 representation for b equal 5 or 7.
Once again, the actual hardware design is very similar to the previous two
designs.
Using the formula at the end of section 4.1.3.1, b+3 bits are needed in
the binary channels.
The left shifts are performed by a hardwired 8-1 MUX
would be more efficient in hardware size and
although a custom left shifterl
speed.
One small advantage of using the 8-1 MUX is that the delay of an AND
gate in the critical path is removed, although the added delay of an 8-1 MUX
and 3 additional adder stages more than compensates in added delay.
design is shown in figure 42.
The coefficients are decoded as follows:
h2
hl
ho
coefficient
0
0
0
0
0
0
1
1
0
1
0
2
0
1
1
4
1
0
1
-1
1
1
0
-2
1
1
1
-4
Any comparison of this
design with other designs should
only two b bit busses are used.
consider that
Limiting the number of global buses was the
primary goal, and it has been achieved.
shown
The final
The complete hardware summary is
below
1 In this case the rectangular grid would have transmission gates on three central
diagonals.
79
Number
Sizing
Transistors
1 bit Full Adder
b+3
9792b + 29376 g2
32b + 96
8-1 MUX
b+2
17952b + 84864 p2
50b + 220
Part
Type
p2
24b + 144
10b + 30
1 bit Register
b+6
8160b + 48960
XOR gate
b+3
4896b + 14688 p2
XNOR gate
4896
1
8
40800b + 182784 p 2
Totals
Global Bus
Critical
g2
Path
Throughput
Architecture
116b + 498
2b signal lines
XNOR + (4-1 MUX + XOR) + AND + b+2 bit adder + register
(1.9b + 16.78 ns)- 1
Summary
for
the
balanced
quinary
subtap
#2
Summary
Subtap
The tables below list the size in g 2 , the number of transistors needed, and
the latency for each of the three subtap designs using the new algorithm.
Figures 43 and 44
of numbers were obtained from a standard cell library.
graphically
summarize the size and transistor data.
entire tap
per subtap
b
J
2
3
2
2
3
4
5
6
7
8
9
4
4
5
6
6
size (g2 )
107712
135456
163200
190944
218688
246432
274176
301920
I transistors
290
368
446
524
602
680
758
836
Balanced
size (g 2 )
215424
270912
489600
763776
874752
1232160
1645056
1811520
Ternary
All
I transistors
580
736
1338
2096
2408
3400
4548
5016
80
b
2
3
4
5
6
7
8
9
b
2
3
4
5
6
7
8
9
J
1
2
2
3
3
4
4
4
entire tap
per subtap
size (p2) I transistors
size (g 2 ) I transistors
444
444
151776
151776
1060
365568
530
182784
1232
427584
213792
616
2106
734400
244800
702
2364
827424
788
275808
3496
1227264
874
306816
3840
960
1351296
337824
4184
1475328
1046
368832
Balanced Quinary
J
1
2
2
2
3
3
4
4
per subtap
size (g2) I transistors
264384
730
846
305184
962
345984
1078
386784
1194
427584
468384
1310
1426
509184
1542
549984
Modified Balanced
entire tap
size (s2 ) I transistors
264384
730
1692
610368
1924
691968
2156
773568
1282752
3582
1405152
3930
5704
2036736
6168
2199936
Septary
2500000
2000000
*
1500000
Balanced Ternary
Offset Quaternary
1000000-
l
Radix 7
500000
0 2
Figure 43
Size (g2)
3
4
verses
5
Number
6
7
of Bits
8
in
9
the
Binary
Channels
81
7000
6000
5000Balanced Ternary
4000-
Offset Quaternary
3000
I
Radix 7
200010000
2
Figure 44
b
2
3
4
5
6
7
8
9
3
4
Transistors
5
6
7
verses Number
Channels
8
9
of Bits in the Binary
Modified
Balanced
Balanced
Septary
Quinary
Ternary
20.58 ns
17.86 ns
14.16 ns
22.48 ns
19.76 ns
16.06 ns
24.38 ns
21.66 ns
17.96 ns
26.28 ns
19.86 ns
23.56 ns
28.18 ns
25.46 ns
21.76 ns
30.08 ns
27.36 ns
23.66 ns
31.98 ns
29.26 ns
25.56 ns
33.88 ns
31.16 ns
27.46 ns
Latency through Subtap
In each case the new algorithm subtaps appear to be both larger and
Unfortunately, the
slower than the corresponding old algorithm designs.
numbers can only be considered rough estimates of the actual hardware and
Since all of the RNS designs discussed in
speed of the new algorithm designs.
this paper are intended for full custom implementation, the actual
implementations would not be restricted to the parts in a standard cell library,
and the subtap designs, both old and new algorithm, would be both smaller
82
and
However,
faster.1
the
algorithm
new
the
available,
where
implementation
and an AND
shifter,
algorithm
an 8-1
MUX
gate.
If
would
subtaps
be
was used
larger
physically
instead of a 2-1
designs were
full custom
or
comparable
more
and
occurs for the modified radix 7
The most discrepancy
slower multiplexors.
by
implemented
shifts were
left
even
Because there was no left shift
disadvantaged by the standard cell library.
block
are
designs
superior
compared
both
in
a left
MUX,
the new
hardware
and
speed.
Putting it all Together
in
Earlier
convolution
same
channel
computation
the
Fortunately,
the
chapter
this
was
must
same
discussed
be
scaling
each
mini-
biased/unbiased
algorithm.
The
for - the
algorithm
also.
scaling
and
for
the
performed
boxes can
be
of
summing
new
The
used.
primary
difference
between the two cases is the form of the output at the final subtaps.
algorithm
biased/unbiased
the
output of
a
subtap
always
is
Using the new algorithm, the output of
output requires only one clock cycle.
a subtap is always unbiased, but may or may not be normalized.
range
of unnormalized
but
normalized,
To compute an unbiased normalized version of the
may or may not be biased.
the
For the
values
cycles are needed to generate
that
the output
can
2
span ,
Depending on
several
clock
a normalized version of the output.
Earlier, the output of a subtap was shown to vary within the range ± m*
max(hi).
One of the fundamental assumptions of the new algorithm is that the
members of the digit set are powers of two;
written as
is
2
k, and the output range written as ±2k m.
performed
times m.
therefore,
by
successively
adding
or
subtracting
hi can be equivalently
Normalizing the output
decreasing
powers
of
The top level block diagram of this algorithm is shown in figure 45.
1 The area given for
actual part to prevent
several parts will be
respective boundary
2 Determined by the
a standard cell part includes a boundary around the edges of the
violations of design rules. The area for a custom design including
significantly less than the sum of the areas of each part and their
layers.
maximum allowed digit in the coefficient decomposition
2
83
Normalized
Output
Output of
Final Subtap
--
y'
y -9.- Yy'-m
2 k-1 m
2
k-2
m
m
m
Figure 45
Normalizing
Stage
A block diagram of a norm box is shown in figure 46.
Each of the norm
boxes operates in a manner similar to the subtaps of the filter.
If the input to
a norm box is positive, a negative multiple of the modulus is added to it; if the
At each step
input is negative, a positive multiple of the modulus is added to it.
the range of the output is reduced by a power of two until the output falls into
the
range
±
m.
The fix block at the end of the chain operates
in a similar
manner, but adds the capability to output both unbiased and biased1
versions
of the output.
2k-I
, k-I
2 nm
range
Figure 46
Norm Box
At the beginning of this section it was implied that the same
algorithm
can be used here
also.
scaling
This is
units designed
for biased/unbiased
true; however,
it should be mentioned that scaling units can also be designed
within the philosophy of the new algorithm.
x,
x-m
and left shifted
The new scaling units would add
versions of each together
1 Biased version of x is equal to x-m ignoring the b+lst bit.
to obtain an unnormalized
84
A chain of norm boxes and a fix box
scaled value that spans a certain range.
would
return the
and
hardware
scaled value to the normalized
aesthetic
to
advantages
the
new
There may slight
range.
algorithm
units;
scaling
however, they do exhibit an increased latency because of the added correction
stages.
Binary to RNS Conversion
Now
that
Block
have
architectures
been
developed
to
a high-speed
compute
convolution sum within RNS, a method is needed to convert the data into a
residue form at an equally high throughput
binary
to
RNS
programmability
is
converter
is necessary
Programmability
modulus.
The
programming.
needed
for
to allow the
can
filter chain
be
each
modulus,
same design to
obtained
designs
Because a
and a low latency.
with
which are
some
operate
needed
of
for any
levels
different
also
form
for
of
each
modulus used the simplest form of programming, a single b bit number that
could
be loaded
into a register or asserted to input
pins.
More
complex
programming would consist of loading several values into registers or blocks
of memory.
primary
When comparing designs for the binary to RNS converter, the
consideration becomes the
level of programming.
How much is it
worth to eliminate tables?
Table
Lookup
Approach
The table lookup approach is useful because a binary to RNS conversion is
If the binary input to the
a linear function, ignoring possible normalization.
filter is d bits and the residue channels are b bits, the residue value of the
input is equal to the sum of the low order b bits of the binary input with the
residue value of the high order (d-b) bits.
The conversion of the high order
bits can be performed by a table; the resulting conversion
figure 47.
Unfortunately,
the 2d-bxb bit table Used could be very large and
slow for large values of d.
Because the large table is implementing a linear
function, however, it could be replaced by
similar
approach
for
unit is shown in
general
been discussed in chapter 3.
linear
a number of smaller tables.
functions
of
one
variable
has
A
already
-1
85
Figure 47
Table
Binary
Input
Residue
Output
Binary to
Lookup Approach for
Using several smaller tables seems to be the
throughput
the outputs
have been
point.
conversion,
for the
developed
Since
filter chains,
In order to avoid drastic
recoding
a high
to efficiently
modulo add
of residue
accumulators
versions
several
Conversion
best way to achieve
but some method is needed
of these tables.
RNS
these can be
used
as
a starting
of the high order d-b bits of the
input, only the radix 2 and radix 4 accumulators1
will be considered.
Because the radix 4 accumulator designs used an offset digit set {-2, -1, 0,
1}, the input must be recoded.
high
order
(d-b)
bits
of the
The recoding algorithm is as follows: taking the
input
in
pairs
representation for the number, then starting
obtain
the
standard
radix
four
at the low order digit if the digit
is 2 replace it with -2 and carry 1 to the next place, if the digit is 3 replace it
with -1 and carry
1, otherwise
1 A valid observation is
biased/unbiased algorithm.
radix 2 accumulator can
the new radix 3 design.
radix 5 design; the only
let the digit remain
unchanged.
An alternate
that radix 2 and 4 accumulators were only developed for the
Versions can also be developed using the new algorithm. A
be implemented by removing the two's complement circuitry from
A radix 4 accumulator has a data path that is identical to the
difference is a more complex coefficient decoding.
86
result
be
must
this method
correctly,
interpreted
Although the
or (1010... 10)2.
method to recode the input is to add (22...2)4
the
generate
does
desired
An example for the single radix four digit is shown
carry if a digit is 2 or 3.
below
01 (1)
10 (2)
11
10
10
10
10
10 (0')
11 (1')
either conversion
Examining
contain
may
bits
order
(3)
00 (0)
algorithm,
an
1 01 (-1')
1 00 (-2')
additional
radix 4
offset
the
digit
the
than
of
form
the
high
radix
standard
4
representation.
the
Using
biased/unbiased
algorithm,
four
are
values
at
needed
each
{x, x+p, 2x, 2x+p} Here x denotes the normalized mod m value of 2i
accumulator
where i is the place of the low order bit of the bit pair in the total input.
must
values
These
shift register.
a tapped
into
loaded
be
addressable
For
a d bit binary
practically,
or, more
memory
into
input
b bit
and
residue
channel, the total number of b bit stored values is 4F(d-b)/21.
If the new algorithm is used instead, only x and x-m are needed at each
These values would also be loaded into addressable memory or a
accumulator.
number of b bit stored values is only 2F(d-b)/21.
algorithm uses
the new
Both of the offset
the conversion
perform
an extra correction
but also increases the latency
adds hardware,
even
otherwise,
the
registers
not only
stage which
bit binary
a (d-b)
adder to
between the standard digit set and the offset digit set.
is on the order of b, the
cycle;
every bonus
of the conversion.
radix 4 design require
The adder increases both the hardware
d-b
Unfortunately,
Although only one-half as many
has an equal and opposite penalty.
are needed,
the total
For a d bit input and b bit residue channel,
tapped shift register.
binary
size and latency of the conversion.
add can probably
adder
must
If
occur within a single clock
be pipelined
increasing
the
latency
more.
set conversion
The digit
1, 2, 31
digit set
{0,
modified
to use the
values
and
hardware
contain
size.
can be avoided entirely if the standard
is used.
standard
Although either
radix 4 accumulator
digit set, both designs will
larger selectors.
The decrease
radix 4
can be
require more stored
in latency
is paid for in
87
No Table Approach
If
and
simple
binary
latency,
simple
to
programmable
Because
multiplier.
is
programmability
RNS
more
conversion
doubler
residue
the b bit
important
units
can
design
both
than
be
in
residue value of 2b
hardware
developed
an
extended
is knowni,
using
size
the
residue
it can be
multiplied by the d-b high order bits of the input using the residue multiplier
design from chapter 3;
the b low order bits of the input can be used as an
4
Input
0]
2
0
*2
.0
*2
Figure 48
No Table Binary to RNS
1 2b = g mod m, think about it...
greater than
2 b- 1,
Conversion
Also, if the moduli set is chosen so that all moduli are
the biased form of
2b
can be formed by left shifting g.
88
the first accumulator
for
seed
unbiased
to RNS
binary
programmable
simple
p needs to be loaded or
only
converter;
is a
result
The
the multiplier.
in
asserted to an input.
starting at bit 0,
segmenting the binary input,
can be designed by
blocks
*2
residue
using
unit
conversion
programmable
simple
Another
into b bit
Using *2 blocks the higher order segments can be multiplied by
sections.
Fd/bl
design for
primarily for small
of the
This design would be competitive
3 is shown in figure 48.
=
block diagram
A
summed.
and the result
of two
powers
appropriate
the
Fd/bl.
Residue to Binary Conversion
An RNS to binary conversion unit is needed to put the results from all of
the
algorithm
conversion
are
that
algorithms
Two
independently.
literature
that
each
allow
would
have
digit
extensively
been
and
Theorem
Remainder
Chinese
the
Remainder
Chinese
In chapter 3 the
mathematical
to
be
converted
discussed
in the
Mixed
Radix
the
link
Theorem
Chinese Remainder
between
the
weighted number representation.
( 1
x=
where M =
the
Mi
Theorem
This equivalence
-1
1>mIx1 + M 2<M 2 >x
was presented
to show the
a
conventional
is as follows:
-1
2
+
...
+ Mr<Mr mx
algorithms
which could be very large.
must be
mod M
The computations required to
CRT expression are multiply-accumulates
2 rb
and
representation
residue
H mi, Mi = M/mi, and xi = x mod mi.
is on the order of
other
than the binary
complex
algorithm.
conversion
evaluate
Unfortunately,
number.
Because the RNS digits are uncoupled and unordered1 , there is no
to RNS case.
simple
more
is significantly
conversion
to binary
RNS
to form a binary
back together
channels
the residue
modulo
M; however,
M
To avoid modulo M arithmetic
investigated.
1 If each residue in an RNS system is considered to be a digit, the digits are uncoupled
because there are no carries. This feature allows us to perform smaller computations on
each digit without any links between the digits. A more familiar weighted number system
such as the decimal number system has coupled digits.
89
Mixed
Radix Conversion
A standard way to avoid the modulo M arithmetic in the CRT is to use the
mixed radix conversion
converted
evaluated
to
to
an
With this algorithm the residues are first
algorithm.
intermediate
find a standard
radix1
mixed
representation
fixed radix value.
that
The calculations
is
then
required
to
convert from residue to mixed radix are all b bit modulo mi and the remaining
calculations
The
are
conventional
binary.
Algorithm
mixed
radix
representation
are chosen to be equal to the moduli in the moduli set.
The
best
motivate
To
simplify
way
following
to
the
process
conversion
the
algorithm
is
the
to
radices
reverse
in
the
engineer
it.
First,
the
notation must be defined
Fi=
ith mixed radix
mi=
ith
modulus
xi=
ith
residue coefficient
Yij
=
mij
=<mni>mj
coefficient
<Cinj
Assume x is already in the mixed radix form
+ F mm + F1mIm m3 +
x = IF + Frm
0
1 1
2 12
3 12
3
If there are r moduli in the moduli set, there will be r digits in the associated
mixed radix representation.
Taking residues of both sides of this equation for
each modulus in the moduli set, the left side (x mod mi) equals xi, and the right
side is some function of the Fi's.
using residue arithmetic.
With the xi's known we can solve for the Fi
The first three digits are shown below
rF
0 = x1
x1 = 0
X2= <F 0 +
Flml>m2
1 1 = <x 2 - FO>m2<m1- 1 >m2 = [(x 2 - 712) * m12] mod m 2
X3= <FO +
Fimi + F2mim2>m3
F 2 = [((x3
-
713) * m 1 3
-
723) * M2 3 ] mod m 3
1 Previously, for coefficient decomposition fixed radix number systems were defined.
Mixed radix number systems are similar except that the radix a can vary from place to
place. If xi are the digits of a mixed radix number with radices ai, the value of the number
is computed as follows:
J-1
i
i=0
0
90
Much of the computation for the residue to mixed radix conversion can be
performed in parallel.
The example below shows the conversion process for a
simple 3 moduli RNS.
Example
m1 = 3
m2 = 4
m3= 5
x1=1
x2=0
x3=4
-(712 = 1)
F 1 =1
3
3
*(m12-1=3)
*(m 1 3-i=2)
1
F2
=
= 1)
-(71
1
1
-(723 = 1)
0
*(m2 3 ~ =4)
0
F3 = 0
A top level block diagram of the required computation is shown in figure
49.
X
1
Figure
49
Mixed
Radix
Conversion
channels)
Algorithm
(for
4
residue
91
The
Hardware
The
primary
consideration
moduli
the
set,
hardware
design
process
includes
sets
hardware
conversion
the
the
the conversion
Because
programmability.
for
moduli.
The
Although
it would
be nice for testing if the
is
feature
the
optimum
the
omitted,
addition
really
not
moduli
level
of
modulus
in
the
number
of
a new
requires
design.
set could be arbitrarily
moduli
If
required.
is
programmability
The
into the design.
can be hardwired
seti
the
each
on
limit
of another modulus
allowable
programmed,
a
is
advantage of hardwiring the moduli set is that the residue multiplies can be
by
performed
without having to load
lookups
table
residue
The
arithmetic.
new
range
unbiased
To simplify
will be used
algorithm
the following discussion the original biased/unbiased
for
2
the tables.
be
also
could
algorithm
the presentation.
applied, but it would only complicate
Except for timing registers, the data flow graph in figure 49 is a top level
for
diagram
block
needed,
The
unbiased
in
the
(figure
negative
is
subtractor
residue
complementing
the
mixed
radix
conversion
Two
hardware.
form,
subtrahend
50).
residue,
Because
the
a
very
the
and
simple
are
inputs
are
subtraction
adding
the inversion
biased/unbiased
the
If both
design.
can
be
performed
will generate
addition
with this
subtractor is that an overflow
by
two's
to
form
complemented
the
the biased form of
requirement
is
and the carry out of the adder will indicate the state of the output.
problem
designs
one a residue subtractor and the other a residue scaler.
available
minuend
the
satisfied,
The only
will cause undefined
To guarantee that no overflow occurs, the moduli set must be ordered (ie
results.
ml <
m2 < m3 < ... < mr).
unbiased/biased
bA
Y
(x 2 ~ 1)
b I
2
Figure
50
Residue
Subtractor
(Unbiased/Biased
Algorithm)
1 The optimum moduli set is that set of reletively prime moduli representable in a b bit
binary channel that have the maximum product. See chapter 5
2 The number of multiplies that would occur in a practically sized MRC would require a
large number of memory in which to store tables.
92
The
linear
high
residue
operation,
order bits
accumulator
Because
scaling unit is also a simple design.
is
the design
of the
binary
very
to
to
similar
RNS
described
earlier
can be used
the
units
together,
the conversion
conversion.
in
The
scaling is a
unit
standard
a configuration
for
the
radix
4
similar to that
in figure 24.
Putting
completed
into
its
two
with a throughput
final
binary
form,
equal
the
to the
several
binary
multiply out the mixed radix coefficients.
by
standard
pipelined
to
binary
match
shift
and
add
the throughput
mixed
filter chains.
multiplies
conversion
can
To place the
must
be
input
be performed
to
These multiplies can be performed
multipliers
of the
radix
with
filter chains.
the
individual
Because
adders
the mixed
radix digits are weighted 1 , the low order digits may be neglected.
1 Assuming that all moduli are on the order of 2b, the most significant mixed radix digit
multiplies ~2 rb; the least significant multiplies 1. If several moduli are used, the low
order digits will be noise.
93
Chapter 5
Now
tool
Design Aid
some basic
that
can be developed
friendly form.
the length
both old and
input
a
to present
Given
of the
architectures
the number
and
designed
size and
of bits in the input
speed
analyzed,
a
data in more
and coefficients,
and
will output the optimal architectures
for
Originally, the intention was to have the user
new algorithms.
cost
been
the hardware
filter, the program
size/latency
have
function.
After
some
consideration,
I
decided
to
output all possible optimal size/speed designs, and allow the user to weight the
alternatives.
designs
In addition, no distinction will be made between the two types of
because
it was
not
possible to
accurately
estimate
the hardware
or
speed of these designs or characterize the advantage of having fewer buses.
The
compute
algorithm
Moduli
The
algorithm
the
consists
minimum
to prune
number
of moduli
Algorithm
Selection
sections:
an
channels
for each
bitwidth
(Basic
range is
requirement,
range [3 , 9].
an
algorithm
and
to
an
RNS)
number of bits in the coefficients
and input
required dynamic range for the
the coefficients
dynamic
distinct
fairly
the design space.
filter determine the
bits,
of two
e bits, and the length
(d + e + log2 N ).1
optimum
moduli
N,
filter.
the length
If the
of the
input is d
the total number of bits of
Starting
set 2
and
with
this
dynamic
range
can be chosen for each b within the
The optimum moduli set itself is not so important; however, the
number of moduli,
r,
in the sets for different
number of residue channels
that are needed.
values
However,
of b determines
the
it is much simpler to
solve the reversed problem: given b and r find the optimum moduli set and its
product.
The moduli selection algorithm can then be used to solve for r given
b and a target moduli product.
A first attempt
at optimal moduli selection
all sets of r relatively
prime numbers
uses
an exhaustive search 3 of
requiring b or fewer bits.
The search
proved to be extensive for large b and r and required a significant amount of
1 The log2 N term can be significant for a long FIR filter; it is the price of exact
calculation (no rounding).
2 Optimum moduli set implies the set of relatively prime moduli representable in b bits
that has the highest product of any set of relatively prime moduli representable in b bits
with the same number of moduli in the set.
3 See Appendix for code
94
computer time.
a search
This prompted
for a more efficient
that
algorithm
used some of the properties of an optimal moduli set.
To improve the speed of the algorithm, it is worthwhile to consider some
limit the search a bit.
of the properties of the optimum moduli set and
way to
simple
modulus
mi
to be
included
must
be
being
excluded
observation
Because
2
included
because
2
Since
2
set,
the
disadvantages
they
are
not
the
of its being
advantages
of other
relatively
prime
potential
moduli
to
A first
mi.
only
factors
of 2
b is the highest even
, it can
only
other
exclude
even
moduli set
number in our potential
and only one even number can be included in the final moduli set,
2
b should
A second observation is that the remainder of the moduli set
be in this set.
This information can
will contain odd numbers that are as large as possible.
be added
certain
b should always be included in any optimum moduli set.
b contains
numbers.
in the moduli
for a
the following:
than
greater
is that
algorithm is
a better
motivate
A
search algorithm to substantially reduce the search
to the exhaustive
time; however, the search still becomes unwieldy for larger b and r.
exhaustive search
the reduced
To improve on
algorithm
As a first attempt at
directly choose the moduli set and avoid a lengthy search.
direct selection,
start with
remaining odd numbers
requisite
the
number
2
it is possible to
b - 1 as the first odd in the set and step down the
adding those that are relatively
have been
r moduli
for this
The inspiration
chosen.
algorithm is similar to that above which included
prime to the set until
2
b because
it is the highest
Only the highest number having a factor of 3 will
number with a factor of 2.
be included along with the highest having a factor of each prime 5, 7, 11,
It works
this algorithm does not always give the best answer.
Unfortunately,
etc.
for b = 2, 3, 4, 5, and 7, but for some cases of b=6 or 8 it gives a suboptimal
result.
A
deeper
occasionally
smaller ones.
of
investigation
optimal to exclude
For example,
optimum
moduli
sets
shows
that
it
is
a larger odd number in order to include two
if the highest number having a factor of 3 also
contains a factor of 5 (which is the case for b = 8) then this number not only
excludes all smaller factors of 3 but also all smaller factors of 5.
The problem
can be more easily visualized if a moduli set with r moduli is thought of as
been generated by adding a modulus to the moduli set of r-1 moduli.
point
the product
of the
moduli
will
be
increased
more
by
At some
excluding
the
highest factor of 3 and 5 and including the second highest factor of 3 and the
95
second highest factor of 5 rather than including
that is relatively
prime to
the existing
set.
the next smaller odd number
A numerical
example
is shown
below
8 Bit Moduli --- 6 Moduli in Set
256, 255, 253, 251, 247, 241
8 Bit Moduli --- 7 Moduli in Set
256, 253, 251, 249, 247, 245, 241
In this example 255 contains both factors of 3 and 5.
When the seventh
modulus is added it becomes advantageous to omit 255 from the moduli set and
add the next highest factor of three, 249, and the next highest factor of five,
This is because 255*239 = 60945 and 249*245 = 61005.
245.
The replacement
could have been anticipated by realizing that 255 is a double factor oddl and
calculating
containing
LOW
which
the factors
equals
the
in the double
product
of the
factor odd
highest
by the
double factor
divided
In this case LOW= 249*245/255 = 239.2.
odd.
second
odds
If the next number that is
relatively prime to the existing set (239 in this case) is less than LOW, then the
double factor prime should be omitted instead.
A heuristic algorithm was developed to avoid double factor primes without
explicit
factoring.
First,
relatively prime odds
sequentially
a moduli
less than
excluded,
and
the
largest r-1
becomes
set
is
calculated,
and if
the current one from
all odds have been excluded
moduli sets having
and the
relatively prime odds
search space).
greater than
which odds
of 2b
largest r-1
Then, each odd in the set is
2b is generated.
(omitting the excluded one from the
moduli
set consisting
The product of the new
the previous
are sequentially
from a current set without any
a greater product, the search
chosen
are
product,
this set
When
excluded.
of the
resulting
ends with the current set as
a result.
The
excluding
heuristic
practical values r and b.
be ordered
algorithm
gives
the
correct
moduli
set
for
For some very large values of r, the exclusion must
in ordered for the algorithm to converge on the optimal set.
1 More properly, 255 contains two small factors (3 and 5)
The
96
first error occurs for b = 8 and r = 38.
The error can be ignored for all
uses. 1
practical
Now that a method to find the optimum moduli set given r and b has been
developed, we need a way to find the sufficient number of b bit moduli needed
range.
to achieve a given dynamic
will be less than or equal to
needed
b).
/
Starting
2
with
Since all moduli
within a b bit channel
b, an initial lower guess is r = (# bits of range
of
this value
moduli
r, the heuristic
selection
algorithm can be called with successively higher values of r until a moduli set
is found that has the sufficient range.
Aid
The Design
Given the number of moduli needed for different bitwidth moduli and the
data for the different bitwidth residue subtap designs, a set of
size/latency
(size,
pairs
delay)
be
that
formed
can
be
to
searched
find
potential
First, the size data listed in chapter 4 is multiplied by the
optimum designs.
appropriate
can
number
of moduli
for the
required
a size number for the entire filter tap.
respective
to obtain
bitwidths
Because all of the subtaps within a tap
all operate in parallel, the filter latency is equal to the subtap latency for any
number of moduli
latency figures
size and
Second, the scaled
channels.
Four design types
grouped into pairs that make up the possible design space.
were
investigated
ternary,
offset
investigated
quinary,
quaternary,
for
and
implementations
using
the
the
biased/unbiased
and
balanced
using
modified
the
new
balanced
are considered,
binary,
algorithm:
Three
quinary.
design
algorithm:- balanced
septary.
For
each
one for each moduli
are
balanced
types
ternary,
design
width from
were
balanced
type,
seven
b=3 to b=9.
So, the design space of the old algorithm contains up to 28 designs,
and the
design space of the new algorithm contains up to 21 designs 2 .
The
principle
search
algorithm used
to prune
the
space
of finding dominant designs that exclude
of designs
others.
operates
on a
When two designs
are compared, if one has both a smaller size and a lower latency than another,
the first is said to dominate.
The dominant design would always be chosen,
and the other can be pruned.
If one has a lower delay and the other has a
1 The dynamic range obtained from 37 8 bit moduli is 274.65 bits of precision.
unlikely that any application would need this precision.
It is
2 In general, some of the lower values of b will be excluded because there are not enough
moduli representable in b' bits to acheive the desired dynamic range.
97
smaller size, the two are said to coexist, neither excludes the other.
design in the design
designs
dominant
space is compared to the
will
size/speed combinations.
These
remain.
If each
others, a group of coexisting
will
designs
have
the
optimum
A good way to visualize the process is to imagine a 2-
D scatter plot of the design space with size on one axis and delay on the other.
We only want to keep those designs that plot closest to the origin.
Discussion
Running the
looking
at
the
charts
offset quaternary
the optimum
gives
the
in chapter
4.
algorithm
set,
set
are
that
For the
would
and the
size
charts
usually the balanced
show
be
anticipated
biased/unbiased
design usually has the smallest size
have the smallest size for b=4 to 8.
optimum
results
from
algorithm
the
and the largest delay of
the offset
design
quaternary
to
The other' designs in the old algorithm
ternary
and
the
binary.
Because
the
delay increases with the radix, this is also expected.
One design type in each algorithm is always dominated by others.
For the
old algorithm, the balanced radix 5 design is dominated by the offset radix 4
design for b less than
values of b,
9.
the advantage
moduli are used.
Although the subtap delays
of the
are the same for all
radix 5 design is not realized
until 9 bit
For the new algorithm, the modified balanced septary design
is dominated by the balanced quinary design for all values of b.
98
Standard Binary Arithmetic with Pipelining
Chapter 6
alternative
be presented using standard binary
will
another
and analyzed,
have been presented
RNS designs
that the
Now
Using some
arithmetic.
of the techniques developed for residue filter taps and extensive pipelining, a
design
conventional
be
can
a
higher
Although no general decisions
will be
that
implemented
throughput than the residue designs.
at
operates
made between the residue and standard implementations, this design would be
a good starting point for any comparisons.
Development
of
architecture
The first is coefficient
The architecture design is based on two concepts.
inserted
in
the carry
The coefficient
the residue
avoid
to
decomposition
and
multiplies,
chain
decomposition
filter discussion.
of long
second
the
binary
is
to increase
adders
of bits must be included
registers
throughput.
within
concept has been thoroughly
discussed
The primary difference here
is that overflow
A
sufficient number
within standard binary arithmetic.
cannot be permitted
pipeline
in the binary channels to prevent overflow.
pipelined adder is the real innovation of the binary design.
The
The adders can be
pipelined to a granularity of single bit adders for a throughput of (1 adder
delay + 1 register delay)- 1 .
pipelined
the high order carry out is needed to condition the next
because
stage for the old
condition
For the residue problem the adders can not be
algorithm
the next
designs
stage in the
order bit
and the high
new
algorithm
is needed to
For the residue
designs.
designs, all of the calculation has to be performed within a single clock cycle.
Filter Tap
Using the transpose filter form, the temporary results move along a data
path
consisting
of
N
adder/register
combination is shown in figure 51.
pairs.
A
to
all adder/register
the overall data flow.
adder/register
By shifting some of the registers to before
the adder, the adder becomes pipelined as in figure 52.
register shift
single
combinations
Performing the same
in the data path
preserves
99
Figure 51
Throughput Limiting Data Path in a Transpose Form FIR
Filter
52
Figure
In
the
Increased Throughput
Combination
a six
figures,
bit
is
pipelined
is that the
A first observation
sections.
adder
carries.
dlb,
1
In general,
in
observation
operating
the
the
are
data
current
and the final
adder
increases
when
This should be expected because
are needed
extra registers
for the
one bit registers. 2
bought
with
increases
in
is
being
clocked
through,
the
second
section
inputs,
section on the two delayed inputs.
Another
hardware.
is that the adder is operating on three different
on
bit
two
a b bit adder/register pipelined into d bit sections, where
throughput
Assuming
time.
inputs,
and
three
into
requires b one bit adders and 2b + (b/d) - d - 1
Increases
to
inputs for each output,
Adder/Register
number of registers
they are shifted to the inputs of the adder.
there are two
Binary
first
the
on
at the same
adds
the
one
section
is
previous
Something is needed
guarantee that the inputs to the adder are staggered in time.
1 dlb read d divides b (integer divide without remainder)
2 For d=1, the number of registers = 3b - 2; for d = b/2, the number of registers = 3b/2 +
1.
100
Subtap
Binary
To actually design a filter tap, a coefficient decomposition must be chosen.
The hardware
is simplest if the binary
decomposition
is used.
Each
consists of a b bit adder and c AND gates as shown in figure 53.
of registers stagger the input is shown in figure 54.
An input stage
With (b/d) sections per
((b/d) 2 + (b/d))/2 registers are needed for the input stage.
adder,
taps receive the staggered
input,
the adders
subtap
Because all
will always receive the proper
inputs.
h.
Staggered
Input
New Adder
Register
Combination
Output
Previous
Stage
Figure 53
Balanced
The
Binary
Subtap
using
Binary
Arithmetic
Ternary
advantage of the higher radix is that fewer subtaps
The balanced ternary design uses the
tap.
Pipelined
are needed
per
same method used by the residue
The design needs an additional c XOR
designs to two's complement the input.
gates to invert the input; the same signal gating the XOR gates is fed into the
carry
in
Higher
of the
adder/register
combination.
Radices
It is possible to go to even higher radix decompositions of the coefficients
than
the
balanced
inputs
ternary;
is broadcast
however,
to
the
the
taps
designs
in
complicate
different
time
slightly.
slices,
the
Although
coefficient
remains the same for all time slices, and scaling all time slices by factors of
101
radix
designs adds
form
and
the
offset
decoding,
coefficient
Using this
design.
ternary
a left shifter to the balanced
appropriate
for the higher
A general form
two can be performed by shifting the input.
balanced
quaternary,
quinary,
and modified balanced septary subtap designs can be implemented. 1
Because
there is no
does not decrease the throughput of the subtap.
however,
difficulty,
some
in
synchronizing
The
and a register.
the subtap is limited only by the d bit adder section
additional hardware
for
subtap, the throughput
from the previous
feedforward
of the
versions
staggered
There
is
input
that have been scaled by the digits of the decomposition radix.
Shift
Reconstruction
Add
of the parallel
out
taps constructed
filter
With
described
subtaps
above,
Each of the J
the output of the final stage will not be in a "friendly" form.
mini-convolution chains will output a b bit result that consists of b/d
There is the temptation to place an output
slices of different time results.
stage
chain
2
after
each
back together
chains
convolution
final
at
to
subtap
the
put the
When
in time.
same
staggered
in
time;
following
discussion
result
the scaling
throughput
this
however,
as
of each
summing
and
the
mini-convolution
filter
is
of the mini-
considered,
the
To use pipelined adders, the inputs
output stage does not seem as appealing.
must be
d bit
is
the
form
in which
they
are
output.
The
binary coefficient
focuses
decomposition
the number of mini-convolution
coefficient,
with
on
the
chains,
J,
to the
is equal
In
this case
width of the
chain consists of b
Assuming that there are J mini-convolution chains,
a b bit output, the data arrives at the outputs as b/d
delayed data.
of reconstruction,
single bit adder sections.
e, and the output of each mini-convolution
bit time slices.
case
simplest
1
each with
d bit sections of time
The 0th and e-1st outputs are shown below
1 Because no left shift block is available in the standard cell library, an explicit design is
omitted. For all of these designs number of subtaps needed for e bit coefficients can be
found in the tables of chapter 4.
2 The output stage would consist of register chains similar to the input stage.
102
ho
I
F
channel
channel
he.1
[i-b]
ye- [i-b]
Ye-i [i - (b-1)]1
y0
y 0 [i
-
(b-1)]
[i-2]
y
yI[i-2]
I
y 0 [i-1]
e:1-1]
Focusing on the i-1st time slice, one bit can be grabbed from each channel's
Ye-2[i-1],
If the these bits are grouped in descending order {ye.1[i-1],
output.
..., y1 [i-1], yo[i-1]}, the first slice into the adder is obtained. Because the output
of the hj channel is multiplied by 2i, the group of bits is properly ordered in
At the next output time, the i-2nd group contains results
an e bit binary word.
The e bit binary word obtained from this output
from this same time slice.
this same time slice.
results from
contains
At the next output time the i-3rd group
binary word.
e bit
obtained
to the previously
must be left shifted one place and added
The e bit word obtained must be left shifted two places
This process continues until the final
and added to the previous partial sum.
group that contains results from the same time slice arrives at the high order
shifted e bit numbers are added together to generate a single
All total, b
bits.
binary
The hardware
result.
required
the reconstruction
to perform
is b
e
bit adders; the latency of this hardware is b+e clock cycles.
A
coefficient
quaternary
other
can
algorithm
similar
combinations
used
be
2
with
decomposition
adder
bit
adders
the
can
results
be
reconstructed
Using
sections.
with the
any
process,
However, using
especially any scaling by numbers other than powers of two.
pipelined
offset
case:
the
reconstruction
the
complicates
significantly
for
reconstruct
to
same
throughput
as the filter chain.
RNS
Hardware Required/Contrast with
Similar
hardware
to
the
comparison
between
comparison
that
follows
will
different
focus
because of the assumed large number of taps.
reconstruction
hardware
will
be
smaller
architecture
than
residue
primarily
on
designs,
the
filter
taps
the
In any case, the input stage and
their
to RNS converter and the RNS to binary converter.
counterparts,
the
binary
One difference is that the
size of the RNS hardware was determined by the total dynamic range; the size
103
a function of both the coefficient
pipelined binary design is
of the massively
width and the combination of the input width and the filter length.
For any exact1
FIR filter the total required dynamic range is d + e + log2 N
where d = input width, e = coefficient width, and N = length of the filter.
b bit moduli channels with r and b chosen such that
residue FIR filter uses r
the product rb is just greater than d + e + log 2 N.
each
implementation,
filter is
complete
r(b/2)b
wide.
bits
wide
pipelined
The
(d +
implementation) uses e channels each
With the offset quaternary
a (b/2)b
tap uses
residue filter
uses (d +
binary
filter
log 2 N)e bits wide.
The optimum
6 bit moduli with the offset quaternary implementation.
The total width of the RNS data path is r(b/2)b = 5(6/2)6
quaternary
(binary
The complete
For a rough comparison, assume d = 8, e = 8, and N = 10000.
offset
The
data paths.
log 2 N) bits wide.
pipelined binary filter (binary implementation)
size RNS design has 5
The
filter
binary
have
would
4
23
bit
= 90 bits wide.
channels 2
The
for a total
Both designs would involve similar shift and add stages, but
width of 92 bits.
the RNS design would require 2 or 4 buses to broadcast versions of the input to
the
subtaps.
In
addition,
the
design
RNS
would
require
conversion
large
stages at the beginning and end of the filter.
General
Discussion
More
the RNS
decisions between
limited
very
comparison
competitive
sizing standpoint.
designs.
be
comparisons
detailed
to
would
and the
that
was
the
comparable 3
needed
pipelined binary
the
performed,
residue
to
pipelined
Whether further
investigation
it would
implementation
a yardstick
any
Based on the
binary
designs
from
binary
with which
seem
a hardware
higher than
shows the pipelined
provide
make
actually
designs.
the throughput is significantly
Also,
superior or inferior,
be
the RNS
design to
to measure
the RNS designs.
1 Exact implies that no rounding occurs.
2 The coefficients are decomposed into e/2 quaternary digits, and width of the channel
data paths must be increased by one because the maximum coefficient now has a magnitude
of 2.
3 using the same decomposition radix
104
Conclusion
Chapter 7
With a cursory introduction to RNS, it is easy to become excited about the
for
potential
great
processing
signal
any
of
computation
speed
high
After a deeper study of
algorithm requiring only addition and multiplication.
the topic, it rings clear, however, that RNS is only useful for a very limited
scaling
and magnitude
use
RNS
of
is
the limitation stems from the problems with
In part,
number of applications.
the
constraint
largest
the
of
overhead
huge
the
but
comparison,
on
the general
Even
units.
conversion
for
applications for which RNS is ideally suited such as the FIR filter problem, the
size
of
must
problem
the
sufficiently
be
to
large
warrant
RNS
applying
techniques.
If more efficient conversion
units can be designed, maybe the space of
RNS applications could increase.
The designs in chapter 4 are in the proper
direction,
away from the infamous table lookup solution to any RNS problem.
However,
a
of
number
substantial
are
computations
required
either
for
the
of the computational units,
CRT or the MRC, and, regardless of the efficiency
there is a large amount of computation to be performed.
The massively pipelined binary concept may prove to be superior to RNS
The massively pipelined
techniques even on the problems that RNS is good at.
designs
faster
are
all
in
cases,
and
the
hardware
initial
estimates
seem
Obviously, more research should be done in this area before any
comparable.
are made concerning
global decisions
the two methods.
Assuming RNS does have merit, there is much more work to be done to
extend the little that I have done.
compared,
that the new algorithm
I believe
hardware
in
designs
biased/unbiased
If custom VLSI versions of the designs are
size
all
was
arithmetic
But,
units
if
what
representation
arithmetic
signed
number
to operate
balanced
ternary,
offset
is
used
idea
for
all
numbers
two's
on
a
a
result,
form
the
two's complement
numbers.
balanced
quinary
more
efficient
quaternary,
in
complement
As
representations.
designed
More
speed.
designs in general.
discussions,
architecture
all
were
units be
Another
more
all
for
assumed
the
of
also
possibly
and
research should be done on the new algorithm
Throughout
would be better than the
designs
design;
or
could
developed?
is
logic is required
to
use
for the
redundant
number
representations.
arithmetic units, the
Although
shorter propogation delay
105
could
be
technique
used to
could
conventional
increase
give
design.
the
throughput
RNS the needed
of the RNS
edge in the
filters.
Maybe
this
battle with a pipelined
106
Dynamic Range for Optimum Moduli Sets
Appendix 1
Bit
3
4
Product
1
2
3
4
8.OOOOOOOOE+00
5.60000000E+0 1
2.80000000E+02
8.40000000E+02
Bits of
Precision
3.0000
5.8074
8.1293
9.7142
1
2
3
4
5
1.60000000E+01
2.40000000E+02
3.12000000E+03
3.43200000E+04
2.40240000E+05
7.20720000E+05
4.0000
7.9069
11.6073
15.0668
17.8741
19.4591
1.0000
0.9884
3.20000000E+0 1
9.92000000E+02
2.87680000E+04
7.76736000E+05
1.94184000E+07
4.46623200E+08
8.48584080E+09
1.44259294E+1 1
1.87537082E+12
2.06290790E+13
1.44403553E+14
5.0000
9.9542
14.8122
19.5671
24.2109
28.7345
32.9824
37.0699
40.7703
44.2297
47.0371
1.0000
0.9954
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
6.40000000E+0 1
4.03200000E+03
2.45952000E+05
1.45111680E+07
7.98114240E+08
4.23000547E+10
1.98810257E+12
8.81392140E+13
3.78998620E+15
1.55389434E+17
5.74940907E+ 18
1.78231681E+20
5.16871875E+21
1.1888053 1E+23
2.02096900E+24
2.62725974E+25
6.0000
11.9773
17.9080
23.7907
29.5720
35.2999
40.8545
46.3248
51.7511
57.1087
62.3181
67.2723
72.1303
76.6539
80.7413
84.4418
1.0000
0.9981
0.9949
0.9913
0.9857
0.9806
0.9727
0.9651
0.8796
59.00
55.00
53.00
47.00
44.33
43.00
41.00
37.00
31.00
29.00
23.00
17.00
13.00
1
2
3
4
5
1.28000000E+02
1.62560000E+04
2.03200000E+06
2.49936000E+08
3.02422560E+10
7.0000
13.9887
20.9545
27.8970
34.8158
1.0000
0.9992
0.9978
0.9963
0.9947
127.00
125.00
123.00
121.00
6
5
1
2
3
4
5
6
7
8
9
10
11
6
7
Bits Used EfficiencyIncrease
Moduli
Length
1
3
6
9
12
1.0000
0.9679
0.9033
0.8095
0.9673
0.9417
0.8937
0.8108
7.00
5.00
3.00
15.00
13.00
11.00
7.00
3.00
0.9684
0.9578
31.00
29.00
27.00
25.00
23.00
0.9424
19.00
0.9267
0.9060
0.8846
0.8552
17.00
13.00
0.9875
0.9784
0.9584
0.9518
0.9442
0.9343
0.9247
0.9125
0.8971
11.00
7.00
63.00
61.00
107
8
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
3.59882846E+12
4.06667616E+14
4.43267702E+16
4.74296441E+18
4.88525334E+20
4.93410588E+22
4.78608270E+24
4.25961360E+26
3.53547929E+28
2.79302864E+30
2.03891091E+32
1.44762674E+34
9.69909918E+35
5.91645050E+37
3.49070580E+39
1.85007407E+41
8.69534814E+42
3.73899970E+44
1.45820988E+46
5.39537657E+47
1.67256674E+49
4.85044353E+50
1.11560201E+52
2.11964382E+53
41.7107
48.5308
55.2990
62.0405
68.7270
75.3852
81.9851
88.4609
94.8359
101.1397
107.3295
113.4792
119.5453
125.4761
131.3587
137.0866
142.6412
148.0675
153.3529
158.5623
163.5165
168.3745
172.8981
177.1460
42
49
56
63
70
77
84
91
98
105
112
119
126
133
140
147
154
161
168
175
182
189
196
203
0.9931
0.9904
0.9875
0.9848
0.9818
0.9790
0.9760
0.9721
0.9677
0.9632
0.9583
0.9536
0.9488
0.9434
0.9383
0.9326
0.9262
0.9197
0.9128
0.9061
0.8984
0.8909
0.8821
0.8726
119.00
113.00
109.00
107.00
103.00
101.00
97.00
89.00
83.00
79.00
73.00
71.00
67.00
61.00
59.00
53.00
47.00
43.00
39.00
37.00
31.00
29.00
23.00
19.00
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
2.56000000E+02
6.52800000E+04
1.65158400E+07
4.14547584E+09
1.02393253E+12
2.46767740E+14
5.90355529E+16
1.41094972E+19
3.28751284E+21
7.52840440E+23
1.70894780E+26
3.81095359E+28
8.04111207E+30
1.67370004E+33
3.33066308E+35
6.56140627E+37
1.26635141E+40
2.41873119E+42
4.37790346E+44
7.83644720E+46
1.35570536E+49
2.26402796E+51
3.69036557E+53
5.79387395E+55
8.74874967E+57
1.30356370E+60
1.81195354E+62
2.48237635E+64
3.25191302E+66
8.0000
15.9944
23.9773
31.9489
39.8973
47.8101
55.7124
63.6133
71.4775
79.3167
87.1432
94.9441
102.6652
110.3667
118.0033
125.6253
133.2178
140.7952
148.2951
155.7789
163.2135
170.5972
177.9460
185.2406
192.4790
199.6981
206.8171
213.9151
220.9485
8
16
24
32
40
48
56
64
72
80
88
96
104
112
120
128
136
144
152
160
168
176
184
192
200
208
216
224
232
1.0000
0.9996
0.9991
0.9984
0.9974
0.9960
0.9949
0.9940
0.9927
0.9915
0.9903
0.9890
0.9872
0.9854
0.9834
0.9814
0.9795
0.9777
0.9756
0.9736
0.9715
0.9693
0.9671
0.9648
0.9624
0.9601
0.9575
0.9550
0.9524
255.00
253.00
251.00
247.00
241.00
239.24
239.00
233.00
229.00
227.00
223.00
211.00
208.14
199.00
197.00
193.00
191.00
181.00
179.00
173.00
167.00
163.00
157.00
151.00
149.00
139.00
137.00
131.00
108
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
4.12992954E+68
4.66682038E+70
5.08683422E+72
5.44291260E+74
5.60619999E+76
5.66226199E+78
5.49239413E+80
4.77044208E+82
1.39678594E+84
1.01965374E+86
7.23954154E+87
4.85049283E+89
2.95880063E+91
1.74569237E+93
9.25216956E+94
4.34851969E+96
1.86986347E+98
7.66644022E+99
2.83658288E+101
8.22609036E+102
6.66313319E+104
227.9372
234.7574
241.5256
248.2671
254.9536
261.6118
268.2117
274.6522
279.5241
285.7139
291.8636
297.9297
303.8605
309.7431
315.4710
321.0256
326.4519
331.8094
337.0189
341.8769
348.2167
240
248
256
264
272
280
288
296
304
312
320
328
336
344
352
360
368
376
384
392
400
0.9497
0.9466
0.9435
0.9404
0.9373
0.9343
0.9313
0.9279
0.9195
0.9157
0.9121
0.9083
0.9043
0.9004
0.8962
0.8917
0.8871
0.8825
0.8777
0.8721
0.8705
127.00
113.00
109.00
107.00
103.00
101.00
97.00
86.86
29.28
73.00
71.00
67.00
61.00
59.00
53.00
47.00
43.00
41.00
37.00
29.00
81.00
109
Appendix 2 -- Design
Final
/* * ***
/*
/*
/*
/*
/*
Design
**
**
* ***
Program
Aid
**
Aid Code
****
**
****
***
* *****
**
**
**
* ***
**
****
****
This program is an attempt at creating an RNS design aid.
It combines the heuristic moduli selection algorithm with
the architecture design data from section 4 of the thesis.
Lightspeed C
Written by Kurt A. Locher January 6, 1989
Debugged by Kurt A. Locher until January 7, 1989 1:08am
*******
* **/
*
*
*
*
*
/
/
/
/
/
**************************************************/
#include <stdio.h>
#include <math.h>
typedef char string[80];
/* maxmoduli indicates the maximum number of relatively prime moduli */
/* that can selected from the set of numbers less than or equal to
/* 2Ab. b is the index of maxmoduli, which starts at 0.
*/
int maxmoduli[] = {0, 0, 0, 4, 6, 11, 16, 29, 50, 80};
maino
{
typedef struct
(
int type;
int bits;
double size;
double delay;
char *next;
} rec;
int inputbits;
int coeffbits;
int b;
/* number of bits in the input */
/* number of bits in the coeffcients */
/* number of bits the binary channels */
int i;
int r estimate;
/* initial guess at # of moduli needed *1
/* number of moduli needed for different b*/
int r[10];
/* lowest bitwidth with the required dynamic range */
int firstb;
int *moduli;
int *findmoduliO;
double filterlength;
double drange;
double product;
/* length of the filter in taps */
/* dynamic range in bits of moduli sets */
110
double
FILE
FILE
FILE
FILE
FILE
FILE
FILE
totaldrange;
/* total dynamic range needed (in bits) */
1*
1*
1*
1*
1*
1*
*nsd;
*osd;
*ntd;
*otd;
*nld;
*old;
*fopeno;
fp
fp
fp
fp
fp
fp
for
for
for
for
for
for
newalgorithmsizedata */
oldalgorithmsizedata */
newalgorithmtransdata */
oldalgorithmtransdata */
newalgorithm latency-data */
oldalgorithmjlatency-data */
string otype[6];
/* text description of old design type */
double osize[6][9]; /* size of old designs for different bitwidths */
double odelay[6][9];
/* latency of old designs for different b */
string ntype[6];
/* text description of new design type */
double nsize[6][9]; /* size of new designs for different bitwidths */
/* latency of new designs for different b */
double ndelay[6][9];
string t[10];
string
dummy;
int j;
long int atol();
double
int
atofO;
design;
char
addflag;
char
char
tempflag;
*mallocO;
rec
rec
rec
*start;
*pointer;
*lastpointer;
/* First, find the number of moduli needed for each value of b */
/* from 3 to 9 */
printf("Enter number of bits in the input
scanf("%d", &inputbits);
printf("Enter number of bits in the coefficients
scanf("%d", &coeffbits);
printf("Enter length of the filter (# of taps)
scanf("%lf",
&filterlength);
totaldrange
=
(double)inputbits
+ (double)coeffbits
+ (log(filterlength)/log(2));
/* for each value of b */
for(b=3; b<=9; b++) {
r_estimate = (int)(totaldrange / b) - 1;
do {
r_estimate++;
if (restimate > maxmoduli[b])
r_estimate = 0;
{
111
firstb = b;
break;
/* do loop */
}
moduli
drange
)
=
=
b,
findmoduli(restimate,
log(product)/log(2);
&product);
while (drange < totaldrange);
r[b] = restimate;
if (r[b] !=O) {
printf("\nThe optimal moduli set for b = %d is
/*
... ",
b);
for(i=O; i<r[b]; i++) {
moduli[i]);
printf('%d\t",
printf("\nThe product is %.8e for %lf bits of
precision\n\n",
product,
(log(product)/log(2)));
}
'
*/
firstb++;
/* Now, load the size and latency data from files */
if((osd = fopen("oldalgorithmsize-data",
printf("Error
opening
"r")) == NULL)
oldalgorithm sizedata\n");
for(i=O; i<4; i++) {
fscanf(osd,
"%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s",
t[3], t[4], t[5], t[6], t[7], t[8], t[9] )
otype[i],
for (j=3; j<=9; j++)
osize[i][j] = r[j] * atof(tU]);
}
fclose(osd);
if ((old = fopen("oldalgorithm-latency-data",
printf("Error
opening
"r")) == NULL)
oldalgorithm latency-data\n");
for(i=O; i<4; i++) {
fscanf(old,
"%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s",
t[3], t[4], t[5], t[6], t[7], t[8], t[9] )
for (j=3; j<=9; j++)
odelay[i][j]
= atof(tU]);
}
fclose(old);
/*
Now lets prune the number of alternatives a bit */
start = (rec *)malloc(sizeof(rec));
start->size = 9e30;
start->delay = 9e30;
start->next = NULL;
for(design=O; design<4; design++)
{
dummy,
112
for(b=firstb; b<=9; b++) {
pointer = start;
lastpointer = NULL;
addflag = 1;
do {
tempflag = 0;
if (pointer->size > osize[design][b])
if (pointer->delay > odelay[design][b])
tempflag++;
tempflag++;
addflag = 0;
if (tempflag == 0)
if (tempflag == 2) {
/* delete current record */
if (lastpointer == NULL) { /* if first in list */
start = (rec *)pointer->next;
free(pointer);
pointer = start;
}
else {
lastpointer->next = pointer->next;
free(pointer);
pointer = (rec *)pointer->next;
}
}
else {
/* otherwise, just step to the next record */
lastpointer = pointer;
pointer = (rec *)pointer->next;
}
} while (pointer != NULL);
if (addflag == 1)
pointer = (rec *)malloc(sizeof(rec));
pointer->type = design;
pointer->bits = b;
pointer->size = osize[design][b];
pointer->delay = odelay [design] [b];
pointer->next = NULL;
if (start == NULL)
start = pointer;
else
lastpointer->next = (char *)pointer;
}
/*
}
And print the results */
pointer = start;
printf("\n\nThe old
combinations
are\n");
designs
which
provide
the best
size/speed
do I
printf("%s design with %d
%d bit moduli --> size %.Of,
delay
.2 f\n",
otype[pointer->type],
r[pointer->bits],
pointer->bits,
pointer->size,
113
pointer->delay);
pointer = (rec *)pointer->next;
} while (pointer != NULL);
/* And now for something a little different */
if((nsd = fopen("newalgorithmsize data", "r")) == NULL)
opening newalgorithm size-data\n");
printf("Error
for(i=O; i<3; i++) {
fscanf(nsd,
"%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s",
t[3], t[4], t[5], t[6], t[7], t[8], t[9] )
ntype[i],
for (j=3; j<=9; j++)
nsize[i][j] = rU] * atof(t[j]);
fclose(nsd);
"r")) == NULL)
if ((nld = fopen("newalgorithmjlatency_data",
newalgorithm-latency-data\n");
opening
printf("Error
for(i=O; i<3; i++) {
fscanf(nld,
"%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s",
t[3], t[4], t[5], t[6], t[7], t[8], t[9] )
dummy,
for (j=3; j<=9; j++)
ndelay[i][j]
= atof(t[j]);
}
fclose(nld);
/* Now lets prune the number of alternatives a bit */
start = (rec *)malloc(sizeof(rec));
start->size = 9e30;
start->delay = 9e30;
start->next = NULL;
for(design=0; design<3; design++) {
for(b=firstb; b<=9; b++) {
pointer = start;
lastpointer = NULL;
addflag = 1;
do {
tempflag = 0;
if (pointer->size > nsize[design] [b])
if (pointer->delay > ndelay[design][b])
tempflag++;
tempflag++;
addflag = 0;
if (tempflag == 0)
if (tempflag == 2) {
/* delete current record */
if (lastpointer == NULL) { /* if first in list */
start = (rec *)pointer- >next;
free(pointer);
pointer = start;
}
114
else {
lastpointer->next = pointer->next;
free(pointer);
pointer = (rec *)pointer->next;
}
else {
/* otherwise, just step to the next record */
lastpointer = pointer;
pointer = (rec *)pointer->next;
}
} while (pointer
!= NULL);
if (addflag == 1) {
pointer = (rec *)malloc(sizeof(rec));
pointer->type = design;
pointer->bits = b;
pointer->size = nsize[design][b];
pointer->delay = ndelay[design] [b];
pointer->next = NULL;
if (start == NULL)
start = pointer;
else
lastpointer->next
}
}
/*
=
(char
*)pointer;
}
And print the results */
pointer = start;
printf("\n\nThe new
combinations
are\n");
designs
which
provide
the
best
size/speed
do {
%.2fin",
printf("%s design with %d
ntype[pointer->type],
%d bit moduli -- > size %.Of,
r[pointer->bits],
pointer->bits,
delay
pointer-
>size,
pointer->delay);
pointer = (rec *)pointer->next;
} while (pointer != NULL);
}
/*
/*
/*
/*
/*
/*
int
int
Function findmoduli returns the "optimal" set of moduli with the
It operates by recursively
desired number of bits and elements.
calling a sub-optimal moduli selection function with a list of
excluded numbers.
Inputs: # of moduli, # of bits/moduli
Outputs: moduli set, log2 (product of moduli set)
*findmoduli(number,
number;
bits, product)
/* number of moduli in set */
*
/
*
/
* /
*/
*/
*/
115
int bits;
double *product;
/* maximum number of bits per modulus
/* product of best moduli set, returned */
*/
{
char
char
char
int
int
int
int
int
int
doitaga in;
*mallo
*reallo CO;
/* done flag */
/* dynamic memory
/* dynamic memory
allocation */
allocation */
/* looping variables */
i, j;
/* size of exclusion set passed */
numexc;
/* exclusion set */
*exclude;
*result; /* best set of moduli */
/* current set of moduli */
*mod;
powerO; /* integer power function */
double temp;
/* product of current moduli set */
/* Initialize stuff... */
mod = (int *)malloc(number * sizeof(int));
result = (int *)malloc(number * sizeof(int));
exclude = (int *)malloc(sizeof(int));
numexc
mod[0]
=
=
0;
power(2, bits);
/*
/* start with no exclusions */
always include highest power of 2 */
/* Take first stab with sub-optimal algorithm
findset(mod, number, exclude, numexc);
*product = mod[O];
for(i=1; i<number; i++)
*product = *product * mod[i];
/*
/*
*/
recursively call suboptimal algorithm successively
each member of the current best moduli set */
excluding
*/
do {
doitagain = 0;
for(i=O; i<number; i++)
result[i] = mod[i];
numexc++;
exclude = (int *)realloc(exclude, numexc*sizeof(int));
for(i=1; i<number; i++) {
exclude[numexc - 1] = result[i];
findset(mod, number, exclude, numexc);
temp = mod[0];
for(j=1; j<number; j++)
temp = temp * modU];
/* Is the current moduli set better than the best */
if (temp > *product) {
*product = temp;
doitagain = 1;
break;
} /* if */
} /* for */
/* break out of the for loop */
116
} while(doitagain);
return(result);
}
/*
/*
/*
/*
/*
/*
/*
/*
/*
* /
* /
* /
* /
* /
* /
* /
* /
*/
Function findset executes the basic flawed algorithm for finding
a moduli set with the addition of select number exclusion. The
basic algorithm starts with the highest possible moduli in a b
It then counts down
binary representation (2 to the power of b).
the odd numbers less than this highest modulus including those
that are relatively prime to the currently existing set. By
the exclusion feature allows a calling function to compensate for
the the flaw in this algorithm by excluding multiple factor
numbers.
/* **
* *****
* **
* ****
* *****
* *****
***
* ***
****
**
**
* **
**
**
*/
*****
findset(modset, lenset, exclude, lenexc)
*modset; /* predimensioned array to hold moduli, modset[O] = 2An */
/* length of requested moduli set */
lenset;
/* predimensioned and initialized exclude set */
*exclude;
/* length of exclude set */
int lenexc;
int
int
int
int
I
char
success;
int i, j;
int
int
newmod;
gcdo;
/*
flag
*/
/*
/*
/*
looping variables */
potential new member of moduli set */
greatest common denominator function
*/
/* find first odd number that does not conflict with exclude set */
newmod = modset[O] - 1;
if (lenexc != 0) {
do {
success = 1;
for(j=O; j<lenexc; j++) {
if (newmod == excludeU]) I
newmod -= 2;
/* go to next greatest odd
* /
success = 0;
break; /* the for loop */
}
}
}
while(!success);
}
modset[1l] = newmod;
/* Using first odd element as a seed, find the remaining relatively */
For these elements not only */
/* prime elements to complete the set.
*/
/* must the exclude list be checked, but also they must be
/* relatively prime to the existing elements */
for(i=2; i<lenset; i++) {
newmod = modset[i-1] - 2;
117
do {
/* check exclude list */
if (lenexc != 0) {
do {
success = 1;
for(j=O; j<lenexc; j++) {
if (newmod == excludeU]) {
/* go to next
newmod -= 2;
greatest
odd */
success = 0;
break; /* the for loop */
}
}
} while(!success);
}
/*
check
for relative primality
*/
success = 1;
for(j=1; j<i; j++) {
if (gcd(modset[j], newmod) != 1)
newmod -= 2;
/* go to next greatest odd
* /
success = 0;
break; /* the for loop */
}
}
} while(!success);
modset[i] = newmod;
} /* for */
}
int power(num,
int num;
int pow;
pow)
{
int i;
int res;
res = num;
for(i=1; i<pow; i++)
res = num * res;
return(res);
}
/* ***
* **
* **
***
***
***
* * *****
* ***
**
* ***
/* function gcd returns the greatest commmon
/* using Euclid's Algorithm...
**
**
* ***
**
**
**
**
**
divisor of two integers
* **/
*
*/
/
118
int gcd(a, b)
int a;
int b;
{
int temp;
int r;
/* make sure a > b */
if (a < b) {
temp = a;
a =b;
b = temp;
}
/* Euclid's Algorithm
do {
r = a % b;
a= b;
b = r;
} while(r > 0);
return(a);
}
*/
119
Selection
Moduli
* ***
/* * *****
****
Algorithm
**
**
**
*****
****
* ***
**
**
****
* ***
* **
**
******
/
/
/* This program is an attempt at creating an RNS design aid.
/* It uses a recursive method that is based on some of the
/* basic characteristics of an optimal moduli set.
*
*
/* Written by Kurt A. Locher August 29, 1988
*/
November 12, 1988
/* Ported to the Macintosh
****************************************************
*****
*/
*/
#include <stdio.h>
#include <math.h>
maino
{
int nummod;
int numbits;
int i;
int *moduli;
int
*findmoduliO;
double
product;
do [
printf("Enter bitlength of moduli
scanf("%d", &numbits);
printf("Enter number of moduli
scanf("%d", &nummod);
moduli = findmoduli(nummod, numbits);
product = 1;
printf("\nThe optimal moduli set is ... ");
for(i=O; i<nummod; i++) {
product = product * moduli[i];
printf("%d\t", moduli[i]);
}
product is %.8e for %lf bits of precision\n\n",
product, (log(product)/log(2)));
} while(numbits > 2);
printf("\nThe
}
/*
/*
/*
/*
Function findmoduli returns the "optimal" set of moduli with the
It operates by recursively
desired number of bits and elements.
calling a sub-optimal moduli selection function with a list of
excluded numbers.
int *findmoduli(number, bits)
/* number of moduli in set */
int number;
/* maximum number of bits per modulus */
int bits;
{
*
*
*
/
/
/
*/
120
/* done flag */
/* dynamic memory
/* dynamic memory
allocation
allocation
int i, j;
/* looping variables
*/
int numexc;
int *exclude;
int *mod;
/* size of exclusion set passed */
/* exclusion set */
/* current set of moduli */
char
char
char
doitagain;
*mallocO;
*reallocO;
*/
*/
int *result; /* best set of moduli */
int powerO; /* integer power function */
double product;
double temp;
/* product of best moduli set */
/* product of current moduli set */
/* Initialize stuff... */
mod = (int *)malloc(number * sizeof(int));
result = (int *)malloc(number * sizeof(int));
exclude = (int *)malloc(sizeof(int));
/* start with no exclusions */
numexc = 0;
mod[0]
=
power(2, bits);
/* always include highest power of 2 */
/* Take first stab with sub-optimal algorithm
findset(mod, number, exclude, numexc);
product = mod[O];
*/
for(i=1; i<number; i++)
product = product * mod[i];
/* recursively call suboptimal algorithm successively excluding */
/* each member of the current best moduli set */
do {
doitagain = 0;
for(i=0; i<number; i++)
result[i] = mod[i];
numexc++;
exclude = (int *)realloc(exclude, numexc*sizeof(int));
for(i=1; i<number; i++) {
exclude[numexc - 1] = result[i];
findset(mod, number, exclude, numexc);
temp = mod[O];
for(j=1; j<number; j++)
temp = temp * modU];
/* Is the current moduli set better than the best */
if (temp > product) {
product = temp;
doitagain = 1;
/* break out of the for loop */
break;
} /* if */
} /* for */
}
while(doitagain);
return(result);
121
}
/* Function findset executes the basic flawed algorithm for finding
*
*
*
*
*
*
*
/* a moduli set with the addition of select number exclusion. The
/* basic algorithm starts with the highest possible moduli in a b
/* binary representation (2 to the power of b). It then counts down
/* the odd numbers less than this highest modulus including those
/* that are relatively prime to the currently existing set. By
/* the exclusion feature allows a calling function to compensate for
/* the the flaw in this algorithm by excluding multiple factor
/* numbers.
/
/
/
/
/
/
/
*/
*/
int findset(modset, lenset, exclude, lenexc)
int *modset; /* predimensioned array to hold moduli, modset[O] = 2^n */
/* length of requested moduli set */
int lenset;
/* predimensioned and initialized exclude set */
int *exclude;
/* length of exclude set */
int lenexc;
{
char
success;
int i, j;
int
int
newmod;
gcdo;
*/
/*
flag
/*
/*
/*
looping variables */
potential new member of moduli set */
greatest common denominator function
*/
/* find first odd number that does not conflict with exclude set */
newmod = modset[O] - 1;
if (lenexc != 0) {
do {
success = 1;
for(j=O; j<lenexc; j++) (
if (newmod == excludej])
newmod -= 2;
/* go to next greatest odd
*/
success = 0;
break; /* the for loop */
}
}
} while(!success);
}
modset[1] = newmod;
/* Using first odd element as a seed, find the remaining relatively */
/* prime elements to complete the set. For these elements not only */
*/
/* must the exclude list be checked, but also they must be
/* relatively prime to the existing elements */
for(i=2; i<lenset; i++) {
newmod = modset[i-1] - 2;
do {
/* check exclude list */
122
if (lenexc != 0) {
do {
success = 1;
for(j=O; j<lenexc;
j++) {
if (newmod~ == excludeUI])
newmod -= 2;
greatest odd
/* go to next
*/
success = 0;
break; /* the for loop */
}
}
} while(!success);
}
/* check for relative
success = 1;
for(j=1; j<i; j++) {
primality */
if (gcd(modset[j], newmod) != 1)
/* go to next greatest odd
newmod -= 2;
*/
success = 0;
break; /* the for loop */
}
}
} while(!success);
modset[i] = newmod;
)
/* for */
}
int power(num,
int num;
int pow;
pow)
{
int i;
int res;
res = num;
for(i=1; i<pow; i++)
res = num * res;
return(res);
}
/* **
* **
**
***
**
****
* **
* **
****
* ****
***
**
**
***
**
**
* **
**
*/
* ***
/* function gcd returns the greatest commmon divisor of two integers
/* using Euclid's Algorithm...
/********************
int gcd(a, b)
int a;
********************************
**
*
**
*/
*/
/
123
int b;
{
int temp;
int r;
/* make sure a > b */
if (a < b) {
temp = a;
a =b;
b = temp;
}
/* Euclid's Algorithm */
do {
r = a % b;
a= b;
b = r;
} while(r > 0);
return(a);
}
124
References
General
Herstein, I.N.
Abstract Algebra, MacMillan Publishing, New York, 1986
The Art of Computer Programming, Vol 2
Knuth, D.E.
Algorithms,
Addison-Wesley,
Reading,
MA,
Seminumerical
(pp 178-197
1981
&
268-277)
McClellen,
J.H.
& Rader,
C.M.
Number- Theory
in
Digital Signal
Processing, Prentice-Hall, Englewood Cliffs, NJ, 1979
Oppenheim, A.V. & Shafer, R.W.
Digital Signal Processing, Prentice-
Hall, New York, 1975
Rabiner, L.R.
& Gold,
Processing,
B.
Computer Technology,
F.J.
Application
of Digital
Signal
Prentice-Hall, Englewood Cliffs, NJ, 1975
Szabo, N.S. & Tanaka, R.I.
Taylor,
Theory and
"Residue
Residue Arithmetic and its Application to
McGraw-Hill, New York, 1967
Arithmetic:
A
Tutorial
with
Examples",
IEEE
Computer, May 1984
Architecture
Bayoumi, M.A., et al.
Transactions
1987
"A VLSI Implementation of Residue Adders", IEEE
on Circuits and Systems, Vol CAS-34,
No 3, March
125
"Residue Arithmetic and VLSI", IEEE 1983
Chaing, C-L
Huang,
C.H.
"Implementation
Number
Residue
a Fast
of
System",
IEEE
Systems, Vol CAS-28, January
Digital
Processor
Transactions
on
Using
Circuits
the
and
1981
"The Use of Residue Number Systems in the Design
Jenkins, W.K., et al,
of Finite Impulse Response Digital Filters",
IEEE Transactions on
Circuits and Systems, April 1977
et al,
Johnson, B.L.,
Processing",
"The Residue Number System for VLSI
SPIE Vol 696 Advanced
Algorithms
Signal
and Architectures
for Signal Processing, 1986
"Residue Number Scaling and Other Operations using ROM
Jullien, G.A.
Arrays", IEEE Transactions on Computers, C-27, No 4, April 1978
Key, E.L.
"Digital Signal Processing with Residue Number Systems", IEEE
1983
Kung, H.T.
"Why Systolic Architectures?", IEEE Computer, Jan 1982
"Architectures for Signal Processing", MIT Course
Musicus, B.R.
Handout
Soderstrand, M.A.
"A New Hardware Implementation of Modulo Adders
for Residue Number Systems"
Taylor, F.J., et al.
"An Efficient Residue-to-Decimal Converter", IEEE
Transactions on Circuits and Systems, Vol CAS-28, No 12, December
1981
Taylor, F.J.
"A VLSI Residue Arithmetic Multiplier", IEEE Transactions
on Computers, Vol C-31, No 6, June 1982
126
"DFP Family Overview" and "ZR33881 Digital Filter
ZORAN Corporation,
Data Sheet"
Processor
Complex
Residue
Arithmetic
Baranieka, A.Z., et al,
"Residue Number System Implementations of
Number Theoretic Transforms
in Complex Residue Rings", IEEE
Transactions on ASSP, Vol ASSP-28, No 3, June 1980
Jenkins, W.K.
"Quadratic Modular Number Codes for Complex Digital
Signal Processing", IEEE ISCAS 1984
Jullien, G.A., et al, "Complex Digital Signal Processing over Finite Rings",
IEEE Transactions on Circuits and Systems, Vol CAS-34, No 4, April
1987
Krishnan, R., et al, "The Modified Quadratic Number System (MQRNS) for
Complex High-Speed Signal Processing",
IEEE Trasactions on
Circuits and Systems, Vol CAS-33, No 3, March 1986
Krishnan, R., et al,
"Complex Digital Signal Processing Using Quadratic
Number System", IEEE Trasactions on ASSP, Vol ASSP-34, No 1,
February
Soderstrand, M.A.
1986
"Applications of Quadratic-Like
Number System
Arithmetic
to Ultrasonics",
Conference ASSP 1984, vol 2, March 1984
Complex Residue
IEEE
International
Download