A Pipeline Fast Fourier Transform

advertisement
1015
IEEE TRANSACTIONS ON COMPUTERS, VOL. C-19, NO. 11, NOVEMBER 1970
[7]
[81
[9]
[10]
[11]
[12]
[13]
experiments with recorded text as a cornmunication media," 1964
Fall Joint Computer Conf., AFIPS Proc., vol. 27, pt. 1. Washington,
D. C.: Spartan, 1965, pp. 399-411.
D. C. Engelbart and W. K. English, "A research center for augmenting human intellect," 1968 Fall Joint Computer Conf., AFIPS Proc.,
vol. 33, pt. 1. Washington, D. C.: Thompson, 1968, pp. 395-410.
T. H. Kehl and C. Moss, "Systems programming on-line," Computers
and Biomed. Res., vol. 1, pp. 550-555, June 1968.
H. Bratman, H. G. Martin, and E. C. Perstein, "Program composition and editing with an on-line display," 1968 Fall Joint Computer
Conf., AFIPS Proc., vol. 33, pt. 2. Washington, D. C.: Thompson,
1968, pp. 1349-1360.
B. Tolliver, "TVEDIT," Stanford University, Palo Alto, Calif.,
Stanford Time-Sharing Memo. 32, March 1965.
J. McCarthy, D. Brian, G. Feldman, and J. Allen, "THOR-A
display based time sharing system," 1967 Spring Joint Computer
Conf., AFIPS Proc., vol. 30. Washington, D. C.: Thompson, 1967,
pp. 623-633.
M. A. Wilkes, LAP3 Users' Manual, Massachusetts Institute of
Technology, Cambridge, Mass., Center Development Office Rept.,
August 1963.
, "LAP5: LINC assembly program," Proc. DECUS Spring
A
[14]
[15]
[16]
[17]
[18]
[19]
[20]
Symp. Maynard, Mass.: Digital Equipment Corp., 1966, pp.
43-50.
W. A. Clark and C. E. Molnar, "A description of the LINC," in Computers in Biomedical Research, vol. 2, R. W. Stacy and B. Waxman,
Eds. New York: Academic Press, 1965, pp. 35-66.
R. L. Best and T. C. Stockebrand, "A computer-integrated rapidaccess magnetic tape system with fixed address," 1958 Proc. Western
Joint Computer Conf. New York: American Institute of Electrical
Engineers, 1959, pp. 42-46.
T. Kilburn, R. B. Payne, and D. J. Howarth, "The Atlas supervisor,"
in Computers: A Key to Total Systems Control, 1961 Eastern Joint
Computer Conf., AFIPS Proc., vol. 20. New York: Macmillan,
1961, pp. 279-294.
M. A. Wilkes, "Conversational access to a 2048-word machine,"
Comm. ACM, vol. 13, pp. 407-414, July 1970.
, LAP6 Handbook, Computer Research Laboratory, Washington University, St. Louis, Mo., Tech. Rept. 2, May 1967.
, LAP6 Use of the Stucki-Ornstein Text Editing Algorithm,
Computer Systems Laboratory, Washington University, St. Louis,
Mo., Tech. Rept. 18, February 1970.
, LAP6 Manuscript Listings, Computer Systems Laboratory,
Washington University, St. Louis, Mo., May 1967.
Pipeline Fast Fourier Transform
HERBERT L. GROGINSKY,
SENIOR MEMBER, IEEE, AND
Abstract-This paper describes a novel structure for a hardwired
fast Fourier transform (FFT) signal processor that promises to permit
digital spectrum analysis to achieve throughput rates consistent with
extremely wide-band radars. The technique is based on the use of
serial storage for data and intermediate results and multiple arithmetic units each of which carries out a sparse Fourier transform.
Details of the system are described for data sample sizes that are
binary multiples, but the technique is applicable to any composite
number.
Index Terms-Cascade Fourier transform, digital signal processor,
Doppler radar, fast Fourier transform, radar-sonar signal processor,
radix-two fast Fourier transform, real-time signal processor.
INTRODUCTION
HIS paper describes a novel structure for a hardwired
FFT signal processor that promises to permit digital
spectrum analysis to achieve throughput rates consistent with extremely wide-band radars.
The processor consists of a number ofmodular units connected in cascade through switches that direct the flow of
information from memory to arithmetic units. The switching required to carry out the process is simple and is controlled by a binary counter. The processor is similar to the
binary analyzer described by Bergland and Hale [1], but
r
Manuscript received November 7, 1969; revised April 27, 1970. This
work was supported by Raytheon research and development funding. A
patent has been filed on the basic structure of this signal processor. This
paper was presented at EASCON'69 (Electronics and Aerospace Systems
Convention), Washington, D. C., October 27-29, 1969.
The authors are with the Raytheon Company, Sudbury, Mass.
GEORGE A. WORKS
employs only N complex words of storage to compute the
FFT of N complex data samples.' Bergland [2] has listed
many alternative organizations of FFT processors. Recently O'Leary [10] has also proposed a similar structure.
We show that the Cooley-Tukey algorithm does a natural interleaving of data gathered by the time multiplexing
of a number of independent channels, typical of radars and
sonars. In this concept, the successive stages or iterations of
the fundamental algorithm are each carried out in the separate cascaded modules. Using shift registers as digital delay
lines permits new data to be entered into the processor while
the processing of earlier data blocks is carried out. In effect
the overall delay required is equal to the time required to
gather the analysis sample block N in each of the separate
channels. As the Nth complex data sample is loaded into
the digital delay line, the first analysis frequency appears at
the output. The output appears in precisely the same channel sequence as the data when they were loaded into the delay line. The output frequencies, however, appear in the
scrambled sequence associated with the algorithm.
The control device, namely the binary counter, yields a
digital number identifying both the channel number and
the frequency currently being outputted. In addition, it
specifies the instants at which the separate modules are to be
Although the processor described here is cascade in structure, we
prefer the pipeline designation used by computer designers [11 ] because
this structure permits direct application of pipeline arithmetic techniques.
1016
IEEE TRANSACTIONS ON COMPUTERS, NOVEMBER 1970
switched and a digital number identifying the sine/cosine and th ,e ranges
values needed by each of the stages. This structure, although
< v < Cm
1 < m < Ml 0 <I <Rm,
hardwired, does permit the flexible interchange of channels
processed for data sample length per channel. Thus a sys- wh
tem capable of processing N complex samples in a single
m
channel is also capable of processing N/L samples in each of
= H1 rk
Rm
L channels provided L is a factor of N. The modular design
k= 1
of the device permits the duplicate arithmetic units to
=
Cm NIRm
weight the input prior to the FFT operation and to present
a%o=
A,m0 = Xn
the output in magnitude. Furthermore, it allows computation of Fourier transforms at the rate at which new data and
can be inserted into the digital delay line. Fundamentally,
the signal processing rate is independent of the data sample
pl mod r-,up- greatest integer in ,u/r.
length.
A pipeline FFT configured to process radar data from a
when this is done, the number of calculations
pulse Doppler tracking radar has been designed and tested. dropsI act,
N2 complex operations (MULTIPLY and ADD) to
from
The system uses MOSFET shift registers as the digital delay Nr, +
... + rM) such operations. This iterative pro-r2+
lines, TTL in the arithmetic units and MOS LSI READ ONLY N(r1 +
the amount of hardware required to realreduces
th
memory to store sine/cosine tables and filter shaping weight icess be
and provides the basic pipeline structure
operation,
functions.
izeith
new
calculations
to proceed before the results of
tting
The system processes eight range channels taking 512 permit calculations are completed.
This form of the al*
complex samples per channel. It is designed to obtain sub- earlthim is known as the Cooley-Tukey version.
Wh
clutter visibility of 60 dB and achieves this using 12-bit
en rk =2 for all k, the algorithm is conveniently sumfixed-point internal operations in the arithmetic unit. The
the flow diagram shown in Fig. 1. In this figure,
throughput rate achieved by the system is 128K samples marize-d bydata
enter the left-hand column and each succesthe inpput
per second.
corresponds
to a later stage of the iterative pro)lumn
sive co
cess. T he coefficients indicated at the input then correspond
THEORY OF OPERATION
to the index of the input data xn and give its order in the
In this section, the method of operation of the pipeline data sttream. The figures shown at each later stage indicate
FFT is explained in terms of the fundamental mathematical the coeefficients of the rotation vector W2. that must be apoperations that must be carried out. An extensive literature plied t';o the lower branch entering each node.
A niumber of important features of the algorithm may be
now exists [3] explaining the basic principles of the FFT.
The discussion here emphasizes certain features of the seen b'y examining Fig. 1. First, we observe that each stage
analysis, permitting a hardwired realization ofthe algorithm needs only the data generated from the preceding stage.
Secon'd, if each stage is processed in order of arrival, the
to achieve the goals set forth in the introduction.
first st ;age examines data points displaced by half the data
The discrete Fourier transform (DFT) is defined by
length (N/2), the second by one quarter of the data length
N-1
(1) (N/4), etc. Third, if the data were available in a continuous
Xm = Z xnW7N
streaml, the first stage could be processing one block of data
n=O
while tthe second stage processed the next earlier block and
where
so on 1through all M stages. Fourth, the rotation vector reWN = e- j(2nlz1N)
quired[, W2m, has the same periodicity as the data displacement i] nterval. Finally, we note that the output appears in the
xn is a complex data sample, and Xm is the complex image of
usual scrambled order of frequency associated with this
the data at frequency m/N.
n of the algorithm.
versioi
Theory shows that when N is a composite number
The significance of these remarks is that each of the stages
M
N= Hl rk
(2) may b e realized with a basic component whose general form
is sho'wn in Fig. 2. Any m module alternately transfers
k= 1
blocks of 2' data samples into the delay line and into the
where the rk are a set of integers (possibly with repeats), and arithmletic unit. When the data block just fills the delay line,
(1) may be calculated iteratively in M stages as follows.
the ariithmetic unit obtains a rotation vector (from a READ
ONLY lmemory) and begins its operation. The next block of
a =
(3) 2' inp out data samples are sent to the arithmetic unit that
WR uam
now p.roduces two complex outputs in response to the two
compl4ex inputs it receives. One of the outputs is immediwith
ately tiransferred to the next stage while the other output is
p = , mod R,,1
sent tc the delay line. Thus in the interim period when the
v + IC.
q
delay 1line is filled with fresh input data, the contents of the
Fm-1
1=0
=
GROGINSKY AND WORKS: PIPELINE FAST FOURIER TRANSFORM
1017
3
Stage
0
0
Spectrum
Output
Data
Input
OUTPUT FREQUENCY AND
ROTATION VECTOR ADDRESS
CHANNE
NUMBEF
Fig. 3. Pipeline FFT processor.
W2
Rotation Vector Base
0D
0
W8
W4
Fig. 1. Flow diagram of a Cooley-Tukey FFT.
SWITCH
-
CNRL
7-
/
/
-
- -
-
-
/'
/
DATA
INPUT
/
DELAY LINE
= 2
SAMPLES
|
?
/
?
/
I
1
01
DATA
OUTPUT
2341 6
1
j
1
10 1
13
17 8 11 1IZ1214 15
Complex Data Samples
tI
T
ON
Fig. 4. FFT interleaving mechanism.
stage of the process has a simple interpretation. Theory
shows that
xOUT
YIN
I
Rm-1
ROTATION VECTOR W
Fig. 2. Pipeline FFT m module.
A=
line containing the results of processing the earlier blocks are
transferred to the next stage. The arithmetic unit, of course,
computes the complex two-point transforms, shown below.
xout
Yout
=
Xin + Yin W
Xin
YinWz
(4)
This module design may be assembled into a system for
computing DFTs in blocks of N = 2M samples as shown in
Fig. 3. The rotation vector storage shown in the figure is a
table of roots of unity, or sines and cosines, which is shared
by all m modules. Fig. 1 shows that N/2 different rotation
vectors are read to process one block of N samples. Indeed
if they are produced in the order required for the last stage,
the rotation vectors required for the earlier stages may be
obtained by strobing this list at the proper instant in advance of its need. Thus the sines and cosines required for
each stage may be obtained by providing a register for each
arithmetic unit, all driven from a common bus. Note that
exactly M arithmetic units and exactly N-1 complex data
points of storage are needed in this system and that the first
transform output is obtained immediately after the last data
sample in the block of N is received.
The theory cited above leads to still another important
observation, namely that the output at any intermediate
k=O
XV+kCmWyynRX
(5)
which is precisely the discrete Fourier transform of all
groups of data points separated by an interval Cm. This may
be better understood in reference to Fig. 4, which shows the
natural FFT interleaving mechanism that results from the
Cooley-Tukey algorithm. In terms of a sequential process,
the output of the first stages results in N/2 independent twopoint transforms; the output of the second stage yields N/4
independent four-point transforms, etc. Thus, if two independent streams of complex data were entered into the input
interleaved with one another, the module 1 stage of the
cascade processor would produce two independent DFTs
of each data stream. The spectral component of each
channel of data is outputted before the spectral frequency is
changed. In particular for pulsed radar or sonar application, where the data for many range samples are received
before a new sample may be taken, this system permits the
data to be processed in order of arrival with no modification
of the control circuits and without requiring the data to be
reassembled first into consecutive (noninterleaved) data
streams.
Equation (5) shows that the index u, which is the rotation
vector coefficient in (3), can also be regarded as the current
frequency being outputted, namely- PIRm. When these normalized frequencies (chosen to make Rm a unit period) are
modified to account for the expanded sampling interval Cm,
the current frequency being outputted may be regarded on
1018
IEEE TRANSACTIONS ON COMPUTERS, NOVEMBER 1970
absolute scales as the frequency U/RmCm= 1u/N. Thus, in
Fig. 1, the coefficient given at every stage in the process indicates not only the coefficient of the rotation vector applied
to the lower branch but the current frequency (in absolute
terms) as well.
The control mechanism for this system is perhaps its
most elegant feature. Fig. 1 shows clearly that a binary
counter driven in synchronism with the data stream generates a signal designating the processing interval (i.e., the
switch position) for each stage. However, if any output is
taken at any intermediate stage, say at the k + 1st stage, then
the lower k bits of the counter, in normal order, give the
channel number of the data currently being outputted, while
the upper M - k bits taken in bit-reversed order, give the
frequency that is currently being outputted and, in addition,
describes the address (0) of the sine and cosines needed in
the current computation of the k +1st stage. Thus, the
control mechanism contains all the information needed to
carry out the spectral analysis as well as to descramble the
output data. In many applications, it is unnecessary to
descramble the output data provided one can identify the
frequency of any component.
Indeed the structure shown in Fig. 3 may be readily
modified to calculate transforms when the data samples are
given in scrambled order. This structure also permits the
trade-off of channels processed for data length per channel
by taking outputs at an intermediate stage. The modified
structure produces the output sequence in natural order in
both the channels and in time.
COMPUTATIONAL ERRORS AND QUANTIZATION
If the input signal to an FFT machine is obtained from an
analog-to-digital converter, it must be sufficiently finely
quantized so that quantization noise is uncorrelated to
avoid distortion of the Fourier transform. Widrow [4] has
shown that for signals in the presence of Gaussian noise,
choosing the quantization grain or value of the least significant bit of the quantizer equal to three times the noise
standard derivation provides a good approximation to this
condition. The value of the most significant bit of the quantizer must be greater than or equal to one half the peak value
of the input signal (without noise) to avoid peak limiting.
The minimum number of bits Q required to represent a
signal of peak value + Vp with additive Gaussian noise of
standard deviation an is therefore given by
Q log2 (VPI/3n) + 1.
=
(6)
One or more bits may be added to further reduce quantization noise.
The number of bits per data word required in an FFT
processor is usually greater than the number of bits Q in
the input word to reduce computational noise. The twopoint transform (4), which is the building block of larger
Fourier transforms, may be accomplished by four real
multiplications and six real additions. Following each twopoint transform, words may be truncated or rounded to
maintain constant word size or allowed to grow.
Welch [5] has shown that rounding after each two-point
transform leads to a relative rms output error E, which is
bounded above by
B =
(0.3)2 -B+(M+ 3)'2/rms (input)
(7)
for -a transform of 2M samples using B-bit arithmetic. The
error may be reduced by allowing words to lengthen from
stage to stage. This can be particularly attractive in a cascade machine because most of the storage in such a machine
may be associated with the first few stages of the transform.
ALTERNATE CONFIGURATIONS
The computing module circuitry may be configured in a
number of different ways to meet different requirements.
The arithmetic unit that computes two-point transforms
may employ either four real multiplications and six real
additions or three real multiplications and nine real additions. The multiplications may be performed either before
data are stored in the shift register or after data emerge from
the shift register. Goldstone [6] has shown that multipliers
may be time-shared between adjacent modules by performing half of the required multiplications as data enter the
shift register and the other half as data leave. Some of the
total shift register delay may be incorporated in a pipeline
multiplier for high-speed operation.
In order to make use of the speed possible in a pipeline
processor, one word delays must be inserted between the
computing modules. These delays permit each module to
begin computation at the start of a word time rather than
to wait for the preceding modules to compute the input it
requires. These intermodule delays do not appreciably
complicate the control circuitry of the processor, since they
may be compensated by delaying the control and rotation
vector inputs to the module by a delay equal to the total
data delay.
A cascade processor requires no multipliers in the first
two modules. The rotation vectors (W° and Wj) used in
these modules may be implemented by, at most, a switching
circuit to interchange the real and imaginary part of the
data word and invert the sign of the resulting real part when
rotation by W4 is required. The multipliers may be implemented if desired and used for data windowing.
Data windowing, or multiplying input data samples by a
data window function, is a technique used to change the
frequency response of the equivalent N filters whose output
is computed by the FFT. If an FFT of N samples taken at
equal intervals T is computed, a data window that is exactly
zero everywhere except over sampling interval of duration
NT must be used. If no data windowing is intentionally
performed, input data samples are weighted by a data
window function that is unity over the sampling interval
and zero elsewhere.
The frequency response of the equivalent FFT filters is
given by the discrete Fourier transform of the data window
function. The frequency response of an FFT filter when no
data windowing is performed is therefore the DFT of a unitamplitude pulse of duration NT, which (normalized) is
GROGINSKY AND WORKS: PIPELINE FAST FOURIER TRANSFORM
sin rNTf
(f) NJ
Nsin
7tTf
=
1019
(8)
The relatively slow decrease in amplitude with increasing
frequency of this function makes it undesirable for Fourier
analyzer applications, in which data windowing is commonly used. Frequently used data window functions include the Hamming [7], Hanning [7], Dolph-Tchebyscheff
[8], and Taylor [9] functions. The properties of these functions have been extensively described.
IMPLEMENTATION
The pipeline structure was used to implement a real-time
FFT machine to provide spectral analysis for a tracking
pulse-Doppler radar. This machine simultaneously processes eight channels of Doppler data at a sample rate of
16 000 complex samples per second per channel with a transform length of 512 complex samples and a quantization of
24 bits per complex sample. The FFT throughput rate is
slightly over three million bits per second. A Taylor data
window is employed to permit 60-dB subclutter visibility of
high-speed targets.
The machine is composed of nine processing modules, a
synchronizer, A-D converter, and display. The processing
modules employ a total of 2600 TTL integrated circuits for
switching and arithmetic functions and a total of 500 200bit MOSFET shift registers for delay. Words are stored in
bit-serial form in the shift registers, but arithmetic operations are performed in bit-parallel. Interstage delay is incorporated in the arithmetic units. Data windowing is performed by the "spare" multiplier in the first processing
module. An eight complex sample input buffer following the
A-D converter allows the FFT machine to operate at a constant word rate equal to eight times the 8-16 kHz radar
PRF, independent of the ranges at which samples are collected. Fig. 5 shows a photograph of the complete breadboard FFT cascade processor.
CONCLUSIONS
The pipeline FFT processor has proven to be an effective
tool to meet the real-time spectral analysis requirements of
many radar and sonar systems. Its structure permits sufficient paralleling of operations, such that the processing time
is limited solely by the time it takes to collect the data. It is
efficient in storage requirement since it requires only as
much storage as the number of samples to be processed. It
is simple to control because basically all of the control information can be generated by a binary counter. Furthermore the use of this counter guarantees proper synchronization of the control function with the data stream passing
through the device.
Perhaps its greatest disadvantage is the scrambled order
in which the output appears. In the radar/sonar applications, for which the device was designed, this fault was
transparent. It does, however, make certain operations in
the frequency domain, such as smoothing over frequency,
somewhat more difficult. Descrambling is possible in a
Fig. 5. Pipeline processor machine.
pipeline processor at the cost of additional memory less
than the number of words in the processed data block.
The remarkable flexibility with which the structure may
be reconfigured to trade channels for data sample length
and to carry out the inverse transform functions is one of its
most satisfying properties. Properly utilized, the device is
able to carry out auto- and cross-correlation, block convolutions, and cross-spectral density calculations.
REFERENCES
[1] G. D. Bergland and H. W. Hale, "Digital real-time spectral analysis,"
IEEE Trans. Electronic Computers, vol. EC-16, pp. 180-185, April
1967.
[2] G. D. Bergland, "Fast Fourier transform hardware implementations-An overview," IEEE Trans. Audio Electroacoust., vol.
AU-17, pp. 104-108, June 1969.
[3]
"A guided tour of the fast Fourier transform," IEEE Spectrum, vol. 6, pp. 41-52, July 1969.
[4] B. Widrow, "Statistical analysis of amplitude-quantized sampleddata systems," AIEE Trans., vol. 79, pt. 2, pp. 555-567, January 1961.
[5] P. D. Welch, "A fixed-point fast Fourier transform error analysis,"
IEEE Trans. Audio Electroacoust., vol. AU-17, pp. 151-157, June
1969.
[6] B. J. Goldstone, "Serial FFT-More efficient utilization of the multiplier," Raytheon Co. internal memo BFX-R-29, October 1968.
[7] R. B. Blackman and J. W. Tukey, The Measurement of Power
Spectra. New York: Dover, 1958.
[81 C. L. Dolph, "A current distribution for broadside arrays which
optimizes the relationship between beam width and side-lobe level,"
Proc. IRE, vol. 34, pp. 335-348, June 1946.
[9] T. T. Taylor, "Design of line-source antennas for narrow beamwidth
and low sidelobes," IRE Trans. Antennas Propag., vol. AP-3, pp.
16-28, January 1955.
[10] G. O'Leary, "A high-speed cascade fast Fourier transformer," presented at IEEE Arden House Workshop on Digital Filtering, January 1970.
[11] W. R. Graham, "The parallel, pipeline and conventional computer,"
Datamation, vol. 16, pp. 68-71, April 1970.
,
Download