Linear Array Implementation the EM Algorithm for

advertisement
IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 42, NO. 4, AUGUST 1995
1439
Linear Array Implementation of the EM Algorithm
for PET Image Reconstruction
K. Rajan, L. M. Patnaik, Fellow, ZEEE, and J . Ramakrishna
Abstract-The PET image reconstruction based on the EM
algorithm has several attractive advantages over the conventional
convolution backprojection algorithms. However, the PET image
reconstruction based on the EM algorithm is computationally
burdensome for today’s single processor systems. In addition, a
large memory is required for the storage of the image, projection
data, and the probability matrix. Since the computations are
easily divided into tasks executable in parallel, multiprocessor
configurations are the ideal choice for fast execution of the EM
algorithms. In this study, we attempt to overcome these two
problems by paralleliziing the EM algorithm on a multiprocessor
system. The parallel EM algorithm on a linear array topology
using the commercially available fast floating point digital signal
processor (DSP) chips as the processing elements (PE’s) has been
implemented. The performance of the EM algorithm on a 386/387
machine, IBM 6OOO FUSC workstation, and on the linear array
system is discussed and compared. The results show that the
computational speed performance of a linear array using 8 DSP
chips as PE’s executing the EM image reconstruction algorithm
is about 15.5 times better than that of the IBM 6000 RISC
workstation.The novelty of the scheme is its simplicity. The linear
array topology is expandable with a larger number of PE’s. The
architecture is not dependent on the DSP chip chosen, and the
substitution of the latest DSP chip is straightforward and could
yield better speed performance.
I. INTRODUCTION
P
OSITRON Emission Tomography (PET) is an imaging
technique to visualize the spatial and temporal distribution
of the radio-nucleids inside the human body by measuring
the event counts of positron-electron annihilation. There are
two main approaches for PET image reconstruction: analytic
methods such as the Convolution Back Projection (CBP)
algorithm [ 131 which was originally devised for computer
aided tomography (CAT), and the iterative algorithms such
as expectation maximization (EM) algorithms. An analytic
algorithm usually consists of two main computations. One
is filtering and the other is back projection. An iterative
algorithm, on the other hand, starts with an initial guess
of the solution and iteratively updates (corrects) the object
according to the computed pseudo-projection and the measured
projection data, till convergence is reached. The stopping rule
proposed in 1181 is based on a statistical approach, where
after each iteration, the estimate is accepted or rejected as
Manuscript received July 13, 1994; revised December 6, 1994 and April
6, 1995.
K. Rajan is with the Department of Physics, Indian Institute of Science,
Bangalore 560 012, India.
L. M. Patnaik is with Microorocessor Aoolications Laboratorv.
Indian
<.
Institute of Science, Bangalore <60 012, Indi’a:
J. Ramakrishna is with the Department of Physics, Indian Institute of
Science, Bangalore 560 012, India.
IEEE Log Number 9413066.
the final image based on the result of a statistical hypoth__
esis test. The major computations in an iterative method
are forward (pseudo-projection) and backward (correction)
projections. The EM algorithm requires longer computation
time than the CBP method. However, the image reconstructed
using the EM algorithm is less noisy than the CBP image and
the EM algorithm does not require the projection data to be
equally spaced.
Various efforts have been made to speed up the image
reconstruction tasks. These efforts essentially fall into three
categories, Le., algorithmic improvement, dedicated hardware,
and parallel processing. In the first category, most of the
attempts for iterative reconstruction methods have been concerned with reduction of the number of iterations, i.e., to
make convergence faster [7], [lo], [13], [15]. In the second
category, dedicated hardware techniques have been employed
to speed up the computation [8], [9], [17]. To overcome the
two major problems that impede the routine use of the EM
algorithm for clinical use, i.e., the long computation time
and very large memory requirement, it is imperative to rely
on parallel processing techniques which have a potential to
speed up the reconstruction by several orders of magnitude.
Several attempts at improving the speedup using multiprocessor approach have been reported [2], [3], [51, [6]. Chen
et al. have studied the parallelization of the EM algorithm
on a message passing system (Intel iPSCl2) and on a shared
memory system (BBN Butterfly GP 1000) [2]. A data and task
partitioning scheme called partition-by-box is proposed in this
study. The partition-by-box scheme proposed by Chen uses the
broadcast and partial result integration algorithms. The binary
tree architecture is more efficient to perform the broadcast
and integration algorithms. Though Chen et al. have used the
iPSCl2 hypercube system, the pseudo-binary tree embedded
in the hypercube has been used for the EM algorithm. In [3],
Chen et al. have proposed new integration and broadcasting
algorithms for hypercube, ring, and n-D mesh topologies,
which are more efficient than conventional algorithms. A
close look at the EM reconstruction algorithm shows that
most of the computation time is spent in executing multiply
and multiply-accumulate types of instructions. Digital Signal
Processors (DSP’s) are optimized processors to execute fast
multiply
instructions. In our earlier
_ _ and multiply-accumulate
_ .
study [14], we investigated the implementation of the parallel
EM algorithm on an Extended Hypercube (EH) topology. The
EH is a hierarchical. exDansive. recursive Structure With a
constant predefined building block. The EH (Ic,Z) (1 is the
degree of the EH) is built using basic modules consisting of a
k-cube of processor elements (PE’s) and a Network Controller
0018-9499/95$04.00 0 1995 IEEE
I
,
I
1440
IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 42, NO. 4, AUGUST 1995
(NC). The NC is used as a communication processor to handle
intermodule communication; 2'" such basic modules can be
interconnected via 2'" NC's, forming a k-cube among the
NC's. In general, an EH (k,Z)consists of one NC at the Zth
level, and a k-cube of 2k NC'sPE's at the (1 - 1)st level.
The NC'sPE's at the (1 - 2)st level of hierarchy form 2'"
distinct k-cubes. Thus, we have k-cubes at all levels j for
0 5 j 5 1. An EH (3, 1) with ADSP 21020 DSP devices
as PE's was implemented. Eight DSP chips formed the 3D
cube and one DSP chip was configured as an NC; the NC
was connected to all the eight PE's through direct links. The
PENC link was created through a message channel generated
out of a dual-port RAM (DPR). This DPR memory channel
allows overlapped computation and communication tasks. In
addition, the EH executes the integration of partial results
efficiently. The EH topology supports efficient single node
broadcast and multi-node broadcast. These features were used Fig. 1. The PET measurement system.
in the execution of the EM algorithm. It was found that
the computational speed performance of an EH (3, 1) using
The EM algorithm for image reconstruction can be written
DSP chips as PE's executing the EM image reconstruction as [161, [111
algorithm is about 100 times better than that of the IBM RISC
6000 workstation for image sizes of 16*16,32*32,64*64, and
128*128. However, the algorithmic complexity is much higher
in EH. Also hardware complexity of the EH with a memory
j=1
i'=l
channel is very high. Here in this study, the implementation
i=l,...,N
(1)
of the parallel version of the EM algorithm on a linear array
where
of processors was investigated. The novelty of the scheme is
its simplicity. The linear array topology is very easy to build.
number of photon pairs emitted from box i (the
X[i]
The expansion of the linear array is quite straightforward.With
image to be reconstructed),
the announcement of latest DSP devices with serial and link
U
iteration index,
ports, it is possible to have glueless interconnection of PE's
p ( i , j ) probability that a photon emitted from box i is
to generate large networks. The algorithm also breaks up into
detected by tube j ,
simple subtasks.
y[j]
number of photon pairs detected by tube j (projection data),
N
total number of reconstruction boxes, and
Nt
total number of detector tubes.
11. THE EM ALGORITKM
The standard EM iteration step given by (1) can be rewritten
The EM algorithm is the basic approach used to maximize in an additive form [ 121 as
the log likelihood objective function for the PET image
reconstruction problem. PET images are used to study the
human physiology and organ functions. The patient is given
a tagged substance which emits positrons. Each positron
j=1
annihilates with an electron and emits two photons in opposite
N
directions. The patient is surrounded by a ring of detectors and
y[j] - C X [ i ' ] ( " ) p ( i ' , j )
the two photons are detected in time coincidence by a pair of
i'=l
detector elements defining a detector unit or detector tube. The
N
reconstruction problem in PET is to determine the memory
map of the annihilations from which information about the
i'=l
regional physiology can be obtained.
p ( i , j ) , i = 1, . . . , N .
(2)
In the recent literature, much attention is given to maximum
Equation (2) has been implemented on the linear array
likelihood reconstruction based on expectation maximization
system.
The EM algorithm converges toward a possible unique
(EM). These algorithms are appealing because, unlike other
minimum,
and the image obtained after convergence is indemethods such as CBP, they take into account the statistical
pendent
of
the initial estimate X[z]O. However, if the procedure
nature of the measurements. Dempster et al. [41 presented a
is
stopped
before the maximum likelihood is reached, the
general algorithm to produce maximum likelihood estimates
initial
estimate
can strongly influence the result. All the
from incomplete data. Shepp and Vardi [ 161, and Lange [ 111
reconstructions
carried
out in this study were started with
applied this technique to image reconstruction from PET
identical
X
[
i
]
values.
measurements. The measurement system is shown in Fig. 1.
1441
RAJAN et ai.: LINEAR ARRAY IMPLEMENTATION OF THE EM ALGORITHM FOR PET IMAGE RECONSTRUCTION
111. COMPUTATIONAL
COMPLEXITY
the partition-by-box scheme for the backward step. So the
The complexity of the EM algorithm is given in [ 161. For a storage overhead for the partition-by-tube-and-box scheme is
128 x 128 square object, there are 16384 object boxes. The roughly twice that of the other two schemes, because we have
probability p ( i , j ) that an emission in box i is detected in a to store the task and data corresponding to the tube space
tube j depends on a number of physical factors such as the for the forward step, and the box space data and task for the
geometry of the measurement system, the decomposition of the backward step.
In order to implement the partition-by-tube scheme, two
object space, the physical properties of the medium and the
approaches are possible. First, the indices of the pixels in
response of the detector system. In this study, it is assumed that
each of the Nt tubes are precomputed and stored. Second,
the probability of an emission in box i and its detection in tube
the indices of the pixels in each of the tubes are computed
j depend only on the geometry of the measurement system. In
in each step. The first approach is faster, but it requires more
such a case an annihilation event in box i is detected in a tube
memory. Most of the memory is required to store the indices
j with the probability p ( i , j ) proportional to the angle of view
of the pixels in each of the Nt tubes, the corresponding pixel
from the center of the box i in to the detector tube j. Shepp
values, and the pixel-tubeprobabilities. The parallel algorithm
et al. [ 161 have shown that the choice of p ( i , j ) based only on
formulated is based on the first approach. We introduce
the geometry of the measurement system is reasonable, and
three 1D arrays: integer pixelindices[] to hold the indices
that the results of the reconstruction do not depend critically
of the pixels in each of the Nt tubes, the corresponding real
on the choice of p ( i , j).Since there are Nt number of detector pixel-tube-probability [ 1, and integer pixel-count[ ] to hold the
tubes and N object boxes, the dimension of probability matrix number of pixels in each tube. The array pixelindices[] is
p ( i ,j) is N x Nt. For a circular ring measurement geometry stored in numerically increasing order to use binary search to
with Nd detector elements equally spaced around the circle of check if a particular pixel lies in the tube processed by that PE.
radius fi circumscribing the display boxes, the total number
The partition-by-tube scheme can be easily divided into
of detector tubes Nt is given by Nt = (Nd/2) * (Nd/2 1) tasks executable in parallel. In this study, we are investigating
since there are (Nd/2 1) detector intervals opposite each the implementation of the partition-by-tube scheme on the
one [16]. For a system with 128 detectors and an object space linear array multiprocessor system. Each PEt in the linear array
of 128 x 128, the dimension of the probability matrix p ( i , j ) executes the following algorithm:
is 4160 x 16384. Since only a very small percentage of the linear array implementation of the EM algorithm based
total detector tubes passes through a box, the probability matrix on partition-by-tube
p ( i , j ) is highly sparse. To minimize the storage requirement, for each detector tube t = 1 to Nt
only the nonzero p ( i ,j ) ’ s are stored.
The EM algorithm represents a family of important problems that are rich in data parallelism. The data parallelism in
begin
the EM algorithm may be described by two spaces namely,
for i = 1 to N /* A P computation */
box and tube spaces. There are three possible data and task
begin
partitioning schemes [2] to solve the image reconstruction
receive X[i] from PEt-l
problem based on the EM algorithm. The first approach,
transmit X[i] to PEt+l
the partition-by-box scheme, is based on partitioning the EM
update the index of the pixel in PEt register
algorithm based on the box space. In this approach, for both
linary search 1D array pixelindices[]
the forward and backward steps, a box and all the task and
if pixel index matches with any element
data associated with that box are assigned to a PE.
in the array pixelindices[]
The second approach, partition-by-tube scheme, uses the
get the corresponding pixel-tube-probability p
detector tube as a subpartition. In this scheme, for both forward
pseudo-projection[t]
and backward steps, a detector tube is assigned to a PE. So
:= pseudo-projection[t] X [ i ] .p
all the task and data associated with that tube are assigned
end
AP[tl = (projection[t]-pseudo-projection[t])
to a PE. The partition-by-box and partition-by-tube schemes
pseudo-projection[t]
give almost similar performance. The potential problem of
for i = 1 to N /* correction phase */
the partition-by-box scheme and partition-by-tube is that the
begin
computational load is not well balanced. For the partition-byreceive X[i] from PEt-l
box scheme, the computational load associated with each box
update the index of the pixel in PEt register
in the partition region is different. For the partition-by-tube
binary search 1D array pixelindices[]
scheme, the number of pixels in each of the detector tubes is
if pixel index matches with any element
different.
in the array pixelindices[]
In this study, we use the partition-by-tube scheme for
get the corresponding pixel-tubeprobability p
implementation on a linear array. The same logic can be used
X[i] = X[i] + X[i].AP[t] . p . /* partial pixel
for the implementation of the partition-by-box scheme on a
value update */
linear array.
transmit X[i] to PEt+l.
The third approach, the partition-by-tube-and-box scheme
end
uses the partition-by-tube scheme for the forward step, and
end.
+
+
+
1442
IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 42, NO. 4, AUGUST 1995
memory
memory
memory
Fig. 2. Linear array multiprocessor system.
The image reconstruction algorithm based on the EM technique is iterative in nature. The major computations in an
iterative scheme are forward (pseudo-projection) and backward (correction) operations. From an estimate of the object,
the forward algorithm finds the pseudo-projection on each
detector tube j. The ratio of the difference between the
projection data (actual measurement data) and the pseudoprojection data (computed value) to the pseudo-projection
data, i.e., A P ( t ) , gives the extent of resemblance between
projection of the object and the projection of the reconstructed
object. Based on A P , corrections are made to the initial guess
of the object. In the correction operation, AP in the projection
domain is back-projected onto the object domain.
IV. MULTIPROCESSOR
ARCHITECTURE
The multiprocessor architecture used for the implementation
of the parallel EM algorithm is shown in Fig. 2. It consists of
linear connection of Nt identical processors. PEj is connected
to PEj-1 and PEj+l through a set of high speed first-infirst-out (FIFO) buffers. PEj can send a 32-b data item to
PEj+l and receive a 32-b data item from PEj-1 through
the FIFO's. In computing the forward step (pseudo-projection
computation), once the pipeline is filled, all the Nt identical
processors operate in parallel with each processor computing
the pseudo-projection on a tube assigned to that PE. There are
Nt number of detector tubes; the task and data corresponding
to tube j are preloaded onto PEj. Each PE holds four 1D
arrays: integer pixelindices[] that hold the indices of the
boxes that lie in tube j, real pixel-values of the pixel indices
stored in the array pixelindices[], the corresponding real
box-tube-probability [ 1, and integer pixel-count[] that holds the
number of boxes in tube j. It also gets preloaded with the
projection data corresponding to the tube it processes. For
maximum processing speed, there should be one processor
per detector tube; otherwise each processor will hold data
corresponding to multiple detector tubes. In a configuration
with Np processors and Nt detector tubes, each processor will
be assigned with data and task corresponding to N t / N p tubes.
In order to minimize the potential load imbalance problem,
data and task corresponding to Nt tubes in N p PE's were
distributed cyclically.
The host processor sends the pixel intensities to the processor array. PEj receives the box intensity, and sends the
intensity data to the neighboring node, PEj+l. Each PEj keeps
track of the index of the pixel it currently processes using
a processor register. PEj computes the contribution to the
pseudoprojection on tube j from the current box intensity.
The integer array pixelindices[] are stored in numerically
increasing order. So each PEj uses binary search to find out
whether the current pixel lies in tube j. If the current pixel
lies in tube j, the corresponding probability value is used to
compute the contribution to the pseudo-projection on tube j.
Once PEj completes the processing of a pixel, it receives the
next pixel intensity data from PEj-l. This process is continued
till all the N image pixels are processed. When the complete
image has been processed, the pseudoprojection of the image
is distributed in the local memories of the N p processors in
the linear array with N t / N p pseudoprojection in PEj. The
complexity of the binary search for the average case and the
worst case is the same and is given by O(log, n) where n is
the number of entries in the table. The binary search is quite
simple to implement and is quite efficient.
Each PEj in the array computes A P ( j ) , the ratio of the
difference between the projection data (actual measurement
data) and the pseudo-projection data (computed value) on tube
j to the projection data on tube j. A P ( j ) gives the extent
of resemblance between the projection of the object and the
projection of the reconstructed object.
For handling the backprojection, the architecture employs
the same N p processors. In the optimal case, with one PE
per detector tube, each processor holds AP for one detector
tube. The host sends the box intensities to the processor array
through the FIFO. PEj receives the box intensity from PEj-1
and keeps track of the index of the pixel it currently processes.
PEj uses binary search to find out whether the current pixel lies
in tube j. If the current pixel lies in tube j , the corresponding
probability value and AP are used to compute the correction
to the pixel intensity. The partially corrected pixel intensity is
passed down the pipeline through the FIFO. PEj receives the
next box intensity data from PEj-1. This process is continued
till all the N box intensities have traversed the pipeline. The
last PE in the linear array gives out the reconstructed image.
v.
IMPLEMENTATION OF THE LINEARARRAY
The linear array has been implemented using ADSP 21020
[ l ] DSP chips. Fig. 3 shows the schematic of a PE. 256
Kwords of program memory and 256 Kwords of data memory
are attached to the DSP device through the program memory
space and data memory space respectively. Eight PE's are linearly linked using high-speed first-in-first-out (FIFO) buffers
(Am 7200 High Density FIFO 256 x 9 CMOS Memory).
Fig. 4 shows the connection between the PE's. Four Am 7200
devices are connected in width expansion mode to form a
256 x 36 FIFO buffer. The FIFO buffers are mapped in the
data memory address space of the DSP's. PEj can send data
to PEj+l by writing into the FIFOj+l. PEj can receive data
from PEj-1 by reading the FIFOj.
The Am 7200 CMOS FIFO is a 256 x 9 dual-port static
RAM array and it stores the data written into it in sequential
order. The dual-port RAM array has dedicated read and write
pointers. The built-in flag logic allows the FIFO to accept and
output data asynchronously and simultaneously. The PEj is
connected to one of the ports of the FIFO and is controlled
1443
RAJAN et al.: LINEAR ARRAY IMPLEMENTATION OF THE EM ALGORITHM FOR PET IMAGE RECONSTRUCTION
PROGRAM DATA
MEMORY
(256 Kwwds)
ADDR
Fig. 3.
p:: :It4
DATA
MEMORY
(256Kwords)
DATA
ADDR
Schematic diagram of a PE.
TABLE I
EXECUTION
TIMEFOR ONEITERATION
OF
THE EM ALGORITHM
(PARTITION-BY-TUBE)
1 Image size I
16‘16
32*32
64*64
128’128
linear
array ( 1 PE)
11.6ms
91 ms
753.6 m
6.97 s
I
linear
I 386/387
array ( 8 PES)
185 ms
2.48 m
19.8 m
1.44 s
162 m
12.23 s
105.43 s
1.41 s
by the FIFO control signal full, which gives the status of the
FIFO. It is asserted when the FIFO is full. Similarly, PEj+l is
connected to the other port of the FIFO and is controlled by the
control signal empty. The bit empty is asserted when the FIFO
is empty, which indicates that no more reads should be made
until PEj writes into the FIFO. The full and empty control
signals of the FIFO are in turn connected to the FLAG-1 and
FLAG-0 of the DSP device respectively. These two DSP flag
bits are programmed as inputs and are tested through program
instructions. Thus the FIFO allows the linear array to operate
asynchronously. During the forward step, PEj receives the
pixel-value from PEj-1 and transmits the same to PEj+l. PEj
starts processing the pixel. The computational load at each of
the PE’s is not the same. So PEj may take a little longer to
process a pixel. But as and when it completes, it picks up the
next pixel-value and starts processing.
During the backward step, a pixel-value arrives at a PEj.
The PEj processes the pixel, updates the pixel-value depending
upon the contribution from the tube j and then transmits the
result to PEj+l. The PEj+l waits till the empty signal changes
state, which indicates that new pixel-value has arrived.
A typical DSP such as ADSP 21020 has three independent
computation units: an arithmetic and logic unit (ALU), a
multiplier, and a shifter. The computation units perform single
cycle operations. The three units are connected in parallel
and they operate in parallel. In a multi-function instruction,
multiple functional units operate in parallel. A 10-port register
file is used for transferring data among computation units and
data buses, and for storing immediate results.
ADSP 21020 has two independent memories-ne
for data
and the other for program instructions and data. Two independent address generators (DAG) and a program sequencer
supply address for memory access. A program sequencer with
a 32-word instruction cache allows the ADSP 21020 to access
data from both data memory and program memory and fetch
an instruction. In addition, the DAG’S are updated to point to
39.38 m
309 m
2.49 s
21.48 s
TABLE II
PERFORMANCE
OF THE EM AGORITHM
ON
MULTIPLE
PROCESSORS
(IMAGESIZE:64*64)
Noof processors 1
1
I 4
Execution time I 753.6 ms I 323.7
Fig. 4. PE-to-PE FIFO buffer.
I IBM 6000 workstation 11
I 8 I 12 I 16
I 162 ms I 107.8ms 187.5 ms
the next operands. So all these five operations are executed in
the same clock cycle. In addition, the PE has a zero overhead
loop facility with a single cycle setup and exit.
ADSP 21020 has an integer multiply-accumulate unit. But
it lacks a floating point multiply-accumulate unit. A close look
at the EM algorithm shows that most of the computation time
is spent in executing multiply and multiply-accumulate type of
instructions, and thus a floating point MAC would have still
further speeded up the EM algorithm execution.
In the execution of the EM algorithm, the array pixel-val[]
and floating point array pixel-tube-probability[ ] are stored in
data memory and program memory, respectively. This allows
the PE to read two operands, one from data memory and the
other from the program memory simultaneously. The linear
array is connected to a PC/386 system to use the resources
such as memory, keyboard, display and the disk storage. The
ADSP 21020 operates at 33 MHz with a cycle time of 33 ns.
The fast DSP memories are made up of 20-11s static memories.
Table I gives the execution time for one iteration of the
EM algorithm on a linear array having a single PE, on a linear
array with 8 PE’s, on a 386/387 running at 33 MHz under Unix
operating system and on an IBM 6000 RISC workstation for
the partition-by-tube scheme.
The results show that the computational speed of a multiprocessor system for the EM image reconstruction algorithm
is about 15.5 times better than that of an IBM 6000 RISC
workstation. The speed-up of the linear array with 8 nodes
is approximately 5. The EM algorithm was executed on a
linear array with 4, 8, 12, and 16 nodes for an image size of
64*64. Table I1 shows the performance of the EM algorithm
on multiple processors. The execution time decreases almost
linearly with the increase in number of nodes. The total number
of detector tubes processed is constant for a fixed image size.
The number of detector tubes processed in each of the nodes
decreases linearly with an increase in number of nodes. In
addition, the memory required to store the data at the nodes
reduces linearly with an increase in number of PE’s.
The algorithm can also be implemented on a linear array
of fixed point DSP devices which are available for as low as
$10 per chip.
VI. CONCLUSIONS
In this paper, we have described the parallelization of
the EM algorithm on an linear array topology. The linear
IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. 42, NO. 4, AUGUST 1995
1444
array topology is expandable with more number of PE’s.
In this current study, a FIFO buffer between the stages has
been used. But with more advanced devices such as an
ADSP 21060 containing ADSP 21020 core, 4 Mb of static
memory with cross bar switch buses, two serial ports, six
4-b link ports and a powerful DMA processor, it is possible
to have glueless interconnection between the processors to
build a large network. The architecture is not dependent
on the DSP chip chosen, and the substitution of the latest
DSP chip is straightforward and could yield better speed
performance. The EM algorithm breaks down to a sequence
of multiply and multiply/accumulate type of instructions. Since
DSP chips are optimized processors for executing multiply and
multiply/accumulate instructions, they give high performance.
It has been found that the computational speed performance
of the 8-node linear array executing the EM image reconstruction algorithm is comparable to that of IBM 6000 RISC
workstation.
ACKNOWLEDGMENT
The authors would like to thank the reviewers for their
valuable comments and suggestions.
REFERENCES
[ l ] Analog Devices’s Users Manual, ADSP 21020, 1992.
[2] C. M. Chen and S. Y. Lee, “Parallelization of the EM algorithm for
3-D PET image reconstruction,” IEEE Trans. Med. Imag., vol. 10, no.
4, pp. 513-522, Dec. 1991.
[3] -,
“On parallelizing the EM algorithm for PET image reconstruction,’’ IEEE Trans. Parallel Dist. Sysi., vol. 5, no. 8, pp. 860-873, Aug.
1994.
[4] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood
from incomplete data via the EM algorithm,” J. R. Stat. SOC.Series B,
VOI. 39, pp. 1-38, 1977.
[5] E. Shieh et al., “High-speed computation of the radon transform and
back projection using an expandable multi processor architecture,”IEEE
Trans. Circuits Syst. video Technol., vol. 2, no. 4, pp. 347-359, Dec.
1992.
[6] R. Hartz, D. Bristow, and N. Mullani, “A real-time TOFPET slice back
projection engine employing dual AM 291 16 microprocessors,” IEEE
Trans. Nucl. Sei., vol. NS-32, no. 1, pp. 839-842, Feb. 1985.
[7] T. Hebert and R. Leahy, “Fast methods for incorporating attenuation in
the EM algorithm,” IEEE Trans. Nucl. Sci., vol. 37, no. 2, pp. 754-758,
1990.
[8] W. F. Jones, L. G. Byars, and M. E. Casey, “Design of a super fast threedimensional projection system for positron emission tomography,” IEEE
Trans. Nucl. Sci., vol. 35, no. 2, pp. 800-804, Apr. 1990.
[9] -,
“Positron emission tomographic and expectation maximization:
A VLSI architecture for multiple iterations per second,” IEEE Trans.
Nucl. Sci., vol. 35, no. 1, pp. 620-624, Feb. 1988.
[lo] L. Kaufman, “Implementing and accelerating the EM algorithm for
PET,” IEEE Trans. Med. Imag., vol. 6 , pp. 37-50, Mar. 1987.
[ l l ] K. Lange and R. Carson, “EM reconstruction algorithm for emission
and transmission tomography,” J. Comput. Assisted Tomography, vol. 8,
pp. 306-316, 1984.
[12] K. Lange, M. Bahn, and R. Little, “A theoretical study of some maximum likelihood algorithm for emission and transmission tomography,”
IEEE Trans. Med. Imag., vol. MI-6, no. 2, pp. 106-114, 1987.
[ 131 R. M.Lewitt and G. Muehllehner, “Accelerated iterative reconstruction
for positron emission tomography based on EM algorithm for maximum
likelihood estimation,” IEEE Trans. Med. Imag., vol. MI-5, no. 1, pp.
16-22, 1986.
[I41 K. Rajan, L. M. Patnaik, and J. Ramakrishna, “High-speed computation
of the EM algorithm for PET image reconstruction,” IEEE Trans. Nucl.
Sci., vol. 41, no. 5, Oct. 1994.
[15] N. Rajeevan, K. Rajagopal, and G. Krishna, “Vector-extrapolated fast
maximum likelihood estimation algorithms for emission tomography,”
IEEE Trans. Med. Imag., vol. 11, no. 1, pp. 9-20, Mar. 1992.
[16] L. A. Shepp and Y. Vardi, “Maximum likelihood reconstruction for
emission tomography,” IEEE Trans. Med. Imag., vol. 11, no. 2, pp.
113-121, Oct. 1992.
[17] C. J. Thompson and T. M. Peters, “A fractional address accumulator
for fast back projection,” IEEE Trans. Nucl. Sci., vol. NS-28, no, 4, pp.
3648-3650, Aug. 1981.
[18] E. Veclerov and 3. Llacer, “Stopping rule for the MLE algorithm based
on statistical hypothesis testing,” IEEE Trans. Med. Imag., vol. MI-6,
pp. 313-319, 1987.
Download