MorphoSys : An Integrated Reconfigurable System for Data

advertisement
1
MorphoSys: An Integrated Reconfigurable System for DataParallel Computation-Intensive Applications
Hartej Singh, Ming-Hau Lee, Guangming Lu, Fadi J. Kurdahi and Nader Bagherzadeh,
Department of Electrical and Computer Engineering,
University of California, Irvine, CA 92697
Abstract: In this paper, we propose the MorphoSys reconfigurable system, which is targeted at dataparallel and computation-intensive applications. This architecture combines a reconfigurable array
of processor cells with a RISC processor core and a high bandwidth memory interface unit. We
introduce the system-level model and describe the array architecture, its configuration memory,
inter-connection network, role of the control processor and related components. We demonstrate the
flexibility and efficacy of MorphoSys by simulating video compression (MPEG-2) and targetrecognition applications on its behavioral VHDL model. Upon evaluating the performance of these
applications in comparison with other implementations and processors, we find that MorphoSys
achieves performance improvements of more than an order of magnitude. MorphoSys architecture
demonstrates the effectiveness of utilizing reconfigurable processors for general purpose as well as
embedded applications.
Index Terms: Reconfigurable processors, reconfigurable cell array, SIMD (single instruction
multiple data), context switching, automatic target recognition, template matching, multimedia
applications, video compression, MPEG-2.
1. Introduction
Reconfigurable computing systems are systems that combine reconfigurable hardware with software
programmable processors. These systems have some ability to configure or customize a part of the
hardware unit for one or more applications [1]. Reconfigurable computing is a hybrid approach
between the extremes of ASICs (Application-specific ICs) and general-purpose processors. A
2
reconfigurable system would generally have wider applicability than an ASIC and better
performance than a general-purpose processor.
The significance of reconfigurable systems can be illustrated using an example. Many applications
have a heterogeneous nature, and comprise several sub-tasks with different characteristics. Thus, a
multi-media application may include a data-parallel task, a bit-level task, irregular computations,
some high-precision word operations and perhaps a real-time component. For these heterogeneous
applications with wide-ranging sub-tasks, the ASIC approach would mandate a large number of
separate chips, which is uneconomical. Also, most general-purpose processors would very likely not
satisfy the performance constraints for the entire application. However, a reconfigurable system
may be so designed, that it can be optimally reconfigured for each sub-task through a configuration
plane. This system would have a very high probability of meeting the application constraints within
the same chip. Moreover, it would be useful for general-purpose applications, too.
Conventionally, the most common devices used for reconfigurable computing are field
programmable gate arrays (FPGAs) [2]. FPGAs allow designers to manipulate gate-level devices
such as flip-flops, memory and other logic gates. However, FPGAs have full utility only for bitlevel operations. They are slower than ASICs, have lower logic density and have inefficient
performance for 8 bit or wider datapath operations. Hence, many researchers have proposed other
models of reconfigurable computing systems that target different applications. PADDI [3], rDPA
[4], DPGA [5], MATRIX [6], Garp [7], RaPiD [8], REMARC [9], and RAW [10] are some of the
systems that have been developed as prototypes of reconfigurable computing systems. These are
discussed briefly in a following section.
1.1 MorphoSys: An Integrated System with Reconfigurable Array Processor
In this paper, we propose MorphoSys, as an implementation of a novel model for reconfigurable
computing systems. This design model, shown in Figure 1, involves having a reconfigurable SIMD
component on the same die with a powerful general-purpose RISC processor, and a high bandwidth
memory interface. The intent of MorphoSys architecture is to demonstrate the viability of this
3
model. This integrated architecture model may provide the potential to satisfy the increasing
demand for low cost stream/frame data processing needed for multimedia applications.
MorphoSys
Main Processor
(e.g. advanced RISC)
Reconfigurable
Processor Array
Instruction,
Data Cache
High Bandwidth
Memory Interface
System Bus
External Memory
(e.g. SDRAM, RDRAM)
Figure 1: An Integrated Architectural Model for Processors with Reconfigurable Systems
For the current implementation, the reconfigurable component is in the form of an array of
processor cells which is controlled by a basic version of a RISC processor. Thus, MorphoSys may
also be classified as a reconfigurable array processor. MorphoSys targets applications with inherent
parallelism, high regularity, computation-intensive nature and word-level granularity. Some
examples of these applications are video compression (DCT/IDCT, motion estimation), graphics
and image processing, DSP transforms, etc. However, MorphoSys is flexible and robust to also
support complex bit-level applications such as ATR (Automatic Target Recognition) or irregular
tasks such as zig-zag scan, and provide high precision multiply-accumulates for DSP applications.
1.2 Organization of paper
Section 2 provides brief explanations of some terms and concepts used frequently in reconfigurable
computing. Then, we present a brief review of relevant research contributions. Section 4 introduces
the system model for MorphoSys, our prototype reconfigurable computing system. The next section
(Section 5) describes the architecture of MorphoSys reconfigurable cell array and associated
components. Section 6 describes the programming and simulation environment and mView, a
graphical user interface for the programming and simulation of MorphoSys. Next, we illustrate the
mapping of a set of applications (video compression and ATR) to MorphoSys. We provide
performance estimates for these applications, as obtained from simulation of behavioral VHDL
4
models and compare them with other systems and processors. Finally, we present conclusions from
this research effort in Section 8.
2. Taxonomy for Reconfigurable Systems
In this section, we provide definitions for parameters that are frequently used to characterize the
design of a reconfigurable computing system.
(a) Granularity (fine versus coarse): This refers to the data size for operations. Bit-level
operations correspond to fine-grain granularity but coarse-grain granularity implies operations
on word-size data. Depending upon the granularity, the reconfigurable component may be a
look-up table, a gate, an ALU-multiplier, etc.
(b) Depth of Programmability (single versus multiple): This pertains to the number of
configuration planes resident in a reconfigurable system. Systems with a single configuration
plane have limited functionality. Other systems with multiple configuration planes may
perform different functions without having to reload configuration data.
(c) Reconfigurability (static versus dynamic): A system may need to be frequently reconfigured
for executing different applications. Reconfiguration is either static (execution is interrupted)
or dynamic (in parallel with execution). Single configuration systems typically have static
reconfiguration. Dynamic reconfiguration is very useful for multi-configuration systems.
(d) Interface (remote versus local): A reconfigurable system has a remote interface if the
system’s host processor is not on the same chip/die as the programmable hardware. A local
interface implies that the host processor and programmable logic reside on the same chip.
(e) Computation model: For most reconfigurable systems, the computation model may be
described as either SIMD or MIMD. Some systems may also follow the VLIW model.
3. Related Research Contributions
There has been considerable research effort to develop reconfigurable computing systems. Research
prototypes with fine-grain granularity include Splash [11], DECPeRLe-1 [12], DPGA [5] and Garp
[7]. Array processors with coarse-grain granularity, such as PADDI [3], rDPA [4], MATRIX [6], and
5
REMARC [9] form another class of reconfigurable systems. Other systems with coarse-grain
granularity include RaPiD [8] and RAW [10].

The Splash [11] and DECPeRLe-1 [12] computers were among the first research efforts in
reconfigurable computing. Splash, a linear array of processing elements with limited routing
resources, is useful mostly for linear systolic applications. DECPeRLe-1 is organized as a twodimensional array of 16 FPGAs with more extensive routing. Both systems are fine-grained,
with remote interface, single configuration and static reconfigurability.

PADDI [3] has a set of concurrently executing 16-bit functional units (EXUs). Each of these has
an eight-word instruction memory. The EXU communication network uses crossbar switches.
Each EXU has dedicated hardware for fast arithmetic operations. Memory resources are
distributed among EXUs. PADDI targets real-time DSP applications (filters, convolvers, etc.)

rDPA: The reconfigurable data-path architecture (rDPA) [4] consists of a regular array of
identical data-path units (DPUs). Each DPU consists of an ALU, a micro-programmable control
and four registers. The rDPA array is dynamically reconfigurable and scalable. The ALUs are
intended for parallel and pipelined implementation of complete expressions and statement
sequences. The configuration is done through mapping of statements in high-level languages to
rDPA using DPSS (Data Path Synthesis System).

MATRIX: This architecture [6] is unique in that it aims to unify resources for instruction storage
and computation. The basic unit (BFU) can serve either as a memory or a computation unit. The
8-bit BFUs are organized in an array, and each BFU has a 256-word memory, ALU-multiply
unit and reduction control logic. The interconnection network has a hierarchy of three levels. It
can deliver upto 10 GOPS (Giga-operations/s) with 100 BFUs when operating at 100 MHz.

RaPiD: This is a linear array (8 to 32 cells) of functional units [8], configured to form a linear
computation pipeline. Each array cell has an integer multiplier, three ALUs, registers and local
memory Segmented buses are used for efficient utilization of inter-connection resources. It
achieves performance close to its peak 1.6 GOPS for applications such as FIR filters or motion
estimation.
6

REMARC: This system [9] consists of a reconfigurable coprocessor, which has a global control
unit for 64 programmable blocks (nano processors). Each 16-bit nano processor has a 32 entry
instruction RAM, a 16-bit ALU, 16 entry data RAM, instruction register, and several registers
for program data, input data and output data. The interconnection is two-level (2-D mesh and
global buses across rows and columns). The global control unit (1024 instruction RAM with
data and control registers) controls the execution of the nano processors and transfers data
between the main processor and nano processors. This system performs remarkably well for
multimedia applications, such as MPEG encoding and decoding (though it is not specified if it
satisfies the real-time constraints).

RAW: The main idea of this approach [10] is to implement a highly parallel architecture and
fully expose low-level details of the hardware architecture to the compiler. The Reconfigurable
Architecture Workstation (RAW) is a set of replicated tiles, where each tile contains a simple
RISC processor, some bit-level reconfigurable logic and some memory for instructions and data.
Each RAW tile has an associated programmable switch which connects the tiles in a widechannel point-to-point interconnect. When tested on benchmarks ranging from encryption,
sorting, to FFT and matrix operations, it provided gains from 1X to 100X, as compared to a Sun
SparcStation 20.

DPGA: A fine-grain prototype system, the Dynamically Programmable Gate Arrays (DPGA) [5]
use traditional 4-input lookup tables as the basic array element. DPGA supports rapid run-time
reconfiguration. Small collections of array elements are grouped as sub-arrays that are tiled to
form the entire array. A sub-array has complete row and column connectivity. Reconfigurable
crossbars are used for communication between sub-arrays. The authors suggest that DPGAs may
be useful for implementing systolic pipelines, utility functions and even FSMS, with utilization
gains of 3-4X.

Garp: This fine-grained approach [7] has been designed to fit into an ordinary processing
environment, where a host processor manages main thread of control while only certain loops
and subroutines use the reconfigurable array for speedup in performance. The array is composed
7
of rows of blocks, which resemble CLBs of Xilinx 4000 series [13]. There are at least 24
columns of blocks, while number of rows is implementation specific. The blocks operate on 2bit data. There are vertical and horizontal block-to-block wires for data movement within the
array. Separate memory buses move information (data as well as configuration) in and out of the
array. Speedups ranging from 2 to 24 X are obtained for applications, such as encryption, image
dithering and sorting.
4. MorphoSys: Components, Features and Program Flow
Figure 2 shows the organization of the integrated MorphoSys reconfigurable computing system. It
is composed of an array of reconfigurable cells (RC Array) with its configuration data memory
(Context Memory), a control processor (Tiny RISC), a data buffer (Frame Buffer) and a DMA
controller.
M1 Chip
Instr
Data
Cache
6
Tiny_RISC
Core Processor
16
4
16
16
22
32
DMA
Controller
10
SDRAM
Main 32
Memory
64
RC
Array
(8 X 8)
256
Context
Memory
2x8x
16x32
Frame Buffer
(2x2x64x64)
64
64
32
9
Figure 2: Block diagram of MorphoSys (M1 chip)
The correspondence between this figure and the architectural model in Figure 1 is as follows: the
RC Array with its Context Memory corresponds to the reconfigurable processor array (SIMD coprocessor), the Tiny RISC corresponds to the Main Processor, and the high-bandwidth memory
interface is implemented as the Frame Buffer and the DMA Controller.
8
4.1 System Components
Reconfigurable Cell Array: The main component of MorphoSys is the 8 x 8 RC (Reconfigurable
Cell) Array, shown in Figure 3. Each RC has an ALU-multiplier and a register file and is configured
through a 32-bit context word. The context words for the RC Array are stored in Context Memory.
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
Figure 3: MorphoSys 8 x 8 RC Array with 2-D Mesh and Complete Quadrant Connectivity
Host/Control processor: The controlling component of MorphoSys is a 32-bit processor, called
Tiny RISC. This is based on the design of a RISC processor in [14]. Tiny RISC handles generalpurpose operations and also controls operation of the RC array. It initiates all data transfers to and
from the Frame Buffer and configuration data load for the Context Memory.
Frame Buffer: An important component is the two-set Frame Buffer, which is analogous to a data
cache. It makes the memory accesses transparent to the RC Array, by overlapping of computation
with the data load and store, alternately using the two sets. MorphoSys performance benefits
tremendously from this data buffer. A dedicated data buffer has been missing in most of the
contemporary reconfigurable systems, with consequent degradation of performance.
9
4.2 Features of MorphoSys
The RC Array follows the SIMD model of computation. All the RCs in the same row/column share
same configuration data. However, each RC operates on different data. Sharing the context across a
row/column is useful for data-parallel applications. In brief, important features of MorphoSys are:
Coarse-level granularity: MorphoSys differs from bit-level FPGAs and other fine-grain
reconfigurable systems, in that it operates on 8 or 16-bit data. This ensures better silicon utilization
(higher logic density), and faster performance for word-level operations as compared to FPGAs.
MorphoSys is free from variable wire propagation delays, an undesirable characteristic of FPGAs.
Configuration: The RC array is configured through context words. This specifies an instruction
opcode for the RC, and provides control bits for input multiplexers. It also specifies constant values
for computations. The configuration data is stored as context words in the Context Memory.
Considerable depth of programmability: The Context Memory can store up to 32 planes of
configuration. The user has the option of broadcasting contexts across rows or columns.
Dynamic reconfiguration capability: MorphoSys supports dynamic reconfiguration. Context data
may be loaded into a non-active part of the Context Memory without interrupting RC Array
operation. Context loads and reloads are specified through Tiny RISC and actually done by the
DMA controller.
Local/Host Processor and High-Speed Memory Interface: The control processor (Tiny RISC) and
the RC Array are resident on the same chip. This prevents I/O limitations from affecting
performance. In addition, the memory interface is through an on-chip DMA Controller, for faster
data transfers between main memory and Frame Buffer. It also helps to reduce reconfiguration time.
4.3 Tiny RISC Instructions for MorphoSys
Several new instructions were introduced in the Tiny RISC instruction set for effective control of
the MorphoSys RC Array operations. These instructions are summarized in Table 1. They perform
the following functions:

data transfer between main memory (SDRAM) and Frame Buffer,
10

loading of context words from main memory into Context Memory, and

control of execution of the RC Array.
There are two categories of these instructions: DMA instructions and RC instructions. The DMA
instruction fields specify load/store, memory address, number of bytes to be transferred and Frame
Buffer or Context Memory address. The RC instruction fields specify context for execution, Frame
Buffer address and broadcast mode (row/column, broadcast versus selective).
Table 1: Modified Tiny RISC Instructions for MorphoSys M1 Chip
Mnemonic
Description of Operation
LDCTXT
Initiate loading of context into Context Memory
LDFB, STFB
Initiate data transfers between Frame Buffer and Memory
CBCAST
Execute (broadcast) specific context in RC Array
SBCB
Execute RC context, and read one operand data from Frame buffer
into RC Array
DBCBSC,
Execute RC context on one specific column or row, and read two
DBCBSR
operand data from Frame Buffer into RC column
DBCBAC,
Execute RC context, and read two operand data from Frame Buffer
DBCBAR
into RC column
WFB,
Write data from specific column of RC Array into Frame Buffer (with
WFBI
indirect or immediate address)
RCRISC
Write data from RC Array to Tiny RISC
4.4 MorphoSys Program Flow
Next, we illustrate the typical operation of the MorphoSys system. The Tiny RISC processor
handles the general-purpose operations itself. Specific parts of applications, such as multimedia
tasks, are mapped to the RC Array. The Tiny RISC processor initiates the loading of the context
words (configuration data) for these operations from external memory into the Context Memory
through the DMA Controller (Figure 2) using the LDCTXT instruction. Next, it issues the LDFB
11
instruction to signal the DMA Controller to load application data, such as image frames, from main
memory to the Frame Buffer. At this point, both configuration and application data are ready.
Now, Tiny RISC issues CBCAST, SBCB or one of the four DBCB instructions to start execution of
the RC Array. These instructions specify the particular context (among the multiple contexts in
Context Memory) to be executed by the RCs. As shown in Table 1, there are two modes of
specifying the context: column broadcast and row broadcast. For column (row) broadcast, all the
RCs in the same column (row) are configured by the same context word. Tiny RISC can also enable
selective functioning of a row/column, and can access data from selected RC outputs.
While the RC Array is executing instructions, and using data from the first set of the Frame Buffer,
the Tiny RISC initiates transfer of data for the next computation into the second set of the Frame
Buffer using the LDFB instruction. The DMA Controller conducts the actual transfer of data, so the
execution of the RC Array is interrupted minimally. The RC Array can meanwhile also write back
to the first set of the Frame Buffer using the WFB instruction. When the RC Array execution on the
first data set completes, fresh data is available in the second set. Thus the RC Array does not have to
wait for data load/store, but can continue execution, while the previous output data is written out to
memory and another set of data is loaded into the first set of the Frame buffer.
When the context data needs to be changed, it can also be done in the background using the DMA
Controller. While the RC Array is operating on some context data, other parts of the Context
Memory can be updated, providing fast run-time reconfiguration.
5. Design of RC Array, Context Memory and Interconnection Network
In this section, we describe three major components of MorphoSys: the reconfigurable cell, the
context memory, and the three-level interconnection network of the RC array.
5.1 Architecture of Reconfigurable Cell
The reconfigurable cell (RC) array is the programmable core of MorphoSys. It consists of an 8x8
array (Figure 3) of identical Reconfigurable Cells (RC). Each RC (Figure 4) is the basic unit of
reconfiguration. Its functional model is similar to the data-path of a conventional processor.
12
As Figure 4 shows, each RC comprises of an ALU-multiplier, a shift unit, and two multiplexers for
ALU inputs. It has an output register, a feedback register, and a small register file. A context word,
loaded from Context Memory and stored in the context register (Section 5.2), defines the
functionality of the ALU. It also provides control bits to input multiplexers, determines where the
operation result is to be stored and the direction/amount of shift at the output. In addition, the
context word can also specify an immediate value (constant).
R
T
C
B
E
XQ
U D L
FB
MUXA
I
MUXB
16
12
FB REG
ALU_OP
16
Constant
16
ALU+MULT
R2
R3
L
M
R1
R
e
g
i
s
t
e
r
I
R0
Context word from Context Memory
R0 - R3
C
o
n
t
e
x
t
FLAG
ALU_SFT
Packing
Register
WR_BUS, WR_Exp
SHIFT
28
Register File
O/P REG
16
Figure 4: Reconfigurable Cell Architecture
ALU-Multiplier unit: The ALU has 16-bit inputs, and the multiplier has 16 by 12 bit inputs.
Externally, the ALU-multiplier has four input ports. Two ports, Port A and Port B are for data from
outputs of input multiplexers. The third input (12 bits) takes a value from the constant field in the
context register (Figure 5). The fourth port takes its input from the output register. The ALU adder
has been designed for 28 bit inputs. This prevents loss of precision during multiply-accumulate
operation, even though each multiplier output may be much more than 16 bits, i.e. a maximum of 28
bits. Besides standard logic and arithmetic functions, the ALU has several additional functions for
e.g. computing the absolute value of difference of two numbers and a single cycle multiplyaccumulate operation. The total number of ALU functions is about thirty.
Input multiplexers: The two input multiplexers (Figure 4) select one of several inputs for the ALU.
Mux A (16-to-1) provides values from the outputs of the four nearest neighbors, and outputs of
13
other cells in the same row and column (within the quadrant). It also has an express lane input (as
explained in sub-section on interconnection network), an array data bus input, a feedback input, a
cross-quadrant input and four inputs for register file. Mux B (8-to-1) takes its inputs from four
register file outputs, an array data bus input and from the outputs of three of the nearest neighbors.
Registers: The register file is composed of four 16-bit registers. The output register is 32 bits wide
(to accommodate intermediate results of multiply-accumulate instructions). The shift unit is also 32
bits wide. A flag register indicates sign of input operand at port A of ALU. It is useful when the
operation to be performed depends upon the sign of the operand, as in the quantization step during
image compression. The feedback register makes it possible to reuse previous operands.
Custom hardware: This is used to implement special functions, for e.g., one’s counter or packing
register (packs binary data into words). One’s counter and packing register are used for applications
that require processing of binary image data, such as automatic target recognition (ATR).
5.2 Context Memory
Each RC is configured through a context word stored in the Context Register. The context word is
provided from the Context Memory. The Context Memory is organized into two blocks (for row
and column contexts) with each block having eight sets of sixteen context words. The RC Array
configuration plane comprises eight context words (one from each set) from either the row or
column block. Thus the Context Memory can store 32 configuration planes.
Context register: This 32-bit register contains the context word for configuring each RC. It is a part
of each RC, whereas the Context Memory is separate from the RC Array (Figure 2).
The different fields for the context word are defined in Figure 5. The field ALU_OP specifies ALU
function. The control bits for Mux A and Mux B are specified in the fields MUX_A and MUX_B.
Other fields determine the registers to which the result of an operation is written (REG #), and the
direction (RS_LS) and amount of shift (ALU_SFT) applied to output.
One interesting feature is that the context includes a 12 bit-field for the constant. This makes it
possible to provide an operand to a row/column of the RC directly from the context. It is used for
14
operations involving constants, such as multiplication by a constant. However, if ALU-Multiplier
functions do not need a constant, the extra bits in the Constant field specify an ALU-Multiplier suboperation. These sub-operations are used to expand the functionality of the ALU unit.
WR-Bus REG #
ALU_SFT
MUX_B
Constant
31 30 29-28 27 26-23 22-19 18-1615-12 11-0
WR_Exp
RS_LS
MUX_A
ALU_OP
Figure 5: RC Context word definition
The context word also specifies whether a particular RC writes to its row/column express lane
(WR_Exp). Whether or not the RC array will write out the result to Frame Buffer, is also specified
by the context data (WR_BUS).
The programmability of the interconnection network is derived from the context word. Depending
upon the context, an RC can access the input of any other RC in its column or row within the same
quadrant, or else select an input from its own register file. The context word provides functional
programmability by configuring the ALUs of each RC to perform specific functions.
Context broadcast: For this implementation of MorphoSys, the major focus is on data-parallel
applications, which exhibit a definite regularity. Based on this idea of regularity and parallelism, the
context is broadcast to a row (column) of RCs. That implies that all eight RCs in a row (column)
share the same context, and perform the same operations. For example, for DCT computation, eight
1-D DCTs need to be computed, across eight rows. This is easy to achieve with just eight context
words to program the entire RC Array, for each step of the computation. Thus it takes only 10
cycles to complete a 1-D DCT (as illustrated in Section 7.1.2).
Context Memory organization: Corresponding to either row/column broadcast of the context word,
a set of eight context words can specify the configuration for the RC Array. The computation model
for the RC specifies multiple contexts. To provide this depth of programmability, there are sixteen
15
planes of configuration for each broadcast mode, which implies 128 context words. Based on
studies of relevant applications, a depth of sixteen for each context set (total configuration depth is
32) has been found sufficient for most applications studied for this project. Since there are two
blocks (one for each broadcast mode), the Context Memory can store a total of 256 context words
of 32 bits each.
Dynamic reconfiguration: When the Context Memory needs to be changed in order to perform
some different part of an application, the Tiny RISC signals the DMA Controller to load in the
required context data from main memory. The context update can be performed concurrently with
RC Array execution, provided that the RC Array is not allowed to access the parts that are being
changed. There are 32 context planes and this depth facilitates dynamic (run-time) reloading of the
contexts. Dynamic reconfiguration makes it possible to reduce the effective reconfiguration time to
zero.
Selective context enabling: This feature implies that it is possible to enable one specific row or
column for operation in the RC Array. This feature is primarily useful in loading data into the RC
Array. Since the context can be used selectively, and because the data bus limitations allow loading
of only one column at a time, the same set of context words can be used repeatedly to load data into
all the eight columns of the RC. Without this feature, eight context planes (out of the 32 available)
would have been required just to read or write data. This feature also allows irregular operations in
the RC Array, for e.g. zigzag re-arrangement of array elements.
5.3 Interconnection Network
The RC interconnection network is comprised of three hierarchical levels.
RC Array mesh: The underlying network throughout the array (Figure 3) is a 2-D mesh. It provides
nearest neighbor connectivity.
Intra-quadrant (complete row/column) connectivity: The second layer of connectivity is at the
quadrant level (a quadrant is a 4 by 4 RC group). In the current MorphoSys specification, the RC
16
array has four quadrants (Figure 3). Within each quadrant, each cell can access the output of any
other cell in its row and column, as shown in Figure 3.
Inter-quadrant (express lane) connectivity: At the highest or global level, there are buses for
routing connections between adjacent quadrants. These buses, also called express lanes, run across
rows as well as columns. Figure 6 shows two express lanes going in each direction across a row.
These lanes can supply data from any one cell (out of four) in a row (column) of a quadrant to other
cells in adjacent quadrant but in same row (column). Thus, up to four cells in a row (column) may
access the output value of any one of four cells in the same row (column) of an adjacent quadrant.
Row express lane ==>
RC
RC
RC
RC
RC
RC
RC
RC
<== Row express lane
Figure 6: Express lane connectivity (between cells in same row, but adjacent quadrants)
The express lanes greatly enhance global connectivity. Even irregular communication patterns, that
otherwise require extensive interconnections, can be handled quite efficiently. For e.g., an eightpoint butterfly is accomplished in only three cycles.
Data bus: A 128-bit data bus from Frame Buffer to RC array is linked to column elements of the
array. It provides two eight bit operands to each of the eight column cells. It is possible to load two
operand data (Port A and Port B) in an entire column in one cycle. Eight cycles are required to load
the entire RC array. The outputs of RC elements of each column are written back to frame buffer
through Port A data bus.
Context bus: When a Tiny RISC instruction specifies that a particular context be executed, it must
be distributed to Context Register in each RC from the Context Memory. The context bus transmits
the context data to each RC in a row/column depending upon the broadcast mode. Each context
word is 32 bits wide, and there are eight rows (columns), hence the context bus is 256 bits wide.
17
6. Programming and Simulation Environment
6.1 Behavioral VHDL Model
The MorphoSys reconfigurable system has been specified in behavioral VHDL. The system
components namely, the 8x8 Reconfigurable Array, the 32-bit Tiny RISC host processor, the
Context Memory, Frame Buffer and the DMA controller have been modeled for complete
functionality. The unified model has been subjected to simulation for various applications using the
QuickVHDL simulation environment. These simulations utilize several test-benches, real world
input data sets, a simple assembler-like parser for generating the context/configuration instructions
and assembly code for Tiny RISC.
6.2 Context Generation
Each application has to be coded into the context words and Tiny RISC instructions for simulation.
For the former, an assembler-parser, mLoad, generates the contexts from programs written in the
RC instruction set by the user. The next step is to determine the sequence of Tiny RISC instructions
for appropriate operation of the RC Array, timely data input and output, and to provide sample data
files. Once a sequence has been determined, and the data procured, test-benches are used to
simulate the system. Figure 7 depicts the simulation environment with its different components.
6.3 GUI for MorphoSys: mView
A graphical user interface, mView, has been prepared for programming the MorphoSys RC Array. It
is also used for studying MorphoSys simulation behavior. This GUI is based on Tcl/Tk [15]. It
displays graphical information about the functions being executed at each RC, the active
interconnections, the sources and destination of operands, usage of data buses and the express lanes,
values of RC outputs, etc. It has several built-in features that allow visualization of RC execution,
interconnect usage patterns for different applications, and single-step simulation runs with
backward, forward and continuous execution. It operates in one of two modes: programming mode
or simulation mode.
18
Figure 7: Simulation Environment for MorphoSys, with mView display
In the programming mode, the user sets functions and interconnections for each row/column of the
RC Array corresponding to each context (row/column broadcasting) for the application. mView then
generates a context file for representing the user-specified application.
In the simulation mode, mView takes a context file, or a simulation output file as input. For either of
these, it provides a graphical display of the state of each RC as it executes the application
represented by the context/simulation file. The display includes comprehensive information relating
to the functions, interconnections, operand sources, and output values for each RC.
mView is a valuable aid to the designer in mapping algorithms to the RC Array. Not only does
mView significantly reduce the programming time, but it also provides low-level information about
the actual execution of applications in the RC Array. This feature, coupled with its graphical nature,
makes it a convenient tool for verifying and debugging simulation runs.
6.4 Code Generation for MorphoSys
An important aspect of our research is an effort to develop a programming environment for
automatic mapping and code generation for MorphoSys. Eventually, we hope to be able to compile
hybrid code for the host processor and MorphoSys co-processor using the SUIF compiler
environment [16]. Initially, we will partition the application between the host processor and
MorphoSys manually, for example by inserting pragma directives. C code will be mapped into
MorphoSys configuration words based on the mLoad assembler. At an advanced development
19
stage, MorphoSys would perform online profiling of applications and dynamically adjust the
reconfiguration profile for enhanced efficiency.
7. Mapping Applications to MorphoSys
In this section, we discuss the mapping of video compression and automatic target recognition
(ATR) to the MorphoSys architecture. Video compression has a high degree of data-parallelism and
tight real-time constraints. ATR is one of the most computation-intensive applications. We also
provide performance estimates based on VHDL simulations.
7.1 Video Compression (MPEG)
Video compression is an integral part of many multi-media applications. In this context, MPEG
standards for video compression [17] are important for realization of digital video services, such as
video conferencing, video-on-demand, HDTV and digital TV. MPEG Standards [17] specify the
syntax of the coded bit stream and the decoding process. Based on this, Figure 8 shows the block
diagram of an MPEG encoder.
Regulator
+
-
Frame
Memory
Zig-zag
scan
DCT
Quantization
Output
Buffer
IDCT
Output
+
Motion
Compensation
Frame
Memory
Motion Vectors
Input
Predictive frame
Inv. Quant.
PreProcessing
VLC
Encoder
Motion
Estimation
Figure 8: Block diagram of an MPEG Encoder
As depicted in Figure 8, the functions required of a typical MPEG encoder are:

Preprocessing: for example, color conversion to YCbCr, prefiltering and subsampling.
20

Motion Estimation and Compensation: After preprocessing, motion estimation is used to
remove temporal redundancies in successive frames (predictive coding) of P and B type.

DCT and Quantization: Each macroblock (typically consisting of 6 blocks of size 8x8 pixels) is
then transformed using the discrete cosine transform. The resulting DCT coefficients are
quantized to enable compression.

Zigzag scan and VLC: The quantized coefficients, are rearranged in a zigzag manner (in order
of low to high spatial frequency) and compressed using variable length encoding.

Inverse Quantization and Inverse DCT: The quantized blocks of I and P type frames are inverse
quantized and transformed back into the spatial domain by an inverse DCT. This operation
yields a copy of the picture which is used for future predictive coding, i.e. motion estimation.
Next, we discuss two major functions (motion estimation and DCT) of the MPEG video encoder, as
mapped to MorphoSys. Finally, we discuss the overall performance of MorphoSys for the entire
compression encoder sequence (except VLC).
It is remarkable that because of computation intensive nature of motion estimation, only dedicated
processors or ASICs have been used to implement MPEG video encoders. Most reconfigurable
systems, DSP processors or multimedia processors (for e.g. [18]) consider only MPEG decoding or
a sub-task (for e.g. IDCT). Our mapping of MPEG encoder to MorphoSys is perhaps the first time
that a reconfigurable system has been used to successfully implement the MPEG video encoder.
7.1.1 Video Compression: Motion Estimation for MPEG
Motion estimation is widely adopted in video compression to identify redundancy between frames.
The most popular technique for motion estimation is the block-matching algorithm because of its
simple hardware implementation [19]. Some standards also recommend this algorithm. Among the
different block-matching methods, full search block matching (FSBM) involves the maximum
computations. However, FSBM gives an optimal solution with low control overhead.
Typically, FSBM is formulated using the mean absolute difference (MAD) criterion as follows:
N
MAD(m, n) = 
i 1
N
 Ri, j   S i  m, j  n 
j 1
given p  m, n  q
21
where p and q are the maximum displacements, R(i, j) is the reference block of size N x N pixels at
coordinates (i, j), and S(i+m, j+n) is the candidate block within a search area of size (N+p+q)2
pixels in the previous frame. The displacement vector is represented by (m, n), and the motion
vector is determined by the least MAD(m, n) among all the (p+q+1)2 possible displacements within
the search area.
Figure 9 shows the configuration of RC Array for FSBM computation. Initially, one reference block
and the search area associated with it are loaded into one set of the frame buffer. The RC array
starts the matching process for the reference block resident in the frame buffer. During this
computation, another reference block and the search area associated with it are loaded into the other
set of the frame buffer. In this manner, data loading and computation time are overlapped.
S(i+m, j+n)
R(i, j)
RC operations
| R  S | & accumulate
Partial sums accumulate
Partial sums accumulate
Results to Tiny RISC
| R  S| & accumulate
Register (delay element)
Register (delay element)
| R  S | & accumulate
Partial sums accumulate
Figure 9: Configuration of RC Array for Full Search Block Matching
For each reference block, three consecutive candidate blocks are matched concurrently in the RC
Array. As depicted in Figure 9, each RC in first, fourth, and seventh row performs the computation
P j   Ri, j   S i  m, j  n  ,
1  i  16
where Pj is the partial sum. Data from a row of the reference block is sent to the first row of the RC
Array and passed to the fourth row and seventh row through delay elements. The eight partial sums
22
(Pj) generated in these rows are then passed to the second, third, and eighth row respectively to
perform
MADm, n  
P
1  i  16
j
.
Subsequently, three MAD values corresponding to three candidate blocks are sent to Tiny RISC for
comparison, and the RC array starts block matching for the next three candidate blocks.
Computation cost: Based on the computation model shown above, and using N=16, for a reference
block size of 16x16, it takes 36 clock cycles to finish the matching of three candidate blocks. There
are (8+8+1)2 = 289 candidate blocks in each search area, and VHDL simulation results show that a
total of (102x[36+16])=5304 cycles are required to finish the matching of the whole search area.
The 16 extra cycles are for comparing the MAD results after each set of three block comparisons
and updating the motion vectors for the best match. If the image size is 352x288 pixels at 30 frames
per second (MPEG-2 main profile, low level), the number of reference blocks per frame is 22x18 =
396 (each reference block size is 16x16).
Processing of an entire image frame would take
5304x396 = 2.1 x 106 cycles.
Motion Est. GOPS
Peak GOPS
30
25.6
GOPS
24
18
12.8
12
6
4
6.4
16
8
M1
(256
RCs)
M1
(128
RCs)
M1
(64
RCs)
0
Figure 10: MorphoSys M1 Performance for Motion Estimation -Giga Operations/S (GOPS)
At the anticipated clock rate of 100 MHz for MorphoSys, the computation time is  21.0 ms. This is
much smaller than frame period of 33.33ms. The context loading time is only 71 cycles, and since a
huge number of actual computation cycles are required before changing the configuration, its effect
23
is negligible. Figure 10 illustrates the performance of different generations of MorphoSys for
motion estimation in terms of giga-operations (109 operations) per second. We extrapolate the
performance results for future generations of M1, assuming future technologies will allow more
RCs on a single chip. These estimates are conservative and assume a fixed clock of 100 MHz
throughout. The GOPS figure for motion estimation is more than 60% of the peak value.
Performance Analysis: MorphoSys performance is compared with three ASIC architectures
implemented in [19], [20], [21] and Intel MMX instructions [22] for matching one 8x8 reference
block against its search area of 8 pixels displacement. The result is shown in Figure 11. The ASIC
architectures have same processing power (in terms of processing elements) as MorphoSys, though
they employ customized hardware units such as parallel adders to enhance performance. The
number of processing cycles for MorphoSys is comparable to the cycles required by the ASIC
designs. Pentium MMX takes almost 29000 cycles for the same task, which is almost thirty times
more than MorphoSys.
1159
1200
1020
900
631
581
600
Cycles
300
0
ASIC [19]
ASIC [20]
ASIC [21]
MorphoSys
M1 (64 RCs)
Figure 11: Performance Comparison for Motion Estimation
Since MorphoSys is not an ASIC, its performance with regard to these ASICs is significant. In a
subsequent sub-section, it shall be shown that this performance level enables the implementation of
an MPEG-2 encoder on MorphoSys.
24
7.1.2 Video Compression: Discrete Cosine Transform (DCT) for MPEG
The forward and inverse DCT are used in MPEG encoders and decoders. In the following analysis,
we consider an algorithm for fast 8-point 1-D DCT [23]. It involves 16 multiplications and 26
additions, leading to 256 multiplications and 416 additions for a 2-D implementation. The 1-D
algorithm is first applied to the rows (columns) of an input 8x8 image block, and then to the
columns (rows). The eight row (column) DCTs may be computed in parallel.
Mapping to RC Array: The standard block size for DCT in most image and video compression
standards is 8x8. Since the RC array has the same size, each pixel of the image block may be
directly mapped to each RC. Each pixel of the input block is stored in one RC.
Sequence of steps:
Loading input data: The 8x8 pixel block is loaded from the frame buffer to the RC Array. The data
bus between the frame buffer and the RC array allows concurrent loading of eight pixels at a time.
The entire block is loaded in eight cycles.
Row-column approach: Using the separability property, 1-D DCT along rows is computed. For row
(column) mode of operation, the configuration context is broadcast along columns (rows). Different
RCs within a row (column) of the array communicate using the three-layer interconnection network
to compute outputs for 1-D DCT. The coefficients needed for the computation are provided as
constants in context words. When 1-D DCT along rows (columns) is complete, the 1-D DCT along
columns (rows) are computed in a similar manner (Figure 12).
Each sequence of 1-D DCT [21] involves:
i. Butterfly computation: It takes three cycles to perform this using the inter-quad connectivity
layer of express lanes.
ii. Computation and re-arrangement: For 1-D DCT (row/column), the computation takes six
cycles. An extra cycle is used for re-arrangement of computed results.
25
1-D DCT along
columns (10 cycles)
1-D DCT along rows
( 10 cycles )
ROW 0 RC RC RC RC RC RC RC RC
COL 0 . . . . . . . . . . .COL 7
.
.
.
.
.
.
.
.
RC
.
.
.
.
.
.
.
.
RC
.
.
.
.
.
.
.
.
RC
ROW 3
.
.
.
.
.
.
.
.
RC
ROW 4
.
.
.
.
.
.
.
.
RC
ROW 5
.
.
.
.
.
.
.
.
RC
ROW 6
ROW 1
ROW 2
RC
ROW 7 RC RC RC RC RC RC RC RC
RC
. . . . . .
. . . . . .
RC
RC
. . . . . .
RC
. . . . . .
RC
. . . . . .
RC
. . . . . .
RC
. . . . . .
. . . . . .
RC
RC
Figure 12: Computation of 2-D DCT across rows/columns (without transposing)
Computation cost: The cost for computing 2-D DCT on an 8x8 block of the image is as follows: 6
cycles for butterfly, 12 cycles for both 1-D DCT computations and 3 cycles are used for rearrangement and scaling of data (giving a total of 21 cycles). This estimate is verified by VHDL
simulation. Assuming the data blocks to be present in the RC Array (through overlapping of data
load/store with computation cycles), it would take 0.49 ms for MorphoSys @ 100 MHz to compute
the DCT for all 8x8 blocks (396x6) in one frame of a 352x288 image. The cost of computing the 2D IDCT is the same, because the steps involved are similar. Context loading time is quite
significant at 270 cycles. However, this effect is minimized through transforming a large number of
blocks (typically 2376 blocks) before a different configuration is loaded.
Performance analysis: MorphoSys requires 21 cycles to complete 2-D DCT (or IDCT) on 8x8
block of pixel data. This is in contrast to 240 cycles required by Pentium MMX
TM
[22]. Even a
dedicated superscalar multi-media processor, [24] requires 201 clocks for the IDCT. REMARC [9]
takes 54 cycles to implement the IDCT, even though it uses 64 nano-processors. The DSP
multimedia video processor [18] computes the IDCT in 320 cycles. The relative performance
figures for MorphoSys and other implementations are given in Figure 13.
26
400
320
300
201
240
200
Cycles
100
54
21
TMS320C80
MVP
Pentium
MMX
V830R/AV
REMARC
MorphoSys
M1
0
Figure 13: DCT/IDCT Performance Comparison (cycles)
Notably, MorphoSys performance scales linearly with the array size. For a 256 element RC array,
the number of operations possible per second would increase fourfold, with corresponding effect on
throughput for 2-D DCT and other algorithms. The performance figures (in GOPS) are summed up
in Figure 14 and these are more than 50% of the peak values. Once, again the figures are scaled for
future generations of MorphoSys M1, conservatively assuming a constant clock of 100 MHz.
30
25.6
24
18
12.8
12
6
6.4
13.44
6.72
DCT GOPS
Peak GOPS
3.36
MorphoSys
M1 (256
RCs)
MorphoSys
M1 (128
RCs)
MorphoSys
M1 (64
RCs)
0
Figure 14: Performance for DCT/IDCT -Giga Operations per Second (GOPS)
Some other points are worth noting: first, all rows (columns) perform the same computations, hence
they can be configured by a common context (thus enabling broadcast of context word), which
leads to saving in context memory space. Second, the RC array provides the option of broadcasting
27
context either across rows or across columns. This allows computation of second 1-D DCT without
transposing the data. Elimination of the transpose operation saves a considerable amount of cycles,
and is important for high performance. This operation generally consumes valuable cycle time. For
example, even hand-optimized version of IDCT code for Pentium MMX (that uses 64-bit registers)
needs at least 25 register-memory instructions for completing the transpose [22]. Processors, such as
the TMS320 series [18], also expend valuable cycle time on transposing data.
Precision analysis for IDCT: We conducted experiments for the precision of output IDCT for
MorphoSys as specified in the IEEE Standard [25]. Considering that MorphoSys is not a custom
design, and performs fixed-point operations, the results were impressive. We satisfied worst case
pixel error. The Overall Mean Square Error (OMSE) was within 15% of the reference value. The
majority of pixel locations also satisfied the worst case reference values for mean error and mean
square error.
Zigzag Scan: We also implemented the zig-zag scan function, even though MorphoSys is not
designed for applications that comprise of irregular accesses. But interestingly, we were able to use
selective context enabling feature of the RC Array to design a reasonable implementation. It is an
evidence of the flexibility of the MorphoSys model that we could map an application that is quite
diverse from the targeted applications for this architecture.
7.1.3 Mapping MPEG-2 Video Encoder to MorphoSys
We mapped all the functions for MPEG-2 video encoder, except the VLC encoding, to MorphoSys.
We assume that the Main profile, at the low level is being used. The maximum resolution required
for this level is 352x288 pixels per frame at 30 frames per second. We further assume that a group
of pictures consists of a sequence of four frames in the order IBBP (a typical choice for
broadcasting applications). The number of cycles required to compute each sub-task of the MPEG
encoder, for each macroblock type are listed in Table 2. Besides the actual computation cycles, we
also take into account the configuration load cycles and the cycles for loading the data from
memory.
28
Table 2: Performance Figures of MorphoSys M1 (64 RCs) for I, P and B Macro-blocks
Macroblock type /
Motion Estimation
Motion Comp., DCT and Quant. ( / for
MPEG functions
(in clock cycles)
Inv Quant., IDCT, inv MC)
Context Mem Ld
Compute
Context
Mem Ld
Compute
I type macroblock
0
0
0
270/270
234/234
264/264
P type macroblock
71
334
5304
270/270
351/351
264/264
B type macroblock
71
597
10608
270 / 0
468 / 0
306/ 0
All the macro-blocks in each P and B frame are first subjected to motion estimation, then we
perform motion compensation, DCT and quantization for all macroblocks of a frame. These are
written out to frame storage in main memory. Finally, we perform inverse quantization, inverse
DCT and reverse motion prediction for each macroblock of I and P type frames. Each frame has
396 macroblocks, and the total number of cycles required for encoding each frame type are depicted
in Figure 15. It may be noted that motion estimation takes up almost 90% of the computation time
for P and B type frames.
Motion Est.
MC, DCT and Q
Inv Q, IDCT, Inv MC
B frame
P frame
I frame
0
1250000
2500000
3750000
5000000
Cycles
Figure 15: MorphoSys performance for I, P and B frames (MPEG Video Encoder)
From the data in Figure 15, and using the assumption of frame sequence of IBBP, the total encoding
time is 117.3 ms. This is 88% of the total available time (133.3 ms). From empirical data values in
29
[24], the remaining 12% of available time is sufficient to compute VLC. We compare the MPEG
video encoder performance with that of REMARC [9] in Table 3. Even though MorphoSys figures
do not include VLC, they are almost two orders of magnitude less than REMARC. The Motion
Estimation algorithm (the major computation) is the same for REMARC and MorphoSys (FSBM).
Table 3: Comparison of MorphoSys MPEG Encoder with REMARC [9] MPEG Encoder
Frame Type / # of
Total clock cycles for
Clock cycles for REMARC
cycles
MorphoSys M1 (64 RCs)
[9] (64 nano-processors)
I frame
209,628
52.9 x 106
P frame
2,378,987
69.6 x 106
B frame
4,572,035
81.5 x 106
7.2 Automatic Target Recognition (ATR)
Automatic Target Recognition (ATR) is the machine function of automatically detecting,
classifying, recognizing, and identifying an object. The ACS Surveillance challenge has been
quantified as the ability to search 40,000 square nautical miles per day with one meter resolution
[26]. The computation levels when targets are partially obscured reaches the hundreds-of-teraflops
range. There are many algorithmic choices available to implement an ATR system.
Bitplane 0
Bit
Slice
Chip
128x128x8bits
..
.
Bitplane 7
C0
Bit
Correlator
..
Shapesum
+
C7
Bright
Template
Surround
Template
8x8x1bit
8x8x1bit
Thresholding
Bit
Correlator
Bright
Surround
Figure 16: ATR Processing Model
Peak
Detection
30
The ATR processing model developed at Sandia National Laboratory is shown in Figure 16 ([27]
and [28]). This model was designed to detect partially obscured targets in Synthetic Aperture Radar
(SAR) images generated by the radar imager in real time. SAR Images (8-bits pixels) are input to a
focus-of-attention processor to identify the regions of interest (chips). These chips are thresholded
to generated binary images and the binary images are matched against binary target templates.
The first step is to generate the shapesum. The 128 x 128 x 8bits chip is bit sliced into eight
bitplanes. The system generates a shapesum by correlating each bitplane with the bright template
and then computing a weighted sum of the eight results. The chip is subsequently thresholded to
generate the binary image. Each pixel of the chip is compared with the shapesum and is set to a
binary data based on the following equation:
If Ai j  shapesum > 0,
Ai j  1
If Ai j  shapesum < 0,
Ai j  0,
where Ai j represents the 8-bits pixels in chip
The most significant bit of the output register represents the result of the thresholding (in 2’s
complement representation). Each RC in the first column of the RC Array has a 8-bit packing
register. These registers collect the thresholding results of the RCs in each row. The data in the
packing registers is sent back to the frame buffer, and another set of 64 pixels of the chip are loaded
to RC array for thresholding.
8-bits template data
16-bits binary image data
AND
One’s Counter
Result
Figure 17: Matching Process in Each RC
31
After the thresholding, a 128x128 binary image is generated and stored in the frame buffer. This
binary image is then matched against the target template using the bit correlator, shown in Figure
17. This template matching is similar to FSBM described in a previous sub-section. Each row of
8x8 target template is packed as an 8-bits number and loaded in RC array. All the candidate blocks
in the chip are correlated with target template. One column of RC array performs matching of one
target template and eight blocks are matched concurrently in the RC array.
In order to perform bit-level correlation, two bytes (16 bits) of image data are input to each RC. In
the first step, the 8 most significant bits of the image data are ANDed with the template data and a
special adder tree (implemented as custom hardware in each RC) is used to count the number of
one’s of the ANDed output to generate the correlation result. Then, the image data is shifted left
one bit and the process is repeated again to perform the matching of the second block. After the
image data is shifted eight times, a new 16-bits data is loaded and RC starts another correlation of
eight consecutive candidate blocks.
Performance analysis: For performance analysis, we choose the same system parameters that are
chosen for ATR systems implemented using Xilinx XC4010 FPGA [27] and Splash 2 system [28].
The image size of each chip is 128x128 pixels, and the template size is 8x8 bits. For 16 pairs of
target templates, the processing time is 21 ms for MorphoSys (at 100 MHz), 210 ms for the Xilinx
FPGA system [24], and 195 ms for the Splash 2 system [25]. Fig. 18 depicts relative performance.
250
195
200
210
150
Time (ms)
100
50
21
0
MorphoSys M1 Splash 2 [25]
(64 RCs)
Xilinx FPGAs
[24]
Figure 18: Performance Comparison of MorphoSys for ATR
32
ATR System Specification: A quantified measure of the ATR problem states that 100 chips have to
be processed each second for a given target. The target has a pair of bright and surround templates
for each five degree rotation (72 pairs for full 360 degree rotation). Considering these requirements,
Table 4 compares the number of MorphoSys chips necessary to achieve this versus the number of
boards of the system described in [27] and [28]. Only nine chips of MorphoSys M1 (64 RCs)
would be needed to satisfy this specification, as compared to 90 boards for the system using FPGAs
[27] and 84 boards for the Splash system [28]. Once again, the figures for future generation
MorphoSys M1 chips assume a constant clock of 100 MHz
Table 4: ATR Performance Comparison (MorphoSys @ 100 MHz)
GOPS = Giga-
M1
M1
M1
Xilinx FPGA
Splash 2
Operations/second
(64 RCs)
(128 RCs)
(256 RCs)
[24]
[25]
ATR GOPS
14
28
56
1.4
1.52
No. of Chips/boards
9 chips
5 chips
3 chips
90 boards
84 boards
8. Conclusions and Future Work
In this paper, we presented a new model of reconfigurable architecture in the form of MorphoSys,
and mapped several applications to it. The results have validated this architectural model through
impressive performance for several of the target applications. We plan to implement MorphoSys on
an actual chip for practical evaluation.
Extensions for MorphoSys model: It may be noted that the MorphoSys architectural model is not
limited to using a basic/simple RISC processor for the main control processor. For the current
implementation, Tiny RISC is used only to validate the design model. However, several possible
extensions to this model are envisioned. One would be to use an advanced general-purpose
processor in conjunction with Tiny RISC (which would then function as an I/O processor for the
RC Array). Also, an advanced processor with multi-threading capability may be used as the main
processor. This would enable concurrent processing of the RC Array and the main processor.
33
Another potential focus is the RC Array. For this implementation, the array has been fine-tuned for
data-parallel, computation intensive tasks. However, the design model allows other versions, too.
For e.g., a suitably designed RC Array may be used for a different application class, such as stream
processing, high-precision signal processing, bit-level operations, control-intensive applications,
etc.
Based on the above, we visualize that MorphoSys may be the precursor of a generation of generalpurpose processors that have a specialized reconfigurable component, designed for multimedia or
some other significant class of applications.
9. Acknowledgments
This research is supported by Defense and Advanced Research Projects Agency (DARPA) of the
Department of Defense under contract number F-33615-97-C-1126. We express thanks to Prof.
Eliseu M.C. Filho, Prof. Tomas Lang, and Prof. Walid Najjar for their useful and incisive
comments, Robert Heaton (Obsidian Technology) for his contributions towards the physical design
of MorphoSys and Ms. Kerry Hill of Air Force Research Laboratory, for her constructive feedback.
We acknowledge the contributions of Maneesha Bhate, Matthew Campbell, Benjamin U-Tee
Cheah, Alexander Gascoigne, Nambao Van Le, Rafael Maestre, Robert Powell, Rei Shu, Lingling
Sun, Cesar Talledo, Eric Tan, Timothy Truong, and Tom Truong; all of whom have been associated
with the development of MorphoSys models and application mapping.
References:
1. W. H. Mangione-Smith, B Hutchings, D. Andrews, A. DeHon, C. Ebeling, R. Hartenstein, O.
Mencer, J. Morris, K. Palem, V. K. Prasanna, H. A. E. Spaaneburg, “Seeking Solutions in
Configurable Computing,” IEEE Computer, Dec 1997, pp. 38-43
2. S. Brown and J. Rose, “Architecture of FPGAs and CPLDs: A Tutorial,” IEEE Design and Test
of Computers, Vol. 13, No. 2, pp. 42-57, 1996
3. D. Chen , J. Rabaey, “Reconfigurable Multi-processor IC for Rapid Prototyping of AlgorithmicSpecific High-Speed Datapaths,” IEEE Journal of Solid-State Circuits, V. 27, No. 12, Dec 92
34
4. R. Hartenstein and R. Kress, “A Datapath Synthesis System for the Reconfigurable Datapath
Architecture,” Proc. of Asia and South Pacific Design Automation Conf., 1995, pp. 479-484
5. E. Tau, D. Chen, I. Eslick, J. Brown and A. DeHon, “A First Generation DPGA
Implementation,” FPD’95, Canadian Workshop of Field-Programmable Devices, May 1995
6. E. Mirsky and A. DeHon, “MATRIX: A Reconfigurable Computing Architecture with
Configurable Instruction Distribution and Deployable Resources,” Proceedings of IEEE
Symposium on FPGAs for Custom Computing Machines, 1996, pp.157-66
7. J. R. Hauser and J. Wawrzynek, “Garp: A MIPS Processor with a Reconfigurable Coprocessor,” Proc. of the IEEE Symposium on FPGAs for Custom Computing Machines, 1997
8. C. Ebeling, D. Cronquist, and P. Franklin, “Configurable Computing: The Catalyst for HighPerformance Architectures,” Proceedings of IEEE International Conference on Applicationspecific Systems, Architectures and Processors, July 1997, pp. 364-72
9. T. Miyamori and K. Olukotun, “A Quantitative Analysis of Reconfigurable Coprocessors for
Multimedia Applications,” Proceedings of IEEE Symposium on Field-Programmable Custom
Computing Machines, April 1998
10. J. Babb, M. Frank, V. Lee, E. Waingold, R. Barua, M. Taylor, J. Kim, S. Devabhaktuni, A.
Agrawal, “The RAW Benchmark Suite: computation structures for general-purpose computing,”
Proc. IEEE Symposium on Field-Programmable Custom Computing Machines, FCCM 97,
1997, pp. 134-43
11. M. Gokhale, W. Holmes, A. Kopser, S. Lucas, R. Minnich, D. Sweely, D. Lopresti, “Building
and Using a Highly Parallel Programmable Logic Array,” IEEE Computer, pp. 81-89, Jan. 1991
12. P. Bertin, D. Roncin, and J. Vuillemin, “Introduction to Programmable Active Memories,” in
Systolic Array Processors, Prentice Hall, 1989, pp. 300-309
13. Xilinx, the Programmable Logic Data Book, 1994
14. A. Abnous, C. Christensen, J. Gray, J. Lenell, A. Naylor and N. Bagherzadeh, “ Design and
Implementation of the Tiny RISC microprocessor,” Microprocessors and Microsystems, Vol.
16, No. 4, pp. 187-94, 1992
35
15. Practical Programming in Tcl and Tk, 2nd edition, by Brent B. Welch, Prentice-Hall, 1997
16. SUIF Compiler system, The Stanford SUIF Compiler Group, http://suif.stanford.edu
17. ISO/IEC JTC1 CD 13818. Generic coding of moving pictures, 1994 (MPEG-2 standard)
18. F. Bonomini, F. De Marco-Zompit, G. A. Mian, A. Odorico, D. Palumbo, “Implementing an
MPEG2 Video Decoder Based on TMS320C80 MVP,” SPRA 332, Texas Instr., Sep 1996
19. C. Hsieh, T. Lin, “VLSI Architecture For Block-Matching Motion Estimation Algorithm,” IEEE
Trans. on Circuits and Systems for Video Tech., vol. 2, pp. 169-175, June 1992
20. S.H Nam, J.S. Baek, T.Y. Lee and M. K. Lee, “ A VLSI Design for Full Search Block Matching
Motion Estimation,” Proc. of IEEE ASIC Conference, Rochester, NY, Sep 1994, pp. 254-7
21. K-M Yang, M-T Sun and L. Wu, “ A Family of VLSI Designs for Motion Compensation Block
Matching Algorithm,” IEEE Trans. on Circuits and Systems, V. 36, No. 10, Oct 89, pp. 1317-25
22. Intel Application Notes for Pentium MMX, http://developer.intel.com/drg/mmx/appnotes/
23. W-H Chen, C. H. Smith and S. C. Fralick, “A Fast Computational Algorithm for the Discrete
Cosine Transform,” IEEE Trans. on Comm., vol. COM-25, No. 9, September 1977
24. T. Arai, I. Kuroda, K. Nadehara and K. Suzuki, “V830R/AV: Embedded Multimedia
Superscalar RISC Processor,” IEEE MICRO, Mar/Apr 1998, pp. 36-47
25. “IEEE Standard Specifications for the Implementation of 8x8 Inverse Discrete Cosine
Transform,” Std. 1180-1990, IEEE, Dec. 1990
26. Challenges for Adaptive Computing Systems, Defense and Advanced Research Projects Agency
(DARPA), www.darpa.mil/ito/research/acs/challenges.html
27. J. Villasenor, B. Schoner, K. Chia, C. Zapata, H. J. Kim, C. Jones, S. Lansing, and B.
Mangione-Smith, “ Configurable Computing Solutions for Automatic Target Recognition,”
Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines, April 1996
28. M. Rencher and B.L. Hutchings, " Automated Target Recognition on SPLASH 2," Proceedings
of IEEE Symposium on FPGAs for Custom Computing Machines, April 1997
Download