journal of information science and engineering 14, 605-632

advertisement
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 14, 605-632 (1998)
Rapid Prototyping of Hardware/Software Codesign for Embedded
Signal Processing
Yin-Tsung Hwang, Yuan-Hung Wang and Jer-Sho Hwang
Department of Electronic Engineering
National Yunlin University of Science & Technology
Touliu, Yunlin 640, Taiwan, R.O.C.
E-mail: hwangyt@cad.el.yuntech.edu.tw
In this paper, we propose a target board architecture suitable for embedded signal
processing applications based on hardware software codesign. The target board,
which serves as a system attached to a host PC via a PCI bus interface, contains a
TMS320C30 DSP processor and up to four Xilinx XC5204 FPGAs. The software
and hardware sections of the codesign can be easily implemented using C and VHDL
programming in the C30 processor and FPGAs, respectively. Based on the proposed
target board architecture, the interface circuitry and the communication protocols
between the hardware (FPGAs) and software (C30) sections are first derived. The
interface circuitry is described in VHDL code and will be added to the FPGA design
for high level synthesis. Five types of HW/SW communications are supported. A
HW/SW codesign flow is also exploited, and a partitioning verification procedure is
developed. To illustrate the merits of the proposed system, a HW/SW codesign
implementation example based on the G.728 LD-CELP decoder for speech
compression is described.
Keywords: hardware/software codesign, communication interface, embedded system,
hardware/software partitioning, hardware description language, target board, rapid
prototyping, field programmable gate array.
1. INTRODUCTION
Large and complex DSP systems are often composed of multiple and
heterogeneous processing blocks. Some of these blocks may involve massive
computations, where hardwired implementation in ASIC may provide the best speed
performance. Other blocks may need to handle complex control flows or
communication protocols, where a software programming approach is most flexible
and cost effective. Therefore, in contrast to pure software programming or pure
hardwired implementations, an alternative approach is to partition the system into
software and hardware sections. Each section can then be implemented using
respective software or hardware technology. Since these processing blocks interact
with one another, they cannot be designed independently. This gives rise to a new
design methodology, where the constituent hardware and software subsystems are
Received October 31, 1997; revised March 18, 1998.
Communicated by Jin-Yang Jou.
developed concurrently to meet specified performance and cost constraints.
This is
known as hardware/software codesign.
1.1 HW/SW Codesign for Embedded Systems
HW/SW codesign has been commonly adopted in designing embedded systems.
Basically, such embedded systems react in real time to external asynchronous event
and process incoming data as in classical digital signal processing. Products in
applications, such as personal communication, automotive control, consumer
electronics or office automation, are often implemented in embedded systems. An
embedded system usually incorporates a programmable microprocessor core with
memory plus some hardwired (or field programmable) devices for peripherals and
dedicated computations. The programmable microprocessor (digital signal processor
in most cases) is mainly responsible for realization of the software section while the
hardwired logic devices implement the hardware section of the system. So far, the
HW/SW codesign environment for embedded systems is still very primitive. The
design is basically achieved in an ad-hoc manner. This, however, is not appropriate
for the dynamic and fast changing nature of the embedded system market, where
products must be developed within a very short cycle. To resolve this problem, two
factors are important:
1) a rapid prototyping embedded system and
2) a good hardware/software codesign environment.
A rapid prototyping system can provide a platform for early implementation and
verification of the system. It can also serve as an architectural template which
enhances design reuse and reduces the effort required to develop a system. A good
codesign environment can help the designer tackle the new and emerging design
issues encountered in the HW/SW codesign. These include hardware/software
co-specification, partitioning, communication, co-simulation and co-verification.
Among them, the HW/SW partitioning problem has probably received the most
research attention. Other issues, nonetheless, are often solved in an ad-hoc and
application dependent way.
1.2 Rapid Prototyping of HW/SW Codesign
In this paper, we propose a rapid prototyping embedded system based on
HW/SW codesign. The rapid prototyping system assumes an architectural template
which is implemented using a programmable and configurable target board. This
architectural template is designed to be flexible and capable of accommodating the
computing needs for various embedded signal processing applications, in particular in
the arena of speech, acoustics and audio. This target board serves not only as an
implementation vehicle to rapidly prototype and verify HW/SW codesigned systems,
but also as a practical platform for addressing the design issues in HW/SW codesign.
Without such an architectural platform, the design space will be too large to exploit
efficiently. Many design constraints, such as HW/SW communication models, can
not be characterized precisely, either. This may lead to underestimation of the
incurred hardware complexity and time overhead, which makes the design impractical
for real implementation. Based on the proposed target board architecture, we then
address the HW/SW communication interface problem, which is largely dependent on
the architectural platform. Communication interfaces and their protocols are
essential to efficiently integrate the HW and SW sections into a complete system.
They are also prerequisites for constructing a HW/SW partitioning model. With the
proposed architectural template and the HW/SW communication interfaces, a
practical HW/SW codesign environment can then be established. In this paper,
instead of developing a brand new CAD system, our focus is on integrating existing
CAD tools and devising a design flow for the HW/SW codesign problem.
2. PREVIOUS WORK AND PROPOSED APPROACHES
Numerous works on HW/SW co-design related issues have been presented.
Our review will focus basically on two issues, i.e., the target board architecture along
with its communication interface, and the design methodology.
2.1 Target Architectures and Communication Interfaces
CASTLE [1] provided an architectural library containing several common
co-design architectural templates as well as communication and synchronization
mechanisms. It is up to the designer to decide on the appropriate architecture. A
sequence of architectural refinements also has to be employed manually before real
implementation can take place. A file compression algorithm example uses a
SPARC processor and a Xilinx XC4005 FPGA as a coprocessor with a local memory.
In [2], the target architecture is a PC-based development board which contains an
Intel i960 microprocessor, an Xilinx 4008 FPGA, a program/data memory, a single
serial I/O and an AT bus interface. The system is basically a single bus architecture,
where all components are hooked to a system bus. The communication between the
i960 microprocessor and the FPGA is achieved through dual port parameter memory,
where batch data is transferred to and from the program/data memory via DMA
operations. This architecture is clearly defined but can only support batch type
HW/SW communication. In the COBRA project [3], the base hardware module
carries four Xilinx 4025 FPGAs connected as a mesh. Multiple base modules can be
connected together, and an I/O module supports parallel SPARC S-Bus connection to
a host. Since only one FPGA (root) can be connected directly to the host processor
(SPARC), communication data between the host processor and any destination
FPGAs (other than the root) must be relayed and routed through the intermediate
FPGAs.
This implies dramatic HW/SW communication overhead.
The
architecture is only suitable for master-slave type operations, where FPGA modules
passively perform dedicated function calls from the host processor. In [4], a
master/master target architecture was adopted. The generic hardware architecture
includes a Motorola 68000 microprocessor and a root Xilinx 4005 FPGA. The root
FPGA
shares
the
memory
with
the
microprocessor.
Communication/synchronization between the processor and FPGAs is implemented
using both interrupt and polling techniques. Additional FPGAs are connected to the
root FPGA via a serial bus, which uses a token ring mechanism to control access.
Basically, its limited communication bandwidth and the significant communication
overhead between the microprocessor and the non-root FPGAs have constrained its
application. In [5, 6], the COSYMA target architecture was presented, which
consists of a standard RISC processor core (the SPARC processor) and an application
specific co-processor. The hardware and software components execute in mutual
exclusion. Communication is done through shared memory with a CSP type
protocol. Since this is a master-slave type architecture, it is generally not preferred
in embedded applications, where all the processing blocks should work concurrently.
The CSP type protocol is also not suitable for batch data movements. In [7-9], the
target architecture contained a processor embedded with ASICs. The interface
between the hardware (ASIC) and software (processor) sections is a control FIFO
buffer, which serves as a mechanism to enforce the scheduling of both hardware and
software. Data transfer from hardware to software is explicitly synchronized. The
processor uses a polling strategy to perform premeditated transfer from the hardware
section. This architecture requires careful scheduling in both the hardware and
software sections so that they can retrieve appropriate data from the FIFO. It is also
not efficient for batch communication. The target architecture presented in [10] is a
hypothetical multi-processor plus one ASIC configuration. The processors and
ASIC are assumed to be fully connected and to perform in full synchronization
(multi-rate). Communication between the hardware and software sections is, thus,
determined by means of static scheduling. The synchronized communication
between the HW and SW sections, however, makes this architecture less feasible for
real implementation.
2.2 Codesign System Development Environments
System development environments differ mainly in their initial system
specifications and the way in which hardware/software partitioning is performed. In
the VULCAN synthesis tool suite [7-9], a hardware oriented design strategy was
adopted. It starts from an initial hardware specification in HardwareC. Portions of
the design are later migrated into the software section if the design constraints are
satisfied. The migration process is iterative. Candidates are selected so that the
communication cost is lowered while the timing constraint is maintained. The
COSYMA system [5, 6] features a software oriented approach, where the system
specification is in the form of communicating processes described in C*, a super set
of the C language. After data profiling, partitioning is performed using a simulated
annealing algorithm. In [10], the HMS (Hardware/multi-software) starts with the
system specification in a data flow graph. Partitioning is performed at the fine grain
level. Starting with an all software implementation, if the system cannot be
scheduled within the specified time, the algorithm begins to add the most needed
specialized hardware. The process proceeds until both the timing and silicon
requirements are met. In [11], the co-design starts with a system specification in
VHDL. Partitioning is then carried out at both the coarse and fine grain levels.
Pre-partitioning, performed in interaction with the designer, is done to obtain proper
partitioning granularity. Partitioning performed later takes the estimated speed and
cost into account and is achieved using a simulated annealing algorithm. All the
above mentioned tools support automatic partitioning while the following tools
require manual partitioning.
CASTLE [1] starts with a mixed algorithmic
specification in C++/VHDL. For the partitioning step, CASTLE displays the list of
functions of the algorithmic specifications. The designer manually partitions them
into hardware and software components and refines the architecture step by step.
After each partitioning decision, the system estimates the consequences. In [12], a
more general implementation independent specification is achieved via a
co-specification language using object-oriented functional notation. Design objects
are classified into three groups, i.e., hardware, software and codesign. Codesign
objects are generic in specification and can be compiled into hardware or software
sections. In [13], the system is specified in the SDL language, and various high level
synthesis tools are employed to implement the hardware, the software and the
interface designs. In the COBRA environment [3], the partitioning uses the
specification to extract data dependencies and to define coprocessors for the
application. Partitioning can be performed either automatically using a clustering
method or manually but guided with the tool's assistance in analysis and verification.
In COSMOS[14-16], the design process starts with a system-level specification
language SDL.
The specification is translated to a common intermediate form
capable of modeling con-currency, high-level communication, synchronization and
exceptions. Partitioning is performed interactively by the designers using the
transformation primitives provided by the system.
2.3 Our Approaches
In this paper, our rapid prototyping HW/SW codesign system is different from
the previous works in the following aspects. First, the proposed target board
architecture aims to be very flexible in implementing various embedded systems. A
master-master type architecture is adopted, where both the hardware section and the
software section can run and communicate concurrently. Second, the architecture is
designed to support a wide range of communication interfaces and protocols, ranging
from block data movement to asynchronous single data transfer.
HW/SW
communication can ,thus, be efficiently achieved, to improve system integration.
Third, our partitioning verification model is constructed based on the proposed target
board architecture. It can precisely characterize the HW/SW design constraints and
the communication overhead. This facilitates better partitioning and performance
verification. The organization of the remainder of this paper is as follows: In Section
3, the proposed target board architecture is described. In Section 4, we present the
HW/SW communication interfaces supported by the target board. The codesign
environment, which includes input specification, design flow and a partitioning model,
is described in Section 5. An LD-CELP speech decoder example is presented in
section 6 to illustrate the usefulness of this rapid prototyping system.
3. THE CODESIGN TARGET BORAD ARCHITECTURE
The target architecture of the embedded system is usually application dependent.
A typical embedded system architecture, however, may consist of three major
modules, i.e.,
‧a processor core,
‧one or several hardware accelerators, and
‧a peripheral block.
The processor core can be either a commercial digital signal processor, such as
TI TMS320C30/40, or an in house ASIP (application specific instruction set
processor). The processor core implements as much as possible of the signal
processing and control functions. The hardware accelerators are specialized data
paths that can be used to extend the instruction set of the processor core in an
application specific way.
Hardware accelerators usually implement time critical
functions, the performance of which cannot be realized in programmable technology
today. The peripheral blocks are primarily responsible for communicating with the
outside world. These include memory, timers, serial and parallel interfaces, DMA
controllers, A/D and D/A converter, etc. These three main modules are then
connected by various types of interconnection media, which can be a global shared
data bus, a direct port to port connection, a switched channel or a larger
interconnection network. Data transactions can be in bit serial or bit parallel format
and can be controlled by a bus arbitration mechanism or a DMA controller.
3.1 Design Concerns
Since different embedded applications possess different computing
characteristics, a "universal" architecture often turns out to be least efficient for all
applications. Therefore, we focus application on the area of signal compression for
digital transmission networks. These can be speech, audio or image signals. (For
the time being, video signals are not considered due to their overwhelming
com-puting requirements.) Signal compression can be used to minimize the
communication capacity required for transmission of high quality signals or,
equivalently, to get the best possible fidelity over an available digital communication
channel. With such applications in mind, we chose a TI TMS320C30 digital signal
processor as the processor core of the system. The C30 is a 32-bit floating point
processor which has been adopted successfully and widely in audio/speech processing.
It is equipped with one DMA controller, two serial ports and two timers, which can
greatly simplify the peripheral block design of the target board. The C30 also has
two 32-bit external buses, one primary and one expansion, which provide wide
communication bandwidth between the hardware and software sections. For the
hardware accelerators, field programmable logic devices must be incorporated in
place of a hardwired ASIC. We chose Xilinx FPGAs due to their high capacity and
architectural efficiency in data path implementation.
For real time signal
compression, the target board should also be equipped with data acquisition and
playing back peripherals, which include A/D D/A converters, microphones and
speakers (for speech and audio signals). To achieve high performance computing,
we prefer a master-master type configuration so that both the HW and SW sections
can operate in parallel. For the interconnection structure of the system, a shared bus
architecture is used as it is the most flexible (when compared with the port-to-port
direct connection) and cost effective (when compared with the switching network)
one among the alternatives. It can also eliminate physical data movement across the
boundary between the HW and the SW sections. It can, then, be used as the
backbone of the system's interconnections. A shared bus alone, however, can not
satisfy all the different kinds of communication needs between the HW and the SW
sections. Bus contention among all the hooked up devices may also degrade
performance. Auxiliary communication channels are, thus, needed to enable
implementation of specialized communication and to reduce traffic on the shared bus.
To simplify system development, the target board can be designed as a system
attached to a host machine, e.g., a PC or work station. A host bus interface is,
therefore, needed, through which the host can initialize the target board, download the
kernel program to the DSP processor, configure field programmable logic devices,
and upload processed signal data from the target board.
3.2 Proposed Architecture
Based on the design concerns mentioned in section 3.1, the target board is
designed as a PC add-on card with a PCI bus interface. It is regarded as an
embedded system in the PC (host machine) and behaves like a service provider for
specific functions. To support the general paradigm of HW/SW codesign, the target
board architecture contains a TI TMS320C30 DSP processor and up to four Xilinx
XC 5204 FPGAs. Unlike some other designs [1, 3], we do not use the host
processor to perform the function of software section. The induced heavy
communication overhead between the host processor and target board will make the
design less efficient. The FPGAs, via the VHDL specification and high level
synthesis, can implement the hardwired functions for the hardware section. The
block diagram of the target board architecture is shown in Fig. 1.
To ensure that the target board is adaptable to a wide range of applications, a
universal shared bus architecture is adopted. This facilitates a basic communication
model between the DSP processor and the FPGAs via access to a shared memory
location. The size of the shared memory is 8MB. It can be accessed by the C30
processor, FPGAs and PCI interface controller. Through the PCI interface, the
shared memory can serve as a data buffer to exchange data with the host machine.
The PCI interface circuitry, implemented in an Altera EPM7096, controls an AMCC
S5933 PCI controller so as to communicate with the PC host. Since all these devices
hooked up to the same bus must compete for access privileges, a bus control unit
(BCU), also in an EPM7096 CPLD, arbitrates among the memory access requests
received from different devices. The C30 processor is assigned higher priority than
are the FPGAs. Once a device (including C30 and FPGAs) is granted bus control,
no other device can take control unless it is relinquished by the current owner. To
reduce traffic in the shared bus, each FPGA is supplied with a 2K local memory
PCI
Controller
AMCC
S5933
PCI Interface
Shared Memory
R/W
A0...A23
D0...D31
A0...A23
D0...D31
PCI BUS
ADDRESS BUS
DATA BUS
ACT4
ACT3
ACT2
ACT1
C30_USE
Bus
Control
Unit
G
R
A
N
T
1
B
R
Q
1
G
R
A
N
T
2
B
R
Q
2
G
R
A
N
T
3
Address
Decoder
BUS_idle
A0..A24
BUS_idle
R/W
RDY
B
R
Q
3
G
R
A
N
T
4
A0..A23
B
R
Q
4
D0...D31
INT0
B
U
S
Y
D0..D31 A0..A23
ACT
BUS_idle
IRQ
R/W
ACK
D0..D31 A0..A23
ACT
BUS_idle
IRQ
R/W
ACK
D0..D31 A0..A23
ACT
BUS_idle
IRQ
R/W
ACK
D0..D31 A0..A23
ACT
BUS_idle
IRQ
R/W
ACK
XC5204 4
XC5204 3
XC5204 2
XC5204 1
BUSY
BRQ
GRANT
RDY
IACK
BUSY
BRQ
GRANT
RDY
IACK
BUSY
BRQ
GRANT
RDY
IACK
BUSY
BRQ
GRANT
A4
A3
A2
A1
IRQ1~4
Interrupt
Control
ACK
Unit
INT1
INT2
Address
Control
Peripheral
BUS
Connector
Seria l 0
Serial Port 0
Connector
Seria l 1
Serial Port 0
Connector
13
DATA
16
INT3
IACK
TMS320C30
HOLD
RDY
IACK
DATA_IN
CS_W
STRB
R/W
RDY
XF1
XD0..XD7
XA0..XA24
CS_W
Queue
Decoder
MUX
MUX
Local
Memory 4
MUX
MUX
Local
Memory 3
MUX
MUX
Local
Memory 2
MUX
MUX
Local
Memory 1
Fig. 1. Block diagram of the target board architecture.
XA0..XA23
module which serves as a private work sheet.
The C30 processor's on-chip DMA
controller can perform batch data communication between the shared memory and a
specific local memory module. The memory map of the target board is shown in Fig.
2.
808000h
80800Fh
808020h
80802Fh
808030h
80803Fh
808040h
80804Fh
808050h
80805Fh
DMA Channel
Timer 0
Timer 1
Serial Port 0
Serial Port 1
0h Interrupt Location and
Reserved (192)
0BFh
0C0h
ROM
(Internal)
0FFFh
1000h
External
STRB Active
7FFFFFh
800000h
Expansion Bus
801FFFh MSTRB Active (8K)
802000h
Reserved
(8K)
803FFFh
804000h
Expansion Bus
805FFFh IOSTRB Active (8K)
806000h
Reserved
(8K)
807FFFh
808000h
8097FFh
809800h
809BFFh
809C00h
809FFFh
80A000h
0FFFFFFh
Peripheral Bus
Memory-Mapped
Registers
(internal) (6K)
RAM Block 0
(1K) (Internal)
1000h
3FFFFFh
400000h
7FFFFFh
800000h
8007FFh
800800h
800FFFh
801000h
8017FFh
801800h
801FFFh
80A000h
80A004h
80A008h
80A00Ch
RAM Block 1
(1K) (Internal)
External
STRB Active
Reserved
Shared Memory
(4M)
Local Memory 1
(2K)
Local Memory 2
(2K)
Local Memory 3
(2K)
Local Memory 4
(2K)
FPGA 1 Address Vector
FPGA 2 Address Vector
FPGA 3 Address Vector
FPGA 4 Address Vector
Reversed
0C00000h
0FFFFFFh
Shared Memory
(4M)
Fig. 2. Memory map of the target board architecture.
Note that the local memory for each FPGA also appears in the map and supports
DMA access. A 2 to 1 multiplexing unit, however, must be placed before the local
memory module so that it can switch access between an FPGA or DMA controller.
Each FPGA is also assigned an address so that the C30 can perform memory mapped
I/O to access the FPGAs. In case a hardware section design is split over several
FPGAs, a local signal bus, configured with a ring structure, is employed to provide
point to point direct communication among the FPGAs. Besides the communication
achieved by the shared memory, a FIFO structure with a width of 16 and a depth of 4
is provided between the C30 processor and the leading FPGA. The FIFO is actually
implemented in the FPGA and connected to the C30 processor's expansion data bus
(the lower 16 bits). The two serial port connectors in the target board are mainly
used for connection to an external audio signal interface. The peripheral bus
connector, along with 8k X 32 peripheral memory, provides a 16-bit wide DSP-Link
interface with an external data acquisition device. The basic communication
between an FPGA and the DSP processor is achieved by means of an interrupt. The
C30 processor then acknowledges the interrupt. Since implementation of the
interrupt mechanism in FPGA would be too expensive, communication from the C30
to a specific FPGA is checked using a polling scheme and is accomplished by means
of a memory mapped I/O operation. When the C30 writes to an FPGA's address, the
address decoder will generate an activate signal and send it to the FPGA. Once the
activate signal is recognized by the FPGA, it will notify the C30 via the signal flag
XF1. Communication between FPGAs are mostly achieved by means of the local
bus interconnection. The ring structure can provide a fully interconnected network
for up to three FPGAs.
4. HW/SW COMMUNICATION INTERFACE & PROTOCOLS
A critical issue in HW/SW codesign is the efficiency of communication between
the hardware and software sections. Synchronous communication between them is
virtually impractical in that software execution must be monitored cycle by cycle.
Our target board supports two forms of asynchronous communication, i.e.,
handshaking and queue, as well as batch communication via the DMA controller.
Since the communication time overhead is much longer than the normal execution
cycle, our target board is not suitable for frequent, fine grain communication.
4.1 Types of Communication
We may classify the communication patterns encountered in a HW/SW codesign
system into three categories. The first one is simply for control transfer or
synchronization purposes; e.g., the DSP processor invokes the specific function of the
FPGAs. This usually does not involve a large volume of data exchange. The
incurred small amount of data exchange will be referred as a message. The second
pattern is for constant rate data transfer, where a sequence of data with each item
separated by a specified period, e.g., speech samples, is transferred. The third
pattern is for bursty data transfer, where a block of data, e.g., coefficients of filters,
has to be moved. Such data transfers often occur at the beginning or end of a
computing module. In our target board design, five types of communication are
supported:
1.
Asynchronous communication by means of handshaking: A handshaking protocol
must be followed to proceed the communication. It is usually used for
communication between the HW and SW, for occasional and small amounts of
message exchange. To implement the communication interface, each FPGA
reserves two address locations in the shared memory as the message buffer.
One is for the outgoing message (interrupt vector)sent to the C30 processor, and
the other one is for the incoming message (control information) received from
the C30 processor. Each message is 32-bit wide, and the format is application
2.
3.
dependent.
Asynchronous communication by means of a queue: A queue is a unidirectional
communication channel, where data are inserted and retrieved in order. It is
most suitable for constant data rate transfers and can be used in both HW/SW
and intra-HW (i.e., among FPGAs) communication. The queue itself (a dual
port memory) as well as two control pointers and the pointer update mechanism
is implemented in the FPGA hardware.
Batch communication: One possible drawback of the shared bus memory access
scheme is bus contention. If both the C30 processor and FPGAs need to access
the shared memory frequently, the performance will be degraded. A better way
is to copy data from the shared memory to the FPGA's local memory, where the
FPGA has exclusive access rights. Batch communication is carried out by the
C30's DMA controller. It can occur between the shared memory and the local
memory or between the shared memory and the PC host. In the former case, a
request is initiated by the HW (FPGA), and in the latter case, a request is
initiated by the SW (C30).
4. Synchronous communication: This is supported only in intra-HW communication.
In our design, it is performed by the local interconnection bus between the
FPGAs. The send and receive signals are both latched. There is basically no
5.
communication interface circuitry except for the I/O buffers. This requires
careful static scheduling to ensure correct data transfer.
Direct communication: This is similar to the synchronous communication case
except that the signals are not latched. This provides a direct point-to-point
connection between two FPGAs and is useful when a large combinational circuit
is split into two FPGAs.
4.2 HW/SW Communication Protocols
We will first define the hand-shaking type asynchronous communication
protocols as follows.
1.
2.
Software send: The software program sends a message to a specific FPGA via
memory mapped I/O write. It then enters an indefinite loop, which polls input
pin XF1 for the acknowledge signal IACK from the FPGA. It is, therefore, a
blocking send.
Software receive: The software program keeps on polling an internal flag to
check if a message from a specific FPGA has been received via interrupt service.
If this is the case, the program will exit the loop and read the outgoing message
buffer of the corresponding FPGA.
3.
4.
Hardware send: The FPGA sends a message to the DSP processor by means of
an interrupt. After arbitration, the address of the FPGA with the highest
priority will be saved in the ICU. In the interrupt service routine, the C30
processor will check the address, read the corresponding interrupt vector and
send back an IACK signal.
Hardware receive: The finite state machine of the FPGA enters an idle state to
poll the ACT signal. If the signal is asserted, the FPGA will then read the
incoming message buffer and raise the IACK pin.
The protocol illustrations for the two cases, i.e., hardware send, software receive
and software send, hardware receive are shown in Figs. 3 and 4, respectively.
FPGA
Interrupt
Control
Unit
TMS320C30
COMMENT
FPGA interrupts
C30
IRQ ¡ö 1
ICU arbitrates the
interrupt requests
IRQn =1:
INTn¡ö0
INTn = 0:
IACK¡ö0
C30 acknowledges
and performs ISR
ACK = 0:
An ¡ö1
ICU relays ACK to the
interrupt FPGAn
FPGA recives
ACK from C30
ACK = 1:
receive ACK
Fig. 3. The protocol illustration of FPGA send vs. C30 receive.
TMS320C30
Address
Decoder
FPGA
C30 writes to
activate an FPGA
send address
vector: XF0¡ö1
address decoder
generates ACT
signal
EN=1:
ACTn¡ö1ö1
ACT=1:
IACK¡ö1
XF1=1
: receive ACK
COMMENT
FPGAn polls &
acknowledges
C30 receives ACK
from FPGAn
Fig. 4. The protocol illustration of C30 send vs. FPGA receive.
As for communication via a queue, the protocols are as follows:
1.
Software write: The software program first reads the queue's "full" flag through
the expansion bus. If the flag is set, the program will keep on polling the flag
until it is cleared. A write operation to the address designated for the queue's
input port is next performed. Besides the enqueued data, an extra control bit is
augmented during the write operation to signal the control mechanism for queue
2.
3.
4.
pointer update.
Software read: Similar to the write procedure, a software read will first wait for
the queue's "empty" flag to be cleared. After the data is read, an extra write
operation which contains only the control bit is performed to signal the pointer
update.
Hardware write: This procedure is the same as that of software write except that
the finite state machine of the FPGA will first enter the wait state rather than
enter the indefinite polling loop. Once the flag is cleared, the finite state
machine will automatically move to the write state.
Hardware read: This procedure is the same as that of software read expect that it
is conducted under FSM control.
4.3 Communication Interfaces
To support the above mentioned protocols, communication interfaces must be
incorporated in both the HW and SW designs. In the FPGA part, the interface is
under the control of a finite state machine (FSM). The state diagram and the VHDL
code of the FSM are shown in Figs. 5 and 6, respectively. When in the idle (or
equivalently hardware receive) state, the FPGA is free from having to perform the
specific computation and can poll the flag to determine if any further communication
request from the C30 processor exists. As a result, the call from the C30 is only
checked in this state, and tasks performed by the FPGAs are basically non-preemptive.
There are two hardware execution states - one with and the other one without shared
bus access control. When competing for bus control, an FPGA will enter the bus
request state. Since the hardware send procedure needs to write the FPGA's output
message buffer in the shared memory, it must be initiated only when the FPGA owns
the bus and relinquishes control afterward. The interface in the software section is
mainly implemented via communication library functions. Fig. 7 shows the state
diagram for SW execution.
In performing the software send (receive) procedure, the program will enter the
wait for FPGA acknowledge (interrupt) state. Both states correspond to indefinite
looping, which will be terminated when proper flags are set. Fig. 8 shows the state
diagram of the queue's control mechanism.
ACT =0
:IACK<=0
ACK=0
:none
ACK=1
:IRQ<=0
Idle
ACK=1
:IACK<=1
BUS_idle=0
:BRQ<=0
Interrupt
C30
Finish funct ion
:IRQ<=1
Execute
Hardware
RDY=1
:BRQ<=0
:BUSY<=0
BUS_idle=1
:BRQ<=1
GRANT =0
:BUSY<=0
FPGA
using BUS
Wait BUS
GRANT =1
:BUSY<=1
RDY=0
:BUSY<=1
Fig. 5. State diagram of the FPGA interface circuitry
ENTITY fpga IS
PORT
(clock,ACT,ACK,RDY,BUS_idle,GRANT: IN BIT;
IRQ,BRQ,BUSY,IACK : OUT BIT);
END fpga;
ARCHITECTURE behavioral OF fpga IS
TYPE state IS
(idle,exec,wait_bus,using_bus,int_c30);
SIGNAL current : state := idle;
SIGNAL finish : BIT;
BEGIN
PROCESS
BEGIN
WAIT UNTIL clock='0' AND NOT
clock'STABLE;
CASE current IS
WHEN idle =>
IACK<='0';
IRQ<='0';
BUSY<='0';
BRQ<='0';
finish<='0';
IF ACT='1' THEN
IACK<='1';
current<=exec;
END IF;
WHEN exec =>
IACK<='0';
IF BUS_idle='1' and finish='0' THEN
BRQ<='1';
current<=wait_bus;
END IF;
IF finish='1' THEN
IRQ<='1';
current<=int_c30;
END IF;
WHEN wait_bus =>
IF GRANT='1' THEN
BUSY<='1';
current<=using_bus;
END IF;
WHEN using_bus =>
IF RDY='1' THEN
BRQ<='0';
BUSY<='0';
current<=exec;
finish<='1';
END IF;
WHEN int_c30 =>
IF ACK='1' THEN
IRQ<='0';
IACK<='0';
current<=idle;
END IF;
END CASE;
END PROCESS;
END behavioral;
Fig. 6. VHDL code for the FPGA interface circuitry
rcv_flag=1
: IACK ¡ö0
rcv_flag=0
wait for
FPGA
Interrupt
Software
Execution
software
receive
software send: write message to
the FPGA input buffer
write FPGA & XF0 ¡ö1
XF1=0: FPGA IACK
XF1=0: no FPGA IACK
wait for
FPGA ACK
Fig. 7. State diagram of the C30 program execution
Initial
none
: ptr_R <= 1;
: ptr_W <= 1;
: full <= '1';
: empty <= '1';
FIFO
wr='1' & full='0'
: mem(ptr_W) <=
datain;
: ptr_W <= ptr_W + 1;
rd='1' & empty='0'
: dataout <= mem(ptr_R);
ptr_R <= ptr_R + 1;
ptr_R=ptr_W
: empty <= '1';
: full <= '0';
ptr_R=ptr_W
: full <= '1';
: empty <= '0';
EMPTY
FULL
Fig. 8. State diagram of the queue’s control mechanism
Fig. 9 shows the state diagram of the bus control unit. In our design, the C30 is
guaranteed access without arbitration when the bus is free. If the bus is in use by an
FPGA, the C30 will be denied external bus access automatically through internal
hardware interlocking. A bus request is, thus, transparent to software execution.
Note that the C30, FPGAs and other control units all work at the same clock rate, i.e.,
30MHz. The bus controller, however, works on the rising edges of the clock while
the FPGAs work on the negative edges of the clock. Likewise, different FSMs are
needed for the PCI bus controller and DMA controller. In Table 1, we list the
estimated communication delays of the proposed communication protocols. We
assume that the DRAM access time is 66ns, i.e., one instruction cycle for the C30
processor. The simplified communication protocols lead to a delay only 4 to 6 times
longer than a memory access delay.
:none
C30_use_BUS=1
:none
BUSY=1
BUS
C30 using
:BUS_idle<=1
C30_use_BUS=0
using BUS
FPGA
:GRANT <=0
:BUS_idle<=0
C30_use_BUS=1
:GRANT <=0
:BUS_idle<=0
& BRQ=1
C30_use_BUS=0
:none
BUSY=0
BUS idle
:BUS_idle<=1
:RDY<=0
:GRANT <=0
& BRQ=0
C30_use_BUS=0
Fig 9. The state diagram of the bus control unit
Note that the numbers for the communication between the HW and SW sections
do not include the extra delay incurred due to mismatch between the send and receive
operations. Because of the bus arbitration delay, the shared memory access time by
the FPGA is longer than that by the C30 processor. Once the FPGA is granted bus
control, the access time, however, will also be 66ns (two clock cycles). Even though
the local memory access time is also two clock cycles long, no bus contention
overhead will occur as opposed to the case of shared memory access. Table 2 shows
the compiled FPGA interface circuitry overheads. The memory access circuitry for
both the shared and local memory modules are also included. The interface circuitry
occupies less than 10% of the CLB resources. It uses about 62% of the I/O pins to
support both the shared and local bus interfaces. Since the two bus interfaces suffice
to provide all the required HW/SW communication, the remaining pins can be
reserved for direct or synchronous HW/HW communication, which facilitates
hardware implementation across several FPGAs.
Table 1. The estimated communication delays
Communication. HW send,
type
SW receive
delay (ns)
413
SW send, shared memory R/W FPGA local
DMA
HW receive
(HW)
(C30)
memory R/W setup delay/w
231
99
66
66
264
132
Table 2. FPGA interface circuitry overheads
CLBs
I/O pads
CLB FG
CLB 8
FFs
0
0
interface
5
9
14
addr bus
0
12+12
0
0
0
0
12+12
0
0
data bus
0
32+32
0
addr gen available
6
120
0
156
24
480
% used
9.2
62.2
7.9
24
480
3-state Buffer
CLB carry MUX
CLB 5-inp f MUX
6.7
64+64
0
0
0
0
0
656
480
240
23.2
0
0
5. HW/SW CODESIGN ENVIRONMENT
Based on the proposed target board design and the communication protocols, we
propose a new HW/SW codesign environment for rapid prototyping of embedded
applications. The design flow is shown in Fig. 10.
5.1
HW/SW Codesign Flow
The codesign begins with an algorithmic specification in VHDL. Since the
process is a major modeling construct in VHDL used to describe the function or
algorithm of a design entity, the system specification is described as a collection of
processes. To model the communication among the processes, we define a send
procedure and a receive procedure.
The send procedure is designed to be
non-blocking, so that computing concurrency
faithfully
among
the
processes can be
VHDL
Program
Process Profling
Construct Process
Communication Graph
FAIL
Hardware / Software
Partition
Heuristic Cost
Function
Performance &
Constraints Verification
PASS
Static Process
Scheduling
FPGA Partitioning
Functional Level
Cosimulation
Interface mapping
Synopsys Synthesis
Implement FPGAs
VHDL to C T ranslation
Add Interface Code
Simulated Evolution
Code Generator
System Integration
Fig. 10. The flow of the HW/SW co-design.
preserved. This is achieved by declaring all inter-process communication data in
signals. Each signal is associated with a tag to indicate its availability. The send
operation is performed simply by setting the tag and then advancing to the next
instruction. On the other hand, the receive operation checks the flag before using the
data. This provides a generic description of the inter-process communication.
After partitioning, these communications are then mapped to the appropriate
communication interfaces mentioned in section 4. Note that the tags are for
simulation purposes only and will not be synthesized in the design. An example of a
system input specification in VHDL is shown in Fig. 11. Since each process is
treated as indivisible in HW/SW partitioning, we first profile each process to extract
its hardware and software implementation attributes. For software profiling, the
VHDL code is first translated into an equivalent C code and compiled into an
assembly code using the C compiler of the C30 processor. The attributes extracted
include the program code size and the execution time. For hardware profiling, we
count the different types of distinct operations, e.g., multiply, add and divide, and the
number of occurrences for each type of operation. We also calculate two indices, i.e.,
parallelism and uniformity, to assess the benefit of hardware implementation. The
parallelism index is obtained by dividing the total number of operations (arithmetic
operations plus memory access) by the length of the ASAP scheduling (assuming no
resource constraints). The uniformity index is obtained by dividing the total number
of arithmetic operations by the number of distinct operation types. We further
classify each send and receive operation in all the processes as either batch mode or
discrete mode operations. After profiling, a process communication graph (PCG) is
constructed next. A PCG greatly resembles a signal flow graph or a block diagram
used to describe DSP system. Each node corresponds to a process while each link
corresponds to an inter-process communication. Each node of the PCG is annotated
with the profiling attributes mentioned above plus its invocation frequency. Each
link is tagged with the variables for communication and their invocation frequencies.
Note that these frequencies are all measured with respect to the data input rate. The
PCG is then partitioned subject to the performance and resource constraints.
g_bar
P2
line1
(flag1)
line4
P1
(flag4)
line5
d8
line2
(flag2)
coeff
(flag5)
line3
P3
(flag3)
P4
PROCEDURE receive_2
PROCEDURE send_2
(SIGNAL flag1, flag2 : INOUT BIT) IS
(SIGNAL flag1, flag2 : INOUT BIT) IS
BEGIN
BEGIN
WAIT UNTIL flag1 AND flag2;
flag1 <='1';
flag1 <= '0'; flag2 <= '0';
flag2 <='1';
END receive_2;
END send_2;
-ENTITY test_model IS
PORT (g_bar, d8 : IN INTEGER; coeff : OUT INTEGER);
END test_model;
ARCHITECTURE behavorial OF test_model IS
SIGNAL line1, line2, line3, line4, line5 : INTEGER;
SIGNAL flag1, flag2, flag3, flag4, line5 : BIT;
BEGIN
-- concurrent statemant
coeff <= line4;
P1 : PROCESS
P2 : PROCESS
VARIABLE temp : INTEGER;
BEGIN
BEGIN
:
:
WAIT UNTIL flag4;
receive_2(flag1, flag2);
-- clear the valid flag of line4
temp := line1 + line2;
flag4 <= '0';
line4 <= temp;
line1 <= line4 + g_bar;
line5 <= temp;
-- set valid flag of line1
send_2(flag4, flag5);
flag1 <= '1';
:
:
END PROCESS P1;
END PROCESS P2;
P3 : PROCESS
BEGIN
:
receive_1(flag3);
line2 <= line3 * d8;
P4 : PROCESS
BEGIN
:
receive_1(flag5);
line3 <= line3 + line4;
send_1(flag2);
:
END PROCESS P3;
END behavorial;
send_1(flag3);
:
END PROCESS P4;
Fig. 11. An example of VHDL input specification.
5.2 HW/SW Partitioning and Verification
In this study, we adopted a software oriented approach which starts with all
begin to migrate a process from the software section to the hardware section.
Process selection is guided by a heuristic cost function, and the resultant partitioning
of each move is verified against both the performance and resource constraints. For
the time being, the selection process is performed manually by the user, and our focus
is on the verification part. Our verification scheme is based on the proposed target
board architecture. It can precisely characterize the hardware and software design
constraints, and the communication overhead so that infeasible partitioning can be
detected early without working out the implementation details. In the application
domain of signal compression, the system must periodically process the input data
stream. Therefore, the codesign system must complete one iteration of computation
within a specified period, called an initiation interval, which is the reciprocal of the
system's throughput rate. Given the initiation interval T, the partitioning verification
procedure is as follows:
Procedure Partition_Verification
1.
subject to the partitioning result, assign all edges in the PCG with a
communication type. The rules are: All PCG edges crossing the partitioning
boundary are assigned batch communication if they are classified as batch
mode; assigned queue communication if they are classified as discrete mode
and the invocation frequency is greater than 1; assigned hand-shaking
communication otherwise. All PCG edges within the hardware section are
assigned synchronous communication if they are inter-iteration; assigned
direct communication if they are intra-iteration.
2.
calculate the delay of each communication based on estimate derived from the
communication protocol. The estimate is obtained by measurement of real
implementation but ignoring the indefinite delay in wait state or bus
contention
3.
perform preliminary performance check. For the software section, if the
summation of all processes’ profiling computation delays plus calculated
communication delays is greater than the initiation interval T, the partitioning
is concluded as infeasible. For the hardware section, the demanding factor for
each type of operation is first calculated. The demanding factor  i 
ni  t i
T
represents the lower bound on the number of type i function units needed,
where ni is the total number of type i operation in all hardware section
processes and ti is the measured delay of type i function unit in FPGA
implementation. If   i   i  1 , the hardware section design has exceeded
i
the FPGA capacity, where i is the normalized hardware complexity ratio
with the total FPGA capacity (after taking away the interface circuitry) equal
to 1.
4.
allocate hardware function units. The hardware resource partitioning among
different types of function units is proportional to the respective demanding
factor. s   i number of type i function units is allocated, where scaling
factor s = max{s |  s   i   i  k } and k is a empirical value (around 0.8) for
i
maximum FPGA CLB utilization ratio.
5.
subject to the hardware allocation, conduct a resource constraint scheduling
on the modified PCG. To obtain a modified PCG, each node in the original
PCG is split into a set of nodes with each one corresponding to a code
fragment separated by the send and receive operations. The communication
edges
are
adjusted
accordingly.
However,
all
the
inter-iteration
communication are removed. The software section has only one resource, i.e.
the C30 processor. All the software section nodes are scheduled statically and
executed sequentially on the C30 processor. A simple list scheduler can fit the
purpose. If the entire modified PCG cannot be scheduled in T, the partitioning
is also infeasible.
5.3 HW/SW Implementation
After HW/SW partitioning, the hardware section design is further divided into
different FPGAs by means of algorithms such as bipartite partitioning. Since the
entire design is still expressed in VHDL, a functional level co-simulation using a
VHDL simulator is performed here as a check point before proceeding with physical
implementation. The partitioned VHDL program for each FPGA is augmented with
a predefined VHDL code to synthesize the HW/SW communication interface. High
level synthesis in the SynopsysTM system is next employed to derive the FPGA design.
Each FPGA design is based on a structural template which contains 4 basic modules,
i.e., a processing unit (PU), control unit (CU), memory unit (MU) and communication
interface unit (CIU). The processing units adopt fixed point arithmetic and may also
include the number conversion and data alignment circuitry. The memory unit in
cludes registers, address generators and read/write circuitry of the local memory.
The control unit is simply synthesized by a finite state machine. The communication
interface unit includes data storage buffers for the communication channel and an
FSM based controller that implements the communication protocols. For the
software section, each VHDL process is first converted into a data flow graph (DFG).
A code generation tool based on simulated evolution [17] is used to generate the
software assembly code of the corresponding DFG.
To implement the
communication interface, extra codes on communication subroutines are inserted.
Codes for different processes are merged into one code according to the static
scheduling result.
6. CODESIGN EXAMPLE OF LD-CELP DECODER
To demonstrate the usefulness of the proposed target board, a large and practical
codesign example for the LD-CELP (low-delay coded excited linear prediction)
speech decoder based on the CCITT G.728 recommendation [18] is currently under
development. It can support a data compression ratio of 4 and yield a 16Kbit/s data
rate. The recommendation has been widely adopted in applications such as
teleconferencing systems and digital answer machines. Fig. 12 shows the simplified
block diagram of the speech decoder system. At the encoding site, the system takes
unquantized speech inputs at an 8K sampling rate. Every five consecutive samples
are assembled into a speech vector and encoded as a 10-bit index of the code book.
At the receiving end, for each received 10-bit index, the decoder performs a table
look-up to extract the corresponding codevector from the excitation codebook. The
extracted codevector is then passed through a gain scaling unit and a synthesis filter to
produce the current decoded signal vector. The synthesis filter coefficients and the
gain are then updated via backward adaptation. The decoded signal vector is then
passed through an adaptive post-filter to enhance the perceptual quality. The system
is described as a collection of nine processes.
6.1 HW/SW Partitioning of the Decoder
To implement the system, we first conducted a process profiling of the G.728
decoder module. The results shown in Table 3 are based on the computations
needed to process a data frame which consists of 4 vectors, i.e., 20 input samples. A
data frame is, therefore, considered as one iteration in this case. To decode the
encoded speech data in real time, all the computations must be finished in 2.5ms (the
interval for a data frame)
The software profiling shows that a pure software
Log-gain
limiter
47
Gain
Excitation VQ
Codebook
29
1-vector
delay
46
Log-gain
offset value
holder 41
67
RMS
calculator
39
Bandwidth
expansion
module 45
LevinsionDurbin
recursion
module 44
Backward
vector gain
adapter
30
Postfilter
34
Hybrid
windowing
module
Inverse
logarithm
calculator 48
+
Log-gain
linear
predictor
Synthesis
filter
32
31
LevinsonDurbin
recursion
module
51
33
40
42
50
Bandwidth
expansion
module
Logarithm
calculator
+
Postfilter
Adapter
35
49
Bcakward
synthesis
filter adapter
Hybrid
windowing
module 43
Fig. 12. The block diagram of a G.728 LD-CELP speech decoder
implementation takes about 44.6ms, which is about 18 times the allowed time slot,
i.e.,an initiation interval. Our partitioning algorithm first picks the most time
consuming process, i.e., the Levinson-Durbin (L-D) recursion module in the
backward synthesis filter adapter, and moves it to the hardware section.
Algor-ithmically, the L-D recursion module consists of a 3-level nested computing
loop. It computes the predictor coefficients from the auto-correlation matrix
recursively frame by frame. In the profiling, it can be seen that the uniformity index
of the L-D is good because the same module but with lower computing order also
appears in the backward vector gain adapter. The parallelism index, however, is
poor due to tightly coupled recursive computation. In this example, we manually
replace it with a more parallelized Schur algorithm[19].
Table 3. Profiling results of G.728 decoder system
HW Profiling
SW Profiling
Add/Sub Multiply Division Memory
2
7
1
5
0
5
0
1
31
128
2
228
Instructions
108
116
8556
Figure 12. The block diagram of a G.728
LD-CELP speech decoderProcess
Excitation VQ Codebook
Gain Scale
Backward vector gain adapter (except
for L&D’s recursion)
Synthesis filter
Backward synthesis filter adapter
(except for L&D’s recursion)
261
3577
256
3703
0
0
240
519
8676
17750
Postfilter
Postfilter adapter
L&D’s recursion for log-gain linear
predictor
L&D’s recursion for synthesis filter
46
7289
334
46
7300
334
1
2
10
114
310
11
3752
53134
4840
3314
7
33147
50
51
579107
6.2 Partitioning Verification and Hardware Design of the Schur Algorithm
In partitioning verification, we first perform a communication classification.
The communication between the Schur algorithm process and synthesis filter process,
and the communication between the Schur algorithm process and the backward
synthesis filter adaptor process are classified as batch communication. In the former
case, autocorrelation matrix coefficients, and in the latter case, synthesis filter
coefficients will be transferred. Both are mapped to batch communications via the
DMA. The exchanged coefficients are first moved from the shared memory to the
local memory and are moved back again after the Toeplitz system is solved. The
next step in the codesign flow is to allocate hardware function units; we obtained an
allocation consisting of three multipliers, one divider and four adders. Based on this
allocation, the systolic array design of the Schur algorithm is derived manually
instead of by using a high level synthesis tool. The design contains a reflection
coefficient calculation (RCC) stage, followed by forward and backward substitution
stages. The RCC stage is implemented by means of a pipelined-lattice structure as
shown in Fig. 13.
k (i1)   u1(i ) v0(i )
v1( i 1 ) 
v 2( i 1 ) 
v 3( i 1 ) 
v1( i )  k ( i 1 ) u 2( i )
v 2( i )  k ( i 1 ) u 3( i )
v 3( i )  k ( i 1 ) u 4( i )
u 0( i  1 ) 
u1( i  1 ) 
u 2( i 1 ) 
u 3( i  1 ) 
u1( i )  k ( i  1 ) v 0( i )
u 2( i )  k ( i 1 ) v1( i )
u 3( i )  k ( i 1 ) v 2( i )
u 4( i )  k ( i 1 ) v 3( i )
v 0( i 1 ) 
v 0( i )  k ( i 1 ) u1( i )
i=3
u44  u0( 4 )
i=2
u33  u0( 3)
u34  u1( 3 )
u24  u2( 2 )
u23  u1( 2 )
i=1
u22  u0( 2 )
u14  u3(1)  t3
u13  u2(1)  t2
u12  u1(1)  t1
Initial
u11  u 0(1 )  t 0
Fig. 13. Pipelined lattice structure design for reflection coefficient calculation
To obtain the maximum degree of parallelism, the number of required
butterfly-like modules should be equal to the filter tap order in both adapters, i.e., 10
and 50, respectively. This apparently exceeds the FPGA capacity in our target board.
Currently, only two butterfly modules are incorporated in one XC5204 FPGA. Each
module contains a multifunction arithmetic array for high speed multiplication and
division: a carry-free accumulator. The systolic array design for the forward &
backward substitution (FBS) stage is shown in Fig. 14. Each module contains
function units similar to those in the RCC butterfly module. To match the data
bandwidth of the RCC stage, only two modules are incorporated and implemented in
another XC5204 FPGA.
(f) u41|(b)u14
(t=4)
(f) u42|(b)u24
(t=5)
(f) u43|(b)u34
(t=6)
(f&b)
(f) u
(t=3)
(f) u
(t=4)
(f&b)u
33
(f&b)u
22
(f&b)u
11
31
|(b)u13
|(b)u23
(f) u |(b)u
21
11
32
(t=2)
t
out
t
out
2
g1
t
out
3
g1
4
g2
4
g1
5
g2
6
g3
output
t
u44
(t=7)
(t=5)
(t=3)
(t=1)
t
output
t
5
u41g1
4
u31g1
3
u21g1
output
6
u41g1+u42g2
5
u31g1+u32g2
7
u41g1+u42g2+u43g3
(f) y |(b)g
4
1
(t=7)
(f) y3|(b)g2
(t=5)
(f) y2|(b)g3
(t=3)
(f) y |(b)g
1
4
(t=1)
(f) g4|(b)x1
(t=7)
(f) g |(b)x
3
2
(t=5)
(f) g |(b)x
2
3
(t=3)
(f) g |(b)x
1
4
(t=1)
Fig. 14. Systolic array design for the forward & backward substitution
6.3 HW/SW Communication Performance Analysis
In this implementation, the HW/SW communication overhead includes 1) the
DMA delay in moving the auto-correlation matrices from the shared memory to the
local memory of the FPGA for RCC, 2) the delay caused by the C30 when it activates
the FPGAs to perform the Schur algorithm, 3) the delay caused by the FPGA when
it interrupts the C30 to signal completion of the Schur algorithm, and 4) the DMA
delay in moving the filter coefficients from the local memory to the shared memory.
The total communication overheads are compiled in Table 4.
Table 4. Communication time overhead for the HW sections in LD-CELP decoder
Execution cycle No. of comm. clks
timing overhead
% of comm. overhead
Block 44
2.5ms
112
3.68 us
0.15%
Block 50
2.5ms
432
14.24 us
0.57%
Again, the overheads are quite small (less than 1%) in this case. Fig. 15 shows
the HW/SW communication flow of the LD-CELP decoder example implemented in
our target board. The types of communication used are also indicated in the figure.
Downloading and uploading of data between the shared memory and the FPGA's local
memory are types of batch communication. Sixty-two data items are exchanged in
total (51 for the synthesis filter and 11 for the gain adapter) during each data frame.
The signaling between the C30 and the FPGAs basically follows the asynchronous
communication protocols. The data passing between the RCC and FBS FPGAs falls
into the synchronous communication category. In addition, the executions between
the RCC and FBS are pipelined, and the communication is synchronized by the
registers. Even though the entire example is still under development, initial analyses
did show that the proposed target board architecture facilitates very efficient HW/SW
communications.
TMS320C3x
FPGA1
FPGA2
DMA
Vector
#1
ACT<=1
IACK<=1
Solving Toeplitz
DMA
frame #n-1
Vector
#2
Vector
#3
ACT<=1
IACK<=1
INT<=0
Vector
#4
INT<=0
Vector
#1
ACT<=1
DMA
IACK<=1
DMA
frame #n
Vector
#2
Vector
#3
Vector
#4
ACT<=1
IACK<=1
Solving Toeplitz
Forward & Backward
substitutation
coefficients for
synthesis filter are
ready
Forward & Backward
substitutation
coefficients for
gain adapater are
ready
Local Memory
Schur Algorithm
for synthesis filter
Local Memory
Schur Algorithm
for gain adapter
INT<=0
DMA
Local Memory
INT<=0
DMA
Local Memory
Forward & Backward
substitutation
Forward & Backward
substitutation
DMA
Vector
#1
ACT<=1
IACK<=1
Solving Toeplitz
DMA
frame #n+1
Vector
#2
Vector
#3
Vector
#4
ACT<=1
IACK<=1
INT<=0
INT<=0
Solving Toepiltz
Forward & Backward
substitutation
coefficients for
synthesis filter are
ready
Forward & Backward
substitutation
coefficients for
gain adapater are
ready
Synchronous communication type
Asynchronous communication type
Batch communication type
Pipelined stage
Fig. 15. HW/SW communication flow of the LD-CELP decoder
7. CURRENT RESULTS AND SUMMARY
The proposed HW/SW codesign system is currently under development at
National Yunlin University of Science & Technology, Taiwan, ROC. Even though
the target board is equipped with a C30 processor and Xilinx FPGAs, the proposed
interface module and communication protocols can be equally applied to different
DSP processors and FPGAs. Likewise, the partitioning and the verification
procedures in the co-design system can be easily adapted to different FPGA and DSP
processor models. We have so far finished 1) the software code generation module
of the C30 processor, 2) definition and behavioral simulation of the communication
interface and protocols, 3) the architectural design of the target board and 4) both the
software and hardware section designs of the LD-CELP decoder example. The
HW/SW partitioning module is still under development. Therefore, partitioning of
the decoder example is done manully at this moment. We are also constructing an
FPGA reference library for both design estimation and synthesis purposes.
In summary, in this paper, we have presented a novel embedded prototyping
system based on hardware/software co-design. The proposed target board consists
of a popular DSP processor and several large capacity Xilinx FPGAs. It features a
shared bus architecture with two levels of memory hierarchies, i.e., shared main
memory and FPGA local memory. Communication interfaces between the hardware
and software sections have been carefully defined to support various types of HW/SW
communications efficiently. The communication interfaces for the HW and the SW
sections are described in VHDL code and C communication routines, respectively.
This leads to code augmentation in both sections which supports the communication
interface. Based on this prototyping system, we have also proposed a HW/SW
co-design environment which takes VHDL code as the initial design specification and
performs coarse grain partitioning at the process level. A partitioning result
verification procedure has also been developed. High level synthesis and parallel
DSP code generation are then employed to realize the respective designs in the target
board. Preliminary results for the codesign example of an LD-CELP speech decoder
indicate that the proposed prototyping system does support very efficient HW/SW
communications.
ACKNOWLEDGMENT
This work was financially supported by the NSC, R.O.C., under Grant
NSC86-2221-E-224-007.
REFERENCES
1. M. TheiBinger, P. Stravers, and H. Veit, "Castle: An interactive environment for
HW-SW co-design," in Proceedings of Third International Workshop on
Hardware/Software Codesign, 1994, pp. 203-209.
2. M.D. Edwards, J. Forrest, "Software acceleration using programmable hardware
devices," IEE Proceedings Computer Digital Technology, Vol. 143, 1996, pp.
55-63
3. G. Koch, U. Kebschull and W. Rosenstiel, "A prototyping environment for
hardware/software codesign in the CORBA project," in Proceedings of Third
International Workshop Hardware/Software Codesign, 1994, pp. 10-16.
4. J.P. Calvez, D. Isidoro and D. Jeuland, "A codesign experience with the MCSE
methodology," in Proceedings of Third International Workshop on
Hardware/Software Codesign, 1994, pp. 140-147.
5. R. Ernst, J. Henkel, T. Benner, "Hardware-software cosynthesis for
microcontrollers," IEEE Design & Test of Computers, Vol. 10, No. 4, 1993, pp.
64-75.
6. D. Herrmann, J. Henkel and R. Ernst, "An approach to the adaptation of estimated
cost parameters in the COSYMA system," in Proceedings of Third International
Workshop Hardware/ Software Codesign, 1994, pp. 100-107.
7. R.K. Gupta and G. De Micheli, "Hardware-software cosynthesis for digital
system," IEEE Design & Test of Computer, Vol. 10, No. 3, 1994, pp. 29-41.
8. R.K. Gupta, C.N. Coelho Jr. and G. De Micheli, "Program implementation
schemes for hardware-software systems," Computer, Vol. 27, No. 1, 1994, pp.
48-55.
9. R. Gupta, Claudionor Coelho and G. De Micheli, "Synthesis and simulation of
digital systems containing interacting hardware and software components," in
Proceedings of DAC, 1992, pp. 225-230
10. M. Sheliga and E. Sha, "Hardware/software co-Design with the HMS framework,"
Journal of VLSI Signal Processing, Vol. 13, No. 1, 1996, pp. 37-56.
11. P. Eles, Z. Peng and A. Doboli, "VHDL system-level specification and
partitioning in a hardware/software co-synthesis environment," in Proceedings of
the 3rd International Workshop on HW/SW Codesign, 1994, pp. 49-55.
12. N.S. Woo, A.E. Dunlop and W. Wolf, "Codesign from co-specification,"
Computer, Vol. 1, No. 1, 1994, pp. 42-7.
13. A. Baganne, J.L. Phillipe and E. Martin, "A codesign methodology for
telecommunication systems: A case study of an acoustic echo canceller," in
Proceedings of the 1997 IEEE Workshop on Signal Processing Systems - Design
and Implementation, Leicester, UK, 1997, pp. 273-282
14. T. Ben Ismail and A.A. Jerraya, "Synthesis steps and design models forcodesign,"
Journal of Computer, Vol. 28, No. 2, 1995, pp. 44-53.
15. T.B. Ismail, M. Abid and A. Jerraya, "COSMOS: A codesign approach for
communicating systems," in Proceedings of the 3rd International Workshop on
HW/SW Codesign, 1994, pp. 17-24.
16. Ismail, M. Abid, K. O'Brien and A. Jerraya, "An approach for hardware-software
codesign," in Proceedings of the 5th International Workshop on Rapid System
Prototyping, 1994, pp. 73-80.
17. Y.-T. Hwang and J.-S. Hwang, "Simulated evolution based code generation for
programmable DSP processors," in Proceedings of ISCAS '97, Vol. IV, 1997, pp.
2593-2596.
18. J.-H. Chen, R. Cox, Y.-C. Lin and others, "A low-delay CELP coder for the
CCITT 16 kb/s speech coding standard," IEEE Journal of Selected Areas in
Communications, Vol. 10, No. 5, 1992, pp. 830-849.
19. Y.H. Hu and S.Y. Kung, "Toeplitz eigen system solver," IEEE Transaction on
ASSP, Vol. 33, 1982, pp. 1264-1271.
Yin-Tsung Hwang(黃穎聰)obtained his B.S. and M.S. degrees, both in
electronic engineering, from National Chiao Tung University, Hisnchu, Taiwan,
R.O.C., in 1983 and 1985, respectively. He received the Ph.D. degree from the
department of Electrical & Computer Engineering, the University of Wisconsin,
Madison, in 1993. He then joined the Department of Electronic Engineering,
National Yunlin University of Science & Technology, and is now an associate
professor. Dr. Hwang's research interests include code generation for high
performance digital signal processors, hardware/software codesign and VLSI digital
signal processing.
Yuan-Hung Wang ( 王 元 鴻 ) graduated with a B.S. degree in Electronic
Engineering from National Yunlin University of Science & Technology in 1995.
After his graduation, Mr. Wang worked for the Optical-Electronic Research Lab.,
Industrial Technology Research Institute, for one year. He joined the Institute of
Electronic and Information Engineering, National Yunlin University of Science &
Technology in 1996, and is currently working toward his masters degree. Mr.
Wang's research interests include FPGA implementation and hardware/software
codesign.
Jer-Sho Hwang(黃晢修)received his B.S. degree in electronic engineering from
Chung-Yuan Christian University in 1995. In the same year, he joined the Institute
of Electronic and Information Engineering, National Yunlin University of Science &
Technology, and studied under the supervision of Prof. Yin-Tsung Hwang. He
received his M.S. degree in 1997 and is now serving in the R.O.C. Army. Mr.
Hwang's research interests include DSP code generation and hardware/software
codesign.
Download