JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 14, 605-632 (1998) Rapid Prototyping of Hardware/Software Codesign for Embedded Signal Processing Yin-Tsung Hwang, Yuan-Hung Wang and Jer-Sho Hwang Department of Electronic Engineering National Yunlin University of Science & Technology Touliu, Yunlin 640, Taiwan, R.O.C. E-mail: hwangyt@cad.el.yuntech.edu.tw In this paper, we propose a target board architecture suitable for embedded signal processing applications based on hardware software codesign. The target board, which serves as a system attached to a host PC via a PCI bus interface, contains a TMS320C30 DSP processor and up to four Xilinx XC5204 FPGAs. The software and hardware sections of the codesign can be easily implemented using C and VHDL programming in the C30 processor and FPGAs, respectively. Based on the proposed target board architecture, the interface circuitry and the communication protocols between the hardware (FPGAs) and software (C30) sections are first derived. The interface circuitry is described in VHDL code and will be added to the FPGA design for high level synthesis. Five types of HW/SW communications are supported. A HW/SW codesign flow is also exploited, and a partitioning verification procedure is developed. To illustrate the merits of the proposed system, a HW/SW codesign implementation example based on the G.728 LD-CELP decoder for speech compression is described. Keywords: hardware/software codesign, communication interface, embedded system, hardware/software partitioning, hardware description language, target board, rapid prototyping, field programmable gate array. 1. INTRODUCTION Large and complex DSP systems are often composed of multiple and heterogeneous processing blocks. Some of these blocks may involve massive computations, where hardwired implementation in ASIC may provide the best speed performance. Other blocks may need to handle complex control flows or communication protocols, where a software programming approach is most flexible and cost effective. Therefore, in contrast to pure software programming or pure hardwired implementations, an alternative approach is to partition the system into software and hardware sections. Each section can then be implemented using respective software or hardware technology. Since these processing blocks interact with one another, they cannot be designed independently. This gives rise to a new design methodology, where the constituent hardware and software subsystems are Received October 31, 1997; revised March 18, 1998. Communicated by Jin-Yang Jou. developed concurrently to meet specified performance and cost constraints. This is known as hardware/software codesign. 1.1 HW/SW Codesign for Embedded Systems HW/SW codesign has been commonly adopted in designing embedded systems. Basically, such embedded systems react in real time to external asynchronous event and process incoming data as in classical digital signal processing. Products in applications, such as personal communication, automotive control, consumer electronics or office automation, are often implemented in embedded systems. An embedded system usually incorporates a programmable microprocessor core with memory plus some hardwired (or field programmable) devices for peripherals and dedicated computations. The programmable microprocessor (digital signal processor in most cases) is mainly responsible for realization of the software section while the hardwired logic devices implement the hardware section of the system. So far, the HW/SW codesign environment for embedded systems is still very primitive. The design is basically achieved in an ad-hoc manner. This, however, is not appropriate for the dynamic and fast changing nature of the embedded system market, where products must be developed within a very short cycle. To resolve this problem, two factors are important: 1) a rapid prototyping embedded system and 2) a good hardware/software codesign environment. A rapid prototyping system can provide a platform for early implementation and verification of the system. It can also serve as an architectural template which enhances design reuse and reduces the effort required to develop a system. A good codesign environment can help the designer tackle the new and emerging design issues encountered in the HW/SW codesign. These include hardware/software co-specification, partitioning, communication, co-simulation and co-verification. Among them, the HW/SW partitioning problem has probably received the most research attention. Other issues, nonetheless, are often solved in an ad-hoc and application dependent way. 1.2 Rapid Prototyping of HW/SW Codesign In this paper, we propose a rapid prototyping embedded system based on HW/SW codesign. The rapid prototyping system assumes an architectural template which is implemented using a programmable and configurable target board. This architectural template is designed to be flexible and capable of accommodating the computing needs for various embedded signal processing applications, in particular in the arena of speech, acoustics and audio. This target board serves not only as an implementation vehicle to rapidly prototype and verify HW/SW codesigned systems, but also as a practical platform for addressing the design issues in HW/SW codesign. Without such an architectural platform, the design space will be too large to exploit efficiently. Many design constraints, such as HW/SW communication models, can not be characterized precisely, either. This may lead to underestimation of the incurred hardware complexity and time overhead, which makes the design impractical for real implementation. Based on the proposed target board architecture, we then address the HW/SW communication interface problem, which is largely dependent on the architectural platform. Communication interfaces and their protocols are essential to efficiently integrate the HW and SW sections into a complete system. They are also prerequisites for constructing a HW/SW partitioning model. With the proposed architectural template and the HW/SW communication interfaces, a practical HW/SW codesign environment can then be established. In this paper, instead of developing a brand new CAD system, our focus is on integrating existing CAD tools and devising a design flow for the HW/SW codesign problem. 2. PREVIOUS WORK AND PROPOSED APPROACHES Numerous works on HW/SW co-design related issues have been presented. Our review will focus basically on two issues, i.e., the target board architecture along with its communication interface, and the design methodology. 2.1 Target Architectures and Communication Interfaces CASTLE [1] provided an architectural library containing several common co-design architectural templates as well as communication and synchronization mechanisms. It is up to the designer to decide on the appropriate architecture. A sequence of architectural refinements also has to be employed manually before real implementation can take place. A file compression algorithm example uses a SPARC processor and a Xilinx XC4005 FPGA as a coprocessor with a local memory. In [2], the target architecture is a PC-based development board which contains an Intel i960 microprocessor, an Xilinx 4008 FPGA, a program/data memory, a single serial I/O and an AT bus interface. The system is basically a single bus architecture, where all components are hooked to a system bus. The communication between the i960 microprocessor and the FPGA is achieved through dual port parameter memory, where batch data is transferred to and from the program/data memory via DMA operations. This architecture is clearly defined but can only support batch type HW/SW communication. In the COBRA project [3], the base hardware module carries four Xilinx 4025 FPGAs connected as a mesh. Multiple base modules can be connected together, and an I/O module supports parallel SPARC S-Bus connection to a host. Since only one FPGA (root) can be connected directly to the host processor (SPARC), communication data between the host processor and any destination FPGAs (other than the root) must be relayed and routed through the intermediate FPGAs. This implies dramatic HW/SW communication overhead. The architecture is only suitable for master-slave type operations, where FPGA modules passively perform dedicated function calls from the host processor. In [4], a master/master target architecture was adopted. The generic hardware architecture includes a Motorola 68000 microprocessor and a root Xilinx 4005 FPGA. The root FPGA shares the memory with the microprocessor. Communication/synchronization between the processor and FPGAs is implemented using both interrupt and polling techniques. Additional FPGAs are connected to the root FPGA via a serial bus, which uses a token ring mechanism to control access. Basically, its limited communication bandwidth and the significant communication overhead between the microprocessor and the non-root FPGAs have constrained its application. In [5, 6], the COSYMA target architecture was presented, which consists of a standard RISC processor core (the SPARC processor) and an application specific co-processor. The hardware and software components execute in mutual exclusion. Communication is done through shared memory with a CSP type protocol. Since this is a master-slave type architecture, it is generally not preferred in embedded applications, where all the processing blocks should work concurrently. The CSP type protocol is also not suitable for batch data movements. In [7-9], the target architecture contained a processor embedded with ASICs. The interface between the hardware (ASIC) and software (processor) sections is a control FIFO buffer, which serves as a mechanism to enforce the scheduling of both hardware and software. Data transfer from hardware to software is explicitly synchronized. The processor uses a polling strategy to perform premeditated transfer from the hardware section. This architecture requires careful scheduling in both the hardware and software sections so that they can retrieve appropriate data from the FIFO. It is also not efficient for batch communication. The target architecture presented in [10] is a hypothetical multi-processor plus one ASIC configuration. The processors and ASIC are assumed to be fully connected and to perform in full synchronization (multi-rate). Communication between the hardware and software sections is, thus, determined by means of static scheduling. The synchronized communication between the HW and SW sections, however, makes this architecture less feasible for real implementation. 2.2 Codesign System Development Environments System development environments differ mainly in their initial system specifications and the way in which hardware/software partitioning is performed. In the VULCAN synthesis tool suite [7-9], a hardware oriented design strategy was adopted. It starts from an initial hardware specification in HardwareC. Portions of the design are later migrated into the software section if the design constraints are satisfied. The migration process is iterative. Candidates are selected so that the communication cost is lowered while the timing constraint is maintained. The COSYMA system [5, 6] features a software oriented approach, where the system specification is in the form of communicating processes described in C*, a super set of the C language. After data profiling, partitioning is performed using a simulated annealing algorithm. In [10], the HMS (Hardware/multi-software) starts with the system specification in a data flow graph. Partitioning is performed at the fine grain level. Starting with an all software implementation, if the system cannot be scheduled within the specified time, the algorithm begins to add the most needed specialized hardware. The process proceeds until both the timing and silicon requirements are met. In [11], the co-design starts with a system specification in VHDL. Partitioning is then carried out at both the coarse and fine grain levels. Pre-partitioning, performed in interaction with the designer, is done to obtain proper partitioning granularity. Partitioning performed later takes the estimated speed and cost into account and is achieved using a simulated annealing algorithm. All the above mentioned tools support automatic partitioning while the following tools require manual partitioning. CASTLE [1] starts with a mixed algorithmic specification in C++/VHDL. For the partitioning step, CASTLE displays the list of functions of the algorithmic specifications. The designer manually partitions them into hardware and software components and refines the architecture step by step. After each partitioning decision, the system estimates the consequences. In [12], a more general implementation independent specification is achieved via a co-specification language using object-oriented functional notation. Design objects are classified into three groups, i.e., hardware, software and codesign. Codesign objects are generic in specification and can be compiled into hardware or software sections. In [13], the system is specified in the SDL language, and various high level synthesis tools are employed to implement the hardware, the software and the interface designs. In the COBRA environment [3], the partitioning uses the specification to extract data dependencies and to define coprocessors for the application. Partitioning can be performed either automatically using a clustering method or manually but guided with the tool's assistance in analysis and verification. In COSMOS[14-16], the design process starts with a system-level specification language SDL. The specification is translated to a common intermediate form capable of modeling con-currency, high-level communication, synchronization and exceptions. Partitioning is performed interactively by the designers using the transformation primitives provided by the system. 2.3 Our Approaches In this paper, our rapid prototyping HW/SW codesign system is different from the previous works in the following aspects. First, the proposed target board architecture aims to be very flexible in implementing various embedded systems. A master-master type architecture is adopted, where both the hardware section and the software section can run and communicate concurrently. Second, the architecture is designed to support a wide range of communication interfaces and protocols, ranging from block data movement to asynchronous single data transfer. HW/SW communication can ,thus, be efficiently achieved, to improve system integration. Third, our partitioning verification model is constructed based on the proposed target board architecture. It can precisely characterize the HW/SW design constraints and the communication overhead. This facilitates better partitioning and performance verification. The organization of the remainder of this paper is as follows: In Section 3, the proposed target board architecture is described. In Section 4, we present the HW/SW communication interfaces supported by the target board. The codesign environment, which includes input specification, design flow and a partitioning model, is described in Section 5. An LD-CELP speech decoder example is presented in section 6 to illustrate the usefulness of this rapid prototyping system. 3. THE CODESIGN TARGET BORAD ARCHITECTURE The target architecture of the embedded system is usually application dependent. A typical embedded system architecture, however, may consist of three major modules, i.e., ‧a processor core, ‧one or several hardware accelerators, and ‧a peripheral block. The processor core can be either a commercial digital signal processor, such as TI TMS320C30/40, or an in house ASIP (application specific instruction set processor). The processor core implements as much as possible of the signal processing and control functions. The hardware accelerators are specialized data paths that can be used to extend the instruction set of the processor core in an application specific way. Hardware accelerators usually implement time critical functions, the performance of which cannot be realized in programmable technology today. The peripheral blocks are primarily responsible for communicating with the outside world. These include memory, timers, serial and parallel interfaces, DMA controllers, A/D and D/A converter, etc. These three main modules are then connected by various types of interconnection media, which can be a global shared data bus, a direct port to port connection, a switched channel or a larger interconnection network. Data transactions can be in bit serial or bit parallel format and can be controlled by a bus arbitration mechanism or a DMA controller. 3.1 Design Concerns Since different embedded applications possess different computing characteristics, a "universal" architecture often turns out to be least efficient for all applications. Therefore, we focus application on the area of signal compression for digital transmission networks. These can be speech, audio or image signals. (For the time being, video signals are not considered due to their overwhelming com-puting requirements.) Signal compression can be used to minimize the communication capacity required for transmission of high quality signals or, equivalently, to get the best possible fidelity over an available digital communication channel. With such applications in mind, we chose a TI TMS320C30 digital signal processor as the processor core of the system. The C30 is a 32-bit floating point processor which has been adopted successfully and widely in audio/speech processing. It is equipped with one DMA controller, two serial ports and two timers, which can greatly simplify the peripheral block design of the target board. The C30 also has two 32-bit external buses, one primary and one expansion, which provide wide communication bandwidth between the hardware and software sections. For the hardware accelerators, field programmable logic devices must be incorporated in place of a hardwired ASIC. We chose Xilinx FPGAs due to their high capacity and architectural efficiency in data path implementation. For real time signal compression, the target board should also be equipped with data acquisition and playing back peripherals, which include A/D D/A converters, microphones and speakers (for speech and audio signals). To achieve high performance computing, we prefer a master-master type configuration so that both the HW and SW sections can operate in parallel. For the interconnection structure of the system, a shared bus architecture is used as it is the most flexible (when compared with the port-to-port direct connection) and cost effective (when compared with the switching network) one among the alternatives. It can also eliminate physical data movement across the boundary between the HW and the SW sections. It can, then, be used as the backbone of the system's interconnections. A shared bus alone, however, can not satisfy all the different kinds of communication needs between the HW and the SW sections. Bus contention among all the hooked up devices may also degrade performance. Auxiliary communication channels are, thus, needed to enable implementation of specialized communication and to reduce traffic on the shared bus. To simplify system development, the target board can be designed as a system attached to a host machine, e.g., a PC or work station. A host bus interface is, therefore, needed, through which the host can initialize the target board, download the kernel program to the DSP processor, configure field programmable logic devices, and upload processed signal data from the target board. 3.2 Proposed Architecture Based on the design concerns mentioned in section 3.1, the target board is designed as a PC add-on card with a PCI bus interface. It is regarded as an embedded system in the PC (host machine) and behaves like a service provider for specific functions. To support the general paradigm of HW/SW codesign, the target board architecture contains a TI TMS320C30 DSP processor and up to four Xilinx XC 5204 FPGAs. Unlike some other designs [1, 3], we do not use the host processor to perform the function of software section. The induced heavy communication overhead between the host processor and target board will make the design less efficient. The FPGAs, via the VHDL specification and high level synthesis, can implement the hardwired functions for the hardware section. The block diagram of the target board architecture is shown in Fig. 1. To ensure that the target board is adaptable to a wide range of applications, a universal shared bus architecture is adopted. This facilitates a basic communication model between the DSP processor and the FPGAs via access to a shared memory location. The size of the shared memory is 8MB. It can be accessed by the C30 processor, FPGAs and PCI interface controller. Through the PCI interface, the shared memory can serve as a data buffer to exchange data with the host machine. The PCI interface circuitry, implemented in an Altera EPM7096, controls an AMCC S5933 PCI controller so as to communicate with the PC host. Since all these devices hooked up to the same bus must compete for access privileges, a bus control unit (BCU), also in an EPM7096 CPLD, arbitrates among the memory access requests received from different devices. The C30 processor is assigned higher priority than are the FPGAs. Once a device (including C30 and FPGAs) is granted bus control, no other device can take control unless it is relinquished by the current owner. To reduce traffic in the shared bus, each FPGA is supplied with a 2K local memory PCI Controller AMCC S5933 PCI Interface Shared Memory R/W A0...A23 D0...D31 A0...A23 D0...D31 PCI BUS ADDRESS BUS DATA BUS ACT4 ACT3 ACT2 ACT1 C30_USE Bus Control Unit G R A N T 1 B R Q 1 G R A N T 2 B R Q 2 G R A N T 3 Address Decoder BUS_idle A0..A24 BUS_idle R/W RDY B R Q 3 G R A N T 4 A0..A23 B R Q 4 D0...D31 INT0 B U S Y D0..D31 A0..A23 ACT BUS_idle IRQ R/W ACK D0..D31 A0..A23 ACT BUS_idle IRQ R/W ACK D0..D31 A0..A23 ACT BUS_idle IRQ R/W ACK D0..D31 A0..A23 ACT BUS_idle IRQ R/W ACK XC5204 4 XC5204 3 XC5204 2 XC5204 1 BUSY BRQ GRANT RDY IACK BUSY BRQ GRANT RDY IACK BUSY BRQ GRANT RDY IACK BUSY BRQ GRANT A4 A3 A2 A1 IRQ1~4 Interrupt Control ACK Unit INT1 INT2 Address Control Peripheral BUS Connector Seria l 0 Serial Port 0 Connector Seria l 1 Serial Port 0 Connector 13 DATA 16 INT3 IACK TMS320C30 HOLD RDY IACK DATA_IN CS_W STRB R/W RDY XF1 XD0..XD7 XA0..XA24 CS_W Queue Decoder MUX MUX Local Memory 4 MUX MUX Local Memory 3 MUX MUX Local Memory 2 MUX MUX Local Memory 1 Fig. 1. Block diagram of the target board architecture. XA0..XA23 module which serves as a private work sheet. The C30 processor's on-chip DMA controller can perform batch data communication between the shared memory and a specific local memory module. The memory map of the target board is shown in Fig. 2. 808000h 80800Fh 808020h 80802Fh 808030h 80803Fh 808040h 80804Fh 808050h 80805Fh DMA Channel Timer 0 Timer 1 Serial Port 0 Serial Port 1 0h Interrupt Location and Reserved (192) 0BFh 0C0h ROM (Internal) 0FFFh 1000h External STRB Active 7FFFFFh 800000h Expansion Bus 801FFFh MSTRB Active (8K) 802000h Reserved (8K) 803FFFh 804000h Expansion Bus 805FFFh IOSTRB Active (8K) 806000h Reserved (8K) 807FFFh 808000h 8097FFh 809800h 809BFFh 809C00h 809FFFh 80A000h 0FFFFFFh Peripheral Bus Memory-Mapped Registers (internal) (6K) RAM Block 0 (1K) (Internal) 1000h 3FFFFFh 400000h 7FFFFFh 800000h 8007FFh 800800h 800FFFh 801000h 8017FFh 801800h 801FFFh 80A000h 80A004h 80A008h 80A00Ch RAM Block 1 (1K) (Internal) External STRB Active Reserved Shared Memory (4M) Local Memory 1 (2K) Local Memory 2 (2K) Local Memory 3 (2K) Local Memory 4 (2K) FPGA 1 Address Vector FPGA 2 Address Vector FPGA 3 Address Vector FPGA 4 Address Vector Reversed 0C00000h 0FFFFFFh Shared Memory (4M) Fig. 2. Memory map of the target board architecture. Note that the local memory for each FPGA also appears in the map and supports DMA access. A 2 to 1 multiplexing unit, however, must be placed before the local memory module so that it can switch access between an FPGA or DMA controller. Each FPGA is also assigned an address so that the C30 can perform memory mapped I/O to access the FPGAs. In case a hardware section design is split over several FPGAs, a local signal bus, configured with a ring structure, is employed to provide point to point direct communication among the FPGAs. Besides the communication achieved by the shared memory, a FIFO structure with a width of 16 and a depth of 4 is provided between the C30 processor and the leading FPGA. The FIFO is actually implemented in the FPGA and connected to the C30 processor's expansion data bus (the lower 16 bits). The two serial port connectors in the target board are mainly used for connection to an external audio signal interface. The peripheral bus connector, along with 8k X 32 peripheral memory, provides a 16-bit wide DSP-Link interface with an external data acquisition device. The basic communication between an FPGA and the DSP processor is achieved by means of an interrupt. The C30 processor then acknowledges the interrupt. Since implementation of the interrupt mechanism in FPGA would be too expensive, communication from the C30 to a specific FPGA is checked using a polling scheme and is accomplished by means of a memory mapped I/O operation. When the C30 writes to an FPGA's address, the address decoder will generate an activate signal and send it to the FPGA. Once the activate signal is recognized by the FPGA, it will notify the C30 via the signal flag XF1. Communication between FPGAs are mostly achieved by means of the local bus interconnection. The ring structure can provide a fully interconnected network for up to three FPGAs. 4. HW/SW COMMUNICATION INTERFACE & PROTOCOLS A critical issue in HW/SW codesign is the efficiency of communication between the hardware and software sections. Synchronous communication between them is virtually impractical in that software execution must be monitored cycle by cycle. Our target board supports two forms of asynchronous communication, i.e., handshaking and queue, as well as batch communication via the DMA controller. Since the communication time overhead is much longer than the normal execution cycle, our target board is not suitable for frequent, fine grain communication. 4.1 Types of Communication We may classify the communication patterns encountered in a HW/SW codesign system into three categories. The first one is simply for control transfer or synchronization purposes; e.g., the DSP processor invokes the specific function of the FPGAs. This usually does not involve a large volume of data exchange. The incurred small amount of data exchange will be referred as a message. The second pattern is for constant rate data transfer, where a sequence of data with each item separated by a specified period, e.g., speech samples, is transferred. The third pattern is for bursty data transfer, where a block of data, e.g., coefficients of filters, has to be moved. Such data transfers often occur at the beginning or end of a computing module. In our target board design, five types of communication are supported: 1. Asynchronous communication by means of handshaking: A handshaking protocol must be followed to proceed the communication. It is usually used for communication between the HW and SW, for occasional and small amounts of message exchange. To implement the communication interface, each FPGA reserves two address locations in the shared memory as the message buffer. One is for the outgoing message (interrupt vector)sent to the C30 processor, and the other one is for the incoming message (control information) received from the C30 processor. Each message is 32-bit wide, and the format is application 2. 3. dependent. Asynchronous communication by means of a queue: A queue is a unidirectional communication channel, where data are inserted and retrieved in order. It is most suitable for constant data rate transfers and can be used in both HW/SW and intra-HW (i.e., among FPGAs) communication. The queue itself (a dual port memory) as well as two control pointers and the pointer update mechanism is implemented in the FPGA hardware. Batch communication: One possible drawback of the shared bus memory access scheme is bus contention. If both the C30 processor and FPGAs need to access the shared memory frequently, the performance will be degraded. A better way is to copy data from the shared memory to the FPGA's local memory, where the FPGA has exclusive access rights. Batch communication is carried out by the C30's DMA controller. It can occur between the shared memory and the local memory or between the shared memory and the PC host. In the former case, a request is initiated by the HW (FPGA), and in the latter case, a request is initiated by the SW (C30). 4. Synchronous communication: This is supported only in intra-HW communication. In our design, it is performed by the local interconnection bus between the FPGAs. The send and receive signals are both latched. There is basically no 5. communication interface circuitry except for the I/O buffers. This requires careful static scheduling to ensure correct data transfer. Direct communication: This is similar to the synchronous communication case except that the signals are not latched. This provides a direct point-to-point connection between two FPGAs and is useful when a large combinational circuit is split into two FPGAs. 4.2 HW/SW Communication Protocols We will first define the hand-shaking type asynchronous communication protocols as follows. 1. 2. Software send: The software program sends a message to a specific FPGA via memory mapped I/O write. It then enters an indefinite loop, which polls input pin XF1 for the acknowledge signal IACK from the FPGA. It is, therefore, a blocking send. Software receive: The software program keeps on polling an internal flag to check if a message from a specific FPGA has been received via interrupt service. If this is the case, the program will exit the loop and read the outgoing message buffer of the corresponding FPGA. 3. 4. Hardware send: The FPGA sends a message to the DSP processor by means of an interrupt. After arbitration, the address of the FPGA with the highest priority will be saved in the ICU. In the interrupt service routine, the C30 processor will check the address, read the corresponding interrupt vector and send back an IACK signal. Hardware receive: The finite state machine of the FPGA enters an idle state to poll the ACT signal. If the signal is asserted, the FPGA will then read the incoming message buffer and raise the IACK pin. The protocol illustrations for the two cases, i.e., hardware send, software receive and software send, hardware receive are shown in Figs. 3 and 4, respectively. FPGA Interrupt Control Unit TMS320C30 COMMENT FPGA interrupts C30 IRQ ¡ö 1 ICU arbitrates the interrupt requests IRQn =1: INTn¡ö0 INTn = 0: IACK¡ö0 C30 acknowledges and performs ISR ACK = 0: An ¡ö1 ICU relays ACK to the interrupt FPGAn FPGA recives ACK from C30 ACK = 1: receive ACK Fig. 3. The protocol illustration of FPGA send vs. C30 receive. TMS320C30 Address Decoder FPGA C30 writes to activate an FPGA send address vector: XF0¡ö1 address decoder generates ACT signal EN=1: ACTn¡ö1ö1 ACT=1: IACK¡ö1 XF1=1 : receive ACK COMMENT FPGAn polls & acknowledges C30 receives ACK from FPGAn Fig. 4. The protocol illustration of C30 send vs. FPGA receive. As for communication via a queue, the protocols are as follows: 1. Software write: The software program first reads the queue's "full" flag through the expansion bus. If the flag is set, the program will keep on polling the flag until it is cleared. A write operation to the address designated for the queue's input port is next performed. Besides the enqueued data, an extra control bit is augmented during the write operation to signal the control mechanism for queue 2. 3. 4. pointer update. Software read: Similar to the write procedure, a software read will first wait for the queue's "empty" flag to be cleared. After the data is read, an extra write operation which contains only the control bit is performed to signal the pointer update. Hardware write: This procedure is the same as that of software write except that the finite state machine of the FPGA will first enter the wait state rather than enter the indefinite polling loop. Once the flag is cleared, the finite state machine will automatically move to the write state. Hardware read: This procedure is the same as that of software read expect that it is conducted under FSM control. 4.3 Communication Interfaces To support the above mentioned protocols, communication interfaces must be incorporated in both the HW and SW designs. In the FPGA part, the interface is under the control of a finite state machine (FSM). The state diagram and the VHDL code of the FSM are shown in Figs. 5 and 6, respectively. When in the idle (or equivalently hardware receive) state, the FPGA is free from having to perform the specific computation and can poll the flag to determine if any further communication request from the C30 processor exists. As a result, the call from the C30 is only checked in this state, and tasks performed by the FPGAs are basically non-preemptive. There are two hardware execution states - one with and the other one without shared bus access control. When competing for bus control, an FPGA will enter the bus request state. Since the hardware send procedure needs to write the FPGA's output message buffer in the shared memory, it must be initiated only when the FPGA owns the bus and relinquishes control afterward. The interface in the software section is mainly implemented via communication library functions. Fig. 7 shows the state diagram for SW execution. In performing the software send (receive) procedure, the program will enter the wait for FPGA acknowledge (interrupt) state. Both states correspond to indefinite looping, which will be terminated when proper flags are set. Fig. 8 shows the state diagram of the queue's control mechanism. ACT =0 :IACK<=0 ACK=0 :none ACK=1 :IRQ<=0 Idle ACK=1 :IACK<=1 BUS_idle=0 :BRQ<=0 Interrupt C30 Finish funct ion :IRQ<=1 Execute Hardware RDY=1 :BRQ<=0 :BUSY<=0 BUS_idle=1 :BRQ<=1 GRANT =0 :BUSY<=0 FPGA using BUS Wait BUS GRANT =1 :BUSY<=1 RDY=0 :BUSY<=1 Fig. 5. State diagram of the FPGA interface circuitry ENTITY fpga IS PORT (clock,ACT,ACK,RDY,BUS_idle,GRANT: IN BIT; IRQ,BRQ,BUSY,IACK : OUT BIT); END fpga; ARCHITECTURE behavioral OF fpga IS TYPE state IS (idle,exec,wait_bus,using_bus,int_c30); SIGNAL current : state := idle; SIGNAL finish : BIT; BEGIN PROCESS BEGIN WAIT UNTIL clock='0' AND NOT clock'STABLE; CASE current IS WHEN idle => IACK<='0'; IRQ<='0'; BUSY<='0'; BRQ<='0'; finish<='0'; IF ACT='1' THEN IACK<='1'; current<=exec; END IF; WHEN exec => IACK<='0'; IF BUS_idle='1' and finish='0' THEN BRQ<='1'; current<=wait_bus; END IF; IF finish='1' THEN IRQ<='1'; current<=int_c30; END IF; WHEN wait_bus => IF GRANT='1' THEN BUSY<='1'; current<=using_bus; END IF; WHEN using_bus => IF RDY='1' THEN BRQ<='0'; BUSY<='0'; current<=exec; finish<='1'; END IF; WHEN int_c30 => IF ACK='1' THEN IRQ<='0'; IACK<='0'; current<=idle; END IF; END CASE; END PROCESS; END behavioral; Fig. 6. VHDL code for the FPGA interface circuitry rcv_flag=1 : IACK ¡ö0 rcv_flag=0 wait for FPGA Interrupt Software Execution software receive software send: write message to the FPGA input buffer write FPGA & XF0 ¡ö1 XF1=0: FPGA IACK XF1=0: no FPGA IACK wait for FPGA ACK Fig. 7. State diagram of the C30 program execution Initial none : ptr_R <= 1; : ptr_W <= 1; : full <= '1'; : empty <= '1'; FIFO wr='1' & full='0' : mem(ptr_W) <= datain; : ptr_W <= ptr_W + 1; rd='1' & empty='0' : dataout <= mem(ptr_R); ptr_R <= ptr_R + 1; ptr_R=ptr_W : empty <= '1'; : full <= '0'; ptr_R=ptr_W : full <= '1'; : empty <= '0'; EMPTY FULL Fig. 8. State diagram of the queue’s control mechanism Fig. 9 shows the state diagram of the bus control unit. In our design, the C30 is guaranteed access without arbitration when the bus is free. If the bus is in use by an FPGA, the C30 will be denied external bus access automatically through internal hardware interlocking. A bus request is, thus, transparent to software execution. Note that the C30, FPGAs and other control units all work at the same clock rate, i.e., 30MHz. The bus controller, however, works on the rising edges of the clock while the FPGAs work on the negative edges of the clock. Likewise, different FSMs are needed for the PCI bus controller and DMA controller. In Table 1, we list the estimated communication delays of the proposed communication protocols. We assume that the DRAM access time is 66ns, i.e., one instruction cycle for the C30 processor. The simplified communication protocols lead to a delay only 4 to 6 times longer than a memory access delay. :none C30_use_BUS=1 :none BUSY=1 BUS C30 using :BUS_idle<=1 C30_use_BUS=0 using BUS FPGA :GRANT <=0 :BUS_idle<=0 C30_use_BUS=1 :GRANT <=0 :BUS_idle<=0 & BRQ=1 C30_use_BUS=0 :none BUSY=0 BUS idle :BUS_idle<=1 :RDY<=0 :GRANT <=0 & BRQ=0 C30_use_BUS=0 Fig 9. The state diagram of the bus control unit Note that the numbers for the communication between the HW and SW sections do not include the extra delay incurred due to mismatch between the send and receive operations. Because of the bus arbitration delay, the shared memory access time by the FPGA is longer than that by the C30 processor. Once the FPGA is granted bus control, the access time, however, will also be 66ns (two clock cycles). Even though the local memory access time is also two clock cycles long, no bus contention overhead will occur as opposed to the case of shared memory access. Table 2 shows the compiled FPGA interface circuitry overheads. The memory access circuitry for both the shared and local memory modules are also included. The interface circuitry occupies less than 10% of the CLB resources. It uses about 62% of the I/O pins to support both the shared and local bus interfaces. Since the two bus interfaces suffice to provide all the required HW/SW communication, the remaining pins can be reserved for direct or synchronous HW/HW communication, which facilitates hardware implementation across several FPGAs. Table 1. The estimated communication delays Communication. HW send, type SW receive delay (ns) 413 SW send, shared memory R/W FPGA local DMA HW receive (HW) (C30) memory R/W setup delay/w 231 99 66 66 264 132 Table 2. FPGA interface circuitry overheads CLBs I/O pads CLB FG CLB 8 FFs 0 0 interface 5 9 14 addr bus 0 12+12 0 0 0 0 12+12 0 0 data bus 0 32+32 0 addr gen available 6 120 0 156 24 480 % used 9.2 62.2 7.9 24 480 3-state Buffer CLB carry MUX CLB 5-inp f MUX 6.7 64+64 0 0 0 0 0 656 480 240 23.2 0 0 5. HW/SW CODESIGN ENVIRONMENT Based on the proposed target board design and the communication protocols, we propose a new HW/SW codesign environment for rapid prototyping of embedded applications. The design flow is shown in Fig. 10. 5.1 HW/SW Codesign Flow The codesign begins with an algorithmic specification in VHDL. Since the process is a major modeling construct in VHDL used to describe the function or algorithm of a design entity, the system specification is described as a collection of processes. To model the communication among the processes, we define a send procedure and a receive procedure. The send procedure is designed to be non-blocking, so that computing concurrency faithfully among the processes can be VHDL Program Process Profling Construct Process Communication Graph FAIL Hardware / Software Partition Heuristic Cost Function Performance & Constraints Verification PASS Static Process Scheduling FPGA Partitioning Functional Level Cosimulation Interface mapping Synopsys Synthesis Implement FPGAs VHDL to C T ranslation Add Interface Code Simulated Evolution Code Generator System Integration Fig. 10. The flow of the HW/SW co-design. preserved. This is achieved by declaring all inter-process communication data in signals. Each signal is associated with a tag to indicate its availability. The send operation is performed simply by setting the tag and then advancing to the next instruction. On the other hand, the receive operation checks the flag before using the data. This provides a generic description of the inter-process communication. After partitioning, these communications are then mapped to the appropriate communication interfaces mentioned in section 4. Note that the tags are for simulation purposes only and will not be synthesized in the design. An example of a system input specification in VHDL is shown in Fig. 11. Since each process is treated as indivisible in HW/SW partitioning, we first profile each process to extract its hardware and software implementation attributes. For software profiling, the VHDL code is first translated into an equivalent C code and compiled into an assembly code using the C compiler of the C30 processor. The attributes extracted include the program code size and the execution time. For hardware profiling, we count the different types of distinct operations, e.g., multiply, add and divide, and the number of occurrences for each type of operation. We also calculate two indices, i.e., parallelism and uniformity, to assess the benefit of hardware implementation. The parallelism index is obtained by dividing the total number of operations (arithmetic operations plus memory access) by the length of the ASAP scheduling (assuming no resource constraints). The uniformity index is obtained by dividing the total number of arithmetic operations by the number of distinct operation types. We further classify each send and receive operation in all the processes as either batch mode or discrete mode operations. After profiling, a process communication graph (PCG) is constructed next. A PCG greatly resembles a signal flow graph or a block diagram used to describe DSP system. Each node corresponds to a process while each link corresponds to an inter-process communication. Each node of the PCG is annotated with the profiling attributes mentioned above plus its invocation frequency. Each link is tagged with the variables for communication and their invocation frequencies. Note that these frequencies are all measured with respect to the data input rate. The PCG is then partitioned subject to the performance and resource constraints. g_bar P2 line1 (flag1) line4 P1 (flag4) line5 d8 line2 (flag2) coeff (flag5) line3 P3 (flag3) P4 PROCEDURE receive_2 PROCEDURE send_2 (SIGNAL flag1, flag2 : INOUT BIT) IS (SIGNAL flag1, flag2 : INOUT BIT) IS BEGIN BEGIN WAIT UNTIL flag1 AND flag2; flag1 <='1'; flag1 <= '0'; flag2 <= '0'; flag2 <='1'; END receive_2; END send_2; -ENTITY test_model IS PORT (g_bar, d8 : IN INTEGER; coeff : OUT INTEGER); END test_model; ARCHITECTURE behavorial OF test_model IS SIGNAL line1, line2, line3, line4, line5 : INTEGER; SIGNAL flag1, flag2, flag3, flag4, line5 : BIT; BEGIN -- concurrent statemant coeff <= line4; P1 : PROCESS P2 : PROCESS VARIABLE temp : INTEGER; BEGIN BEGIN : : WAIT UNTIL flag4; receive_2(flag1, flag2); -- clear the valid flag of line4 temp := line1 + line2; flag4 <= '0'; line4 <= temp; line1 <= line4 + g_bar; line5 <= temp; -- set valid flag of line1 send_2(flag4, flag5); flag1 <= '1'; : : END PROCESS P1; END PROCESS P2; P3 : PROCESS BEGIN : receive_1(flag3); line2 <= line3 * d8; P4 : PROCESS BEGIN : receive_1(flag5); line3 <= line3 + line4; send_1(flag2); : END PROCESS P3; END behavorial; send_1(flag3); : END PROCESS P4; Fig. 11. An example of VHDL input specification. 5.2 HW/SW Partitioning and Verification In this study, we adopted a software oriented approach which starts with all begin to migrate a process from the software section to the hardware section. Process selection is guided by a heuristic cost function, and the resultant partitioning of each move is verified against both the performance and resource constraints. For the time being, the selection process is performed manually by the user, and our focus is on the verification part. Our verification scheme is based on the proposed target board architecture. It can precisely characterize the hardware and software design constraints, and the communication overhead so that infeasible partitioning can be detected early without working out the implementation details. In the application domain of signal compression, the system must periodically process the input data stream. Therefore, the codesign system must complete one iteration of computation within a specified period, called an initiation interval, which is the reciprocal of the system's throughput rate. Given the initiation interval T, the partitioning verification procedure is as follows: Procedure Partition_Verification 1. subject to the partitioning result, assign all edges in the PCG with a communication type. The rules are: All PCG edges crossing the partitioning boundary are assigned batch communication if they are classified as batch mode; assigned queue communication if they are classified as discrete mode and the invocation frequency is greater than 1; assigned hand-shaking communication otherwise. All PCG edges within the hardware section are assigned synchronous communication if they are inter-iteration; assigned direct communication if they are intra-iteration. 2. calculate the delay of each communication based on estimate derived from the communication protocol. The estimate is obtained by measurement of real implementation but ignoring the indefinite delay in wait state or bus contention 3. perform preliminary performance check. For the software section, if the summation of all processes’ profiling computation delays plus calculated communication delays is greater than the initiation interval T, the partitioning is concluded as infeasible. For the hardware section, the demanding factor for each type of operation is first calculated. The demanding factor i ni t i T represents the lower bound on the number of type i function units needed, where ni is the total number of type i operation in all hardware section processes and ti is the measured delay of type i function unit in FPGA implementation. If i i 1 , the hardware section design has exceeded i the FPGA capacity, where i is the normalized hardware complexity ratio with the total FPGA capacity (after taking away the interface circuitry) equal to 1. 4. allocate hardware function units. The hardware resource partitioning among different types of function units is proportional to the respective demanding factor. s i number of type i function units is allocated, where scaling factor s = max{s | s i i k } and k is a empirical value (around 0.8) for i maximum FPGA CLB utilization ratio. 5. subject to the hardware allocation, conduct a resource constraint scheduling on the modified PCG. To obtain a modified PCG, each node in the original PCG is split into a set of nodes with each one corresponding to a code fragment separated by the send and receive operations. The communication edges are adjusted accordingly. However, all the inter-iteration communication are removed. The software section has only one resource, i.e. the C30 processor. All the software section nodes are scheduled statically and executed sequentially on the C30 processor. A simple list scheduler can fit the purpose. If the entire modified PCG cannot be scheduled in T, the partitioning is also infeasible. 5.3 HW/SW Implementation After HW/SW partitioning, the hardware section design is further divided into different FPGAs by means of algorithms such as bipartite partitioning. Since the entire design is still expressed in VHDL, a functional level co-simulation using a VHDL simulator is performed here as a check point before proceeding with physical implementation. The partitioned VHDL program for each FPGA is augmented with a predefined VHDL code to synthesize the HW/SW communication interface. High level synthesis in the SynopsysTM system is next employed to derive the FPGA design. Each FPGA design is based on a structural template which contains 4 basic modules, i.e., a processing unit (PU), control unit (CU), memory unit (MU) and communication interface unit (CIU). The processing units adopt fixed point arithmetic and may also include the number conversion and data alignment circuitry. The memory unit in cludes registers, address generators and read/write circuitry of the local memory. The control unit is simply synthesized by a finite state machine. The communication interface unit includes data storage buffers for the communication channel and an FSM based controller that implements the communication protocols. For the software section, each VHDL process is first converted into a data flow graph (DFG). A code generation tool based on simulated evolution [17] is used to generate the software assembly code of the corresponding DFG. To implement the communication interface, extra codes on communication subroutines are inserted. Codes for different processes are merged into one code according to the static scheduling result. 6. CODESIGN EXAMPLE OF LD-CELP DECODER To demonstrate the usefulness of the proposed target board, a large and practical codesign example for the LD-CELP (low-delay coded excited linear prediction) speech decoder based on the CCITT G.728 recommendation [18] is currently under development. It can support a data compression ratio of 4 and yield a 16Kbit/s data rate. The recommendation has been widely adopted in applications such as teleconferencing systems and digital answer machines. Fig. 12 shows the simplified block diagram of the speech decoder system. At the encoding site, the system takes unquantized speech inputs at an 8K sampling rate. Every five consecutive samples are assembled into a speech vector and encoded as a 10-bit index of the code book. At the receiving end, for each received 10-bit index, the decoder performs a table look-up to extract the corresponding codevector from the excitation codebook. The extracted codevector is then passed through a gain scaling unit and a synthesis filter to produce the current decoded signal vector. The synthesis filter coefficients and the gain are then updated via backward adaptation. The decoded signal vector is then passed through an adaptive post-filter to enhance the perceptual quality. The system is described as a collection of nine processes. 6.1 HW/SW Partitioning of the Decoder To implement the system, we first conducted a process profiling of the G.728 decoder module. The results shown in Table 3 are based on the computations needed to process a data frame which consists of 4 vectors, i.e., 20 input samples. A data frame is, therefore, considered as one iteration in this case. To decode the encoded speech data in real time, all the computations must be finished in 2.5ms (the interval for a data frame) The software profiling shows that a pure software Log-gain limiter 47 Gain Excitation VQ Codebook 29 1-vector delay 46 Log-gain offset value holder 41 67 RMS calculator 39 Bandwidth expansion module 45 LevinsionDurbin recursion module 44 Backward vector gain adapter 30 Postfilter 34 Hybrid windowing module Inverse logarithm calculator 48 + Log-gain linear predictor Synthesis filter 32 31 LevinsonDurbin recursion module 51 33 40 42 50 Bandwidth expansion module Logarithm calculator + Postfilter Adapter 35 49 Bcakward synthesis filter adapter Hybrid windowing module 43 Fig. 12. The block diagram of a G.728 LD-CELP speech decoder implementation takes about 44.6ms, which is about 18 times the allowed time slot, i.e.,an initiation interval. Our partitioning algorithm first picks the most time consuming process, i.e., the Levinson-Durbin (L-D) recursion module in the backward synthesis filter adapter, and moves it to the hardware section. Algor-ithmically, the L-D recursion module consists of a 3-level nested computing loop. It computes the predictor coefficients from the auto-correlation matrix recursively frame by frame. In the profiling, it can be seen that the uniformity index of the L-D is good because the same module but with lower computing order also appears in the backward vector gain adapter. The parallelism index, however, is poor due to tightly coupled recursive computation. In this example, we manually replace it with a more parallelized Schur algorithm[19]. Table 3. Profiling results of G.728 decoder system HW Profiling SW Profiling Add/Sub Multiply Division Memory 2 7 1 5 0 5 0 1 31 128 2 228 Instructions 108 116 8556 Figure 12. The block diagram of a G.728 LD-CELP speech decoderProcess Excitation VQ Codebook Gain Scale Backward vector gain adapter (except for L&D’s recursion) Synthesis filter Backward synthesis filter adapter (except for L&D’s recursion) 261 3577 256 3703 0 0 240 519 8676 17750 Postfilter Postfilter adapter L&D’s recursion for log-gain linear predictor L&D’s recursion for synthesis filter 46 7289 334 46 7300 334 1 2 10 114 310 11 3752 53134 4840 3314 7 33147 50 51 579107 6.2 Partitioning Verification and Hardware Design of the Schur Algorithm In partitioning verification, we first perform a communication classification. The communication between the Schur algorithm process and synthesis filter process, and the communication between the Schur algorithm process and the backward synthesis filter adaptor process are classified as batch communication. In the former case, autocorrelation matrix coefficients, and in the latter case, synthesis filter coefficients will be transferred. Both are mapped to batch communications via the DMA. The exchanged coefficients are first moved from the shared memory to the local memory and are moved back again after the Toeplitz system is solved. The next step in the codesign flow is to allocate hardware function units; we obtained an allocation consisting of three multipliers, one divider and four adders. Based on this allocation, the systolic array design of the Schur algorithm is derived manually instead of by using a high level synthesis tool. The design contains a reflection coefficient calculation (RCC) stage, followed by forward and backward substitution stages. The RCC stage is implemented by means of a pipelined-lattice structure as shown in Fig. 13. k (i1) u1(i ) v0(i ) v1( i 1 ) v 2( i 1 ) v 3( i 1 ) v1( i ) k ( i 1 ) u 2( i ) v 2( i ) k ( i 1 ) u 3( i ) v 3( i ) k ( i 1 ) u 4( i ) u 0( i 1 ) u1( i 1 ) u 2( i 1 ) u 3( i 1 ) u1( i ) k ( i 1 ) v 0( i ) u 2( i ) k ( i 1 ) v1( i ) u 3( i ) k ( i 1 ) v 2( i ) u 4( i ) k ( i 1 ) v 3( i ) v 0( i 1 ) v 0( i ) k ( i 1 ) u1( i ) i=3 u44 u0( 4 ) i=2 u33 u0( 3) u34 u1( 3 ) u24 u2( 2 ) u23 u1( 2 ) i=1 u22 u0( 2 ) u14 u3(1) t3 u13 u2(1) t2 u12 u1(1) t1 Initial u11 u 0(1 ) t 0 Fig. 13. Pipelined lattice structure design for reflection coefficient calculation To obtain the maximum degree of parallelism, the number of required butterfly-like modules should be equal to the filter tap order in both adapters, i.e., 10 and 50, respectively. This apparently exceeds the FPGA capacity in our target board. Currently, only two butterfly modules are incorporated in one XC5204 FPGA. Each module contains a multifunction arithmetic array for high speed multiplication and division: a carry-free accumulator. The systolic array design for the forward & backward substitution (FBS) stage is shown in Fig. 14. Each module contains function units similar to those in the RCC butterfly module. To match the data bandwidth of the RCC stage, only two modules are incorporated and implemented in another XC5204 FPGA. (f) u41|(b)u14 (t=4) (f) u42|(b)u24 (t=5) (f) u43|(b)u34 (t=6) (f&b) (f) u (t=3) (f) u (t=4) (f&b)u 33 (f&b)u 22 (f&b)u 11 31 |(b)u13 |(b)u23 (f) u |(b)u 21 11 32 (t=2) t out t out 2 g1 t out 3 g1 4 g2 4 g1 5 g2 6 g3 output t u44 (t=7) (t=5) (t=3) (t=1) t output t 5 u41g1 4 u31g1 3 u21g1 output 6 u41g1+u42g2 5 u31g1+u32g2 7 u41g1+u42g2+u43g3 (f) y |(b)g 4 1 (t=7) (f) y3|(b)g2 (t=5) (f) y2|(b)g3 (t=3) (f) y |(b)g 1 4 (t=1) (f) g4|(b)x1 (t=7) (f) g |(b)x 3 2 (t=5) (f) g |(b)x 2 3 (t=3) (f) g |(b)x 1 4 (t=1) Fig. 14. Systolic array design for the forward & backward substitution 6.3 HW/SW Communication Performance Analysis In this implementation, the HW/SW communication overhead includes 1) the DMA delay in moving the auto-correlation matrices from the shared memory to the local memory of the FPGA for RCC, 2) the delay caused by the C30 when it activates the FPGAs to perform the Schur algorithm, 3) the delay caused by the FPGA when it interrupts the C30 to signal completion of the Schur algorithm, and 4) the DMA delay in moving the filter coefficients from the local memory to the shared memory. The total communication overheads are compiled in Table 4. Table 4. Communication time overhead for the HW sections in LD-CELP decoder Execution cycle No. of comm. clks timing overhead % of comm. overhead Block 44 2.5ms 112 3.68 us 0.15% Block 50 2.5ms 432 14.24 us 0.57% Again, the overheads are quite small (less than 1%) in this case. Fig. 15 shows the HW/SW communication flow of the LD-CELP decoder example implemented in our target board. The types of communication used are also indicated in the figure. Downloading and uploading of data between the shared memory and the FPGA's local memory are types of batch communication. Sixty-two data items are exchanged in total (51 for the synthesis filter and 11 for the gain adapter) during each data frame. The signaling between the C30 and the FPGAs basically follows the asynchronous communication protocols. The data passing between the RCC and FBS FPGAs falls into the synchronous communication category. In addition, the executions between the RCC and FBS are pipelined, and the communication is synchronized by the registers. Even though the entire example is still under development, initial analyses did show that the proposed target board architecture facilitates very efficient HW/SW communications. TMS320C3x FPGA1 FPGA2 DMA Vector #1 ACT<=1 IACK<=1 Solving Toeplitz DMA frame #n-1 Vector #2 Vector #3 ACT<=1 IACK<=1 INT<=0 Vector #4 INT<=0 Vector #1 ACT<=1 DMA IACK<=1 DMA frame #n Vector #2 Vector #3 Vector #4 ACT<=1 IACK<=1 Solving Toeplitz Forward & Backward substitutation coefficients for synthesis filter are ready Forward & Backward substitutation coefficients for gain adapater are ready Local Memory Schur Algorithm for synthesis filter Local Memory Schur Algorithm for gain adapter INT<=0 DMA Local Memory INT<=0 DMA Local Memory Forward & Backward substitutation Forward & Backward substitutation DMA Vector #1 ACT<=1 IACK<=1 Solving Toeplitz DMA frame #n+1 Vector #2 Vector #3 Vector #4 ACT<=1 IACK<=1 INT<=0 INT<=0 Solving Toepiltz Forward & Backward substitutation coefficients for synthesis filter are ready Forward & Backward substitutation coefficients for gain adapater are ready Synchronous communication type Asynchronous communication type Batch communication type Pipelined stage Fig. 15. HW/SW communication flow of the LD-CELP decoder 7. CURRENT RESULTS AND SUMMARY The proposed HW/SW codesign system is currently under development at National Yunlin University of Science & Technology, Taiwan, ROC. Even though the target board is equipped with a C30 processor and Xilinx FPGAs, the proposed interface module and communication protocols can be equally applied to different DSP processors and FPGAs. Likewise, the partitioning and the verification procedures in the co-design system can be easily adapted to different FPGA and DSP processor models. We have so far finished 1) the software code generation module of the C30 processor, 2) definition and behavioral simulation of the communication interface and protocols, 3) the architectural design of the target board and 4) both the software and hardware section designs of the LD-CELP decoder example. The HW/SW partitioning module is still under development. Therefore, partitioning of the decoder example is done manully at this moment. We are also constructing an FPGA reference library for both design estimation and synthesis purposes. In summary, in this paper, we have presented a novel embedded prototyping system based on hardware/software co-design. The proposed target board consists of a popular DSP processor and several large capacity Xilinx FPGAs. It features a shared bus architecture with two levels of memory hierarchies, i.e., shared main memory and FPGA local memory. Communication interfaces between the hardware and software sections have been carefully defined to support various types of HW/SW communications efficiently. The communication interfaces for the HW and the SW sections are described in VHDL code and C communication routines, respectively. This leads to code augmentation in both sections which supports the communication interface. Based on this prototyping system, we have also proposed a HW/SW co-design environment which takes VHDL code as the initial design specification and performs coarse grain partitioning at the process level. A partitioning result verification procedure has also been developed. High level synthesis and parallel DSP code generation are then employed to realize the respective designs in the target board. Preliminary results for the codesign example of an LD-CELP speech decoder indicate that the proposed prototyping system does support very efficient HW/SW communications. ACKNOWLEDGMENT This work was financially supported by the NSC, R.O.C., under Grant NSC86-2221-E-224-007. REFERENCES 1. M. TheiBinger, P. Stravers, and H. Veit, "Castle: An interactive environment for HW-SW co-design," in Proceedings of Third International Workshop on Hardware/Software Codesign, 1994, pp. 203-209. 2. M.D. Edwards, J. Forrest, "Software acceleration using programmable hardware devices," IEE Proceedings Computer Digital Technology, Vol. 143, 1996, pp. 55-63 3. G. Koch, U. Kebschull and W. Rosenstiel, "A prototyping environment for hardware/software codesign in the CORBA project," in Proceedings of Third International Workshop Hardware/Software Codesign, 1994, pp. 10-16. 4. J.P. Calvez, D. Isidoro and D. Jeuland, "A codesign experience with the MCSE methodology," in Proceedings of Third International Workshop on Hardware/Software Codesign, 1994, pp. 140-147. 5. R. Ernst, J. Henkel, T. Benner, "Hardware-software cosynthesis for microcontrollers," IEEE Design & Test of Computers, Vol. 10, No. 4, 1993, pp. 64-75. 6. D. Herrmann, J. Henkel and R. Ernst, "An approach to the adaptation of estimated cost parameters in the COSYMA system," in Proceedings of Third International Workshop Hardware/ Software Codesign, 1994, pp. 100-107. 7. R.K. Gupta and G. De Micheli, "Hardware-software cosynthesis for digital system," IEEE Design & Test of Computer, Vol. 10, No. 3, 1994, pp. 29-41. 8. R.K. Gupta, C.N. Coelho Jr. and G. De Micheli, "Program implementation schemes for hardware-software systems," Computer, Vol. 27, No. 1, 1994, pp. 48-55. 9. R. Gupta, Claudionor Coelho and G. De Micheli, "Synthesis and simulation of digital systems containing interacting hardware and software components," in Proceedings of DAC, 1992, pp. 225-230 10. M. Sheliga and E. Sha, "Hardware/software co-Design with the HMS framework," Journal of VLSI Signal Processing, Vol. 13, No. 1, 1996, pp. 37-56. 11. P. Eles, Z. Peng and A. Doboli, "VHDL system-level specification and partitioning in a hardware/software co-synthesis environment," in Proceedings of the 3rd International Workshop on HW/SW Codesign, 1994, pp. 49-55. 12. N.S. Woo, A.E. Dunlop and W. Wolf, "Codesign from co-specification," Computer, Vol. 1, No. 1, 1994, pp. 42-7. 13. A. Baganne, J.L. Phillipe and E. Martin, "A codesign methodology for telecommunication systems: A case study of an acoustic echo canceller," in Proceedings of the 1997 IEEE Workshop on Signal Processing Systems - Design and Implementation, Leicester, UK, 1997, pp. 273-282 14. T. Ben Ismail and A.A. Jerraya, "Synthesis steps and design models forcodesign," Journal of Computer, Vol. 28, No. 2, 1995, pp. 44-53. 15. T.B. Ismail, M. Abid and A. Jerraya, "COSMOS: A codesign approach for communicating systems," in Proceedings of the 3rd International Workshop on HW/SW Codesign, 1994, pp. 17-24. 16. Ismail, M. Abid, K. O'Brien and A. Jerraya, "An approach for hardware-software codesign," in Proceedings of the 5th International Workshop on Rapid System Prototyping, 1994, pp. 73-80. 17. Y.-T. Hwang and J.-S. Hwang, "Simulated evolution based code generation for programmable DSP processors," in Proceedings of ISCAS '97, Vol. IV, 1997, pp. 2593-2596. 18. J.-H. Chen, R. Cox, Y.-C. Lin and others, "A low-delay CELP coder for the CCITT 16 kb/s speech coding standard," IEEE Journal of Selected Areas in Communications, Vol. 10, No. 5, 1992, pp. 830-849. 19. Y.H. Hu and S.Y. Kung, "Toeplitz eigen system solver," IEEE Transaction on ASSP, Vol. 33, 1982, pp. 1264-1271. Yin-Tsung Hwang(黃穎聰)obtained his B.S. and M.S. degrees, both in electronic engineering, from National Chiao Tung University, Hisnchu, Taiwan, R.O.C., in 1983 and 1985, respectively. He received the Ph.D. degree from the department of Electrical & Computer Engineering, the University of Wisconsin, Madison, in 1993. He then joined the Department of Electronic Engineering, National Yunlin University of Science & Technology, and is now an associate professor. Dr. Hwang's research interests include code generation for high performance digital signal processors, hardware/software codesign and VLSI digital signal processing. Yuan-Hung Wang ( 王 元 鴻 ) graduated with a B.S. degree in Electronic Engineering from National Yunlin University of Science & Technology in 1995. After his graduation, Mr. Wang worked for the Optical-Electronic Research Lab., Industrial Technology Research Institute, for one year. He joined the Institute of Electronic and Information Engineering, National Yunlin University of Science & Technology in 1996, and is currently working toward his masters degree. Mr. Wang's research interests include FPGA implementation and hardware/software codesign. Jer-Sho Hwang(黃晢修)received his B.S. degree in electronic engineering from Chung-Yuan Christian University in 1995. In the same year, he joined the Institute of Electronic and Information Engineering, National Yunlin University of Science & Technology, and studied under the supervision of Prof. Yin-Tsung Hwang. He received his M.S. degree in 1997 and is now serving in the R.O.C. Army. Mr. Hwang's research interests include DSP code generation and hardware/software codesign.