Development of a Design Methodology for Partitioning Hardware and Software Components of System-on-a-Chip Designs Bryan Shepardson Honors Thesis Proposal Class of 2004, Department of Electrical and Computer Engineering Executive summary: The increased prevalence of System-on-a-Chip designs has created interest in optimizing system performance through hardware/software partitioning of the system. However, a general partitioning methodology has not been developed to the point where a software algorithm can be analyzed at a high level to make partitioning decisions. This project will use the MP3 encoding algorithm as a case study in an effort to develop generalized methods for partitioning designs into hardware and software. Data will be taken for a benchmark system and analyzed to determine high-latency aspects of the software algorithm. Hardware components external to the microprocessor core will be developed and implemented, and the latency of the modified system will be compared to the benchmark system to determine the effect of the modifications. These results will be used in conjunction with the high-level functionality of the original algorithm to form observations regarding the effectiveness of the hardware components with respect to design efficiency. Introduction: As clock speeds have grown faster and physical system size has grown smaller, system designers have begun to explore ways of optimizing their systems by combining microprocessors with the external digital components needed to operate the system. The goal is to place all needed functionality on a single chip. Such systems are commonly referred to as System-on-a-Chip (or SoC) designs. Advantages of this type of design include lower parts cost per unit due to the decrease in number of silicon chips, and less complex printed circuit board designs. One platform for the development of SoC designs is the field-programmable gate array, or FPGA. The FPGA has certain advantages over application-specific integrated circuits (ASICs) with respect to the design process, the most important of which is associated with the inherent delay associated with the redesign and manufacturing of an ASIC. Since an FPGA can be reprogrammed in-house with an updated design, each iteration of the design process can be shortened from weeks to days. Specifically, the inclusion of the microprocessor as an integral part of the programmable logic of a SoC design has created new opportunities to optimize the design. One such method is the modification of the soft-core processor itself. Typical modifications include changes to the instruction set, the datapath, the registers or the control structure of the processor.1 The result of these modifications can be a processor core that is optimized for a specific application. Another method for optimizing the system is to design logic components external to the processor to be used as an adjunct to the normal processor functionality. Specifically, modifications such as these would be targeted toward computationally expensive tasks.2 Using this method, the processor core would remain unchanged, but the code executed would be modified to take advantage of these external components. Studies in this area have included the analysis of the effect of these changes in design methodology in practical applications.3 In order to fully take advantage of the second method in designing a SoC system, one must understand where the application's implementation is inefficient due to limitations imposed by the processor architecture. This allows design decisions to be made that maximize the benefit of using augmenting logic components. In order to achieve this optimization, a methodology must be developed to analyze an algorithm for the purposes of identifying the areas to be optimized. The MP3 encoding algorithm is a good subject for a case study with regard to these optimization issues due to the various tasks it is required to perform. The main components of the algorithm include a filter bank to break the signal up into frequency bands, a psychoacoustic model to determine which components of the signal are masked due to human perception of sound, and a quantization process to code the information by conventional lossless means (such as Huffman 1 B. R. Rau and M. S. Schlansker, "Embedded Computer Architecture and Automation," IEEE Computer vol. 34, number 4, pp. 75-83,2001. 2 C. Ulmer, "Configurable Computing: Practical Use of FPGAs," Georgia Institute of Technology, Ph.D. qualifying exam 3 J. Kempa, "Maximizing Embedded System Performance in the Era of Programmable Logic," published at http://www.chipcenter.com/pld/pldill97.htm. coding). 4 Each of these processes uses the microprocessor in a different way. For example, the filter bank calculations involve a long multiplication-addition sequence of instructions for each spectral component, while the masking threshold is determined through evaluating conditional branches.5 Through analysis of the MP3 encoding algorithm with respect to its inefficiencies, specific areas will be identified that are by nature inefficient when implemented in conventional microprocessors. This information will then be used to devise a methodology for analyzing other algorithms at a high level for the purposes of allowing a designer to efficiently partition their design into hardware and software. Objectives: The purpose of this research is to explore the usage of external logic to accelerate certain parts of code running on a microprocessor embedded within an FPGA. With knowledge of which parts of an algorithm are likely to be inefficient, and what types of operations are well suited to be performed outside of the microprocessor core, design decisions can be made to optimize the system. The information gathered from this research will be used to develop criteria and methods for the development of these external logic components. The goal is to justify the methods in as general a case as possible, in order for the methods to apply to many different types of algorithms. This compiled infofll1ation could be used directly by a designer to make high-level architectural decisions, or it could be built into a compiler or device fitter to identify possible areas in a design that would benefit from external logic acceleration. Methods: The research will be conducted in a series of five main tasks. These tasks are implementing the existing MP3 algorithm in an FPGA, taking benchmarking data to detefll1ine the latency associated with each part of the algorithm, identifying areas to be optimized and designing the 4 K. Brandenburg, "MP3 and AAC Explained," presented at AES 17th International Conference on High Quality Audio Encoding, 1999. 5 Z. Smekal, "Spectral Analysis By Digital Filter Banks," ElectronicsLetters.com, paper #2/10/2001. external logic components needed to optimize, implementing the algorithm with external logic upgrades in the FPGA, and benchmarking the new system to compare latencies to the baseline system. The version of the algorithm to be used in this project will be the LAME encoder. The LAME project is open-source (distributed under the GNU General Public License) and has been shown to produce a more accurate wavefofll1 output than other encoders, both commercial and opensource.6 This source code will be compiled on an x86 PC, and run as the baseline system. Using a method such as a software profiler, the baseline system will be analyzed to identify highlatency areas of the algorithm. Once the inefficient areas have been identified, a low-level consideration of each area will be conducted to detefll1ine whether external logic acceleration would be practical and beneficial to the design. Criteria to be considered include the degree of parallelism that could be achieved by using external logic, and the difference in efficiency of a particular sequence of instructions in a CPU datapath as compared to a specialized component. One or two areas will be chosen to implement in digital logic, and their associated components will be designed and tested using VHDL. The testing will include accuracy of output as well as timing considerations. The components will then be programmed into an FPGA connected to the PC via the serial port. Once the design and implementation of the new components is completed, the software of the baseline system will be modified to take advantage of the external components. This process will involve data transfer to and from the FPGA via the serial port, as well as any necessary control signals. Once this is completed, tests will be conducted to ensure that the quality of the output from the new system is comparable to the baseline system. Since the goal of the project is to analyze the difference between a software and hardware implementation of particular components, the output of any particular implementation of the system should be the same, and therefore a file comparison will be used. After the new design is determined to produce acceptable results as compared to the baseline system, the latency of the external logic will be measured by monitoring control signals. This data will be compared to the baseline system to determine whether any significant reduction in latency has been observed. These observations will be compared to the amount of area used within the 6 Data presented in "Analysis" section of http://www.r3rnix.net/ FPGA in order to evaluate whether the optimizations were valuable in a practical sense. Once these conclusions are made, specific features of the new system will be analyzed relative to the original design of the algorithm in order to generalize the results for use in designing other systems. Timeline: