Development of a Design Methodology for Partitioning Hardware

advertisement
Development of a Design Methodology for Partitioning Hardware and Software
Components of System-on-a-Chip Designs
Bryan Shepardson
Honors Thesis Proposal
Class of 2004, Department of Electrical and Computer Engineering
Executive summary:
The increased prevalence of System-on-a-Chip designs has created interest in optimizing system
performance through hardware/software partitioning of the system. However, a general
partitioning methodology has not been developed to the point where a software algorithm can be
analyzed at a high level to make partitioning decisions. This project will use the MP3 encoding
algorithm as a case study in an effort to develop generalized methods for partitioning designs into
hardware and software. Data will be taken for a benchmark system and analyzed to determine
high-latency aspects of the software algorithm. Hardware components external to the
microprocessor core will be developed and implemented, and the latency of the modified system
will be compared to the benchmark system to determine the effect of the modifications. These
results will be used in conjunction with the high-level functionality of the original algorithm to
form observations regarding the effectiveness of the hardware components with respect to design
efficiency.
Introduction:
As clock speeds have grown faster and physical system size has grown smaller, system designers
have begun to explore ways of optimizing their systems by combining microprocessors with the
external digital components needed to operate the system. The goal is to place all needed
functionality on a single chip. Such systems are commonly referred to as System-on-a-Chip (or
SoC) designs. Advantages of this type of design include lower parts cost per unit due to the
decrease in number of silicon chips, and less complex printed circuit board designs.
One platform for the development of SoC designs is the field-programmable gate array, or FPGA.
The FPGA has certain advantages over application-specific integrated circuits (ASICs) with
respect to the design process, the most important of which is associated with the inherent delay
associated with the redesign and manufacturing of an ASIC. Since an FPGA can be reprogrammed in-house with an updated design, each iteration of the design process can be
shortened from weeks to days.
Specifically, the inclusion of the microprocessor as an integral part of the programmable logic of
a SoC design has created new opportunities to optimize the design. One such method is the
modification of the soft-core processor itself. Typical modifications include changes to the
instruction set, the datapath, the registers or the control structure of the processor.1 The result of
these modifications can be a processor core that is optimized for a specific application.
Another method for optimizing the system is to design logic components external to the processor
to be used as an adjunct to the normal processor functionality. Specifically, modifications such as
these would be targeted toward computationally expensive tasks.2 Using this method, the
processor core would remain unchanged, but the code executed would be modified to take
advantage of these external components. Studies in this area have included the analysis of the
effect of these changes in design methodology in practical applications.3
In order to fully take advantage of the second method in designing a SoC system, one must
understand where the application's implementation is inefficient due to limitations imposed by the
processor architecture. This allows design decisions to be made that maximize the benefit of
using augmenting logic components. In order to achieve this optimization, a methodology must
be developed to analyze an algorithm for the purposes of identifying the areas to be optimized.
The MP3 encoding algorithm is a good subject for a case study with regard to these optimization
issues due to the various tasks it is required to perform. The main components of the algorithm
include a filter bank to break the signal up into frequency bands, a psychoacoustic model to
determine which components of the signal are masked due to human perception of sound, and a
quantization process to code the information by conventional lossless means (such as Huffman
1
B. R. Rau and M. S. Schlansker, "Embedded Computer Architecture and Automation," IEEE Computer
vol. 34, number 4, pp. 75-83,2001.
2
C. Ulmer, "Configurable Computing: Practical Use of FPGAs," Georgia Institute of Technology, Ph.D.
qualifying exam
3
J. Kempa, "Maximizing Embedded System Performance in the Era of Programmable Logic," published at
http://www.chipcenter.com/pld/pldill97.htm.
coding). 4 Each of these processes uses the microprocessor in a different way. For example, the
filter bank calculations involve a long multiplication-addition sequence of instructions for each
spectral component, while the masking threshold is determined through evaluating conditional
branches.5
Through analysis of the MP3 encoding algorithm with respect to its inefficiencies, specific areas
will be identified that are by nature inefficient when implemented in conventional
microprocessors. This information will then be used to devise a methodology for analyzing other
algorithms at a high level for the purposes of allowing a designer to efficiently partition their
design into hardware and software.
Objectives:
The purpose of this research is to explore the usage of external logic to accelerate certain parts of
code running on a microprocessor embedded within an FPGA. With knowledge of which parts of
an algorithm are likely to be inefficient, and what types of operations are well suited to be
performed outside of the microprocessor core, design decisions can be made to optimize the
system.
The information gathered from this research will be used to develop criteria and methods for the
development of these external logic components. The goal is to justify the methods in as general a
case as possible, in order for the methods to apply to many different types of algorithms. This
compiled infofll1ation could be used directly by a designer to make high-level architectural
decisions, or it could be built into a compiler or device fitter to identify possible areas in a design
that would benefit from external logic acceleration.
Methods:
The research will be conducted in a series of five main tasks. These tasks are implementing the
existing MP3 algorithm in an FPGA, taking benchmarking data to detefll1ine the latency
associated with each part of the algorithm, identifying areas to be optimized and designing the
4
K. Brandenburg, "MP3 and AAC Explained," presented at AES 17th International Conference on High
Quality Audio Encoding, 1999.
5
Z. Smekal, "Spectral Analysis By Digital Filter Banks," ElectronicsLetters.com, paper #2/10/2001.
external logic components needed to optimize, implementing the algorithm with external logic
upgrades in the FPGA, and benchmarking the new system to compare latencies to the baseline
system.
The version of the algorithm to be used in this project will be the LAME encoder. The LAME
project is open-source (distributed under the GNU General Public License) and has been shown
to produce a more accurate wavefofll1 output than other encoders, both commercial and opensource.6 This source code will be compiled on an x86 PC, and run as the baseline system.
Using a method such as a software profiler, the baseline system will be analyzed to identify highlatency areas of the algorithm. Once the inefficient areas have been identified, a low-level
consideration of each area will be conducted to detefll1ine whether external logic acceleration
would be practical and beneficial to the design. Criteria to be considered include the degree of
parallelism that could be achieved by using external logic, and the difference in efficiency of a
particular sequence of instructions in a CPU datapath as compared to a specialized component.
One or two areas will be chosen to implement in digital logic, and their associated components
will be designed and tested using VHDL. The testing will include accuracy of output as well as
timing considerations. The components will then be programmed into an FPGA connected to the
PC via the serial port.
Once the design and implementation of the new components is completed, the software of the
baseline system will be modified to take advantage of the external components. This process will
involve data transfer to and from the FPGA via the serial port, as well as any necessary control
signals. Once this is completed, tests will be conducted to ensure that the quality of the output
from the new system is comparable to the baseline system. Since the goal of the project is to
analyze the difference between a software and hardware implementation of particular
components, the output of any particular implementation of the system should be the same, and
therefore a file comparison will be used.
After the new design is determined to produce acceptable results as compared to the baseline
system, the latency of the external logic will be measured by monitoring control signals. This data
will be compared to the baseline system to determine whether any significant reduction in latency
has been observed. These observations will be compared to the amount of area used within the
6
Data presented in "Analysis" section of http://www.r3rnix.net/
FPGA in order to evaluate whether the optimizations were valuable in a practical sense. Once
these conclusions are made, specific features of the new system will be analyzed relative to the
original design of the algorithm in order to generalize the results for use in designing other
systems.
Timeline:
Download