jang_a - klabs.org

advertisement
Paper Title: Generation of SIMD MAC Unit using Redundant Binary
Keyword: MAC (Multiplication-and-Accumulation), DSP, SIMD, Redundant Binary
First Author Information
Name: Young-Jin, Jang
Affiliation: KyungHee Univ., South Korea
Address: CANN LAB., Computer Eng., KyungHee Univ.
YongIn, KyungKi, Korea, 449-701
Phone: +82-31-201-2947
Fax: +82-31-202-1723
e-mail: ddsmurf@vlsi.khu.ac.kr / ddsmurf@empal.com
Generation of SIMD MAC Unit using Redundant Binary
Young-Jin, Jang* and Hyon-Soo, Lee**
Department of Computer Engineering
School of Electronics and Information
KyungHee University, KOREA
*
ddsmurf@vlsi.khu.ac.kr **leehs@khu.ac.kr
In this paper, we present a generation method of SIMD MAC(Multiplication-andAccumulation) unit using redundant binary. Fast and area-efficient MAC units are the heart of
real-time video and digital signal processing systems. In order to implement high-performance
system, general-purpose CPU and DSP mostly use SIMD MAC unit such as INTEL's
MMX/SSE-2, SUN's VIS, PowerPC's AltiVec, etc. These solutions, however, are not sufficiently
satisfied with low-power, area, and speed constraints in embedded systems. Recently, FPGA
technology has been developed the alternative to content with constraints described above. FPGA
designers, who want to construct SIMD MAC unit, make use of MAC code generated from FPGA
vendor's tools (for example, ALTERA's DSPBuilder, XILINX's IP CoreGenerator) in parallel or
design HDL code manually.
For MAC unit which is generated using vendor's tools, there are some problems. First, existing
tools only generate MAC with fixed-size word length and single functionality according to the
specified parameters of user. Since the generated MAC has lack of scalability and flexibility for
processing of arbitrary data size, designers must regenerate suitable MAC unit corresponding with
different operand word-lengths. Second, vendor's MAC unit cannot process sub-word and does
not have parallelism in the point of sub-word level. Since only fixed-size operation is supported,
utilization is lower than of SIMD MAC. Finally, the customized HDL code using vendor's tools
does not have portability between FPGA vendors due to the dedicated codes for their own vendor
tool. Although the code can provide several different architectures for different constraints, it is
hard to match the host of architectures available to the synthesis tool and the correct architecture
must be chosen manually.
To achieve high performance and solve these problems, the proposed method has the following
features: ⅰ) the code scalability is implemented by the multiple operation and sub-word
computation support in single MAC; ⅱ) the proposed HDL code with user parameters increases
portability and reusability; ⅲ) data parallelism through sub-word computing provides higher
performance than of the existing MAC code. Additionally, we implement efficient signed/unsigned
MAC for constant-time addition and simultaneous format conversion with redundant binary. As
using redundant binary, we can simply construct signed multiplier because of simplicity of sign
representation in redundant binary and remove the additional bit manipulation existing in 2's
complement multiplier.
In proposed MAC, the executable operations are ⅰ) general MAC operations, ⅱ) SIMD MAC,
and ⅲ) summation of the result of SIMD multiplication and accumulation. These operations are
controlled by user's function selection.
The proposed generation algorithm consists of 5 stages. Stage-1 and stage-5 perform converting
redundant binary to/from 2's complement operand and manipulating operand's sign for signed
operation. Stage-2 and stage-3 are radix-4 booth's multiplier including sub-word control and
execute signed/unsigned multiplication. Finally, stage-4 performs accumulation and detects
overflow. All stages compute automatically bit-position of partial product according to
customizing parameters.
We have implemented the proposed method using VHDL at structural level. Parameters are
transferred from testbench code using generic mapping. So as to construct addition tree of partial
products, for-loop construct and 4-D array are used in VHDL code. Implemented code is
synthesizable VHDL code for any synthesis tools. If you use Verilog HDL 2001, you can convert
VHDL code into Verilog HDL code readily.
To verify the effectiveness of the proposed method, we have compared the synthesis results with
of MAC generated. But since the existing tools do not generate SIMD MAC code, direct
comparison of functionality and scalability is impossible. Furthermore we implement filter design
to validate SIMD MAC operation. Consequently, the proposed method is valid for the FPGA
designs requiring more scalability and flexibility on the fixed hardware architecture. And we
provide a freely available VHDL library with generic components that can be used as building
blocks in digital applications without requiring expert knowledge in SIMD and redundant binary.
Keyword: MAC (Multiplication-and-Accumulation), DSP, SIMD, Redundant Binary
Download