Paper Title: Generation of SIMD MAC Unit using Redundant Binary Keyword: MAC (Multiplication-and-Accumulation), DSP, SIMD, Redundant Binary First Author Information Name: Young-Jin, Jang Affiliation: KyungHee Univ., South Korea Address: CANN LAB., Computer Eng., KyungHee Univ. YongIn, KyungKi, Korea, 449-701 Phone: +82-31-201-2947 Fax: +82-31-202-1723 e-mail: ddsmurf@vlsi.khu.ac.kr / ddsmurf@empal.com Generation of SIMD MAC Unit using Redundant Binary Young-Jin, Jang* and Hyon-Soo, Lee** Department of Computer Engineering School of Electronics and Information KyungHee University, KOREA * ddsmurf@vlsi.khu.ac.kr **leehs@khu.ac.kr In this paper, we present a generation method of SIMD MAC(Multiplication-andAccumulation) unit using redundant binary. Fast and area-efficient MAC units are the heart of real-time video and digital signal processing systems. In order to implement high-performance system, general-purpose CPU and DSP mostly use SIMD MAC unit such as INTEL's MMX/SSE-2, SUN's VIS, PowerPC's AltiVec, etc. These solutions, however, are not sufficiently satisfied with low-power, area, and speed constraints in embedded systems. Recently, FPGA technology has been developed the alternative to content with constraints described above. FPGA designers, who want to construct SIMD MAC unit, make use of MAC code generated from FPGA vendor's tools (for example, ALTERA's DSPBuilder, XILINX's IP CoreGenerator) in parallel or design HDL code manually. For MAC unit which is generated using vendor's tools, there are some problems. First, existing tools only generate MAC with fixed-size word length and single functionality according to the specified parameters of user. Since the generated MAC has lack of scalability and flexibility for processing of arbitrary data size, designers must regenerate suitable MAC unit corresponding with different operand word-lengths. Second, vendor's MAC unit cannot process sub-word and does not have parallelism in the point of sub-word level. Since only fixed-size operation is supported, utilization is lower than of SIMD MAC. Finally, the customized HDL code using vendor's tools does not have portability between FPGA vendors due to the dedicated codes for their own vendor tool. Although the code can provide several different architectures for different constraints, it is hard to match the host of architectures available to the synthesis tool and the correct architecture must be chosen manually. To achieve high performance and solve these problems, the proposed method has the following features: ⅰ) the code scalability is implemented by the multiple operation and sub-word computation support in single MAC; ⅱ) the proposed HDL code with user parameters increases portability and reusability; ⅲ) data parallelism through sub-word computing provides higher performance than of the existing MAC code. Additionally, we implement efficient signed/unsigned MAC for constant-time addition and simultaneous format conversion with redundant binary. As using redundant binary, we can simply construct signed multiplier because of simplicity of sign representation in redundant binary and remove the additional bit manipulation existing in 2's complement multiplier. In proposed MAC, the executable operations are ⅰ) general MAC operations, ⅱ) SIMD MAC, and ⅲ) summation of the result of SIMD multiplication and accumulation. These operations are controlled by user's function selection. The proposed generation algorithm consists of 5 stages. Stage-1 and stage-5 perform converting redundant binary to/from 2's complement operand and manipulating operand's sign for signed operation. Stage-2 and stage-3 are radix-4 booth's multiplier including sub-word control and execute signed/unsigned multiplication. Finally, stage-4 performs accumulation and detects overflow. All stages compute automatically bit-position of partial product according to customizing parameters. We have implemented the proposed method using VHDL at structural level. Parameters are transferred from testbench code using generic mapping. So as to construct addition tree of partial products, for-loop construct and 4-D array are used in VHDL code. Implemented code is synthesizable VHDL code for any synthesis tools. If you use Verilog HDL 2001, you can convert VHDL code into Verilog HDL code readily. To verify the effectiveness of the proposed method, we have compared the synthesis results with of MAC generated. But since the existing tools do not generate SIMD MAC code, direct comparison of functionality and scalability is impossible. Furthermore we implement filter design to validate SIMD MAC operation. Consequently, the proposed method is valid for the FPGA designs requiring more scalability and flexibility on the fixed hardware architecture. And we provide a freely available VHDL library with generic components that can be used as building blocks in digital applications without requiring expert knowledge in SIMD and redundant binary. Keyword: MAC (Multiplication-and-Accumulation), DSP, SIMD, Redundant Binary