RISC-V TensorCore for Edge AI Submitted in partial fulfillment of the requirements of the degree of BACHELOR OF ENGINEERING In COMPUTER ENGINEERING By Group No: 4 1902067 Kaku Jay Sushil 1902072 Khan Mohd Hamza Rafique 1902080 Kotadia Hrishit Jayantilal 1902114 Narwani Trushant Sunil Guide: DR. TANUJA SARODE (Professor, Department of Computer Engineering, TSEC) Computer Engineering Department Thadomal Shahani Engineering College University of Mumbai 2022-2023 CERTIFICATE This is to certify that the project entitled “RISC-V TensorCore for Edge AI” is a bonafide work of 1902067 Kaku Jay Sushil 1902072 Khan Mohd Hamza Rafique 1902080 Kotadia Hrishit Jayantilal 1902114 Narwani Trushant Sunil Submitted to the University of Mumbai in partial fulfillment of the requirement for the award of the degree of “BACHELOR OF ENGINEERING” in “COMPUTER ENGINEERING”. Dr. Tanuja Sarode Guide Dr. Tanuja Sarode Dr. G. T. Thampi Head of Department Principal Project Report Approval for B.E Project report entitled RISC-V TensorCore for Edge AI by 1902067 Kaku Jay Sushil 1902072 Khan Mohd Hamza Rafique 1902080 Kotadia Hrishit Jayantilal 1902114 Narwani Trushant Sunil is approved for the degree of “BACHELOR OF ENGINEERING” in “COMPUTER ENGINEERING”. Examiners 1. 2. Date: Place: Declaration We declare that this written submission represents my ideas in my own words and where others 'ideas or words have been included, we have adequately cited and referenced the original sources. We also declare that we have adhered to all principles of academic honesty and integrity and have not misrepresented or fabricated or falsified any idea/data/fact/source in our submission. We understand that any violation of the above will be cause for disciplinary action by the Institute and can also evoke penal action from the sources which have thus not been properly cited or from whom proper permission has not been taken when needed. 1) ________________________________ Kaku Jay Sushil - 1902067 2) ________________________________ Khan Mohd Hamza Rafique - 1902072 3) ________________________________ Kotadia Hrishit Jayantilal - 1902080 4) ________________________________ Narwani Trushant Sunil - 1902114 Date: Abstract Artificial Intelligence (AI) has become an integral part of various industries, and deep learning models have shown remarkable performance in tasks such as image recognition, natural language processing, and speech recognition. However, training these models poses significant challenges, including the need for large amounts of high-quality data, high computational resources, and interpretability of the models. To address these challenges, the use of Field-Programmable Gate Arrays (FPGAs) in combination with RISC-V Tensorflow, an open-source deep learning framework optimized for RISC-V processors, has gained significant attention. FPGAs offer high-speed and low-latency performance for AI computations, making them well-suited for training deep learning models. Additionally, FPGAs can be programmed and customized for specific AI workloads, improving efficiency and reducing energy consumption. The combination of RISC-V Tensorflow and FPGAs can significantly accelerate AI model training, reducing the time and resources required for training, while also providing greater transparency and interpretability of the models. FPGAs can also support high-throughput data processing, enabling real-time processing of data in applications like autonomous vehicles and robotics. Furthermore, FPGAs can be used to create specialized accelerators for specific AI workloads, such as convolutional or recurrent neural networks, to achieve better performance and efficiency. This customization helps to reduce the reliance on expensive and energy-intensive general-purpose processors and accelerators. In conclusion, the use of RISC-V Tensorflow and FPGAs presents a promising solution to the challenges faced in AI training, including high computational requirements, interpretability, and efficiency. As AI models continue to grow in complexity, the use of these technologies is expected to become more widespread, leading to significant advancements in the field of AI. Table of Content List of Figures iii List of Tables iv Chapter 1 Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6 Introduction 1 1.1 Introduction 1 1.2 Problem Statement and Objectives 2 1.3 Scope 2 Review of Literature 4 2.1 Domain Explanation 4 2.2 Review of Existing System 6 2.3 Limitations of Existing System/Research Gaps 7 Proposed System 8 3.1 Design Details 8 3.2 Methodology 12 Implementation Details 31 4.1 Experimental Setup 31 4.2 Software and Hardware Setup 32 Results and Discussion 35 5.1 Performance Evaluation Parameters 35 5.2 Implementation Results 36 5.3 Results Discussion 38 Conclusion and Future Work 40 References 41 Acknowledgement 43 List of Figures Figure No. Description 3.1 DIT Radix 2 Butterfly 3.2 Radix-2 decimation in time 32-point FFT 10 3.3 Basic DSP48E1 Slice Functionality 13 3.4 7 Series FPGA DSP48E1 Slice 14 3.5 I/O to the FFT compute unit or core 15 3.6 AXI4-Stream Handshake 17 3.7 Pipelining Paradigm 22 3.8 Dependency Example 26 3.9 Scalar Dependency 30 4.1 Audio signal Page No. decomposed 9 into its 33 frequency components using FFT 5.1 Scaled FFT Dataflow 36 List of Tables Table No. 3.1 3.2 5.1 5.2 5.3 5.4 5.5 5.6 5.7 Description Equations for real and imaginary parts of P’ and Q’ Fixed-Point Identifier Summary Performance Estimation Latency Details Timing Estimates Latency Estimates Utilization Estimates Resource Usage Implementation Estimates Final Timing Implementation Estimates RTL/Co-Simulation Performance Estimates Page No. 9 18 37 38 38 38 38 39 39 Chapter 1 Introduction 1.1 Introduction The growth of computing power has been exponential over the past few decades, and this trend is likely to continue in the foreseeable future. The increasing demand for more computing power is driven by a wide range of applications, from scientific research to business analytics, gaming, and artificial intelligence. The growth of computing power has led to the development of advanced processors with specialized hardware units that can accelerate the performance of specific tasks. One such hardware unit is the convolution engine, which is commonly found in processors used in signal processing and machine learning applications. Convolution is a mathematical operation that is widely used in signal processing, image processing, and machine learning. It involves the integration of a function with a modified version of itself, which is known as a kernel or filter. Convolution can be thought of as a way to extract meaningful features from data by applying a set of transformations that highlight certain patterns or characteristics. Convolution engines are found in a wide range of processors used in signal processing and machine learning applications, including CPUs, GPUs, and specialized accelerators such as FPGAs and ASICs. They are used for programmable processors specialized for the convolution-like data-flow prevalent in computational photography, computer vision, and image processing [1]. The implementation of convolution engines varies depending on the processor architecture and the specific use case, but they all share the goal of accelerating the performance of convolution operations. One example of convolution in real life is in image processing, where convolution is used to enhance or filter images. For example, edge detection filters are commonly used to detect the edges of objects in an image, and blurring filters are used to remove noise or smooth out an image. 1.2 Problem Statement & Objectives The rate of growth of number of convolution operations required for a task hasn’t scaled linearly with the the available computing power today which leaves need for domain specific and application specific digital design that can efficiently handle the computation along with maintaining high throughput in significantly less time with minimal overhead on the cost of being a little less precise but giving reliable results at the same time. This is confirmed with the data that a CPU takes around 70 pJ work to accomplish the same task whereas an ASIC would take around less than 1 pJ [2]. The principal objective of this project is to design and develop an FPGA based softcore co-processor design coupled with a RISC-V core. To meet these demands, hardware implementations of the Fast Fourier Transform (FFT) algorithm have become crucial. In particular, a 32-point fixed-point FFT core is needed to efficiently process digital signals in real-time applications. However, designing an efficient 32-point fixed-point FFT core poses several challenges. The first challenge is to ensure that the core meets the required performance specifications, such as the processing speed, power consumption, and area utilization. The second challenge is to ensure the accuracy of the fixed-point arithmetic used in the core, as any errors in the arithmetic computations that can lead to significant degradation in the quality of the processed signals. Furthermore, the design of the core must also take into account the need for flexibility and scalability, as different applications may require varying FFT sizes. Optimization on managing the bit rate growth of the FFT computation is yet another challenge that should be addressed. 1.3 Scope The scope of the project consists of essential aspects of hardware programming viz. design, verification, testing, optimizing and prototyping.The principal objective of this project is to design and develop an FPGA based softcore co-processor design coupled with a RISC-V core. The project is meant to utilize System on a Chip that comprises an FPGA to generate the FFT output for accelerating signal processing which could be extended to do general matrix convolution tasks and helps understand the tradeoffs better when interfaced with AI applications. While various steps of the sample synthesis and processing may be offloaded to the FPGA to utilize its inherent parallelism, the post processing compute would be handled with the SoC with minimal overhead and higher efficiency. Design involves creating a blueprint for the hardware system considering factors such as power consumption, cost, and ease of manufacturing. Verification ensures the hardware design meets requirements and specifications. Identifying potential issues before manufacturing, reducing costs and time to market. Testing identifies and fixes bugs and errors in hardware design. Functional testing ensures intended function while non-functional testing evaluates performance under different conditions. Optimizing improves performance, power consumption, and cost-effectiveness. Identifies areas for improvement such as resource usage, power consumption, and cost reduction. Prototyping builds physical prototypes to test hardware performance. Identifying design issues and refining design before mass production. Chapter 2 Review of Literature 2.1 Domain Explanation Digital Signal Processing (DSP) refers to the computation of mathematically intensive algorithms applied to data signals, such as audio signal manipulation, video compression, data coding/decoding and digital communications [3]. It involves transforming signals from the time domain to the frequency domain using techniques such as Fourier analysis, and then applying signal processing techniques to achieve various functions. These functions can include filtering, smoothing, and modulation, among others. In DSP, signals are typically represented as a sequence of discrete values, and algorithms are used to manipulate these values. DSP techniques are applied to a wide range of signals, including audio, video, images, and control signals [4]. In the context of audio signals, DSP techniques can be used for tasks such as filtering to remove noise or unwanted frequencies, equalization to adjust the tonal balance, and compression to reduce the size of audio files for storage or transmission purposes. DSP is also commonly used in audio effects processing, such as reverb, chorus, and modulation effects, to create various sound effects. For video signals, DSP techniques are used for tasks such as video compression to reduce the amount of data needed to represent a video, image processing for tasks like image enhancement, and video analysis for tasks like motion detection and tracking. DSP is also used in video encoding and decoding, where it plays a crucial role in compressing and decompressing video data for efficient storage and transmission. In the field of image processing, DSP techniques are used for tasks such as image filtering to remove noise or enhance details, image compression for efficient storage and transmission, and image recognition for tasks like object detection and facial recognition. DSP is also used in medical imaging for tasks like image reconstruction, image enhancement, and image analysis for diagnostic purposes. Control signals, which are used to manage and regulate the behavior of a system, are another important application of DSP. Control signals can be used in various engineering and automation applications to adjust system parameters, monitor system behavior, and achieve desired system performance. DSP techniques are used in the analysis and processing of control signals to design efficient control algorithms, optimize system behavior, and minimize control signal overhead. Overall, DSP techniques are incredibly versatile and can be applied to a wide range of signals, including audio, video, images, and control signals. The discrete representation of signals in DSP allows for efficient processing and analysis using mathematical algorithms, making it a powerful tool in various fields such as telecommunications, multimedia processing, medical imaging, and control systems. Waves are a type of signal that carry energy and propagate through space or a medium. They can be classified as mechanical, electromagnetic, or quantum-mechanical based on their nature. Waves have a specific frequency, wavelength, and amplitude, and these properties can be used to analyze and manipulate them using DSP techniques. Control signal overhead refers to the additional computational and processing resources required to implement control signals in a system. Implementing control signals can add overhead to a system in several ways [5]. For example, the controller itself requires processing power to generate control signals and monitor the system's behavior. Additionally, the additional input signals needed to generate control signals can increase the complexity and cost of the system's hardware and software. Moreover, the process of measuring the system's behavior and generating control signals can introduce delays, which can impact the system's performance. Managing control signal overhead is an important consideration in system design and implementation. It requires careful optimization of the control signal generation process, efficient hardware and software design, and minimizing delays in the measurement and control loop. DSP techniques can be used to analyze and optimize control signals, as well as to design efficient control algorithms that minimize overhead while achieving the desired system behavior. 2.2 Review of Existing Systems There have been Static Quantized Radix-2 FFT/IFFT Processors designed to conduct constraint analysis. Amongst the major setbacks associated with such high resolution, FFT processors are the high power consumption resulting from the structural complexity and computational inefficiency of floating-point calculations. As such, a parallel pipelined architecture was proposed to statically scale the resolution of the processor to suit adequate trade-off constraints. The quantization was applied to provide an approximation to address the finite word-length constraints of digital signal processing (DSP) [6]. One approach to mitigate these issues is to use a parallel pipelined architecture, which allows for efficient processing of FFT and IFFT operations. This architecture is designed to statically scale the resolution of the processor to suit trade-off constraints adequately. By using a parallel pipelined architecture, the processing tasks are divided into smaller tasks that can be processed in parallel, resulting in improved computational efficiency. In addition to the parallel pipelined architecture, quantization is applied to provide an approximation to address the finite wordlength constraints of digital signal processing (DSP). Quantization involves rounding or truncating the values of signals or coefficients to a fixed number of bits, resulting in a reduced precision representation of the original data. This quantization process helps to reduce the computational complexity and memory requirements of the processor, which in turn reduces power consumption. The use of quantization in FFT processors allows for a trade-off between computational efficiency and precision. Higher quantization levels result in lower precision but also reduce power consumption and computational complexity. On the other hand, lower quantization levels result in higher precision but may increase power consumption and computational complexity. Overall, the use of Static Quantized Radix-2 FFT/IFFT processors with a parallel pipelined architecture and quantization techniques provides an effective solution to address the constraints associated with high-resolution FFT processors, such as high power consumption, structural complexity, and computational inefficiency. These approaches allow for efficient trade-offs between resolution, computational efficiency, and power consumption in DSP applications, making them valuable tools in various fields, including telecommunications, multimedia processing, and control systems. Hardware accelerators such as the GreenWaves Technologies GAP8 [7] and Esperanto ETSoC-1 are designed to provide high-performance AI inference in energy-constrained environments. These systems use RISC-V cores and integrated TensorCore units to perform complex operations such as convolution, pooling, and activation functions. Overall, the existing systems related to RISC-V TensorCore for Edge AI offer promising solutions for low-power, high-performance AI computing in resource-constrained environments. With continued research and development, these systems are expected to play an increasingly important role in enabling the next generation of edge AI applications. 2.3 Limitations of Existing System/Research Gaps Existing systems that utilize RISC-V Tensor Cores for edge AI applications face several limitations. One significant limitation is the power consumption of these systems because of their floating-point units, which can be very power-hungry. This is because floating-point arithmetic requires more computational resources and precision than fixed-point arithmetic [8]. Furthermore, the time consumed to pivot or reconfigure the system can be significant, particularly in edge AI applications where real-time responsiveness is critical. This can be especially problematic in situations where the edge AI system needs to adjust quickly to changes in the input data or the environment. The need for pivoting or reconfiguration can also increase the complexity and cost of the system, as it may require additional hardware, software, or human intervention. These limitations can make it challenging to use RISC-V Tensor Cores for edge AI applications, especially in scenarios where power consumption, flexibility, and responsiveness are critical factors. To address these limitations, researchers and engineers are exploring alternative approaches to edge AI, such as using more efficient data representations, reducing the precision of the arithmetic used, and exploring novel hardware architectures that are more flexible and reprogrammable. Additionally, advances in machine learning algorithms and software frameworks may also help to reduce the computational requirements of edge AI applications, making them more feasible for deployment on resource-constrained devices. Chapter 3 Proposed System 3.1 Design Details FFT (Fast Fourier Transform) is an algorithm used to efficiently compute the discrete Fourier transform (DFT) of a sequence of complex data points. The DFT is a mathematical transformation that converts a signal from the time domain to the frequency domain, revealing the underlying frequency components of the signal. The FFT algorithm takes advantage of the symmetry and periodicity properties of the DFT to reduce the number of computations required to compute the transform. The basic idea is to recursively break down the input sequence into smaller and smaller sub-sequences until each sub-sequence consists of just two data points. Then, by applying a series of mathematical operations to these smaller sub-sequences, the FFT algorithm computes the DFT of the original sequence. The efficiency of the FFT algorithm makes it an essential tool in many applications, especially in real-time signal processing, where the computational complexity of the DFT can be a bottleneck. By using the FFT algorithm, the DFT can be computed much faster, allowing for real-time processing of signals. In the FFT algorithm, bit-reversal refers to the process of reordering the input data points in a way that makes it possible to perform the required computations in a more efficient manner. To compute the bit-reversal of an index value, we need to swap the binary digits of the index value in a specific order. For example, consider the sequence of index values {0, 1, 2, 3, 4, 5, 6, 7}. The binary representation of these values are {000, 001, 010, 011, 100, 101, 110, 111}. To compute the bit-reversal of these index values, we need to swap their binary digits in a specific order, such that the new sequence becomes {0, 4, 2, 6, 1, 5, 3, 7}. The bit-reversal step is essential in the FFT algorithm as it enables the computation of the DFT in a more efficient manner. By reordering the input data points, the algorithm can perform the required computations in a more optimal way, reducing the number of operations required and improving the overall performance of the algorithm. The FFT arithmetic [9] is basically divided into two types, which is the decimation-in-time (DIT) and the decimation-infrequency (DIF). This radix-2-DIT FFT is adopted in this paper. An 'N' point discrete Fourier transformation (DFT) of the input sequences x (n) is written as, …(1) x(n) could be further divided into odd part and even part using radix-2 DIT in (l), taking advantage of periodicity and symmetry we can obtain the following equations …(2) The Radix-2 Butterfly is illustrated in Figure 3.1. In each butterfly structure, two complex inputs P and Q are operated upon and become complex outputs P’ and Q’. Complex multiplication is performed on Q and the twiddle factor, then the product is added to and subtracted from input P to form outputs P’ and Q’. The exponent of the twiddle factor ππ π is dependent on the stage and group of its butterfly. The butterfly is usually represented by its flow graph, which looks like a butterfly. Figure 3.1 DIT Radix 2 Butterfly The mathematical meaning of this butterfly is shown in Table 3.1 with separate equations for real and imaginary parts. Table 3.1 Equations for real and imaginary parts of P’ and Q’ Complex Real Part Imaginary Part P’ = P + Q * W Pr’ = Pr + (Qr * Wr - Qi * Wi) Pi’ = Pi + (Qr * Wi + Qi * Wr) Q’ = P - Q * W Qr’ = Pr - (Qr * Wr - Qi * Wi) Qi’ = Pi - (Qr * Wi + Qi * Wr) . Figure 3.2 Radix-2 decimation in time 32-point FFT Communication interface: The interface communication protocol was chosen to be AXI-Stream (AXIS) protocol as the arrival of data is sequential and in batches. This makes the traditional bi-directional address querying model slower in terms of implementation. AXIS helps moving a block of data quickly between the producer and consumer. The implementation details and protocol is mentioned in great depth in the forthcoming subsection. FFT core: The FFT core is selected to have 16-bit real and imaginary fixed point two's complement representation along 8-bits dedicated to the signed integer part and the rest 8-bits to the fractional part making it optimal for various applications requiring both integer based as well as fractional computation for real time radix-2 FFT analysis. The block size was decided to be 32 as it achieves a frequency of approx 1378 Hz resolution for an audio sampled at 44.1 kHz frequency. Hence the total number of stages comes out to be πππ2 32 which turns out to be 5 stages. On each pass, the algorithm performs Radix-2 butterflies, where each butterfly picks up four or two complex numbers, respectively, and returns four or two complex numbers to the same memory. The numbers returned to memory by the core are potentially larger than the numbers picked up from memory. A strategy must be employed to accommodate this dynamic range expansion. A full explanation of scaling strategies and their implications is beyond the scope of this document; for more information about this topic; see A Simple Fixed-Point Error Bound for the Fast Fourier Transform [10]. For Radix-2, the growth is by a factor of up to . This implies a bit growth of up to 2 bits. This bit growth can be handled in three ways: β Performing the calculations with no scaling and carrying all significant integer bits to the end of the computation β Scaling at each stage using a fixed-scaling schedule β Scaling automatically using block floating-point All significant integer bits are retained when using full-precision unscaled arithmetic. The width of the datapath increases to accommodate the bit growth through the butterfly. The growth of the fractional bits created from the multiplication are truncated (or rounded) after the multiplication. The width of the output is (input width+log2(transform length)+1). This accommodates the worst case scenario for bit growth. We here go with the Scaling at each stage using a fixed-scaling schedule. When using scaling, a scaling schedule is used to divide by a factor of 1, 2, 4, or 8 in each stage. If scaling is insufficient, a butterfly output might grow beyond the dynamic range and cause an overflow. As a result of the scaling applied in the FFT implementation, the transform computed is a scaled transform. The scale factor s is defined as: where bα΅’ is the scaling (specified in bits) applied in stage i. The scaling results in the final output sequence being modified by the factor 1/s. For the forward FFT, the output sequence X’ (k), k = 0,...,N - 1 computed by the core is defined as If a Radix-2 algorithm scales by a factor of 2 or one right shift in terms of hardware manipulation, in each stage, the factor of 1/s is equal to the factor of 1/N in the inverse FFT equation. 3.2 Methodology FPGAs (Field Programmable Gate Arrays) offer unique advantages in terms of parallelism that can be exploited to accelerate computationally intensive tasks. FPGA-based parallelism can provide a high degree of flexibility, efficiency, and performance compared to other hardware accelerators. Moreover, FPGAs are synchronous hardware with a jitter of less than one cycle of clock and they also are not affected by the rather complex behavior of the operating system services, interrupt handling, etc. Due to the physical parallelism, the processes do not influence each other [11]. One of the primary ways to exploit FPGA parallelism is by utilizing its reconfigurable hardware resources, which can be customized to fit the specific requirements of the application. This allows for the creation of highly optimized, parallel hardware designs that can process data at high speeds. Additionally, FPGAs can be programmed using specialized languages such as Verilog or VHDL, which enable fine-grained control over the hardware design. Computations of trigonometric functions for twiddle factor compute in hardware Existing solutions used a cordic algorithm for the computation of sin and cos while we interface it using DSP slices which are faster and more efficient as it is hard-etched on the FPGA chip which is vendor specific in our case Xilinx. Namely the DSP slice is the 48E1 version of the Xilinx IP. FPGAs are efficient for digital signal processing (DSP) applications because they can implement custom, fully parallel algorithms. DSP applications use many binary multipliers and accumulators that are best implemented in dedicated DSP slices. All 7 series FPGAs have many dedicated, full-custom, low-power DSP slices, combining high speed with small size while retaining system design flexibility. The DSP slices enhance the speed and efficiency of many applications beyond digital signal processing, such as wide dynamic bus shifters, memory address generators, wide bus multiplexers, and memory-mapped I/O registers. The basic functionality of the DSP48E1 slice is shown in Figure 3.2. Figure 3.3 Basic DSP48E1 Slice Functionality Some highlights of the DSP functionality include: β 25 × 18 two’s-complement multiplier: β Dynamic bypass β 48-bit accumulator: β Can be used as a synchronous up/down counter β Power saving pre-adder: β Optimizes symmetrical filter applications and reduces DSP slice requirements β Single-instruction-multiple-data (SIMD) arithmetic unit: β Dual 24-bit or quad 12-bit add/subtract/accumulate β Optional logic unit: β Can generate any one of ten different logic functions of the two operands β Pattern detector: β Convergent or symmetric rounding β 96-bit-wide logic functions when used in conjunction with the logic unit β Advanced features: β Optional pipelining and dedicated buses for cascading Figure 3.4 7-Series FPGA DSP48E1 Slice The DSP slice consists of a multiplier followed by an accumulator. At least three pipeline registers are required for both multiply and multiply-accumulate operations to run at full speed. The multiply operation in the first stage generates two partial products that need to be added together in the second stage. When only one or two registers exist in the multiplier design, the M register should always be used to save power and improve performance. Add/Sub and Logic Unit operations require at least two pipeline registers (input, output) to run at full speed. The cascade capabilities of the DSP slice are extremely efficient at implementing high speed pipelined filters built on the adder cascades instead of adder trees. Multiplexers are controlled with dynamic control signals, such as OPMODE, ALUMODE, and CARRYINSEL, enabling a great deal of flexibility. Designs using registers and dynamic opmodes are better equipped to take advantage of the DSP slice capabilities than combinatorial multiplies. In general, the DSP slice supports both sequential and cascaded operations due to the dynamic OPMODE and cascade capabilities. Fast Fourier Transforms (FFTs), floating point, computation (multiply, add/sub, divide), counters, and large bus multiplexers are some applications of the DSP slice. Additional capabilities of the DSP slice include synchronous resets and clock enables, dual A input pipeline registers, pattern detection, Logic Unit functionality, single instruction/multiple data (SIMD) functionality, and MACC and Add-Acc extension to 96 bits. The DSP slice supports convergent and symmetric rounding, terminal count detection and auto-resetting for counters, and overflow/underflow detection for sequential accumulators. ALU functions are identical in the 7 series FPGA DSP48E1 slice as in the Virtex-6 FPGA DSP48E1 slice. AXI4-STREAM INTERFACE AXI4-Stream interface can be applied to any input argument and any array or pointer output argument. Because an AXI4-Stream interface transfers data in a sequential streaming manner, it cannot be used with arguments that are both read and written. In terms of data layout, the data type of the AXI4-Stream is aligned to the next byte. For example, if the size of the data type is 12 bits, it will be extended to 16 bits. Depending on whether a signed/unsigned interface is selected, the extended bits are either sign-extended or zero-extended. If the stream data type is a user-defined struct, the struct is aggregated and aligned to the size of the largest data element within the struct. As shown in Figure 3.4, AXI Stream is the communication protocol being used. Figure 3.5 I/O to the FFT compute unit or core The following code examples show how the packed alignment depends on your struct type. If the struct contains only char type, as shown in the following example, then it will be packed with alignment of one byte. Total size of the struct will be two bytes: struct A { char foo; char bar; }; However, if the struct has elements with different data types, as shown below, then it will be packed and aligned to the size of the largest data element, or four bytes in this example. Element bar will be padded with three bytes resulting in a total size of eight bytes for the struct: struct A { int foo; char bar; }; The AXI4-Stream interface is implemented as a struct type in Vitis HLS and has the following signature (defined in ap_axi_sdata.h): template <typename T, size_t WUser, size_t WId, size_t WDest> struct axis { .. }; Where: T: Stream data type WUser: Width of the TUSER signal WId: Width of the TID signal WDest: Width of the TDest signal When the stream data type (T) are simple integer types, there are two predefined types of AXI4Stream implementations available: A signed implementation of the AXI4-Stream class (or more simply ap_axis<Wdata, WUser, WId, WDest>) hls::axis<ap_int<WData>, WUser, WId, WDest> An unsigned implementation of the AXI4-Stream class (or more simply ap_axiu<WData, WUser, WId, WDest>) hls::axis<ap_uint<WData>, WUser, WId, WDest> The value specified for the WUser, WId, and WDest template parameters controls the usage of side-channel signals in the AXI4-Stream interface. When the hls::axis class is used, the generated RTL will typically contain the actual data signal TDATA, and the following additional signals: TVALID, TREADY, TKEEP, TSTRB, TLAST, TUSER, TID, and TDEST. TVALID, TREADY, and TLAST are necessary control signals for the AXI4-Stream protocol. TKEEP, TSTRB, TUSER, TID, and TDEST signals are special signals that can be used to pass around additional bookkeeping data. How AXI4-Stream Works? AXI4-Stream is a protocol designed for transporting arbitrary unidirectional data. In an AXI4Stream, TDATA width of bits is transferred per clock cycle. The transfer is started once the producer sends the TVALID signal and the consumer responds by sending the TREADY signal (once it has consumed the initial TDATA). At this point, the producer will start sending TDATA and TLAST (TUSER if needed to carry additional user-defined sideband data). TLAST signals the last byte of the stream. So the consumer keeps consuming the incoming TDATA until TLAST is asserted. Figure 3.6 AXI4-Stream Handshake AXI4-Stream has additional optional features like sending positional data with TKEEP and TSTRB ports which makes it possible to multiplex both the data position and data itself on the TDATA signal. Using the TID and TDIST signals, you can route streams as these fields roughly correspond to stream identifier and stream destination identifier [12]. PRECISION FIXED-POINT DATA TYPES Fixed-point data types model the data as an integer and fraction bits. In this example, the Vitis HLS ap_fixed type is used to define an 18-bit variable with 6 bits representing the numbers above the binary point and 12 bits representing the value below the decimal point. The variable is specified as signed and the quantization mode is set to round to plus infinity. Because the overflow mode is not specified, the default wrap-around mode is used for overflow. #include <ap_fixed.h> ... ap_fixed<18,6,AP_RND > my_type; ... When performing calculations where the variables have different numbers of bits or different precision, the binary point is automatically aligned. The behavior of the C++ simulations performed using fixed-point matches the resulting hardware. This allows you to analyze the bitaccurate, quantization, and overflow behaviors using fast C-level simulation.Fixed-point types are a useful replacement for floating point types which require many clock cycles to complete. Unless the entire range of the floating-point type is required, the same accuracy can often be implemented with a fixed-point type resulting in the same accuracy with smaller and faster hardware. A summary of the ap_fixed type identifiers is provided in the following table Table 3.2 Fixed-Point Identifier Summary Identifier W Description Word length in bits I The number of bits used to represent the integer value, that is, the number of integer bits to the left of the binary point. When this value is negative, it represents the number of implicit sign bits (for signed representation), or the number of implicit zero bits (for unsigned representation) to the right of the binary point. For example, ap_fixed<2, 0> a = -0.5; // a can be -0.5, ap_ufixed<1, 0> x = 0.5; // 1-bit representation. x can be 0 or 0.5 ap_ufixed<1, -1> y = 0.25; // 1-bit representation. y can be 0 or 0.25 const ap_fixed<1, -7> z = 1.0/256; // 1-bit representation for z = 2^-8 Q Quantization mode: This dictates the behavior when greater precision is generated than can be defined by the smallest fractional bit in the variable used to store the result. O ap_fixed Types Description AP_RND Round to plus infinity AP_RND_ZERO Round to zero AP_RND_MIN_INF Round to minus infinity AP_RND_INF Round to infinity AP_RND_CONV Convergent rounding AP_TRN Truncation to minus infinity (default) AP_TRN_ZERO Truncation to zero Overflow mode: This dictates the behavior when the result of an operation exceeds the maximum (or minimum in the case of negative numbers) possible value that can be stored in the variable used to store the result. N ap_fixed Types Description AP_SAT Saturation AP_SAT_ZERO Saturation to zero AP_SAT_SYM Symmetrical saturation AP_WRAP Wrap around (default) AP_WRAP_SM Sign magnitude wrap around This defines the number of saturation bits in overflow wrap modes. The default maximum width allowed for ap_[u]fixed data types is 1024 bits. This default may be overridden by defining the macro AP_INT_MAX_W with a positive integer value less than or equal to 32768 before inclusion of the ap_int.h header file. The following is an example of overriding AP_INT_MAX_W: #define AP_INT_MAX_W 4096 // Must be defined before next line #include "ap_fixed.h" ap_fixed<4096> very_wide_var; Arbitrary precision data types are highly recommended when using Vitis HLS. As shown in the earlier example, they typically have a significant positive benefit on the quality of the hardware implementation. LOOP PIPELINING When pipelining loops, the optimal balance between area and performance is typically found by pipelining the innermost loop. This also results in the fastest runtime. The following code example demonstrates the trade-offs when pipelining loops and functions. #include "loop_pipeline.h" dout_t loop_pipeline(din_t A[N]) { int i,j; static dout_t acc; LOOP_I:for(i=0; i < 20; i++){ LOOP_J: for(j=0; j < 20; j++){ acc += A[i] * j; } } return acc; } If the innermost (LOOP_J) is pipelined, there is one copy of LOOP_J in hardware, (a single multiplier). Vitis HLS automatically flattens the loops when possible, as in this case, and effectively creates a new single loop of 20*20 iterations. Only one multiplier operation and one array access need to be scheduled, then the loop iterations can be scheduled as a single loopbody entity (20x20 loop iterations). If the outer-loop (LOOP_I) is pipelined, inner-loop (LOOP_J) is unrolled creating 20 copies of the loop body: 20 multipliers and 20 array accesses must now be scheduled. Then each iteration of LOOP_I can be scheduled as a single entity. If the top-level function is pipelined, both loops must be unrolled: 400 multipliers and 400 arrays accessed must now be scheduled. It is very unlikely that Vitis HLS will produce a design with 400 multiplications because in most designs, data dependencies often prevent maximal parallelism, for example, even if a dual-port RAM is used for A[N], the design can only access two values of A[N] in any clock cycle. The concept to appreciate when selecting at which level of the hierarchy to pipeline is to understand that pipelining the innermost loop gives the smallest hardware with generally acceptable throughput for most applications. Pipelining the upper levels of the hierarchy unrolls all sub-loops and can create many more operations to schedule (which could impact runtime and memory capacity), but typically gives the highest performance design in terms of throughput and latency. To summarize the above options: β Pipeline LOOP_J Latency is approximately 400 cycles (20x20) and requires less than 100 LUTs and registers (the I/O control and FSM are always present). β Pipeline LOOP_I Latency is approximately 20 cycles but requires a few hundred LUTs and registers. About 20 times the logic as the first option, minus any logic optimizations that can be made. β Pipeline function loop_pipeline Latency is approximately 10 (20 dual-port accesses) but requires thousands of LUTs and registers (about 400 times the logic of the first option minus any optimizations that can be made). Imperfect Nested Loops When the inner loop of a loop hierarchy is pipelined, Vitis HLS flattens the nested loops to reduce latency and improve overall throughput by removing any cycles caused by loop transitioning (the checks performed on the loop index when entering and exiting loops). Such checks can result in a clock delay when transitioning from one loop to the next (entry and/or exit). Imperfect loop nests, or the inability to flatten them, results in additional clock cycles to enter and exit the loops. When the design contains nested loops, analyze the results to ensure as many nested loops as possible have been flattened: review the log file or look in the synthesis report for cases, as shown in Loop Pipelining, where the loop labels have been merged (LOOP_I and LOOP_J are now reported as LOOP_I_LOOP_J) [13]. PIPELINING PARADIGM Pipelining is a commonly used concept that you will encounter in everyday life. A good example is the production line of a car factory, where each specific task such as installing the engine, installing the doors, and installing the wheels, is often done by a separate and unique workstation. The stations carry out their tasks in parallel, each on a different car. Once a car has had one task performed, it moves to the next station. Variations in the time needed to complete the tasks can be accommodated by buffering (holding one or more cars in a space between the stations) and/or by stalling (temporarily halting the upstream stations) until the next station becomes available. Suppose that assembling one car requires three tasks A, B, and C that take 20, 10, and 30 minutes, respectively. Then, if all three tasks were performed by a single station, the factory would output one car every 60 minutes. By using a pipeline of three stations, the factory would output the first car in 60 minutes, and then a new one every 30 minutes. As this example shows, pipelining does not decrease the latency, that is, the total time for one item to go through the whole system. It does however increase the system's throughput, that is, the rate at which new items are processed after the first one. Since the throughput of a pipeline cannot be better than that of its slowest element, the programmer should try to divide the work and resources among the stages so that they all take the same time to complete their tasks. In the car assembly example above, if the three tasks A. B and C took 20 minutes each, instead of 20, 10, and 30 minutes, the latency would still be 60 minutes, but a new car would then be finished every 20 minutes, instead of 30. The diagram below shows a hypothetical manufacturing line tasked with the production of three cars. Assuming each of the tasks A, B and C takes 20 minutes, a sequential production line would take 180 minutes to produce three cars. A pipelined production line would take only 100 minutes to produce three cars. The time taken to produce the first car is 60 minutes and is called the iteration latency of the pipeline. After the first car is produced, the next two cars only take 20 minutes each and this is known as the initiation interval (II) of the pipeline. The overall time taken to produce the three cars is 100 minutes and is referred to as the total latency of the pipeline, i.e. total latency = iteration latency + II * (number of items - 1). Therefore, improving II improves total latency, but not the iteration latency. From the programmer's point of view, the pipelining paradigm can be applied to functions and loops in the design. After an initial setup cost, the ideal throughput goal will be to achieve an II of 1 - i.e.,after the initial setup delay, the output will be available at every cycle of the pipeline. In our example above, after an initial setup delay of 60 minutes, a car is then available every 20 minutes. Figure 3.7 Pipelining Paradigm Pipelining is a classical micro-level architectural optimization that can be applied to multiple levels of abstraction. We covered task-level pipelining with the producer-consumer paradigm earlier. This same concept applies to the instruction-level. This is in fact key to keeping the producer-consumer pipelines (and streams) filled and busy. The producer-consumer pipeline will only be efficient if each task produces/consumes data at a high rate, and hence the need for the instruction-level pipelining (ILP). Due to the way pipelining uses the same resources to execute the same function over time, it is considered a static optimization since it requires complete knowledge about the latency of each task. Due to this, the low level instruction pipelining technique cannot be applied to dataflow type networks where the latency of the tasks can be unknown as it is a function of the input data. The next section details how to leverage the three basic paradigms that have been introduced to model different types of task parallelism [14]. PIPELINE DEPENDENCIES Vitis HLS constructs a hardware datapath that corresponds to the C/C++ source code. When there is no pipeline directive, the execution is sequential so there are no dependencies to take into account. But when the design has been pipelined, the tool needs to deal with the same dependencies as found in processor architectures for the hardware that Vitis HLS generates. Typical cases of data dependencies or memory dependencies are when a read or a write occurs after a previous read or write. A read-after-write (RAW), also called a true dependency, is when an instruction (and data it reads/uses) depends on the result of a previous operation. I1: t = a * b; I2: c = t + 1; The read in statement I2 depends on the write of t in statement I1. If the instructions are reordered, it uses the previous value of t. A write-after-read (WAR), also called an anti-dependence, is when an instruction cannot update a register or memory (by a write) before a previous instruction has read the data. I1: b = t + a; I2: t = 3; The write in statement I2 cannot execute before statement I1, otherwise the result of b is invalid. A write-after-write (WAW) is a dependence when a register or memory must be written in specific order otherwise other instructions might be corrupted. I1: t = a * b; I2: c = t + 1; I3: t = 1; The write in statement I3 must happen after the write in statement I1. Otherwise, the statement I2 result is incorrect. A read-after-read has no dependency as instructions can be freely reordered if the variable is not declared as volatile. If it is, then the order of instructions has to be maintained. For example, when a pipeline is generated, the tool needs to take care that a register or memory location read at a later stage has not been modified by a previous write. This is a true dependency or read-after-write (RAW) dependency. A specific example is: int top(int a, int b) { int t,c; I1: t = a * b; I2: c = t + 1; return c; } Statement I2 cannot be evaluated before statement I1 completes because there is a dependency on variable t. In hardware, if the multiplication takes 3 clock cycles, then I2 is delayed for that amount of time. If the above function is pipelined, then VHLS detects this as a true dependency and schedules the operations accordingly. It uses data forwarding optimization to remove the RAW dependency, so that the function can operate at II =1. Memory dependencies arise when the example applies to an array and not just variables. int top(int a) { int r=1,rnext,m,i,out; static int mem[256]; L1: for(i=0;i<=254;i++) { #pragma HLS PIPELINE II=1 I1: m = r * a; mem[i+1] = m; // line 7 I2: rnext = mem[i]; r = rnext; // line 8 } return r; } In the above example, scheduling of loop L1 leads to a scheduling warning message: WARNING: [SCHED 204-68] Unable to enforce a carried dependency constraint (II = 1, distance = 1) between 'store' operation (top.cpp:7) of variable 'm', top.cpp:7 on array 'mem' and 'load' operation ('rnext', top.cpp:8) on array 'mem'. INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 2, Depth: 3. There are no issues within the same iteration of the loop as you write an index and read another one. The two instructions could execute at the same time, concurrently. However, observe the read and writes over a few iterations: // Iteration for i=0 I1: m = r * a; mem[1] = m; // line 7 I2: rnext = mem[0]; r = rnext; // line 8 // Iteration for i=1 I1: m = r * a; mem[2] = m; // line 7 I2: rnext = mem[1]; r = rnext; // line 8 // Iteration for i=2 I1: m = r * a; mem[3] = m; // line 7 I2: rnext = mem[2]; r = rnext; // line 8 When considering two successive iterations, the multiplication result m (with a latency = 2) from statement I1 is written to a location that is read by statement I2 of the next iteration of the loop into rnext. In this situation, there is a RAW dependence as the next loop iteration cannot start reading mem[i] before the previous computation's write completes. Figure 3.8 Dependency Example Note that if the clock frequency is increased, then the multiplier needs more pipeline stages and increased latency. This will force II to increase as well. Consider the following code, where the operations have been swapped, changing the functionality. int top(int a) { int r,m,i; static int mem[256]; L1: for(i=0;i<=254;i++) { #pragma HLS PIPELINE II=1 I1: r = mem[i]; // line 7 I2: m = r * a , mem[i+1]=m; // line 8 } return r; } The scheduling warning is: INFO: [SCHED 204-61] Pipelining loop 'L1'. WARNING: [SCHED 204-68] Unable to enforce a carried dependency constraint (II = 1, distance = 1) between 'store' operation (top.cpp:8) of variable 'm', top.cpp:8 on array 'mem' and 'load' operation ('r', top.cpp:7) on array 'mem'. WARNING: [SCHED 204-68] Unable to enforce a carried dependency constraint (II = 2, distance = 1) between 'store' operation (top.cpp:8) of variable 'm', top.cpp:8 on array 'mem' and 'load' operation ('r', top.cpp:7) on array 'mem'. WARNING: [SCHED 204-68] Unable to enforce a carried dependency constraint (II = 3, distance = 1) between 'store' operation (top.cpp:8) of variable 'm', top.cpp:8 on array 'mem' and 'load' operation ('r', top.cpp:7) on array 'mem'. INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 4, Depth: 4. Observe the continued read and writes over a few iterations: Iteration with i=0 I1: r = mem[0]; // line 7 I2: m = r * a , mem[1]=m; // line 8 Iteration with i=1 I1: r = mem[1]; // line 7 I2: m = r * a , mem[2]=m; // line 8 Iteration with i=2 I1: r = mem[2]; // line 7 I2: m = r * a , mem[3]=m; // line 8 A longer II is needed because the RAW dependence is via reading r from mem[i], performing the multiplication, and writing to mem[i+1]. Removing False Dependencies to Improve Loop Pipelining False dependencies are dependencies that arise when the compiler is too conservative. These dependencies do not exist in the real code, but cannot be determined by the compiler. These dependencies can prevent loop pipelining. The following example illustrates false dependencies. In this example, the read and write accesses are to two different addresses in the same loop iteration. Both of these addresses are dependent on the input data, and can point to any individual element of the hist array. Because of this, Vitis HLS assumes that both of these accesses can access the same location. As a result, it schedules the read and write operations to the array in alternating cycles, resulting in a loop II of 2. However, the code shows that hist[old] and hist[val] can never access the same location because they are in the else branch of the conditional if(old == val). void histogram(int in[INPUT SIZE], int hist[VALUE SIZE]) f int acc = 0; int i, val; int old = in[0]; for(i = 0; i < INPUT SIZE; i++) { #pragma HLS PIPELINE II=1 val = in[i]; if(old == val) { acc = acc + 1; } else { hist[old] = acc; acc = hist[val] + 1; } old = val; } hist[old] = acc; To overcome this deficiency, you can use the DEPENDENCE directive to provide Vitis HLS with additional information about the dependencies. void histogram(int in[INPUT SIZE], int hist[VALUE SIZE]) { int acc = 0; int i, val; int old = in[0]; #pragma HLS DEPENDENCE variable=hist type=intra direction=RAW dependent=false for(i = 0; i < INPUT SIZE; i++) { #pragma HLS PIPELINE II=1 val = in[i]; if(old == val) { acc = acc + 1; } else { hist[old] = acc; acc = hist[val] + 1; } old = val; } hist[old] = acc; When specifying dependencies there are two main types: β Inter Specifies the dependency is between different iterations of the same loop. If this is specified as FALSE it allows Vitis HLS to perform operations in parallel if the pipelined or loop is unrolled or partially unrolled and prevents such concurrent operation when specified as TRUE. β Intra Specifies dependence within the same iteration of a loop, for example an array being accessed at the start and end of the same iteration. When intra dependencies are specified as FALSE, Vitis HLS may move operations freely within the loop, increasing their mobility and potentially improving performance or area. When the dependency is specified as TRUE, the operations must be performed in the order specified. Scalar Dependencies Some scalar dependencies are much harder to resolve and often require changes to the source code. A scalar data dependency could look like the following: while (a != b) { if (a > b) a -= b; else b -= a; } The next iteration of this loop cannot start until the current iteration has calculated the updated values of a and b, as shown in the figure 3.8. Figure 3.9 Scalar Dependency If the result of the previous loop iteration must be available before the current iteration can begin, loop pipelining is not possible. If Vitis HLS cannot pipeline with the specified initiation interval, it increases the initiation internally. If it cannot pipeline at all, as shown by the above example, it halts pipelining and proceeds to output a non-pipelined design [15]. Chapter 4 Implementation Details 4.1 Experimental Setup An FPGA is an integrated circuit (IC) equipped with configurable logic blocks (CLBs) and other features that can be programmed and reprogrammed by a user. The term “fieldprogrammable” indicates that the FPGA’s abilities are adjustable and not hardwired by the manufacturer like other ICs. FPGAs are integrated circuits (ICs) that fall under the umbrella of programmable logic devices (PLDs). The fundamental functionality of FPGA technology is built on adaptive hardware, which has the unique ability to be modified after manufacture. Arrays of hardware blocks, each configurable, can be connected as needed, allowing highly efficient, domain-specific architectures to be built for any application. The architecture of FPGAs makes them an efficient solution for hardware acceleration. Devices such as ASICs and GPUs use an antiquated method of jumping between programming and memory. They also don’t accommodate applications where real-time information is needed, since the high amount of power required for storage and retrieval tasks causes performance lags. Unlike ASICs and GPUs, FPGAs don’t need to jump between memory and programming, which makes the process of storing and retrieving data more efficient. And since FPGA architecture is more flexible, you can customize how much power you’d like an FPGA to utilize for a specific task. That flexibility can help offload energy-consuming tasks to one or several FPGAs from a conventional CPU or another device. And since many FPGAs can be reprogrammed, you can easily implement upgrades and adjustments to a hardware acceleration system. FPGA programming uses an HDL to manipulate circuits depending on what capabilities you want the device to have. The process is different from programming a GPU or CPU, since you aren’t writing a program that will run sequentially. Rather, you’re using an HDL to create circuits and physically change the hardware depending on what you want it to do. The process is similar to programming software in that you write code that is turned into a binary file and loaded onto the FPGA. But the outcome is that the HDL makes physical changes to the hardware, rather than strictly optimizing the device to run software. A program on an FPGA pieces together lower-level elements like logic gates and memory blocks, which work in concert to complete a task. Because you’re manipulating the hardware from the ground up, FPGAs allow a great deal of flexibility. You can adjust basic functions such as memory or power usage depending on the task[16]. 4.2 Software and Hardware Setup HLS: The Xilinx® Vivado® High-Level Synthesis (HLS) compiler provides a programming environment similar to those available for application development on both standard and specialized processors. Vivado HLS shares key technology with processor compilers for the interpretation, analysis, and optimization of C/C++ programs. The main difference is in the execution target of the application. By targeting an FPGA as the execution fabric, Vivado HLS enables a software engineer to optimize code for throughput, power, and latency without the need to address the performance bottleneck of a single memory space and limited computational resources. This allows the implementation of computationally intensive software algorithms into actual products, not just functionality demonstrators. This chapter introduces how the Vivado HLS compiler works and how it differs from a traditional software compiler. Application code targeting the Vivado HLS compiler uses the same categories as any processor compiler. Vivado HLS analyzes all programs in terms of: β Operations β Conditional statements β Loops β Functions FFT ANALYSIS A fast Fourier transform (FFT) is a highly optimized implementation of the discrete Fourier transform (DFT), which converts discrete signals from the time domain to the frequency domain. FFT computations provide information about the frequency content, phase, and other properties of the signal. Figure 4.1: Audio signal decomposed into its frequency components using FFT. Popular FFT algorithms include the Cooley-Tukey algorithm, prime factor FFT algorithm, and Rader’s FFT algorithm. The most commonly used FFT algorithm is the Cooley-Tukey algorithm, which reduces a large DFT into smaller DFTs to increase computation speed and reduce complexity. FFT has applications in many fields. Fast Fourier Transform (FFT) analysis is a widely used technique for analyzing and processing signals, such as audio and image data, in various applications. The use of Field-Programmable Gate Arrays (FPGAs) in combination with RISC-V Tensorflow can significantly accelerate FFT analysis, enabling real-time processing of data in applications like telecommunications, imaging, and audio processing. RISC-V Tensorflow is an open-source deep learning framework optimized for RISC-V processors, which provides an efficient and flexible platform for implementing FFT analysis algorithms. The flexibility of RISC-V allows the design and implementation of customized accelerators that can perform FFT computations efficiently, leading to faster processing times and lower energy consumption. FPGAs offer high-speed and low-latency performance for FFT computations, making them well-suited for implementing FFT accelerators. FPGAs can be programmed and customized for specific FFT workloads, improving efficiency and reducing energy consumption. Additionally, FPGAs can support high-throughput data processing, enabling real-time processing of data in applications like audio and image processing. The combination of RISC-V Tensorflow and FPGAs can lead to the development of specialized accelerators for FFT computations, improving the efficiency and accuracy of FFT analysis. These accelerators can be customized for specific FFT algorithms, such as the Cooley-Tukey FFT algorithm, leading to more efficient and faster processing times. Moreover, the use of FPGAs in FFT analysis can enable the processing of larger data sets, making it possible to analyze complex signals in real-time. This is particularly important in applications like telecommunications, where the ability to analyze and process data in real-time can significantly improve the performance and reliability of communication systems. In conclusion, the use of RISC-V Tensorflow and FPGAs presents a promising solution for accelerating FFT analysis, improving the efficiency and accuracy of signal processing in various applications. As the use of FPGAs in AI and signal processing continues to grow, the development of specialized accelerators for FFT analysis is expected to become more widespread, leading to significant advancements in the field of signal processing. FFT in MATLAB MATLAB® provides many functions like fft, ifft, and fft2 with which FFT can be implemented directly. In MATLAB, FFT implementation is optimized to choose from among various FFT algorithms depending on the data size and computation. Similarly, Simulink® provides blocks for FFT that can be used in Model-Based Design and simulation. MATLAB and Simulink also support implementation of FFT on specific hardware such as FPGAs, processors including ARM, and NVIDIA GPUs, through automatic code generation. Here's a brief overview of how to use the FFT function in MATLAB: 1. Load your signal data into a MATLAB variable. 2. Apply a windowing function to the data if necessary. This can help reduce spectral leakage and improve the accuracy of the FFT analysis. 3. Apply the FFT function to the signal data. The FFT function in MATLAB is called fft(). The function takes the signal data as input and returns the FFT coefficients, which represent the frequency components of the signal. 4. Use the FFT coefficients to plot the frequency spectrum of the signal. You can use the abs() function to get the magnitude of the FFT coefficients, and then plot the magnitude against the frequency. Chapter 5 Results and Discussion 5.1 Performance Evaluation Parameters In FPGA (Field Programmable Gate Array), latency refers to the delay between when a signal is input to the FPGA and when the FPGA responds with a corresponding output. The latency of an FPGA can be affected by factors such as the number of logic elements, the clock frequency, and the routing architecture. Iteration interval in an FPGA can refer to the clock frequency or clock period of the design. The clock frequency determines the maximum speed at which the FPGA can operate, while the clock period represents the duration of each clock cycle. The iteration interval can be adjusted by changing the clock frequency, which can affect the performance and power consumption of the design. Error tolerance in FPGA refers to the ability of the FPGA to handle errors that may occur during operation. FPGAs can be designed with various error correction mechanisms, such as ECC (Error Correction Code), parity checking, and redundant logic elements. These mechanisms can help detect and correct errors that may occur in the FPGA, improving its reliability and reducing the likelihood of failures. In addition, FPGAs can be designed with built-in self-test (BIST) capabilities to detect and diagnose faults and errors during operation. 5.2 Implementation Results Figure 5.1 Scaled FFT Dataflow Table 5.1 Performance Estimation Latency Details Instance Module Latency (cycles) Latency (absolute) Interval (cycles) min max min max min Type max Loop_VITIS _LOOP_88_ 1_proc2_U0 Loop_VITIS_ LOOP_88_1_ proc2 34 34 0.340 us 0.340 us 34 34 no bitreverse_U 0 bitreverse 35 35 0.350 us 0.350 us 35 35 no FFT0_13_U0 FFT0_13 18 18 0.180 us 0.180 us 18 18 no FFT0_14_U0 FFT0_14 22 22 0.220 us 0.220 us 22 22 no FFT0_15_U0 FFT0_15 22 22 0.220 us 0.220 us 22 22 no FFT0_16_U0 FFT0_16 22 22 0.220 us 0.220 us 22 22 no FFT0_U0 FFT0 22 22 0.220 us 0.220 us 22 22 no Loop_VITIS _LOOP_98_ 2_proc7_U0 Loop_VITIS_ LOOP_98_2_ proc7 35 35 0.350 us 0.350 us 35 35 no 5.3 Results Discussion Table 5.2 Timing Estimates solution_1 Clock ap_clk Target 10.00 ns 10.00 ns Estimated 6.573 ns 6.450 ns scaled_fft Table 5.3 Latency Estimates scaled_fft solution_1 Latency (cycles) Latency (absolute) Interval (cycles) min 313 217 max 313 217 min 3.130 us 2.170 us max 3.130 us 2.170 us min 314 36 max 314 36 Table 5.4 Utilization Estimates solution_1 BRAM_18K DSP FF LUT URAM RTL SLICE LUT FF DSP SRL BRAM 0 8 1072 2209 0 scaled_fft 2 16 2047 2862 0 Table 5.5 Resource Usage Implementation solution_1 scaled_fft verilog verilog 0 0 930 2863 807 1207 8 16 2 22 Table 5.6 Final Timing Implementation scaled_fft solution_1 RTL CP required verilog 10.000 verilog 10.000 CP achieved post-synthesis 5.752 CP achieved postimplementation 7.657 - - The timing of the circuit has further improved from 100 MHz (10 nx) to 155.04 MHz (6.450ns) helping us reach even faster speed without affecting the critical path or entering metastability with respect to the design or internal gates. As the results from report comparison generated by Vitis HLS (2021.1) we can see we’ve achieved a near about 30% speed up ( reduced latency from 313 cycles to 217 cycles) in the scaled version of the FFT implementation. As a tradeoff for improved timing we use more resources on the FPGA more significantly the FF (Flip-Flops) and the DSP slices. This is a significant observation as with the increase in hardware resource the chip footprint too increases in size and well as power. The optimization uses more hardware to parallelise the stages using dataflow optimization as discussed in the sections above. As we can see from table 5.1, the closer the individual subprocesses are in terms of their latency the better they can be scheduled together giving a much better overall latency considering throughput. Table 5.2 RTL/Co-Simulation Performance Estimates Fixed-point implementations are prone to some error. We are able to contain the error tolerance of less than 1% which was confirmed by our self checking testbench in the RTL/Co-Simulation results performance estimate as given above. The final verdict for this stage was Pass. Chapter 6 Conclusion and Future Work The FFT design was also successfully tested for processing audio files with 16-bit encoding samples at 44.1 kHz which is standard for music and human speech. The results too showed similar output when compared to RTL/Co-Simulation maintaining our standard of less than one percent error tolerance. Our implementation also finds its application in embedded micro devices that are low-power, low-cost and having area constraints like earphones or handheld devices. We concluded the design and implementation of a 32-point fixed-point FFT core that meets the performance, accuracy, flexibility, scalability, and compatibility requirements for real-time signal processing applications. In time, changes can be made to the interface by introducing block floating point which is customized scaling of the butterfly computation at each stage of the FFT which needs to be decided by the programmer which is completely dependent on the application, output and the data itself. Furthermore the size of the FFT could itself be increased for finer frequency resolution i.e. more bins. It translates to just having more points for a particular frame and as we increase the size of the FFT block we would be able to completely move to High-Res Audio application. The design could be taken forward to implement ANC (Active Noise Cancellation) in earphones which is a high potential emerging consumer market. References [1]W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and M. Horowitz, “Convolution engine,” Communications of the ACM, vol. 58, no. 4, pp. 85–93, 2015. [2]M. Horowitz, "1.1 Computing's energy problem (and what we can do about it)," 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2014, pp. 10-14. [3]R. Toulson and T. Wilmshurst, “An Introduction to Digital Signal Processing,” in Fast and effective embedded systems design: Applying the ARM mbed, Amsterdam: Newnes, an imprint of Elsevier, 2017. [4]“Digital Signal Processing,” Wikipedia, 23-Mar-2023. [Online]. Available: https://en.wikipedia.org/wiki/Digital_signal_processing. [Accessed: 25-Mar-2023]. [5]“Overhead (computing),” Wikipedia, 12-Feb-2023. [Online]. https://en.wikipedia.org/wiki/Overhead_(computing). [Accessed: 26-Mar-2023]. Available: [6]R. Teymourzadeh, M. J. Abigo, and M. V. Hoong, “Static quantised radix-2 fast Fourier transform (fft)/inverse FFT processor for constraints analysis,” International Journal of Electronics, vol. 101, no. 2, pp. 231–240, 2013. [7]M. Croome, “Greenwaves Technologies unveils the GAP8 IOT Application Processor,” GreenWaves Technologies, 17-Jun-2019. [Online]. Available: https://greenwavestechnologies.com/greenwaves-technologies-unveils-gap8/. [Accessed: 30-Mar-2023]. [8]“Fixed-point vs. floating-point digital signal processing,” Fixed-Point vs. Floating-Point Digital Signal Processing | Analog Devices. [Online]. Available: https://www.analog.com/en/technical-articles/fixedpoint-vs-floatingpoint-dsp.html. [Accessed: 03-Apr-2023]. [9]V. Kumar M, D. Selvakumar A, and S. P M, “Area and frequency optimized 1024 point radix-2 FFT processor on FPGA,” 2015 International Conference on VLSI Systems, Architecture, Technology and Applications (VLSI-SATA), 2015. [10]W. R. Knight and R. Kaiser, A Simple Fixed-Point Error Bound for the Fast Fourier Transform, IEEE Trans. Acoustics, Speech and Signal Proc., Vol. 27, No. 6, pp. 615-620, December 1979.] and Theory and Application of Digital Signal Processing [L. R. Rabiner and B. Gold, Theory and Application of Digital Signal Processing, Prentice-Hall Inc., Englewood Cliffs, New Jersey, 1975. [11]J. Arias,M. Desainte-Catherine and C. Rueda, “Exploiting Parallelism in FPGAs for the Real-Time Interpretation of Interactive Multimedia Scores” Journées d'Informatique Musicale 2015, May 2015, Montréal, Canada. β¨hal-01129316β© [12]AXI4-STREAM INTERFACE, AMD Adaptive Computing Documentation Portal. Available: https://docs.xilinx.com/r/2021.1-English/ug1399-vitis-hls/How-AXI4-StreamWorks (Accessed: April 13, 2023). [13] Loop Pipelining, AMD Adaptive Computing Documentation Portal. Available at: https://docs.xilinx.com/r/2021.1-English/ug1399-vitis-hls/Loop-Pipelining (Accessed: April 13, 2023). [14] PIPELINING PARADIGM, AMD Adaptive Computing Documentation Portal. Available at: https://docs.xilinx.com/r/2021.1-English/ug1399-vitis-hls/Pipelining-Paradigm (Accessed: April 13, 2023). [15] Pipelining Dependencies, AMD Adaptive Computing Documentation Portal. Available at: https://docs.xilinx.com/r/2021.1-English/ug1399-vitis-hls/Exploiting-Task-Level-ParallelismDataflow-Optimization (Accessed: April 13, 2023). [16]Programming an FPGA, An introduction to how it works, Xilinx. Available at: https://www.xilinx.com/products/silicon-devices/resources/programming-an-fpga-anintroduction-to-how-it-works.html (Accessed: April 11, 2023). Acknowledgement We would like to express our gratitude and thanks to Dr. Tanuja Sarode and Mr. C. S. Kulkarni for their valuable guidance and help. We are indebted for their guidance and constant supervision as well as provision of necessary information regarding the project. We would like to express our greatest appreciation to our principal Dr. G.T. Thampi and head of the department Dr. Tanuja Sarode for their encouragement and tremendous support. We take this opportunity to express our gratitude to the people who have been instrumental in the successful completion of the project. Kaku Jay Sushil Khan Mohd Hamza Rafique Kotadia Hrishit Jayantilal Narwani Trushant Sunil 1902067 1902072 1902080 1902114