Uploaded by Chirag Sharma

RISC-V TensorCore for Edge AI Project Report

advertisement
RISC-V TensorCore for Edge AI
Submitted in partial fulfillment of the requirements
of the degree of
BACHELOR OF ENGINEERING
In
COMPUTER ENGINEERING
By
Group No: 4
1902067
Kaku Jay Sushil
1902072
Khan Mohd Hamza Rafique
1902080
Kotadia Hrishit Jayantilal
1902114
Narwani Trushant Sunil
Guide:
DR. TANUJA SARODE
(Professor, Department of Computer Engineering, TSEC)
Computer Engineering Department
Thadomal Shahani Engineering College
University of Mumbai
2022-2023
CERTIFICATE
This is to certify that the project entitled “RISC-V TensorCore for Edge AI” is a
bonafide work of
1902067
Kaku Jay Sushil
1902072
Khan Mohd Hamza Rafique
1902080
Kotadia Hrishit Jayantilal
1902114
Narwani Trushant Sunil
Submitted to the University of Mumbai in partial fulfillment of the requirement for the award
of the degree of “BACHELOR OF ENGINEERING” in “COMPUTER
ENGINEERING”.
Dr. Tanuja Sarode
Guide
Dr. Tanuja Sarode
Dr. G. T.
Thampi
Head of Department
Principal
Project Report Approval for B.E
Project report entitled RISC-V TensorCore for Edge AI by
1902067
Kaku Jay Sushil
1902072
Khan Mohd Hamza Rafique
1902080
Kotadia Hrishit Jayantilal
1902114
Narwani Trushant Sunil
is approved for the degree of “BACHELOR OF ENGINEERING” in
“COMPUTER ENGINEERING”.
Examiners
1.
2.
Date:
Place:
Declaration
We declare that this written submission represents my ideas in my own
words and where others 'ideas or words have been included, we have adequately
cited and referenced the original sources. We also declare that we have adhered to
all principles of academic honesty and integrity and have not misrepresented or
fabricated or falsified any idea/data/fact/source in our submission. We understand
that any violation of the above will be cause for disciplinary action by the Institute
and can also evoke penal action from the sources which have thus not been
properly cited or from whom proper permission has not been taken when needed.
1) ________________________________
Kaku Jay Sushil - 1902067
2) ________________________________
Khan Mohd Hamza Rafique - 1902072
3) ________________________________
Kotadia Hrishit Jayantilal - 1902080
4) ________________________________
Narwani Trushant Sunil - 1902114
Date:
Abstract
Artificial Intelligence (AI) has become an integral part of various industries, and deep learning
models have shown remarkable performance in tasks such as image recognition, natural
language processing, and speech recognition. However, training these models poses significant
challenges, including the need for large amounts of high-quality data, high computational
resources, and interpretability of the models.
To address these challenges, the use of Field-Programmable Gate Arrays (FPGAs) in
combination with RISC-V Tensorflow, an open-source deep learning framework optimized for
RISC-V processors, has gained significant attention. FPGAs offer high-speed and low-latency
performance for AI computations, making them well-suited for training deep learning models.
Additionally, FPGAs can be programmed and customized for specific AI workloads, improving
efficiency and reducing energy consumption.
The combination of RISC-V Tensorflow and FPGAs can significantly accelerate AI model
training, reducing the time and resources required for training, while also providing greater
transparency and interpretability of the models. FPGAs can also support high-throughput data
processing, enabling real-time processing of data in applications like autonomous vehicles and
robotics.
Furthermore, FPGAs can be used to create specialized accelerators for specific AI workloads,
such as convolutional or recurrent neural networks, to achieve better performance and
efficiency. This customization helps to reduce the reliance on expensive and energy-intensive
general-purpose processors and accelerators.
In conclusion, the use of RISC-V Tensorflow and FPGAs presents a promising solution to the
challenges faced in AI training, including high computational requirements, interpretability, and
efficiency. As AI models continue to grow in complexity, the use of these technologies is
expected to become more widespread, leading to significant advancements in the field of AI.
Table of Content
List of Figures
iii
List of Tables
iv
Chapter 1
Chapter 2
Chapter 3
Chapter 4
Chapter 5
Chapter 6
Introduction
1
1.1 Introduction
1
1.2 Problem Statement and Objectives
2
1.3 Scope
2
Review of Literature
4
2.1 Domain Explanation
4
2.2 Review of Existing System
6
2.3 Limitations of Existing System/Research Gaps
7
Proposed System
8
3.1 Design Details
8
3.2 Methodology
12
Implementation Details
31
4.1 Experimental Setup
31
4.2 Software and Hardware Setup
32
Results and Discussion
35
5.1 Performance Evaluation Parameters
35
5.2 Implementation Results
36
5.3 Results Discussion
38
Conclusion and Future Work
40
References
41
Acknowledgement
43
List of Figures
Figure No.
Description
3.1
DIT Radix 2 Butterfly
3.2
Radix-2 decimation in time 32-point FFT
10
3.3
Basic DSP48E1 Slice Functionality
13
3.4
7 Series FPGA DSP48E1 Slice
14
3.5
I/O to the FFT compute unit or core
15
3.6
AXI4-Stream Handshake
17
3.7
Pipelining Paradigm
22
3.8
Dependency Example
26
3.9
Scalar Dependency
30
4.1
Audio
signal
Page No.
decomposed
9
into
its
33
frequency components using FFT
5.1
Scaled FFT Dataflow
36
List of Tables
Table No.
3.1
3.2
5.1
5.2
5.3
5.4
5.5
5.6
5.7
Description
Equations for real and imaginary parts of P’ and Q’
Fixed-Point Identifier Summary
Performance Estimation Latency Details
Timing Estimates
Latency Estimates
Utilization Estimates
Resource Usage Implementation Estimates
Final Timing Implementation Estimates
RTL/Co-Simulation Performance Estimates
Page No.
9
18
37
38
38
38
38
39
39
Chapter 1
Introduction
1.1 Introduction
The growth of computing power has been exponential over the past few decades, and
this trend is likely to continue in the foreseeable future. The increasing demand for more
computing power is driven by a wide range of applications, from scientific research to business
analytics, gaming, and artificial intelligence. The growth of computing power has led to the
development of advanced processors with specialized hardware units that can accelerate the
performance of specific tasks. One such hardware unit is the convolution engine, which is
commonly found in processors used in signal processing and machine learning applications.
Convolution is a mathematical operation that is widely used in signal processing, image
processing, and machine learning. It involves the integration of a function with a modified
version of itself, which is known as a kernel or filter. Convolution can be thought of as a way
to extract meaningful features from data by applying a set of transformations that highlight
certain patterns or characteristics. Convolution engines are found in a wide range of processors
used in signal processing and machine learning applications, including CPUs, GPUs, and
specialized accelerators such as FPGAs and ASICs. They are used for programmable processors
specialized for the convolution-like data-flow prevalent in computational photography,
computer vision, and image processing [1]. The implementation of convolution engines varies
depending on the processor architecture and the specific use case, but they all share the goal of
accelerating the performance of convolution operations.
One example of convolution in real life is in image processing, where convolution is used to
enhance or filter images. For example, edge detection filters are commonly used to detect the
edges of objects in an image, and blurring filters are used to remove noise or smooth out an
image.
1.2
Problem Statement & Objectives
The rate of growth of number of convolution operations required for a task hasn’t scaled
linearly with the the available computing power today which leaves need for domain specific
and application specific digital design that can efficiently handle the computation along with
maintaining high throughput in significantly less time with minimal overhead on the cost of
being a little less precise but giving reliable results at the same time. This is confirmed with the
data that a CPU takes around 70 pJ work to accomplish the same task whereas an ASIC would
take around less than 1 pJ [2].
The principal objective of this project is to design and develop an FPGA based softcore
co-processor design coupled with a RISC-V core.
To meet these demands, hardware
implementations of the Fast Fourier Transform (FFT) algorithm have become crucial. In
particular, a 32-point fixed-point FFT core is needed to efficiently process digital signals in
real-time applications.
However, designing an efficient 32-point fixed-point FFT core poses several challenges.
The first challenge is to ensure that the core meets the required performance specifications, such
as the processing speed, power consumption, and area utilization. The second challenge is to
ensure the accuracy of the fixed-point arithmetic used in the core, as any errors in the arithmetic
computations that can lead to significant degradation in the quality of the processed signals.
Furthermore, the design of the core must also take into account the need for flexibility and
scalability, as different applications may require varying FFT sizes. Optimization on managing
the bit rate growth of the FFT computation is yet another challenge that should be addressed.
1.3
Scope
The scope of the project consists of essential aspects of hardware programming viz.
design, verification, testing, optimizing and prototyping.The principal objective of this project
is to design and develop an FPGA based softcore co-processor design coupled with a RISC-V
core. The project is meant to utilize System on a Chip that comprises an FPGA to generate the
FFT output for accelerating signal processing which could be extended to do general matrix
convolution tasks and helps understand the tradeoffs better when interfaced with AI
applications. While various steps of the sample synthesis and processing may be offloaded to
the FPGA to utilize its inherent parallelism, the post processing compute would be handled with
the SoC with minimal overhead and higher efficiency.
Design involves creating a blueprint for the hardware system considering factors such as power
consumption, cost, and ease of manufacturing.
Verification ensures the hardware design meets requirements and specifications. Identifying
potential issues before manufacturing, reducing costs and time to market.
Testing identifies and fixes bugs and errors in hardware design. Functional testing ensures
intended function while non-functional testing evaluates performance under different
conditions.
Optimizing improves performance, power consumption, and cost-effectiveness. Identifies areas
for improvement such as resource usage, power consumption, and cost reduction.
Prototyping builds physical prototypes to test hardware performance. Identifying design issues
and refining design before mass production.
Chapter 2
Review of Literature
2.1 Domain Explanation
Digital Signal Processing (DSP) refers to the computation of mathematically intensive
algorithms applied to data signals, such as audio signal manipulation, video compression, data
coding/decoding and digital communications [3]. It involves transforming signals from the time
domain to the frequency domain using techniques such as Fourier analysis, and then applying
signal processing techniques to achieve various functions. These functions can include filtering,
smoothing, and modulation, among others.
In DSP, signals are typically represented as a sequence of discrete values, and algorithms are
used to manipulate these values. DSP techniques are applied to a wide range of signals,
including
audio,
video,
images,
and
control
signals
[4].
In the context of audio signals, DSP techniques can be used for tasks such as filtering to remove
noise or unwanted frequencies, equalization to adjust the tonal balance, and compression to
reduce the size of audio files for storage or transmission purposes. DSP is also commonly used
in audio effects processing, such as reverb, chorus, and modulation effects, to create various
sound effects.
For video signals, DSP techniques are used for tasks such as video compression to reduce the
amount of data needed to represent a video, image processing for tasks like image enhancement,
and video analysis for tasks like motion detection and tracking. DSP is also used in video
encoding and decoding, where it plays a crucial role in compressing and decompressing video
data for efficient storage and transmission.
In the field of image processing, DSP techniques are used for tasks such as image filtering to
remove noise or enhance details, image compression for efficient storage and transmission, and
image recognition for tasks like object detection and facial recognition. DSP is also used in
medical imaging for tasks like image reconstruction, image enhancement, and image analysis
for diagnostic purposes.
Control signals, which are used to manage and regulate the behavior of a system, are another
important application of DSP. Control signals can be used in various engineering and
automation applications to adjust system parameters, monitor system behavior, and achieve
desired system performance. DSP techniques are used in the analysis and processing of control
signals to design efficient control algorithms, optimize system behavior, and minimize control
signal overhead.
Overall, DSP techniques are incredibly versatile and can be applied to a wide range of signals,
including audio, video, images, and control signals. The discrete representation of signals in
DSP allows for efficient processing and analysis using mathematical algorithms, making it a
powerful tool in various fields such as telecommunications, multimedia processing, medical
imaging, and control systems.
Waves are a type of signal that carry energy and propagate through space or a medium. They
can be classified as mechanical, electromagnetic, or quantum-mechanical based on their nature.
Waves have a specific frequency, wavelength, and amplitude, and these properties can be used
to analyze and manipulate them using DSP techniques.
Control signal overhead refers to the additional computational and processing resources
required to implement control signals in a system. Implementing control signals can add
overhead to a system in several ways [5]. For example, the controller itself requires processing
power to generate control signals and monitor the system's behavior. Additionally, the
additional input signals needed to generate control signals can increase the complexity and cost
of the system's hardware and software. Moreover, the process of measuring the system's
behavior and generating control signals can introduce delays, which can impact the system's
performance.
Managing control signal overhead is an important consideration in system design and
implementation. It requires careful optimization of the control signal generation process,
efficient hardware and software design, and minimizing delays in the measurement and control
loop. DSP techniques can be used to analyze and optimize control signals, as well as to design
efficient control algorithms that minimize overhead while achieving the desired system
behavior.
2.2
Review of Existing Systems
There have been Static Quantized Radix-2 FFT/IFFT Processors designed to conduct
constraint analysis. Amongst the major setbacks associated with such high resolution, FFT
processors are the high power consumption resulting from the structural complexity and
computational inefficiency of floating-point calculations. As such, a parallel pipelined
architecture was proposed to statically scale the resolution of the processor to suit adequate
trade-off constraints. The quantization was applied to provide an approximation to address the
finite word-length constraints of digital signal processing (DSP) [6].
One approach to mitigate these issues is to use a parallel pipelined architecture, which allows
for efficient processing of FFT and IFFT operations. This architecture is designed to statically
scale the resolution of the processor to suit trade-off constraints adequately. By using a parallel
pipelined architecture, the processing tasks are divided into smaller tasks that can be processed
in parallel, resulting in improved computational efficiency. In addition to the parallel pipelined
architecture, quantization is applied to provide an approximation to address the finite wordlength constraints of digital signal processing (DSP).
Quantization involves rounding or truncating the values of signals or coefficients to a fixed
number of bits, resulting in a reduced precision representation of the original data. This
quantization process helps to reduce the computational complexity and memory requirements
of the processor, which in turn reduces power consumption. The use of quantization in FFT
processors allows for a trade-off between computational efficiency and precision. Higher
quantization levels result in lower precision but also reduce power consumption and
computational complexity.
On the other hand, lower quantization levels result in higher precision but may increase power
consumption and computational complexity. Overall, the use of Static Quantized Radix-2
FFT/IFFT processors with a parallel pipelined architecture and quantization techniques
provides an effective solution to address the constraints associated with high-resolution FFT
processors, such as high power consumption, structural complexity, and computational
inefficiency. These approaches allow for efficient trade-offs between resolution, computational
efficiency, and power consumption in DSP applications, making them valuable tools in various
fields, including telecommunications, multimedia processing, and control systems.
Hardware accelerators such as the GreenWaves Technologies GAP8 [7] and Esperanto ETSoC-1 are designed to provide high-performance AI inference in energy-constrained
environments. These systems use RISC-V cores and integrated TensorCore units to perform
complex operations such as convolution, pooling, and activation functions.
Overall, the existing systems related to RISC-V TensorCore for Edge AI offer promising
solutions for low-power, high-performance AI computing in resource-constrained
environments. With continued research and development, these systems are expected to play
an increasingly important role in enabling the next generation of edge AI applications.
2.3
Limitations of Existing System/Research Gaps
Existing systems that utilize RISC-V Tensor Cores for edge AI applications face several
limitations. One significant limitation is the power consumption of these systems because of
their floating-point units, which can be very power-hungry. This is because floating-point
arithmetic requires more computational resources and precision than fixed-point arithmetic [8].
Furthermore, the time consumed to pivot or reconfigure the system can be significant,
particularly in edge AI applications where real-time responsiveness is critical. This can be
especially problematic in situations where the edge AI system needs to adjust quickly to
changes in the input data or the environment. The need for pivoting or reconfiguration can also
increase the complexity and cost of the system, as it may require additional hardware, software,
or human intervention.
These limitations can make it challenging to use RISC-V Tensor Cores for edge AI applications,
especially in scenarios where power consumption, flexibility, and responsiveness are critical
factors. To address these limitations, researchers and engineers are exploring alternative
approaches to edge AI, such as using more efficient data representations, reducing the precision
of the arithmetic used, and exploring novel hardware architectures that are more flexible and
reprogrammable. Additionally, advances in machine learning algorithms and software
frameworks may also help to reduce the computational requirements of edge AI applications,
making them more feasible for deployment on resource-constrained devices.
Chapter 3
Proposed System
3.1
Design Details
FFT (Fast Fourier Transform) is an algorithm used to efficiently compute the discrete
Fourier transform (DFT) of a sequence of complex data points. The DFT is a mathematical
transformation that converts a signal from the time domain to the frequency domain, revealing
the underlying frequency components of the signal. The FFT algorithm takes advantage of the
symmetry and periodicity properties of the DFT to reduce the number of computations required
to compute the transform. The basic idea is to recursively break down the input sequence into
smaller and smaller sub-sequences until each sub-sequence consists of just two data points.
Then, by applying a series of mathematical operations to these smaller sub-sequences, the FFT
algorithm computes the DFT of the original sequence. The efficiency of the FFT algorithm
makes it an essential tool in many applications, especially in real-time signal processing, where
the computational complexity of the DFT can be a bottleneck. By using the FFT algorithm, the
DFT can be computed much faster, allowing for real-time processing of signals.
In the FFT algorithm, bit-reversal refers to the process of reordering the input data points in a
way that makes it possible to perform the required computations in a more efficient manner. To
compute the bit-reversal of an index value, we need to swap the binary digits of the index value
in a specific order. For example, consider the sequence of index values {0, 1, 2, 3, 4, 5, 6, 7}.
The binary representation of these values are {000, 001, 010, 011, 100, 101, 110, 111}. To
compute the bit-reversal of these index values, we need to swap their binary digits in a specific
order, such that the new sequence becomes {0, 4, 2, 6, 1, 5, 3, 7}. The bit-reversal step is
essential in the FFT algorithm as it enables the computation of the DFT in a more
efficient manner. By reordering the input data points, the algorithm can perform the required
computations in a more optimal way, reducing the number of operations required and improving
the overall performance of the algorithm.
The FFT arithmetic [9] is basically divided into two types, which is the decimation-in-time
(DIT) and the decimation-infrequency (DIF). This radix-2-DIT FFT is adopted in this paper.
An 'N' point discrete Fourier transformation (DFT) of the input sequences x (n) is written as,
…(1)
x(n) could be further divided into odd part and even part using radix-2 DIT in (l), taking
advantage of periodicity and symmetry we can obtain the following equations
…(2)
The Radix-2 Butterfly is illustrated in Figure 3.1. In each butterfly structure, two complex inputs
P and Q are operated upon and become complex outputs P’ and Q’. Complex multiplication is
performed on Q and the twiddle factor, then the product is added to and subtracted from input
P to form outputs P’ and Q’. The exponent of the twiddle factor π‘Šπ‘Š
π‘Š is dependent on the stage
and group of its butterfly. The butterfly is usually represented by its flow graph, which looks
like a butterfly.
Figure 3.1 DIT Radix 2 Butterfly
The mathematical meaning of this butterfly is shown in Table 3.1 with separate equations for
real and imaginary parts.
Table 3.1 Equations for real and imaginary parts of P’ and Q’
Complex
Real Part
Imaginary Part
P’ = P + Q * W
Pr’ = Pr + (Qr * Wr - Qi * Wi)
Pi’ = Pi + (Qr * Wi + Qi * Wr)
Q’ = P - Q * W
Qr’ = Pr - (Qr * Wr - Qi * Wi)
Qi’ = Pi - (Qr * Wi + Qi * Wr)
.
Figure 3.2 Radix-2 decimation in time 32-point FFT
Communication interface:
The interface communication protocol was chosen to be AXI-Stream (AXIS) protocol as the
arrival of data is sequential and in batches. This makes the traditional bi-directional address
querying model slower in terms of implementation. AXIS helps moving a block of data quickly
between the producer and consumer. The implementation details and protocol is mentioned in
great depth in the forthcoming subsection.
FFT core:
The FFT core is selected to have 16-bit real and imaginary fixed point two's complement
representation along 8-bits dedicated to the signed integer part and the rest 8-bits to the
fractional part making it optimal for various applications requiring both integer based as well
as fractional computation for real time radix-2 FFT analysis. The block size was decided to be
32 as it achieves a frequency of approx 1378 Hz resolution for an audio sampled at 44.1 kHz
frequency. Hence the total number of stages comes out to be π‘Šπ‘Šπ‘Š2 32 which turns out to be 5
stages.
On each pass, the algorithm performs Radix-2 butterflies, where each butterfly picks up four
or two complex numbers, respectively, and returns four or two complex numbers to the same
memory. The numbers returned to memory by the core are potentially larger than the numbers
picked up from memory. A strategy must be employed to accommodate this dynamic range
expansion. A full explanation of scaling strategies and their implications is beyond the scope of
this document; for more information about this topic; see A Simple Fixed-Point Error Bound
for the Fast Fourier Transform [10].
For Radix-2, the growth is by a factor of up to . This implies a bit growth of up
to 2 bits. This bit growth can be handled in three ways:
● Performing the calculations with no scaling and carrying all significant integer bits to
the end of the computation
● Scaling at each stage using a fixed-scaling schedule
● Scaling automatically using block floating-point
All significant integer bits are retained when using full-precision unscaled arithmetic. The width
of the datapath increases to accommodate the bit growth through the butterfly. The growth of
the fractional bits created from the multiplication are truncated (or rounded) after the
multiplication. The width of the output is (input width+log2(transform length)+1). This
accommodates the worst case scenario for bit growth.
We here go with the Scaling at each stage using a fixed-scaling schedule. When using scaling,
a scaling schedule is used to divide by a factor of 1, 2, 4, or 8 in each stage. If scaling is
insufficient, a butterfly output might grow beyond the dynamic range and cause an overflow.
As a result of the scaling applied in the FFT implementation, the transform computed is a scaled
transform. The scale factor s is defined as:
where bα΅’ is the scaling (specified in bits) applied in stage i. The scaling results in the final output
sequence being modified by the factor 1/s. For the forward FFT, the output sequence X’ (k), k
= 0,...,N - 1 computed by the core is defined as
If a Radix-2 algorithm scales by a factor of 2 or one right shift in terms of hardware
manipulation, in each stage, the factor of 1/s is equal to the factor of 1/N in the inverse FFT
equation.
3.2
Methodology
FPGAs (Field Programmable Gate Arrays) offer unique advantages in terms of
parallelism that can be exploited to accelerate computationally intensive tasks. FPGA-based
parallelism can provide a high degree of flexibility, efficiency, and performance compared to
other hardware accelerators. Moreover, FPGAs are synchronous hardware with a jitter of less
than one cycle of clock and they also are not affected by the rather complex behavior of the
operating system services, interrupt handling, etc. Due to the physical parallelism, the processes
do not influence each other [11].
One of the primary ways to exploit FPGA parallelism is by utilizing its reconfigurable hardware
resources, which can be customized to fit the specific requirements of the application. This
allows for the creation of highly optimized, parallel hardware designs that can process data at
high speeds. Additionally, FPGAs can be programmed using specialized languages such as
Verilog or VHDL, which enable fine-grained control over the hardware design.
Computations of trigonometric functions for twiddle factor compute in hardware
Existing solutions used a cordic algorithm for the computation of sin and cos while we interface
it using DSP slices which are faster and more efficient as it is hard-etched on the FPGA chip
which is vendor specific in our case Xilinx. Namely the DSP slice is the 48E1 version of the
Xilinx IP.
FPGAs are efficient for digital signal processing (DSP) applications because they can
implement custom, fully parallel algorithms. DSP applications use many binary multipliers and
accumulators that are best implemented in dedicated DSP slices. All 7 series FPGAs have many
dedicated, full-custom, low-power DSP slices, combining high speed with small size while
retaining system design flexibility. The DSP slices enhance the speed and efficiency of many
applications beyond digital signal processing, such as wide dynamic bus shifters, memory
address generators, wide bus multiplexers, and memory-mapped I/O registers. The basic
functionality of the DSP48E1 slice is shown in Figure 3.2.
Figure 3.3 Basic DSP48E1 Slice Functionality
Some highlights of the DSP functionality include:
● 25 × 18 two’s-complement multiplier:
β—‹ Dynamic bypass
● 48-bit accumulator:
β—‹ Can be used as a synchronous up/down counter
● Power saving pre-adder:
β—‹ Optimizes symmetrical filter applications and reduces DSP slice requirements
● Single-instruction-multiple-data (SIMD) arithmetic unit:
β—‹ Dual 24-bit or quad 12-bit add/subtract/accumulate
● Optional logic unit:
β—‹ Can generate any one of ten different logic functions of the two operands
● Pattern detector:
β—‹ Convergent or symmetric rounding
β—‹ 96-bit-wide logic functions when used in conjunction with the logic unit
● Advanced features:
β—‹ Optional pipelining and dedicated buses for cascading
Figure 3.4 7-Series FPGA DSP48E1 Slice
The DSP slice consists of a multiplier followed by an accumulator. At least three pipeline
registers are required for both multiply and multiply-accumulate operations to run at full speed.
The multiply operation in the first stage generates two partial products that need to be added
together in the second stage.
When only one or two registers exist in the multiplier design, the M register should always be
used to save power and improve performance.
Add/Sub and Logic Unit operations require at least two pipeline registers (input, output) to run
at full speed.
The cascade capabilities of the DSP slice are extremely efficient at implementing high speed
pipelined filters built on the adder cascades instead of adder trees.
Multiplexers are controlled with dynamic control signals, such as OPMODE, ALUMODE, and
CARRYINSEL, enabling a great deal of flexibility. Designs using registers and dynamic
opmodes are better equipped to take advantage of the DSP slice capabilities than combinatorial
multiplies.
In general, the DSP slice supports both sequential and cascaded operations due to the dynamic
OPMODE and cascade capabilities. Fast Fourier Transforms (FFTs), floating point,
computation (multiply, add/sub, divide), counters, and large bus multiplexers are some
applications of the DSP slice.
Additional capabilities of the DSP slice include synchronous resets and clock enables, dual A
input pipeline registers, pattern detection, Logic Unit functionality, single instruction/multiple
data (SIMD) functionality, and MACC and Add-Acc extension to 96 bits. The DSP slice
supports convergent and symmetric rounding, terminal count detection and auto-resetting for
counters, and overflow/underflow detection for sequential accumulators.
ALU functions are identical in the 7 series FPGA DSP48E1 slice as in the Virtex-6 FPGA
DSP48E1 slice.
AXI4-STREAM INTERFACE
AXI4-Stream interface can be applied to any input argument and any array or pointer output
argument. Because an AXI4-Stream interface transfers data in a sequential streaming manner,
it cannot be used with arguments that are both read and written. In terms of data layout, the data
type of the AXI4-Stream is aligned to the next byte. For example, if the size of the data type is
12 bits, it will be extended to 16 bits. Depending on whether a signed/unsigned interface is
selected, the extended bits are either sign-extended or zero-extended. If the stream data type is
a user-defined struct, the struct is aggregated and aligned to the size of the largest data element
within the struct. As shown in Figure 3.4, AXI Stream is the communication protocol being
used.
Figure 3.5 I/O to the FFT compute unit or core
The following code examples show how the packed alignment depends on your struct type. If
the struct contains only char type, as shown in the following example, then it will be packed
with alignment of one byte. Total size of the struct will be two bytes:
struct A {
char foo;
char bar;
};
However, if the struct has elements with different data types, as shown below, then it will be
packed and aligned to the size of the largest data element, or four bytes in this example. Element
bar will be padded with three bytes resulting in a total size of eight bytes for the struct:
struct A {
int foo;
char bar;
};
The AXI4-Stream interface is implemented as a struct type in Vitis HLS and has the following
signature (defined in ap_axi_sdata.h):
template <typename T, size_t WUser, size_t WId, size_t WDest> struct
axis { .. };
Where:
T: Stream data type
WUser: Width of the TUSER signal
WId: Width of the TID signal
WDest: Width of the TDest signal
When the stream data type (T) are simple integer types, there are two predefined types of AXI4Stream implementations available:
A signed implementation of the AXI4-Stream class (or more simply ap_axis<Wdata,
WUser, WId, WDest>)
hls::axis<ap_int<WData>, WUser, WId, WDest>
An unsigned implementation of the AXI4-Stream class (or more simply ap_axiu<WData,
WUser, WId, WDest>)
hls::axis<ap_uint<WData>, WUser, WId, WDest>
The value specified for the WUser, WId, and WDest template parameters controls the usage of
side-channel signals in the AXI4-Stream interface.
When the hls::axis class is used, the generated RTL will typically contain the actual data
signal TDATA, and the following additional signals: TVALID, TREADY, TKEEP, TSTRB, TLAST,
TUSER, TID, and TDEST.
TVALID, TREADY, and TLAST are necessary control signals for the AXI4-Stream protocol.
TKEEP, TSTRB, TUSER, TID, and TDEST signals are special signals that can be used to pass
around additional bookkeeping data.
How AXI4-Stream Works?
AXI4-Stream is a protocol designed for transporting arbitrary unidirectional data. In an AXI4Stream, TDATA width of bits is transferred per clock cycle. The transfer is started once the
producer sends the TVALID signal and the consumer responds by sending the TREADY signal
(once it has consumed the initial TDATA). At this point, the producer will start sending TDATA
and TLAST (TUSER if needed to carry additional user-defined sideband data). TLAST signals
the last byte of the stream. So the consumer keeps consuming the incoming TDATA until TLAST
is asserted.
Figure 3.6 AXI4-Stream Handshake
AXI4-Stream has additional optional features like sending positional data with TKEEP and
TSTRB ports which makes it possible to multiplex both the data position and data itself on the
TDATA signal. Using the TID and TDIST signals, you can route streams as these fields
roughly correspond to stream identifier and stream destination identifier [12].
PRECISION FIXED-POINT DATA TYPES
Fixed-point data types model the data as an integer and fraction bits. In this example, the Vitis
HLS ap_fixed type is used to define an 18-bit variable with 6 bits representing the numbers
above the binary point and 12 bits representing the value below the decimal point. The variable
is specified as signed and the quantization mode is set to round to plus infinity. Because the
overflow mode is not specified, the default wrap-around mode is used for overflow.
#include <ap_fixed.h>
...
ap_fixed<18,6,AP_RND > my_type;
...
When performing calculations where the variables have different numbers of bits or different
precision, the binary point is automatically aligned. The behavior of the C++ simulations
performed using fixed-point matches the resulting hardware. This allows you to analyze the bitaccurate, quantization, and overflow behaviors using fast C-level simulation.Fixed-point types
are a useful replacement for floating point types which require many clock cycles to complete.
Unless the entire range of the floating-point type is required, the same accuracy can often be
implemented with a fixed-point type resulting in the same accuracy with smaller and faster
hardware. A summary of the ap_fixed type identifiers is provided in the following table
Table 3.2 Fixed-Point Identifier Summary
Identifier
W
Description
Word length in bits
I
The number of bits used to represent the integer value, that is, the number
of integer bits to the left of the binary point. When this value is negative, it
represents the number of implicit sign bits (for signed representation), or
the number of implicit zero bits (for unsigned representation) to the right
of the binary point. For example,
ap_fixed<2, 0> a = -0.5;
// a can be -0.5,
ap_ufixed<1, 0> x = 0.5;
// 1-bit representation. x
can be 0 or 0.5
ap_ufixed<1, -1> y = 0.25; // 1-bit representation. y
can be 0 or 0.25
const ap_fixed<1, -7> z = 1.0/256; // 1-bit
representation for z = 2^-8
Q
Quantization mode: This dictates the behavior when greater precision is
generated than can be defined by the smallest fractional bit in the variable
used to store the result.
O
ap_fixed Types
Description
AP_RND
Round to plus infinity
AP_RND_ZERO
Round to zero
AP_RND_MIN_INF
Round to minus infinity
AP_RND_INF
Round to infinity
AP_RND_CONV
Convergent rounding
AP_TRN
Truncation to minus infinity (default)
AP_TRN_ZERO
Truncation to zero
Overflow mode: This dictates the behavior when the result of an operation
exceeds the maximum (or minimum in the case of negative numbers)
possible value that can be stored in the variable used to store the result.
N
ap_fixed Types
Description
AP_SAT
Saturation
AP_SAT_ZERO
Saturation to zero
AP_SAT_SYM
Symmetrical saturation
AP_WRAP
Wrap around (default)
AP_WRAP_SM
Sign magnitude wrap around
This defines the number of saturation bits in overflow wrap modes.
The default maximum width allowed for ap_[u]fixed data types is 1024 bits. This default may
be overridden by defining the macro AP_INT_MAX_W with a positive integer value less than
or equal to 32768 before inclusion of the ap_int.h header file. The following is an example of
overriding AP_INT_MAX_W:
#define AP_INT_MAX_W 4096 // Must be defined before next line
#include "ap_fixed.h"
ap_fixed<4096> very_wide_var;
Arbitrary precision data types are highly recommended when using Vitis HLS. As shown in the
earlier example, they typically have a significant positive benefit on the quality of the hardware
implementation.
LOOP PIPELINING
When pipelining loops, the optimal balance between area and performance is typically found
by pipelining the innermost loop. This also results in the fastest runtime. The following code
example demonstrates the trade-offs when pipelining loops and functions.
#include "loop_pipeline.h"
dout_t loop_pipeline(din_t A[N]) {
int i,j;
static dout_t acc;
LOOP_I:for(i=0; i < 20; i++){
LOOP_J: for(j=0; j < 20; j++){
acc += A[i] * j;
}
}
return acc;
}
If the innermost (LOOP_J) is pipelined, there is one copy of LOOP_J in hardware, (a single
multiplier). Vitis HLS automatically flattens the loops when possible, as in this case, and
effectively creates a new single loop of 20*20 iterations. Only one multiplier operation and one
array access need to be scheduled, then the loop iterations can be scheduled as a single loopbody entity (20x20 loop iterations). If the outer-loop (LOOP_I) is pipelined, inner-loop
(LOOP_J) is unrolled creating 20 copies of the loop body: 20 multipliers and 20 array accesses
must now be scheduled. Then each iteration of LOOP_I can be scheduled as a single entity.
If the top-level function is pipelined, both loops must be unrolled: 400 multipliers and 400
arrays accessed must now be scheduled. It is very unlikely that Vitis HLS will produce a design
with 400 multiplications because in most designs, data dependencies often prevent maximal
parallelism, for example, even if a dual-port RAM is used for A[N], the design can only access
two values of A[N] in any clock cycle.
The concept to appreciate when selecting at which level of the hierarchy to pipeline is to
understand that pipelining the innermost loop gives the smallest hardware with generally
acceptable throughput for most applications. Pipelining the upper levels of the hierarchy unrolls
all sub-loops and can create many more operations to schedule (which could impact runtime
and memory capacity), but typically gives the highest performance design in terms of
throughput and latency.
To summarize the above options:
● Pipeline LOOP_J
Latency is approximately 400 cycles (20x20) and requires less than 100 LUTs and
registers (the I/O control and FSM are always present).
● Pipeline LOOP_I
Latency is approximately 20 cycles but requires a few hundred LUTs and registers.
About 20 times the logic as the first option, minus any logic optimizations that can be
made.
● Pipeline function loop_pipeline
Latency is approximately 10 (20 dual-port accesses) but requires thousands of LUTs
and registers (about 400 times the logic of the first option minus any optimizations that
can be made).
Imperfect Nested Loops
When the inner loop of a loop hierarchy is pipelined, Vitis HLS flattens the nested loops to
reduce latency and improve overall throughput by removing any cycles caused by loop
transitioning (the checks performed on the loop index when entering and exiting loops). Such
checks can result in a clock delay when transitioning from one loop to the next (entry and/or
exit).
Imperfect loop nests, or the inability to flatten them, results in additional clock cycles to enter
and exit the loops. When the design contains nested loops, analyze the results to ensure as
many nested loops as possible have been flattened: review the log file or look in the synthesis
report for cases, as shown in Loop Pipelining, where the loop labels have been merged
(LOOP_I and LOOP_J are now reported as LOOP_I_LOOP_J) [13].
PIPELINING PARADIGM
Pipelining is a commonly used concept that you will encounter in everyday life. A good
example is the production line of a car factory, where each specific task such as installing the
engine, installing the doors, and installing the wheels, is often done by a separate and unique
workstation. The stations carry out their tasks in parallel, each on a different car. Once a car has
had one task performed, it moves to the next station. Variations in the time needed to complete
the tasks can be accommodated by buffering (holding one or more cars in a space between the
stations) and/or by stalling (temporarily halting the upstream stations) until the next station
becomes available.
Suppose that assembling one car requires three tasks A, B, and C that take 20, 10, and 30
minutes, respectively. Then, if all three tasks were performed by a single station, the factory
would output one car every 60 minutes. By using a pipeline of three stations, the factory would
output the first car in 60 minutes, and then a new one every 30 minutes. As this example shows,
pipelining does not decrease the latency, that is, the total time for one item to go through the
whole system. It does however increase the system's throughput, that is, the rate at which new
items are processed after the first one.
Since the throughput of a pipeline cannot be better than that of its slowest element, the
programmer should try to divide the work and resources among the stages so that they all take
the same time to complete their tasks. In the car assembly example above, if the three tasks A.
B and C took 20 minutes each, instead of 20, 10, and 30 minutes, the latency would still be 60
minutes, but a new car would then be finished every 20 minutes, instead of 30. The diagram
below shows a hypothetical manufacturing line tasked with the production of three cars.
Assuming each of the tasks A, B and C takes 20 minutes, a sequential production line would
take 180 minutes to produce three cars. A pipelined production line would take only 100
minutes to produce three cars.
The time taken to produce the first car is 60 minutes and is called the iteration latency of the
pipeline. After the first car is produced, the next two cars only take 20 minutes each and this is
known as the initiation interval (II) of the pipeline. The overall time taken to produce the three
cars is 100 minutes and is referred to as the total latency of the pipeline, i.e. total latency =
iteration latency + II * (number of items - 1). Therefore, improving II improves total latency,
but not the iteration latency. From the programmer's point of view, the pipelining paradigm can
be applied to functions and loops in the design. After an initial setup cost, the ideal throughput
goal will be to achieve an II of 1 - i.e.,after the initial setup delay, the output will be available
at every cycle of the pipeline. In our example above, after an initial setup delay of 60 minutes,
a car is then available every 20 minutes.
Figure 3.7 Pipelining Paradigm
Pipelining is a classical micro-level architectural optimization that can be applied to multiple
levels of abstraction. We covered task-level pipelining with the producer-consumer paradigm
earlier. This same concept applies to the instruction-level. This is in fact key to keeping the
producer-consumer pipelines (and streams) filled and busy. The producer-consumer pipeline
will only be efficient if each task produces/consumes data at a high rate, and hence the need for
the instruction-level pipelining (ILP).
Due to the way pipelining uses the same resources to execute the same function over time, it is
considered a static optimization since it requires complete knowledge about the latency of each
task. Due to this, the low level instruction pipelining technique cannot be applied to dataflow
type networks where the latency of the tasks can be unknown as it is a function of the input
data. The next section details how to leverage the three basic paradigms that have been
introduced to model different types of task parallelism [14].
PIPELINE DEPENDENCIES
Vitis HLS constructs a hardware datapath that corresponds to the C/C++ source code.
When there is no pipeline directive, the execution is sequential so there are no dependencies to
take into account. But when the design has been pipelined, the tool needs to deal with the same
dependencies as found in processor architectures for the hardware that Vitis HLS generates.
Typical cases of data dependencies or memory dependencies are when a read or a write occurs
after a previous read or write.
A read-after-write (RAW), also called a true dependency, is when an instruction (and data it
reads/uses) depends on the result of a previous operation.
I1: t = a * b;
I2: c = t + 1;
The read in statement I2 depends on the write of t in statement I1. If the instructions are
reordered, it uses the previous value of t.
A write-after-read (WAR), also called an anti-dependence, is when an instruction cannot update
a register or memory (by a write) before a previous instruction has read the data.
I1: b = t + a;
I2: t = 3;
The write in statement I2 cannot execute before statement I1, otherwise the result of b is invalid.
A write-after-write (WAW) is a dependence when a register or memory must be written in
specific order otherwise other instructions might be corrupted.
I1: t = a * b;
I2: c = t + 1;
I3: t = 1;
The write in statement I3 must happen after the write in statement I1. Otherwise, the statement
I2 result is incorrect.
A read-after-read has no dependency as instructions can be freely reordered if the variable is
not declared as volatile. If it is, then the order of instructions has to be maintained.
For example, when a pipeline is generated, the tool needs to take care that a register or memory
location read at a later stage has not been modified by a previous write.
This is a true dependency or read-after-write (RAW) dependency. A specific example is:
int top(int a, int b) {
int t,c;
I1: t = a * b;
I2: c = t + 1;
return c;
}
Statement I2 cannot be evaluated before statement I1 completes because there is a dependency
on variable t. In hardware, if the multiplication takes 3 clock cycles, then I2 is delayed for that
amount of time. If the above function is pipelined, then VHLS detects this as a true dependency
and schedules the operations accordingly. It uses data forwarding optimization to remove the
RAW dependency, so that the function can operate at II =1.
Memory dependencies arise when the example applies to an array and not just variables.
int top(int a) {
int r=1,rnext,m,i,out;
static int mem[256];
L1: for(i=0;i<=254;i++) {
#pragma HLS PIPELINE II=1
I1:
m = r * a; mem[i+1] = m;
// line 7
I2:
rnext = mem[i]; r = rnext; // line 8
}
return r;
}
In the above example, scheduling of loop L1 leads to a scheduling warning message:
WARNING:
[SCHED
204-68]
Unable
to
enforce
a
carried
dependency
constraint (II = 1,
distance = 1)
between 'store' operation (top.cpp:7) of variable 'm', top.cpp:7 on
array 'mem' and
'load' operation ('rnext', top.cpp:8) on array 'mem'.
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 2,
Depth: 3.
There are no issues within the same iteration of the loop as you write an index and read another
one. The two instructions could execute at the same time, concurrently. However, observe the
read and writes over a few iterations:
// Iteration for i=0
I1:
m = r * a; mem[1] = m;
// line 7
I2:
rnext = mem[0]; r = rnext; // line 8
// Iteration for i=1
I1:
m = r * a; mem[2] = m;
// line 7
I2:
rnext = mem[1]; r = rnext; // line 8
// Iteration for i=2
I1:
m = r * a; mem[3] = m;
// line 7
I2:
rnext = mem[2]; r = rnext; // line 8
When considering two successive iterations, the multiplication result m (with a latency = 2)
from statement I1 is written to a location that is read by statement I2 of the next iteration of
the loop into rnext. In this situation, there is a RAW dependence as the next loop iteration
cannot start reading mem[i] before the previous computation's write completes.
Figure 3.8 Dependency Example
Note that if the clock frequency is increased, then the multiplier needs more pipeline stages and
increased latency. This will force II to increase as well.
Consider the following code, where the operations have been swapped, changing the
functionality.
int top(int a) {
int r,m,i;
static int mem[256];
L1: for(i=0;i<=254;i++) {
#pragma HLS PIPELINE II=1
I1:
r = mem[i];
// line 7
I2:
m = r * a , mem[i+1]=m; // line 8
}
return r;
}
The scheduling warning is:
INFO: [SCHED 204-61] Pipelining loop 'L1'.
WARNING:
[SCHED
204-68]
Unable
to
enforce
a
carried
dependency
constraint (II = 1,
distance = 1)
between 'store' operation (top.cpp:8) of variable 'm', top.cpp:8 on
array 'mem'
and 'load' operation ('r', top.cpp:7) on array 'mem'.
WARNING:
[SCHED
204-68]
Unable
to
enforce
a
carried
dependency
constraint (II = 2,
distance = 1)
between 'store' operation (top.cpp:8) of variable 'm', top.cpp:8 on
array 'mem'
and 'load' operation ('r', top.cpp:7) on array 'mem'.
WARNING:
[SCHED
204-68]
Unable
to
enforce
a
carried
dependency
constraint (II = 3,
distance = 1)
between 'store' operation (top.cpp:8) of variable 'm', top.cpp:8 on
array 'mem'
and 'load' operation ('r', top.cpp:7) on array 'mem'.
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 4,
Depth: 4.
Observe the continued read and writes over a few iterations:
Iteration with i=0
I1:
r = mem[0];
// line 7
I2:
m = r * a , mem[1]=m; // line 8
Iteration with i=1
I1:
r = mem[1];
// line 7
I2:
m = r * a , mem[2]=m; // line 8
Iteration with i=2
I1:
r = mem[2];
// line 7
I2:
m = r * a , mem[3]=m; // line 8
A longer II is needed because the RAW dependence is via reading r from mem[i], performing
the multiplication, and writing to mem[i+1].
Removing False Dependencies to Improve Loop Pipelining
False dependencies are dependencies that arise when the compiler is too conservative. These
dependencies do not exist in the real code, but cannot be determined by the compiler. These
dependencies can prevent loop pipelining.
The following example illustrates false dependencies. In this example, the read and write
accesses are to two different addresses in the same loop iteration. Both of these addresses are
dependent on the input data, and can point to any individual element of the hist array. Because
of this, Vitis HLS assumes that both of these accesses can access the same location. As a result,
it schedules the read and write operations to the array in alternating cycles, resulting in a loop
II of 2. However, the code shows that hist[old] and hist[val] can never access the same
location because they are in the else branch of the conditional if(old == val).
void histogram(int in[INPUT SIZE], int hist[VALUE SIZE]) f
int acc = 0;
int i, val;
int old = in[0];
for(i = 0; i < INPUT SIZE; i++)
{
#pragma HLS PIPELINE II=1
val = in[i];
if(old == val)
{
acc = acc + 1;
}
else
{
hist[old] = acc;
acc = hist[val] + 1;
}
old = val;
}
hist[old] = acc;
To overcome this deficiency, you can use the DEPENDENCE directive to provide Vitis HLS with
additional information about the dependencies.
void histogram(int in[INPUT SIZE], int hist[VALUE SIZE]) {
int acc = 0;
int i, val;
int old = in[0];
#pragma
HLS
DEPENDENCE
variable=hist
type=intra
direction=RAW
dependent=false
for(i = 0; i < INPUT SIZE; i++)
{
#pragma HLS PIPELINE II=1
val = in[i];
if(old == val)
{
acc = acc + 1;
}
else
{
hist[old] = acc;
acc = hist[val] + 1;
}
old = val;
}
hist[old] = acc;
When specifying dependencies there are two main types:
● Inter
Specifies the dependency is between different iterations of the same loop.
If this is specified as FALSE it allows Vitis HLS to perform operations in parallel if the
pipelined or loop is unrolled or partially unrolled and prevents such concurrent operation
when specified as TRUE.
● Intra
Specifies dependence within the same iteration of a loop, for example an array being
accessed at the start and end of the same iteration.
When intra dependencies are specified as FALSE, Vitis HLS may move operations
freely within the loop, increasing their mobility and potentially improving performance
or area. When the dependency is specified as TRUE, the operations must be performed
in the order specified.
Scalar Dependencies
Some scalar dependencies are much harder to resolve and often require changes to the source
code. A scalar data dependency could look like the following:
while (a != b) {
if (a > b) a -= b;
else b -= a;
}
The next iteration of this loop cannot start until the current iteration has calculated the updated
values of a and b, as shown in the figure 3.8.
Figure 3.9 Scalar Dependency
If the result of the previous loop iteration must be available before the current iteration can
begin, loop pipelining is not possible. If Vitis HLS cannot pipeline with the specified initiation
interval, it increases the initiation internally. If it cannot pipeline at all, as shown by the above
example, it halts pipelining and proceeds to output a non-pipelined design [15].
Chapter 4
Implementation Details
4.1
Experimental Setup
An FPGA is an integrated circuit (IC) equipped with configurable logic blocks (CLBs)
and other features that can be programmed and reprogrammed by a user. The term “fieldprogrammable” indicates that the FPGA’s abilities are adjustable and not hardwired by the
manufacturer like other ICs.
FPGAs are integrated circuits (ICs) that fall under the umbrella of programmable logic devices
(PLDs). The fundamental functionality of FPGA technology is built on adaptive hardware,
which has the unique ability to be modified after manufacture. Arrays of hardware blocks, each
configurable, can be connected as needed, allowing highly efficient, domain-specific
architectures to be built for any application.
The architecture of FPGAs makes them an efficient solution for hardware acceleration. Devices
such as ASICs and GPUs use an antiquated method of jumping between programming and
memory. They also don’t accommodate applications where real-time information is needed,
since the high amount of power required for storage and retrieval tasks causes performance
lags.
Unlike ASICs and GPUs, FPGAs don’t need to jump between memory and programming,
which makes the process of storing and retrieving data more efficient. And since FPGA
architecture is more flexible, you can customize how much power you’d like an FPGA to utilize
for a specific task. That flexibility can help offload energy-consuming tasks to one or several
FPGAs from a conventional CPU or another device. And since many FPGAs can be
reprogrammed, you can easily implement upgrades and adjustments to a hardware acceleration
system.
FPGA programming uses an HDL to manipulate circuits depending on what capabilities you
want the device to have. The process is different from programming a GPU or CPU, since you
aren’t writing a program that will run sequentially. Rather, you’re using an HDL to create
circuits and physically change the hardware depending on what you want it to do.
The process is similar to programming software in that you write code that is turned into a
binary file and loaded onto the FPGA. But the outcome is that the HDL makes physical changes
to the hardware, rather than strictly optimizing the device to run software. A program on an
FPGA pieces together lower-level elements like logic gates and memory blocks, which work in
concert to complete a task. Because you’re manipulating the hardware from the ground up,
FPGAs allow a great deal of flexibility. You can adjust basic functions such as memory or
power usage depending on the task[16].
4.2 Software and Hardware Setup
HLS: The Xilinx® Vivado® High-Level Synthesis (HLS) compiler provides a
programming environment similar to those available for application development on both
standard and specialized processors. Vivado HLS shares key technology with processor
compilers for the interpretation, analysis, and optimization of C/C++ programs. The main
difference is in the execution target of the application.
By targeting an FPGA as the execution fabric, Vivado HLS enables a software engineer to
optimize code for throughput, power, and latency without the need to address the performance
bottleneck of a single memory space and limited computational resources. This allows the
implementation of computationally intensive software algorithms into actual products, not just
functionality demonstrators. This chapter introduces how the Vivado HLS compiler works and
how it differs from a traditional software compiler. Application code targeting the Vivado HLS
compiler uses the same categories as any processor compiler. Vivado HLS analyzes all
programs in terms of:
● Operations
● Conditional statements
● Loops
● Functions
FFT ANALYSIS
A fast Fourier transform (FFT) is a highly optimized implementation of the discrete
Fourier transform (DFT), which converts discrete signals from the time domain to the frequency
domain. FFT computations provide information about the frequency content, phase, and other
properties of the signal.
Figure 4.1: Audio signal decomposed into its frequency components using FFT.
Popular FFT algorithms include the Cooley-Tukey algorithm, prime factor FFT algorithm, and
Rader’s FFT algorithm. The most commonly used FFT algorithm is the Cooley-Tukey
algorithm, which reduces a large DFT into smaller DFTs to increase computation speed and
reduce complexity. FFT has applications in many fields.
Fast Fourier Transform (FFT) analysis is a widely used technique for analyzing and processing
signals, such as audio and image data, in various applications. The use of Field-Programmable
Gate Arrays (FPGAs) in combination with RISC-V Tensorflow can significantly accelerate
FFT analysis, enabling real-time processing of data in applications like telecommunications,
imaging, and audio processing.
RISC-V Tensorflow is an open-source deep learning framework optimized for RISC-V
processors, which provides an efficient and flexible platform for implementing FFT analysis
algorithms. The flexibility of RISC-V allows the design and implementation of customized
accelerators that can perform FFT computations efficiently, leading to faster processing times
and lower energy consumption.
FPGAs offer high-speed and low-latency performance for FFT computations, making them
well-suited for implementing FFT accelerators. FPGAs can be programmed and customized for
specific FFT workloads, improving efficiency and reducing energy consumption. Additionally,
FPGAs can support high-throughput data processing, enabling real-time processing of data in
applications like audio and image processing.
The combination of RISC-V Tensorflow and FPGAs can lead to the development of specialized
accelerators for FFT computations, improving the efficiency and accuracy of FFT analysis.
These accelerators can be customized for specific FFT algorithms, such as the Cooley-Tukey
FFT algorithm, leading to more efficient and faster processing times.
Moreover, the use of FPGAs in FFT analysis can enable the processing of larger data sets,
making it possible to analyze complex signals in real-time. This is particularly important in
applications like telecommunications, where the ability to analyze and process data in real-time
can significantly improve the performance and reliability of communication systems.
In conclusion, the use of RISC-V Tensorflow and FPGAs presents a promising solution for
accelerating FFT analysis, improving the efficiency and accuracy of signal processing in
various applications. As the use of FPGAs in AI and signal processing continues to grow, the
development of specialized accelerators for FFT analysis is expected to become more
widespread, leading to significant advancements in the field of signal processing.
FFT in MATLAB
MATLAB® provides many functions like fft, ifft, and fft2 with which FFT can be implemented
directly. In MATLAB, FFT implementation is optimized to choose from among various FFT
algorithms depending on the data size and computation. Similarly, Simulink® provides blocks
for FFT that can be used in Model-Based Design and simulation. MATLAB and Simulink also
support implementation of FFT on specific hardware such as FPGAs, processors including
ARM, and NVIDIA GPUs, through automatic code generation.
Here's a brief overview of how to use the FFT function in MATLAB:
1. Load your signal data into a MATLAB variable.
2. Apply a windowing function to the data if necessary. This can help reduce spectral
leakage and improve the accuracy of the FFT analysis.
3. Apply the FFT function to the signal data. The FFT function in MATLAB is called fft().
The function takes the signal data as input and returns the FFT coefficients, which
represent the frequency components of the signal.
4. Use the FFT coefficients to plot the frequency spectrum of the signal. You can use the
abs() function to get the magnitude of the FFT coefficients, and then plot the magnitude
against the frequency.
Chapter 5
Results and Discussion
5.1
Performance Evaluation Parameters
In FPGA (Field Programmable Gate Array), latency refers to the delay between when a signal
is input to the FPGA and when the FPGA responds with a corresponding output. The latency
of an FPGA can be affected by factors such as the number of logic elements, the clock
frequency, and the routing architecture.
Iteration interval in an FPGA can refer to the clock frequency or clock period of the design. The
clock frequency determines the maximum speed at which the FPGA can operate, while the
clock period represents the duration of each clock cycle. The iteration interval can be adjusted
by changing the clock frequency, which can affect the performance and power consumption of
the design.
Error tolerance in FPGA refers to the ability of the FPGA to handle errors that may occur during
operation. FPGAs can be designed with various error correction mechanisms, such as ECC
(Error Correction Code), parity checking, and redundant logic elements. These mechanisms can
help detect and correct errors that may occur in the FPGA, improving its reliability and reducing
the likelihood of failures. In addition, FPGAs can be designed with built-in self-test (BIST)
capabilities to detect and diagnose faults and errors during operation.
5.2
Implementation Results
Figure 5.1 Scaled FFT Dataflow
Table 5.1 Performance Estimation Latency Details
Instance
Module
Latency
(cycles)
Latency (absolute) Interval
(cycles)
min
max min
max
min
Type
max
Loop_VITIS
_LOOP_88_
1_proc2_U0
Loop_VITIS_
LOOP_88_1_
proc2
34
34
0.340 us 0.340 us 34
34
no
bitreverse_U
0
bitreverse
35
35
0.350 us 0.350 us 35
35
no
FFT0_13_U0
FFT0_13
18
18
0.180 us 0.180 us 18
18
no
FFT0_14_U0
FFT0_14
22
22
0.220 us 0.220 us 22
22
no
FFT0_15_U0
FFT0_15
22
22
0.220 us 0.220 us 22
22
no
FFT0_16_U0
FFT0_16
22
22
0.220 us 0.220 us 22
22
no
FFT0_U0
FFT0
22
22
0.220 us 0.220 us 22
22
no
Loop_VITIS
_LOOP_98_
2_proc7_U0
Loop_VITIS_
LOOP_98_2_
proc7
35
35
0.350 us 0.350 us 35
35
no
5.3
Results Discussion
Table 5.2 Timing Estimates
solution_1
Clock
ap_clk
Target
10.00 ns
10.00 ns
Estimated
6.573 ns
6.450 ns
scaled_fft
Table 5.3 Latency Estimates
scaled_fft
solution_1
Latency (cycles)
Latency (absolute)
Interval (cycles)
min
313
217
max
313
217
min
3.130 us
2.170 us
max
3.130 us
2.170 us
min
314
36
max
314
36
Table 5.4 Utilization Estimates
solution_1
BRAM_18K
DSP
FF
LUT
URAM
RTL
SLICE
LUT
FF
DSP
SRL
BRAM
0
8
1072
2209
0
scaled_fft
2
16
2047
2862
0
Table 5.5 Resource Usage Implementation
solution_1
scaled_fft
verilog
verilog
0
0
930
2863
807
1207
8
16
2
22
Table 5.6 Final Timing Implementation
scaled_fft
solution_1
RTL
CP required
verilog
10.000
verilog
10.000
CP achieved post-synthesis 5.752
CP achieved postimplementation
7.657
-
-
The timing of the circuit has further improved from 100 MHz (10 nx) to 155.04 MHz (6.450ns)
helping us reach even faster speed without affecting the critical path or entering metastability
with respect to the design or internal gates.
As the results from report comparison generated by Vitis HLS (2021.1) we can see we’ve
achieved a near about 30% speed up ( reduced latency from 313 cycles to 217 cycles) in the
scaled version of the FFT implementation. As a tradeoff for improved timing we use more
resources on the FPGA more significantly the FF (Flip-Flops) and the DSP slices.
This is a significant observation as with the increase in hardware resource the chip footprint too
increases in size and well as power. The optimization uses more hardware to parallelise the
stages using dataflow optimization as discussed in the sections above.
As we can see from table 5.1, the closer the individual subprocesses are in terms of their latency
the better they can be scheduled together giving a much better overall latency considering
throughput.
Table 5.2 RTL/Co-Simulation Performance Estimates
Fixed-point implementations are prone to some error. We are able to contain the error tolerance
of less than 1% which was confirmed by our self checking testbench in the RTL/Co-Simulation
results performance estimate as given above. The final verdict for this stage was Pass.
Chapter 6
Conclusion and Future Work
The FFT design was also successfully tested for processing audio files with 16-bit encoding
samples at 44.1 kHz which is standard for music and human speech. The results too showed
similar output when compared to RTL/Co-Simulation maintaining our standard of less than one
percent error tolerance. Our implementation also finds its application in embedded micro
devices that are low-power, low-cost and having area constraints like earphones or handheld
devices. We concluded the design and implementation of a 32-point fixed-point FFT core that
meets the performance, accuracy, flexibility, scalability, and compatibility requirements for
real-time signal processing applications.
In time, changes can be made to the interface by introducing block floating point which is
customized scaling of the butterfly computation at each stage of the FFT which needs to be
decided by the programmer which is completely dependent on the application, output and the
data itself. Furthermore the size of the FFT could itself be increased for finer frequency
resolution i.e. more bins. It translates to just having more points for a particular frame and as
we increase the size of the FFT block we would be able to completely move to High-Res Audio
application. The design could be taken forward to implement ANC (Active Noise Cancellation)
in earphones which is a high potential emerging consumer market.
References
[1]W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and M. Horowitz,
“Convolution engine,” Communications of the ACM, vol. 58, no. 4, pp. 85–93, 2015.
[2]M. Horowitz, "1.1 Computing's energy problem (and what we can do about it)," 2014 IEEE
International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2014, pp.
10-14.
[3]R. Toulson and T. Wilmshurst, “An Introduction to Digital Signal Processing,” in Fast and
effective embedded systems design: Applying the ARM mbed, Amsterdam: Newnes, an imprint
of Elsevier, 2017.
[4]“Digital Signal Processing,” Wikipedia, 23-Mar-2023. [Online]. Available:
https://en.wikipedia.org/wiki/Digital_signal_processing. [Accessed: 25-Mar-2023].
[5]“Overhead
(computing),”
Wikipedia,
12-Feb-2023.
[Online].
https://en.wikipedia.org/wiki/Overhead_(computing). [Accessed: 26-Mar-2023].
Available:
[6]R. Teymourzadeh, M. J. Abigo, and M. V. Hoong, “Static quantised radix-2 fast Fourier
transform (fft)/inverse FFT processor for constraints analysis,” International Journal of
Electronics, vol. 101, no. 2, pp. 231–240, 2013.
[7]M. Croome, “Greenwaves Technologies unveils the GAP8 IOT Application Processor,”
GreenWaves Technologies, 17-Jun-2019. [Online]. Available: https://greenwavestechnologies.com/greenwaves-technologies-unveils-gap8/. [Accessed: 30-Mar-2023].
[8]“Fixed-point vs. floating-point digital signal processing,” Fixed-Point vs. Floating-Point
Digital
Signal
Processing
|
Analog
Devices.
[Online].
Available:
https://www.analog.com/en/technical-articles/fixedpoint-vs-floatingpoint-dsp.html.
[Accessed: 03-Apr-2023].
[9]V. Kumar M, D. Selvakumar A, and S. P M, “Area and frequency optimized 1024 point
radix-2 FFT processor on FPGA,” 2015 International Conference on VLSI Systems,
Architecture, Technology and Applications (VLSI-SATA), 2015.
[10]W. R. Knight and R. Kaiser, A Simple Fixed-Point Error Bound for the Fast Fourier
Transform, IEEE Trans. Acoustics, Speech and Signal Proc., Vol. 27, No. 6, pp. 615-620,
December 1979.] and Theory and Application of Digital Signal Processing [L. R. Rabiner and
B. Gold, Theory and Application of Digital Signal Processing, Prentice-Hall Inc., Englewood
Cliffs, New Jersey, 1975.
[11]J. Arias,M. Desainte-Catherine and C. Rueda, “Exploiting Parallelism in FPGAs for the
Real-Time Interpretation of Interactive Multimedia Scores” Journées d'Informatique
Musicale 2015, May 2015, Montréal, Canada. ⟨hal-01129316⟩
[12]AXI4-STREAM INTERFACE, AMD Adaptive Computing Documentation Portal.
Available:
https://docs.xilinx.com/r/2021.1-English/ug1399-vitis-hls/How-AXI4-StreamWorks (Accessed: April 13, 2023).
[13] Loop Pipelining, AMD Adaptive Computing Documentation Portal. Available at:
https://docs.xilinx.com/r/2021.1-English/ug1399-vitis-hls/Loop-Pipelining (Accessed: April
13, 2023).
[14] PIPELINING PARADIGM, AMD Adaptive Computing Documentation Portal. Available
at: https://docs.xilinx.com/r/2021.1-English/ug1399-vitis-hls/Pipelining-Paradigm (Accessed:
April 13, 2023).
[15] Pipelining Dependencies, AMD Adaptive Computing Documentation Portal. Available at:
https://docs.xilinx.com/r/2021.1-English/ug1399-vitis-hls/Exploiting-Task-Level-ParallelismDataflow-Optimization (Accessed: April 13, 2023).
[16]Programming an FPGA, An introduction to how it works, Xilinx. Available at:
https://www.xilinx.com/products/silicon-devices/resources/programming-an-fpga-anintroduction-to-how-it-works.html (Accessed: April 11, 2023).
Acknowledgement
We would like to express our gratitude and thanks to Dr. Tanuja Sarode and Mr. C. S.
Kulkarni for their valuable guidance and help. We are indebted for their guidance and constant
supervision as well as provision of necessary information regarding the project. We would like
to express our greatest appreciation to our principal Dr. G.T. Thampi and head of the
department Dr. Tanuja Sarode for their encouragement and tremendous support. We take this
opportunity to express our gratitude to the people who have been instrumental in the successful
completion of the project.
Kaku Jay Sushil
Khan Mohd Hamza Rafique
Kotadia Hrishit Jayantilal
Narwani Trushant Sunil
1902067
1902072
1902080
1902114
Download