Uploaded by 2023pvl0098

A Novel Parallel Processing Element Architecture for Accelerating ODE and AI

advertisement
TSINGHUA SCIENCE AND TECHNOLOGY
ISSN 1007-0214
DOI: 10.26599 / TST.20xx. 9010 x x x
Volume xx, Number x, xxxxxxx 20xx
A Novel Parallel Processing Element Architecture for Accelerating ODE and AI
Abstract: Transforming complex problems, such as transforming ordinary differential equations (ODEs)
into matrix formats, into simpler computational tasks is key for AI advancements and paves the way for
more efficient computing architectures. Systolic Arrays, known for their computational efficiency, low
power use and ease of implementation, address AI's computational challenges. They are central to
mainstream industry AI accelerators, with improvements to the Processing Element (PE) significantly
boosting systolic array performance, and also streamlines computing architectures, paving the way for
more efficient solutions in technology fields. This research presents a novel PE design and its integration
of systolic array based on a novel computing theory - bit-level mathematics for Multiply-Accumulate (MAC)
operation. We present 3 different architectures for the PE and provide a comprehensive comparison
between them and the state-of-the-art technologies, focusing on power, area, and throughput. This
research also demonstrates the integration of the proposed MAC unit design with systolic arrays,
highlighting significant improvements in computational efficiency. Our implementations show a
2380952.38 times lower latency, yet 64.19 times less DSP48E1, 1.26 times less Look-Up Tables (LUTs),
10.76 times less Flip-Flops (FFs), with 99.63 times less power consumption and 15.19 times higher
performance per PE compared to the state-of-the-art design.
Key words:
Parallel MAC Unit; Processing Element (PE); FPGA Implementation; Computing Theory
1 Introduction
In the progression of computational technologies,
particularly in fields such as Digital Signal Processing
(DSP)[1], Artificial Intelligence (AI)[2], and image
processing[3], the conversion of ordinary differential
equations (ODEs) into matrix multiplication formats is
an important process[4]. This transformation is key to
making the most of today’s computing architectures,
underscoring the need for highly efficient MultiplyAccumulate (MAC) units.
These units are essential for doing matrix
operations, turning ODEs into tasks computers can
handle, which leverages modern hardware's capability
for parallel processing, enabling simultaneous
execution of multiple tasks. It’s a big step in using
computer power more effectively for a wide range of
uses[5]. Therefore, the enhancement and progression of
MAC units are central to advancing computational
infrastructures, supporting the technological evolution
of future generations.
As computing architecture advances, the role of
MAC units in overseeing large-scale matrix
multiplications becomes indispensable, particularly for
the operation of deep learning and neural networks.
This efficiency reduces unnecessary computations and
storage demands, positioning MAC units as an
essential element in the advancement and streamlining
of a wide range of technological applications.
Contemporary architectures engaging in critical
tasks like matrix multiplication frequently employ
numerous MAC units to diminish memory interactions.
Google Tensor Processing Unit (TPU) use 4 by 4 by 4
by 64 systolic array for accelerating large scale tensor
operations[6]. Such an architecture bolsters the
computation process by enabling data reuse and
reducing complexity. It connects MAC units in series
to form a Processing Element (PE), enhancing
parallelism and performance[7]. Similarly, Huawei’s
DaVinci AI accelerator integrates 16 MAC units within
each PE, accompanied by an adder tree for effective
© The author(s) 2021. The articles published in this open access journal are distributed under the terms of the
Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).
2
Tsinghua Science and Technology, xxxxxxx 20xx, x(x): xxx-xxx
result accumulation[8]. This arrangement is calibrated
for both standard and intricate computations,
underscoring the importance of sparsity management in
AI.
In applications like Convolutional Neural
Networks (CNNs), where large-scale matrix operations
are commonplace, various techniques are employed to
optimize computational efficiency. One such technique
is the "img2col" function, which transforms 3D
convolutions into matrix multiplications, necessitating
the handling of matrices with vast dimensions[9].
Another noteworthy application is the Winograd
algorithm[10], which aims to boost computational
efficiency by reducing the number of multiplication
arithmetic operations needed for convolution. This
algorithm can notably speed up convolution operations,
especially in situations with limited computing
resources or when latency is a concern. This is
attributed to the increased sizes of images and feature
maps. Architectures based on systolic arrays, such as
TPU and NVDLA[11], are favoured for their capability
to accelerate computations. In these structures, MAC
units that offer lower latency or enhanced integration
are particularly valuable, especially for processing
sparse matrices in extensive AI models, where sparsity
optimization is essential for improving computational
efficiency and performance.
In this research, we propose a groundbreaking
MAC unit founded on a novel computing theory. Our
integrated MAC units are scalable and naturally
parallel, allowing multiple MAC operations within a
single component efficiently, and can be effectively
integrated with systolic arrays. The key contributions
of this work are as follows:
•
Presents a novel computing theory centred on bitlevel mathematics in the design of MAC units,
subsequently introducing a novel PE tailored
accordingly for Multiplication-Accumulation
operations.
•
Develops three distinct hardware architectures,
each optimised for either high throughput or
reduced latency. Integrate the fastest architecture
with systolic array.
•
Conducts a comprehensive comparison. Evaluate
three MAC units. Compare the integration with
the systolic array against existing literature. Focus
on power consumption, area, and throughput.
The remainder of the article is structured as
follows: Section Two reviews the current landscape of
MAC units and the computational power derived from
using CPUs or ASICs, analysing their limitations, then
delves into an analysis of systolic arrays, highlighting
their potential to overcome these challenges. Section
three offers an in-depth exploration of the computing
theory and the hardware architecture, also show the
implementation with systolic array. Section four
evaluates our three novel MAC units using different
architectures, then compares with other researches after
integrate with systolic array. Meanwhile, we further
assess our architecture through the application of
Automatic Modulation Classification (AMC)[12]. The
experimental results demonstrate that our architecture
achieves a time saving of 99.98% compared with the
existing literature. Finally, the last section concludes
with an analysis of our MAC unit's potential,
summarizing our future research directions.
2 Literature Review
Transforming ODE problems into General Matrix
Multiply (GEMM) operations has become a crucial
technique in computational studies. This process
involves breaking down ODEs into finite difference
equations, which are then represented as a matrix
equation A⋅y=b. Here, A contains the coefficients of the
equations, y represents the solution vector, and b is the
right-hand side of the equations[4]. By solving this
matrix equation using techniques similar to GEMM,
such as matrix inversion or iterative methods, we can
determine the values of y at each time point. This
method helps bridge the gap between ODEs and AI
advancements, making computational infrastructures
more efficient and streamlining technological
applications.
This transformation process enables efficient
numerical solution techniques, making model training
and inference in AI faster and improving algorithmic
execution in signal processing tasks. By breaking down
Tsinghua Science and Technology, xxxxxxx 20xx, x(x): xxx-xxx 3
ODEs and representing them as matrix equations,
computational architectures can efficiently use matrix
operations for computation. This approach underscores
the versatility and applicability of matrix operations
across various domains, highlighting the role of
GEMM in advancing computational technologies,
particularly in accelerating MAC operations, a basic
component in various computational tasks.
MAC units play a pivotal role across various
domains, significantly enhancing computational
efficiency in deep learning, signal processing, and
FPGA-based accelerations. In deep learning, MAC
units streamline computations, facilitating faster model
training and inference while optimizing energy
consumption[13]. Signal processing benefits from MAC
units through improved algorithmic execution for tasks
like filtering and audio processing, thereby elevating
operational speed and accuracy[14]. Furthermore,
FPGA-based neural network accelerators leverage
MAC units to achieve substantial gains in performance
and efficiency, demonstrating the broad applicability
and impact of MAC units in advancing computational
technologies[15].
In
contemporary
cloud-edge
computing
architectures, two primary performance requirements
emerge distinctly: high throughput and low latency.
High throughput is vital for scenarios necessitating the
rapid transmission or reception of vast data volumes[16].
This capability is especially crucial in cloud data
transmission, where the efficient handling of largescale data sets, such as video streaming and bulk data
analytics, is paramount. On the other hand, low latency
is imperative for applications involving smaller data
packets that demand immediate processing[17]. Such
requirements are paramount in the autonomous driving
sector, where milliseconds can determine the safety of
decisions made by the vehicle's AI systems[18].
Additionally, low latency is critical in real-time gaming
and financial trading applications, where delay
reduction translates directly into competitive advantage
and operational efficiency.
In recent developments within the field of
computing, both CPUs and ASICs have undergone
significant evaluation. Studies focus on their
performance through GOPs using INT8 operations per
watt. ASICs, designed for tasks requiring high
computational throughput and energy efficiency,
showcase superior performance. For instance,
NVIDIA's Tesla P4 and P40 GPUs, based on the Pascal
architecture, achieve 21.8 and 47.0 INT8 TOP/s
respectively, underlining ASICs' advantages in deep
learning inference[19].
Conversely, Intel's advancement with the
"Sapphire Rapids" 4th Gen Xeon processors, featuring
Advanced Matrix Extensions (AMX), illustrates the
efforts to enhance CPUs' capability for AI inference
tasks. These processors manage 2,048 INT8 operations
per cycle per core, marking a step towards narrowing
the performance gap between CPUs and ASICs for
specific computational tasks[20].
Additionally, Google's Coral M.2 Accelerator
emerges as a noteworthy ASIC example, offering 4
TOP/s at an energy consumption of only 2 watts. This
efficiency, translating to 2 TOPS/W, sets a new
benchmark for power-efficient machine learning
acceleration[21]
These instances reflect the ongoing technological
advancements in the realms of ASICs and CPUs, with
ASICs maintaining a lead in efficiency and
performance for specialized tasks, whereas CPUs are
catching up in versatility for a wider application
spectrum.
However, despite the advancements in MAC units
enhancing computational efficiency, scalability and
adaptability challenges persist. ASICs, tailored for
specific tasks, sacrifice flexibility, limiting their utility
outside their optimized scope. In contrast, CPUs offer
versatility but struggle with inefficiencies, especially in
energy-sensitive
settings[22]-[23].
This
contrast
highlights the delicate balance between performance
specialization
and
general
adaptability
in
computational architectures. It underscores the
intricacies of choosing the appropriate technology,
emphasizing the necessity to align architectural
decisions with the specific requirements of
applications, while also considering energy efficiency
and flexibility[22]-[23].
Exploring MAC units, CPUs, and ASICs
4
Tsinghua Science and Technology, xxxxxxx 20xx, x(x): xxx-xxx
highlights strides in processing efficiency, emphasizing
the interplay between energy efficiency, operational
flexibility, and cost. This discourse not only reveals
technological advancements but also the complex
decision-making process system architects and
engineers navigate. It shows how advancements in
MAC units, while significant, are moderated by the
practical limitations and trade-offs of choosing between
CPUs and ASICs. This intricate balance is crucial for
optimizing
computational
resources
across
applications, highlighting the pursuit for architectures
that marry high performance with adaptability and
energy efficiency[22]-[23].
In this evolving landscape of computational
architectures, the emergence and integration of systolic
arrays present an innovative approach to addressing
some of the above challenges, particularly in scenarios
demanding high throughput and energy efficiency[24].
The principle of a systolic array, anchored in parallel
computing architecture, allows for data to move
synchronously across the array between adjacent
processors. This method of operation contributes
significantly to reducing the need for external memory
access, thus enhancing computational efficiency.
This architecture excels in applications requiring
intensive parallel processing, such as artificial
intelligence and image processing, by effectively
handling data parallelism and reducing operational
steps, contrary to the limitations posed by Amdahl's
Law in traditional architectures. Furthermore,
innovations in systolic array design have expanded
their functionality to include nonlinear operations
efficiently, further enhancing their suitability for
complex neural network inference tasks[25].
At the core of a systolic array's architecture is a
two-dimensional grid of PEs, each directly connected
to its immediate neighbours. This design facilitates
efficient data transfer and reduces reliance on central
memory, offering an advantage in processing intensive
tasks such as those found in deep learning
applications[24].
Systolic arrays operate through stages, initiating
data at their edges and sequentially delivering results,
showcasing continuous operation and parallel
processing efficiency. This emphasizes their role in
enhancing energy efficiency and throughput. Design
considerations such as uniform dataflow and edge
condition management are crucial for addressing
scalability and adaptability challenges, ensuring
systolic arrays achieve optimal performance. Their
integration into computational architectures represents
a significant advancement, contributing to the
development of high-performance, adaptable, and
energy-efficient systems. Systolic arrays, alongside
MAC units, CPUs, and ASICs, enrich the diversity of
technological solutions, driving innovation and
improvement in the field[25].
Fig. 1 General Architecture of the MAC unit based on our novel computing theory
Tsinghua Science and Technology, xxxxxxx 20xx, x(x): xxx-xxx 5
3.2
3 Methodology
3.1
Computing Theory
In mathematics, to multiply matrices, we need to
multiply each corresponding number and then sum up
all the results. However, in computer calculations,
numbers at the data level must first be converted to the
bit level. Let's take 2 vectors, A and B, with k bits m
numbers as an example.
𝑖
π΄π‘š = ∑π‘˜−1
𝑖=0 π‘Žπ‘– × 2
(1)
𝑗
π΅π‘š = ∑π‘˜−1
𝑗=0 𝑏𝑗 × 2
(2)
Utilizing Eq. (1) and Eq. (2) enables the
determination of the bit count. This information, when
integrated with the matrix multiplication formula,
yields the following result:
𝐢 = ∑2π‘˜−1
π΄π‘š × π΅π‘š
0
π‘˜−1
𝑖
𝑗
= ∑2π‘˜−1
[(∑π‘˜−1
0
𝑖=0 π‘Žπ‘– × 2 ) × (∑𝑗=0 𝑏𝑗 × 2 )]
π‘˜−1
𝑖+𝑗
= ∑2π‘˜−1
[(∑π‘˜−1
]
0
𝑖=0 ∑𝑗=0 π‘Žπ‘– × π‘π‘— ) × 2
𝟐𝐀−𝟏
∑⬚
Accumulator
𝟎
π’Œ−𝟏 π’Œ−𝟏
∑∑⬚
Bit-count
π’Š=𝟎 𝒋=𝟎
π’‚π’Š × π’ƒπ’‹
πŸπ’Š+𝒋
AND Gate
Bit-Shift
Upon attaining a comprehensive grasp of the
mathematical formula and its associated digital logic,
the subsequent phase entails the development of the
hardware architecture. INT8 has been used as an
example to the capability of our proposed computing
theory along with its scalable hardware architecture.
This architecture is devised in alignment with Eq. (3).
Figure 1 depicts the overarching structure of the MAC
unit. Upon the input of two vectors, each comprising
eight 8-bits numbers, a conversion process ensues,
translating these numbers from the decimal (data level)
to binary (bit level), thereby transforming the vectors
into matrices. The corresponding figures within these
two matrices are then processed through the AND
gates, succeeded by the counters and the shifters. The
last step of this process is the computation of the result
in the Accumulator.
Table 2 Merge table for reducing the number
(3)
The transformation from the original result to what
we see in Eq. (3) is quite straightforward. The initial
sum symbol, indexed by (0 to 2k-1), can be expressed
as an accumulator during the hardware setup. The
second and third sum symbols, indexed by (i, j=0 to k1), can be represented simply by using a counter. The
power of 2 can be turned into shifters, and
multiplication in binary form can be done using an
AND gate. There's no data dependency in these sum
symbols, making it easy to scale to numbers of varying
word lengths and different quantities of vector
elements. Rather than multiplying separate numbers
and summing them up, our computing approach
combines the entire MAC process. This approach
manages the MAC operation well because we can use
a counter, AND gate, and shifter to replace the original
multipliers, boosting efficiency.
Table 1 Mathematic Operation with Digital Logic
Hardware Architecture
Shift
Bit position from A and B
Number of AND Gates (in Bits)
0
0/0
7
1
1/0, 0/1
14
2
0/2, 1/1, 2/0
21
3
0/3, 1/2, 2/1, 3/0
28
4
0/4, 1/3, 2/2, 3/1, 4/0
35
5
0/5, 1/4, 2/3, 3/2, 4/1, 5/0
42
6
0/6, 1/5, 2/4, 3/3, 4/2, 5/1, 6/0
49
7
0/7, 1/6, 2/5, 3/4, 4/3, 5/2, 6/1, 7/0
56
8
1/7, 2/6, 3/5, 4/4, 5/3, 6/2, 7/1
49
9
2/7, 3/6, 4/5, 5/4, 6/3, 7/2
42
10
3/7, 4/6, 5/5, 6/4, 7/3
35
11
4/7, 5/6, 6/5, 7/4
28
12
5/7, 6/6, 7/5
21
13
6/7, 7/6
14
14
7/7
7
The structure’s potential for straightforward
pipeline arrangement is evident. Adherence to Eq. (3)
in its strictest form would necessitate the incorporation
of 64 counters and fixed shifters within the hardware,
which could lead to considerable resource
consumption. The shift bit range spans from 0 to 14 for
8-bit numbers, allowing for the amalgamation of
certain results and reducing the count of counters and
shifters from 64 to 15. Table 2 has been provided to
6
Tsinghua Science and Technology, xxxxxxx 20xx, x(x): xxx-xxx
Fig. 2 Stages for LUT-based Architecture
elucidate the manner in which these counters and
shifters are combined. The first column of table 2
delineates the number of bits requiring shifting, the
second column elucidates the bit positions for A and B
(8-bit numbers), and the third column details the
requisite number of 1-bit AND gates. It is noteworthy
that the most prevalent scenarios involve a shift of 7.
The observation reveals that, although the total
number of AND gates remain congruent with the figure
prior to amalgamation, the employed strategy
efficaciously reduces the quantity of counters and
shifters from 64 to a mere 15.
3.3
Look Up Table (LUT)-based
Architecture
To further minimize the number of counters and
shifters and expedite the execution speed, an alternative
approach is to leverage a look-up table (LUT) to bypass
the intermediary steps of the AND Gate, Bit-count, and
bit-shift, acquiring the desired results directly from the
bit-level matrix.
By dividing the numbers in the bit-level matrix into
×
groups of 4 bits, there would be 2(4 2) = 16,384 unique
entries in the look-up table, given the maximum
number represented by four bits is 15 (i.e., '1111' in
binary). Therefore, we can construct a table that maps
these 16,384 possibilities to their corresponding results,
eliminating the need for intermediary computational
steps.
To ascertain the RAM size required for this look-up
table, we need to consider the total number of entries
and the size of each entry. Given that there are 16,384
unique entries, and each entry corresponds to the sum
of the products of a 4-bit group, which can be up to 8
× 8 = 64 (as two 4-bit numbers can each represent up
to the decimal number 15), we would need a RAM of
size 16,384 × 7 bits, since 64 can be represented with
7 bits.
In terms of bytes, considering that 1 byte is
equivalent to 8 bits, the byte size of the RAM would be
(16,384 × 7) / 8 = 14,336 bytes. Please note that this
calculation provides an ideal estimate, but the realworld RAM modules usually come in powers of 2, so
the actual RAM size might need to be rounded up to the
closest power of 2, which should be 214 = 16,384 bytes.
The utilization of a look-up table in this context is
an efficient mechanism to quickly access the result for
a particular 4-bit group, thereby avoiding the time and
Tsinghua Science and Technology, xxxxxxx 20xx, x(x): xxx-xxx 7
Fig. 3 Project Structure
computational resources required by the intermediate
steps. This substantial time saving and reduced
computational load can greatly enhance the processing
speed.
However, the implementation of a look-up table
leads to increased memory requirements due to the
need to store the table. Therefore, the feasibility of this
approach would be contingent on the trade-off between
memory availability and the need for computational
efficiency and processing speed. In scenarios where
memory resources are abundant, and speed is a priority,
this look-up table method can greatly optimize the
performance of the bit-level operation-based MAC
unit.
As depicted in Fig. 2, the process consists of three
stages. In Stage 1, the 8-bit input data is split into two
sets of 4-bit data. In Stage 2, the results are retrieved
from the ROM, which serves as a substitute for the
AND gates, counters, and shifters. The final stage
encompasses the accumulator, where all the results are
summed up to yield the final outcome.
The look-up table implementation for the MAC
unit offers a promising approach for future research and
development, potentially providing considerable
improvements in speed, energy efficiency, and
simplification of the hardware architecture. It
exemplifies an innovative method to further harness the
potential of bit-level computations in the field of AI
hardware acceleration.
3.4
Systolic Array
Following the completion of the above work, it
will be integrated as a PE within the systolic array. As
shown in Fig. 3(a), the systolic array is configured with
parameters for rows, columns, and the number of bits
for inputs, facilitating flexible scalability. Also, the
module takes in clock, reset, and start signals, along
with data inputs from the top and left sides, and a
selection signal for output processing.
It initializes a dynamic grid of PEs, connecting
them so that data flows horizontally and vertically
through the array. Each PE receives input from its left
and top neighbours, processes it, and passes the result
to its right and bottom neighbours. The array's design
enables efficient parallel processing of data, typical in
applications like matrix multiplication or digital signal
processing. The output from the array is then
aggregated by accumulator and outputted to a register,
which outputs the final result based on the provided
selection signal.
8
3.5
Tsinghua Science and Technology, xxxxxxx 20xx, x(x): xxx-xxx
Implementation
The above architecture is developed using Verilog
HDL for coding and utilized by using Xilinx Vivado
2022.2. The overall architecture of the design is
illustrated in Fig. 3.
Based on the discussed computing theory and
LUT-based architecture, three PE architectures are
designed. These include a pure digit circuitry
architecture based on the computing theory, and two
architectures optimised with LUT: Swap A&C and
Swap All. Swap A&C replaces the operations of AND
Gate and Bit-count with LUT, whereas Swap All
replaces AND Gate, Bit-count, and Bit-shift with LUT.
The novel PE is integrated into the systolic array,
a key part of AI acceleration. The PE module is
composed of the core module and an accumulator in the
illustration presented in Fig. 3(b). Within the core
module, there lies the conversion module and the
calculation module, as shown in Fig. 3(c). The
conversion module is tasked with converting INT8
vectors into an 8×8 binary matrix. Meanwhile, the
calculation module undertakes computational
operations or utilizes ROM for accessing look-up table
data.
decreasing bit operations in AND Gates and addition
operations in counters, leading to a slight overall
reduction in resource usage.
Table 3 PE Hardware Resources
16 MACs
Pure
Digital
Circuit
Swap
A&C
Frequency (MHz)
125
125
125
250
500
Latency (clk cycles)
2
2
2
2
3
Performance (GOPs)
1
1
1
2
2.67
Area - DSP48E1
1
1
1
1
1
Area - LUTs
2591
2335 11493 11493 11493
Area - FF
1539
1539
Swap All
771
771
771
Subsequently, the Swap All approach with a 4ns
clock period was utilized for integration with the
systolic array. Although the “2ns” approach offers
quicker latency, its mere 0.019ns buffer time—
calculated as Clock Period × Performance - Latency =
2 ns × 3 clock - 5.981ns = 0.019 ns—might induce
instability in the systolic array, hence it was not chosen
for integration.
4 Results and Discussion
The hardware design specification in this research
utilizes the Xilinx Zynq UltraScale+ series ZCU102
(XCZU5EV) board. We conducted tests on the PE
using 16 MAC configurations, as shown in Table 3. The
table outlines the resource consumption for three
architectures: Pure Digital Circuit, Swap A&C (AND
Gate and Bit-count), and Swap All (AND Gate, Bitcount and Bit-shift).
Among these, Swap All exhibits the lowest
latency, requiring only two cycles at clock periods of
8ns and 4ns, three clock cycles of 2ns, implying that
results for 16 MAC operations can be obtained as
quickly as every 6ns. In contrast, the Pure Digital
Circuit approach consumes fewer resources and area,
saving 77.46% in LUT and x51.52% in DSP48E1
compared to Swap All. Compared to pure circuits,
Swap A&C reduces superfluous LUT consumption by
Fig. 4 Power, Performance and Effiency of Systolic Array
Figure 4 displays the measured Power,
Performance, and Power Efficiency after integrating
this approach into systolic arrays of sizes 2×2 to 12×
12. As the operations and latency of each PE is same,
the Performance and Power Efficiency can be
calculated by these functions:
π‘ƒπ‘’π‘Ÿπ‘“π‘œπ‘Ÿπ‘šπ‘Žπ‘›π‘π‘’ = [
+
(16 + 16 + 15) π‘œπ‘π‘’π‘Ÿπ‘Žπ‘‘π‘–π‘œπ‘›
× π‘ π‘–π‘§π‘’ 2
4 𝑛𝑠 × 4 π‘π‘™π‘˜ 𝑐𝑦𝑐𝑙𝑒
𝑠𝑖𝑧𝑒 2 − 1
1𝑠
]×
4𝑛𝑠 × 1 π‘π‘™π‘˜ 𝑐𝑦𝑐𝑙𝑒
1 𝑛𝑠
= (2.6875 𝑠𝑖𝑧𝑒 2 + 0.25 )𝐺𝑂𝑃/𝑠
π‘ƒπ‘œπ‘€π‘’π‘Ÿ 𝐸𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑐𝑦 = π‘ƒπ‘’π‘Ÿπ‘“π‘œπ‘Ÿπ‘šπ‘Žπ‘›π‘π‘’ / π‘ƒπ‘œπ‘€π‘’π‘Ÿ
Tsinghua Science and Technology, xxxxxxx 20xx, x(x): xxx-xxx 9
The data mentioned above were used to predict the
trends of power for sizes extending from 12×12 to 30
×30. It is observed that both power and performance
increase exponentially, and Power Efficiency has its
maximum, with the size closest to this maximum being
14×14. The graph also shows values at this size, with
Power predicted at 0.74W, Performance at 624.50
GOPs, and Power Efficiency at its peak of 839.52
GOPs/W.
Table 4 lists the frequency, power, performance,
and area data for PE using the Swap All architecture
applied to systolic arrays of sizes 1×1 to 6×6 and
14×14. It also includes comparisons with FPGA
accelerators from other literature. Data surpassing ours
are highlighted in red, while those inferiors are in
green.
The research by Yang[26] and Cao[27], which used
Zedbroad to implement an 8×4 systolic array, is
compared with our 6×6 array. Our design is using 26.8
times less power than Yang's. Compared to [27], our
design shows 54.78 times more throughput and 1284.6
times lower latency.
Chen's research[28] with U280 broad on BERT, a
large language model including an 8×16 systolic array,
is compared by our 14×14 array, which shows a 59655
times less latency and similar maximum clock
frequency. In terms of area, despite using 3.45 times
more LUTs, our research uses 9.08 times fewer
DSP48E1s, 3.40 times fewer registers, and saves
9000kb of memory.
Xue's[29] research includes a 32×16 systolic array
in the VGG-16 network. Compared to this, our 14×14
systolic array's performance is 5.83 times higher
throughput with less resources.
Chen’s[30] and Lian's[31] research do not involve a
systolic array but uses multiple parallel PEs without a
suitable dataflow for matrix multiplication. Compared
to Chen's 8 PEs, our 3×3 array has 2380952.38 times
lower latency, with 99.63 times lower power
consumption. Our throughput is lower by 5.07 times,
but the performance-power efficiency is 19.65 times
higher. In terms of area, while LUTs are a disadvantage
at 4.26 times more, we have 32.89 times more
DSP48E1s, 3.32 times more registers, and save 720kb
of block memory.
Comparing our 4×4 array with Lian's[31] research,
which includes 16 PEs, our power consumption is
68.51 times lower. Our throughput is lower by 14.99
times, but performance-power efficiency is 4.2 times
higher. In area, our architecture demonstrates superior
performance in all measured aspects, with 64.19 times
fewer DSP48E1s, 1.26 times fewer LUTs, 10.76 times
fewer registers, and saving 16434kb of block memory.
Overall, our architecture boasts extremely low
latency and power consumption, offering better
performance compared to other systolic array
accelerators. The high number of LUTs used as ROM
for lookup tables is inevitable. However, for crucial
FPGA components like DSP48E1s and memory, our
consumption is lower.
Finally, this research is applied in an application in
communications. Zhao et al.'s research on AMC[12],
involving MAC operations, this research is used to
replace these operations. The operation on two 100×
Table 4 Comparisons with Other FPGA Accelerators
Device
Network
Size
Frequency (MHz)
Quantization Strategy
Latency(ns)
Power(W)
Performance (GOPs)
Performance/PE
(GOPs)
Power Efficiency
(GOPs/W)
Area - DSP48E1
Area - Memory (kb)
Area - LUTs
Area - FF
Ours
Ours
Ours
Ours
Ours
Ours
Ours
[26]
[27]
XCZU5E
V
N/A
1×1
250
8-bit int
XCZU5E
V
N/A
2×2
250
8-bit int
XCZU5E
V
N/A
3×3
250
8-bit int
XCZU5E
V
N/A
4×4
250
8-bit int
XCZU5E
V
N/A
5×5
250
8-bit int
XCZU5E
V
N/A
6×6
250
8-bit int
XCZU5EV
N/A
14×14
250
8-bit int
Zedbroad
N/A
8×4
100
16-bit int
Zedbroad
LeNet 5
8×4
100
N/A
16
0.044
2.94
52
0.085
12.50
84
0.108
28.44
116
0.134
50.75
148
0.165
79.44
180
0.201
114.50
436
0.744
624.50
N/A
5.386
N/A
2.94
3.13
3.16
3.17
3.18
3.18
3.19
66.76
1
0
11493
771
147.06
4
0
45972
2572
263.31
9
0
103693
6939
378.73
16
0
184400
13104
481.44
25
0
288093
20811
569.65
36
0
414772
29548
839.52
196
0
2255700
167582
[28]
[29]
[30]
[31]
U280
BERT
8×16
245
W4A8
PYNQ-Z2
VGG-16
32×16
120
8-bit int
XC7VX690T
VGG-16
16 PE
200
8-bit fp
231240
N/A
2.09
26010000
N/A
N/A
N/A
N/A
107.21
XZCU3EG
VGG-16
8PE
300
8-bit int
2000000
00
10.760
144.20
N/A
0.07
N/A
0.21
18.03
47.55
N/A
32
108
2774
3336
N/A
64
1242
28861
41828
N/A
1780
9000
653000
569000
N/A
N/A
N/A
N/A
N/A
13.40
296
720
24350
23019
82.88
1027
16434
231761
140971
N/A
9.180
760.83
10
Tsinghua Science and Technology, xxxxxxx 20xx, x(x): xxx-xxx
100 matrices, taking 1.989 ms in MATLAB, is achieved
in merely 79 clock cycles through the systolic array
with the Swap All architecture, equating to only 316ns,
which is 6294 times faster than MATLAB-based
software implementations.
[2]
Conclusion
[3]
In this research, the team proposed three
innovative architectures for PE designed to enhance the
efficiency of MAC operations within systolic arrays, a
fundamental component in AI accelerators. The novel
computing theory, centred on bit-level mathematics,
underpins these architectures, leading to significant
improvements in computational efficiency and
resource utilization. We have done comprehensive
comparisons to show the strength and limitations of our
design. Our design is very good at low latency
application and power sensitive applications because of
its super low latency and high performance-power
efficiency.
In the future, the team aims to make improvements
in three key areas of our architecture. The focus will be
on enhancing how our design handles sparse matrices,
which is essential for increasing data processing speed
and efficiency in AI applications. We also plan to
address the challenge of excessive use of LUTs, aiming
to reduce the resource consumption they cause.
Additionally, the current architecture only works
with INT8, and the computing theory applies to any
INT or fixed-point numbers. For floating-point
numbers, an option is to convert them to fixed-point
numbers, but this inevitably comes with some data loss.
Therefore, exploring adjustments to the PE size will be
undertaken to better accommodate a wider variety of
data, thereby increasing the versatility and applicability
of our architecture across different computational tasks.
References
[1]
D. Wu, X. Fan, W. Cao, and L. Wang, SWM: A highperformance sparse-Winograd matrix multiplication
CNN accelerator, IEEE Trans. Very Large Scale
Integr. (VLSI) Systems, vol. 29, no. 5, pp. 936–949,
2021.
Z. Al-Qadi and M. Aqel, Performance analysis of
parallel matrix multiplication algorithms used in
image processing, World Appl. Sci. J., vol. 6, no. 1,
pp. 45–52, 2009.
[4]
J. C. Butcher, Numerical Methods for Ordinary
Differential Equations, 3rd ed., Hoboken, NJ: John
Wiley & Sons, 2016.
[5]
F. Siddiqui, S. Amiri, U. I. Minhas, T. Deng, R.
Woods, K. Rafferty, and D. Crookes, FPGA-Based
Processor Acceleration
for
Image
Processing
Applications, Journal of Imaging, vol. 5, no. 1, 16,
2019.
[6]
X. He, et al., Sparse-TPU: adapting systolic arrays
for sparse matrices, presented at the 34th ACM
International
Conference
on
Supercomputing,
Barcelona, Spain, 2020.
[7]
X. Wei, et al., Automated systolic array architecture
synthesis for high throughput CNN inference on
FPGAs, presented at the 54th Annual Design
Automation Conference 2017, Austin, USA, 2017.
[8]
H. Liao, J. Tu, J. Xia, and X. Zhou, DaVinci: a
scalable architecture for neural network computing,
presented at the 2019 IEEE Hot Chips 31
Symposium, Cupertino, USA, pp. 1–44, 2019.
[9]
W.Z. Tao, Y. Wang, and H. Zhang, Overview of
tensor layout in modern neural network accelerator,
presented at the 18th International Computer
Conference on Wavelet Active Media Technology
and
Information
Processing
(ICCWAMTIP),
Chengdu, China, 2021.
[10] A. Lavin and S. Gray, Fast Algorithms for
Convolutional Neural Networks, presented at the
2016 IEEE Conference on Computer Vision and
J. Sohl, J. Wang and D. Liu, Large matrix
Pattern Recognition (CVPR), Las Vegas, USA, 2016.
multiplication on a novel heterogeneous parallel
[11] S. Feng, J. Wu, S. Zhou, and R. Li, "The
DSP architecture, in Advanced Parallel Processing
implementation of LeNet-5 with NVDLA on RISC-
Technologies 2009, LNCS, vol. 5737, Y. Dou, R.
V SoC," presented at the 10th International
Gruber, J. M. Joller, Ed. Heidelberg: Springer, 2009,
Conference on Software Engineering and Service
pp. 408–419.
Science (ICSESS), Beijing, China, 2019.
Tsinghua Science and Technology, xxxxxxx 20xx, x(x): xxx-xxx 11
[12] Y. Zhao, W. C. J. Gavin, T. Deng, E. A. Ball and L.
[23] K. Anderson. "Hardware for machine learning
Seed, A Scalable and Accurate Chessboard-Based
inference:
AMC Algorithm with Low Computing Demands,
https://telnyx.com/resources/hardware-for-machine-
IEEE Access, vol. 11, pp. 120955-120962, 2023.
learning-inference-cpus-gpus-tpus, 2023.
[13] S. Yu et al., Energy-Efficient Neural Network Design
using
Memristive
MAC
Unit,
doi:10.3389/felec.2022.877629.
Units
Networks
with
for
[24] H. T. Kung and C. E. Leiserson, Systolic arrays (for
VLSI), in Sparse Matrix Proceedings 1978, I. S.
282.
Neural
[25] R. Sun, Y. Ni, X. He, J. Zhao and A. Zou, ONE-SA:
Weight-Sharing,
doi:
Enabling Nonlinear Operations in Systolic Arrays
for Efficient and Flexible Neural Network Inference,
[15] M. Cho and Y. Kim, FPGA-Based Convolutional
Network
TPUs,"
Convolutional
10.48550/arXiv.1801.10219
Neural
GPUs,
Duff, G. W. Stewart, Eds., SIAM, 1979, pp. 256–
[14] J. Garland and D. Gregg, Low Complexity MultiplyAccumulate
CPUs,
Accelerator
with
doi: 10.48550/arXiv.2402.00395.
Resource-
[26] K. Yang, H. Liu, Y. Zhao, T. Deng, A new design
Optimized Approximate Multiply-Accumulate Unit,
approach of hardware implementation through
Electronics, vol. 9, no. 1, 2020.
natural language entry, IET Collaborative Intelligent
[16] G. Rong, Y. Xu, X. Tong, et al., An edge-cloud
Manufacturing, vol. 5, no. 4, e12087, 2023.
collaborative computing platform for building AIoT
[27] Y. Cao, X. Wei, T. Qiao, and H. Chen, FPGA-based
applications efficiently, J. Cloud Comp., vol. 10, 36,
accelerator for convolution operations, presented at
2021, doi:10.1186/s13677-021-00250-w.
the 2019 IEEE International Conference on Signal,
[17] B. Cohen, et al., Edge Computing: Next Steps in
Architecture,
Design
and
Testing,
Information
and
Data
Processing
(ICSIDP),
Chongqing, China, 2019.
[28] H. Chen, J. Zhang, Y. Du, S. Xiang, Z. Yue, N.
https://www.openstack.org/use-cases/edgecomputing/edge-computing-next-steps-in-architecture-
Zhang, Y. Cai, Z. Zhang, Understanding the Potential
design-and-testing.
of FPGA-Based Spatial Acceleration for Large
[18] X.
Ge,
Communications
"Ultra-Reliable
in
Low-Latency
Autonomous
Vehicular
Networks," IEEE Trans. Vehicular Technology, vol.
68, no. 5, pp. 5005-5016, May 2019.
Data
2023.
[29] P. Xue, L. Pan, L. Sun, and M. Huang, Dual-LineSystolic Array
[19] M. Harris. New Pascal GPUs Accelerate Inference in
the
Language Model Inference, arXiv:2312.15159,
for
High
Performance
CNN
Accelerator, presented at the 2022 IEEE 30th Annual
Center,
International Symposium on Field-Programmable
https://developer.nvidia.com/blog/new-pascal-gpus-
Custom Computing Machines (FCCM), New York
accelerate-inference-in-the-data-center, 2016.
City, NY, USA, 2022.
[20] T. Morgan. Why AI Inference Will Remain Largely
[30] X. Chen, J. Li, and Y. Zhao, Hardware Resource and
on the CPU, https://www.nextplatform.com/2023/
Computational Density Efficient CNN Accelerator
Why-AI-Inference-Will-Remain-Largely-On-The-
Design Based on FPGA, presented at the 2021 IEEE
CPU, 2023.
International Conference on Integrated Circuits,
[21] ASUS Global. Coral M.2 Accelerator B+M key,
https://www.asus.com/AIoT-Industrial-
China, 2021.
Solutions/Coral-M2-Accelerator-BM-key, 2024.
[22] D-Central. ASIC vs FPGA vs GPU vs CPU:
Understanding
the
Differences,
Technologies and Applications (ICTA), Zhuhai,
https://d-
[31] X. Lian, Z. Liu, Z. Song, J. Dai, W. Zhou, and X. Ji,
High-Performance FPGA-Based CNN Accelerator
with Block-Floating-Point Arithmetic, in IEEE
central.tech/asic-vs-fpga-vs-gpu-vs-cpu-
Transactions on Very Large Scale Integration (VLSI)
understanding-the-differences, 2022.
Systems, vol. 27, no. 8, pp. 1874-1885, Aug. 2019.
Download