TSINGHUA SCIENCE AND TECHNOLOGY ISSN 1007-0214 DOI: 10.26599 / TST.20xx. 9010 x x x Volume xx, Number x, xxxxxxx 20xx A Novel Parallel Processing Element Architecture for Accelerating ODE and AI Abstract: Transforming complex problems, such as transforming ordinary differential equations (ODEs) into matrix formats, into simpler computational tasks is key for AI advancements and paves the way for more efficient computing architectures. Systolic Arrays, known for their computational efficiency, low power use and ease of implementation, address AI's computational challenges. They are central to mainstream industry AI accelerators, with improvements to the Processing Element (PE) significantly boosting systolic array performance, and also streamlines computing architectures, paving the way for more efficient solutions in technology fields. This research presents a novel PE design and its integration of systolic array based on a novel computing theory - bit-level mathematics for Multiply-Accumulate (MAC) operation. We present 3 different architectures for the PE and provide a comprehensive comparison between them and the state-of-the-art technologies, focusing on power, area, and throughput. This research also demonstrates the integration of the proposed MAC unit design with systolic arrays, highlighting significant improvements in computational efficiency. Our implementations show a 2380952.38 times lower latency, yet 64.19 times less DSP48E1, 1.26 times less Look-Up Tables (LUTs), 10.76 times less Flip-Flops (FFs), with 99.63 times less power consumption and 15.19 times higher performance per PE compared to the state-of-the-art design. Key words: Parallel MAC Unit; Processing Element (PE); FPGA Implementation; Computing Theory 1 Introduction In the progression of computational technologies, particularly in fields such as Digital Signal Processing (DSP)[1], Artificial Intelligence (AI)[2], and image processing[3], the conversion of ordinary differential equations (ODEs) into matrix multiplication formats is an important process[4]. This transformation is key to making the most of today’s computing architectures, underscoring the need for highly efficient MultiplyAccumulate (MAC) units. These units are essential for doing matrix operations, turning ODEs into tasks computers can handle, which leverages modern hardware's capability for parallel processing, enabling simultaneous execution of multiple tasks. It’s a big step in using computer power more effectively for a wide range of uses[5]. Therefore, the enhancement and progression of MAC units are central to advancing computational infrastructures, supporting the technological evolution of future generations. As computing architecture advances, the role of MAC units in overseeing large-scale matrix multiplications becomes indispensable, particularly for the operation of deep learning and neural networks. This efficiency reduces unnecessary computations and storage demands, positioning MAC units as an essential element in the advancement and streamlining of a wide range of technological applications. Contemporary architectures engaging in critical tasks like matrix multiplication frequently employ numerous MAC units to diminish memory interactions. Google Tensor Processing Unit (TPU) use 4 by 4 by 4 by 64 systolic array for accelerating large scale tensor operations[6]. Such an architecture bolsters the computation process by enabling data reuse and reducing complexity. It connects MAC units in series to form a Processing Element (PE), enhancing parallelism and performance[7]. Similarly, Huawei’s DaVinci AI accelerator integrates 16 MAC units within each PE, accompanied by an adder tree for effective © The author(s) 2021. The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/). 2 Tsinghua Science and Technology, xxxxxxx 20xx, x(x): xxx-xxx result accumulation[8]. This arrangement is calibrated for both standard and intricate computations, underscoring the importance of sparsity management in AI. In applications like Convolutional Neural Networks (CNNs), where large-scale matrix operations are commonplace, various techniques are employed to optimize computational efficiency. One such technique is the "img2col" function, which transforms 3D convolutions into matrix multiplications, necessitating the handling of matrices with vast dimensions[9]. Another noteworthy application is the Winograd algorithm[10], which aims to boost computational efficiency by reducing the number of multiplication arithmetic operations needed for convolution. This algorithm can notably speed up convolution operations, especially in situations with limited computing resources or when latency is a concern. This is attributed to the increased sizes of images and feature maps. Architectures based on systolic arrays, such as TPU and NVDLA[11], are favoured for their capability to accelerate computations. In these structures, MAC units that offer lower latency or enhanced integration are particularly valuable, especially for processing sparse matrices in extensive AI models, where sparsity optimization is essential for improving computational efficiency and performance. In this research, we propose a groundbreaking MAC unit founded on a novel computing theory. Our integrated MAC units are scalable and naturally parallel, allowing multiple MAC operations within a single component efficiently, and can be effectively integrated with systolic arrays. The key contributions of this work are as follows: • Presents a novel computing theory centred on bitlevel mathematics in the design of MAC units, subsequently introducing a novel PE tailored accordingly for Multiplication-Accumulation operations. • Develops three distinct hardware architectures, each optimised for either high throughput or reduced latency. Integrate the fastest architecture with systolic array. • Conducts a comprehensive comparison. Evaluate three MAC units. Compare the integration with the systolic array against existing literature. Focus on power consumption, area, and throughput. The remainder of the article is structured as follows: Section Two reviews the current landscape of MAC units and the computational power derived from using CPUs or ASICs, analysing their limitations, then delves into an analysis of systolic arrays, highlighting their potential to overcome these challenges. Section three offers an in-depth exploration of the computing theory and the hardware architecture, also show the implementation with systolic array. Section four evaluates our three novel MAC units using different architectures, then compares with other researches after integrate with systolic array. Meanwhile, we further assess our architecture through the application of Automatic Modulation Classification (AMC)[12]. The experimental results demonstrate that our architecture achieves a time saving of 99.98% compared with the existing literature. Finally, the last section concludes with an analysis of our MAC unit's potential, summarizing our future research directions. 2 Literature Review Transforming ODE problems into General Matrix Multiply (GEMM) operations has become a crucial technique in computational studies. This process involves breaking down ODEs into finite difference equations, which are then represented as a matrix equation A⋅y=b. Here, A contains the coefficients of the equations, y represents the solution vector, and b is the right-hand side of the equations[4]. By solving this matrix equation using techniques similar to GEMM, such as matrix inversion or iterative methods, we can determine the values of y at each time point. This method helps bridge the gap between ODEs and AI advancements, making computational infrastructures more efficient and streamlining technological applications. This transformation process enables efficient numerical solution techniques, making model training and inference in AI faster and improving algorithmic execution in signal processing tasks. By breaking down Tsinghua Science and Technology, xxxxxxx 20xx, x(x): xxx-xxx 3 ODEs and representing them as matrix equations, computational architectures can efficiently use matrix operations for computation. This approach underscores the versatility and applicability of matrix operations across various domains, highlighting the role of GEMM in advancing computational technologies, particularly in accelerating MAC operations, a basic component in various computational tasks. MAC units play a pivotal role across various domains, significantly enhancing computational efficiency in deep learning, signal processing, and FPGA-based accelerations. In deep learning, MAC units streamline computations, facilitating faster model training and inference while optimizing energy consumption[13]. Signal processing benefits from MAC units through improved algorithmic execution for tasks like filtering and audio processing, thereby elevating operational speed and accuracy[14]. Furthermore, FPGA-based neural network accelerators leverage MAC units to achieve substantial gains in performance and efficiency, demonstrating the broad applicability and impact of MAC units in advancing computational technologies[15]. In contemporary cloud-edge computing architectures, two primary performance requirements emerge distinctly: high throughput and low latency. High throughput is vital for scenarios necessitating the rapid transmission or reception of vast data volumes[16]. This capability is especially crucial in cloud data transmission, where the efficient handling of largescale data sets, such as video streaming and bulk data analytics, is paramount. On the other hand, low latency is imperative for applications involving smaller data packets that demand immediate processing[17]. Such requirements are paramount in the autonomous driving sector, where milliseconds can determine the safety of decisions made by the vehicle's AI systems[18]. Additionally, low latency is critical in real-time gaming and financial trading applications, where delay reduction translates directly into competitive advantage and operational efficiency. In recent developments within the field of computing, both CPUs and ASICs have undergone significant evaluation. Studies focus on their performance through GOPs using INT8 operations per watt. ASICs, designed for tasks requiring high computational throughput and energy efficiency, showcase superior performance. For instance, NVIDIA's Tesla P4 and P40 GPUs, based on the Pascal architecture, achieve 21.8 and 47.0 INT8 TOP/s respectively, underlining ASICs' advantages in deep learning inference[19]. Conversely, Intel's advancement with the "Sapphire Rapids" 4th Gen Xeon processors, featuring Advanced Matrix Extensions (AMX), illustrates the efforts to enhance CPUs' capability for AI inference tasks. These processors manage 2,048 INT8 operations per cycle per core, marking a step towards narrowing the performance gap between CPUs and ASICs for specific computational tasks[20]. Additionally, Google's Coral M.2 Accelerator emerges as a noteworthy ASIC example, offering 4 TOP/s at an energy consumption of only 2 watts. This efficiency, translating to 2 TOPS/W, sets a new benchmark for power-efficient machine learning acceleration[21] These instances reflect the ongoing technological advancements in the realms of ASICs and CPUs, with ASICs maintaining a lead in efficiency and performance for specialized tasks, whereas CPUs are catching up in versatility for a wider application spectrum. However, despite the advancements in MAC units enhancing computational efficiency, scalability and adaptability challenges persist. ASICs, tailored for specific tasks, sacrifice flexibility, limiting their utility outside their optimized scope. In contrast, CPUs offer versatility but struggle with inefficiencies, especially in energy-sensitive settings[22]-[23]. This contrast highlights the delicate balance between performance specialization and general adaptability in computational architectures. It underscores the intricacies of choosing the appropriate technology, emphasizing the necessity to align architectural decisions with the specific requirements of applications, while also considering energy efficiency and flexibility[22]-[23]. Exploring MAC units, CPUs, and ASICs 4 Tsinghua Science and Technology, xxxxxxx 20xx, x(x): xxx-xxx highlights strides in processing efficiency, emphasizing the interplay between energy efficiency, operational flexibility, and cost. This discourse not only reveals technological advancements but also the complex decision-making process system architects and engineers navigate. It shows how advancements in MAC units, while significant, are moderated by the practical limitations and trade-offs of choosing between CPUs and ASICs. This intricate balance is crucial for optimizing computational resources across applications, highlighting the pursuit for architectures that marry high performance with adaptability and energy efficiency[22]-[23]. In this evolving landscape of computational architectures, the emergence and integration of systolic arrays present an innovative approach to addressing some of the above challenges, particularly in scenarios demanding high throughput and energy efficiency[24]. The principle of a systolic array, anchored in parallel computing architecture, allows for data to move synchronously across the array between adjacent processors. This method of operation contributes significantly to reducing the need for external memory access, thus enhancing computational efficiency. This architecture excels in applications requiring intensive parallel processing, such as artificial intelligence and image processing, by effectively handling data parallelism and reducing operational steps, contrary to the limitations posed by Amdahl's Law in traditional architectures. Furthermore, innovations in systolic array design have expanded their functionality to include nonlinear operations efficiently, further enhancing their suitability for complex neural network inference tasks[25]. At the core of a systolic array's architecture is a two-dimensional grid of PEs, each directly connected to its immediate neighbours. This design facilitates efficient data transfer and reduces reliance on central memory, offering an advantage in processing intensive tasks such as those found in deep learning applications[24]. Systolic arrays operate through stages, initiating data at their edges and sequentially delivering results, showcasing continuous operation and parallel processing efficiency. This emphasizes their role in enhancing energy efficiency and throughput. Design considerations such as uniform dataflow and edge condition management are crucial for addressing scalability and adaptability challenges, ensuring systolic arrays achieve optimal performance. Their integration into computational architectures represents a significant advancement, contributing to the development of high-performance, adaptable, and energy-efficient systems. Systolic arrays, alongside MAC units, CPUs, and ASICs, enrich the diversity of technological solutions, driving innovation and improvement in the field[25]. Fig. 1 General Architecture of the MAC unit based on our novel computing theory Tsinghua Science and Technology, xxxxxxx 20xx, x(x): xxx-xxx 5 3.2 3 Methodology 3.1 Computing Theory In mathematics, to multiply matrices, we need to multiply each corresponding number and then sum up all the results. However, in computer calculations, numbers at the data level must first be converted to the bit level. Let's take 2 vectors, A and B, with k bits m numbers as an example. π π΄π = ∑π−1 π=0 ππ × 2 (1) π π΅π = ∑π−1 π=0 ππ × 2 (2) Utilizing Eq. (1) and Eq. (2) enables the determination of the bit count. This information, when integrated with the matrix multiplication formula, yields the following result: πΆ = ∑2π−1 π΄π × π΅π 0 π−1 π π = ∑2π−1 [(∑π−1 0 π=0 ππ × 2 ) × (∑π=0 ππ × 2 )] π−1 π+π = ∑2π−1 [(∑π−1 ] 0 π=0 ∑π=0 ππ × ππ ) × 2 ππ€−π ∑β¬ Accumulator π π−π π−π ∑∑β¬ Bit-count π=π π=π ππ × ππ ππ+π AND Gate Bit-Shift Upon attaining a comprehensive grasp of the mathematical formula and its associated digital logic, the subsequent phase entails the development of the hardware architecture. INT8 has been used as an example to the capability of our proposed computing theory along with its scalable hardware architecture. This architecture is devised in alignment with Eq. (3). Figure 1 depicts the overarching structure of the MAC unit. Upon the input of two vectors, each comprising eight 8-bits numbers, a conversion process ensues, translating these numbers from the decimal (data level) to binary (bit level), thereby transforming the vectors into matrices. The corresponding figures within these two matrices are then processed through the AND gates, succeeded by the counters and the shifters. The last step of this process is the computation of the result in the Accumulator. Table 2 Merge table for reducing the number (3) The transformation from the original result to what we see in Eq. (3) is quite straightforward. The initial sum symbol, indexed by (0 to 2k-1), can be expressed as an accumulator during the hardware setup. The second and third sum symbols, indexed by (i, j=0 to k1), can be represented simply by using a counter. The power of 2 can be turned into shifters, and multiplication in binary form can be done using an AND gate. There's no data dependency in these sum symbols, making it easy to scale to numbers of varying word lengths and different quantities of vector elements. Rather than multiplying separate numbers and summing them up, our computing approach combines the entire MAC process. This approach manages the MAC operation well because we can use a counter, AND gate, and shifter to replace the original multipliers, boosting efficiency. Table 1 Mathematic Operation with Digital Logic Hardware Architecture Shift Bit position from A and B Number of AND Gates (in Bits) 0 0/0 7 1 1/0, 0/1 14 2 0/2, 1/1, 2/0 21 3 0/3, 1/2, 2/1, 3/0 28 4 0/4, 1/3, 2/2, 3/1, 4/0 35 5 0/5, 1/4, 2/3, 3/2, 4/1, 5/0 42 6 0/6, 1/5, 2/4, 3/3, 4/2, 5/1, 6/0 49 7 0/7, 1/6, 2/5, 3/4, 4/3, 5/2, 6/1, 7/0 56 8 1/7, 2/6, 3/5, 4/4, 5/3, 6/2, 7/1 49 9 2/7, 3/6, 4/5, 5/4, 6/3, 7/2 42 10 3/7, 4/6, 5/5, 6/4, 7/3 35 11 4/7, 5/6, 6/5, 7/4 28 12 5/7, 6/6, 7/5 21 13 6/7, 7/6 14 14 7/7 7 The structure’s potential for straightforward pipeline arrangement is evident. Adherence to Eq. (3) in its strictest form would necessitate the incorporation of 64 counters and fixed shifters within the hardware, which could lead to considerable resource consumption. The shift bit range spans from 0 to 14 for 8-bit numbers, allowing for the amalgamation of certain results and reducing the count of counters and shifters from 64 to 15. Table 2 has been provided to 6 Tsinghua Science and Technology, xxxxxxx 20xx, x(x): xxx-xxx Fig. 2 Stages for LUT-based Architecture elucidate the manner in which these counters and shifters are combined. The first column of table 2 delineates the number of bits requiring shifting, the second column elucidates the bit positions for A and B (8-bit numbers), and the third column details the requisite number of 1-bit AND gates. It is noteworthy that the most prevalent scenarios involve a shift of 7. The observation reveals that, although the total number of AND gates remain congruent with the figure prior to amalgamation, the employed strategy efficaciously reduces the quantity of counters and shifters from 64 to a mere 15. 3.3 Look Up Table (LUT)-based Architecture To further minimize the number of counters and shifters and expedite the execution speed, an alternative approach is to leverage a look-up table (LUT) to bypass the intermediary steps of the AND Gate, Bit-count, and bit-shift, acquiring the desired results directly from the bit-level matrix. By dividing the numbers in the bit-level matrix into × groups of 4 bits, there would be 2(4 2) = 16,384 unique entries in the look-up table, given the maximum number represented by four bits is 15 (i.e., '1111' in binary). Therefore, we can construct a table that maps these 16,384 possibilities to their corresponding results, eliminating the need for intermediary computational steps. To ascertain the RAM size required for this look-up table, we need to consider the total number of entries and the size of each entry. Given that there are 16,384 unique entries, and each entry corresponds to the sum of the products of a 4-bit group, which can be up to 8 × 8 = 64 (as two 4-bit numbers can each represent up to the decimal number 15), we would need a RAM of size 16,384 × 7 bits, since 64 can be represented with 7 bits. In terms of bytes, considering that 1 byte is equivalent to 8 bits, the byte size of the RAM would be (16,384 × 7) / 8 = 14,336 bytes. Please note that this calculation provides an ideal estimate, but the realworld RAM modules usually come in powers of 2, so the actual RAM size might need to be rounded up to the closest power of 2, which should be 214 = 16,384 bytes. The utilization of a look-up table in this context is an efficient mechanism to quickly access the result for a particular 4-bit group, thereby avoiding the time and Tsinghua Science and Technology, xxxxxxx 20xx, x(x): xxx-xxx 7 Fig. 3 Project Structure computational resources required by the intermediate steps. This substantial time saving and reduced computational load can greatly enhance the processing speed. However, the implementation of a look-up table leads to increased memory requirements due to the need to store the table. Therefore, the feasibility of this approach would be contingent on the trade-off between memory availability and the need for computational efficiency and processing speed. In scenarios where memory resources are abundant, and speed is a priority, this look-up table method can greatly optimize the performance of the bit-level operation-based MAC unit. As depicted in Fig. 2, the process consists of three stages. In Stage 1, the 8-bit input data is split into two sets of 4-bit data. In Stage 2, the results are retrieved from the ROM, which serves as a substitute for the AND gates, counters, and shifters. The final stage encompasses the accumulator, where all the results are summed up to yield the final outcome. The look-up table implementation for the MAC unit offers a promising approach for future research and development, potentially providing considerable improvements in speed, energy efficiency, and simplification of the hardware architecture. It exemplifies an innovative method to further harness the potential of bit-level computations in the field of AI hardware acceleration. 3.4 Systolic Array Following the completion of the above work, it will be integrated as a PE within the systolic array. As shown in Fig. 3(a), the systolic array is configured with parameters for rows, columns, and the number of bits for inputs, facilitating flexible scalability. Also, the module takes in clock, reset, and start signals, along with data inputs from the top and left sides, and a selection signal for output processing. It initializes a dynamic grid of PEs, connecting them so that data flows horizontally and vertically through the array. Each PE receives input from its left and top neighbours, processes it, and passes the result to its right and bottom neighbours. The array's design enables efficient parallel processing of data, typical in applications like matrix multiplication or digital signal processing. The output from the array is then aggregated by accumulator and outputted to a register, which outputs the final result based on the provided selection signal. 8 3.5 Tsinghua Science and Technology, xxxxxxx 20xx, x(x): xxx-xxx Implementation The above architecture is developed using Verilog HDL for coding and utilized by using Xilinx Vivado 2022.2. The overall architecture of the design is illustrated in Fig. 3. Based on the discussed computing theory and LUT-based architecture, three PE architectures are designed. These include a pure digit circuitry architecture based on the computing theory, and two architectures optimised with LUT: Swap A&C and Swap All. Swap A&C replaces the operations of AND Gate and Bit-count with LUT, whereas Swap All replaces AND Gate, Bit-count, and Bit-shift with LUT. The novel PE is integrated into the systolic array, a key part of AI acceleration. The PE module is composed of the core module and an accumulator in the illustration presented in Fig. 3(b). Within the core module, there lies the conversion module and the calculation module, as shown in Fig. 3(c). The conversion module is tasked with converting INT8 vectors into an 8×8 binary matrix. Meanwhile, the calculation module undertakes computational operations or utilizes ROM for accessing look-up table data. decreasing bit operations in AND Gates and addition operations in counters, leading to a slight overall reduction in resource usage. Table 3 PE Hardware Resources 16 MACs Pure Digital Circuit Swap A&C Frequency (MHz) 125 125 125 250 500 Latency (clk cycles) 2 2 2 2 3 Performance (GOPs) 1 1 1 2 2.67 Area - DSP48E1 1 1 1 1 1 Area - LUTs 2591 2335 11493 11493 11493 Area - FF 1539 1539 Swap All 771 771 771 Subsequently, the Swap All approach with a 4ns clock period was utilized for integration with the systolic array. Although the “2ns” approach offers quicker latency, its mere 0.019ns buffer time— calculated as Clock Period × Performance - Latency = 2 ns × 3 clock - 5.981ns = 0.019 ns—might induce instability in the systolic array, hence it was not chosen for integration. 4 Results and Discussion The hardware design specification in this research utilizes the Xilinx Zynq UltraScale+ series ZCU102 (XCZU5EV) board. We conducted tests on the PE using 16 MAC configurations, as shown in Table 3. The table outlines the resource consumption for three architectures: Pure Digital Circuit, Swap A&C (AND Gate and Bit-count), and Swap All (AND Gate, Bitcount and Bit-shift). Among these, Swap All exhibits the lowest latency, requiring only two cycles at clock periods of 8ns and 4ns, three clock cycles of 2ns, implying that results for 16 MAC operations can be obtained as quickly as every 6ns. In contrast, the Pure Digital Circuit approach consumes fewer resources and area, saving 77.46% in LUT and x51.52% in DSP48E1 compared to Swap All. Compared to pure circuits, Swap A&C reduces superfluous LUT consumption by Fig. 4 Power, Performance and Effiency of Systolic Array Figure 4 displays the measured Power, Performance, and Power Efficiency after integrating this approach into systolic arrays of sizes 2×2 to 12× 12. As the operations and latency of each PE is same, the Performance and Power Efficiency can be calculated by these functions: πππππππππππ = [ + (16 + 16 + 15) ππππππ‘πππ × π ππ§π 2 4 ππ × 4 πππ ππ¦πππ π ππ§π 2 − 1 1π ]× 4ππ × 1 πππ ππ¦πππ 1 ππ = (2.6875 π ππ§π 2 + 0.25 )πΊππ/π πππ€ππ πΈπππππππππ¦ = πππππππππππ / πππ€ππ Tsinghua Science and Technology, xxxxxxx 20xx, x(x): xxx-xxx 9 The data mentioned above were used to predict the trends of power for sizes extending from 12×12 to 30 ×30. It is observed that both power and performance increase exponentially, and Power Efficiency has its maximum, with the size closest to this maximum being 14×14. The graph also shows values at this size, with Power predicted at 0.74W, Performance at 624.50 GOPs, and Power Efficiency at its peak of 839.52 GOPs/W. Table 4 lists the frequency, power, performance, and area data for PE using the Swap All architecture applied to systolic arrays of sizes 1×1 to 6×6 and 14×14. It also includes comparisons with FPGA accelerators from other literature. Data surpassing ours are highlighted in red, while those inferiors are in green. The research by Yang[26] and Cao[27], which used Zedbroad to implement an 8×4 systolic array, is compared with our 6×6 array. Our design is using 26.8 times less power than Yang's. Compared to [27], our design shows 54.78 times more throughput and 1284.6 times lower latency. Chen's research[28] with U280 broad on BERT, a large language model including an 8×16 systolic array, is compared by our 14×14 array, which shows a 59655 times less latency and similar maximum clock frequency. In terms of area, despite using 3.45 times more LUTs, our research uses 9.08 times fewer DSP48E1s, 3.40 times fewer registers, and saves 9000kb of memory. Xue's[29] research includes a 32×16 systolic array in the VGG-16 network. Compared to this, our 14×14 systolic array's performance is 5.83 times higher throughput with less resources. Chen’s[30] and Lian's[31] research do not involve a systolic array but uses multiple parallel PEs without a suitable dataflow for matrix multiplication. Compared to Chen's 8 PEs, our 3×3 array has 2380952.38 times lower latency, with 99.63 times lower power consumption. Our throughput is lower by 5.07 times, but the performance-power efficiency is 19.65 times higher. In terms of area, while LUTs are a disadvantage at 4.26 times more, we have 32.89 times more DSP48E1s, 3.32 times more registers, and save 720kb of block memory. Comparing our 4×4 array with Lian's[31] research, which includes 16 PEs, our power consumption is 68.51 times lower. Our throughput is lower by 14.99 times, but performance-power efficiency is 4.2 times higher. In area, our architecture demonstrates superior performance in all measured aspects, with 64.19 times fewer DSP48E1s, 1.26 times fewer LUTs, 10.76 times fewer registers, and saving 16434kb of block memory. Overall, our architecture boasts extremely low latency and power consumption, offering better performance compared to other systolic array accelerators. The high number of LUTs used as ROM for lookup tables is inevitable. However, for crucial FPGA components like DSP48E1s and memory, our consumption is lower. Finally, this research is applied in an application in communications. Zhao et al.'s research on AMC[12], involving MAC operations, this research is used to replace these operations. The operation on two 100× Table 4 Comparisons with Other FPGA Accelerators Device Network Size Frequency (MHz) Quantization Strategy Latency(ns) Power(W) Performance (GOPs) Performance/PE (GOPs) Power Efficiency (GOPs/W) Area - DSP48E1 Area - Memory (kb) Area - LUTs Area - FF Ours Ours Ours Ours Ours Ours Ours [26] [27] XCZU5E V N/A 1×1 250 8-bit int XCZU5E V N/A 2×2 250 8-bit int XCZU5E V N/A 3×3 250 8-bit int XCZU5E V N/A 4×4 250 8-bit int XCZU5E V N/A 5×5 250 8-bit int XCZU5E V N/A 6×6 250 8-bit int XCZU5EV N/A 14×14 250 8-bit int Zedbroad N/A 8×4 100 16-bit int Zedbroad LeNet 5 8×4 100 N/A 16 0.044 2.94 52 0.085 12.50 84 0.108 28.44 116 0.134 50.75 148 0.165 79.44 180 0.201 114.50 436 0.744 624.50 N/A 5.386 N/A 2.94 3.13 3.16 3.17 3.18 3.18 3.19 66.76 1 0 11493 771 147.06 4 0 45972 2572 263.31 9 0 103693 6939 378.73 16 0 184400 13104 481.44 25 0 288093 20811 569.65 36 0 414772 29548 839.52 196 0 2255700 167582 [28] [29] [30] [31] U280 BERT 8×16 245 W4A8 PYNQ-Z2 VGG-16 32×16 120 8-bit int XC7VX690T VGG-16 16 PE 200 8-bit fp 231240 N/A 2.09 26010000 N/A N/A N/A N/A 107.21 XZCU3EG VGG-16 8PE 300 8-bit int 2000000 00 10.760 144.20 N/A 0.07 N/A 0.21 18.03 47.55 N/A 32 108 2774 3336 N/A 64 1242 28861 41828 N/A 1780 9000 653000 569000 N/A N/A N/A N/A N/A 13.40 296 720 24350 23019 82.88 1027 16434 231761 140971 N/A 9.180 760.83 10 Tsinghua Science and Technology, xxxxxxx 20xx, x(x): xxx-xxx 100 matrices, taking 1.989 ms in MATLAB, is achieved in merely 79 clock cycles through the systolic array with the Swap All architecture, equating to only 316ns, which is 6294 times faster than MATLAB-based software implementations. [2] Conclusion [3] In this research, the team proposed three innovative architectures for PE designed to enhance the efficiency of MAC operations within systolic arrays, a fundamental component in AI accelerators. The novel computing theory, centred on bit-level mathematics, underpins these architectures, leading to significant improvements in computational efficiency and resource utilization. We have done comprehensive comparisons to show the strength and limitations of our design. Our design is very good at low latency application and power sensitive applications because of its super low latency and high performance-power efficiency. In the future, the team aims to make improvements in three key areas of our architecture. The focus will be on enhancing how our design handles sparse matrices, which is essential for increasing data processing speed and efficiency in AI applications. We also plan to address the challenge of excessive use of LUTs, aiming to reduce the resource consumption they cause. Additionally, the current architecture only works with INT8, and the computing theory applies to any INT or fixed-point numbers. For floating-point numbers, an option is to convert them to fixed-point numbers, but this inevitably comes with some data loss. Therefore, exploring adjustments to the PE size will be undertaken to better accommodate a wider variety of data, thereby increasing the versatility and applicability of our architecture across different computational tasks. References [1] D. Wu, X. Fan, W. Cao, and L. Wang, SWM: A highperformance sparse-Winograd matrix multiplication CNN accelerator, IEEE Trans. Very Large Scale Integr. (VLSI) Systems, vol. 29, no. 5, pp. 936–949, 2021. Z. Al-Qadi and M. Aqel, Performance analysis of parallel matrix multiplication algorithms used in image processing, World Appl. Sci. J., vol. 6, no. 1, pp. 45–52, 2009. [4] J. C. Butcher, Numerical Methods for Ordinary Differential Equations, 3rd ed., Hoboken, NJ: John Wiley & Sons, 2016. [5] F. Siddiqui, S. Amiri, U. I. Minhas, T. Deng, R. Woods, K. Rafferty, and D. Crookes, FPGA-Based Processor Acceleration for Image Processing Applications, Journal of Imaging, vol. 5, no. 1, 16, 2019. [6] X. He, et al., Sparse-TPU: adapting systolic arrays for sparse matrices, presented at the 34th ACM International Conference on Supercomputing, Barcelona, Spain, 2020. [7] X. Wei, et al., Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs, presented at the 54th Annual Design Automation Conference 2017, Austin, USA, 2017. [8] H. Liao, J. Tu, J. Xia, and X. Zhou, DaVinci: a scalable architecture for neural network computing, presented at the 2019 IEEE Hot Chips 31 Symposium, Cupertino, USA, pp. 1–44, 2019. [9] W.Z. Tao, Y. Wang, and H. Zhang, Overview of tensor layout in modern neural network accelerator, presented at the 18th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), Chengdu, China, 2021. [10] A. Lavin and S. Gray, Fast Algorithms for Convolutional Neural Networks, presented at the 2016 IEEE Conference on Computer Vision and J. Sohl, J. Wang and D. Liu, Large matrix Pattern Recognition (CVPR), Las Vegas, USA, 2016. multiplication on a novel heterogeneous parallel [11] S. Feng, J. Wu, S. Zhou, and R. Li, "The DSP architecture, in Advanced Parallel Processing implementation of LeNet-5 with NVDLA on RISC- Technologies 2009, LNCS, vol. 5737, Y. Dou, R. V SoC," presented at the 10th International Gruber, J. M. Joller, Ed. Heidelberg: Springer, 2009, Conference on Software Engineering and Service pp. 408–419. Science (ICSESS), Beijing, China, 2019. Tsinghua Science and Technology, xxxxxxx 20xx, x(x): xxx-xxx 11 [12] Y. Zhao, W. C. J. Gavin, T. Deng, E. A. Ball and L. [23] K. Anderson. "Hardware for machine learning Seed, A Scalable and Accurate Chessboard-Based inference: AMC Algorithm with Low Computing Demands, https://telnyx.com/resources/hardware-for-machine- IEEE Access, vol. 11, pp. 120955-120962, 2023. learning-inference-cpus-gpus-tpus, 2023. [13] S. Yu et al., Energy-Efficient Neural Network Design using Memristive MAC Unit, doi:10.3389/felec.2022.877629. Units Networks with for [24] H. T. Kung and C. E. Leiserson, Systolic arrays (for VLSI), in Sparse Matrix Proceedings 1978, I. S. 282. Neural [25] R. Sun, Y. Ni, X. He, J. Zhao and A. Zou, ONE-SA: Weight-Sharing, doi: Enabling Nonlinear Operations in Systolic Arrays for Efficient and Flexible Neural Network Inference, [15] M. Cho and Y. Kim, FPGA-Based Convolutional Network TPUs," Convolutional 10.48550/arXiv.1801.10219 Neural GPUs, Duff, G. W. Stewart, Eds., SIAM, 1979, pp. 256– [14] J. Garland and D. Gregg, Low Complexity MultiplyAccumulate CPUs, Accelerator with doi: 10.48550/arXiv.2402.00395. Resource- [26] K. Yang, H. Liu, Y. Zhao, T. Deng, A new design Optimized Approximate Multiply-Accumulate Unit, approach of hardware implementation through Electronics, vol. 9, no. 1, 2020. natural language entry, IET Collaborative Intelligent [16] G. Rong, Y. Xu, X. Tong, et al., An edge-cloud Manufacturing, vol. 5, no. 4, e12087, 2023. collaborative computing platform for building AIoT [27] Y. Cao, X. Wei, T. Qiao, and H. Chen, FPGA-based applications efficiently, J. Cloud Comp., vol. 10, 36, accelerator for convolution operations, presented at 2021, doi:10.1186/s13677-021-00250-w. the 2019 IEEE International Conference on Signal, [17] B. Cohen, et al., Edge Computing: Next Steps in Architecture, Design and Testing, Information and Data Processing (ICSIDP), Chongqing, China, 2019. [28] H. Chen, J. Zhang, Y. Du, S. Xiang, Z. Yue, N. https://www.openstack.org/use-cases/edgecomputing/edge-computing-next-steps-in-architecture- Zhang, Y. Cai, Z. Zhang, Understanding the Potential design-and-testing. of FPGA-Based Spatial Acceleration for Large [18] X. Ge, Communications "Ultra-Reliable in Low-Latency Autonomous Vehicular Networks," IEEE Trans. Vehicular Technology, vol. 68, no. 5, pp. 5005-5016, May 2019. Data 2023. [29] P. Xue, L. Pan, L. Sun, and M. Huang, Dual-LineSystolic Array [19] M. Harris. New Pascal GPUs Accelerate Inference in the Language Model Inference, arXiv:2312.15159, for High Performance CNN Accelerator, presented at the 2022 IEEE 30th Annual Center, International Symposium on Field-Programmable https://developer.nvidia.com/blog/new-pascal-gpus- Custom Computing Machines (FCCM), New York accelerate-inference-in-the-data-center, 2016. City, NY, USA, 2022. [20] T. Morgan. Why AI Inference Will Remain Largely [30] X. Chen, J. Li, and Y. Zhao, Hardware Resource and on the CPU, https://www.nextplatform.com/2023/ Computational Density Efficient CNN Accelerator Why-AI-Inference-Will-Remain-Largely-On-The- Design Based on FPGA, presented at the 2021 IEEE CPU, 2023. International Conference on Integrated Circuits, [21] ASUS Global. Coral M.2 Accelerator B+M key, https://www.asus.com/AIoT-Industrial- China, 2021. Solutions/Coral-M2-Accelerator-BM-key, 2024. [22] D-Central. ASIC vs FPGA vs GPU vs CPU: Understanding the Differences, Technologies and Applications (ICTA), Zhuhai, https://d- [31] X. Lian, Z. Liu, Z. Song, J. Dai, W. Zhou, and X. Ji, High-Performance FPGA-Based CNN Accelerator with Block-Floating-Point Arithmetic, in IEEE central.tech/asic-vs-fpga-vs-gpu-vs-cpu- Transactions on Very Large Scale Integration (VLSI) understanding-the-differences, 2022. Systems, vol. 27, no. 8, pp. 1874-1885, Aug. 2019.