A General Computing Architecture Based on Systolic Array Motivation While compute-intensive workloads have motivated the search for high performance and greater efficiency, the emergence of machine learning has greatly intensified the urgency of finding alternatives to the classic Von Neumann computing architecture. Even if transistor geometries were still shrinking at the pace predicted by Moore’s Law, the CPU is so far outclassed by other architectures that the impact of further improvements would be negligible. There are two design elements present in the majority of the various possible alternatives to the Von Neumann architecture : true parallelism and data-flow. True parallelism involves decomposition of a problem into many subproblems that can be calculated simultaneously. Parallelism has been widely used to improve the performance of Von Neumann architectures through techniques such as microinstruction pipelining and multi-cores. These techniques do not, in themselves, contribute to the decomposition of a problem and, generally, can only offer linear speedup. True parallelism provides a natural framework for decomposition of a computation and is theoretically capable of achieving exponential speedup. Dataflow is the movement of data through a computation, as opposed to fetch-and-store to and from a separate memory system on each computation in a traditional processor. The memory wall, that is, the cost of data transfer between memory and the compute elements, has emerged as a primary barrier to improving the energetic and time performance of Von Neumann architectures as technology improvements can only bring sub-linear gains. Data-flow architectures have the potential to completely avoid the memory wall and therefore have no intrinsic limit on the throughput that may be achieved through parallelism. A major advantage of the Von Neumman architecture over the viable alternatives that have presented themselves so far is its generality. There is a genetic relationship between CPU instruction set architectures grounded in evolutionary improvements in processor design that have enabled software to be portable to a wide range of computers and to remain portable over many years. Software is routinely written by programmers that have no specific knowledge of the computing system on which it will be executed. Many alternatives to the Von Neumann architecture are specific to a high-performance computing problem and, being implemented directly in purpose-built hardware, are not software-programmable at all. Somewhat more generalized solutions are programmable, but require knowledge of the target machine and are usually not suited for general purpose programming, including implementation of a layered architecture including an operating system and an application stack. A troubling characteristic of such implementations is that they are locked into the approaches and assumptions that are current at the time the design is made. In an area such as machine learning where the techniques and even the objectives are evolving at a vertiginous rate, such early lock-in will arrest the normal development of the field. Systolic Array is a term that has been used for a number of computing architectures that combine true parallelism and data-flow and we will use it henceforth to describe our proposal for an alternative to the Von Neumann architecture. Our objective has been to design a Systolic Array that also achieves the generality of a Von Neumann architecture, both on the specific point of compatibility with instruction set architectures that will enable porting of existing codes and on the broader point of not being constraining the discovery and development of new algorithms. The principles of this architecture will be laid out in the following sections where the potential of this system to achieve exponential speedup will be demonstrated on a well-known computing problem, Edit Distance. General Characteristics of the Systolic Array n x m matrix : The fabric consists of an n x m matrix, where any matrix cell may be active in a compute cycle. A computation may be resolved in a single step within the limits of the matrix or the fabric may be configured as a torus to enable an infinite computational surface. In general, the systolic array trades space for time, achieving high performance by enabling a large number of simultaneous computational branches across the large fabric that modern circuit geometries are capable of creating. Unary computation : Computations in the matrix are bitwise, the Systolic Array therefore supports data items of arbitrary bit lengths. Compatibility with legacy instruction set architectures and programming languages can be achieved using bitslice composition. Tri-state Processing Element : The processing element at each intersection of wires in the matrix is a tri-state device, that is, routing flow is enabled through the highimpedance state, called NULL. Propagation of computation : Computation is self-propagated in the matrix through transmission of instructions and data and flow control using the NULL state. No reconfiguration : The Systolic Array is not a reconfigurable device, function, routing, and propagation of computing all happen dynamically in the fixed array fabric. The Systolic Array may be implemented in an FPGA with a single synthesis that will enable any computation to be run on it. Asynchronous : The Systolic Array is an unclocked architecture, where computation propagates through the fabric until the compute is resolved, with sequential operations gated through control of routing through the NULL state. Exponential Speed-up :Edit Distance Experimental Results Edit distance is an exhaustively studied problem in computer science, with many known algorithms giving optimal solutions to variants of the basic problem of measuring how similar two strings or sequences are. Although the problem is easy to understand and the solutions seem relatively simple, edit distance is used in many real-world problems such as spell checking, plagiarism detection and DNA sequence alignment. We are going to examine the implementation of Levenshtein Distance, one of the variants of Edit Distance, on the Systolic Array. Levenshtein Distance has been proven to be impossible to calculate in less than near quadratic time, that is, in approximately O(n2) where n is the number of characters compared in the two strings. We demonstrated the quadratic time growth in terms of instruction cycles, performing measurements using the optimum dynamic programming solution to Levenshtein distance implemented on an Arm Cortex M0. len str1 len str2 Total cycles 1 1 292 2 2 1136 4 4 4480 8 8 17792 16 16 70912 32 32 283136 64 64 1131520 128 128 4524032 256 256 18092032 512 512 72359936 1024 1024 289423360 2048 2048 1157660672 Plotted, we see the characteristic quadratic curve : The Systolic Array solution transforms Levenshtein Distance from O(n 2) in time to O(n2) in space, performing the computation in n x m array where n is the length of the first string and m is the length of the second string. The photo below shows the results captured in our lab where we compared execution of the optimum dynamic programming solution for Levenshtein Distance on the Arm Cortex M0 to the solution run on the Systolic Array implemented on a 35T Artix 7 FPGA. The Levenshtein distance is calculated in a single cycle on the Systolic Array irrespective of input size. Further, even for a very small example the Systolic Array performs the calculation several orders of magnitude faster than dynamic programming on a sequential processor. The test case run on the ARM core takes 60 microseconds, measured pin to pin with an oscilloscope in order to capture execution time on the fabric, while the same calculation using the same measurement technique takes 60 nanoseconds on the Systolic Array. Systolic Array Programming While in the current state of development, programming of the Systolic Array is performed at a low level equivalent to assembly language in sequential instruction set architectures, At a low-level, Systolic Array programming is mainly concerned with pattern matching and routing which determines the progression of execution in the array and generation of output signals. Systolic Array algorithms are based on concepts readily mapped to constructs in high-level languages including tree structure, and both functional and object-oriented programming paradigms. There are two approaches to programming the Systolic Array. Assembly code for a supported ISA architecture can to compiled directly into the assembly equivalent of the Systolic Array and while this approach may not make optimal use of its capabilities, it will accelerate the code compared to the ISA’s target architecture due to parallel execution, loop unrolling into space, and asynchronous execution. Much of the ARM ISA is already supported. This is the legacy codes solution. The second approach is to develop a new algorithm conceived specifically with the true massive parallelism of the Systolic Array. This can be difficult since computer science is based on a long history of dealing with sequentiality as a fundamental constraint. It is also an opportunity to rethink our approach to many key computing problems and to perhaps, in time, to expand the range of problems recognized as computable.