Uploaded by Peter Troendle

A Specific Computing Acceleration Based on Systolic Array

advertisement
A General Computing Architecture Based on Systolic Array
Motivation
While compute-intensive workloads have motivated the search for high performance
and greater efficiency, the emergence of machine learning has greatly intensified the
urgency of finding alternatives to the classic Von Neumann computing architecture.
Even if transistor geometries were still shrinking at the pace predicted by Moore’s
Law, the CPU is so far outclassed by other architectures that the impact of further
improvements would be negligible.
There are two design elements present in the majority of the various possible
alternatives to the Von Neumann architecture : true parallelism and data-flow. True
parallelism involves decomposition of a problem into many subproblems that can be
calculated simultaneously. Parallelism has been widely used to improve the
performance of Von Neumann architectures through techniques such as microinstruction pipelining and multi-cores. These techniques do not, in themselves,
contribute to the decomposition of a problem and, generally, can only offer linear
speedup. True parallelism provides a natural framework for decomposition of a
computation and is theoretically capable of achieving exponential speedup. Dataflow is the movement of data through a computation, as opposed to fetch-and-store
to and from a separate memory system on each computation in a traditional
processor. The memory wall, that is, the cost of data transfer between memory and
the compute elements, has emerged as a primary barrier to improving the energetic
and time performance of Von Neumann architectures as technology improvements
can only bring sub-linear gains. Data-flow architectures have the potential to
completely avoid the memory wall and therefore have no intrinsic limit on the
throughput that may be achieved through parallelism.
A major advantage of the Von Neumman architecture over the viable alternatives
that have presented themselves so far is its generality. There is a genetic
relationship between CPU instruction set architectures grounded in evolutionary
improvements in processor design that have enabled software to be portable to a
wide range of computers and to remain portable over many years. Software is
routinely written by programmers that have no specific knowledge of the computing
system on which it will be executed. Many alternatives to the Von Neumann
architecture are specific to a high-performance computing problem and, being
implemented directly in purpose-built hardware, are not software-programmable at
all. Somewhat more generalized solutions are programmable, but require knowledge
of the target machine and are usually not suited for general purpose programming,
including implementation of a layered architecture including an operating system and
an application stack. A troubling characteristic of such implementations is that they
are locked into the approaches and assumptions that are current at the time the
design is made. In an area such as machine learning where the techniques and even
the objectives are evolving at a vertiginous rate, such early lock-in will arrest the
normal development of the field.
Systolic Array is a term that has been used for a number of computing architectures
that combine true parallelism and data-flow and we will use it henceforth to describe
our proposal for an alternative to the Von Neumann architecture. Our objective has
been to design a Systolic Array that also achieves the generality of a Von Neumann
architecture, both on the specific point of compatibility with instruction set
architectures that will enable porting of existing codes and on the broader point of
not being constraining the discovery and development of new algorithms. The
principles of this architecture will be laid out in the following sections where the
potential of this system to achieve exponential speedup will be demonstrated on a
well-known computing problem, Edit Distance.
General Characteristics of the Systolic Array
n x m matrix : The fabric consists of an n x m matrix, where any matrix cell may be
active in a compute cycle. A computation may be resolved in a single step within the
limits of the matrix or the fabric may be configured as a torus to enable an infinite
computational surface. In general, the systolic array trades space for time, achieving
high performance by enabling a large number of simultaneous computational
branches across the large fabric that modern circuit geometries are capable of
creating.
Unary computation : Computations in the matrix are bitwise, the Systolic Array
therefore supports data items of arbitrary bit lengths. Compatibility with legacy
instruction set architectures and programming languages can be achieved using bitslice composition.
Tri-state Processing Element : The processing element at each intersection of wires
in the matrix is a tri-state device, that is, routing flow is enabled through the highimpedance state, called NULL.
Propagation of computation : Computation is self-propagated in the matrix through
transmission of instructions and data and flow control using the NULL state.
No reconfiguration : The Systolic Array is not a reconfigurable device, function,
routing, and propagation of computing all happen dynamically in the fixed array
fabric. The Systolic Array may be implemented in an FPGA with a single synthesis
that will enable any computation to be run on it.
Asynchronous : The Systolic Array is an unclocked architecture, where computation
propagates through the fabric until the compute is resolved, with sequential
operations gated through control of routing through the NULL state.
Exponential Speed-up :Edit Distance Experimental Results
Edit distance is an exhaustively studied problem in computer science, with many
known algorithms giving optimal solutions to variants of the basic problem of
measuring how similar two strings or sequences are. Although the problem is easy to
understand and the solutions seem relatively simple, edit distance is used in many
real-world problems such as spell checking, plagiarism detection and DNA sequence
alignment. We are going to examine the implementation of Levenshtein Distance,
one of the variants of Edit Distance, on the Systolic Array. Levenshtein Distance has
been proven to be impossible to calculate in less than near quadratic time, that is, in
approximately O(n2) where n is the number of characters compared in the two
strings.
We demonstrated the quadratic time growth in terms of instruction cycles, performing
measurements using the optimum dynamic programming solution to Levenshtein
distance implemented on an Arm Cortex M0.
len str1 len str2
Total cycles
1
1
292
2
2
1136
4
4
4480
8
8
17792
16
16
70912
32
32
283136
64
64
1131520
128
128
4524032
256
256
18092032
512
512
72359936
1024
1024
289423360
2048
2048
1157660672
Plotted, we see the characteristic quadratic curve :
The Systolic Array solution transforms Levenshtein Distance from O(n 2) in time to
O(n2) in space, performing the computation in n x m array where n is the length of
the first string and m is the length of the second string.
The photo below shows the results captured in our lab where we compared
execution of the optimum dynamic programming solution for Levenshtein Distance
on the Arm Cortex M0 to the solution run on the Systolic Array implemented on a
35T Artix 7 FPGA. The Levenshtein distance is calculated in a single cycle on the
Systolic Array irrespective of input size. Further, even for a very small example the
Systolic Array performs the calculation several orders of magnitude faster than
dynamic programming on a sequential processor.
The test case run on the ARM core takes 60 microseconds, measured pin to pin with
an oscilloscope in order to capture execution time on the fabric, while the same
calculation using the same measurement technique takes 60 nanoseconds on the
Systolic Array.
Systolic Array Programming
While in the current state of development, programming of the Systolic Array is
performed at a low level equivalent to assembly language in sequential instruction
set architectures, At a low-level, Systolic Array programming is mainly concerned
with pattern matching and routing which determines the progression of execution in
the array and generation of output signals. Systolic Array algorithms are based on
concepts readily mapped to constructs in high-level languages including tree
structure, and both functional and object-oriented programming paradigms.
There are two approaches to programming the Systolic Array. Assembly code for a
supported ISA architecture can to compiled directly into the assembly equivalent of
the Systolic Array and while this approach may not make optimal use of its
capabilities, it will accelerate the code compared to the ISA’s target architecture due
to parallel execution, loop unrolling into space, and asynchronous execution. Much
of the ARM ISA is already supported. This is the legacy codes solution. The second
approach is to develop a new algorithm conceived specifically with the true massive
parallelism of the Systolic Array. This can be difficult since computer science is
based on a long history of dealing with sequentiality as a fundamental constraint. It is
also an opportunity to rethink our approach to many key computing problems and to
perhaps, in time, to expand the range of problems recognized as computable.
Download