Connex Integral Parallel Architecture & the 13 Berkeley Motifs (version 1.1) Abstract: Connex Integral Parallel Architecture is a many-cell engine designed to solve intense computational problems. Connex technology is presented and analyzed from the point of view of each of the 13 computational motifs proposed in the Berkeley's View [1]. We conclude that in almost all 13 computational motifs Connex technology works efficiently. 1. Introduction Parallel processing is able to propose two distinct solutions for the everlasting problems which challenge computer science: complexity & intensity. Coarse-grain networks of few or dozens of big & complex processors performing multi-threaded computation are proposed for the complex computation, while fine-grain networks of hundreds or thousands of small & simple processing cells are proposed for the intense computation. Connex architecture1 is designed to perform intense computation. We take into account the fundamental differences between multi-processors and manyprocessors [2]. We see multi-processors as performing complex computations, while many-processors are designed for intense computations2. Academia provides a comprehensive survey on parallel computing in the seminal research report known as the Berkeley's View [1], where 13 computational motifs are identified as the main challenges for the emerging parallel paradigm. This white paper investigates how Connex architecture behaves related to the 13 computational motifs emphasized in the Berkeley's View. 2. Connex Integral Parallel Architecture Integral Parallel Architecture (IPA) is defined as a fine-grain cellular machine with structural resources for performing mainly data-parallel and time-parallel computations and resources for computations requiring speculative execution. IPA considers two main data structures: vectors of scalars, processed in data parallel machines, and streams of 1 2 See here the history of the project: http://arh.pub.ro/gstefan/conexMemory.html The distinction between the complex computation and the intense computation is defined in [12]. scalars, processed in time parallel machines. Connex IPA performs the following types of parallel computations: data parallel computation working on vectors and having as result, vectors, scalars (by reduction parallel operations) or streams (applied as inputs for time parallel computations) time parallel computation working on streams and having as result streams, scalars (by accumulating operations), or vectors (applied as inputs for data parallel computations) speculative parallel computation (expanded mainly inside the time parallel computation, but sometimes inside the data parallel computation) working on scalars and having as result vectors reduced immediately by selection (the simplest reduction function) to a scalar reduction parallel computation working on vectors with results on scalars. Almost all parallel computations are data parallel (with the associated reduction operations), but some of them involve time parallel processes, supported by speculative computations, if needed. Data parallel engine Connex computational structure performing data parallel computation (with the associated reduction operations) is a fine-grain network of small & simple execution units (EU) working as a many-cell machine. The current embodiment of the Connex many-core section (see Figure 1) is a linear array of n=1024 EUs. Each EU is a machine working on words of p=16 bits, with a 1 KB local data memory. This memory allows storage of m=512 16-bit components vectors. The processing array works in parallel with an IO plane (IOP) used to transfer w=128-bit data words between the array and the external memory. The architecture is scalable: all the previous parameters scale up or down (usually: n = 64 … 4096, p = 8 … 64, m = 64 … 16384, w = 32 … 1024). The array is controlled by Instruction Sequencer (IS), a simple controller, while the IOP transfers data under the control of another machine called IO Controller (IOC). Thus, data processing and data transfer are two independent processes performed in parallel. Data exchange between the processing array and the IO plane is performed in one clock cycle and it is synchronized by hardware mechanisms. For time parallel computation a dynamically reconfigurable network of 8 simple processors is provided outside of the processing array. Speculative computation is performed in both networks. Fgure 1. Connex data parallel engine. The processing array is paralleled by the IO Plane which performs data transfers transparent to the processing. Data parallel architecture The user's view of the data parallel machine is represented in Figure 2. The linear cellular machine containing 1024 EUs performing parallel on the vectors stored on the twodimension array of scalars and Booleans. The number of cells, n, provides the spatial dimension of the array, while number of words stored in each cell provides the temporal dimension of the array (for example, in Figure 2 n = 1024, m = 512). On the spatial dimension the “distance” between two scalars in a vector is in O(n). On the temporal dimension the “distance” between two scalars is in O(1). Indeed, for performing an operations between two scalars stored in two different cells the time to have them in the same EU is proportional, in the worst case, with the number of cells, while two operands stored in the data memory of an EU are accessed in few cycles using random addressing. The two dimensions of the Connex architecture – the horizontal, spatial dimension and the vertical, temporal dimension – must be carefully used by the programmer in order to squeeze the maximum of performance from ConnexArrayTM. Two kinds of vectors are defined in this array: horizontal vectors (along the spatial dimension) and vertical vectors (along the temporal dimension). In the following we consider only horizontal vectors called simply vectors. (When vertical vectors will be considered it will be specified.) Operations on vectors are performed in a small fixed number of cycles. Some generic operations are exemplified in the following: PROCESSING OPERATIONS performed in the processing array under the control of IS: o full vector operation: {carry, v5} = v4 + v3; the corresponding integer components of the two operand vectors (v4 and v3) are added, and the result is stored in the scalar vector v5 and in the Boolean vector carry o Boolean operation: s3 = s3 & s5; the corresponding Boolean components of the two operand vectors, s3, s5, are ANDed and the result is stored in the result vector s3 Figure 2. The internal state of Connex data parallel machine. There are m = 512 integer (horizontal) vectors, each having n = 1024 16-bit integer components (vi[j] is a 16-bit integer), and 8 selection vectors, each having 1024 Booleans (sk[j] is a Boolean). o predicated execution: v1 = s2 ? v3 - v2 : v1; in any positions where s2 = 1 the corresponding components are operated, while in the rest (i.e., elsewhere) the content of the result vector remains unchanged (it is a ``spatial if” statement) o vector rotate: v7 = v7 >> n; the content of vector v7 is rotated n positions right, i.e., v7[i] = v7[(i+n)mod1024] INPUT-OUTPUT OPERATIONS performed in IOP under the control of IOC: o strided load: load v5 address burst stride; the content of v5 is loaded with data from the external memory accessed starting from the address: address, using bursts of size: burst, on a stride of: stride o scattered load: sload v3 highAddress v9 addr stride; the content of v3 is loaded with data from the external memory indirectly accessed using the content of the address vector: v9, whose content is used starting from the index address: addr, on a stride of: stride; the address vector is structured in pairs of 16-bit words; each of the 512 resulting 32-bit word is organized as follows: {dummy, burst[5:0], address[24:0]} where: if dummy == 1, then a burst of {burst[5:0], 1'b0} dummy bytes are loaded, else a burst of {burst[5:0], 1'b0} bytes from the address {highAddress, addr, 1'b0} is loaded (it is a sort of indirect load) o strided store: store v7 address burst stride; o gathered store: gstore v4 highAddress v3 addr stride; (it is a sort of indirect store). VectorC: the programming language for data parallel architecture Connex data parallel engine is programmed in VectorC, a C++ language extension for parallel processing defined by Connex [8]. The extension is made by adding new primitive data types and by extending the existing operators to accept the new data types. In the VectorC programming language the conditional statements have become predication statements. The new data primitives are: int vector: vector of integers (stored as a pair of 16-bit integer vectors) short vector: vector of shorts (stored as a 16-bit integer vector) byte vector: vector of bytes (two byte vectors are stored as a integer vector) selection: vector of Booleans In order to explain how VectorC works let be the following variable declarations: int i1, i2, i3; bool b1, b2, b3; int vector v1, v2, v3; selection s1, s2, s3; Then a VectorC statement like: v3 = v1 + v2; replaces this style of for statement: for (int i = 0; i < VECTOR_SIZE; i++) v3[i] = v1[i] + v2[i]; and s3 = s1 && s2; replaces this for statement: for (int i = 0; i < VECTOR_SIZE; i++) s3[i] = s1[i] && s2[i]; The scalar statement if (b1) {i3 = i1 + i2}; is written inVectorC as the following vector predication statement: WHERE (s1) {v3 = v1 + v2}; replacing this nested for: for (int i = 0; i < VECTOR_SIZE; i++) if (s1[i]) v3[i] = v1[i] + v2[i]; Similarly, i3 = (b1)? i1 : i2; is extended to accept vectors: v3 = (s1)? v1 : v2; Here is an example in VectorC computing the absolute difference of two vectors. vector absdiff(vector V1, vector V2); int main() { vector V1 = 2; vector V2 = 3; vector V; V = absdiff(V1, V2); return 0; } vector absdiff(vector V1, vector V2) { vector V; V = V1 - V2; WHERE (V < 0) { V = -V; } ENDW return V; } See few introductory examples in [5] where the VectorC library is posted. Connex data parallel engine by the numbers The last implementation of ConnexArrayTM provided the following measured performances: computation: 400 GOPS3 at 400 MHz (peak performance) external bandwidth: 6.4 GB/sec (peak performance) internal bandwidth: 800 GB/sec (peak performance) power: < 3 Watt area: < 50 mm2 (1024-EU array, including 1Mbyte of memory and the two controllers with their local program and data memory). Design & technology parameters: 65nm technology Fully synthesized (no custom memories or layout) Standard Chartered Semiconductor “G” process Time parallel architecture Amdhal law is the argument used against the extensive use of parallel computation. But, this very old argument (1967) was coined in the pioneering era of the parallel computing. 3 16-bit Operations Meantime the theoretical and technological context changed and a lot of new data about how parallelism works are accumulated. In 1998 Daniel Hillis expressed his opinion as follows: “I now understand the flow in Amdahl’s argument. It lies in the assumption that a fixed portion of the computation, even just 10 percent, must be sequential. This estimate sounds plausible, but it turns out not to be true of most large computations. The false intuition came from a misunderstanding about how parallel processors would be used. … Even problems that most people think are inherently sequential can usually be solved efficiently on a parallel computer.” ([4], pag. 114 – 115) Indeed, for “large computations” pipelining, speculating or speculating in a pipe on the structural support of a many-cell engine provides the architectural context for executing in parallel pure sequential computations when big stream of data are waiting to be processed. Figure 3. Connex time parallel engine. a. The pipe without speculation. b. The pipe with speculation. The i-th stage in pipe is computed by q PEs dynamically configured in a speculative linear array. PEi+q selects dynamically as input only one of the outputs of the speculative array Time parallel computation is supported in the current implementation of Connex chip by the small configurable network or processing elements called the Stream Accelerator (SA). The network works like a pipe of processors in which at any point two or more machines are connected in parallel to support speculation. In the actual implementation 8 machines are used. Functions like CABAC decoding procedure, a highly sequential with strong data dependency computation, are efficiently executed in a small program. The computation of SA is defined by two mechanisms: stream of functions containing m programs pj(σ): S(σin) = <p0(σin), p1(σ0), … pm-1(σm-2), > = σout applied to a stream of scalars σin, generating a stream of scalars σout as output, where: pj(σ) is a program which processes the stream σ and generates the stream σj; this is a type of MIMD computation vector of functions containing q programs pj(σ): V(σin) = [p0(σin), … pq-1(σin)] applied to a stream of scalars σin, generating a stream of q-component vectors; this is a true MISD computation. The latency introduced by a stream of functions is m, but the stream is computed in real time (see Figure 3a). Vectors of functions are used to perform speculation when the computation requests it in order to preserve the possibility of real time computation (see Figure 3b). For example: < p0(σin), p1(σ0), … pi-1(σi-2), V(σi-1), pi+q(σ?), … pm-1(σm-2)> is a computation which performs a speculation in the stage i of the pipe. The program pi+q(σ?) selects from the vectors generated by V(σi-1) only one component as input. There is a Connex version having the stream accelerator functionality integrated with the data parallel engine. 3. Berkeley's View Berkeley's View [1] provides a comprehensive presentation of the problems to be solved by the emerging actor on the computing market: the ubiquitous parallel paradigm. Many decades an academic topic, parallelism becomes an important actor on the market after 2001 when the clock rate race stopped. This research report presents 13 computational motifs 4 which cover the main aspects of the parallel computing. They are defined unrelated with a specific parallel architecture. In the next section we will make a preliminary evaluation of them in the context of Connex's IPA. 4 Initially called dwarfs, they are renamed as motifs in [9]. 4. Connex's Performance Connex's cellular network has the simplest possible interconnection network. This is both an advantage and a limitation. On one hand, the area of the system is minimized, and it is ease to hide the associated organization from the user, with no loss in programmability or in the efficiency of compilation. The limitation appears in certain application domains. What follows are short comments about how the Connex architecture works for each of the 13 Berkeley's View motifs. Motif 1: Dense linear algebra The computation in this domain operates mainly on N×M matrices. The operations performed are: matrix addition, scalar multiplication, transposition of a matrix, dot product of vectors, matrix multiplication, determinant of a matrix, (forward & backward) Gaussian elimination, solving systems of linear equations, and inverse of a N×M matrix. Depending on the product N×M the internal representation of the matrix is decided. If the product is small enough (usually, no bigger than 128), each matrix can be expanded as a vertical vector and associated to one EU, resulting in 1024 matrices represented by N×M 1024-element horizontal vectors. But, if the product N×M is big, then P EUs are associated with each matrix, resulting in parallel processing of 1024/P matrices represented in (N×M)/P 1024-element vectors. For all the operations above listed the computation is usually accelerated almost 1024 times, but not under 1024/P times. This is possible because special hardware is provided for reduction operations in the ConnexArrayTM (for example: adding 1024 16-bit numbers takes 4-20 clock cycles, depending on the actual implementation of the reduction network associated to the array). Motif 2: Sparse linear algebra There are two types of sparse matrices: (1) randomly distributed sparse arrays (represented by few types of lists), (2) band arrays (represented by a stream of short vectors). For small random sparse arrays, converting them internally into dense array is a good solution. For big random sparse arrays the associated list is operated on using efficient search operations. For band arrays systolic-like solution are proposed. Connex’s intense computation engine handles these types of linear algebra problems very well. Acceleration is between 128 and 1024 times depending on the density of the array. Motif 3: Spectral methods The typical examples are: FFT or wavelet computation. Because of the ``butterfly" data movement, how the FFT computation is implemented depends on the length of the sample. The spatial and the temporal dimensions of the Connex array helps the programmer to easily adapt the data representation to result in an almost linear acceleration, i.e., in the worst case the acceleration is not less than 80% from the linear acceleration which is 1024. Evaluation report: Using VectorC as simulation environment FFT computation was evaluated for Connex architecture. For a ConnexArrayTM version with n = 1024, p = 32, m = 512, w = 256, which can be implemented in 65nm on 1cm2, a 4096-sample FFT is computed at 1.6 clock cycles per sample, a 1024-sample FFT is computed at 1 clock cycle per sample, a 256-sample FFT is computed in less than 0.5 clock cycles per sample, and a 64sample FFT is computed in less than 0.2 clock cycles per sample. At 400 MHz results a 10 Watt chip. The algorithm loads each stream of sample as 16 64-element vertical vectors. Thus, the array works simultaneously of 64 FFTs. This is a good example of the optimizations allowed by the two dimensions of the Connex architecture. If only the spatial dimension is used, loading all the 1024 samples as a single horizontal vector, then the same computation is done in 8 clock cycles per sample instead of only one. Almost one order of magnitude is acquired only playing with the two dimensions of our architecture. Motif 4: N-Body method This method fits perfectly on Connex’s architecture, because for j=0 to j=n-1 the following equation must be computed: U(xj) = Σi F(xj, Xi) Each function F(xj, Xi) is computed on a single EU, and then the sum is a reduction operation linearly accelerated by the array. Depending on the value of n, the data is distributed in the processing array using the spatial dimension only, or for large n, both the spatial and the temporal dimension. For this motif results also an almost linear acceleration. Motif 5: Structured grids The grid is distributed on the two dimensions of our array: the spatial dimension and the temporal dimension. Each processor is assigned a line of nodes (on the spatial dimension). It performs each update step locally and independently of other line of nodes. Each node only has to communicate with neighboring nodes on the grid, exchanging data at the end of each step. The system works as a cellular automaton. The computation is accelerated almost linearly on Connex’s architecture. Motif 6: Unstructured grids Unstructured grid problems are described as updates on an irregular grid, where each grid element is updated from its neighbor grid elements. Parallel computation is disturbed when problem size is large, and the non-uniformity of the data distribution would best utilize special access mechanisms. In order to solve the non-uniformity problem for the Connex Array, a preprocessing step is required. The algorithm for preprocessing the n-element unstructured grid representation starts from an initial list of grid elements G = {g0, … gn-1} and provides the minimum number of vectors, following the steps sketched here: the n×n interconnection matrix for n grid elements is generated interchanging elements in the list G a minimal band matrix is generated each diagonal of the band represents a vector loaded into the processing array the result is a grid with some dummy elements, but each actual grid element has its neighborhood located in few adjacent EUs. Depending on how the list G the preprocessing can be performed in the Connex array or in an external standard processor. The resulting acceleration is smaller than for the structured grid (depending on the computation involved in each grid elements, 10-20% of the acceleration is lost). Motif 7: Map reduce The typical example of a map reduce computation is the Monte Carlo method. This method consists in many completely independent computations working on randomly generated data. This type of computation is highly parallel. Sometimes it requires the add reduction function, for which Connex architecture has special accelerating hardware. The computation is linearly accelerated. Motif 8: Combinational logic There are a lot of very different problems falling in this class. We list here only the most important and the most frequently used ones: blocks processing, exemplified by AES encryption algorithms; it works in 4×4 arrays of bytes, each array is loaded in one EU, and the processing is completely SIMD-like with linear acceleration on the Connex Array. stream processing, exemplified by convolution methods which do not use blocks, processing instead a continuous bit stream; it is computed very efficiently in Connex’s time parallel accelerator (SA) with no speculation involved image rotation for black & white or color bit mapped images is performed first by loading m×m array of pixels into the processing array on both dimensions (spatial and temporal), second, executing a local transformation, and third restoring the transformed image in the appropriate place. This is done very efficiently on the Connex Array. route lookup, used in networking; it supposes three data-base like operations: longest match, insert, delete, all performed very efficiently by the Connex processing array. Motif 9: Graph traversal The array of 1024 machines can be used as a big ``speculative device". Each EU starts with a full graph stored in its data memory, and the computation provides the result when one EU, if any, finds the solution. Limitations are generated by the dimension of the data memory of each EU. More investigation is needed to evaluate the actual power of Connex technology in solving this problem. Some problems related with graphs are easily solved if matrix computation is involved (example: computing the distance between all the elements of a graph). Motif 10: Dynamic programming Viterbi decoding is the example presented in [1]. It best fits the modular feed-forward architecture of SA, built as a distinct network as currently implemented at Connex or integrated into the main data parallel processing array. Very long streams of bits are computed in parallel by the pipeline structure of Connex’s SA. Motif 11: Back-track and branch & bound Motif under investigation (“Berkeley's View” is silent regarding this motif). Motif 12: Graphical models Motif under investigation (“Berkeley's View” is silent regarding this motif). Motif 13: Finite state machine (FSM) The authors of “Berkeley's View” claim that for this motif “nothing helps”. But, we consider that a pipe of machines featured with speculative resources [6] provides benefits in acceleration. In fact, Connex’s SA solves the problem if its speculative resources are activated. Another way to use ConnexArrayTM for FSM oriented application is to add to each cell, working as a PE, specific instructions for FSM emulation. The resulting system can be used as a speculative engine for deep packet search applications. Connex’s SA technology seems to be the first implementation of a machine able to deal with this rebellious motif. 5. Concluding Remarks 1. Connex technology covers almost all motifs. Excepting the motifs 11 and 12 (work on them in progress), possibly 9, the Connex technology performs very well. 2. The linear network connecting EUs is not a limitation. Because the intense computational problems are characterized by an advanced locality, the simplest interconnection network is not a major limitation. The temporal dimension of the architecture in many cases helps to avoid the limitations imposed by the two simple interconnection networks. 3. The spatial & temporal dimensions are doing a good job together. The user's view of the machine is a two-dimension array. Actually one dimension is in space (the 1024 EUs), and the other dimension is in time (the 512 16-bit words stored in each local randomly accessed memory). These two distinct dimensions allow Connex to optimize area resources, while the locality and the degree of parallelism are both kept at high values. 4. Time parallelism is rare, but unavoidable. Almost any time in a real complex application all kinds of parallelism are involved. Some pure sequential processes represent sometimes uncomfortable corner cases solved only by the time parallel resources provided in Connex architecture (see the 13th motif). 5. Connex organization is transparent. Because the interconnection network is simple, the internal organization of the machine is easy to be made transparent to the user. The elegant solution offered by the VectorC language is a good proof of the high organizational transparency of the Connex technology. References [1] K. Asanovic, et al.: ”The Landscape of Parallel Computing Research: A View from Berkeley”, Technical Report No. UCB/EECS-2006-183, December 18, 2006. http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf [2] Shekar Y. Borkar, et al.: “Platform 2015: Intel Processor and Platform Evolution for the Next decade”, edited by R. M. Ramanathan and Vince Thomas, Intel Corporation, 2005. [3] Pradeep Dubey: “A Platform 2015 Workload Model: Recognition, Mining and Synthesis Moves Computers to the Era of Tera”, Technology@Intel Magazine, Feb. 2005. [4] W. Daniel Hillis: The Pattern on the Stone. The Simple Ideas that Make Computers Work, Basic Books, 1998. [5] Mihaela Malita: http://www.anselm.edu/internet/compsci/Faculty_Staff/mmalita/HOMEPAGE/ResearchS07/Web siteS07/index.html [6] Mihaela Malita, Gheorghe Stefan, Dominique Thiebaut: “Not Multi-, but Many-Core: Designing Integral Parallel Architectures for Embedded Computation” in ACM SIGARCH Computer Architecture News, Volume 35 , Issue 5, Dec. 2007, Special issue: ALPS '07 Advanced low power systems; communication at International Workshop on Advanced Low Power Systems held in conjunction with 21st International Conference on Supercomputing June 17, 2007 Seattle, WA, USA. [7] Mihaela Malita, Gheorghe Stefan: “On the Many-Processor Paradigm”, in: H. R. Arabina (Ed.): Proceedings of the 2008 World Congress in Computer Science, Computer Engineering and Applied Computing, vol. PDPTA'08 (The 2008 International Conference on Parallel and Distributed Processing Techniques and Applications), 2008. http://arh.pub.ro/gstefan/pdpta08.pdf [8] Bogdan Mîţu: “C Language Extension for Parallel Processing”, BrightScale research report 2008. http://arh.pub.ro/gstefan/VectorC.ppt [9] David A. Patterson: “The Parallel Computing Landscape: A Berkeley View 2.0”, keynote lecture at The 2008 World Congress in Computer Science, Computer Engineering and Applied Computing, Las Vegas, July, 2008. [10] Gheorghe Stefan: “The CA1024: A Massively Parallel Processor for Cost-Effective HDTV”, in SPRING PROCESSOR FORUM JAPAN, June 8-9, 2006, Tokyo. [11] Gheorghe Stefan: “The CA1024: SoC with Integral Parallel Architecture for HDTV Processing”, invited paper at 4th International System-on-Chip (SoC) Conference & Exhibit, November 1 & 2, 2006, Radisson Hotel Newport Beach, CA [12] Gheorghe Stefan, Anand Sheel, Bogdan Mitu, Tom Thomson, Dan Tomescu: “The CA1024: A Fully Programmable System-On-Chip for Cost-Effective HDTV Media Processing”, in Hot Chips: A Symposium on High Performance Chips, Memorial Auditorium, Stanford University, August 20 to 22, 2006. [13] Gheorghe Stefan: “One-Chip TeraArchitecture”, in Proceedings of the 8th Applications and Principles of Information Science Conference, Okinawa, Japan on 1112 January 2009. http://arh.pub.ro/gstefan/teraArchitecture.pdf