2. Connex Integral Parallel Architecture

advertisement
Connex Integral Parallel Architecture
&
the 13 Berkeley Motifs
(version 1.1)
Abstract: Connex Integral Parallel Architecture is a many-cell engine designed to
solve intense computational problems. Connex technology is presented and
analyzed from the point of view of each of the 13 computational motifs proposed in
the Berkeley's View [1]. We conclude that in almost all 13 computational motifs
Connex technology works efficiently.
1. Introduction
Parallel processing is able to propose two distinct solutions for the everlasting problems
which challenge computer science: complexity & intensity. Coarse-grain networks of
few or dozens of big & complex processors performing multi-threaded computation are
proposed for the complex computation, while fine-grain networks of hundreds or
thousands of small & simple processing cells are proposed for the intense computation.
Connex architecture1 is designed to perform intense computation.
We take into account the fundamental differences between multi-processors and manyprocessors [2]. We see multi-processors as performing complex computations, while
many-processors are designed for intense computations2.
Academia provides a comprehensive survey on parallel computing in the seminal
research report known as the Berkeley's View [1], where 13 computational motifs are
identified as the main challenges for the emerging parallel paradigm.
This white paper investigates how Connex architecture behaves related to the 13
computational motifs emphasized in the Berkeley's View.
2. Connex Integral Parallel Architecture
Integral Parallel Architecture (IPA) is defined as a fine-grain cellular machine with
structural resources for performing mainly data-parallel and time-parallel computations
and resources for computations requiring speculative execution. IPA considers two main
data structures: vectors of scalars, processed in data parallel machines, and streams of
1
2
See here the history of the project: http://arh.pub.ro/gstefan/conexMemory.html
The distinction between the complex computation and the intense computation is defined in [12].
scalars, processed in time parallel machines. Connex IPA performs the following types of
parallel computations:




data parallel computation working on vectors and having as result, vectors,
scalars (by reduction parallel operations) or streams (applied as inputs for time
parallel computations)
time parallel computation working on streams and having as result streams,
scalars (by accumulating operations), or vectors (applied as inputs for data
parallel computations)
speculative parallel computation (expanded mainly inside the time parallel
computation, but sometimes inside the data parallel computation) working on
scalars and having as result vectors reduced immediately by selection (the
simplest reduction function) to a scalar
reduction parallel computation working on vectors with results on scalars.
Almost all parallel computations are data parallel (with the associated reduction
operations), but some of them involve time parallel processes, supported by speculative
computations, if needed.
Data parallel engine
Connex computational structure performing data parallel computation (with the
associated reduction operations) is a fine-grain network of small & simple execution units
(EU) working as a many-cell machine.
The current embodiment of the Connex many-core section (see Figure 1) is a linear array
of n=1024 EUs. Each EU is a machine working on words of p=16 bits, with a 1 KB local
data memory. This memory allows storage of m=512 16-bit components vectors. The
processing array works in parallel with an IO plane (IOP) used to transfer w=128-bit data
words between the array and the external memory. The architecture is scalable: all the
previous parameters scale up or down (usually: n = 64 … 4096, p = 8 … 64, m = 64 …
16384, w = 32 … 1024).
The array is controlled by Instruction Sequencer (IS), a simple controller, while the IOP
transfers data under the control of another machine called IO Controller (IOC). Thus,
data processing and data transfer are two independent processes performed in parallel.
Data exchange between the processing array and the IO plane is performed in one clock
cycle and it is synchronized by hardware mechanisms.
For time parallel computation a dynamically reconfigurable network of 8 simple
processors is provided outside of the processing array.
Speculative computation is performed in both networks.
Fgure 1. Connex data parallel engine. The processing array is paralleled by
the IO Plane which performs data transfers transparent to the processing.
Data parallel architecture
The user's view of the data parallel machine is represented in Figure 2. The linear cellular
machine containing 1024 EUs performing parallel on the vectors stored on the twodimension array of scalars and Booleans.
The number of cells, n, provides the spatial dimension of the array, while number of
words stored in each cell provides the temporal dimension of the array (for example, in
Figure 2 n = 1024, m = 512). On the spatial dimension the “distance” between two
scalars in a vector is in O(n). On the temporal dimension the “distance” between two
scalars is in O(1). Indeed, for performing an operations between two scalars stored in two
different cells the time to have them in the same EU is proportional, in the worst case,
with the number of cells, while two operands stored in the data memory of an EU are
accessed in few cycles using random addressing.
The two dimensions of the Connex architecture – the horizontal, spatial dimension and
the vertical, temporal dimension – must be carefully used by the programmer in order to
squeeze the maximum of performance from ConnexArrayTM. Two kinds of vectors are
defined in this array: horizontal vectors (along the spatial dimension) and vertical vectors
(along the temporal dimension). In the following we consider only horizontal vectors
called simply vectors. (When vertical vectors will be considered it will be specified.)
Operations on vectors are performed in a small fixed number of cycles. Some generic
operations are exemplified in the following:

PROCESSING OPERATIONS performed in the processing array under the
control of IS:
o full vector operation: {carry, v5} = v4 + v3;
the corresponding integer components of the two operand vectors (v4 and
v3) are added, and the result is stored in the scalar vector v5 and in the
Boolean vector carry
o Boolean operation: s3 = s3 & s5;
the corresponding Boolean components of the two operand vectors, s3, s5,
are ANDed and the result is stored in the result vector s3
Figure 2. The internal state of Connex data parallel machine. There are
m = 512 integer (horizontal) vectors, each having n = 1024 16-bit integer
components (vi[j] is a 16-bit integer), and 8 selection vectors, each having
1024 Booleans (sk[j] is a Boolean).
o predicated execution: v1 = s2 ? v3 - v2 : v1;
in any positions where s2 = 1 the corresponding components are operated,
while in the rest (i.e., elsewhere) the content of the result vector remains
unchanged (it is a ``spatial if” statement)

o vector rotate: v7 = v7 >> n;
the content of vector v7 is rotated n positions right, i.e.,
v7[i] = v7[(i+n)mod1024]
INPUT-OUTPUT OPERATIONS performed in IOP under the control of IOC:
o strided load: load v5 address burst stride;
the content of v5 is loaded with data from the external memory accessed
starting from the address: address, using bursts of size: burst, on a stride
of: stride
o scattered load: sload v3 highAddress v9 addr stride;
the content of v3 is loaded with data from the external memory indirectly
accessed using the content of the address vector: v9, whose content is used
starting from the index address: addr, on a stride of: stride; the address
vector is structured in pairs of 16-bit words; each of the 512 resulting 32-bit
word is organized as follows:
{dummy, burst[5:0], address[24:0]}
where: if dummy == 1, then a burst of {burst[5:0], 1'b0} dummy
bytes are loaded, else a burst of {burst[5:0], 1'b0} bytes from the
address {highAddress, addr, 1'b0} is loaded (it is a sort of indirect
load)
o strided store: store v7 address burst stride;
o gathered store: gstore v4 highAddress v3 addr stride;
(it is a sort of indirect store).
VectorC: the programming language for data parallel architecture
Connex data parallel engine is programmed in VectorC, a C++ language extension for
parallel processing defined by Connex [8]. The extension is made by adding new
primitive data types and by extending the existing operators to accept the new data types.
In the VectorC programming language the conditional statements have become
predication statements.
The new data primitives are:




int vector:
vector of integers (stored as a pair of 16-bit integer vectors)
short vector: vector of shorts (stored as a 16-bit integer vector)
byte vector: vector of bytes (two byte vectors are stored as a integer vector)
selection:
vector of Booleans
In order to explain how VectorC works let be the following variable declarations:
int i1, i2, i3;
bool b1, b2, b3;
int vector v1, v2, v3;
selection s1, s2, s3;
Then a VectorC statement like:
v3 = v1 + v2;
replaces this style of for statement:
for (int i = 0; i < VECTOR_SIZE; i++)
v3[i] = v1[i] + v2[i];
and
s3 = s1 && s2;
replaces this for statement:
for (int i = 0; i < VECTOR_SIZE; i++)
s3[i] = s1[i] && s2[i];
The scalar statement
if (b1) {i3 = i1 + i2};
is written inVectorC as the following vector predication statement:
WHERE (s1) {v3 = v1 + v2};
replacing this nested for:
for (int i = 0; i < VECTOR_SIZE; i++)
if (s1[i])
v3[i] = v1[i] + v2[i];
Similarly,
i3 = (b1)? i1 : i2;
is extended to accept vectors:
v3 = (s1)? v1 : v2;
Here is an example in VectorC computing the absolute difference of two vectors.
vector absdiff(vector V1, vector V2);
int main() {
vector V1 = 2;
vector V2 = 3;
vector V;
V = absdiff(V1, V2);
return 0;
}
vector absdiff(vector V1, vector V2) {
vector V;
V = V1 - V2;
WHERE (V < 0) {
V = -V;
}
ENDW
return V;
}
See few introductory examples in [5] where the VectorC library is posted.
Connex data parallel engine by the numbers
The last implementation of ConnexArrayTM provided the following measured
performances:
 computation: 400 GOPS3 at 400 MHz (peak performance)
 external bandwidth: 6.4 GB/sec (peak performance)
 internal bandwidth: 800 GB/sec (peak performance)
 power: < 3 Watt
 area: < 50 mm2 (1024-EU array, including 1Mbyte of memory and the two
controllers with their local program and data memory).
Design & technology parameters:



65nm technology
Fully synthesized (no custom memories or layout)
Standard Chartered Semiconductor “G” process
Time parallel architecture
Amdhal law is the argument used against the extensive use of parallel computation. But,
this very old argument (1967) was coined in the pioneering era of the parallel computing.
3
16-bit Operations
Meantime the theoretical and technological context changed and a lot of new data about
how parallelism works are accumulated. In 1998 Daniel Hillis expressed his opinion as
follows:
“I now understand the flow in Amdahl’s argument. It lies in the assumption that a fixed
portion of the computation, even just 10 percent, must be sequential. This estimate
sounds plausible, but it turns out not to be true of most large computations. The false
intuition came from a misunderstanding about how parallel processors would be used. …
Even problems that most people think are inherently sequential can usually be solved
efficiently on a parallel computer.” ([4], pag. 114 – 115)
Indeed, for “large computations” pipelining, speculating or speculating in a pipe on the
structural support of a many-cell engine provides the architectural context for executing
in parallel pure sequential computations when big stream of data are waiting to be
processed.
Figure 3. Connex time parallel engine. a. The pipe without speculation. b.
The pipe with speculation. The i-th stage in pipe is computed by q PEs
dynamically configured in a speculative linear array. PEi+q selects dynamically as
input only one of the outputs of the speculative array
Time parallel computation is supported in the current implementation of Connex chip by
the small configurable network or processing elements called the Stream Accelerator
(SA). The network works like a pipe of processors in which at any point two or more
machines are connected in parallel to support speculation. In the actual implementation 8
machines are used. Functions like CABAC decoding procedure, a highly sequential with
strong data dependency computation, are efficiently executed in a small program.
The computation of SA is defined by two mechanisms:

stream of functions containing m programs pj(σ):
S(σin) = <p0(σin), p1(σ0), … pm-1(σm-2), > = σout
applied to a stream of scalars σin, generating a stream of scalars σout as output, where:
pj(σ) is a program which processes the stream σ and generates the stream σj; this is a
type of MIMD computation

vector of functions containing q programs pj(σ):
V(σin) = [p0(σin), … pq-1(σin)]
applied to a stream of scalars σin, generating a stream of q-component vectors; this is
a true MISD computation.
The latency introduced by a stream of functions is m, but the stream is computed in real
time (see Figure 3a). Vectors of functions are used to perform speculation when the
computation requests it in order to preserve the possibility of real time computation (see
Figure 3b). For example:
< p0(σin), p1(σ0), … pi-1(σi-2), V(σi-1), pi+q(σ?), … pm-1(σm-2)>
is a computation which performs a speculation in the stage i of the pipe. The program
pi+q(σ?) selects from the vectors generated by V(σi-1) only one component as input.
There is a Connex version having the stream accelerator functionality integrated with the
data parallel engine.
3. Berkeley's View
Berkeley's View [1] provides a comprehensive presentation of the problems to be solved
by the emerging actor on the computing market: the ubiquitous parallel paradigm. Many
decades an academic topic, parallelism becomes an important actor on the market after
2001 when the clock rate race stopped. This research report presents 13 computational
motifs 4 which cover the main aspects of the parallel computing. They are defined
unrelated with a specific parallel architecture. In the next section we will make a
preliminary evaluation of them in the context of Connex's IPA.
4
Initially called dwarfs, they are renamed as motifs in [9].
4. Connex's Performance
Connex's cellular network has the simplest possible interconnection network. This is both
an advantage and a limitation. On one hand, the area of the system is minimized, and it is
ease to hide the associated organization from the user, with no loss in programmability or
in the efficiency of compilation. The limitation appears in certain application domains.
What follows are short comments about how the Connex architecture works for each of
the 13 Berkeley's View motifs.
Motif 1: Dense linear algebra
The computation in this domain operates mainly on N×M matrices. The operations
performed are: matrix addition, scalar multiplication, transposition of a matrix, dot
product of vectors, matrix multiplication, determinant of a matrix, (forward & backward)
Gaussian elimination, solving systems of linear equations, and inverse of a N×M matrix.
Depending on the product N×M the internal representation of the matrix is decided. If the
product is small enough (usually, no bigger than 128), each matrix can be expanded as a
vertical vector and associated to one EU, resulting in 1024 matrices represented by N×M
1024-element horizontal vectors. But, if the product N×M is big, then P EUs are
associated with each matrix, resulting in parallel processing of 1024/P matrices
represented in (N×M)/P 1024-element vectors.
For all the operations above listed the computation is usually accelerated almost 1024
times, but not under 1024/P times. This is possible because special hardware is provided
for reduction operations in the ConnexArrayTM (for example: adding 1024 16-bit
numbers takes 4-20 clock cycles, depending on the actual implementation of the
reduction network associated to the array).
Motif 2: Sparse linear algebra
There are two types of sparse matrices: (1) randomly distributed sparse arrays
(represented by few types of lists), (2) band arrays (represented by a stream of short
vectors).
For small random sparse arrays, converting them internally into dense array is a good
solution. For big random sparse arrays the associated list is operated on using efficient
search operations. For band arrays systolic-like solution are proposed.
Connex’s intense computation engine handles these types of linear algebra problems very
well. Acceleration is between 128 and 1024 times depending on the density of the array.
Motif 3: Spectral methods
The typical examples are: FFT or wavelet computation. Because of the ``butterfly" data
movement, how the FFT computation is implemented depends on the length of the
sample. The spatial and the temporal dimensions of the Connex array helps the
programmer to easily adapt the data representation to result in an almost linear
acceleration, i.e., in the worst case the acceleration is not less than 80% from the linear
acceleration which is 1024.
Evaluation report:
Using VectorC as simulation environment FFT computation was evaluated for Connex
architecture. For a ConnexArrayTM version with n = 1024, p = 32, m = 512, w = 256,
which can be implemented in 65nm on 1cm2, a 4096-sample FFT is computed at 1.6
clock cycles per sample, a 1024-sample FFT is computed at 1 clock cycle per sample,
a 256-sample FFT is computed in less than 0.5 clock cycles per sample, and a 64sample FFT is computed in less than 0.2 clock cycles per sample. At 400 MHz results
a 10 Watt chip.
The algorithm loads each stream of sample as 16 64-element vertical vectors. Thus, the
array works simultaneously of 64 FFTs. This is a good example of the optimizations
allowed by the two dimensions of the Connex architecture. If only the spatial dimension
is used, loading all the 1024 samples as a single horizontal vector, then the same
computation is done in 8 clock cycles per sample instead of only one. Almost one order
of magnitude is acquired only playing with the two dimensions of our architecture.
Motif 4: N-Body method
This method fits perfectly on Connex’s architecture, because for j=0 to j=n-1 the
following equation must be computed:
U(xj) = Σi F(xj, Xi)
Each function F(xj, Xi) is computed on a single EU, and then the sum is a reduction
operation linearly accelerated by the array. Depending on the value of n, the data is
distributed in the processing array using the spatial dimension only, or for large n, both
the spatial and the temporal dimension. For this motif results also an almost linear
acceleration.
Motif 5: Structured grids
The grid is distributed on the two dimensions of our array: the spatial dimension and the
temporal dimension. Each processor is assigned a line of nodes (on the spatial dimension).
It performs each update step locally and independently of other line of nodes. Each node
only has to communicate with neighboring nodes on the grid, exchanging data at the end
of each step. The system works as a cellular automaton. The computation is accelerated
almost linearly on Connex’s architecture.
Motif 6: Unstructured grids
Unstructured grid problems are described as updates on an irregular grid, where each grid
element is updated from its neighbor grid elements. Parallel computation is disturbed
when problem size is large, and the non-uniformity of the data distribution would best
utilize special access mechanisms. In order to solve the non-uniformity problem for the
Connex Array, a preprocessing step is required.
The algorithm for preprocessing the n-element unstructured grid representation starts
from an initial list of grid elements
G = {g0, … gn-1}
and provides the minimum number of vectors, following the steps sketched here:




the n×n interconnection matrix for n grid elements is generated
interchanging elements in the list G a minimal band matrix is generated
each diagonal of the band represents a vector loaded into the processing array
the result is a grid with some dummy elements, but each actual grid element has
its neighborhood located in few adjacent EUs.
Depending on how the list G the preprocessing can be performed in the Connex array or
in an external standard processor. The resulting acceleration is smaller than for the
structured grid (depending on the computation involved in each grid elements, 10-20% of
the acceleration is lost).
Motif 7: Map reduce
The typical example of a map reduce computation is the Monte Carlo method. This
method consists in many completely independent computations working on randomly
generated data. This type of computation is highly parallel. Sometimes it requires the add
reduction function, for which Connex architecture has special accelerating hardware. The
computation is linearly accelerated.
Motif 8: Combinational logic
There are a lot of very different problems falling in this class. We list here only the most
important and the most frequently used ones:

blocks processing, exemplified by AES encryption algorithms; it works in 4×4
arrays of bytes, each array is loaded in one EU, and the processing is completely
SIMD-like with linear acceleration on the Connex Array.

stream processing, exemplified by convolution methods which do not use blocks,
processing instead a continuous bit stream; it is computed very efficiently in
Connex’s time parallel accelerator (SA) with no speculation involved

image rotation for black & white or color bit mapped images is performed first by
loading m×m array of pixels into the processing array on both dimensions (spatial
and temporal), second, executing a local transformation, and third restoring the
transformed image in the appropriate place. This is done very efficiently on the
Connex Array.

route lookup, used in networking; it supposes three data-base like operations:
longest match, insert, delete, all performed very efficiently by the Connex
processing array.
Motif 9: Graph traversal
The array of 1024 machines can be used as a big ``speculative device". Each EU starts
with a full graph stored in its data memory, and the computation provides the result when
one EU, if any, finds the solution. Limitations are generated by the dimension of the data
memory of each EU. More investigation is needed to evaluate the actual power of
Connex technology in solving this problem.
Some problems related with graphs are easily solved if matrix computation is involved
(example: computing the distance between all the elements of a graph).
Motif 10: Dynamic programming
Viterbi decoding is the example presented in [1]. It best fits the modular feed-forward
architecture of SA, built as a distinct network as currently implemented at Connex or
integrated into the main data parallel processing array. Very long streams of bits are
computed in parallel by the pipeline structure of Connex’s SA.
Motif 11: Back-track and branch & bound
Motif under investigation (“Berkeley's View” is silent regarding this motif).
Motif 12: Graphical models
Motif under investigation (“Berkeley's View” is silent regarding this motif).
Motif 13: Finite state machine (FSM)
The authors of “Berkeley's View” claim that for this motif “nothing helps”. But, we
consider that a pipe of machines featured with speculative resources [6] provides benefits
in acceleration. In fact, Connex’s SA solves the problem if its speculative resources are
activated.
Another way to use ConnexArrayTM for FSM oriented application is to add to each cell,
working as a PE, specific instructions for FSM emulation. The resulting system can be
used as a speculative engine for deep packet search applications.
Connex’s SA technology seems to be the first implementation of a machine able to deal
with this rebellious motif.
5. Concluding Remarks
1. Connex technology covers almost all motifs. Excepting the motifs 11 and 12 (work
on them in progress), possibly 9, the Connex technology performs very well.
2. The linear network connecting EUs is not a limitation. Because the intense
computational problems are characterized by an advanced locality, the simplest
interconnection network is not a major limitation. The temporal dimension of the
architecture in many cases helps to avoid the limitations imposed by the two simple
interconnection networks.
3. The spatial & temporal dimensions are doing a good job together. The user's view
of the machine is a two-dimension array. Actually one dimension is in space (the 1024
EUs), and the other dimension is in time (the 512 16-bit words stored in each local
randomly accessed memory). These two distinct dimensions allow Connex to optimize
area resources, while the locality and the degree of parallelism are both kept at high
values.
4. Time parallelism is rare, but unavoidable. Almost any time in a real complex
application all kinds of parallelism are involved. Some pure sequential processes
represent sometimes uncomfortable corner cases solved only by the time parallel
resources provided in Connex architecture (see the 13th motif).
5. Connex organization is transparent. Because the interconnection network is simple,
the internal organization of the machine is easy to be made transparent to the user. The
elegant solution offered by the VectorC language is a good proof of the high
organizational transparency of the Connex technology.
References
[1] K. Asanovic, et al.: ”The Landscape of Parallel Computing Research: A View from Berkeley”,
Technical Report No. UCB/EECS-2006-183, December 18, 2006.
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf
[2] Shekar Y. Borkar, et al.: “Platform 2015: Intel Processor and Platform Evolution for the Next
decade”, edited by R. M. Ramanathan and Vince Thomas, Intel Corporation, 2005.
[3] Pradeep Dubey: “A Platform 2015 Workload Model: Recognition, Mining and Synthesis
Moves Computers to the Era of Tera”, Technology@Intel Magazine, Feb. 2005.
[4] W. Daniel Hillis: The Pattern on the Stone. The Simple Ideas that Make Computers Work,
Basic Books, 1998.
[5] Mihaela Malita:
http://www.anselm.edu/internet/compsci/Faculty_Staff/mmalita/HOMEPAGE/ResearchS07/Web
siteS07/index.html
[6] Mihaela Malita, Gheorghe Stefan, Dominique Thiebaut: “Not Multi-, but Many-Core:
Designing Integral Parallel Architectures for Embedded Computation” in ACM SIGARCH
Computer Architecture News, Volume 35 , Issue 5, Dec. 2007, Special issue: ALPS '07 Advanced low power systems; communication at International Workshop on Advanced Low
Power Systems held in conjunction with 21st International Conference on Supercomputing June
17, 2007 Seattle, WA, USA.
[7] Mihaela Malita, Gheorghe Stefan: “On the Many-Processor Paradigm”, in: H. R. Arabina
(Ed.): Proceedings of the 2008 World Congress in Computer Science, Computer Engineering and
Applied Computing, vol. PDPTA'08 (The 2008 International Conference on Parallel and
Distributed Processing Techniques and Applications), 2008. http://arh.pub.ro/gstefan/pdpta08.pdf
[8] Bogdan Mîţu: “C Language Extension for Parallel Processing”, BrightScale research report
2008. http://arh.pub.ro/gstefan/VectorC.ppt
[9] David A. Patterson: “The Parallel Computing Landscape: A Berkeley View 2.0”, keynote
lecture at The 2008 World Congress in Computer Science, Computer Engineering and Applied
Computing, Las Vegas, July, 2008.
[10] Gheorghe Stefan: “The CA1024: A Massively Parallel Processor for Cost-Effective HDTV”,
in SPRING PROCESSOR FORUM JAPAN, June 8-9, 2006, Tokyo.
[11] Gheorghe Stefan: “The CA1024: SoC with Integral Parallel Architecture for HDTV
Processing”, invited paper at 4th International System-on-Chip (SoC) Conference & Exhibit,
November 1 & 2, 2006, Radisson Hotel Newport Beach, CA
[12] Gheorghe Stefan, Anand Sheel, Bogdan Mitu, Tom Thomson, Dan Tomescu: “The CA1024:
A Fully Programmable System-On-Chip for Cost-Effective HDTV Media Processing”, in Hot
Chips: A Symposium on High Performance Chips, Memorial Auditorium, Stanford University,
August 20 to 22, 2006.
[13] Gheorghe Stefan: “One-Chip TeraArchitecture”, in Proceedings of the 8th
Applications and Principles of Information Science Conference, Okinawa, Japan on 1112 January 2009. http://arh.pub.ro/gstefan/teraArchitecture.pdf
Download