DARPA Data Transposition benchmark on a reconfigurable computer

advertisement
The DARPA Data Transposition Benchmark on a Reconfigurable
Computer
S. Akella, D. A. Buell, L. E. Cordova
Department Computer Science and Engineering,
University of South Carolina,
Columbia, South Carolina, 29208.
{akella | buell | cordoval}@cse.sc.edu
J. Hammes
SRC Computers, Inc.
4240 North Nevada Avenue
Colorado Springs, Colorado 80907
hammes@srccomp.com
Abstract
The Defense Advanced Research Projects Agency has
recently released a set of six discrete mathematics
benchmarks that could be used to measure the performance
of high productivity computing systems. Benchmark 3
requires transposition, bit by bit in blocks, of a long bitstream.
In this paper we present the design and
implementation of this benchmark on a high performance
reconfigurable computer – the SRC Computers SRC-6. We
evaluate the performance of this machine by benchmarking
our implementation against a standard C based
implementation of the same algorithm on a single processor
Pentium PC.
Index Terms— Reconfigurable Architectures, FPGA,
Verilog, C, Matrix Transposition.
1.
Introduction
The six DARPA Discrete Mathematics Benchmarks can
be used to measure the performance of high productivity
computing systems [1]. The benchmarks have been written
for 64-bit machines and coded in Fortran 77 or C. In
addition some of the benchmarks are available in MPI,
Shmem, and UPC. DARPA is interested in experimenting
with all of these six algorithms and desires performance
improvement using novel methods for implementations of
these algorithms. The six benchmarks are described briefly
on the University of South Carolina reconfigurable
computing research group’s website [2]
In this paper we look at DARPA Benchmark 3, the Data
Transposition algorithm. The problem definition for this
benchmark is given below.
 Let {Ai} be a stream of n-bit integers of length L.
Consider each successive block of n integers as an n x
n matrix of bits. For each such matrix, transpose the
MAPLD, September 7-9, 2005




bits such that bit bij is interchanged with bit bji. The
resulting stream {A’i} will have the property that:
A’1, A’n+1, A’2n+1, etc will hold sequentially high
order bits of {Ai};
A’2, A’n+2, A’2n+2, etc with hold sequentially next-tohigh order bits of {Ai}
Output the stream {A’i}
Parameters for this benchmark are:
Length of
Stream
107
107
107
Bit-width (n)
No. of Iterations
32
64
1024
400
230
12
We see that the Data Transposition benchmark provides
three different set of sub-benchmarks which need to be
implemented. They vary based on the bit-width (n) of the
integers and the number of iterations we need to run the
input stream of length L set to ten million bits for all the
three sub-benchmarks.
2.
Previous Research
There has been relatively little previous research on this
data transposition problem in which the transposition
operation is done at the bit level. Most previous work has
assumed that each element of the matrix was a word by
itself. Several methodologies such as parallel matrix
transpose algorithms [3, 4], mesh implementations [5], and
hypercube algorithms [6] have been investigated, but none
of them is relevant to what we do here. We therefore look at
implementing this algorithm in our methodology that suits
the underlying architecture on which we implement it. First
we will look at the software implementation, and then we
will discuss the SRC-6 implementation.
3.
Software implementation
We first look at a software implementation of this
benchmark. The DARPA benchmark “rules of the game”
are that at first we should make and time the original code,
making only changes necessary for correct execution. The
original code was written in FORTRAN and has a recursive
transposition algorithm and is written for a 64-bit
architecture which is difficult to run on the available 32-bit
Pentium Processor PC. It also assumes that all input words
could be accessed at the same time, which is not suitable for
direct porting to the SRC-6 architecture. We therefore
implemented the benchmark in C using a nested loop
construct for the purposes of simplicity and for easier
portability to the SRC-6 computer. It works on one word of
data at a time which is adherent to the SRC-6 hardware
architecture. The loop construct is similar for 32-bit and 64bit benchmarks. The code for the 1024-bit benchmark has
slight modifications to work on an array of 64-bit words
instead of the entire 1024-bit word. The methodology used
for the software implementation is as given below:
 Read the bit-stream in words of n bits (32/64/1024
bits based on the benchmark)
 There are n n-bit inputs, and n-bit outputs. The
outputs represent the transposed values of the
inputs.
 Each of the n-bits of the ith input would form ith
bit of all the n output words.
 The inner loop runs n times, once for each output
generated.
 In each iteration of the inner loop, the ith bit of
the n-bit input read is picked and placed in the ith
position of the corresponding output.
 The outer loop runs the inner loop n times, once
for each output to be generated.
This setup is slightly different for the 1024-bit
benchmark since the natural word length of the PC is
restricted to 64-bits and the natural data transfer unit to the
reconfigurable resources on the SRC-6 is a 64-bit word. We
use sixteen 64-bit words to represent one 1024-bit word.
Thus while working on one 1024-bit word input we actually
work on sixteen 64-bit words each time in the inner loop.
Similarly, when generating each of the 1024-bit word
outputs, we generate sixteen 64-bit words which when
concatenated together would form the 1024-bit word.
The code for each benchmark is run on a computer
which has dual Intel Pentium 4 processors, and the timing
results collected are presented in Table 1.
4.
available. This computer allows the programmer to be able
to overlook the details of underlying the hardware
architecture and focus on the implemented function. This
approach helps in decreasing the time to solution by
facilitating software development by programmers and
mathematicians.
4.1.
The Hardware Architecture [8]
The SRC-6 system architecture includes Intel®
microprocessors running the Linux operating system. The
reconfigurable logic resource is referred to by SRC as a
MAP®; the MAP® boards normally come in pairs attached
to these microprocessors. Each MAP® consists of two
Xilinx® FPGA chips and a control processor. Code for the
Intel processors is written in standard C or Fortran. Code
for the MAP® hardware can be written in MAP C or MAP
Fortran and compiled by an SRC-proprietary compiler that
targets the MAP® components. Calls to execute on the
MAP® are function/subroutine calls from the standard C or
Fortran modules.
The MAP® consists of both fixed and reconfigurable
hardware. The fixed part includes general control logic, the
DMA engine in the control FPGA processor and six 4MB
banks of on board memory referred to as OBMs. The
reconfigurable hardware consists of two Xilinx ®
XC2V6000 FPGA chips referred to as User logic. The
architectural block diagram is as given in Figure 1.
The SRC-6 reconfigurable computer
The SRC-6, by SRC Computers, [7] is one of the
limited numbers of reconfigurable computers commercially
MAPLD, September 7-9, 2005
Figure 1. MAP® interface block diagram [8]
Most computations will first require a transfer of input
data to the OBMs from the microprocessor memory through
the control processor interface. The FPGA user logic then
performs computations on the input data by reading it from
the OBMs, perhaps writing intermediate or final results to
the OBMs, and then transferring back from the OBMs to
the microprocessor memory through the control processor
interface.
An important point to notice, especially when
comparing 32-bit and 64-bit computations like this
benchmark, is that the reconfigurable hardware is strongly
oriented toward 64-bit words. The OBM is organized as six
banks of memory organized as 64-bit words to be read or
written. The DMA data transfers to and from the
microprocessor in 64-bit quantities. We would expect,
therefore, that a 32-bit benchmark would actually be
hampered by the 64-bit orientation of the machine, and we
would have as a goal that the 1024-bit benchmark would
function very nearly like an array of sixteen 64-bit-word
transpositions. This is something of an oversimplification
but might serve to place the implementations in perspective.
4.2.
The Programming Environment
The SRC-6 programming environment involves
traditional software based compilation process along with a
specialized MAP® compilation process that is used to
compile the code that is to run on the MAP ® hardware.
Code that is to be run on the Intel® processors is compiled
separately. The application code has two main source files,
one that is supposed to run on the Intel® processors and one
that is supposed to run on the MAP ® hardware. The two
source files can be written in either C or Fortran. The
MAP® source file has functions that can call user hardware
macros to be executed on the MAP® hardware. The
hardware macros can be built-in or user-defined in either
the VHDL or Verilog hardware description languages.
The microprocessor C or Fortran compiler and the
MAP® compiler object files are linked together to form the
final application executable.
5.
The SRC-6 Implementation
The SRC-6 environment provides us with the options of
programming in C or Fortran. We have chosen to
implement all the DARPA benchmarks in C. We have
implemented the benchmarks in two ways:
 The transposition operation is implemented using
pure C code which we refer to as a C Map
implementation.
 The transposition operation is written in Verilog
and a function call is made to this macro from the
MAP® C source file. We refer to this as the
Verilog Map implementation.
The SRC-6 implementation used was different from the
MAPLD, September 7-9, 2005
FORTRAN code in that the FORTRAN implementation
assumes the capability to access multiple words at the same
time. The SRC-6 hardware would require additional clock
cycles for accessing multiple words from the same bank.
Thus implementations suitable for this underlying
architecture which work on one word, two words and then
four words of data at a time were designed. The two
methodologies have different performance results.
We initially look at the basic 32-bit and 64-bit C Map
and Verilog Map implementations, then the basic 1024-bit
C Map and Verilog Map implementations. Later we scale
up the architecture and look at multi-unit parallel
implementation and implementations with 128-bit data
transfers for all three sub-benchmarks.
6.
The C Map Implementation
A pure C implementation was done for each of the 32bit, 64-bit and the 1024-bit benchmarks.. The basic code
format is similar for all three. A C main program reads the
input bit-stream from an input file, calls a MAP function for
performing the transposition operation, and then writes the
results to an output file. The input bit stream is stored in
one long array in the main program. The data arrangement
within the input array for the 32-bit and 64-bit input data is
the same. Once the input data is stored within the array, it is
transferred from the common memory to the bank through
DMA calls. The MAP® has six OBM banks of 524288
words, each word being 64 bits. The input bit stream is of
10 million words transferred in blocks of 262144 words at a
time, since we cannot load the entire input data into the
OBM banks.
6.1.
32-bit and 64-bit benchmarks.
The main C program and the map C code for the 64-bit
benchmark are presented in Figures 2 and 3.
The code is similar for the 32-bit benchmark. The MAP
code constructs are mostly similar to the C constructs.
Certain constructs are vendor-specific and are used instead
of traditional C constructs for the purposes of current
version compiler compatibility. In the MAP C code, we
have the vendor-specific macro ‘selector_64’ that is used
instead of a large switch statement. A sample code is
presented below:
selector_64 (i==0, temp0, temp, &temp);
The macro selects variable temp0 and assigns it to
temp when the condition i==0 is true. We can use the
above construct on different conditions that represent
different cases of a switch statement. We note that
constructs like this are necessary for ensuring that efficient
code is generated; although the transposition itself is a
relatively simple operation, expressing the bit-level
selection in a 64-bit word can be textually tedious to the
point of obscuring the underlying algorithm.
The data transposition is implemented using shift and
bit-wise or operations as shown in the code of Figure 3.
void main(){
//declarations…
// Assigning values
for (i = 0; i < m; i++){
fscanf(in, "%lld", &temp);
E[i] = 0;
}
A[i] = temp;
for(j=0;j<230;j++){
for(k=0;k<nblocks;k++)
//assign values in blocks of half the bank capacity
// call map function
dt (A, E, m, &time, 0);
}
Figure 2. Version 1 main.c code for the 64-bit C Map
implementation
void dt (uint64_t A[], uint64_t E[], int m, int64_t
*time, int mapnum) {
…declarations..
..DMA Transfer of data from CM to OBM..
//Transposition operation done in blocks
nblocks = datasize/m;
for (block=0; block < nblocks; block++){
for (i=0; i<m; i++) {
l = block*m + i;
EL[l] = 0;
for(j=0; j<m;j++) {
k = block*m + j;
temp = ((AL[k] & (1ULL<<(31-i)))) << i;
EL[l] |= temp >> j;
}
}
} ….
..DMA Transfer of output data from OBM to CM..
read_timer(&t2);
*time = t2 - t1;
Figure 3. Version 1 map C code for the 64-bit C Map
implementation
6.1.1. Code Improvements. The version 1 code
presented in Figure 3 works similar to the software
implementation. Its inner loop generates one output data at
a time and the outer loop runs the inner loop until all the
outputs are generated. This code is not productive and
MAPLD, September 7-9, 2005
results in little performance benefit, as shown in the timing
results of Table 2. We would like to exploit the inherent
parallel nature of the FPGA architecture, so we modify the
code to operate upon on all the n output values (for each
block or one n x n matrix of data) at the same time.
We use temporary variables for each of the n output
values which warrant the use of additional space on the
FPGA. Each time one n-bit input word is read, we obtain
one bit of each of the output values. The ith input would
give us the ith bit of each of the outputs and is thus placed
in the ith position of all the temporary variables using
shift and or operations. Thus, when we are done reading
the nth input value, we have all the transposed values. We
then transfer these values onto the OBM. This modification
allows us to generate the n output values in n cycles instead
of n2 cycles as was originally the case.
This modification works quite well for the 32-bit
benchmark, but with the 64-bit benchmark we run into
synthesis problems because we run out of FPGA resources.
This comes from an increase in the number of temporary
variables (32 to 64) and a corresponding increase in the
number of shift and or operators. To overcome the
increase in resource usage we must modify the shift
operations portion of the code. The original code uses
variable-distance shifts and thus generates shift operators of
different sizes. This can be modified to perform shifts of
constant distances, which take little or no logic to
implement. A code sample is provided below:
temp17 |= ((AL[l] & (1ULL<<46)) << 17)
>> i;
The statement can be replaced by the following
statement that uses constant distance shifts.
temp17 = (temp17 << 1) | (1 &
(AL[l]>>46));
The original functionality is retained, but the resource
usage is drastically reduced, especially because we have 64
such modifications for the 64-bit transposition operation.
The modified code for version 2 is presented in Figure 4.
This code synthesizes well using resources well below the
total available on the XC2V6000 chip.
Finally we use parallel sections using compiler pragmas
to overlap DMA transfer of input data with computation on
the data that has in a previous step been transferred onto the
OBMs. This provides great performance benefits, as was
shown in previous work on the DARPA Boolean equation
benchmark [9], by permitting full overlap of data
movement with computation. The code for the 32-bit and
64-bit benchmarks has been run on the SRC-6 computer,
and the timing results are presented in Table 3.
The 1024-bit benchmark requires either a multi-unit
implementation where multiple 64-bit transposition units
are employed for conducting the 1024-bit transposition
operation or calling the 64-bit unit multiple times for
transposing the 1024-bit matrix. We first look at the Verilog
Map implementation of the 32-bit and 64-bit benchmarks
before we discuss the 1024-bit implementation for both the
C Map and the Verilog MAP implementations.
void dt (uint64_t A[], uint64_t E[], int m, int icpl, int loops,
int64_t *time, int mapnum) {
…Declarations…
..Initial DMA Transfer…
#pragma src parallel sections {
#pragma src section // Data transposition operation {
..declarations..
for (block1=0;block1<limit;block1++){
for (i=0; i<mm; i++) {
………
temp0 = (temp0 << 1) | (1 & (AL[l]>>63));
………..
temp63 = (temp63 << 1) | (1 & (AL[l]));
}
for(i=0;i<mm;i++){
k = block1*mm + i + ioff;
temp = temp63;
selector_64 (i==0, temp0, temp, &temp);
..............
selector_64 (i==63, temp63, temp, &temp);
EL[k] = temp;
}
}
}// end of data transposition parallel section
#pragma src section // DMA transfer {
..declarations..
if (block < loops) {
inputarraysub += icpl;
joff = BLOCKSIZE - ioff;
.. Parallel DMA transfer of data
}
} //end of block for loop
Figure 4. Final version map C code for the 64-bit C Map
implementation
6.2.
The Verilog Map implementation
As we have mentioned earlier, the SRC-6 programming
environment allows us to incorporate user-defined hardware
macros designed using either VHDL/Verilog. The C Map
implementation provides a good speed–up over a standard
processor, but we wanted to explore the possibilities of
designing the application using a hardware description
language that gives us a highly customized implementation.
This is similar to traditional software programming
methodology where critical functions are rewritten in
assembly language for better performance. The hardware
description languages provide us the ability to operate at the
bit level instead of at the word level in C. This added
MAPLD, September 7-9, 2005
advantage can be exploited in designing the transposition
algorithm that mainly requires bit-manipulation. We
therefore model the transposition algorithm in Verilog and
implement it on the SRC-6 platform. An additional benefit
from this extra implementation is to measure the quality of
the code synthesized from C by the SRC MAP C compiler.
The basic idea and setup used for the Verilog design is
similar for all three versions of the benchmarks to the Map
C code. The basic hardware architecture of a 64-bit
transposition module is given in Figure 5.
dt_op
A [0]
.
.
.
.
.
.
.
.
.
.
A [63]
.
.
.
.
.
dt_op
A [31]
temp [63 : 0]
.
.
.
.
.
temp [63 : 0]
.
.
.
.
.
dt_op
.
.
.
.
.
temp [63 : 0]
Figure 5. Basic architecture of the 64-bit Verilog
transposition unit
The module consists of 64 units (dt_op) in parallel, each
working on one of the 64 output values. Each unit holds a
64-bit shift register and a 64-bit or and performs the
following operation:
temp <= (temp << 1) | ({63'b0, A});
This is different from the C version in that we operate
on one bit of the input held in the variable A. Also, the
concatenation of A with the word 63’b0 (63-bit word
having a value of zero) is not possible in C using a simple
bit-level operation as in the code above but would in C need
shift and addition operations. A setup similar to that of the
64-bit is used for the 32-bit benchmark, with the inputs and
outputs of the macro being 32-bit words. The 32-bit and 64-
bit benchmarks using Verilog macros were implemented on
the SRC-6 and the timing results we collected are presented
in Table 4.
6.3.
1024-bit benchmark
The 1024-bit benchmark is a different from the previous
two in that it operates upon a “word size” that is 16 times
the size of the word that can be stored within the OBMs.
This requires us to store the 1024-bit word in 16 locations
of one OBM. We must also use multiple code units, with
each unit performing a 64-bit transposition, in order to
perform the entire 1024-bit transposition.
Figure 6. Breaking up of a 4x4 matrix into 2x2 submatrices
Transposing a 1024x1024 matrix can be broken down
using 256 64-bit transposition modules. Each 64-bit
transposition module would transpose each of the 256
64x64 sub-matrices. The resulting transposed words can
then be appropriately concatenated to obtain the transposed
1024-bit words. The idea behind this method can be shown
using the 4x4 example below that is broken down into the
transposition of 2x2 matrices as shown in Figure 6.
In Figure 6, the 4x4 matrix has been broken down into
four 2x2 sub-matrices that we could number 1, 2, 3, and 4
going anti-clockwise. The transposed sub-matrices would
give us 2-bit transposed values that can be appropriately
concatenated to give us the final 4-bit transposition values.
The Figure 7 below gives us the final 4-bit transposed
matrix.
Figure 7. The transposed 4x4 matrix
MAPLD, September 7-9, 2005
If you look at the transposed matrix above and if we
number the transposed matrices of sub-matrices 1, 2, 3, 4 as
a, b, c, d we would see that we have these transposed submatrices in the order a, c, b, d going anti-clockwise. What
this means is that the sub-matrix a (sub-matrix 1
transposed) values concatenated with sub-matrix c (submatrix 3 transposed) form the first two 4-bit transposed
values. The next two 4-bit transposed values are obtained
by concatenating the values of sub-matrix b (sub-matrix 2
transposed) and values of sub-matrix d (sub-matrix 4
transposed). This is a simple idea and is relatively obvious
examining the transposition operation.
Now that we have decided to break up the design into
64-bit units, we face the problem of resource usage and
availability for implementation on the FPGA. The 64-bit
transposition C Map implementation uses about 61% of the
FPGA slices, including the communication overhead. The
Verilog Map implementation takes 2% of the FPGA slices
and about 28% of the slices for the infrastructure to
read/write the OBM and some additional control logic for
the MAP® C overhead. The 64-bit Verilog unit also uses
around 4096 register bits, which is about 6% of the total
register bits available on the FPGA. The entire 1024x1024
matrix transposition would require 256 64-bit units, so it is
clearly impossible to implement this on one FPGA chip.
We considered some available options for using multiple
64-bit units but a four unit design for the C MAP®
implementation doesn’t synthesize running out resources by
using 91,012 slices, 269% of the available 33792 slices.
However we have realized through experimentation that the
four units employed do not actually run in parallel but are
executed sequentially one after another. The SRC MAP ®
compiler pipelines only innermost loops and data
independent statements are executed in parallel within these
loops. The four unit design has four inner loops within one
outer loop. The outer loop is not pipelined and the four
inner loops execute sequentially thus eliminating the
advantage of having four units. We could as well have one
unit that can be called 256 times to transpose the
1024x1024 matrix. This implementation in Map C
synthesizes well utilizes resources well below those
available.
The idea is to use one 64-bit unit and have it work on all
the 64-bit parts of the 1024-bit input words one part at a
time (since each 1024-bit word can be broken down into 16
64-bit words). Since each unit works on a 64x64 submatrix, we would have one unit working on the first 64
rows (of the 1024x1024 matrix) sixteen times to transpose
those 64 rows. Thus, in 64x16 i.e. 1024 cycles (each 64-bit
unit takes 64 cycles for a 64x64 matrix transposition) we
would transpose 64 rows of the 1024x1024 matrix, and in
16*1024=16384 cycles we would be able to transpose all
1024 rows of the matrix. This scheme would take 16384
cycles to generate all the final transposed values at one
time. However the values are generated out of order as the
first 64 rows would give us the first 64-bits of each of the
1024 outputs and the next 64 rows would give the second
set of 64-bits and so on so forth. The different 64-bit parts
of each output are not generated in sequence. Thus we
would have to transpose the whole 1024x1024 bit matrix
before we could transfer the final 1024 transposed values to
the memory. This requires substantial temporary storage
space on the FPGA (about 1024x1024 bits to be exact).
If we work on columns instead of rows, we could
generate 64 1024-bit transposed values in 1024 cycles..
This “columnar” scheme would work on the first 64
columns of the 1024x1024 matrix. Each 64-bit unit would
generate the transposed values of the 64 rows in the first 64
columns of the matrix. Thus, in 1024 cycles we would have
the first 64 transposed values of the matrix. This operation’s
correctness is evident from the 4x4 example that we have
shown above. The transposed values are a concatenation of
the transposed values of the 1 and 3 sub-matrices. Thus, if
we have 2-bit transposition units working the first 2
columns of the 4x4 matrix we would get the sub parts of the
transposed values of the first 2 columns which we could
concatenate appropriately to obtain the first 2 transposed
values of the 4x4 matrix. Similarly, we could generate the
first 64 transposed values of our 1024x1024 matrix by
working on the first 64 columns. The entire transposition
would take 16384 cycles, but we would have generated and
also transferred the values without having had to store them
on the FPGA, thus saving time compared to transferring
1024 values all at the end.
We have in our analysis not taken into consideration the
fact that all the 16 64-bit parts of each 1024-bit word cannot
be transferred to the FPGA at once due to certain
limitations. First, we have to store the 16 values in the
available six banks, so we cannot later read all the 16 values
at the same time. In obtaining one 1024-bit value, then, we
would need multiple cycles and add latency to the design. If
we stored all the 16 values in the same memory, we would
need 16 cycles to read the full 1024-bit value. Thus to read
the 1024 values we would need 16384 cycles
The time required to move the output values from the
FPGA onto the OBM is 16384 cycles.
The map C code for the 1024-bit Verilog Map design is
presented in Figure 8. The C Map implementation code
structure is similar except that it uses C ‘shift’ and ‘and’
operations instead of a Verilog macro for the performing
the transposition operation. The computation calls
occurring within a loop are pipelined with one computation
taking one clock cycle. The data transfer is overlapped with
the computations that would allow us to achieve a transfer
rate of one word per cycle. The loop-carried scalar
dependencies that arise due to certain sections of code are
avoided by replacing them with vendor specific code
structures such as cg_accum_add_32. This replaces a
conditional increment statement within a pipelined for
loop.
The C Map and the Verilog Map implementation of the
1024-benchmark employing one 64-bit transposition unit,
MAPLD, September 7-9, 2005
were implemented on the SRC-6 reconfigurable computer
and the timing results collected are shown in Table 3 and
Table 4 respectively.
void dt (uint64_t A[], uint64_t E…….. int mapnum) {
…Declarations…
..Intiail DMA Transfer…
for (block = 0; block < loops; block++) {
// parallel sections for transposition and dma transfer
#pragma src parallel sections
{
#pragma src section {
for (block1 = 0; block1 < limit/2; block1++){
……
for (col = 0; col < 16; col++){
…….
for(part=0; part<16;part++){
……
for (j0 = 0; j0 < bw*2; j0 += 1)
{
k0 = block1*16384 + j0*16 + part*1024 +
ioff + col;
dt_op (AL[k0], i0, j0, j0 == 0, &temp0);
cg_accum_add_32 (1, j0>64, 0, j0 == 0,
&i0);
l0 = block1*16384 + part + col*1024 +
i0*16 + ioff;
EL[l0] = temp0;
}
}
}
}
}// end computation sections
#pragma src section // DMA transfer
{
.. DMA transfer of data in parallel to the computation
}
} //end of block for loop
Figure 8. 1024-bit map C code for the 1024-bit Verilog
Map implementation
6.4.
Parallel 3-unit implementation
After having looked at the basic implementations and
obtaining better performance with the Verilog Map designs
for the three benchmarks we look at scaling the architecture
and
having
multi-unit
parallel
Verilog
Map
implementations. The multi-unit designs are possible for
Verilog Map as the Verilog macros do not use a lot of
resources and fit well on the available FPGA space.
The parallel implementation has 3 transposition units
with one Verilog macro call for all three. The Verilog
macro implements these units in parallel. The data is
transferred to 3 OBMs one after another with the 3 parallel
units taking input from 3 OBMs respectively. The output
from the three parallel units is written to the remaining 3
OBMs. The Verilog macro has three instantiations of the
transposition unit that execute concurrently. Each of these
units takes the input from one of the three input memory
banks and writes the transposed values to one of the three
output memory banks. The code structure is similar for all
the three benchmarks. The data transfer speed of one word
per cycle remains the same and thus the performance
improvement is obtained only during computation. The
computation in the parallel implementation is three times
faster over the one unit implementations. The one unit
implementations had three cycles per word effective
computational throughput which is set by the Map compiler
that sets the pipeline depth and the cycle time for code
block transitions. The one cycle per word data transfer
throughput is maintained, obtaining a total of four cycles
per word total throughput. Since we have three units
working in parallel now, we obtain a 1 cycle per word
computational throughput and the data transfer speed of one
word per cycle is maintained. This provides us a two cycle
per word total throughput and thus theoretically twice the
speedup over the one unit implementations. However the
pipeline depth and other factors affect the throughput of the
design. The parallel 3 unit SRC-6 implementation results
are presented in Table 5.
6.5.
128-bit Data Transfer
The SRC-6 machine has the capability of data transfers
of 128-bit words between the common memory and the
OBMs with a 64-bit word transferred at the positive edge
and the other 64-bit word at the negative edge to two
adjacent memory banks. This enables us to transfer two 64bit words in one cycle. The earlier implementations had 64bit word transfers which were not exploiting the full
bandwidth. The 32-bit benchmark could have two words in
one 64-bit word and thus can have four words transferred in
one cycle with a 128-bit DMA transfer. The 64-bit and the
1024-bit benchmark designs could have two 64-bit words
transferred in one cycle. The two-word and four-word per
cycle transfer would require us to modify the way the
transposition is performed in the Verilog macro. In the case
of the 64-bit benchmark the two input words transferred
could be read from the two banks concurrently and operated
upon using two-bit shifts instead of the one-bit shifts
employed in the previous implementations. Similarly for
the 32-bit benchmark the four 32-bit input words
transferred within two 64-bit words could be read from the
two banks concurrently and operated upon using four-bit
shifts. The 1024-bit benchmark operation is slightly
different as the two words transferred form inputs from
different blocks of data that need to be operated upon
separately. In this case we employ two 64-bit transposition
units that work on the alternate columns of the matrix. Here
each column would represent 16 blocks of 64x64 matrices
and since there are 16 such columns in a 1024x1024 matrix
we have the each of the two units work on 8 columns 16
times for generating the final transposed values.
MAPLD, September 7-9, 2005
The multi-bit shifts in case of the 32-bit and 64-bit
benchmarks and two-unit processing in case of the 1024-bit
benchmark enable us to work on multiple words
concurrently providing speedup in both computation as well
as data transfer. We are working on four words at a time for
the 32-bit benchmark and two words at a time for the 64-bit
and the 1024-bit benchmark. Thus compared to the basic
one unit one word transfer per cycle implementation we
theoretically expect to have twice the speedup in case of the
64-bit design and a four times speed up in case of the 32-bit
benchmark. However the multi-bit shifts in the designs
affect their pipeline depths and thus the overall throughputs.
The 128-bit transfer SRC-6 Verilog implementation results
are presented in Table 6.
6.6.
Parallel 2-unit with 128-bit Data Transfer
The 128-bit data transfer implementations are scaled for
a parallel implementation of two units working on two
different streams of data. In these implementations we have
two DMAs instead of one, to four OBMs with each DMA
transferring 128-bits to two adjacent OBMs. The 32-bit and
the 64-bit benchmarks have two units working in parallel
with each unit reading 128-bits from two banks. The 32-bit
transposition unit would perform four bit shifts and the 64bit transposition unit would perform two bit shifts as in the
original implementation for 128-bit word data transfers.
The 1024-bit benchmark has to operate upon the four words
separately and thus employs four 64-bit units that read
concurrently from the four banks of memory. The 32-bit
design is effectively working on eight 32-bit words at a
time and the 64-bit and the 1024-bit designs are working on
four 64-bit words at a time. Since we are using up four
OBMs from the available six on the MAP® for reading in
the input we would be left with only two OBMs for writing
the output. Each unit generates four output words per cycle
which need to be written to only two OBMs with two words
being written to one OBM. Thus the two writes to each
OBM would take up two cycles and add two-cycle latency
to the pipelined loop instead of just one cycle. This affects
the overall throughput of the design. The 1024-bit
benchmark works on four words at a time thus operating at
two times faster on computation but has two cycle latency
with the output transfer thus achieving the same maximum
throughput as the original 128-bit transfer implementation.
The implementation results are presented in Table 7.
7.
Results and Analysis
The timing results for the software and SRC-6
implementations for both C Map and Verilog Map are
presented in Tables 1, 2, 3, and 4.
In Table 2 are presented the timing results for the C
Map implementation using a nested loop structure without
using parallel sections. It is quite obvious that we have not
obtained any performance benefits but rather see
degradation for both the 32-bit and the 64-bit benchmarks
compared to the software implementation timing shown in
Table 1. We have not implemented the 1024-bit benchmark
since it requires the use of multiple 64-bit units and would
thus result in lower performance.
The speedup numbers are tabulated in Table 8 below.
The speedup is listed for all the three benchmarks for the
five different implementations. These implementation being
C Map, Verilog Map, Parallel 3-unit Verilog map, 128-bit
data transfer Verilog Map and parallel 2-unit 128-bit
transfer Verilog map that are named as A, B, C, D, and E
respectively.
Table 1. Timing results for the software C
implementation
no of
time in
time per
Benchmark
iterations
secs
iteration
32-bit
400
3767
9.41
64-bit
230
3270
14.21
1024-bit
12
2004
187.66
Table 2. Timing results for the C Map implementation
with a nested loop construct
no of
time in
time per
Benchmark
iterations
secs
iteration
32-bit
400
6171
15.42
64-bit
230
5132
22.31
Table 3. Timing Results for the final C Map
implementation
no of
time in
time per
Benchmark
iterations
secs
iteration
32-bit
400
244
0.61
64-bit
230
129
0.56
1024-bit
12
97
8.08
Table 4. Timing results for the Verilog Map
implementation
no of
time in
time per
Benchmark
iterations
secs
iteration
32-bit
400
179
0.44
64-bit
230
98
0.42
1024-bit
12
72.7
6.05
MAPLD, September 7-9, 2005
Table 5. Timing results for the Parallel 3-unit Verilog
Map implementation
no of
time in
time per
Benchmark
iterations
secs
iteration
32-bit
400
95
0.23
64-bit
230
60
0.26
1024-bit
12
44
3.66
Table 6. Timing results for the128-bit transfer Verilog
Map implementation
no of
time in
time per
Benchmark
iterations
secs
iteration
32-bit
400
82
0.20
64-bit
230
63
0.27
1024-bit
12
30
2.5
Table 7. Timing results for the parallel 2-unit, 128-bit
transfer Verilog Map implementation
no of
time in
time per
Benchmark
iterations
secs
iteration
32-bit
400
55
0.13
64-bit
230
54
0.23
1024-bit
12
30
2.5
Table 8. Speedup of different SRC-6 implementations
over the software implementations.
Speedup over software
Benchmarks
A
B
C
D
E
32-bit
15
21
41
46
68
64-bit
25
33
55
52
61
1024-bit
23
31
51
75
75
Table 8 shows that the C Map implementation of the 32bit benchmark is ~15 times faster than the software
implementation and the C Map implementation of the 64bit benchmark is 25 times faster than the software
implementation.
The C Map 1024-bit benchmark
implementation shows a speed up of about 23 times over
the software implementation.
The speedup from the initial implementation for all the
benchmarks can be easily seen, indicating the benefits of a
parallel design with the transpositions for all the output
words for a block of data conducted concurrently rather
than sequentially.
The Verilog Map implementations provide an even
greater improvement and speedup compared to the C Map
implementations. The advantage of working at the bit-level
provides better performance. The results shown in Table 8
indicate that the Verilog implementation of the 32-bit and
64-bit benchmarks are about 21 and 34 times faster,
respectively, than the software implementation. The Verilog
implementation of the 1024-bit shows a 31 times speed up
compared to the software C implementation indicating the
advantages of using hardware design units for the 1024-bit
transposition.
The Verilog Map implementations show better
performance than the C Map implementations basically due
the fact that in Verilog the operations are performed at the
bit level and not the word level making the transposition
operations faster.
The numbers shown are significant for the parallel 3unit implementations and also the 128-bit data transfer
implementations. The parallel 3-unit exploits the available
FPGA space and implements the three units in parallel
working on input data concurrently for conducting the
transposition operation and thus achieving speedup over the
computation. However the data transfer is not full
bandwidth and is done at 1 word per cycle. The number of
parallel units could be increased but this number is
restricted by the number of memory banks available from
which data could be read concurrently without and any
overlap. The SRC-6 has six OBMs and thus we use three
for input and three for output. If we had more parallel units
we would have to have some of the units read from the
same bank which would add additional cycles to the
pipeline. Thus employing three units we achieve a speedup
of 41 times for the 32-bit, 55 times for the 64-bit and about
51 times for the 1024-bit benchmark respectively.
The 128-bit data transfer utilizes the full-bandwidth of
data transfer between the common memory and the OBMs
and thus transfers two words per cycle obtaining twice the
speed of the original implementations. The computational
units also perform the transposition on two words
concurrently to maintain the two-word per cycle
throughput. These implementations obtain a speedup of 46
times for 32-bit, 52 times for the 64-bit and 75 times for the
1024-bit benchmark respectively. Finally we have the
parallel two-unit implementations along with the 128-bit
data transfer that tries to exploit both the data transfer
speedup as well as computational speedup by having two
units instead of one working on four words of input rather
than two words of input. The throughput however is limited
by the fact that we have to write four words of output into
two banks of memory thus overlapping two writes. This
certainly limits the throughput but we still achieve
considerable speedup for the 32-bit and the 64-bit
benchmarks with the 1024-bit achieving the same speedup
as the one-unit 128-bit transfer implementation. The 32-bit
benchmark obtains a 68 times speedup, the 64-bit a 61
times and the 1024-bit a 75 times speedup. The throughputs
are mostly governed by the pipeline depths and the data
transition throughputs that the compiler sets based on the
MAPLD, September 7-9, 2005
design. We have not yet obtained a specific way of
measuring these parameters except through observing the
dataflow graph or the Verilog code generated during the
compilation process. Thus it is mostly a post-compilation
process and is a bit tedious.
8.
Conclusions
We have implemented the Data Transposition
benchmark on the SRC-6 platform and compare its
performance against the software implementation. We have
looked at multiple architectures with emphasis on
exploiting the inherent parallelism that FPGAs provide. The
SRC-6 reconfigurable computer provides the ability to
implement the DARPA benchmark 3 and obtain a 34-times
speedup on the 64-bit benchmark and a 31-times speedup
on the 1024-bit benchmark compared to the software
implementations through basic implementations. The
higher-end architectures that exploit parallelism as well the
128-bit data transfer capability obtain better performance
with the 32-bit, 64-bit, and the 1024-bit benchmark
achieving a 68-times, 61-times and a 75-times speedup over
the software implementation respectively. The benefits of
implementing on a reconfigurable platform using parallel
architectures can be clearly observed from the improvement
in performance that was obtained. The SRC-6
reconfigurable platform thus provides the means to
implement high data bandwidth, complex computation
algorithms and obtain considerable performance
improvements.
The parallelism has been exploited to a certain degree
but is limited by the SRC-6 memory architecture. The six
banks set a constrain as to how many parallel units can
operate on the FPGA concurrently. The other important
factor obviously is how many of the units we could fit on
the FPGA. We have the resource usage for the different
benchmarks for the different implementation given in Table
9. The numbers give the overall slice usage including the
map communication overhead. We have a certain amount
of communication overhead due to the reads and writes to
the OBMs from the FPGA which is unavoidable.
Table 9. Resource usage of the SRC-6 implementations
Slices used*
Benchmark
B
C
D
E
32-bit
7,024
10,155
7,312
8,875
64-bit
9,507
18,007
9,588
13,672
1024-bit
7,599
16,051
10,852
19,946
*Virtex-II 6000 has a total of 33792 slices
The 3-unit parallel implementation for the 32-bit, the
64-bit and the 1024-bit benchmarks use up around 30%,
53% and 47% slices respectively. The 128-bit
implementations employ two units for the 32-bit and 64-bit
benchmarks and four units for the 1024-bit benchmark and
use up 26%, 40% and 59% slices respectively. This implies
that we could have more parallel units implemented on the
chip. However this would lead to bank conflicts with
multiple units reading from the same OBM.
If we look beyond the SRC-6 system architecture and
assume we have more than six OBMs with smaller
capacities, we could have more units in parallel trying to
read from these memory banks. The number of memory
banks we would require depends on how many parallel
units we could implement on the available FPGA space.
The slice counts we have obtained tell us that we could
implement around ten units for the 32-bit benchmark and
around five and six units for the 64-bit and the 1024-bit
benchmarks. These calculations include the communication
overhead involved due to memory interface. This overhead
is compiler specific as we have observed that a 64-bit
transposition unit just by itself uses around 3% of the LUTs
and about 6% of the register bits on the FPGA. Thus by
rough calculation the FPGA could easily fit around ten of
the 64-bit units. However even if we assume that we could
fit ten units on the FPGA we probably would run out of I/O
pins for having interface with ten different memory banks.
The Xilinx Virtex® II 6000 chip on the SRC-6 MAP® board
has a total of 1,104 pins. The six memory banks on the
SRC-6 system use up a considerable number of these pins,
some pins are used up by inter-chip communication
between the two FPGA chips on the MAP, some are used
for the communication with the control processor and some
for the chain ports for communication between the two
MAPs. Thus in this architecture we would probably run out
of pins on the FPGA to build an appropriate interface.
The Data Transposition problem allows us to have high
degree of parallelism where multiple units can work on
small blocks of data with no data dependency. This feature
pushes us to look at implementing this algorithm on highly
parallel, distributed memory architectures. We believe that
a system with multiple FPGAs, each connected to couple of
local memory banks, would be an appropriate architecture
for implementing this application. The FPGA sizes could be
large enough to fit one unit including the communication
overhead for the memory access. Each FPGA would hold
one transposition unit with its local memory banks holding
the input and output data. Many such processing elements
working in parallel would achieve a very high performance.
However this architecture would be specific for this
application or applications that allow high degree of
parallelism where each processing element size is not very
big.
We would like to explore these kinds of architectures
for implementing highly parallel data independent
algorithms by looking away from traditional Von Neumann
type architectures where memory is considered as one big
space instead of several smaller distributed units.
MAPLD, September 7-9, 2005
9.
References
[1] Defense Advanced Research Projects Agency. High
productivity computing systems discrete mathematics
benchmarks 2003.
[2] University of South Carolina reconfigurable website,
http://www.cse.sc.edu/~buell/Public_Data/DARPA_HPCS/D
ARPA_discrete_math.html.
[3] J. Choi, J. J. Dongarra, D. Walker, “Parallel matrix transpose
algorithms on distributed memory concurrent computers,”
Oak Ridge National Laboratory, October 1993.
[4] S. Krishnamoorthy, G. Baumgartner, D. Cociorva, C. Lam, P.
Sadayappan, “Efficient parallel out-of-core matrix
transposition,” Proceedings of the IEEE International
Conference on Cluster Computing (Cluster’03), December
2003, IEEE Computer Society Press, pp. 300-307.
[5] M. Kauffmann, U. Meyer, J. F. Sibeyn, “Matrix transpose on
meshes: Theory and practice,” 11th International Parallel
Processing Symposium, 1997.
[6] A. Edelman, “Optimal matrix transposition and bit reversal
on hypercubes: All to all personalized communication,”
Journal of Parallel Distributed Computing, vol. 11, pp. 328331, 1991.
[7] SRC Computers, Inc. www.srccomp.com
[8] The SRC-6 MAP® Hardware Guide, Version 1.8. SRC
Computers, Inc.
[9] D.A. Buell, S. Akella, J.P. Davis, G. Quan, “The DARPA
Boolean equation benchmark on a reconfigurable computer”,
Proceedings of the Military Applications of Programmable
Logic Device, 2004.
Download