FAST MATRIX MULTIPLICATION

advertisement
BVB-IEEE
SAE-INDIA BVB Collegiate Club
Manthan ‘06
FAST MATRIX MULTIPLICATION
Nikhil.Ambekar . Rohit.Patil, Rajashekargouda.Patil
Subray.Bhat
nik.ambekar@gmail.com
subraybhat@gmail.com
B.V.B.C.E.T Hubli.
Abstract
Matrix multiplication is a highly procedure
oriented computation, there is only one way to
multiply two matrices and it involves lots of
multiplications and additions. The size of the
matrix that can be handled depends on the
number of the number of multiplier units. A
design for matrix multiplication is presented.
1. Introduction: Matrix multiplication is one of
the most fundamental and important problems in
science and engineering. Many other important
matrix problems can be solved via matrix
multiplication, e.g., finding the Nth power, the
inverse, the determinant, eigenvalues, an LUfactorization, and the characteristic polynomial
of a matrix, and the product of a sequence of
matrices. Many graph theory problems are also
reduced to matrix multiplication. Examples are
finding the transitive closure, all-pairs shortest
paths, the minimum-weight spanning tree,
topological sort, and critical paths of a graph.
Matrix multiplication also find their application
in aero-plane designs, graphic calculators, LCD
displays etc. Therefore, fast and processor
efficient parallel algorithms for matrix
multiplication are definitely of fundamental
importance.
Matrix multiplication is a highly
procedure oriented computation, there is only
one way to multiply two matrices and it involves
lots of multiplications and additions. But the
simple part of matrix multiplication is that the
evaluation of elements of the resultant elements
can be done independent of the other, this points
to parallel computation architectures. Parallel
matrix multiplication has been extensively
investigated in the past two decades and many
algorithms and approaches have been designed
to achieve them. Some of the approaches that
have been developed are: 1Dsystolic, 2Dsystolic, Cannon’s algorithm, Fox’s algorithm,
Berntsen’s algorithm, The transpose algorithm
and DNS algorithm.
Matrix multiplication has also been
approached with a view of exploiting the sparsity
of the matrix and hence reducing the number of
additions required to get the matrices multiplied.
Many algorithms have also been developed for
this approach. The algorithm and architectures
designed face a problem as the size of matrices
keep varying, there are issues over the memory
sharing and hence an over head over the memory
system codes.
In this paper we propose an
architecture that is capable of handling matrices
of variable sizes. The hardware suggested
incorporates a high degree of parallelism to
achieve large throughput. This hardware is a
dedicated system for matrix multiplication and
relieves the main processor from the burden of
large matrix multiplication. The design involves
a novel approach to multiply two numbers; the
multiplier unit used here is capable of
multiplying two numbers in a single clock cycle.
This increases the speed of the computation. The
system is simple to implement and is highly
scalable, the system can be scaled with simple
repetition of the hardware and with no changes
in the algorithm.
The rest of the paper is organized as
follows: section 2 deals with the implementation
of design, section 3 deals with the scalability
issue, section 4 presents the results of the
implementation of multiplier, section 6 looks
into advantages of the system and section 7
concludes the paper.
2. Implementation of the circuit:
The architecture for the
multiplier is as shown below,
matrix
BVB-IEEE
SAE-INDIA BVB Collegiate Club
Manthan ‘06
Control Unit:
It consists of counters, control register
and logic circuit to channel the data out of the
input memory block.
The unit has following main blocks
1. Memory blocks
2. Control unit
3. Matrix arrangement of multiplier units
4. Adders and counters
Memory blocks:
These are array of registers containing the
data input and output for the matrix multiplier
circuit. Input memory block has the data
arranged in proper order. Writing to this memory
block is taken care by the OS.
The output memory block contains the
elements of the resultant matrix, reading the
result out of this matrix is also taken care by the
OS.
Whenever a fresh set of data is entered
into the input memory block, OS indicates the
control unit by making ‘fresh_in’ signal high.
Control unit
resets the ‘fresh_in’ bit and
appropriately channels the data to the multiplier
units. Once the last set of data is channeled from
the input memory block, control unit generates
‘write_in’ signal to the OS. The occurrence of
this signal indicates OS to fill fresh data in to the
input memory block.
Once the output memory block is full
with the resultant elements the control unit
generates the ‘output_ready’ signal. OS reads the
data out of the output memory block on
occurrence of this signal.
Control unit coordinates with the OS for the
data transfer from(to) output(input) memory
blocks. In the process it generates the handshake
signals ‘write_in’ and ‘output _ready’.
It contains couple of registers to hold the
order of matrix. The data in these registers is
used to set up counters and hence keep track of
number of data fill’s and data read out’s from
input and output memory blocks respectively.
The control unit generates ‘done’ signal at the
end of complete matrix multiplication.
The channeling of data is controlled by
specifying appropriate select lines for the
multiplexer. The outputs of the multiplexer are
hard wired to the multiplier cells.
Matrix implementation of multiplier units:
The multiplier units are arranged into a
matrix form so as to have a high degree of
parallelism in the computation. We know that in
matrix multiplication the evaluation of one of the
resultant element can be done independent of the
other. Hence we load the input elements to
compute the resultant elements in parallel.
Our design is scalable i,e repetition of
units in similar manner results in higher
parallelism. The degree of parallelism is decided
on the application for which the circuit is being
used. If the input matrices are larger than the
hardware allocated, the circuit is still capable of
handling those matrices. We illustrate the
architecture and algorithm for the matrix
multiplication using hardware capable of
handling 4X4 input matrices.
BVB-IEEE
SAE-INDIA BVB Collegiate Club
The above figure illustrates two input
matrices each of order 4X4. Here we are
considering only square matrices for explanation
purpose. If 2 rectangular matrices are given as
inputs, the hard ware checks for the condition of
matrix multiplication and continues the work.
The hardware for the implementation of
multiplication of matrices in figure3 is as given
in figure 4.
The block diagram shown in figure 4 is capable
of multiplying two 4X4 matrices in 4 clock
cycles. The first column elements Ax are given
as inputs along the first row of the block,
similarly the subsequent columns are given as
inputs to subsequent rows of multipliers. The
first column elements of the other matrix are
given as inputs to the entire block with one
element as input to each multiplier in the row.
On evaluation of the product at the end of
the clock cycle we get the partial products of the
first column elements which are added in the
‘adder and counters’ block to get the first column
elements of the resultant matrix. We incorporate
one level of pipe lining by pre-fetching one set of
elements from the input memory block while the
Manthan ‘06
previous set of inputs are being processed in the
multiplier units.
Next we look into the design of the heart of
the entire circuit i,e the multiplier unit itself.
To implement the matrix multiplier we go
for the simple shift and add algorithm. In this
algorithm one of the ‘n’ bit numbers to be
multiplied is shifted by ‘n’ shifts and the
occurrence of a bit ‘1’ in the other number
results in the addition of the corresponding
shifted version, while the occurrence of bit ‘0’
results in addition of zero. The shift and add
method is used here for the implementation of
the multiplier circuit because of its simplicity
and less hardware required for implementation.
The block diagram for the multiplier
unit is as shown below in figure5,
The internal details of the multiplier are not
revealed in this paper. The block mainly contains
a combinational logic circuit which allows only
selected inputs to the adder. The adder we use
here is a custom built adder. It is capable of
adding all n inputs at one instant and give the
result. Unlike conventional adders where we
need levels of adders to add n numbers, our
adder is a single logic block that takes all the
inputs at one instant. For the proper working of
this adder we need a special arrangement of bits
in the input numbers to the adder and is taken
care by the ‘logic circuit for bit shuffle’ block.
The above circuit is capable of
multiplying two n bit numbers in a single clock
cycle. The rising edge of the clock loads the two
numbers to be multiplied into the registers A and
B, while the falling edge of the loads the output
into the output register. The ON period of the
clock being equal to the delay time of the entire
multiplier block. From the above description we
observe that the multiplication is computed only
during the ON time of the clock input. The OFF
time accounts the delay in the further adder and
BVB-IEEE
SAE-INDIA BVB Collegiate Club
counter block. This arrangement allows the
circuit to compute two different sets of works in
a single clock cycle.
Counters and Adders:
Figure 4 shows the arrangement of the
multiplier blocks. The multipliers give the
multiplied output at the end of the ON time of
the current clock cycle, these outputs have to be
added to get the resultant element. For handling a
4X4 matrix the multiplier outputs have to be
added column wise( addition of M11, M21,
M31, M41) and we get one resultant element at
the end of addition.
The unit that handles this addition is as shown in
figure 6,
Manthan ‘06
high, hence the partial sum is added along with
fresh set inputs. If the counter decrements to zero
then all the N multiplications have been taken
into account and hence the output buffer is
enabled while the feed back is disabled, for the
next clock cycle both the buffers are disabled so
that the partial sum of next number is loaded into
the register. We have taken a separate counter in
this block for simplicity of understanding, the
control of the buffers can easily be done by the
control circuit itself.
In the above multiplier circuit for the
evaluation of one resultant element we identify
the necessary row and column, access the
elements from those row and column only, hence
we go with element by element evaluation of the
resultant matrix.
3. Scalability:
One of the features of the matrix
multiplier circuit presented in this paper is that
the circuit is highly scalable to meet the
requirements of the application for which the
circuit is being used. Scalability can be done on
two fronts and they are,
1. Scalability on the number of bits.
2. Scalability on the size of the matrix
that can be handled.
The above unit is only a single unit out of m
such independent units. The value of m is
dependent on the number of columns of
multipliers in figure 4.
In the multiplication of two NXN
matrices the evaluation of each resultant element
results in N number of multiplications and N-1
additions. If we were to handle matrices of order
with a maximum order of 4X4 the simple four
adders would suffice the correct evaluation of the
resultant elements, but if the order is greater than
4X4 then we have to account for the N number
of multiplications (N>4) that go into the
evaluation of the elements. This is achieved by
setting up the counter with a count N/4, so that
on every evaluation of partial sum it is stored
into the register
shown in figure 6. The counter is decremented
every clock cycle and if the value is not zero
then the enable signal for the feed back buffer is
Scalability on the number of bits:
The representation of the numbers can
be of variable bit sizes. This depends on the
application where the matrix multiplier is being
used.
If the numbers of bits vary then the
necessary hardware changes that have to be
incorporated are,
1.
2.
3.
The number of multiplexers in the in the
multiplier block shown in figure 5, vary
on the number of bits used for the
representation of the input numbers.
The number of bits in the register A
and B, the number of bits in the adder
as shown in figure 5 change. The adder
which is custom built is capable of
handling these issues.
All the design issues connected to
variable bit sizes are restricted to the
design of the multiplier unit.
Scalability on the size of the matrix
That can be handled:
The size of the matrix that can be
handled depends on the number of the number of
multiplier units that are incorporated in the block
BVB-IEEE
SAE-INDIA BVB Collegiate Club
shown in figure 4. As the number of multiplier
units increase the number of in puts to the adder
in the ‘adder and counter’ block increase. The
adder is capable of taking and adding all N input
numbers at one shot, hence with a small increase
in the adder unit we can scale the matrix
handling capacity of the circuit to greater
numbers.
As the circuit becomes capable of
handling bigger matrices the throughput of the
system increases and bigger matrices can be
added at much faster rate. But there is a catch, as
the handling capacity increases the hardware
increases and the complexity of the circuit
increases. Also if more number of matrices of
lower order are given as inputs to the circuit zero
padding occurs to make the matrix size at least
equal to the hardware provided, this result in
wastage of hardware and the power. Hence
depending on the application for which the
multiplier is used the matrix handling capacity
has to be designed.
4. Experimentation and results:
The multiplier unit of the matrix
multiplier described in this paper has been
implemented by us. This being the heart of the
entire system and the key block to decide on the
operating clock frequency, we can derive certain
conclusions the results observed.
The multiplier unit has been designed
to multiply two numbers of sixteen bit each. The
code for the circuit is written in VHDL using the
tool XILINX ISE 7.1i version.
The multiplier unit of the system has
been implemented and has been tested
extensively for proper working. Wide range of
input combinations has been given and the
outputs have been verified.
5 .Further work:
The control circuit has to be implemented.
The interconnectivity of all the blocks has to be
defined and the entire circuit has to e tested for
proper working.
6. Advantages and Disadvantages:
Advantages:
1. The circuit is simple to
Implement.
2. The design is easily scalable to
Meet the requirements.
3. The different blocks can be
optimized and changed without
affecting the other blocks in the
Manthan ‘06
4.
circuit.
This is a dedicated hardware to
multiply two matrices and relives
the main processor.
Disadvantages:
1. The matrix handling capacity has to be
designed carefully.
7.Conclusion:
In this paper we have proposed a circuit
capable of doing fast matrix multiplication. The
system uses simple algorithms and is easy to
implement.
8. Acknowledgements:
1. Prof.B.L.Desai, HOD E and C BVBCET
.
9. References:
1] Digital system design using
Charles Roth jr
VHDL,
2] Computer Organization, Carl Hamacher
3] SUMMA: A Matrix Multiplication
Algorithm Suitable for Clusters and Scalable
Shared Memory Systems, Manojkumar
Krishnan and Jarek Nieplocha.
4] Recursive Array Layouts and Fast Matrix
Multiplication, Siddhartha Chatterjee, Alvin
R. Lebeck, Praveen K. Patnala, and
Mithuna Thottethodi.
Download