BVB-IEEE SAE-INDIA BVB Collegiate Club Manthan ‘06 FAST MATRIX MULTIPLICATION Nikhil.Ambekar . Rohit.Patil, Rajashekargouda.Patil Subray.Bhat nik.ambekar@gmail.com subraybhat@gmail.com B.V.B.C.E.T Hubli. Abstract Matrix multiplication is a highly procedure oriented computation, there is only one way to multiply two matrices and it involves lots of multiplications and additions. The size of the matrix that can be handled depends on the number of the number of multiplier units. A design for matrix multiplication is presented. 1. Introduction: Matrix multiplication is one of the most fundamental and important problems in science and engineering. Many other important matrix problems can be solved via matrix multiplication, e.g., finding the Nth power, the inverse, the determinant, eigenvalues, an LUfactorization, and the characteristic polynomial of a matrix, and the product of a sequence of matrices. Many graph theory problems are also reduced to matrix multiplication. Examples are finding the transitive closure, all-pairs shortest paths, the minimum-weight spanning tree, topological sort, and critical paths of a graph. Matrix multiplication also find their application in aero-plane designs, graphic calculators, LCD displays etc. Therefore, fast and processor efficient parallel algorithms for matrix multiplication are definitely of fundamental importance. Matrix multiplication is a highly procedure oriented computation, there is only one way to multiply two matrices and it involves lots of multiplications and additions. But the simple part of matrix multiplication is that the evaluation of elements of the resultant elements can be done independent of the other, this points to parallel computation architectures. Parallel matrix multiplication has been extensively investigated in the past two decades and many algorithms and approaches have been designed to achieve them. Some of the approaches that have been developed are: 1Dsystolic, 2Dsystolic, Cannon’s algorithm, Fox’s algorithm, Berntsen’s algorithm, The transpose algorithm and DNS algorithm. Matrix multiplication has also been approached with a view of exploiting the sparsity of the matrix and hence reducing the number of additions required to get the matrices multiplied. Many algorithms have also been developed for this approach. The algorithm and architectures designed face a problem as the size of matrices keep varying, there are issues over the memory sharing and hence an over head over the memory system codes. In this paper we propose an architecture that is capable of handling matrices of variable sizes. The hardware suggested incorporates a high degree of parallelism to achieve large throughput. This hardware is a dedicated system for matrix multiplication and relieves the main processor from the burden of large matrix multiplication. The design involves a novel approach to multiply two numbers; the multiplier unit used here is capable of multiplying two numbers in a single clock cycle. This increases the speed of the computation. The system is simple to implement and is highly scalable, the system can be scaled with simple repetition of the hardware and with no changes in the algorithm. The rest of the paper is organized as follows: section 2 deals with the implementation of design, section 3 deals with the scalability issue, section 4 presents the results of the implementation of multiplier, section 6 looks into advantages of the system and section 7 concludes the paper. 2. Implementation of the circuit: The architecture for the multiplier is as shown below, matrix BVB-IEEE SAE-INDIA BVB Collegiate Club Manthan ‘06 Control Unit: It consists of counters, control register and logic circuit to channel the data out of the input memory block. The unit has following main blocks 1. Memory blocks 2. Control unit 3. Matrix arrangement of multiplier units 4. Adders and counters Memory blocks: These are array of registers containing the data input and output for the matrix multiplier circuit. Input memory block has the data arranged in proper order. Writing to this memory block is taken care by the OS. The output memory block contains the elements of the resultant matrix, reading the result out of this matrix is also taken care by the OS. Whenever a fresh set of data is entered into the input memory block, OS indicates the control unit by making ‘fresh_in’ signal high. Control unit resets the ‘fresh_in’ bit and appropriately channels the data to the multiplier units. Once the last set of data is channeled from the input memory block, control unit generates ‘write_in’ signal to the OS. The occurrence of this signal indicates OS to fill fresh data in to the input memory block. Once the output memory block is full with the resultant elements the control unit generates the ‘output_ready’ signal. OS reads the data out of the output memory block on occurrence of this signal. Control unit coordinates with the OS for the data transfer from(to) output(input) memory blocks. In the process it generates the handshake signals ‘write_in’ and ‘output _ready’. It contains couple of registers to hold the order of matrix. The data in these registers is used to set up counters and hence keep track of number of data fill’s and data read out’s from input and output memory blocks respectively. The control unit generates ‘done’ signal at the end of complete matrix multiplication. The channeling of data is controlled by specifying appropriate select lines for the multiplexer. The outputs of the multiplexer are hard wired to the multiplier cells. Matrix implementation of multiplier units: The multiplier units are arranged into a matrix form so as to have a high degree of parallelism in the computation. We know that in matrix multiplication the evaluation of one of the resultant element can be done independent of the other. Hence we load the input elements to compute the resultant elements in parallel. Our design is scalable i,e repetition of units in similar manner results in higher parallelism. The degree of parallelism is decided on the application for which the circuit is being used. If the input matrices are larger than the hardware allocated, the circuit is still capable of handling those matrices. We illustrate the architecture and algorithm for the matrix multiplication using hardware capable of handling 4X4 input matrices. BVB-IEEE SAE-INDIA BVB Collegiate Club The above figure illustrates two input matrices each of order 4X4. Here we are considering only square matrices for explanation purpose. If 2 rectangular matrices are given as inputs, the hard ware checks for the condition of matrix multiplication and continues the work. The hardware for the implementation of multiplication of matrices in figure3 is as given in figure 4. The block diagram shown in figure 4 is capable of multiplying two 4X4 matrices in 4 clock cycles. The first column elements Ax are given as inputs along the first row of the block, similarly the subsequent columns are given as inputs to subsequent rows of multipliers. The first column elements of the other matrix are given as inputs to the entire block with one element as input to each multiplier in the row. On evaluation of the product at the end of the clock cycle we get the partial products of the first column elements which are added in the ‘adder and counters’ block to get the first column elements of the resultant matrix. We incorporate one level of pipe lining by pre-fetching one set of elements from the input memory block while the Manthan ‘06 previous set of inputs are being processed in the multiplier units. Next we look into the design of the heart of the entire circuit i,e the multiplier unit itself. To implement the matrix multiplier we go for the simple shift and add algorithm. In this algorithm one of the ‘n’ bit numbers to be multiplied is shifted by ‘n’ shifts and the occurrence of a bit ‘1’ in the other number results in the addition of the corresponding shifted version, while the occurrence of bit ‘0’ results in addition of zero. The shift and add method is used here for the implementation of the multiplier circuit because of its simplicity and less hardware required for implementation. The block diagram for the multiplier unit is as shown below in figure5, The internal details of the multiplier are not revealed in this paper. The block mainly contains a combinational logic circuit which allows only selected inputs to the adder. The adder we use here is a custom built adder. It is capable of adding all n inputs at one instant and give the result. Unlike conventional adders where we need levels of adders to add n numbers, our adder is a single logic block that takes all the inputs at one instant. For the proper working of this adder we need a special arrangement of bits in the input numbers to the adder and is taken care by the ‘logic circuit for bit shuffle’ block. The above circuit is capable of multiplying two n bit numbers in a single clock cycle. The rising edge of the clock loads the two numbers to be multiplied into the registers A and B, while the falling edge of the loads the output into the output register. The ON period of the clock being equal to the delay time of the entire multiplier block. From the above description we observe that the multiplication is computed only during the ON time of the clock input. The OFF time accounts the delay in the further adder and BVB-IEEE SAE-INDIA BVB Collegiate Club counter block. This arrangement allows the circuit to compute two different sets of works in a single clock cycle. Counters and Adders: Figure 4 shows the arrangement of the multiplier blocks. The multipliers give the multiplied output at the end of the ON time of the current clock cycle, these outputs have to be added to get the resultant element. For handling a 4X4 matrix the multiplier outputs have to be added column wise( addition of M11, M21, M31, M41) and we get one resultant element at the end of addition. The unit that handles this addition is as shown in figure 6, Manthan ‘06 high, hence the partial sum is added along with fresh set inputs. If the counter decrements to zero then all the N multiplications have been taken into account and hence the output buffer is enabled while the feed back is disabled, for the next clock cycle both the buffers are disabled so that the partial sum of next number is loaded into the register. We have taken a separate counter in this block for simplicity of understanding, the control of the buffers can easily be done by the control circuit itself. In the above multiplier circuit for the evaluation of one resultant element we identify the necessary row and column, access the elements from those row and column only, hence we go with element by element evaluation of the resultant matrix. 3. Scalability: One of the features of the matrix multiplier circuit presented in this paper is that the circuit is highly scalable to meet the requirements of the application for which the circuit is being used. Scalability can be done on two fronts and they are, 1. Scalability on the number of bits. 2. Scalability on the size of the matrix that can be handled. The above unit is only a single unit out of m such independent units. The value of m is dependent on the number of columns of multipliers in figure 4. In the multiplication of two NXN matrices the evaluation of each resultant element results in N number of multiplications and N-1 additions. If we were to handle matrices of order with a maximum order of 4X4 the simple four adders would suffice the correct evaluation of the resultant elements, but if the order is greater than 4X4 then we have to account for the N number of multiplications (N>4) that go into the evaluation of the elements. This is achieved by setting up the counter with a count N/4, so that on every evaluation of partial sum it is stored into the register shown in figure 6. The counter is decremented every clock cycle and if the value is not zero then the enable signal for the feed back buffer is Scalability on the number of bits: The representation of the numbers can be of variable bit sizes. This depends on the application where the matrix multiplier is being used. If the numbers of bits vary then the necessary hardware changes that have to be incorporated are, 1. 2. 3. The number of multiplexers in the in the multiplier block shown in figure 5, vary on the number of bits used for the representation of the input numbers. The number of bits in the register A and B, the number of bits in the adder as shown in figure 5 change. The adder which is custom built is capable of handling these issues. All the design issues connected to variable bit sizes are restricted to the design of the multiplier unit. Scalability on the size of the matrix That can be handled: The size of the matrix that can be handled depends on the number of the number of multiplier units that are incorporated in the block BVB-IEEE SAE-INDIA BVB Collegiate Club shown in figure 4. As the number of multiplier units increase the number of in puts to the adder in the ‘adder and counter’ block increase. The adder is capable of taking and adding all N input numbers at one shot, hence with a small increase in the adder unit we can scale the matrix handling capacity of the circuit to greater numbers. As the circuit becomes capable of handling bigger matrices the throughput of the system increases and bigger matrices can be added at much faster rate. But there is a catch, as the handling capacity increases the hardware increases and the complexity of the circuit increases. Also if more number of matrices of lower order are given as inputs to the circuit zero padding occurs to make the matrix size at least equal to the hardware provided, this result in wastage of hardware and the power. Hence depending on the application for which the multiplier is used the matrix handling capacity has to be designed. 4. Experimentation and results: The multiplier unit of the matrix multiplier described in this paper has been implemented by us. This being the heart of the entire system and the key block to decide on the operating clock frequency, we can derive certain conclusions the results observed. The multiplier unit has been designed to multiply two numbers of sixteen bit each. The code for the circuit is written in VHDL using the tool XILINX ISE 7.1i version. The multiplier unit of the system has been implemented and has been tested extensively for proper working. Wide range of input combinations has been given and the outputs have been verified. 5 .Further work: The control circuit has to be implemented. The interconnectivity of all the blocks has to be defined and the entire circuit has to e tested for proper working. 6. Advantages and Disadvantages: Advantages: 1. The circuit is simple to Implement. 2. The design is easily scalable to Meet the requirements. 3. The different blocks can be optimized and changed without affecting the other blocks in the Manthan ‘06 4. circuit. This is a dedicated hardware to multiply two matrices and relives the main processor. Disadvantages: 1. The matrix handling capacity has to be designed carefully. 7.Conclusion: In this paper we have proposed a circuit capable of doing fast matrix multiplication. The system uses simple algorithms and is easy to implement. 8. Acknowledgements: 1. Prof.B.L.Desai, HOD E and C BVBCET . 9. References: 1] Digital system design using Charles Roth jr VHDL, 2] Computer Organization, Carl Hamacher 3] SUMMA: A Matrix Multiplication Algorithm Suitable for Clusters and Scalable Shared Memory Systems, Manojkumar Krishnan and Jarek Nieplocha. 4] Recursive Array Layouts and Fast Matrix Multiplication, Siddhartha Chatterjee, Alvin R. Lebeck, Praveen K. Patnala, and Mithuna Thottethodi.