Hindawi Publishing Corporation Mathematical Problems in Engineering Volume 2010, Article ID 268093, 17 pages doi:10.1155/2010/268093 Research Article A Novel Parallel Algorithm Based on the Gram-Schmidt Method for Tridiagonal Linear Systems of Equations Seyed Roholah Ghodsi and Mohammad Taeibi-Rahni Mechanical and Aerospace Engineering Department, Science and Research Branch, Islamic Azad University (IAU), Tehran 1477-893855, Iran Correspondence should be addressed to Seyed Roholah Ghodsi, roholahghodsi@gmail.com Received 8 June 2010; Revised 19 September 2010; Accepted 6 December 2010 Academic Editor: David Chelidze Copyright q 2010 S. R. Ghodsi and M. Taeibi-Rahni. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. This paper introduces a new parallel algorithm based on the Gram-Schmidt orthogonalization method. This parallel algorithm can find almost exact solutions of tridiagonal linear systems of equations in an efficient way. The system of equations is partitioned proportional to number of processors, and each partition is solved by a processor with a minimum request from the other partitions’ data. The considerable reduction in data communication between processors causes interesting speedup. The relationships between partitions approximately disappear if some columns are switched. Hence, the speed of computation increases, and the computational cost decreases. Consequently, obtained results show that the suggested algorithm is considerably scalable. In addition, this method of partitioning can significantly decrease the computational cost on a single processor and make it possible to solve greater systems of equations. To evaluate the performance of the parallel algorithm, speedup and efficiency are presented. The results reveal that the proposed algorithm is practical and efficient. 1. Introduction Linear systems of equations occur frequently in many fields of engineering and mathematics. Different approaches have been used to solve the linear systems of equations 1, for example, QR factorization. There are different QR factorization methods, such as Givens transformations, Householder transformations, and Gram-Schmidt orthogonalisation. A considerable amount of study has been done on parallel QR factorization algorithms 2, 3. In addition, some numerical libraries and softwares have supplied the QR factorization, such as ScaLAPACK 4, and Parallel Linear Algebra Software for Multicore Architectures PLASMA 5. PLASMA Tile QR factorization is provided for multicore system architectures using the block algorithms 6. 2 Mathematical Problems in Engineering Tridiagonal system is a well known form that arises in areas such as linear algebra, partial differential equations, orthogonal polynomials, and signal processing. Problem of solving tridiagonal systems on parallel computers has been studied widely. It seems that first parallel tridiagonal system solver, referred to as cyclic reduction, was introduced by Hockney in 1965 7. Throughout more than forty years since then, different parallel algorithms have been proposed and developed. The recursive doubling algorithm was presented by Stone for fine grained parallel systems in 1973 8. Wang suggested a partitioning algorithm for coarse grained systems in 1981 9. The classical method is the inverse of the matrix containing unknown elements; recently, an explicit formula was given by Kilic 10. In this paper, we consider a novel parallel algorithm based on the modified GramSchmidt orthogonalization method 11. This method is a kind of QR factorization, in which firstly orthogonal and upper-triangular matrices are computed and then the solution of the system of equations is found using these matrices. Amodio et al. proposed a parallel factorization for tridiagonal matrices 12, 13. On the other hand, parallel implementation of modified Gram-Schmidt algorithm has been studied in different literatures 14–17. Although the proposed method in this paper is currently developed and verified for tridiagonal systems, the authors are working on extension of the method to other systems. The scalability of parallel algorithms is a vital problem, so different techniques have been developed to overcome it such as processors load balancing 18 and reducing the number of synchronizations 19. The obtained results show that the suggested parallel algorithm is practically scalable. Therefore, the objective of this paper is to propose an efficient parallel algorithm. In Section 2, the QR factorization is introduced as a base of this algorithm. In Section 3, the procedure of the suggested parallel algorithm is completely described. A numerical example is given in Section 4. In Section 5, accuracy, speedup, and efficiency of the algorithm are evaluated. Finally, conclusions are given in Section 6. 2. QR Factorization A QR factorization of a nonsingular matrix A ∈ Rn×n is a decomposition of the matrix into a unitary orthogonal matrix Q ∈ Rn×n i.e., QT Q I and an upper-triangular matrix R ∈ Rn×n as A QR, 2.1 where the decomposition is unique. One of the most well-known QR methods is the Classical Gram-Schmidt CGS process. To overcome numerical instability, the modified Gram-Schmidt algorithm MGS is preferred after a simple change in the classical algorithm 20. Each column of orthogonal matrix can be normalized as follows: uk qk k , u 2.2 Mathematical Problems in Engineering 3 where qk is a column of orthogonal matrix and 1 ≤ k ≤ n. Column uk is determined from previously computed orthogonal columns qi , where 1 ≤ i ≤ k − 1, as uk ak − k−1 ak , qi qi , 2.3 i1 where ak , qi qi denotes the projection operator of the column ak onto the column qi and k a ,q i k i a ,q i i , q ,q 2.4 where, ak , qi presents the inner product of the columns ak and qi . Loss of orthogonality depends on matrix A, and the amount of deviation from orthogonality is minimized when A is a well-conditioned matrix 21. Additionally, an upper-triangular matrix is constructed column by column as rik uk , qi , 2.5 where rik is a typical element of upper-triangular matrix and 1 ≤ i ≤ n. Furthermore, the QR factorization can be used as a part of solution of a linear system of equations. To obtain the solution of a linear system AX B , 7 X ∈ Rn , B ∈ Rn , 2.6 we consider the QR factorization QR X B. 2.7 Then we multiply the transpose of Q by both sides of 2.7, where QT Q I and QT B B . Finally, a linear system with an upper-triangular coefficient matrix is achieved as R X B , 2.8 where R R. The final step is computing the matrix X by usual back substitution. 3. Parallel Algorithm The concept of novel parallel algorithm based on the Gram-Schmidt method will be described here. The parallel implementation of the Gram-Schmidt method is not efficient compared with other methods, particularly when it is employed to solve a tridiagonal system of equations. However, some modifications allow appropriate conditions needed to parallelize the Gram-Schmidt method. In the Gram-Schmidt method, computation of each column of the 4 Mathematical Problems in Engineering q(1) = a(1) + r1,2 q(2) = a(2) + (2) | a · q(1) | (1) q | q(1) · q(1) | r1,3 q(3) = a(3) + (3) r2,3 (1) (3) | a · q | (1) | a · q(2) | (2) + q q (1) (1) |q ·q | | q(2) · q(2) | (1) (4) (1) (5) r1,4 q(4) = a(4) + (4) r2,4 | a · q | (1) | a · q | (2) | a · q(3) | (3) + + q q q (1) (1) (2) (2) |q ·q | |q ·q | | q(3) · q(3) | r1,5 q(5) = a(5) + (5) (4) (2) (5) r2,5 r3,5 r4,5 | a · q(4) | (4) | a · q | (1) | a · q | (2) | a · q | (3) + + + q q q q | q(1) · q(1) | | q(2) · q(2) | | q(3) · q(3) | | q(4) · q(4) | r1,6 q(6) = a(6) + r3,4 (2) r2,6 (3) r3,6 | a(6) · q(1) | (1) | a(6) · q(2) | (2) | a(6) · q(3) | (3) q q q + + + | q(1) · q(1) | | q(2) · q(2) | | q(3) · q(3) | (5) r4,6 | a(6) · q(4) | (4) q + | q(4) · q(4) | r5,6 | a(6) · q(5) | (5) q | q(5) · q(5) | ··· Figure 1: The zero and nonzero terms in the equation of each orthogonal column, where qm is the mth column of orthogonal matrix and am is mth column of matrix A. Also, the zero terms are crossed out. orthogonal and upper-triangular matrices depends on the previously computed columns of the orthogonal matrix. Most of the elements of a tridiagonal matrix are zero, except for those on three diagonals. Each term of projection operator in 2.3 consists of inner product of one column of this tridiagonal matrix with a computed column of orthogonal matrix. When most of the elements of one of these columns are zero, then a few elements of projection operator can be nonzero. As a result, only the last two terms of projection operator in the computation of each column of orthogonal matrix are nonzero, as 0, rik / i k − 2, k − 1. 3.1 Figure 1 shows the terms of 2.3 for the first six columns of orthogonal matrix Q. As a result, computation of each column of the orthogonal and upper-triangular matrices depends only on the two previous adjacent columns of the orthogonal matrix Q. It means that, if the column k of orthogonal matrix is computing, only two computed columns k − 1 and k − 2 of matrix Q and the column k of matrix A are requested, as uk ak − ak , qk−2 qk−1 − ak , qk−1 qk−1 . 3.2 This property is a base of our parallel algorithm. In this situation, if we omit two adjacent columns of matrix A, then all terms of projection operator for the next columns are zero. Mathematical Problems in Engineering 5 a b Figure 2: The partitioning a and transformation b of matrix’s columns. The amount of partition can be related to the number of processor. The last partition is made by transferred columns from between other blocks. It means that the relation of columns of orthogonal matrix in 2.3 is disrupted. In order to use this feature, if two columns of matrix An×n , for instance, ak−2 and ak−1 , are transferred to the end of matrix, the next corresponding columns in the orthogonal matrix are disconnected from the previous ones. It means that ak , qk−2 0 and ak , qk−1 0 and as a result, uk ak . 3.3 This transformation decreases the amount of computation significantly. The parallel algorithm is achieved by partitioning the matrix An×n into m − 1 m is the number of processors blocks of fairly equal size, that is, each block contains approximately Ne n/m − 1 columns, and each processor is concerned with computing only its block. The computation can be done mostly independently by each processor, but at some points, communication is required. The first column of the matrix owned by processor e is denoted as j e , where je < je1 for e 1, . . . , m − 1 and N e je1 − j e . To disconnect the relationship between adjacent blocks, two columns between them, that is, je1 − 1 and j e1 for e 1, . . . , m − 2, are transferred to the end of matrix 3.4. These transferred columns make another partition, as shown in Figure 2. Therefore, the number of columns of the new block is H 2 × m − 2: jm e − 1 ←− je1 − 1, jm e ←− je1 for e 1, . . . , m − 2. 3.4 It should be noticed that the place of each element of matrix X corresponding to the transferred columns of matrix A is also changed. Each processor, except for the last one, can start computing some columns of the orthogonal and upper-triangular matrices that are related to the columns of its own block without receiving any data from other processors. This feature plays a significant role in improving the speedup of the algorithm. If processor Pe is responsible for a block consisting of columns from aje to aje1 −1 , it computes the columns from qje to q j e1 −1 of the orthogonal matrix with a minimum request for data of other partition, and the same holds for r je to r je1 −1 of the upper-triangular matrix. All the blocks, except the last one, only need data of the last block, because the first two and last two columns of each block have some relation to some columns of the last block. 6 Mathematical Problems in Engineering To achieve more speedup, another switching is used in each block, except for the last block. The last two columns in each block are transferred between the second and third columns of block. This switching causes the first four columns to be computed firstly and then be sent to the last processor before other columns of block are determined. As a result, all processors are in use simultaneously, and the efficiency of the parallel algorithm increases. Lastly, all columns of the orthogonal and upper-triangular matrices are computed with the least network transaction. To show the final result, each processor sends its block to the last processor. The next step is to find the solution of the linear system of equations. According to 2.7, the values of QT Q and QT B should be determined. Because data of each block of matrix Q and also matrix B exist in each processor, the computation of these matrices can be done without any request to the other processors’ data. Finally, the unknown matrix is obtained by a back substitution procedure. In this step, each processor just needs to communicate with the last processor. The back substitution is started from the last processor which consists of H columns. The last H elements of the unknown matrix are computed using just the last processor and are sent to other ones. Because the H elements of matrix X are computed, the last processor can compute independently, the last H terms of each linear equation. As a result, instead of H columns, just one column is needed to broadcast to the other processors. Then the other processors compute the related unknown elements with the received data. The complete flowchart of described parallel algorithm is shown in Figure 3. It may seem that the amount of computation in the last processor is more than the others. However, it should be mentioned that the number of columns in the last block is less than the number of columns in the others. If the number of unknowns increases, the difference between size of the last block and other blocks becomes greater. 4. Example Let A be a 15 × 15 nonsingular tridiagonal matrix. The results of three cases are presented here and compared with each other; we use the label case a for one processor, case b for two, and case c for three. In Figure 4, the transformations of two columns in case b two processors and four columns in case c three processors are shown. The computed orthogonal matrices in the three cases are presented in Figure 5. The number of nonzero elements is 135 in case a, 97 in case b, and “93” in case c. Hence, the computational cost decreases as the computational power i.e., the number of processors increases. This situation is an ideal condition for a parallel algorithm. Finally, the upper-triangular matrices are shown in Figure 6. Notice that, in cases b and c, the elements in the last row of each block are zero, except for elements that belong to the last processors, in addition to an element which is equal to 1. Therefore, the back substitution procedure in each block does not require the data of the other processors, only the last processor. This feature is also significant for solving a huge system of equations. 5. Results The testbed specification is shown in Table 1. The results of using up to 16 processors are presented here. The numerical experiment is done on a parallel system with MIMD Multiple Mathematical Problems in Engineering Processor no. 1 7 Processor no. 2 Start Start Read 1st block of A and corresponding elements of B Read 2nd block of A and corresponding elements of B Last processor Start Read transferred columns of A and corresponding elements of B = = A ··· A x=B Transfer last 2 columns of block to 1st and 2nd columns x=B Transfer last 2 columns of block to 1st and 2nd columns Receive data Compute q(1) , q(2) , r (1) , and r (2) Send q(1) , q(2) , Compute q(i) , and r (i) for i = 1, 2, 3, and 4 Compute Q and R Compute QT B Send q(i) , and r (i) for i = 1, 2, 3, and 4 r (1) , and r (2) Compute other columns of Q and R Compute other columns of Q and R Compute QT B Compute QT B Receive data Receive data Compute corresponding elements of unknown matrix (X) Compute corresponding elements of unknown matrix (X) Finish Finish Compute corresponding elements of unknown matrix (X) Broadcast obtained results Finish Figure 3: The flowchart of proposed parallel algorithm. Data, Multiple Instruction structure and shared memory 22. The language of parallel processing that is used in this paper is standard MPI 23. The results of the procedure described above consist of different aspects. Its main purpose was to find the solution of a tridiagonal system of equations in a fast and efficient manner. The other achievements were computing the orthogonal and uppertriangular matrices which are useful computations in different fields. This entire procedure in 8 Mathematical Problems in Engineering 32 22 13 5 0 118 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 −9 0 0 0 0 85 19 0 0 0 17 86 41 0 0 0 27 38 55 0 0 0 34 10 46 0 0 0 49 20 0 0 0 0 74 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 61 0 0 0 0 0 0 50 77 0 0 0 0 0 62 22 4 0 0 0 0 0 12 −11 99 0 0 0 0 0 85 104 56 0 0 0 0 0 35 20 38 0 0 0 0 0 59 37 52 0 0 0 0 0 56 90 0 0 0 0 0 0 64 0 0 0 0 0 0 0 0 0 0 0 0 0 94 7 a 32 22 13 5 0 118 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 −9 0 0 0 0 0 0 0 0 0 85 19 0 0 0 0 0 0 0 0 17 86 41 0 0 0 0 0 0 0 0 27 38 55 0 0 0 0 0 0 0 0 34 10 46 0 0 0 0 0 0 0 0 49 20 0 0 0 0 0 0 0 0 0 74 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 −11 99 0 0 0 0 0 0 0 0 85 104 56 0 0 0 0 0 0 0 0 35 20 38 0 0 0 0 0 0 0 0 59 37 52 0 0 0 0 0 0 0 0 56 90 0 0 0 0 0 0 0 0 0 64 0 0 0 0 0 0 0 0 0 0 0 0 0 94 7 0 0 0 0 0 0 61 50 62 0 0 0 0 0 0 0 0 0 0 0 0 0 77 22 12 0 0 0 0 0 b 32 13 0 0 0 0 0 0 0 0 0 0 0 0 0 22 5 118 0 0 0 0 0 0 0 0 0 0 0 0 0 0 −9 0 85 19 17 86 0 27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 46 0 0 20 61 0 74 50 77 0 62 22 0 0 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 56 0 20 38 59 37 0 56 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 52 0 90 94 64 7 0 0 0 41 38 34 0 0 0 0 0 0 0 0 0 0 0 0 0 55 10 49 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 −11 99 85 104 0 35 0 0 0 0 0 0 c Figure 4: The matrix A15×15 ; a one processor, b two processors, and c three processors. Table 1: The Specification of test bed. CPU up to 16× Intel Core 2 Due 3.0 GHz RAM 4 GB Cache 4 MB Network 100 Mb/s OS Win XP Language MPI a sequential form requires On2 number of operations, where n is the number of equations. The complexity of the parallel algorithm is 9H 2 26H 32n 130, 5.1 where H n/p is the ratio of the number of equations to the number of processors. Furthermore, as mentioned in Section 3, the number of zero elements raises as the number of blocks increases. This feature can be thought of as being on route to linear complexity. Mathematical Problems in Engineering 32 13 0 0 0 0 0 0 0 0 0 0 0 0 0 9 1.37 2.15 −9.96 4.52 6.58 −5.40 −1.99 −3.38 −5.28 24.5 −11.1 −16.2 13.3 4.90 118 −0.18 0.82 −0.37 −0.54 0.44 0.16 0 17 8.89 −4.04 −0.87 4.82 1.78 0 0 27 13.1 19.1 −15.6 −5.77 0 0 0 34 −14.2 11.69 4.31 0 0 0 0 49 15.19 5.60 0 0 0 0 0 74 −4.19 0 0 0 0 0 0 62 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5.39 −13.3 −0.44 −4.81 15.6 −11.7 −15.2 11.3 5.76 12 0 0 0 0 0 0.64 −1.57 −5.23 −0.57 1.85 −1.38 −1.79 1.34 0.68 −9.86 85 0 0 0 0 −6.27 15.4 0.51 5.60 −18.2 13.6 17.6 −13.2 −6.71 97.1 12.9 35 0 0 0 0.32 −0.87 −2.62 −0.28 0.92 −0.69 −0.90 0.67 0.34 −4.94 −0.65 15.9 59 0 0 0.45 −1.10 −3.69 −0.40 1.30 −0.97 −1.26 0.94 0.48 −6.95 −0.92 22.3 −6.71 56 0 −0.82 2.03 6.76 0.73 −2.39 1.78 2.32 −1.73 −0.88 12.7 1.69 −41.1 12.3 19.7 64 −0.34 0.84 2.80 0.30 −0.99 0.74 0.96 −0.72 −0.36 5.27 0.70 −17.0 5.10 8.17 −15.6 a 32 13 0 0 0 0 0 0 0 0 0 0 0 0 0 1.37 −3.38 118 0 0 0 0 0 0 0 0 0 0 0 0 2.15 −5.28 −0.18 17 0 0 0 0 0 0 0 0 0 0 0 −9.96 24.5 0.82 8.89 27 0 0 0 0 0 0 0 0 0 0 4.52 −11.1 −0.37 −4.04 13.1 34 0 0 0 0 0 0 0 0 0 6.58 −5.40 −16.2 13.3 −0.54 0.44 −0.87 4.82 19.1 −15.6 −14.2 11.69 49 15.19 0 74 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 −4.21 −2.12 1.98 −0.75 −2.51 −11 110.6 −5.11 −6.96 12.8 5.21 0.79 85 14.5 −0.56 −0.99 1.70 0 35 16.1 22.6 −41.4 −17.1 0 0 59 −6.73 12.4 5.04 0 0 0 56 19.9 8.26 0 0 0 0 64 −15.7 −1.99 4.90 0.16 1.76 −5.77 4.31 5.60 −4.19 61.1 3.84 −2.37 −3.80 3.54 0.24 −3.22 5.41 −13.3 −0.44 −4.83 15.7 −11.7 −15.2 11.4 5.78 0.36 −0.22 −0.36 0.33 0.02 −0.30 b 32 13 0 0 0 0 0 0 0 0 0 0 0 0 0 1.37 −3.38 118 0 0 0 0 0 0 0 0 0 0 0 0 2.15 −5.28 −0.18 17 0 0 0 0 0 0 0 0 0 0 0 −9.96 24.5 0.82 8.89 27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 46 20 74 0 0 0 0 0 0 0 0 0 0 0 0 −28.3 48.7 4.44 62 0 0 0 0 0 0 0 0 0 0 0 −25.9 −26.1 23.2 6.96 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 56 20 59 0 0 0 0 0 4.52 0 0 0 −11.1 0 0 0 −0.37 0 0 0 −4.04 0 0 0 13.1 0 0 0 10.1 0 0 0 −8.29 0 0 0 −4.06 0 0 0 11.4 0 0 0 5.08 −23.5 24.8 −0.99 0 29.6 −16.1 22.6 0 12.2 −18.1 −6.73 0 56 22.8 56 0 0 −17.5 56 0 7.02 −17.3 −0.58 −6.3 20.3 −15.2 16.7 4.94 −20.4 5.73 0 0 0 0 0 0.64 −1.57 −0.05 −0.57 1.85 −1.38 −1.79 1.34 0.68 −9.85 11.1 17.8 −16.6 −1.12 15.1 −6.08 14.9 0.50 5.42 −17.6 13.1 17.1 −12.8 −6.50 94.1 12.5 20.0 −18.6 −1.26 16.9 c Figure 5: The orthogonal matrices; a one processor, b two processors, and c three processors. Choosing a proper method to solve tridiagonal system of equations depends on different criteria, such as the order of complexity, accuracy and variety of achievements. For instance, the order of complexity of Gaussian elimination without pivoting is On, but it is not well suited for parallel processing 24. On the other hand, in spite of favorable order of complexity of parallel iterative algorithms, their main focus is just on solution of the system with good enough accuracy but without any further outcomes. 10 Mathematical Problems in Engineering 1 0.64 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 −0.1 0.72 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.16 0 4.53 2.17 1 0.92 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.98 0.71 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.05 0 0.1 0.90 1 0.73 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.90 0.26 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.06 0 −0.09 1.04 1 1.07 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.65 0 0 0.12 0.11 0 1 0.74 0.81 0 1 1.25 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.41 0.35 1 a 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.64 1 0 0 0 0 0 0 0 0 0 0 0 0 0 −0.1 0.72 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.64 −0.1 0 0 0 0 0 0 0 1 0.72 0.16 0 0 0 0 0 0 0 1 4.53 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0.61 0.71 0 0 0 0 0 0 0 1 0.24 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0.42 0.44 0 0 0 0 0 0 0 1 1.20 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.16 0 0 4.53 2.17 0 1 0.92 0.98 0 1 0.71 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.05 0 0 0 0 0.1 0 0 0 0 1 0 0 0 0 0 1 1.05 0.65 0 0 0 1 0.11 0.1 0 0 0 1 0.74 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.81 1.25 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.90 0 0.73 0 0.03 0 −0.02 0 −0.03 1.40 0.03 0.35 0 1 −0.23 0 1 0 0 0 0 0 0 0 0 0.91 0 0.09 0.03 −0.01 0.02 0.01 0.27 1 0 0 0 0 0 0 0.98 0 0.18 0 0.30 0.03 −0.74 −0.05 0 0.68 0 −0.42 0 0.05 0 1.04 0.61 −0.01 1 −0.08 0 1 0 0 0 0 0 0 0 0 0.57 0.93 −0.29 −0.17 0.99 0.74 0.31 0.76 1 b 0 0 0 0 0 2.17 0 0.92 0 0.19 0 −0.14 0 −0.42 0 0 1.11 0 0.38 0 1 0 0 1 0 0 0 0 0 0 c Figure 6: The Upper-triangular matrices; a one processor, b two processors, and c three processors. The complexities of different QR factorization methods in sequential form are brought here as follows 25: 1 Gram-Schmidt: 2/3n3 On2 , 2 Householder: 4/3n3 On2 , 3 Given: 8/3n3 On2 , 4 Fast Given: 4/3n3 On2 . Mathematical Problems in Engineering 11 Table 2: A norm of X to evaluate error over all computed unknowns and a norm of QT Q − I to determine the accuracy of the orthogonal matrix. order n 200 000 n 300 000 n 400 000 n 500 000 X 6.91e-7 4.48e-8 8.78e-8 3.01e-9 QT Q − I 9.39e-12 8.62e-12 9.32e-12 8.52e-12 There are some important points to justify our algorithm. The first is the accuracy of obtained results. Obviously, the accuracy of direct solvers, that is, mostly exact, is better than that of the iterative ones. The second item is the variety of results, such as computing the orthogonal and upper-triangular matrices. The QR decomposition is usually used to solve the linear least squares problem and find the eigenvalues of system. The third is the scalability of the algorithm, which is a crucial parameter for success of a parallel algorithm, that is, the proposed algorithm can employ more processors to solve larger problems efficiently. Consequently, development of this efficient algorithm based on the QR factorization is valuable and practical. Some sequential and parallel QR factorization methods are investigated by Demmel et al. 26. The obtained results are mostly exact. The accuracy of the algorithm was investigated using two methods. The first method was a norm of X to evaluate error over all computed unknowns, and the second one was a norm of QT Q − I to determine the accuracy of the orthogonal matrix. The level of error for systems of order n 10 000, 20 000, 30 000, and 40 000 is shown in Table 2. The order of error in these cases is acceptable. The final results, that is, the computed unknown, became more accurate in larger matrices. Moreover, the level of orthogonality is high in all cases. Furthermore, this algorithm includes a new and interesting parallel feature. The parallel algorithm can decrease the computational cost and the requested memory on a single processor. The MPI can be used on a single processor to divide it into different parts. In other words, the parallel algorithm based on the MPI library can be executed on different parts of a processor in the same way as it is executed on a network of parallel CPUs. The enormous number of equations causes the processor to return the “insufficient virtual memory” error. Increasing the number of blocks in the matrix results in increasing the amount of zero elements and decreasing the size of the requested memory physical and virtual. In Figure 7, the size of the requested memory for a matrix of order n 20 000 and 50 000 is shown. For instance, computation of a matrix of order n 50 000 was not possible on a specified processor with a single block. However, the computation was done on this processor with 4 blocks using 4.75 GB of memory. When the number of blocks was 16, the used memory decreased to 1.48 GB. The consumed time to find the solution of a system of equations of order n 20 000 and n 50 000 is presented in Figure 8. It is shown that when the number of blocks increased, time consumption was reduced. The rate of reduction nearly vanished when the number of blocks reached 16. The key issue in the parallel processing of a single application is speedup, especially its dependence on the number of processors used. Speedup is defined as a factor by which the execution time for the application changes: Sp Ts , Tp 5.2 12 Mathematical Problems in Engineering 5 4.5 Memory usage (GB) 4 3.5 3 2.5 2 1.5 1 0.5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of blocks n = 20 000 n = 50 000 Figure 7: The size of requested memory in computation by a single CPU. where, p, Ts , and Tp are the number of processor, the executing time of the sequential algorithm, and the executing time of the parallel algorithm with p processors, respectively. On the other hand, the efficiency of parallel processing is a performance metric which is defined as Ep S . p 5.3 It is a value, typically between zero and one, estimating how well utilized the processors are in solving the problem, compared to how much effort is wasted in communication and synchronization. To investigate the ability of the parallel algorithm, the CPU time T , the speedup Sp , and the efficiency Ep were considered in different cases Table 3. The mentioned orders are 20 000, 40 000, 60 000, 80 000, and 100 000. The first column shows the computational time in a sequential case. A critical problem in solving a large system of equations using one processor, as mentioned before e.g., order more than 50 000 in this study, is insufficient memory. To conquer this obstacle, the virtual memory was increased to 10 GB. Although using virtual memory can remove the “insufficient memory” error, it causes the speed of the computation to decrease, leading to inefficiency. The memory problem is completely solved in the parallel algorithm because each processor just needs to have the data of its block. For instance, if the order of matrix is 100 000, then in two processors the order of the data is 50 000, and in three processors it will reach to about 33 000. Therefore, the size of a huge system of equations does not create any computational limitation on the parallel cases. The speedup and efficiency curves for the computation of a tridiagonal system of equations from orders 20 000 to 100 000 using different numbers of processors are presented in Figures 9–11. The comparison between computation times of different sizes of systems of equations using up to 16 processors is shown in Figure 9. The rate of time reduction was greater in larger systems. For instance, the time was reduced from 267 milliseconds to 22 milliseconds, that is, 11 times less than sequential form. Mathematical Problems in Engineering 13 14 12 Time (ms) 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of blocks n = 20 000 n = 50 000 Figure 8: Time of computation by a single CPU. Table 3: The computation time, speed up, and efficiency of parallel solution of the systems of equations with different order. n = 100 000 n = 200 000 n = 300 000 n = 400 000 n = 500 000 T Sp Ep T Sp Ep T Sp Ep T Sp Ep T Sp Ep 1 0.00338 — — 0.02495 — — 0.05973 — — 0.12345 — — 0.26724 — — 2 0.00175 1.927 0.964 0.01289 1.935 0.968 0.03025 1.974 0.987 0.06221 1.984 0.992 0.12916 2.069 1.034 4 0.00092 3.651 0.913 0.00678 3.679 0.920 0.01601 3.731 0.933 0.03225 3.828 0.957 0.06842 3.905 0.976 Processors no. 6 10 12 0.00067 0.00056 0.00051 5.027 5.954 6.528 0.838 0.744 0.653 0.00487 0.00406 0.00365 5.121 6.139 6.822 0.854 0.767 0.682 0.01140 0.00931 0.00821 5.239 6.413 7.272 0.873 0.802 0.727 0.02252 0.01787 0.015.441 5.482 6.907 7.995 0.914 0.863 0.800 0.04718 0.03648 0.03094 5.664 7.325 8.635 0.944 0.916 0.864 14 0.00051 6.566 0.547 0.00347 7.180 0.598 0.00761 7.845 0.654 0.01427 8.635 0.720 0.02694 9.918 0.827 16 0.00052 6.466 0.462 0.00336 7.425 0.530 0.00700 8.523 0.609 0.01320 9.350 0.668 0.02397 11.147 0.796 18 0.00054 6.171 0.386 0.00336 7.423 0.464 0.00662 9.017 0.564 0.01236 9.983 0.624 0.02248 11.887 0.743 The amount of speedup in different problem sizes is shown in Figure 10. When the order of the system was 20 000 and the number of processors was greater than 12, the speedup decreased. The reason for this inefficiency is the effect of network delay on computation time. On the other hand, in the larger problem sizes, this problem does not exist. It is encouraging to note that, with an increase in problem size, the efficiency of the parallel algorithm increases. In Figure 11, the efficiency of different executions is presented. It is obvious that the rate of reduction in the efficiency of the parallel algorithm decreased 14 Mathematical Problems in Engineering 300 250 Time (ms) 200 150 100 50 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of processor n = 80 000 n = 100 000 n = 20 000 n = 40 000 n = 60 000 Figure 9: The computation time of system of equation in different orders using up to 16 processors. 14 12 Speedup 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of processor n = 20 000 n = 40 000 n = 60 000 n = 80 000 n = 100 000 Figure 10: The speedup of parallel algorithm. by raising the problem size. This characteristic makes this parallel algorithm a practical subroutine in numerical computations. In addition, the speedup of proposed algorithm is compared with parallel Cycle Reduction algorithm Figure 12. Cycle Reduction is a well-known algorithm for solving system of equations 27. The speedup of both algorithms is shown in two orders of system of equations, that is, n 200 000 and 400 000. Obviously, the speedup of proposed algorithm is more than that of Cycle Reduction and also would be greater by increasing number of equations and processors. Mathematical Problems in Engineering 15 1 0.9 Efficiency 0.8 0.7 0.6 0.5 0.4 0.3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of processor n = 80 000 n = 100 000 n = 20 000 n = 40 000 n = 60 000 Figure 11: The efficiency of parallel algorithm. n = 200 000 8 9 7 8 7 5 Speedup Speedup 6 4 3 6 5 4 3 2 2 1 0 n = 400 000 10 1 4 10 Number of processors Present Cyclic reduction [27] a 16 0 4 10 16 Number of processors Present Cyclic reduction [27] b Figure 12: The speedup comparison between proposed algorithm and Cycle Reduction in two different orders of system of equations: a n 200 000 and b n 400 000. 6. Conclusion In this paper, a novel parallel algorithm based on the Gram-Schmidt method was presented to solve a tridiagonal linear system of equations in an efficient way. Solving linear systems of equations is a famous problem in different scientific fields, especially systems in the tridiagonal form. Therefore, the efficiency and accuracy of the algorithm can play a vital role in the entire solution procedure. In this parallel algorithm, the coefficient matrix is partitioned according to the number of processors. After that, some columns in each block are switched. Therefore, 16 Mathematical Problems in Engineering the corresponding orthogonal and upper-triangular matrices are computed by each processor with a minimum request from the other partitions’ data. Finally, the system of equations is solved almost exactly. The proposed algorithm has some remarkable advantages, which include the following: 1 the ability to find almost exact solutions to tridiagonal linear systems of equations of different sizes, as well as the corresponding orthogonal and upper-triangular matrices; 2 considerable reduction of computation cost proportional to the number of processors and blocks; 3 increasing the efficiency of the algorithm by raising the size of the system of equations; 4 the significant ability of a single processor using the MPI library; the requested memory decreases by partitioning the matrix, and hence the computational speed increases. The orders of the systems of equations that are presented in this paper range from 10 000 to 100 000. The results show that the suggested algorithm is scalable and completely practical. References 1 B. N. Datta, Numerical Linear Algebra and Applications, SIAM, Philadelphia, Pa, USA, 2nd edition, 2010. 2 D. P. O’Leary and P. Whitman, “Parallel QR factorization by householder and modified GramSchmidt algorithms,” Parallel Computing, vol. 16, no. 1, pp. 99–112, 1990. 3 A. Pothen and P. Raghavan, “Distributed orthogonal factorization: givens and Householder algorithms,” SIAM Journal on Scientific and Statistical Computing, vol. 10, no. 6, pp. 1113–1134, 1989. 4 J. Blackford, J. Choi, A. Cleary et al., ScaLAPACK Users’ Guide, SIAM, Philadelphia, Pa, USA, 1997. 5 E. Agullo, J. Dongarra, B. Hadri et al., “PLASMA 2.0 users’ guide,” Tech. Rep., ICL, UTK, 2009. 6 A. Buttari, J. Langou, J. Kurzak, and J. Dongarra, “Parallel tiled QR factorization for multicore architectures,” Concurrency Computation Practice and Experience, vol. 20, no. 13, pp. 1573–1590, 2008. 7 R. W. Hockney, “A fast direct solution of Poisson’s equation using Fourier analysis,” Journal of the Association for Computing Machinery, vol. 12, pp. 95–113, 1965. 8 H. S. Stone, “An efficient parallel algorithm for the solution of a tridiagonal linear system of equations,” Journal of the Association for Computing Machinery, vol. 20, pp. 27–38, 1973. 9 H. H. Wang, “A parallel method for tridiagonal equations,” ACM Transactions on Mathematical Software, vol. 7, no. 2, pp. 170–183, 1981. 10 E. Kilic, “Explicit formula for the inverse of a tridiagonal matrix by backward continued fractions,” Applied Mathematics and Computation, vol. 197, no. 1, pp. 345–357, 2008. 11 G. H. Golub and C. F. van Loan, Matrix Computations, Johns Hopkins Studies in the Mathematical Sciences, Johns Hopkins University Press, Baltimore, Md, USA, 3rd edition, 1996. 12 P. Amodio, L. Brugnano, and T. Politi, “Parallel factorizations for tridiagonal matrices,” SIAM Journal on Numerical Analysis, vol. 30, no. 3, pp. 813–823, 1993. 13 P. Amodio and L. Brugnano, “The parallel QR factorization algorithm for tridiagonal linear systems,” Parallel Computing, vol. 21, no. 7, pp. 1097–1110, 1995. 14 S. R. Ghodsi, B. Mehri, and M. Taeibi-Rahni, “A parallel implementation of Gram-Schmidt algorithm for dense linear system of equations,” International Journal of Computer Applications, vol. 5, no. 7, pp. 16–20, 2010. 15 R. Doallo, B. B. Fraguela, J. Touriño, and E. L. Zapata, “Parallel sparse modified Gram-Schmidt QR decomposition,” in Proceedings of the International Conference and Exhibition on High-Performance Computing and Networking, vol. 1067 of Lecture Notes In Computer Science, pp. 646–653, 1996. Mathematical Problems in Engineering 17 16 E. L. Zapata, J. A. Lamas, F. F. Rivera, and O. G. Plata, “Modified Gram-Schmidt QR factorization on hypercube SIMD computers,” Journal of Parallel and Distributed Computing, vol. 12, no. 1, pp. 60–69, 1991. 17 L. C. Waring and M. Clint, “Parallel Gram-Schmidt orthogonalisation on a network of transputers,” Parallel Computing, vol. 17, no. 9, pp. 1043–1050, 1991. 18 B. Kruatrachue and T. Lewis, “Grain size determination for parallel processing,” IEEE Software, vol. 5, no. 1, pp. 23–32, 1988. 19 A. W. Lim and M. S. Lam, “Maximizing parallelism and minimizing synchronization with affine partitions,” Parallel Computing, vol. 24, no. 3-4, pp. 445–475, 1998. 20 L. Giraud, J. Langou, and M. Rozloznik, “The loss of orthogonality in the Gram-Schmidt orthogonalization process,” Computers & Mathematics with Applications, vol. 50, no. 7, pp. 1069–1075, 2005. 21 A. Björck and C. C. Paige, “Loss and recapture of orthogonality in the modified Gram-Schmidt algorithm,” SIAM Journal on Matrix Analysis and Applications, vol. 13, no. 1, pp. 176–190, 1992. 22 T. Wittwer, An Introduction to Parallel Programming, VSSD uitgeverij, 2006. 23 M. Snir and W. Gropp, MPI: The Complete Reference, MIT Press, Cambridge, UK, 2nd edition, 1998. 24 T. M. Austin, M. Berndt, and J. D. Moulton, “A memory efficient parallel tridiagonal solver,” Tech. Rep. LA-UR 03-4149, Mathematical Modeling and Analysis Group, Los Alamos National Laboratory, Los Alamos, NM, USA, 2004. 25 W. M. Lioen and D. T. Winter, “Solving large dense systems of linear equations on systems with virtual memory and with cache,” Applied Numerical Mathematics, vol. 10, no. 1, pp. 73–85, 1992. 26 J. Demmel, L. Grigori, M. F. Hoemmen, and J. Langou, “Communication-optimal parallel and sequential QR and LU factorizations,” Tech. Rep. UCB/EECS-2008-89, Electrical Engineering and Computer Sciences University of California at Berkeley, Berkeley, Calif, USA, 2008. 27 P. Amodio, “Optimized cyclic reduction for the solution of linear tridiagonal systems on parallel computers,” Computers & Mathematics with Applications, vol. 26, no. 3, pp. 45–53, 1993.