Computational Mechanics Analysis Tools for Parallel-Vector Supercomputers O. O. Storaasli, D. T. Nguyen, M. A. Baddourah, and J. Qin Table of Contents ● Abstract ● Introduction ● Parallel-Vector Linear Equation Solvers ❍ Shared-Memory Choleski Solver ❍ Shared-Memory Out-of-Core Solver ❍ Distributed-Memory Solver ❍ Shared-Memory Unsymmetric Solver ● Parallel-Vector Lanczos Eigensolver ● Parallel Element Generation and Assembly ● Parallel Geometric Nonlinear Analysis ● Parallel Optimal Design Algorithms ❍ Parallel Vector Simplex Method ❍ Parallel Vector BFGS Optimization Method ● Domain Decomposer for Parallel Solution ● Concluding Remarks ● Acknowledgements ● References ● Postscript Version of Report Abstract Computational algorithms for structural analysis on parallel-vector supercomputers are reviewed. These parallel algorithms, developed by the authors, are for the assembly of structural equations, "out-of-core" strategies for linear equation solution, massively distributed-memory equation solution, unsymmetric equation solution, general eigen-solution, geometrically nonlinear finite element analysis, design sensitivity analysis for structural dynamics, optimization search analysis and domain decomposition. The source code for many these algorithms is available from NASA Langley by sending email to O.O. Storaasli@larc.nasa.gov. (This paper was published in the '93 AIAA SDM meeting proceedings as well as in the International Journal of Computing Systems in Engineering, vol. 4, nos. 2-4, 1993.) Introduction The analysis and design of complex aerospace structures requires the rapid solution of large systems of linear and nonlinear equations, eigenvalue extraction for buckling, vibration and flutter modes, structural optimization and design sensitivity calculation. Computers with multiple processors and vector capabilities can offer substantial computational advantages over traditional scalar computers for these analyses.[1] Rapid progress has taken place developing parallel-vector computers although software to exploit their parallel and vector capability is still in its infancy. These computers fall into two categories, namely, shared-memory computers (e.g., Convex C-240, Cray C-90) and distributed-memory computers (e.g., IBM SP-1 and SP-2, Intel Paragon, Thinking Machines CM-5). Shared-memory computers typically have only a few processors (i.e., up to 16 processors on a Cray C90), which can address a large memory and rapidly process vector instructions (so add and multiply operations are performed in parallel). Information is shared among processors simply by referencing a common variable or array in shared-memory. Distributed-memory computers are very different from their shared-memory counterparts and most algorithms need to be rewritten to run efficiently on them. Distributed memory computers may have thousands of processors, each with limited memory. Information is passed between the processors by explicit message passing commands (i.e. send, receive). In this paper, general-purpose, highly-efficient algorithms are presented for: solution of systems of linear and nonlinear equations, eigenvalue analysis, generation and assembly of element matrices, design sensitivity analysis, optimization and domain decomposition. Each algorithm is briefly described with an example to illustrate the numerical performance. References are also given to papers by the authors containing detailed descriptions for most of the algorithms. The algorithms have been coded in FORTRAN for shared-memory computers (Cray, Convex) and many for distributed-memory computers (IBM SP-1, SP-2 and Intel Paragon). The capability and numerical performance of these parallel-vector algorithms are discussed in the paper. Parallel-Vector Linear Equation Solvers Shared-Memory Choleski Solver [1-5] A fast, accurate Choleski-based method, PVSOLVE, to solve symmetric systems of linear equations was developed. This method uses a variable-band storage scheme, but uses column heights to eliminate operations on zeros outside the skyline during factorization. The method employs parallel computation in the outermost Choleski loop and vector computation via the "loop unrolling" technique in the innermost Choleski loop. A user option eliminates operations on zeros inside the band. Variable-band (row-by-row) storage is used to enable SAXPY[1,2] operations which are significantly faster than dotproduct operations used in other Choleski skyline solvers, since both add and multiply units are kept busy operating in parallel. PVSOLVE was tested for the structural analyses of numerous applications. High-Speed Civil Transport (HSCT) To evaluate the performance of PVSOLVE, a static structural analysis was performed on several HSCT concepts [1] such as the Mach 2.4 16,152 degree-of-freedom finite-element model shown in Fig. 1. Fig. 1. PVSOLVE solution time for HSCT The HSCT models were symmetrically loaded (upward) at both wing tips and fixed at their nose and tail. The model in Fig. 1 contained 2,694 nodes and 7,868 triangular elements which produced a matrix with 12.5 million terms with a maximum and average bandwidth of 1,272 and 772, respectively. The solution time for this model reduced in a scalable fashion from 6 to 0.8 seconds for from one to eight processors, respectively. Shared-Memory Out-of-Core Solver [6] The number of equations that PVSOLVE can solve on shared-memory computers is limited by the total addressable memory of the computer. To solve more equations, an "out-of-core" version of PVSOLVE denoted as PV-OOC was developed. PV-OOC requires only a fraction of the equations (i.e., one block) to reside in memory during the maytrix factorization and back substitution stages. The size of memory required to store one block is the square of the maximum bandwidth. This allows much larger systems of equations to be solved (or the same problems in a fraction of the memory) compared to PVSOLVE. An alternative version of PV-OOC requires even less memory (24 x bandwidth + 6 x number of equations) but takes slightly more time for equation solution. On a Cray computer, where input/output (I/O) operations can be overlapped with computation, BUFFER IN/OUT[7] is used instead of the much slower binary READ/WRITE. This further reduces I/O time and virtually eliminates "I/O" time if the Solid State Disk (fast semiconductor memory) is used. To achieve parallel performance, Force[8] ,a parallel FORTRAN language (pre-compiler), is used in PV-OOC. The analysis time for the Mach 2.4 model in Fig. 1, using PV-OOC with a solid state disk, matches the timings obtained using PVSOLVE. Thus, a larger more complex Mach 3.0 HSCT finite-element model was attempted. This Mach 3.0 HSCT model consists of 14,737 nodes and 32,448 triangular elements as shown in Fig. 2. Fig. 2. Mach 3.0 HSCT structural model The assembled global stiffness matrix has 88,404 equations with a maximum bandwidth of 2,556. The PV-OOC solution times for the Mach 3.0 HSCT model on a Cray Y-MP are shown in Fig. 3. Fig. 3. PV-OOC solution for refined HSCT A reduction in solution time from 140 seconds for 1 processor to 18 seconds for 8 processors was achieved using PV-OOC. This reduction was accomplished by using only 14 million words of memory compared to the 98 million words required by PVSOLVE. Before a Cray C-90 was available, the corresponding reduction in solution time was from 550 seconds to 75 seconds on 1 and 8 Cray Y-MP processors, respectively. The speedup factor of approximately 3 (Y-MP to C-90) indicates that PVSOLVE is robust and exploits computer hardware upgrades with no coding changes. Distributed-Memory Choleski Solver [6] While loop unrolling [1] is best for shared-memory computers, "vector unrolling" (using dot product operations) is effective on distributed-memory computers (e.g., Intel/860). A new Choleski-based algorithm, denoted as PV-DM, with a block skyline storage scheme was developed and its performance evaluated on a 16,146-equation reduced model of the Mach 3.0 HSCT. This reduced model is considerably smaller than the HSCT model in Fig. 2 since the bandwidth is only 321. Displacements were calculated by PV-DM using 8, 16 and 32 Intel i/860 processors with the 32 processors solution time compared with the Intel ProSolver[9] in Table 1. Table 1. HSCT Solution Time (sec) Method PV-DM ProSolver Factorization 26.0 50.2 Forward 0.8 9.0 Backward 1.0 51.0 ----------------------------------------Total 27.8 110.2 The solution times shown are the sum of the computation and communication times. Level 8 loop unrolling was used and found to reduce computation time by approximately ten percent. Since the communication time was dominant as the number of processors increases, the time reduction from 8 to 32 processors was minimal. Computer architects recognize this communication "bottleneck", and future parallel computers will reduce latency (set up time to send data) and increase the data communication rate. PV-DM solved this problem 27.8 seconds, compared to 110.2 seconds for ProSolver[9]. The factorization was about twice as fast for PV-DM and the forward/backward solution was over an order of magnitude faster. This feature is desirable for many applications where the forward/backward solution time dominates the analysis such as in structural dynamics, eigenvalue extraction, design sensitivity, nonlinear and optimization analyses. Shared-Memory Unsymmetric Solver [10] A Gauss elimination solver for non-positive definite unsymmetric full systems of equations, denoted as PV-US, has also been developed. To take advantage of the vector speed (using SAXPY[7] operations) on shared-memory computers, the upper half of the unsymmetric coefficient matrix is stored by rows and the lower half by columns. PV-US can solve unsymmetric systems of equations which arise in many engineering applications (i.e. panel flutter and computational fluid dynamics). Panel Flutter Example Large-deflection, nonlinear supersonic/hypersonic aerodynamic panel flutter analyses were performed for the model shown in Fig 4. Fig. 4. Finite element panel flutter analysis At a sufficiently large velocity of airflow over the plate, which is suspended over the cavity shown, panel flutter can occur. The finite element model for this panel flutter application has 1,452 unsymmetric equations and a flutter matrix with an upper and lower half bandwidth of 778 and 727, respectively. The performance of PV-US to solve the flutter analysis equations is shown in Fig. 5. Fig. 5. Panel flutter analysis on Cray Y-MP PV-US solved the unsymmetric system of equations in 4.63 seconds on one Cray Y-MP processor compared to 0.61 seconds on eight processors (see Fig. 5). This reduction in solution time demonstrates a parallel speedup of 7.6 on eight processors which is an efficiency, relative to linear speedup, of 95 percent. Parallel-Vector Lanczos Eigensolver [11,12] PVSOLVE [1] was used in a Lanczos eigensolver and found to reduce significantly the total computation time using multi-processors. A fast matrix-vector multiplication method was also used in the Lanczos procedure. Orbital Platform Model The numerical accuracy and efficiency of the proposed parallel-vector Lanczos algorithm was demonstrated by calculating the eigenvalues of the orbital Control Structure Interaction (CSI) platform model shown in Fig. 6. Fig. 6. CSI finite element model This model contains 536 nodes, 1,647 three-dimensional beam elements and 3,096 degrees of freedom. The Lanczos algorithm is faster than the subspace iteration algorithm on one Alliant processor and was faster on two Cray Y-MP processors than on one (see Table 2). Table 2. Lanczos Performance Comparison Eigenvalues 10 15 Eigensolver/Alliant Subspace Lanczos 50.3 38.3 68.3 41.3 Lanczos/Cray 1 proc. 2 proc. 20 50 100 79.2 421.2 2083.9 46.9 89.1 209.5 6.78 3.97 The parallel Lanczos solver grows from 20 percent to ten times faster than the subspace iteration solver on the Alliant as the number of eigenpairs calculated increases. The 100th eigenvalue is 193.88 for both methods, indicating the excellent accuracy of the Lanczos algorithm. Fifty eigenvalues were calculated on one and two Cray Y-MP processors with a parallel speedup of 1.71. Parallel Element Generation and Assembly [13] A new "node-by-node" procedure was developed to generate and assemble element stiffness matrices concurrently. In this procedure (found to be effective for both shared and distributed memory computers), each processor is assigned a group of different nodes of the finite element model. Since nodes are assigned to processors, no communication occurs between processors during element generation and assembly as in the traditional "element-by-element" method. Thus, as shown in Fig. 7, near-scalable computation time reductions for the Mach 3.0 HSCT model were obtained for both shared and distributed memory computers. Fig. 7. Parallel Generation and Assembly Scalable speedup was achieved by the parallel algorithm on both Cray and Intel computers for 1 to 8 and 16 to 512 processors, respectively. This node-by-node method is one of the first structural analysis algorithms to show better performance on a distributed memory computer (512-processor Intel Delta) compared to a Cray C-90. Parallel Geometric Nonlinear Analysis [14] PVSOLVE and the parallel element generation and assembly algorithms were installed in three popular geometric nonlinear structural analysis methods: Newton-Raphson, modified Newton-Raphson and Broyden-Fletcher-Goldfarb-Shanno (BFGS). For scalar computers, many consider the BFGS method to be the most effective for nonlinear finite element analysis. However, for the CSI platform application on the Cray Y-MP, the parallel full Newton Raphson method consistently took less time (see Fig. 8) than the other methods run in parallel Fig. 8. Parallel nonlinear methods All methods showed excellent parallel speedup as the number of processors increased with the NewtonRaphson further outperforming the modified Newton-Raphson for 8 processors. This is attributed to the lack of parallelism in the forward/backward elimination phase of the Modified Newton Raphson method. Parallel-Vector Design Sensitivity Analysis in Linear Dynamics [15,16] The mode-acceleration method, PVSOLVE, and eigen-solver were combined in an alternative formulation[16 ]of Design Sensitivity Analysis (DSA) of structural systems under dynamic loads. The solution time for DSA of the CSI finite element model reduced as the number of Cray Y-MP processors increased (see Table 3). Table 3. Parallel-Vector DSA for CSI Processors 1 2 3 4 Seconds 6.98 3.85 2.79 2.41 Speedup 1.00 1.81 2.50 2.89 The times in Table 3 include all computations except the generation of stiffness and mass matrices. Parallel Optimal Design Algorithms [17,18] Many structural problems are formulated as constrained optimization problems which may be solved using different solution algorithms. Effective nonlinear constrained optimization algorithms can be designed based on parallelizing the following two optimization methods: Simplex and BFGS. Parallel-Vector Simplex Method Both parallel and vector methods (use of Force[8] and loop unrolling[1]) were incorporated in the Simplex procedure to solve a linear programming problem with 500 design variables and 500 equality constraints. A total of 1,500 design variables resulted after introducing slack and surplus variables into the problem. Time speedups of up to 3.6 (using four Cray-2 processors) were obtained for this 1,500 design variable system. (see Table 4). Table 4. Parallel-Vector Simplex method Processors 1 2 3 4 Seconds 83.40 41.41 30.43 23.17 Speedup 1.00 2.00 2.74 3.60 Parallel-Vector BFGS Optimization Method PVSOLVE was installed in the BFGS method to solve nonlinear unconstrained optimization problems. Systems of 300 nonlinear equations, cast in the form of nonlinear unconstrained optimization problems, were solved on multi-processor computers. For this example, converged solutions were reached after 11 BFGS iterations as indicated in Table 5. Table 5. Parallel-Vector BFGS Method Size 300x300 300x300 300x300 Seconds Iters. 0.102 11 0.058 11 0.032 11 Proc. 1 2 4 Speedup 1.00 1.77 3.21 The solution time reduced from 0.102 seconds to 0.032 seconds for 1 and 4 processors, respectively. This reduction is a speedup of 3.21 on 4 processors, for an efficiency of 80 percent. This is a significant savings since in a typical optimization, the BFGS method is used hundreds of times. To make the BFGS method even more effective, a new line search method, referred to as the Parallel Golden Block (PGB) method, was found to improve the parallel performance by inserting many points (instead of just two as in the Golden Section method) in the targeted design space region. Series example To illustrate the PGB method, the optimum solution of a simple cosine series consisting of 600 terms [17]was calculated. The PGB method reduced the solution time from 0.54 seconds to 0.144 seconds when using four Cray processors as shown in Table 6. Table 6. Parallel Golden Block Method No of No. of Time Speedup for Proc. Points (sec) PGB GS 1 2 0.355 1.00 1.00 1 12 0.540 1.00 0.66 2 12 0.277 1.95 1.28 3 12 0.187 2.89 1.90 4 12 0.144 3.76 2.47 Fig. 9. Domain Decomposition of Mach 2.4 High-Speed Civil Transport model This is 2.47 times faster than the best sequential Golden Section method denoted GS in Table 6. This speedup results from simultaneously calculating twelve points in the Golden Block method compared to only two points in the Golden Section method. Domain Decomposer for Parallel Solution [19] A simple but efficient algorithm was developed to automatically decompose arbitrary finite element domains into a specified number of subdomains for substructure analysis in a parallel computer environment. The algorithm balances the work load, minimizes inter-processor communication and minimizes the bandwidth of the resulting systems of equations. It avoids domain splitting by utilizing the joint coordinate information which is readily available from the finite element model. This is illustrated in its application to the 16,152 degree-of-freedom Mach 2.4 HSCT model with 2,694 nodes and 7,868 triangular elements shown in Fig. 1. The algorithm which runs on shared memory computers (i.e. Cray, Convex, Sun, etc.) created the four subdomains (fuselage, forward, center and aft wing sections) shown in Fig. 9. Since three subdomain have exactly 1,965 elements and one has 1,964 elements, the work is idealy balanced across multiple processors. In addition, a minimum of boundary nodes occurred in the entire structure and no domain splitting resulted. This small number of boundary nodes helps to reduce the communication time required when parallel structural analysis is performed based on this decomposition. Concluding Remarks A number of highly efficient parallel-vector algorithms have recently been developed which show significant improvements for conducting structural analysis. Software using these algorithms has been demonstrated on a variety of engineering applications (static, dynamic, linear, nonlinear, design sensitivity and optimization) executing on shared-memory, and in some cases, distributed-memory supercomputers. The performance of the algorithms was evaluated on medium to large-scale practical applications. The software described may be used in a stand-alone mode, or may be integrated into existing finite element codes as the authors have done. Acknowledgments The authors were granted early use of the Intel Delta Supercomputer operated by Caltech on behalf of the Concurrent Supercomputing Consorium. References 1. Storaasli, O. O., Nguyen, D. T. and Agarwal, T. K., "A Parallel-Vector Algorithm For Rapid Structural Analysis on High-Performance Computers",NASA TM 102614 , April 1990. 2. Agarwal, T., Storaasli, O., and Nguyen, D.,"A Parallel-Vector Algorithm for Rapid Structural Analysis on High-Performance Computers", AIAA Paper No. 90-1149, Proc. of the AIAA/ASME/ ASCE/ AHS 31st Structures, Structural Dynamics and Materials Conference, Long Beach, CA, April 2-4, 1990, pp. 662-672. 3. Qin, J., Agarwal, T. K., Storaasli, O. O., Nguyen, D. T. and Baddourah, M. A., "Parallel-Vector Outof-Core Solver for Computational Mechanics", Proc. of the 2nd Symposium on Parallel Computational Methods or Large-Scale Structural Analysis and Design, Norfolk, VA , February 24-25, 1993. 4. Baddourah, M. A., Storaasli, O. O. and Bostic, S., "Linear Static Structural and Vibration Analysis on High-Performance Computers", 2nd Symposium on Parallel Computational Methods for Large-Scale Structural Analysis and Design, Norfolk, VA, February 24-25, 1993. 5. Storaasli, O. O., Nguyen, D. T. and Agarwal, T. K. "The Parallel Solution of Large-Scale Structural Analysis Problems on Supercomputers", AIAA Journal, Vol. 28, No. 7, July 1990, pp. 1211-1216. 6. Qin, J., and Nguyen, D. T., "A Parallel-Vector Equation Solver for Distributed Memory Computers", Proc. of the 2nd Symposium on Parallel Computational Methods for Large-Scale Structural Analysis and Design, Norfolk, VA, February 24-25, 1993. 7. CFT77 Compiling System, Vol. 1: FORTRAN Reference Manual, SR-3071 5.0, Cray Research Inc., Eagan, MN, June1991, pp. 327-329. 8. Jordan, H. F., Benten, M.S., Arenstorf, N. S., and Ramann, A. V., "Force User's Manual: A Portable Parallel, FORTRAN", NASA CR 4265, January, 1990. 9. iPSC/860 ProSolver-SES Manual, Intel Supercomputer Systems Division, Beaverton, OR, 1992. 10. Qin, J., Gray, C. E., Jr., Mei, C., and Nguyen, D. T., "Parallel-Vector Equation Solver for Unsymmetric Matrices on Supercomputers", Computing Systems in Engineering, Vol. 2, No. 2/3, 1991, pp. 197-201. 11. Belvin, W. K., Maghami, P. G., and Nguyen, D. T., "Efficient Use of High-Performance Computers for Integrated Controls and Structures Design", Int. Journal of Computing Systems in Engineering, Vol. 13, No. 1-4, 1992, pp. 181-187. 12. Storaasli, O. O., Bostic, S. W., Patrick, M., Mahajan, U., and Ma, S., "Three Parallel Computation Methods for Structural Vibration Analysis", AIAA Paper No. 88-2391 Proc. of the 29th AIAA/ASME/ ASCE/AHS/ASC Structures, Structural Dynamics and Materials Conference, Williamsburg, VA, April 18-20, 1988, pp. 1401-1411. 13. Baddourah, M. A., Storaasli, O. O., Carmona E. A. and Nguyen, D. T., "A Fast Parallel Algorithm for Generation and Assembly of Finite Element Stiffness and Mass Matrices", AIAA Paper No. 911006, Proc. of the 32nd AIAA/ASME/ASCE/AHS/ ASC Structures, Structural Dynamics and Materials Conference, Baltimore, MD April 8-10, 1991, pp. 1547-1553. 14. Baddourah, M. A., "Parallel-Vector Computation for Geometrically Nonlinear Frame Structural Analysis and Design Sensitivity Analysis, Ph. D. Dissertation, Department of Civil Engineering, Old Dominion University, Norfolk, VA, June, 1991. 15. Zhang, Y., and Nguyen, D. T., "Parallel-Vector Sensitivity Calculations in Linear Structural Dynamics", Int. Journal of Computing Systems in Engineering, Vol. 3, No. 1-4, 1992, pp. 365-377. 16. Zhang, Y., "Parallel-Vector Design Sensitivity Analysis in Structural Dynamics", Ph. D. Dissertation, Department of Civil Engineering, Old Dominion University, Norfolk, VA, June, 1991. 17. Nguyen, D. T., Storaasli, O. O., Carmona, E. A., Al-Nasra, M., Zhang, Y., Baddourah, M. A., and Agarwal, T. K., "Parallel-Vector Computation for Linear Structural Analysis and Nonlinear Unconstrained Optimization Problems", Int. Journal of Computing Systems in Engineering , Vol. 2, No. 4, Sept. 1991, pp. 175-182. 18. Baddourah, M. A. and Nguyen, D. T., "Parallel-Vector Processing for Linear Programming", Computers and Structures, Vol. 38, No. 11, 1991, pp. 269-282. 19. Moayyad, M. A. and Nguyen, D. T., "An Algorithm for Domain Decomposition in Finite Element Analysis", Computers and Structures, Vol. 39, Nos. 1-4, 1991,pp. 277-290. Nguyen:Center for Multidisciplinary Parallel-Vector Computation, Old Dominion University, Norfolk VA 23529, U.S.A Baddourah:Lockheed Engineering and Sciences Co., Hampton, VA 23669, U.S.A. Qin:Center for Multidisciplinary Parallel-Vector Computation, Old Dominion University, Norfolk VA 23529, U.S.A d.g.roper@larc.nasa.gov