Computational Mechanics Analysis Tools for Parallel

advertisement
Computational Mechanics Analysis Tools for
Parallel-Vector Supercomputers
O. O. Storaasli, D. T. Nguyen, M. A. Baddourah, and J. Qin
Table of Contents
●
Abstract
●
Introduction
●
Parallel-Vector Linear Equation Solvers
❍
Shared-Memory Choleski Solver
❍
Shared-Memory Out-of-Core Solver
❍
Distributed-Memory Solver
❍
Shared-Memory Unsymmetric Solver
●
Parallel-Vector Lanczos Eigensolver
●
Parallel Element Generation and Assembly
●
Parallel Geometric Nonlinear Analysis
●
Parallel Optimal Design Algorithms
❍
Parallel Vector Simplex Method
❍
Parallel Vector BFGS Optimization Method
●
Domain Decomposer for Parallel Solution
●
Concluding Remarks
●
Acknowledgements
●
References
●
Postscript Version of Report
Abstract
Computational algorithms for structural analysis on parallel-vector supercomputers are reviewed. These
parallel algorithms, developed by the authors, are for the assembly of structural equations, "out-of-core"
strategies for linear equation solution, massively distributed-memory equation solution, unsymmetric
equation solution, general eigen-solution, geometrically nonlinear finite element analysis, design
sensitivity analysis for structural dynamics, optimization search analysis and domain decomposition.
The source code for many these algorithms is available from NASA Langley by sending email to O.O.
Storaasli@larc.nasa.gov. (This paper was published in the '93 AIAA SDM meeting proceedings as well
as in the International Journal of Computing Systems in Engineering, vol. 4, nos. 2-4, 1993.)
Introduction
The analysis and design of complex aerospace structures requires the rapid solution of large systems of
linear and nonlinear equations, eigenvalue extraction for buckling, vibration and flutter modes, structural
optimization and design sensitivity calculation. Computers with multiple processors and vector
capabilities can offer substantial computational advantages over traditional scalar computers for these
analyses.[1] Rapid progress has taken place developing parallel-vector computers although software to
exploit their parallel and vector capability is still in its infancy. These computers fall into two categories,
namely, shared-memory computers (e.g., Convex C-240, Cray C-90) and distributed-memory computers
(e.g., IBM SP-1 and SP-2, Intel Paragon, Thinking Machines CM-5).
Shared-memory computers typically have only a few processors (i.e., up to 16 processors on a Cray C90), which can address a large memory and rapidly process vector instructions (so add and multiply
operations are performed in parallel). Information is shared among processors simply by referencing a
common variable or array in shared-memory.
Distributed-memory computers are very different from their shared-memory counterparts and most
algorithms need to be rewritten to run efficiently on them. Distributed memory computers may have
thousands of processors, each with limited memory. Information is passed between the processors by
explicit message passing commands (i.e. send, receive).
In this paper, general-purpose, highly-efficient algorithms are presented for: solution of systems of linear
and nonlinear equations, eigenvalue analysis, generation and assembly of element matrices, design
sensitivity analysis, optimization and domain decomposition. Each algorithm is briefly described with an
example to illustrate the numerical performance. References are also given to papers by the authors
containing detailed descriptions for most of the algorithms. The algorithms have been coded in
FORTRAN for shared-memory computers (Cray, Convex) and many for distributed-memory computers
(IBM SP-1, SP-2 and Intel Paragon). The capability and numerical performance of these parallel-vector
algorithms are discussed in the paper.
Parallel-Vector Linear Equation Solvers
Shared-Memory Choleski Solver
[1-5]
A fast, accurate Choleski-based method, PVSOLVE, to solve symmetric systems of linear equations was
developed. This method uses a variable-band storage scheme, but uses column heights to eliminate
operations on zeros outside the skyline during factorization. The method employs parallel computation
in the outermost Choleski loop and vector computation via the "loop unrolling" technique in the
innermost Choleski loop. A user option eliminates operations on zeros inside the band. Variable-band
(row-by-row) storage is used to enable SAXPY[1,2] operations which are significantly faster than dotproduct operations used in other Choleski skyline solvers, since both add and multiply units are kept
busy operating in parallel. PVSOLVE was tested for the structural analyses of numerous applications.
High-Speed Civil Transport (HSCT)
To evaluate the performance of PVSOLVE, a static structural analysis was performed on several HSCT
concepts [1] such as the Mach 2.4 16,152 degree-of-freedom finite-element model shown in Fig. 1.
Fig. 1. PVSOLVE solution time for HSCT
The HSCT models were symmetrically loaded (upward) at both wing tips and fixed at their nose and
tail. The model in Fig. 1 contained 2,694 nodes and 7,868 triangular elements which produced a matrix
with 12.5 million terms with a maximum and average bandwidth of 1,272 and 772, respectively. The
solution time for this model reduced in a scalable fashion from 6 to 0.8 seconds for from one to eight
processors, respectively.
Shared-Memory Out-of-Core Solver
[6]
The number of equations that PVSOLVE can solve on shared-memory computers is limited by the total
addressable memory of the computer. To solve more equations, an "out-of-core" version of PVSOLVE
denoted as PV-OOC was developed. PV-OOC requires only a fraction of the equations (i.e., one block)
to reside in memory during the maytrix factorization and back substitution stages. The size of memory
required to store one block is the square of the maximum bandwidth. This allows much larger systems of
equations to be solved (or the same problems in a fraction of the memory) compared to PVSOLVE. An
alternative version of PV-OOC requires even less memory (24 x bandwidth + 6 x number of equations)
but takes slightly more time for equation solution.
On a Cray computer, where input/output (I/O) operations can be overlapped with computation, BUFFER
IN/OUT[7] is used instead of the much slower binary READ/WRITE. This further reduces I/O time and
virtually eliminates "I/O" time if the Solid State Disk (fast semiconductor memory) is used. To achieve
parallel performance, Force[8] ,a parallel FORTRAN language (pre-compiler), is used in PV-OOC.
The analysis time for the Mach 2.4 model in Fig. 1, using PV-OOC with a solid state disk, matches the
timings obtained using PVSOLVE. Thus, a larger more complex Mach 3.0 HSCT finite-element model
was attempted. This Mach 3.0 HSCT model consists of 14,737 nodes and 32,448 triangular elements as
shown in Fig. 2.
Fig. 2. Mach 3.0 HSCT structural model
The assembled global stiffness matrix has 88,404 equations with a maximum bandwidth of 2,556. The
PV-OOC solution times for the Mach 3.0 HSCT model on a Cray Y-MP are shown in Fig. 3.
Fig. 3. PV-OOC solution for refined HSCT
A reduction in solution time from 140 seconds for 1 processor to 18 seconds for 8 processors was
achieved using PV-OOC. This reduction was accomplished by using only 14 million words of memory
compared to the 98 million words required by PVSOLVE. Before a Cray C-90 was available, the
corresponding reduction in solution time was from 550 seconds to 75 seconds on 1 and 8 Cray Y-MP
processors, respectively. The speedup factor of approximately 3 (Y-MP to C-90) indicates that
PVSOLVE is robust and exploits computer hardware upgrades with no coding changes.
Distributed-Memory Choleski Solver
[6]
While loop unrolling [1] is best for shared-memory computers, "vector unrolling" (using dot product
operations) is effective on distributed-memory computers (e.g., Intel/860). A new Choleski-based
algorithm, denoted as PV-DM, with a block skyline storage scheme was developed and its performance
evaluated on a 16,146-equation reduced model of the Mach 3.0 HSCT. This reduced model is
considerably smaller than the HSCT model in Fig. 2 since the bandwidth is only 321. Displacements
were calculated by PV-DM using 8, 16 and 32 Intel i/860 processors with the 32 processors solution
time compared with the Intel ProSolver[9] in Table 1.
Table 1. HSCT Solution Time (sec)
Method
PV-DM
ProSolver
Factorization
26.0
50.2
Forward
0.8
9.0
Backward
1.0
51.0
----------------------------------------Total
27.8
110.2
The solution times shown are the sum of the computation and communication times. Level 8 loop
unrolling was used and found to reduce computation time by approximately ten percent. Since the
communication time was dominant as the number of processors increases, the time reduction from 8 to
32 processors was minimal. Computer architects recognize this communication "bottleneck", and future
parallel computers will reduce latency (set up time to send data) and increase the data communication
rate. PV-DM solved this problem 27.8 seconds, compared to 110.2 seconds for ProSolver[9]. The
factorization was about twice as fast for PV-DM and the forward/backward solution was over an order
of magnitude faster. This feature is desirable for many applications where the forward/backward
solution time dominates the analysis such as in structural dynamics, eigenvalue extraction, design
sensitivity, nonlinear and optimization analyses.
Shared-Memory Unsymmetric Solver
[10]
A Gauss elimination solver for non-positive definite unsymmetric full systems of equations, denoted as
PV-US, has also been developed. To take advantage of the vector speed (using SAXPY[7] operations)
on shared-memory computers, the upper half of the unsymmetric coefficient matrix is stored by rows
and the lower half by columns. PV-US can solve unsymmetric systems of equations which arise in many
engineering applications (i.e. panel flutter and computational fluid dynamics).
Panel Flutter Example
Large-deflection, nonlinear supersonic/hypersonic aerodynamic panel flutter analyses were performed
for the model shown in Fig 4.
Fig. 4. Finite element panel flutter analysis
At a sufficiently large velocity of airflow over the plate, which is suspended over the cavity shown,
panel flutter can occur. The finite element model for this panel flutter application has 1,452 unsymmetric
equations and a flutter matrix with an upper and lower half bandwidth of 778 and 727, respectively. The
performance of PV-US to solve the flutter analysis equations is shown in Fig. 5.
Fig. 5. Panel flutter analysis on Cray Y-MP
PV-US solved the unsymmetric system of equations in 4.63 seconds on one Cray Y-MP processor
compared to 0.61 seconds on eight processors (see Fig. 5). This reduction in solution time demonstrates
a parallel speedup of 7.6 on eight processors which is an efficiency, relative to linear speedup, of 95
percent.
Parallel-Vector Lanczos Eigensolver
[11,12]
PVSOLVE [1] was used in a Lanczos eigensolver and found to reduce significantly the total
computation time using multi-processors. A fast matrix-vector multiplication method was also used in
the Lanczos procedure.
Orbital Platform Model
The numerical accuracy and efficiency of the proposed parallel-vector Lanczos algorithm was
demonstrated by calculating the eigenvalues of the orbital Control Structure Interaction (CSI) platform
model shown in Fig. 6.
Fig. 6. CSI finite element model
This model contains 536 nodes, 1,647 three-dimensional beam elements and 3,096 degrees of freedom.
The Lanczos algorithm is faster than the subspace iteration algorithm on one Alliant processor and was
faster on two Cray Y-MP processors than on one (see Table 2).
Table 2. Lanczos Performance Comparison
Eigenvalues
10
15
Eigensolver/Alliant
Subspace
Lanczos
50.3
38.3
68.3
41.3
Lanczos/Cray
1 proc. 2 proc.
20
50
100
79.2
421.2
2083.9
46.9
89.1
209.5
6.78
3.97
The parallel Lanczos solver grows from 20 percent to ten times faster than the subspace iteration solver
on the Alliant as the number of eigenpairs calculated increases. The 100th eigenvalue is 193.88 for both
methods, indicating the excellent accuracy of the Lanczos algorithm. Fifty eigenvalues were calculated
on one and two Cray Y-MP processors with a parallel speedup of 1.71.
Parallel Element Generation and Assembly
[13]
A new "node-by-node" procedure was developed to generate and assemble element stiffness matrices
concurrently. In this procedure (found to be effective for both shared and distributed memory
computers), each processor is assigned a group of different nodes of the finite element model. Since
nodes are assigned to processors, no communication occurs between processors during element
generation and assembly as in the traditional "element-by-element" method. Thus, as shown in Fig. 7,
near-scalable computation time reductions for the Mach 3.0 HSCT model were obtained for both shared
and distributed memory computers.
Fig. 7. Parallel Generation and Assembly
Scalable speedup was achieved by the parallel algorithm on both Cray and Intel computers for 1 to 8 and
16 to 512 processors, respectively. This node-by-node method is one of the first structural analysis
algorithms to show better performance on a distributed memory computer (512-processor Intel Delta)
compared to a Cray C-90.
Parallel Geometric Nonlinear Analysis
[14]
PVSOLVE and the parallel element generation and assembly algorithms were installed in three popular
geometric nonlinear structural analysis methods: Newton-Raphson, modified Newton-Raphson and
Broyden-Fletcher-Goldfarb-Shanno (BFGS). For scalar computers, many consider the BFGS method to
be the most effective for nonlinear finite element analysis. However, for the CSI platform application on
the Cray Y-MP, the parallel full Newton Raphson method consistently took less time (see Fig. 8) than
the other methods run in parallel
Fig. 8. Parallel nonlinear methods
All methods showed excellent parallel speedup as the number of processors increased with the NewtonRaphson further outperforming the modified Newton-Raphson for 8 processors. This is attributed to the
lack of parallelism in the forward/backward elimination phase of the Modified Newton Raphson
method.
Parallel-Vector Design Sensitivity Analysis in Linear Dynamics
[15,16]
The mode-acceleration method, PVSOLVE, and eigen-solver were combined in an alternative
formulation[16 ]of Design Sensitivity Analysis (DSA) of structural systems under dynamic loads. The
solution time for DSA of the CSI finite element model reduced as the number of Cray Y-MP processors
increased (see Table 3).
Table 3. Parallel-Vector DSA for CSI
Processors
1
2
3
4
Seconds
6.98
3.85
2.79
2.41
Speedup
1.00
1.81
2.50
2.89
The times in Table 3 include all computations except the generation of stiffness and mass matrices.
Parallel Optimal Design Algorithms
[17,18]
Many structural problems are formulated as constrained optimization problems which may be solved
using different solution algorithms. Effective nonlinear constrained optimization algorithms can be
designed based on parallelizing the following two optimization methods: Simplex and BFGS.
Parallel-Vector Simplex Method
Both parallel and vector methods (use of Force[8] and loop unrolling[1]) were incorporated in the
Simplex procedure to solve a linear programming problem with 500 design variables and 500 equality
constraints. A total of 1,500 design variables resulted after introducing slack and surplus variables into
the problem. Time speedups of up to 3.6 (using four Cray-2 processors) were obtained for this 1,500
design variable system. (see Table 4).
Table 4. Parallel-Vector Simplex method
Processors
1
2
3
4
Seconds
83.40
41.41
30.43
23.17
Speedup
1.00
2.00
2.74
3.60
Parallel-Vector BFGS Optimization Method
PVSOLVE was installed in the BFGS method to solve nonlinear unconstrained optimization problems.
Systems of 300 nonlinear equations, cast in the form of nonlinear unconstrained optimization problems,
were solved on multi-processor computers. For this example, converged solutions were reached after 11
BFGS iterations as indicated in Table 5.
Table 5. Parallel-Vector BFGS Method
Size
300x300
300x300
300x300
Seconds Iters.
0.102
11
0.058
11
0.032
11
Proc.
1
2
4
Speedup
1.00
1.77
3.21
The solution time reduced from 0.102 seconds to 0.032 seconds for 1 and 4 processors, respectively.
This reduction is a speedup of 3.21 on 4 processors, for an efficiency of 80 percent. This is a significant
savings since in a typical optimization, the BFGS method is used hundreds of times.
To make the BFGS method even more effective, a new line search method, referred to as the Parallel
Golden Block (PGB) method, was found to improve the parallel performance by inserting many points
(instead of just two as in the Golden Section method) in the targeted design space region.
Series example
To illustrate the PGB method, the optimum solution of a simple cosine series consisting of 600 terms
[17]was calculated. The PGB method reduced the solution time from 0.54 seconds to 0.144 seconds
when using four Cray processors as shown in Table 6.
Table 6. Parallel Golden Block Method
No of
No. of
Time
Speedup for
Proc.
Points
(sec) PGB
GS
1
2
0.355
1.00
1.00
1
12
0.540
1.00
0.66
2
12
0.277
1.95
1.28
3
12
0.187
2.89
1.90
4
12
0.144
3.76
2.47
Fig. 9. Domain Decomposition of Mach 2.4 High-Speed Civil Transport model
This is 2.47 times faster than the best sequential Golden Section method denoted GS in Table 6. This
speedup results from simultaneously calculating twelve points in the Golden Block method compared to
only two points in the Golden Section method.
Domain Decomposer for Parallel Solution
[19]
A simple but efficient algorithm was developed to automatically decompose arbitrary finite element
domains into a specified number of subdomains for substructure analysis in a parallel computer
environment. The algorithm balances the work load, minimizes inter-processor communication and
minimizes the bandwidth of the resulting systems of equations. It avoids domain splitting by utilizing
the joint coordinate information which is readily available from the finite element model. This is
illustrated in its application to the 16,152 degree-of-freedom Mach 2.4 HSCT model with 2,694 nodes
and 7,868 triangular elements shown in Fig. 1. The algorithm which runs on shared memory computers
(i.e. Cray, Convex, Sun, etc.) created the four subdomains (fuselage, forward, center and aft wing
sections) shown in Fig. 9.
Since three subdomain have exactly 1,965 elements and one has 1,964 elements, the work is idealy
balanced across multiple processors. In addition, a minimum of boundary nodes occurred in the entire
structure and no domain splitting resulted. This small number of boundary nodes helps to reduce the
communication time required when parallel structural analysis is performed based on this
decomposition.
Concluding Remarks
A number of highly efficient parallel-vector algorithms have recently been developed which show
significant improvements for conducting structural analysis. Software using these algorithms has been
demonstrated on a variety of engineering applications (static, dynamic, linear, nonlinear, design
sensitivity and optimization) executing on shared-memory, and in some cases, distributed-memory
supercomputers. The performance of the algorithms was evaluated on medium to large-scale practical
applications.
The software described may be used in a stand-alone mode, or may be integrated into existing finite
element codes as the authors have done.
Acknowledgments
The authors were granted early use of the Intel Delta Supercomputer operated by Caltech on behalf of
the Concurrent Supercomputing Consorium.
References
1. Storaasli, O. O., Nguyen, D. T. and Agarwal, T. K., "A Parallel-Vector Algorithm For Rapid
Structural Analysis on High-Performance Computers",NASA TM 102614 , April 1990.
2. Agarwal, T., Storaasli, O., and Nguyen, D.,"A Parallel-Vector Algorithm for Rapid Structural
Analysis on High-Performance Computers", AIAA Paper No. 90-1149, Proc. of the AIAA/ASME/ ASCE/
AHS 31st Structures, Structural Dynamics and Materials Conference, Long Beach, CA, April 2-4, 1990,
pp. 662-672.
3. Qin, J., Agarwal, T. K., Storaasli, O. O., Nguyen, D. T. and Baddourah, M. A., "Parallel-Vector Outof-Core Solver for Computational Mechanics", Proc. of the 2nd Symposium on Parallel Computational
Methods or Large-Scale Structural Analysis and Design, Norfolk, VA , February 24-25, 1993.
4. Baddourah, M. A., Storaasli, O. O. and Bostic, S., "Linear Static Structural and Vibration Analysis on
High-Performance Computers", 2nd Symposium on Parallel Computational Methods for Large-Scale
Structural Analysis and Design, Norfolk, VA, February 24-25, 1993.
5. Storaasli, O. O., Nguyen, D. T. and Agarwal, T. K. "The Parallel Solution of Large-Scale Structural
Analysis Problems on Supercomputers", AIAA Journal, Vol. 28, No. 7, July 1990, pp. 1211-1216.
6. Qin, J., and Nguyen, D. T., "A Parallel-Vector Equation Solver for Distributed Memory Computers",
Proc. of the 2nd Symposium on Parallel Computational Methods for Large-Scale Structural Analysis
and Design, Norfolk, VA, February 24-25, 1993.
7. CFT77 Compiling System, Vol. 1: FORTRAN Reference Manual, SR-3071 5.0, Cray Research Inc.,
Eagan, MN, June1991, pp. 327-329.
8. Jordan, H. F., Benten, M.S., Arenstorf, N. S., and Ramann, A. V., "Force User's Manual: A Portable
Parallel, FORTRAN", NASA CR 4265, January, 1990.
9. iPSC/860 ProSolver-SES Manual, Intel Supercomputer Systems Division, Beaverton, OR, 1992.
10. Qin, J., Gray, C. E., Jr., Mei, C., and Nguyen, D. T., "Parallel-Vector Equation Solver for
Unsymmetric Matrices on Supercomputers", Computing Systems in Engineering, Vol. 2, No. 2/3, 1991,
pp. 197-201.
11. Belvin, W. K., Maghami, P. G., and Nguyen, D. T., "Efficient Use of High-Performance Computers
for Integrated Controls and Structures Design", Int. Journal of Computing Systems in Engineering, Vol.
13, No. 1-4, 1992, pp. 181-187.
12. Storaasli, O. O., Bostic, S. W., Patrick, M., Mahajan, U., and Ma, S., "Three Parallel Computation
Methods for Structural Vibration Analysis", AIAA Paper No. 88-2391 Proc. of the 29th AIAA/ASME/
ASCE/AHS/ASC Structures, Structural Dynamics and Materials Conference, Williamsburg, VA, April
18-20, 1988, pp. 1401-1411.
13. Baddourah, M. A., Storaasli, O. O., Carmona E. A. and Nguyen, D. T., "A Fast Parallel Algorithm
for Generation and Assembly of Finite Element Stiffness and Mass Matrices", AIAA Paper No. 911006, Proc. of the 32nd AIAA/ASME/ASCE/AHS/ ASC Structures, Structural Dynamics and Materials
Conference, Baltimore, MD April 8-10, 1991, pp. 1547-1553.
14. Baddourah, M. A., "Parallel-Vector Computation for Geometrically Nonlinear Frame Structural
Analysis and Design Sensitivity Analysis, Ph. D. Dissertation, Department of Civil Engineering, Old
Dominion University, Norfolk, VA, June, 1991.
15. Zhang, Y., and Nguyen, D. T., "Parallel-Vector Sensitivity Calculations in Linear Structural
Dynamics", Int. Journal of Computing Systems in Engineering, Vol. 3, No. 1-4, 1992, pp. 365-377.
16. Zhang, Y., "Parallel-Vector Design Sensitivity Analysis in Structural Dynamics", Ph. D.
Dissertation, Department of Civil Engineering, Old Dominion University, Norfolk, VA, June, 1991.
17. Nguyen, D. T., Storaasli, O. O., Carmona, E. A., Al-Nasra, M., Zhang, Y., Baddourah, M. A., and
Agarwal, T. K., "Parallel-Vector Computation for Linear Structural Analysis and Nonlinear
Unconstrained Optimization Problems", Int. Journal of Computing Systems in Engineering , Vol. 2, No.
4, Sept. 1991, pp. 175-182.
18. Baddourah, M. A. and Nguyen, D. T., "Parallel-Vector Processing for Linear Programming",
Computers and Structures, Vol. 38, No. 11, 1991, pp. 269-282.
19. Moayyad, M. A. and Nguyen, D. T., "An Algorithm for Domain Decomposition in Finite Element
Analysis", Computers and Structures, Vol. 39, Nos. 1-4, 1991,pp. 277-290.
Nguyen:Center for Multidisciplinary Parallel-Vector Computation, Old Dominion University, Norfolk
VA 23529, U.S.A
Baddourah:Lockheed Engineering and Sciences Co., Hampton, VA 23669, U.S.A.
Qin:Center for Multidisciplinary Parallel-Vector Computation, Old Dominion University, Norfolk VA
23529, U.S.A
d.g.roper@larc.nasa.gov
Download