Preliminary ADCIRC Benchmark Results DRAFT Brett D. Estrade Daniel S. Katz Steve Brandt Chirag Dekate September 12, 2006 1 Introduction 1.1 ADCIRC Overview The Advanced Circulation (ADCIRC) model is a finite element coastal ocean model, which simulates the motion of fluid on a revolving Earth. ADCIRC is discretized in space using finite elements and in time using the finite difference method [2]. Because of its use of finite elements, unstructured and irregularly-shaped domains may be used. This makes ADCIRC well suited for modeling highly complex geometries such as coastlines with variable refinement, so there is no need to nest meshes as may be seen with traditional finite difference codes. For the purpose of these benchmarks, we used version 45.10. The 2DDI (2 dimensional depth integrated) version of ADCIRC is a 2 dimensional, finite element, barotropic1 hydrodynamic model capable of including wind, wave, and tidal forcings as well as river flux into the domain [4]. Following the classification of application algorithms proposed by Simon, ADCIRC would fall under the unstructured grids section2 . Both the depth-integrated and the fully 3D versions of ADCIRC solve a vertically-integrated continuity equation for water surface elevation. ADCIRC utilizes the Generalized Wave Continuity Equation (GWCE) formulation to avoid spurious oscillations that are associated with the Galerkin finite element formulation. ADCIRC-2DDI solves the vertically-integrated momentum equations to determine the depth-averaged velocity. In 3D, ADCIRC implements the shallow water form of the momentum equation to solve for the velocity components in the coordinate direction, x, y, and z over generalized stretched vertical coordinate system. The vertical component of velocity is obtained in ADCIRC by solving the 3D continuity equation for w after u and v have been determined from the solution of the 3D momentum equations [1]. ADCIRC is parallelized using a single program multiple data (SPMD) approach which uses data decomposition to distribute the computational grid among multiple processors. Before performing a parallel simulation, all input files must be decomposed into sub-domains for the desired number of processors using the utility adcprep. This utility utilizes Metis3 to discretize the grid in order to balance the computational load across concurrent processes. Because of this SPMD approach, most of the required communications occur at the boundaries of the sub-domains, where these nodes are shared as ghost nodes. 1 That is, density depends only on pressure http://www.nersc.gov/∼simon/Presentations2005/ICCSE05HDSimon.ppt 3 METIS is a family of programs for partitioning unstructured graphs and hypergraphs and computing fill-reducing orderings of sparse matrices. 2 1 ADCIRC has been mostly parallelized4 , and uses the Message Passing Interface (MPI) for communication. ADCIRC has been successfully run on many HPC architectures, and should compile and run on any platform that has a standard Fortran 90 compiler and an implementation of MPI. In most cases, and until parallel communication overhead dominates execution time, the time required to complete an ADCIRC simulation decreases linearly as the number of processors increases. ADCIRC has shown super-linear speed up when cache affects allows for much of the require data to be accessed directly from the cache located on the processors themselves. 1.2 ADCIRC Development and Distribution ADCIRC is copyrighted by Rick Luettich of the University of North Carolina and Joannes Westerink of the University of Notre Dame. New releases of ADCIRC are handled by Rick Luettich and his team, and are distributed on a per request basis. Development of the ADCIRC code is distributed among groups at UNC, Notre Dame, the University of Texas-Austin, and the University of Oklahoma. There is no regular development cycle or code repository, though there are frequent updates and bug fixes. Typically, academic and research use of the code is permitted without fee. Use of the code by for-profit entities requires a fee and additional licensing terms. Some ADCIRC resources include: • Main Web Page - http://www.adcirc.org • Theory Report - http://www.adcirc.org/adcirc theory 2004 12 08.pdf • User’s Guide - http://www.adcirc.org/document/ADCIRC title page.html • Example Problems - http://www.adcirc.org/document/Examples.html 1.3 ADCIRC Mesh and Input File Creation The application of ADCIRC to a particular region is non-trivial, since a mesh and initial data have to be generated by the user. A popular tool for generating meshes is called the Surface Water Modeling System (SMS)5 . Additionally, the acquisition of data such as bathymetry (water depth) for the mesh and wind data for the generation of wind forcing files must be handled by the user. The details and formats of all input files are described fully in [2]. CCT is involved in several projects using ADCIRC, including the SURA Coastal Ocean Observing and Prediction (SCOOP) Program and the development of the Lake Ponchatrain Forecasting System (LPFS) for the Army Corps of Engineers. These use ADCIRC to model the Gulf of Mexico and Lake Ponchatrain using 50,000 node and 30,000 node grids, respectively. The LSU Hurricane Center also uses ADCIRC for predicting coastal storm surge, working primarily with a 600,000 node mesh developed in conjunction with Notre Dame. 4 i.e., some matrix operations such as the tri-diagonal matrix solver in vsmy.F, which could benefit from parallelization, operate serially within each sub problem 5 http://www.ems-i.com/SMS/SMS Overview/sms overview.html Preliminary ADCIRC Benchmark Results DRAFT: 2 2 Benchmarking Methodologies This paper reflects benchmarking efforts conducted primarily during February 2006 in preparation for a response to the NSF solicitation for a high performance computing system acquisition “Towards a Petascale computing environment for Space and Engineering”. The methods employed for these benchmarking results were ad hoc in the sense that the cases were arbitrary, model results were not validated, and there was limited ability to affect or control the parameters of the cases tested and environments in which they were test. 2.1 Timing The wall clock times as reported by the batch queue systems on each HPC platform were used to time each run. No effort was made to modify the code so that only numerically relevant portions of the code were timed independent of irrelevant book-keeping functionality of the code. 3 Details The following section details the specifics of the cases involved in this benchmarking effort. 3.1 Cases The cases involved in the benchmarking were selected based on availability and convenience, not as a result of careful planning. The essential goal was to get a general idea of how well ADCIRC scaled across processors on the platforms specified in this paper. The other factor to consider was that the model results were not validated in any way. The only criterion for a successful run was that there was a return status of 0 for the model run. Additionally, all of these meshes were created by globally splitting each element into 4. All of these cases were force with tides (M2 ) only. Case 1 2 3 Nodes 1001424 254565 31435 Elements 1968728 492182 58369 Simulation Length 0.5 days 0.5 days 0.5 days Time Step 1.0 sec 5.0 sec 5.0 sec Forcing Tides (M2 ) Tides (M2 ) Tides (M2 ) Cases 1 and 2 were used to determine the scaling efficiency for the platforms used in this effort. Case 3 was used primarily to test ADCIRC’s reliance on communications on Bassi and Loni, the two IBM P5s used in this testing. Preliminary ADCIRC Benchmark Results DRAFT: 3 3.2 Platforms Mach Name Bassi Bigben BlueGene/L Cobalt Loni Mike 4 Arch IBM P5 Cray XT3 IBM BGL SGI Altix IBM P5 Linux Cluster Proc Type 1.9 GHz P5 2.4 GHz Opterons 700 MHz PPC440 1.6 GHz Itanium 2 1.9 GHz P5 Intel Pentium IV Xeon Used 256 512 1024 256 256 256 Interconnect HPS 2 x 2GB/s (peak) Cray SeaStar 1.6Gb/link Torus NUMAflex HPS 1 x 2GB/s (peak) Myrinet 100 Mbit eth Location NERSC PSC ANL NERSC LSU LSU Results These results should be taken only as a small sampling of ADCIRC’s performance of these machines 6 . Among the issues with these results are: • Reported times were wall clock, and account only for the total execution time, not just the numerically relevant code • Timings were conducted only once per machine per number of processor, so we didn’t look at any variations of run times • Runs were only tidally driven, which doesn’t take into account ADCIRC’s other forcing options (winds, etc) or 3D capabilities 4.1 Scaling The scaling of ADCIRC for each case run on a particular machine is show in figures 1, 2, 3, 4. Generally, the scalig performance is fairly good. 4.2 Efficiency It should be reiterated that these results are preliminary, and show the performance over a range of processors for tidally forced ADCIRC runs for a non-realistic mesh. Efficiency measures how much performance increases the number of processors are increased. Ideally, one would hope for an efficiency of 1.0, which indicates that the problem scales linearly as the number of processors increases. In other words, an efficiency of 1.0 indicates that as the number of processors double, the amount of time to solution decreases by 1/2. In a practical sense, efficiency is used to determine when an increase in the number of CPUs is not worth the performance gain. Indeed, it is possible for an increase of processors to have a detrimental affect on the time of a run, which means that the time to solution will increase as the number of processors increases. Equation 1 presents this quantity as shown in [3]. 6 Up to the minute results may be viewed via http://www.cct.lsu.edu/∼estrabd/pmwiki/index.php Preliminary ADCIRC Benchmark Results DRAFT: 4 ADCIRC Scaling on Loni 6 1000k Nodes 200k Nodes 32k Nodes 5 Log(Time) 4 3 2 1 0 2 4 8 16 32 64 Number of PROCS 96 128 256 512 1024 Figure 1: Scaling on Loni ADCIRC Scaling on Mike 6 1000k Nodes 200k Nodes 5 Log(Time) 4 3 2 1 0 2 4 8 16 32 64 Number of PROCS 96 128 256 512 1024 Figure 2: Scaling on Mike Ep = Preliminary ADCIRC Benchmark Results DRAFT: 5 T (1) pT (p) (1) ADCIRC Scaling on Bassi 6 1000k Nodes 5 Log(Time) 4 3 2 1 0 2 4 8 16 32 64 Number of PROCS 96 128 256 512 1024 Figure 3: Scaling on Bassi ADCIRC Scaling on BlueGene/L 6 1000k Nodes 5 Log(Time) 4 3 2 1 0 2 4 8 16 32 64 Number of PROCS 128 256 512 1024 Figure 4: Scaling on BlueGene/L Figures 5, 6, 7, 8 show the efficiency profiles for the cases run on the respective machine. Because some cases were not run on a single processor due to excessive run times, the efficiencies reported here are shown against the smallest number of processors actually used, divided by the number pf processors (ideal efficiency is assumed from to that number of processors.) Efficiency appears to be Preliminary ADCIRC Benchmark Results DRAFT: 6 good until a particular point is reached. This point is different for each case and for each machine, and it appears to indicate where communication begins to dominate these fixed sized problems. ADCIRC Efficiency on Loni 2 1000k Nodes 200k Nodes 32k Nodes Ideal Efficiency 1.5 1 0.5 0 2 4 8 16 32 64 96 128 256 512 1024 Number of PROCS Figure 5: Efficiency on Loni ADCIRC Efficiency on Mike 2 1000k Nodes 200k Nodes Ideal Efficiency 1.5 1 0.5 0 2 4 8 16 32 64 Number of PROCS Figure 6: Efficiency on Mike Preliminary ADCIRC Benchmark Results DRAFT: 7 96 128 256 512 1024 ADCIRC Efficiency on Bassi 2 1000k Nodes Bassi Ideal Efficiency 1.5 1 0.5 0 2 4 8 16 32 64 Number of PROCS 96 128 256 512 1024 Figure 7: Efficiency on Bassi ADCIRC Efficiency on BlueGene/L 2 1000k Nodes Ideal Efficiency 1.5 1 0.5 0 2 4 8 16 32 64 Number of PROCS 128 256 512 1024 Figure 8: Efficiency on BlueGene/L 4.3 Dependence on Communications Case 3 was utilized to test ADCIRC’s dependence on communications and available bandwidth. Loni and Bassi were used for these test. The test consisted of timing an 8 processor ADCIRC run Preliminary ADCIRC Benchmark Results DRAFT: 8 using the following two configurations: 1. 8 processes running on 8 CPUs on a single P5 compute node 2. 8 processes running 1 process on each of 8 P5 compute nodes The IBM p5 has 8 processors per node. For jobs that require 8 or less processors, one can run entirely within a node. In this mode MPI communications are simply shared memory operations and therefore the simulation effectively achieves huge bandwidth at very low latency. Alternatively, one can run 8 processors across 8 nodes and make use of the Federation7 switch. The two configurations were used to test ADCIRC’s reliance on communication and available bandwidth. Two ADCIRC runs, one of each type described above, were completed for a problem size of 30K nodes. The runs were several hours long and completed within one second of each other. This indicates that, at least for this example problem, ADCIRC is completely computation bound. 5 Issues 5.1 Case Variability Because these tests were done in an ad hoc way, little control was available for dictating the type or size of the cases we used to benchmark the ADCIRC code. In the future we hope to have more control of the size and type of cases we are able to run. 5.2 Validating Model Results The only metric for deeming a run ”successful” was that it terminated with an MPI status of ”0”. This, however, did not tell us if the model ran properly. Future benchmarking cases must include a reliable way to determine if the model performed properly under the given test conditions. Improper execution will affect the accuracy of any other metrics, and it will affect how we are able to compare performance of ADCIRC for a particular case across platforms and number of CPUs. It has been noted that this is hard to do, because it means that you have to have an application-specific test of accuracy. This could be a bitwise comparison of output files, but many apps either encode run time information such as the current date in the output file, or do not produce identical results on different numbers of processors due to changes in rounding from different operation ordering. 6 Future Benchmarking Future benchmarking efforts will be performed in a much more controlled way. The following sections describe the target areas and basic plan. 7 dual plane HPS on Bassi Preliminary ADCIRC Benchmark Results DRAFT: 9 6.1 Validation The only validation that will be done will be making sure that the ADCIRC case run returns reasonable results. This will be done by judging against known good results. In the event that known good results are not available, one can just make sure that all of the runs (on a varying number of procs) return similar8 results. In this case similar means that the node by node comparison of values indicate some small fraction of a difference between them for all time steps. 6.2 Profiling More a accurate, detailed profiling of ADCIRC will be performed. The following are currently being evaluated:oprofile, valgrind, and gprof. 6.3 Goals 1. Compile enough statistics so that it is possible to approximate the most cost effective number of processors to execute over an N-node mesh (at least for the meshes used in this benchmarking). 2. Provide an ADCIRC benchmarking bundle that will allow others to perform their own benchmarks. 3. Gain a good understanding of what ADCIRC is doing and how well it is doing it. 6.4 Needs 1. A generic mesh that with results we can validate against 2. The capability to refine mesh and modify input files accordingly 3. adcprep working on all test machines 4. A validation program for results 5. The ability to run on 1000s of nodes (BlueGene/L) 6. Specification of the tests we will be running each machine (cases, nodes, ADCIRC modes, etc) 7. Establish a detailed test plan with test specifics, milestones, and time lines Current plans are to run in 2D and 3D, for all meshes, forcing with tides only and tides + winds. Another option would be to include a variation of the tests with element wettingdrying on and harmonic analysis turn set to be performed. 8 i.e., with in some error tolerence determined by the tester Preliminary ADCIRC Benchmark Results DRAFT: 10 7 Conclusions Based on profiling, communication overhead appears to dominate the total time to solution once the workload for each processor is reduced to the point where the majority of the time is spent in M P I All Reduce. At this point, a sharp decline in speed up and efficiency is observed. Increasing the complexity of the problem by increasing the number of nodes/elements, introducing external forcing (winds, etc), or executing in 3D mode will allow for a problem to scale across a greater number of processors. However, a point will eventually be reached when the communications overhead again dominates the time spent in computation. Although this was not thoroughly investigated, it should be possible to determine how well a particular case will scale explicitly based on some preliminary performance metrics and case characteristics. References [1] R. L. Joannes Westerink. Formulation and Numerical Implementation of the 2D/3D ADCIRC Finite Element Model Version 44.XX, December 2004. [2] R. L. Joannes Westerink. ADCIRC; A (Parallel) ADvanced CIRCulation Model for Ocean, Coastal, and Estuarine Waters, December 2005. [3] G. H. Golub and C. F. V. Loan. Matrix Computations. Johns Hopkins Press, Baltimore, MD, USA, second edition, 1989. [4] Oceanography Division, Naval Research Laboratory, Stennis Space Center, MS. A Coupled Hydrodynamic-Wave Model for Simulating Wave and Tidally-Driven 2D Circulation in Inlets, 2002. Preliminary ADCIRC Benchmark Results DRAFT: 11