Preliminary ADCIRC Benchmark Results DRAFT

advertisement
Preliminary ADCIRC Benchmark Results DRAFT
Brett D. Estrade
Daniel S. Katz
Steve Brandt
Chirag Dekate
September 12, 2006
1
Introduction
1.1
ADCIRC Overview
The Advanced Circulation (ADCIRC) model is a finite element coastal ocean model, which simulates
the motion of fluid on a revolving Earth. ADCIRC is discretized in space using finite elements and
in time using the finite difference method [2]. Because of its use of finite elements, unstructured
and irregularly-shaped domains may be used. This makes ADCIRC well suited for modeling highly
complex geometries such as coastlines with variable refinement, so there is no need to nest meshes
as may be seen with traditional finite difference codes. For the purpose of these benchmarks,
we used version 45.10. The 2DDI (2 dimensional depth integrated) version of ADCIRC is a 2
dimensional, finite element, barotropic1 hydrodynamic model capable of including wind, wave, and
tidal forcings as well as river flux into the domain [4]. Following the classification of application
algorithms proposed by Simon, ADCIRC would fall under the unstructured grids section2 .
Both the depth-integrated and the fully 3D versions of ADCIRC solve a vertically-integrated continuity equation for water surface elevation. ADCIRC utilizes the Generalized Wave Continuity
Equation (GWCE) formulation to avoid spurious oscillations that are associated with the Galerkin
finite element formulation. ADCIRC-2DDI solves the vertically-integrated momentum equations
to determine the depth-averaged velocity. In 3D, ADCIRC implements the shallow water form of
the momentum equation to solve for the velocity components in the coordinate direction, x, y, and
z over generalized stretched vertical coordinate system. The vertical component of velocity is obtained in ADCIRC by solving the 3D continuity equation for w after u and v have been determined
from the solution of the 3D momentum equations [1].
ADCIRC is parallelized using a single program multiple data (SPMD) approach which uses data
decomposition to distribute the computational grid among multiple processors. Before performing
a parallel simulation, all input files must be decomposed into sub-domains for the desired number
of processors using the utility adcprep. This utility utilizes Metis3 to discretize the grid in order to
balance the computational load across concurrent processes. Because of this SPMD approach, most
of the required communications occur at the boundaries of the sub-domains, where these nodes are
shared as ghost nodes.
1
That is, density depends only on pressure
http://www.nersc.gov/∼simon/Presentations2005/ICCSE05HDSimon.ppt
3
METIS is a family of programs for partitioning unstructured graphs and hypergraphs and computing fill-reducing
orderings of sparse matrices.
2
1
ADCIRC has been mostly parallelized4 , and uses the Message Passing Interface (MPI) for communication. ADCIRC has been successfully run on many HPC architectures, and should compile
and run on any platform that has a standard Fortran 90 compiler and an implementation of MPI.
In most cases, and until parallel communication overhead dominates execution time, the time required to complete an ADCIRC simulation decreases linearly as the number of processors increases.
ADCIRC has shown super-linear speed up when cache affects allows for much of the require data
to be accessed directly from the cache located on the processors themselves.
1.2
ADCIRC Development and Distribution
ADCIRC is copyrighted by Rick Luettich of the University of North Carolina and Joannes Westerink of the University of Notre Dame. New releases of ADCIRC are handled by Rick Luettich
and his team, and are distributed on a per request basis. Development of the ADCIRC code is
distributed among groups at UNC, Notre Dame, the University of Texas-Austin, and the University
of Oklahoma. There is no regular development cycle or code repository, though there are frequent
updates and bug fixes. Typically, academic and research use of the code is permitted without fee.
Use of the code by for-profit entities requires a fee and additional licensing terms.
Some ADCIRC resources include:
• Main Web Page - http://www.adcirc.org
• Theory Report - http://www.adcirc.org/adcirc theory 2004 12 08.pdf
• User’s Guide - http://www.adcirc.org/document/ADCIRC title page.html
• Example Problems - http://www.adcirc.org/document/Examples.html
1.3
ADCIRC Mesh and Input File Creation
The application of ADCIRC to a particular region is non-trivial, since a mesh and initial data have
to be generated by the user. A popular tool for generating meshes is called the Surface Water
Modeling System (SMS)5 . Additionally, the acquisition of data such as bathymetry (water depth)
for the mesh and wind data for the generation of wind forcing files must be handled by the user.
The details and formats of all input files are described fully in [2].
CCT is involved in several projects using ADCIRC, including the SURA Coastal Ocean Observing
and Prediction (SCOOP) Program and the development of the Lake Ponchatrain Forecasting System (LPFS) for the Army Corps of Engineers. These use ADCIRC to model the Gulf of Mexico
and Lake Ponchatrain using 50,000 node and 30,000 node grids, respectively. The LSU Hurricane
Center also uses ADCIRC for predicting coastal storm surge, working primarily with a 600,000
node mesh developed in conjunction with Notre Dame.
4
i.e., some matrix operations such as the tri-diagonal matrix solver in vsmy.F, which could benefit from parallelization, operate serially within each sub problem
5
http://www.ems-i.com/SMS/SMS Overview/sms overview.html
Preliminary ADCIRC Benchmark Results DRAFT: 2
2
Benchmarking Methodologies
This paper reflects benchmarking efforts conducted primarily during February 2006 in preparation
for a response to the NSF solicitation for a high performance computing system acquisition “Towards a Petascale computing environment for Space and Engineering”. The methods employed for
these benchmarking results were ad hoc in the sense that the cases were arbitrary, model results
were not validated, and there was limited ability to affect or control the parameters of the cases
tested and environments in which they were test.
2.1
Timing
The wall clock times as reported by the batch queue systems on each HPC platform were used to
time each run. No effort was made to modify the code so that only numerically relevant portions
of the code were timed independent of irrelevant book-keeping functionality of the code.
3
Details
The following section details the specifics of the cases involved in this benchmarking effort.
3.1
Cases
The cases involved in the benchmarking were selected based on availability and convenience, not
as a result of careful planning. The essential goal was to get a general idea of how well ADCIRC
scaled across processors on the platforms specified in this paper. The other factor to consider was
that the model results were not validated in any way. The only criterion for a successful run was
that there was a return status of 0 for the model run. Additionally, all of these meshes were created
by globally splitting each element into 4. All of these cases were force with tides (M2 ) only.
Case
1
2
3
Nodes
1001424
254565
31435
Elements
1968728
492182
58369
Simulation Length
0.5 days
0.5 days
0.5 days
Time Step
1.0 sec
5.0 sec
5.0 sec
Forcing
Tides (M2 )
Tides (M2 )
Tides (M2 )
Cases 1 and 2 were used to determine the scaling efficiency for the platforms used in this effort. Case 3 was used primarily to test ADCIRC’s reliance on communications on Bassi and Loni,
the two IBM P5s used in this testing.
Preliminary ADCIRC Benchmark Results DRAFT: 3
3.2
Platforms
Mach Name
Bassi
Bigben
BlueGene/L
Cobalt
Loni
Mike
4
Arch
IBM P5
Cray XT3
IBM BGL
SGI Altix
IBM P5
Linux Cluster
Proc Type
1.9 GHz P5
2.4 GHz Opterons
700 MHz PPC440
1.6 GHz Itanium 2
1.9 GHz P5
Intel Pentium IV Xeon
Used
256
512
1024
256
256
256
Interconnect
HPS 2 x 2GB/s (peak)
Cray SeaStar
1.6Gb/link Torus
NUMAflex
HPS 1 x 2GB/s (peak)
Myrinet 100 Mbit eth
Location
NERSC
PSC
ANL
NERSC
LSU
LSU
Results
These results should be taken only as a small sampling of ADCIRC’s performance of these machines
6 . Among the issues with these results are:
• Reported times were wall clock, and account only for the total execution time, not just the
numerically relevant code
• Timings were conducted only once per machine per number of processor, so we didn’t look
at any variations of run times
• Runs were only tidally driven, which doesn’t take into account ADCIRC’s other forcing
options (winds, etc) or 3D capabilities
4.1
Scaling
The scaling of ADCIRC for each case run on a particular machine is show in figures 1, 2, 3, 4.
Generally, the scalig performance is fairly good.
4.2
Efficiency
It should be reiterated that these results are preliminary, and show the performance over a range
of processors for tidally forced ADCIRC runs for a non-realistic mesh.
Efficiency measures how much performance increases the number of processors are increased. Ideally, one would hope for an efficiency of 1.0, which indicates that the problem scales linearly as the
number of processors increases. In other words, an efficiency of 1.0 indicates that as the number of
processors double, the amount of time to solution decreases by 1/2. In a practical sense, efficiency
is used to determine when an increase in the number of CPUs is not worth the performance gain.
Indeed, it is possible for an increase of processors to have a detrimental affect on the time of a run,
which means that the time to solution will increase as the number of processors increases. Equation
1 presents this quantity as shown in [3].
6
Up to the minute results may be viewed via http://www.cct.lsu.edu/∼estrabd/pmwiki/index.php
Preliminary ADCIRC Benchmark Results DRAFT: 4
ADCIRC Scaling on Loni
6
1000k Nodes
200k Nodes
32k Nodes
5
Log(Time)
4
3
2
1
0
2
4
8
16
32
64
Number of PROCS
96 128
256
512
1024
Figure 1: Scaling on Loni
ADCIRC Scaling on Mike
6
1000k Nodes
200k Nodes
5
Log(Time)
4
3
2
1
0
2
4
8
16
32
64
Number of PROCS
96 128
256
512
1024
Figure 2: Scaling on Mike
Ep =
Preliminary ADCIRC Benchmark Results DRAFT: 5
T (1)
pT (p)
(1)
ADCIRC Scaling on Bassi
6
1000k Nodes
5
Log(Time)
4
3
2
1
0
2
4
8
16
32
64
Number of PROCS
96 128
256
512
1024
Figure 3: Scaling on Bassi
ADCIRC Scaling on BlueGene/L
6
1000k Nodes
5
Log(Time)
4
3
2
1
0
2
4
8
16
32
64
Number of PROCS
128
256
512
1024
Figure 4: Scaling on BlueGene/L
Figures 5, 6, 7, 8 show the efficiency profiles for the cases run on the respective machine. Because
some cases were not run on a single processor due to excessive run times, the efficiencies reported
here are shown against the smallest number of processors actually used, divided by the number pf
processors (ideal efficiency is assumed from to that number of processors.) Efficiency appears to be
Preliminary ADCIRC Benchmark Results DRAFT: 6
good until a particular point is reached. This point is different for each case and for each machine,
and it appears to indicate where communication begins to dominate these fixed sized problems.
ADCIRC Efficiency on Loni
2
1000k Nodes
200k Nodes
32k Nodes
Ideal
Efficiency
1.5
1
0.5
0
2
4
8
16
32
64
96 128
256
512
1024
Number of PROCS
Figure 5: Efficiency on Loni
ADCIRC Efficiency on Mike
2
1000k Nodes
200k Nodes
Ideal
Efficiency
1.5
1
0.5
0
2
4
8
16
32
64
Number of PROCS
Figure 6: Efficiency on Mike
Preliminary ADCIRC Benchmark Results DRAFT: 7
96 128
256
512
1024
ADCIRC Efficiency on Bassi
2
1000k Nodes Bassi
Ideal
Efficiency
1.5
1
0.5
0
2
4
8
16
32
64
Number of PROCS
96 128
256
512
1024
Figure 7: Efficiency on Bassi
ADCIRC Efficiency on BlueGene/L
2
1000k Nodes
Ideal
Efficiency
1.5
1
0.5
0
2
4
8
16
32
64
Number of PROCS
128
256
512
1024
Figure 8: Efficiency on BlueGene/L
4.3
Dependence on Communications
Case 3 was utilized to test ADCIRC’s dependence on communications and available bandwidth.
Loni and Bassi were used for these test. The test consisted of timing an 8 processor ADCIRC run
Preliminary ADCIRC Benchmark Results DRAFT: 8
using the following two configurations:
1. 8 processes running on 8 CPUs on a single P5 compute node
2. 8 processes running 1 process on each of 8 P5 compute nodes
The IBM p5 has 8 processors per node. For jobs that require 8 or less processors, one can run
entirely within a node. In this mode MPI communications are simply shared memory operations and
therefore the simulation effectively achieves huge bandwidth at very low latency. Alternatively, one
can run 8 processors across 8 nodes and make use of the Federation7 switch. The two configurations
were used to test ADCIRC’s reliance on communication and available bandwidth.
Two ADCIRC runs, one of each type described above, were completed for a problem size of 30K
nodes. The runs were several hours long and completed within one second of each other. This
indicates that, at least for this example problem, ADCIRC is completely computation bound.
5
Issues
5.1
Case Variability
Because these tests were done in an ad hoc way, little control was available for dictating the type
or size of the cases we used to benchmark the ADCIRC code. In the future we hope to have more
control of the size and type of cases we are able to run.
5.2
Validating Model Results
The only metric for deeming a run ”successful” was that it terminated with an MPI status of ”0”.
This, however, did not tell us if the model ran properly. Future benchmarking cases must include a
reliable way to determine if the model performed properly under the given test conditions. Improper
execution will affect the accuracy of any other metrics, and it will affect how we are able to compare
performance of ADCIRC for a particular case across platforms and number of CPUs. It has been
noted that this is hard to do, because it means that you have to have an application-specific test
of accuracy. This could be a bitwise comparison of output files, but many apps either encode run
time information such as the current date in the output file, or do not produce identical results on
different numbers of processors due to changes in rounding from different operation ordering.
6
Future Benchmarking
Future benchmarking efforts will be performed in a much more controlled way. The following
sections describe the target areas and basic plan.
7
dual plane HPS on Bassi
Preliminary ADCIRC Benchmark Results DRAFT: 9
6.1
Validation
The only validation that will be done will be making sure that the ADCIRC case run returns reasonable results. This will be done by judging against known good results. In the event that known
good results are not available, one can just make sure that all of the runs (on a varying number
of procs) return similar8 results. In this case similar means that the node by node comparison of
values indicate some small fraction of a difference between them for all time steps.
6.2
Profiling
More a accurate, detailed profiling of ADCIRC will be performed. The following are currently
being evaluated:oprofile, valgrind, and gprof.
6.3
Goals
1. Compile enough statistics so that it is possible to approximate the most cost effective number
of processors to execute over an N-node mesh (at least for the meshes used in this benchmarking).
2. Provide an ADCIRC benchmarking bundle that will allow others to perform their own benchmarks.
3. Gain a good understanding of what ADCIRC is doing and how well it is doing it.
6.4
Needs
1. A generic mesh that with results we can validate against
2. The capability to refine mesh and modify input files accordingly
3. adcprep working on all test machines
4. A validation program for results
5. The ability to run on 1000s of nodes (BlueGene/L)
6. Specification of the tests we will be running each machine (cases, nodes, ADCIRC modes,
etc)
7. Establish a detailed test plan with test specifics, milestones, and time lines
Current plans are to run in 2D and 3D, for all meshes, forcing with tides only and tides + winds.
Another option would be to include a variation of the tests with element wettingdrying on and
harmonic analysis turn set to be performed.
8
i.e., with in some error tolerence determined by the tester
Preliminary ADCIRC Benchmark Results DRAFT: 10
7
Conclusions
Based on profiling, communication overhead appears to dominate the total time to solution once
the workload for each processor is reduced to the point where the majority of the time is spent in
M P I All Reduce. At this point, a sharp decline in speed up and efficiency is observed. Increasing
the complexity of the problem by increasing the number of nodes/elements, introducing external
forcing (winds, etc), or executing in 3D mode will allow for a problem to scale across a greater number of processors. However, a point will eventually be reached when the communications overhead
again dominates the time spent in computation. Although this was not thoroughly investigated,
it should be possible to determine how well a particular case will scale explicitly based on some
preliminary performance metrics and case characteristics.
References
[1] R. L. Joannes Westerink. Formulation and Numerical Implementation of the 2D/3D ADCIRC Finite
Element Model Version 44.XX, December 2004.
[2] R. L. Joannes Westerink. ADCIRC; A (Parallel) ADvanced CIRCulation Model for Ocean, Coastal, and
Estuarine Waters, December 2005.
[3] G. H. Golub and C. F. V. Loan. Matrix Computations. Johns Hopkins Press, Baltimore, MD, USA,
second edition, 1989.
[4] Oceanography Division, Naval Research Laboratory, Stennis Space Center, MS.
A Coupled
Hydrodynamic-Wave Model for Simulating Wave and Tidally-Driven 2D Circulation in Inlets, 2002.
Preliminary ADCIRC Benchmark Results DRAFT: 11
Download