1 SPEC HPG Benchmarks for HPC Systems Matthijs van Kalyanasundaram Waveren , Kumaran , Greg Gaertner, Wesley Jones , Rudolf Eigenmann , Ron Whitney and Hideki Saito SPEC HPG Secretary Fujitsu Systems Lieberman , Matthias S. Müller , Brian Europe Ltd waveren@fujitsu.fr SPEC HPG Chair, Silicon Graphics, Inc.kumaran@sgi.com National Renewable Energy Lab. Wesley Jones@nrel.gov Purdue University. eigenman@ecn.purdue.edu SPEC HPG Vice Chair, Hewlett-Packard Corporation. lieb@rsn.hp.com TU Dresden matthias.mueller@tu-dresden.de Sun Microsystems. Brian.Whitney@sun.com Intel Corporation hideki.saito@intel.com Abstract— In this paper, we discuss results and characteristics of the benchmark suites maintained by the Standard Performance Evaluation Corporation’s (SPEC) High-Performance Group (HPG). Currently, SPEC HPG has two lines of benchmark suites for measuring performance of large-scale systems: SPEC OMP and SPEC HPC2002. SPEC OMP uses the OpenMP API and includes benchmark suites intended for measuring performance of modern shared memory parallel systems. SPEC HPC2002 uses both OpenMP and MPI, and thus it is suitable for distributed memory systems, shared memory systems, and hybrid systems. SPEC HPC2002 contains benchmarks from three popular application areas, Chemistry, Seismic, and Weather Forecasting. Each of the three benchmarks in HPC2002 has a small and a medium data set, in order to satisfy the need for benchmarking a wide range of high-performance systems. We analyze published results of these benchmark suites regarding scalability. We also discuss current efforts of SPEC HPG to develop a SPEC MPI benchmark suite based on MPI applications. I. I NTRODUCTION SPEC (The Standard Performance Evaluation Corporation) is an organization for creating industry-standard benchmarks to measure various aspects of modern computer system performance. SPEC’s High-Performance Group (SPEC HPG) is a workgroup aimed at benchmarking high-performance computer systems. In June of 2001, SPEC HPG released the first of the SPEC OMP benchmark suites, SPEC OMPM2001. This suite consists of a set of OpenMP-based[1], [2] application programs. The data sets of the SPEC OMPM2001 suite (also referred to as the medium suite) are derived from state-ofthe-art computation on modern medium-scale (4- to 16-way) shared memory parallel systems. Aslot et al.[3] have presented the benchmark suite. Aslot et al.[4] and Iwashita et al.[5] have described performance characteristics of the benchmark suite. The second, large suite, SPEC OMPL2001, focusing on 32way and larger systems, was released in May 2002. SPEC OMPL2001 shares most of the application code base with SPEC OMPM2001, but the code and the data sets have been improved and made larger to achieve better scaling and also to reflect the class of computation regularly performed on such large systems[6]. So far, SPEC OMP performance has been reported for systems up to 128 processors. SPEC HPC2002 is the latest release of the HPC benchmark suite. It is suitable for shared and distributed memory machines or clusters of shared memory nodes. SPEC HPC applications have been collected from among the largest, most realistic computational applications that are available for distribution by SPEC. In contrast to SPEC OMP, they are not restricted to any particular programming model or system architecture. Both shared-memory and message passing methods are supported. All codes of the current SPEC HPC2002 suite are available in an MPI and an OpenMP programming model and they include two data set sizes. The SPEC MPI benchmark suite is currently under development. It will be based on real MPI application codes. It was felt that there is a need for a standard MPI benchmark suite based on real applications in the marketplace, which uses the SPEC model of result submission and publication, and SPEC HPG aims to fill this need. Performance characteristics of application programs on large-scale systems are often significantly different from those on smaller systems. In our previous paper[6] we have discussed the scaling of SPEC OMP benchmarks. In this paper, we characterize the performance behavior of large-scale systems (32-way and larger) using the SPEC OMPL and HPC2002 benchmark suites. In Section 2, we provide a short description of the applications contained in the benchmarks. Section 3 analyzes the published results of SPEC OMPL2001 and SPEC HPC2002 on large systems, based on application program behavior and systems’ architectural features. Section 4 concludes the paper. II. D ESCRIPTION OF THE B ENCHMARKS A. Overview of the SPEC OMPL2001 Benchmark The SPEC OMPL2001 benchmark suite consists of 9 application programs, which represent the type of software used in scientific technical computing. The applications include modeling and simulation programs from the fields of chemistry, mechanical engineering, climate modeling, and physics. Of the 9 application programs, 7 are written in Fortran, and 2 (ART and EQUAKE) are written in C. The benchmarks require a virtual address space of about 6.4 GB in a 16processor run. The rationale for this size were to provide data sets significantly larger than those of the SPEC OMPM benchmarks, with a requirement for a 64-bit address space. Descriptions of the 9 applications codes are provided in Tab. I. B. Overview of the SPEC HPC2002 Benchmark Suite SPEC HPC2002 is a benchmark suite based on highperformance computing (HPC) applications and the MPI and 2 TABLE I TABLE II D ESCRIPTION AND NUMBER OF LINES OF THE OMPL BENCHMARK APPLICATIONS . Code APPLU APSI MGRID SWIM FMA3D ART GAFORT EQUAKE WUPWISE Description Solves 5 coupled non-linear PDEs on a 3dimensional logically structured grid, using the Symmetric Successive Over-Relaxation implicit time-marching scheme[7]. Lake environmental model, which predicts the concentration of pollutants. It solves the model for the mesoscale and synoptic variations of potential temperature, wind components, and for the mesoscale vertical velocity, pressure, and distribution of pollutants. Simple multigrid solver, which computes a 3dimensional potential field. Weather prediction model, which solves the shallow water equations using a finite difference method. Crash simulation program. It simulates the inelastic, transient dynamic response of 3-dimensional solids and structures subjected to impulsively or suddenly applied loads. It uses an explicit finite element method[8]. (Adaptive Resonance Theory) neural network, which is used to recognize objects in a thermal image[9]. The objects in the benchmark are a helicopter and an airplane. Computes the global maximum fitness using a genetic algorithm. It starts with an initial population and then generates children who go through crossover, jump mutation, and creep mutation with certain probabilities. Is an earthquake-modeling program. It simulates the propagation of elastic seismic waves in large, heterogeneous valleys in order to recover the time history of the ground motion everywhere in the valley due to a specific seismic event. It uses a finite element method on an unstructured mesh[10]. (Wuppertal Wilson Fermion Solver) is a program in the field of lattice gauge theory. Lattice gauge theory is a discretization of quantum chromodynamics. Quark propagators are computed within a chromodynamic background field. The inhomogeneous lattice-Dirac equation is solved. #Lines 4000 D ESCRIPTION AND HPC2002 BENCHMARK APPLICATIONS . Code Description SPECenv (WRF) is based on the WRF weather model, a state-of-the-art, non-hydrostatic mesoscale weather model, see http://www.wrf-model.org. was developed by ARCO beginning in 1995 to gain an accurate measure of performance of computing systems as it relates to the seismic processing industry for procurement of new computing resources. used to simulate molecules ab initio, at the quantum level, and optimize atomic positions. It is a research interest under the name of GAMESS at the Gordon Research Group of Iowa State University and is of interest to the pharmaceutical industry. 7500 SPECseis 500 SPECchem 400 60000 1300 NUMBER OF LINES OF THE 20 #Lines and Languages 25000 C 145000 F90 25000 F77 and C 120000 F77 and C small OpenMP small MPI medium OpenMP medium MPI linear 15 1500 10 1500 5 0 2200 OpenMP standards for parallel processing. It is targeted at those who evaluate performance for HPC systems, including users, system vendors, software vendors, and researchers. It uses a set of realistic applications to measure the performance of the computing system’s processors, memory architecture, and operating system. SPEC HPC2002 improves upon and replaces the SPEC HPC96 benchmark suite. The SPEC HPC2002 suite comprises three benchmarks, each with a small- and medium-sized data set. A short description of the benchmark applications is provided in Tab. II. The SPECenv application is developed within the WRF (Weather Research and Forecasting) Modeling System development project. This is a multi-year project being undertaken by several agencies. Members of the WRF Scientific Board include representatives from EPA, FAA, NASA, NCAR, NOAA, NRL, USAF and several universities. SPEC HPG integrated version 1.2.1 of the WRF weather model into the SPEC tools for building, running and verifying results. This means that 0 5 10 15 # threads or # processes 20 Fig. 1. Speedup of SPECenv for different data sets and programming models on a Sun Fire 6800 platform. the benchmark runs on more systems than WRF has officially been ported to. It can run in OpenMP, MPI or mixed MPIOpenMP mode (hybrid). The benchmark runs use restart files that are created after the model has run for several simulated hours. This ensures that cumulus and microphysics schemes are fully developed during the benchmark runs. Fig. 1 shows the scalability for the different data sets and the OpenMP and MPI mode on a Sun Fire 6800. The medium data set shows better scalability. The best programming model will depend on the platform and data set. Here, OpenMP is better for the small data set and MPI for the large data set. SPECseis consists of a modeling phase which generates synthetic seismic traces for any size of data set, with a flexibility in the geometry of shots and receivers, ground structures, varying lateral velocity, and many other options. A subsequent phase stacks the traces into common midpoint stacks. There are two imaging phases which produce the valuable output seismologists use to locate resources of oil. The first of the two imaging phases is a Fourier method 3 16 10 small OpenMP small MPI medium OpenMP medium MPI linear 14 12 small OpenMP (1200 MHz) small MPI (1200 MHz) medium OpenMP (900 Mhz) linear 8 10 6 8 4 6 4 2 2 0 0 0 2 4 6 8 10 12 # threads or # processes 14 16 0 5 10 15 # threads or # processes 20 Fig. 2. Speedup of SPECseis for different data sets and programming models on a Sun Fire 6800 platform. Fig. 3. Speedup of SPECchem for different data sets and programming models on a Sun Fire 6800 platform normalized by the 2-way speedup. which is very efficient but which does not take into account variations in the velocity profile. Yet, it is widely used and remains the basis of many methods for acoustic imaging. The second imaging technique is a much slower finite-difference method, which can handle variations in the lateral velocity. This technique is used in many seismic migration codes today. SPECseis can run in OpenMP or MPI mode. Fig. 2 shows the scalability for the different data sets and the OpenMP and MPI mode on a Sun Fire 6800. The medium data set shows better scalability. The best programming model will depend on the platform and data set. In this case OpenMP has better scalability. Like SPECseis, SPECchem is often used to exhibit performance of high-performance systems among the computer vendors. Portions of SPECchem codes date back to 1984. It comes with many built-in functionalities, such as various field molecular wave-functions, certain energy corrections for some of the wave-functions, and simulation of several different phenomena. Depending on what wave-functions you choose, SPECchem has the option to output energy gradients of these functions, find saddle points of the potential energy, compute the vibrational frequencies and IR intensities, and more. SPECchem can run in OpenMP, MPI or mixed MPIOpenMP mode (hybrid). Fig. 3 shows the scalability for the different data sets and the OpenMP and MPI mode on a Sun Fire 6800. The medium data set shows better scalability, despite the fact that this data set was measured on a machine with faster processors. In this case MPI has better scalability for the small data set. NUMA, clusters) and interconnects, processors, memory hierarchy, compilers, and MPI implementations. The draft benchmark suite currently consists of 16 candidate application codes from the fields of CFD, weather forecasting, Molecular Dynamics, Quantum Chemistry, Biochemistry, and Finite Element Methods. The contents of the suite is not fixed yet. It is not expected that all of the candidate codes will be accepted, and more codes will be added to the list of candidate codes. SPEC HPG is still open of MPI application code submissions. MPI applications from any application area can be submitted, but SPEC prefers programs that are considered state-of-the-art in a given field. The application should also be portable across different hardware and software architectures. The SPEC tool set used is common with the tool set of the upcoming update of the SPEC CPU benchmark suite. The MPI benchmark suite is being tested by HPG group members, ensuring portability across a wide selection of computing resources. C. Development of the SPEC MPI Benchmark Suite In May 2004, SPEC HPG decided to start development of a SPEC MPI benchmark suite. It was felt that there is a need for a standard MPI benchmark suite based on real applications in the marketplace which uses the SPEC model of result submission and publication, and SPEC HPG aims to fill this need. The MPI benchmark will provide performance metrics that can be used to compare different hardware architectures (SMP, III. L ARGE S YSTEM P ERFORMANCE B ENCHMARK S UITES OF SPEC Performance characteristics of application programs on large-scale systems are often significantly different from those on smaller systems. Figure 4 shows a scaling of Amdahls speedup for 32 to 128 threads, normalized by the Amdahls speedup of 16 threads. Amdahl’s predicts the speedup formula processors to be , where is on is the fraction the parallel coverage. The parallel coverage of sequential execution time that is enclosed by a parallel construct. Amdahls speedup assumes perfect scaling of the parallel portion of the program. Actual programs and actual hardware have additional sources of overhead, which degrade the performance obtained on a real system relative to prediction given by Amdahls law. On the other hand Amdahl’s speedup does not take into account that the performance can strongly depend on the data size per processor. Figures 5,6, and 7 show the scaling data for published benchmark results of SPEC OMPL2001. The numbers listed in the 4 TABLE III OMPL2001 wupwise P LATFORMS USED FOR THE OMPL BENCHMARKS . 4 Vendor Clock [MHz] L1 Inst. L1 Data L2 Cache L3 Cache HP PA 875 0.75MB 1.5MB - HP IA64 1500 16KB 16KB 256KB 6144KB Sun 1200 32KB 64KB 8MB - SGI 400 32KB 32KB 8MB - Fujitsu 1299 128KB 128KB 2MB - HP-Superdome Sun Fire 15K SGI O3800 HP-Superdome IA64 Fujitsu Primepower linear 3.5 3 2.5 2 1.5 1 following figures have been obtained from the results published by SPEC as of November, 2003. For the latest results published by SPEC, see http://www.spec.org/omp/results and http://www.spec.org/hpc2002/results. All results shown conform to Base Metrics reporting rules. Base Metrics are produced by building all applications in the suite with a common set of optimizations and without any modifications to the source or directives. For better presentation of the graph, we have normalized all results with the 32-processor results of the same type of system configuration. If the same system has faster and slower processor configurations, we used the scores with the faster processors. In order to make the graphs readable, we have selected the systems that provided at least 32- and 64-processor results. 0.5 0 0 20 40 60 80 # threads 100 120 140 100 120 140 100 120 140 100 120 140 OMPL2001 swim 4 HP-Superdome Sun Fire 15K SGI O3800 HP-Superdome IA64 Fujitsu Primepower linear 3.5 3 2.5 2 1.5 1 0.5 0 0 A. Scalability of OMPL Benchmarks As of November 2003 28 results had been published for OMPL and 76 for OMPM. These results were used for the figures. Note that as of January 2006, 39 results have been published for OMPL and 141 for OMPM. In this section we focus on the results with the large data set. Figures 5,6, and 7 show the scalability of the SPEC OMPL benchmark applications. The results of five different architectures described in Tab. III are shown: a HP Superdome with PA-8700+ CPUs, a HP Superdome with Itanium2, a SUN Fire 15K with UltraSPARC III, a SGI O3800 with R12000 and a Fujitsu Primepower System with SPARC64V CPUs. The benchmarks WUPWISE, SWIM, FMA3D, and ART show good scalability up to 128 processors. In order for SWIM to scale well, the bandwidth to main memory needs to scale with the number of processors. To increase the scalability OMPL2001 SWIM has more parallel loops than the OMPM2001 version. In addition, some scalar computation is performed in parallel, in favor of improved locality. This is of 20 40 60 80 # threads OMPL2001 fma3d 4 HP-Superdome Sun Fire 15K SGI O3800 HP-Superdome IA64 Fujitsu Primepower linear 3.5 3 2.5 2 1.5 1 0.5 0 0 20 40 60 80 # threads OMPL2001 art 4 HP-Superdome Sun Fire 15K SGI O3800 HP-Superdome IA64 Fujitsu Primepower linear 3.5 3 2.5 2 1.5 1 0.5 0 0 Fig. 4. Scaling of Amdahl’s Speedup of OMPM2001 through 128 threads, normalized by the 16-thread speedup. 20 40 60 80 # threads Fig. 5. Scalability of the OMPL benchmarks that scale well to 128 processors on HP-Superdome, Sun Fire 15K and SGI O3800 normalized by the 32-thread speedup. 5 special importance for machines with non uniform memory access using the first touch algorithm to place pages. OMPL2001 ART calls malloc() more efficiently than the OMPM2001 version. This change reduces contention on malloc(), and thus improved the scalability of ART. OMPL2001 mgrid 4 HP-Superdome Sun Fire 15K SGI O3800 HP-Superdome IA64 Fujitsu Primepower linear 3.5 3 2.5 2 OMPL2001 applu 1.5 4 HP-Superdome Sun Fire 15K SGI O3800 HP-Superdome IA64 Fujitsu Primepower linear 3.5 3 2.5 1 0.5 0 0 20 40 2 1.5 60 80 # threads 100 120 140 100 120 140 100 120 140 120 140 OMPL2001 equake 1 4 0.5 3.5 0 0 20 40 60 80 # threads 100 120 140 HP-Superdome Sun Fire 15K SGI O3800 HP-Superdome IA64 Fujitsu Primepower linear 3 2.5 2 Fig. 6. Superlinear scaling of APPLU on HP-Superdome and Fujitsu PRIMEPOWER normalized by the 32-thread speedup. 1.5 1 The benchmark APPLU shows superlinear scaling on HP Superdome and Fujitsu PRIMEPOWER. In our previous paper, we also presented superlinear scaling for the SGI Origin 3800 and the Fujitsu PRIMEPOWER 2000[6]. This is due to a more efficient usage of the cache as more processors are used. The same effect is visible on the Sun Fire 15K. According to the cache sizes of these systems, the sweet spot of the aggregate cache amount is between 64MB and 96MB. In the OMPL2001 version, false sharing was reduced by moving one of the OpenMP DO directives from the outermost loop to the second-level loop. The benchmarks EQUAKE, MGRID, APSI and GAFORT show good scaling up to 64 processors, but poor scaling for larger numbers of processors. HP Superdome and SGI Origin 3800 scaled less on EQUAKE. MGRID and EQUAKE are sparse matrix calculations, which do not scale well to large numbers of processors. In order to gain more scalability in the OMPL2001 version, we exploited more parallelism in EQUAKE, resulting in better scaling on HP Superdome and SGI Origin 3800. Larger data set in OMPL2001 helped the scaling of MGRID. Compared to OMPM2001 APSI, OMPL2001 APSI has a larger trip count of 240 for the corresponding loop. OMPL2001 APSI also has an improved work array distribution scheme as well as improved handling of parallel reduction operations. 0.5 0 0 20 40 60 80 # threads OMPL2001 apsi 4 HP-Superdome Sun Fire 15K SGI O3800 HP-Superdome IA64 Fujitsu Primepower linear 3.5 3 2.5 2 1.5 1 0.5 0 0 20 40 60 80 # threads OMPL2001 gafort 4 HP-Superdome Sun Fire 15K SGI O3800 HP-Superdome IA64 Fujitsu Primepower linear 3.5 3 2.5 2 1.5 B. Scalability of HPC Benchmarks As of November 2003 38 results have been published for SPEC HPC2002. 18 for the small and 20 for the medium data set. Results on up to 128 processes-threads have been published. These results were used for the figures. Note that as of January 2006, 104 results have been published for SPEC HPC2002. In this section we focus on the medium data set results on an IBM SP with 128 Power-3 CPUs running at 1 0.5 0 0 20 40 60 80 # threads 100 Fig. 7. Scalability of the OMPL benchmarks that scale well to 64 processors on HP-Superdome, Sun Fire 15K and SGI O3800 normalized by the 32threads speedup. 6 375MHz and an SGI 3800 with 128 R14000A CPUs at 600 MHz. In addition we show the results of a Sun Fire 6800 and 15K with UltraSparc III CPUs at 900MHz or 1200MHz (indicated in the graphs), and a 64 CPU Sun Fire 880 cluster with 8-way SMPs connected by Myrinet. Fig. 8 shows the scalability of SPECenv for the medium data set. All results use the MPI model of execution. The benchmark shows good scalability up to 128 processors. 4 IBM SP-375 Sun Fire 6800 MHz OpenMP Sun Fire 880 hybrid Sun Fire 15K OpenMP Sun Fire 15K hybrid linear 3.5 3 2.5 2 1.5 1 8 IBM SP-375 SGI 3800 Sun Fire 6800 Sun Fire 15K linear 7 6 0.5 0 0 5 4 10 20 30 40 50 # threads or # processes 60 70 Fig. 10. Speedup of SPECchem for the medium data set normalized by the 16-way speedup. 3 TABLE IV 2 S CALING OF DIFFERENT EXECUTION MODELS OF SPEC ENV FOR THE SMALL AND MEDIUM DATA SET. P LEASE NOTE THE DIFFERENT BASE 1 POINTS . 0 0 20 40 60 80 100 # threads or # processes 120 140 Fig. 8. Speedup of SPECenv for the medium data set normalized by the 16-way speedup. Fig. 9 shows the scalability of SPECseis for the medium data set. The scaling behavior depends strongly on the programming model and platform. While it shows almost perfect scaling (81% efficiency) on 16 processors, the efficiency on a Sun Fire 15K is much less. For the SPECchem benchmark only results of IBM and Sun are available for the medium data set. For both published IBM results the MPI model of execution was used. The efficiency of the 32 processor run is 82% compared to the 16 processor run on the IBM. Using OpenMP the Sun Fire 15K shows a perfect, almost superlinear scaling from 16 to 32 processors, but for 64 processors the efficiency is only 57% compared to 16 Sun Fire 6800 MPI Sun Fire 6800 OpenMP Sun Fire 15K MPI Sun Fire 15K OpenMP linear 14 12 10 8 6 4 2 0 0 10 20 30 40 50 # threads or # processes 60 70 Fig. 9. Speedup of SPECseis for the medium data set normalized by the 4-way speedup. MPI x OMP 4way Itanium , small data set MPI x OMP Sun Fire 6800, medium data set 4x1 1.72 16x1 1.76 1x4 1.39 1x16 1.56 2x2 1.98 8x2 1.83 1x1 1 8x1 1 the 16 processor run. Better efficiency is achieved with the hybrid execution model, where 81% is reached with 64 CPUs on the Sun Fire 880 cluster. Although two of the HPC2002 benchmarks can be used in hybrid mode all of the published results are limited to either pure OpenMP or MPI mode. Tab. IV shows the potential benefit of an hybrid execution of SPECenv. Similar benefit is visible for SPECchem in Fig. 11, since the MPI and OpenMP parallelism is on a different level of granularity. IV. C ONCLUSION AND F UTURE W ORK In this paper we have analyzed the performance characteristics of published results of the SPEC OMPL2001 and HPC2002 benchmark suites. We have found that the OMPL benchmark programs scale well up to 64 processors. For the HPC2002 benchmarks we also have demonstrated the impact of the choice of the execution model. The results show that the best choice of MPI, OpenMP or hybrid depends on the used hardware architecture as well as on the program and the data sets. Although HPC2002 is not limited to shared memory platforms, there are no results of larger machines available, so far. We attribute this to the relative recent release of HPC2002 and expect it to change in the near future. The trends of the SPEC HPC2002 codes indicate clear limits of scalability. We conclude that, even given sizeable data sets, large-scale, realistic applications do not exhibit the near-ideal speedups that some of the smaller benchmarks suggest. While this is an expected finding for many readers, demonstrating the evidence is an important result of the SPEC HPC2002 development. The fact that SPEC benchmarks and reports 7 hybrid MPI OpenMP 7 6 5 4 3 2 1 1 100 10 processes 10 threads 100 1 Fig. 11. Speedup of SPECchem for MPI, OpenMP and hybrid execution on a Sun Fire 15K. The lines connecting hybrid runs denote runs with the same number of processors*threads. are fully disclosed will allow both scientists and engineers to identify the causes that limit performance and develop remedies. We combined OMPM and OMPL into one benchmark providing different data set sizes. It also includes alternative sources that have been submitted to SPEC according to the run rules of the benchmark. SPEC HPG is currently developing a new benchmark suite, that will be based on MPI application codes. The MPI benchmark will provide performance metrics that can be used to compare different hardware architectures (SMP, NUMA, clusters) and interconnects, processors, memory hierarchy, compilers, and MPI implementations. SPEC/HPG is open for MPI application submissions. A good candidate program would represent a type of computation that is regularly performed on high-performance computers. ACKNOWLEDGMENTS The authors would like to thank all of those who participated in the benchmark development, especially the authors of the application programs and the contributors of the data sets. Benchmark development would have been impossible without their hard work. [5] H. Iwashita, E. Yamanaka, N. Sueyasu, M. van Waveren, and K. Miura, “The SPEC OMP 2001 benchmark on the Fujitsu PRIMEPOWER system,” in 3rd European Workshop on OpenMP, EWOMP’01, Barcelona, Spain, September 2001. [6] H. Saito, G. Gaertner, W. Jones, R. Eigenmann, H. Iwashita, R. Lieberman, M. van Waveren, and B. Whitney, “Large system performance of SPEC OMP benchmark suites,” Int’l Journal of Parallel Programming, vol. 31, no. 3, June 2003. [7] E. Barszcz, R. Fatoohi, V. Venkatkrishnan, and S. Weeratunga, “Solution of regular sparse triangular systems on vector and distributed-memory multiprocessors,” NASA Ames Research Center, Tech. Rep. Rept. No: RNR-93-007, 1993. [8] S. Key and C. C. Hoff, “An improved constant membrane and bending stress shell element for explicit transient dynamics,” Computer Methods in Applied Mechanics and Engineering, vol. 124, pp. 33–47, 1995. [9] M. Domeika, C. Roberson, E. Page, and G. Tagliarini, “Adaptive resonance theory 2 neural network approach to star field recognition,” in Applications and Science of Artificial Neural Networks II, ser. Proc. SPIE, S. K. Rogers and D. W. Ruck, Eds. International Society for Optical Engineering, 1996, vol. 2760, pp. 589–596. [10] H. Bao, J. Bielak, O. Ghattas, L. F. Kallivokas, D. R. O’Hallaron, J. R. Shewchuk, and J. Xu, “Large-scale simulation of elastic wave propagation in heterogeneous media on parallel computers,” Computer Methods in Applied Mechanics and Engineering, vol. 152, no. 1-2, pp. 85–102, January 1998. Matthijs van Waveren has contributed to the development of OpenMP Fortran and C/C++ 2.0, Fortran 95 and Fortran 2003. He is the representative of Fujitsu Ltd in the SPEC High-Performance Group, the OpenMP Architecture Review Board, and the INCITS J3 committee on Fortran 2008. He received the Ph.D. degree from the University of Amsterdam, the Netherlands, in 1989. He worked at the SARA Computer Centre in Amsterdam, at Schlumberger/Geco-Prakla R&D in Delft and the University of Groningen before joining Fujitsu in 1996. He has published over 20 technical papers. He has served on the programme committee of OpenMP workshops in Europe and Japan and as Secretary of the SPEC High-Performance Group. Current address: Fujitsu Systems Europe Ltd, 8, rue Maryse Hilsz, 31500 Toulouse, France. email: waveren@fujitsu.fr Kumaran Kalyanasundaram manages the Performance Engineering group at SGI. He is the primary representative for SGI on SPEC’s High Performance Group and Open Systems Group. He also serves as the Chair of HPG. Kumaran received his PhD in Aerospace Engineering & Engineering Mechanics from Iowa State University in May 1995. His dissertation involved developing a fast, parallel, O(NlogN) algorithm for generating 2-manifold surfaces in 3-D. R EFERENCES [1] OpenMP C and C++ Application Programming Interface Version 2.0, March 2002, OpenMP Architecture Review Board, http://www.openmp.org/specs/mpdocuments/cspec20.pdf. [2] OpenMP Fortran Application Programming Interface Version 2.0, November 2000, OpenMP Architecture Review Board, http://www.openmp.org/specs/mpdocuments/fspec20.pdf. [3] V. Aslot, M. Domeika, R. Eigenmann, G. Gaertner, W. B. Jones, and B. Parady, “SPEComp: a new benchmark suite for measuring parallel computer performance,” in WOMPAT’01: Workshop on OpenMP Applications and Tools, ser. LNCS. Springer, July 2001, vol. 2104, pp. 1–10. [4] V. Aslot and R. Eigenmann, “Performance characteristics of the SPEC OMP2001 benchmarks,” in 3rd European Workshop on OpenMP, EWOMP’01, Barcelona, Spain, September 2001. Greg Gaertner was a senior engineer with the High Performance Technical Computing Division of the Hewlett-Packard company. Gregs worked in the High Performance Computing Expertise Center. Greg also was actively involved in performance benchmarking for Technical Clusters. Greg received his M.S. in Physics from the University of Minnesota. Greg has forty plus years of experience in high performance technical computing having worked for Control Data, Cray Research, SSI, Digital, Compaq and Hewlett Packard. Greg currently resides in Bedford, New Hampshire, USA. 8 Wesley Jones , Ph.D., is a Senior Scientist in the Computational Sciences Center at the National Renewable Energy Laboratory in Golden Colorado. He is an expert in scientific and technical computing applications, algorithms, languages, software development, and application programming interfaces. Wesley currently collaborates with NREL scientists and engineers to address challenges of application performance, portability, and scalability. He is also co-investigator on a DOE grant ”Predicting the Electronic Properties of 3D, Million-Atom Semiconductor Nanostructure Architectures.” He spent the 7 years prior to joining NREL at the computer manufacturer, SGI, where he was technical lead of the Performance Engineering Team and chair of the Standard Performance Evaluation Corporation High Performance Group. At SGI Wesley worked on techniques and algorithms for improving the performance of scientific and technical applications. He also worked with the computer design and production engineers on evaluating and modifying hardware, software and networking for application of SGI computers to scientific and technical computing workloads. Rudolf Eigenmann is a Professor at the School of Electrical and Computer Engineering at Purdue University. His research interests include optimizing compilers, programming methodologies and tools, and performance evaluation for high-performance computers. He has published over 100 technical papers is these areas. Eigenmann received his Ph.D. in Electrical Engineering/Computer Science in 1988 from ETH Zurich, Switzerland. He is the recipient of a 1997 NSF CAREER award and serves on the editorial boards of the International Journal of Parallel Programming and the IEEE Computing in Science and Engineering Magazine. He has also served as the chairman/vice-chairman of the High-Performance Group of the Standard Performance Evaluation Corporation (SPEC). He is a member of the IEEE, ACM, and ASEE. Address: Purdue University, School of Electrical and Computer Engineering, 465 Northwestern Avenue, West Lafayette, IN 47907, eigenman@purdue.edu, http://www.ece.purdue.edu/ẽigenman/ Ron Lieberman is a senior engineer with the High Performance Technical Computing Division of the Hewlett-Packard company. Ron works primarily on the Fortran90 compiler for HP-UX. Ron also is actively involved in performance benchmarking for SMP based technical servers. Ron recived his B.S. and M.S. from Central Michigan University in 1985. Ron is Hewlett-Packard’s +alternate representative on High-Performance Group of the Standard Performance Evaluation Corporation (SPEC). Address: Hewlett-Packard Company, 3000 Waterview Pkwy, MS: B2W25 Richardson Texas 75080, lieb@rsn.hp.com Matthias S. Müller is Deputy Director and CTO of the Center for Information Services and High Performance Computing (ZIH) of the Dresden University of Technology. His research interest include programming methologies and tools, computational science on high performance computers and Grid computing. He received his Ph.D. in Physics in 2001 from University of Stuttgart, Germany. In his dissertation he studied fast algorithms for the simulation of granular particles. He is a member of the German Physical Society (DPG) and SPEC HPG. Brian Whitney is a Senior Staff Engineer with Sun Microsystems, Inc. His interests include parallelization, performance optimization and machine characterization of high-performance computers. He has over 20 years experience working with a vast array of computer architectures, including hypercubes, array processors, mini-supercomputers and vector processors. He is a graduate of the University of Washington in 1982 where he received his B.S. in Computer Science and in Mathematics. He is currently a member of the SPEC HPG group and serves as the Release Manager.