Parallel Computing Michael Young, Mark Iredell NWS Computer History 1968 CDC 6600 1974 IBM 360 1983 CYBER 205 first vector parallelism 1991 Cray Y-MP first shared memory parallelism 1994 Cray C-90 ~16 gigaflops 2000 IBM SP first distributed memory parallelism 2002 IBM SP P3 2004 IBM SP P4 2006 IBM SP P5 2009 IBM SP P6 2013 IBM Idataplex SB ~200 teraflops NEMS/GFS Modeling Summer School 2 Algorithm of the GFS Spectral Model One time loop is divided into : Computation of the tendencies of divergence, surface pressure, temperature and vorticity and tracers (grid) Semi-implicit time integration (spectral) First half of time filter (spectral) Physical effects included in the model (grid) Damping to simulate subgrid dissipation (spectral) Completion of the time filter (spectral) NEMS/GFS Modeling Summer School 3 Algorithm of the GFS Spectral Model Definitions : Operational Spectral Truncation T574 with a Physical Grid of 1760 longitudes by 880 latitudes and 64 vertical levels (23 km resolution) θ is latitude λ is longitude l is zonal wavenumber n is total wavenumber (zonal + meridional) NEMS/GFS Modeling Summer School 4 Three Variable Spaces Spectral (L x N x K) Fourier (L x J x K) Physical Grid ( I x J x k) I is number of longitude points J is number of latitudes K is number of levels NEMS/GFS Modeling Summer School 5 The Spectral Technique All fields possess a spherical harmonic representation: JJ l l il F (, ) f P (sin ) e nn l 0 n l where 1 2 l n 2 n 1 ( 2 n 1 )( n l ) d ( 1 x ) l 2 l 2 P ( x ) ( 1 x ) n n l n 2 n ! 2 ( n l )! dx NEMS/GFS Modeling Summer School 6 Spectral to Grid Transform Legendre transform: J F ( ) fP (sin ) l n l l l nn Fourier transform using FFT: J F ( , ) F ( ) e l il l 0 NEMS/GFS Modeling Summer School 7 Grid to Spectral Transform 1 l fn 2 2 0 l il F ( , ) P (sin ) e cos dd n 2 2 Inverse Fourier transform (FFT): 1 2 F() F ( ,)eild 0 2 l M jl i M F ( )e j, j 0 Inverse Legendre (Gaussian quadrature): l l l F w F ( ) P (sin ) n k k n k N k 1 NEMS/GFS Modeling Summer School 8 MPI and OpenMP GFS uses Hybrid 1-Dimensional MPI layout and OpenMP threading at do loop level MPI (Message Passing Interface) is used to communicate between tasks which contain a subgrid of a field OpenMP supports shared memory multiprocessor programming (threading) using compiler directives NEMS/GFS Modeling Summer School 9 MPI and OpenMP Data Transposes are implemented using MPI_alltoallv Required to switch between the variable spaces which have different 1-D MPI decompositions NEMS/GFS Modeling Summer School 10 Spectral to Physical Grid Call sumfln_slg_gg (Legendre Transform) Call four_to_grid (FFT) Data Transpose after Legendre Transform in preparation for FFT to Physical grid space call mpi_alltoallv(works,sendcounts,sdispls,mpi_r_mpi, x workr,recvcounts,sdispls,mpi_r_mpi, x mc_comp,ierr) NEMS/GFS Modeling Summer School 11 Physical Grid to Spectral Call Grid_to_four (Inverse FFT) Call Four2fln_gg (Inverse Legendre Transform) Data Transpose performed before the Inverse Legendre Transform call mpi_alltoallv(works,sendcounts,sdispls,MPI_R_MPI, x workr,recvcounts,sdispls,MPI_R_MPI, x MC_COMP,ierr) NEMS/GFS Modeling Summer School 12 Physical Grid Space Parallelism 1-D MPI distributed over latitudes. OpenMP threading used on longitude points. Each MPI task holds a group of latitudes, all longitudes, and all levels Cyclic distribution of latitudes used for load balancing the MPI tasks due to a smaller number of longitude points per latitude as latitude increases (approaches the poles). NEMS/GFS Modeling Summer School 13 Physical Grid Space Parallelism Cyclic distribution of latitudes example 5 MPI tasks and 20 Latitudes would be Task 1 2 3 4 5 Lat 1 2 3 4 5 Lat 10 9 8 7 6 Lat 11 12 13 14 15 Lat 20 19 18 17 16 NEMS/GFS Modeling Summer School 14 Physical Grid Space Parallelism Physical Grid Vector Length per OpenMP thread NGPTC (namelist variable) defines number (block) of longitude points per group (vector length per processor) that each thread will work on Typically set anywhere from 15-30 points NEMS/GFS Modeling Summer School 15 Spectral Space Parallelism Hybrid 1-D MPI layout with OpenMP threading Spectral space 1-D MPI distributed over zonal wave numbers (l's). OpenMP threading used on a stack of variables times number of levels. Each MPI task holds a group of l’s, all n’s, and all levels Cyclic distribution of l's used for load balancing the MPI tasks due to smaller numbers of meridional points per zonal wave number as the wave number increases. NEMS/GFS Modeling Summer School 16 GFS Scalability 1-D MPI scales to 2/3 of the spectral truncation. For T574 about 400 MPI tasks. OpenMP threading scales to 8 threads. T574 scales to 400 x 8 = 3200 processors. NEMS/GFS Modeling Summer School 17