Parallel Weather Prediction University of Illinois at Chicago Jeevan Joseph Meriya Susan Thomas Weather prediction- ancient times • Based on: o Knowledge of local weather. o Depended on experience, memory and a variety of empirical rules. • o Guesswork based on intuition. o Animal reading/response. What about oceanic weather predictions? Weather Forecasting- early times • Advection: Transport of fluid characteristics and properties by the movement of fluid itself. • • Cons: o Advection is non-linear o Assumption of constant wind Advancements: Vilhelm Bjerknes (1890 ) o Diagnostic Step o Prognostic Step Early Numerical Weather Prediction • Richardson: Dream of parallel weather factory [Lynch, CUP, 2006] o Expressed physical principles as system of equations, and used finite difference method to solve the system. • Jules Charney: Meteorology Project [Charney, J.Meteor, 1947] o Lack of computing power o Quasi-Geostrophic system. "Eliminate unimportant acoustic and shearinggravitational oscillations" General Circulation Model General wind directions. Image from NASA Numerical Weather Prediction • • Uses mathematical models of the atmosphere and oceans Includes massive datasets, complex calculations should be performed on supercomputers. o IFS grid containing 8 x 105 points on the surface, 91 levels, 5 prognostic variables => 3 x 108 data values. • Fundamental Problem - chaotic nature of the partial differential equations used to simulate the atmosphere. "Numerical Weather Prediction." Wikipedia: The Free Encyclopedia. Wikimedia Foundation, Inc., 9 May 2013. Web. 5 May 2013 Elements of NWP • Initialization The process of entering observation data into the model to generate initial conditions is called initialization. o Methods to gather Observational Data - weather satellites, radiosonde, some research projects use reconnaissance aircraft What is an Atmospheric Model? An atmospheric model is a computer program that produces meteorological information for future times at given locations and altitudes. Within any modern model is a set of equations, known as the primitive equations, used to predict the future state of the atmosphere. "Atmospheric Model." Wikipedia: The Free Encyclopedia. Wikimedia Foundation, Inc., 15 March 2013. Web. 5 May 2013 Elements of NWP - Contd. • Computation The equations used are nonlinear partial differential equations which are impossible to solve exactly through analytical methods. o Different solution methods used by global weather models: Finite Difference Method - world is represented as discrete points on a regularly spaced grid of latitude and longitude and is applied for all three spatial dimensions. Spectral Method - solve for a range of wavelengths, applied for the horizontal dimensions and finite-difference methods in the vertical. o Equations are initialized from data and rates of change are determined o Examples: UKMET Unified Model run 6 days into future, ECMWF's IFS and EC's GEM Model run 10 days into future and GFS by EMC run 16 days into future. [Wiley, SISSDAPDP, 1991] "Numerical Weather Prediction." Wikipedia: The Free Encyclopedia. Wikimedia Foundation, Inc., 9 May 2013. Web. 5 May 2013 Grid Point Calculation Credit: from Mann & Kump, Dire Predictions Integrated Forecasting System IFS [Barros, Parallel Computing, 1995.] • T799 Triangular truncation at wavenumber 213 [Ritchie, Monthly Weather Review, 1995] o Uniform spatial resolution (25 km) over the surface of the sphere • • • 91 levels in the vertical axis. 3 x 10 8 data values collected for each calculations. Three different discrete function spaces: o Grid-point Space o Fourier Space IFS- contd. • • Data dependencies. o FFT: Zonal wave no(m) are dependent on grid-point latitude data o Legendre transforms: Zonal wave no(m) are dependent on longitude data o Physical computations involve vertical data(Z) Advection Schemes o Basic: Eulerian. All grid columns can be considered to be independent, allowing simple data decomposition. o Semi-Lagrangian. Access to data from nearby grid columns. IFS- contd. • Approach to Parallelization o o Native macro-tasking used whenever possible. Pro: Make use of the existing partitioning, reducing transposition Con: Limited parallelism. Uses PARMACS, a portable message passing library which can be used in distributed memory computers. o Transposition: Complete data is redistributed to processes at various stages so that interprocess communication can be minimized. Data dependencies exist only within one coordinate direction, this being different for various algorithmic components. IFS- contd. • • Transposition strategy Grid point Transposition Fourier Space Spectral [Thole, Parallel Computing, 1995] Transposition IFS- Transposition Strategy • • Partitioning levels across NB sectors. All transforms between spectral and fourier space can be carried out independently within A set. NB A-Sets, each comprising of NA processors IFS- Transposition Strategy, contd.. • • • After Inverse FT, levels divided among NB A-Sets. Latitudes divided within A-Sets. Full grid columns available for physical computations. Transposition carried out independently in the B-Set IFS- Transposition Strategy, contd.. • • After grid point computation, Fourier Transform carried out. Data is then transposed back to Fourier Space. This is the inverse transposition of the previous step. IFS- Transposition Strategy, contd.. • • • Transposition No.4 carried out between Fast- Fourier stage and Legendre. In order to allow Legendre transforms to be carried out locally. This is the inverse transposition of the first step(Transposition 4). IFS- Transposition Strategy, contd.. • • • • Timestep is calculated in spectral space. Coefficients are wave number(m) and levels. Entails vertical dependencies. After transposition, each proc. has all levels of only part of the subset of spectral coefficients handled by its own set, enabling vector-matrix multiplication locally. IFS- Transposition Strategy, contd.. • • • Spectral data transposed to previous distribution Time step is completed. Both Transpositions 5, 6 are done independently with the B set. Semi-Lagrangian Advection(SLT) • • SLT happens in grid point phase before physics calculation. Each processor is assigned the computation of a number of grid columns, referred to as core dataset • Halo dataset is now constructed comprising of all data likely to be used by proc for SLT computation. • • • Halo extend criteria: timestep, maximum wind etc. Halo obtained through message exchange. Amount of data depends on grid column decomposition. SLT Grid Column Decomposition • Apple Decomposition o Assigns grid points along latitude o Long and thin o High amount of data Barros, Parallel Computing, 1995. SLT Grid Column Decomposition,Contd. • ORANGE Decomposition o Assigns grid points along latitude, longitudinal boundaries creates boxes. o Box structure o Low amount of data transfer Barros, Parallel Computing, 1995. SLT Grid Column Decomposition,Contd. • ORANGE at Poles: IGLOO o Assigns grid points along latitude, longitudinal boundaries creates boxes. o Significant closer to poles o Reduces data duplication Barros, Parallel Computing, 1995. SLT Grid Column Decomposition,Contd. S.R.M. Barros, Parallel Computing, 1995. IFS conclusion • • • • • The data distribution is handled by a high level transposition strategy, isolating message passing into routines. Ensembles. Parallel efficiency for transposition method is very high. SLT tuned to get lower data communication in a halo region. Vast majority of IFS contain no specialized code related to parallelism, hence making maintenance and optimization easier. Medium Range Weather Forecasting Model on PARAM - a Parallel Machine [Kaginalkar, CDAC Technical Report, 1995] • • T80L18, global weather model is parallelized. Includes triangular truncation and spectral resolution of 80 waves, 128 latitudes and 256 longitudes with 18 varying pressure levels in vertical direction. • Equations derived from conservation laws of mass, momentum and energy. • Finite difference method is used to approximate the derivatives in vertical direction. Medium Range Weather Forecasting Model on PARAM - a Parallel Machine [Kaginalkar, CDAC Technical Report, 1995] • Basic Algorithm Idea o Step 1: Input spectral coefficients for all m, n and k o Step 2: Compute Fourier coefficients using inverse Legendre's transform for all m, j and k. o Step 3: Compute Gaussian grid point values using the inverse Fourier transform for all l,j and k. o Step 4: Compute non-linear terms and physics in grid point domain on Gaussian grid. o Step 5: Compute Fourier coefficient using the direct Fourier transform o Step 6: Compute spectral coefficients by direct Medium Range Weather Forecasting Model on PARAM - a Parallel Machine [Kaginalkar, CDAC Technical Report, 1995] 0 <= m, n <= M, spectral indices and range 1 <= j <= 128, latitude index and range 1 <= k <= 18, vertical level index and range 0 <= l <= 256, longitude index and range Medium Range Weather Forecasting Model on PARAM - a Parallel Machine [Kaginalkar, CDAC Technical Report, 1995] • • • • PARAM - a parallel Machine PARAM8600 has 16 processors and 64 transputers. PARAM9000SS has 64MByte memory for each processor PARAM OpenFrame is fastest with ethernet link speed of 100Mbits/s and myrinet with speed of 1.2Gbits/s, for 8 dual Core Processors Medium Range Weather Forecasting Model on PARAM - a Parallel Machine [Kaginalkar, CDAC Technical Report, 1995] • • • • • Parallelisation Strategy Main Task - Identify the nature of parallelisation involved and which was the most compute intensive parts and their data dependencies. Most obvious and clear data independent and compute intensive work is across the latitudes. Latitude wise decomposition. Pair of latitudes will be placed in each processor. Major Computation is done in Fourier and Legendre transform. Medium Range Weather Forecasting Model on PARAM - a Parallel Machine [Kaginalkar, CDAC Technical Report, 1995] • • • • • • Parallelisation Strategy - Contd. Each processor does the localized work which is computing Fourier coefficients and gaussian grid point values, non-linear terms and physics in grid point domain. Then, each processor computes Fourier coefficients only for its assigned latitudes - FFT. FFT is computed independently, without any inter-processor communication. Partial sum of Legendre Transform computed in each processor for the given latitudes. These sums are circulated among all processors to have global sum on each processor. Medium Range Weather Forecasting Model on PARAM - a Parallel Machine [Kaginalkar, CDAC Progress Report, 19951996] • Communication Pattern Medium Range Weather Forecasting Model on PARAM - a Parallel Machine [Kaginalkar, CDAC Progress Report, 19951996] • • Results and Performance Evaluated the scientific accuracy of parallel T80 code against CRAY forecast output. The partial double precision results of T80 on PARAM approximate the full double precision results of CRAY within 5% variation. Medium Range Weather Forecasting Model on PARAM - a Parallel Machine [Kaginalkar, CDAC Progress Report, 1995-1996] Medium Range Weather Forecasting Model on PARAM - a Parallel Machine [Kaginalkar, CDAC Technical Report, 1995] • Future Work • Exploiting fast global communication routines to reduce communication overheads as number of processors increase. • Exploration of faster intrinsic functions and better utilization of cache GPU Acceleration Of Numerical Weather Prediction [Michalakes, IPDPS 2008] • Even though there is large-scale parallelism in weather models, performance increase has come from processor speed than increased parallelism. • Alternative method - exploiting emerging architectures using fine grained parallelism once used in vector machines. • This paper demonstrates a nearly 8x speedup for a computationally intensive module of Weather Research and Forecast (WRF) model on a variety of NVIDIA Graphics Processing Units (GPU) GPU Acceleration Of Numerical Weather Prediction [Michalakes, IPDPS 2008] • • GPU-based Computing • • Today's Scenario • • Low cost, low power (watts per flop) and very high performance alternative. CPUs unable to exploit parallelism much finer than one subdomain i.e, one geographic region assigned to one processor. GPUs introduce layers of concurrency between data-parallel threads with fast context switching. Also have dedicated memories to provide the bandwidth needed for high FLOPs GPU Acceleration Of Numerical Weather Prediction [Michalakes, IPDPS 2008] • Weather Research and Forecast model - most widely used, uses explicit finite difference approximation method, represents atmosphere over a 3-dimensional grid. • WRF Single Moment 5-tracer (WSM5) - computationally intensive physics module, is only 0.4% of total source code but takes a quarter of total run time on single processor. • On an average, WSM5 involves 2400 floating point multiplyequivalent operations per cell per invocation GPU Acceleration Of Numerical Weather Prediction [Michalakes, IPDPS 2008] • NVIDIA GPUs - NVIDIA 8800 GTX, Quadro 5600 and a pre-release of GTX 200 • NVIDIA 8800 GTX: Eight physical processors as a SIMD unit in one multiprocessor and there are 16 multiprocessors. 768 MB multiported SDRAM with each multiprocessor. A local 16 kb thread shared memory. GPU Acceleration Of Numerical Weather Prediction [Michalakes, IPDPS 2008] • • • • • • One writes a CUDA kernel for the GPU Each kernel is a collection of threads arranged to blocks and grids. Each block is bound to a virtual multiprocessor. The hardware will time-share the multiprocessor among the blocks. More threads per block, better the performance will be. Kernel should have enough blocks to simultaneously utilize the multiprocessors in a given NVIDIA GPU. Memory - data that does not fit in fast shared memory must be stored in slower DRAM device memory. GPU Acceleration Of Numerical Weather Prediction [Michalakes, IPDPS 2008] • Validation GPU Acceleration Of Numerical Weather Prediction [Michalakes, IPDPS 2008] • Performance GPU Acceleration Of Numerical Weather Prediction [Michalakes, IPDPS 2008] • Cost per call Conclusion • • Pre-computer era weather forecast was a nightmare. • Computation introduced massive speed, but without parallelization, we computation was still behind the weather change. • Today, with parallel computing and advanced models, a complete prediction of up to 51 ensembles can be calculated in 10 hours. • Sudden weather changes still unpredictable. References • • [Kaginalkar, CDAC Progress Report, 1995]: A. Kaginalkar and S. Purohit. "Benchmarking of Medium Range Weather Forecasting Model on PARAM -A parallel machine". CDAC Technical Report, CDAC, India, 1995. • [Barros, Parallel Computing, 1995.] S.R.M. Barros, et. al., The IFS model: A parallel production weather code, Parallel Computing, vol.21, No.10, p. 1621, 1995. • [Michalakes, IPDPS, 2008]: J. Michalakes and M. Vachharajani. GPU acceleration of numerical weather prediction. IPDPS 2008: IEEE Int’l Symp. on Parallel and Distributed Processing, pages 1–7, 2008. • [Wiley, SISSDAPDP, 1991]: R. L. Wiley. Parallel Processing and Numeric Weather Prediction. Second International Specialist Seminar on the Design and Application of Parallel Digital Processors, pages 15-19, 1991. • [Lynch, JCP, 2008]: P. Lynch, The origins of computer weather prediction and References • • • [Charney, J.Meteor, 1947]: J.G. Charney, The dynamics of long waves in a baroclinic westerly current, J. Meteor. 4 pages 135–162., 1947. [Lynch, CUP, 2006]: P. Lynch, The Emergence of Numerical Weather Prediction: Richardson’s Dream, Cambridge University Press, Cambridge, 2006. [Ritchie, Monthly Weather Review, 1995]: H. Ritchie, C. Temperton, A. Simmons, et. al., Implementation of the semi-Lagrangian method in a high resolution version of the ECMWF forecast model, Monthly Weather Review 123 (1995) 489-514.