Gung Ho: A code design for weather and climate prediction on exascale machines R. Forda , M.J. Gloverb , D.A. Hamc , C.M. Maynardb , S.M. Picklesa , G.D. Rileyd , N. Woodb a Scientific Computing Department, STFC Daresbury laboratory b The Met Office, FitzRoy road, Exeter, EX1 3PB c Department of Computing and Grantham Institute for Climate Change, Imperial College, London d School of Computer Science, University of Manchester Abstract The Met Office’s numerical weather prediction and climate model code, the Unified Model (UM) is almost 25 years old. It is a unified model in the sense that the same global model is used to predict both short term weather and longer term climate change. More accurate and timely predictions of extreme weather events and better understanding of the consequences of climate change require bigger computers. Some of the underlying assumptions in the UM will inhibit scaling to the number of processor cores likely to be deployed at Exascale. In particular, the longitude-latitude discretisation presently used in the UM would result in 12 m grid resolution at the poles for a 10 km resolution at mid-latitudes. The Gung Ho project is a research collaboration between the Met Office, NERC funded researchers and STFC Daresbury to design a new dynamical core that will exhibit very large data parallelism. The accompanying software design and development project to implement this new core and enable it to be coupled to physics parameterisation, ocean models and sea ice models, for example is presented and some of the computational science issues discussed. Keywords: 1. Introduction Partial differential equations describe a wide range of physical phenomena. Typically, they exhibit no general analytic solution and certainly this Preprint submitted to Advances in Engineering Software March 24, 2014 is the case for the Navier-Stokes equations, which are the starting point for numerical weather prediction (NWP) and climate models. However, an approximate numerical solution can be constructed by replacing differential operators with finite difference or finite element versions of the same equations. The domain of the problem can be decomposed and the equations solved for each sub-domain. This requires only a small amount of boundary data from neighbouring sub-domains. This data parallel approach naturally maps to modern supercomputer architectures with large numbers of compute nodes and distributed memory. Each node can perform the necessary calculations on its own sub-domain and then communicate these results to its neigbouring nodes. However, the relevance of this simple picture of the massively parallel paradigm for supercomputing is diminishing as machine architectures become increasingly complex. While there is still much uncertainty about what future architectures will look like in 2018 and beyond, it is clear that the trend towards sharedmemory cluster nodes with an ever-increasing number of cores, stagnant or decreasing clock frequency, and decreasing memory per core, are certain to continue. For a review, see for example [1]. This is because the design of new silicon is constrained by the imperative to minimize power consumption. Furthermore, it is expected that improvements to Instruction Level Parallelism (ILP) within the execution units1 on commodity processors be incremental at best. Consequently, the computational performance of an individual processor core may even decrease. Thus simulation codes must be able to exploit ever increasing numbers of cores just to manage current workloads on future systems, let alone achieve the next order-of-magnitude resolution improvements demanded by scientific ambitions. Most scientific codes exploit the data parallelism expressed by the Message Passing Interface (MPI) over distributed memory. However, the arguments (and evidence) in favour of introducing an additional thread-based layer of parallelism over a shared memory node, such as OpenMP threads, into an MPI code are more compelling than ever. Exploiting the shared memory within a cluster node helps to reduce the cost of communications thereby improving scalability, and to alleviate the impact of decreasing memory per core by reducing the need to store redundant copies in memory. On 1 e.g. Single Instruction Multiple Data (SIMD) instructions such as Streaming SIMD Extension (SSE) and Advanced Vector Extensions (AVX). 2 many architectures, it is also possible to hide much of the effect of memory latency by oversubscribing execution units with threads. Extrapolating the trends described above and in [1] results in a prediction of a so called exascale machine between 2018 and 2020. Assuming that an individual thread can run at a peak rate of 1 GFlops2 , then 109 threads would be needed to keep an exascale machine busy. To put this into context for an atmospheric model, a 1 km resolution model will have approximately 5 × 108 mesh elements in the horizontal. Compare this to a 10 km global model which has only 5 × 106 mesh elements in the horizontal. Whilst this naive analysis may suggest that such a global model has only enough data parallelism to exploit 106 or 107 concurrent threads, this is not as bad as it first appears. Neither operational forecasts nor research jobs occupy the whole machine and moreover, increased use of coupled models instead of atmosphere-only models can be expected to boost resource utilisation, as can ensemble runs via task parallelism. However, it is clear that in the future it will be necessary to find and exploit additional parallelism beyond horizontal mesh domain decomposition and concurrent runs of the atmosphere, ocean and IO models. In the current Met Office code, the Unified Model[2, 3], a major factor limiting the scalability and accuracy of global models is the use of a regular latitude-longitude mesh, as the resolution increases dramatically towards the poles. A third generation UM dynamical core, ENDGame [4] is scheduled to enter operational service for the global model in summer 2014. Whilst this greatly improves the scaling behaviour of the Unified Model, the latitudelongitude mesh remains. This will fundamentally limit the ability of the global model to scale to the degree necessary for an exascale machine. Clearly, the disruptive technology change necessary for exascale computing is a compelling driver of algorithmic and software change. In response, the UK Met Office, NERC and STFC have funded a five year project to design a new dynamical core for an atmospheric model, called Gung Ho. This project is a collaboration between meteorologists, climate scientists, computational scientists and mathematicians based at the Universities of Bath, Exeter, Leeds, Reading, Manchester and Warwick, Imperial College London, the Met Office and STFC Daresbury. This paper describes the design of the software needed to support such a dynamical core. 2 Floating point operations per second (flops) 3 2. Design considerations for a dynamical core The Gung Ho dynamical core is being designed from the outset with both numerical accuracy and computational scalability in mind and the primary role of the computational science work package is to ensure that the solutions being proposed by other work packages in the project will scale efficiently to a large numbers of threads. At the same time, it must be recognised that a software infrastructure for next generation weather and climate prediction must aspire to an operational lifetime of 25 years or more if it is to match the longevity of the current Unified Model. Therefore, other aspects – such as portable performance, ease of use, maintainability and future-proofing – are also important, and are addressed here. • Portable performance means the ability to obtain good performance on a range of different machine architectures. This should apply to the size of machine (from tablet to supercomputer), the model configuration (climate, coupled etc.) and to different classes of machine (e.g. graphical processor unit (GPU) or multi-core oriented systems). It is not known what type of architectures will dominate in 2018 and onwards but, as is the case today, architecture specific parallelisation optimisations are expected to play an important role in obtaining good performance. Therefore the Gung Ho software architecture should be flexible enough to allow different parallelisation options to be added. • Ease of use is primarily about making the development of new algorithms as simple as possible and ease of maintenance will hopefully follow from this simplicity. Clearly there is a trade-off between application specific performance optimisations and ease of use/maintainability. As will be shown in later sections, in the proposed Gung Ho software architecture this problem is addressed by separating out these different concerns into different software layers. • Future-proofing in this context refers to being able to support changes in the dynamical core algorithm (such as a change of mesh or discretisation for example) without having to change all of the software. As well the ability to exploit hardware developments. The current Unified Model is written in Fortran 90 and much of the cultural heritage and experience of programming in the meteorology and climate 4 community is in Fortran. Therefore, the Gung Ho dynamical core software will be written in Fortran, but the more modern Fortran 2003; that is, it may rely on any feature of the Fortran 2003 standard. This allows the support software to use some of the new object oriented features (procedures within objects in particular) if required whilst keeping the performance benefits and familiarity of Fortran. Fortran 2003 also provides a standard way of calling C code, which is also desirable. Parallelisation is expected to be a mixture of MPI and OpenMP with the addition of OpenACC if GPUs are also to be targeted. However, there may emerge other programming models that these designs do not exclude. In particular Partitioned Global Address Space models (PGAS) such as Co-Array Fortran (CAF), which is part of the Fortran 2008 standard, are not excluded. 2.1. Support for unstructured meshes To avoid some of the problems associated with a regular latitude/longitude mesh, quasi-uniform structured meshes such as the cubed sphere are being considered for the Gung Ho dynamical core. However, it is highly undesirable to specialise to a particular grid prematurely. Indeed, alternative grids to the cubed sphere are still under active consideration in Gung Ho [5–7]. Moreover, the best choice of mesh now may not be the best choice over the lifetime of the Gung Ho dynamical core. Therefore, a more general irregular data structure will be used to store the data on a mesh. This means that instead of neighbours being addressed directly in the familiar stencil style (data(i+1,j+1) etc.) of a structured mesh, they are addressed by a neighbours list e.g. data(neighbour(i),i=1,Nneighbours). Iterating over a list of neighbours allows an arbitrary number of neighbours for each mesh element which allows simple support for different topologies. There is a performance cost associated with this indirect memory addressing, that is having to look up the locations of neighbours. However, a critical design concept is that the mesh will be structured in the vertical. This is driven by the dominance of gravity in the Earth’s atmosphere, which leads to the atmosphere being highly anisotropic on a planetary scale. Thus the semi-structured mesh will be a horizontally unstructured columnar mesh where the vertical discretisation is anisotropic. By using direct addressing in the vertical and traversing the vertical dimension in the innermost loop nest, the overhead of indirect addressing can be reduced to a negligible level. Fair comparisons between structured and unstructured approaches are hard to achieve due to the difficulty in separating impact on performance of 5 factors such as the order of the method; the choice of finite difference, finite volume or finite element approaches; the efficiency of the implementations; and so on. That said, in [8], a set of kernels drawn from an atmospheric finite volume icosahedral model execute with a performance of within 1% of the directly addressed speed when executed in a vertical direct, horizontal indirect framework. So, the proposed data model uses indirect addressing in the horizontal, and direct addressing in the vertical, with columns contiguous in memory. In contrast, the current Unified Model uses direct addressing in all three dimensions, and it is levels that are contiguous in memory. The change to column-contiguous (vertical innermost) from level-contiguous (horizontal innermost) will also have performance consequences, which depend on the operation being performed and the architecture on which it runs. For a cache based memory model such as conventional CPUs, coding the vertical index as the innermost dimension is usually optimal for cache reuse because the array data in this dimension will be contiguous in memory. Thus, an inner loop over this dimension will exhibit data locality and allow the compiler to exploit the cache. However, for a non-cache based memory model such as on a GPU this arrangement may not necessarily be optimal [9– 14]. On current GPUs, memory latency can be hidden by oversubscribing the number of thread teams (commonly referred to as warps for NVidia hardware and software) to physical cores. The low cost of switching warps whilst waiting for data amortises the cost of memory latency. Consequently the number of concurrent threads is very high. Moreover, GPUs have a vector memory access feature (known as coalesced memory access in NVidia terminology) for an individual warp. To access data in this way, each thread in a warp requests data from a sequential element of a contiguous array in a single memory transaction. Naively, a data layout with the horizontal index innermost would achieve this. However, coalesced memory access could still be achieved with the vertical index innermost if a warp of threads cooperate to load a single element column into shared memory as a contiguous load. The warp then computes for the column (using a red-black ordering if required if write stencils overlap). In particular, a thread-wise data decomposition in the vertical is likely to be required for operations such as a tridiagonal solve so that data segment sizes are sufficiently reduced to fit into (warp) shared memory. Furthermore, such considerations are not limited to GPUs. Commodity processors today possess cache-based memory architectures and vector-like, 6 pipe-lined execution units capable of issuing multiple floating-point instructions per cycle. Although today’s “vectors”, typified by 128 or 256 bit SSE or AVX, are much smaller than in the era of vector computers, it is still the case that an application incapable of exploiting the ILP that such units confer is likely to achieve disappointing performance. It is unclear whether the recent trend towards larger vector units in commodity processors will continue, because the returns for many applications – especially those of limited computational intensity – are diminishing. In a recent study of the NEMO ocean code, [15] compared two alternative index orderings for 3-dimensional arrays; the level contiguous ordering (which is used both by NEMO and the current UM) and the alternative, column contiguous in memory, as will be the case in the Gung Ho data model. The study found that although the column contiguous ordering is advantageous for some important operations (such as halo exchanges and the avoidance of redundant computations on land), and although most operations can be made to vectorize in either ordering, there remain some operations that are difficult to vectorize in the column contiguous ordering. The archetypal example is a tridiagonal solve in the vertical dimension, because of loop-carried dependencies that inhibit vectorization. In the column contiguous ordering, the dependencies remain, but by operating on a whole level at once, the vector units can be fed; however, this trick is not cache friendly in the column contiguous ordering. As it is likely that similar concerns will arise in the Gung Ho dynamical core, alternative methods for solving tri-diagonal systems such as sub-structuring and cyclic reduction are under consideration [16]. These methods have in common the idea of transforming the problem into one that exposes more parallelism, albeit at the cost of more floating point operations; the hope is that as “flops become free”, the increased parallelism will win. Clearly, performance portability and data layout are complex issues. There are complex performance benefits and penalties whatever choice is made, and uncertainty in what hardware features will be available at exascale make all the consequences of this choice harder to foresee. However, the vertical index innermost data layout represents the best choice for achieving performance with the awareness that hardware specific performance code will be necessary to exploit whatever hardware features are available. This consideration impacts on the design of the Gung Ho software architecture. 7 3. The Gung Ho software architecture The software architecture is designed with a separations of concerns in mind to satisfy the design criteria of performance portability and futureproofing. The software will be formally separated into different layers, each with its own responsibilities. A layer can only interact with another via the defined API. These layers are the driver layer, the algorithm layer, the parallelisation system (PSy), the kernel layer and the infrastructure layer. These are illustrated in figure 1. The identification of these layers enables a separation of concerns between substantially independent components which have different requirements and which require developers with substantially different skill sets. For example the PSy is responsible for the shared memory parallelisation of the model and the placement of communication calls. The layers above (algorithm) and below (kernel) are thus isolated from the complexities of the parallelisation process. Figure 1: Gung Ho software component diagram. The arrows represent the APIs connecting the layers and the direction shows the flow control. The driver encapsulates and controls all models in a coupled model and 8 their partitions; it calls the algorithm layer in each model3 . The algorithm layer is effectively the top level of Gung Ho software. It is where the scientist defines their algorithm in terms of local numerical operations, called kernels, and global predefined operators such as linear algebra operations on solution fields. The algorithm layer calls the PSy layer which is responsible for shared memory parallelisation and intra-model communication. The PSy layer calls the scientific kernel routines. The kernel routines implement the individual scientific kernels. All layers may make use of the infrastructure layer which provides services (such as halo operations and library functions). The following sections discuss each of the layers and their associated APIs in more detail. Although the design of the software architecture is layered, some layers are depicted side-by-side in Figure 1. The reason for this, and for the vertical line between the algorithm layer and kernel layer on one side and the PSy on the other, is to emphasise the division of responsibilities. The natural scientist will deal with the algorithm and the kernel, and the computational scientist will deal with the PSy layer. The vertical line has been termed the “iron curtain” to underline the strong separation between code parallelisation and optimisation on the one hand and algorithm development on the other. The driver is mainly the responsibility of the computational scientist as it makes use of the computational infrastructure. The kernel layer can be treated as part of a general toolkit in order to promote the re-use of functionality within Gung Ho and between Gung Ho and other models at the Met Office (including the large collection of parametrisations of small-scale processes known in atmospheric modelling as physics). In a formal instantiation of this idea, developers would be encouraged to use the toolkit functions wherever possible and would not be able to add new science into the system unless that science were added to the toolkit. 3.1. Driver layer The driver layer is responsible for the scheduling of the individual models in a coupled model including the output and checkpoint frequencies. There may be different implementations of the driver layer depending on the particular configuration. For example, a stand-alone driver might look different to a driver for a fully coupled Earth System Model. 3 A model refers to a complete physical subsystem, e.g. the atmospheric dynamics, the atmospheric chemistry or oceanic model, which are then coupled together 9 The driver is also responsible for holding any state that is passed into and out of the Gung Ho dynamical core, or exchanged with other models, and for ensuring that any required coupling involving the state takes place. The driver may know little about the content of the state. For example, it may simply pass an opaque state object, as is the case with the Model for Prediction Across Scales (MPAS) 4 . However, these options need further investigation, and for now, we postpone the decision as to which layer(s) will be responsible for allocating and initialising the state variables. The driver will control the mapping of the model onto the underlying resources i.e. the number of MPI processes and threads and the number of executables. It is envisioned that the driver will make use of existing coupled model infrastructure provided by third parties, for example some of the functionality of the Earth System Modelling Framework (ESMF)5 may be adopted. The driver can call the algorithm layer directly. 3.2. Algorithm layer The algorithm layer, named Algs in Figure 1, will be where the scientist specifies the Gung Ho algorithm(s). As such it should be kept as high level as possible to keep it simple to develop and manage. This layer will have a logically global data view i.e. should operate on whole fields. Data objects at the field level should be derived types. The same applies to other data structures, such as sparse matrices, that are visible at the algorithm layer. This layer should primarily (and potentially solely) make use of a set of calls to existing building block code and control code to implement the logic of the algorithm. These building blocks will often, but not exclusively, be defined by user-defined kernels that specify an operation to be performed, typically column-wise, on fields. The exception is common or performance critical operations that may be implemented as library functions. Calls may be in a hierarchy, for example a call to a Helmholtz solver may contain a number of smaller “building block” calls. These kernels form the kernel layer and the algorithm is specified as a set of kernels calls (via the Psy layer) in the algorithm layer. There will be no explicit parallel communication calls (for example halo updates) in the layer: this will be handled exclusively by the PSy layer. 4 5 http://mpas-dev.github.io/ http://www.earthsystemmodeling.org/ 10 However, calls to communicate data via a coupler (put/get calls) will occur at this layer. 3.2.1. Algorithm to PSy interface The interface between these layers must maintain the separation of concerns between the high level algorithmic view and a computational view. The interface defines what information the algorithm layer must present and how it is to be used by the PSy layer. Information on the kernel to be executed must be passed. This may be as a direct call to the PSy layer, or as an argument (e.g. a function pointer) of a generic execute kernel function. The merits and defects of each approach are still under consideration although the differences are subtle and the consequences of choosing one over the other are at this stage still unforeseen. Other arguments in the interface will contain information about data objects to be operated on, and critically data access descriptors as to whether and when the data is to be read only, can be overwritten or the values incremented. Such an interface with a generic compute engine approach has been deployed in a different context in [17] Furthermore, multiple kernels may be passed to the PSy layer in a single call. This coupled with the data access descriptors allows the PSy layer to do two things. As the PSy layer is responsible for communication, both interand intra-node, this approach allows the PSy layer to parallelise the routines together, for example by reducing the the number of halo swaps. This is known as kernel fusion. Moreover, the data access descriptors will allow the PSy layer to computationally reason about the order of the operations, for example to achieve a better overlap between computation and communication. This is known as kernel re-ordering. Both offer potential performance optimisation gains. 3.3. PSy layer The interface to the PSy layer presents a limited package of work, what operations are required and critically, the data dependencies. In short, a computational pattern. Whilst there is no intention to formally characterise these patterns as in [18, 19], this abstraction remains a powerful software engineering concept. The interface to the PSy layer will be Fortran 2003 compliant, however, this may never be compiled directly. This interface will be parsed (a Python parser being the most likely candidate) and then transformed to implement some of the necessary optimisations. This may be to default Fortran 2003 routines, but also allows for the generation of hardware 11 specific code. For example, this could be language/directive specific such as CUDA or OpenACC for a GPU hardware target or even CPU assembler for a CPU target. Another example of such optimisations could be targeted at architecture specific configuratons, such as different cache and AVX sizes on Intel Xeon and Xeon Phi. The interface parser/transformer (IPT) will be a source-to-source translator. The default target will be Fortran 2003. This means that the code will always be able to be debugged using standard tools, correctness can be checked and standard compilers with optimisation benefits can be deployed. However, where hardware and computational optimisation opportunities exists the translator can be used to generate highly optimised code. This optimisation need only be created once, and potentially re-used whenever the computational pattern fits. In the case of multiple functions being passed, the IPT implementing the PSy layer will be able to decide the most appropriate ordering of these functions, including whether they are to be computed concurrently, that is, exploiting task or functional parallelism. It is the responsibility of the PSy layer to access the raw data from the data arguments passed through the interface. The PSy layer calls routines from the kernel layer. The implementation will also be able to call routines from the infrastructure layer. The data access descriptors will allow the translator to reason when communication is required and place the appropriate calls, for example halo swap, to the infrastructure layer. The PSy layer will also be responsible for on-node, shared memory parallelism most likely implemented with OpenMP threads or OpenACC. 3.3.1. PSy to kernel interface The PSy layer calls the kernel routines directly, passing data in and out by argument with field information being passed as raw arrays. This means that the data has to be unpacked from derived data type objects in the PSy layer. 3.4. Kernel layer The kernel layer is where the building blocks of an algorithm are implemented. This layer is expected to be written by natural scientists with input from computational scientists as this layer should be written for ease of maintenance but also such that, for example, the compiler can optimise the code to run efficiently. 12 To help the compiler optimise, explicit array dimensions should be provided where possible. In certain circumstances, where performance of a routine is critical it may be that compiler directives might need to be added to the kernel code to aid optimisation (such as unrolling or vectorization). In more extreme cases there may need to be more than one implementation of a particular kernel which is optimised for a particular class of architecture, but this can be avoided if the Fortran 2003 compiler can provide sufficient performance. Two potential kernel implementations are considered for the design. These are not mutually exclusive and it may be that some mixture of the two is used. In the first implementation the parallelisable dimensions are made malleable, so that the parallelisable dimensions can be called with different numbers of work units. This approach is flexible in that the same kernel code can be used for smaller or larger numbers of work units per call depending on what the architecture requires. For example, one would expect a GPU machine to require a large number of small work-units and a many-core machine to require fewer larger work units. In the second implementation the loop structures are not contained in the kernel code and any looping is performed by the PSy layer. One exception to this rule is the loop over levels which may be explicitly written, particularly if it is not possible to parallelise the loop. The size of the kernels, in terms of computational work, is still under consideration, although there is nothing in the architecture design that dictates kernel size. Smaller kernels are more suitable for architectures based on GPUs but larger kernels are more suitable for multi-core processors. It may be that smaller kernels can be efficiently combined to create larger kernels which would tend to favour smaller ones. Ultimately it will be the unit of work defined by the algorithm developer that will dictate the size of a kernel. Finally, the kernel layer will be managed as a toolkit of functions which can only be updated by agreement with the toolkit owner. Combined with a rule that disallows code, other than subroutine calls and control code, at the algorithm layer, this approach would promote the idea of re-using existing functionality (potentially across different models, not just the Gung Ho dynamical core) as code developers would need to apply to add new functionality to the kernel layer and this could be checked with the existing toolkit for duplication. 13 3.5. Infrastructure layer The infrastructure layer provides generic functionality to all of the other layers. It is actually a grouping of different types of functionality including parallelisation, communication, libraries, logging and calendar management. Parallelisation support will include functionality to support the partitioning of a mesh and mapping partitions to a processor decomposition. It is expected that ESMF or some similar software solution will be used to support this functionality. However, this functionality is not provided directly but rather through the infrastructure layer so that only the appropriate functionality is exposed. Communication support is related to the partitioning support as halo operations are derived from the specification of a partitioned mesh in ESMF. Again this functionality is not provided directly but rather through the infrastructure layer. The infrastructure will also provide the coupling put/get functionality (for example, implemented by OASIS36 ) but that is out of scope for Gung Ho. It makes sense to use standard optimised libraries (such as the BLAS or FFTs) where possible and the infrastructure will provide an interface to a subset these libraries. Finally, logging support is a useful way to manage the output of information, warnings and errors from code and calendar support helps manage the time-stepping of models. ESMF provide infrastructure for both of these and the infrastructure will provide an interface. 4. Data model A data model is an abstract description of the data consumed and generated by a program. The fundamental form of data employed in simulation software is the field. That is, a function of space and potentially, time. A data model for fields therefore sets out to express field values and their relationship to space and time. In simulation software, the field is represented by a finite set of degrees of freedom (DoFs) associated with a mesh and potentially changing through time. Simulation software must therefore have a data model which encompasses meshes and fields, and gives the relationship between them. 6 https://verc.enes.org/oasis 14 4.1. Choice of discrete function space The conventional approach in geoscientific simulation software is for degrees of freedom to be identified with topological mesh entities. For example, it is usual to speak of “node data”, meaning that a field has one degree of freedom associated with each mesh vertex, or “edge data” indicating that a flux degree of freedom is associated with each edge of the mesh. Furthermore, it is conventional to select a single (or perhaps a small related set) of discretisations, and to develop the model with a hard-coded assumption that that discretisation will apply: for example C-grid models are coded with field data structures which assume that velocity will always be represented by a single flux variable on each face, while tracers will be represented by a single flux variable at each cell centre. In contrast, finite element models in various domains of application conventionally record an explicit relationship for each field between the degrees of freedom and the mesh topology. This permits a single computational infrastructure to support the development of models with radically different numerical discretisations. In this model, a single infrastructure could be used to develop a C grid model, with a single flux DoF with every mesh facet, or an A grid model with one velocity DoF per dimension with each vertex. The infrastructure could also support slightly more unusual cases, such as the BDFM element [introduced for weather applications in 7] associates two flux DoF with each triangle edge and three flux DoF with the triangle interior. Spectral element schemes take this much further with very large numbers of DoF associated with each topological entity. The Gung Ho data model is of this type, and allows for different function spaces to be represented, including those which associate more than one DoF with a given topological entity, and those in which a single field has DoF associated with more than one kind of topological object. It is important to note that, although this data model is taken from the finite element community, it is merely an engineering mechanism which allows for the support of different DoF locations on the mesh. There is no impediment to Gung Ho, or a future dynamical core, adopting a finite difference or finite volume discretisation in association with this data model. 4.2. Vertical structure in the data model Gung Ho meshes will be organised in columns. Section 2.1 examines the issues surrounding the data-layout and indirect addressing will be employed in the horizontal, with direct addressing employed in the vertical. For this 15 to be an efficient operation, the data model will reflect this choice inherently by ensuring vertically aligned degrees of freedom are arranged contiguously in memory. This will be achieved by storing an explicit indirection list for the bottom DoFs in each column. For each column in the horizontal mesh, the indirection list will store the DoFs associated with the lowest three-dimensional cell. It will also store the offset from these DoFs to those associated with the cell immediately above this cell. Since vertically adjacent DoFs will have adjacent numbers, this offset list will be the same for every column, and for every layer in the column. By always iterating over the mesh column-wise, it will be possible to exploit this structure efficiently as follows: first, the columns DoFs associated with the bottom cell of a given column are looked up, and the numerical calculations for that cell are executed. Next, the DoFs for the cell above the first one are found by adding the vertical offsets to the bottom cell DoFs. This process is repeated until the top of the column is reached and the DoFs for the next column must be explicitly looked up. Since the vertical offsets are constant for all columns in the mesh, these offsets can be expected to be in cache, so the indexing operation for all but the bottom cell require no access to main memory. Further, the vertical adjacency of DoF values in combination with this iteration order results in stride one access to the field values themselves, maximising the efficient use of the hardware memory hierarchy. Finally, it should be noted that this approach generalises directly to iterations over entities other than column cells. For example iteration over faces to compute fluxes is straightforward. 5. Conclusions Exascale computing presents serious challenges to weather and climate model codes, to such an extent that it may be considered a disruptive technology. Recognising this, the Gung Ho project is considering a mathematical formulation for a dynamical core which is capable of exposing as much parallelism in the model as possible. The design of the software architecture has to support features of this model, such as the unstructured mesh, whilst exploiting the parallelism to achieve performance. In a separation of concerns model of software design, each component of the architecture has a clear purpose. The strongly defined interfaces between these layers promote a powerful abstraction which can be used to design software which can be 16 developed and maintained, be portable between diverse hardware with as yet unknown features and may be executed with the high performance. 6. Acknowledgments We acknowledge use of Hartree Centre resources in this work. The STFC Hartree Centre is a research laboratory in association with IBM providing High Performance Computing platforms funded by the UK’s investment in e-Infrastructure. The Centre aims to develop and demonstrate next generation software, optimised to take advantage of the move towards exascale computing. This work was supported by the Natural Environment Research Council Next Generation Weather and Climate Prediction Program [grant numbers NE/I022221/1 and NE/I021098/1] and the Met Office. References [1] J. J. Dongarra, A. J. van der Steen, High-performance computing systems: Status and outlook, Acta Numerica 21 (2012) 379–474. doi:10.1017/S0962492912000050. URL http://journals.cambridge.org/article S0962492912000050 [2] A. Brown, et al., Unified Modeling and Prediction of Weather and Climate: A 25-Year Journey, Bulletin of the American Meteorological Society 93 (2012) 1865–1877. doi:10.1175/BAMS-D-12-00018.1. [3] D. N. Walters, et al., The Met Office Unified Model Global Atmosphere 4.0 and Jules Global Land 4.0 configurations, Geoscientific Model Development Discussions 6 (2) (2013) 2813–2881. doi:10.5194/gmdd-6-28132013. URL http://www.geosci-model-dev-discuss.net/6/2813/2013/ [4] N. Wood, et al., An inherently mass-conserving semi-implicit semilagrangian discretisation of the deep-atmosphere global nonhydrostatic equations, Q. J. R. Meteorol. Soc. (2013). doi:10.1002/qj.2235. [5] C. J. Cotter, D. A. Ham, Numerical wave propagation for the triangular p1dg-p2 finite element pair, Journal of Computational Physics 230 (8) (2011) 2806 – 2820. 17 [6] J. Thuburn, C. Cotter, A framework for mimetic discretization of the rotating shallow-water equations on arbitrary polygonal grids, SIAM Journal on Scientific Computing 34 (3) (2012) B203–B225. arXiv:http://epubs.siam.org/doi/pdf/10.1137/110850293, doi:10.1137/110850293. URL http://epubs.siam.org/doi/abs/10.1137/110850293 [7] C. Cotter, J. Shipton, Mixed finite elements for numerical weather prediction, Journal of Computational Physics 231 (21) (2012) 7076 – 7091. doi:10.1016/j.jcp.2012.05.020. [8] A. MacDonald, J. Middlecoff, T. Henderson, J. Lee, A general method for modeling on irregular grids, International Journal of High Performance Computing Applications 25 (4) (2011) 392–403. doi:10.1177/1094342010385019. [9] N. Bell, M. Garland, Efficient sparse matrix-vector multiplication on CUDA, NVIDIA Technical Report NVR-2008-004, NVIDIA Corporation (Dec. 2008). [10] G. I. Egri, Z. Fodor, C. Hoelbling, S. D. Katz, D. Nogradi, et al., Lattice QCD as a video game, Comput.Phys.Commun. 177 (2007) 631–639. arXiv:hep-lat/0611022, doi:10.1016/j.cpc.2007.06.005. [11] M. Clark, R. Babich, K. Barros, R. Brower, C. Rebbi, Solving Lattice QCD systems of equations using mixed precision solvers on GPUs, Comput.Phys.Commun. 181 (2010) 1517–1528. arXiv:0911.3191, doi:10.1016/j.cpc.2010.05.002. [12] M. de Jong, Developing a CUDA solver for large sparse matrices in Marin, Master’s thesis, Delft University of Technology (2012). [13] E. Mueller, X. Guo, R. Scheichl, S. Shi, Matrix-free GPU implementation of a preconditioned Conjugate Gradient solver for anisotropic elliptic PDEs, submitted to comput. vis. sci. (2013). URL http://arxiv.org/abs/1302.7193 [14] I. Reguly, M. Giles, Efficient sparse matrix-vector multiplication on cache-based GPUs, in: Innovative Parallel Computing (InPar), 2012, 2012, pp. 1–12. doi:10.1109/InPar.2012.6339602. 18 [15] S. M. Pickles, A. R. Porter, Developing NEMO for large multicore scalar systems: Final report of the dCSE NEMO project, Tech. rep., HECToR dCSE report (2012). URL http://www.hector.ac.uk/cse/distributedcse/reports/nemo02/ [16] Y. Zhang, J. Cohen, J. D. Owens, Fast tridiagonal solvers on the GPU, in: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’10, ACM, New York, NY, USA, 2010, pp. 127–136. doi:10.1145/1693453.1693472. URL http://doi.acm.org/10.1145/1693453.1693472 [17] F. Rathgeber, G. R. Markall, L. Mitchell, N. Loriant, D. A. Ham, C. Bertolli, P. H. Kelly, Pyop2: A high-level framework for performance-portable simulations on unstructured meshes, High Performance Computing, Networking Storage and Analysis, SC Companion: 0 (2012) 1116–1123. doi:http://doi.ieeecomputersociety.org/10.1109/SC.Companion.2012.134. [18] E. Gamma, R. Helm, R. E. Johnson, J. M. Vlissides, Design patterns: Abstraction and reuse of object-oriented design, in: Proceedings of the 7th European Conference on Object-Oriented Programming, ECOOP ’93, Springer-Verlag, London, UK, 1993, pp. 406–431. URL http://dl.acm.org/citation.cfm?id=646151.679366 [19] E. Gamma, R. Helm, R. Johnson, J. Vlissides, Design patterns: elements of reusable object-oriented software, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1995. 19