On Advanced Scientific Understanding, Model Componentisation and Coupling in GENIE Sofia Panagiotidi, Eleftheria Katsiri and John Darlington London e-Science Centre, Imperial College London, South Kensington Campus, London SW7 2AZ, UK Email: lesc-staff@doc.ic.ac.uk Abstract. GENIE is a scalable modular platform aiming to simulate the long term evolution of the Earth’s climate. It’s modularity lies in the fact that it consists of a separate module for the earth’s atmosphere, the oceans, the land, the sea-ice and the land-ice. Such models need to be coupled together to produce the ensemble model. As new modules are actively being researched and developed, it is desirable that the Genie community has the flexibility to add, modify and couple together GENIE modules easily, without undue programming effort. Despite significant progress by the Genie community the desired result has not yet been achieved. This paper discusses a methodology for parametrising Genie, i.e., annotating the coupling semantics and coordinating the coupled model execution in a Grid environment. This is not trivial due to technical constraints such as global variables and large, shared data meshes, as well as semantic dependencies, such as complex interpolations, extrapolations and model-specific execution flows. In order to address these issues, we present the OLOGEN ontology for GENIE. The proposed scheme, consists of a set of semantics for coupled model development and execution; the latter simulated with functional skeletons. Last, OLOGEN supports component coupling and promotes distribution. 1 Introduction The Grid ENabled Integrated Earth (GENIE) System Model is a scalable modular platform that aims to simulate the long-term and paleo-climate change of the Earth. Distinct earth entities such as the atmosphere, the oceans, the land, the sea-ice and the land-ice -referred to, from now on, as earth modules, are modelled separately, using large meshes of data points to represent each entity (finite elements) and abiding by the natural laws of physics that govern the exchange of energy fluxes from one entity to another. The entire climate is simulated by coupling together the above modules according to a number of scenarios. The resulting model is often referred to as an ensemble. For example, a minimal model is the one that comprises an atmospheric module and an ocean module, as portrayed in Figure 1(a). A more advanced model includes the coupling of also the seaice and the land-ice components (see Figure 1(b)). Coupling the modules together is not trivial, as even when the individual models are correct, when coupled, the ensemble may be incorrect, i.e. nonconvergent. We mentioned before that each model is constrained by the laws of physics that govern the energy in each module. Coupled models, exchange energy fluxes from one to the other. Unless the natural laws are applied correctly at the exchange process, the ensemble will not be correct. Although we 1 appreciate that this is an important issue, it is not addressed in this paper. There are however several scientific coupling systems that are used currently in order to guarantee correct coupling. Additionally to the existence of a variety of simulation scenarios, the scientist can select among different implementations for the same component, or ensemble models, such as the c-Goldstein1 There is scope to include and the EMBM models. However, those implementations differ significantly from one another and so does the way that they are coupled together. The EMBM model requires that data is interpolated first when passed from one module to the other and there are several choices of interpolation functions that can be quite complex. Once the ensemble model is built, it is initialised using a set of initial parameters and then, it is run forwards in time, simulating a time period for which actual climate data exist through observations. Those observations are used in order to validate - i.e., check the precision of the results of aplying the model against the real data - and tune the model -i.e. select a new set of input parameters that produce more accurate results. The latter is a computationally exhaustive process that needs to be repeated iteratively, producing exceedingly more accurate values for the input parameters. Last, having validated and tuned the model, it is used for making predictions for the future. GENIE currently supports -among others- four ESMs (Environment System Models), called IGCM, c-Goldstein, EMBM, Glimmer and fixed atmosphere. Throughout the paper we refer to these systems as models. We refer to the ESM components (atmosphere, land, ocean, sea-ice and land-ice) with the term modules. For reasons of clarity, we refer to the model that results from the coupling of the individual modules as ensemble model. Atmosphere Atmosphere netsolar precipitation netlong stressx stressy latent energy sensible energy evaporation netsolar netlong stressy stressx precipitation runoff Sea−Ice evaporation sensible energy latent energy netsolar netlong precipitation Land−Ice latent energy sensible energy precipitation runoff Ocean Land Ocean runoff (a) (b) Fig. 1: Two GENIE ensemble models. Furthermore, new modules and ensembles are being developed as we speak. Ideally, GENIE should be able to include the new models and use them in stead of, in addition to or in comparison to the existing models. The rest of the paper is as follows: First, Section 2 discusses previous work. Section 3 discusses in more detail the motivation and Section 4 the issues that lie behind componentising and gridenabling systems such as Genie. Section 5 explains the more specific problem of high-level model coupling by investigating the current architecture of the GENIE software suite. Section 6 summarises the semantic dependencies that are existent in the GENIE code and Section 7 presents OLOGEN, an ontology for GENIE that enables high-level development and correct usage. Having annotated the semantics through OLOGEN, Section 8 presents a solution for coordinating the execution flow for coupled model without undue programming overhead, through the use of functional skeletons. Last, Section 9 concludes and discusses further work. 2 Previous work Several milestones towards the advance of the GENIE framework have been achieved. The first task was to separate the mass of Fortran code into distinct pieces each representing an environmental module (atmosphere, land, ocean, ocean-surface, chemistry, sea-ice etc). The main piece of code handling and coupling the modules was disengaged and formed the so called ”genie.F” top program. Many efforts in gradually moving GENIE towards a grid environment have been made. Previous work also includes research into the way the ICENI framework [1] can be used to run parameter sweep experiments accross multiple Grid resources [5]. Accordingly, GENIE was represented as a single non-componentised Binary executable within ICENI and a service-oriented data management system and a portal submitting to a Condor pool a parameter sweep experiment, consisting of many instances of the binary executable, was successfully implemented. [4, 6]. Using Condor, allowed several pools residing in different locations (London e-Science centre, Southampton Regional e-Science Centre) to be combined into one single computing resource using the flocking features within Condor. In such a way, it was made possible for scientists to run many ensemble experiments in parallel and monitor them, dramatically reducing the overall execution time. Two other important works exist in the area of model coupling frameworks. The General Coupling Framework [3] and the APST [7] software suite. Although they are both considerable pieces of work, neither fully addresses the above requirements. 3 Motivation-Vision In emerging grids and Grid Computing, shared, distributed, and heterogeneous resources make scientific progress possible through collaborative research among scientists. Unfortunately, as many of the scientific applications like GENIE are still built in monolithic structure, mostly due to the lack of technologies at the time they first started being developed, restructuring them into a collection of reusable components is a major architectural effort. When moving from monolithic towards modularised applications, the main flow-of-control logic must remain at the wheel at all times. All the components are developed independently and the interactions should only occur at a higher level. The main coupling framework should be able to flexibly handle any data exchange between components (component coupling). It should be relatively simple in order to provide different presentation layers and/or varying sets of end-user functionality. In general, monolithic applications are usually strongly platform dependent and difficult to maintain and develop. GENIE belongs to a vast set of ESMs, largely implemented in Fortran 77 and Fortran 90. As the GENIE earth modules are low-level Fortran subroutines, there is no space left for Gridenabled execution of the whole model, including features such as distributed component execution, optimisation, parallelisation and application steering and monitoring. Thus, we can conclude that it is essential for GENIE, as typical legacy code application, to be advanced into a Grid based architecture. In order to do so, layered semantic knowledge about the joints/links between the modules is neccessary to reside in the framework. It is also desirable that the simulations can produce results under tight deadlines. Therefore, semantic annotations of the behavioral and executional/computational aspects of the components should be additionally encapsulated. It that way, correct and optimal execution are insured, since the exploitation of optimisation techniques becomes possible. 4 Problems-Difficulties faced GENIE currently supports several components including ocean, atmosphere and land surface modules. These modules are separate Fortran files but cannot be considered as independent components as Fortran is unable to support the desired objectoriented and distributed programming features. The interaction (exchange of energy and fluxes) between them as well as the generalised control of the behaviour of each one is done by a separate program, ”genie.F”, which is responsible for interpolating the data passed from one to another through interpolation functions. There are several issues making the task of the componentisation of the earth modules complex. The first and more important is that the separate Fortran subroutines, representing the earth units, are implemented in such a way that they share the same memory space. This is imposed by Fortran in the way that arguments are exchanged between subroutines. That is, whenever a simple or complex stucture is passed to/from a subroutine, it is always done by passing its memory reference, rather than a copy of itself. This leads to the need for the existence of global variables in the application representing attributes, which are considered to be part of the modules. Consequently, their existence creates mutual dependencies since they are accessed by various contexts within GENIE. Such complex dependencies add extra complexity and make it harder to create an environment where different modules could be distributed and run across multiple resources. Another direct consequence is that somebody needs intimate knowledge of the system in order to identify which of these attributes are input, output, or both input and output, something which is essential in an interface-based environment. Thus, in cases where variables are shared by more than one module their scope must be made explicit. Therefore, the determination of their input/output nature, in physical as well as technical terms, is a significant task to be completed. GENIE scientists have been working on models representing different aspects of the Earth system (e.g. ocean, atmosphere). At the same time, for each of these entities there have been implemented various physical components, each taking into account and expressing different parameters at varying weights and considering the earth at varying grid resolutions and topologies. Thus, an attempt to create a common interface for every earth entity in respect to the existing Fortran components is far from trivial. At present, when a GENIE experiment under some configuration runs, each component is initialised, executed and finalised through distinct Fortran subroutines, called from ”genie.F”. What is more, in the case where, possibly after a system failure, one component needs to be reconfigured/restarted, another subroutine responsible to do so is called. Envisioning a component-based GENIE architecture, where multiple instances of the same component will exist at the same time, it is essential that these phases are included within the life cycle of each one. As such, reusability and resource disengagement/management can be achieved to the best extent, during and after the completion of a component task. Based on the above, a last issue that is of great practical interest in our case is the use of the wrapping environment in order to acquire dynamic and flexible earth components. The two technologies which have been under research in LeSC are the JNI (Java Native Interface) wrapping tool and the ICENI binary components. Table 1 shows the most important efficiencies/inefficiencies (+/-) of each of the two is depicted. Technology JNI binary ICENI exposed interf ace + − M eta − Data assignment + − reusability + + easy implementation − + code optimisation + − component support − + Table 1: JNI vs. ICENI 5 The current GENIE Currently GENIE consists of several components glued together through a routine called genie.F. Those components are wrappers on top of the Fortan routines that implement the earth modules, providing in this way a separate entity to each module and generating a dual-layer architecture in genie.F. In fact genie.F acts as the top layer, gluing the wrappers together in a hard-coded manner. genie.F contains hardcoded configuration parameters in this layer, i.e., the frequency of the coupling (timestamp), the grid sizes, the atmospheric resolution, and a set of validity flags that show which combinations of models are valid and which not. The bottom layer, consists of the code that is contained in the wrappers themselves, including execution routines, for calculating the linear and non-linear grid points, average statistics, printing histogram and NetCDF files and configuration parameters, such as the precision of the floating point variables, the type of the interpolation functions and (in duplication with the top layer) the grid sizes for each model. In this way, genie.F appears to be a glorified dual-layer meta-file, a huge case statement that covers all possible execution scenarios with flags and several configuration files.Figure 2 portrays the different layers of abstraction in genie.F for the IGCM ensemble model. It is quite obvious at this stage that repeating the same abstraction process for all models can reveal several opportunities for parametrisation, while at the same time, is is worth annotating the dependencies among the variables in the various layers with appropriate semantics and coordinating the modification on these variables, in a way that we ensure that the dependencies are maintained. Genie.F wrappers initialise_atmos pass_results igcm_atmos_adiab configuration genie_global.F land_surflux iatmos_diab land_blayer genie_control.F IGCM initialise_atmos igcm3_adiab.f land_surflux land_blayer taken from the component literature, such as Component, Interface, Method, ComponentImpl (see Figure 4). We use these semantics in order to define component coupling, component wrapping and the execution flow abstractions. On the other hand, because in GENIE components represent scientific models or parts (routines), we need to provide semantics that corresponds to the scientific nomenclature as well. We define the entities Model, Topology, Module, IGCM, c-Goldstein, EMBM, Glimmer, atmos, land, ocean, seaice, landice . The entity Model referes to the type of scientific model used, i.e., cGoldstein, Module refers to the parts f the planet that are modelled and Topology, to the combination of modules that form the ensemble model. 6.3 Component Wrapping Using the component nomenclature, we propose to semantically annotate wrapper input/output interfaces as opposed to local module input/output interfaces or local variables. The former are accessible in the higher layer, where the coupling needs to take place, whereas the latter are accessible in the lower layer. The same applies for interfaces that are assigned a value during the configuration stage. We need to annotate the ones that are exposed to the users as wrapper control interfaces and those that belong to the lower layer as local control interfaces or local variables. Each wrapper needs to be associated with the names of the underlying routines that are used in it’s implementation. igcm3_diab.f 6.4 Late binding configuration land_surface netout2 mgrmlt mgrmlt accum_means accum_means mgrmlt param1.cmn gridss.cmn bats.cmn genie_precision.inc papram2.cmn gridpf.cmn Fig. 2: The ensemble layer (a) and the individual module layer (b). 6 Coupling Semantic Dependencies In order to derive the semantical dependencies, we refer to the requirements of Section 1. Figure 1 illustrates the fact that modules can be coupled together to form different topologies. 6.1 Validation After the topology has been defined, it is essential to validate the compatibility of the components that have been selected. E.g., c-Goldstein uses a set of interpolation routines that are not used in any other model. EMBM also uses interpolations. genie.F currently has a set of flags that model this. 6.2 Component and Model Semantics Given the motivation of Section 3, i.e., to componentise Genie, it is essential to introduce semantics Some interfaces are only set by components that are executed later on, and so their values are assigned during the following iteration. Such variables are typically initiated by initiation modules. We refer to such variables with the term late-binding variables. An example of such a variable is the netsolar variable of the igcm land surflux component. This is calculated by the netout2.f subroutine called inside the igcm diab.F component that follows. For this reason, this variable needs to be initialised by the initialise atmos.F subroutine. igcm_adiab_wrapper igcm_land_surflux_wrapper pass_results igcm_land_blayer_wrapper ocean_blayer calc_surf_meshes igcm_diab_wrapper Fig. 3: IGCM execution flow 6.5 Interface Scope It is essential in GENIE to be able to annotate the scope of an interface. Currently, all modules share the same memory space. Furthermore, each time a new component is designed, the designer will need to know, which variables/interfaces are shared and declare the new variables/interfaces that are introduced by each model. The same applies when two models are coupled, as their coupling often introduces new variables. It is therefore essential to distinguish global variables that are common and useful to all modules, such as the mesh of discrete points that model the whole planet and the ones that are specific to each module, e.g., the grid interpolation atm is specific to the c-Goldstein module. We define the former as common and the latter as module-specific. The matrix iland mask is a variable that is common to GENIE as it contains information about the globe, namely the land is denoted by 1 and the sea by 0. On the contrary, the variable interpol grid, is a variable that is specific to the cGoldstein model. In Figure 4, the above semantics is associated with the entity Interface, in the OLOGEN ontology that is introduced later on. 6.6 Abstract components We can classify abstract components in the following categories. Generic modules, specific module wrappers, interpolation components, grid manipulation components, initialisation components. Ideally, generic components could be used to create abstract scenaria such as the ones of Figure 1 and specific implementations could be chosen later on. 6.7 Execution flow We mentioned in Section 1 that different ensembles are coded quite differently. Figure 3 portrays the execution flow for the IGCM model. The c-Goldstein model follows a different execution flow, containing an interpolation component between each module. We have developed the following semantics to annotate execution flows: binary data flow, binary execution flow, data flow, execution flow, interface coupling, thread coupling, distribution model. The coordination of execution flows is the subject of Section 8. Section 7(see Figure 5) discusses the OLOGEN ontology that implements the above semantics. 7 OLOGEN: An Ontology for GENIE We have designed an Ontology for GENIE called OLOGEN, using Protege[8]. OLOGEN provides a very detailed controlled vocabulary both for GENIE users and developers as well as tree-like structure of semantic dependencies, covering both the components, their communication and the execution flow, while GENIE runs. Requiring scientists to provide semantic annotations in their code in the appropriate manner ensures that the integrity of developing GENIE models, coupling them and experimenting with GENIE. Furthermore, OLOGEN is designed with support for Grid execution by providing a distribution model (see Figure 4). Currently GENIE runs in a single host. However, there is scope for distribution and parallelism. OLOGEN supports both. OLOGEN has been designed based on our observations and in collaboration with the GENIE team. However, we feel that it will only reach it’s maximum value when tested in action and reviewed by the GENIE developers. OLOGEN consists of two main modules. First, we have designed a class hierarchy of GENIE entities that aim to capture the development needs of the GENIE community. Figure 4 portrays the development-related class hierarchy. Second, a hierarchy of relations that implement execution-flow related entities, such as execution flow, data flow, interface coupling, component coupling, component nesting, component wrapping, thread coupling. Figure 5 shows these classes in Protege. The relations Component Wrapping and Thread Coupling can be seen as algorithmic skeletons and are discussed, as such, in Section 8. For example, the Execution Sequence relation, maps a binary execution sequence to a component, thus enabling the creation of an execution flow with more than two components. The IGCM execution flow can be modeled in this way. A Data Flow is a binary mapping from a set of components to another component that receives input from the former, plus an interface coupling. The igcm land blayer.F is such a component. Non binary data flows can be built from the binary ones as above. To illustrate OLOGEN with an example, consider the class Interface, as shown in Figure 4. We define an entity called a Method that is illustrated in the Wrapper skeleton, discussed next. 8 Execution Coordination through Functional Skeletons We simulated the execution flow of the GENIE topologies using the algorithmic skeleton programming paradigm. Algorithmic skeletons, proposed in [2] as part of SCL [2], for cordinating parallel execution, are ideal for coordinating the execution patterns of GENIE models, without undue programming effort. Indeed, due to their iterative nature and their periodic communication, executing GENIE components resemble co-routines, or in threadenabled languages, interleaved threads. In order to distinguish between the various distribution models that exist in the Grid we define a configuration skeleton, called distribution model that describes the different options under which GENIE can execute. We define four distribution strategies: samehost, remote-host, distributed and parallel. We also define two skeletons, the Wrap and the ThreadCoupler skeletons, discussed next. Fig. 4: Development Ontology in OLOGEN Fig. 5: Relations in OLOGEN 8.1 The Wrap skeleton The Wrap skeleton is a higher-order function that given a component c, it generates a component C and that can be coupled with other components through a clearly-defined interface. The Wrap skeleton uses the following semantics: input, output and input/output variables, the base component name c Fig. 6: The OLOGEN wrap relation(b) and a list of control interfaces. The syntax of the Wrap skeleton is the following: W rap (c, list of wrapper input interf aces, IterF or koverall {IterF or timestamp1 step A iterF or timestamp2 step B list of wrapper output interf aces, list of local input interf aces, where stepA = threadA.execute list of local output interf aces, threadB.notif y(interf ace1 ) list of control interf aces) = C step B = threadB.execute threadA.notif y(interf ace2 )} As an example, consider the application of the skeleton wrap on the GENIE component igcm atmos adiab.F, resulting in the component igcm atmos adiab wrapper.F. The new component is portrayed in Figure 6. The export wrapper variables is a generic interface, that is implemented depending on the configuration. 8.2 The ThreadCoupler skeleton The ThreadCoupler skeleton is a higher-order function that given two components c1 and c2 and a set of configuration parameters, it returns two interleaved threads (Thread A and Thread B)in some appropriate implementation. The two threads are interleaved so that they simulate a coroutine execution. While thread B listens, Thread A executes a number of iterations according to a predefined timestamp, it then notifies thread B which starts executing a number of iterations defined by component c2 , while A listens. When B completes its iterations, it notifies Thread A which repeats the previous steps, and so on. Each notification between A and B is associated with access to shared data-structures, e.g., the mesh that represents the heat fluxes from the atmosphere to the land. The way by which data is exchanged during the notification is implementationspecific and it depends on the distribution model. The ThreadCoupler component returns a (coupled) component thus enabling nested coupling. The syntax of the ThreadCoupler skeleton, is the following: T hreadCoupler(C1 , C2 , interf ace1 , interf ace2 , timestamp1 , timestamp2 ) = C3 The notify interface, is an abstract interface that is implemented depending on the distribution model and the specifics of the interface. For example, in the case where large meshes are used for intercomponent communication, the distribution model should be set to single host. Else, for example when coupling the check fluxes.F routine with the IGCM ocean component, a flag may be enough to communicate whether the checking was correct or not. In this case, the the coupled components can be run on remote hosts and communicate via message passing. The notify interface can be implemented through standard I/O functions. As an example, consider the application of the ThreadCoupler skeleton to the following components that compose a complete IGCM atmosphere-land-ocean model. Figure 7 portrays the result. The components are executed from top to bottom. igcm_atmos_adiab IterFor pass_results nents, such as component iteration and component composition. This should be done always in respect to the specified by the GENIE framework semantics. References n igcm_land_surflux IterFor m igcm_land_blayer ThreadCoupler IterFor k IterFor l igcm_ocean_surflux igcm_ocean_blayer igcm_diab Fig. 7: OLOGEN Thread Coupler relation (a) IGCM ThreadCoupling (b) 9 Conclusions The main contribution of this paper has been to identify the difficulties faced when trying to technically advance a legacy environment. OLOGEN encapsulates the semantic knowledge underlying component coupling and our algorithmic skeleton coordinate the realisation within a grid environment. Although created for GENIE, our tools are applicable more generally. In addition, we have been currently investigating into extending the ICENI framework so as to include the neccessary tools in order to provide the coupling functionality of GENIE. This would need further development of the meta-data layer concerning the meaning (types, semantic constraints) as well as the compositional application workflow. The last had been implemented based on the YAWL (Yet Another Workflow Language) [9], though partly providing its functionalities. It would be feasible to extend the workflow so that it supports the extra operations involved in the coupling of GENIE compo- 1. The imperial college e-science networked infrastructure (iceni). available at: http://www.lesc.ic.ac.uk/iceni. 2. J. Darlington. Parallel programming usng skeleton functions. Notes in Computer Science, 694:146–160, 1993. 3. R.W. Ford, G.D. Riley, M.K. Bane, C.W Armstrong, and T.L Freeman. Gcf: A general coupling framework. In Concurrency and Computation: Practice and Experience, to appear 2005. 4. M. Y. Gulamali, A. S. McGough, S. J. Newhouse, and J. Darlington. Using iceni to run parameter sweep applications across multiple grid resources. In Global Grid Forum 10, Case Studies on Grid Applications Workshop, Berlin, Germany, Mar. 2004. 5. M.Y. Gulamali, T.M. Lenton, A. Yool, A.R. Price, R.J. Marsh, N.R. Edwards, P.J. Valdes, J.L. Wason, S.J. Cox, M. Krznaric, S. Newhouse, and J. Darlington. Genie: Delivering e-science to the environmental scientist. In UK e-Science All Hands Meeting 2003, pages 145–152, Nottingham, UK, 2003. 6. M.Y. Gulamali, A.S. McGough, R.J. Marsh, N.R. Edwards, T.M. Lenton, P.J. Valdes, S.J. Cox, S.J. andNewhouse, and J. Darlington. Performance guided scheduling in genie through iceni. In UK e-Science All Hands Meeting 2004, pages 792–799, Nottingham, UK, 2004. ISBN=1-904425-21-6. 7. Francine Berman Henri Casanova, Graziano Obertelli and Rich wolski. The apples parameter sweep template: User-level middleware for the grid. In Super Computing Conference 2000, 2000. 8. M.A. Musen, S.W. Tu, H. Eriksson, J.H Gennari, and A.R Puerta. Protege-ii: An environment for reusable problem-solving methods and domain ontologies. Chambery, Savoie, France, 1993. 9. W. van der Aalst and A. Hofstede. Yawl: Yet another workflow language. Technical Report FIT-TR-2002-06, Queensland University of Technology, Brisbane, 2002. available at: citeseer.ist.psu.edu/vanderaalst03yawl.html.