From “lab books” to computational Earth science. Chris Hill, MIT – cnh@mit.edu Edinburgh, July 2007 Lab books A lab notebook is a primary record of research. Researchers use a lab notebook to document their hypotheses, experiments and initial analysis or interpretation of these experiments. The notebook serves as an organizational tool, a memory aid, and can also have a role in protecting any intellectual property that comes from the research. The guidelines for lab notebooks vary widely between institution and between individual labs, but some guidelines are fairly common. The lab notebook is usually written in as the experiments progress, rather than a later date. Many say that lab notebook should be thought of as a diary of activities that are described in sufficient detail to allow another scientist to follow the same steps. To ensure that data cannot be easily altered, notebooks with permanently bound pages are often recommended. Researchers are often encouraged to write only with unerasable pen, to sign and date each page, and to have their notebooks inspected periodically by another scientist who can read and understand it. All of these guidelines can be useful in proving exactly when a discovery was made, in the case of a patent dispute. Several companies now offer electronic lab notebooks. This format has gained some popularity, especially in large pharmaceutical companies, which have large numbers of researchers and great need to document their experiments. wikipedia Lab books • physical, chemical and biological scientists are taught lab-book discipline from an early age. – reproducible results are the foundation of scientific and engineering disciplines e.g. Mickleson/Morley. – even an infamous “Journal of Unreproducible Results” • in computational science the “lab book” discipline is not so ubiquitous – maybe because – program is a formal statement of applied mathematical axioms – axioms are deterministic – therefore reproducibility is not an issue – however, a programs i.e. a complex collection of simple elemental statements is hard to comprehend. If details are not recorded, reproducibility may well be an issue. Some example computational Earth science experiments. • • • • Aqua-planet. Eddying North Atlantic. Global ocean with eddies and seaice. IPCC A simple GFD configuration Water covered planet. Atmosphere-oceanseaice. Jean-Michel Campin and David Ferreira • Some factors that affect the solution: – Initial conditions. – Atmosphere: Clouds, radiation, dynamics, boundary layer, temporal and spatial discretization…. – Seaice: Thermodynamics. Aging. Stress-strain relation…. – Ocean: Dynamics, coordinate system, vertical/horizontal friction and mixing…. – Coupling: Time stepping, emergetics. – External forcings: Solar insolation, reference profiles An eddying, ocean only configuration • Some factors that affect the solution: Ocean-only, forced with atmospheric reanalysis for JanMar. Red/blue shading: ocean heating/cooling. Cyan/magenta line: +/-17.5OC @ 200m. Streaks: Windstress. Green thickness: Ocean mixed layer depth. – Initial conditions. – Atmosphere fluxes: Planetary boundary layer scheme. – Ocean: Dynamics, coordinate system, vertical/horizontal friction and mixing…. – Coupling: Time stepping, emergetics. – External forcings: Solar insolation, reference profiles, atmospheric reanalysis. – Non-linear/turbulent flow, so bitwise reproducibility subject to FP round off, parallel reduction operatations etc… Global eddying ocean, seaice decadal ensemble. 50+ members. Ensemble perturbations: Numerical formulation Ocean parameters Seaice parameters Initial conditions Boundary conditions Sv IPCC ocean ACC transports CNRM-CM3 INM-CM3.0 MRI-CGCM2.3.2a CCCMA-CGCM3.1 MIRCO3.2(hires) MIROC3.2(medres) GISS-AOM GISS-ER Sv Observational Couples atmosphere, ocean, seaice, land, vegetation, chemistry etc… 400 350 300 250 200 150 100 50 0 Could I make this plot without too much difficulty – yes Could I rerun IPCC scenario (possibly with some parameter change) – no Diagnosing these results is possible today (PCMDI/ESG archives) for broad scientific community. Rerunning experiments (with or without small changes) is still very hard. Factors affecting solution range from bottom drag to land-surface formulation to emissions profiles. Examples summary • To reproduce an experiment – significant quantity of information needs to be stored – spans broad “big-picture” information (watercovered planet, atmos+ocean+seaice) to minute details (bitwise reproducibility may require record of compiler, OS etc…) Sv 400 350 300 250 200 150 100 50 0 INM-CM3.0 CNRM-CM3 MRI-CGCM2.3.2a MIRCO3.2(hires) CCCMA-CGCM3.1 MIROC3.2(medres) GISS-ER GISS-AOM Observational Sv Way Forward • hand record is not practical nor ideal (i.e. not as potentially useful as electronic record). • Electronic information should be stored so as to be amenable to machine reasoning. – requires defined vocabularies, precise formal structure, pattern matching, rules etc.. – W3C/semantic web technologies XML, RDF, • In theory, using XML, RDF etc…< we could describe model systems using these and enable reruns for extra outputs (e.g. transport of S3 by flow) or derived runs (e.g. modified air-sea coupling coefficient of formulation). • In practice this is hardwork! Baby steps toward a computational Earth science “model repository”. • What is working today – PCMDI/ESG • Steps toward future - ESC PCMDI • • • • • Archive of all IPCC model outputs. Stored in common format (netCDF with standard metadata). Stored on common mesh. Simplifies things, but can/does degrade information and even mislead (e.g. conservation in one coordinate system may be inexact in another). Very limited model metadata is held. Very successful and technically impressive – societal utility func. of model quality! Schmittner et al (2005, GRL) Earth System Curator (ESC) Can we (for better or worse!) do for models what PCMDI does for datasets? PCMDI datasets are data “wrapped” in a common/standard container (netCDF). The PCMDI container is “selfdescribing”. This means we can query and even combine (to some degree) the PCMDI datasets. A container analogy for modeling technology is the “component architecture” supported by systems like ESMF. Building a coupled model oriented solution – modeling system as a component tree • Some mathematics – component M M – no side-effects – possible persistent internal state e , i • Supports representation as DAG such that P ,n P ,n ,P ,n,m : m 1, nc M0 e.g M 0,1,1 M 0,1 M 0, 2 M0,1,2 M0,2,1M0,2,2 M0,2,3 0,2 0,2 ,0,2,1,0,2,2 ,0,2,3 Example of actual component tree. Suarez et. al • Tree of components from the GEOS-5 modeling system. • Each box is an ESMF component. • Components adhere to DAG semantics. Individual components in ESMF • ESC builds on an ESMF-like component model. – ESMF Component • Container for sequence of computation that implements a particular algorithm (physics simulation e.g. Navier-Stokes solver or technical function e.g history manager). An ESMF component exposes its external interfaces through an ESMF state. – ESMF State • Container data type to transport data between components – ESMF Field • Container data type that can be used to push/pop ndimensional data with an associated mesh from an ESMF State. Given a component model, like the ESMF paradigm, ESC… • Describes a component in terms of – parameters that control the computation sequence. – states and fields that are passed into/out of the component. • Provides two levels of description – potential and specific. – Potential is a list of all possible parameters and fields. It is a virtualized description in that it is not describing a specific instance. – Specific is a description of an instantiated component in which parameters are bound to specific values and fields and states are bound to specific values. ESC component descriptions are in terms of XML schema. • Curator-NMM – Described numerical model parameters e.g. timestep, system requirements, • Gridspec – Describes numerical mesh. • Curator-CIAO – Describes components inputs and outputs • Curator-complete – Describes wiring together of components – A coupled component is also a component i.e. schema is recursive. Some details (more at http://www.earthsystemcurator.org) ….. Curator-NMM • The Curator-NMM schema describes model components, their content, and their connections. It is a superset of the NMM schema. The main constructs in the Curator-NMM schema are component, potential model, and model. Components are "composable" pieces of code that can be coupled together in various arrangements to form different models. A potential model consists of a group of components, and describes the set of possible models that can be built from those components. A model is a fully specified application based on a potential model and configuration choices. Curator-NMM Mosaic Grid Specification • The Mosaic Grid Specification is a standardized description of muti-patch, structured grids being developed in coordination with CF activities. Mosaic Grid Specification Component – component compatibility checking. • ESC can describe coupled (multi-component) systems. • In principle ESC could support recombination of components from coupled systems e.g. couple component A (atmosphere dynamics) with component B (land-surface). • Ideally, for this, compatibility constraints need to be expressed in a standard way. Service architectures • Standards services – Developing standardized descriptions is a well-proven method toward a service oriented approach e.g. Some useful (but an incomplete list of) URLs Component models Metadata & standards http://www.esmf.ucar.edu http://maplcode.org http://www.earthsystemcurator.org http://ncas-cms.nerc.ac.uk/NMM/ http://www.earthsystemgrid.org/ http://www.cgd.ucar.edu/cms/eaton/cfmetadata/ http://sbml.org/index.psp http://cml.sourceforge.net/wiki/index.php /Main_Page http://www.w3.org/ Summary • Earth System Curator project is an activity developing schema and tools to capture “semantic” information about models. – Such information provides basis for formally recording numerical experiments – computational Earth science “lab book”. – It also provides the basis for a formal approach reproducible numerical results – fewer “Journal of Irreproducible Results” candidates. • Other efforts SBML (systems biology), CML (chemistry) already “uploads” to Science submissions. • Maybe soon a computational Earth science challenge will become, how to stop people doing dumb things with easy to use modeling services, rather than how to get people to use obtuse legacy modeling systems - maybe! ESC collaboration • NCAR (Cecelia Deluca, Julien Chastang), MIT (Chris Hill, Constantinos Evangelinos), Georgia Tech (Spencer Rubager, Rocky Dunlap, Angela), GFDL (Balaji, Sergey), Reading UK (Lois Steenman-Clark, Katherine Boughton), PRISM (Sophie Valcke).