Climate Knowledge Discovery Workshop Report Reinhard Budich1 MPI-M, Peter Fox2 RPI, Auroop Ganguly3 ORNL, Jim Kinter4 COLA, Per Nyberg5 CRAY, Tobias Weigel6 DKRZ 1. Background Numerical simulation based science follows a new paradigm: its knowledge discovery process rests upon massive amounts of data. We are entering the age of data-intensive science. Data is either generated by equipment such as satellite sensors, microscopes, particle colliders etc., or is just born digital, i.e. generated by an intensive usage of high performance computing. One of the largest repositories of scientific data in any discipline is the model-generated and observational geoscience data produced and used in climate science. Climate scientists gather data faster than they can be interpreted. Current approaches to data volumes are primarily focused on traditional methods, best-suited for large-scale phenomena and coarse-resolution data sets. The data volumes from climate modeling will increase dramatically due to both increasing resolution and number of processes described. What is needed is a suite of new techniques interpreting and linking phenomena on and between different time- and length scales as well as realms and processes. Such tools could provide unique insights into challenging features of the Earth system, including extreme events, nonlinear dynamics and chaotic regimes. Based on experience, including that from other sciences, the breakthroughs needed to address these challenges will come from well-organized collaborative efforts involving several disciplines, including end-users and scientists in climate and related areas (atmosphere, ocean, land surface, …), computer and computational scientists, computing engineers, and mathematicians xii . This report summarizes the findings from the Climate Knowledge Discovery workshop organized by Deutsches Klimarechenzentrum GmbH (DKRZ), Max-Planck-Institut für Meteorologie (MPI-M) and Cray Inc. to bring together experts from various domains to investigate the use and application of large-scale graph analytics, semantic technologies and knowledge discovery algorithms in climate science. The workshop was held from 30 March to 1 April 2011 at DKRZ in Hamburg, Germany. 1 Corresponding Author Address: Reinhard Budich, reinhard.budich@zmaw.de Corresponding Author Address: Peter Fox, pfox@cs.rpi.edu 3 Corresponding Author Address: Auroop Ganguly, gangulyar@ornl.gov 4 Corresponding Author Address: Jim Kinter, kinter@cola.iges.org 5 Corresponding Author Address: Per Nyberg, nyberg@cray.com 6 Corresponding Author Address: Tobias Weigel, weigel@dkrz.de 2 CKD Workshop Report 2 2. Workshop Objective The aim of the workshop was to formulate science and technology strategies to further develop climate knowledge discovery methods and tools to enable a data-intensive approach to climate science. Should such advanced data analytical tools be proven viable, their methods should be used in future Coupled Model Inter-comparison Project (CMIP7) efforts, including the 5th round that is underway now (CMIP5). Three basic questions were posed to pursue this aim: 1) What steps are required to realize the potential of graph analytics, semantic technologies and knowledge discovery algorithms in climate science? 2) Where methods and technologies already exist, how can they be leveraged? 3) Where gaps are identified, what steps must be taken to address them? 3. Data Driven Opportunities for Climate Science Climate science faces long-standing gaps in the understanding of key processes like convection, land-atmosphere interaction, and teleconnections. A combination of improved models, observations and computing power has been suggested as the way towards improving this understanding. Our postulate: Confronting model-simulations with observations in new ways and learning functional relations within climate variables from observations and models – which may generalize to predictive insights under non-stationary climate - would be invaluable for basic climate research, and key to addressing adaptation and resource management issues in a timely manner. This provides the motivation for novel and scalable methodologies in data-guided descriptive and predictive analysis, which can handle massive volumes of "5-dimensional" (space x-, y-, and z-, time and parameters) data generated from nonlinear processes with complex dependencies and feedback loops. Many application areasvi have recognized the need for representing and reasoning about domain and multi-disciplinary knowledge, and recognizing the reality of science today, that knowledge is distributed over many resources. Computer science research areas being put into practice include: information integration, distributed knowledge management, the semantic web (including the Resource Description Framework (RDF) and the Web Ontology Language (OWL) which are both World-Wide-Web Consortium Recommendations (i.e. standards)), multi-agent and distributed reasoning systems. Methods of automated reasoning and advanced analytics have been used in a number of domains by employing formal ontologies expressed in logic viii. Ontologies are essential components of modern knowledge management systems, and are becoming more prevalent in distributed computing environments based on Web Services, and in applications such as e-Commerce and e-Science. In information systems science, an ontology vii is a formal representation of knowledge as a set of concepts within a domain and the relationships among those concepts. Formalized ontologies are used to enable automated reasoning about the entities within that domain, in addition to describing the domain. Ontologies 7 http://www-pcmdi.llnl.gov/projects/cmip/index.php February 9, 2016 CKD Workshop Report 3 are useful in articulating shared vocabularies, which can be used to model a domain based on types of objects and/or concepts that exist, as well as their properties and relations among objects. More broadly, semantic technologies comprising query and reasoning capabilities, semantic storage and retrieval capabilities, have already been implemented in community efforts such as the Earth System Curator8 and the METAFOR project9, the National Aeronautics and Space Administration (NASA) Semantic Web for Earth and Environmental Terminology (SWEET)10, and the International Research Institute (IRI) Lamont-Doherty Earth Observatory (LDEO) Climate Data Library11. 4. Complex Network Solutions Data-intensive science in general has been called the fourth paradigm of scientific exploration ix, in addition to theory, experimentation and the more recently added computational modeling and simulation. As the meeting progressed there was a growing awareness that a combination of high-performance analytics from model-simulated and observed data, with algorithms motivated from network science and graph theory, nonlinear dynamics and statistics, as well as data mining and machine learning, may be one way forward for the community. Recent developments in climate networksi,ii,iii,iv,v, while impressive, have barely scratched the surface in terms of what may be eventually possible. While statistical and dynamical downscaling have become widely applied in the climate community, challenging issues about bias variance tradeoffs, long-range dependence or teleconnections and impacts of boundary or initial conditions remain largely unsolved. Here complex networks12 may offer the next generation of solutions in climate science and climate change consequence management. Another emergent research area is to explore ways to connect network solutions from climate change science all the way to impacts science like urban sustainability and endangered natural ecosystems, where both the climate and impacted systems may be represented through loosely coupled network paradigms. The ability to interpret the results from the climate science perspective and develop hybrid models that blend available process knowledge with complex networks or other data-guided approaches will remain issues of key importance. 5. Construction of Graphs from Climate Data Graphs are mathematical structures used to model pairwise relations between objects. The graph is represented by a collection of vertices or 'nodes' and a collection of edges that connect pairs of vertices. The workshop provided a venue for an open and frank discussion of the potential role of graph theory and graph analytics to climate data. It was noted that the first step – getting community information into graph form – has not been widely adopted. The construction of graphs is an information modeling process. It is important to distinguish this ‘modeling’ from 8 http://www.earthsystemcurator.org/ontologies/ http://metaforclimate.eu 10 http://sweet.jpl.nasa.gov/ontology/ 11 http://iridl.ldeo.columbia.edu/ontologies/ 12 A complex network is a network (graph) with non-trivial topological features—features that do not occur in simple networks such as lattices or random graphs but often occur in real graphs. (Wikipedia) 9 February 9, 2016 CKD Workshop Report 4 the familiar modeling of climate data using mathematical equations, and increasingly with the implementation of algorithms in computer software and execution on modern computer systems. To invoke this process, the climate community needs to define what is important to the graph: the types of nodes in the graph and their first order connections. Beyond that, the construction of climate science domain ontologies would be a key next step, but to apply these ontologies, it is also necessary to define the process of semantically annotating climate data and representing it in graphs. Two basic approaches were described to construct graphs from climate data. First, the gridded values of geophysical fields representing physical phenomena such as sea-ice or melt pond can be described as nodes in the graphs. The edge between two nodes can describe domain processes. Alternatively, the edge can describe the physical phenomena and the nodes describe the processes. This bi-directional approach seems to have a lot of potential. It was noted, however, that the graph community does not currently have good knowledge representation forms for physical processes. Encoding of science knowledge has traditionally been more successful when describing objects that can be measured. Both approaches rely on a rich set of feature detection algorithms to transform the grid-based primary data into objects in a graph as the first step. More complex graphs can also be constructed with a named definition of the network where node types do not need to be consistent in definition. For example, one node could define a geographic area (Pacific Region) and be connected to a node with physical measurements (time series of temperatures) with a named and typed relationship (measured location). Sub-graphs can be constructed to represent different components currently implemented in coupled Earth system models (atmosphere, ocean, land surface, …). Different algorithmic approaches to climate modeling can then be manifest in multiple graph instantiations. 6. Potential Areas of Application a. Model Inter-comparison: Model inter-comparison projects are a standard procedure that enables a diverse community of scientists to analyze climate models in a systematic fashion, which serves to facilitate model improvement. In a model inter-comparison, many models with different assumptions are subjected to the same experimental protocol, and the output of all the models are made available to a community of researchers. This procedure requires a community-based infrastructure in support of climate model diagnosis, validation, inter-comparison, documentation and data access. The construction of graphs using model output provides another method, in addition to more traditional methods, by which to compare outputs either from different models for the same simulated period or the same model using perturbed conditions. Graph algorithms such as graph isomorphism allow for powerful exploration of the encoded knowledge base (isomorphism is an equivalence relation between graphs). b. Climate Teleconnections: Climate science features challenging characteristics such as nonlinear dynamics, chaotic regimes and multi-physics complexity. Anomalies can be related to each other at large distances (typically thousands of kilometers), a feature referred to as a tele-connection. Information is February 9, 2016 CKD Workshop Report 5 propagated between the distant points through the atmosphere or ocean by transport processes or wave dispersion. In addition, the climate system is dynamic such that remote atmospheric and ocean responses to large-scale fluctuations in the climate system will never occur exactly the same way twice. From a computational perspective, this problem is extremely challenging requiring the search for patterns and relationships at large distances on a graph within and between time-slices. Traversing a graph looking for what is connected to what causes every memory reference to be global and random, since every reference will lead to new connections that are not located in the same sections of memory. c. Scale Interactions: It is also the case that climate phenomena are observed at a wide range of spatial (103 – 107 m) and temporal scales (10-2 – 108 years). Because of the fluid dynamics of the atmosphere and oceans and because the processes governing phenomena on different time scales are codependent, there are strong interactions among the spatial and temporal scales of climate. This complicates the analysis of climate data and presents challenges for data processing and automating inference. 7. Technology Requirements Applications that aim to combine semantic annotation and reasoning with graph algorithms and climate data are more successful when combined with sound ontologies. Many science ontologies have been developed in the past, but their consistency and methodological soundness (and completeness) are highly variable. Most importantly, realistic use cases must be defined even before starting to engineer ontologies. The time frame for such developments is therefore often multiple years. In the METAFOR project, an ontology for model metadata has been developed, the METAFOR Common Information Model (CIM13). Like the Earth System Grid (ESG) ontology, it is intended to replace existing model metadata. While ESG uses Semantic Web standards such as RDF and OWL to encode the ontology, the METAFOR project uses eXtensible Markup Language (XML) Schema for the application-level encoding and CIM support tools. An encoding of the CIM with RDF and OWL has been proposed, but there are open issues. The encoding methodology is unclear so far and the feasibility of automated translation is disputed. There are also differences in the meta-model of Semantic Web standards and International Standards Organization (ISO) standards that the CIM relies on which impede a direct transition. METAFOR also has progressed on the important aspect of the development of controlled vocabularies for community terms. While the CIM captures information related to the scientific workflow, the controlled vocabularies describe the physical phenomena in the data, which provides a foundation for developing domain ontologies and any subsequent graph representation of the data. Graphs constructed from climate data could reach millions or billions of nodes. Graphs of such size are well beyond simple comprehension or visualization. Parallel software and hardware technologies will be essential to enable complex data analytics on today’s multi-terabyte and tomorrow’s petabyte datasets that will require future climate analysis. Investigation of existing 13 http://metaforclimate.eu/ February 9, 2016 CKD Workshop Report 6 and new parallel graph algorithms and data structures capable of analyzing spatio-temporal, and possibly dynamic, data at massive scale on parallel and multithreaded systems is required. From a contemporary computing perspective, the performance of graph algorithms is typically limited by memory latency with access patterns being highly data dependent. 8. Next Steps There exists a set of highly articulated tools in both the graph and climate communities. Investigation is required to explore and, in some cases, jumpstart the use of knowledge discovery algorithms in climate science. These approaches would augment the traditional methods of Earth system modeling and further leverage the volumes of observational and model data. Graph and data-driven technologies have been successfully applied to social and bioinformatics areas. Work to date by groups including Potsdam Institute for Climate Impact Research (PIK), University of Wisconsin and Oak Ridge National Laboratory have demonstrated initial applicability to climate science. The workshop concluded that work it would be valuable to further stimulate uptake and evaluation within the climate research community. Many of the traditional methods of climate analysis originated in meteorology: these methods are used to understand the weather. The equations used in numerical weather prediction and climate simulation are nearly the same insofar as geophysical fluid dynamics are at the core of atmospheric and oceanic motions. Weather and climate analysis, however, are distinct in many ways, because climate feedback processes occur on time scales that are long compared to the variations of the weather. Climate varies over weeks through millions of years. Processes exist that impact the full range of time scales, but it is not yet fully understood which of the processes are most important on what time scales and for what areas. As an example of an area of interest, the Arctic will experience changes in climate more rapidly than other parts of the globe. Possible reasons include positive feedbacks that amplify the general warming trend due to increasing greenhouse gas concentrations: snow-ice albedo, stabilization of atmosphere, cloud influence and sea-ice influenced by wind. What cannot be determined in climate models today or in the traditional analysis of climate observations is which of those is most relevant to the current changing climate. Another critical area of concern is the degree of uncertainty in climate predictions and projections. A high confidence exists in the projections of climate change at the global scale and understanding of climate processes at the large scales, but confidence decreases with decreasing spatial scale. Decision-making, in contrast, occurs at the scale of human institutions, which is much more local or regional in scope. Furthermore, there is a growing interest in attribution of climate events, either to proximate causes or to their ultimate origins, in particular, whether those origins are natural or related to human activity. Attribution depends on an understanding of the predictability of climate phenomena. It is critical, therefore, to assess predictability, confidence and uncertainty at scales relevant for decision support. Two areas for development were identified at the workshop: ontologies to describe the relationship between features and events and a CKD test-bed based on a subset of the CMIP5 data. A key outcome of discussions among the attendees was to pursue a systematic February 9, 2016 CKD Workshop Report 7 methodology and begin by focusing on a specific use case or two that will provide some instance level modeling. Understanding the principal processes of a specific use case (such as Arctic climate change) is key to defining representations and will have the highest probability for success. From a methodology perspective, the use case and questions requiring answers that arise from it, define the required vocabulary terms, i.e. semantics and their relations. Since the ultimate knowledge base construction arises from a graph structure, and associated information modeling is best conducted in a group setting. Initially a small multidisciplinary team including information/knowledge modelers and domain scientists working within a conceptual model framework, suitably influenced by familiarity with the underlying data of interest, develop the knowledge model. This model can be discussed and vetted by a larger group of domain experts and a prototype knowledge base can be constructed, and analyzed for assessing suitability, consistency and completeness of the model. Subsequently a more formal evaluation process can be applied to determine iterations and improvements. This approach has been used successfully in a number of other semantically-based projects x,xi. The primary aim of this approach is to obtain suitably structured ‘data’, i.e. in graph model form upon which graph and complex network analyses can be applied. Further examination was also proposed on a set of science topics that will bring the climate science analysis and graph analytic communities together on a given number of concrete examples. Specific test cases are to be developed but are expected to include model intercomparison, feature detection, and teleconnections. These test cases will expose the detailed technical implementation (graph construction, graph isomorphism approaches, etc…) and develop a common understanding of the concepts and terms used by the various communities (graphs, semantics, etc…). It is anticipated that such studies can advise on the right mix of automation, human control and abstraction that work best for analyzing new climate data sets. 9. CKD Workshop Attendees Name David A. Bader Venkatramani Balaji Joachim Biercamp Benno Blumenthal Michael Böttinger Reinhard Budich Alexey Cheptsov Kendall Clark Traute Crüger Gerry Devine Andi Drebes John Feo Peter Fox Bernadette Fritzsch Affiliation Georgia Institute of Technology NOAA DKRZ Columbia IRI DKRZ MPI-M HLRS Clark & Parsia MPI-M University of Reading Uni Hamburg PNNL RPI AWI February 9, 2016 CKD Workshop Report Name Affiliation Steven Haflich Illia Horenko Heike Jänicke Elke Keup-Thiel Stephan Kindermann Jim Kinter Ingo Kirchner Kerstin Kleese van Dam Luis Kornblueh Uwe Kuester Jürgen Kurths Michael Lautenschlager Tobias Lippert Alexander Löw Thomas Ludwig Jim Maltby Jochem Marotzke Norbert Marwan Thorsten Mauritsen Philipp Metzner Shoaib Mufti Craig Norvell Per Nyberg Christian Pagé Stephen Pascoe Michael Ponater Hans Ramthun Nurcan Rasig David Rogers Will Sawyer Gary Stanbridge Karsten Steinhaeuser Anastasios Tsonis Tobias Weigel Franz University of Lugano University of Heidelberg CSC DKRZ COLA and George Mason University FU Berlin PNNL MPI-M HLRS PIK DKRZ ParStream MPI-M DKRZ Cray MPI-M PIK MPI-M University of Lugano Cray Franz Cray CERFACS BADC DLR DKRZ Cray Sandia CSCS Cray Univ. of Notre Dame / ORNL University of Wisconsin DKRZ 8 February 9, 2016 CKD Workshop Report 10. 9 References i K. Steinhaeuser, N. V. Chawla and A. R. Ganguly (in press, available at doi:10.1002/sam.10100). Complex Networks as a Unified Framework for Descriptive Analysis and Predictive Modeling in Climate Science. Statistical Analysis and Data Mining. ii K. Steinhaeuser, N. V. Chawla and A. R. Ganguly (2010). An Exploration of Climate Data Using Complex Networks. SIGKDD Explorations 12(1), 25-32. iii Zou, Y.; Donges, J. F.; Kurths, J. Recent advances in complex climate network analysis, Complex Systems and Complexity Science, 2011. 27-38 p iv J. F. Donges, Y. Zou, N. Marwan, and J. Kurths. Complex networks in climate dynamics. Europhysics Letters, 87:48007, 2009. v A. A. Tsonis, K. L. Swanson, and P. J. Roebber. What Do Networks Have to Do with Climate? Bulletin of the American Meteorological Society, 87(5):585–595, 2006. vi Deborah L. McGuinness, Peter Fox, Boyan Brodaric, Elisa F. Kendall, The Emerging Field of Semantic Scientific Knowledge Integration, in IEEE Intelligent Systems, 24(1): 25-26, 2009. vii T. R. Gruber. A translation approach to portable ontologies. Knowledge Acquisition, 5(2):199-220, 1993. viii F. Baader, D. Calvanese, D. L. McGuinness, D. Nardi, P. F. Patel-Schneider: The Description Logic Handbook: Theory, Implementation, Applications. Cambridge University Press, Cambridge, UK, 2003. ix The Fourth Paradigm: Data Intensive Scientific Discovery, Eds. Tony Hey, Stewart Tansley and Kristin Tolle, Microsoft External Research, 2009. x Benedict, J.L., McGuinness, D.L., & Fox, P. 2007, A Semantic Web-based Methodology for Building Conceptual Models of Scientific Information, EOS Trans. AGU, 88(52), Fall Meeting Suppl., Abstract IN53A-0950. xi Fox, P. and McGuinness, D. L., An Open-World Iterative Methodology for the Development of Semantically-enabled Applications, in preparation, 2011. xii Navarra, A., J. L. Kinter III, and J. Tribbia, 2010: Crucial Experiments in Climate Science. Bull. Amer. Meteor. Soc., 91, 343-352 February 9, 2016