CKD_Workshop_Report_Final

advertisement
Climate Knowledge Discovery
Workshop Report
Reinhard Budich1 MPI-M, Peter Fox2 RPI, Auroop Ganguly3 ORNL, Jim Kinter4 COLA,
Per Nyberg5 CRAY, Tobias Weigel6 DKRZ
1. Background
Numerical simulation based science follows a new paradigm: its knowledge discovery process
rests upon massive amounts of data. We are entering the age of data-intensive science. Data is
either generated by equipment such as satellite sensors, microscopes, particle colliders etc., or
is just born digital, i.e. generated by an intensive usage of high performance computing.
One of the largest repositories of scientific data in any discipline is the model-generated and
observational geoscience data produced and used in climate science. Climate scientists gather
data faster than they can be interpreted. Current approaches to data volumes are primarily
focused on traditional methods, best-suited for large-scale phenomena and coarse-resolution
data sets. The data volumes from climate modeling will increase dramatically due to both
increasing resolution and number of processes described. What is needed is a suite of new
techniques interpreting and linking phenomena on and between different time- and length scales
as well as realms and processes. Such tools could provide unique insights into challenging
features of the Earth system, including extreme events, nonlinear dynamics and chaotic regimes.
Based on experience, including that from other sciences, the breakthroughs needed to address
these challenges will come from well-organized collaborative efforts involving several disciplines,
including end-users and scientists in climate and related areas (atmosphere, ocean, land
surface, …), computer and computational scientists, computing engineers, and mathematicians
xii
.
This report summarizes the findings from the Climate Knowledge Discovery workshop organized
by Deutsches Klimarechenzentrum GmbH (DKRZ), Max-Planck-Institut für Meteorologie (MPI-M)
and Cray Inc. to bring together experts from various domains to investigate the use and
application of large-scale graph analytics, semantic technologies and knowledge discovery
algorithms in climate science. The workshop was held from 30 March to 1 April 2011 at DKRZ in
Hamburg, Germany.
1
Corresponding Author Address: Reinhard Budich, reinhard.budich@zmaw.de
Corresponding Author Address: Peter Fox, pfox@cs.rpi.edu
3 Corresponding Author Address: Auroop Ganguly, gangulyar@ornl.gov
4 Corresponding Author Address: Jim Kinter, kinter@cola.iges.org
5 Corresponding Author Address: Per Nyberg, nyberg@cray.com
6 Corresponding Author Address: Tobias Weigel, weigel@dkrz.de
2
CKD Workshop Report
2
2. Workshop Objective
The aim of the workshop was to formulate science and technology strategies to further develop
climate knowledge discovery methods and tools to enable a data-intensive approach to climate
science. Should such advanced data analytical tools be proven viable, their methods should be
used in future Coupled Model Inter-comparison Project (CMIP7) efforts, including the 5th round
that is underway now (CMIP5).
Three basic questions were posed to pursue this aim:
1) What steps are required to realize the potential of graph analytics, semantic technologies and
knowledge discovery algorithms in climate science?
2) Where methods and technologies already exist, how can they be leveraged?
3) Where gaps are identified, what steps must be taken to address them?
3. Data Driven Opportunities for Climate Science
Climate science faces long-standing gaps in the understanding of key processes like convection,
land-atmosphere interaction, and teleconnections. A combination of improved models,
observations and computing power has been suggested as the way towards improving this
understanding. Our postulate: Confronting model-simulations with observations in new ways and
learning functional relations within climate variables from observations and models – which may
generalize to predictive insights under non-stationary climate - would be invaluable for basic
climate research, and key to addressing adaptation and resource management issues in a timely
manner. This provides the motivation for novel and scalable methodologies in data-guided
descriptive and predictive analysis, which can handle massive volumes of "5-dimensional"
(space x-, y-, and z-, time and parameters) data generated from nonlinear processes with
complex dependencies and feedback loops.
Many application areasvi have recognized the need for representing and reasoning about domain
and multi-disciplinary knowledge, and recognizing the reality of science today, that knowledge is
distributed over many resources. Computer science research areas being put into practice
include: information integration, distributed knowledge management, the semantic web (including
the Resource Description Framework (RDF) and the Web Ontology Language (OWL) which are
both World-Wide-Web Consortium Recommendations (i.e. standards)), multi-agent and
distributed reasoning systems. Methods of automated reasoning and advanced analytics have
been used in a number of domains by employing formal ontologies expressed in logic viii.
Ontologies are essential components of modern knowledge management systems, and are
becoming more prevalent in distributed computing environments based on Web Services, and in
applications such as e-Commerce and e-Science. In information systems science, an ontology vii
is a formal representation of knowledge as a set of concepts within a domain and the
relationships among those concepts. Formalized ontologies are used to enable automated
reasoning about the entities within that domain, in addition to describing the domain. Ontologies
7
http://www-pcmdi.llnl.gov/projects/cmip/index.php
February 9, 2016
CKD Workshop Report
3
are useful in articulating shared vocabularies, which can be used to model a domain based on
types of objects and/or concepts that exist, as well as their properties and relations among
objects. More broadly, semantic technologies comprising query and reasoning capabilities,
semantic storage and retrieval capabilities, have already been implemented in community efforts
such as the Earth System Curator8 and the METAFOR project9, the National Aeronautics and
Space Administration (NASA) Semantic Web for Earth and Environmental Terminology
(SWEET)10, and the International Research Institute (IRI) Lamont-Doherty Earth Observatory
(LDEO) Climate Data Library11.
4. Complex Network Solutions
Data-intensive science in general has been called the fourth paradigm of scientific exploration ix,
in addition to theory, experimentation and the more recently added computational modeling and
simulation. As the meeting progressed there was a growing awareness that a combination of
high-performance analytics from model-simulated and observed data, with algorithms motivated
from network science and graph theory, nonlinear dynamics and statistics, as well as data mining
and machine learning, may be one way forward for the community. Recent developments in
climate networksi,ii,iii,iv,v, while impressive, have barely scratched the surface in terms of what may
be eventually possible. While statistical and dynamical downscaling have become widely applied
in the climate community, challenging issues about bias variance tradeoffs, long-range
dependence or teleconnections and impacts of boundary or initial conditions remain largely
unsolved. Here complex networks12 may offer the next generation of solutions in climate science
and climate change consequence management. Another emergent research area is to explore
ways to connect network solutions from climate change science all the way to impacts science
like urban sustainability and endangered natural ecosystems, where both the climate and
impacted systems may be represented through loosely coupled network paradigms. The ability
to interpret the results from the climate science perspective and develop hybrid models that
blend available process knowledge with complex networks or other data-guided approaches will
remain issues of key importance.
5. Construction of Graphs from Climate Data
Graphs are mathematical structures used to model pairwise relations between objects. The
graph is represented by a collection of vertices or 'nodes' and a collection of edges that connect
pairs of vertices. The workshop provided a venue for an open and frank discussion of the
potential role of graph theory and graph analytics to climate data. It was noted that the first step –
getting community information into graph form – has not been widely adopted. The construction
of graphs is an information modeling process. It is important to distinguish this ‘modeling’ from
8
http://www.earthsystemcurator.org/ontologies/
http://metaforclimate.eu
10 http://sweet.jpl.nasa.gov/ontology/
11 http://iridl.ldeo.columbia.edu/ontologies/
12 A complex network is a network (graph) with non-trivial topological features—features that do not occur
in simple networks such as lattices or random graphs but often occur in real graphs. (Wikipedia)
9
February 9, 2016
CKD Workshop Report
4
the familiar modeling of climate data using mathematical equations, and increasingly with the
implementation of algorithms in computer software and execution on modern computer systems.
To invoke this process, the climate community needs to define what is important to the graph:
the types of nodes in the graph and their first order connections. Beyond that, the construction of
climate science domain ontologies would be a key next step, but to apply these ontologies, it is
also necessary to define the process of semantically annotating climate data and representing it
in graphs.
Two basic approaches were described to construct graphs from climate data. First, the gridded
values of geophysical fields representing physical phenomena such as sea-ice or melt pond can
be described as nodes in the graphs. The edge between two nodes can describe domain
processes. Alternatively, the edge can describe the physical phenomena and the nodes describe
the processes. This bi-directional approach seems to have a lot of potential. It was noted,
however, that the graph community does not currently have good knowledge representation
forms for physical processes. Encoding of science knowledge has traditionally been more
successful when describing objects that can be measured. Both approaches rely on a rich set of
feature detection algorithms to transform the grid-based primary data into objects in a graph as
the first step.
More complex graphs can also be constructed with a named definition of the network where
node types do not need to be consistent in definition. For example, one node could define a
geographic area (Pacific Region) and be connected to a node with physical measurements (time
series of temperatures) with a named and typed relationship (measured location). Sub-graphs
can be constructed to represent different components currently implemented in coupled Earth
system models (atmosphere, ocean, land surface, …). Different algorithmic approaches to
climate modeling can then be manifest in multiple graph instantiations.
6. Potential Areas of Application
a. Model Inter-comparison:
Model inter-comparison projects are a standard procedure that enables a diverse community of
scientists to analyze climate models in a systematic fashion, which serves to facilitate model
improvement.
In a model inter-comparison, many models with different assumptions are
subjected to the same experimental protocol, and the output of all the models are made available
to a community of researchers. This procedure requires a community-based infrastructure in
support of climate model diagnosis, validation, inter-comparison, documentation and data
access.
The construction of graphs using model output provides another method, in addition to more
traditional methods, by which to compare outputs either from different models for the same
simulated period or the same model using perturbed conditions. Graph algorithms such as graph
isomorphism allow for powerful exploration of the encoded knowledge base (isomorphism is an
equivalence relation between graphs).
b. Climate Teleconnections:
Climate science features challenging characteristics such as nonlinear dynamics, chaotic
regimes and multi-physics complexity. Anomalies can be related to each other at large distances
(typically thousands of kilometers), a feature referred to as a tele-connection. Information is
February 9, 2016
CKD Workshop Report
5
propagated between the distant points through the atmosphere or ocean by transport processes
or wave dispersion. In addition, the climate system is dynamic such that remote atmospheric and
ocean responses to large-scale fluctuations in the climate system will never occur exactly the
same way twice. From a computational perspective, this problem is extremely challenging
requiring the search for patterns and relationships at large distances on a graph within and
between time-slices. Traversing a graph looking for what is connected to what causes every
memory reference to be global and random, since every reference will lead to new connections
that are not located in the same sections of memory.
c. Scale Interactions:
It is also the case that climate phenomena are observed at a wide range of spatial (103 – 107 m)
and temporal scales (10-2 – 108 years). Because of the fluid dynamics of the atmosphere and
oceans and because the processes governing phenomena on different time scales are codependent, there are strong interactions among the spatial and temporal scales of climate. This
complicates the analysis of climate data and presents challenges for data processing and
automating inference.
7. Technology Requirements
Applications that aim to combine semantic annotation and reasoning with graph algorithms and
climate data are more successful when combined with sound ontologies. Many science
ontologies have been developed in the past, but their consistency and methodological
soundness (and completeness) are highly variable. Most importantly, realistic use cases must be
defined even before starting to engineer ontologies. The time frame for such developments is
therefore often multiple years.
In the METAFOR project, an ontology for model metadata has been developed, the METAFOR
Common Information Model (CIM13). Like the Earth System Grid (ESG) ontology, it is intended to
replace existing model metadata. While ESG uses Semantic Web standards such as RDF and
OWL to encode the ontology, the METAFOR project uses eXtensible Markup Language (XML)
Schema for the application-level encoding and CIM support tools. An encoding of the CIM with
RDF and OWL has been proposed, but there are open issues. The encoding methodology is
unclear so far and the feasibility of automated translation is disputed. There are also differences
in the meta-model of Semantic Web standards and International Standards Organization (ISO)
standards that the CIM relies on which impede a direct transition.
METAFOR also has progressed on the important aspect of the development of controlled
vocabularies for community terms. While the CIM captures information related to the scientific
workflow, the controlled vocabularies describe the physical phenomena in the data, which
provides a foundation for developing domain ontologies and any subsequent graph
representation of the data.
Graphs constructed from climate data could reach millions or billions of nodes. Graphs of such
size are well beyond simple comprehension or visualization. Parallel software and hardware
technologies will be essential to enable complex data analytics on today’s multi-terabyte and
tomorrow’s petabyte datasets that will require future climate analysis. Investigation of existing
13
http://metaforclimate.eu/
February 9, 2016
CKD Workshop Report
6
and new parallel graph algorithms and data structures capable of analyzing spatio-temporal, and
possibly dynamic, data at massive scale on parallel and multithreaded systems is required.
From a contemporary computing perspective, the performance of graph algorithms is typically
limited by memory latency with access patterns being highly data dependent.
8. Next Steps
There exists a set of highly articulated tools in both the graph and climate communities.
Investigation is required to explore and, in some cases, jumpstart the use of knowledge
discovery algorithms in climate science. These approaches would augment the traditional
methods of Earth system modeling and further leverage the volumes of observational and model
data.
Graph and data-driven technologies have been successfully applied to social and bioinformatics
areas. Work to date by groups including Potsdam Institute for Climate Impact Research (PIK),
University of Wisconsin and Oak Ridge National Laboratory have demonstrated initial
applicability to climate science. The workshop concluded that work it would be valuable to
further stimulate uptake and evaluation within the climate research community.
Many of the traditional methods of climate analysis originated in meteorology: these methods are
used to understand the weather. The equations used in numerical weather prediction and climate
simulation are nearly the same insofar as geophysical fluid dynamics are at the core of
atmospheric and oceanic motions. Weather and climate analysis, however, are distinct in many
ways, because climate feedback processes occur on time scales that are long compared to the
variations of the weather. Climate varies over weeks through millions of years. Processes exist
that impact the full range of time scales, but it is not yet fully understood which of the processes
are most important on what time scales and for what areas.
As an example of an area of interest, the Arctic will experience changes in climate more rapidly
than other parts of the globe. Possible reasons include positive feedbacks that amplify the
general warming trend due to increasing greenhouse gas concentrations: snow-ice albedo,
stabilization of atmosphere, cloud influence and sea-ice influenced by wind. What cannot be
determined in climate models today or in the traditional analysis of climate observations is which
of those is most relevant to the current changing climate.
Another critical area of concern is the degree of uncertainty in climate predictions and
projections. A high confidence exists in the projections of climate change at the global scale and
understanding of climate processes at the large scales, but confidence decreases with
decreasing spatial scale. Decision-making, in contrast, occurs at the scale of human institutions,
which is much more local or regional in scope. Furthermore, there is a growing interest in
attribution of climate events, either to proximate causes or to their ultimate origins, in particular,
whether those origins are natural or related to human activity. Attribution depends on an
understanding of the predictability of climate phenomena. It is critical, therefore, to assess
predictability, confidence and uncertainty at scales relevant for decision support.
Two areas for development were identified at the workshop: ontologies to describe the
relationship between features and events and a CKD test-bed based on a subset of the CMIP5
data. A key outcome of discussions among the attendees was to pursue a systematic
February 9, 2016
CKD Workshop Report
7
methodology and begin by focusing on a specific use case or two that will provide some instance
level modeling. Understanding the principal processes of a specific use case (such as Arctic
climate change) is key to defining representations and will have the highest probability for
success.
From a methodology perspective, the use case and questions requiring answers that arise from
it, define the required vocabulary terms, i.e. semantics and their relations. Since the ultimate
knowledge base construction arises from a graph structure, and associated information modeling
is best conducted in a group setting. Initially a small multidisciplinary team including
information/knowledge modelers and domain scientists working within a conceptual model
framework, suitably influenced by familiarity with the underlying data of interest, develop the
knowledge model. This model can be discussed and vetted by a larger group of domain experts
and a prototype knowledge base can be constructed, and analyzed for assessing suitability,
consistency and completeness of the model. Subsequently a more formal evaluation process
can be applied to determine iterations and improvements. This approach has been used
successfully in a number of other semantically-based projects x,xi. The primary aim of this
approach is to obtain suitably structured ‘data’, i.e. in graph model form upon which graph and
complex network analyses can be applied.
Further examination was also proposed on a set of science topics that will bring the climate
science analysis and graph analytic communities together on a given number of concrete
examples. Specific test cases are to be developed but are expected to include model intercomparison, feature detection, and teleconnections. These test cases will expose the detailed
technical implementation (graph construction, graph isomorphism approaches, etc…) and
develop a common understanding of the concepts and terms used by the various communities
(graphs, semantics, etc…). It is anticipated that such studies can advise on the right mix of
automation, human control and abstraction that work best for analyzing new climate data sets.
9. CKD Workshop Attendees
Name
David A. Bader
Venkatramani Balaji
Joachim Biercamp
Benno Blumenthal
Michael Böttinger
Reinhard Budich
Alexey Cheptsov
Kendall Clark
Traute Crüger
Gerry Devine
Andi Drebes
John Feo
Peter Fox
Bernadette Fritzsch
Affiliation
Georgia Institute of Technology
NOAA
DKRZ
Columbia IRI
DKRZ
MPI-M
HLRS
Clark & Parsia
MPI-M
University of Reading
Uni Hamburg
PNNL
RPI
AWI
February 9, 2016
CKD Workshop Report
Name
Affiliation
Steven Haflich
Illia Horenko
Heike Jänicke
Elke Keup-Thiel
Stephan Kindermann
Jim Kinter
Ingo Kirchner
Kerstin Kleese van Dam
Luis Kornblueh
Uwe Kuester
Jürgen Kurths
Michael Lautenschlager
Tobias Lippert
Alexander Löw
Thomas Ludwig
Jim Maltby
Jochem Marotzke
Norbert Marwan
Thorsten Mauritsen
Philipp Metzner
Shoaib Mufti
Craig Norvell
Per Nyberg
Christian Pagé
Stephen Pascoe
Michael Ponater
Hans Ramthun
Nurcan Rasig
David Rogers
Will Sawyer
Gary Stanbridge
Karsten Steinhaeuser
Anastasios Tsonis
Tobias Weigel
Franz
University of Lugano
University of Heidelberg
CSC
DKRZ
COLA and George Mason University
FU Berlin
PNNL
MPI-M
HLRS
PIK
DKRZ
ParStream
MPI-M
DKRZ
Cray
MPI-M
PIK
MPI-M
University of Lugano
Cray
Franz
Cray
CERFACS
BADC
DLR
DKRZ
Cray
Sandia
CSCS
Cray
Univ. of Notre Dame / ORNL
University of Wisconsin
DKRZ
8
February 9, 2016
CKD Workshop Report
10.
9
References
i
K. Steinhaeuser, N. V. Chawla and A. R. Ganguly (in press, available at doi:10.1002/sam.10100).
Complex Networks as a Unified Framework for Descriptive Analysis and Predictive Modeling in Climate
Science. Statistical Analysis and Data Mining.
ii K. Steinhaeuser, N. V. Chawla and A. R. Ganguly (2010). An Exploration of Climate Data Using Complex
Networks. SIGKDD Explorations 12(1), 25-32.
iii Zou, Y.; Donges, J. F.; Kurths, J. Recent advances in complex climate network analysis, Complex
Systems and Complexity Science, 2011. 27-38 p
iv J. F. Donges, Y. Zou, N. Marwan, and J. Kurths. Complex networks in climate dynamics. Europhysics
Letters, 87:48007, 2009.
v A. A. Tsonis, K. L. Swanson, and P. J. Roebber. What Do Networks Have to Do with Climate? Bulletin
of the American Meteorological Society, 87(5):585–595, 2006.
vi Deborah L. McGuinness, Peter Fox, Boyan Brodaric, Elisa F. Kendall, The Emerging Field of Semantic
Scientific Knowledge Integration, in IEEE Intelligent Systems, 24(1): 25-26, 2009.
vii T. R. Gruber. A translation approach to portable ontologies. Knowledge Acquisition, 5(2):199-220, 1993.
viii F. Baader, D. Calvanese, D. L. McGuinness, D. Nardi, P. F. Patel-Schneider: The Description Logic
Handbook: Theory, Implementation, Applications. Cambridge University Press, Cambridge, UK, 2003.
ix The Fourth Paradigm: Data Intensive Scientific Discovery, Eds. Tony Hey, Stewart Tansley and Kristin
Tolle, Microsoft External Research, 2009.
x Benedict, J.L., McGuinness, D.L., & Fox, P. 2007, A Semantic Web-based Methodology for Building
Conceptual Models of Scientific Information, EOS Trans. AGU, 88(52), Fall Meeting Suppl., Abstract
IN53A-0950.
xi Fox, P. and McGuinness, D. L., An Open-World Iterative Methodology for the Development of
Semantically-enabled Applications, in preparation, 2011.
xii Navarra, A., J. L. Kinter III, and J. Tribbia, 2010: Crucial Experiments in Climate Science. Bull. Amer.
Meteor. Soc., 91, 343-352
February 9, 2016
Download