Climate scientists* big challenge: reproducibility using big

advertisement
CLIMATE SCIENTISTS’ BIG CHALLENGE:
REPRODUCIBILITY USING BIG DATA
Kyo Lee, Chris Mattmann, and RCMES team
Jet Propulsion Laboratory (JPL), Caltech
Reproducibility issues in climate science
• Lots of published papers and reports do not include a
computational description which is sufficiently detailed to
reproduce the results.
• Even with detailed description, it is practically impossible
to reproduce others’ climate simulation results.
• How many readers of the IPCC report can draw this plot?
(from the latest IPCC report)
Climate Science is Big Data Science
• Data sets are massive and stored in distributed
systems over many physical locations.
• Coupled Model Intercomparison Project Phase 5 (CMIP5)
for IPCC assessment: 110 different experiments, 24
modeling centers, 45 models, 3.3 petabytes of data.
• By 2020 each experiment will generate an exabyte of data.
• Use massive observational data sets to:
• Formulate hypotheses from observed empirical
relationships.
• Simulate current and past conditions under those
hypotheses using climate models.
• Test hypotheses by comparing simulations to
observations.
Our unique challenges :
data change quickly over time
•
Community Earth System Model (CESM)
developed at National Center for Atmospheric
Research
CESM 1.0
(June 2010)
minor updates
and branch
versions
CESM 1.0.3
(June 2011)
numerous ways to
configure a
simulation
CESM 1.0.6
(May 2014)
• Options: discretization methods, sub-grid
scale physics, coupling with ocean, and so on.
• CESM is open source, but it is practically
impossible to reproduce others’ simulation
results.
Regional Climate Model Evaluation System
(RCMES, http://rcmes.jpl.nasa.gov/)
• RCMES is an open source software package developed by NASA’s
JPL and UCLA to facilitate the evaluation of climate models. Now
Open Climate Workbench (OCW) is one of top-level projects at the
Apache Software Foundation.
• Make observational datasets, with some emphasis on NASA satellite
data, more accessible to the climate modeling community for climate
model evaluation.
• Provide researchers more time to spend on analyzing results and less
time coding and worrying about file formats, data transfers, etc.
• Provide guidance to further improve models by visualizing collective
evaluation results of models.
• Make some basic model evaluation for climate models reproducible.
Regional Climate Model Evaluation System powered
by Apache Software Foundation
Other Data Centers
(ESG, DAAC, ExArch Network)
URL
Metadata
TRMM
MODIS
AIRS
CERES
Soil
moisture
ETC
Data Table
Extractor
for
various
data
formats
Data Table
Data Table
Data Table
Data Table
User
input
Model data
Extract OBS
data
Extract model
data
Regridder
(Put the OBS & model data on
the same time/space grid)
Data extractor
(Binary or netCDF)
Metrics Calculator
(Calculate evaluation metrics)
Data Table
Common Format,
Native grid,
Efficient
architecture
Use the
regridded
data for
user’s
own
analyses
and VIS.
Visualizer
(Plot the metrics)
Raw Data:
Various
sources,
formats,
Resolutions,
Coverage
RCMED
RCMET
(Regional Climate Model Evaluation Database)
A large scalable database to store data from
variety of sources in a common format
(Regional Climate Model Evaluation Tool)
A library of codes for extracting data from
RCMED and model and for calculating
evaluation metrics
Ingest obs/models, re-gridding, calculate metrics (e.g., bias, RMSE, correlation,
significance, PDFs), and visualize results (e.g., contour, time series, Taylor).
Replication of Kim et al. (2013) using RCMES
How to make climate studies more reproducible?
• Different programming languages (Fortran, Matlab, R,
Python, IDL, NCL, GrADS, ….): the workflow system
could facilitate replication of other studies.
• Difficulties in reproducing others’ simulation results: Earth
System Grid Federation (ESGF) provides software
infrastructure to facilitate model intercomparison projects
using observational data.
• Climate scientists need more open source software
similar to RCMES that can facilitate their analyses of
observational and model data.
Download