&
(NCAR’s Data & GriD Efforts) for
COMMISSION FOR BASIC SYSTEMS
INFORMATION SYSTEMS and SERVICES
INTERPROGRAMME TASK TEAM ON THE
FUTURE WMO INFORMATION SYSTEM
KUALA LUMPUR, 20 - 24 OCTOBER 2003
Courtesy: Don Middleton
NCAR Scientific Computing Division
NCAR
“The Panel’s overarching recommendation is that the
National Science Foundation should establish and lead a large-scale, interagency, and internationally coordinated
Advanced Cyberinfrastructure Program (ACP) to create, deploy, and apply cyberinfrastructure in ways that radically empower all scientific and engineering research and allied education. We estimate that sustained new NSF funding of $1 billion per year is needed to achieve critical mass and to leverage the coordinated co-investment from other federal agencies, universities, industry, and international sources necessary to empower a revolution.
The cost of not acting quickly or at a subcritical level could be high, both in opportunities lost and in increased fragmentation and balkanization of the research.”
Atkins Report, Executive Summary
NCAR
http://www.earthsystemgrid.org
U.S. DOE SciDAC funded R&D effort - a “ Collaboratory
Pilot Project”
Build an “Earth System Grid” that enables management, discovery, distributed access, processing, & analysis of distributed terascale climate research data
Build upon Globus Toolkit and DataGrid technologies and deploy (Rubber on the road)
Potential broad application to other areas
NCAR
ANL
– Ian Foster (PI)
– Veronika Nefedova
– (John Bresenhan)
– (Bill Allcock)
LBNL
– Arie Shoshani
– Alex Sim
ORNL
– David Bernholdte
– Kasidit Chanchio
– Line Pouchard
NCAR
LLNL/PCMDI
– Bob Drach
– Dean Williams (PI)
USC/ISI
– Anne Chervenak
– Carl Kesselman
– (Laura Perlman)
NCAR
– David Brown
– Luca Cinquini
– Peter Fox
– Jose Garcia
– Don Middleton (PI)
– Gary Strand
NCAR
– 7.5GB/yr, 100 years -> .75TB
– 29GB/yr, 100 years -> 2.9TB
– 110GB/yr, 100 years -> 11TB
NCAR
Increased turnaround, model development, ensemble of runs
Increase by a factor of 10, linear data
– 7.5GB/yr, 100 years -> .75TB * 10 = 7.5TB
NCAR
Spatial Resolution: T42 -> T85 -> T170
Increase by factor of ~ 10-20, linear data
Temporal Resolution: Study diurnal cycle, 3 hour data
Increase by factor of ~ 4, linear data
CCM3 at T170 (70km)
NCAR
Quality: Improved boundary layer, clouds, convection, ocean physics, land model, river runoff, sea ice
Increase by another factor of 2-3, data flat
Scope: Atmospheric chemistry (sulfates, ozone…), biogeochemistry (carbon cycle, ecosystem dynamics), middle Atmosphere Model…
Increase by another factor of 10+, linear data
NCAR
NCAR
End 2002: 1.2 million files comprising ~75TB of data at NCAR, ORNL, LANL, NERSC, and
PCMDI
End 2007: As much as 3 PB (3,000 TB) of data (!)
Current practice is already broken – the future will be even worse if something isn’t done…
NCAR
Data
– Different formats are converted to netCDF
– netCDF is not standardized to the CF model
– Different sites require knowledge of different methods of access
Metadata
– Most kept in online files separate from data and unsearchable unless one is “in the know”
– Some kept in people’s brains
Access control
– Manual
– Not formalized
Data requests
– Beginnings of a formal process (e.g., the PCMDI model)
– Beginnings of web portals
– Far too much done by hand
– Logging nearly non-existent
NCAR
Enabling the simulation and data management team
Enabling the core research community in analyzing and visualizing results
Enabling broad multidisciplinary communities to access simulation results
We need integrated scientific work environments that enable smooth WORKFLOW for knowledge development: computation, collaboration & collaboratories, data management, access, distribution, analysis, and visualization.
NCAR
Move data a minimal amount, keep it close to computational point of origin when possible
– Data access protocols, distributed analysis
When we must move data, do it fast and with a minimum amount of human intervention
– Storage Resource Management, fast networks
Keep track of what we have, particularly what’s on deep storage
– Metadata and Replica Catalogs
Harness a federation of sites, web portals
– Globus Toolkit -> The Earth System Grid -> The
UltraDataGrid
NCAR
Tera/Peta-scale
Archive
Tools for reliable staging, transport, and replication
HRM
Server
Server
Tera/Peta-scale
Archive
NCAR
HRM
Client
Selection
Control
Monitoring
HRM
Running well across DOE/HPSS systems
New component built that abstracts NCAR Mass
Storage System
Defining next generation of requirements with climate production group
First “real” usage
“The bottom line is that it now works fine and is over 100 times faster than what I was doing before. As important as two orders of magnitude increase in throughput is, more importantly I can see a path that will essentially reduce my own time spent on file transfers to zero in the development of the climate model database” – Mike Wehner, LBNL
NCAR
NCAR
Typical Application Distributed Application
Application Application netCDF lib data
OPeNDAP Client
OPeNDAP
Via http
OpenDAP Server
OPeNDAP-g
-Transparency
-Performance
-Security
-Authorization
-(Processing)
Application
ESG client
OPeNDAP
Via
Grid
ESG Server
ESG
+
DODS
Data
(local)
Data
(remote)
Big Data
(Multiple remotes)
NCAR
For XML encoding of metadata (and data) of any generic netCDF file
Objects: netCDF, dimension, variable, attribute
Beta version reference implementation as Java Library
(http://www.scd.ucar.edu/vets/luca/netcdf/extract_metadata.htm) nc:netCDFType netCDF nc:dimension nc:VariableType nc:attribute nc:variable nc:values nc: attribute
NCAR
isA
Object
[1] id participant role= isA isA
Activity
[0,1] name
[0,1] description
[0,1] rights
[0,n] date type=
[0,n] note
[0,n] participant role=
[0,n] reference uri= isA isPartOf
Investigation isA
Person
[0,1] firstName
[0,1] lastName
[0,1] contact
Project
[0,n] topic type=
[0,1] funding
Ensemble worksFor
Campaign
Institution
[0,1] name
[0,1] type
[0,1] contact
Service
[0,1] name
[0,1] description
LEGEND
AbstractClass
Class inheritance association serviceId hasParent hasChild hasSibling isPartOf
Simulation
[0,n] simulationInput type=
[0,n] simulationHardware
Observation generated
By
Experiment
Dataset
[0,1] type
[0,1] conventions
[0,n] date type=
[0,n] format type= uri=
[0,1] timeCoverage
[0,1] spaceCoverage NCAR
Analysis isPart
Of
Co-developed NcML with Unidata
– CF conventions in progress, almost done
Developed & evaluated a prototype metadata system
Finalized an initial schema for PCM/CCSM
– Address interoperability with federal standards and NASA/GCMD via the generation of DIF/FGDC/ISO
– Address interoperability with digital libraries via the creation of
Dublin Core
Testing relational and native XML databases, and OGSA-DAI
Exploratory work for first-generation ontology
Authoring of discovery metadata in progress
NCAR
ESG Topology
LBNL gridFTP SERVER
HRM HPSS
RLS gridFTP visualize
NCAR
LAS SERVER
DISK cache gridFTP SERVER gridFTP
LLNL
HRM gridFTP SERVER
DISK
RLS
ISI
OGSA-DAI
MySQL
RDBMS gridFTP cross-update
HRM execute
MSS
RLS cross-update query
GRAM
GATEKEEPER
ESG WEB PORTAL
Tomcat/Struts authenticate submit query MyProxy
ANL
CAS
HPSS
ORNL gridFTP SERVER
HRM
RLS
NCAR
CCSM Data Management Group
The Globus Project
Other SciDAC Projects: Climate, Security & Policy for Group
Collaboration, Scientific Data Management ISIC, & Highperformance DataGrid Toolkit
OPeNDAP/DODS (multi-agency)
NSF National Science Digital Libraries Program (UCAR &
Unidata THREDDS Project)
U.K. e-Science and British Atmospheric Data Center
NOAA NOMADS and CEOS-grid
Earth Science Portal group (multi-agency, intnl.)
NCAR
Broaden usage of DataMover and refine
Continue building metadata catalogs
Revisit overall security model and consider simplified approaches
Redesign and implement user interface
Alpha version of OPeNDAPg
– Test and evaluate with client applications
Develop automation for data publishing (GT3)
Deploy for IPCC runs
NCAR
Ben Kirtman, COLA
Provide a common portal to NCAR, UCAR, and university data
Provide a sustainable cyberinfrastructure that dramatically lowers the cost of sharing data (there is HUGE interest in this)
Directly couple to simulation systems and DataMonster
Begin capturing rich metadata and catalog our scientific experiments for the world
MSS -> A Petascale Mass Knowledge System
Federate internationally (ESG, THREDDS, U.K. e-Science, NOMADS, PRISM,
GEON, etc.)
NCAR
Mass Storage
System (1.5PB)
Petascale Knowledge
Repository
Establish a new paradigm for managing and accessing scientific data based on semantic organization.
NCAR
Purpose:
Build an infrastructure using different methods for data exploration and delivery
Web-based retrieval and interactive analysis for MSS collections
Data sharing for multi-institution cooperative studies
Browse, select, compare, download data sets, & specify data subsets using – graphical, text entry, choice of output format
Components:
User interface, Live Access Server (LAS)
Middleware, Ferret, NCL, GrADS
File service, local, or DODS
Status:
Pilot working (2 years), more middleware testing
NCAR
Live Access Client
Ferret
Live Access Server
NCL Other Engines
DODS
NCAR
Data Collections
Massive
Data
Simulation & Retrospective
CSM, PCM, DSS,
MM5, WRF, MICOM,
CMIWG
NCAR
NCAR
Interface and Reanalysis 2
Sea Level Pressure
NCAR
user interface middleware
Community Data Portal architecture
UI UI UI
Struts
Tomcat
UI
GDS
Tomcat
DODS aggregation server
Tomcat
LAS
Tomcat core services catalogs parsing & metadata ingestion data search & discovery catalogs browsing
MSS data retrieval data access
(OPeNDAP, FTP, HTTP) data visualization
(NCL, Ferret) hardware dataportal.ucar.edu
raid disks
MSS
NCAR
ESG metadata
DC metadata
Community Data Portal Metadata Software
NcML metadata other metadata parses
THREDDS catalog parser application reference
THREDDS catalogs stores full
XML doc shreds XML doc into tables
XML native DB
(Xindice displays
XML viewer web application schemaspecific stylesheets uses
THREDDS catalogs browser
Web application
NCAR links to future advanced query
(Xpath, Xquery)
Search & Discovery web application relational DB
(MySQL) simple query
(SQL)
Results: list of triplets
(dataset id, metadata schema, metadata URL)
ACD: MOZART v2.1 standard run (Louisa Emmons)
ATD: Radar almost ready for today!
CGD: CAS satellite data example (Lesley Smith)
CGD: CDAS and VEMAP data (Steve AulenBach, Nan
Rosenbloom, Dave Schimmel)
CGD: CCSM 1000 year run (Lawrence Buja)
CGD: PCM 16 top datasets (Gary Strand)
SCD: DSS full data holdings (Bob Dattore, Steve Worley)
SCD: VETS example visualization catalog (Markus Stobbs,
Luca Cinquini)
COLA: Jennifer Adams, Jim Kinter, Brian Doty
NCAR
Recruiting (!)
– One student for data ingest
– One software engineer
– Systems
– Expanding storage by 20TB (SCD cosponsor)
Ongoing publication of datasets
Publishing documents on plans, design, how to partner, standard services, and management procedures
Building partnerships, DMWG meeting August
NCAR
Building a sustainable infrastructure for the long-term
Difficult, expensive, and time-consuming
Requires longer-term projects
Team-building is a critical process
Collaboration technologies really help
Managing all the collaborations is a challenge
But extremely valuable
Good progress, first real usage
NCAR
– www.earthsystemgrid.org
– dataportal.ucar.edu
NCAR
NCAR
END
We Will Examine Practically Every Aspect of the Earth
System from Space in This Decade
Longer-term Missions Observation of Key Earth System Interactions
Aqua
Terra
Landsat 7
Aura
ICEsat Jason-1
QuikScat
Exploratory -
Explore Specific Earth System Processes and Parameters and
Demonstrate Technologies
GRACE
SRTM
VCL
NCAR
Cloudsat
PICASSO
Courtesy of Tim Killeen, NCAR
Triana
EO-1
Essential
– So important that it becomes ubiquitous
Reliable
– Example: the built environment of the Roman Empire
Expensive
– Nothing succeeds like excess (e.g. Interstate system
– Inherently one-off (often, few economies of scale)
Clear factorization between research and practice
– Generally deploy what provably works
NCAR
COLA
CGD/VEMAP
ACD,HAO/WACCM
CGD/CCSM, CAM
CGD/CAS
MMM/WRF
UCAR/JOSS
UCAR/Unidata
CGD,SCD,CU/GridBGC
NOAA/NOMADS
GODAE
HAO/TIEGCM,MLSO
ATD/Radar, HIAPER
ACD/Mozart, BVOC,
Aqua proposal
BioGeo/CDAS
SCD/DSS
DOE/Earth System Grid
DLESE
GIS Initiative
NCAR