PPT - WMO

advertisement

The Earth System Grid (ESG)

&

The Community Data Portal (CDP)

(NCAR’s Data & GriD Efforts) for

COMMISSION FOR BASIC SYSTEMS

INFORMATION SYSTEMS and SERVICES

INTERPROGRAMME TASK TEAM ON THE

FUTURE WMO INFORMATION SYSTEM

KUALA LUMPUR, 20 - 24 OCTOBER 2003

Courtesy: Don Middleton

NCAR Scientific Computing Division

NCAR

“Atkins Report”

“A new age has dawned…”

“The Panel’s overarching recommendation is that the

National Science Foundation should establish and lead a large-scale, interagency, and internationally coordinated

Advanced Cyberinfrastructure Program (ACP) to create, deploy, and apply cyberinfrastructure in ways that radically empower all scientific and engineering research and allied education. We estimate that sustained new NSF funding of $1 billion per year is needed to achieve critical mass and to leverage the coordinated co-investment from other federal agencies, universities, industry, and international sources necessary to empower a revolution.

The cost of not acting quickly or at a subcritical level could be high, both in opportunities lost and in increased fragmentation and balkanization of the research.”

Atkins Report, Executive Summary

NCAR

The Earth System Grid

http://www.earthsystemgrid.org

U.S. DOE SciDAC funded R&D effort - a “ Collaboratory

Pilot Project”

Build an “Earth System Grid” that enables management, discovery, distributed access, processing, & analysis of distributed terascale climate research data

Build upon Globus Toolkit  and DataGrid technologies and deploy (Rubber on the road)

Potential broad application to other areas

NCAR

ESG Team

ANL

– Ian Foster (PI)

– Veronika Nefedova

– (John Bresenhan)

– (Bill Allcock)

LBNL

– Arie Shoshani

– Alex Sim

ORNL

– David Bernholdte

– Kasidit Chanchio

– Line Pouchard

NCAR

LLNL/PCMDI

– Bob Drach

– Dean Williams (PI)

USC/ISI

– Anne Chervenak

– Carl Kesselman

– (Laura Perlman)

NCAR

– David Brown

– Luca Cinquini

– Peter Fox

– Jose Garcia

– Don Middleton (PI)

– Gary Strand

NCAR

Baseline Numbers

T42 CCSM (current, 280km)

– 7.5GB/yr, 100 years -> .75TB

T85 CCSM (140km)

– 29GB/yr, 100 years -> 2.9TB

T170 CCSM (70km)

– 110GB/yr, 100 years -> 11TB

NCAR

Capacity-related Improvements

Increased turnaround, model development, ensemble of runs

Increase by a factor of 10, linear data

Current T42 CCSM

– 7.5GB/yr, 100 years -> .75TB * 10 = 7.5TB

NCAR

Capability-related Improvements

Spatial Resolution: T42 -> T85 -> T170

Increase by factor of ~ 10-20, linear data

Temporal Resolution: Study diurnal cycle, 3 hour data

Increase by factor of ~ 4, linear data

CCM3 at T170 (70km)

NCAR

Capability-related Improvements

Quality: Improved boundary layer, clouds, convection, ocean physics, land model, river runoff, sea ice

Increase by another factor of 2-3, data flat

Scope: Atmospheric chemistry (sulfates, ozone…), biogeochemistry (carbon cycle, ecosystem dynamics), middle Atmosphere Model…

Increase by another factor of 10+, linear data

NCAR

Model Improvement Wishlist

Grand Total:

Increase compute by a Factor O(1000-

10000)

NCAR

ESG Scenario

End 2002: 1.2 million files comprising ~75TB of data at NCAR, ORNL, LANL, NERSC, and

PCMDI

End 2007: As much as 3 PB (3,000 TB) of data (!)

Current practice is already broken – the future will be even worse if something isn’t done…

NCAR

ESG Scenario (cont.)

Data

– Different formats are converted to netCDF

– netCDF is not standardized to the CF model

– Different sites require knowledge of different methods of access

Metadata

– Most kept in online files separate from data and unsearchable unless one is “in the know”

– Some kept in people’s brains

Access control

– Manual

– Not formalized

Data requests

– Beginnings of a formal process (e.g., the PCMDI model)

– Beginnings of web portals

– Far too much done by hand

– Logging nearly non-existent

NCAR

ESG: Challenges

Enabling the simulation and data management team

Enabling the core research community in analyzing and visualizing results

Enabling broad multidisciplinary communities to access simulation results

We need integrated scientific work environments that enable smooth WORKFLOW for knowledge development: computation, collaboration & collaboratories, data management, access, distribution, analysis, and visualization.

NCAR

ESG: Strategies

Move data a minimal amount, keep it close to computational point of origin when possible

– Data access protocols, distributed analysis

When we must move data, do it fast and with a minimum amount of human intervention

– Storage Resource Management, fast networks

Keep track of what we have, particularly what’s on deep storage

– Metadata and Replica Catalogs

Harness a federation of sites, web portals

– Globus Toolkit -> The Earth System Grid -> The

UltraDataGrid

NCAR

Storage/Data Management

Tera/Peta-scale

Archive

Tools for reliable staging, transport, and replication

HRM

Server

Server

Tera/Peta-scale

Archive

NCAR

HRM

Client

Selection

Control

Monitoring

HRM

HRM aka “DataMover”

Running well across DOE/HPSS systems

New component built that abstracts NCAR Mass

Storage System

Defining next generation of requirements with climate production group

First “real” usage

“The bottom line is that it now works fine and is over 100 times faster than what I was doing before. As important as two orders of magnitude increase in throughput is, more importantly I can see a path that will essentially reduce my own time spent on file transfers to zero in the development of the climate model database” – Mike Wehner, LBNL

NCAR

OPeNDAP

An Open Source Project for a

Network Data Access Protocol

(originally DODS, the Distributed

Oceanographic Data System)

NCAR

Distributed Data

Access Services

Typical Application Distributed Application

Application Application netCDF lib data

OPeNDAP Client

OPeNDAP

Via http

OpenDAP Server

OPeNDAP-g

-Transparency

-Performance

-Security

-Authorization

-(Processing)

Application

ESG client

OPeNDAP

Via

Grid

ESG Server

ESG

+

DODS

Data

(local)

Data

(remote)

Big Data

(Multiple remotes)

NCAR

ESG: NcML Core Schema

For XML encoding of metadata (and data) of any generic netCDF file

Objects: netCDF, dimension, variable, attribute

Beta version reference implementation as Java Library

(http://www.scd.ucar.edu/vets/luca/netcdf/extract_metadata.htm) nc:netCDFType netCDF nc:dimension nc:VariableType nc:attribute nc:variable nc:values nc: attribute

NCAR

isA

Object

[1] id participant role= isA isA

Activity

[0,1] name

[0,1] description

[0,1] rights

[0,n] date type=

[0,n] note

[0,n] participant role=

[0,n] reference uri= isA isPartOf

Investigation isA

Person

[0,1] firstName

[0,1] lastName

[0,1] contact

Project

[0,n] topic type=

[0,1] funding

Ensemble worksFor

Campaign

Institution

[0,1] name

[0,1] type

[0,1] contact

Service

[0,1] name

[0,1] description

LEGEND

AbstractClass

Class inheritance association serviceId hasParent hasChild hasSibling isPartOf

Simulation

[0,n] simulationInput type=

[0,n] simulationHardware

Observation generated

By

Experiment

Dataset

[0,1] type

[0,1] conventions

[0,n] date type=

[0,n] format type= uri=

[0,1] timeCoverage

[0,1] spaceCoverage NCAR

Analysis isPart

Of

ESG Metadata Progress

Co-developed NcML with Unidata

– CF conventions in progress, almost done

Developed & evaluated a prototype metadata system

Finalized an initial schema for PCM/CCSM

– Address interoperability with federal standards and NASA/GCMD via the generation of DIF/FGDC/ISO

– Address interoperability with digital libraries via the creation of

Dublin Core

Testing relational and native XML databases, and OGSA-DAI

Exploratory work for first-generation ontology

Authoring of discovery metadata in progress

NCAR

ESG Topology

LBNL gridFTP SERVER

HRM HPSS

RLS gridFTP visualize

NCAR

LAS SERVER

DISK cache gridFTP SERVER gridFTP

LLNL

HRM gridFTP SERVER

DISK

RLS

ISI

OGSA-DAI

MySQL

RDBMS gridFTP cross-update

HRM execute

MSS

RLS cross-update query

GRAM

GATEKEEPER

ESG WEB PORTAL

Tomcat/Struts authenticate submit query MyProxy

ANL

CAS

HPSS

ORNL gridFTP SERVER

HRM

RLS

NCAR

Collaborations & Relationships

CCSM Data Management Group

The Globus Project

Other SciDAC Projects: Climate, Security & Policy for Group

Collaboration, Scientific Data Management ISIC, & Highperformance DataGrid Toolkit

OPeNDAP/DODS (multi-agency)

NSF National Science Digital Libraries Program (UCAR &

Unidata THREDDS Project)

U.K. e-Science and British Atmospheric Data Center

NOAA NOMADS and CEOS-grid

Earth Science Portal group (multi-agency, intnl.)

NCAR

Immediate Directions

Broaden usage of DataMover and refine

Continue building metadata catalogs

Revisit overall security model and consider simplified approaches

Redesign and implement user interface

Alpha version of OPeNDAPg

– Test and evaluate with client applications

Develop automation for data publishing (GT3)

Deploy for IPCC runs

NCAR

The Community Data Portal (CDP)

“The dataportal has changed my life…”

Ben Kirtman, COLA

Provide a common portal to NCAR, UCAR, and university data

Provide a sustainable cyberinfrastructure that dramatically lowers the cost of sharing data (there is HUGE interest in this)

Directly couple to simulation systems and DataMonster

Begin capturing rich metadata and catalog our scientific experiments for the world

MSS -> A Petascale Mass Knowledge System

Federate internationally (ESG, THREDDS, U.K. e-Science, NOMADS, PRISM,

GEON, etc.)

NCAR

Foster Revolutionary Change

Mass Storage

System (1.5PB)

Petascale Knowledge

Repository

Establish a new paradigm for managing and accessing scientific data based on semantic organization.

NCAR

Community Data Portal

Purpose:

 Build an infrastructure using different methods for data exploration and delivery

 Web-based retrieval and interactive analysis for MSS collections

 Data sharing for multi-institution cooperative studies

 Browse, select, compare, download data sets, & specify data subsets using – graphical, text entry, choice of output format

Components:

 User interface, Live Access Server (LAS)

 Middleware, Ferret, NCL, GrADS

 File service, local, or DODS

Status:

 Pilot working (2 years), more middleware testing

NCAR

Data Access

Live Access Client

Ferret

Live Access Server

NCL Other Engines

DODS

NCAR

Data Collections

Massive

Data

Simulation & Retrospective

CSM, PCM, DSS,

MM5, WRF, MICOM,

CMIWG

Example

… Data Analysis

NCAR

Live Access Server + NCL

(Grib Data)

NCAR

Interface and Reanalysis 2

Sea Level Pressure

NCAR

user interface middleware

Community Data Portal architecture

UI UI UI

Struts

Tomcat

UI

GDS

Tomcat

DODS aggregation server

Tomcat

LAS

Tomcat core services catalogs parsing & metadata ingestion data search & discovery catalogs browsing

MSS data retrieval data access

(OPeNDAP, FTP, HTTP) data visualization

(NCL, Ferret) hardware dataportal.ucar.edu

raid disks

MSS

NCAR

ESG metadata

DC metadata

Community Data Portal Metadata Software

NcML metadata other metadata parses

THREDDS catalog parser application reference

THREDDS catalogs stores full

XML doc shreds XML doc into tables

XML native DB

(Xindice displays

XML viewer web application schemaspecific stylesheets uses

THREDDS catalogs browser

Web application

NCAR links to future advanced query

(Xpath, Xquery)

Search & Discovery web application relational DB

(MySQL) simple query

(SQL)

Results: list of triplets

(dataset id, metadata schema, metadata URL)

CDP Data/Catalog Contributors

ACD: MOZART v2.1 standard run (Louisa Emmons)

ATD: Radar almost ready for today!

CGD: CAS satellite data example (Lesley Smith)

CGD: CDAS and VEMAP data (Steve AulenBach, Nan

Rosenbloom, Dave Schimmel)

CGD: CCSM 1000 year run (Lawrence Buja)

CGD: PCM 16 top datasets (Gary Strand)

SCD: DSS full data holdings (Bob Dattore, Steve Worley)

SCD: VETS example visualization catalog (Markus Stobbs,

Luca Cinquini)

COLA: Jennifer Adams, Jim Kinter, Brian Doty

NCAR

Next Steps

Recruiting (!)

– One student for data ingest

– One software engineer

– Systems

– Expanding storage by 20TB (SCD cosponsor)

Ongoing publication of datasets

Publishing documents on plans, design, how to partner, standard services, and management procedures

Building partnerships, DMWG meeting August

NCAR

Closing Thoughts

Building a sustainable infrastructure for the long-term

 Difficult, expensive, and time-consuming

 Requires longer-term projects

Team-building is a critical process

 Collaboration technologies really help

Managing all the collaborations is a challenge

 But extremely valuable

Good progress, first real usage

NCAR

Links

Earth System Grid

– www.earthsystemgrid.org

Community Data Portal

– dataportal.ucar.edu

NCAR

NCAR

END

We Will Examine Practically Every Aspect of the Earth

System from Space in This Decade

Longer-term Missions Observation of Key Earth System Interactions

Aqua

Terra

Landsat 7

Aura

ICEsat Jason-1

QuikScat

Exploratory -

Explore Specific Earth System Processes and Parameters and

Demonstrate Technologies

GRACE

SRTM

VCL

NCAR

Cloudsat

PICASSO

Courtesy of Tim Killeen, NCAR

Triana

EO-1

Characteristics of Infrastructure

Essential

– So important that it becomes ubiquitous

Reliable

– Example: the built environment of the Roman Empire

Expensive

– Nothing succeeds like excess (e.g. Interstate system

– Inherently one-off (often, few economies of scale)

Clear factorization between research and practice

– Generally deploy what provably works

NCAR

CDP Interactions & Opportunities

COLA

CGD/VEMAP

ACD,HAO/WACCM

CGD/CCSM, CAM

CGD/CAS

MMM/WRF

UCAR/JOSS

UCAR/Unidata

CGD,SCD,CU/GridBGC

NOAA/NOMADS

GODAE

HAO/TIEGCM,MLSO

ATD/Radar, HIAPER

ACD/Mozart, BVOC,

Aqua proposal

BioGeo/CDAS

SCD/DSS

DOE/Earth System Grid

DLESE

GIS Initiative

NCAR

Download