Breakout Group #1 - National Center for Atmospheric Research

advertisement
Breakout #1 – Collecting Data and Making It Available
Thursday, Oct 31, 2002
Notes by David Maidment and Bruce Wardlaw
1. What are the three highest priority near-term opportunities to speed
scientific progress, and what payoffs can be expected from investing in
these opportunities?
Unfettered access to data, real-time data distribution, internet connection to sensors
Principles of CyberInfrastructure “if you want to play, you play this way!”
Need a process for building shared infrastructure within a community
Open up ITR proposal process to discipline structure proposals
Unidata is example of 15-year history of data provision for the UCAR, also science and
technology centers,
Long Term Ecological Research program – need to operate LTERs as a network a goal
stated in its 20-year review
What is going on in the other Directorates, put the observation networks into a larger
system, are there economies of scale
Science and culture need to interact and intersect, impact of additional populations in
high-risk areas, inheritance of the North American continent
CUAHSI (Consortium of Universities for Advancement of Hydrologic Science, Inc) is a
community forming process for hydrologists like UCAR is for atmospheric science.
Hydrologic Information Systems within CUAHSI is the forthcoming Cyberinfrastructure
component as Unidata is for UCAR. Need to form the community first before the
infrastructure proposal
Digital libraries include scientific data – how is this considered part of cyberinfrastructure
NSF host a web site that prototypes cyberinfrastructure ideas, documents,
Establish a Cyberinfrastructure portal?
TeraGrid Project – Grid project, computational components, NCSA, Cal Tech, only one
proposal in the environmental sciences. This is a national cyberinfrastructure.
Encourage collaborations with the TeraGrid ($53 million in the first year, $20 million in
second year, ….). Funding for this should be “extra” funding because if TeraGrid
doesn’t pan out the domain initiative should
What cyberinfrastructure would help you do your job better at your university, your lab?
What are the basic functions of cyberinfrastructure, how do you establish a reference
model that can support a range of disciplines, how to you group tools? Ontologies, Data
Mining, Data Cleansing, the portal gets you to it, user goes to select the best tools, what
is the big picture? Create a tool roadmap.
Discovery, Access, Transform, Use, why do have domains creating syntactic systems?
Data collection infrastructure, interact with databases at home while you are in the field,
online, offline access, empirical Green’s functions correlated with observed data in realtime, networks to sensor bases, e.g. Major Research Equipment and Facilities
Construction (e.g. NEES)
Recommendations to NSF –
NSF impose a requirement that all data collected with NSF funds be made openly
available using cyberinfrastructure methods
Making sure that it happens – make this part of the proposal review process
Deposit the data in a Digital Library e.g. DLESE whose data part is not populated.
Relying on individual investigators to become data disseminators is a burden. For all of
the disciplines there needs to be something that deals with data archiving.
How we get to the data in a useful form? How do we make it easier to pull data from
different sources. Data interoperability. NSF is going to make this happen!!
If it is easy for people to put data in a data center they’ll probably do it.
How do you get access to federal data for research that normal users have to pay $$$ for?
How can this be facilitated using cyberinfrastructure.
Genbank, Nepture have moratoria on data of 6 months.
to scoop the PI is a myth?
The bogeyman - who is going
Takes time to clean up the data so can’t require immediate exposure.
Test bed mooring off Bermuda has real-time data on the web, works great, user beware
notices on real-time data delivery.
NCDC has been through this discussion with particular communities. There are federal
requirements on this already.
NSF request NRC to evaluate endangered datasets.
As cyberinfrastructure improves, cost of data handling will go down.
2. What major issues require further investigation and how should these
issues be addressed?
Establish an environmental cyberinfrastructure research portal with a bunch of tools.
THREDDS (Thematic Data Server) is a Unidata project to make environmental data to
take metadata problems out of the hands of users.
There is a variety of cyberinfrastructures that are personally invented, need now to adopt
community models
Identify some key datasets that could be brought together e.g. USGS North American
magnetic dataset and need to update gravity dataset.
How can the GOES dataset be modified to make it more available? Used by a number of
communities. Things that are going to have a big impact on many communities. Make
the remote sensing data more usable.
Need to take current analog or physical data. Infrastructure needed to help with this.
Moving analog data to digital is technologically not advancing quickly.
NRC report says $50million to save and digitally archive such data for the earth sciences.
Summary Report
1. High priority, near term opportunities
Community recommendations to NSF to be a leader in organizing support for long term
maintenance, built-in IT support, data repository system, and coordination
Establish a set of principles to guide CI: “if you want to play, you play this way!”
(e.g., maintaining open access to data, real-time data distribution, internet
connection to sensor)
Establish (1) a reference model that can support a range of disciplines,
How to develop and use group tools - Ontologies, Data Mining, Data Cleansing,
Create a tool roadmap.
(2) an environmental research and education portal to host the reference model
and tools (or tool boxes)
Suggest data models, develop a coordinated distributed network to deliver the data,
support development of virtual digital libraries to hold the data for the long term.
Further the development of cyberinfrastructure (including middleware) enhancing or
enabling interoperability of data
Resolving the issues data standardization (or process for standards consensus),
availability, and preservation.
Payoffs: increased productivity, better common language, cross fertilization, sharing,
enhanced product delivery, with an enhanced education component. This will greatly
reduce challenges in aggregating data.
2. Issues requiring further investigation
Need a process for building and sustaining shared infrastructure within and across
communities
Why is the infrastructure so imperfect now???
Need to substantially increase community buy-in for metrics, interoperability, and a
process for standardization.
3. What are the best means of fostering computer science/environmental research and
education cooperation in this area?
Join and participate with DLESE by providing environmental/computer science
education modules (especially to supply datasets for exercises).
Develop programs for applied computer science degree (Masters or PhD) in
environmental science specialties.
Develop interdisciplinary research teams that will serve as models and laboratories for
educational environment.
Download