Breakout #1 – Collecting Data and Making It Available Thursday, Oct 31, 2002 Notes by David Maidment and Bruce Wardlaw 1. What are the three highest priority near-term opportunities to speed scientific progress, and what payoffs can be expected from investing in these opportunities? Unfettered access to data, real-time data distribution, internet connection to sensors Principles of CyberInfrastructure “if you want to play, you play this way!” Need a process for building shared infrastructure within a community Open up ITR proposal process to discipline structure proposals Unidata is example of 15-year history of data provision for the UCAR, also science and technology centers, Long Term Ecological Research program – need to operate LTERs as a network a goal stated in its 20-year review What is going on in the other Directorates, put the observation networks into a larger system, are there economies of scale Science and culture need to interact and intersect, impact of additional populations in high-risk areas, inheritance of the North American continent CUAHSI (Consortium of Universities for Advancement of Hydrologic Science, Inc) is a community forming process for hydrologists like UCAR is for atmospheric science. Hydrologic Information Systems within CUAHSI is the forthcoming Cyberinfrastructure component as Unidata is for UCAR. Need to form the community first before the infrastructure proposal Digital libraries include scientific data – how is this considered part of cyberinfrastructure NSF host a web site that prototypes cyberinfrastructure ideas, documents, Establish a Cyberinfrastructure portal? TeraGrid Project – Grid project, computational components, NCSA, Cal Tech, only one proposal in the environmental sciences. This is a national cyberinfrastructure. Encourage collaborations with the TeraGrid ($53 million in the first year, $20 million in second year, ….). Funding for this should be “extra” funding because if TeraGrid doesn’t pan out the domain initiative should What cyberinfrastructure would help you do your job better at your university, your lab? What are the basic functions of cyberinfrastructure, how do you establish a reference model that can support a range of disciplines, how to you group tools? Ontologies, Data Mining, Data Cleansing, the portal gets you to it, user goes to select the best tools, what is the big picture? Create a tool roadmap. Discovery, Access, Transform, Use, why do have domains creating syntactic systems? Data collection infrastructure, interact with databases at home while you are in the field, online, offline access, empirical Green’s functions correlated with observed data in realtime, networks to sensor bases, e.g. Major Research Equipment and Facilities Construction (e.g. NEES) Recommendations to NSF – NSF impose a requirement that all data collected with NSF funds be made openly available using cyberinfrastructure methods Making sure that it happens – make this part of the proposal review process Deposit the data in a Digital Library e.g. DLESE whose data part is not populated. Relying on individual investigators to become data disseminators is a burden. For all of the disciplines there needs to be something that deals with data archiving. How we get to the data in a useful form? How do we make it easier to pull data from different sources. Data interoperability. NSF is going to make this happen!! If it is easy for people to put data in a data center they’ll probably do it. How do you get access to federal data for research that normal users have to pay $$$ for? How can this be facilitated using cyberinfrastructure. Genbank, Nepture have moratoria on data of 6 months. to scoop the PI is a myth? The bogeyman - who is going Takes time to clean up the data so can’t require immediate exposure. Test bed mooring off Bermuda has real-time data on the web, works great, user beware notices on real-time data delivery. NCDC has been through this discussion with particular communities. There are federal requirements on this already. NSF request NRC to evaluate endangered datasets. As cyberinfrastructure improves, cost of data handling will go down. 2. What major issues require further investigation and how should these issues be addressed? Establish an environmental cyberinfrastructure research portal with a bunch of tools. THREDDS (Thematic Data Server) is a Unidata project to make environmental data to take metadata problems out of the hands of users. There is a variety of cyberinfrastructures that are personally invented, need now to adopt community models Identify some key datasets that could be brought together e.g. USGS North American magnetic dataset and need to update gravity dataset. How can the GOES dataset be modified to make it more available? Used by a number of communities. Things that are going to have a big impact on many communities. Make the remote sensing data more usable. Need to take current analog or physical data. Infrastructure needed to help with this. Moving analog data to digital is technologically not advancing quickly. NRC report says $50million to save and digitally archive such data for the earth sciences. Summary Report 1. High priority, near term opportunities Community recommendations to NSF to be a leader in organizing support for long term maintenance, built-in IT support, data repository system, and coordination Establish a set of principles to guide CI: “if you want to play, you play this way!” (e.g., maintaining open access to data, real-time data distribution, internet connection to sensor) Establish (1) a reference model that can support a range of disciplines, How to develop and use group tools - Ontologies, Data Mining, Data Cleansing, Create a tool roadmap. (2) an environmental research and education portal to host the reference model and tools (or tool boxes) Suggest data models, develop a coordinated distributed network to deliver the data, support development of virtual digital libraries to hold the data for the long term. Further the development of cyberinfrastructure (including middleware) enhancing or enabling interoperability of data Resolving the issues data standardization (or process for standards consensus), availability, and preservation. Payoffs: increased productivity, better common language, cross fertilization, sharing, enhanced product delivery, with an enhanced education component. This will greatly reduce challenges in aggregating data. 2. Issues requiring further investigation Need a process for building and sustaining shared infrastructure within and across communities Why is the infrastructure so imperfect now??? Need to substantially increase community buy-in for metrics, interoperability, and a process for standardization. 3. What are the best means of fostering computer science/environmental research and education cooperation in this area? Join and participate with DLESE by providing environmental/computer science education modules (especially to supply datasets for exercises). Develop programs for applied computer science degree (Masters or PhD) in environmental science specialties. Develop interdisciplinary research teams that will serve as models and laboratories for educational environment.