Zoo 955, http://limnology.wisc.edu/courses/zoo955 Zoo 955: Information Management in Ecology Spring 2008 Course Information Instructors: Paul Hanson, Center for Limnology, pchanson@wisc.edu Barbara Benson, Center for Limnology, bjbenson@wisc.edu Meeting time: 9:55 – 11:50 Wednesdays; two blocks (A, B) per week with a 10 minute break between blocks. Students are expected to attend both blocks. Locations: Most class periods will meet in the Center for Limnology Conference room (rm 210). One or more labs may meet at computer facilities elsewhere on campus. Course goals: In this seminar you will learn information management issues spanning a broad range of research models, from single-investigator projects to large, international research collaborations. As a group, we will investigate the relationships between information and the research process. We will have practical activities that use tools and technologies required for managing ecological data. As part of this seminar, students will create their own well-designed database, using their data, and tailored to their needs. Student’s responsibilities: (a) participate in class discussions and laboratory exercises; (b) read assigned materials before class; (c) lead or co-lead a one hour discussion; (d) present project. More information on “c” and “d” follows. Activities: Lectures and guest speakers: Instructors or guest speakers will provide lectures on the topics listed and will lead a discussion. Readings may be provided and should be read before the lecture. Labs: Instructors will guide students through hands-on information management activities. Computers and software are supplied, although students may choose to use their own computers. Software will be open source whenever possible. Discussions: Each student will choose a discussion topic to lead during one block. Depending on the number of students, some may need to work in pairs. Sample discussion topics are listed following the syllabus. Student projects: Each student will create a database, using data from her/his choosing. At the end of the semester, each student will have an opportunity to present the database, including the model, technology, metadata, etc. Resources: http://limnology.wisc.edu/courses/Zoo955: Some materials for the course are available at this Web site. 1 Zoo 955, http://limnology.wisc.edu/courses/zoo955 Syllabus Date 23 Jan Person(s) Benson Benson Hanson Benson, Hanson Luke Winslow, CFL Luke Winslow, CFL Topic Introduction Exercise on data structures Introduction Exercise on data structures (cont.) Create a database Create a database GLEON 6, instructors away Amy Kamarainen Hanson, Winslow Jeff Maxted, CFL Jeff Maxted, CFL Hanson Michael Hamilton, James Reserve, CA Katrina Butkas Metadata More database tools Spatial data Spatial data Sensor networks Embedded ecological sensor networks A Activity Lecture Lab Lecture Lab Lab Lab Work day Work day Discussion Lab Lecture Lecture Discussion Guest speaker Discussion Discussion Spring break Spring break Discussion Matt van de Bogert 26 Mar B A B A 02 Apr B A Discussion Guest speaker 09 Apr B A Discussion Guest speaker Lucas Mayer-Horner Deana Pennington, LTER Network Office Steve Powers Peter McCartney, NSF B A B Discussion Discussion Guest speaker A B A B A B Student proj. Cancelled Student proj. Student proj. Student proj. Student proj. 30 Jan 06 Feb 13 Feb 20 Feb 27 Feb 05 Mar 12 Mar 19 Mar 16 Apr 23 Apr 30 Apr 07 May A B A B A B A B A B A B A B Sarah Johnson Ann Busche Noah Lottig Alain Roy and Todd Tannenbaum, UW Computer Science Katrina, Lucas Steve, Noah, Amy Matt, Sarah, Ann Hanson, Benson Hanson, Benson 2 Current practices of documenting flow of scientific analyses Scientific workflows Hierarchical collaborations using a long-term dataset: a local case study Collaboration technologies Structure and function of crossdisciplinary collaborations and the flow of information Young data A perspective from the National Science Foundation on IM in biological sciences IM at the organizational level Top 20 IM needs of students Open Science Grid – International collaborative networks Project presentations CFL Field planning meeting Project presentations Project presentations Summary Summary Zoo 955, http://limnology.wisc.edu/courses/zoo955 Readings by Date/Topic: 20 Feb 2008: Metadata Michener, W. K., J. W. Brunt, J. J. Helly, T. B. Kirchner, and S. G. Stafford. 1997. Non-geospatial metadata for the ecological sciences. Ecol. Appl. 7:330-42. 05 Mar 2008: Sensor Networks Collins, Scott L., Bettencourt, Luis M.A., Hagberg, Aric, Brown, Renee F., Moore, Douglas I., Bonito, Greg, D. 2006. New opportunities in ecological sensing using wireless sensor networks. Frontiers Ecology and the Environment 4(8): 402-407 <Collins et al 2006.pdf> Estrin, Debra et al. 2003. Environmental Cyberinfrastructure Needs for Distributed Sensor Networks: A Report from a National Science Foundation Sponsored Workshop. (Introduction; p.25 Box 6 on ENS; p.29 Box 7 on CLEANR; Chapter 6; Chapter 8) <Estrin et al. 2003> Porter, J. et al. 2005. Wireless sensor networks for ecology. BioScience 55(7): 561572 <Porter et al 2005.pdf> 05 Mar 2008: Sensor Networks Altintas, I., C. Berkeley, E. Jaeger, M. Jones, B. Ludascher, and S. Mock. 2004. Kepler: an extensible system for design and execution of scientific workflows. Proc. 16th Int. Conf. Sci. Stat. Database Manag. Deelman, E.W, and Y. Gil. Workshop on the Challenges of Scientific Workflows. National Science Foundation. http://www.isi.edu/nsf-workflows06 Pennington, D.D., D. Higgins, A. Townsend Peterson, M.B., Jones, B. Ludascher, and S. Bowers. Ecological Niche Modeling Using the Kepler Workflow System. Report. 02 Apr 2008: Large, Collaborative Networks 09 Apr 2008: National Science Foundation NSF Cyberinfrastructure Council. 2007. Cyberinfrastructure Vision for 21st Century Discovery. National Science Foundation. http://www.nsf.gov/pubs/2007/nsf0728 3 Zoo 955, http://limnology.wisc.edu/courses/zoo955 Sample Discussion Topics 1. Data models: Data models refers to the way in which data are organized, including data types, relationships between variables, variable grouping, and the relationships between meta data and data. What data models exist? Which are most commonly used in ecology? How do they differ as a function of the system they represent, e.g., can bird data be organized differently from water chemistry data? 2. Metadata: Metadata is the contextual information needed to use a data set (data about the data). What metadata are important for science reuse of data and why? What metadata standards have been developed for ecology? What incentives might be provided to researchers to generate good quality metadata for their data sets? 3. Data discovery: Exploring data repositories and searching across data archives requires data to be “exposed” to the world and tools for accessing those data. How does the data model, data structure, and metadata facilitate this? What standards exist, such as EML, to help this process? What tools are being developed to facilitate discovery across multiple IM systems? What are the techniques for visualizing discovered data? 4. Sensor networks: Tremendous resources have been invested in sensor networks designed to automatically monitor the environment. The huge volumes of data and the effort required to deploy and maintain these systems require automation and standardization at many steps. What are the implications of continuously streaming data for IM? What are the unique requirements for data models, QA/QC, and data discovery? 5. Semantic mediation, controlled vocabularies, ontologies: Integrating ecological data from multiple sources can be challenging due to heterogeneity in content, format, scale, semantics, etc. What are some of the approaches to semantic mediation of data sets? How can a controlled vocabulary facilitate data integration? What are some examples from ecology of use of controlled vocabularies? What roles can ontologies play in data integration? Why are ontologies difficult to create? 6. QA/QC: Ensuring data quality (quality assurance/ quality control) is a critical, though often underappreciated, component of IM. What are the standards of QA/QC? What tools are available? Who should be responsible for this – field technicians, database administrators or researchers? What algorithms are used to perform QA/QC, and how are data quality represented in the database? 7. IM in large, collaborative networks: LTER, NEON, GLEON, GEON, WATERS, and CLIME are examples of large collaborative networks. Big science requires big investment in IM. How do these organizations do it? What resources are required? At what level are their IMSs compatible? 8. Scientific workflows: IM and data analysis often requires repetitive tasks executed in defined, though sometimes complicated, sequence. Scientific workflows formalize, automate and document these tasks. What are the components of a workflow? What tasks lend themselves to workflows? What tools exist to develop workflows, and who is using them? 9. Collaboration technologies: People have to communicate, and scientific collaborations are more frequently occurring among geographically dispersed researchers. What are some of the technologies available that enable scientists in these collaborations to analyze, discuss, annotate, and view data? 10. Document management: The lifecycle of the research process produces a variety of information, ranging from proposals to data to manuscripts. New technologies organize and 4 Zoo 955, http://limnology.wisc.edu/courses/zoo955 serve these documents for consumption by people inside the organization, as well as the general public. What technologies exist for document management, and how does it differ from traditional fileserving? What are the implications for duties and responsibilities for IM within research organizations? How does this type of technology improve the competitive advantage of research organizations? 5