Data Practices across Disciplines: Informing Collections & Curation Carole L. Palmer Melissa H. Cragin, Tiffany Chao, & Nic Weber Center for Informatics Research in Science & Scholarship Graduate School of Library & Information Science University of Illinois at Urbana-Champaign iConference 9 February 2011 Seattle, WA Data Conservancy studies of scientists Astronomy NCAR Life Sciences Earth Sciences Social Sciences Task-based design and usability testing User cases, data requirements, system recommendations UCLA Ethnography, oral histories Use cases, Data reqs. SMALL SCIENCE Curation requirements relating data characteristics & community data practices Reuse potentials ILLINOIS Small science is big, and poorly curated 12,025 NSF grants awarded in 2007 = $2,865,388,605 20% 80% Number of Grants 2405 9621 Total Dollars $1,747,957,451 $1,117,431,154 Range $300,000 - $38,131,952 $579 - $300,000 Top 254 grants received 20% of the total awarded (Heidorn, 2009) Research questions & target domains • What data, in what forms, are needed to advance research? • What factors predict value for reuse of data sets? • How do the dependencies among research communities evolve around data resources? Earth & life science intersections, with challenging curation problems: systems geobiology - soil ecology - oceanography . . . • interdisciplinary research; need for data from outside fields, integration of data across fields and scales. • production and use of compound / complex data sets. • ingest / curation of community databases, policy and reuse issues. Progressive data collection Talking shop about data - efficient exchange with the right scientists about the right things Scientists leading research - IP, access, discovery, research context • Pre-interview worksheets • Semi-structured interviews • follow up sessions with selected participants Scientists managing data - stages, versions, standards, tools (post docs, others from labs and research groups) • Data deposit & sharing worksheet • Data samples, related documentation Units of analysis Data “sets” aligned with research group production and dissemination workflows and services policies on attribution, embargoing, etc. Data communities Aligned with current and future interactions around data representation, functionality, and use policies for selection, appraisal, retention, description Data communities What are the meaningful social units for organization and use of data over the long term? • Sub-discipline focused on particular kinds of data that produce specific measurements or analysis - (systems geobiology) • Specialized domain focused on a research problem, often interdisciplinary in nature - (urban vulnerability) • Developers of shared community-level data collection (i.e., “Resource Collection”, NSB 2005) - (soil science) Core research challenge: Predict and design for communities of users, which will differ from producers, and change over time Data curation and sharing dynamics Data units User communities Geobiology Volcanology Soil ecology Site-specific time series: Rock profile: Database: physical rock thin section chemical analysis photographs field notes • multiple abiotic soil measurements • associated metadata • reduced spreadsheets: rock, water, microbial • microscopy images • annotated digital photographs Geology Chemistry Microbiology Genomics U.S. Park Service • by request Sharing • no repository conventions • mostly post-publication some unpublished • • • • • Geology – igneous petrology Geophysics Geochemistry Geology – bio geo chemistry Earthworm ecology Sensor network researchers • • by request • no repository • public resource collection Data Curation Framework Data Conservancy collection criteria • Broad scope, targeted research areas / needs – earth sciences, life sciences, social sciences, and astronomy • At-risk and highly unique or valuable data for target research areas – consistent with the traditional role of special collections • Data with high potential for future reuse – Yet, producers often fail to recognize the potential for reuse by others. (Cragin, Palmer, Carlson, & Witt. 2010. Philosophical Transactions of the Royal Society A) Hjørland’s epistemological potential of documents • Representation (subject analysis) should go beyond description of aboutness • Expose ability to “transfer knowledge” – requires “understanding of which future problems can give rise to the use of the document in question” (p. 93) • Documents can have an infinite number of properties capable of informing a user, therefore description must be informed by: – Analysis of contributions to various user groups—beyond the originally intended audience – Prioritization of the contributions with the most “long-term utility” – Categorizations that will function in the information system Data as raw materials of research • Do not transfer knowledge directly • Processing and tools for intelligibility and interpretation • Effort and resources to determine integrity and fit for new purpose Curation roles in DC: – Integrity - assessed in part by applying OAIS criteria for preservation description information. – Fit-for-purpose - alignment with the methods and tools of a given research community. Analytic potential of data Data domains of interest user communities integrity contributions fit-forpurpose categorization contributions description Data curation expertise As was true with bibliographic resources, understanding future uses of data involves comprehension of particulars of data functionality and application And, • historical and cultural dynamics of research areas • broad cross-disciplinary epistemological trends to address needs of current and yet unknown user groups. Questions & comments, please clpalmer@illinois.edu http://cirss.lis.uiuc.edu/ Center for Informatics Research in Science and Scholarship