Data DataManagement ManagementBest BestPractices Practices Ryan RyanWomack Womackand andAletia AletiaMorgan Morgan Rutgers RutgersUniversity UniversityLibraries Libraries Research ResearchData DataServices Services October 29, 2013 Why Data Management? • Worst case scenarios – Lost data – Stolen data – Incomprehensible data – Unverifiable data Why Data Management? • Best Case Scenarios, Data is… – Robust – Recoverable – Reliable – Reusable – Reproducible – Reputable – Renowned The Data Lifecycle Understand your data • Quantity (MB, GB, TB) – will affect your decisions about how to store, package, transport, and backup your data – Large numbers of discrete files, regardless of size, may be harder to handle – Multiple versions? Frequent updates? • Format – Seek out open formats or at least commonly used formats (.xlsx is ok, consider .csv) – Both for preservation and reuse, dependency on a single option can be a problem, even if it is the most convenient in the short-term. • Rights and permissions – Reuse of other data sources (licensed or otherwise) may require investigation and restrict data sharing options – Confidentiality of human subjects or guarding of information for patent protection may be reasons to restrict data – Give appropriate credit to others’ contributions Preserving your Data What types of data need to be managed or and stored over the course of your project? • Raw Data • Working Data • Processed or Final Data • Preserved Data for Possible Re-Use Data Preservation Considerations • Access Control • Versioning • Backup – Local – Shared • Campus • Offsite Security Security includes both • Physical Security • Logical/Network/Password Security Organizing your Data • Who controls the data for which parts of the project? • A logical structure and plan will help during and after data creation, for directories and filenames (e.g. project directory has standard locations for raw data, analysis, code, graphs, etc.) • Project folders with well-documented naming conventions (e.g., date of data creation as part of file name in a structured format (ISO is yyyy-mm-dd). See ARM example. • Versioning scheme should be agreed on and documented. • Goal is to have unique and understandable identifiers so that different parts of a project can merge and be handled easily over time, even if participants and conditions change. Documenting your Data • Documentation helps – with your own reuse of the data at a later date, – others in your research group working with the data – with the long-term reuse potential of your data • A Codebook is a structured way of explaining the contents of your data file • Readme files are a well-understood way of communicating information about the contents and setup of your data • Documentation can also take the form of a simple text or Word document • Include any explanations of experimental methods, software code run on the data, and any other tools needed to work with the data • Your previous work on creating and describing a consistent organization for your data will help here. Reproducible Research • Ideally, someone else can grab your data project as a complete bundle of data, documentation, and software code, and recreate the analysis to get exactly the same results • Reports and data can be integrated so that live analysis run on actual data can be placed in reports (some R packages do this). • Many initiatives are advancing the concept of reproducible research. • This high standard of evidence and validation is an assurance that data and conclusions are not flawed (or faked). • Good data management practices lay the groundwork for success in reproducible research Data Management Plans • Many sponsored research agencies now require a "Data Management Plan" (DMP) as a component of any proposal • The DMP is a formal document that outlines what the PI will do with data during and after completion of a funded research project • The specific requirements for the DMP vary by funder, and by research subject Why The DMP Requirement? • Ensure the preservation of important research data • Support the potential re-use of grant funded data by other researchers to validate and potentially extend the value of the data • Improve accountability for use of public revenue to support research • Provide for the reasonable access to research data to be consistent with the Freedom of Information Act Benefits of a Good DMP • Improved competitiveness for grant programs; a clear and complete Data Management Plan will support the project’s evaluation plan • Enhanced PI efficiency – writing a DMP encourages the creation of a structured plan for managing research data throughout the life of the project and beyond • Long-term protection and preservation of RU research data Who’s Asking for a DMP? • National Science Foundation http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/aag_6.jsp#VID4 • • • • Centers for Disease Control and Prevention http://www.cdc.gov/od/foia/policies/sharing.htm Department of Energy http://www.cio.energy.gov/policy-guidance/federal_regulations.htm Department of Defense http://www.dtic.mil/whs/directives/corres/pdf/320014p.pdf Environmental Protection Agency http://www.epa.gov/quality/informationguidelines/documents/EPA_InfoQualityGuidelines.pdf Institute of Museum and Library Services http://www.imls.gov/applicants/forms/DigitalProducts.pdf NASA http://nasascience.nasa.gov/earth-science/earth-science-data-centers/data-and-information-policy National Endowment for the Humanities http://www.neh.gov/grants/guidelines/pdf/DataManagementPlans.pdf National Institute of Justice http://www.ojp.usdoj.gov/nij/funding/data-resources-program/welcome.htm National Institute of Standards and Technology http://www.nist.gov/director/quality_standards.htm United States Department of Agriculture http://www.csrees.usda.gov/ United State Department of Education http://ies.ed.gov/funding/datasharing_policy.asp • • • • • • • • • Many academic journals have also begun to require data sharing as part of the submission process. For a partial list of journals with data sharing mandates for their published articles, see the following: http://oad.simmons.edu/oadwiki/Journal_open-data_policies http://gking.harvard.edu/pages/datasharing-and-replication As described by Portland State University Library http://library.pdx.edu/digital-scholarship/data-mgmt-plans/who-requires-dmps.html The NSF DMP Should Include • [data attributes] The types of data, samples, physical collections, software, curriculum materials, and other materials to be produced in the course of the project; • [metadata] The standards to be used for data and metadata format and content • [security policies] Policies for access and sharing including provisions for appropriate protection of privacy, confidentiality, security, intellectual property, or other rights or requirements, including the right to embargo data for a specified time period to allow first publication and thorough use of the data; • [use policies] Policies and provisions for fair re-use, re-distribution, and the production of derivatives; and • [preservation] Plans for archiving data, samples, and other research products, and for preservation of access to them. DMP Web Support Operated by the California Digital Library, the DMPTool is a site supporting general DMP development with some school-specific guidance RUL Data Resources – offers information about services and experts to support your data management efforts Discoverable Data • Publicly available archives such as Dryad, Dataverse, ICPSR, and more allow other researchers to easily discover and reuse data • Metadata provide standardized terminology for searching and discovering data matching defined characteristics. Unlike a data upload to a website, metadata provides a well-structured way for computer indexing of the data. • Your discipline may have well-defined metadata standards (or not). Check with your librarian if you are in doubt. • Databib and re3data are directories of research data repositories that can be used to discover possible locations for you to deposit and share your data. Citing Data • By publicly sharing your data via an established repository, you will typically get a DOI (Digital Object Identifier) or other persistent URL. This will serve as a permanent pointer to the data • You can cite your data, and others’ data, with DOI’s. This is more precise and easier for others to work with than a vague reference to the data by title, author, or even citing a paper that uses the data. • Data citation increases the impact of your research! This completes the data lifecycle. The Data Lifecycle Further References • Australian National Data Service: Data Management for Researchers • Australian National University: Data Management • CIESIN: Geospatial Electronic Records • ICPSR Guide to Social Science Data Preparation and Archiving (pdf): • Oak Ridge National Laboratory: Best Practices for Preparing Environmental Data Sets to Share and Archive • UK Data Archive: Create & Manage Data and Managing and Sharing Data: a Best Practice Guide for Researchers (pdf).