Data Management CIESIN - Program for Disability Research

advertisement
Data
DataManagement
ManagementBest
BestPractices
Practices
Ryan
RyanWomack
Womackand
andAletia
AletiaMorgan
Morgan
Rutgers
RutgersUniversity
UniversityLibraries
Libraries
Research
ResearchData
DataServices
Services
October 29, 2013
Why Data Management?
• Worst case scenarios
– Lost data
– Stolen data
– Incomprehensible data
– Unverifiable data
Why Data Management?
• Best Case Scenarios, Data is…
– Robust
– Recoverable
– Reliable
– Reusable
– Reproducible
– Reputable
– Renowned
The Data Lifecycle
Understand your data
• Quantity (MB, GB, TB)
– will affect your decisions about how to store, package, transport, and backup your data
– Large numbers of discrete files, regardless of size, may be harder to handle
– Multiple versions? Frequent updates?
• Format
– Seek out open formats or at least commonly used formats (.xlsx is ok, consider .csv)
– Both for preservation and reuse, dependency on a single option can be a problem, even
if it is the most convenient in the short-term.
• Rights and permissions
– Reuse of other data sources (licensed or otherwise) may require investigation and
restrict data sharing options
– Confidentiality of human subjects or guarding of information for patent protection may
be reasons to restrict data
– Give appropriate credit to others’ contributions
Preserving your Data
What types of data need to be managed or and stored over the
course of your project?
• Raw Data
• Working Data
• Processed or Final Data
• Preserved Data for Possible Re-Use
Data Preservation Considerations
• Access Control
• Versioning
• Backup
– Local
– Shared
• Campus
• Offsite
Security
Security includes both
• Physical Security
• Logical/Network/Password Security
Organizing your Data
•
Who controls the data for which parts of the project?
•
A logical structure and plan will help during and after data creation, for directories
and filenames (e.g. project directory has standard locations for raw data, analysis,
code, graphs, etc.)
•
Project folders with well-documented naming conventions (e.g., date of data
creation as part of file name in a structured format (ISO is yyyy-mm-dd). See ARM
example.
•
Versioning scheme should be agreed on and documented.
•
Goal is to have unique and understandable identifiers so that different parts of a
project can merge and be handled easily over time, even if participants and
conditions change.
Documenting your Data
• Documentation helps
– with your own reuse of the data at a later date,
– others in your research group working with the data
– with the long-term reuse potential of your data
• A Codebook is a structured way of explaining the contents of your data file
• Readme files are a well-understood way of communicating information
about the contents and setup of your data
• Documentation can also take the form of a simple text or Word document
• Include any explanations of experimental methods, software code run on
the data, and any other tools needed to work with the data
• Your previous work on creating and describing a consistent organization
for your data will help here.
Reproducible Research
• Ideally, someone else can grab your data project as a complete
bundle of data, documentation, and software code, and recreate
the analysis to get exactly the same results
• Reports and data can be integrated so that live analysis run on
actual data can be placed in reports (some R packages do this).
• Many initiatives are advancing the concept of reproducible
research.
• This high standard of evidence and validation is an assurance that
data and conclusions are not flawed (or faked).
• Good data management practices lay the groundwork for success in
reproducible research
Data Management Plans
• Many sponsored research agencies now require a
"Data Management Plan" (DMP) as a component
of any proposal
• The DMP is a formal document that outlines what
the PI will do with data during and after
completion of a funded research project
• The specific requirements for the DMP vary by
funder, and by research subject
Why The DMP Requirement?
• Ensure the preservation of important research data
• Support the potential re-use of grant funded data by
other researchers to validate and potentially extend
the value of the data
• Improve accountability for use of public revenue to
support research
• Provide for the reasonable access to research data to
be consistent with the Freedom of Information Act
Benefits of a Good DMP
• Improved competitiveness for grant programs; a clear
and complete Data Management Plan will support the
project’s evaluation plan
• Enhanced PI efficiency – writing a DMP encourages the
creation of a structured plan for managing research
data throughout the life of the project and beyond
• Long-term protection and preservation of RU research
data
Who’s Asking for a DMP?
•
National Science Foundation http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/aag_6.jsp#VID4
•
•
•
•
Centers for Disease Control and Prevention http://www.cdc.gov/od/foia/policies/sharing.htm
Department of Energy http://www.cio.energy.gov/policy-guidance/federal_regulations.htm
Department of Defense http://www.dtic.mil/whs/directives/corres/pdf/320014p.pdf
Environmental Protection
Agency http://www.epa.gov/quality/informationguidelines/documents/EPA_InfoQualityGuidelines.pdf
Institute of Museum and Library Services http://www.imls.gov/applicants/forms/DigitalProducts.pdf
NASA http://nasascience.nasa.gov/earth-science/earth-science-data-centers/data-and-information-policy
National Endowment for the
Humanities http://www.neh.gov/grants/guidelines/pdf/DataManagementPlans.pdf
National Institute of Justice http://www.ojp.usdoj.gov/nij/funding/data-resources-program/welcome.htm
National Institute of Standards and Technology http://www.nist.gov/director/quality_standards.htm
United States Department of Agriculture http://www.csrees.usda.gov/
United State Department of Education http://ies.ed.gov/funding/datasharing_policy.asp
•
•
•
•
•
•
•
•
•
Many academic journals have also begun to require data sharing as part of the submission process. For a
partial list of journals with data sharing mandates for their published articles, see the following:
http://oad.simmons.edu/oadwiki/Journal_open-data_policies http://gking.harvard.edu/pages/datasharing-and-replication
As described by Portland State University Library
http://library.pdx.edu/digital-scholarship/data-mgmt-plans/who-requires-dmps.html
The NSF DMP Should Include
•
[data attributes] The types of data, samples, physical collections, software,
curriculum materials, and other materials to be produced in the course of the
project;
•
[metadata] The standards to be used for data and metadata format and content
•
[security policies] Policies for access and sharing including provisions for
appropriate protection of privacy, confidentiality, security, intellectual property, or
other rights or requirements, including the right to embargo data for a specified
time period to allow first publication and thorough use of the data;
•
[use policies] Policies and provisions for fair re-use, re-distribution, and the
production of derivatives; and
•
[preservation] Plans for archiving data, samples, and other research products, and
for preservation of access to them.
DMP Web Support
Operated by the California
Digital Library, the DMPTool is
a site supporting general DMP
development with some
school-specific guidance
RUL Data Resources – offers
information about services and
experts to support your data
management efforts
Discoverable Data
• Publicly available archives such as Dryad, Dataverse, ICPSR,
and more allow other researchers to easily discover and
reuse data
• Metadata provide standardized terminology for searching
and discovering data matching defined characteristics.
Unlike a data upload to a website, metadata provides a
well-structured way for computer indexing of the data.
• Your discipline may have well-defined metadata standards
(or not). Check with your librarian if you are in doubt.
• Databib and re3data are directories of research data
repositories that can be used to discover possible locations
for you to deposit and share your data.
Citing Data
• By publicly sharing your data via an established
repository, you will typically get a DOI (Digital Object
Identifier) or other persistent URL. This will serve as a
permanent pointer to the data
• You can cite your data, and others’ data, with DOI’s.
This is more precise and easier for others to work with
than a vague reference to the data by title, author, or
even citing a paper that uses the data.
• Data citation increases the impact of your research!
This completes the data lifecycle.
The Data Lifecycle
Further References
• Australian National Data Service: Data Management for
Researchers
• Australian National University: Data Management
• CIESIN: Geospatial Electronic Records
• ICPSR Guide to Social Science Data Preparation and
Archiving (pdf):
• Oak Ridge National Laboratory: Best Practices for
Preparing Environmental Data Sets to Share and
Archive
• UK Data Archive: Create & Manage Data and Managing
and Sharing Data: a Best Practice Guide for
Researchers (pdf).
Download