Data Management Plans What you need to know

advertisement
21 April 2015
NIH + NSF Data Sharing Policies
What is a Data Management Plan
Accountability
Data Products, Format, and Metadata
Storage, Sharing, and Metadata
Budgeting for Data Management
Resources
 Applies to the sharing of final research data* for research purposes.
 Applies to basic research, clinical studies, surveys, and other types of research
supported by NIH and to research that involves human subjects and laboratory
research that does not involve human subjects.
 Applies to applicants seeking $500,000 or more in direct costs in any year of the
proposed project period through grants, cooperative agreements, or contracts.
 Applies to research applications submitted beginning October 1, 2003.
*Final Research Data - Recorded factual material commonly accepted in the scientific community as necessary
to document and support research findings. This does not mean summary statistics or tables. It means the data on
which summary statistics and tables are based. For the purposes of this policy, final research data do not include
laboratory notebooks, partial datasets, preliminary analyses, drafts of scientific papers, plans for future research,
peer review reports, communications with colleagues, or physical objects, such as gels or laboratory specimens.
 “Investigators are expected to share with other researchers, at no more than
incremental cost and within a reasonable time, the primary data, samples, physical
collections and other supporting materials created or gathered in the course of
work under NSF grants.”
 “Grantees are expected to encourage and facilitate such sharing.”
 Data refers to any information that can be stored in digital form, including text,
numbers, images, video or movies, audio, software, algorithms, equations,
animations, models, simulations, etc. Such data may be generated by various
means including observation, computation, or experiment
 Applies to research applications submitted on or after January 18, 2011.

Research data collections


Resource or community data collections




Products of one or a few focused research projects
Serve a specific research community
Typically fall between research and reference data collections in size, scale, funding,
community of users, and duration
Conform to community standards
Reference data collections


Serve large segments of the research and education communities
Conform to robust and comprehensive standards
 An opportunity for PIs to articulate how they will conform to the FEDERAL data
sharing policy for research results.
 The DMP is reviewed as an integral part of the proposal, coming under ‘Intellectual
Merit’ or ‘Broader Impacts’ or both, as appropriate for the scientific community of
relevance.
 Data management requirements and plans may change across specific
Directorates, Offices, Divisions, Programs, or other NSF/NIH units.
 The types of data, samples, physical collections, software, curriculum
materials, publications, and other materials to be produced in the
course of the project;
 The standards to be used for data and metadata format and content
(where existing standards are absent or deemed inadequate, this
should be documented along with any proposed solutions or
remedies);
 Policies for access and sharing including provisions for appropriate
protection of privacy, confidentiality, security, intellectual property, or
other rights or requirements;
 Policies and provisions for re-use, re-distribution, and the production
of derivatives; and
 Plans for archiving data, samples, and other research products, and for
preservation of access to them.
Element
Description
?
NSF Mapping
Data description
A description of the information to be gathered; the nature and scale of the data that will be generated or collected.
Yes
Expected Data
Existing data
A survey of existing data relevant to the project and a discussion of whether and how these data will be integrated.
Yes
Expected Data
Format
Formats in which the data will be generated, maintained, and made available, including a justification for the procedural and
archival appropriateness of those formats.
Yes
Data Format and Dissemination
Metadata
A description of the metadata to be provided along with the generated data, and a discussion of the metadata standards used.
Yes
Data Format and Dissemination
Storage and backup
Storage methods and backup procedures for the data, including the physical and cyber resources and facilities that will be used Yes
for the effective preservation and storage of the research data.
Data Storage and Preservation of
Access
Security
A description of technical and procedural protections for information, including confidential information, and how permissions, Yes
restrictions, and embargoes will be enforced.
Data Format and Dissemination
Responsibility
Names of the individuals responsible for data management in the research project.
Yes
Roles and Responsibility
Intellectual property rights
Entities or persons who will hold the intellectual property rights to the data, and how IP will be protected if necessary. Any
copyright constraints (e.g., copyrighted data collection instruments) should be noted.
Yes
Data Format and Dissemination
Access and sharing
A description of how data will be shared, including access procedures, embargo periods, technical mechanisms for
dissemination and whether access will be open or granted only to specific user groups. A timeframe for data sharing and
publishing should also be provided.
Yes
Data Storage and Preservation of
Access
Audience
The potential secondary users of the data.
Yes
Data Format and Dissemination
Selection and retention periods
A description of how data will be selected for archiving, how long the data will be held, and plans for eventual transition or
termination of the data collection in the future.
Yes
Period of Data Retention
Archiving and preservation
The procedures in place or envisioned for long-term archiving and preservation of the data, including succession plans for the
data should the expected archiving entity go out of existence.
Yes
Data Storage and Preservation of
Access
Ethics and privacy
A discussion of how informed consent will be handled and how privacy will be protected, including any exceptional
arrangements that might be needed to protect participant confidentiality, and other ethical issues that may arise.
Yes
Data Format and Dissemination
Budget
The costs of preparing data and documentation for archiving and how these costs will be paid. Requests for funding may be
included.
Data organization
How the data will be managed during the project, with information about version control, naming conventions, etc.
Quality Assurance
Procedures for ensuring data quality during the project.
Legal requirements
A listing of all relevant federal or funder requirements for data management and data sharing.
 Explains how the responsibilities regarding the management of your data will be
delegated.
 Time allocations
 Project management of technical aspects
 Training requirements
 Contributions of non-project staff - individuals should be named where possible(custodians of
the repository/archive you choose to store your data
 Outlines the staff/organizational roles and responsibilities for implementing this
data management plan.
 Who will be responsible for data management and for monitoring the data management
plan?
 How will adherence to this data management plan be checked or demonstrated?
 What process is in place for transferring responsibility for the data?
 Who will have responsibility over time for decisions about the data once the original
personnel are no longer available?
 Is the data regulated by policy or law?
 Are there legal constraints (e.g., HIPAA) on sharing data?
 How will you handle informed consent with respect to communicating to respondents
that the information they provide will remain confidential when data are shared or
made available for secondary analysis?
 Determine constraints if classified data, specific handling requirements, IRB/human subject
research
 If yes, how will you comply with these constraints?
 Write your compliance plan point by point
 If applicable, how will you manage disclosure risk in the data to be shared and
archived?
 Is there intellectual property (e.g., patent, copyright) rights on the datasets?
 Determine restrictions and conditions to share and disseminate
 Does someone else own the data? What are their conditions for use, sharing, and dissemination?
 Determine DMPs as established by any international research consortia or set forth
in formal science and technology agreements signed by the United States
Government and foreign counterparts.
 This should be addressed with any international research partners when first
planning a collaboration.
 Talk to the Program Officer for additional assistance.
 Inputs and outputs (existing, intermediary, and final datasets)
 Existing data and sources you are using (Digital and physical collections)
 Quantitative Social and Economic Data Sets
 Numeric data sets, geospatial data, spatio-temporal data
 Qualitative Information
 Microfilms, historical documents, oral interviews, video tapes, hand written records,
transcripts, tables, figures, flowcharts, 3D models, digital audio
 Experimental Research
 Tabulated data
 Mathematical and Computer Models
 May include descriptions in published articles or fully documented and robust versions of
these models
 Determine formats and estimated size, and if it will be shared
 Formats: RTF text, MS Excel converted to CSV, MATLAB, PNG (images), WAV audio, MPEG
video, shapefile, as well as any instrument-specific formats or software Size/amount: Rate
produced, e.g., 1 TB/year, 50GB/experiment
 Metadata should be machine readable for better re-usability and processing.
HINT: Sketching a diagram of data workflow
helps to identify datasets and issues re their
management.
 Give a short description of what "data" will mean in your research
 What data will be generated in the research?
 What data types will you be creating or capturing?
 How will you capture or create the data?
 If you will be using existing data, state that fact and include where you got it.
 What is the relationship between the data you are collecting and the existing data?
 What data will be preserved and shared?
 “Data about data”
 Typical functions
 Discovery tool
 Rights management
 Version identification
 Certify authenticity
 Status indicator
 Defines content structure
 Interoperability
 Situates geospatially
 Process descriptions
 Access and transfer
Objectives
Domains
Architecture
Objectives
Principles
Discipline
Genre
Format
Structure
Extent
Granularity
 What details (metadata) are necessary for others to use your data?
 List standards for formats or metadata for your datasets.
 Document why you selected them
 Describe the method by which metadata will be generated.
 Document naming conventions/schema for your data.
 List the data dictionaries/taxonomies/ontologies you will use for your data.
 Describe how you will track versions of the datasets.
 List and describe the tools that are necessary to use the datasets.
 OAIS, Open Archival Information System
 CSDGM, Content Standard for Digital Geospatial Metadata
 ICPSR, Inter-university Consortium for Political and Social Research
 DDI, Data Documentation Initiative**
 best practices: data life cycle and longitudinal data
 SDMX, Statistics Data and Metadata Exchange
 XML, Extensible Markup Language
 Citation is the preferred form of acknowledgement
 Should include a doi to establish authouritative data source or a PURL (Persistent
Uniform Resource Location)
 Citation: Involuntary Commitment Data, public use dataset [restricted use data, if
appropriate]. Produced and distributed by the PSRDC, College of Behavioral and
Community Sciences, University of South Florida (year data were
downloaded). URL
Acknowledgement: The collection of data used in this study was partly supported
by the National Institutes of Health under grant number R01 HD069609 and the
National Science Foundation under award number 1157698.
 Document which of the digital or non-digital datasets listed will NOT be stored or
retained during the project.
 Document the type of media and the location(s) where the data will be stored and
who is responsible.
 Document how and where the data will be backed up and who is responsible.
 Document any access controls for data and/or data transfers that need to be
secured and how these controls will be applied.
 Indicate which datasets used or generated will be shared
 Indicate which any datasets are in proprietary formats and if they will be converted to a
non-proprietary format for sharing.
 Determine the audience who will use the datasets.
 Determine acknowledgement protocol
 Determine sharing protocols: open access or release upon request.
 Account for any delay in the accessibility of your data after your research is done.
 Explain details of any embargo periods.
 Determine how long will data be kept beyond the life of the project
 Will a third-party service be used to archive or release data?
 Set a release date to share the data.
 Describe any restrictions on use, sharing, repurposing, etc. of datasets
 Include costs of any additional resources (3rd party services, etc.) in budget.
 Under the auspices of the PI
 Data archive: A place where machine-readable data are acquired, manipulated,
documented, and finally distributed to the scientific community for further
analysis.
 Data enclave: A controlled, secure environment in which eligible researchers can
perform analyses using restricted data* resources.
 Mixed mode sharing.
**Restricted Data - datasets that cannot be
distributed to the general public, because of, for
example, participant confidentiality concerns, thirdparty licensing or use agreements, or national
security considerations.
 Builds upon storage by taking additional steps toward preserving digital files.
 Safeguards data against file corruption of storage media.
 Includes updating from obsolete formats.
 Often includes enhanced discovery and access of datasets.
 Includes a preservation strategy and disaster recovery plan.
 Often handled by an third-party archiving service or data repository.
 Check university guidelines.
 Include deposit fees in budget.
Example 1
The proposed research will involve a small sample (less than 20 subjects) recruited from clinical facilities in the New York City area with
Williams syndrome. This rare craniofacial disorder is associated with distinguishing facial features, as well as mental retardation. Even with
the removal of all identifiers, we believe that it would be difficult if not impossible to protect the identities of subjects given the physical
characteristics of subjects, the type of clinical data (including imaging) that we will be collecting, and the relatively restricted area from
which we are recruiting subjects. Therefore, we are not planning to share the data.
Example 2
The proposed research will include data from approximately 500 subjects being screened for three bacterial sexually transmitted diseases
(STDs) at an inner city STD clinic. The final dataset will include self-reported demographic and behavioral data from interviews with the
subjects and laboratory data from urine specimens provided. Because the STDs being studied are reportable diseases, we will be collecting
identifying information. Even though the final dataset will be stripped of identifiers prior to release for sharing, we believe that there
remains the possibility of deductive disclosure of subjects with unusual characteristics. Thus, we will make the data and associated
documentation available to users only under a data-sharing agreement that provides for: (1) a commitment to using the data only for
research purposes and not to identify any individual participant; (2) a commitment to securing the data using appropriate computer
technology; and (3) a commitment to destroying or returning the data after analyses are completed.
Example 3
This application requests support to collect public-use data from a survey of more than 22,000 Americans over the age of 50 every 2 years.
Data products from this study will be made available without cost to researchers and analysts at https://ssl.isr.umich.edu/hrs/. User
registration is required in order to access or download files. As part of the registration process, users must agree to the conditions of use
governing access to the public release data, including restrictions against attempting to identify study participants, destruction
http://grants.nih.gov/grants/policy/data_sharing/data_sharing_guidance.htm
 It is acceptable to state in the DMP that the project is not anticipated to generate
data or samples that require management and/or sharing.
 PIs should note that the statement will be subject to peer review.
 If data you generate is owned by your institution, the data access plan must address
the institutional strategy for providing access to relevant data and supporting
materials.
 Open-access publishing is not addressed in the implementation of the data
management plan requirement.
Expenses
 Documenting
Types of Activities Covered
• Reports
• Reprints
 Preparing
• Page charges or other journal costs
• Does not cover costs for prior or early publication
 Publishing
• Illustrations
 Disseminating
• Cleanup
 Sharing research findings
• Storage and indexing of data and databases
and supporting material
 Data sharing and
archiving
NOTE: If the data have been
collected already, a competitive or
administrative supplement may be
available.
• Documentation
• Development, documentation and debugging of software
• Storage, preservation, documentation, indexing, etc., of
physical specimens, collections or fabricated items.
 DMPTool (Argonne Laboratories)
 NIH
 Data Sharing Policy and Implementation Guidance
 8.2 Availability of Research Results
 NSF
 NSF Data Sharing Policy
 NSF Data Management Plan Requirements
 NSF Social, Behavioral and Economic (SBE) Directorate-wide Guidance
 ICPSR
 Effective Data Management
 Databib
 Registry of Research Data Repositories
 DataONE
 Best Practices
Download