Data Management Issues for Postdocs

advertisement
Introduction to Data Management
Data Management
• Overview of research data
– Joel Roselin, Office of Research Compliance and
Training
• Data Storage and Retention
– Danianne Mizzy, Engineering Librarian
• Data Sharing
– Kathryn Pope, Center for Digital Research and
Scholarship
2
Goals of research
• The primary goals of research are:
– To advance knowledge
– To improve life for people (or animals)
• Secondary goals of research:
– Career advancement
– Professional recognition
– Financial gain
3
When you conduct research…
• …You are entrusted with:
– Human subjects
– Animals
– Access to specialized materials and technology
• Chemicals
• Drugs
• Machinery
• Information (personal or confidential)
– Funding from government or industry
4
When you conduct research…
• Not everyone is granted the privilege to conduct
research:
– Qualifications include:
• Advanced degree (or enrolled in a degree program)
• Position in a research institution
– Promise to:
• Be responsible in the conduct of the research
• Be responsible stewards of the research dollars and other
resources
• Share the results of the research for the good of society
5
When you conduct research…
• The privilege can be revoked for failing to fulfill
professional responsibilities:
– Not get funding
– Debarment
– Lose of position
6
What are data?
• What counts as data in your field?
7
What are data?
• What counts as data in your field?
– Subject data (humans or animals)
• Blood cell counts
• Observational
• Survey responses
– Lab data
• Test results
• Assays
– Other data
• Library information
• Photographs
8
What are data?
True or False
In scientific research, only the information and
observations that are made as part of scientific
inquiry are considered data.
9
It’s ALL data
• False!
• Data are not only the information and
observations made as part of scientific inquiry
but also the materials, the means, and the
products of that inquiry (sometimes called data
sources).
• Examples:
•
•
•
•
Cell lines
Survey instruments
Associated software
Specimens
10
Everything is Data
Everything is data and
data is everything!
11
Sensitive Data
• Some data are highly sensitive
– Private Health Information (PHI), including insurance information
– Personal information such as Social Security numbers, financial data
• Inappropriate release of sensitive information can lead to
harms:
– Privacy violations
– Identity theft
– Financial liability for the University
• Sensitive information is highly regulated and requires security,
e.g. encryption
• University resources:
– HIPAA website
– IRB Website
– Policy on Electronic Data Security Breach Reporting and Response
12
13
Takeways
• Everything is data and data is everything!
• The PI is has stewardship (control) of a
project's data, with regard to publication and
copyright.
14
Data Management & Retention
Danianne Mizzy
Engineering Librarian
15
Data Management & Retention
• Funder requirements
– Minimum or maximum?
– Just because not required doesn’t mean you don’t need
to consider and address long term access
• Columbia Data Retention Policy
– Research data must be archived for a minimum of three
years after the final project close-out, with original data
retained wherever possible.
16
Relevant Policies
• CU Policies & Procedures
– Administrative Code of Conduct
– Statement of Ethical Conduct
– Faculty Handbook
– Sponsored Projects Handbook
– Clinical Research Handbook
– Electronic Information Resources
Security
• Funder Requirements
17
Agency Retention Periods
• HIPAA – At least 6 years
• NIH – 3 years
• NSF - What constitute reasonable procedures
will be determined by the community of
interest through the process of peer review
and program management.
18
Data Storage Planning
• Need to plan for entire life-cycle
• Establish a baseline and project the rate of
growth for the duration of the project.
• Active
– Frequent additions & updates
• Archival
– In fixed form - only need periodic access
19
Data Storage Considerations
• Size
• Retention period
• Privacy or security requirements?
• Sharing?
20
Data Storage Options at CU
Active (Working) Storage
CUIT
– 500 MB personal critical data
– Workgroup Space on Central –
• $400 per gigabyte per year with a minimum
of a half gigabyte (500 MB)
– Research Computing Services
• High Performance Cluster
• For more information contact
rcs@columbia.edu
School & Departmental servers
21
Data Storage Options at CU
Active Storage
Library
Center for Digital Research &
Scholarship (CDRS)
– Consultation available
22
Data Storage Options at CU
Archival Storage
•Library – Academic Commons
23
Data Management Planning
• What file formats? Are they long-lived?
– Long-lived
– Non-proprietary
• Storage and backup strategy?
– Media – CDs and DVDs not long-lived
• What project and data identifiers will be assigned?
• Naming conventions, file/directory structure
• Version Control
• Is there a metadata scheme or other community
standard for data sharing/integration?
24
CU Security Policy
• Individuals who access or control University
electronic information resources must take
appropriate and necessary measures to ensure
the security, integrity, and protection of these
resources, using appropriate physical and
logical security measures.
25
Data Security and Data Integrity
• Unencrypted vs. Encrypted
– Keep passwords & keys on paper in a secure location
– and in an Encrypted Digital File
• Uncompressed vs. Compressed
26
Security - Physical
• Restrict access to computers, offices and
storage media
• Store lab notebooks, samples in locked
cabinets
• Only let trusted individuals troubleshoot
computer problems
• Appropriate environmental controls
27
Security - Network
• Keep confidential and sensitive data on
computers not connected to the Internet
• Keep virus protection up to date
• Don't sent confidential data via e-mail or FTP
(use encryption, if you must)
• Use passwords on files and computers
• Data disposition at end retention period
28
Security – CU Encryption Options
CUIT
•BitLocker for removable storage devices
•Can purchase Guardian Hard Disk Encryption
through CUIT
•Windows Encrypting File System (native)
•Apple – File Vault (native)
•WinZip/7 Zip/Truecrypt
•Savant Application Whitelist software
29
Back-ups
• Make 3 copies
– Original
– External/local
– External/remote – different geographic area
• Verify recovery is possible
– Checksum validation
– Test file restore after initial set-up
– Periodically thereafter
30
Data Back-up Options
• Hard Drive
• Tape Back-up
• Server
• Cloud Storage
– Amazon S3
– Subject Repository/ Data Centers
• (PubChem, Dryad, IRI/LDEO)
31
Metadata
Structured information that describes, explains,
locates, and otherwise makes it easier to retrieve
and use an information resource.
3 main types:
Descriptive
Administrative
Structural
32
Major Research Metadata Standards
• Darwin Core (Biology)
• DDI (Data Documentation Initiative, for data
sets in social and behavioral sciences)
• DIF (Directory Interchange Format for scientific
data sets)
• EML (Ecological Metadata Language)
• FGDC/CSDGM (geographic data)
• National Biological Information Infrastructure
(NBII)
33
Other DMP elements
• Who in the research group will be responsible
for data management?
• Are there tools or software needed to
create/process/visualize the data?
34
Writing Data Management Plans
• Follow CU and funder polices and guidelines
• Can use CUL template as starting point
• Visit SCP web site for further information
http://scholcomm.columbia.edu/
35
Data Management Plans - NSF
1.
TYPES of data, samples, physical collections, software,
curriculum materials, and other materials to be
produced in the course of the project
2.
STANDARDS to be used for data and metadata format
and content (where existing standards are absent or
deemed inadequate, this should be documented along
with any proposed solutions or remedies)
3.
ACCESS and sharing policies including provisions for
appropriate protection of privacy, confidentiality,
security, intellectual property, or other rights or
requirements
4.
Policies and provisions for RE-USE, re-distribution, and
the production of derivatives
5.
Plans for ARCHIVING data, samples, and other research
products, and for preservation of access to them
6.
OR justification why no plan is needed
36
Data Sharing Plan - NIH
1. Expected schedule for data sharing
2. Format of the final dataset
3. Documentation to be provided
4. Whether or not any analytic tools
will be provided
5. Whether or not a data-sharing
agreement will be required and, if
so, a brief description of such an
agreement
6. Mode of data sharing
37
Takeaways
• Create a plan to manage your research data
before the project begins
• Follow the plan
• At the end of the project securely archive data
of long term value and
• Properly dispose of obsolete or sensitive data
• Guidance available from OVPR and Scholarly
Communications Program
38
Sharing your data
Emerging practices
39
Why isn’t data sharing the norm?
•
not common in many disciplines
•
not recognized in promotion/tenure
•
researcher gives up control of data
•
worries about being scooped or
misinterpreted
•
time required to present data in usable
format
•
lack of infrastructure and standards
40
Sharing increasingly seen as valuable
“More and more often these days,
a research project's success is
measured not just by the
publications it produces, but also
by the data it makes available to
the wider community.”
- Nature editorial 9.10.09
“It is obvious that making data
widely available is an essential
element of scientific research.”
- Science editorial 2.11.11
41
New need for openness
“Science has always been about open
debate. But incidents such as the UEA email
leaks have prompted the Royal Society to
look at how open science really is. With the
advent of the Internet, the public now
expect a greater degree of transparency.
The impact of science on people’s lives, and
the implications of scientific assessments for
society and the economy are now so great
that people won’t just believe scientists
when they say “trust me, I’m an expert.” …
Science has to adapt.”
- Geoffrey Boulton, chair Royal Society working
group for study: Science as a public enterprise:
opening up scientific information, 5.13.11
42
Sharing advances science
Sharing can help produce significant
advances in research, as these projects
have demonstrated.
Sloan
Digital
Sky
Survey
Human
Genome
Project
43
NIH-funded
Alzheimer’s study
published in April
2011
Sharing benefits researchers
Rewards of sharing may include:
• opportunities to do innovative research
• research with higher impact
• support for transparency in research
• recognition, reciprocity from colleagues
• more opportunities to preserve data
44
You may have to share
More funders are requiring it
The National Science Foundation now
asks researchers requesting funding to
show how they will share data.
•
Grant applications must include a
two-page data management plan.
•
Data management and access plans
will be evaluated “through the
process of peer review and program
management.”
45
You may have to share
More journals are requiring it
“…authors are required to make materials,
data and associated protocols promptly
available to readers….Nature journals reserve
the right to refuse publication in cases where
authors do not provide adequate assurances
that they can comply...”
46
What do you share?
NSF says data covered by its
data management and sharing
requirements will “be
determined by the community
of interest.”
This “may include, but is not
limited to: data, publications,
samples, physical collections,
software and models.”
47
Some data are not shareable
Be aware of reasons you may NOT
want to share your data:
• Data must be scrubbed of
confidential information before
sharing.
• You may be able to justify not
sharing if your data includes
proprietary licenses or patentable
items, is useful for further analyses,
etc.
48
How and when do you share?
“How” depends on…
• the format of your data
• funder and publisher requirements
• any restrictions on your data
“When” depends on…
• customary embargo periods
• if relevant guidelines specify amount
of time within which data must be
shared
49
Guidelines from the NSF
Division of Earth Sciences (EAR)
Data should be provided at lowest possible cost.
Data may be made available via
•
national data center
•
widely available journal, book, or website
•
institutional archives standard for discipline
•
other EAR-specified repositories.
Data should be made available as soon as
possible, but no later than two years after
collection.
50
Online repositories
Repositories are:
•
organized around institutions or subjects
•
often open access
•
archival, not active, storage for digital data
•
may offer:
o
long-term preservation and access
o
search engine optimization
o
permanent URL or DOI
51
Columbia’s repository
AC accepts data and other materials from
Columbia faculty, students, and staff, and
provides:
• a permanent URL
• secure replicated storage
• accurate metadata
• globally accessible repository
• option for contextual linking between data
and published research results
52
Some subject-based repositories
Cryospheric data repository
run by U of Colorado
NASA’s space science
mission repository
NOAA’s
marine data
repository
Biological activities of small
molecules data repository run
by NCBI at Nat’l Library of
Medicine
Macromolecular structural
data repository run by
international consortium
53
More subject-based repositories
Social science data repository
run by consortium
Basic and applied biosciences
data repository run by
consortium of publishers
Geodesy data
repository run by
university
consortium
Data repository for
archeology and
related disciplines
run by nonprofit
consortium
Deep-sea core
samples repository
housed at LDEO
54
Data licenses
•
Copyright issues around data can be
complex
•
These groups offer “ready-made” licenses
for data that help clarify any restrictions on
reuse
55
Data sharing is here to stay
Initiatives are underway to:
•
establish norms for sharing
•
create sharing and preservation infrastructure
•
establish standards for interoperability
•
clarify copyright and licensing issues
Data Conservancy
Digital Curation Centre
56
Takeaways
•
Data sharing requirements are being
implemented by more funders and
publishers.
•
Norms and standards for sharing are not set
and vary across disciplines.
•
Be aware of sharing requirements and
restrictions on your data.
•
Find links to a variety of institutional and
data repositories at
http:scholcomm.columbia.edu
57
Contacts
• Joel Roselin
• Office of Research Compliance and Training
• JR2644@columbia.edu
• Danianne Mizzy
• Engineering Librarian
• dmizzy@columbia.edu
• Kathryn Pope
• Center for Digital Research and Scholarship
• kp2002@columbia.edu
Download