Management … Data a “nuts-and-bolts” part of Responsible Conduct of Research

advertisement
Data Management …
a “nuts-and-bolts” part of
Responsible Conduct of Research
March 21, 2015
Enid Karr, Sr. Bibliographer for Biology, Earth & Environmental
Sciences, Environmental Studies
enid.karry@bc.edu
Sally Wyman, Collection Development Librarian, Sr. Bibliographer
for Chemistry, Physics, Environmental Studies
sally.wyman@bc.edu
Barbara Mento, Data/GIS Librarian, Sr. Bibliographer for Computer
Science, Economics, Mathematics
barbara.mento@bc.edu
First – Why?
 Fits into “responsible conduct of research”
 Risk of data loss for you and the University
 Facilitates fulfillment of requests from
others to see your data
 Shared data (“open access”)  higher
citation rate!
More (Really Good)
Reasons:
Increasingly, grants require a “Data Management Plan”

NSF

NIH

All larger agencies, coming soon
Per White House Directive on Open Data -- Feb. 22, 2013
More scholarly journal policies (Nature, Science, PNAS,
PLoS…) require that data must be:

Clearly documented .. available for sharing … detailed
enough to permit replication of analysis
New “data journals” starting to appear – including Nature’s
Scientific Data, which publishes data sets
A “Typical”
Data Management Plan
1-2 pages describing the project and how data will be:

Collected (including formats, size, etc.) … Secured … Analyzed …
Shared … Preserved
Details about access/sharing

Potential audience(s) for the data

How access will be provided and how others will find it:
“Access” (freely-available) vs. “Sharing” (by request)

Stipulations for privacy, confidentiality, IP or other rights

Allowed re-use of the data, derivative products
Metadata standards to be used
How long data will be retained -- archiving, long-term preservation and
format migration
From the NSF FAQ on
Data Management Plans:
“DMP” covers recorded factual material commonly accepted in the
[specific] scientific community as necessary to validate research
findings. May include, but is not limited to:

Data

Publications

Samples

Physical collections

Software and models
But not: preliminary analyses, drafts of scientific papers, plans for future
research, peer reviews, or communications with colleagues. (Office of
Management and Budget (OMB) Circular A-110 )
Boston College Libraries
Data Management Plan
Research Guide
http://libguides.bc.edu/dataplan
 Guidance on content
 Templates/examples
 Additional resources
 To arrange a
consultation with a
subject specialist
Data Management in Action
Some “best practices” while collecting or
generating your data
 Storage
 Documentation
 Loss Prevention
 Security
Image: digitalart / FreeDigitalPhotos.net
Handling … Storing … and
Backing Up Your Data
Data Storage Elements to Consider:
 File Formats and Naming
 Directory Structure
 Version Control
 Assign Responsibility
 Document your practices
 Think about all of this EARLY
File Formats
Whenever possible, save your data using open standards.
Avoid proprietary formats. Some examples:
 TXT, PDF/PDF Archival, not Word (doc, docx)
 ASCII, not Excel (xls, xlsx)
 MPEG-4, not Quicktime (qtff)
 TIFF or JPEG2000, not GIF or JPG
 XML or RDF, not RDBMS
Ideally, save files in both original format AND one of the preferred
ones listed above.
Why Use Open File Formats?
 No restrictions on their use
 Open source code  future migration
easier
 Propriety formats are offered by companies
that may go out of business, carrying the
code knowledge with them
 Facilitates sharing
Organization
File Naming Conventions/Best Practices

Consistent, descriptive, UNIQUE … avoid spaces and special characters

Use brief names

Can contain:

Project acronyms

Researchers’ initials

File type information

Version number

Date

File Status
IUS_v02_092011_final.csv
Internet Usage Study version 2, Sept
2011, final draft, in csv format
Organization
Directory Structure
 Use folders!
 Possible ways to organize:
 By types of data
Image: digitalart / FreeDigitalPhotos.net
 IR, NMR, etc.
 By experiment
 By collection method
 Choose option that works best for your research
group … it should be understandable to others
Version Control
 Keep an archival (unmodified) version, and
updated versions (clearly labelled)
 Use ordinal numbers (1, 2, 3) for major
changes and decimals for minor changes
(V1.1, V1.2 …)
 Version control software can help, and
some software has this built-in… especially
instrument software
Data Entry and Quality Control

Whatever you use, be consistent

Define abbreviations in readme.txt file or in a “codebook”

Record dates for best sorting (YYYYMMDD)

Check periodically for data corruption/integrity using checksum, for example

Flag problematic data

Handling of null values: problematic in moving across software platforms

Consider using blanks: treated as null values by R, Python, Excel

Don’t use text (as in, “no data”) in a data column formatted for numbers

Avoid manual data entry whenever possible

Consider making your raw data files “read only”
Data Documentation (“Metadata”)
 What is
metadata?
 Benefits of good
documentation
 What elements
should be
documented?
For help, contact your subject
specialist:
www.bc.edu/libraries/help/askalib.html
ISO suggested Minimum Data Elements
o Title
o Creator (Principal Investigators)
o Date Created (also versions)
o Instrument and model
o Format (and software required)
o Subject
o Unique Identifier
o Description of the specific data resource
o Coverage of the data (spatial or temporal)
o Publishing Organization
o Type of Resource
o Rights
o Funding or Grant
Why Metadata?
 It helps others discover your research when you
share your data.
 This “data about your data” captures the most
critical information about a particular project.
Capture it early on… you think you will remember,
but …
 Metadata may be required for journal
publication/data deposit.
Metadata Standards
These vary …
by discipline
by type of data
by repository
for example: GenBank
We can help.
Sample GenBank Record –
example of a standard
Data Documentation – What do you do
with it once you have it?
 Record it in a readme.txt file
 In some fields, “codebooks” are used to record
methodology and other data management
notes (e.g. IRB compliance statements, etc.)
 Consider including a “data dictionary”
 Inserted with deposited data these files
facilitate “discovery” of your data on the Web
Data Loss Prevention
 Regular back-ups protect against data loss
 Back up strategy will depend on your needs:
 Back up all versions of the files or certain ones?
 How often will you back up files?
 Have at least two back up locations
 internal (your computer)
 external (i.e. the BC Research Data Archive or
departmental servers)
 Assign responsibility for backing-up
Physical Storage Options
Local
Centralized
Remote
Convenient but less
secure (especially
external media)
More secure, with
automatic back-up
… and more space
Permanent, someone else
takes responsibility for future
migration
• On your own
computer’s hard drive
• External media (hard
drive, CD/DVD, flash
drive)
• Departmental server,
local network access
• ITS
• Departmental
server, local
network access
• Disciplinary Repositories,
e.g. GenBank, Cambridge
Structure Database
• Secure cloud options are
in use at other institutions
Data Storage
ITS offers a remote, automated backup of faculty and staff
computers using a product called Connected Backup by Autonomy.
Users have the ability to recover files from any location using a web
browser.
http://www.bc.edu/offices/help/essentials/backup/ironmtn.html
Research Services provides secure archive space for research
data that is backed up nightly.
http://www.bc.edu/offices/researchservices/dataresources/archive.html
Your department may provide its own storage options.
Funding Long-term
Data Storage
 Who will pay for this? NSF DMP guidelines
encourage inclusion of cost information … and
grants may pay.
 How much of your data will you save? Raw data
(untouched) always …
 In general, data must be stored for three years
(contact Dr. Stephen Erickson at the Boston
College Office of Research and Integrity for
more information).
Data Security
For additional assistance with security
planning, consult the Computer Policy &
Security Office of the IT Assurance
Department.
Director: David Escalante
www.bc.edu/offices/its/depts/assurance/policysecurity.html
Data Access and Sharing
Options include:
 Personal website
 Journal “supplementary
materials” (ACS, etc.)
 Institutional repository, e.g.
eScholarship@bc
 Disciplinary (or
multidisciplinary) repository
 Or, a combination: journaldesignated repository –
Nature example)
E-Scholarship@bc
• A repository for BC data sets and
publications
• A portal for pointing to your data wherever it
is stored (at BC or beyond)
Data Sharing Options
Beyond BC
 Subject-based archives – ask your subject
librarian
 Directories of data repositories:
 DataBib (Beta)
 http://databib.org/index.php#
 Simmons Data Repositories Listing
 http://oad.simmons.edu/oadwiki/Data_repositories
Examples of Repositories
 Biomedicine:
 GenBank -- sequence data
 RSCB Protein DataBank -- biomolecule
crystal structure coordinates, etc.
 Chemistry:
 Cambridge Structural Database (CSD)
 PubChem (Part of NCBI Entrez, covering
biological activities of small molecules)
 Multidisciplinary: FigShare.com (Open, Free)
DMPs: Data Sharing … also
Archiving
What does the Data Sharing Policy Mean?
Example NSF: “plans for archiving data, samples,
and other research products, and for
preservation of access to them.”
Archiving Data means not just preserving the data
in the original format but also in a format that is
non-platform reliant, using a standard that
ensures that the data can be re-used in the
future.
Metadata is vital to insure data is findable.
Ethics and Privacy

Sensitive data should be redacted
before depositing in a public
archive or repository.

Access to data may be
embargoed (access limited for a
time) for confidentiality, legal,
patentability or other reasons.

Dark archives ensure permanent
protection of confidentiality.

Where human subjects/privacy is
involved, BC’s Institutional Review
Board (IRB) must approve.
http://www.bc.edu/research/oric/human.html
Image: digitalart / FreeDigitalPhotos.net
Data Ownership
You may have copyright or ownership
concerns when planning to share your data.
For assistance and more information, please
contact the Boston College
Office for Research Integrity and Compliance:
http://www.bc.edu/content/bc/research/oric/compliance.html
Intellectual Property/Technology
Transfer Concerns
 Funders/journals expect that you will share your
data within a reasonable amount of time …
 However, they also recognize the need to protect
intellectual property rights and potential
commercial value
 The DMP should describe your plans to protect
those rights
 Contact the Boston College Office for
Technology Transfer and Licensing as part of
your DMP writing process
Research Output
Data Citations
Why should I cite data?

Ensures that original producers of the data (you!) are
credited in citation indexes*

Allows researchers to locate research data used in an
article

May be required by the archive that stored the data you
have repurposed
*Piwowar HA, Day RS, Fridsma DB (2007) Sharing Detailed Research Data Is Associated with Increased
Citation Rate. PLoS ONE 2(3): e308.doi:10.1371/journal.pone.0000308
Citing Data Sets
Essential citation elements; style will vary:
•
author or creator
•
title or description
•
year of publication
•
publisher and/or the database/archive from which it was retrieved
•
the URL or DOI if the data set is online
National Center for Biotechnology
Information. PubChem Compound Database;
CID=5934766,
http://pubchem.ncbi.nlm.nih.gov/summary/su
mmary.cgi?cid=5934766 (accessed Feb. 22,
2011).
Mackey, R.A., Mackey, E.F., and O’Brien,
B.A. (1990). Lasting relationships research
data archive (eScholarship version) [Data
file]. Boston College School of Social
Work. http://hdl.handle.net/2345/2228
Additional Support

The Libraries

The Data Management LibGuide
libguides.bc.edu/dataplan

Subject Specialists
www.bc.edu/libraries/help/askalib.html

eScholarship@BC
escholarship.bc.edu

The Office for Sponsored Programs Research
http://www.bc.edu/research/osp.html

ITS/Research Services
http://www.bc.edu/offices/researchservices/

Office for Research Integrity and Compliance
http://www.bc.edu/research/oric/compliance.html

The Office for Technology Transfer and Licensing
http://www.bc.edu/research/ottl/
Some Useful Links

Data Management and Sharing Snafu in 3 Short Acts
(NYU Health Sciences Library)
https://www.youtube.com/watch?v=N2zK3sAtr-4

DataOne Best Practices
https//www.dataone.org/all-best-practices-download-pdf

DCC (Digital Curation Center) Disciplinary Metadata
Standards
http://www.dcc.ac.uk//resources/metadata-standards

DCC Digital Curation Center Metadata Standards – Physical
Sciences
http://www.dcc.ac.uk/resources/subject-areas/physical-science

Guide to Writing “Readme” Style Metadata (Cornell)
http://data.research.cornell.edu/content/readme
Questions?
Download