Introduction to Data Management Data Management • Overview of research data – Joel Roselin, Office of Research Compliance and Training • Data Storage and Retention – Danianne Mizzy, Engineering Librarian • Data Sharing – Kathryn Pope, Center for Digital Research and Scholarship 2 Goals of research • The primary goals of research are: – To advance knowledge – To improve life for people (or animals) • Secondary goals of research: – Career advancement – Professional recognition – Financial gain 3 When you conduct research… • …You are entrusted with: – Human subjects – Animals – Access to specialized materials and technology • Chemicals • Drugs • Machinery • Information (personal or confidential) – Funding from government or industry 4 When you conduct research… • Not everyone is granted the privilege to conduct research: – Qualifications include: • Advanced degree (or enrolled in a degree program) • Position in a research institution – Promise to: • Be responsible in the conduct of the research • Be responsible stewards of the research dollars and other resources • Share the results of the research for the good of society 5 When you conduct research… • The privilege can be revoked for failing to fulfill professional responsibilities: – Not get funding – Debarment – Lose of position 6 What are data? • What counts as data in your field? 7 What are data? • What counts as data in your field? – Subject data (humans or animals) • Blood cell counts • Observational • Survey responses – Lab data • Test results • Assays – Other data • Library information • Photographs 8 What are data? True or False In scientific research, only the information and observations that are made as part of scientific inquiry are considered data. 9 It’s ALL data • False! • Data are not only the information and observations made as part of scientific inquiry but also the materials, the means, and the products of that inquiry (sometimes called data sources). • Examples: • • • • Cell lines Survey instruments Associated software Specimens 10 Everything is Data Everything is data and data is everything! 11 Sensitive Data • Some data are highly sensitive – Private Health Information (PHI), including insurance information – Personal information such as Social Security numbers, financial data • Inappropriate release of sensitive information can lead to harms: – Privacy violations – Identity theft – Financial liability for the University • Sensitive information is highly regulated and requires security, e.g. encryption • University resources: – HIPAA website – IRB Website – Policy on Electronic Data Security Breach Reporting and Response 12 13 Takeways • Everything is data and data is everything! • The PI is has stewardship (control) of a project's data, with regard to publication and copyright. 14 Data Management & Retention Danianne Mizzy Engineering Librarian 15 Data Management & Retention • Funder requirements – Minimum or maximum? – Just because not required doesn’t mean you don’t need to consider and address long term access • Columbia Data Retention Policy – Research data must be archived for a minimum of three years after the final project close-out, with original data retained wherever possible. 16 Relevant Policies • CU Policies & Procedures – Administrative Code of Conduct – Statement of Ethical Conduct – Faculty Handbook – Sponsored Projects Handbook – Clinical Research Handbook – Electronic Information Resources Security • Funder Requirements 17 Agency Retention Periods • HIPAA – At least 6 years • NIH – 3 years • NSF - What constitute reasonable procedures will be determined by the community of interest through the process of peer review and program management. 18 Data Storage Planning • Need to plan for entire life-cycle • Establish a baseline and project the rate of growth for the duration of the project. • Active – Frequent additions & updates • Archival – In fixed form - only need periodic access 19 Data Storage Considerations • Size • Retention period • Privacy or security requirements? • Sharing? 20 Data Storage Options at CU Active (Working) Storage CUIT – 500 MB personal critical data – Workgroup Space on Central – • $400 per gigabyte per year with a minimum of a half gigabyte (500 MB) – Research Computing Services • High Performance Cluster • For more information contact rcs@columbia.edu School & Departmental servers 21 Data Storage Options at CU Active Storage Library Center for Digital Research & Scholarship (CDRS) – Consultation available 22 Data Storage Options at CU Archival Storage •Library – Academic Commons 23 Data Management Planning • What file formats? Are they long-lived? – Long-lived – Non-proprietary • Storage and backup strategy? – Media – CDs and DVDs not long-lived • What project and data identifiers will be assigned? • Naming conventions, file/directory structure • Version Control • Is there a metadata scheme or other community standard for data sharing/integration? 24 CU Security Policy • Individuals who access or control University electronic information resources must take appropriate and necessary measures to ensure the security, integrity, and protection of these resources, using appropriate physical and logical security measures. 25 Data Security and Data Integrity • Unencrypted vs. Encrypted – Keep passwords & keys on paper in a secure location – and in an Encrypted Digital File • Uncompressed vs. Compressed 26 Security - Physical • Restrict access to computers, offices and storage media • Store lab notebooks, samples in locked cabinets • Only let trusted individuals troubleshoot computer problems • Appropriate environmental controls 27 Security - Network • Keep confidential and sensitive data on computers not connected to the Internet • Keep virus protection up to date • Don't sent confidential data via e-mail or FTP (use encryption, if you must) • Use passwords on files and computers • Data disposition at end retention period 28 Security – CU Encryption Options CUIT •BitLocker for removable storage devices •Can purchase Guardian Hard Disk Encryption through CUIT •Windows Encrypting File System (native) •Apple – File Vault (native) •WinZip/7 Zip/Truecrypt •Savant Application Whitelist software 29 Back-ups • Make 3 copies – Original – External/local – External/remote – different geographic area • Verify recovery is possible – Checksum validation – Test file restore after initial set-up – Periodically thereafter 30 Data Back-up Options • Hard Drive • Tape Back-up • Server • Cloud Storage – Amazon S3 – Subject Repository/ Data Centers • (PubChem, Dryad, IRI/LDEO) 31 Metadata Structured information that describes, explains, locates, and otherwise makes it easier to retrieve and use an information resource. 3 main types: Descriptive Administrative Structural 32 Major Research Metadata Standards • Darwin Core (Biology) • DDI (Data Documentation Initiative, for data sets in social and behavioral sciences) • DIF (Directory Interchange Format for scientific data sets) • EML (Ecological Metadata Language) • FGDC/CSDGM (geographic data) • National Biological Information Infrastructure (NBII) 33 Other DMP elements • Who in the research group will be responsible for data management? • Are there tools or software needed to create/process/visualize the data? 34 Writing Data Management Plans • Follow CU and funder polices and guidelines • Can use CUL template as starting point • Visit SCP web site for further information http://scholcomm.columbia.edu/ 35 Data Management Plans - NSF 1. TYPES of data, samples, physical collections, software, curriculum materials, and other materials to be produced in the course of the project 2. STANDARDS to be used for data and metadata format and content (where existing standards are absent or deemed inadequate, this should be documented along with any proposed solutions or remedies) 3. ACCESS and sharing policies including provisions for appropriate protection of privacy, confidentiality, security, intellectual property, or other rights or requirements 4. Policies and provisions for RE-USE, re-distribution, and the production of derivatives 5. Plans for ARCHIVING data, samples, and other research products, and for preservation of access to them 6. OR justification why no plan is needed 36 Data Sharing Plan - NIH 1. Expected schedule for data sharing 2. Format of the final dataset 3. Documentation to be provided 4. Whether or not any analytic tools will be provided 5. Whether or not a data-sharing agreement will be required and, if so, a brief description of such an agreement 6. Mode of data sharing 37 Takeaways • Create a plan to manage your research data before the project begins • Follow the plan • At the end of the project securely archive data of long term value and • Properly dispose of obsolete or sensitive data • Guidance available from OVPR and Scholarly Communications Program 38 Sharing your data Emerging practices 39 Why isn’t data sharing the norm? • not common in many disciplines • not recognized in promotion/tenure • researcher gives up control of data • worries about being scooped or misinterpreted • time required to present data in usable format • lack of infrastructure and standards 40 Sharing increasingly seen as valuable “More and more often these days, a research project's success is measured not just by the publications it produces, but also by the data it makes available to the wider community.” - Nature editorial 9.10.09 “It is obvious that making data widely available is an essential element of scientific research.” - Science editorial 2.11.11 41 New need for openness “Science has always been about open debate. But incidents such as the UEA email leaks have prompted the Royal Society to look at how open science really is. With the advent of the Internet, the public now expect a greater degree of transparency. The impact of science on people’s lives, and the implications of scientific assessments for society and the economy are now so great that people won’t just believe scientists when they say “trust me, I’m an expert.” … Science has to adapt.” - Geoffrey Boulton, chair Royal Society working group for study: Science as a public enterprise: opening up scientific information, 5.13.11 42 Sharing advances science Sharing can help produce significant advances in research, as these projects have demonstrated. Sloan Digital Sky Survey Human Genome Project 43 NIH-funded Alzheimer’s study published in April 2011 Sharing benefits researchers Rewards of sharing may include: • opportunities to do innovative research • research with higher impact • support for transparency in research • recognition, reciprocity from colleagues • more opportunities to preserve data 44 You may have to share More funders are requiring it The National Science Foundation now asks researchers requesting funding to show how they will share data. • Grant applications must include a two-page data management plan. • Data management and access plans will be evaluated “through the process of peer review and program management.” 45 You may have to share More journals are requiring it “…authors are required to make materials, data and associated protocols promptly available to readers….Nature journals reserve the right to refuse publication in cases where authors do not provide adequate assurances that they can comply...” 46 What do you share? NSF says data covered by its data management and sharing requirements will “be determined by the community of interest.” This “may include, but is not limited to: data, publications, samples, physical collections, software and models.” 47 Some data are not shareable Be aware of reasons you may NOT want to share your data: • Data must be scrubbed of confidential information before sharing. • You may be able to justify not sharing if your data includes proprietary licenses or patentable items, is useful for further analyses, etc. 48 How and when do you share? “How” depends on… • the format of your data • funder and publisher requirements • any restrictions on your data “When” depends on… • customary embargo periods • if relevant guidelines specify amount of time within which data must be shared 49 Guidelines from the NSF Division of Earth Sciences (EAR) Data should be provided at lowest possible cost. Data may be made available via • national data center • widely available journal, book, or website • institutional archives standard for discipline • other EAR-specified repositories. Data should be made available as soon as possible, but no later than two years after collection. 50 Online repositories Repositories are: • organized around institutions or subjects • often open access • archival, not active, storage for digital data • may offer: o long-term preservation and access o search engine optimization o permanent URL or DOI 51 Columbia’s repository AC accepts data and other materials from Columbia faculty, students, and staff, and provides: • a permanent URL • secure replicated storage • accurate metadata • globally accessible repository • option for contextual linking between data and published research results 52 Some subject-based repositories Cryospheric data repository run by U of Colorado NASA’s space science mission repository NOAA’s marine data repository Biological activities of small molecules data repository run by NCBI at Nat’l Library of Medicine Macromolecular structural data repository run by international consortium 53 More subject-based repositories Social science data repository run by consortium Basic and applied biosciences data repository run by consortium of publishers Geodesy data repository run by university consortium Data repository for archeology and related disciplines run by nonprofit consortium Deep-sea core samples repository housed at LDEO 54 Data licenses • Copyright issues around data can be complex • These groups offer “ready-made” licenses for data that help clarify any restrictions on reuse 55 Data sharing is here to stay Initiatives are underway to: • establish norms for sharing • create sharing and preservation infrastructure • establish standards for interoperability • clarify copyright and licensing issues Data Conservancy Digital Curation Centre 56 Takeaways • Data sharing requirements are being implemented by more funders and publishers. • Norms and standards for sharing are not set and vary across disciplines. • Be aware of sharing requirements and restrictions on your data. • Find links to a variety of institutional and data repositories at http:scholcomm.columbia.edu 57 Contacts • Joel Roselin • Office of Research Compliance and Training • JR2644@columbia.edu • Danianne Mizzy • Engineering Librarian • dmizzy@columbia.edu • Kathryn Pope • Center for Digital Research and Scholarship • kp2002@columbia.edu