TEMPLATE FOR A DATA MANAGEMENT PLAN The following template can be used to develop a Data Management Plan to accompany a research proposal. This template is based on the format suggested by the MRC. The notes (in italics) provide further context and guidance for its completion. Text from submitted applications is shown in red to provide an example of how sections can be addressed. Please do not cut/paste directly from these examples. Text highlighted yellow provides suggested text for University of Sheffield researchers to cut/paste. If you opt NOT to use the template the topics listed in the template should be addressed if requested by the funder. You can also use the DMPonline tool to create a Data Management Plan. DMPonline outputs are available a range of formats. Sign in with your institutional credentials by choosing 'University of Sheffield' then login using MUSE. To create a new plan, select your research funder, ensuring University of Sheffield is selected for institutional guidance and tick the box for DCC guidance. 0. Proposal name Exactly as in the proposal that the DMP accompanies 1. Description of the data 1.1 Type of study Up to three lines of text that summarise the type of study (or studies) for which the data are being collected. Example: “This application encompasses experimental laboratory-based studies, a genomics study and an experimental medicine study. The aim is to explore the role of xxx in respiratory function and immunity from the genetic through to the biochemical and functional levels.” Example: “This project will comprise three areas of study - in vitro, clinical cross-sectional and in silico experiments - addressing the role of xxx in colorectal pathophysiology.” Example: “This is a pre-clinical laboratory-based study using in vitro and in vivo methodologies to model breast cancer metastasis in the skeleton.” Example: “This project combines laboratory work, epidemiological and qualitative research with an investigation into the cultural, socioeconomic, environmental and political factors that influence both the prevalence of caries and the effectiveness of local oral healthcare programmes.” 1.2 Types of data Outline the types of research data that will be managed in the following terms: quantitative, qualitative; generated from surveys, clinical measurements, models, interviews, medical records, electronic health records, administrative records, genotypic data, images, audiovisual data, tissue samples. Include the raw data arising directly from the research, the reduced data derived from it, and published data. Example: “We will collect qualitative and quantitative data from interviews, completion of questionnaires, laboratory measurements of blood samples and the measurement of bone mineral density.” Example: “Qualitative and quantitative CT and MR imaging data will be recorded. Demographic (age, sex), clinical measurements (WHO functional class, blood results, walk test distance, lung function test data), imaging data (qualitative and quantitative cardiac and pulmonary MRI and CT measurements), right heart catheter data (ASPIRE and National Cohort datasets), genotypic data (National Cohort dataset) and outcome data (time to death and deceased/alive) will be recorded.” MRC Template for a Data Management Plan, v01-00, 1 May 2012 P A G E 1 Example: “Quantitative data generated from in vitro experiments will include data files from plate readers, flow cytometry plots, light and fluorescent image files. Quantitative data of all in vivo experiments will include uCT images, both raw and reconstructed, of mouse bones (longitudinal and end of procedure) and images of sections. Sequence data will also be generated from RNAseq experiments. Sample types will include human derived cells, murine tissue samples and human patient tissue samples in addition to limited medical anonymised data (disorder, age, sex, etc).” 1.3 Format and scale of the data Outline and justify your choice of file formats, software used, number of records, databases, sweeps, repetitions, etc. in terms that are meaningful in your field of research. Decisions may be based on staff expertise, a preference for open formats, the standards accepted by data centres or widespread usage within a given community. Do formats and software enable sharing and long-term validity of data? Using standardised and interchangeable or open lossless data formats ensures the long-term usability of data. See UKDS Guidance on recommended formats. Estimate the volume of data in MB/GB/TB and how this will grow to make sure any additional storage and technical support required can be provided. Example: “The majority of raw quantitative data will be stored in Microsoft Excel format with statistical analysis performed in GraphPad Prism. Images of western blots and PCR gels will be stored as jpegs. a slide scanning microscope, used for processed tissue sections, uses proprietary Zen software but images are exported as jpegs. A Filemaker database is used to store physiological data (and genotype) of animals used in experiments. Colorimetric/fluometric data (ELISA/proliferation assays), quantitative PCR data are collected using proprietary software but all software used can export data into sharable formats (.txt, .xls, .jpeg, .png). Data volume will range from small <1 kB text files to image files exceeding 10 GB. The expected volume will total under 2 TB.” Example: “A master Access spreadsheet (.mbd) will link to study-specific excel spreadsheets (.xls) and will be backed up on a University server to maintain the long-term validity of the data. The data will be in an anonymised format with data sharing procedures in place to enable sharing of the data with other interested research groups. A master spreadsheet linking to the patient identifiers will be stored on a password-protected NHS Trust computer.” Example: “Quantitative biological data will be stored as excel and Graphpad PRISM files and images as jpeg and pdf files. Optical, super-resolution images, uCT and multiphoton images will be stored as tiff/jpeg files. Average amount of data for in vivo metastasis experiments will be 20 MB and for in vivo experiments 4GB (Total across the project 37 GB). This project will generate a large number of tissue samples from in vivo studies including primary (mammary tumours), tibiae, femurs, serum, bone marrow, protein and RNA from 300 mice. These samples will be stored in paraffin wax or at -80C. After all data have been collected surplus material will be made available to other researchers via the SEARCHBreast initiative (http://searchbreast.org).” 2. Data collection / generation Justification for why new data collection or long term management is needed should be included in the Case for Support. Use this section to focus on good practice and standards for ensuring new data are of high quality and processing is well documented. 2.1 Methodologies for data collection / generation How the data will be collected/generated and which community data standards (if any) will be used at this stage. Indicate how the data will be organised during the project, mentioning for example naming conventions, version control and folder structures. Example: “Data will be collected by specific software or handwritten (in indexed laboratory MRC Template for a Data Management Plan, v01-00, 1 May 2012 P A G E 1 notebooks) and transferred into spreadsheets. Files will indicate the date of data acquisition, an identifier for each experiment, replicate and/or time point and file version number. Files will be stored in folders named by experiment, containing subfolders of original/raw and edited/analysed data. When necessary for blinded analysis, independent investigators will assign randomly generated codes for data/files and de-code these on completion of the experimental analysis.” Example: “Data analysis standard operating procedures (SOPs) have been developed for generation of imaging data and all analysis will be carried out in accordance with SOP documents. All data is electronic and will be stored in a central Access database, linking together Excel spreadsheets.” Example: “...The interviews will be conducted by trained post-doctoral researchers and recorded on electronic recorders, transcribed and translated. Interview guides will be used to direct the conversations and to enable the study team to obtain the information needed to inform the project. All participants will be given a pseudonym will be used when referring directly to their answers.” Example: “.... Data collected by our partners in xxx will be securely transferred to the UK on completion of the study and subsequently imported into the Access database.” Example: “Data will be generated from the study protocols, mainly using the Hologic Scanner software and GIMIAS based computer-assisted tool. The information collected from these tools will be combined with anonymised patient extracts from hospital information systems and expert assessments of the vertebral fracture assessment (VFA) images. All recruited patients will be anonymised and a pseudo patient identifier eg. VERDICT_0001) will be generated for each patient. All data collected as part of the study will be assigned to the corresponding pseudo patient identifier. Patient demographics and clinical measurements will be collected as Excel spreadsheets and published as records using the pseudo patient identifier into the XNAT database. All imaging data will also be published as folders using the same patient identifier in the XNAT database. On completion of any subsequent image processing or expert-labelling step, the relevant data will be published using the same pseudo patient identifier. The proposed project data heirarchy is as follows: VERDICT_0001/ Demographics/... Clinical measurements/... Imaging/… DXA/… Spine radiograph/… Assessments/… VERDICT Tool/… Experts/…” 2.2 Data quality and standards Explain how the consistency and quality of data collection / generation will be controlled and documented. What quality assurance processes will you adopt? This may include processes such as calibration, repeat samples or measurements, standardised data capture or recording, data entry validation, peer review of data or representation with controlled vocabularies. Example: “Consistency and quality of the data collection will be tested on a monthly basis by [the study data coordinator]. This will include scatter plots of the data and testing for normality. Repeated sample testing will also be performed as described in the Case for Support.” Example: “Each in vitro experiment will be carried out at least four times using appropriate positive and negative controls. Where feasible, technical replicates will also be carried out to ensure consistency, robustness and data quality. Robustness of data from in vivo experiments will be ensured through the use of littermate controls with cohort sizes as dictated by power calculations. Acquisition parameters and experimental protocols will be standardised and MRC Template for a Data Management Plan, v01-00, 1 May 2012 P A G E 1 cross-referenced to the raw data. Where analyses of immunostaining is required these will be performed in a blinded manner and assessed by more than one person to generate a kappa statistic.” Example: “Data quality standards will be met through using the Hologic scanner, which is calibrated at regular intervals in accordance with standard hospital practice. Data capture is standardised using appropriate software, eg. Hologic scanner exporting DXA images and GIMAS based computer-assisted tool generating VTK files. Data extraction from hospital systems will be automated to minimise errors due to manual handling of data. Whee manual entry is necessary, consistency will be maintained using peer review and cross checking of results with existing measurements.” Example: “... RNASeq data will first be checked for analysis suitability using Illumina Q scores and the software package RNA-SeQC which generates a series of quality control metrics as an HTML report and tab delimited files. It provides information on read counts (total, unique, duplicate ends, rRNA reads, strands specificity), coverage (mean coverage; reads per base, means of coefficient variation, coverage plots) as well as downsampling, GC bias and correlation to reference expression profile. Reads will be processed and analysed using established best practice tools such STARAlign (mapping), HTSeq/RSEM (gene and transcript quantification) and DESeq2 (differential expression between two or more replicated groups). Alternative methods that emerge during the project will also be considered if deemed to significantly impact outputs.” 3. Data management, documentation and curation Keep this section concise and accessible to readers who are not data-management experts. Focus on principles, systems and major standards. Focus on the main kind(s) of study data. Give brief examples and avoid long lists. 3.1 Managing, storing and curating data. Outline briefly how and where data will be stored, backed-up, managed and curated in the short to medium term to ensure that data and metadata are stored securely for the lifetime of the project. Specify any community agreed or other formal data standards used (with URL references). [Enter data security standards in Section 4]. Note: Storing data on laptops, computer hard drives or external storage devices alone is not recommended. The use of robust, managed storage with automatic backup is preferred by the University and by funders. See UKDA guidance on data storage and backup. *All requests for research data storage in the Faculty of Medicine, Dentistry and Health should be made to the Faculty IT Hub in the first instance (med-it@sheffield.ac.uk). They will work with you to create an appropriate folder structure and give access to authorised users. “Data and definitive project documentation will be stored on centrally provisioned University of Sheffield virtual servers and research storage infrastructure ( https://www.sheffield.ac.uk/cics/research) throughout the lifetime of the project. Both Windows and Linux Virtual Servers with up to 10TB of storage are made available to research projects. Access control is by authorised University computer account username and password. Off-site access is facilitated by secure VPN connection authenticated by University username and remote password. By default, two copies of data are kept across two physical plant rooms, with a 28 day snapshot made of data and backed up securely offsite at least daily. This service is maintained by the University’s Corporate Information and Computing Services. Storage of data on local hard drives and devices will be limited to xxxxxx. All mobile devices e.g. laptops, tablets, mobile phones, external storage devices are encrypted as standard. Google Drive may be used for more flexible collaborative working but only where non personal-sensitive information is involved. Where Google Drive is used, copies of complete MRC Template for a Data Management Plan, v01-00, 1 May 2012 P A G E 1 and definitive documents will be transferred to the main project repository on the University research storage infrastructure.” 3.2 Metadata standards and data documentation Describe plans for documenting, annotating and describing data so that research data are usable by others than your own team. This may include information on the methodology used to collect the data, analytical and procedural information, definitions of variables, units of measurement, any assumptions made, the format and file type of the data. Include description of documentation that will accompany the data to provide secondary users with any necessary details to prevent misuse, misinterpretation or confusion, and will allow the data to be read and interpreted in the future. Consider also how you will capture and create the metadata and where it will be recorded e.g. in a database with links to each item, in a ‘readme’ text file, in file headers etc. Researchers are strongly encouraged to use community standards to describe and structure data, where these are in place. The DCC offers a catalogue of disciplinary metadata standards. Suggested minimum text: “Methods and SOPs will be stored electronically in Microsoft Word documents (.doc) with the spreadsheets containing data” Example: “...We have operational documents for qualitative and quantitative image analyses, detailing the image acquisition and image analysis methods on Microsoft Word documents (.doc) that are available to third parties.” Example: “Methods used to generate and pre-process the data will be described and stored as open file format. Metadata information regarding scanner settings, software settings, software version and operator information are captured with each scan of a patient and stored as part of XNAT database and can easily be exported to an open format.” Example: “Data will be analysed graphically and statistically using Graphpad software, and metadata will be presented using standard statistical inferences (mean and significance tests). Files containing raw data will be labelled logically and stored in folders in a logical hierarchical fashion. Explanation of methods used (experimental and analytical) will be stored alongside the raw data in the same folders, in simple text documents. For genomic data, variant scores will be callibrated according to GATK best practice and annotated using ANNOVAR.” 3.3 Data preservation strategy and standards Outline your plans for long-term storage, preservation and planned retention period for the research data. Include formal preservation standards, if any. Indicate which data may not be retained (if any). Consider any additional resources needed to prepare data for deposit or meet charges from data repositories if not using an established repository. Most research funders expect data to be retained for a minimum of 10 years from the end of the project. For data that by their nature cannot be re-measured, efforts should be made to retain them indefinitely. See the DCC guide: How to appraise and select research data for curation. Long term preservation and access may be best managed by using a specialist data repository. See the Library RDM webpage on Data repositories. Look in re3data.org and at Wellcome Trust - Data repositories and database resources to find an appropriate repository. If no suitable repository is available you may deposit data in ORDA. Alternatively, data may be retained in the University’s research storage infrastructure and registered in ORDA. This is suitable if you need to regulate users through ‘Data sharing agreements’. Suggested text in all cases: “Data will be archived in line with the University of Sheffield’s Research Data Management Policy, which is a component of the University's Policy on Good R&I Practices (the 'GRIP' Policy)” MRC Template for a Data Management Plan, v01-00, 1 May 2012 P A G E 1 Where data is in paper format: “Data collected in paper form will be routinely digitised and the paper form disposed of / stored for at least 10 years at our universities in secured areas.” For data deposited in external data repositories: “Research data selected for long-term preservation and sharing will be deposited in [name of repository/weblink]. The [name of repository] is openly accessible and searchable and will guarantee preservation of these data for ten years or more. Metadata records describing these data will be created in ORDA, the University of Sheffield research data registry and repository” Where some research data are being deposited in ORDA: “Data that are not deposited in [name of repository/weblink] will be deposited in ORDA, a repository and registry of research data produced at the University of Sheffield, which will preserve data for ten years or more.” Where data is deposited in ORDA only: “Data selected for long-term preservation and sharing will be deposited in ORDA, a repository and registry of research data produced at the University of Sheffield, which will guarantee preservation for ten years or more.” Where data is being retained locally, but not made ‘openly’ accessible:“Data selected for longterm preservation and sharing will be stored on centrally provisioned University of Sheffield virtual servers and research storage infrastructure (https://www.sheffield.ac.uk/cics/research) for at least ten years. Records of these data will be published in ORDA, a registry of research data produced at the University of Sheffield.” Example: “Laboratory notebooks will be stored by the University of Sheffield for an indefinite period. Digital data will be stored for a minimum of 10 years after the completion of the study, on University of Sheffield research storage infrastructure. All personal identifiable and study data will also be kept for a minimum of 15 years by Sheffield Teaching Hospitals NHS Foundation Trust in a secure off-site facility run by [name of management company].” Example: “We plan to make processed image data available via our [xxxxx] server for at least 5 years after the end of this Programme. Raw image files for important experiments will also be made available in this way. For other data, a single archive copy will be kept for 5 years after the end of the Programme.” Example: “Data will be retained for the recommended 20 year period as set out in MRC guidance ‘Personal Information in Medical Research’ section 7. The quantitative data will be stored using the simplest data standard format of comma separated values (CSV) and the qualitative data will be stored as anonymised MS Word transcripts to ensure the best chance of longevity. Once the storage period has expired, all data held by [Department/School/Centre] will be destroyed according to [Department/School/Centre] Information Governance procedures.” 4. Data security and confidentiality of potentially disclosive personal information Complete this section only if your research data include personal data relating to human participants in research. Information provided will be in line with your ethical review. 4.1 Formal information/data security standards Identify formal information standards with which your study is or will be compliant. An example is ISO 27001 (ISO standard for data security) Note: Although The University of Sheffield is not an accredited ISO 27001 institution, its information standards comply with ISO/IEC 27001:2013, demonstrating strong and robust information security structure. MRC Template for a Data Management Plan, v01-00, 1 May 2012 P A G E 1 “The University of Sheffield requires its users to adhere, as a minimum, to the following security standards, http://www.shef.ac.uk/cics/policies/infosec and where necessary, for example where patient data is involved, more secure system policies are defined.” 4.2 Main risks to data security All personal data has an element of risk. Summarise the main risks to the confidentiality and security of information related to human participants, the level of risk and how these risks will be managed. Cover the main processes or facilities for storage and processing of personal data, data access, with controls put in place and any auditing of user compliance with consent and security conditions. If your research data include personal data relating to human participants in research, it is not sufficient to write not applicable under this heading. See UKDS guidance on data security. Example: “Clinical data in this project will be de-identified and obtained via application to the [name of biobank]. All information is sorted in locked filing cabinets or password-encrypted computers which are located in locked rooms.” 5. Data sharing and access 5.1 Suitability for sharing Indicate whether the data you propose to collect (or existing data you propose to use) in the study will be suitable for sharing. (“Yes” or “No”) If “No,” indicate why they will not be suitable for sharing and then go to Section 6. Example: “Data collected through this proposal (lab and genomic data) will be suitable to share in an anonymised format for other interested researchers. Any non-anonymised data which is patient-identifiable will not be shared unless explicit consent is provided by families.” Example: “The Access database will be shareable as the data will be anonymised and in a format suitable for preservation. The paper CRFs will not be shared to protect patient confidentiality.” 5.2 Discovery by potential users of the research data Indicate how potential new users can find out about your data and identify whether they could be suitable for their research purposes, e.g. through basic discovery metadata (i.e. the title, author, subjects, keywords and publisher) being readily available on the study website, or in other databases or catalogues. Also, indicate whether your policy or approach to data sharing is (or will be) published on your study website (or by other means). Most research funders recommend the use of established data repositories, community databases and related initiatives to aid data preservation, sharing and reuse. Identify any that will be entrusted with storing, curating and/or sharing data from your study. An international list of data repositories is available via Databib or Re3data. See also the list maintained by Wellcome. Suggested text in all cases: “Records of datasets will be published in ORDA, the University of Sheffield’s registry of research data produced at the University, which will issue DataCite DOIs for registered datasets and promote discovery.” Example: “At the end of the project all published models will be uploaded onto the [name of database and link to repository] and remain available via Github. Additional datasets that we believe may be of future value to the research community (for instance, those collected using specialist equipments such as [xxxxx], that may not be widely available) will be selected for longer term preservation and sharing and deposited in ORDA.” Example: “All data will be archived on the University research storage infrastructure for 10 years after the end of the project. Gene expression profiles will also be deposited in the [name MRC Template for a Data Management Plan, v01-00, 1 May 2012 P A G E 1 of database and link to repository]. These data will be MIAME compliant and allow subsequent meta-analyses to be performed using the data.” 5.3 Governance of access The methods used to share data will be dependent on a number of factors such as the type, size, complexity and sensitivity of data. Identify who makes or will make the decision on whether to supply research data to a potential new user. For population health and patientbased research, indicate how independent oversight of data access and sharing works (or will work) in compliance with MRC policy. Indicate whether the research data will be deposited in and available from an identified community database, repository, archive or other infrastructure established to curate and share data. Suggested text for use when data will not be placed in a repository: “The lead PI and project team [including collaborators if applicable] will review applications to access experimental data and make the decision on whether to supply research data to potential applicants. Data will then be released on a case by case basis.” Additional text [optional]: “Data will be made available through shared research platforms [insert examples relevant to project] with the relevant permissions in place.” 5.4 The study team’s exclusive use of the data Data (with accompanying metadata) should be shared in a timely fashion. It is generally expected that timely release would be no later than publication of the main findings and should be in-line with established best practice in the field. Research funders typically allow embargoes in line with practice in the field, but expect these to be outlined up-front and justified. MRC’s requirement is for timely data sharing, with the understanding that a limited, defined period of exclusive use of data for primary research is reasonable according to the nature and value of the data, and that this restriction on sharing should be based on simple, clear principles. Suggested text in all cases: “The project group (including collaborators) will have exclusive use of the data until the main research findings are published or patent applications have been filed [if potentially relevant to project]” and/or “...or for a period of x months/years.” Additional text [optional]: “Following publication, data will be made available on request or shared through the relevant research platforms.” 5.5 Restrictions or delays to sharing, with planned actions to limit such restrictions Outline any expected difficulties in data sharing, along with causes and possible measures to overcome these. Restriction to data sharing may be due to participant confidentiality, consent agreements or IPR. Strategies to limit restrictions may include data being anonymised or aggregated; gaining participant consent for data sharing; gaining copyright permissions. For prospective studies, consent procedures should include provision for data sharing to maximise the value of the data for wider research use, while providing adequate safeguards for participants. As part of the consent process, proposed procedures for data sharing should be set out clearly and current and potential future risks associated with this explained to research participants. If no restrictions are foreseen: “At present we do not foresee any delays in data sharing following publication of the main research findings.” For patient-based studies: “Patients will be made aware of our data sharing procedures at the time of consent.” Additional text [optional]: “Delays in sharing data may arise through a delayed ability to analyse or publish the research findings.” and/or “Delays in sharing data may arise due to IPR and if this is a factor, advice will be sought from the University’s Research & Innovation Services.” MRC Template for a Data Management Plan, v01-00, 1 May 2012 P A G E 1 5.6 Regulation of responsibilities of users Indicate whether external users are (will be) bound by data sharing agreements, licenses or end-user agreements setting out their main responsibilities. If so, set out the terms and key responsibilities to be followed. Note how access will be controlled, for example by the use of specialist services. A data enclave provides a controlled secure environment in which eligible researchers can perform analyses using restricted data resources. Where a managed access process is required, the procedure should be clearly described and transparent. “External users will be bound by data sharing agreements as specified by the [name of funder] Data Sharing Policy.” [where an external collaborator is involved] “Data sharing agreements will be put in place with [name of collaborator], who will be a primary re-user of data” “The University of Sheffield’s Good Research and Innovation Practice (GRIP) Policy follows RCUK principles for data sharing (http://www.rcuk.ac.uk/research/datapolicy/)” 6. Responsibilities Specify who, alongside the PI, is responsible for ensuring the study-wide data management, as well as for specific roles such as data capture, metadata production, data quality, storage and backup, data archiving & data sharing. For collaborative projects you should explain the co-ordination of data management responsibilities across partners. See UKDS guidance on data management roles and responsibilities. Example: “In addition to the PI [name], the data capture and data management will be supported by the Co-applicant [name] who will oversee the [xxxx] aspects of the project.” Example: “Each member of the research team will be responsible for the management of data relevant to this study according to guidelines set by the study protocol.” Example: “The data co-ordinator (50% FTE) will be responsible on a day-to-day basis for ensuring the study-wide data management, data security and quality assurance of data. The PI will have overall responsibility for ensuring that the data management plan is adhered to.” 7. Relevant institutional, departmental or study policies on data sharing and data security Please complete, where such policies are (i) relevant to your study, and (ii) are in the public domain, e.g. accessibly through the internet. Add any others that are relevant. Some of the information you give in the remainder of the DMP will be determined by the content of other policies. If so, point/link to them here. Policy Data Manageme nt Policy & Procedures Data Security Policy Data Sharing Policy Institutiona l URL or Reference University of Sheffield Research Data Management Policy http://www.shef.ac.uk/polopoly_fs/1.553350!/file/GRIPPolicyextractRDM.pdf University of Sheffield Data protection policy: http://www.shef.ac.uk/cics/policies/infosecpolicy The study will adhere to the MRC and RCUK principles http://www.mrc.ac.uk/research/policies-and-guidance-for-researchers/datasharing/ University of Sheffield Good Research and Innovation Practice (GRIP) Policy http://www.sheffield.ac.uk/polopoly_fs/1.356709!/file/GRIPPolicySenateapprove d.pdf MRC Template for a Data Management Plan, v01-00, 1 May 2012 P A G E 1 Informatio n Policy University of Sheffield Data protection policy: http://www.shef.ac.uk/cics/policies/infosecpolicy Other: Other 8. Author of this Data Management Plan (Name) and, if different to that of the Principal Investigator, their telephone & email contact details MRC Template for a Data Management Plan, v01-00, 1 May 2012 P A G E 1