Data Management Plan 1. Introduction and Context 1.1 Basic project information 1.1.1. Name: An Infrastructure for Platform Technology in Synthetic Biology 1.1.2. Funding: EPSRC 1.1.2. Budget: £5 million over 5 years 1.1.3. Lead PI and organisation: Prof D Kitney/Prof P Freemont, Imperial College 1.1.4. Other partner organisations: Cambridge, Edinburgh, Kings, Newcastle 1.2. Short description of the project's fundamental aims and purpose The aim of the project is to develop integrated platform technology and an infrastructure for synthetic biology. Five British universities (Imperial College, Cambridge, Edinburgh, LSE/Kings and Newcastle), who are amongst the international leaders in synthetic biology, have formed a Consortium to address the issue. The consortium will provide the critical mass and synergy necessary to address a range of synthetic biology research and its translation to industry. Synthetic Biology can be defined as a field that "aims to design and engineer biologically based parts, novel devices and systems, as well as redesigning existing natural biological systems" (The Royal Academy of Engineering Synthetic Biology Inquiry Report. May 2009). It is a rapidly developing field which, it is now recognised, will have major relevance to enhancing the UK's industrial base. The scientific drivers are the accumulation of bio-knowledge over the last sixty years and the application of methods and concepts from engineering – for example, and importantly, modularisation, standardisation and characterisation. The project will create platform technology, comprising tools, processes etc, which can be applied across a range of fields. The platform technology will be used as part of a systematic design process involving the design cycle for synthetic biology (specifications, design, modelling, implementation, testing and validation) to produce biologically based parts, devices and systems - with a range of applications in different fields. An important aspect of the approach is the incorporation of ethical, societal and environmental considerations into the design process. A number of detailed exemplar applications projects will be carried out in close collaboration with industry to test the effectiveness of the infrastructure and platform technology as a vehicle for translating university-based research into industrial applications. The Consortium aims to provide knowledge hubs where firms in the field can share information and build strong networks and clusters. The objective is for the platform technology to be: (a) compatible with the work and aspirations of the international community and (b) readily accessible by a wide range of users (both academic and industrial). 1.3. Related data plan and sharing policies The project aims to share data as required by the EPSRC The project will adhere to the Imperial College Backup Policy and Research Data Management Policy. The project will adhere to the Imperial College policy on Freedom of Information. 1.4. Basic Data Management Plan information 1.4.1. Date of creation of plan 1.4.2. Aims and purpose of this plan 1.4.3. Target audience for this plan The plan was created on 1st November 2012 Its aim is to ensure all active data at all partner institutions is properly protected and . To share that data when appropriate to do so with sufficient documentation and meta data to aid discovery and usability. The target audience for the plan are the project participants at all partner organisations and EPSRC. 2. Data Collection/generation and standards 2.1. Give a short description and estimated size (Gb) of the data being generated or reused Data will be posted from existing data bases, generated from observations and generated automatically e.g. from Theonyx Robots 2.2. Existing Data The project will utilise the existing data from public repositories which have been (and will be) integrated into SynBioMine (a database instance built on the InterMine data warehouse platform) SINBio MINE data. 2.3. New Data to be used 2.3.1. 2.3.2. 2.3.3. 2.3.4. The The The The data that will be captured/created process for capturing/creating the new data file formats to be used, and why criteria to be used for data quality assurance/management. The data to be created will consist of models of biological parts e.g. promoters The new data will be created by Theonyx Robots A number of MSSQL Data bases will be used to hold the data The supervisors will regularly check data. The Theonyx Robots will be regularly calibrated 2.4. Relationship between old and new data Existing data will be useful for reference validation and quality control. The new data will provide further data points to enable more accurate models to be defined. 2.5. Data Documentation and Metadata 2.5.1. The method used to make the datasets understandable in isolation. 2.5.2. If not understandable, what metadata will be created and standard used. 2.5.3. How these metadata will be created or captured? 2.5.4. The form the extrinsic descriptive and intrinsic technical metadata will take. A significant part of the project is to develop/extend existing documentation standards to optimise understanding. In addition rigorous standard for metadata has been developed and will be followed. See example. The project has developed its own meta data schema to describe the parts registry. See example of technical Meta data 3. Legal, Ethical and Intellectual Property issues 3.1. Ethical and privacy issues 3.1.1. How any ethical/privacy issues, that may prohibit data sharing, will be resolved. 3.1.2. How "personal data" in terms of the Data Protection Act (1998) will be protected. The intention is to share data subject to protecting the IPR of College and its employees and students. It is not envisaged there will be any ethical issues with respect to the data as there is no personal data as defined by the Data Protection Act. 3.2. Intellectual property rights 3.2.1. Who owns the copyright and other Intellectual Property? 3.2.2. How the dataset will be licensed? 3.2.4. The dispute resolution process/mechanism The copyright and IPR is retained by the members of the consortium The datasets will be licensed by ??? Any disputes arising will be addressed by the Project Governance Board. 4. Data Sharing and access methods and, embargos. 4.1. Access and Data Sharing 4.1.1. How the RC’s requirement for sharing all or part of the data will be met and when. 4.1.2. Which groups/organisations are likely to be interested in the data. 4.1.3. How this new data might be reused 4.1.4. The reasons why, if it is proposed to not share the data. The project is committed to sharing data from month 50 onwards. The data will be of interest to Metabolic Engineers and synthetic Biologists. The group would hope to run its own repository funded by on-going research. 4.2. Exploitation 4.2.1. When, and for how long, the data will be made available 4.2.2. Any embargo to be imposed e.g. for political/commercial/patent reasons The data will be made available after month 50 of the project for an initial period of 10 years. 4.3. Publications 4.3.1. Describe the plan to publish findings which rely on the data and any restrictions publishers(s) place on sharing of data. It is planned to publish in Nature there are no known restrictions on data sharing 5. Active data management: Short/medium term, storage/backup and security arrangements. 5.1. Quantify Short-Term Storage Requirements The data will be less than 1Gb in the first instance. 5.2. Storage Media 5.2.1. Where (physically) will the data will be stored during the project's lifetime? 5.2.2 How, if needed, data will be transferred to other sites/collaborators Data will be stored at Imperial, Cambridge, Edinburgh, and Newcastle. The master data sets will reside at Imperial. Data will be transferred between the 4 sites across JANET by using the Data Cube for encryption. 5.3. Back-Up 5.3.1. How the data will be backed up during the project's lifetime. 5.3.2 Plans for off-site storage At Imperial College the data will be backed up by the central IT organisation ICT which includes remote back-up to a site at Imperial College London Hammersmith Campus. See http://www3.imperial.ac.uk/ict/services/computerroom/file_and_backup_services At Cambridge, data will be stored on a fast RAID6 array to provide protection against disk failure, and offsite backups for all but the raw data (which can be downloaded again) will be provided by synchronising the main storage array with one housed in the Department of Genetics. All machines will be protected by two levels of backup power supply: In-rack UPS batteries plus an external generator. The machine room has redundant chilling units with over-temperature alarms signalling via email, text message and directly to the University's 24/7 security centre. At Edinburgh At Newcastle The Newcastle team utilise Linux virtual machines (VMs) to host their software and data resources. The VMs are hosted on a variety of servers depending on performance requirements. The Newcastle University servers are maintained and backed up by Computing Science (CS) and Newcastle Information Services and Systems according to their policies (ISS) (seehttp://research.ncl.ac.uk/rdm/rdmncl/ and http://research.ncl.ac.uk/rdm/rdmncl/). Within ISS and CS Servers are maintained in a physically secure environment with appropriate arrangements for fire and accidental damage. We also as a research group backup our data and VMs manually in a server in a separate building (the Centre for Bacterial Cell Biology). 5.4. Security 5.4.1. The access restrictions and security measures to be used. Each server is accessed by non-sharable login name/password combinations. 5.4.2. Any access permissions, restrictions and embargoes? None. 5.4.3. Any other security issues None 5.4.4. Any issues with transferring this data across an unsecured network The current data sets do not include sensitive data. Data will be transferred between the 4 sites across JANET by using the Data Cube for encryption. 6. Archive data management. Long-Term storage and Preservation 6.1. Describe the long-term strategy for maintaining, curating and archiving the data. 6.2. Long-Term Specifics 6.2.1 The data to be selected for preservation for the long-term? 6.2.3. How long the data should be kept beyond the life of the Project. 6.2.4. How, if the dataset includes sensitive data, it will be managed. 6.2.5. Any transformations necessary to prepare data for preservation/ sharing. The data to be offered for long term storage and preservation include promoters. The data will be kept for 10 years. The date sets do not include sensitive data. No transformation is necessary to prepare the data for sharing. 6.3. Metadata and Documentation for Long-Term Preservation 6.3.1. The metadata/documentation provided to make the datasets reusable. The project includes setting up of documentation standards for the data and Metadata. See attached Biopart Data Sheet as developed by CSyn BI 6.3.2. The links to published papers and/or outcomes and strategies for maintaining persistent citation e.g. using Digital Object Identifiers DOIs Appropriate DOIs will be obtained from DataCite for the data 6.4. Longer-Term Stewardship 6.4.1. Who or what will have responsibility over time for decisions about the data once the original personnel have gone 6.4.2. The formal process for transferring responsibility for the data in the event of the long-term place of deposit closing. The then current HoD for Bioengineering will have responsibility over time for decisions about the data once the original personal have moved on. Should the repository close, the College Records Office will be charged with ensuring the data is relocated. 7. Resourcing 7.1. The staff/organisational roles/responsibilities for implementing this data management plan 7.2. How data management activities will be funded during the project's lifetime? 7.3 How longer-term data management will be funded after the project ends. The responsibility for implementing the data management plan rests will project manager Dr R Dickinson and reporting specifically to him on this matter Mr A Spirling. The project has sufficient funds to meet the data management activities during the lifetime of the project. Longer term data management will be funded from subsequent project bids 8. Adherence and Review 8.1. How adherence to this plan will be checked/demonstrated and by whom The plan will be reviewed for adherence at the monthly project meeting and modifications made where necessary. The activity will be led by the project manager Dr R Dickenson 8.2. Review 8.2.1. When the data management plan will be reviewed and by whom 8.2.3. Does this version supersede an earlier plan? There will be a formal six monthly review led by the PI Professor Dick Kitney 9. Agreement ratification by stakeholders 9.1. Statement of agreement (with signatures if required) 10. Annexes 10.1. Contact details and expertise of nominated data managers/named individuals Imperial College. Dr R Dickinson Cambridge: Dr. G. Micklem Cambridge Systems Biology Centre Tennis Court Road Cambridge CB2 1QR Phone: (+44) 1223 760240 Fax: (+44) 1223 333992 Email: g.micklem@gen.cam.ac.uk Expertise: Dr. Micklem specializes in bioinformatics/genomics with a background in both academia and the biotechnology industry. His group has developed the large-scale data integration/analysis platform, InterMine, and works on disease data integration (metabolicMine, WT; HumanMine, WT), providing a pltform and tools to five of the main model organism databases (NIH), and is part of the modENCODE project Data Coordination Centre (NIH). Edinburgh Dr Alistair Elphick Newcastle Prof Anil Wipat 10.2. Glossary of terms 10.3. Other annexes as required A Spirling 28th February 2014