Data Management Plan

advertisement
Data Management Plan
1. Introduction and Context
1.1 Basic project information
1.1.1. Name: An Infrastructure for Platform Technology in Synthetic Biology
1.1.2. Funding: EPSRC
1.1.2. Budget: £5 million over 5 years
1.1.3. Lead PI and organisation: Prof D Kitney/Prof P Freemont, Imperial College
1.1.4. Other partner organisations: Cambridge, Edinburgh, Kings, Newcastle
1.2. Short description of the project's fundamental aims and purpose
The aim of the project is to develop integrated platform technology and an infrastructure
for synthetic biology. Five British universities (Imperial College, Cambridge, Edinburgh,
LSE/Kings and Newcastle), who are amongst the international leaders in synthetic
biology, have formed a Consortium to address the issue. The consortium will provide
the critical mass and synergy necessary to address a range of synthetic biology research
and its translation to industry. Synthetic Biology can be defined as a field that "aims to
design and engineer biologically based parts, novel devices and systems, as well as
redesigning existing natural biological systems" (The Royal Academy of Engineering
Synthetic Biology Inquiry Report. May 2009). It is a rapidly developing field which, it is
now recognised, will have major relevance to enhancing the UK's industrial base.
The scientific drivers are the accumulation of bio-knowledge over the last sixty years and
the application of methods and concepts from engineering – for example, and
importantly, modularisation, standardisation and characterisation.
The project will create platform technology, comprising tools, processes etc, which can
be applied across a range of fields. The platform technology will be used as part of a
systematic design process involving the design cycle for synthetic biology (specifications,
design, modelling, implementation, testing and validation) to produce biologically based
parts, devices and systems - with a range of applications in different fields. An important
aspect of the approach is the incorporation of ethical, societal and environmental
considerations into the design process. A number of detailed exemplar applications
projects will be carried out in close collaboration with industry to test the effectiveness of
the infrastructure and platform technology as a vehicle for translating university-based
research into industrial applications. The Consortium aims to provide knowledge hubs
where firms in the field can share information and build strong networks and clusters.
The objective is for the platform technology to be: (a) compatible with the work and
aspirations of the international community and (b) readily accessible by a wide range of
users (both academic and industrial).
1.3. Related data plan and sharing policies
The project aims to share data as required by the EPSRC
The project will adhere to the Imperial College Backup Policy and Research Data
Management Policy.
The project will adhere to the Imperial College policy on Freedom of Information.
1.4. Basic Data Management Plan information
1.4.1. Date of creation of plan 1.4.2. Aims and purpose of this plan
1.4.3. Target audience for this plan
The plan was created on 1st November 2012
Its aim is to ensure all active data at all partner institutions is properly protected and .
To share that data when appropriate to do so with sufficient documentation and meta
data to aid discovery and usability.
The target audience for the plan are the project participants at all partner organisations
and EPSRC.
2. Data Collection/generation and standards
2.1. Give a short description and estimated size (Gb) of the data being
generated or reused
Data will be posted from existing data bases, generated from observations and
generated automatically e.g. from Theonyx Robots
2.2. Existing Data
The project will utilise the existing data from public repositories which have been
(and will be) integrated into SynBioMine (a database instance built on the
InterMine data warehouse platform) SINBio MINE data.
2.3. New Data to be used
2.3.1.
2.3.2.
2.3.3.
2.3.4.
The
The
The
The
data that will be captured/created
process for capturing/creating the new data
file formats to be used, and why
criteria to be used for data quality assurance/management.
The data to be created will consist of models of biological parts e.g. promoters
The new data will be created by Theonyx Robots
A number of MSSQL Data bases will be used to hold the data
The supervisors will regularly check data. The Theonyx Robots will be regularly
calibrated
2.4. Relationship between old and new data
Existing data will be useful for reference validation and quality control.
The new data will provide further data points to enable more accurate models to be
defined.
2.5. Data Documentation and Metadata
2.5.1. The method used to make the datasets understandable in isolation.
2.5.2. If not understandable, what metadata will be created and standard used.
2.5.3. How these metadata will be created or captured?
2.5.4. The form the extrinsic descriptive and intrinsic technical metadata will
take.
A significant part of the project is to develop/extend existing documentation standards to
optimise understanding.
In addition rigorous standard for metadata has been developed and will be followed.
See example.
The project has developed its own meta data schema to describe the parts registry.
See example of technical Meta data
3. Legal, Ethical and Intellectual Property issues
3.1. Ethical and privacy issues
3.1.1. How any ethical/privacy issues, that may prohibit data sharing, will be
resolved.
3.1.2. How "personal data" in terms of the Data Protection Act (1998) will be
protected.
The intention is to share data subject to protecting the IPR of College and its employees
and students.
It is not envisaged there will be any ethical issues with respect to the data as there is no
personal data as defined by the Data Protection Act.
3.2. Intellectual property rights
3.2.1. Who owns the copyright and other Intellectual Property?
3.2.2. How the dataset will be licensed?
3.2.4. The dispute resolution process/mechanism
The copyright and IPR is retained by the members of the consortium
The datasets will be licensed by ???
Any disputes arising will be addressed by the Project Governance Board.
4. Data Sharing and access methods and, embargos.
4.1. Access and Data Sharing
4.1.1. How the RC’s requirement for sharing all or part of the data will be met
and when.
4.1.2. Which groups/organisations are likely to be interested in the data.
4.1.3. How this new data might be reused
4.1.4. The reasons why, if it is proposed to not share the data.
The project is committed to sharing data from month 50 onwards.
The data will be of interest to Metabolic Engineers and synthetic Biologists.
The group would hope to run its own repository funded by on-going research.
4.2. Exploitation
4.2.1. When, and for how long, the data will be made available
4.2.2. Any embargo to be imposed e.g. for political/commercial/patent reasons
The data will be made available after month 50 of the project for an initial period of 10
years.
4.3. Publications
4.3.1. Describe the plan to publish findings which rely on the data and any
restrictions publishers(s) place on sharing of data.
It is planned to publish in Nature there are no known restrictions on data sharing
5. Active data management: Short/medium term, storage/backup and security arrangements.
5.1. Quantify Short-Term Storage Requirements
The data will be less than 1Gb in the first instance.
5.2. Storage Media
5.2.1. Where (physically) will the data will be stored during the project's lifetime?
5.2.2 How, if needed, data will be transferred to other sites/collaborators
Data will be stored at Imperial, Cambridge, Edinburgh, and Newcastle. The master data
sets will reside at Imperial.
Data will be transferred between the 4 sites across JANET by using the Data Cube for
encryption.
5.3. Back-Up
5.3.1. How the data will be backed up during the project's lifetime.
5.3.2 Plans for off-site storage
At Imperial College the data will be backed up by the central IT organisation ICT which
includes remote back-up to a site at Imperial College London Hammersmith Campus.
See http://www3.imperial.ac.uk/ict/services/computerroom/file_and_backup_services
At Cambridge, data will be stored on a fast RAID6 array to provide protection against
disk failure, and offsite backups for all but the raw data (which can be downloaded
again) will be provided by synchronising the main storage array with one housed in the
Department of Genetics. All machines will be protected by two levels of backup power
supply: In-rack UPS batteries plus an external generator. The machine room has
redundant chilling units with over-temperature alarms signalling via email, text message
and directly to the University's 24/7 security centre.
At Edinburgh
At Newcastle
The Newcastle team utilise Linux virtual machines (VMs) to host their software and data
resources. The VMs are hosted on a variety of servers depending on performance
requirements. The Newcastle University servers are maintained and backed up by
Computing Science (CS) and Newcastle Information Services and Systems according to
their policies (ISS)
(seehttp://research.ncl.ac.uk/rdm/rdmncl/ and http://research.ncl.ac.uk/rdm/rdmncl/).
Within ISS and CS Servers are maintained in a physically secure environment with
appropriate arrangements for fire and accidental damage. We also as a research group
backup our data and VMs manually in a server in a separate building (the Centre for
Bacterial Cell Biology).
5.4. Security
5.4.1. The access restrictions and security measures to be used.
Each server is accessed by non-sharable login name/password combinations.
5.4.2. Any access permissions, restrictions and embargoes? None.
5.4.3. Any other security issues None
5.4.4. Any issues with transferring this data across an unsecured network
The current data sets do not include sensitive data. Data will be transferred between the
4 sites across JANET by using the Data Cube for encryption.
6. Archive data management. Long-Term storage and Preservation
6.1. Describe the long-term strategy for maintaining, curating and archiving the
data.
6.2. Long-Term Specifics
6.2.1 The data to be selected for preservation for the long-term?
6.2.3. How long the data should be kept beyond the life of the Project.
6.2.4. How, if the dataset includes sensitive data, it will be managed.
6.2.5. Any transformations necessary to prepare data for preservation/ sharing.
The data to be offered for long term storage and preservation include promoters.
The data will be kept for 10 years.
The date sets do not include sensitive data. No transformation is necessary to prepare
the data for sharing.
6.3. Metadata and Documentation for Long-Term Preservation
6.3.1. The metadata/documentation provided to make the datasets reusable.
The project includes setting up of documentation standards for the data and Metadata.
See attached Biopart Data Sheet as developed by CSyn BI
6.3.2. The links to published papers and/or outcomes and strategies for
maintaining persistent citation e.g. using Digital Object Identifiers DOIs
Appropriate DOIs will be obtained from DataCite for the data
6.4. Longer-Term Stewardship
6.4.1. Who or what will have responsibility over time for decisions about the data
once the original personnel have gone
6.4.2. The formal process for transferring responsibility for the data in the event
of the long-term place of deposit closing.
The then current HoD for Bioengineering will have responsibility over time for decisions
about the data once the original personal have moved on.
Should the repository close, the College Records Office will be charged with ensuring the
data is relocated.
7. Resourcing
7.1. The staff/organisational roles/responsibilities for implementing this data
management plan
7.2. How data management activities will be funded during the project's lifetime?
7.3 How longer-term data management will be funded after the project ends.
The responsibility for implementing the data management plan rests will project
manager Dr R Dickinson and reporting specifically to him on this matter Mr A Spirling.
The project has sufficient funds to meet the data management activities during the
lifetime of the project.
Longer term data management will be funded from subsequent project bids
8. Adherence and Review
8.1. How adherence to this plan will be checked/demonstrated and by whom
The plan will be reviewed for adherence at the monthly project meeting and
modifications made where necessary.
The activity will be led by the project manager Dr R Dickenson
8.2. Review
8.2.1. When the data management plan will be reviewed and by whom
8.2.3. Does this version supersede an earlier plan?
There will be a formal six monthly review led by the PI Professor Dick Kitney
9. Agreement ratification by stakeholders
9.1. Statement of agreement (with signatures if required)
10. Annexes
10.1. Contact details and expertise of nominated data managers/named
individuals
Imperial College. Dr R Dickinson
Cambridge:
Dr. G. Micklem
Cambridge Systems Biology Centre
Tennis Court Road
Cambridge
CB2 1QR
Phone: (+44) 1223 760240
Fax: (+44) 1223 333992
Email: g.micklem@gen.cam.ac.uk
Expertise: Dr. Micklem specializes in bioinformatics/genomics with a background in both
academia and the biotechnology industry. His group has developed the large-scale data
integration/analysis platform, InterMine, and works on disease data integration
(metabolicMine, WT; HumanMine, WT), providing a pltform and tools to five of the main
model organism databases (NIH), and is part of the modENCODE project Data
Coordination Centre (NIH).
Edinburgh
Dr Alistair Elphick
Newcastle
Prof Anil Wipat
10.2. Glossary of terms
10.3. Other annexes as required
A Spirling
28th February 2014
Download