An e-Science Resource for High Throughput Protein Crystallography

advertisement
An e-Science Resource for High Throughput Protein Crystallography
Rob Allan, Gregory Diakun, Martyn Guest, Ronan Keegan, Colin Nave, Miroslav Papiz,
Martyn Winn, Graeme Winter, CLRC Daresbury Laboratory
Jonathan Diprose, Robert Esnouf, Chris Mayo, David Stuart University of Oxford, The Wellcome
Trust Centre for Human Genetics Oxford
Ludovic Launer, Martin Walsh MRC France, c/o ESRF, Grenoble
Joel Fillon, Kim Henrick, Anne Pajon, European Bioinformatics Institute, Cambridge
Kevin Cowtan, Paul Young York Structural Biology Laboratory
Randy Read, Dept. of Haematology, Cambridge
Omer Rana, Department of Computer Science, Cardiff University
Abstract
The aim of this project is to make it routine to obtain reliable information on protein structure using X-ray
crystallography in a high-throughput mode by introducing easy access to all facilities together with
automation where appropriate. This will allow the biologist to concentrate on the scientific questions
rather than the technical details.
Scientific Background
The vast amounts of data coming from the
genome projects have generated a demand for
new methods to determine structural and
functional information about the proteins and
other macromolecules in living systems. At the
same time, advances in biotechnology are
making it easier to obtain pure samples of these
molecules. This has led to a demand for, and the
possibility of, high throughput structural
biology and has given rise to the term
“structural
genomics”.
High-throughput
techniques also open up the possibility of
carrying out mass binding studies on each
protein in order to investigate possible function
or to develop inhibitors as potential
pharmaceutical compounds or functional
probes. Increasingly, scientists who are not
experts in the various techniques individually
will carry out the complete investigations. It is
vital that high-throughput techniques do not
lead to a reduction in the quality of the
information obtained.
In the UK, several large projects involving high
throughput protein structure determination using
crystallographic methods have recently started
or been announced. As an example, the recently
announced BBSRC SPORT initiative has, as
one of its aims to "establish in the UK an
internationally competitive capability in high
throughput, parallel approaches to the
expression, isolation, purification and structure/
function
determination
of
biological
macromolecules. This capability must be
achieved as part of a programme aimed at
producing detailed descriptions of the structure
and molecular functions of cellular complexes
and/or pathways of
major
biological
importance."
The diagram underneath shows the major steps
in a protein crystallography project from
selecting the target protein to investigate to
depositing and analyzing the structure.
Aims of the Project
A large amount of information will be generated
by this work. This information will have to be
passed through the structure determination
process and stored in accessible databases so
that functional information can be obtained and
experiments
repeated,
perhaps
with
modifications. Many of the investigations will
be initiated and carried out by biologists with
little experience in physical techniques such as
x-ray crystallography. Where applicable,
automation of experiments and data analysis
will be necessary. It is therefore essential that
reliable methods of handling and analysing the
large amounts of data generated by high
throughput
protein
crystallography
are
developed if the appropriate benefits are to be
realised.
Crystallisation
Data Collection
Phasing
Protein Production
Protein Structure
Target
Selection
Structure analysis
The e-htpx project aims to unify the procedures
of protein structure determination into a single
all encompassing interface from which users
can initiate, plan, direct and document their
experiment either locally or remotely from a
desktop computer.
The main aims are
• To develop a user interface to allow
structural biologists to interact easily with all
the required resources
• To implement a portal for managing and
analysing projects submitted to high-throughput
protein production
• To develop systems for controlling the
diffraction data collection and analysis based on
rules as currently used by experts in the field of
protein crystallography
• To
extend
and
develop
structure
determination software to take advantage of
low-cost, highly parallel computing facilities so
that feedback can be provided on the success, or
otherwise, of structure determination.
• To implement portals for x-ray data
collection and structure determination, enabling
access to all facilities over the internet.
• To implement an automated system for
collating the data from all stages and
transferring these data to the EBI for deposition
in public databases.
• To liase with industrial users concerning
their needs
Deposition
The UML diagram underneath defines the
interaction between the various modules.
e-Science issues
There are several features of this project which
are worth emphasizing. Firstly, there will be
real samples which have to be transferred
between the various facilities. These will have
to be tracked (e.g. with bar codes) and the
relevant information transferred from the
corresponding database. Access to these
databases will be a major issue for the project
and it is possible that the standards defined by
OGSA DAI will be important for this. Access to
instruments (e.g. for protein production,
crystallization and x-ray data collection) will be
required. In this case it is intended to implement
a large degree of automation so that the various
procedures can be carried out with minimum
intervention. However the ability to monitor the
course of the experiments and modify the
procedures will be important. Finally, security
will be a significant concern. Although most of
the information from academic projects will
eventually be deposited in public databases,
scientists will wish to protect the information
prior to publication and deposition.
Pharmaceutical companies using the facilities
will also regard this aspect as being of prime
importance.
Part of the project involves modification of
standard protein crystallography analysis
software to run on parallel computers. When
implemented, resource location and load
balancing may be applicable for this part of the
project. However, the other resources for high
throughput structure determination will be
located at defined experimental facilities and
identifiable in advance.
Finally, it is worth pointing out that, although
the process has similarities to a manufacturing
process involving several sites, each procedure
is subject to significant uncertainty. Feedback
from later stages of the process to earlier stages
will be necessary when problems occur.
Obtaining the appropriate mix of automation
and user decision making will be a major factor
in the project.
Progress so far
A major activity has gone in to developing a
comprehensive data model for the project. This
is a key requirement as it will enable the
information to be exchanged between different
stages of the process in a way which is
independent of the precise implementation of
each stage. The model describing the process of
protein production, from target selection
through
expression,
purification
to
crystallisation is at an advanced stage
(http://www.ebi.ac.uk/msdsrv/docs/ehtpx/lims/index.html).
The
SQL
schema to create tables inside a database will
soon be autogenerated from the UML Data
Model. The SQL file is generated by another
version of the Data Model: The Torque
Database Schema Reference. Torque is part of
The Apache DB Project and it is used here to
generate the database resources required from
an XML file containing the database schema,
the Torque Database Schema. The compliance
between these two models is maintained by
hand for the moment. The relational database
schema is available for three databases:
mySQL, postgreSQL and Oracle. A detailed
documentation for each table and each table
field is available with, autogenerated from the
Torque
Database
Schema.
(see
http://www.ebi.ac.uk/msdsrv/docs/ehtpx/lims/downloads.html).
Procedures have been developed to allow
remote users to request images from the
automatic crystallisation facilities at the Oxford
Protein Production Facility and view the images
remotely. The images (one is shown in the top
left hand part of the first diagram) are colour
coded according to a scoring scheme for
successful crystallization. Automation of x-ray
data collection is progressing, with the ability to
produce and implement an automatic strategy
for data collection from some initial x-ray
diffraction images images of the crystal1.
Interfacing this software to robotic sample
changers is proceeding at both the ESRF and the
SRS. Progress has also been made on
implementing parallel processors for some of
the x-ray data analysis software so that rapid
feedback can be given to the experiments.
Options for the graphical user interface and
workflow engines have been investigated.
In order to test the procedures, a trial of a web
service will be carried out in summer 2003. A
Various workflow packages (e.g. Triana,
Taverna) are being assessed for use with the
web
services
implementation.
During
subsequent stages of the project, the automation
aspect of the project (including feedback from
parallel computing) will be developed further
and the web services extended to the complete
procedure and placed in a grid environment,
probably using the recently released Globus
Toolkit 3 technology.
sequence diagram showing the main flow of
information is given below.
The automation for x-ray data collection
includes a collaboration (www.dna.ac.uk),
involving Andrew Leslie and Harry Powell
(MRC Cambridge), Sean McSweeney, Olof
Svennson and Darren Spruce (ESRF), Raimond
Ravelli and Pierre Legrand (EMBL Grenoble),
Karen Ackroyd, Dave Love and Steve Kinder
(CCLRC Daresbury) and Liz Duke (Diamond).
Julie Wilson at York has collaborated on the
scoring scheme for crystallisation trials.
Acknowledgements
The funding sources for this project are from the
BBSRC e-science program with additional
funds from the DTI for outreach to industry.
There are a large number of links to and
collaborations with other projects working in
related areas.
The e-HTPX Protein Production Model
incorporates ideas from many sources in
addition to those named on the poster. Helen
Berman (RCSB), John Westbrook (RCSB),
Cathy Lawson (Rutgers), Rosalind Kim
(Berkeley) John Ionides (EBI), the Halx Project
(Anne Poupon, Emeline Deleury and Isabelle
Krimm http://halx.genomics.eu.org/), and Mole
Project
(Alun Ashton and Peter Wood
http://www.ccp4.ac.uk/lims/) and CCPN Project
(Wayne Boucher, Rasmus Fogh, Tim Stevens &
Wim Vranken) were involved in defining or
testing the data model.
Reference
1. Automation of the collection and processing
of X-ray diffraction data – a generic approach,
A.G.W. Leslie, H.R. Powell, G. Winter, O.
Svensson, D. Spruce, S. McSweeney, D. Love,
S. Kinder, E. Duke and C. Nave. Acta Cryst.
D.(2002) D58, 1924-1928.
Download