An e-Science Resource for High Throughput Protein Crystallography Rob Allan, Gregory Diakun, Martyn Guest, Ronan Keegan, Colin Nave, Miroslav Papiz, Martyn Winn, Graeme Winter, CLRC Daresbury Laboratory Jonathan Diprose, Robert Esnouf, Chris Mayo, David Stuart University of Oxford, The Wellcome Trust Centre for Human Genetics Oxford Ludovic Launer, Martin Walsh MRC France, c/o ESRF, Grenoble Joel Fillon, Kim Henrick, Anne Pajon, European Bioinformatics Institute, Cambridge Kevin Cowtan, Paul Young York Structural Biology Laboratory Randy Read, Dept. of Haematology, Cambridge Omer Rana, Department of Computer Science, Cardiff University Abstract The aim of this project is to make it routine to obtain reliable information on protein structure using X-ray crystallography in a high-throughput mode by introducing easy access to all facilities together with automation where appropriate. This will allow the biologist to concentrate on the scientific questions rather than the technical details. Scientific Background The vast amounts of data coming from the genome projects have generated a demand for new methods to determine structural and functional information about the proteins and other macromolecules in living systems. At the same time, advances in biotechnology are making it easier to obtain pure samples of these molecules. This has led to a demand for, and the possibility of, high throughput structural biology and has given rise to the term “structural genomics”. High-throughput techniques also open up the possibility of carrying out mass binding studies on each protein in order to investigate possible function or to develop inhibitors as potential pharmaceutical compounds or functional probes. Increasingly, scientists who are not experts in the various techniques individually will carry out the complete investigations. It is vital that high-throughput techniques do not lead to a reduction in the quality of the information obtained. In the UK, several large projects involving high throughput protein structure determination using crystallographic methods have recently started or been announced. As an example, the recently announced BBSRC SPORT initiative has, as one of its aims to "establish in the UK an internationally competitive capability in high throughput, parallel approaches to the expression, isolation, purification and structure/ function determination of biological macromolecules. This capability must be achieved as part of a programme aimed at producing detailed descriptions of the structure and molecular functions of cellular complexes and/or pathways of major biological importance." The diagram underneath shows the major steps in a protein crystallography project from selecting the target protein to investigate to depositing and analyzing the structure. Aims of the Project A large amount of information will be generated by this work. This information will have to be passed through the structure determination process and stored in accessible databases so that functional information can be obtained and experiments repeated, perhaps with modifications. Many of the investigations will be initiated and carried out by biologists with little experience in physical techniques such as x-ray crystallography. Where applicable, automation of experiments and data analysis will be necessary. It is therefore essential that reliable methods of handling and analysing the large amounts of data generated by high throughput protein crystallography are developed if the appropriate benefits are to be realised. Crystallisation Data Collection Phasing Protein Production Protein Structure Target Selection Structure analysis The e-htpx project aims to unify the procedures of protein structure determination into a single all encompassing interface from which users can initiate, plan, direct and document their experiment either locally or remotely from a desktop computer. The main aims are • To develop a user interface to allow structural biologists to interact easily with all the required resources • To implement a portal for managing and analysing projects submitted to high-throughput protein production • To develop systems for controlling the diffraction data collection and analysis based on rules as currently used by experts in the field of protein crystallography • To extend and develop structure determination software to take advantage of low-cost, highly parallel computing facilities so that feedback can be provided on the success, or otherwise, of structure determination. • To implement portals for x-ray data collection and structure determination, enabling access to all facilities over the internet. • To implement an automated system for collating the data from all stages and transferring these data to the EBI for deposition in public databases. • To liase with industrial users concerning their needs Deposition The UML diagram underneath defines the interaction between the various modules. e-Science issues There are several features of this project which are worth emphasizing. Firstly, there will be real samples which have to be transferred between the various facilities. These will have to be tracked (e.g. with bar codes) and the relevant information transferred from the corresponding database. Access to these databases will be a major issue for the project and it is possible that the standards defined by OGSA DAI will be important for this. Access to instruments (e.g. for protein production, crystallization and x-ray data collection) will be required. In this case it is intended to implement a large degree of automation so that the various procedures can be carried out with minimum intervention. However the ability to monitor the course of the experiments and modify the procedures will be important. Finally, security will be a significant concern. Although most of the information from academic projects will eventually be deposited in public databases, scientists will wish to protect the information prior to publication and deposition. Pharmaceutical companies using the facilities will also regard this aspect as being of prime importance. Part of the project involves modification of standard protein crystallography analysis software to run on parallel computers. When implemented, resource location and load balancing may be applicable for this part of the project. However, the other resources for high throughput structure determination will be located at defined experimental facilities and identifiable in advance. Finally, it is worth pointing out that, although the process has similarities to a manufacturing process involving several sites, each procedure is subject to significant uncertainty. Feedback from later stages of the process to earlier stages will be necessary when problems occur. Obtaining the appropriate mix of automation and user decision making will be a major factor in the project. Progress so far A major activity has gone in to developing a comprehensive data model for the project. This is a key requirement as it will enable the information to be exchanged between different stages of the process in a way which is independent of the precise implementation of each stage. The model describing the process of protein production, from target selection through expression, purification to crystallisation is at an advanced stage (http://www.ebi.ac.uk/msdsrv/docs/ehtpx/lims/index.html). The SQL schema to create tables inside a database will soon be autogenerated from the UML Data Model. The SQL file is generated by another version of the Data Model: The Torque Database Schema Reference. Torque is part of The Apache DB Project and it is used here to generate the database resources required from an XML file containing the database schema, the Torque Database Schema. The compliance between these two models is maintained by hand for the moment. The relational database schema is available for three databases: mySQL, postgreSQL and Oracle. A detailed documentation for each table and each table field is available with, autogenerated from the Torque Database Schema. (see http://www.ebi.ac.uk/msdsrv/docs/ehtpx/lims/downloads.html). Procedures have been developed to allow remote users to request images from the automatic crystallisation facilities at the Oxford Protein Production Facility and view the images remotely. The images (one is shown in the top left hand part of the first diagram) are colour coded according to a scoring scheme for successful crystallization. Automation of x-ray data collection is progressing, with the ability to produce and implement an automatic strategy for data collection from some initial x-ray diffraction images images of the crystal1. Interfacing this software to robotic sample changers is proceeding at both the ESRF and the SRS. Progress has also been made on implementing parallel processors for some of the x-ray data analysis software so that rapid feedback can be given to the experiments. Options for the graphical user interface and workflow engines have been investigated. In order to test the procedures, a trial of a web service will be carried out in summer 2003. A Various workflow packages (e.g. Triana, Taverna) are being assessed for use with the web services implementation. During subsequent stages of the project, the automation aspect of the project (including feedback from parallel computing) will be developed further and the web services extended to the complete procedure and placed in a grid environment, probably using the recently released Globus Toolkit 3 technology. sequence diagram showing the main flow of information is given below. The automation for x-ray data collection includes a collaboration (www.dna.ac.uk), involving Andrew Leslie and Harry Powell (MRC Cambridge), Sean McSweeney, Olof Svennson and Darren Spruce (ESRF), Raimond Ravelli and Pierre Legrand (EMBL Grenoble), Karen Ackroyd, Dave Love and Steve Kinder (CCLRC Daresbury) and Liz Duke (Diamond). Julie Wilson at York has collaborated on the scoring scheme for crystallisation trials. Acknowledgements The funding sources for this project are from the BBSRC e-science program with additional funds from the DTI for outreach to industry. There are a large number of links to and collaborations with other projects working in related areas. The e-HTPX Protein Production Model incorporates ideas from many sources in addition to those named on the poster. Helen Berman (RCSB), John Westbrook (RCSB), Cathy Lawson (Rutgers), Rosalind Kim (Berkeley) John Ionides (EBI), the Halx Project (Anne Poupon, Emeline Deleury and Isabelle Krimm http://halx.genomics.eu.org/), and Mole Project (Alun Ashton and Peter Wood http://www.ccp4.ac.uk/lims/) and CCPN Project (Wayne Boucher, Rasmus Fogh, Tim Stevens & Wim Vranken) were involved in defining or testing the data model. Reference 1. Automation of the collection and processing of X-ray diffraction data – a generic approach, A.G.W. Leslie, H.R. Powell, G. Winter, O. Svensson, D. Spruce, S. McSweeney, D. Love, S. Kinder, E. Duke and C. Nave. Acta Cryst. D.(2002) D58, 1924-1928.