Structural Genomics UK an Oxford perspective Dave Stuart, UK-China meeting, June 2002 Structural Genomics in Britain Structure of presentation: - e-science considerations (information content) - survey - some tasks and status Structural Genomics Information content – the scale of the problem Human genomic DNA 3.2 Gbases 6.4 Gbits ~1 Gbyte Translated proteins amino acids non-H atoms parameters parameter data experimental data 100,000 (conservative estimate?) >30,000,000 0.2 Gatoms 1 Gparameters 2 Gbyte 200 Gbyte This would require: 3.2 Million Gbyte of X-ray data Of course this assumes 1 structure / protein For druggable targets, eg HIV-1 RT, may expect say >100 data sets to be collected for 1 protein! Structural Genomics Information content – what we can do NOW X-ray data collection at a 3rd generation synchrotron (present technologies!) 3 sec / image 1,200 images / hour ~ 1,000 Gbyte / station / day So for the planned day 1 beamlines at the new UK synchrotron, Diamond …. Upto 1,000 Tbytes / year of data, to be shipped (GRID) / analysed (HPC) / archived (???) Structural Genomics The ultimate aim is to tackle more complex proteins, macromolecular complexes and macromolecular machines…. For example The sheer complexity of these systems poses problems – eg the BTV core has 1000 protein subunits, another system has a mass of 66 MDa – detailed analysis of these is now possible due to beam characteristics at 3rd generation synchrotron sources (not just viruses, eg ribosome) ESRF Grenoble ESRF Grenoble Grimes at al., Nature, 1998, Gouet et al., Cell, 1999 Structural Genomics International context, synchrotron end – eg NIH programme example Software Development: Complex Instrumentation to simple GUI Structural Genomics in Britain Despite a strong history of structural biology, there is still rather little coordinated activity in structural genomics in the UK. Ongoing: - BBSRC: current review of SB Wellcome Trust / Industry: SGC Oxford: MRC funded OPPF Daresbury: NWSGC The European perspective: SPINE e-science, BBSRC: SRS,EBI,Oxford,York Diamond – the new UK synchrotron • Joint project: Office of Science and Technology, Wellcome Trust, France • Science: roughly 50/50 Biology/Physical Sciences 3 out of 7 day one beamlines for HTP PX Extant UK synchrotron resources (i) • SRS: new MAD beamline under construction • ESRF to build HTP public PX beamline Extant UK synchrotron resources (ii) BM14 at the ESRF Owned and run by UK in collaboration with EMBL (UK: MRC France) Broad energy range MAD beamline Intention is to provide a test-bed for automation Automation investigated: expect to order EMBL microdiffractometer and automatic sample changer Oxford Protein Production Facility (OPPF) • MRC funding, 3 years initially (~6 M UK £) • ‘Pilot project’ for larger scale activity associated with the Diamond synchrotron Oxford Protein Production Facility WWW.OPPF.OX.AC.UK Management Group: Executive Committee: John Bell, Iain Campbell, Simon Davis, Robert Esnouf, Jon Grimes, Karl Harlos, Louise Johnson, Yvonne Jones, Ian Jones, David Kerr, Tony Monaco, Gavin Screaton, Dave Stammers, Dave Stuart Rob Esnouf, Jon Grimes, Karl Harlos, Yvonne Jones, Dave Stammers, Dave Stuart (chair) Project Manager: Dr R. Owens Staff: David Alderton, Rene Assenberg, Nick Berrow, Jon Diprose, Sally Greening,Jo Nettleship, Nahid Rahman-Huq, Tom Walter, (Lester Carter, Mike Pickford) WTC OPPF Building handover – April 2002 Aims / Philosophy of the OPPF • Link in with existing biomedical research programmes which are using, for instance, microarray and SAGE technologies • Targets: mainly human proteins relevant to human health, plus human viruses, driven by input from existing biology programmes (funded by MRC and other bodies) • Establish expression in bacteria/insect/mammalian systems • Provide protein as a resource for functional studies and structural studies (e.g. use GFP to track protein expression) • Proteins will be ‘reagents’ for programmes aiming to look at assemblies of several components (co-expression) • Link in with NMR and cryo-EM • Target 1000 clones per year into pipeline Target definition • Herpes viruses • Proteins characteristic of immune cell function • Zinc finger containing proteins / transcription factors • The cancer genome • Protein modules (largely extracted from above) OPPF tasks • Bioinformatics. Data base construction, LIMS integration. • Protein expression/purification. Standardization to pipeline 1000 target proteins per year. • Crystallization. Automation of screening, detection and optimization. • Data collection/phasing. Data base integration with synchrotron (no direct support for X-ray, NMR, or EM) OPPF – management structure (Executive) Tracking and scheduling with a Laboratory Information Management System (LIMS) Virtues evident - from financial accounting to data mining Effort considerable - after investigation decided to go with a commercial system Barcodes •• Coding symbology likely to be adopted by the OPPF: OPPF: 128C encoding 12 numeric digits •• Suggested usage format of the 12 digits: XX YYY ZZZZZZ C XX Oganisation Oganisation identifier identifier –– the the OPPF OPPF would would take take 44 44 The -99 is The range range 90 90-99 is reserved, reserved, and and will will be be used used for for three three digit digit organisation organisation identifiers identifiers when when this this becomes becomes necessary necessary YYY Object Object identifier identifier –– eg eg 998 998 for for normal normal Greiner Greiner plates, plates, 999 999 for for shallow shallow Greiner Greiner plates, plates, 000 000 for for people, people, etc. etc. (Gives (Gives aa range range of of 1000 1000 object object types) types) ZZZZZZ Object Object content content identifier identifier –– aa unique unique identifier identifier for for this this item item within within this this object object set set (Gives (Gives aa range range of of 1000000 1000000 uniquely uniquely identified identified items items per per object) object) C C Triple -add-triple checksum Triple-add-triple checksum –– help help prevent prevent typos typos if if manually manually entered entered Global identification A flexible solution would be: • Coding symbology: Unspecified • Alphanumeric/Numeric: The string must start with the numeric organisation identifier (first two or three characters), otherwise no constraint is imposed • Length of encoded string: >2 (>3 for organisation identifiers beginning 90-99) • Each organisation then provides a web-based facility to translate their own code string into the relevant documentation for each item made available to other groups. • A centrally-maintained list maps the organisation identifier (first two digits of string) to the organisation name and the URL of the documentation-providing facility Cloning issues Cloning strategy: Gateway (present activity) or Infusion (LIC – ability to switch expression system readily) Expression: Bacteria / insects / mammalian Tags: 1 or 2 tags plus fusion protein option via Gateway (initial tests done) Expression & purification issues • Initial work with E Coli • Activate Baculovirus infection route in year 1 • Mammalian cell route within 2 years (developments) • Initial N-term His tag construct, HRV 3C protease cleavable (Gateway issues, topo adaptation or Infusion, under investigation – to avoid current nested PCR and long primers) Qiagen Biorobot 8000 - 96 well expression screening - 96 well parallel protein purification – magnetic & vacuum manifold technologies - In particular Ni NTA - or simply protein detection Crystallisation issues Technology: 96 well sitting drop drop volumes plates Nanolitre Currently Greiner Reservoir dispensing for screens: Robbins hydra, in-house adaptations Drop dispensing: Cartesian microsys (8 head), in-house adaptations Cartesian mods Current drop size 100nl (will scale down) Using lab scientists to test system. …. They have voted with their feet! Rate of equiilibration… Crystals are large enough… Thanks to C. Nicholls Storage: TAP 10,000 plate robot Imaging: Integrated Veeco Veeco-Optimag system 1 tray -96 tray-96 images / minute SPINE Structural Proteomics IN Europe SPINE 3 Year grant – 13.7 M Euros Technologies development: cloning and expression, through to crystallography and NMR Biomedical targets: -Human pathogens: Bacterial: TB & Campylobacter Viral: Herpes viruses & enzyme targets -Human proteins: Cancer related targets Neurological development/disorders