DRAFT REPORT OF THE BBSRC-FUNDED GENOMICS MEETS GRID MEETING HELD AT THE NATIONAL e-SCIENCE CENTRE, EDINBURGH ON 11 – 12 NOVEMBER 2002 EXECUTIVE SUMMARY In response to a request from the genomics groups supported by BBSRC for more information on Grid technology and what it might offer, BBSRC organised a workshop to bring together key members of the genomics and e-Science communities plus the main data providers. The workshop aimed to inform the e-Science community about the special data needs of genomics research and then to brainstorm some possible Grid pilot projects to develop Grid technology for use in the genomics area. The key message was that the genomics community could not begin to share data on a large scale (via “the Grid”) until a set of standards had been agreed and methods for interoperability put in place. The output from the meeting was a series of suggested projects and activities: • Preparation of a case for a community project to drive biological resource connectivity and data exchange projects and to promote formation of standards. This project should be taken forward: − BBSRC to meet with other funding agencies (MRC, Wellcome Trust, EPSRC, NERC) to consider a mechanism by which this project might be supported. The group was anxious to avoid double jeopardy in peer review. − A joint funding agency-supported Town meeting to bring together interested parties to start putting the proposal together (should include key data providers in the UK and overseas – eg the research councils Wellcome Trust, EMBL/EBI, NCBI, USDA, Whitehead). − Establishment of a steering group to write the proposal for the community project. • A collection of Grid pilot projects, some of which were described in the reportback session, to be submitted to BBSRC’s research committees (next closing date: 20 January 2003). • The community to identify training needs and seek input from the National eScience Centre with help and advice from the Grid Support Centre team. 1 DRAFT REPORT OF THE BBSRC-FUNDED GENOMICS MEETS GRID MEETING HELD AT THE NATIONAL e-SCIENCE CENTRE, EDINBURGH ON 11 – 12 NOVEMBER 2002 Background and Purpose of the Meeting 1. In July 2002, BBSRC hosted a Forum at Robinson College, Cambridge, for the groups funded through its Investigating Gene Function (IGF) Initiative. One of the talks at the Forum was a presentation from Dr David Boyd, Rutherford Appleton Laboratory, on the scope of the Grid Support Centre Post funded by BBSRC through the Bioinformatics and e-Science Initiative. The audience heard that this “post” (Pete Oliver and Richard Wong) had the remit to provide support for the BBSRC research community to help develop Grid technology. In particular, Pete Oliver and Richard Wong had been asked to help the genomics community supported through IGF and the BBSRC Institutes to establish some pump-priming, pilot projects to fill the gaps in the current BBSRC e-Science portfolio. 2. Immediately after David Boyd’s talk, the bioinformaticians funded on the IGF projects were invited to attend a breakout session to consider key data issues for the community and to discuss how these issues might be addressed. In particular, the discussion focussed on opportunities and barriers to developing Grid technology for the genomics community given that BBSRC expected to be taking forward a further e-Science initiative on integrating data following the announcement of the Spending Review 2002. 3. At the breakout session, it emerged that one of the main barriers to developing Grid technology for genomics was a lack of understanding about the Grid and what it might offer this community over and above what could already be achieved using existing tools. Many of the bioinformaticians had attended Grid meetings featuring presentations and demonstrations from the Grid pilot projects but this had not helped them to understand the basic concept of “the Grid”, the technology behind it and how it might work in the genomics context. The group asked if BBSRC could organise a meeting to help address this problem by offering a forum for open debate between a small number of individuals from the Genomics and Grid communities. Genomics Meets Grid Meeting 4. The meeting was arranged for 11 and 12 November, 2002 at the National eScience Centre, Edinburgh and was by invitation-only, limited to around 30 participating delegates (link to delegates list) plus observers. The aims of the meeting were: • • To help the Grid community understand the key data issues for the IGF/BBSRC Institute scientists (through a series of short presentations) To obtain a view of the key issues from the major data providers (EBI and the Sanger Centre) 2 • • • • To help the genomics community understand Grid technology and what elements of it could offer solutions to its data problems (through open discussion) To identify what issues needed to be addressed before Grid technology could be encompassed by this community (through brain storming) To identify a series of activities to establish Grid projects in the IGF community and the Institutes, including pilot projects suitable for funding through responsive mode (in breakout groups) To establish links with the Grid Support Centre and identify how Pete Oliver and Richard Wong might help Talks Details of talks from the meeting can be found on the NeSC Web site (link to talks). Using the Grid for Genomics 5. The meeting started with an introduction from David Boyd on the concept of the Grid, its components, capabilities and services offered both by the Grid Support Centre and the National e-Science Centre. 6. His definition of the Grid was: “technology that enables persistent shared use of distributed resources – computing, data, visualisation, instruments, network – without needing to know in advance where these are or who owns them”. 7. He explained that the Grid would work by combining technology at several levels and then applying this package to large-scale research problems (applications, eg protein simulation, climate modelling). The Grid middleware was now being developed as a service model of distributed computing based on Web services – the Open Grid Services Architecture (OGSA). This new toolkit would also include support for accessing structured data in databases: DAI, an important development for genomics. 8. The Grid Engineering Task Force was co-ordinating the linkage of computing resources at all UK e-Science centres, interconnected by the SuperJANET4 multigigabit backbone and Regional Networks. The Hinxton site would be included in the network. 9. The UK Grid Support Centre had been established to help all e-Science programme participants to install and use Grid software and also had a role in raising awareness about the Grid, offering help through a Help Desk and running a Certificate Authority. 10. BBSRC had provided funding for a support-facility for BBSRC-supported researchers, especially those not funded through the e-Science programme and, as well as providing technical support, would assist groups in developing Grid demonstration projects, establishing mechanisms to share Beowulf clusters and organising training courses at NeSC. Pete Oliver and Richard Wong would be proactive within the community, identifying opportunities and providing technical support for installing and running Grid software. The group would work in close 3 collaboration with BBSRC IT Services (BITS) and would be overseen by a steering group chaired by Roger Gillam. 11. Three possible projects had already been identified following a meeting with the Institute of Grassland and Environmental Research (IGER): • • • Remote BLAST jobs – the aim is to use the Grid to get faster turnaround on multi-processor resources enabling larger databases to be searched using the Beowulf cluster at RAL Hyper-spectral image analysis – aims to speed up image analysis and provide access to data archiving making use of parallel algorithms running on Beowulf clusters; the project especially focuses on the machine-learning end of data analysis which it is hope cab be improved using the Grid approach PHYLIP – aims to improve turn-around time of phylogenetic analysis of large multi-gene families using a parallel version of the PHYLIP programme 12. The key challenges for BBSRC scientists were: • • • Most Grid software is implemented on LINUX platforms; genomics data tends to be handled in a Windows – only environment Network Bandwidth – many Grid sites now have 1Gbit/sec; BBSRC researchers often only have 2Mbits/sec available to them Groups will need to provide access through firewalls to enable remote Gridbased use of machines; a mechanism has now been identified but it need implementing widely Genomics Challenges 13. The next part of the meeting comprised a series of 10-minute presentations from members of the genomics community to try to convey the complexity of the genomics data and the key issues facing those wishing to use it. Early talks also contained a description of some of the main techniques and terminologies: Andy Law, Roslin Institute: Farm Animal Genomics Jo Dicks, John Innes Centre: Analysis of Crop Plant Genomes Alan Archibald, Roslin Institute: Farm Animal Genomics David Marshall, Scottish Crops Research Institute: Some Jolly Fun with Barley ESTs Helen Ougham, Institute of Grassland and Environmental Research: IGER Research Gary Barker, Bristol University: Mining SNPs from Public Sequence Databases Sean May, Nottingham Arabidopsis Stock Centre: Nottingham Arabidopsis Stock Centre Gos Micklem, Cambridge University: Tom Oldfield, EBI: Macromolecular Structure Database Project EMSD James Cuff, Sanger Centre: Ensembl Compute: Grid Issues 14. In responding to these talks, the “Grid technologists” made the following comments and observations: 4 • • • • • • • “Integration” includes both data and instrumentation. There was generally a large amount of data that was widely distributed and fluid (frequently changing); issues relating to computation were: lack of resources, ease of parallelisation and lack of expertise. One of the main challenges was to track queries so that they could be repeated; machine learning and artificial intelligence would be important techniques to employ. Policy on sharing information varied between groups, often depending on whether industry was involved in the project. Data might be embargoed for up to a year and needed to be well-protected in that time. Security was therefore a major issue for this area, especially for medical data. A number of key issues were orthogonal to whether the Grid was used or not. For example, development of tools for data integration and data mining were important for the genomics community but could be developed independently of the Grid. Grid technology would then become the “plumbing” when data had to be linked over distance. The EBI had spent 6 years (100 person years) developing a relational database that recognised the hierarchy found in protein structures. This database was clean and self consistent and required the addition of legacy data to be included and corrected. This project had identified the necessity of cleaning and organising all apects of biological data as well as defining a data interchange method - probably as XML. Access to sufficient computing resource was an issue for most groups. Many labs had Linux farms of around 20 nodes but would require access to higher performance computing in the future. IBM was currently trying to encourage the use of HPC(X) by life science groups and this was an opportunity for the genomics groups. Bandwidth was also a major consideration for data sharing across the Grid. Some of the EPSRC-funded pilot projects were already developing technology that might be appropriate for genomics data sharing and there was scope for collaboration eg Comb-e-Chem, MyGrid and DiscoveryNet. Plenary Brain-Storming 15. Norman Paton led the initial part of the discussion session that aimed to identify topics for the breakout groups to discuss. He steered the group towards thinking about problems and possible solutions in terms of services. The assumption was that Web and Grid services would come to have no obvious boundary. The challenge would be to identify the kinds of services used by the genomics community and then attempt to understand the relevant implementation technologies. 16. He suggested that the current status of genomics services is: • • • • • There are many of them There is little standardisation Few support complex queries – compute resources are an issue Few support integration between resources Many are organism specific and thus re-invention is an issue 5 17. He also explained that services are provided on a number of different levels: • • • • Generic storage or analysis services eg database access, machine learning Data-type specific services eg access to transcriptome data, clustering of transcriptome data Multiple data-type services – correlate protein and mRNA expression data Bespoke services - correlate protein and mRNA expression data in yeast 18. The following points were raised in discussion: • • • Ease of use is important but users should also be able to understand what is in the “black box” What are the pros and cons of Web services vs CORBA? Web services emphasises easy access to data, ie text, and currently has the momentum but standards is an issue: there is no alternative to XML at present. “Overheads” is also an issue for Web services. CORBA is already widely used by the community and the OMG has defined standards. Definition of standards will be key to developing Grid technology and the bulk data providers (eg EBI and Sanger) need to be involved. It was noted that the EBI, Sanger and NCBI all have different standards that needed to be integrated. 19. The Group agreed that the breakout groups should be based around the main biological drivers: Comparative Genomics, Functional Genomics and Annotation. Issues that the groups should consider included: computing power, networking requirements, security, databases, scalability, data quality and visualisation. Breakout Sessions 20. The breakout groups met briefly at the end of the first day to consider how they would operate on the following day and to appoint a rapporteur. Membership of each group was – click on each heading for summaries of the discussions at each group: Comparative Genomics: Chair: Jo Dicks, Rapporteur: Neil Wipat, Dave Marshall, Andy Law, Pete Oliver, Richard Wong, Chris Thompson. Functional Genomics: Chair: Gos Micklem, Rapporteur: Norman Paton, Helen Ogham, David Boyd****? Adrian Pugh Annotation: Chair: James Cuff, Rapporteur: Alan Robinson, Kerstin Kleese, Thorsten Forster, Tom Oldfield, David Goodwin, Colin Edwards, Gavin McCance, Alex Gray, Liz Ourmozdi. 6 Report-Back Functional Genomics 21. The Functional Genomics group identified the key issues facing functional genomics: • • The main need for a functional genomics project was to have an efficient sequence analysis service to describe sequence, matrix and alignment. The service would include a collection of operations including simple search (algorithm, sequence, matrix) to provide a list of alignment and “All All” search (algorithm, genome (x2), matrix) to provide a list of alignment. The support service must be able to operate over multiple implementation environments. Options include Grid middleware (Condor Farm), standard facilities and supercomputer facilities. CPU time might be “bought” from undergraduate labs. Computational issues included the need for replication of performance (mirrors, multiple tiers), programmable interfaces (some support at NCBI, BioX experiences with clean interfaces). Replication is a classic Grid functionality, requiring specialised transport for shipping large amounts of data across the network and replication services. For All against All searches, current computational requirements exceed supply. 22. The group highlighted the Trancriptome as a special case. Microarrays can measure many thousands of transcript levels at a time. Transcript levels can also be compared at different points in time. A transcriptome service would therefore need: normalisation and quality assurance, clustering of individual points and collections, spot histories and clustering (what’s happening with a particular gene?), search/query for array of interest. Computational issues include the need to perform analyses over large experiments (eg clustering over 1000 arrays) and there may be tiered data replication. 23. In a forward look, the group predicted that a service might be developed with interfaces for all the “omes”. The first step would be to develop individual services and each would have specific computational challenges. Beyond this, there would be a need to perform requests over multiple “omes”. It would be essential to ensure good metadata and to plan for interoperation. There will be specific and difficult challenges associated with data on phenotype, populations etc. 24. In discussion, the group suggested that the Grid Pilot to develop a BLAST service with IGER was one step towards developing an individual service. Data replication was an issue around the different sites regardless of “the Grid” and this activity would require increased bandwidth and disc space. 25. In summary, the functional genomics group identified the need to develop consistent and “cleaned up” interfaces and to identify the computational problems that needed to be addressed. 7 Comparative Genomics 26. The comparative genomics group highlighted the fact that this community would also need to draw on a large amount of data: sequences, ESTs, whole genome, protein. The key issues to be addressed were: availability of databases and compute power for analysis. 27. In terms of data, the key issues raised were: • • • • • • Results of data searches needed to be published and this could be achieved through a Grid service resource Data needs to be closely coupled to computer hardware (a special problem for comparative genomics) The area needs more databases and they need to be interoperable. Standardisation is key to interoperability (sharing data) and the Grid could drive the process of standardisation. Standards for sequence representation are particularly important. Comparative genomics is used for gap filling for incomplete sequences and for spotting errors/mis-assemblies. There is therefore a link to annotation. However, ontologies have limited use for genome comparison. Mapping data will require special attention because it includes studies on gene order and content, high throughput techniques such as SNPs and haplotype construction. 28. Scalability issues were: • • There is a need to build in access to additional computer resources as the data problem grows. Databases need to scale and querying needs to happen across multiple databases. This could be facilitated by developing links with the Polar* project. http://www.ncl.ac.uk/polar/ 29. A fully-operational Grid system would need to address security issues, especially for industry, in the following areas: • • • • Access control. Accountability, especially for resource utilisation. Data provenance, quality and tracking of versions. Defining sensible strategies for notification. 30. In order for projects to be established and taken forward in the Grid area, the community would require education on what Grid technology is and how to use it. Help with code parallelisation would also be required. 31. The group had identified some possible Grid projects: • Explore potential for linking and building on the BBSRC-funded Microbase project (Wipat) to explore distributed querying. 8 • • A project to develop a Grid resource for storing and presenting the results of comparative genomics studies to allow further analysis ie a support service for data retrieval and analysis. ComparaGrid A pilot in establishing a shared resource for phylogeny studies and alignments encompassing simultaneous phylogeny alignment and parallelisation of MCM algorithms. PhyloGrid 32. In discussion, the group agreed that Grid technology could offer comparative genomics: increased compute power, data interoperability and a service to run code remotely with appropriate visualisation. In order to move towards encompassing Grid technology, data resources would first need to be hooked together in a programmable form. Annotation 33. The annotation group first defined annotation as: adding value to data, facilitating interpretation of data and cross-referencing within data, helping quality control by allowing some provenance information to be stored. It was acknowledged that different users would require different views of annotation dependent on what they were trying to do. 34. The path taken through a cross-database enquiry would depend on what annotation was seen on the way and what cross-references to other resources were provided. The query would also pass through a range of different archives, systems, architectures and data types ie a biologist needs to navigate through a matrix of resources. It would also be important to ensure information on different versions was stored. 35. Grid technology could be used to establish interfaces to allow easier navigation between resources. Service providers needed to agree how to put this in place and agree what links needed to be made and how to present the links. 36. In order to take this forward, the group suggested a community project to drive biological resource connectivity and data exchange projects and to promote formation of standards. Features of the project would include: • • • • • • • A large project with community buy-in: 3 years with 5 – 10 posts and funding for meetings. Issues to address: how to do compute similarity searches, organisation of output and quality, provision of network requirements (bandwidth) and networking between people (video-conferencing and Access Grid), security issues, scalability, replication and mirroring (use of XML for interoperation and data exchange) One or a few user-driven projects with the view to evolve into collaborations for more generic application. Address IPR and security issues. Alpha and Beta testers to try code. Registry and discovery services, starting with UDDI and increasing in complexity. Definition of the minimum data needed to capture. Databases: Development of "wrappers", XML standards for exchange, handling mirrors & replication. 9 SUMMARY The output from the meeting was a series of suggested projects and activities: • Preparation of a case for a community project to drive biological resource connectivity and data exchange projects and to promote formation of standards. This project should be taken forward: − BBSRC to meet with other funding agencies (MRC, Wellcome Trust, EPSRC, NERC) to consider a mechanism by which this project might be supported. The group was anxious to avoid double jeopardy in peer review. − A joint funding agency-supported Town meeting to bring together interested parties to start putting the proposal together (should include key data providers in the UK and overseas – eg the research councils Wellcome Trust, EMBL/EBI, NCBI, USDA, Whitehead). − Establishment of a steering group to write the proposal for the community project. • A collection of Grid pilot projects, some of which were described in the reportback session, to be submitted to BBSRC’s research committees (next closing date: 20 January 2003). • The community to identify training needs and seek input from the National eScience Centre with help and advice from the Grid Support Centre team. Other Actions • • David Boyd to ensure contact details and appropriate links were included in the Grid Support Centre database. Chris Thompson to establish a web page for Grid technology on the BBSRC site and consider setting up a discussion board. 10 ATTENDEES AT THE BBSRC “GENOMICS MEETS GRID” WORKSHOP 11-12 NOVEMBER NATIONAL e-SCIENCE CENTRE, EDINBURGH Name Institution Postal Address e-mail Tel. Fax Name Institution Postal Address e-mail Tel. Fax Name Institution Postal Address e-mail Tel. Fax Name Institution Postal Address e-mail Tel. Fax Name Institution Postal Address e-mail Tel. Fax Name Institution Postal Address e-mail Tel. Fax Name Institution Postal Address e-mail Tel. Fax Rob Allan CLRC e-Science Centre Daresbury Laboratory, Daresbury, Warrington, WA4 4AD r.j.allan@dl.ac.uk 01925 603207 01925 603634 Alan L Archibald Roslin Institute Roslin, Midlothian EH25 9PS alan.archibald@bbsrc.ac.uk 0131-527-4200 0131-440-0434 Malcolm Atkinson National e-Science Centre mpa@dcs.gla.ac.uk Gary Barker Rothamstead Research Wild Country Lane, Long Ashton, Bristol BS41 9AF gary.barker@bristol.ac.uk 01275-549417 01275-394007 Rob Baxter EPCC & NeSC - the University of Edinburgh email: R.Baxter@epcc.ed.ac.uk phone: +44 (0)131 650 4989 or +44 (0)131 651 4041 David Boyd CLRC Rutherford Appleton Laboratory Chilton, Didcot, Oxon OX11 0QX d.r.s.boyd@rl.ac.uk 01235-446167 01235-445945 James Cuff Wellcome Trust Sanger Institute Wellcome Trust Genome Campus, Hinxton, Cambridge james@sanger.ac.uk 01223-494880 01223-494919 11 Name Institution Postal Address e-mail Tel. Fax Name Institution Postal Address e-mail Tel. Fax Name Institution Postal Address e-mail Tel. Fax Name Institution Postal Address e-mail Tel. Fax Name Institution Postal Address e-mail Tel. Fax Name Institution Postal Address e-mail Tel. Fax Name Institution Postal Address e-mail Tel. Fax Govind Chandra JIC Norwich Research Park, Colney, Norwich NR4 7UH Govind.chandra@bbsrc.ac.uk Paul Donachy Queens University Belfast P.Donachy@qub.ac.uk Jo Dicks John Innes Centre Norwich Research Park, Colney, Norwich NR4 7UH jo.dicks@bbsrc.ac.uk 01603-450597 01603-450595 Colin Edwards BBSRC Bioscience IT Services West Common, Harpenden, Herts AL5 2JE Colin.Edwards@bbsrc.ac.uk 01582-714941 01582-714901 Thorsten Forster Scottish Centre for Genomic Technology and Informatics The University of Edinburgh, Medical School, Little France Crescent, Edinburgh EH16 4SB Thorsten.Forster@ed.ac.uk 0131 242 6287 Peter Ghazal Scottish Centre for Genomic Technology and Informatics The University of Edinburgh, Medical School, Little France Crescent, Edinburgh EH16 4SB P.Ghazal@ed.ac.uk 0131 242 6288 Graeme Gill NASC, Nottingham University LE12 5RD graeme@arabidopsis.info 0115-9513091 0115-9513297 12 Name Institution Postal Address e-mail Tel. Fax Name Institution Postal Address e-mail Tel. Fax Name Institution Postal Address e-mail Tel. Fax Name Institution Postal Address e-mail Tel. Fax Name Institution Postal Address e-mail Tel. Fax Name Institution Postal Address e-mail Tel. Fax Name Institution Postal Address e-mail Tel. Fax Roger Gillam BBSRC Bioscience IT Services West Common, Harpenden, Herts AL5 2JE Roger.Gillam@bbsrc.ac.uk 07702-562866 01582-714951 David Goodwin Institute of Biological Science Cledwyn Building, University of Wales Aberystwyth, Aberystwyth, Ceredigion SY23 2JS dwg@aber.ac.uk 01970-622284 W Alec Gray University of Wales, Cardiff: Cardiff e-Science Centre w.a.gray@cs.cardiff.ac.uk Mark Hayes Cambridge e-Science Centre Centre for Mathematical Sciences, Wilberforce Cambridge, CB3 0WA Mah1002@cam.ac.uk 01223 756 251 01223756 900 Kerstin Kleese van dam CLRC – e-Science Centre CLRC, Daresbury Laboratory, Warrington WA4 4AD k.kleese@dl.ac.uk 01925-603832 01925-603634 Andy Law Roslin Institute Roslin BioCentre, Roslin, Midlothian EH25 9PS Andy.Law@bbsrc.ac.uk 0131-527-4241 0131-440-0434 David Marshall Scottish Crop Research Institute SCRI, Invergowrie, Dundee, DD2 5DA d.marshall@scri.sari.ac.uk 01382-562731 13 Road, Name Institution Postal Address e-mail Tel. Fax Name Institution Postal Address e-mail Tel. Fax Name Institution Postal Address e-mail Tel. Fax Name Institution Postal Address e-mail Tel. Fax Name Institution Postal Address e-mail Tel. Fax Name Institution Postal Address e-mail Tel. Fax Name Institution Postal Address e-mail Tel. Fax Sean May NASC, Nottingham University Plant Sciences, Sutton Bonnington Campus, Loughborough LE12 5RD sean@arabidopsis.info 0115-9513091 0115-9513297 Gavin McCance University of Glasgow Dept of Physics and Astronomy, Kelvin Building, G12 8QQ g.mccance@physics.gla.ac.uk 0141 330 5316 0141 330 5881 Gos Micklem Cambridge University Genetics Dept., Cambridge University, Downing Street, Cambridge CB2 3EH gos.micklem@gen.cam.ac.uk 01223-765281 01223-333992 Tom Oldfield European Bioinformatics Institute Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD oldfield@ebi.ac.uk 01223 492526 01223 494487 Peter Oliver CLRC Atlas Centre, Rutherford Appleton Laboratory, Chilton, Didcot, Oxon OX11 0QX p.m.oliver@rl.ac.uk 01235-445164 01235-446646 Liz Ourmozdi BBSRC Polaris House, North Star Avenue, Swindon SN2 1UH Elizabeth.ourmozdi@bbsrc.ac.uk 01793-413282 01793-413234 Helen Ougham IGER Plas Gogerddan, Aberystwyth, Ceredigion SY23 3RE helen.ougham@bbsrc.ac.uk 01970-823094 01970-823242 14 Name Institution Postal Address e-mail Tel. Fax Name Institution Postal Address e-mail Tel. Fax Name Institution Postal Address e-mail Tel. Fax Name Institution Postal Address e-mail Tel. Fax Name Institution Postal Address e-mail Tel. Fax Name Institution Postal Address e-mail Tel. Fax Name Institution Postal Address e-mail Tel. Fax Norman Paton Department of Computer Science, Manchester University Oxford Road, Manchester M13 9PL norm@cs.man.ac.uk 0161-275- 6910 0161-275-6236 Adrian Pugh BBSRC Polaris House, North Star Avenue, Swindon SN2 1UH Adrian.pugh@bbsrc.ac.uk 01793-413229 01793-413234 Alan J Robinson EMBL European Bioinformatics Institute Wellcome Trust Genome Campus, Hinxton CB10 1SD Alan@ebi.ac.uk 01223-494444 01223-494468 Andrew Simpson Software Engineering Centre Wolfson Building, Parks Road, Oxford, OX1 3QD Andrew.simpson@comlab.ox.ac.uk 01865 283514 Chris Thompson BBSRC Polaris House, North Star Avenue, Swindon SN2 1UH Chris.thompson@bbsrc.ac.uk 01793-413393 01793-413234 Anil Wipat (Neil) University of Leeds Computing Science, Claremont Tower, Newcastle-upon-Tyne, NE1 7RU Anil.wipat@newcastle.ac.uk 0191-222-8213 0191-222-8232 Richard Wong Rutherford Appleton Laboratory Atlas Centre, eScience Centre, R27, Chilton, Oxfordshire OX11 0QX r.y.m.wong@rl.ac.uk 01235-446075 15 FUNCTIONAL GENOMICS BREAKOUT Gos Micklem – Chair; Norman Paton - Rapporteur. 1. The main topics considered were BLAST and array data. It was felt that the two had very similar Grid-needs. Examples of the needs were worked through. 2. A ‘club’ could be formed as a backbone to the services. This might be the IGF centres, for instance. It’s needs would be:a. more dynamic scheduling of computing power to deal with analyses b. need to move code more dynamically between centres, and replicate information (to help with the a) c. high speed links d. people with (Grid) expertise to run them – although this could be centralised. 3. Computing power of undergraduate teaching nodes could be utilised to increase compute power. Condor might be used. (There is an issue of ownership here). 4. Most user interest (in order of demand) from array users is for:a. spot history b. pairwise comparisons c. clustering 5. In terms of security the main issue was embargo of data before general release. Authentication was mainly needed between members of the ‘club’. 6. There was a suggestion that there should be an ‘IGF-hackerthon’ to further discuss the issues. (Would be subsumed by wider meeting). 7. I have mocked up a description of the possible ‘Genomics Grid’, and some of the issues. Genomics Grid ‘The Club’ IGF/Genomics db • • • • Authentication issues Data replication Compute sharing High bandwidth Query/Response (low bandwidth) User Own db Mirror (Another IGF/genomics db) 16 Other external db (node) ANNOTATION BREAKOUT (Chair: James Cuff, Reporter: Alan Robinson) The group discussed the following; 1. • • • • • The meaning of annotation Capturing, interpreting and adding value to raw data. Providing understanding Allows users to find things out by distilling information and data Providing user knowledge of how, why and what was done initially Example given of what annotation means is that bioinformaticists generally don’t understand protein structure, therefore can’t have a system where too much information is given to the user at once. Should have a ‘folded down’ annotation system so the user can delve deeper if specific information needs to be obtained. Summary: Annotation has a two-tier meaning; (i) the linking of one thing to another and (ii) explaining things to the user. This is where the Grid comes in……. 2. How the Grid can help annotation • Present problems include the lack of tools and systems to integrate data –with sequencing data it is possible to put data on top of existing data but this can’t be done with structural data. • All disparate data needs to be merged that wouldn’t normally be done – this should be driven by the biology users. • Large sites need to be convinced to work together and ‘talk’ the same language; this will involve standardising existing (meta)data and communication between the databases so they are the same. • Before approaching this hurdle, a critical mass of users needs to be brought on board. Summary: (i) need to deal with the conflict of information and (ii) need commonality of codes and data among databases. 3. Pilot Grid projects • The community requires a fully interconnected and navigable ‘mesh’ of resources as work-flow may change depending on what is discovered en route. Therefore, there will be a need to jump from one database to another through reference points. • Electronic submission of papers is an issue of concern as some databases have richer information than others; ‘related’ articles need to be linked by accession numbers. • Before considering a Grid pilot project there are three bottlenecks to overcome: (i) talking to people, (ii) ask users what information they want to link and (iii) how do the users want this information presented. • Suggested project: Linkage analysis pilot project (5 – 10 FTE for3 years, approx £2M). The key issues to cover include: 17 1. Defining standards for data exchange and interfaces/wrappers to resources 2. User and device driven visualisation 3. Similarity searches, clustering, output/quality, low band-width networking, security on data, collaborations – all these issues need to be considered. 4. Project will be user-driven and community buy-in is essential for success. 5. Typical users will test out new programmes (will require persuading as initially this will be frustrating, thus emphasising the need to bring users on board) 6. Simple registry service required to describe this meshwork of services before leading to the development of a more advanced ontology based registry service (progressive over time) 18 COMPARATIVE GENOMICS BREAKOUT Chair: Jo Dicks, Rapporteur: Neil Wipat The brainstorming started with a preamble about what the main issues are for comparative genomics, summarised: • • • • • • • • • • • BBSRC institutes have poor connectivity – there was concern that this might negatively affect their chances of getting funding for Grid technology development. Compute power needs to be central and next to a big data store. The community needs to be able to store, archive and curate data and to be able to store results of past searches. One of the main problems was incomplete data-sets (partial genomes); comparative genomics could be used to fill data in. Standards need to be established for sequence data – perhaps the Grid system could build on a pre-existing standard? Could the research councils pressure grants holders to use set ontologies? Should standards be based on ontologies that were largely subjective? Phylogeny comparisons require high compute power and are therefore suitable for a Grid project. Security is an issue – as is accountability, provenance and the need for audit trails. Scalability is an issue – current infrastructure can’t accommodate complex queries and this is made worse by naïve users not using it efficiently. Database searching will be facilitated by a more efficient architecture – it can’t be done on a national scale at present; the community should collaborate with the Polar* project on distributed database searching. Community should wait until Globus 3 and OGSA DAI are operational. The community was not in a position to use high performance computing; help was needed with code – the IBM funding to support “centres of excellence” for Life Sciences was essential. Parallelisation • • • Scalability and increased speed needed to be addressed locally before moving to a Grid environment. Memory is an issue as well as increased CPU. Phylogeny alignment studies are generally tightly coupled to local machines but alignment could be done on the Grid. Data Integration • • Mapping will increase the amounts of data available eg haplotype studies and software will be needed to integrate it for analysis. Data includes information on ESTs, sequences, protein sequences. Integration will require: robust databases, compute power for analysis, recording and storing results of past searches as a resource, data storage closely coupled to the computing power (a special problem for comparative genomics) 19 • • • Standards will be key to facilitate interoperability: the Grid might be able to drive the community to conform to standards. Ontologies have limited use in comparative genomics, which is often used to fill in gaps for missing data, because it is open to erroneous interpretation. Security, accountability, charging, provenance, data quality, access control, limits on notification, privacy and resource utilisation are all issues that need to be resolved in the context of “the Grid”. Links to external data sources are essential. Scalability • Databases and algorithms need to be able to scale as resources grow. The issue of how to query across distributed databases also needs to be addressed. The Polar* project might help address this issue. Education and Awareness • • If there is to be a move towards Grid applications, the bioinformatics sector needs lots of education about Grid technology and its capabilities. The community will also need help in terms of code development and parallelisation in relation to use of high performance computing. Mapping Data • Analytical techniques for special problems in mapping such as gene order and content studies, high throughput techniques eg SNPs, haplotype construction will need a large amount of compute power. Possible Grid Projects • • • Explore potential for linking and building on the BBSRC-funded Microbase project (Wipat) to explore parallelisation. A project to develop a Grid resource for handling and analysing comparative genomics data. A pilot in establishing a shared resource for phylogeny studies and alignments. 20