In response to a request from the genomics groups supported... information on Grid technology and what it might offer, BBSRC... DRAFT

advertisement
DRAFT REPORT OF THE BBSRC-FUNDED GENOMICS MEETS GRID
MEETING HELD AT THE NATIONAL e-SCIENCE CENTRE, EDINBURGH
ON 11 – 12 NOVEMBER 2002
EXECUTIVE SUMMARY
In response to a request from the genomics groups supported by BBSRC for more
information on Grid technology and what it might offer, BBSRC organised a
workshop to bring together key members of the genomics and e-Science communities
plus the main data providers. The workshop aimed to inform the e-Science
community about the special data needs of genomics research and then to brainstorm
some possible Grid pilot projects to develop Grid technology for use in the genomics
area. The key message was that the genomics community could not begin to share
data on a large scale (via “the Grid”) until a set of standards had been agreed and
methods for interoperability put in place.
The output from the meeting was a series of suggested projects and activities:
•
Preparation of a case for a community project to drive biological resource
connectivity and data exchange projects and to promote formation of standards.
This project should be taken forward:
− BBSRC to meet with other funding agencies (MRC, Wellcome Trust, EPSRC,
NERC) to consider a mechanism by which this project might be supported.
The group was anxious to avoid double jeopardy in peer review.
− A joint funding agency-supported Town meeting to bring together interested
parties to start putting the proposal together (should include key data providers
in the UK and overseas – eg the research councils Wellcome Trust,
EMBL/EBI, NCBI, USDA, Whitehead).
− Establishment of a steering group to write the proposal for the community
project.
•
A collection of Grid pilot projects, some of which were described in the reportback session, to be submitted to BBSRC’s research committees (next closing date:
20 January 2003).
•
The community to identify training needs and seek input from the National eScience Centre with help and advice from the Grid Support Centre team.
1
DRAFT REPORT OF THE BBSRC-FUNDED GENOMICS MEETS GRID
MEETING HELD AT THE NATIONAL e-SCIENCE CENTRE, EDINBURGH
ON 11 – 12 NOVEMBER 2002
Background and Purpose of the Meeting
1. In July 2002, BBSRC hosted a Forum at Robinson College, Cambridge, for the
groups funded through its Investigating Gene Function (IGF) Initiative. One of
the talks at the Forum was a presentation from Dr David Boyd, Rutherford
Appleton Laboratory, on the scope of the Grid Support Centre Post funded by
BBSRC through the Bioinformatics and e-Science Initiative. The audience heard
that this “post” (Pete Oliver and Richard Wong) had the remit to provide support
for the BBSRC research community to help develop Grid technology. In
particular, Pete Oliver and Richard Wong had been asked to help the genomics
community supported through IGF and the BBSRC Institutes to establish some
pump-priming, pilot projects to fill the gaps in the current BBSRC e-Science
portfolio.
2. Immediately after David Boyd’s talk, the bioinformaticians funded on the IGF
projects were invited to attend a breakout session to consider key data issues for
the community and to discuss how these issues might be addressed. In particular,
the discussion focussed on opportunities and barriers to developing Grid
technology for the genomics community given that BBSRC expected to be taking
forward a further e-Science initiative on integrating data following the
announcement of the Spending Review 2002.
3. At the breakout session, it emerged that one of the main barriers to developing
Grid technology for genomics was a lack of understanding about the Grid and
what it might offer this community over and above what could already be
achieved using existing tools. Many of the bioinformaticians had attended Grid
meetings featuring presentations and demonstrations from the Grid pilot projects
but this had not helped them to understand the basic concept of “the Grid”, the
technology behind it and how it might work in the genomics context. The group
asked if BBSRC could organise a meeting to help address this problem by
offering a forum for open debate between a small number of individuals from the
Genomics and Grid communities.
Genomics Meets Grid Meeting
4. The meeting was arranged for 11 and 12 November, 2002 at the National eScience Centre, Edinburgh and was by invitation-only, limited to around 30
participating delegates (link to delegates list) plus observers. The aims of the
meeting were:
•
•
To help the Grid community understand the key data issues for the
IGF/BBSRC Institute scientists (through a series of short presentations)
To obtain a view of the key issues from the major data providers (EBI and
the Sanger Centre)
2
•
•
•
•
To help the genomics community understand Grid technology and what
elements of it could offer solutions to its data problems (through open
discussion)
To identify what issues needed to be addressed before Grid technology
could be encompassed by this community (through brain storming)
To identify a series of activities to establish Grid projects in the IGF
community and the Institutes, including pilot projects suitable for funding
through responsive mode (in breakout groups)
To establish links with the Grid Support Centre and identify how Pete
Oliver and Richard Wong might help
Talks
Details of talks from the meeting can be found on the NeSC Web site (link to talks).
Using the Grid for Genomics
5. The meeting started with an introduction from David Boyd on the concept of the
Grid, its components, capabilities and services offered both by the Grid Support
Centre and the National e-Science Centre.
6. His definition of the Grid was: “technology that enables persistent shared use of
distributed resources – computing, data, visualisation, instruments, network –
without needing to know in advance where these are or who owns them”.
7. He explained that the Grid would work by combining technology at several levels
and then applying this package to large-scale research problems (applications, eg
protein simulation, climate modelling). The Grid middleware was now being
developed as a service model of distributed computing based on Web services –
the Open Grid Services Architecture (OGSA). This new toolkit would also
include support for accessing structured data in databases: DAI, an important
development for genomics.
8. The Grid Engineering Task Force was co-ordinating the linkage of computing
resources at all UK e-Science centres, interconnected by the SuperJANET4 multigigabit backbone and Regional Networks. The Hinxton site would be included in
the network.
9. The UK Grid Support Centre had been established to help all e-Science
programme participants to install and use Grid software and also had a role in
raising awareness about the Grid, offering help through a Help Desk and running a
Certificate Authority.
10. BBSRC had provided funding for a support-facility for BBSRC-supported
researchers, especially those not funded through the e-Science programme and, as
well as providing technical support, would assist groups in developing Grid
demonstration projects, establishing mechanisms to share Beowulf clusters and
organising training courses at NeSC. Pete Oliver and Richard Wong would be
proactive within the community, identifying opportunities and providing technical
support for installing and running Grid software. The group would work in close
3
collaboration with BBSRC IT Services (BITS) and would be overseen by a
steering group chaired by Roger Gillam.
11. Three possible projects had already been identified following a meeting with the
Institute of Grassland and Environmental Research (IGER):
•
•
•
Remote BLAST jobs – the aim is to use the Grid to get faster turnaround on
multi-processor resources enabling larger databases to be searched using the
Beowulf cluster at RAL
Hyper-spectral image analysis – aims to speed up image analysis and provide
access to data archiving making use of parallel algorithms running on Beowulf
clusters; the project especially focuses on the machine-learning end of data
analysis which it is hope cab be improved using the Grid approach
PHYLIP – aims to improve turn-around time of phylogenetic analysis of large
multi-gene families using a parallel version of the PHYLIP programme
12. The key challenges for BBSRC scientists were:
•
•
•
Most Grid software is implemented on LINUX platforms; genomics data tends
to be handled in a Windows – only environment
Network Bandwidth – many Grid sites now have 1Gbit/sec; BBSRC
researchers often only have 2Mbits/sec available to them
Groups will need to provide access through firewalls to enable remote Gridbased use of machines; a mechanism has now been identified but it need
implementing widely
Genomics Challenges
13. The next part of the meeting comprised a series of 10-minute presentations from
members of the genomics community to try to convey the complexity of the
genomics data and the key issues facing those wishing to use it. Early talks also
contained a description of some of the main techniques and terminologies:
Andy Law, Roslin Institute: Farm Animal Genomics
Jo Dicks, John Innes Centre: Analysis of Crop Plant Genomes
Alan Archibald, Roslin Institute: Farm Animal Genomics
David Marshall, Scottish Crops Research Institute: Some Jolly Fun with Barley
ESTs
Helen Ougham, Institute of Grassland and Environmental Research: IGER
Research
Gary Barker, Bristol University: Mining SNPs from Public Sequence Databases
Sean May, Nottingham Arabidopsis Stock Centre: Nottingham Arabidopsis Stock
Centre
Gos Micklem, Cambridge University:
Tom Oldfield, EBI: Macromolecular Structure Database Project EMSD
James Cuff, Sanger Centre: Ensembl Compute: Grid Issues
14. In responding to these talks, the “Grid technologists” made the following
comments and observations:
4
•
•
•
•
•
•
•
“Integration” includes both data and instrumentation. There was generally a
large amount of data that was widely distributed and fluid (frequently
changing); issues relating to computation were: lack of resources, ease of
parallelisation and lack of expertise.
One of the main challenges was to track queries so that they could be repeated;
machine learning and artificial intelligence would be important techniques to
employ.
Policy on sharing information varied between groups, often depending on
whether industry was involved in the project. Data might be embargoed for up
to a year and needed to be well-protected in that time. Security was therefore
a major issue for this area, especially for medical data.
A number of key issues were orthogonal to whether the Grid was used or not.
For example, development of tools for data integration and data mining were
important for the genomics community but could be developed independently
of the Grid. Grid technology would then become the “plumbing” when data
had to be linked over distance.
The EBI had spent 6 years (100 person years) developing a relational database
that recognised the hierarchy found in protein structures. This database was
clean and self consistent and required the addition of legacy data to be
included and corrected. This project had identified the necessity of cleaning
and organising all apects of biological data as well as defining a data
interchange method - probably as XML.
Access to sufficient computing resource was an issue for most groups. Many
labs had Linux farms of around 20 nodes but would require access to higher
performance computing in the future. IBM was currently trying to encourage
the use of HPC(X) by life science groups and this was an opportunity for the
genomics groups. Bandwidth was also a major consideration for data sharing
across the Grid.
Some of the EPSRC-funded pilot projects were already developing technology
that might be appropriate for genomics data sharing and there was scope for
collaboration eg Comb-e-Chem, MyGrid and DiscoveryNet.
Plenary Brain-Storming
15. Norman Paton led the initial part of the discussion session that aimed to identify
topics for the breakout groups to discuss. He steered the group towards thinking
about problems and possible solutions in terms of services. The assumption was
that Web and Grid services would come to have no obvious boundary. The
challenge would be to identify the kinds of services used by the genomics
community and then attempt to understand the relevant implementation
technologies.
16. He suggested that the current status of genomics services is:
•
•
•
•
•
There are many of them
There is little standardisation
Few support complex queries – compute resources are an issue
Few support integration between resources
Many are organism specific and thus re-invention is an issue
5
17. He also explained that services are provided on a number of different levels:
•
•
•
•
Generic storage or analysis services eg database access, machine learning
Data-type specific services eg access to transcriptome data, clustering of
transcriptome data
Multiple data-type services – correlate protein and mRNA expression data
Bespoke services - correlate protein and mRNA expression data in yeast
18. The following points were raised in discussion:
•
•
•
Ease of use is important but users should also be able to understand what is in
the “black box”
What are the pros and cons of Web services vs CORBA? Web services
emphasises easy access to data, ie text, and currently has the momentum but
standards is an issue: there is no alternative to XML at present. “Overheads”
is also an issue for Web services. CORBA is already widely used by the
community and the OMG has defined standards.
Definition of standards will be key to developing Grid technology and the bulk
data providers (eg EBI and Sanger) need to be involved. It was noted that the
EBI, Sanger and NCBI all have different standards that needed to be
integrated.
19. The Group agreed that the breakout groups should be based around the main
biological drivers: Comparative Genomics, Functional Genomics and Annotation.
Issues that the groups should consider included: computing power, networking
requirements, security, databases, scalability, data quality and visualisation.
Breakout Sessions
20. The breakout groups met briefly at the end of the first day to consider how they
would operate on the following day and to appoint a rapporteur. Membership of
each group was – click on each heading for summaries of the discussions at each
group:
Comparative Genomics: Chair: Jo Dicks, Rapporteur: Neil Wipat, Dave Marshall,
Andy Law, Pete Oliver, Richard Wong, Chris Thompson.
Functional Genomics: Chair: Gos Micklem, Rapporteur: Norman Paton, Helen
Ogham, David Boyd****? Adrian Pugh
Annotation: Chair: James Cuff, Rapporteur: Alan Robinson, Kerstin Kleese, Thorsten
Forster, Tom Oldfield, David Goodwin, Colin Edwards, Gavin McCance, Alex Gray,
Liz Ourmozdi.
6
Report-Back
Functional Genomics
21. The Functional Genomics group identified the key issues facing functional
genomics:
•
•
The main need for a functional genomics project was to have an efficient
sequence analysis service to describe sequence, matrix and alignment. The
service would include a collection of operations including simple search
(algorithm, sequence, matrix) to provide a list of alignment and “All All” search
(algorithm, genome (x2), matrix) to provide a list of alignment. The support
service must be able to operate over multiple implementation environments.
Options include Grid middleware (Condor Farm), standard facilities and
supercomputer facilities. CPU time might be “bought” from undergraduate labs.
Computational issues included the need for replication of performance
(mirrors, multiple tiers), programmable interfaces (some support at NCBI,
BioX experiences with clean interfaces). Replication is a classic Grid
functionality, requiring specialised transport for shipping large amounts of data
across the network and replication services. For All against All searches,
current computational requirements exceed supply.
22. The group highlighted the Trancriptome as a special case. Microarrays can
measure many thousands of transcript levels at a time. Transcript levels can also
be compared at different points in time. A transcriptome service would
therefore need: normalisation and quality assurance, clustering of individual
points and collections, spot histories and clustering (what’s happening with a
particular gene?), search/query for array of interest. Computational issues include
the need to perform analyses over large experiments (eg clustering over 1000
arrays) and there may be tiered data replication.
23. In a forward look, the group predicted that a service might be developed with
interfaces for all the “omes”. The first step would be to develop individual
services and each would have specific computational challenges. Beyond this,
there would be a need to perform requests over multiple “omes”. It would be
essential to ensure good metadata and to plan for interoperation. There will be
specific and difficult challenges associated with data on phenotype, populations
etc.
24. In discussion, the group suggested that the Grid Pilot to develop a BLAST service
with IGER was one step towards developing an individual service. Data
replication was an issue around the different sites regardless of “the Grid” and this
activity would require increased bandwidth and disc space.
25. In summary, the functional genomics group identified the need to develop
consistent and “cleaned up” interfaces and to identify the computational problems
that needed to be addressed.
7
Comparative Genomics
26. The comparative genomics group highlighted the fact that this community would
also need to draw on a large amount of data: sequences, ESTs, whole genome,
protein. The key issues to be addressed were: availability of databases and
compute power for analysis.
27. In terms of data, the key issues raised were:
•
•
•
•
•
•
Results of data searches needed to be published and this could be achieved
through a Grid service resource
Data needs to be closely coupled to computer hardware (a special problem for
comparative genomics)
The area needs more databases and they need to be interoperable.
Standardisation is key to interoperability (sharing data) and the Grid could drive
the process of standardisation. Standards for sequence representation are
particularly important.
Comparative genomics is used for gap filling for incomplete sequences and for
spotting errors/mis-assemblies. There is therefore a link to annotation. However,
ontologies have limited use for genome comparison.
Mapping data will require special attention because it includes studies on gene
order and content, high throughput techniques such as SNPs and haplotype
construction.
28. Scalability issues were:
•
•
There is a need to build in access to additional computer resources as the data
problem grows.
Databases need to scale and querying needs to happen across multiple databases.
This could be facilitated by developing links with the Polar* project.
http://www.ncl.ac.uk/polar/
29. A fully-operational Grid system would need to address security issues, especially
for industry, in the following areas:
•
•
•
•
Access control.
Accountability, especially for resource utilisation.
Data provenance, quality and tracking of versions.
Defining sensible strategies for notification.
30. In order for projects to be established and taken forward in the Grid area, the
community would require education on what Grid technology is and how to use
it. Help with code parallelisation would also be required.
31. The group had identified some possible Grid projects:
•
Explore potential for linking and building on the BBSRC-funded Microbase
project (Wipat) to explore distributed querying.
8
•
•
A project to develop a Grid resource for storing and presenting the results of
comparative genomics studies to allow further analysis ie a support service for
data retrieval and analysis. ComparaGrid
A pilot in establishing a shared resource for phylogeny studies and alignments
encompassing simultaneous phylogeny alignment and parallelisation of MCM
algorithms. PhyloGrid
32. In discussion, the group agreed that Grid technology could offer comparative
genomics: increased compute power, data interoperability and a service to run
code remotely with appropriate visualisation. In order to move towards
encompassing Grid technology, data resources would first need to be hooked
together in a programmable form.
Annotation
33. The annotation group first defined annotation as: adding value to data, facilitating
interpretation of data and cross-referencing within data, helping quality control by
allowing some provenance information to be stored. It was acknowledged that
different users would require different views of annotation dependent on what
they were trying to do.
34. The path taken through a cross-database enquiry would depend on what
annotation was seen on the way and what cross-references to other resources were
provided. The query would also pass through a range of different archives,
systems, architectures and data types ie a biologist needs to navigate through a
matrix of resources. It would also be important to ensure information on different
versions was stored.
35. Grid technology could be used to establish interfaces to allow easier navigation
between resources. Service providers needed to agree how to put this in place and
agree what links needed to be made and how to present the links.
36. In order to take this forward, the group suggested a community project to drive
biological resource connectivity and data exchange projects and to promote
formation of standards. Features of the project would include:
•
•
•
•
•
•
•
A large project with community buy-in: 3 years with 5 – 10 posts and funding for
meetings.
Issues to address: how to do compute similarity searches, organisation of output
and quality, provision of network requirements (bandwidth) and networking
between people (video-conferencing and Access Grid), security issues, scalability,
replication and mirroring (use of XML for interoperation and data exchange)
One or a few user-driven projects with the view to evolve into collaborations for
more generic application. Address IPR and security issues.
Alpha and Beta testers to try code.
Registry and discovery services, starting with UDDI and increasing in complexity.
Definition of the minimum data needed to capture.
Databases: Development of "wrappers", XML standards for exchange, handling
mirrors & replication.
9
SUMMARY
The output from the meeting was a series of suggested projects and activities:
•
Preparation of a case for a community project to drive biological resource
connectivity and data exchange projects and to promote formation of standards.
This project should be taken forward:
− BBSRC to meet with other funding agencies (MRC, Wellcome Trust, EPSRC,
NERC) to consider a mechanism by which this project might be supported.
The group was anxious to avoid double jeopardy in peer review.
− A joint funding agency-supported Town meeting to bring together interested
parties to start putting the proposal together (should include key data providers
in the UK and overseas – eg the research councils Wellcome Trust,
EMBL/EBI, NCBI, USDA, Whitehead).
− Establishment of a steering group to write the proposal for the community
project.
•
A collection of Grid pilot projects, some of which were described in the reportback session, to be submitted to BBSRC’s research committees (next closing date:
20 January 2003).
•
The community to identify training needs and seek input from the National eScience Centre with help and advice from the Grid Support Centre team.
Other Actions
•
•
David Boyd to ensure contact details and appropriate links were included in the
Grid Support Centre database.
Chris Thompson to establish a web page for Grid technology on the BBSRC site
and consider setting up a discussion board.
10
ATTENDEES AT THE BBSRC “GENOMICS MEETS GRID” WORKSHOP
11-12 NOVEMBER
NATIONAL e-SCIENCE CENTRE, EDINBURGH
Name
Institution
Postal Address
e-mail
Tel.
Fax
Name
Institution
Postal Address
e-mail
Tel.
Fax
Name
Institution
Postal Address
e-mail
Tel.
Fax
Name
Institution
Postal Address
e-mail
Tel.
Fax
Name
Institution
Postal Address
e-mail
Tel.
Fax
Name
Institution
Postal Address
e-mail
Tel.
Fax
Name
Institution
Postal Address
e-mail
Tel.
Fax
Rob Allan
CLRC e-Science Centre
Daresbury Laboratory, Daresbury, Warrington, WA4 4AD
r.j.allan@dl.ac.uk
01925 603207
01925 603634
Alan L Archibald
Roslin Institute
Roslin, Midlothian EH25 9PS
alan.archibald@bbsrc.ac.uk
0131-527-4200
0131-440-0434
Malcolm Atkinson
National e-Science Centre
mpa@dcs.gla.ac.uk
Gary Barker
Rothamstead Research
Wild Country Lane, Long Ashton, Bristol BS41 9AF
gary.barker@bristol.ac.uk
01275-549417
01275-394007
Rob Baxter
EPCC & NeSC - the University of Edinburgh
email: R.Baxter@epcc.ed.ac.uk
phone: +44 (0)131 650 4989 or +44 (0)131 651 4041
David Boyd
CLRC Rutherford Appleton Laboratory
Chilton, Didcot, Oxon OX11 0QX
d.r.s.boyd@rl.ac.uk
01235-446167
01235-445945
James Cuff
Wellcome Trust Sanger Institute
Wellcome Trust Genome Campus, Hinxton, Cambridge
james@sanger.ac.uk
01223-494880
01223-494919
11
Name
Institution
Postal Address
e-mail
Tel.
Fax
Name
Institution
Postal Address
e-mail
Tel.
Fax
Name
Institution
Postal Address
e-mail
Tel.
Fax
Name
Institution
Postal Address
e-mail
Tel.
Fax
Name
Institution
Postal Address
e-mail
Tel.
Fax
Name
Institution
Postal Address
e-mail
Tel.
Fax
Name
Institution
Postal Address
e-mail
Tel.
Fax
Govind Chandra
JIC
Norwich Research Park, Colney, Norwich NR4 7UH
Govind.chandra@bbsrc.ac.uk
Paul Donachy
Queens University Belfast
P.Donachy@qub.ac.uk
Jo Dicks
John Innes Centre
Norwich Research Park, Colney, Norwich NR4 7UH
jo.dicks@bbsrc.ac.uk
01603-450597
01603-450595
Colin Edwards
BBSRC Bioscience IT Services
West Common, Harpenden, Herts AL5 2JE
Colin.Edwards@bbsrc.ac.uk
01582-714941
01582-714901
Thorsten Forster
Scottish Centre for Genomic Technology and Informatics
The University of Edinburgh, Medical School, Little France
Crescent, Edinburgh EH16 4SB
Thorsten.Forster@ed.ac.uk
0131 242 6287
Peter Ghazal
Scottish Centre for Genomic Technology and Informatics
The University of Edinburgh, Medical School, Little France
Crescent, Edinburgh EH16 4SB
P.Ghazal@ed.ac.uk
0131 242 6288
Graeme Gill
NASC, Nottingham University
LE12 5RD
graeme@arabidopsis.info
0115-9513091
0115-9513297
12
Name
Institution
Postal Address
e-mail
Tel.
Fax
Name
Institution
Postal Address
e-mail
Tel.
Fax
Name
Institution
Postal Address
e-mail
Tel.
Fax
Name
Institution
Postal Address
e-mail
Tel.
Fax
Name
Institution
Postal Address
e-mail
Tel.
Fax
Name
Institution
Postal Address
e-mail
Tel.
Fax
Name
Institution
Postal Address
e-mail
Tel.
Fax
Roger Gillam
BBSRC Bioscience IT Services
West Common, Harpenden, Herts AL5 2JE
Roger.Gillam@bbsrc.ac.uk
07702-562866
01582-714951
David Goodwin
Institute of Biological Science
Cledwyn Building, University of Wales Aberystwyth,
Aberystwyth, Ceredigion
SY23 2JS
dwg@aber.ac.uk
01970-622284
W Alec Gray
University of Wales, Cardiff: Cardiff e-Science Centre
w.a.gray@cs.cardiff.ac.uk
Mark Hayes
Cambridge e-Science Centre
Centre for Mathematical Sciences, Wilberforce
Cambridge, CB3 0WA
Mah1002@cam.ac.uk
01223 756 251
01223756 900
Kerstin Kleese van dam
CLRC – e-Science Centre
CLRC, Daresbury Laboratory, Warrington WA4 4AD
k.kleese@dl.ac.uk
01925-603832
01925-603634
Andy Law
Roslin Institute
Roslin BioCentre, Roslin, Midlothian EH25 9PS
Andy.Law@bbsrc.ac.uk
0131-527-4241
0131-440-0434
David Marshall
Scottish Crop Research Institute
SCRI, Invergowrie, Dundee, DD2 5DA
d.marshall@scri.sari.ac.uk
01382-562731
13
Road,
Name
Institution
Postal Address
e-mail
Tel.
Fax
Name
Institution
Postal Address
e-mail
Tel.
Fax
Name
Institution
Postal Address
e-mail
Tel.
Fax
Name
Institution
Postal Address
e-mail
Tel.
Fax
Name
Institution
Postal Address
e-mail
Tel.
Fax
Name
Institution
Postal Address
e-mail
Tel.
Fax
Name
Institution
Postal Address
e-mail
Tel.
Fax
Sean May
NASC, Nottingham University
Plant Sciences, Sutton Bonnington Campus, Loughborough
LE12 5RD
sean@arabidopsis.info
0115-9513091
0115-9513297
Gavin McCance
University of Glasgow
Dept of Physics and Astronomy, Kelvin Building, G12 8QQ
g.mccance@physics.gla.ac.uk
0141 330 5316
0141 330 5881
Gos Micklem
Cambridge University
Genetics Dept., Cambridge University, Downing Street,
Cambridge CB2 3EH
gos.micklem@gen.cam.ac.uk
01223-765281
01223-333992
Tom Oldfield
European Bioinformatics Institute
Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10
1SD
oldfield@ebi.ac.uk
01223 492526
01223 494487
Peter Oliver
CLRC
Atlas Centre, Rutherford Appleton Laboratory, Chilton, Didcot,
Oxon OX11 0QX
p.m.oliver@rl.ac.uk
01235-445164
01235-446646
Liz Ourmozdi
BBSRC
Polaris House, North Star Avenue, Swindon SN2 1UH
Elizabeth.ourmozdi@bbsrc.ac.uk
01793-413282
01793-413234
Helen Ougham
IGER
Plas Gogerddan, Aberystwyth, Ceredigion SY23 3RE
helen.ougham@bbsrc.ac.uk
01970-823094
01970-823242
14
Name
Institution
Postal Address
e-mail
Tel.
Fax
Name
Institution
Postal Address
e-mail
Tel.
Fax
Name
Institution
Postal Address
e-mail
Tel.
Fax
Name
Institution
Postal Address
e-mail
Tel.
Fax
Name
Institution
Postal Address
e-mail
Tel.
Fax
Name
Institution
Postal Address
e-mail
Tel.
Fax
Name
Institution
Postal Address
e-mail
Tel.
Fax
Norman Paton
Department of Computer Science, Manchester University
Oxford Road, Manchester M13 9PL
norm@cs.man.ac.uk
0161-275- 6910
0161-275-6236
Adrian Pugh
BBSRC
Polaris House, North Star Avenue, Swindon SN2 1UH
Adrian.pugh@bbsrc.ac.uk
01793-413229
01793-413234
Alan J Robinson
EMBL European Bioinformatics Institute
Wellcome Trust Genome Campus, Hinxton CB10 1SD
Alan@ebi.ac.uk
01223-494444
01223-494468
Andrew Simpson
Software Engineering Centre
Wolfson Building, Parks Road, Oxford, OX1 3QD
Andrew.simpson@comlab.ox.ac.uk
01865 283514
Chris Thompson
BBSRC
Polaris House, North Star Avenue, Swindon SN2 1UH
Chris.thompson@bbsrc.ac.uk
01793-413393
01793-413234
Anil Wipat (Neil)
University of Leeds
Computing Science, Claremont Tower, Newcastle-upon-Tyne,
NE1 7RU
Anil.wipat@newcastle.ac.uk
0191-222-8213
0191-222-8232
Richard Wong
Rutherford Appleton Laboratory
Atlas Centre, eScience Centre, R27, Chilton, Oxfordshire OX11
0QX
r.y.m.wong@rl.ac.uk
01235-446075
15
FUNCTIONAL GENOMICS BREAKOUT
Gos Micklem – Chair; Norman Paton - Rapporteur.
1. The main topics considered were BLAST and array data. It was felt that the
two had very similar Grid-needs. Examples of the needs were worked through.
2. A ‘club’ could be formed as a backbone to the services. This might be the IGF
centres, for instance. It’s needs would be:a. more dynamic scheduling of computing power to deal with analyses
b. need to move code more dynamically between centres, and replicate
information (to help with the a)
c. high speed links
d. people with (Grid) expertise to run them – although this could be
centralised.
3. Computing power of undergraduate teaching nodes could be utilised to
increase compute power. Condor might be used. (There is an issue of
ownership here).
4. Most user interest (in order of demand) from array users is for:a. spot history
b. pairwise comparisons
c. clustering
5. In terms of security the main issue was embargo of data before general release.
Authentication was mainly needed between members of the ‘club’.
6. There was a suggestion that there should be an ‘IGF-hackerthon’ to further
discuss the issues. (Would be subsumed by wider meeting).
7. I have mocked up a description of the possible ‘Genomics Grid’, and some of
the issues.
Genomics Grid
‘The Club’
IGF/Genomics
db
•
•
•
•
Authentication issues
Data replication
Compute sharing
High bandwidth
Query/Response
(low bandwidth)
User
Own db
Mirror
(Another
IGF/genomics
db)
16
Other external
db
(node)
ANNOTATION BREAKOUT
(Chair: James Cuff, Reporter: Alan Robinson)
The group discussed the following;
1.
•
•
•
•
•
The meaning of annotation
Capturing, interpreting and adding value to raw data.
Providing understanding
Allows users to find things out by distilling information and data
Providing user knowledge of how, why and what was done initially
Example given of what annotation means is that bioinformaticists generally don’t
understand protein structure, therefore can’t have a system where too much
information is given to the user at once. Should have a ‘folded down’ annotation
system so the user can delve deeper if specific information needs to be obtained.
Summary: Annotation has a two-tier meaning; (i) the linking of one thing to another
and (ii) explaining things to the user. This is where the Grid comes in…….
2. How the Grid can help annotation
• Present problems include the lack of tools and systems to integrate data –with
sequencing data it is possible to put data on top of existing data but this can’t be
done with structural data.
• All disparate data needs to be merged that wouldn’t normally be done – this
should be driven by the biology users.
• Large sites need to be convinced to work together and ‘talk’ the same language;
this will involve standardising existing (meta)data and communication between
the databases so they are the same.
• Before approaching this hurdle, a critical mass of users needs to be brought on
board.
Summary: (i) need to deal with the conflict of information and (ii) need commonality
of codes and data among databases.
3. Pilot Grid projects
• The community requires a fully interconnected and navigable ‘mesh’ of resources
as work-flow may change depending on what is discovered en route. Therefore,
there will be a need to jump from one database to another through reference
points.
• Electronic submission of papers is an issue of concern as some databases have
richer information than others; ‘related’ articles need to be linked by accession
numbers.
• Before considering a Grid pilot project there are three bottlenecks to overcome: (i)
talking to people, (ii) ask users what information they want to link and (iii) how
do the users want this information presented.
• Suggested project: Linkage analysis pilot project (5 – 10 FTE for3 years, approx
£2M). The key issues to cover include:
17
1. Defining standards for data exchange and interfaces/wrappers to resources
2. User and device driven visualisation
3. Similarity searches, clustering, output/quality, low band-width networking,
security on data, collaborations – all these issues need to be considered.
4. Project will be user-driven and community buy-in is essential for success.
5. Typical users will test out new programmes (will require persuading as
initially this will be frustrating, thus emphasising the need to bring users on
board)
6. Simple registry service required to describe this meshwork of services before
leading to the development of a more advanced ontology based registry
service (progressive over time)
18
COMPARATIVE GENOMICS BREAKOUT
Chair: Jo Dicks, Rapporteur: Neil Wipat
The brainstorming started with a preamble about what the main issues are for
comparative genomics, summarised:
•
•
•
•
•
•
•
•
•
•
•
BBSRC institutes have poor connectivity – there was concern that this might
negatively affect their chances of getting funding for Grid technology
development.
Compute power needs to be central and next to a big data store.
The community needs to be able to store, archive and curate data and to be
able to store results of past searches.
One of the main problems was incomplete data-sets (partial genomes);
comparative genomics could be used to fill data in.
Standards need to be established for sequence data – perhaps the Grid system
could build on a pre-existing standard? Could the research councils pressure
grants holders to use set ontologies? Should standards be based on ontologies
that were largely subjective?
Phylogeny comparisons require high compute power and are therefore suitable
for a Grid project.
Security is an issue – as is accountability, provenance and the need for audit
trails.
Scalability is an issue – current infrastructure can’t accommodate complex
queries and this is made worse by naïve users not using it efficiently.
Database searching will be facilitated by a more efficient architecture – it can’t
be done on a national scale at present; the community should collaborate with
the Polar* project on distributed database searching.
Community should wait until Globus 3 and OGSA DAI are operational.
The community was not in a position to use high performance computing; help
was needed with code – the IBM funding to support “centres of excellence”
for Life Sciences was essential.
Parallelisation
•
•
•
Scalability and increased speed needed to be addressed locally before moving to a
Grid environment.
Memory is an issue as well as increased CPU.
Phylogeny alignment studies are generally tightly coupled to local machines but
alignment could be done on the Grid.
Data Integration
•
•
Mapping will increase the amounts of data available eg haplotype studies and
software will be needed to integrate it for analysis.
Data includes information on ESTs, sequences, protein sequences. Integration
will require: robust databases, compute power for analysis, recording and storing
results of past searches as a resource, data storage closely coupled to the
computing power (a special problem for comparative genomics)
19
•
•
•
Standards will be key to facilitate interoperability: the Grid might be able to drive
the community to conform to standards. Ontologies have limited use in
comparative genomics, which is often used to fill in gaps for missing data,
because it is open to erroneous interpretation.
Security, accountability, charging, provenance, data quality, access control, limits
on notification, privacy and resource utilisation are all issues that need to be
resolved in the context of “the Grid”.
Links to external data sources are essential.
Scalability
•
Databases and algorithms need to be able to scale as resources grow. The issue
of how to query across distributed databases also needs to be addressed. The
Polar* project might help address this issue.
Education and Awareness
•
•
If there is to be a move towards Grid applications, the bioinformatics sector needs
lots of education about Grid technology and its capabilities.
The community will also need help in terms of code development and
parallelisation in relation to use of high performance computing.
Mapping Data
•
Analytical techniques for special problems in mapping such as gene order and
content studies, high throughput techniques eg SNPs, haplotype construction will
need a large amount of compute power.
Possible Grid Projects
•
•
•
Explore potential for linking and building on the BBSRC-funded Microbase
project (Wipat) to explore parallelisation.
A project to develop a Grid resource for handling and analysing comparative
genomics data.
A pilot in establishing a shared resource for phylogeny studies and alignments.
20
Download