Using the Grid for Genomics David Boyd CLRC e-Science Centre d.r.s.boyd@rl.ac.uk http://www.e-science.clrc.ac.uk/ 11 November 2002 BBSRC Genomics meets Grid Workshop 1 Outline • Brief introduction to the Grid • UK e-Science Grid • Grid Support Centre • BBSRC Grid Support Service 11 November 2002 BBSRC Genomics meets Grid Workshop 2 What is the Grid? Experiments The Grid Computers Sensors Data Scientists Displays Technology that enables persistent shared use of distributed resources – computing, data, visualisation, instruments, networks – without needing to know in advance where these are or who owns them 11 November 2002 BBSRC Genomics meets Grid Workshop 3 How does the Grid work? Applications - eg climate modelling, protein simulation, aircraft design Grid toolkits - eg data discovery, experiment control, visualisation Grid services - eg resource scheduling, data transfer, security Grid resources - eg computers, data archives, instruments, networks 11 November 2002 BBSRC Genomics meets Grid Workshop 4 Some components of the Grid • Globus Toolkit v2 (GT2) – – – – security based on PKI X.509 digital certificates (GSI) directory service to publish information on resources (MDS) resource allocation and job submission (GRAM) efficient file transfer process (GridFTP) • Condor – distributes and monitors work across a network of machines – mature workload management system for compute-intensive jobs – can harvest unused compute cycles with checkpointing • Storage Resource Broker – uniform interface to heterogeneous distributed data resources – incorporates metadata catalogue for attribute-based data location 11 November 2002 BBSRC Genomics meets Grid Workshop 5 Service model of distributed computing • Web Services – cross-platform distributed computing model based on Web standards, particularly XML – message-passing protocol for interacting with Web services (SOAP) – operational description of Web services (WSDL) – registry for publishing and discovering available Web services (UDDI) – mechanism for combining Web services into a workflow (WSFL) • And the next version of Globus (GT3) aka . . . • Grid Services (Open Grid Services Architecture – OGSA) – – – – – extension of Web services incorporating Grid security model enables dynamic creation and termination of customised services supports instantiated services with memory for long-lived tasks can compose Grid services into complex workflows will include support for accessing structured data in databases (DAI) 11 November 2002 BBSRC Genomics meets Grid Workshop 6 UK e-Science Grid • Being assembled now - coordinated by the Grid Engineering Task Force • Linking computing resources at all UK e-Science Centres – National Centre (Edinburgh+Glasgow) – 8 Regional Centres – 2 CLRC Centres (RAL+DL) – EBI Hinxton – further Centres joining soon • Interconnected by the SuperJANET4 multi-gigabit backbone and Regional Networks 11 November 2002 Glasgow Edinburgh Newcastle Belfast Manchester DL Oxford Cardiff BBSRC Genomics meets Grid Workshop RAL Cambridge Hinxton London Southampton 7 UK Core Grid Support Centre • Part of the e-Science Core Programme • Led by CLRC e-Science Centre • Team of 6 based at CLRC (RAL+DL) and Edinburgh and Manchester Universities (but actually providing access to the expertise of more than 25 people) • Helps all e-Science programme participants to install and use Grid software quickly, easily and productively • Offers to meet all projects to discuss requirements 11 November 2002 BBSRC Genomics meets Grid Workshop 8 Grid Support Centre Services • Helpdesk - support@grid-support.ac.uk – provides access to expert technical support • Web information resource - http://www.grid-support.ac.uk – offers Grid awareness and education material • Grid Starter Kit – supports self installation of Grid software • National Grid Directory Service – supports Grid resource discovery and access to current status • Certification Authority (CA) – http://www.grid-support.ac.uk/ca – issues digital certificates to UK e-scientists – assigns a trustable digital identity to an individual – you need one to use the Grid! 11 November 2002 BBSRC Genomics meets Grid Workshop 9 BBSRC Grid Support Service (1) • What it will offer: – support for all BBSRC researchers and institute staff – support for learning about, installing and applying Grid technology – source of digital certificates for using Grid software – access to technical expertise about all aspects of the Grid – assistance in developing demonstration Grid applications – support for sharing high performance Beowulf clusters between institutes – support for accessing large-scale data resources distributed across the Grid – organisation of customised Grid training courses at NeSC to meet demand – skills transfer to BBSRC support staff and research scientists 11 November 2002 BBSRC Genomics meets Grid Workshop 10 BBSRC Grid Support Service (2) • How it will work: – dedicated BBSRC Grid support staff at CLRC/RAL – exploit close links with UK Grid Support Centre – community workshops to identify requirements – discussions with institutes and leading research groups to discuss demonstrator applications of Grid technology – technical support for installing Grid software and developing Gridbased applications – close collaboration with BBSRC IT support service – steering group of representatives from BBSRC institutes and IGF centres, to be chaired by Roger Gillam 11 November 2002 BBSRC Genomics meets Grid Workshop 11 BBSRC Grid Support Service (3) • Progress to date: – consultation meeting at IGF Forum in July – link established with BBSRC Research Computing Committee – promoted at BBSRC grant holders workshop at Warwick on 28/29 October – staff at RAL now in place • Peter Oliver & Richard Wong – meeting with IGER and BITS at Aberystwyth • Helen Ougham (IGER), Colin Edwards (BITS) • three possible demonstrator projects identified – in priority order . . . 11 November 2002 BBSRC Genomics meets Grid Workshop 12 BBSRC Grid – demonstrator projects (1) • Project 1 - remote BLAST jobs – IGER currently carries out FASTA and BLAST DNA homology searches using the MoBiCS service provided by BITS (single processor) – aim is to use the Grid to get faster turnround on multi-processor resources enabling larger databases to be searched – initially submit jobs to the Beowulf cluster at RAL • 32 CPU Beowulf Cluster (wulfkit), 32GB memory, 1TB disk • RAL setting up static copy of EMBL database and installing BLAST software – enable client side access • • • • using Java CoG kit on Windows set up Registration Authorities at IGER and BITS (verify user identity) get user certificates from UK e-Science CA write Globus scripts to submit jobs from IGER to Beowulf Cluster at RAL – then install Globus server software at another site 11 November 2002 BBSRC Genomics meets Grid Workshop 13 BBSRC Grid – demonstrator projects (2) • Project 2 - hyper-spectral image analysis – DEFRA-funded project on hyperspectral imaging of leaves as a diagnostic of nutritional and developmental status – generates 60GB of images files per experiment – aim is to speed up image analysis and provide access to data archiving – make use of parallel algorithms running on Beowulf clusters – Alan Gay (IGER) is contact • Project 3 - PHYLIP – phylogenetic analysis of large multigene families using the program PHYLIP – large run times – 4-5 days – aim is to improve turnround using a parallel version of the program 11 November 2002 BBSRC Genomics meets Grid Workshop 14 BBSRC Grid – Some issues • Windows-only environment – need to install openssl as well as Java CoG (Commodity Grid) kit to handle certificate conversion – most Grid software is implemented on Unix platforms • Network bandwidth low ~ 2MBits/sec – many Grid sites now have 1 Gbit/sec – depends on what is required to support individual Grid projects • Firewalls – need to provide access through firewalls to enable remote Gridbased use of machines – mechanism for this now identified using tables of Grid machine IP addresses and port numbers 11 November 2002 BBSRC Genomics meets Grid Workshop 15 Conclusions • The Grid can potentially . . . – enable bigger and better science – enhance the capabilities of individual researchers – provide access to resources you need but don’t have yourself – transcend traditional organisational and disciplinary boundaries to make new collaborations possible – drive change • Now is the time to identify if and how the Grid and help the genomics community – this is what we are here to discover 11 November 2002 BBSRC Genomics meets Grid Workshop 16