NGS induction --- case study: the BRIDGES project Micha Bayer Grid Services Developer, BRIDGES project National e-Science Centre, Glasgow Hub The BRIDGES project Biomedical Research Informatics Delivered by Grid-Enabled Services 2 year e-Science project, started 1st October 2003 aim: provide data integration and grid-based compute power for Cardiovascular Functional Genomics project CFG project investigates genetic predisposition for hypertensive heart disease my role on project: develop grid applications for end users BRIDGES requirements and the NGS functional: high throughput compute tasks, e.g. large BLAST jobs non-functional: interfaces to applications should be targeted at the less computer literate --- users range in computer literacy from fairly advanced to mildly technophobic security requirements should not cause any extra work or inconvenience for users as this may put them off altogether resources provided by BRIDGES compete with familiar, similar resources already on offer at established bioinformatics institutions (EBI, NCBI, EMBL) -> need to make things “palatable” so people do use it How to get your job onto the NGS standard solutions: NGS portal Leeds GSI-SSH custom solutions: project portal Oxford NGS clusters RAL Manchester standalone GUI client Custom grid applications if possible/appropriate, get a developer to write bespoke interface to a grid app running on NGS only worthwhile if application is used frequently and/or by many users and is relatively unchanging/simple best to hide complexity of grid from users altogether users should not even have to choose between resources automatic scheduling of jobs to resources that currently have spare capacity is desirable best option for delivery is portlet in project-specific web portal – just need web browser for access then Project web portals portals are configurable, personalized collections of web applications delivered to a web browser as a single page NGS encourage projects to maintain their own web portals to deliver apps to their users applications can then be provided through user-friendly, specific portlet interfaces allows the hiding of grid complexity from users requires developer time BRIDGES portal currently uses IBM Websphere (free to academia) More on portals increasingly important technology – not just for grid computing (cf. Yahoo) gives end users a customized view of software and hardware resources specific to their particular application domain also provides a single point of access to Grid-based resources following user authentication (“single-sign-on”) content is provided by portlets (Java servlet extension) – JSR168 standard provides for exchangeability some portal packages currently available: IBM Websphere, Gridsphere, JetSpeed, uPortal, Jportlet, Apache Pluto Gridsphere and Websphere two commonly used portal server packages APIs are almost identical but use different sets of libraries pros/cons: feature Gridsphere Websphere low high good – deploys into Apache Tomcat and uses log4j poor – proprietary implementations of application server, logging package and http server good – about the right amount too much – things are very hard to find easy & quick poor – has to be downloaded as lots of separate files debugging your own portlets easy – log4j can be used hard – in-code debugging statements need to be changed to support IBM logging mechanism JSR168 compliance full supposedly full from version 5.1 onwards but name of logged in user cannot be extracted when using JSR168 compliant code for this – poor! overall impression lightweight, easy-to-use, basic complex, monolithic, feature-rich complexity re-use of publicly available components documentation installation/configuration Authentication and User Management (1) model adopted in BRIDGES: requirement was for users not to have to obtain and manage certificates we applied for a single project account at NGS – users do not need individual NGS accounts this account maps to a single user (“BRIDGES”) on the NGS with home directories on all nodes (like normal users) authentication for this user on NGS is by means of the host certificate of the machine where the jobs are submitted from (under control of BRIDGES project) users authenticate via the BRIDGES web portal using standard username and password pairs Authentication and User Management(2) Users can create accounts for themselves in BRIDGES Websphere portal (“self-care”) alternatively one could of course give the users usernames and passwords information gathered is kept in Websphere's secure user database current info is very basic but will be extended to include more detail (e.g. URL of user's project or departmental website where the user is listed) provides at least a basic means of accounting for user activity no need for physically visiting the Registration Authority/presenting ID may need to resort to stricter security if system is abused e.g. if impersonation takes place etc. probably no less secure than certificates managed by an inexperienced user on an unsecured Windows machine Authorisation with PERMIS ScotGRID PERMIS = grid authorisation software developed at Salford University (http://sec.isi.salford.ac.uk/permis/) NeSC Condor Pool BRIDGES uses PERMIS to differentially allow users access to resources typical use is with GT3.3 service but lookup-type use is also possible with other services (in our case GT3.0.2) code in our service calls a PERMIS authorisation service running on a machine at NeSC user's roles are queried and access to resource is permitted or denied accordingly gives BRIDGES staff full control over who is allowed to use NGS resource through our applications NGS end user Leeds Oxford RAL Manch ester Security in BRIDGES – summary make host proxy, authenticate with NGS and submit job job request is passed on securely with username NeSC grid server with host credentials NGS clusters authenticate at BRIDGES web portal with username and password only get user authorisations end user Leeds Oxford BRIDGES web portal RAL Manch ester NeSC machine with PERMIS authorisation service (GT3.3) Host authentication for job submission allows us to submit jobs to NGS as user “BRIDGES” apply for host certificate for the grid server machine as normal (UK e-Science Certification Authority) results in a passwordless private key and host certificate for the machine Java Cog kit code can then be used to generate a host proxy locally this is used for job submission Use case: Microarray reporter sequence BLAST jobs “Job processing – please wait....” (and wait....and wait....) microarray chips contain up to 400,000 reporter sequences these need to be compared to existing annotated sequence databases takes approx. 3 weeks to compute against human genome on average desktop machine BLAST Basic Local Alignment Search Tool used for comparing biological sequences (DNA, protein) against a set of target sequences returns a sorted list of matches most widely used algorithm for this sort of thing compute intensive for more details refer to NCBI website: http://www.ncbi.nlm.nih.gov/blast/ How do I get my application to run efficiently on a grid? applications to be deployed on a compute grid need to be parallelised to really benefit (can of course just run them as single jobs too) for this one must be able to partition a job into several subjobs these then get processed separately at the same time on multiple processors need to combine results of individual subjobs at the end Parallel BLAST – grid style partition your job by putting one or several query sequences into a separate input file (= 1 subjob) distribute all input files, the executable and target data onto your grid clusters (“stage-in”) subjobs get executed there results are returned to the server and combined there if 100 free processors are available, and 100 subjobs are to be run, the time taken is 1/100th of the time it would have taken to run the whole job on a single machine (plus overheads for scheduling, data transfer and result combining) To stage or not to stage? file staging is the copying – at runtime – of files onto the remote resource example: BLAST jobs we need input file target data file (“database” – really a flat text file) executable (BLAST) target files and executable are unchanging components for this kind of job it is best to store these locally on the remote resources to avoid staging overhead (target data are in the region of several gb in size and growing exponentially) rather than individual users keeping multiple copies of publicly available data in their home directories, get sys admins to put up copies visible to all must stage in input files since these vary from job to job BRIDGES GridBLAST Job Submission ScotGRID masternode BRIDGES Portal Server (Cassini) GridBLAST client portlet + GT 3 client class IBM Websphere send job request (HTTP request) send job request (SOAP) return result (SOAP) return result (HTTP response) NESC Grid Server (Titania) GT 3 core grid service BRIDGES MetaScheduler Apache Tomcat PBS server side + BLAST PBS wrapper Condor wrapper GT2.4 wrapper GridBLAST client portlet output worker nodes Condor + BLAST Condor Central Manager NESC Condor pool Web browser end user PC NGS GT2.4 + BLAST Leeds headnode GT2.4 + BLAST worker nodes Oxford headnode worker nodes Current status of our system software is still at prototype stage – haven’t benchmarked any really big jobs yet medium size jobs (<100 input sequences) can be run job submission is from dedicated portlet on BRIDGES portal How we worked with the NGS BRIDGES was one of the first projects running bio jobs on NGS we established a basic infrastructure needed for BLAST on the NGS clusters in collaboration with NGS user support good collaboration on our security requirements – very helpful and accommodating our project account is the first of its kind and we jointly tailored a solution that would fit BRIDGES ask for what you need! things are not cast in stone and it is a public service Public bioinformatics infrastructure on NGS – current status we are in the process of establishing an infrastructure for BLAST jobs that can be used by all this includes: making BLAST and mpiBLAST executables publicly available mirroring the entire NCBI BLAST databases repository currently trialling this on Leeds node – will be replicated at other nodes later data replication on all nodes necessary to avoid severe performance hits input from others needed and welcome! mpiBLAST mpiBLAST will be installed on all nodes as part of the bioinformatics infrastructure trials on Leeds node have not been very encouraging: performance is much poorer than advertised in the papers best performance (within the given limits) is achieved only when number of database fragments is matched with number of available processors (+2 for managing the job) this means job has to queue until the required number of processors is available – can take ages alternative is to split and formatdb the database at runtime – takes about 30 mins for nt database this is poor solutions when actual jobs only takes several minutes Contact details BRIDGES website: http://www.brc.dcs.gla.ac.uk/projects/bridges/ Code repository – contains reuseable components for job submission, GSI security etc: http://www.brc.dcs.gla.ac.uk/projects/bridges/public/code.htm BRIDGES web portal: http://cassini.nesc.gla.ac.uk:9081/wps/portal Contacts: Micha Bayer at NeSC in Glasgow -- michab@dcs.gla.ac.uk Richard Sinnott at NeSC in Glasgow -- ros@dcs.gla.ac.uk