NGS induction --- case study: the BRIDGES project Micha Bayer

advertisement
NGS induction --- case study:
the BRIDGES project
Micha Bayer
Grid Services Developer, BRIDGES project
National e-Science Centre, Glasgow Hub
The BRIDGES project
 Biomedical Research Informatics Delivered by Grid-Enabled
Services
 2 year e-Science project, started 1st October 2003
 aim: provide data integration and grid-based compute
power for Cardiovascular Functional Genomics project
 CFG project investigates genetic predisposition for
hypertensive heart disease
 my role on project: develop grid applications for end users
BRIDGES requirements and the NGS
functional:
 high throughput compute tasks, e.g. large BLAST jobs
non-functional:
 interfaces to applications should be targeted at the less computer
literate --- users range in computer literacy from fairly advanced to
mildly technophobic
 security requirements should not cause any extra work or
inconvenience for users as this may put them off altogether
 resources provided by BRIDGES compete with familiar, similar
resources already on offer at established bioinformatics
institutions (EBI, NCBI, EMBL) -> need to make things “palatable”
so people do use it
How to get your job onto the NGS
standard solutions:
NGS portal
Leeds
GSI-SSH
custom solutions:
project portal
Oxford
NGS
clusters
RAL
Manchester
standalone
GUI client
Custom grid applications
 if possible/appropriate, get a developer to write bespoke
interface to a grid app running on NGS
 only worthwhile if application is used frequently and/or by
many users and is relatively unchanging/simple
 best to hide complexity of grid from users altogether
 users should not even have to choose between resources
 automatic scheduling of jobs to resources that currently
have spare capacity is desirable
 best option for delivery is portlet in project-specific web
portal – just need web browser for access then
Project web portals

portals are configurable, personalized collections of web applications delivered
to a web browser as a single page

NGS encourage projects to maintain their own web portals to deliver apps to
their users

applications can then be provided through user-friendly, specific portlet
interfaces

allows the hiding of grid complexity from users

requires developer time

BRIDGES portal currently uses IBM Websphere (free to academia)
More on portals
 increasingly important technology – not just for grid
computing (cf. Yahoo)
 gives end users a customized view of software and
hardware resources specific to their particular application
domain
 also provides a single point of access to Grid-based
resources following user authentication (“single-sign-on”)
 content is provided by portlets (Java servlet extension) –
JSR168 standard provides for exchangeability
 some portal packages currently available: IBM Websphere,
Gridsphere, JetSpeed, uPortal, Jportlet, Apache Pluto
Gridsphere and Websphere
 two commonly used portal server packages
 APIs are almost identical but use different sets of libraries
 pros/cons:
feature
Gridsphere
Websphere
low
high
good – deploys into Apache
Tomcat and uses log4j
poor – proprietary implementations of
application server, logging package and
http server
good – about the right amount
too much – things are very hard to find
easy & quick
poor – has to be downloaded as lots of
separate files
debugging your own
portlets
easy – log4j can be used
hard – in-code debugging statements
need to be changed to support IBM
logging mechanism
JSR168 compliance
full
supposedly full from version 5.1 onwards
but name of logged in user cannot be
extracted when using JSR168 compliant
code for this – poor!
overall impression
lightweight, easy-to-use, basic
complex, monolithic, feature-rich
complexity
re-use of publicly
available components
documentation
installation/configuration
Authentication and User Management (1)
model adopted in BRIDGES:
 requirement was for users not to have to obtain and
manage certificates
 we applied for a single project account at NGS – users do
not need individual NGS accounts
 this account maps to a single user (“BRIDGES”) on the
NGS with home directories on all nodes (like normal users)
 authentication for this user on NGS is by means of the host
certificate of the machine where the jobs are submitted from
(under control of BRIDGES project)
 users authenticate via the BRIDGES web portal using
standard username and password pairs
Authentication and User Management(2)
 Users can create accounts for themselves in
BRIDGES Websphere portal (“self-care”)
 alternatively one could of course give the users
usernames and passwords
 information gathered is kept in Websphere's
secure user database
 current info is very basic but will be extended to
include more detail (e.g. URL of user's project or
departmental website where the user is listed)
 provides at least a basic means of accounting for
user activity
 no need for physically visiting the Registration
Authority/presenting ID
 may need to resort to stricter security if system is
abused e.g. if impersonation takes place etc.
 probably no less secure than certificates
managed by an inexperienced user on an
unsecured Windows machine
Authorisation with PERMIS
ScotGRID
 PERMIS = grid authorisation software
developed at Salford University
(http://sec.isi.salford.ac.uk/permis/)
NeSC
Condor Pool
 BRIDGES uses PERMIS to differentially
allow users access to resources
 typical use is with GT3.3 service but
lookup-type use is also possible with
other services (in our case GT3.0.2)
 code in our service calls a PERMIS
authorisation service running on a
machine at NeSC
 user's roles are queried and access to
resource is permitted or denied
accordingly
 gives BRIDGES staff full control over
who is allowed to use NGS resource
through our applications
NGS
end user
Leeds
Oxford
RAL
Manch
ester
Security in BRIDGES – summary
make host proxy,
authenticate with NGS
and submit job
job request is
passed on
securely with
username
NeSC grid server
with host credentials
NGS
clusters
authenticate at
BRIDGES web portal
with username and
password only
get user
authorisations
end user
Leeds
Oxford
BRIDGES web portal
RAL
Manch
ester
NeSC machine with PERMIS
authorisation service (GT3.3)
Host authentication for job submission
 allows us to submit jobs to NGS as user “BRIDGES”
 apply for host certificate for the grid server machine as
normal (UK e-Science Certification Authority)
 results in a passwordless private key and host certificate for
the machine
 Java Cog kit code can then be used to generate a host
proxy locally
 this is used for job submission
Use case: Microarray reporter sequence
BLAST jobs
“Job processing – please wait....”
(and wait....and wait....)

microarray chips contain up to 400,000 reporter sequences

these need to be compared to existing annotated sequence databases

takes approx. 3 weeks to compute against human genome on average
desktop machine
BLAST
 Basic Local Alignment Search Tool
 used for comparing biological sequences (DNA, protein)
against a set of target sequences
 returns a sorted list of matches
 most widely used algorithm for this sort of thing
 compute intensive
 for more details refer to NCBI website:
http://www.ncbi.nlm.nih.gov/blast/
How do I get my application to run efficiently on a grid?

applications to be deployed on a
compute grid need to be
parallelised to really benefit (can
of course just run them as single
jobs too)

for this one must be able to
partition a job into several subjobs

these then get processed
separately at the same time on
multiple processors

need to combine results of
individual subjobs at the end
Parallel BLAST – grid style

partition your job by putting one or several query sequences into a
separate input file (= 1 subjob)

distribute all input files, the executable and target data onto your grid
clusters (“stage-in”)

subjobs get executed there

results are returned to the server and combined there

if 100 free processors are available, and 100 subjobs are to be run,
the time taken is 1/100th of the time it would have taken to run the
whole job on a single machine (plus overheads for scheduling, data
transfer and result combining)
To stage or not to stage?
 file staging is the copying – at runtime – of files onto the remote
resource
 example: BLAST jobs
we need
 input file
 target data file (“database” – really a flat text file)
 executable (BLAST)
 target files and executable are unchanging components for this kind of
job
 it is best to store these locally on the remote resources to avoid
staging overhead (target data are in the region of several gb in size
and growing exponentially)
 rather than individual users keeping multiple copies of publicly
available data in their home directories, get sys admins to put up
copies visible to all
 must stage in input files since these vary from job to job
BRIDGES GridBLAST Job Submission
ScotGRID
masternode
BRIDGES Portal Server
(Cassini)
GridBLAST client
portlet + GT 3 client
class
IBM Websphere
send job
request
(HTTP
request)
send job
request
(SOAP)
return
result
(SOAP)
return
result
(HTTP
response)
NESC Grid Server
(Titania)
GT 3 core
grid service
BRIDGES
MetaScheduler
Apache
Tomcat
PBS server
side
+
BLAST
PBS
wrapper
Condor
wrapper
GT2.4
wrapper
GridBLAST
client portlet
output
worker
nodes
Condor
+
BLAST
Condor Central
Manager
NESC Condor
pool
Web browser
end user PC
NGS
GT2.4
+
BLAST
Leeds headnode
GT2.4
+
BLAST
worker nodes
Oxford headnode
worker nodes
Current status of our system
 software is still at prototype stage – haven’t benchmarked
any really big jobs yet
 medium size jobs (<100 input sequences) can be run
 job submission is from dedicated portlet on BRIDGES portal
How we worked with the NGS
 BRIDGES was one of the first projects running bio jobs on NGS
 we established a basic infrastructure needed for BLAST on the
NGS clusters in collaboration with NGS user support
 good collaboration on our security requirements – very helpful
and accommodating
 our project account is the first of its kind and we jointly tailored a
solution that would fit BRIDGES
 ask for what you need! things are not cast in stone and it is a
public service
Public bioinformatics infrastructure on NGS –
current status
 we are in the process of establishing an infrastructure for
BLAST jobs that can be used by all
 this includes:
 making BLAST and mpiBLAST executables publicly
available
 mirroring the entire NCBI BLAST databases repository
 currently trialling this on Leeds node – will be replicated at
other nodes later
 data replication on all nodes necessary to avoid severe
performance hits
 input from others needed and welcome!
mpiBLAST
 mpiBLAST will be installed on all nodes as part of the
bioinformatics infrastructure
 trials on Leeds node have not been very encouraging:
 performance is much poorer than advertised in the papers
 best performance (within the given limits) is achieved only
when number of database fragments is matched with
number of available processors (+2 for managing the job)
 this means job has to queue until the required number of
processors is available – can take ages
 alternative is to split and formatdb the database at runtime –
takes about 30 mins for nt database
 this is poor solutions when actual jobs only takes several
minutes
Contact details
 BRIDGES website:
http://www.brc.dcs.gla.ac.uk/projects/bridges/
 Code repository – contains reuseable components for job
submission, GSI security etc:
http://www.brc.dcs.gla.ac.uk/projects/bridges/public/code.htm
 BRIDGES web portal:
http://cassini.nesc.gla.ac.uk:9081/wps/portal
 Contacts:
Micha Bayer at NeSC in Glasgow -- michab@dcs.gla.ac.uk
Richard Sinnott at NeSC in Glasgow -- ros@dcs.gla.ac.uk
Download