Grid enabled e-Research in the Life Sciences Prof Richard Sinnott Anthony Stell

advertisement
Grid enabled e-Research in
the Life Sciences
Prof Richard Sinnott
Technical Director National e-Science Centre
Anthony Stell
University of Glasgow, Scotland, UK
26th October 2006
Life Sciences
• Some of the Big Questions
– How does a cell/brain work?
– Which genes/pathways are involved in which
diseases and can we develop drugs to target
them?
– Why do people who eat less tend to live longer?
– Is this drug effective (for these individuals)?
– How important are genetic / social / environmental
factors to specific diseases?
• …how clinically significant is the consumption of deep
fried Mars Bar and Pizza Crunch in Scotland?
Life Science Grids
• Extensive Research Community
> 4000 at Glasgow
• Extensive Applications
– Many people care about them
• Health, Food, Environment
• Interacts with virtually every discipline
– Physics, Chemistry, Maths/Stats, Nano-engineering, …
• MANY databases relevant to bioinformatics (and
growing!)
– Heterogeneity, Interdependence, Complexity, Change, …
Database Growth
PDB Content Growth
Yesterday EMBL Database contained 147,881,486,173
nucleotides in 81,229,974 entries.
•DBs growing exponentially!!!
•Biobliographic (MedLine,
PubMed…)
Homo
sapiens
Mus
musculus
Rattus
norvegicus
Bos
taurus
Pan
troglodytes
Canis
familiaris
Monodelphis
domestica
Macaca
mulatta
Danio
rerio
Aedes
aegypti
Other
•Protein Seq (UniProt, …)
•3D Molecular Structure (PDB, …)
•Nucleotide Seq (GenBank,
EMBL…)
•Pathways (KEGG, WIT…)
•Molecular Classifications
(SCOP,…)
•Motif Libraries (PROSITE, Blocks,
…)
•…
Yersinia
pestis
More genomes …...
Arabidopsis
thaliana
Buchnerasp.
APS
Caenorhabitis Campylobacter Chlamydia
elegans
jejuni
pneumoniae
Helicobacter Mycobacterium
pylori
leprae
rat
Rickettsia
prowazekii
mouse
Aquifex
aeolicus
Vibrio
cholerae
Archaeoglobus Borrelia
Mycobacterium
fulgidus
burgorferi
tuberculosis
Drosophila
melanogaster
Escherichia Thermoplasma
coli
acidophilum
Neisseria
Plasmodium Pseudomonas Ureaplasma
meningitidis falciparum
aeruginosa urealyticum
Z2491
Saccharomyces Salmonella
cerevisiae
enterica
Bacillus
subtilis
Thermotoga
maritima
Xylella
fastidiosa
Distributed and Heterogeneous data
Structure
Sequence
Function
LPSYVDWRSAGAVVDIKSQG
ECGGCWAFSAIATVEGINKI
TSGSLISLSEQELIDCGRTQQD
NTRGCDGGYI TDGFQFIIND
GGINTEENYPYTAQDGDCDV
AGGTATAGCGCGCGCGATATATA
AAATGTACGTACGGGCCCTTATA
CGCGCGCGATATATAGCGCGCG
Morphology
Gene expression
Pathways
Translational Research
Just one
example!
+ links to plant/crops,
environmental, health, …
information sources
Populations
Organisms
Physiology
Tissues
Protein-protein interaction (pathways)
Protein Structures
Gene expressions
Nucleotide structures
Systems-Biology…
Is Grid the Answer?
• Key problems to be addressed
– Tools that simplify access to and usage of data
• Internet hopping is not ideal!
– Tools that simplify access to and usage of large
scale HPC facilities
• qsub [-a date_time] [-A account_string] [-c
interval] [-C directive_prefix] [-e path] [-h] [-I] [-j join]
[-k keep] [-l resource_list] [-m mail_options] [-M
user_list] [-N name] [-o path] [-p priority] [-q
destination] [-r c] [-S path_list] [-u user_list] [-v
variable_list] [-V] [-W additional_attributes] [-z] [script]
•…
Is Grid the Answer …ctd?
• Key problems …ctd
– Tools designed to aid understanding of complex
data sets and relationships between them
• e.g. through visualisation
– Support different kinds of collaborative research
• break down the silos
• be multi-discipline
• support the research process
– Provide access to many more computational
resources
• to expedite scientific process (or to make it feasible!)
Access to and Usage of Data
• Grid technology should allow to
–
–
–
–
–
hide heterogeneity,
deal with location transparency,
address security concerns,
support data provenance
…
• Data Access and Integration Specification (DAIS) being
defined by GGF
– OGSA-DAI/DAIT projects key role in shaping these standards
• Other commercial solutions
– IBM Information Integrator, SRS, …
Access to and Usage of HPC facilities
• Consider whole genome-genome comparisons
between two species
– Current strategy essentially chops up one genome and
fires searches for those fragments in the other then reassembles results
• messy approximate matching - re-assembly difficult
• important correlations can be lost
– to make this tractable so called junk DNA ignored
– chopping may introduce artefacts or hide phenomena
Better to put both full genomes in memory and perform a useful
complete comparison
Only possible with very high-end machines (available via grids)
– Should not have to be script writer/Linux sysadmin to use these facilities
Cognitive aspects of Data
they are!!!
• Life science data can be “ugly”
–
–
–
–
Raw data sets messy
Requires significant effort to understand
Schemas/data models evolving
…
• Tools needed to
– Simplify understanding
– Improve analysis
– Navigate through potentially huge data sets
• e.g. to find genes of interest in chromosomes of different
species, …
Collaborative Aspects
• Should provide tools that automate the way
researchers wish to work
– User driven workflows
– Linking compute and data resources “on the fly”
• Where is the “best place” to submit these jobs right now?
– MyGrid workbench gaining widespread acceptance
• 20,000+ downloads
• 3000+ bio-services
Collaborative Aspects …ctd
• Break down the silos and multi-disciplinary
– We are all looking at possible genetic factors in cancer,
metal health, cardiovascular…
• …so we should co-ordinate our efforts and share data,
knowledge, …
– Has anyone generated results like these?
– Can I see them now
» …rather than waiting 2 years for the Nature / Science
publication
– I need input from a physicist, chemist, a statistician, a …
•
•
•
•
•
to explain this,
to process these results,
to simulate this phenomenon,
to verify these results
…
Nucleotide structures
GEMEPS
BRIDGES
SBRN
DyVOSE
GLASS
ESP-Grid
Populations
Organisms
Physiology
Tissues
Protein-protein interaction (pathways)
Protein Structures
Gene expressions
GS SFHS
VOTES
BRIDGES Project
CFG Virtual
Publically Curated Data
Ensembl
Organisation
OMIM
Glasgow
SWISS-PROT
Private
Edinburgh
MGI
VO Authorisation
Private
data
Oxford
Information
Integrator
Synteny
Service
Magna
Vista
Service
London
HUGO
…
RGD
Leicester
DATA
HUB
OGSA-DAI
Private
data
data
Private
data
Netherlands
Private
data
Private
data
+
+
+
Bridges Portal
MagnaVista
www.nesc.ac.uk
MagnaVista
GeneVista
Grid Blast Interface
• Allows ‘genome scale’ blasting
• Transparently uses NGS,
ScotGrid, other GU clusters,
Condor pools
• Many databases already
deployed across nodes
• No user certificates
• Fine grained security at
back-end
Grid Enabled Microarray Expression
Profile Search (GEMEPS)
• 1 year BBSRC project started 1st March 2006
– Involves Glasgow, Cornell University, US, Riken Institute, Japan
– Aim to provide tools for discovery, comparison and analysis of
microarray data sets
• How does my data compare to others?
– Species, disease, platform, results, …
• How do these experiments compare?
• Can we improve the way we establish how genes in different species
are linked?
– Requires data access, integration and move towards data mining
– Built upon fine grained security
• Microarrays expensive and contain potentially important (valuable) data
sets
Experiences
• Currently exploring microarray data sets in detail
– GEO, ArrayExpress, local in-house microarray storage
solutions at Riken, Cornell, SHWFGF …
• Investigating/Grid enabling CellMontage software
(http://cellmontage.cbrc.jp/)
– system for searching gene expression databases for
cells or tissues similar to a query gene expression profile
– similarity of two profiles computed by comparing the
order of genes ranked by expression (Spearman Rank)
• simple measure but sufficient to characterize cell types across
different microarray platforms
• gene sets/expression value ranges differ between platforms,
making direct comparison difficult/impossible
Microarray Data Resources
• Various standards and interoperability issues
– MIAME
– MAGE-ML
– MINiML
– SOFTtext
– SOFTmatrix
–…
• What’s in a name?
– Gene names, probe names, platforms, species
names in experiments, …
• Life Science Identifiers
Grid Enabled Microarray Expression
Profile Search (GEMEPS)
Overview of VOTES
• Grids
– Compute and Data Grids
– Accessing Grids
– Grid Security
• Clinical Trials and VOTES
–
–
–
–
Clinical Trials
VOTES Goals
Security Issues
Classification Issues
• Project so far
– Implementation and Technologies
• Conclusion
– Application to life sciences
– The main challenge: the human one
Grids – what are they?
• Use existing resources to solve large-scale compute
or data problems more efficiently
– Rather than throwing money at hardware solutions…
– Develop applications that intelligently use available
resources whilst maintaining security between all parties
involved
• Compute grids
– Aggregation of CPU cycles and storage for better
performance
• Data grids
– Enhance quality and value of distributed information
• Virtual Organisations
– Where parties share data and resources but in a limited
sense, so who can access what must be strictly controlled.
Soundbites
• “Next generation of the Internet” [Various]
– Knitting together network infrastructures…
• “Internet on steroids” [Me at social events]
– Getting better performance with what you’ve got…
• “More bang for your buck” [BBC Magazine]
– Do the same as could be achieved with lots of
hardware, but doing it more efficiently…
• “Co-ordinated resource sharing and problem
solving in dynamic, multi-institutional virtual
organizations” [“Anatomy of the Grid”, Ian Foster]
–…
Accessing Grids
• Want an open, usable interface to access grid
applications…
• As intuitive and easy to use as browsers to
access applications on the Internet…
• Portal technology is one possible way
forward:
– Developed as stateful web applications.
– Communicate to middleware solutions which do
their magic to allocate and use underlying grid
infrastructures.
Grid Security - 1
• Security is often classified as: “AAA”
– Authentication
• “Who are you?”
– Authorization
• “What are you allowed to do?”
– Accounting
• “Where were you on the night of…?”
• But there are other aspects to be considered:
– Anonymisation
– Confidentiality
– Non-repudiation
Grid Security - 2
• Not just server checking on client, but vice
versa:
– Because a server might be a client to another
process…
• PKI – digital keys/certificates for
authentication
– Clever mathematics provide useful encryption
and signature tools…
– “The Code Book” by Simon Singh
• Proxy certificates
– Pushing your credentials further down a path of
trust than your immediate neighbour…
– Delegation of trust
Grid Technologies
• Range of middleware solutions:
– Globus Toolkit
– Open Middleware Infrastructure Institute (OMII)
– Shibboleth
– GridSphere
• Not mature
– Difficult to implement…
• No clear leaders
– Our job is to pick, choose and develop…
Clinical Trials
• Research studies into new drugs, medical
devices or other interventions on patients in
scientifically-controlled environment.
• Required for regulatory authority approval of
new therapies.
• Generally speaking: they help improve quality
of life.
VOTES
• Virtual Organisations for Trials and Epidemiological Studies
• 3 year (£2.8 million) MRC funded project started in October
2005
• Collaboration between various UK universities:
– Glasgow, Oxford, Nottingham/Leicester, Manchester,
Imperial College London
• Focuses on three key areas of clinical trials:
– Patient Recuitment
– Data Collection
– Study Management
Key Areas
• Patient Recruitment
– How many men aged between 45 and 65 had a
heart attack last year? How many of them would
be willing to participate in the trial of a new drug?
• Data Collection
– Are the participants taking their drug/placebo on a
regular basis? Have there been any incidents
relating to the trial?
• Study Management
– Who can see the trial data (e.g. consultants,
nurses)? Who ensures the trial is in the patient’s
interest? Can we simplify the ethical review
process?
Data Grids
• Falls into the remit of the “Information Grid”:
– “… which provides a way for information
resources to be joined with related information
resources to greater exploit the value of the
inherent relationships among information, then for
new connections to be made as situations
change.” [Grid Computing with Oracle, technical
white paper, 2005]
• Two main challenges:
– Security
– Data classification
• And we want to “plug in” to the existing NHS
IT infrastructure…
Additional Clinical Security
• Anonymisation
– De-identifying data
– Only interested in the statistical data => don’t
need to know the patient’s identity
– So the identifying data is encrypted
• Statistical Inference
– When two bits of seemingly innocuous data are
joined, can result in identification
– E.g. an unusual condition in a particular postcode
Data Classification
• Main problem here is one of language and
definition across domains.
• Solutions proposed include:
– Global schema
• Essentially an overall description of data that all
parties must subscribe to.
– Ontology
• Methods of translating the idiosyncratic description to
a common description used by all parties.
• No clear solution to this yet…
– Current method is to join distributed databases
on CHI number (Community Health Index).
VOTES Portal Overview
• Developed on local test-bed of distributed
servers and databases:
– Log in and are assigned privileges based on role
– Select clinical trial
– Select parameters to view and apply conditions (if
desired)
– Results of this query are brought back from the
databases distributed over the test-bed (or VO if
you will…), joined and presented as a unified
resource.
• Demo available at break…
VOTES Portal Snapshots
Architecture
Portal
Grid Server
Access
Security
Policies
Data Server
Authorisation
Access Matrix
Security Policies
Globus
Container
User
Authentication
Glasgow
GPASS
Local
Trust
Policies
OGSA-DAI
Service
Glasgow
SCI Store 1
(SQL
Server)
SCI Store
1
(SQL Server)
Driving
DB
SCI Store 2
(SQL Server)
Local
Trust
Policies
Remote
Trust
Policies
Consent DB
(Oracle 10g)
RCB Test
Trials DB
(SQL Server)
Local
Trust
Policies
Local
Trust
Policies
Other
Transfer
Grid
Nodes
Technologies
• Technologies
– GridSphere (2.1)
– Globus Toolkit (4.0)
– OGSA-DAI (2.2)
• Security Framework
– Database user management (Resource-level)
• Local restrictions on local resources
– Access Control matrix (VO-level)
• A bit-wise privilege matrix that will be available to the whole VO
• Representative NHS Databases
– GPASS
– SCI Store
Conclusions
• Grid Computing is a challenging field…
• We provide one *possible* solution to applying the
technology paradigms to clinical trials and studies.
• And it is hopefully a worthwhile effort, as it potentially
brings:
–
–
–
–
Efficient use of distributed resources and data.
Enhanced analysis and understanding of said data.
Closer collaboration between participants.
Peace, prosperity and general happiness to human-kind...
• Maybe…
The main challenge…
• … is the human one.
• Encouraging technological uptake…
• Challenging techno-phobic attitudes…
Further Information
• Website: http://www.nesc.ac.uk/hub/projects/votes
• Portal: http://labpc12.nesc.gla.ac.uk:18080/gridsphere
• Contact:
– Prof. Richard Sinnott – [email protected]
– Anthony Stell – [email protected]
– Oluwafemi Ajayi – [email protected]
Download