Grid enabled e-Research in the Life Sciences Prof Richard Sinnott Technical Director National e-Science Centre Anthony Stell University of Glasgow, Scotland, UK 26th October 2006 Life Sciences • Some of the Big Questions – How does a cell/brain work? – Which genes/pathways are involved in which diseases and can we develop drugs to target them? – Why do people who eat less tend to live longer? – Is this drug effective (for these individuals)? – How important are genetic / social / environmental factors to specific diseases? • …how clinically significant is the consumption of deep fried Mars Bar and Pizza Crunch in Scotland? Life Science Grids • Extensive Research Community > 4000 at Glasgow • Extensive Applications – Many people care about them • Health, Food, Environment • Interacts with virtually every discipline – Physics, Chemistry, Maths/Stats, Nano-engineering, … • MANY databases relevant to bioinformatics (and growing!) – Heterogeneity, Interdependence, Complexity, Change, … Database Growth PDB Content Growth Yesterday EMBL Database contained 147,881,486,173 nucleotides in 81,229,974 entries. •DBs growing exponentially!!! •Biobliographic (MedLine, PubMed…) Homo sapiens Mus musculus Rattus norvegicus Bos taurus Pan troglodytes Canis familiaris Monodelphis domestica Macaca mulatta Danio rerio Aedes aegypti Other •Protein Seq (UniProt, …) •3D Molecular Structure (PDB, …) •Nucleotide Seq (GenBank, EMBL…) •Pathways (KEGG, WIT…) •Molecular Classifications (SCOP,…) •Motif Libraries (PROSITE, Blocks, …) •… Yersinia pestis More genomes …... Arabidopsis thaliana Buchnerasp. APS Caenorhabitis Campylobacter Chlamydia elegans jejuni pneumoniae Helicobacter Mycobacterium pylori leprae rat Rickettsia prowazekii mouse Aquifex aeolicus Vibrio cholerae Archaeoglobus Borrelia Mycobacterium fulgidus burgorferi tuberculosis Drosophila melanogaster Escherichia Thermoplasma coli acidophilum Neisseria Plasmodium Pseudomonas Ureaplasma meningitidis falciparum aeruginosa urealyticum Z2491 Saccharomyces Salmonella cerevisiae enterica Bacillus subtilis Thermotoga maritima Xylella fastidiosa Distributed and Heterogeneous data Structure Sequence Function LPSYVDWRSAGAVVDIKSQG ECGGCWAFSAIATVEGINKI TSGSLISLSEQELIDCGRTQQD NTRGCDGGYI TDGFQFIIND GGINTEENYPYTAQDGDCDV AGGTATAGCGCGCGCGATATATA AAATGTACGTACGGGCCCTTATA CGCGCGCGATATATAGCGCGCG Morphology Gene expression Pathways Translational Research Just one example! + links to plant/crops, environmental, health, … information sources Populations Organisms Physiology Tissues Protein-protein interaction (pathways) Protein Structures Gene expressions Nucleotide structures Systems-Biology… Is Grid the Answer? • Key problems to be addressed – Tools that simplify access to and usage of data • Internet hopping is not ideal! – Tools that simplify access to and usage of large scale HPC facilities • qsub [-a date_time] [-A account_string] [-c interval] [-C directive_prefix] [-e path] [-h] [-I] [-j join] [-k keep] [-l resource_list] [-m mail_options] [-M user_list] [-N name] [-o path] [-p priority] [-q destination] [-r c] [-S path_list] [-u user_list] [-v variable_list] [-V] [-W additional_attributes] [-z] [script] •… Is Grid the Answer …ctd? • Key problems …ctd – Tools designed to aid understanding of complex data sets and relationships between them • e.g. through visualisation – Support different kinds of collaborative research • break down the silos • be multi-discipline • support the research process – Provide access to many more computational resources • to expedite scientific process (or to make it feasible!) Access to and Usage of Data • Grid technology should allow to – – – – – hide heterogeneity, deal with location transparency, address security concerns, support data provenance … • Data Access and Integration Specification (DAIS) being defined by GGF – OGSA-DAI/DAIT projects key role in shaping these standards • Other commercial solutions – IBM Information Integrator, SRS, … Access to and Usage of HPC facilities • Consider whole genome-genome comparisons between two species – Current strategy essentially chops up one genome and fires searches for those fragments in the other then reassembles results • messy approximate matching - re-assembly difficult • important correlations can be lost – to make this tractable so called junk DNA ignored – chopping may introduce artefacts or hide phenomena Better to put both full genomes in memory and perform a useful complete comparison Only possible with very high-end machines (available via grids) – Should not have to be script writer/Linux sysadmin to use these facilities Cognitive aspects of Data they are!!! • Life science data can be “ugly” – – – – Raw data sets messy Requires significant effort to understand Schemas/data models evolving … • Tools needed to – Simplify understanding – Improve analysis – Navigate through potentially huge data sets • e.g. to find genes of interest in chromosomes of different species, … Collaborative Aspects • Should provide tools that automate the way researchers wish to work – User driven workflows – Linking compute and data resources “on the fly” • Where is the “best place” to submit these jobs right now? – MyGrid workbench gaining widespread acceptance • 20,000+ downloads • 3000+ bio-services Collaborative Aspects …ctd • Break down the silos and multi-disciplinary – We are all looking at possible genetic factors in cancer, metal health, cardiovascular… • …so we should co-ordinate our efforts and share data, knowledge, … – Has anyone generated results like these? – Can I see them now » …rather than waiting 2 years for the Nature / Science publication – I need input from a physicist, chemist, a statistician, a … • • • • • to explain this, to process these results, to simulate this phenomenon, to verify these results … Nucleotide structures GEMEPS BRIDGES SBRN DyVOSE GLASS ESP-Grid Populations Organisms Physiology Tissues Protein-protein interaction (pathways) Protein Structures Gene expressions GS SFHS VOTES BRIDGES Project CFG Virtual Publically Curated Data Ensembl Organisation OMIM Glasgow SWISS-PROT Private Edinburgh MGI VO Authorisation Private data Oxford Information Integrator Synteny Service Magna Vista Service London HUGO … RGD Leicester DATA HUB OGSA-DAI Private data data Private data Netherlands Private data Private data + + + Bridges Portal MagnaVista www.nesc.ac.uk MagnaVista GeneVista Grid Blast Interface • Allows ‘genome scale’ blasting • Transparently uses NGS, ScotGrid, other GU clusters, Condor pools • Many databases already deployed across nodes • No user certificates • Fine grained security at back-end Grid Enabled Microarray Expression Profile Search (GEMEPS) • 1 year BBSRC project started 1st March 2006 – Involves Glasgow, Cornell University, US, Riken Institute, Japan – Aim to provide tools for discovery, comparison and analysis of microarray data sets • How does my data compare to others? – Species, disease, platform, results, … • How do these experiments compare? • Can we improve the way we establish how genes in different species are linked? – Requires data access, integration and move towards data mining – Built upon fine grained security • Microarrays expensive and contain potentially important (valuable) data sets Experiences • Currently exploring microarray data sets in detail – GEO, ArrayExpress, local in-house microarray storage solutions at Riken, Cornell, SHWFGF … • Investigating/Grid enabling CellMontage software (http://cellmontage.cbrc.jp/) – system for searching gene expression databases for cells or tissues similar to a query gene expression profile – similarity of two profiles computed by comparing the order of genes ranked by expression (Spearman Rank) • simple measure but sufficient to characterize cell types across different microarray platforms • gene sets/expression value ranges differ between platforms, making direct comparison difficult/impossible Microarray Data Resources • Various standards and interoperability issues – MIAME – MAGE-ML – MINiML – SOFTtext – SOFTmatrix –… • What’s in a name? – Gene names, probe names, platforms, species names in experiments, … • Life Science Identifiers Grid Enabled Microarray Expression Profile Search (GEMEPS) Overview of VOTES • Grids – Compute and Data Grids – Accessing Grids – Grid Security • Clinical Trials and VOTES – – – – Clinical Trials VOTES Goals Security Issues Classification Issues • Project so far – Implementation and Technologies • Conclusion – Application to life sciences – The main challenge: the human one Grids – what are they? • Use existing resources to solve large-scale compute or data problems more efficiently – Rather than throwing money at hardware solutions… – Develop applications that intelligently use available resources whilst maintaining security between all parties involved • Compute grids – Aggregation of CPU cycles and storage for better performance • Data grids – Enhance quality and value of distributed information • Virtual Organisations – Where parties share data and resources but in a limited sense, so who can access what must be strictly controlled. Soundbites • “Next generation of the Internet” [Various] – Knitting together network infrastructures… • “Internet on steroids” [Me at social events] – Getting better performance with what you’ve got… • “More bang for your buck” [BBC Magazine] – Do the same as could be achieved with lots of hardware, but doing it more efficiently… • “Co-ordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations” [“Anatomy of the Grid”, Ian Foster] –… Accessing Grids • Want an open, usable interface to access grid applications… • As intuitive and easy to use as browsers to access applications on the Internet… • Portal technology is one possible way forward: – Developed as stateful web applications. – Communicate to middleware solutions which do their magic to allocate and use underlying grid infrastructures. Grid Security - 1 • Security is often classified as: “AAA” – Authentication • “Who are you?” – Authorization • “What are you allowed to do?” – Accounting • “Where were you on the night of…?” • But there are other aspects to be considered: – Anonymisation – Confidentiality – Non-repudiation Grid Security - 2 • Not just server checking on client, but vice versa: – Because a server might be a client to another process… • PKI – digital keys/certificates for authentication – Clever mathematics provide useful encryption and signature tools… – “The Code Book” by Simon Singh • Proxy certificates – Pushing your credentials further down a path of trust than your immediate neighbour… – Delegation of trust Grid Technologies • Range of middleware solutions: – Globus Toolkit – Open Middleware Infrastructure Institute (OMII) – Shibboleth – GridSphere • Not mature – Difficult to implement… • No clear leaders – Our job is to pick, choose and develop… Clinical Trials • Research studies into new drugs, medical devices or other interventions on patients in scientifically-controlled environment. • Required for regulatory authority approval of new therapies. • Generally speaking: they help improve quality of life. VOTES • Virtual Organisations for Trials and Epidemiological Studies • 3 year (£2.8 million) MRC funded project started in October 2005 • Collaboration between various UK universities: – Glasgow, Oxford, Nottingham/Leicester, Manchester, Imperial College London • Focuses on three key areas of clinical trials: – Patient Recuitment – Data Collection – Study Management Key Areas • Patient Recruitment – How many men aged between 45 and 65 had a heart attack last year? How many of them would be willing to participate in the trial of a new drug? • Data Collection – Are the participants taking their drug/placebo on a regular basis? Have there been any incidents relating to the trial? • Study Management – Who can see the trial data (e.g. consultants, nurses)? Who ensures the trial is in the patient’s interest? Can we simplify the ethical review process? Data Grids • Falls into the remit of the “Information Grid”: – “… which provides a way for information resources to be joined with related information resources to greater exploit the value of the inherent relationships among information, then for new connections to be made as situations change.” [Grid Computing with Oracle, technical white paper, 2005] • Two main challenges: – Security – Data classification • And we want to “plug in” to the existing NHS IT infrastructure… Additional Clinical Security • Anonymisation – De-identifying data – Only interested in the statistical data => don’t need to know the patient’s identity – So the identifying data is encrypted • Statistical Inference – When two bits of seemingly innocuous data are joined, can result in identification – E.g. an unusual condition in a particular postcode Data Classification • Main problem here is one of language and definition across domains. • Solutions proposed include: – Global schema • Essentially an overall description of data that all parties must subscribe to. – Ontology • Methods of translating the idiosyncratic description to a common description used by all parties. • No clear solution to this yet… – Current method is to join distributed databases on CHI number (Community Health Index). VOTES Portal Overview • Developed on local test-bed of distributed servers and databases: – Log in and are assigned privileges based on role – Select clinical trial – Select parameters to view and apply conditions (if desired) – Results of this query are brought back from the databases distributed over the test-bed (or VO if you will…), joined and presented as a unified resource. • Demo available at break… VOTES Portal Snapshots Architecture Portal Grid Server Access Security Policies Data Server Authorisation Access Matrix Security Policies Globus Container User Authentication Glasgow GPASS Local Trust Policies OGSA-DAI Service Glasgow SCI Store 1 (SQL Server) SCI Store 1 (SQL Server) Driving DB SCI Store 2 (SQL Server) Local Trust Policies Remote Trust Policies Consent DB (Oracle 10g) RCB Test Trials DB (SQL Server) Local Trust Policies Local Trust Policies Other Transfer Grid Nodes Technologies • Technologies – GridSphere (2.1) – Globus Toolkit (4.0) – OGSA-DAI (2.2) • Security Framework – Database user management (Resource-level) • Local restrictions on local resources – Access Control matrix (VO-level) • A bit-wise privilege matrix that will be available to the whole VO • Representative NHS Databases – GPASS – SCI Store Conclusions • Grid Computing is a challenging field… • We provide one *possible* solution to applying the technology paradigms to clinical trials and studies. • And it is hopefully a worthwhile effort, as it potentially brings: – – – – Efficient use of distributed resources and data. Enhanced analysis and understanding of said data. Closer collaboration between participants. Peace, prosperity and general happiness to human-kind... • Maybe… The main challenge… • … is the human one. • Encouraging technological uptake… • Challenging techno-phobic attitudes… Further Information • Website: http://www.nesc.ac.uk/hub/projects/votes • Portal: http://labpc12.nesc.gla.ac.uk:18080/gridsphere • Contact: – Prof. Richard Sinnott – r.sinnott@nesc.gla.ac.uk – Anthony Stell – a.stell@nesc.gla.ac.uk – Oluwafemi Ajayi – o.ajayi@nesc.gla.ac.uk