Life Sciences & Grid Technologies: Challenges & Opportunities Dr Richard Sinnott

advertisement

Life Sciences & Grid Technologies:

Challenges & Opportunities

Dr Richard Sinnott

Technical Director National e-Science Centre

|||

Deputy Director (Technical) Bioinformatics Research Centre

University of Glasgow

16 th September 2004

HPCInform,

16 Sep, 2004

e-Science and the Grid

‘e-Science is about global collaboration in key areas of science, and the next generation of infrastructure that will enable it.’

‘e-Science will change the dynamic of the way science is undertaken.’

John Taylor

Director General of Research Councils

Office of Science and Technology

Grid is infrastructure used for e-Science

•Metaphor of Power Grid: compute and data resources on demand …revisited later

HPCInform,

16 Sep, 2004

From presentation by Tony Hey

Foundation for e-Science

e-Science methodologies transforming science, engineering, medicine and business driven by exponential growth in data, compute demands

X enabling a whole-system approach computers software

Grid sensor nets instruments

Shared data archives colleagues

HPCInform,

16 Sep, 2004

Data Grids for High Energy Physics

~PBytes/sec

Online System ~100 MBytes/sec

Offline Processor Farm

1 TIPS is approximately 25,000

SpecInt95 equivalents

There is a “bunch crossing” every 25 nsecs.

There are 100 “triggers” per second

Each triggered event is ~1 MByte in size

~20 TIPS

~622 Mbits/sec or Air Freight (deprecated)

Tier 0

~100 MBytes/sec

France Regional

Centre

Germany Regional

Centre

Italy Regional

Centre

FermiLab ~4 TIPS

~622 Mbits/sec

Caltech

~1 TIPS ~1 TIPS ~1 TIPS

Tier2 Centre

~1 TIPS

Tier2 Centre

~1 TIPS

~622 Mbits/sec

Physics data cache

Institute Institute

~1 MBytes/sec

Physicists work on analysis “channels”.

Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server

Physicist workstations

HPCInform,

16 Sep, 2004

NeSC in the UK

NeSC

X

X

X

Prof Malcolm Atkinson (Director)

X

Two UK OGSA Test Grid projects started in January

Previous work on UK e-Science Grid based on GT2

X

Universities of Portsmouth, Reading, Manchester, Westminster and CCLRC

X

X

X

Brief outline current and future developments

The next Grid

X

Integrated Earth system modelling

OGSA definition and delivery Belfast

?

Daresbury Lab

X

…and Technologies GT3, GT4…

Hosting environments & Platforms

Combinations of services supported

CSAR

Material and grids to support adopters Cardiff

Oxford

RAL

Edinburgh

HPC(x)

Newcastle

White Rose Grid

London

Cambridge

Hinxton

Southampton

HPCInform,

16 Sep, 2004

Glasgow e-Science Hub

E-Science Hub

Externally

X

Glasgow end of NeSC

– Involved in UK wide activities

» ETF: In May 2003 became first UK e-Science Centre to run integration tests across every site of the UK (Level 2) Grid.

Therefore 100% access to UK Grid resources at this time

– Public visibility of NeSC

» responsible for NeSC web site

Internally

X

X

X

Focal point for e-Science research/activities at Glasgow

Work closely with foundation departments

– Department of Computing Science

– Department of Physics & Astronomy

Also working closely with other groups including

– Bioinformatics Research Centre

– Electronics and Electrical Engineering

– Biostatistics

16 Sep, 2004

Glasgow e-Science Activities

Consolidating resources

Originally built around ScotGrid

X

Providing shared Grid resource for wide variety of scientists inside/outside Glasgow

– HEP, CS, BRC, EEE, …

» Target shares established

– Focal point for e-Science at Glasgow

» University wide involvement (steering, mgt, users, tech. groups)

Hardware

• 59 IBM X Series 330 dual 1 GHz Pentium III with 2GB memory

• 2 IBM X Series 340 dual 1 GHz Pentium III with 2GB memory

• 3 IBM X Series 340 dual 1 GHz Pentium III with 2GB memory and 100 + 1000 Mbit/s ethernet

• 1TB disk

LTO/Ultrium Tape Library

Cisco ethernet switches

New..

• IBM X Series 370 PIII Xeon with 32 x 512 MB RAM

• 5TB FastT500 disk 70 x 73.4 GB IBM FC Hot-Swap HDD

• eDIKT 28 IBM blades dual 2.4 GHz Xeon with 1.5GB memory

• eDIKT 6 IBM X Series 335 dual 2.4 GHz Xeon with 1.5GB memory

• CDF 10 Dell PowerEdge 2650 2.4 GHz Xeon with 1.5GB memory

• CDF 7.5TB Raid disk

ScotGrid [ Disk ~15TB

CPU ~ 330 1GHz ]

Comp. Serv. 2

University SAN

EEE compute clusters

Condor pools nd HPC facility

Shared sys-admin staff

HPCInform,

16 Sep, 2004

Life Sciences

Extensive Research Community

>1000 per research university

Extensive Applications

Many people care about them

X

Health, Food, Environment

Interacts with virtually every discipline

Physics, Chemistry, Maths/Stats, Nano-engineering, …

450+ databases relevant to bioinformatics (and growing!)

Heterogeneity, Interdependence, Complexity, Change, …

Wonderful Scientific Questions

How does a cell work?

How does a brain work?

How does an organism develop?

What happens to the biosphere when the earth warms up?

Why do people who eat less tend to live longer?

HPCInform,

16 Sep, 2004

Yersinia pestis

More genomes …...

Arabidopsis thaliana

Buchnerasp.

APS

Aquifex aeolicus

Archaeoglobus fulgidus

Borrelia burgorferi

Mycobacterium tuberculosis

Caenorhabitis elegans

Campylobacter jejuni

Chlamydia pneumoniae

Vibrio cholerae

Drosophila melanogaster

Escherichia

Thermoplasma coli acidophilum

Helicobacter pylori

Mycobacterium leprae mouse

Neisseria meningitidis

Z2491

Plasmodium falciparum

Pseudomonas aeruginosa

Ureaplasma urealyticum rat

Salmonella enterica

Bacillus subtilis

Thermotoga maritima

Xylella fastidiosa

Distributed and Heterogeneous data

Structure

Function

Sequence

LPSYVDWRSA GAVVDIKSQG

ECGGCWAFSA IATVEGINKI

TSGSLISLSE QELIDCGRTQ

NTRGCDGGYI TDGFQFIIND

GGINTEENYP YTAQDGDCDV

Gene expression Morphology

HPCInform,

16 Sep, 2004

Complexity of Biological Data

HPCInform,

16 Sep, 2004

+ links to plant/crops, environmental, health, … information sources

Database Growth

PDB Content Growth

•DBs growing exponentially!!!

•Biobliographic (MedLine, …)

•Amino Acid Seq (SWISS-PROT, …)

•3D Molecular Structure (PDB, …)

•Nucleotide Seq (GenBank, EMBL, …)

•Biochemical Pathways (KEGG, WIT…)

•Molecular Classifications (SCOP, CATH,…)

•Motif Libraries (PROSITE, Blocks, …)

HPCInform,

16 Sep, 2004

Is Grid the Answer?

Key problems to be addressed

Tools that

simplify

access to and usage of data

X

Internet hopping is not ideal!

Tools that

simplify

scale HPC facilities access to and usage of large

X qsub [-a date_time] [-A account_string] [-c interval] [-C directive_prefix] [-e path] [-h] [-I] [-j join] [-k keep] [-l resource_list]

[-m mail_options] [-M user_list] [-N name] [-o path] [-p priority] [-q destination] [-r c] [-S path_list] [-u user_list] [-v variable_list] [-V]

[-W additional_attributes] [-z] [script]

Tools designed to

aid understanding

sets and relationships between them of complex data

X e.g. through visualisation

HPCInform,

16 Sep, 2004

Access to and Usage of Data

Grid technology should allow to hide heterogeneity, deal with location transparency, address security concerns,

Data Access and Integration Specification (DAIS) being defined by GGF

OGSA-DAI and DAIT projects key role in shaping these standards

Other commercial solutions (IBM Information Integrator, …)

More later!

HPCInform,

16 Sep, 2004

Access to and Usage of HPC facilities

Consider whole genome-genome (2*3*10^9 bp) comparisons between two species

Current strategy essentially chops up one genome and fires searches for those fragments in the other then re-assembles results

X

X messy approximate matching - re-assembly difficult important correlations can be lost

– to make this tractable so called junk DNA ignored

– chopping may introduce artefacts or hide phenomena

¾ Better to put both full genomes in memory and perform a useful complete comparison

¾ Only possible with very high-end machines (available via grids)

Should not have to be script writer/Linux sys-admin to use these facilities

HPCInform,

16 Sep, 2004

Cognitive aspects of Data

Life science data can be “ugly”

Raw data sets messy

Requires significant effort to understand

Schemas/data models evolving

Tools needed to

Simplify understanding

Improve analysis

Navigate through potentially huge data sets

X e.g. to find genes of interest in chromosomes of different species

HPCInform,

16 Sep, 2004

Overview of BRIDGES

Biomedical Research Informatics Delivered by Grid

Enabled Services (BRIDGES)

NeSC (Edinburgh and Glasgow) and IBM

Started October 2003

Supporting project for CFG project

Generating data on hypertension

Rat, Mouse, Human genome databases

Variety of tools used

BLAST, BLAT, Gene Prediction, visualisation, …

Variety of data sources and formats

Microarray data, genome DBs, project partner research data, …

Aim is integrated infrastructure supporting

Data federation

Security

HPCInform,

16 Sep, 2004

VO Authorisation

Synteny

Grid

Service

Bridges Project

b la st

P riv a te d a ta

O x fo rd

O rg a n is a tio n

G la sg o w

C F G V irtu a l

P riv a te d a ta

Information

Integrator

D A T A

H U B

OGSA-DAI

L o n d o n

P riv a te d a ta

P u b lica lly C u ra te d D a ta

P riv a te d a ta

Le ice ste r

P riv a te d a ta

N e th e rla n d s

P riv a te d a ta

E n se m bl

O M IM

S W IS S -P R O T

M G I

H U G O

+

HPCInform,

16 Sep, 2004

Grid Security

OGSA security

Single sign-on based on (X.509) digital certificates

X establish credentials

– Certification authority based (RAL in UK)

Services (and clients) have APIs for fine grained security

X

Based on GSS-API

Provides for authentication but need authorisation

X

Various technologies for authorisation including PERMIS, CAS, …

Collaborating with P rivil E ge and R ole M anagement

I nfrastructure S tandards Validation (PERMIS) team

X

Lead by Prof David Chadwick, University of Salford

– (www.permis.org)

HPCInform,

16 Sep, 2004

Security Authorisation

PERMIS allows to

Define roles for who can do what on what

X

Policy = { Role x Target x Action }

– Can user X invoke service Y and access or change data Z?

» Policies created with PERMIS PolicyEditor (output is XML based policy)

HPCInform,

16 Sep, 2004

Security Authorisation

PERMIS Privilege Allocator then used to sign policies

Associates roles with specific users

X

Policies stored as attribute certificates in LDAP server

When is authorisation done?

Two main choices

X

X

X

Portal personalised for users based on their policies

– If not allowed to invoke service then they do not get to see it

Actions of users (with given role) are authorised every time the service is invoked

– They can see the service but potentially not be allowed to invoke it

» Performance issues… but more likely scenario for authorisation

In both cases, if not explicitly agreed in policy then rejected and logged!

– Both cases being explored

Plan to exploit the GGF SAML AuthZ specification

X

Based on GT3.3 – currently have BLAST service in GT3.2Final

– Identified issues with standards…

HPCInform,

16 Sep, 2004

Where we are today!

Information Integrator DB repository established and populated

… with public data sets (OMIM, HUGO, RGD, SWISS-PROT)

… linked to relevant resources (ENSEMBL- rat, human, mouse, MGI)

GT3 based Grid services developed (BLAST) using own meta-scheduler

General usage of ScotGrid and local Condor pool

Portal developed using IBM WebSphere

Genome visualisation browsers

SyntenyVista – for viewing synteny between local/remote data sets

MagnaVista – for exploring genetic information across multiple

(remote) resources

Gaining experience with security technologies

Setting up policies with Grid security authorisation software etc

Rolled-out Alpha version of system to CFG group July ‘04

HPCInform,

16 Sep, 2004

HPCInform,

16 Sep, 2004 www.nesc.ac.uk

Lessons learned

Public data resources openness

Often cannot query directly

Often not easy/possible to find schemas

Joint Data Standards Study investigating this

X

X

X

Started on 1 st June and involves

– Digital Archiving Consultancy

– Bioinformatics Research Centre (Glasgow)

– NeSC (Edinburgh and Glasgow)

Look at technical, political, social, ethical etc issues involved in accessing and using public life science resources

– Will liase with NDCC

– Interview relevant scientists, data curators/providers

8 month project with final report in January

– Funded by MRC, BBSRC, Wellcome Trust, JISC, NERC, DTI

GT3 not without pain! (… understatement!!!!)

Hopefully GT4 will be better?

HPCInform,

16 Sep, 2004

Other NeSC Glasgow Projects

Dynamic Virtual Organisations for e-Science Education

(DyVOSE) project

Two year project started 1 st May 2004 funded by JISC

Exploring advanced authorisation infrastructures for security

X

… in Grid Computing Module as part of advanced MSc at Glasgow

– Will provide insight into rolling Grid out to the masses!

ScotGrid GU Condor pool

Other (known!)

Grid resources

PERMIS based

Education

VO policies

Authorisation checks

Authorisation decisions

HPCInform,

16 Sep, 2004

Scottish Bioinformatics Research Network

Four year proposal expected to start November 2004

Funded by Scottish Enterprise, Scottish Higher Education Funding

Council, Scottish Executive Environment and Rural Affairs

Department

X

Involves Glasgow, Dundee, Edinburgh, Scottish Bioinformatics Forum

Aim to provide bioinformatics infrastructure for Scottish health, agriculture and industry

X

X

X

Infrastructure support at Dundee, Edinburgh and Glasgow to support first-rate research in bioinformatics at each academic institute

Infrastructure support at three institutes, to support inter-institutional sharing of compute and data resources through application of Grid computing

Outreach and training activities mediated by the Scottish Bioinformatics Forum

HPCInform,

16 Sep, 2004

VOTES

Virtual Organisations for Trials and Epidemiological Studies

3 year MRC funded project expected to start imminently

Plans to develop Grid infrastructure to address key components of clinical trial/observational study

X

X

X

Recruitment of potentially eligible participants

Data collection during the study

Study administration and coordination

– Involves Glasgow, Oxford, Leicester, Nottingham, Manchester

Clinical Virtual Organisation Framework

Used to realise

CVO-2

(e.g. for recruitment)

CVO-1

(e.g. for data collection)

Disease registries

GPs

GLA

Transfer

Grid

Lei-

Nott

OX

IMP

Hospital databases

Clinical trial data sets

HPCInform,

16 Sep, 2004

Conclusions

NeSC Glasgow focussed largely on life sciences and security

The GRID is happening and will influence life sciences

Government SR allocated £115M for eScience

Life sciences are key

Grid middleware and standards still “evolving“

Non-trivial learning curve

Immaturity of technology

Courses offered at NeSC

Web services, grid services, XML, Java, DB technologies, bioinformatics, others…?

X www.nesc.ac.uk

for registration and suggesting events

Grid Computing modules under development at Glasgow as part of advanced MSc

HPCInform,

16 Sep, 2004

Conclusions

Main take home message: need coordination of

X

X

GRID software engineering/standards

Life science needs/standards

Without this, metaphor of electric power GRID doomed

HPCInform,

16 Sep, 2004

HPCInform,

16 Sep, 2004 www.nesc.ac.uk

HPCInform,

16 Sep, 2004 www.nesc.ac.uk

Bridges Portal

HPCInform,

16 Sep, 2004

MagnaVista

HPCInform,

16 Sep, 2004 www.nesc.ac.uk

MagnaVista

HPCInform,

16 Sep, 2004

QTL upload

HPCInform,

16 Sep, 2004

QTL upload

HPCInform,

16 Sep, 2004

QTL browsing

HPCInform,

16 Sep, 2004

Grid Blast Client

Allows ‘genome scale’ blasting

• Uses ScotGrid and idle compute resources of training lab Condor pool

HPCInform,

16 Sep, 2004

HPCInform,

16 Sep, 2004

HPCInform,

16 Sep, 2004

HPCInform,

16 Sep, 2004

Download