Dr Richard Sinnott
Technical Director National e-Science Centre
|||
Deputy Director (Technical) Bioinformatics Research Centre
University of Glasgow
16 th September 2004
HPCInform,
16 Sep, 2004
‘e-Science is about global collaboration in key areas of science, and the next generation of infrastructure that will enable it.’
‘e-Science will change the dynamic of the way science is undertaken.’
John Taylor
Director General of Research Councils
Office of Science and Technology
•Metaphor of Power Grid: compute and data resources on demand …revisited later
HPCInform,
16 Sep, 2004
From presentation by Tony Hey
e-Science methodologies transforming science, engineering, medicine and business driven by exponential growth in data, compute demands
X enabling a whole-system approach computers software
Grid sensor nets instruments
Shared data archives colleagues
HPCInform,
16 Sep, 2004
~PBytes/sec
Online System ~100 MBytes/sec
Offline Processor Farm
1 TIPS is approximately 25,000
SpecInt95 equivalents
There is a “bunch crossing” every 25 nsecs.
There are 100 “triggers” per second
Each triggered event is ~1 MByte in size
~20 TIPS
~622 Mbits/sec or Air Freight (deprecated)
Tier 0
~100 MBytes/sec
France Regional
Centre
Germany Regional
Centre
Italy Regional
Centre
FermiLab ~4 TIPS
~622 Mbits/sec
Caltech
~1 TIPS ~1 TIPS ~1 TIPS
Tier2 Centre
~1 TIPS
Tier2 Centre
~1 TIPS
~622 Mbits/sec
Physics data cache
Institute Institute
~1 MBytes/sec
Physicists work on analysis “channels”.
Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server
Physicist workstations
HPCInform,
16 Sep, 2004
NeSC
X
X
X
Prof Malcolm Atkinson (Director)
X
Two UK OGSA Test Grid projects started in January
Previous work on UK e-Science Grid based on GT2
X
Universities of Portsmouth, Reading, Manchester, Westminster and CCLRC
X
X
X
Brief outline current and future developments
The next Grid
X
Integrated Earth system modelling
OGSA definition and delivery Belfast
?
Daresbury Lab
X
…and Technologies GT3, GT4…
Hosting environments & Platforms
Combinations of services supported
CSAR
Material and grids to support adopters Cardiff
Oxford
RAL
Edinburgh
HPC(x)
Newcastle
White Rose Grid
London
Cambridge
Hinxton
Southampton
HPCInform,
16 Sep, 2004
E-Science Hub
Externally
X
Glasgow end of NeSC
– Involved in UK wide activities
» ETF: In May 2003 became first UK e-Science Centre to run integration tests across every site of the UK (Level 2) Grid.
Therefore 100% access to UK Grid resources at this time
– Public visibility of NeSC
» responsible for NeSC web site
Internally
X
X
X
Focal point for e-Science research/activities at Glasgow
Work closely with foundation departments
– Department of Computing Science
– Department of Physics & Astronomy
Also working closely with other groups including
– Bioinformatics Research Centre
– Electronics and Electrical Engineering
– Biostatistics
16 Sep, 2004
Consolidating resources
Originally built around ScotGrid
X
Providing shared Grid resource for wide variety of scientists inside/outside Glasgow
– HEP, CS, BRC, EEE, …
» Target shares established
– Focal point for e-Science at Glasgow
» University wide involvement (steering, mgt, users, tech. groups)
Hardware
• 59 IBM X Series 330 dual 1 GHz Pentium III with 2GB memory
• 2 IBM X Series 340 dual 1 GHz Pentium III with 2GB memory
• 3 IBM X Series 340 dual 1 GHz Pentium III with 2GB memory and 100 + 1000 Mbit/s ethernet
•
•
• 1TB disk
LTO/Ultrium Tape Library
Cisco ethernet switches
New..
• IBM X Series 370 PIII Xeon with 32 x 512 MB RAM
• 5TB FastT500 disk 70 x 73.4 GB IBM FC Hot-Swap HDD
• eDIKT 28 IBM blades dual 2.4 GHz Xeon with 1.5GB memory
• eDIKT 6 IBM X Series 335 dual 2.4 GHz Xeon with 1.5GB memory
• CDF 10 Dell PowerEdge 2650 2.4 GHz Xeon with 1.5GB memory
• CDF 7.5TB Raid disk
ScotGrid [ Disk ~15TB
…
CPU ~ 330 1GHz ]
Comp. Serv. 2
University SAN
EEE compute clusters
Condor pools nd HPC facility
Shared sys-admin staff
HPCInform,
16 Sep, 2004
Extensive Research Community
>1000 per research university
Extensive Applications
Many people care about them
X
Health, Food, Environment
Interacts with virtually every discipline
Physics, Chemistry, Maths/Stats, Nano-engineering, …
450+ databases relevant to bioinformatics (and growing!)
Heterogeneity, Interdependence, Complexity, Change, …
Wonderful Scientific Questions
How does a cell work?
How does a brain work?
How does an organism develop?
What happens to the biosphere when the earth warms up?
Why do people who eat less tend to live longer?
…
HPCInform,
16 Sep, 2004
Yersinia pestis
Arabidopsis thaliana
Buchnerasp.
APS
Aquifex aeolicus
Archaeoglobus fulgidus
Borrelia burgorferi
Mycobacterium tuberculosis
Caenorhabitis elegans
Campylobacter jejuni
Chlamydia pneumoniae
Vibrio cholerae
Drosophila melanogaster
Escherichia
Thermoplasma coli acidophilum
Helicobacter pylori
Mycobacterium leprae mouse
Neisseria meningitidis
Z2491
Plasmodium falciparum
Pseudomonas aeruginosa
Ureaplasma urealyticum rat
Salmonella enterica
Bacillus subtilis
Thermotoga maritima
Xylella fastidiosa
Structure
Function
Sequence
LPSYVDWRSA GAVVDIKSQG
ECGGCWAFSA IATVEGINKI
TSGSLISLSE QELIDCGRTQ
NTRGCDGGYI TDGFQFIIND
GGINTEENYP YTAQDGDCDV
Gene expression Morphology
HPCInform,
16 Sep, 2004
HPCInform,
16 Sep, 2004
+ links to plant/crops, environmental, health, … information sources
PDB Content Growth
•DBs growing exponentially!!!
•Biobliographic (MedLine, …)
•Amino Acid Seq (SWISS-PROT, …)
•3D Molecular Structure (PDB, …)
•Nucleotide Seq (GenBank, EMBL, …)
•Biochemical Pathways (KEGG, WIT…)
•Molecular Classifications (SCOP, CATH,…)
•Motif Libraries (PROSITE, Blocks, …)
HPCInform,
16 Sep, 2004
Tools that
access to and usage of data
X
Internet hopping is not ideal!
Tools that
scale HPC facilities access to and usage of large
X qsub [-a date_time] [-A account_string] [-c interval] [-C directive_prefix] [-e path] [-h] [-I] [-j join] [-k keep] [-l resource_list]
[-m mail_options] [-M user_list] [-N name] [-o path] [-p priority] [-q destination] [-r c] [-S path_list] [-u user_list] [-v variable_list] [-V]
[-W additional_attributes] [-z] [script]
Tools designed to
sets and relationships between them of complex data
X e.g. through visualisation
HPCInform,
16 Sep, 2004
Grid technology should allow to hide heterogeneity, deal with location transparency, address security concerns,
…
Data Access and Integration Specification (DAIS) being defined by GGF
OGSA-DAI and DAIT projects key role in shaping these standards
Other commercial solutions (IBM Information Integrator, …)
More later!
HPCInform,
16 Sep, 2004
Current strategy essentially chops up one genome and fires searches for those fragments in the other then re-assembles results
X
X messy approximate matching - re-assembly difficult important correlations can be lost
– to make this tractable so called junk DNA ignored
– chopping may introduce artefacts or hide phenomena
¾ Better to put both full genomes in memory and perform a useful complete comparison
¾ Only possible with very high-end machines (available via grids)
Should not have to be script writer/Linux sys-admin to use these facilities
HPCInform,
16 Sep, 2004
Raw data sets messy
Requires significant effort to understand
Schemas/data models evolving
…
Simplify understanding
Improve analysis
Navigate through potentially huge data sets
X e.g. to find genes of interest in chromosomes of different species
HPCInform,
16 Sep, 2004
Biomedical Research Informatics Delivered by Grid
Enabled Services (BRIDGES)
NeSC (Edinburgh and Glasgow) and IBM
Started October 2003
Supporting project for CFG project
Generating data on hypertension
Rat, Mouse, Human genome databases
Variety of tools used
BLAST, BLAT, Gene Prediction, visualisation, …
Variety of data sources and formats
Microarray data, genome DBs, project partner research data, …
Aim is integrated infrastructure supporting
Data federation
Security
HPCInform,
16 Sep, 2004
VO Authorisation
Synteny
Grid
Service
b la st
P riv a te d a ta
O x fo rd
O rg a n is a tio n
G la sg o w
C F G V irtu a l
P riv a te d a ta
Information
Integrator
D A T A
H U B
OGSA-DAI
L o n d o n
P riv a te d a ta
P u b lica lly C u ra te d D a ta
P riv a te d a ta
Le ice ste r
P riv a te d a ta
N e th e rla n d s
P riv a te d a ta
E n se m bl
O M IM
S W IS S -P R O T
M G I
H U G O
+
HPCInform,
16 Sep, 2004
Single sign-on based on (X.509) digital certificates
X establish credentials
– Certification authority based (RAL in UK)
Services (and clients) have APIs for fine grained security
X
Based on GSS-API
Provides for authentication but need authorisation
X
Various technologies for authorisation including PERMIS, CAS, …
Collaborating with P rivil E ge and R ole M anagement
I nfrastructure S tandards Validation (PERMIS) team
X
Lead by Prof David Chadwick, University of Salford
– (www.permis.org)
HPCInform,
16 Sep, 2004
Define roles for who can do what on what
X
Policy = { Role x Target x Action }
– Can user X invoke service Y and access or change data Z?
» Policies created with PERMIS PolicyEditor (output is XML based policy)
HPCInform,
16 Sep, 2004
PERMIS Privilege Allocator then used to sign policies
Associates roles with specific users
X
Policies stored as attribute certificates in LDAP server
When is authorisation done?
Two main choices
X
X
X
Portal personalised for users based on their policies
– If not allowed to invoke service then they do not get to see it
Actions of users (with given role) are authorised every time the service is invoked
– They can see the service but potentially not be allowed to invoke it
» Performance issues… but more likely scenario for authorisation
In both cases, if not explicitly agreed in policy then rejected and logged!
– Both cases being explored
Plan to exploit the GGF SAML AuthZ specification
X
Based on GT3.3 – currently have BLAST service in GT3.2Final
– Identified issues with standards…
HPCInform,
16 Sep, 2004
Information Integrator DB repository established and populated
… with public data sets (OMIM, HUGO, RGD, SWISS-PROT)
… linked to relevant resources (ENSEMBL- rat, human, mouse, MGI)
GT3 based Grid services developed (BLAST) using own meta-scheduler
General usage of ScotGrid and local Condor pool
Portal developed using IBM WebSphere
Genome visualisation browsers
SyntenyVista – for viewing synteny between local/remote data sets
MagnaVista – for exploring genetic information across multiple
(remote) resources
Gaining experience with security technologies
Setting up policies with Grid security authorisation software etc
Rolled-out Alpha version of system to CFG group July ‘04
HPCInform,
16 Sep, 2004
HPCInform,
16 Sep, 2004 www.nesc.ac.uk
Public data resources openness
Often cannot query directly
Often not easy/possible to find schemas
Joint Data Standards Study investigating this
X
X
X
Started on 1 st June and involves
– Digital Archiving Consultancy
– Bioinformatics Research Centre (Glasgow)
– NeSC (Edinburgh and Glasgow)
Look at technical, political, social, ethical etc issues involved in accessing and using public life science resources
– Will liase with NDCC
– Interview relevant scientists, data curators/providers
8 month project with final report in January
– Funded by MRC, BBSRC, Wellcome Trust, JISC, NERC, DTI
GT3 not without pain! (… understatement!!!!)
Hopefully GT4 will be better?
HPCInform,
16 Sep, 2004
Dynamic Virtual Organisations for e-Science Education
(DyVOSE) project
Two year project started 1 st May 2004 funded by JISC
Exploring advanced authorisation infrastructures for security
X
… in Grid Computing Module as part of advanced MSc at Glasgow
– Will provide insight into rolling Grid out to the masses!
ScotGrid GU Condor pool
Other (known!)
Grid resources
PERMIS based
Education
VO policies
Authorisation checks
Authorisation decisions
HPCInform,
16 Sep, 2004
Four year proposal expected to start November 2004
Funded by Scottish Enterprise, Scottish Higher Education Funding
Council, Scottish Executive Environment and Rural Affairs
Department
X
Involves Glasgow, Dundee, Edinburgh, Scottish Bioinformatics Forum
Aim to provide bioinformatics infrastructure for Scottish health, agriculture and industry
X
X
X
Infrastructure support at Dundee, Edinburgh and Glasgow to support first-rate research in bioinformatics at each academic institute
Infrastructure support at three institutes, to support inter-institutional sharing of compute and data resources through application of Grid computing
Outreach and training activities mediated by the Scottish Bioinformatics Forum
HPCInform,
16 Sep, 2004
Virtual Organisations for Trials and Epidemiological Studies
3 year MRC funded project expected to start imminently
Plans to develop Grid infrastructure to address key components of clinical trial/observational study
X
X
X
Recruitment of potentially eligible participants
Data collection during the study
Study administration and coordination
– Involves Glasgow, Oxford, Leicester, Nottingham, Manchester
Clinical Virtual Organisation Framework
Used to realise
CVO-2
(e.g. for recruitment)
CVO-1
(e.g. for data collection)
Disease registries
GPs
GLA
Transfer
Grid
Lei-
Nott
OX
IMP
Hospital databases
Clinical trial data sets
HPCInform,
16 Sep, 2004
NeSC Glasgow focussed largely on life sciences and security
The GRID is happening and will influence life sciences
Government SR allocated £115M for eScience
Life sciences are key
Grid middleware and standards still “evolving“
Non-trivial learning curve
Immaturity of technology
Courses offered at NeSC
Web services, grid services, XML, Java, DB technologies, bioinformatics, others…?
X www.nesc.ac.uk
for registration and suggesting events
Grid Computing modules under development at Glasgow as part of advanced MSc
HPCInform,
16 Sep, 2004
Main take home message: need coordination of
X
X
GRID software engineering/standards
Life science needs/standards
Without this, metaphor of electric power GRID doomed
HPCInform,
16 Sep, 2004
HPCInform,
16 Sep, 2004 www.nesc.ac.uk
HPCInform,
16 Sep, 2004 www.nesc.ac.uk
HPCInform,
16 Sep, 2004
HPCInform,
16 Sep, 2004 www.nesc.ac.uk
HPCInform,
16 Sep, 2004
HPCInform,
16 Sep, 2004
HPCInform,
16 Sep, 2004
HPCInform,
16 Sep, 2004
•
Allows ‘genome scale’ blasting
• Uses ScotGrid and idle compute resources of training lab Condor pool
HPCInform,
16 Sep, 2004
HPCInform,
16 Sep, 2004
HPCInform,
16 Sep, 2004
HPCInform,
16 Sep, 2004