IBM UK & Ireland Technical Consultancy Group 22

advertisement
IBM
UK & Ireland
Technical Consultancy Group
Prof. Malcolm Atkinson
Director
www.nesc.ac.uk
22nd May 2003
1
Outline
What is e-Science?
UK e-Science
UK e-Science: Roles and Resources
Scientific Data Curation
Data Access & Integration
Data Analysis & Interpretation
e-Science driving Disruptive Technology
Economic impact, Mobile Code,
Decomposition
Global infrastructure, optimisation &
management
“Don’t care where” computing
2
3
Foundation for e-Science
e-Science methodologies will rapidly transform
science, engineering, medicine and business
driven by exponential growth (×1000/decade)

enabling a whole-system approach
computers
software
Grid
sensor nets
instruments
colleagues
Shared data
archives
4
Convergence & Ubiquity
Multi-national, Multi-discipline, Computer-enabled
Consortia, Cultures & Societies
Experiment &
Advanced Data
Collection
→
Shared Data
Theory
Models & Simulations
→
Shared Data
Requires Much
Computing Science
Engineering,
Systems, Notations &
Much Innovation Formal Foundation
Changes Culture,
New Mores,
New Behaviours
→ Trust
New Opportunities, New Results, New Rewards
5
UCSF
UIUC
From Klaus Schulten, Center for Biomollecular Modeling and Bioinformatics, Urbana-Champaign 6
global in-flight engine diagnostics
in-flight data
airline
ground
station
100,000 engines
2-5 Gbytes/flight
5 flights/day = 2.5
petabytes/day
global network
eg SITA
DS&S Engine Health Center
internet, e-mail, pager
maintenance centre
data centre
Distributed Aircraft Maintenance Environment: Universities of Leeds, Oxford, Sheffield &York
7
Tera → Peta Bytes
RAM time to move
15 minutes
1Gb WAN move time
10 hours ($1000)
Disk Cost
RAM time to move
2 months
1Gb WAN move time
14 months ($1 million)
Disk Cost
7 disks = $5000 (SCSI)
Disk Power
100 Watts
Disk Weight
5.6 Kg
Disk Footprint
Inside machine
6800 Disks + 490 units +
32 racks = $7 million
Disk Power
100 Kilowatts
Disk Weight
33 Tonnes
Disk Footprint
60 m2
May 2003 Approximately Correct
8
See also Distributed Computing Economics Jim Gray, Microsoft Research, MSR-TR-2003-24
9
Additional UK e-Science Funding
First Phase: 2001 –2004
Application Projects
£74M
All areas of science and
engineering
>60 Projects
340 at first All Hands Mtg
Core Programme
£35M
Collaborative industrial
projects
~80 Companies
> £30 Million
Second Phase: 2003 –2006
Application Projects
£96M
All areas of science and
engineering
Core Programme
£16M + £25M (?)
Core Grid Middleware
10
e-Science and SR2002
2004-6
Medical
£13.1M
Biological
£10.0M
Environmental £8.0M
Eng & Phys
£18.0M
HPC
£2.5M
Core Prog.
£16.2M + ?
Particle Phys & Astro £31.6M
Economic & Social
£10.6M
Central Labs
£5.0M
Research Council
2001-4
(£8M)
(£8M)
(£7M)
(£17M)
(£9M)
(£15M) + £20M
(£26M)
(£3M)
(£5M)
11
NeSC in the UK
You are
here
HPC(x)
Glasgow
Edinburgh
Directors’ Forum
Newcastle
Helped build a community
Belfast
Engineering Task Force
Grid Support Centre
Manchester
Daresbury Lab
Architecture Task Force
UK Adoption of OGSA
Cambridge
OGSA Grid Market
Oxford
Workflow Management
Hinxton
RAL
Database Task Force
Cardiff
OGSA-DAI
London
GGF DAIS-WG
Southampton
e-SI Programme
training, coordination,
community building,
workshops, pioneering
GridNet
e-Storm
12
UK Grid: Operational & Heterogeneous
Currently a Level-2 Grid based on Globus Toolkit 2
Transition to OGSI/OGSA will prove worthwhile
There are still issues to be resolved
OGSA definition / delivery
Hosting environments & Platforms
Combinations of Services supported
Material and grids to support adopters
A schedule of transitions should be (approximately
& provisionally) published
Expected time line
Now GT2 L2 service + GT3 M/W development & evaluation
Q3 & Q4 2003 GT2 L3 + GT3 L1
Q1 & Q2 2004 significant project transitions to GT3 L2/L3
Late Q4 2004 most projects have transitioned + end GT2 L3
13
14
Biology & Medicine
Extensive Research Community
>1000 per research university
Extensive Applications
Many people care about them

Health, Food, Environment
Interacts with virtually every discipline
Physics, Chemistry, Nanoengineering, …
450 Databases relevant to bioinformatics
Heterogeneity, Interdependence, Complexity, Change, …
Wonderful Scientific Questions
How does a cell work?
How does a brain work?
How does an organism develop?
Why is the biosphere so stable?
What happens to the biosphere when the earth warms up?
…
15
Database Growth
PDB Content Growth
39,856,567,747
16
Infrastructure Architecture
Data Intensive X Scientists
Data Intensive Applications for Science X
Simulation, Analysis & Integration Technology for Science X
Generic Virtual Data Access and Integration Layer
Job Submission
Brokering
Registry
Banking
Data Transport
Workflow
Structured Data
Integration
Authorisation
OGSA
Resource Usage Transformation Structured Data Access
OGSI: Interface to Grid Infrastructure
Compute, Data & Storage Resources
Structured Data
Relational
Distributed
Virtual Integration Architecture
XML Semi-structured
17
Data Access & Integration Services
1a. Request to Registry
for sources of data
about “x”
SOAP/HTTP
Registry
1b. Registry
responds with
Factory handle
service creation
API interactions
2a. Request to Factory for access
to database
Factory
Client
2c. Factory returns
handle of GDS to
client
3a. Client queries GDS with
XPath, SQL, etc
3c. Results of query returned to
client as XML
2b. Factory creates
GridDataService to manage
access
Grid Data
Service
XML /
Relationa
l
database
3b. GDS interacts with database
18
Data Access and Integration Services
1a. Request to Registry for
sources of data about “x” &
“y”
1b. Registry
responds with
Factory handle
Data
Registry
SOAP/HTTP
service creation
API interactions
2a. Request to Factory for access and
integration from resources Sx and Sy
Factory
2c. Factory
returns handle of GDS to client
3b.
Client
Problem
tells“scientific”
Solving
analyst
Client
Application
Environment
coding
scientific
insights
Analyst
2b. Factory creates
Semantic
GridDataServices network
Meta data
3a. Client submits sequence of
scripts each has a set of queries
to GDS with XPath, SQL, etc
GDTS1
GDS
GDTS
XML
database
GDS2
Sx
3c. Sequences of result sets returned to
analyst as formatted binary described in
a standard XML notation
GDS
GDS1
Sy
GDS3
GDS
GDTS2
Relational
database
GDTS
19
ODD-Genes
database
engine 1
registry
database
engine 2
PSE
20
Scientific Data
Opportunities
Global Production of
Published Data
Volume Diversity
Combination 
Analysis  Discovery
Opportunities
Specialised Indexing
New Data Organisation
New Algorithms
Varied Replication
Shared Annotation
Intensive Data &
Computation
Challenges
Data Huggers
Meagre metadata
Ease of Use
Optimised integration
Dependability
Challenges
Fundamental Principles
Approximate Matching
Multi-scale optimisation
Autonomous Change
Legacy structures
Scale and Longevity
Privacy and Mobility
21
22
Mohammed & Mountains
Petabytes of Data cannot be moved
It stays where it is produced or curated

Hospitals, observatories, European Bioinformatics Institute, …
Distributed collaborating communities
Expertise in curation, simulation & analysis
Distributed & diverse data collections
Discovery depends on insights
Tested by combining data from many sources
Using sophisticated models & algorithms
What can you do?
23
Move computation to the data
Assumption: code size << data size
Develop the database philosophy for this?
Queries are dynamically re-organised & bound
Develop the storage architecture for this?
Compute closer to disk?

Dave Patterson
SIGMOD 98
System on a Chip using free space in the on-disk controller
Safe hosting of arbitrary computation
Proof-carrying code for data and compute intensive
tasks + robust hosting environments
Provision combined storage & compute resources
Decomposition of applications
To ship behaviour-bounded sub-computations to data
Co-scheduling & co-optimisation
Data & Code (movement), Code execution
Recovery and compensation
24
Software Changes
Integrated Problem Solving Environments
Users & application developers see

Abstract computer and storage system
Where and how things are executed can be ignored
Diversity, detail, ownership, dependability, cost

Explicit and visible
Increasing sophistication of description



Metadata for discovery
Metadata for management and optimisation
Raising the semantic level of discourse
Applications developed dynamically by composition
Mobile, Safe & Re-organisable Code
Predictable & Guaranteed behaviour
Decomposition & re-composition
New programming languages & understanding needed
25
Organisational & Cultural Changes
Access to Computation & Data must be simple
All use a computational, semantic, data-rich web
i.e. its invisible – the portal / browser lets you do more
Responsibility of data publishers
Cost, dependability, trustworthy, capable, flexibility, …
Shared contributions compose indefinitely
Knowledge accumulation and interdependence
Contributor recognition and IPR
Complexity and management of infrastructure
Always on
Must be sustained
Paid for
Hidden
26
www.ogsadai.org.uk
www.nesc.ac.uk
27
Download