IBM TCG Symposium 21 May 2003

advertisement
IBM
TCG Symposium
Prof. Malcolm Atkinson
Director
www.nesc.ac.uk
21st May 2003
1
Outline
What is e-Science?
UK e-Science
UK e-Science: Roles and Resources
UK e-Science: Projects
NeSC & e-Science Institute
Events
Visitors
International Programme
Scientific Data Curation
Data Access & Integration
Data Analysis & Interpretation
e-Science driving Disruptive Technology
Economic impact, Mobile Code, Decomposition
Global infrastructure, optimisation & management
“Don’t care where” computing
2
3
Foundation for e-Science
e-Science methodologies will rapidly transform
science, engineering, medicine and business
driven by exponential growth (×1000/decade)
X
enabling a whole-system approach
computers
software
Grid
sensor nets
instruments
colleagues
Shared data
archives
4
Convergence & Ubiquity
Multi-national, Multi-discipline, Computer-enabled
Consortia, Cultures & Societies
Experiment &
Advanced Data
Collection
→
Shared Data
Theory
Models & Simulations
→
Shared Data
Requires Much
Computing Science
Engineering,
Systems, Notations &
Much Innovation Formal Foundation
Changes Culture,
New Mores,
New Behaviours
→ Trust
New Opportunities, New Results, New Rewards
5
UCSF
UIUC
From Klaus Schulten, Center for Biomollecular Modeling and Bioinformatics, Urbana-Champaign 6
global in-flight engine diagnostics
in-flight data
airline
ground
station
100,000 engines
2-5 Gbytes/flight
5 flights/day = 2.5
PB/day
global network
eg SITA
DS&S Engine Health Center
internet, e-mail, pager
maintenance centre
data centre
Distributed Aircraft Maintenance Environment: Universities of Leeds, Oxford, Sheffield &York
7
Tera → Peta Bytes
RAM time to move
15 minutes
1Gb WAN move time
10 hours ($1000)
Disk Cost
7 disks = £3500 (SCSI)
Disk Power
100 Watts
Disk Weight
5.6 Kg
Disk Footprint
Inside machine
RAM time to move
2 months
1Gb WAN move time
14 months ($1 million)
Disk Cost
6800 Disks + 490 units +
32 racks = £4.7 million
Disk Power
100 Kilowatts
Disk Weight
33 Tonnes
Disk Footprint
60 m2
May 2003 Approximately Correct
8
9
UK 2000 Spending Review
e-Science and the Grid
‘e-Science is about global collaboration in key
areas of science, and the next generation of
infrastructure that will enable it.’
‘e-Science will change the dynamic of the
way science is undertaken.’
John Taylor
Director General of Research Councils
Office of Science and Technology
10
From presentation by Tony Hey
Additional UK e-Science Funding
First Phase: 2001 –2004
Application Projects
£74M
All areas of science and
engineering
>60 Projects
340 at first All Hands Mtg
Core Programme
£35M
Collaborative industrial
projects
~80 Companies
> £30 Million
Second Phase: 2003 –2006
Application Projects
£96M
All areas of science and
engineering
Core Programme
£16M + £25M (?)
Core Grid Middleware
11
e-Science and SR2002
MRC
BBSRC
NERC
EPSRC
HPC
Core Prog.
PPARC
ESRC
CLRC
2004-6
£13.1M
£10.0M
£8.0M
£18.0M
£2.5M
£16.2M + ?
£31.6M
£10.6M
£5.0M
2001-4
(£8M)
(£8M)
(£7M)
(£17M)
(£9M)
(£15M) + £20M
(£26M)
(£3M)
(£5M)
12
13
NeSC in the UK
You are
here
HPC(x)
Glasgow
Edinburgh
Directors’ Forum
Newcastle
Helped build a community
Belfast
Engineering Task Force
Grid Support Centre
Manchester
Daresbury Lab
Architecture Task Force
UK Adoption of OGSA
Cambridge
OGSA Grid Market
Oxford
Workflow Management
Hinxton
RAL
Database Task Force
Cardiff
London
OGSA-DAI
GGF DAIS-WG
Southampton
e-SI Programme
training, coordination,
community building,
workshops, pioneering
GridNet
e-Storm
14
www.nesc.ac.uk
15
UK Grid: Operational & Heterogeneous
Currently a Level-2 Grid based on Globus Toolkit 2
Transition to OGSI/OGSA will prove worthwhile
There are still issues to be resolved
OGSA definition / delivery
Hosting environments & Platforms
Combinations of Services supported
Material and grids to support adopters
A schedule of transitions should be (approximately
& provisionally) published
Expected time line
Now GT2 L2 service + GT3 M/W development & evaluation
Q3 & Q4 2003 GT2 L3 + GT3 L1
Q1 & Q2 2004 significant project transitions to GT3 L2/L3
Late Q4 2004 most projects have transitioned + end GT2 L3
16
17
e-Science Institute
Past Programme of Events
Planned 6 * two-week research workshops / year
Actually ran 48 events in first 12 months!
Highlights
GGF5, HPDC11 and a cluster of workshops
Protein Science, Neuroinformatics, …
Major training events
X
X
900
800
700
600
500
400
300
200
100
0
Steve Tuecke Grid & Globus (2)
Web Services, DiscoveryLink, Relational DB design, …
GGF1 GGF2 GGF3 GGF4 GGF5 GGF6 GGF7
e-SI Clientele and Outreach (year 1)
> 2600 individuals
From > 500 organisations
236 speakers
Many participants return frequently
Event Diversity
Conferences
Summer schools
Workshops
Training
Community building
Company Sponsorship
&
International
Engagement
18
19
Biology & Medicine
Extensive Research Community
>1000 per research university
Extensive Applications
Many people care about them
X
Health, Food, Environment
Interacts with virtually every discipline
Physics, Chemistry, Nanoengineering, …
450 Databases relevant to bioinformatics
Heterogeneity, Interdependence, Complexity, Change, …
Wonderful Scientific Questions
How does a cell work?
How does a brain work?
How does an organism develop?
Why is the biosphere so stable?
What happens to the biosphere when the earth warms up?
…
20
Database Growth
PDB Content Growth
21
ODD-Genes
database
engine 1
registry
database
engine 2
PSE
22
Scientific Data
Opportunities
Global Production of
Published Data
Volume↑ Diversity↑
Combination ⇒
Analysis ⇒ Discovery
Opportunities
Specialised Indexing
New Data Organisation
New Algorithms
Varied Replication
Shared Annotation
Intensive Data &
Computation
Challenges
Data Huggers
Meagre metadata
Ease of Use
Optimised integration
Dependability
Challenges
Fundamental Principles
Approximate Matching
Multi-scale optimisation
Autonomous Change
Legacy structures
Scale and Longevity
Privacy and Mobility
23
Infrastructure Architecture
Data Intensive X Scientists
Data Intensive Applications for Science X
Simulation, Analysis & Integration Technology for Science X
Generic Virtual Data Access and Integration Layer
Job Submission
Brokering
Registry
Banking
Data Transport
Workflow
Structured Data
Integration
Authorisation
OGSA
Resource Usage Transformation Structured Data Access
OGSI: Interface to Grid Infrastructure
Compute, Data & Storage Resources
Structured Data
Relational
Distributed
Virtual Integration Architecture
XML Semi-structured
24
Draft Specification for GGF 7
25
26
Mohammed & Mountains
Petabytes of Data cannot be moved
It stays where it is produced or curated
X
Hospitals, observatories, European Bioinformatics Institute, …
Distributed collaborating communities
Expertise in curation, simulation & analysis
Distributed & diverse data collections
Discovery depends on insights
Tested by combining data from many sources
Using sophisticated models & algorithms
What can you do?
27
Move computation to the data
Assumption: code size << data size
Develop the database philosophy for this?
Queries are dynamically re-organised & bound
Develop the storage architecture for this?
Compute on disk? (SoC + space on disk chips??)
Safe hosting of arbitrary computation
Proof-carrying code for data and compute intensive
tasks + robust hosting environments
Provision combined storage & compute resources
Decomposition of applications
To ship behaviour-bounded sub-computations to data
Co-scheduling & co-optimisation
Data & Code (movement), Code execution
Recovery and compensation
28
Software Changes
Integrated Problem Solving Environments
Users & application developers see
X
Abstract computer and storage system
Where and how things are executed can be ignored
Diversity, detail, ownership, dependability, cost
X
Explicit and visible
Increasing sophistication of description
X
X
Metadata for discovery
Metadata for management and optimisation
Applications developed dynamically by composition
Mobile, Safe & Re-organisable Code
Predictable behaviour
Decomposition & re-composition
New programming languages & understanding needed
29
Organisational & Cultural Changes
Access to Computation & Data must be simple
All use a computational, semantic, data-rich web
Responsibility of data publishers
Cost, dependability, trustworthy, capable, flexibility, …
Shared contributions compose indefinitely
Knowledge accumulation and interdependence
Contributor recognition and IPR
Complexity and management of infrastructure
Always on
Must be sustained
Paid for
Hidden
30
www.ogsadai.org.uk
www.nesc.ac.uk
31
32
DAI basic Services
1a. Request to
Registry for sources
of data about “x”
SOAP/HTTP
Registry
1b. Registry
responds with
Factory handle
service creation
API interactions
2a. Request to Factory for
access to database
Factory
Client
2c. Factory returns
handle of GDS to
client
3a. Client queries GDS
with XPath, SQL, etc
3c. Results of query returned to
client as XML
2b. Factory creates
GridDataService to manage
access
Grid Data
Service
XML /
Relational
database
3b. GDS interacts with database
33
DAIT basic Services
1a. Request to Registry for
sources of data about “x” &
“y”
1b. Registry
responds with
Factory handle
Data
Registry
SOAP/HTTP
service creation
API interactions
2a. Request to Factory for access and
integration from resources Sx and Sy
Factory
2c. Factory
returns handle of GDS to client
3b. Client
tells
analyst
Client
2b. Factory creates
GridDataServices network
3a. Client submits sequence of
scripts each has a set of queries
to GDS with XPath, SQL, etc
GDTS1
Analyst
GDS
GDTS
GDS2
3c. Sequences of result sets returned to
analyst as formatted binary described in
a standard XML notation
Sx
GDS
GDS1
XML
database
Sy
GDS3
GDS
GDTS2
Relational
database
GDTS
34
Download