BNCOD Coventry Databases and the Grid: Who Challenges Whom

advertisement
BNCOD
Coventry
Databases and the Grid: Who
Challenges Whom
Prof. Malcolm Atkinson
Director
www.nesc.ac.uk
15th July 2003
1
Outline
What is e-Science?
Grids, Collaboration, Virtual Organisations
Structured Data at its Foundation
Key Uses of Distributed Data Resources
Challenges
Scientific Digital Data Curation
Data Access & Integration
OGSA-DAI: Progress and Plans
DAIS-WG
Composition of Analysis & Interpretation
2
3
It’s Easy to Forget
How Different 2003 is From 1993
Enormous quantities of data: Petabytes
For an increasing number of communities,
gating step is not collection but analysis
Ubiquitous Internet: >100 million hosts
Collaboration & resource sharing the norm
Ultra-high-speed networks: >10 Gb/s
Global optical networks
Huge quantities of computing: >100 Top/s
Moore’s law gives us all supercomputers
Ubiquitous computing
Moore’s law everywhere
Instruments, detectors, sensors, scanners, …
Derived from Ian Foster’s slide at ssdbM July 03
4
Foundation for e-Science
e-Science methodologies will rapidly transform
science, engineering, medicine and business
driven by exponential growth (×1000/decade)
X
enabling a whole-system approach
computers
software
Grid
sensor nets
instruments
Diagram derived from
Ian Foster’s slide
colleagues
Shared data
archives
5
6
e-Science and SR2002
2004-6
Medical
£13.1M
Biological
£10.0M
Environmental £8.0M
Eng & Phys
£18.0M
HPC
£2.5M
Core Prog.
£16.2M + ?
Particle Phys & Astro £31.6M
Economic & Social
£10.6M
Central Labs
£5.0M
Research Council
2001-4
(£8M)
(£8M)
(£7M)
(£17M)
(£9M)
(£15M) + £20M
(£26M)
(£3M)
(£5M)
7
NeSC in the UK
National
e-Science
Centre
HPC(x)
Glasgow
Edinburgh
Directors’ Forum
Newcastle
Helped build a community
Belfast
Engineering Task Force
Grid Support Centre
Manchester
Daresbury Lab
Architecture Task Force
UK Adoption of OGSA
Cambridge
OGSA Grid Market
Oxford
Workflow Management
Hinxton
RAL
Database Task Force
Cardiff
London
OGSA-DAI
GGF DAIS-WG
Southampton
e-SI Programme
training, coordination,
community building,
workshops, pioneering
GridNet
e-Storm
8
www.nesc.ac.uk
9
UK Grid
Currently a Level-2 Grid based on GT2
Transition to OGSI/OGSA will prove worthwhile
There are still issues to be resolved
OGSA definition / delivery
Hosting environments & Platforms
Combinations of Services supported
Material and grids to support adopters
A schedule of transitions should be (approximately
& provisionally) published
Expected time line
Now GT2 L2 service + GT3 M/W development & evaluation
Q3 & Q4 2003 GT2 L3 + GT3 L1
Q1 & Q2 2004 significant project transitions to GT3 L2/L3
Late Q4 2004 most projects have transitioned + end GT2 L3
10
11
Three-way Alliance
Multi-national, Multi-discipline, Computer-enabled
Consortia, Cultures & Societies
Theory
Models & Simulations
→
Shared Data
Requires Much
Computing Science
Engineering,
Systems, Notations &
Much Innovation Formal Foundation
Experiment &
Advanced Data
Collection
→
Shared Data
Changes Culture,
New Mores,
New Behaviours
→ Process & Trust
New Opportunities, New Results, New Rewards
12
Biochemical Pathway Simulator
(Computing Science, Bioinformatics, Beatson Cancer Research Labs)
Closing the information loop – between lab and computational model.
DTI Bioscience Beacon Project
Harnessing Genomics Programme
Slide from Muffy Calder, Glasgow
13
UCSF
UIUC
From Klaus Schulten, Center for Biomollecular Modeling and Bioinformatics, Urbana-Champaign14
15
Emergence of
Global Knowledge Communities
Teams organised around common goals
Communities: “Virtual organisations”
Overlapping memberships, resources and activities
Essential diversity is a strength & challenge
membership & capabilities
Geographic and political distribution
No location/organisation/country possesses all
required skills and resources
Dynamic: adapt as a function of their situation
Adjust membership, reallocate responsibilities,
renegotiate resources
Slide derived from Ian Foster’s ssdbm 03 keynote
16
The Emergence of
Global Knowledge Communities
17
Slide from Ian Foster’s ssdbm 03 keynote
Global Knowledge Communities
Often Driven by Data: E.g., Astronomy
No. & sizes of data sets as of mid-2002,
grouped by wavelength
• 12 waveband coverage of large
areas of the sky
• Total about 200 TB data
• Doubling every 12 months
• Largest catalogues near 1B objects
Data and images courtesy Alex Szalay, John Hopkins
Wellcome Trust: Cardiovascular
Functional Genomics
Glasgow
Shared data
BRIDGES
IBM
Edinburgh
Public curated
data
Leicester
Oxford
London
Netherlands
19
Database-mediated Communication
Experimentation
Communities
Curated
& Shared
Databases
Data
Carries knowledge
Simulation
Communities
Analysis &Theory
Communities
Carries knowledge
Data
knowledge
Discoveries
20
21
global in-flight engine diagnostics
in-flight data
airline
ground
station
100,000 engines
2-5 Gbytes/flight
5 flights/day =
2.5 petabytes/day
global network
eg SITA
DS&S Engine Health Center
internet, e-mail, pager
maintenance centre
data centre
22
Distributed Aircraft Maintenance Environment: Universities of Leeds, Oxford, Sheffield &York
Database Growth
Bases 39,856,567,747
PDB Content Growth
23
Distributed Structured Data
Key to Integration of Scientific Methods
Key to Large-scale Collaboration
Many Data Resources
Independently managed
Geographically distributed
Primary Data, Data Products, Meta Data,
Administrative data, …
Discovery and Decisions!
Extracting nuggets from multiple sources
Combing them using sophisticated models
Analysis on scales required by statistics
Repeated Processes
24
Tera → Peta Bytes
RAM time to move
15 minutes
1Gb WAN move time
10 hours ($1000)
Disk Cost
7 disks = $5000 (SCSI)
Disk Power
100 Watts
Disk Weight
5.6 Kg
Disk Footprint
Inside machine
RAM time to move
2 months
1Gb WAN move time
14 months ($1 million)
Disk Cost
6800 Disks + 490 units +
32 racks = $7 million
Disk Power
100 Kilowatts
Disk Weight
33 Tonnes
Disk Footprint
60 m2
May 2003 Approximately Correct
25
See also Distributed Computing Economics Jim Gray, Microsoft Research, MSR-TR-2003-24
Mohammed & Mountains
Petabytes of Data cannot be moved
It stays where it is produced or curated
X
Hospitals, observatories, European Bioinformatics Institute, …
Distributed collaborating communities
Expertise in curation, simulation & analysis
Distributed & diverse data collections
Discovery depends on insights
X
⇒ Unpredictable sophisticated application code
Tested by combining data from many sources
Using sophisticated models & algorithms
What can you do?
26
Dynamically
Move computation to the data
Assumption: code size << data size
Develop the database philosophy for this?
Queries are dynamically re-organised & bound
Develop the storage architecture for this?
Compute closer to disk?
X
Dave Patterson
Seattle
SIGMOD 98
System on a Chip using free space in the on-disk controller
Data Cutter a step in this direction
Develop the sensor & simulation architectures for this?
Safe hosting of arbitrary computation
Proof-carrying code for data and compute intensive tasks +
robust hosting environments
Provision combined storage & compute resources
Decomposition of applications
To ship behaviour-bounded sub-computations to data
Co-scheduling & co-optimisation
Data & Code (movement), Code execution
Recovery and compensation
27
The Story so Far
Technology enables Grids, More Data & …
Information Grids will dominate
Collaboration essential
Combining approaches
Combining skills
Sharing resources
(Structured) Data is the language of Collaboration
Data Access & Integration a Ubiquitous Requirement
Many hard technical challenges
Scale, heterogeneity, distribution, dynamic variation
Intimate combinations of data and computation
With unpredictable (autonomous) development of both
28
Outline
What is e-Science?
Grids, Collaboration, Virtual Organisations
Structured Data at its Foundation
Key Uses of Distributed Data Resources
Challenges
Scientific Digital Data Curation
Data Access & Integration
OGSA-DAI: Progress and Plans
DAIS-WG
Composition of Analysis & Interpretation
29
Science as Workflow
Data integration = the derivation of new data from old,
via coordinated computation
May be computationally demanding
The workflows used to achieve integration are often
valuable artifacts in their own right
May be Data Access & Movement Demanding
Obtaining data from files and DBs, transfer between
computations, deliver to DBs and File stores
Thus we must be concerned with how we
Build workflows
Share and reuse workflows
Explain workflows
Schedule workflows
Slide derived from Ian Foster’s ssdbm 03 keynote
Consider also DBs &
(Autonomous) Updates
30
Sloan Digital Sky Survey
Production System
Slide from Ian Foster’s ssdbm 03 keynote
31
32
OGSA-DAI
First steps towards a generic framework for
integrating data access and computation
Using the grid to take specific classes of
computation nearer to the data
Kit of parts for building tailored access and
integration applications
33
DAIS-WG
Specification of Grid Data Services
Chairs
Norman Paton, Manchester University
Dave Pearson, Oracle
Current Spec. Draft Authors
Mario Antonioletti
Neil P Chue Hong
Susan Malaika
Simon Laws
Norman W Paton
Malcolm Atkinson
Amy Krause
Gavin McCance
James Magowan
Greg Riccardi
34
Draft Specification for GGF 7
35
Conceptual Model
External Universe
External data resource
External data resource
Data set
DBMS
DB
ResultSet
36
Conceptual Model
DAI Service Classes
Data resource
Data resource
DBMS
DB
Data activity session
Data request
Data set
ResultSet
37
OGSA-DAI Partners
IBM
USA
EPCC & NeSC
Glasgow
Newcastle
Belfast
Daresbury Lab
Manchester
Oxford
Oracle
Cambridge
Hinxton
RAL
Cardiff
London
IBM Hursley
Southampton
$5 million, 20 months, started February 2002
Additional 24 months, starts October 2003
38
39
Infrastructure Architecture
Data Intensive X Scientists
Data Intensive Applications for Science X
Simulation, Analysis & Integration Technology for Science X
Generic Virtual Data Access and Integration Layer
Job Submission
Brokering
Registry
Banking
Data Transport
Workflow
Structured Data
Integration
Authorisation
OGSA
Resource Usage Transformation Structured Data Access
OGSI: Interface to Grid Infrastructure
Compute, Data & Storage Resources
Structured Data
Relational
Distributed
Virtual Integration Architecture
XML Semi-structured
40
Data Access & Integration Services
1a. Request to
Registry for sources
of data about “x”
SOAP/HTTP
Registry
1b. Registry
responds with
Factory handle
service creation
API interactions
2a. Request to Factory for
access to database
Factory
Client
2c. Factory returns
handle of GDS to
client
3a. Client queries GDS
with XPath, SQL, etc
3c. Results of query returned to
client as XML
2b. Factory creates
GridDataService to manage
access
Grid Data
Service
XML /
Relational
database
3b. GDS interacts with database
41
Future DAI Services
1a. Request to Registry for
sources of data about “x” &
“y”
1b. Registry
responds with
Factory handle
Data
Registry
SOAP/HTTP
service creation
API interactions
2a. Request to Factory for access and
integration from resources Sx and Sy
2c. Factory
returns handle of GDS to client
3b.
Client
Problem
tells“scientific”
Solving
analyst
Client
Application
Environment
coding
scientific
insights
Analyst
Data Access
& Integration
master
3a. Client submits sequence of
scripts each has a set of queries
to GDS with XPath, SQL, etc
2b. Factory creates
Semantic
GridDataServices network
Meta data
GDTS1
GDS
GDTS
GDS2
3c. Sequences of result sets returned to
analyst as formatted binary described in
a standard XML notation
Application Code
Sx
GDS
GDS1
XML
database
Sy
GDS3
GDS
GDTS2
Relational
database
GDTS
42
A New World
What Architecture will Enable Data &
Computation Integration?
Common Conceptual Models
Common Planning & Optimisation
Common Enactment of Workflows
Common Debugging
…
What Fundamental CS is needed?
Trustworthy code & Trustworthy evaluators
Decomposition and Recomposition of Applications
…
Is there an evolutionary path?
43
Take Home Message
There is plenty for DB researchers to do
Workflow & DB integration, co-optimised
Distributed Queries on a global scale
Heterogeneity on a global scale
Dynamic variability
X
X
Authorisation, Resources, Data & Schema
Performance
Some Massive Data
Metadata for discovery, automation, repetition, …
Grasp the theoretical & practical challenges
Working in Open & Dynamic systems
Incorporate all computation
Welcome code into your most critical places
44
www.ogsadai.org.uk
www.nesc.ac.uk
45
Download