The Challenge of Data Integration Data + Grid = Discovery? 22 January 2003

advertisement
The Challenge of Data Integration
Data + Grid = Discovery?
Prof. Malcolm Atkinson
Director
www.nesc.ac.uk
22nd January 2003
1
Overview
Essentials of e-Science
Collaboration
X
X
X
Resource Sharing
Data Sharing
Mutual Dependence
Essentials of the Grid
Distributed Virtual Machine?
Essentials of Data Sharing
Database Research did it?
New Challenges
Data Access & Integration Building Bricks
Band Wagon v Research Opportunity
Thresholds, Visions and Questions
2
3
UK e-Science Programme (1)
2001 - 2003
DG Research Councils
E-Science
Steering Committee
Director’s
Awareness and Co-ordination Role
Academic Application Support
Programme
Research Councils (£74m), DTI (£5m)
PPARC (£26m)
BBSRC (£8m)
MRC (£8m)
NERC (£7m)
£80m
ESRC (£3m)
EPSRC (£17m)
CLRC (£5m)
Grid TAG
Director
Director’s
Management Role
Generic Challenges
EPSRC (£15m), DTI (£15m)
Collaborative projects
Industrial Collaboration (£40m)
4
UK e-Science
e-Science and the Grid
‘e-Science is about global collaboration in key
areas of science, and the next generation of
infrastructure that will enable it.’
‘e-Science will change the dynamic of the
way science is undertaken.’
John Taylor
Director General of Research Councils
Office of Science and Technology
5
From presentation by Tony Hey
UK e-Science Investment
National
e-Science
Centre
HPC(x)
Edinburgh
Glasgow
Newcastle
Belfast
Projects
> 60 started
> 30 proposed
+
EU Projects
Daresbury Lab
Manchester
Cambridge
Hinxton
Oxford
Cardiff
RAL
London
Southampton
6
UK e-Science Programme (2)
2003 - 2005
DG Research Councils
E-Science
Steering Committee
Director’s
Awareness and Co-ordination Role
Academic Application Support
Programme
Research Councils (£74m), DTI (£5m)
PPARC (£26m)
BBSRC (£8m)
MRC (£8m)
NERC (£7m)
£80m
ESRC (£3m)
EPSRC (£17m)
CLRC (£5m)
Grid TAG
Director
Director’s
Management Role
Generic Challenges
EPSRC (£15m), DTI (£15m)
Collaborative projects
Industrial Collaboration (£40m)
7
8
Collaboration Growing
Hard Problems, Multi-disciplinary, Expense
Sharing
X
X
X
X
Ideas
Thought processes and Stimuli
Effort
Resources
Requires
X
X
X
X
Communication
Common understanding & Framework
Mechanisms for sharing fairly
Organisation and Infrastructure
Scientists have done this for Centuries
9
Collaboration Growing
Text, digital media,
Data, Policy & Digital Infrastructure
Key
structured,
organised &
Sharing
X
X
X
X
Ideas
Thought processes and Stimuli
Effort
Resources
Requires
X
X
X
X
curated data, annotation,
computable models,
visualisation, shared
instruments, shared systems,
shared administration, …
Nationally & Internationally
Distributed, …
Communication
Routine, Daily, Automated,
Common understanding & Framework
…
Mechanisms for sharing fairly
Organisation and Infrastructure
That Requires very Significant Investment in Digital
Systems and their Support
10
Collaboration Growing
Digital networks,
Digital Communication, Metadata,
… digital
Sharing
X
X
X
X
Ideas
Thought processes and Stimuli
Effort
Resources
Requires
X
X
X
X
work-places, digital
instruments, …
Metadata, ontologies,
standards, shared curated
data, shared codes, …
Common platforms, shared
software, shared training, …
Communication
Authentication,
Common understanding & Framework
Authorisation, Accounting,
Mechanisms for sharing fairly
Provenance, Policies, …
Organisation and Infrastructure
Shared Provision of Platform,
The Grid SHOULD make this much easier by
providing a common, supported high-level of
Software and Organisational infrastructure
11
Interdependence
Science has relied on experiment and theory
Simulation, Data Mining, Analysis
Theory- Experiment Greece
Italy
400 BC
1,500 AD
Simulation Europe
1,980 AD
For problems which are:
- too large/small
- too fast/slow
- too complex
- too expensive, unethical, ...
-Testing Understanding
12
Interdependence
Theory
Data
Computing
Models
Data
Experiment
13
Database Growth
PDB protein structures
14
15
Globus Toolkit® History
30000
Does not include downloads from:
NMI, UK eScience, EU Datagrid,
IBM, Platform, etc.
Physiology of the Grid
Paper Released
Anatomy of the Grid
Paper Released
The Grid: Blueprint for a
New Computing
Infrastructure published
DARPA, NSF,
and DOE
begin funding
Grid work
NASA begins
funding Grid work,
DOE adds support
NSF & European Commission
Initiate Many New Grid Projects
25000
20000
Significant
Commercial
Interest in
Grids
15000
10000
Early Application
Successes Reported
GT 1.0.0
Released
5000
Downloads per Month from ftp.globus.org
GT 2.0
Released
0
1997
1998
1999
2000
2001
2002
16
Encompassing Vision
software
computers
sensor
nets
instruments
colleagues
data
archives
17
People & Industry
Global Grid Forum
900
800
700
600
500
400
300
200
100
0
GGF1
GGF2
GGF3
GGF2
GGF3
GGF4
GGF5
GGF4
GGF5
GGF6
GGF7
260
220
400
900
450
>1000
Jul 01
Oct 01
Feb 02
Jul 02
Oct 02
Mar 03
UK All Hands
AHM’02 350
450
“IBM DRIVES GRID COMPUTING
FOR COMMERCIAL BUSINESS WITH
TEN NEW GRID OFFERINGS”
Targets
X
X
X
X
Sep 02
Financial, Life Sciences
Automotive & Aerospace
Governments
Partners
X
X
Platform, DataSynapse
Avaki, Entropia
United Devices
IBM last 20 months
GlobusWorld
1
IBM This week
Jan 03
Leaders of OGSI
Development teams
Grid Jamboree
GGF
18
19
High-Altitude Views
A Rallying Cry
Meeting a Hard Challenge requires Many Minds
Operating & Maintaining Infrastructure requires Many
Hands & Many Companies
Another Stab at Distributed Computing
Hard Challenge: Intellectually and Practically Important
Dependable Ubiquity over Heterogeneity & Fallibility
An Ambitious Virtual Machine
Consistent large scale computational environments
A Global Operating System
Collective Resources, Common Management
20
An Architectural View
Application Users
Application
Application
Common Application Platform for Group of Applications
& Platform
Developers
Monitoring
Diagnosis
Logging
Scheduling
Accounting
Authorisation
Grid Plumbing & Security Infrastructure
Data & Compute Resources
Providers
Distributed
Operations
Teams
21
Open Grid Services Infrastructure
Confluence of Web Services & Grid
Consistent Interface Description
Based on WSDL 1.2 proposal
X
X
X
Extend Properties
Separate Binding from Interface
Function Composition & Inheritence
Exploit WS* Investment
Grid Features
Security
Life-Time Management
Service (state) Information via Data Elements
Discovery
Grouping
Notification
OGSI Version 1 Proposal at GGF7 (March 03)
22
Open Grid Services Architecture
Ubiquitous Building Blocks
Using OGSI Platform
Open & Extensible
Encourage Refactoring Experiments
Initially
The Globus 2 model
X
Except State Information now distributed
Example New Features
Global Name Mapping Service
Replication and Caching Service
Data Access & Integration
Metering, Logging, Authorisation, Charging, …
23
Grid Challenge
Balancing “Direct” Access to the
“Platforms” with Abstraction &
Virtualisation
Developers often have exploitable application
knowledge
Automation necessary & helpful
X
X
Interface matching, operation validation, …
Optimisation at many scales
There isn’t enough effort to develop Languages
& Abstractions
24
25
Data Integration
Scientist with Idea
2) Extract Data
Data Resource 1
1) Find Data
3) Transform Data
4) Combine Data
5) Interpret Data
Data Resource 2
26
Wellcome Trust: Cardiovascular
Functional Genomics
Glasgow
Shared data
Edinburgh
Public curated
data
Leicester
Oxford
London
Netherlands
27
OGSA-DAI Partners
IBM
USA
EPCC & NeSC
Glasgow
Newcastle
Belfast
Daresbury Lab
Manchester
Oxford
Cambridge
Hinxton
EPCC & NeSC
Oracle
RAL
IBM UK
Cardiff
London
IBM Hursley
IBM USA
Southampton
Manchester e-SC
Newcastle e-SC
£3 million, 18 months, started February 2002
Oracle
28
OGSA-DAI
Data Access and Integration for the New
Grid
Uniform Service Interfaces
for Accessing Multiple Data Sources
within the Open Grid Services Architecture.
UK e-Science Contribution to GT3
29
DAI Key Services
GridDataService
GDS
Access to data & DB operations
GridDataServiceFactory
GDSF
Makes GDS & GDSF
GridDataServiceRegistry
GDSR
Discovery of GDS(F) & Data
GridDataTranslationService GDTS
Translates or Transforms Data
GridDataTransportDepot
Data transport with persistence
GDTD
Integrated Structured Data Transport
Relational & XML models supported
Role-based Authorisation
Binary structured files (later)
30
DAI Architecture
Data Intensive X Scientists
Data Intensive Applications for Science X
Simulation, Analysis & Integration Technology for Science X
Generic Virtual Data Access and Integration Technology
Monitoring
Diagnosis
Scheduling
Accounting
GridFTP
Naming
Authorisation
Caching
Data Integration
Services
Data Access Ser vices
Grid Infrastructure
Compute, Data & Storage Resources
Structured Data
Distributed
Data Integration Architecture
31
1a. Request to
Registry for sources
of data about “x”
SOAP/HTTP
Registry
1b. Registry
responds with
Factory handle
service creation
API interactions
2a. Request to Factory for
access to database
Factory
Client
2c. Factory returns
handle of GDS to
client
3a. Client queries GDS
with XPath, SQL, etc
3c. Results of query returned to
client as XML
2b. Factory creates
GridDataService to manage
access
Grid Data
Service
3b. GDS interacts
with database
XML /
Relational
database
32
1a. Request to
Registry for sources of
data about “x” & “y”
SOAP/HTTP
Registry
1b. Registry
responds with
Factory handle
3b. Tell
consumer Client
service creation
API interactions
2a. Request to Factory for access and
integration to databases
2c. Factory returns handle of
GDS to client
Factory
2b. Factory creates
GridDataServices network
3a. Client submits set of
queries GDS with XPath,
SQL, etc
Consumer
GDS
GDS
XML /
Relational
database
GDS
3c. Results of queries returned
to consumer as XML or binary
GDS
GDS
XML /
Relational
database
33
Biomedical (or ANY) Data
Opportunities
Global Production of
Published Data
Volume↑ Diversity↑
Combination ⇒
Analysis ⇒ Discovery
Opportunities
Specialised Indexing
Structurally varied
replication
Consistent Structured
Universe of Discourse
Data & Computation
Integration
Challenges
Data Huggers
Meagre metadata
Ease of Use
Automated, optimised
integration
Traceability, Dependability
Challenges
Approximate Matching
Multi-scale optimisation
Bad habits / industrial
structures
Safety and Multi-scale
optimisation
34
Data Integration Challenges
High-Level Languages
Describing the Data Extraction Recipes
Describing the Sources & Components
X
Metadata that drives automation & validation
Mobility
Code & Data
Integrating Existing DB technology
Moving the DBMS to the Grid context
New Optimisation Challenges
Data & Computation & Storage & Movement
Shared Distributed Annotation Systems
How to Reference
Provenance & Acknowledgement
35
36
Challenges
A Programming & Development Model
Dependability at this Scale
Foundations for Trust
Raising the Level of Automation
Supporting New Forms of
Collaboration
Data
37
38
Download