Prepare for the Data Deluge UCISA: 23

advertisement
UCISA:
How do I Grid enable my University?
Prepare for the Data Deluge
Prof. Malcolm Atkinson
Director
www.nesc.ac.uk
23rd October 2003
Outline
Aspects of the Data Deluge
Our approach to Data Access and Integration
Sloan Digital Sky Survey
Production System
Slide from Ian Foster’s ssdbm 03 keynote
Global Knowledge Communities
Often Driven by Data: E.g., Astronomy
No. & sizes of data sets as of mid-2002,
grouped by wavelength
• 12 waveband coverage of large
areas of the sky
• Total about 200 TB data
• Doubling every 12 months
• Largest catalogues near 1B objects
Data and images courtesy Alex Szalay, John Hopkins
Database Growth
Bases 45,356,382,990
PDB Content Growth
It’s Easy to Forget
How Different 2003 is From 1993
Enormous quantities of data: Petabytes
For an increasing number of communities
gating step is not collection but analysis
Ubiquitous Internet: >100 million hosts
Collaboration & resource sharing the norm
Security and Trust are crucial issues
Ultra-high-speed networks: >10 Gb/s
Global optical networks
Bottlenecks: last kilometre & firewalls
Huge quantities of computing: >100 Top/s
Moore’s law gives us all supercomputers
Organising their effective use is the challenge
Moore’s law everywhere
Instruments, detectors, sensors, scanners, …
Organising their effective use is the challenge
Derived from Ian Foster’s slide at ssdbM July 03
Tera → Peta Bytes
RAM time to move
15 minutes
RAM time to move
2 months
1Gb WAN move time
10 hours ($1000)
Disk Cost
1Gb WAN move time
14 months ($1 million)
Disk Cost
7 disks = $5000 (SCSI)
Disk Power
6800 Disks + 490 units +
32 racks = $7 million
Disk Power
100 Watts
100 Kilowatts
Disk Weight
Disk Weight
5.6 Kg
33 Tonnes
Disk Footprint
Inside machine
Disk Footprint
60 m2
May 2003 Approximately Correct
See also Distributed Computing Economics Jim Gray, Microsoft Research, MSR-TR-2003-24
Infrastructure Architecture
Data Intensive Users
Data Intensive Applications for Application area X
Simulation, Analysis & Integration Technology for Application area X
Generic Virtual Data Access and Integration Layer
Job Submission
Brokering
Registry
Banking
Data Transport
Workflow
Structured Data
Integration
Authorisation
OGSA
Resource Usage Transformation Structured Data Access
OGSI: Interface to Grid Infrastructure
Compute, Data & Storage Resources
Structured Data
Relational
Distributed
Virtual Integration Architecture
XML Semi-structured
-
Integrating DBs into the Grid
We want to build on existing DBs, not replace them.
Could produce a Grid-enabled version of JDBC/ODBC
Need something more for metadata-driven access to
data
Service-based approach should be better
Provide a uniform framework for access to
databases on the Grid
Data as Service:
OGSA Data Access & Integration
Service-oriented treatment of data appears to
have significant advantages
Leverage OGSI introspection, lifetime, etc.
Compatibility with Web services
Standard service interfaces being defined
Service data: e.g., schema
Derive new data services from old (views)
Externalize to e.g. file/database format
Perform queries or other operations
Data Services
GGF Data Access and Integration Svcs (DAIS)
OGSI-compliant interfaces to access relational and
XML databases
Needs to be generalized to encompass other data
sources (see next slide…)
Generalized DAIS becomes the foundation for:
Replication: Data located in multiple locations
Federation: Composition of multiple sources
Provenance: How was data generated?
“OGSA Data Services”
(Foster, Tuecke, Unger, eds.)
Describes conceptual model for representing all
manner of data sources as Web services
Database, filesystems, devices, programs, …
Integrates WS-Agreement
Data service is an OGSI-compliant Web service
that implements one or more of base data
interfaces:
DataDescription, DataAccess, DataFactory,
DataManagement
These would be extended and combined for specific
domains (including DAIS)
OGSA-DAI Approach
Reuse existing technologies and standards
OGSA, Query languages, Java, transport
Build portTypes and services which will enable:
controlled exposure of heterogenous data resources on an OGSIcompliant grid
access to these resource via common interfaces using existing underlying
query mechanisms
(ultimately) data integration across distributed data resources
OGSA-DAI (the software) seeks to be a reference implementation of
the GGF DAIS WG standard
Can’t keep up with frequent standard changes, so software releases track
specific drafts
See http://www.ogsadai.org.uk/ for details.
Data Access & Integration Services
1a. Request to Registry
for sources of data
about “x”
SOAP/HTTP
Registry
1b. Registry
responds with
Factory handle
service creation
API interactions
2a. Request to Factory for access
to database
Factory
Client
2c. Factory returns
handle of GDS to
client
3a. Client queries GDS with
XPath, SQL, etc
3c. Results of query returned to
client as XML
2b. Factory creates
GridDataService to manage
access
Grid Data
Service
XML /
Relationa
l
database
3b. GDS interacts with database
Third Party Delivery
2
Data Set
C
L
I
E
N
T
A
P
I
R
E
Q
U
E
S
T
O
R
S
T
U
B
1
Data Set
dr
3
Data Set
Data Set
C
L
I
E
N
T
C
O
N
S
U
M
E
R
A
P
I
S
T
U
B
4
Future DAI Services?
1a. Request to Registry for
sources of data about “x” &
“y”
1b. Registry
responds with
Factory handle
Data
Registry
SOAP/HTTP
service creation
API interactions
2a. Request to Factory for access and
integration from resources Sx and Sy
Data Access
& Integration
master
2c. Factory
returns handle of GDS to client
3b.
Client
Problem
tells“scientific”
Solving
analyst
Client
Application
Environment
coding
scientific
insights
Analyst
2b. Factory creates
Semantic
GridDataServices network
Meta data
3a. Client submits sequence of
scripts each has a set of queries
to GDS with XPath, SQL, etc
GDTS1
GDS
GDTS
XML
database
GDS2
Sx
3c. Sequences of result sets returned to
analyst as formatted binary described in
a standard XML notation
Application Code
GDS
GDS1
Sy
GDS3
GDS
GDTS2
GDTS
Relational
database
Take Home Message
Information Grids
Support for collaboration
Integrated support for computation and data grids
Structured data fundamental

Relations, XML, semi-structured, files, …
Integrated strategies & technologies needed
OGSA-DAI is here now
See http://www.ogsadai.org.uk/ for details.
A first step — Try it
Tell us what is needed to make it better
Managing Scientific Data is a Major Requirement
The Grid is 30% of the software stack needed for e-Science
Download