OGSA-DAI: tools for data access over web services PRISM Forum

advertisement
OGSA-DAI:
tools for data access
over web services
PRISM Forum
NeSC, 27th April 2005
Neil Chue Hong
Project Manager, EPCC
N.ChueHong@epcc.ed.ac.uk
+44 131 650 5957
Overview
•
•
•
•
The difficulty with data
The Challenge of Data Services
The OGSA-DAI Software
Projects using OGSA-DAI
PRISM Forum - http://www.ogsadai.org.uk
2
The Data Deluge
• Entering an age of data
– Data Explosion
– CERN: LHC will generate 1GB/s =
10PB/y
– VLBA (NRAO) generates 1GB/s today
– Pixar generate 100 TB/Movie
– Storage getting cheaper
• Data stored in many different ways
– Data resources
– Relational databases
– XML databases / files
– Result files
• Need ways to facilitate
– Data discovery
– Data access
– Data integration
• Empower e-Business and e-Science
– The Grid is a vehicle for achieving this
PRISM Forum - http://www.ogsadai.org.uk
3
Composing Observations in Astronomy
No. & sizes of data sets as of mid-2002,
grouped by wavelength
• 12 waveband coverage of large
areas of the sky
• Total about 200 TB data
• Doubling every 12 months
• Largest catalogues near 1B objects
Data and images courtesy Alex Szalay, John Hopkins
PRISM Forum - http://www.ogsadai.org.uk
4
Data Services: motives
• Key to Integration of Scientific Methods
– Publication and sharing of results
– Primary data from observation, simulation & experiment
– Encourages novel uses
– Allows validation of methods and derivatives
– Enables discovery by combining data collected independently
• Key to Large-scale Collaboration
– Economies: data production, publication & management
– Sharing cost of storage, management and curation
– Many researchers contributing increments of data
– Pooling annotation leads to rapid incremental publication
– Accommodates global distribution
– Data & code travel faster and more cheaply
– Accommodates temporal distribution
– Researchers assemble data
– Later (other) researchers access data
PRISM Forum - http://www.ogsadai.org.uk
5
Data Services: a definition
• A Data Service is a web service which provides
published interfaces to allow:
– access to a data resource
– management of a data resource
– transfer of data to and from a data resource
• A data resource here is any form of structured
data e.g. databases, spreadsheets, image files,
sensor streams, records,…
• Standards allow interoperability between
services
– HTTP, SOAP, XML,…
PRISM Forum - http://www.ogsadai.org.uk
6
Data Services: challenges
• Scale
– Many sites, large collections, many uses
• Longevity
– Research requirements outlive technical decisions
• Diversity
– No “one size fits all” solutions will work
– Primary Data, Data Products, Meta Data, Administrative data, …
• Many Data Resources
– Independently owned & managed
– No common goals
– No common design
– Work hard for agreements on foundation types and ontologies
– Autonomous decisions change data, structure, policy, …
– Geographically distributed
• and I haven’t even mentioned security yet!
PRISM Forum - http://www.ogsadai.org.uk
7
The Discovery Process
• Choosing data sources
– How do you find them?
– How do they describe and advertise them?
– Is the equivalent of Google possible?
• Obtaining access to that data
– Overcoming administrative barriers
– Overcoming technical barriers
• Understanding that data and extracting from multiple sources
– The parts you care about for your research
• Combing them using sophisticated models
– The picture of reality in your head
• Analysis on scales required by statistics
– Coupling data access with computation
• Repeated Processes
– Examining variations, covering a set of candidates
– Monitoring the emerging details
PRISM Forum - http://www.ogsadai.org.uk
8
Small problems
• Not just “Grand Challenges”!
– Also the small problems
• For instance:
– What happens to data when an analyst leaves a team?
– How does a team leader point to “popular” data when a new analyst joins?
– How do you manage your data when you start to run out of local storage
space?
– How do you get your data from one format/database to another?
– How do I combine my data with your data without changing either?
• You need to manage your data:
– the Grid can help, but you need to put in place the process yourself
PRISM Forum - http://www.ogsadai.org.uk
9
What is a data service?
• An interface to a stored collection of data
– e.g. Google and Amazon
– web services
• But the data could be:
–
–
–
–
–
replicated
shared
federated
virtual
incomplete
• Don’t care about the underlying representation
– do care about the information it represents
– standards give us interoperability
PRISM Forum - http://www.ogsadai.org.uk
10
Examples of Data Services
• Many Data Services and applications
– Commercial databases
– Web interfaces
– Applications developed individually by groups and projects
• Also many places to get hold of public data
– Publications and citation servers
– Results servers
• OGSA-DAI is a project
– which provides an implementation of data services
– which provides an extensible framework to customise data services
for particular applications
PRISM Forum - http://www.ogsadai.org.uk
11
OGSA-DAI Project
• Develop a component library
– Access and manipulate data in a grid
– Serve UK and International e-Science communities
• Aims to provide
– Common interface to data resources
– Simple integration of distributed queries to multiple data
resources
• Contribute to standardisation efforts
– Input into GGF DAIS WG and other groups
– Provide a reference implementation of DAIS spec
• Based on Open Grid Services Architecture (OGSA)
– Globus Toolkit 3 (GT3) “compliant”
– Moving to WS-RF(GT4) and WS-I+(OMII) versions
PRISM Forum - http://www.ogsadai.org.uk
12
Project Partners
Powered by ….
Funded by the Grid Core Programme
OGSA-DAI
£3 million, 18 months, from Feb 2002
Three major releases, three interim
releases
DAIT (DAI-Two)
Keep the OGSA-DAI brand name
£1.5 million, 24 months,
from Oct 2003
Four major releases
GGF DAIS WG
Strong involvement.
Standardise the interfaces
OGSA-DAI to be a reference
implementation
PRISM Forum - http://www.ogsadai.org.uk
13
Web Service Architecture
Service
Registry
is
bl
h
Service
Consumer
Pu
o
c
is
D
r
e
v
Bind
Service
Provider
PRISM Forum - http://www.ogsadai.org.uk
14
Why OGSA-DAI?
• Why use OGSA-DAI over JDBC?
– Can embed additional functionality at the service end
– Transformations, compressions
– Third party delivery
– The extensible activity framework
– Avoiding unnecessary data movement
– Common interface to heterogeneous data resources
– Relational, XML databases, and files
– Usefulness of the Registry for service discovery
– Dynamic service binding process
– Provision of good meta-data is necessary
– Language independence at the client end
– Do not need to use Java
– Platform independence
– Do not have to worry about connection technology, drivers, etc
PRISM Forum - http://www.ogsadai.org.uk
15
Heterogeneity
Grid
Data
Service
Xindice
MySql
Oracle
DB2
• Data source abstraction behind GDS instance
– plug in “data resource implementations” for different data source
technologies
– does not mandate any particular query language or data format
PRISM Forum - http://www.ogsadai.org.uk
16
Location
Registry
DAISGR
findServiceData
registerService
Factory
Analyst
findServiceData GDSF
• Data resource publication through registry
• Data location hidden by factory
• Data resource meta data available through Service Data
Elements
PRISM Forum - http://www.ogsadai.org.uk
17
OGSA-DAI Services
• OGSA-DAI uses three main service types
– DAISGR (registry) for discovery
– GDSF (factory) to represent a data resource
– GDS (data service) to access a data resource
creates
GDSF
GDS
es
pr
re
es
locates
ts
en
ac
ce
ss
DAISGR
Data
Resource
PRISM Forum - http://www.ogsadai.org.uk
18
GDSF and GDS
• Grid Data Service Factory (GDSF)
– Represents a data resource
– Persistent service
– Currently static (no dynamic GDSFs)
– Cannot instantiate new services to represent other/new
databases
– Exposes capabilities and metadata
– May register with a DAISGR
• Grid Data Service (GDS)
–
–
–
–
Created by a GDSF
Generally transient service
Required to access data resource
Holds the client session
PRISM Forum - http://www.ogsadai.org.uk
19
DAISGR
• DAI Service Group Registry (DAISGR)
–
–
–
–
Persistent service
Based on OGSI ServiceGroups
GDSFs may register with DAISGR
Clients access DAISGR to discover
– Resources
– Services (may need specific capabilities)
– Support a given portType or activity
– In Release 5.0, services no longer automatically register
PRISM Forum - http://www.ogsadai.org.uk
20
Current Version: Release 5.0
• Released on December 3rd 2004
– Globus Toolkit 3.2.1
– Platform and language independent
– Java 1.4
– Runs on Windows, Solaris, Linux, AIX
• Listened to major user requirements
–
–
–
–
–
Wide range of supported data resources
Wide range of delivery methods (e.g. GridFTP), transformations,…
Added indexed text file access to support the bioinformatics community
Added GUI installation and configuration wizard
Continued making improvements in robustness and usability
• Work concentrated on data access
– Wraps data resources without hiding underlying data model
– Provide base for higher-level services
– Distributed Query Processing (DQP)
– Data federation services
• Next release (May 2005) offers GT4 and OMII versions
PRISM Forum - http://www.ogsadai.org.uk
21
Supported Data Resources
Relational
XML
Other
9
Xindice
9
DB2
9
eXist
9
Oracle
9
PostgreSQL
9
SQLServer
9
MySQL
Files
PRISM Forum - http://www.ogsadai.org.uk
9
22
OGSA-DAI Deck of Activities
PRISM Forum - http://www.ogsadai.org.uk
23
Predefined Activities
DeliverFromGDT
xmlCollectionManagement
relationalResourceManager
xmlResourceManagement
sqlBulkLoadRowset
sqlUpdateStatement
sqlStoredProcedure
sqlQueryStatement
xQueryStatement
xUpdateStatement
xPathStatement
DeliverToGDT
DeliverToStream
outputStream
DeliverFromGFTP
DeliverToGFTP
DeliverToURL
DeliverFromURL
PRISM Forum - http://www.ogsadai.org.uk
inputStream
xslTransform
zipArchive
gzipCompression
24
Client Toolkit
• Why? Nobody wants to write XML!
• A programming API which makes writing
applications easier
– Now: Java
– Next: Perl, C, C#
// Create a query
SQLQuery query = new SQLQuery(SQLQueryString);
ActivityRequest request = new ActivityRequest();
request.addActivity(query);
// Perform the query
Response response = gds.perform(request);
// Display the result
ResultSet rs = query.getResultSet();
displayResultSet(rs, 1);
PRISM Forum - http://www.ogsadai.org.uk
25
Integration Scenario
• A patient moves hospital
Data A
Data B
Amalgamated patient record
Data C
DB2
Oracle
A: (PID, name, address, DOB)
B: (PID, first_contact)
CSV
file
C: (PID, first_name, last_name,
address, first_contact, DOB)
PRISM Forum - http://www.ogsadai.org.uk
26
Distributed Query Processing
• Higher level services building on
OGSA-DAI
3,4
• Queries mapped to algebraic
reduce
op_call
(Blast)
exchange
expressions for evaluation
• Parallelism represented by partitioning
queries
hash_join
(proteinId)
–Use exchange operators
reduce
exchange
reduce
1
table_scan
(protein)
PRISM Forum - http://www.ogsadai.org.uk
2
table_scan
termID=S92
(proteinTerm)
27
OGSA-DAI Users Group
• User Group Chair
– Prof. Beth Plale, Indiana University
• A separate independent body to engage with users and feedback to
•
developers in a formal way
Held meetings in Edinburgh and Brussels in 2004
– Presentations from projects using OGSA-DAI
– Discussion of requirements and issues
– Discussion of roadmap
• Next meeting is 1st June 2005 in Edinburgh
• Contact Beth Plale (plale@cs.indiana.edu) for more details
PRISM Forum - http://www.ogsadai.org.uk
28
FAQ, Support, Mailing List
• Frequently Asked Questions
– http://www.ogsadai.org.uk/support/faq.php
– Updated as common problems become clear
• Support for OGSA-DAI releases
– http://www.ogsadai.org.uk/support
– support@ogsadai.org.uk
– Use to report problems
• Discussion list
– users@ogsadai.org.uk
– http://www.ogsadai.org.uk/support/list.php
– General discussion of OGSA-DAI, data and the Grid
PRISM Forum - http://www.ogsadai.org.uk
29
Projects Using
OGSA-DAI
Neil Chue Hong
Project Manager, EPCC
N.ChueHong@epcc.ed.ac.uk
+44 131 650 5957
Projects Using OGSA-DAI
Bridges
N2Grid
(http://www.brc.dcs.gla.ac.uk/projects/bridges/)
(http://www.cs.univie.ac.at/institute/index.html?project-80=80)
BioSimGrid
AstroGrid
(http://www.biosimgrid.org/)
(http://www.astrogrid.org/)
BioGrid
GEON
(http://www.biogrid.jp/)
(http://www.geongrid.org/)
OGSA-DAI
eDiaMoND
(http://www.ogsadai.org.uk)
OGSA-WebDB
(http://www.ediamond.ox.ac.uk/)
(http://www.gtrc.aist.go.jp/dbgrid/)
GeneGrid
FirstDig
(http://www.qub.ac.uk/escience/projects.php#genegrid)
(http://www.epcc.ed.ac.uk/~firstdig/)
myGrid
INWA
(http://www.mygrid.org.uk/)
(http://www.epcc.ed.ac.uk/)
ODD-Genes
IU RGRBench
(http://www.epcc.ed.ac.uk/oddgenes/)
(http://www.cs.indiana.edu/~plale/projects/RGR/OGSA-DAI.html)
PRISM Forum - http://www.ogsadai.org.uk
31
Project classification
• Bridges
• BioGrid
• ODD-Genes
• AstroGrid
• BioSimGrid
Physical
Sciences
• GEON
• eDiamond
Biological
Sciences
• myGrid
• GeneGrid
OGSA-DAI
• MCS
• N2Grid
• OGSA Web-DB
• GridMiner
• IU RGBench
• FirstDig
• INWA
Commercial
Applications
PRISM Forum - http://www.ogsadai.org.uk
Computer
Sciences
32
• e-Digital MammOgraphy National
Database
–Mammogram - X-ray of the breast
• Built prototype of a national
database of mammographic images
–In support of the UK Breast screening
programme
• Employed Grid technologies to
facilitate process
Thanks to eDiaMonND project and the
Digital Database for Screening Mammography
for this image.
PRISM Forum - http://www.ogsadai.org.uk
33
• Breast screening in the UK began in 1988
– Women aged 50-64 screened every 3 Years
– Women aged 50-70 from 2004
– 1 View/Breast → 2 views by 2003
• UK has
– Over 90 Breast screening units throughout the UK
– Each one deals with about 45000 women on average p.a.
• Each centre sees 5000-20000 images/year
• In 2001-02 → 2002-03
–
–
–
–
Screened: 1.4M → 1.5M
Recalled for Assessment : 77911 → 79441
Cancers detected : 10003 → 10467
Lives per year Saved: 300 → 1250 (by 2010)
• Distributed team of doctors perform the analysis
PRISM Forum - http://www.ogsadai.org.uk
34
CHU
Data Training
Load
App
Core &
Training API
KCL
Data Training
Load
App
Data Training
Load
App
Core &
Training API
Core
Services
Core
Services
OGSA-DAI
OGSA-DAI
UED
UCL
Core &
Training API
Core
Services
OGSA-DAI
Data Training
Load
App
Core &
Training API
Core
Services
Content
Manager
DB2
Content
Manager
DB2
Core Training
API
API
Training
Services
OGSA-DAI
OGSA-DAI
DB2 Federation
DB2
Training
Application
OGSA-DAI
Content
Manager
DB2
Content
Manager
PRISM Forum - http://www.ogsadai.org.uk
Database Files
35
• eDiaMoND Findings:
–
–
–
–
–
OGSA-DAI provides a flexible framework
Dynamically configure the system through discovery
Activities can operate with different levels of granularity
Federation can be introduced at various levels
Good documentation on how to extend the framework
– Extended Activities to access IBM DB2 Content
Manager
– Changes between versions broke some things
– Low level XML issues
PRISM Forum - http://www.ogsadai.org.uk
36
FirstDIG
• Data mining with the First Transport Group, UK
– Example: “When buses are more than 10 minutes late there is an
82% chance that revenue drops by at least 10%”
– "The results of this exercise will revolutionise the way we do things in
the bus industry.“, Darren Unwin, Divisional Manager, First South
Yorkshire.
OGSA-DAI
OGSA-DAI
OGSA-DAI
OGSA-DAI
OGSA-DAI Client Application
Data Mining Application
PRISM Forum - http://www.ogsadai.org.uk
37
INWA
• Innovation Node: Western Australia
–Informing Business & Regional Policy:
Grid-enabled fusion of global data and local knowledge
• Project
–Run from Nov 2003 - Aug 2004
–Involved 10 partners (6 UK + 4 Australia)
• Aim
–Data mine commercially sensitive data
–Security an absolute MUST
–Employ Grid technologies
–Need access to data and computational resources
• Demonstrator using:
–OGSA-DAI
–Incorporate data resources
–Sun DCG's TOG (Transfer-queue Over Globus)
–Handle job submission to analyse micro array data
PRISM Forum - http://www.ogsadai.org.uk
38
INWA
EPCC,UK
TOG
Grid Engine
Bank
Telco
OGSA-DAI
Bank data
OGSA-DAI
UK Property
Data Browser
user@australia
Curtin,Australia
TOG
Grid Engine
user@edinburgh
Bank
Telco
OGSA-DAI
Telco data
OGSA-DAI
Australian property
Data Browser
PRISM Forum - http://www.ogsadai.org.uk
39
ODD-Genes
•
•
OGSA-DAI Demo for Genetics
Collaboration between
–EPCC
–Scottish Centre for Genomic Technology and
Informatics (GTI)
–Human Genetics Unit (HGU)
•
ODD-Genes demonstrates:
–Perform high-speed batch analysis of
microarray data on the Grid
–Browse the results of previous analyses
stored in a database
–View data from arbitrary databases as HTML
–Discover related databases on the Grid
–Perform coupled queries on newlydiscovered databases to provide a richer
analysis of gene data
PRISM Forum - http://www.ogsadai.org.uk
40
ODD-Genes Actors
GTI
Micro Array Data (relational)
Globus
GridEngine
OGSA-DAI
DAISGR
OGSA-DAI
ODD-Genes
Webapp
GridEngine
TOG
EPCC 1. Client
2. EPCC is a
computational
resource.
3. HGU is an
example of a
data
repository.
HGU
Mouse Genome Information (XML)
OGSA-DAI
PRISM Forum - http://www.ogsadai.org.uk
41
ODD-Genes Findings
• Data discovery perceived to be very important
– Map data views: time -> spatial locations
– Discovery of new resources
• Transparency to data access
– @HGU had an XML database
– @GTI had a relational database
– Deploy OGSA-DAI and not worry about databases
• Issues
– Registry maintenance policy
– Semantics of the discovery process
– Groups working the same area but different schemas, no generic metadata
(schemas were the effective metadata)
• Provides an additional tool for researchers
PRISM Forum - http://www.ogsadai.org.uk
42
GridMiner
• Test application area: medical
– traumatic brain injury treatment
– Predicting the outcome of seriously ill patients
– analytical part focuses on data mining and On-Line Analytical
Processing (OLAP)
• Target:
– provide tools to discover and access relevant knowledge and
information from different distributed and heterogeneous data
sources
– building on and extending OGSA-DAI
• http://www.gridminer.org/
PRISM Forum - http://www.ogsadai.org.uk
43
GridMiner Scenario
• Heterogeneities:
– Name in A is „First Last“ (as the target format)
– Name in C has to be combined
• Distribution:
– 3 data sources
PRISM Forum - http://www.ogsadai.org.uk
44
Summary
• New technology
– Standardisation process still ongoing
– Infrastructure still developing
• OGSA-DAI acting as an enabler
– It builds on what you already have
– It does not define a radically new model (not rewriting SQL)
– It may make you think about your business process
• Some problems are not OGSA-DAI specific
– Metadata, time zones, security, …
• Data discovery opens up a window of integration opportunity
• Try it out
– It’s free and supported
– Make suggestions, extend functionality, contribute to DAIS-WG
PRISM Forum - http://www.ogsadai.org.uk
45
OGSA-DAI Project Webpage
• http://www.ogsadai.org.uk
Background
News & Events
Software Releases
Documentation
On-line Tutorials
Support
Training Courses
Links
PRISM Forum - http://www.ogsadai.org.uk
46
Download