OGSA-DAI Requirements Gathering Exercise 2

advertisement
OGSA-DAI
Requirements
Gathering Exercise
2nd DIALOGUE workshop
eSI, 9-10 February 2006
Neil Chue Hong
Project Manager, EPCC
N.ChueHong@epcc.ed.ac.uk
+44 131 650 5957
OGSA-DAI Requirements Gathering
• Aims
– learn more about the data access and integration challenges that
other projects are facing
– use this information to inform the future development of the OGSADAI software
• Timescale
– Nov 2005 – Jan 2006
• Gatherers
– Ally Hume
– Amy Krause
– Tom Sugden
2nd DIALOGUE workshop - 9-10 February 2006
2
Projects
• AstroGrid
– (www.astrogrid.org) - distributed queries over large astronomy databases.
• Automed and ISpider
– (www.doc.ic.ac.uk/automed/) and (www.ispider.man.ac.uk) – model-based
data integration and Grid-based informatics platform for proteomics.
• CancerGrid
– (www.cancergrid.org) – storage and analysis of distributed data containing
clinical trial and lab data.
• ESSC
– (www.nerc-essc.ac.uk[MA1]) – environmental and atmospheric simulations.
• Gold
– (www.goldproject.ac.uk) – provides infrastructure for virtual organisations.
• NTRAC
– (www.ntrac.org.uk) – similar to CancerGrid.
2nd DIALOGUE workshop - 9-10 February 2006
3
Structure of Meeting Reports
•
Data
–
•
Queries
–
•
the main problems that the project are currently facing with regards to data
access and integration.
What Can OGSA-DAI Provide?
–
•
the types of queries that are performed against this data, including the query
languages used and the typical size of result sets.
The problem
–
•
the kind of data that the project is concerned with, including the structure,
quantity and types of data resource.
the functionality that the project would like OGSA-DAI to provide.
Checklist
–
summarises the importance of various aspects of data access and
integration for the project.
2nd DIALOGUE workshop - 9-10 February 2006
4
AstroGrid
• a number of distributed
databases, each of which
contains astronomical data
captured from different modalities
Public/Internal
Data Movement
Data Replication
Data Discovery
Transactions
• Almost all the tables in these
databases contain a spatial
coordinate of each feature and
some numerical attributes
associated with that feature.
Security
Reliability
• want to do distributed queries
using their algorithmic domainspecific joins.
Public services.
Efficient data movement is crucially important.
Replication was not considered important.
Data discovery is not an issue as they already
have data discovery mechanisms.
Transactions where not considered to be very
important. Most of the queries are read-only,
apart from the production of temporary tables.
If there is an error they are happy to start the
query again.
They considered security to be a real concern
but this seemed to be more of a resource
usage issue rather than data encryption. Here
they were talking about issues such as
restricting the amount of data a user's query
can write to a database.
There was an emphasis on production level
functionality such as:


Scalability
ability to kill queries
reduced requirement for server restart
The AstroGrid systems must scale well to
support many large distributed databases.
2nd DIALOGUE workshop - 9-10 February 2006
5
AutoMed and ISpider
•
•
•
•
middleware to transform schemas from
different data sources (relational
databases, XML documents, etc.) and
evaluate distributed queries expressed in
their own IQL language.
By creating a path of schematransformations, it is possible to federate
multiple data sources so that they appear
as a single data source to the user
how to optimise distributed queries using
metadata such as data size, occurrence
of indexes, performance rates, etc.
Public/Internal
Data Movement
Data Replication
Data Discovery
Security
Scalability
Public services.
Efficient data movement is required. They are
interested in XML compression algorithms.
Important but not a part of ISpider or
Automed specifically, though they may want
to replicate data for query optimisation
purposes.
The AutoMed APIs already provide a form of
registry, but in the future a web service
interface is envisaged.
Academics are fairly relaxed but in the broader
scheme this is very important, particularly for
use on medical data.
Important
how to fit AutoMed into a grid architecture
2nd DIALOGUE workshop - 9-10 February 2006
6
CancerGrid
• By analysing laboratory data and
correlating it with hospital and trials
data, it is hoped that new subsets of
patients can be discovered who
respond best to particular treatments
• Security is a major concern because
many of the owners of data are
aware of the value of their data and
consequently are concerned about
who has access to it.
• A good means of transforming trial
forms (XML documents) into a format
suitable for automatic insertion into
relational tables is required.
Public/Internal
Data Movement
Data Replication
Data Discovery
Transactions
Security
Reliability
Scalability
Public but only accessible by certain users.
Important for distributed data integration usecase.
No real requirement at the moment.
They envisage a peer-to-peer system for data
discovery, but also plan to use the national
registries that are being developed elsewhere.
Distributed transactions are not a concern at
the moment. Updates take place daily at most
and queries will not take place during updates.
It is vital to be able to expose subsets of data
to
particular users. This is not just
because of
patient confidentiality (data
in anonymised
before it reaches
CancerGrid), but also because of commercial
interests.
Data integrity is important, but there is no
need for 24hr services, so downtime is not a
problem.
Must scale to many databases distributed
around the world.
2nd DIALOGUE workshop - 9-10 February 2006
7
ESSC
•
•
•
dealing with large data sets of between 2
to 3 terabytes, stored mostly on a single
machine. The user requests portions of
data, often assembled from various files.
Uniform web service interfaces are
provided for accessing data sets using
the standard APIs associated with the
binary data file formats that are used
(netCDF, GRIB, HDF, etc.).
The queries used by ESCC are currently
synchronous which causes request
timeout problems when the resulting
datasets are large. Sceptical of current
WS-Notification implementations that
require open ports on client machines.
Public/Internal
Data Movement
Data Replication
Data Discovery
Transactions
Security
Reliability
Scalability
Both public and commercial.
Efficient movement of large binary files (Gbs)
is required.
Not important because this is handled by the
NERC data grid and datasets are generally
copies of Met Office data.
The NERC data grid solves this problem using
4 levels of metadata, expressed in XML.
Transactional updates are quite important
internally, but not important for end users.
Essential because of commercial nature of
data.
Their services are fairly static and restarts are
infrequent. The metadata store can already be
updated without restarting services.
Linear scalability is desired for data extraction.
The current file APIs scale well.
2nd DIALOGUE workshop - 9-10 February 2006
8
GOLD
• develop an infrastructure to facilitate collaboration within
virtual organisations
• Data storage services will be used for capturing interactions
amongst parties of a VO in order to facilitate auditing and
VO-playback.
• Data analysis services will be used for performing particular
types of analysis of data existing mostly in relational
database back-ends.
• primary concern is managing security policies and service
access rights of different types of user dynamically.
2nd DIALOGUE workshop - 9-10 February 2006
9
NTRAC
• build platforms to bring different
systems together
• Many of the data resources that
they are accessing are stored in
Public/Internal
Data Movement
Data Replication
private networks (e.g. NHS
Data Discovery
patient information) with no open
Transactions
Security
gateway to the public.
Reliability
• Researchers want to mine the
data to find people to recruit into
Scalability
Internal.
Not a major concern - even if cross-site,
people are not normally interested in the raw
data.
Mainly for backup purposes not loadbalancing.
The Scottish Executive and NCRN will be
running registry of trials.
Important because of private and commercial
nature of some data.
There are usually sporadic queries made
against a dataset that does not change very
quickly.
NTRAC just recruits patients into
the process so they are not concerned with
reliability at the moment, but when getting
involved in patient care, this is most
important.
-
studies.
2nd DIALOGUE workshop - 9-10 February 2006
10
Prioritised Requirements
ID
R1
R2
R3
R4
R5
R6
R7
R8
R9
Requirement
Efficient transportation of large quantities of data
between heterogeneous data resources.
Data federation and distributed query processing
across heterogeneous data resources.
An asynchronous model for processing large, longrunning queries where the client can poll or be
notified of the query status and the query can be
terminated at an intermediate stage.
The ability to provide different views of data
resources to different users in a secure, DBMSindependent manner and to manage these views
dynamically.
Security/certificate delegation to allow access to
other networks and role-based data access rules.
Provision of more extensive database metadata
capabilities, in particular with the inclusion of
statistics relevant to query optimisation such as
table size, occurrence of indexes and performance
rates.
Support for a unified query language (RDBMSneutral), possibly through integration with
Hibernate.
Extensible join criteria for data integration,
including support for spatial joins.
The ability to limit the size of updates to data
resources. For example, the size of temporary
tables created during a SkyQuery-style [Ref]
distributed query.
Priority
High
High
High
High
High
Medium
Medium
Medium
Medium
2nd DIALOGUE workshop - 9-10 February 2006
11
Notes on requirements
• Prioritised based on a judgement of their importance to the
various projects that were investigated.
– Whether or not they are within the scope of the OGSA-DAI project, or
have already satisfied by OGSA-DAI, is not considered here.
• Frequent mention of the non-functional requirement: ease-ofuse.
– Some concern that installation and configuration remains too complex
when compared with typical WAR-based web service deployment.
• Hope to publish the full document in near future
– let me know if you want a copy
2nd DIALOGUE workshop - 9-10 February 2006
12
Conclusions
• Efficient transportation of large quantities of data between heterogeneous
data resources is a crucial requirement for several projects from distinct
domains.
– This is also an implicit requirement for projects requiring data federation and
distributed query processing.
– If we could solve this problem, it would be of great benefit to these projects,
and also to higher-level middleware projects such as OGSA-DQP
• Security remains a major concern because of the commercial and
sensitive nature of much data
– want a generalised, role-based mechanism for exposing different views of
data resources to different users, and managing these views dynamically.
– is this outside the scope of data integration middleware?
• While we were previously aware of most of the requirements described in
this document, associating them with actual projects can help with
prioritisation.
2nd DIALOGUE workshop - 9-10 February 2006
13
Further information
• The OGSA-DAI Project Site:
– http://www.ogsadai.org.uk
• The DAIS-WG site:
– http://forge.gridforum.org/projects/dais-wg/
• OGSA-DAI Users Mailing list
– users@ogsadai.org.uk
– General discussion on grid DAI matters
• Formal support for OGSA-DAI releases
– http://bugs.ogsadai.org.uk/
• OGSA-DAI training courses
2nd DIALOGUE workshop - 9-10 February 2006
14
Download