OGSA-DAI Requirements Gathering Exercise 2nd DIALOGUE workshop eSI, 9-10 February 2006 Neil Chue Hong Project Manager, EPCC N.ChueHong@epcc.ed.ac.uk +44 131 650 5957 OGSA-DAI Requirements Gathering • Aims – learn more about the data access and integration challenges that other projects are facing – use this information to inform the future development of the OGSADAI software • Timescale – Nov 2005 – Jan 2006 • Gatherers – Ally Hume – Amy Krause – Tom Sugden 2nd DIALOGUE workshop - 9-10 February 2006 2 Projects • AstroGrid – (www.astrogrid.org) - distributed queries over large astronomy databases. • Automed and ISpider – (www.doc.ic.ac.uk/automed/) and (www.ispider.man.ac.uk) – model-based data integration and Grid-based informatics platform for proteomics. • CancerGrid – (www.cancergrid.org) – storage and analysis of distributed data containing clinical trial and lab data. • ESSC – (www.nerc-essc.ac.uk[MA1]) – environmental and atmospheric simulations. • Gold – (www.goldproject.ac.uk) – provides infrastructure for virtual organisations. • NTRAC – (www.ntrac.org.uk) – similar to CancerGrid. 2nd DIALOGUE workshop - 9-10 February 2006 3 Structure of Meeting Reports • Data – • Queries – • the main problems that the project are currently facing with regards to data access and integration. What Can OGSA-DAI Provide? – • the types of queries that are performed against this data, including the query languages used and the typical size of result sets. The problem – • the kind of data that the project is concerned with, including the structure, quantity and types of data resource. the functionality that the project would like OGSA-DAI to provide. Checklist – summarises the importance of various aspects of data access and integration for the project. 2nd DIALOGUE workshop - 9-10 February 2006 4 AstroGrid • a number of distributed databases, each of which contains astronomical data captured from different modalities Public/Internal Data Movement Data Replication Data Discovery Transactions • Almost all the tables in these databases contain a spatial coordinate of each feature and some numerical attributes associated with that feature. Security Reliability • want to do distributed queries using their algorithmic domainspecific joins. Public services. Efficient data movement is crucially important. Replication was not considered important. Data discovery is not an issue as they already have data discovery mechanisms. Transactions where not considered to be very important. Most of the queries are read-only, apart from the production of temporary tables. If there is an error they are happy to start the query again. They considered security to be a real concern but this seemed to be more of a resource usage issue rather than data encryption. Here they were talking about issues such as restricting the amount of data a user's query can write to a database. There was an emphasis on production level functionality such as: Scalability ability to kill queries reduced requirement for server restart The AstroGrid systems must scale well to support many large distributed databases. 2nd DIALOGUE workshop - 9-10 February 2006 5 AutoMed and ISpider • • • • middleware to transform schemas from different data sources (relational databases, XML documents, etc.) and evaluate distributed queries expressed in their own IQL language. By creating a path of schematransformations, it is possible to federate multiple data sources so that they appear as a single data source to the user how to optimise distributed queries using metadata such as data size, occurrence of indexes, performance rates, etc. Public/Internal Data Movement Data Replication Data Discovery Security Scalability Public services. Efficient data movement is required. They are interested in XML compression algorithms. Important but not a part of ISpider or Automed specifically, though they may want to replicate data for query optimisation purposes. The AutoMed APIs already provide a form of registry, but in the future a web service interface is envisaged. Academics are fairly relaxed but in the broader scheme this is very important, particularly for use on medical data. Important how to fit AutoMed into a grid architecture 2nd DIALOGUE workshop - 9-10 February 2006 6 CancerGrid • By analysing laboratory data and correlating it with hospital and trials data, it is hoped that new subsets of patients can be discovered who respond best to particular treatments • Security is a major concern because many of the owners of data are aware of the value of their data and consequently are concerned about who has access to it. • A good means of transforming trial forms (XML documents) into a format suitable for automatic insertion into relational tables is required. Public/Internal Data Movement Data Replication Data Discovery Transactions Security Reliability Scalability Public but only accessible by certain users. Important for distributed data integration usecase. No real requirement at the moment. They envisage a peer-to-peer system for data discovery, but also plan to use the national registries that are being developed elsewhere. Distributed transactions are not a concern at the moment. Updates take place daily at most and queries will not take place during updates. It is vital to be able to expose subsets of data to particular users. This is not just because of patient confidentiality (data in anonymised before it reaches CancerGrid), but also because of commercial interests. Data integrity is important, but there is no need for 24hr services, so downtime is not a problem. Must scale to many databases distributed around the world. 2nd DIALOGUE workshop - 9-10 February 2006 7 ESSC • • • dealing with large data sets of between 2 to 3 terabytes, stored mostly on a single machine. The user requests portions of data, often assembled from various files. Uniform web service interfaces are provided for accessing data sets using the standard APIs associated with the binary data file formats that are used (netCDF, GRIB, HDF, etc.). The queries used by ESCC are currently synchronous which causes request timeout problems when the resulting datasets are large. Sceptical of current WS-Notification implementations that require open ports on client machines. Public/Internal Data Movement Data Replication Data Discovery Transactions Security Reliability Scalability Both public and commercial. Efficient movement of large binary files (Gbs) is required. Not important because this is handled by the NERC data grid and datasets are generally copies of Met Office data. The NERC data grid solves this problem using 4 levels of metadata, expressed in XML. Transactional updates are quite important internally, but not important for end users. Essential because of commercial nature of data. Their services are fairly static and restarts are infrequent. The metadata store can already be updated without restarting services. Linear scalability is desired for data extraction. The current file APIs scale well. 2nd DIALOGUE workshop - 9-10 February 2006 8 GOLD • develop an infrastructure to facilitate collaboration within virtual organisations • Data storage services will be used for capturing interactions amongst parties of a VO in order to facilitate auditing and VO-playback. • Data analysis services will be used for performing particular types of analysis of data existing mostly in relational database back-ends. • primary concern is managing security policies and service access rights of different types of user dynamically. 2nd DIALOGUE workshop - 9-10 February 2006 9 NTRAC • build platforms to bring different systems together • Many of the data resources that they are accessing are stored in Public/Internal Data Movement Data Replication private networks (e.g. NHS Data Discovery patient information) with no open Transactions Security gateway to the public. Reliability • Researchers want to mine the data to find people to recruit into Scalability Internal. Not a major concern - even if cross-site, people are not normally interested in the raw data. Mainly for backup purposes not loadbalancing. The Scottish Executive and NCRN will be running registry of trials. Important because of private and commercial nature of some data. There are usually sporadic queries made against a dataset that does not change very quickly. NTRAC just recruits patients into the process so they are not concerned with reliability at the moment, but when getting involved in patient care, this is most important. - studies. 2nd DIALOGUE workshop - 9-10 February 2006 10 Prioritised Requirements ID R1 R2 R3 R4 R5 R6 R7 R8 R9 Requirement Efficient transportation of large quantities of data between heterogeneous data resources. Data federation and distributed query processing across heterogeneous data resources. An asynchronous model for processing large, longrunning queries where the client can poll or be notified of the query status and the query can be terminated at an intermediate stage. The ability to provide different views of data resources to different users in a secure, DBMSindependent manner and to manage these views dynamically. Security/certificate delegation to allow access to other networks and role-based data access rules. Provision of more extensive database metadata capabilities, in particular with the inclusion of statistics relevant to query optimisation such as table size, occurrence of indexes and performance rates. Support for a unified query language (RDBMSneutral), possibly through integration with Hibernate. Extensible join criteria for data integration, including support for spatial joins. The ability to limit the size of updates to data resources. For example, the size of temporary tables created during a SkyQuery-style [Ref] distributed query. Priority High High High High High Medium Medium Medium Medium 2nd DIALOGUE workshop - 9-10 February 2006 11 Notes on requirements • Prioritised based on a judgement of their importance to the various projects that were investigated. – Whether or not they are within the scope of the OGSA-DAI project, or have already satisfied by OGSA-DAI, is not considered here. • Frequent mention of the non-functional requirement: ease-ofuse. – Some concern that installation and configuration remains too complex when compared with typical WAR-based web service deployment. • Hope to publish the full document in near future – let me know if you want a copy 2nd DIALOGUE workshop - 9-10 February 2006 12 Conclusions • Efficient transportation of large quantities of data between heterogeneous data resources is a crucial requirement for several projects from distinct domains. – This is also an implicit requirement for projects requiring data federation and distributed query processing. – If we could solve this problem, it would be of great benefit to these projects, and also to higher-level middleware projects such as OGSA-DQP • Security remains a major concern because of the commercial and sensitive nature of much data – want a generalised, role-based mechanism for exposing different views of data resources to different users, and managing these views dynamically. – is this outside the scope of data integration middleware? • While we were previously aware of most of the requirements described in this document, associating them with actual projects can help with prioritisation. 2nd DIALOGUE workshop - 9-10 February 2006 13 Further information • The OGSA-DAI Project Site: – http://www.ogsadai.org.uk • The DAIS-WG site: – http://forge.gridforum.org/projects/dais-wg/ • OGSA-DAI Users Mailing list – users@ogsadai.org.uk – General discussion on grid DAI matters • Formal support for OGSA-DAI releases – http://bugs.ogsadai.org.uk/ • OGSA-DAI training courses 2nd DIALOGUE workshop - 9-10 February 2006 14