TG5: Data Management White Paper on State of the Art and

TG5: Data Management White Paper on State of the Art and Planned Developments in the context of FP6 Grid Projects Lead partner: Fraunhofer AIS Chair: Dr. Michael May, michael.may@ais.fraunhofer.de Contributing projects: CoreGrid, DataMiningGrid, EGEE, Inteligrid, NextGrid, OGSA-DAI, OntoGrid, SIMDAT 1 Table of Contents Introduction ............................................................................................................. 2 Objectives of the working group “Data Management” ............................................. 3 Objective 1: Transfer from FP5 to FP6 ................................................................ 3 Objective 2: Collaboration between FP6 projects................................................ 3 Objective 3: Developing a unified view on Grid data management ..................... 4 Relevant European Projects ................................................................................... 4 Contribution of FP6 projects to Data Management ................................................. 5 CoreGrid.............................................................................................................. 5 DataMiningGrid ................................................................................................... 5 EGEE .................................................................................................................. 7 Inteligrid............................................................................................................. 10 NextGrid ............................................................................................................ 12 OGSA-DAI......................................................................................................... 13 OntoGrid............................................................................................................ 14 SIMDAT............................................................................................................. 17 Complementarities & synergies among projects ................................................... 18 Transfer FP5 to FP6 .......................................................................................... 18 Collaboration of FP6 projects/ Unified view on grid data management ............. 19 Positioning with respect to state of the art and related work outside FP6 ............. 21 Middleware oriented projects............................................................................. 21 Application oriented projects ............................................................................. 22 Conclusions and Future plans............................................................................... 22 References............................................................................................................ 23 Introduction Data management has been an important topic in several FP5 projects. Data access has been addressed e.g. on the level of basic data transfer (e.g. GridFTP), and data cataloguing and replication (provided e.g. by the European DataGrid project). The OGSA-DAI project has developed basic support for accessing remote databases. For a major uptake of Grid technology in industry, an even more comprehensive approach to data management will be needed. Applications making use of grid infrastructure such as ontologies or data mining have a need for sophisticated data access mechanisms. Accordingly, FP6 projects such as SIMDAT have data grids as their central topic. This white paper: • • • • • Describes the objectives of the working group data management, Gives an overview on data management approaches of FP6 projects, Identifies overlap, complementarity and points for synergies, Positions FP6 project approaches with respect to the state of the art and Identifies future points for action. 2 The document is intended to be a living document that provides an up-to-date view on data management issues in FP6. Objectives of the working group “Data Management” Objective 1: Transfer from FP5 to FP6 In general, the FP5 projects produced numerous results that could be taken up by FP6 projects. There are two basic modes of technology transfer: • • From projects that produced basic data management facilities to projects that will produce applications; From projects that produced grid applications to new projects that produce new applications. In the first case, there is a need for transferring software and implementations. Transfer from grid middleware oriented projects to application oriented projects is especially relevant for data-intensive projects such as SIMDAT and DataMiningGrid. In the second case, there might be software to transfer or it might be a transfer in terms of know-how and lessons learned. The second case applies e.g. for the relation between the Grace project and the DataMiningGrid. A second important consideration is whether a FP5 project has -- in one way or another -- a continuation in FP6. For some projects, there is a natural take-up of results in their successor projects. This assures continuity in the development of basic grid services. It also assures that technology developed in FP5 will be maintained for the next years and so provides a clear perspective for application projects to take up this technology. If there is no continuation, transfer of know how is feasible but transfer of software is more problematic, since the issue of maintenance arises. Objective 2: Collaboration between FP6 projects The further major objective of the working group is to facilitate the exchange of know how and software between FP6 projects. In this case, there is not a problem that one of the projects ceases to exist in the near future. The challenge is here that the projects move on a parallel time scale; e.g. project A needs some data management software from project B to implement a Grid service on top of that; however, project B just started to design that component so that it is not available now. At the time of writing, most FP6 projects will have produced a system design document. Each of those documents will address issues relevant for data management. These documents provide the raw material for the current document. The material will be collected, synthesized to provide a coherent picture on data management issues in FP6. 3 Objective 3: Developing a unified view on Grid data management The third major objective is, wherever appropriate, to define a unified view on Grid data management issues. While it is unrealistic to expect that all projects will converge on the same set of technologies (and possibly also dangerous, since it is important to explore alternative options), design choices should be made on a common understanding of the available options. Also, projects should inform each other about lessons learned, so that one project does not fall into a trap that another project has already identified or one project does not use a promising technology just because it does not know about it. A further goal will be to position and synchronize the working group activities with world-wide activities on Grid data management, e.g. GGF. Relevant European Projects In the first concertation meeting the following projects have been identified that have data management issues as a major topic: FP5 • • • DataGrid; Grace; OpenMolGrid. FP6 • • • • DataMiningGrid; EGEE; InteliGrid; SIMDAT. The list of projects has become longer as a result of the 2nd concertation meeting, where one objective was to involve additional projects. These projects are: FP6 • • • CoreGrid; NextGrid; OntoGrid. Other • OGSA-DAI (UK eScience programme). 4 Contribution of FP6 projects to Data Management CoreGrid Domenico Talia, University of Calabria Website: http://www.coregrid.net Project goals The CoreGrid Network of Excellence brings together a European critical mass of well-known experts in GRID and P2P research allowing to compete with research and development in North America and Asia. Data Management issues relevant for the project An objective of the CoreGrid project is to build a European-wide research laboratory to avoid fragmentation of Grid research activities in Europe and to achieve long-term integration and sustainability. The Knowledge & Data Management Virtual Institute (KDM Virtual Institute) aims to further the integration of data management and knowledge discovery with Grid technologies for providing knowledge-based Grid services. Moreover the objective is to provide a collaborative setting of European research teams working on distributed storage management on Grids, semantic Grid techniques and tools for supporting data intensive applications and knowledge discovery and management in Grids. Another objective is to strengthen the joint activity of European research groups in those areas, promoting larger leading teams (working as a Virtual Research Institute) working on models and tools for data and knowledge management in Grids and P2P systems. References Website: http://www.coregrid.net DataMiningGrid Martin Swain, University of Ulster Thomas Niessen, Fraunhofer AIS Website: http://www.datamininggrid.org/ Project facts Data mining has been recognized as one of the most important information technologies for automating the process of analysing and interpreting the data in modern knowledge industries and high-tech sectors such as science and engineering. Currently there exists no coherent framework for developing and deploying data-mining applications on the Grid. The DataMiningGrid project will address this gap by developing generic and sector-independent data mining tools and services for the Grid. A test bed consisting of several applications from a diverse set of sectors will serve as platform for demonstrating and promoting the technology developed by the DataMiningGrid. 5 DataMiningGrid is part of the Sixth Framework Programme of the Information Society Technologies Programme (IST). The consortium constitutes of the University of Ulster (Northern Ireland), the Fraunhofer Institute for Autonomous Intelligent Systems (Germany), the Data Mining Solutions group from DaimlerChrysler (Germany), the Israel Institute of Technology (Israel), and the University of Ljubljana, Faculty of Civil and Geodetic Engineering (Slovenia). The grant period for this project is September 1, 2004 to August 31, 2006. Project goals The project’s main objectives are structured into four main phases or milestones (currently phase 1 has been completed): • • • • Specification and validation of data-mining-aware grid tools and interfaces to be developed by the project (due date: February 2005), Early implementation of a mock-up prototype featuring some of the more critical aspects of the project (architecture, middleware, data-mining-aware grid data access interfaces) (due date: August 2005), Delivery of middleware-integrated components (due date: April 2006) and A fully evaluated set DataMiningGrid components, tools, interfaces and application demonstrators from different application scenarios (due date: August 2006). In order to address a wide range of requirements arising from the need and context to mine data in distributed computing environments, the project will develop a test bed consisting, among other things, of various demonstrator applications. Demonstrators from biology and medicine will address data-mining problems requiring a data-mining-aware access of distributed and very large databases (e.g. molecular dynamics simulation data) and the construction of compute-intensive predictive models. Demonstrators in the automotive industry and other text-mining scenarios need to mine large and inherently distributed text repositories (e.g. car repair protocols, customer relationship management data). Another demonstrator will directly mine data arising from logs produced by grid computing middleware. Data Management issues relevant for the project Data sets and data sources used for data mining vary considerably in structure, size, problem solving context, background knowledge, and other statistical and technological aspects across different domains and sectors. Data is different to streams of bits and bytes; this fact needs to be reflected in the protocols and services layers. Furthermore, in grid environments this data is typically distributed over geographically dispersed sites and may become unavailable after some time, while prohibiting replication due to sheer size, privacy or licensing issues. As data mining results always need to be regarded in connection with the original data, provenance information for each model need to be recorded at any step of the data mining process. In order to address such diverse requirements regarding data management the DataMiningGrid project will develop various data and model services, providing the 6 functionality needed to perform data mining in grid environments. In detail these services will encompass the following functionality: • • • • • • • Data service location (metadata annotation); Database federation; Data access and selection; Data transfer; Data transformation and pre-processing; Collecting and storing according provenance data; Mapping provenance data to relating models. The recording of sufficient provenance data to retrace each step of a data mining process, which ultimately led to a certain model, involves capturing such diverse information as metadata about the original datasets used, pre-processing operations applied, data mining algorithms executed (including all parameters), and evaluation methods performed on the model. As currently no standard exists that is capable of recording all these information, available standards such as PMML or CWM will be complemented with additional syntax and data structures developed during the DataMiningGrid project. Technologies involved Data management in DataMiningGrid will rely on proven technologies such as OGSA-DAI for accessing heterogeneous databases and GridFTP for transferring substantial numbers of possibly large textual documents. While OGSA-DAI fits nicely into the service-oriented architecture envisioned for this project, GridFTP provides a widely accepted and proven technology for transferring large file based data volumes in grid environments. References http://www.datamininggrid.org/ Website: Public deliverables: http://www.datamininggrid.org /deliverables.htm EGEE Marc-Elian Begin, CERN Website: http://public.eu-egee.org/ Project facts The EGEE Project involves 71 leading organisations from around 27 countries, federated in regional Grids, with an ultimate combined capacity of over 20000 CPUs – the largest international Grid infrastructure ever assembled. The EU is funding €32 million towards the project, with a similar level of funding from the partners. The total manpower allocated to the project is approximately 600 person years over two years. The breakdown of funded activities is 48 percent for Grid service activities, 24 percent for middleware re-engineering and 28 percent for networking (dissemination, outreach and training). Project goals 7 EGEE is a project that aims to integrate current national, regional and thematic Grid efforts, in order to create a seamless Grid infrastructure for the support of scientific research. EGEE provides researchers in academia and industry with round-the-clock access to major computing resources, independent of geographic location. The infrastructure supports distributed research communities, which share common Grid computing needs and are prepared to integrate their own computing infrastructures and agree on common access policies. Mostly funded by EU funding agencies, this project has a world-wide mission and receives important contributions from the US, Russia and other non EU partners. Data Management issues relevant for the project The developed EGEE application scenarios belong mainly in the particle physics and pharmaceutical area. EGEE has to deal with very large amounts of data, and a variety of different interfaces that are used. There is a high demand to harmonize interfaces and create higher levels of abstraction with the vision of getting a common European infrastructure. Large volumes of data (several petabytes) With LHC (Large Hadron Collider) alone, the EGEE grid production service will have to handle 10-20 PB of persistent data per year. This data throughput will have to be maintained for several years. EGEE is also looking into another "Grand Challenge" which might complement the LHC, hence increasing even further these already challenging figures. Integration of mass storage systems The strategy that we have chosen is to rely on SRM in terms of interfacing with storage systems. gLite (the EGEE middleware) doesn't provide a storage solution, but includes the ability to consume several SRM based storage solution. The actual storage solution is considered as a third party service. This strategy provides freedom and flexibility, while sharing work and leveraging on other projects' investments. Data Replication and location The current solutions for managing replicated copies of the same data rely on a central/global catalogue. VOs can use several catalogues, but each catalogue currently only works as a central catalogue (the catalogue itself is not yet replicated and synchronised, across site boundaries). The data management solution allows the simple query of a catalogue to resolve exact copies of the requested data. The metadata catalogue, also centralised for the moment, can also be used to store metadata on existing data and inversely query this metadata to locate data. The metadata and file catalogues can work together to ensure consistency across their respective information. We already have a File Transfer Service being tested, which will allow data transfers to be triggered from specification written in job descriptions. This service coupled with the File Placement Service will then allow application developers to assume that by 8 the time a job lands on a worker node, data has been transferred and is locally available for processing. This strategy is important from several stand points, where for example it allows resource broker level optimisation of network resources and algorithms to take advantages of free computing resource, required data, network topology and speed to better match resources. Data naming The raw names of files and data (LFN, SURL, etc) thanks to metadata can be "embellished" with more human terminology. Secure access The coordinated effort on security, not only from EGEE and LCG, but also from other Grid projects, is leading to a wide acceptance of VOMS. An important issue for users of sensitive data is whether local access is allowed to data, via native means, or whether only grid mechanisms are allowed to access this data. EGEE is also working on an "encryption service" which will alleviate the burden of crypto key management through a simple and secure service, following a similar strategy as VOMS and CAs. How they are addressed A detailed description is beyond the scope of this document. Information can be found in the following documents: • • gLite architecture: https://edms.cern.ch/document/594698/ ; gLite data management overview: https://edms.cern.ch/file/570643/1/EGEETECH-570643-v1.0.pdf. The above architecture document describes the high-level roadmap for addressing these issues. Further work is taking place to maintain and update the architecture document while we learn through developing, integrating and/or deploying the mentioned solutions. Technologies involved GSI GSI and OpenSSL (now including Grid proxy certificates for authentication) are used by gLite services and components. GridFTP The file transfer service is using GridFTP to transfer data from sites to sites, adding reliability and throttling layer on top of it. SRM As above mentioned, SRM is the main interface used by gLite to prepare for plug-in inclusion of storage backend services. 9 Web services Several services and components are now deployed with Web Service enabled. Security to-date is provided via SOAP over HTTPS, while the longer term goal is to explore further message level security, hopefully based on then reliable WS-Security implementations. The EGEE Project Technical Forum has spun-off a sub-group summarising the lessons learned from Web-Services on gLite v1.0 and preparing for a wider, harmonious and consistent deployment of web service enabled services. The reality of Grids today is that there is not clear unanimity in terms of languages and platforms, which means that our web services have to be consumable by several and sometimes exotic languages, which adds to the difficulty of deploying web service based services. Performance of web services is also an aspect that requires further investigation and demonstration. The undisputable benefits that XML based messaging through web services brings cannot come at too high a price in terms of performance. POSIX-like data access The lower level data management services provided to application developers resemble POSIX data access. However, the current implementation of data management service available on the worker node and the user interface require a slight adaptation of the standard POSIX API and conversely requires to be relinked with the correspondent libraries. References Website: http://public.eu-egee.org/ Architecture: https://edms.cern.ch/document/594698/ gLite: https://edms.cern.ch/file/570643/1/EGEE-TECH-570643-v1.0.pdf http://www.glite.org SRM: http://sdm.lbl.gov/indexproj.php?ProjectID=SRM GGF GSM: https://forge.gridforum.org/projects/gsm-wg/ Inteligrid Jarek Nabrzyski, PSNC Website: http://www.inteligrid.com Project facts The InteliGrid project started in September 2004, funded by the 6th FP. It is a European research project developing grid technologies to provide an interoperability platform to engineering. Interoperability of software, services and tools is one of the paramount problems for the engineering virtual organizations (VO). The VOs in industries like construction, shipbuilding and aerospace industries collaborate on the design, production and maintenance of advanced products that are described in complicated, structured product model databases. The InteliGrid project's hypothesis 10 is that the collaboration platform - the semantic grid itself - must be aware of the business concepts (e.g. car, airplane, house) that the VO is addressing. InteliGrid is a 360 person month, 3.1 MEuro project. Coordinated by the University of Ljubljana in Slovenia, the partners include Technical University Dresden from Germany, VTT from Finland, ESoCE Net from Italy, Poznan Supercomputing and Networking Center from Poland, Obermeyer Planen + Beraten from Germany, Sofistik Hellas from Greece and Conject AG from Germany. Project goals The goal of this project is to create an architecture and a prototype for such an infrastructure, based on existing grid middleware and test it in the context of industries mentioned above. The project will create generic grid-related knowledge, infrastructure and toolkits that will allow for a broad transition of the advanced engineering industry towards semantic, model-based, ontology-committed collaboration on the grid. A specific technological goal is to make the grid infrastructure available to the mostly small to medium enterprise (SME) companies that are providing the engineering software. The core competencies of these companies are topics like structural mechanics or 3D solid modelling and not the latest trends in middleware technology. The project will help SMEs to enhance their applications with grid-computing capabilities. The project will demonstrate how typical server side applications can be made gridcomputing compatible and how the mostly client side Computer Aided Design (CAD) applications can interface with the grid. The demonstration will show the next generation of key engineering collaboration software using the InteliGrid middleware, an ontology service, a product model database server, a project web collaboration service and characteristic computer aided design software. The main results of the project are the generic business-object-aware extensions to grid middleware, implemented in a way that allows grids to commit to an arbitrary ontology. These extensions are propagated to toolkits that allow hardware and software to be integrated into the grid. Data Management issues relevant for the project Main innovation activities are focused on extending the grid architecture with semantics and ontologies beyond current work on metadata and heterogeneous data formats. Interoperability in AEC today, at best relies on the management of IFC files. Attempts to use true IFC databases are mostly academic. EPM is also managing a Product Life Cycle Support (PLCS) implementation for the Norwegian navy, which will handle all lifecycle data related to the frigates in their fleet. The implementation uses the concept of DEX (Data Exchange Set) developed in PLCS to allow applications to access subsets of the PLCS data model. In effect these are similar to STEP conformance classes and allow different applications to access the data they are interested in. 11 Merging and partial extraction of the data to/from the PLCS data model, as well as access control is managed by EPM. The PLCS implementation (http://www.posccaesar.org/) also makes use of reference data that complies to ISO15926 in order to further constrain the data population and to reduce the problem of data redundancy. File-based environments are fragile and depend on a single server that provides such crucial functions as information and process management for a complex project. How they are addressed InteliGrid proposes to use the grid as a robust, scalable, safe infrastructure for the industry. It would allow seamless integration of software committing to any product data standards and focus the developers into the functionality and not data exchange or interfacing with this or that information server. The grid is the place for the semantically rich data. This project is introducing two major generic improvements visible to the end user as well as the application developer: • • The grid infrastructure eliminates the need of knowing exact locations of semantically rich data and complex problem solving services. They are "on the grid". The ontology reduces the need to know the exact structure and access paths of the data in product model databases. They can be accesses by using an engineering ontology as opposed object names and record keys. Technologies involved GridSuite (Grid tools and services developed in several Grid projects, such as GridLab (www.gridlab.org), CrossGrid (www.eu-crossgrid.org), Progress (http://psnc.progress.pl) and productized by Poznan Supercomputing and Networking Center. GridSuite includes such Grid services as: GRMS (Grid Resource Management System), Data and Replica Management System, GAS (Grid Authorization Service), Grid Accounting System, Adaptive Services, Mercury Monitoring System, Mobile User Support Services, Grid Application Toolkit, GridSphere, etc… On the ontology side the project uses the Ontology Server. Other tools include the ICT tools, such as WinTUBE3D, Front3D, Front2D, SOFiEM etc. References Website: http://www.inteligrid.com Some publications can be found under the “Results” link of project’s home page, i.e.: http://www.inteligrid.com/results.htm NextGrid Neil Chue Hong, EPCC, University of Edinburgh http://www.nextgrid.org Website: 12 Project goals The objective of the NextGrid project is to develop a Grid architecture to support mainstream use that meets the needs of business users by addressing security and economically viable business models including legal and privacy issues. Data Management issues relevant for the project The NextGrid environment will be service oriented, loosely coupled, heterogeneous and segmented into multiple business domains. Communicaton is secure and trustable, using XML and WS standards. When no trust can be assumed interaction is driven by standard interfaces and SLAs. NextGrid will be developing a unified data and process model with the aim to be able to treat compute and data services in similar fashion. Furthermore a framework for a suitable information discovery mechanism will be defined. For the discovery of resources a WS version of the echo pattern will be implemented and for data integration the OGSA-DAI framework will be used. At present, NextGrid faces questions of standardisation, security and privacy issues. References Website: http://www.nextgrid.org OGSA-DAI Neil Chue Hong, EPCC, University of Edinburgh Website: http://www.ogsadai.org.uk Project facts The OGSA-DAI project started in 2002 and is funded by UK eScience Grid Core Programme. The involved Partners are: EPCC, IBM UK, NeSC, Oracle, University of Manchester and University of Newcastle. Project goals The OGSA-DAI project (UK eScience) aims to develop a middleware for data management, including data access, data delivery, data transformation and data security. A set of core features is provided by the OGSA-DAI middleware plus a framework to extend and develop new features for building applications, data clients and developing functionality. The goals of the project are: • • • To provide a grid based, extensible framework for access to distributed, heterogeneous data resources, that can be used by other projects, To implement a reference implementation of the WS-DAI specifications from the DAIS-WG and To provide feedback to the DAIS-WG gained from implementation experience. 13 Data Management issues relevant for the project • Combination of different data storage models (relational, XML, object, ...) Schema integration and management Support for transactions. How they are addressed • • Using XML dataset formats; Picking up relevant standards to help interoperability Assimilating research efforts in key areas of data integration, working with other data integration and management projects. Technologies involved • Java, SQL, XPath, XUpdate, Web Services, Web Services Resource Framework, Globus Toolkit 3.2, Globus Toolkit 4.0, OMII_1. References Website: http://www.ogsadai.org.uk OntoGrid Manolis Koubarakis, Technical University of Crete Asunción Gomez-Perez, Universidad Politécnica de Madrid Miguel Esteban Gutiérrez, Universidad Politécnica de Madrid Project facts OntoGrid (Paving the way for Knowledgeable Grid Services and Systems) is an FP6 project working towards making the Semantic Grid a reality. The partners of OntoGrid are: Universidad Politécnica de Madrid (co-ordinator), University of Manchester, University of Liverpool, Technical University of Crete, Intelligent Software Components, Acklin, Boyd International and Deimos Space. Project goals The strategic goals of OntoGrid are the following: • • To pioneer the use of knowledge technologies to enhance and extend the architecture and design of current Grid computing systems to enable crossprocess, cross-company, and cross-industry collaboration; To enable the wide deployment of Knowledge Technologies within Grid Computing architectures to achieve this goal, eventually leading to a Semantic Grid; 14 • • • To improve the high-performance, scalability, resilience to failures, robustness and adaptivity of current Grid and Knowledge Technology components to path the way towards acceptability of deployed technological solutions; To widen the applicability of Grid computing architectures in applications involving virtual organizations formed by autonomous entities with conflicting goals; To expand the applicability of today’s Knowledge and Grid computing architectures and demonstrate that Semantic Grid computing systems can be exploited successfully in e-science and e-business applications. Data Management issues relevant for the project Metadata pervades the Grid middleware stack, describing the environment, the services available and the ways that they can be combined, keeping audit trails of resource use and configurations. The operation of the Grid middleware itself is dependent on the capturing and harnessing of metadata, as well as the metadata required for the Grid applications themselves. In current Grid architectures though, this metadata may not be explicit -- rather the knowledge is often hard coded or encapsulated within the services and their interactions. The aim of the Semantic Web is to explicitly expose metadata relating to Web resources to better support intelligent agents. The Semantic Grid aims to follow a similar path in the Grid setting. The key focus of the OntoGrid project is in this provision of semantics and it is realized by the following: • • The development of intelligent middleware by utilizing explicit metadata and ontologies (e.g., for the tasks of service discovery, brokering or matchmaking); The exploitation of Semantic Web and Peer-to-Peer technologies for metadata management. As a consequence, OntoGrid has a strong metadata and ontology management focus. How they are addressed From a data management point of view, OntoGrid currently focuses on the following two problems: • • Distributed storage and querying of RDF and RDFS data using P2P networks; The development of WS-DAIO. RDF and RDFS are simple data models for describing resources in the Web and as such they are good candidates for developing simple Semantic Grid ontologies. More expressive ontologies can be developed in richer ontology languages such as OWL or other subsets or extensions of FOL. In current work in OntoGrid, we are developing a distributed RDF storage and querying system on top of a distributed hash table (DHT). DHTs are a family of distributed protocols that are aimed at the development of P2P applications. DHTs can be used to solve the following look-up problem: ‘Let X be some data item stored at some distributed dynamic network of nodes. Find data item X.’ The core idea in 15 most DHTs is to solve this look-up problem by offering some of form of distributed hash table functionality: assuming that data items can be identified using unique numeric keys, DHT nodes cooperate to store keys for each other (data items can be actual data or pointers). We have chosen to use the distributed hash table Bamboo (http://bamboo-dht.org/ ) in our work. To develop a distributed RDF/RDFS store on top of Bamboo, we are currently working on answering questions such as: How do you index RDF data using Bamboo? What query answering algorithms are appropriate? How do you balance the query processing load? We expect to use the resulting system Atlas for semantic grid service discovery and distributed ontology storage, retrieval and sharing. In the OntoGrid project we are also developing infrastructure for accessing and using ontologies in a grid environment, which will be used to advance the research on semantic grid architectures. An ontology service will provide access to the information in an ontology. Ontologies can be queried as to their content. Content includes the concepts and relationships that they contain, along with intentional information about those concepts, for example the definitions that apply to a particular class. We might also expect an ontology service to support queries over the information in the ontology, e.g. what are the parents of concept X? Or what are the known individuals conforming to a particular description D? In this way, we can view Ontology Services as a particular kind of Data service – a Data service holds some data, and provides mechanisms for putting things in, and getting things out. This suggests that a sensible mechanism for providing Ontology Services is to make use of existing infrastructure for Data Access, in particular the specifications produced by the GGF Data Access Working Group. The OntoGrid team is defining a proposal of a specification for accessing ontologies in a grid environment. This proposal is known as WS-DAIO. WS-DAIO is an extension of the WS-DAI specification for providing ontology data access, that is, a WS-DAI realization for adding ontology access capabilities to the current OGSA architecture. The OntoGrid team will start with the development of the WS-DAIO-RDF(S), which uses ontologies implemented in the RDF(S) W3C language. Our first aim is to test the usability and usefulness of the proposed interfaces as well as to experiment with different alternatives and their implications in issues like: factory pattern usage, transactions, etc. Technologies involved The main technologies involved here are ontology languages such as RDF, RDFS and OWL, OGSA-DAI and distributed hash tables such as Bamboo. References Website: www.ontogrid.net 16 SIMDAT Michael Krüger, Intel Jörg Kindermann, Fraunhofer AIS Website: http://www.simdat.org Project facts • • • • IST Grid IP project; 4 years; Start date: September 1st, 2004; 26 partners. Project goals SIMDAT aims to test and enhance grid technology for product development and production process design as well as to develop federated versions of problemsolving environments by leveraging enhanced Grid services. Further objectives are to exploit data grids as a basis for distributed knowledge discovery as well as to promote de facto standards for these enhanced Grid technologies across a range of disciplines and sectors. SIMDAT focuses on four application areas: product design in the automotive, aerospace and pharma industry as well as service provision in meteorology. Key to the seamless data access is the federation of problem-solving environments using grid technology. The federated problem solving-environments will be the major result of SIMDAT. Seven key technology layers have been identified as important to achieving the SIMDAT objectives: an integrated grid infrastructure transparent access to data repositories on remote grid sites, management of Virtual Organizations, scientific workflow, ontologies, integration of analysis services and knowledge services. Data Management issues relevant for the project • • • Deployment of Grid concepts, existing and emerging architectures with particular attention to data management: distributed databases; Data aggregation through the use of ontological concepts; Workflows for next-generation aggregated knowledge capture, discovery and mining. Technologies involved In the Auto, Aero and Pharma activities, SIMDAT is developing demonstrators based on Web Service Grids conforming to WSInteroperability. In Meteo SIMDAT is developing a virtual data grid based on Globus Toolkit 3/4. After evaluation and comparison, the OGSA-DAI software has been selected as the central data access layer for the prototypes. For the prototypes, the partners will adapt OGSA-DAI to work with the GRIA layer, and introduce functionality for the handling of provenance information. The resulting 17 layer will be integrated with the Basic Grid Infrastructure system. In the longer term, SIMDAT aims to achieve interoperability between different infrastructures such as Globus Toolkit 4 and GRIA, with a Grid Service API based on the upcoming WSRF set of standards although the level of compliance may differ between implementations. References Website: http://www.simdat.org Complementarities & synergies among projects Transfer FP5 to FP6 In general, the FP5 projects produced numerous results that could be taken up by FP6 projects. The potential transfer is both from projects that produced grid applications, such as Grace, and from projects that produce basic data management facilities, such the DataGrid project. For some projects, such as DataGrid, there is a natural take-up of results in their successor projects like EGEE. This assures continuity in the development of basic grid services. It also assures that technology developed in FP5 will be maintained for the next years and so provides a clear perspective for application projects to take up this technology. Transfer from grid middleware oriented projects to application oriented projects is especially relevant for data-intensive projects such as SIMDAT and DataMiningGrid. OGSA-DAI project results (funded by the UK e-Science programme) will be an important building block for several FP6 projects. The IP SIMDAT, the STREP DataMiningGrid and the OntoGrid project are considering to build (part of) their data management on OGSA-DAI. It would be an advantage if projects like DataMiningGrid and SIMDAT would develop a joint strategy with respect to OGSA-DAI. Results of the DataGrid project will be re-used and extended in the EGEE project, with partly overlapping consortia. So maintenance and further development is not an issue here. OGSA-DAI and the results of the DataGrid project can be seen as complementary, since e.g. the latter focuses on data replication while the first one does not. For the application driven projects SIMDAT and DataMiningGrid the DataGrid results are generally relevant, although specific approaches how to take up those results are currently not clear. The OpenMolGrid project led to interesting results and build up of know how in the area of data management for bioinformatics. The results are partially overlapping with the OGSA-DAI approach (which were of course not available when OpenMolGrid started). OpenMolGrid partners (Univ. Ulster) will build part of the data management for the DataMinigGrid project on OpenMolGrid results. Results and experience from the Grace project on distributed search engines may be relevant for the some Text-Mining tasks in SIMDAT and DataMiningGrid. 18 Collaboration of FP6 projects/ Unified view on grid data management To facilitate the exchange of know how and software between FP6 projects a fundamental action that has to be carried out is to provide communication platforms. In this sense the concertation meetings on TG 5 Data Management – which are held once a year – serve the important role of identifying common problems and potentially setting up partnerships that could lead to joint solutions. To further emphasize cooperation, additional meetings and telephone conferences are planned to be established. A second major facility for collaboration is to create links in a Network of Excellence like CoreGrid. Concerning the issue of Data Management CoreGrid is offering personnel exchange throughout projects that will include non EUprojects such as OGSA-DAI. Besides establishing organisational structures to facilitate collaboration, it is necessary to create a unified view on grid data management so that design choices can be made on a common understanding of the available options. To achieve this, it is vital to classify data management into specific areas. Data Sources Whereas in previous projects flat file-based data access was dominating, the new generation of projects uses a greater variety of data formats and storage mechanisms, ranging from XML, Relational Databases, to Flat Files. This is driven by the broader range of applications taken up by FP6 projects and a shift towards the industrial world, where relational databases are dominating. Data Discovery and Metadata Most projects have the demand to extract and use metadata and have identified metadata management as a major issue. Yet, there exists no standard solution (mostly due to the variety of specific demands) for metadata management. The OntoGrid project aims to provide a generic approach to metadata management based on ontologies and semantic web technology. Inteligrid is working in that direction as well. Ontologies are also a topic for SIMDAT. Collaboration between these projects could offer potential to share experiences and develop an accepted (de facto) standard. Data Access/Data Transfer Technology transfer from projects that produce basic data management facilities to projects that will produce applications is especially relevant for data-intensive projects such as SIMDAT and DataMiningGrid. The application oriented projects like SIMDAT or DataMiningGrid are in need of high quality components for data management; yet developing such basic middleware components ‘in-house’ is normally out of their scope. Cooperation between these two types of projects produces benefits for the middleware oriented projects also, since they need user feedback for their further development and they have to report user numbers. 19 A major result that was identified in the 2nd TG5 concertation meeting is the convergence on OGSA-DAI for database access by many projects, which means that an important part of Data Management developments have a great potential for collaboration. On the other hand, it has to be taken into account that OGSA-DAI support and training activities are limited due to its tight resources. Under consideration are also own developments to customize and enhance OGSADAI functionality to specific demands, e.g. for transferring large data sets for data mining applications. Some projects want to build specific extensions (e.g. for data mining), that can be made available to other projects, seeking collaboration with OGSA-DAI. File based access is a major focus of the EGEE project, and it uses GridFTP for data transfer. GridFTP is also used by many other projects. The matrix below shows the deployed technology of different Grid projects with respect to five areas of data management. DM Areas\ Projects DataMiningGrid EGEE Inteligrid Data Source Relational, Flat File XML, Flat files, mass Relational, XML, storage systems flat files, WebDAV, product model databases OD OGSA-DAI OD Data Discovery/ OGSA-DAI , OD (DEX) Metadata OGSA-DAI OGSA-DAI Database Access Grid FTP, OD GridFTP, FTP, File Access /File GridFTP HTTP, HTTPS, Transfer WebDAV OD Data Replication endData Mining Data intensive Engineering Applications applications simulations and user (CAD, structural analysis (e.g. LHC) Secure data access analysis, ...) (e.g. biomedic) DM Areas\ Projects NextGrid OGSA-DAI OntoGrid Data Source RDBMS, XML, files, semistructured, resource metadata RDBMS, XML, files, semistructured, data services RDF, RDFS, WebODE OD, Grimoires OGSA-DAI, OD(WS-DAIO) OD(Atlas - DHTbased RDF/RDFS Data Discovery/ OD (DEX) Metadata 20 Database Access File Access/ File Transfer Data replication Applications OGSA-DAI OGSA-DAI, OD DM Areas\ Projects SIMDAT Data Source Data Discovery/ Metadata Database Access File Access /File Transfer Data Replication Applications Relational, Flat File OGSA-DAI, OD OD ? Digital media, financial applications, supply chain management, data mining OD (OGSA-DAI) OD, Lucene, GridFTP, FTP various (medical, astronomy, meteorology, geology, biosciences) store) OGSA-DAI DHT-based Ontology based applications, Semantic registries OGSA-DAI,OD GridFTP, OD OD Data Mining, Simulation Own Development Not used Unknown Positioning with respect to state of the art and related work outside FP6 Overall, the FP6 projects are very well positioned with respect to the state of the art in data management. Middleware oriented projects OGSA-DAI defines the state of the art for database access. It has strong support from major database vendors (IBM, Oracle) and may evolve to a standard solution for database access. Application oriented projects currently evaluate the potential of this solution in various areas, e.g. DataMiningGrid uses it for Data Mining and TextMining problems and will thus be able to provide feedback to guide further developments. EGEE reports to be the largest international Grid infrastructure ever assembled and thus provides a reference and test bed for data management issues centred on file based access. 21 Application oriented projects These projects often assemble best of breed components build by other (middleware oriented) projects to provide problem solving environments for specific business or scientific areas (e.g. SIMDAT builds demonstrator for pharma, meteo, automotive, and aerospace). To this end, it is important that key players from middleware oriented projects participate also in these application oriented projects, so that take up of technologies is facilitated and the right design decisions are made. However, this type of project derives its USP from the end users and the application developers that build concrete applications. Some FP6 projects have assembled a critical mass of industrial companies that want to prototype grid technologies in their business areas. Several of these companies are global players and/or among the market leaders in their area. Overall, the application oriented projects in FP6 are certainly at the leading edge of transferring grid technology from science to industry. It is to be expected that the successful deployment of grid technology will have a strong signal effect and will push take up by other companies in the respective area. Conclusions and Future plans Most FP6 projects have now finished the stage of requirements gathering and design, so the overviews provided in this document reflect this still early stage of many projects. Future versions will be able to report on results, experiences, lessons learned. An important milestone will here be the finishing the demonstrators scheduled by various projects for late 2005. While the 1st concertation meeting on Data Management focused on the objective to facilitate transfer from Fp5 to Fp6, yet the collaboration between FP6 projects moves to the centre of attention. A major result that has been identified on the 2nd TG5 concertation meeting is the convergence on OGSA-DAI for database access, which means that an important part of Data Management developments have great potential for collaboration. To further support collaboration on and organisational basis, it is planned to amplify the impact of the Data Management Technical Group by having additional meetings or teleconferences. Besides TG5, the CoreGrid Network of Excellence is considered to also provide good chances for knowledge exchange. An important measure for the future will be - through the process of concertation - to derive a joint view and vision of grid related data management issues, with the goal to develop common understanding of the available options, remove island culture and help defining standards. Accordingly, the TG5 meetings will shift to a more problem driven approach. 22 References CoreGrid website: http://www.coregrid.net DataMiningGrid public deliverables: http://www.datamininggrid.org /deliverables.htm DataMiningGrid website: http://www.datamininggrid.org/ EGEE architecture document: https://edms.cern.ch/document/594698/ EGEE gLite document: https://edms.cern.ch/file/570643/1/EGEE-TECH570643-v1.0.pdf EGEE website: http://public.eu-egee.org/ GGF GSM: https://forge.gridforum.org/projects/gsm-wg/ InteliGrid publications: http://www.inteligrid.com/results.htm InteliGrid website: http://www.inteligrid.com NextGrid website: http://www.nextgrid.org OGSA-DAI website: http://www.ogsadai.org.uk OntoGrid website: http://www.ontogrid.net SIMDAT website: http://www.simdat.org SRM: http://sdm.lbl.gov/indexproj.php?ProjectID=SRM 23

TG5: Data Management White Paper on State of the Art and

Related documents

Products

Support

TG5: Data Management White Paper on State of the Art and

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib