TG5: Data Management White Paper on State of the Art and

advertisement
TG5: Data Management
White Paper on State of the Art and
Planned Developments in the context of FP6 Grid Projects
Lead partner:
Fraunhofer AIS
Chair:
Dr. Michael May, michael.may@ais.fraunhofer.de
Contributing projects:
CoreGrid, DataMiningGrid, EGEE, Inteligrid, NextGrid,
OGSA-DAI, OntoGrid, SIMDAT
1
Table of Contents
Introduction ............................................................................................................. 2
Objectives of the working group “Data Management” ............................................. 3
Objective 1: Transfer from FP5 to FP6 ................................................................ 3
Objective 2: Collaboration between FP6 projects................................................ 3
Objective 3: Developing a unified view on Grid data management ..................... 4
Relevant European Projects ................................................................................... 4
Contribution of FP6 projects to Data Management ................................................. 5
CoreGrid.............................................................................................................. 5
DataMiningGrid ................................................................................................... 5
EGEE .................................................................................................................. 7
Inteligrid............................................................................................................. 10
NextGrid ............................................................................................................ 12
OGSA-DAI......................................................................................................... 13
OntoGrid............................................................................................................ 14
SIMDAT............................................................................................................. 17
Complementarities & synergies among projects ................................................... 18
Transfer FP5 to FP6 .......................................................................................... 18
Collaboration of FP6 projects/ Unified view on grid data management ............. 19
Positioning with respect to state of the art and related work outside FP6 ............. 21
Middleware oriented projects............................................................................. 21
Application oriented projects ............................................................................. 22
Conclusions and Future plans............................................................................... 22
References............................................................................................................ 23
Introduction
Data management has been an important topic in several FP5 projects. Data access
has been addressed e.g. on the level of basic data transfer (e.g. GridFTP), and data
cataloguing and replication (provided e.g. by the European DataGrid project). The
OGSA-DAI project has developed basic support for accessing remote databases. For
a major uptake of Grid technology in industry, an even more comprehensive
approach to data management will be needed. Applications making use of grid
infrastructure such as ontologies or data mining have a need for sophisticated data
access mechanisms. Accordingly, FP6 projects such as SIMDAT have data grids as
their central topic.
This white paper:
•
•
•
•
•
Describes the objectives of the working group data management,
Gives an overview on data management approaches of FP6 projects,
Identifies overlap, complementarity and points for synergies,
Positions FP6 project approaches with respect to the state of the art and
Identifies future points for action.
2
The document is intended to be a living document that provides an up-to-date view
on data management issues in FP6.
Objectives of the working group “Data Management”
Objective 1: Transfer from FP5 to FP6
In general, the FP5 projects produced numerous results that could be taken up by
FP6 projects.
There are two basic modes of technology transfer:
•
•
From projects that produced basic data management facilities to projects that
will produce applications;
From projects that produced grid applications to new projects that produce
new applications.
In the first case, there is a need for transferring software and implementations.
Transfer from grid middleware oriented projects to application oriented projects is
especially relevant for data-intensive projects such as SIMDAT and DataMiningGrid.
In the second case, there might be software to transfer or it might be a transfer in
terms of know-how and lessons learned. The second case applies e.g. for the
relation between the Grace project and the DataMiningGrid.
A second important consideration is whether a FP5 project has -- in one way or
another -- a continuation in FP6. For some projects, there is a natural take-up of
results in their successor projects. This assures continuity in the development of
basic grid services. It also assures that technology developed in FP5 will be
maintained for the next years and so provides a clear perspective for application
projects to take up this technology. If there is no continuation, transfer of know how
is feasible but transfer of software is more problematic, since the issue of
maintenance arises.
Objective 2: Collaboration between FP6 projects
The further major objective of the working group is to facilitate the exchange of know
how and software between FP6 projects. In this case, there is not a problem that one
of the projects ceases to exist in the near future. The challenge is here that the
projects move on a parallel time scale; e.g. project A needs some data management
software from project B to implement a Grid service on top of that; however, project B
just started to design that component so that it is not available now.
At the time of writing, most FP6 projects will have produced a system design
document. Each of those documents will address issues relevant for data
management. These documents provide the raw material for the current document.
The material will be collected, synthesized to provide a coherent picture on data
management issues in FP6.
3
Objective 3: Developing a unified view on Grid data management
The third major objective is, wherever appropriate, to define a unified view on Grid
data management issues. While it is unrealistic to expect that all projects will
converge on the same set of technologies (and possibly also dangerous, since it is
important to explore alternative options), design choices should be made on a
common understanding of the available options. Also, projects should inform each
other about lessons learned, so that one project does not fall into a trap that another
project has already identified or one project does not use a promising technology just
because it does not know about it.
A further goal will be to position and synchronize the working group activities with
world-wide activities on Grid data management, e.g. GGF.
Relevant European Projects
In the first concertation meeting the following projects have been identified that have
data management issues as a major topic:
FP5
•
•
•
DataGrid;
Grace;
OpenMolGrid.
FP6
•
•
•
•
DataMiningGrid;
EGEE;
InteliGrid;
SIMDAT.
The list of projects has become longer as a result of the 2nd concertation meeting,
where one objective was to involve additional projects. These projects are:
FP6
•
•
•
CoreGrid;
NextGrid;
OntoGrid.
Other
•
OGSA-DAI (UK eScience programme).
4
Contribution of FP6 projects to Data Management
CoreGrid
Domenico Talia, University of Calabria
Website:
http://www.coregrid.net
Project goals
The CoreGrid Network of Excellence brings together a European critical mass of
well-known experts in GRID and P2P research allowing to compete with research
and development in North America and Asia.
Data Management issues relevant for the project
An objective of the CoreGrid project is to build a European-wide research laboratory
to avoid fragmentation of Grid research activities in Europe and to achieve long-term
integration and sustainability. The Knowledge & Data Management Virtual Institute
(KDM Virtual Institute) aims to further the integration of data management and
knowledge discovery with Grid technologies for providing knowledge-based Grid
services. Moreover the objective is to provide a collaborative setting of European
research teams working on distributed storage management on Grids, semantic Grid
techniques and tools for supporting data intensive applications and knowledge
discovery and management in Grids. Another objective is to strengthen the joint
activity of European research groups in those areas, promoting larger leading teams
(working as a Virtual Research Institute) working on models and tools for data and
knowledge management in Grids and P2P systems.
References
Website:
http://www.coregrid.net
DataMiningGrid
Martin Swain, University of Ulster
Thomas Niessen, Fraunhofer AIS
Website: http://www.datamininggrid.org/
Project facts
Data mining has been recognized as one of the most important information
technologies for automating the process of analysing and interpreting the data in
modern knowledge industries and high-tech sectors such as science and
engineering. Currently there exists no coherent framework for developing and
deploying data-mining applications on the Grid. The DataMiningGrid project will
address this gap by developing generic and sector-independent data mining tools
and services for the Grid. A test bed consisting of several applications from a diverse
set of sectors will serve as platform for demonstrating and promoting the technology
developed by the DataMiningGrid.
5
DataMiningGrid is part of the Sixth Framework Programme of the Information Society
Technologies Programme (IST). The consortium constitutes of the University of
Ulster (Northern Ireland), the Fraunhofer Institute for Autonomous Intelligent Systems
(Germany), the Data Mining Solutions group from DaimlerChrysler (Germany), the
Israel Institute of Technology (Israel), and the University of Ljubljana, Faculty of Civil
and Geodetic Engineering (Slovenia). The grant period for this project is September
1, 2004 to August 31, 2006.
Project goals
The project’s main objectives are structured into four main phases or milestones
(currently phase 1 has been completed):
•
•
•
•
Specification and validation of data-mining-aware grid tools and interfaces to
be developed by the project (due date: February 2005),
Early implementation of a mock-up prototype featuring some of the more
critical aspects of the project (architecture, middleware, data-mining-aware
grid data access interfaces) (due date: August 2005),
Delivery of middleware-integrated components (due date: April 2006) and
A fully evaluated set DataMiningGrid components, tools, interfaces and
application demonstrators from different application scenarios (due date:
August 2006).
In order to address a wide range of requirements arising from the need and context
to mine data in distributed computing environments, the project will develop a test
bed consisting, among other things, of various demonstrator applications.
Demonstrators from biology and medicine will address data-mining problems
requiring a data-mining-aware access of distributed and very large databases (e.g.
molecular dynamics simulation data) and the construction of compute-intensive
predictive models. Demonstrators in the automotive industry and other text-mining
scenarios need to mine large and inherently distributed text repositories (e.g. car
repair protocols, customer relationship management data). Another demonstrator will
directly mine data arising from logs produced by grid computing middleware.
Data Management issues relevant for the project
Data sets and data sources used for data mining vary considerably in structure, size,
problem solving context, background knowledge, and other statistical and
technological aspects across different domains and sectors. Data is different to
streams of bits and bytes; this fact needs to be reflected in the protocols and services
layers. Furthermore, in grid environments this data is typically distributed over
geographically dispersed sites and may become unavailable after some time, while
prohibiting replication due to sheer size, privacy or licensing issues. As data mining
results always need to be regarded in connection with the original data, provenance
information for each model need to be recorded at any step of the data mining
process.
In order to address such diverse requirements regarding data management the
DataMiningGrid project will develop various data and model services, providing the
6
functionality needed to perform data mining in grid environments. In detail these
services will encompass the following functionality:
•
•
•
•
•
•
•
Data service location (metadata annotation);
Database federation;
Data access and selection;
Data transfer;
Data transformation and pre-processing;
Collecting and storing according provenance data;
Mapping provenance data to relating models.
The recording of sufficient provenance data to retrace each step of a data mining
process, which ultimately led to a certain model, involves capturing such diverse
information as metadata about the original datasets used, pre-processing operations
applied, data mining algorithms executed (including all parameters), and evaluation
methods performed on the model. As currently no standard exists that is capable of
recording all these information, available standards such as PMML or CWM will be
complemented with additional syntax and data structures developed during the
DataMiningGrid project.
Technologies involved
Data management in DataMiningGrid will rely on proven technologies such as
OGSA-DAI for accessing heterogeneous databases and GridFTP for transferring
substantial numbers of possibly large textual documents. While OGSA-DAI fits nicely
into the service-oriented architecture envisioned for this project, GridFTP provides a
widely accepted and proven technology for transferring large file based data volumes
in grid environments.
References
http://www.datamininggrid.org/
Website:
Public deliverables: http://www.datamininggrid.org /deliverables.htm
EGEE
Marc-Elian Begin, CERN
Website: http://public.eu-egee.org/
Project facts
The EGEE Project involves 71 leading organisations from around 27 countries,
federated in regional Grids, with an ultimate combined capacity of over 20000 CPUs
– the largest international Grid infrastructure ever assembled. The EU is funding €32
million towards the project, with a similar level of funding from the partners. The total
manpower allocated to the project is approximately 600 person years over two years.
The breakdown of funded activities is 48 percent for Grid service activities, 24
percent for middleware re-engineering and 28 percent for networking (dissemination,
outreach and training).
Project goals
7
EGEE is a project that aims to integrate current national, regional and thematic Grid
efforts, in order to create a seamless Grid infrastructure for the support of scientific
research. EGEE provides researchers in academia and industry with round-the-clock
access to major computing resources, independent of geographic location. The
infrastructure supports distributed research communities, which share common Grid
computing needs and are prepared to integrate their own computing infrastructures
and agree on common access policies. Mostly funded by EU funding agencies, this
project has a world-wide mission and receives important contributions from the US,
Russia and other non EU partners.
Data Management issues relevant for the project
The developed EGEE application scenarios belong mainly in the particle physics and
pharmaceutical area. EGEE has to deal with very large amounts of data, and a
variety of different interfaces that are used. There is a high demand to harmonize
interfaces and create higher levels of abstraction with the vision of getting a common
European infrastructure.
Large volumes of data (several petabytes)
With LHC (Large Hadron Collider) alone, the EGEE grid production service will have
to handle 10-20 PB of persistent data per year. This data throughput will have to be
maintained for several years. EGEE is also looking into another "Grand Challenge"
which might complement the LHC, hence increasing even further these already
challenging figures.
Integration of mass storage systems
The strategy that we have chosen is to rely on SRM in terms of interfacing with
storage systems. gLite (the EGEE middleware) doesn't provide a storage solution,
but includes the ability to consume several SRM based storage solution. The actual
storage solution is considered as a third party service. This strategy provides
freedom and flexibility, while sharing work and leveraging on other projects'
investments.
Data Replication and location
The current solutions for managing replicated copies of the same data rely on a
central/global catalogue. VOs can use several catalogues, but each catalogue
currently only works as a central catalogue (the catalogue itself is not yet replicated
and synchronised, across site boundaries). The data management solution allows the
simple query of a catalogue to resolve exact copies of the requested data. The
metadata catalogue, also centralised for the moment, can also be used to store
metadata on existing data and inversely query this metadata to locate data. The
metadata and file catalogues can work together to ensure consistency across their
respective information.
We already have a File Transfer Service being tested, which will allow data transfers
to be triggered from specification written in job descriptions. This service coupled with
the File Placement Service will then allow application developers to assume that by
8
the time a job lands on a worker node, data has been transferred and is locally
available for processing. This strategy is important from several stand points, where
for example it allows resource broker level optimisation of network resources and
algorithms to take advantages of free computing resource, required data, network
topology and speed to better match resources.
Data naming
The raw names of files and data (LFN, SURL, etc) thanks to metadata can be
"embellished" with more human terminology.
Secure access
The coordinated effort on security, not only from EGEE and LCG, but also from other
Grid projects, is leading to a wide acceptance of VOMS. An important issue for users
of sensitive data is whether local access is allowed to data, via native means, or
whether only grid mechanisms are allowed to access this data. EGEE is also working
on an "encryption service" which will alleviate the burden of crypto key management
through a simple and secure service, following a similar strategy as VOMS and CAs.
How they are addressed
A detailed description is beyond the scope of this document. Information can be
found in the following documents:
•
•
gLite architecture: https://edms.cern.ch/document/594698/ ;
gLite data management overview: https://edms.cern.ch/file/570643/1/EGEETECH-570643-v1.0.pdf.
The above architecture document describes the high-level roadmap for addressing
these issues. Further work is taking place to maintain and update the architecture
document while we learn through developing, integrating and/or deploying the
mentioned solutions.
Technologies involved
GSI
GSI and OpenSSL (now including Grid proxy certificates for authentication) are used
by gLite services and components.
GridFTP
The file transfer service is using GridFTP to transfer data from sites to sites, adding
reliability and throttling layer on top of it.
SRM
As above mentioned, SRM is the main interface used by gLite to prepare for plug-in
inclusion of storage backend services.
9
Web services
Several services and components are now deployed with Web Service enabled.
Security to-date is provided via SOAP over HTTPS, while the longer term goal is to
explore further message level security, hopefully based on then reliable WS-Security
implementations. The EGEE Project Technical Forum has spun-off a sub-group
summarising the lessons learned from Web-Services on gLite v1.0 and preparing for
a wider, harmonious and consistent deployment of web service enabled services.
The reality of Grids today is that there is not clear unanimity in terms of languages
and platforms, which means that our web services have to be consumable by several
and sometimes exotic languages, which adds to the difficulty of deploying web
service based services. Performance of web services is also an aspect that requires
further investigation and demonstration. The undisputable benefits that XML based
messaging through web services brings cannot come at too high a price in terms of
performance.
POSIX-like data access
The lower level data management services provided to application developers
resemble POSIX data access. However, the current implementation of data
management service available on the worker node and the user interface require a
slight adaptation of the standard POSIX API and conversely requires to be relinked
with the correspondent libraries.
References
Website:
http://public.eu-egee.org/
Architecture: https://edms.cern.ch/document/594698/
gLite:
https://edms.cern.ch/file/570643/1/EGEE-TECH-570643-v1.0.pdf
http://www.glite.org
SRM:
http://sdm.lbl.gov/indexproj.php?ProjectID=SRM
GGF GSM: https://forge.gridforum.org/projects/gsm-wg/
Inteligrid
Jarek Nabrzyski, PSNC
Website: http://www.inteligrid.com
Project facts
The InteliGrid project started in September 2004, funded by the 6th FP. It is a
European research project developing grid technologies to provide an interoperability
platform to engineering. Interoperability of software, services and tools is one of the
paramount problems for the engineering virtual organizations (VO). The VOs in
industries like construction, shipbuilding and aerospace industries collaborate on the
design, production and maintenance of advanced products that are described in
complicated, structured product model databases. The InteliGrid project's hypothesis
10
is that the collaboration platform - the semantic grid itself - must be aware of the
business concepts (e.g. car, airplane, house) that the VO is addressing.
InteliGrid is a 360 person month, 3.1 MEuro project. Coordinated by the University of
Ljubljana in Slovenia, the partners include Technical University Dresden from
Germany, VTT from Finland, ESoCE Net from Italy, Poznan Supercomputing and
Networking Center from Poland, Obermeyer Planen + Beraten from Germany,
Sofistik Hellas from Greece and Conject AG from Germany.
Project goals
The goal of this project is to create an architecture and a prototype for such an
infrastructure, based on existing grid middleware and test it in the context of
industries mentioned above. The project will create generic grid-related knowledge,
infrastructure and toolkits that will allow for a broad transition of the advanced
engineering industry towards semantic, model-based, ontology-committed
collaboration on the grid.
A specific technological goal is to make the grid infrastructure available to the mostly
small to medium enterprise (SME) companies that are providing the engineering
software. The core competencies of these companies are topics like structural
mechanics or 3D solid modelling and not the latest trends in middleware technology.
The project will help SMEs to enhance their applications with grid-computing
capabilities.
The project will demonstrate how typical server side applications can be made gridcomputing compatible and how the mostly client side Computer Aided Design (CAD)
applications can interface with the grid. The demonstration will show the next
generation of key engineering collaboration software using the InteliGrid middleware,
an ontology service, a product model database server, a project web collaboration
service and characteristic computer aided design software.
The main results of the project are the generic business-object-aware extensions to
grid middleware, implemented in a way that allows grids to commit to an arbitrary
ontology. These extensions are propagated to toolkits that allow hardware and
software to be integrated into the grid.
Data Management issues relevant for the project
Main innovation activities are focused on extending the grid architecture with
semantics and ontologies beyond current work on metadata and heterogeneous data
formats. Interoperability in AEC today, at best relies on the management of IFC files.
Attempts to use true IFC databases are mostly academic. EPM is also managing a
Product Life Cycle Support (PLCS) implementation for the Norwegian navy, which
will handle all lifecycle data related to the frigates in their fleet. The implementation
uses the concept of DEX (Data Exchange Set) developed in PLCS to allow
applications to access subsets of the PLCS data model. In effect these are similar to
STEP conformance classes and allow different applications to access the data they
are interested in.
11
Merging and partial extraction of the data to/from the PLCS data model, as well as
access
control
is
managed
by
EPM.
The
PLCS
implementation
(http://www.posccaesar.org/) also makes use of reference data that complies to
ISO15926 in order to further constrain the data population and to reduce the problem
of data redundancy.
File-based environments are fragile and depend on a single server that provides such
crucial functions as information and process management for a complex project.
How they are addressed
InteliGrid proposes to use the grid as a robust, scalable, safe infrastructure for the
industry. It would allow seamless integration of software committing to any product
data standards and focus the developers into the functionality and not data exchange
or interfacing with this or that information server. The grid is the place for the
semantically rich data.
This project is introducing two major generic improvements visible to the end user as
well as the application developer:
•
•
The grid infrastructure eliminates the need of knowing exact locations of
semantically rich data and complex problem solving services. They are "on the
grid".
The ontology reduces the need to know the exact structure and access paths
of the data in product model databases. They can be accesses by using an
engineering ontology as opposed object names and record keys.
Technologies involved
GridSuite (Grid tools and services developed in several Grid projects, such as
GridLab
(www.gridlab.org),
CrossGrid
(www.eu-crossgrid.org),
Progress
(http://psnc.progress.pl) and productized by Poznan Supercomputing and Networking
Center. GridSuite includes such Grid services as: GRMS (Grid Resource
Management System), Data and Replica Management System, GAS (Grid
Authorization Service), Grid Accounting System, Adaptive Services, Mercury
Monitoring System, Mobile User Support Services, Grid Application Toolkit,
GridSphere, etc…
On the ontology side the project uses the Ontology Server. Other tools include the
ICT tools, such as WinTUBE3D, Front3D, Front2D, SOFiEM etc.
References
Website:
http://www.inteligrid.com
Some publications can be found under the “Results” link of project’s home page, i.e.:
http://www.inteligrid.com/results.htm
NextGrid
Neil Chue Hong, EPCC, University of Edinburgh
http://www.nextgrid.org
Website:
12
Project goals
The objective of the NextGrid project is to develop a Grid architecture to support
mainstream use that meets the needs of business users by addressing security and
economically viable business models including legal and privacy issues.
Data Management issues relevant for the project
The NextGrid environment will be service oriented, loosely coupled, heterogeneous
and segmented into multiple business domains. Communicaton is secure and
trustable, using XML and WS standards. When no trust can be assumed interaction
is driven by standard interfaces and SLAs. NextGrid will be developing a unified data
and process model with the aim to be able to treat compute and data services in
similar fashion. Furthermore a framework for a suitable information discovery
mechanism will be defined. For the discovery of resources a WS version of the echo
pattern will be implemented and for data integration the OGSA-DAI framework will be
used. At present, NextGrid faces questions of standardisation, security and privacy
issues.
References
Website:
http://www.nextgrid.org
OGSA-DAI
Neil Chue Hong, EPCC, University of Edinburgh
Website:
http://www.ogsadai.org.uk
Project facts
The OGSA-DAI project started in 2002 and is funded by UK eScience Grid Core
Programme. The involved Partners are: EPCC, IBM UK, NeSC, Oracle, University of
Manchester and University of Newcastle.
Project goals
The OGSA-DAI project (UK eScience) aims to develop a middleware for data
management, including data access, data delivery, data transformation and data
security. A set of core features is provided by the OGSA-DAI middleware plus a
framework to extend and develop new features for building applications, data clients
and developing functionality.
The goals of the project are:
•
•
•
To provide a grid based, extensible framework for access to distributed,
heterogeneous data resources, that can be used by other projects,
To implement a reference implementation of the WS-DAI specifications from
the DAIS-WG and
To provide feedback to the DAIS-WG gained from implementation experience.
13
Data Management issues relevant for the project
•
Combination of different data storage models (relational, XML, object, ...)
Schema integration and management Support for transactions.
How they are addressed
•
•
Using XML dataset formats;
Picking up relevant standards to help interoperability Assimilating research
efforts in key areas of data integration, working with other data integration and
management projects.
Technologies involved
•
Java, SQL, XPath, XUpdate, Web Services, Web Services Resource
Framework, Globus Toolkit 3.2, Globus Toolkit 4.0, OMII_1.
References
Website:
http://www.ogsadai.org.uk
OntoGrid
Manolis Koubarakis, Technical University of Crete
Asunción Gomez-Perez, Universidad Politécnica de Madrid
Miguel Esteban Gutiérrez, Universidad Politécnica de Madrid
Project facts
OntoGrid (Paving the way for Knowledgeable Grid Services and Systems) is an FP6
project working towards making the Semantic Grid a reality. The partners of OntoGrid
are: Universidad Politécnica de Madrid (co-ordinator), University of Manchester,
University of Liverpool, Technical University of Crete, Intelligent Software
Components, Acklin, Boyd International and Deimos Space.
Project goals
The strategic goals of OntoGrid are the following:
•
•
To pioneer the use of knowledge technologies to enhance and extend the
architecture and design of current Grid computing systems to enable crossprocess, cross-company, and cross-industry collaboration;
To enable the wide deployment of Knowledge Technologies within Grid
Computing architectures to achieve this goal, eventually leading to a
Semantic Grid;
14
•
•
•
To improve the high-performance, scalability, resilience to failures, robustness
and adaptivity of current Grid and Knowledge Technology components to path
the way towards acceptability of deployed technological solutions;
To widen the applicability of Grid computing architectures in applications
involving virtual organizations formed by autonomous entities with conflicting
goals;
To expand the applicability of today’s Knowledge and Grid computing
architectures and demonstrate that Semantic Grid computing systems can be
exploited successfully in e-science and e-business applications.
Data Management issues relevant for the project
Metadata pervades the Grid middleware stack, describing the environment, the
services available and the ways that they can be combined, keeping audit trails of
resource use and configurations. The operation of the Grid middleware itself is
dependent on the capturing and harnessing of metadata, as well as the metadata
required for the Grid applications themselves. In current Grid architectures though,
this metadata may not be explicit -- rather the knowledge is often hard coded or
encapsulated within the services and their interactions. The aim of the Semantic Web
is to explicitly expose metadata relating to Web resources to better support intelligent
agents. The Semantic Grid aims to follow a similar path in the Grid setting. The key
focus of the OntoGrid project is in this provision of semantics and it is realized by the
following:
•
•
The development of intelligent middleware by utilizing explicit metadata and
ontologies (e.g., for the tasks of service discovery, brokering or matchmaking);
The exploitation of Semantic Web and Peer-to-Peer technologies for metadata
management.
As a consequence, OntoGrid has a strong metadata and ontology management
focus.
How they are addressed
From a data management point of view, OntoGrid currently focuses on the following
two problems:
•
•
Distributed storage and querying of RDF and RDFS data using P2P networks;
The development of WS-DAIO.
RDF and RDFS are simple data models for describing resources in the Web and as
such they are good candidates for developing simple Semantic Grid ontologies. More
expressive ontologies can be developed in richer ontology languages such as OWL
or other subsets or extensions of FOL.
In current work in OntoGrid, we are developing a distributed RDF storage and
querying system on top of a distributed hash table (DHT). DHTs are a family of
distributed protocols that are aimed at the development of P2P applications. DHTs
can be used to solve the following look-up problem: ‘Let X be some data item stored
at some distributed dynamic network of nodes. Find data item X.’ The core idea in
15
most DHTs is to solve this look-up problem by offering some of form of distributed
hash table functionality: assuming that data items can be identified using unique
numeric keys, DHT nodes cooperate to store keys for each other (data items can be
actual data or pointers).
We have chosen to use the distributed hash table Bamboo (http://bamboo-dht.org/ )
in our work. To develop a distributed RDF/RDFS store on top of Bamboo, we are
currently working on answering questions such as: How do you index RDF data using
Bamboo? What query answering algorithms are appropriate? How do you balance
the query processing load?
We expect to use the resulting system Atlas for semantic grid service discovery and
distributed ontology storage, retrieval and sharing.
In the OntoGrid project we are also developing infrastructure for accessing and using
ontologies in a grid environment, which will be used to advance the research on
semantic grid architectures.
An ontology service will provide access to the information in an ontology. Ontologies
can be queried as to their content. Content includes the concepts and relationships
that they contain, along with intentional information about those concepts, for
example the definitions that apply to a particular class. We might also expect an
ontology service to support queries over the information in the ontology, e.g. what are
the parents of concept X? Or what are the known individuals conforming to a
particular description D? In this way, we can view Ontology Services as a particular
kind of Data service – a Data service holds some data, and provides mechanisms for
putting things in, and getting things out. This suggests that a sensible mechanism for
providing Ontology Services is to make use of existing infrastructure for Data Access,
in particular the specifications produced by the GGF Data Access Working Group.
The OntoGrid team is defining a proposal of a specification for accessing ontologies
in a grid environment. This proposal is known as WS-DAIO. WS-DAIO is an
extension of the WS-DAI specification for providing ontology data access, that is, a
WS-DAI realization for adding ontology access capabilities to the current OGSA
architecture.
The OntoGrid team will start with the development of the WS-DAIO-RDF(S), which
uses ontologies implemented in the RDF(S) W3C language. Our first aim is to test
the usability and usefulness of the proposed interfaces as well as to experiment with
different alternatives and their implications in issues like: factory pattern usage,
transactions, etc.
Technologies involved
The main technologies involved here are ontology languages such as RDF, RDFS
and OWL, OGSA-DAI and distributed hash tables such as Bamboo.
References
Website: www.ontogrid.net
16
SIMDAT
Michael Krüger, Intel
Jörg Kindermann, Fraunhofer AIS
Website: http://www.simdat.org
Project facts
•
•
•
•
IST Grid IP project;
4 years;
Start date: September 1st, 2004;
26 partners.
Project goals
SIMDAT aims to test and enhance grid technology for product development and
production process design as well as to develop federated versions of problemsolving environments by leveraging enhanced Grid services. Further objectives are to
exploit data grids as a basis for distributed knowledge discovery as well as to
promote de facto standards for these enhanced Grid technologies across a range of
disciplines and sectors.
SIMDAT focuses on four application areas: product design in the automotive,
aerospace and pharma industry as well as service provision in meteorology. Key to
the seamless data access is the federation of problem-solving environments using
grid technology. The federated problem solving-environments will be the major result
of SIMDAT. Seven key technology layers have been identified as important to
achieving the SIMDAT objectives: an integrated grid infrastructure transparent
access to data repositories on remote grid sites, management of Virtual
Organizations, scientific workflow, ontologies, integration of analysis services and
knowledge services.
Data Management issues relevant for the project
•
•
•
Deployment of Grid concepts, existing and emerging architectures with
particular attention to data management: distributed databases;
Data aggregation through the use of ontological concepts;
Workflows for next-generation aggregated knowledge capture, discovery and
mining.
Technologies involved
In the Auto, Aero and Pharma activities, SIMDAT is developing demonstrators based
on Web Service Grids conforming to WSInteroperability. In Meteo SIMDAT is
developing a virtual data grid based on Globus Toolkit 3/4. After evaluation and
comparison, the OGSA-DAI software has been selected as the central data access
layer for the prototypes.
For the prototypes, the partners will adapt OGSA-DAI to work with the GRIA layer,
and introduce functionality for the handling of provenance information. The resulting
17
layer will be integrated with the Basic Grid Infrastructure system. In the longer term,
SIMDAT aims to achieve interoperability between different infrastructures such as
Globus Toolkit 4 and GRIA, with a Grid Service API based on the upcoming WSRF
set of standards although the level of compliance may differ between
implementations.
References
Website:
http://www.simdat.org
Complementarities & synergies among projects
Transfer FP5 to FP6
In general, the FP5 projects produced numerous results that could be taken up by
FP6 projects. The potential transfer is both from projects that produced grid
applications, such as Grace, and from projects that produce basic data management
facilities, such the DataGrid project. For some projects, such as DataGrid, there is a
natural take-up of results in their successor projects like EGEE. This assures
continuity in the development of basic grid services. It also assures that technology
developed in FP5 will be maintained for the next years and so provides a clear
perspective for application projects to take up this technology. Transfer from grid
middleware oriented projects to application oriented projects is especially relevant for
data-intensive projects such as SIMDAT and DataMiningGrid.
OGSA-DAI project results (funded by the UK e-Science programme) will be an
important building block for several FP6 projects. The IP SIMDAT, the STREP
DataMiningGrid and the OntoGrid project are considering to build (part of) their data
management on OGSA-DAI. It would be an advantage if projects like DataMiningGrid
and SIMDAT would develop a joint strategy with respect to OGSA-DAI.
Results of the DataGrid project will be re-used and extended in the EGEE project,
with partly overlapping consortia. So maintenance and further development is not an
issue here. OGSA-DAI and the results of the DataGrid project can be seen as
complementary, since e.g. the latter focuses on data replication while the first one
does not. For the application driven projects SIMDAT and DataMiningGrid the
DataGrid results are generally relevant, although specific approaches how to take up
those results are currently not clear.
The OpenMolGrid project led to interesting results and build up of know how in the
area of data management for bioinformatics. The results are partially overlapping with
the OGSA-DAI approach (which were of course not available when OpenMolGrid
started). OpenMolGrid partners (Univ. Ulster) will build part of the data management
for the DataMinigGrid project on OpenMolGrid results.
Results and experience from the Grace project on distributed search engines may be
relevant for the some Text-Mining tasks in SIMDAT and DataMiningGrid.
18
Collaboration of FP6 projects/ Unified view on grid data management
To facilitate the exchange of know how and software between FP6 projects a
fundamental action that has to be carried out is to provide communication platforms.
In this sense the concertation meetings on TG 5 Data Management – which are held
once a year – serve the important role of identifying common problems and
potentially setting up partnerships that could lead to joint solutions. To further
emphasize cooperation, additional meetings and telephone conferences are planned
to be established. A second major facility for collaboration is to create links in a
Network of Excellence like CoreGrid. Concerning the issue of Data Management
CoreGrid is offering personnel exchange throughout projects that will include non EUprojects such as OGSA-DAI.
Besides establishing organisational structures to facilitate collaboration, it is
necessary to create a unified view on grid data management so that design choices
can be made on a common understanding of the available options. To achieve this, it
is vital to classify data management into specific areas.
Data Sources
Whereas in previous projects flat file-based data access was dominating, the new
generation of projects uses a greater variety of data formats and storage
mechanisms, ranging from XML, Relational Databases, to Flat Files. This is driven by
the broader range of applications taken up by FP6 projects and a shift towards the
industrial world, where relational databases are dominating.
Data Discovery and Metadata
Most projects have the demand to extract and use metadata and have identified
metadata management as a major issue. Yet, there exists no standard solution
(mostly due to the variety of specific demands) for metadata management. The
OntoGrid project aims to provide a generic approach to metadata management
based on ontologies and semantic web technology. Inteligrid is working in that
direction as well. Ontologies are also a topic for SIMDAT. Collaboration between
these projects could offer potential to share experiences and develop an accepted
(de facto) standard.
Data Access/Data Transfer
Technology transfer from projects that produce basic data management facilities to
projects that will produce applications is especially relevant for data-intensive projects
such as SIMDAT and DataMiningGrid. The application oriented projects like SIMDAT
or DataMiningGrid are in need of high quality components for data management; yet
developing such basic middleware components ‘in-house’ is normally out of their
scope. Cooperation between these two types of projects produces benefits for the
middleware oriented projects also, since they need user feedback for their further
development and they have to report user numbers.
19
A major result that was identified in the 2nd TG5 concertation meeting is the
convergence on OGSA-DAI for database access by many projects, which means that
an important part of Data Management developments have a great potential for
collaboration. On the other hand, it has to be taken into account that OGSA-DAI
support and training activities are limited due to its tight resources.
Under consideration are also own developments to customize and enhance OGSADAI functionality to specific demands, e.g. for transferring large data sets for data
mining applications. Some projects want to build specific extensions (e.g. for data
mining), that can be made available to other projects, seeking collaboration with
OGSA-DAI.
File based access is a major focus of the EGEE project, and it uses GridFTP for data
transfer. GridFTP is also used by many other projects.
The matrix below shows the deployed technology of different Grid projects with
respect to five areas of data management.
DM Areas\ Projects
DataMiningGrid
EGEE
Inteligrid
Data Source
Relational,
Flat File
XML, Flat
files,
mass Relational,
XML,
storage systems
flat files, WebDAV,
product
model
databases
OD
OGSA-DAI
OD
Data
Discovery/ OGSA-DAI , OD
(DEX)
Metadata
OGSA-DAI
OGSA-DAI
Database Access
Grid FTP, OD
GridFTP,
FTP,
File Access /File GridFTP
HTTP,
HTTPS,
Transfer
WebDAV
OD
Data Replication
endData Mining
Data
intensive Engineering
Applications
applications
simulations
and user
(CAD,
structural
analysis (e.g. LHC)
Secure data access analysis, ...)
(e.g. biomedic)
DM Areas\ Projects
NextGrid
OGSA-DAI
OntoGrid
Data Source
RDBMS, XML,
files, semistructured,
resource metadata
RDBMS, XML,
files, semistructured,
data services
RDF, RDFS, WebODE
OD, Grimoires
OGSA-DAI,
OD(WS-DAIO)
OD(Atlas - DHTbased
RDF/RDFS
Data
Discovery/ OD (DEX)
Metadata
20
Database Access
File Access/ File
Transfer
Data replication
Applications
OGSA-DAI
OGSA-DAI, OD
DM Areas\ Projects
SIMDAT
Data Source
Data
Discovery/
Metadata
Database Access
File Access /File
Transfer
Data Replication
Applications
Relational, Flat File
OGSA-DAI, OD
OD
?
Digital media,
financial
applications,
supply chain
management, data
mining
OD (OGSA-DAI)
OD, Lucene,
GridFTP, FTP
various (medical,
astronomy,
meteorology,
geology,
biosciences)
store)
OGSA-DAI
DHT-based
Ontology
based
applications,
Semantic registries
OGSA-DAI,OD
GridFTP, OD
OD
Data
Mining,
Simulation
Own Development
Not used
Unknown
Positioning with respect to state of the art and related work outside
FP6
Overall, the FP6 projects are very well positioned with respect to the state of the art
in data management.
Middleware oriented projects
OGSA-DAI defines the state of the art for database access. It has strong support
from major database vendors (IBM, Oracle) and may evolve to a standard solution for
database access. Application oriented projects currently evaluate the potential of this
solution in various areas, e.g. DataMiningGrid uses it for Data Mining and TextMining problems and will thus be able to provide feedback to guide further
developments.
EGEE reports to be the largest international Grid infrastructure ever assembled and
thus provides a reference and test bed for data management issues centred on file
based access.
21
Application oriented projects
These projects often assemble best of breed components build by other (middleware
oriented) projects to provide problem solving environments for specific business or
scientific areas (e.g. SIMDAT builds demonstrator for pharma, meteo, automotive,
and aerospace).
To this end, it is important that key players from middleware oriented projects
participate also in these application oriented projects, so that take up of technologies
is facilitated and the right design decisions are made. However, this type of project
derives its USP from the end users and the application developers that build concrete
applications. Some FP6 projects have assembled a critical mass of industrial
companies that want to prototype grid technologies in their business areas. Several
of these companies are global players and/or among the market leaders in their area.
Overall, the application oriented projects in FP6 are certainly at the leading edge of
transferring grid technology from science to industry.
It is to be expected that the successful deployment of grid technology will have a
strong signal effect and will push take up by other companies in the respective area.
Conclusions and Future plans
Most FP6 projects have now finished the stage of requirements gathering and
design, so the overviews provided in this document reflect this still early stage of
many projects. Future versions will be able to report on results, experiences, lessons
learned. An important milestone will here be the finishing the demonstrators
scheduled by various projects for late 2005.
While the 1st concertation meeting on Data Management focused on the objective to
facilitate transfer from Fp5 to Fp6, yet the collaboration between FP6 projects moves
to the centre of attention. A major result that has been identified on the 2nd TG5
concertation meeting is the convergence on OGSA-DAI for database access, which
means that an important part of Data Management developments have great
potential for collaboration.
To further support collaboration on and organisational basis, it is planned to amplify
the impact of the Data Management Technical Group by having additional meetings
or teleconferences. Besides TG5, the CoreGrid Network of Excellence is considered
to also provide good chances for knowledge exchange.
An important measure for the future will be - through the process of concertation - to
derive a joint view and vision of grid related data management issues, with the goal
to develop common understanding of the available options, remove island culture
and help defining standards. Accordingly, the TG5 meetings will shift to a more
problem driven approach.
22
References
CoreGrid website:
http://www.coregrid.net
DataMiningGrid public deliverables:
http://www.datamininggrid.org /deliverables.htm
DataMiningGrid website:
http://www.datamininggrid.org/
EGEE architecture document:
https://edms.cern.ch/document/594698/
EGEE gLite document:
https://edms.cern.ch/file/570643/1/EGEE-TECH570643-v1.0.pdf
EGEE website:
http://public.eu-egee.org/
GGF GSM:
https://forge.gridforum.org/projects/gsm-wg/
InteliGrid publications:
http://www.inteligrid.com/results.htm
InteliGrid website:
http://www.inteligrid.com
NextGrid website:
http://www.nextgrid.org
OGSA-DAI website:
http://www.ogsadai.org.uk
OntoGrid website:
http://www.ontogrid.net
SIMDAT website:
http://www.simdat.org
SRM:
http://sdm.lbl.gov/indexproj.php?ProjectID=SRM
23
Download