Author Guidelines for 8

advertisement
Semantic Web Infrastructure using DataPile
Jakub Yaghob, Filip Zavoral
Department of Software Engineering, Charles University in Prague
{Jakub.Yaghob, Filip.Zavoral} @ mff.cuni.cz1
Abstract
Years of research and development of technologies
and tools do not lead to expected semantic web
widespread. We consider practical nonexistence of an
infrastructure for the semantic web operation as one of
the main reasons of that status. In our paper we describe
a proposal of such infrastructure based on the DataPile
technology and relevant developed tools and their
integration with web search engines and other tools.
1. Introduction - Current State of Semantic
Web Implementation
If the semantic web were half as good as proclaimed
by its seers then it would be widely adopted by
commercial subjects and everyone would use it daily,
analogous to nowadays web. The reality is quite different
– the semantic web is still not in common everyday use.
One of the main reasons is a nonexistence of a unified
infrastructure, where the semantic web could be
effectively tested and operated.
This problem can be easily demonstrated by the
comparison with the ‘plain’ web. There is the
infrastructure stabilized and steadily used. Web servers
have stored somewhere on their disks web pages, the
server typically provides data by http or https protocols.
Data in the HTML format eventually complemented by
some active elements is displayed to the client.
The semantic web does not have such a ‘standard’
infrastructure. There are developed and relatively
stabilized standards for ontology description (OWL),
there are proposed and implemented specialized query
languages (SPARQL [5], RQL [1], SeRQL [3], RDQL
[9]) and data/metadata storage systems (Sesame [10],
Jena [11]). However the overall infrastructure is unclear
what is one of the reasons of unsatisfactory widespread
of the semantic web.
1
Next chapters are organized as follows: chapter two
demonstrates the DataPile systems as a data storage for
the semantic web, chapter 3 shows individual
infrastructure modules, and chapter 4 summarizes current
status and foreshows subsequent progress.
2. DataPile Systems and their relevance to
the semantic web
2.1. The DataPile
The principal idea behind the DataPile [2] is a data
verticalization, i.e. it does not use a traditional horizontal
concept of a relational database table. In the DataPile
each column of a ‘traditional’ row is represented as one
row of a database table, and corresponding attributes
(columns) of the entity (row) are connected together by
the entity identification. Moreover, because all attributes
of all entity types have the same physical representation,
only one relational database table is used for storing all
data. The DataPile as a data structure needs a second
table, which stores logical structure of the entities. This
table holds a metadata which can be used as a source for
inference; we call this table metatable.
We have presented [4] that the DataPile is very
suitable as storage for the semantic web data. The
DataPile was originally developed as a common data
structure for the integration of heterogeneous information
systems [2]. Original requirements laid on the DataPile as
central data storage for data integration were:
1. Keeping all legacy applications,
2. Data unification from different sites,
3. Active distribution of data changes,
4. Possibility to add other collected and stored data,
5. Information about data changes along the time axis.
All these requirements correspond very well with
requirements laid on data storage for the semantic web:
1. Data sources are distributed over the whole web and
we cannot enforce or influence their data scheme.
The work was partly supported by the project 1ET100300419 of the Program Information Society of the Thematic Program II of the National
Research Program of the Czech Republic.
2.
3.
4.
5.
Data stored in one instance of the DataPile
corresponds to one ontology. During data import
from other sources data is transformed (unificated)
into sustained ontology.
The DataPile is able to ensure a data export
including an active export (push), what is especially
useful for executors (see below).
It is possible to change data structure in the DataPile
relatively easily. The data structure is not directly
dictated by a relational database scheme, but it is set
out by a content of metatables, where the structure is
kept. Changes in the data structure fits well with
changes in our sustained ontology.
Moreover, the DataPile keeps information about
time history of data in such a way that we are able to
discover information about state of the world
regarding any time point.
2.2. Data storage in the DataPile and relation
between RDF and the DataPile
Throwing performance optimizations away, data in
the DataPile can be stored in one relational table, which
contains one row for each attribute value of any entity in
a given time segment. Each row in this table contains
following columns: entity ID, attribute type, attribute
value, data source, validity from, validity to, data
relevance. The first three columns can be easily mapped
to the RDF data model, whereas the entity represents a
subject, the attribute type represents a predicate and the
attribute value is an object.
All remaining columns will be more described in the
next subchapter.
Because the DataPile was applied in the particular
large information system, we were able to collect large
number of real data (tens of millions entities). Because
the data in the DataPile is in the matter of fact RDF data,
we have got very large number of real RDF data. The
only small disadvantage is the fact that some data is
private and cannot be published, which can be rectified
by proper data obfuscation.
2.3. Reifications
The “data source” column from the previous
subchapter denotes a source of data, the “data relevance”
represents our estimation of source and data credibility
and two columns “validity from” and “validity to”
enclose the time segment, where the data is valid. They
represent reifications about the RDF statement
represented by one row. Modeling such reifications using
the pure RDF format we could get five times bigger data,
where searching using reifications would be very slow.
2.4. Context ontology and ontology mapping
The sustained ontology stored in the DataPile
metatables is the context ontology, i.e. it describes only
data stored in the DataPile. Data grows during time and
the ontology changes as well, typically by an
augmentation (adding another areas of interest), seldom
by a modification. The DataPile metatables cope with
those ontology changes without changes in the data part
of the DataPile.
One of the biggest current problems, which impede
worldwide expansion of the semantic web, is the problem
of ontology mapping. Using context ontology means
mapping it to other existing ontologies during data and
metadata import. The metadata stored in the DataPile
provide a framework for ontology mapping research.
3. Semantic Web Operating Infrastructure
Proposed semantic web infrastructure is based on the
DataPile where all the data and metadata are stored. The
DataPile supports four types of interfaces - data import,
import and merging of metadata, querying and executors.
Several modules use these interfaces and provide
complex semantic web functionality.
3.1. Importers
The data import interface enables to import and
update the DataPile data. Typical example of importers
are filters converting data from any source (database,
XML, web, ...) into a physical representation acceptable
by the DataPile. The logical format of the data is
converted by the filter to be conformed to the context
ontology.
Among the important parts of importers belongs the
ability to detect and update existing data. The data
unification interface enables to use various unification
algorithm implementations that match existing and
importing data using significant and relevant attributes.
The web search engines are significantly important
among importers - they interconnect the semantic web to
existing web. We use the Egothor system [6] whose
modular architecture is very well suitable for cooperation
with the DataPile. In the original implementation the
Egothor builds a record of a part of web in the inverse
vector representation. Newly added extraction modules
store filtered incoming data into the DataPile instead of
building vectors.
data
source
web
Search Engine
Data
Import
Filter
Semantics
Deductor
contemporary web which contains a plethora of
unstructured data.
We are working on a framework for automatic
metadata importers. Automatic importers based on
different heuristic, statistical, probabilistic or artificial
intelligence methods may automatically generate
metadata from processed data sources. The Egothor
module of automatic semantics derivation from
processing data is the example of such importers.
3.2. Query engines
Metadata
Data
Lion High Seat
App
Server
DataPile
Query Processor
...
multicriterial querying
SPARQL
SDE - Tykadlo
Conductor
Executors
Execution Environment
SOAP, CORBA, ...
Figure 1. Modules involved in the infrastructure
The metadata is naturally getting obsolete during the
time. Therefore the metadata importers play an important
role keeping semantic web up-to-date.
Metadata importers are classified as manual or
automatic. Import filter from a legacy system is a typical
example of such manual importers. The development of
manual importers is relatively expensive, moreover this
approach does not make possible to import data from
In contrast to traditional relational databases with
fixed data schema and standardized query language
(SQL), querying semantic web data is still in the stage of
proposals and pilot implementations. Our proposed
framework is built upon a general query interface
providing means for submitting queries and fetching and
processing results. This interface is intentionally wide
and general so that both parts (querying and processing
results) may be highly heterogeneous.
The querying itself is complicated by the fact that the
user typically doesn't know the data structure, which is
extremely large and dynamic in time. Therefore one of
the query modules is Semantic Driven Explorer (SDE)
named Tykadlo which is capable to search, display and
explore metadata, filter data related to selected metadata
and explore more semantically related data.
The SPARQL [4] module converts the SPARQL
query into respective SQL form and forwards it to the
SQL engine to evaluate. Although this query method is
adequate for simple queries, more complex queries with a
lot of joins are inefficient.
The Multicriterial Query Module [8] is suitable for
specifying several criteria and user preferences while the
resulting data set is qualified by a monotone function of
particular criteria. The task of this module consists of
searching the best one or best-k answers. This module
can be extended to Dynamical Multicriterial Query
Module using previous results (of the same user or other
users) to improve the preciseness of the answer.
The query engine is prepared to adopt more querying
modules, based e.g. on artificial intelligence, advanced
linguistic or data mining methods.
3.3. Executors
Traditional result representation is tightly coupled to
a query method; for example SDE displays
interconnected pages containing result data together with
their structure and relationships, search engine displays
web links with appropriate piece of text, and SPARQL
returns rows of attribute tuples.
The technique of executors brings process models
into this infrastructure. A task of an executor is to realize
an atomic semantic action, i.e. interaction of result data
with an outstanding (not only semantic) world. These
atomic executors can be assembled together to make
complex composed executors. Orchestration, i.e. mutual
executor interconnection to achieve more complex
functionality, is executed by the Conductor module.
It may be illustrated by following example. One's
mother has gone ill, she needs a medicine. A query
module searches nearby pharmacies with the medicine
available, the result is passed to the Conductor. One
executor is responsible for buying the medicine, while
the other arranges delivery to mothers home. The
Conductor orchestrates these two executors to
synchronize, to mutually cooperate, and to pass relevant
data between them.
Proceedings of the 9th East-European Conference on Advances
in Databases and Information Systems, Tallinn, 2005
4. Conclusions
[7] Kulhanek J., Obdrzalek D.: ”Generating and handling
of differential data in DataPile–oriented systems”, IASTED
2006, Database and Applications DBA 2006, 503-045
Proposed solution for semantic web infrastructure is
an open framework, not a closed solution. This is an
essential characteristic due to contemporary state of the
art of the semantic web, its immaturity and a need of
flexible extensibility.
Particular components are in various stage of
finalization. Central DataPile based database as well as
directly attached tools and unification algorithms are
completely implemented, there are about tens of millions
contained entities in a real world pilot deployment.
Nevertheless, the metadata structure should be extended
to be able to contain more complex ontologies.
The technique of filters as base tools for data and
metadata import is also reused from a pilot project [7],
particular manual importers are to be implemented.
Search engine Egothor is fully operating, its integration
into the infrastructure is under development. Proposed
modules for data and metadata import and semantics
derivation are specified.
The main part of our future work will consist in the
implementation of remaining components, their
integration and experimental research including
comparative analysis.
5. References
[1] Alexaki S., Christophides V., Karvounarakis G.,
Plexousakis D., Schol D.: RQL: A Declarative Query Language
for RDF, in Proceedings of the Eleventh International World
Wide Web Conference, USA, 2002
[2] Bednarek D., Obdrzalek D., Yaghob J., Zavoral F.: ”Data
Integration Using DataPile Structure”, ADBIS 2005,
[3] Broekstra J., Kampman A.: SeRQL: A Second Generation
RDF Query Language, SWAD-Europe, Workshop on Semantic
Web Storage and Retrieval, Netherlands, 2003
[4] Dokulil J.: Transforming Data from DataPile Structure into
RDF, in Proceedings of the Dateso 2006 Workshop, Desna,
Czech Republic, 2006, 54-62, Vol-176
[5] Dokulil J.: Pouziti relacnich databazi pro vyhodnoceni
SPARQL dotazu. ITAT 2006, Information Technologies Applications and Theory, University of P.J.Safarik Kosice,
2006
[6] Galambos L., Sherlock W.: Getting Started with ::egothor,
Practical Usage and Theory Behind the Java Search Engine,
http://www.egothor.org/, 2004
[8] Pokorny J., Vojtas P.: A data model for flexible querying.
In Proc. ADBIS’01, A. Caplinskas and J. Eder eds. Lecture
Notes in Computer Science 2151, Springer Verlag, Berlin
2001, 280-293
[9] Seaborne A.: RDQL - A Query Language for RDF, W3C
Member Submission, 2004
[10] Broekstra, J., Kampman, A., van Harmelen, F.: Sesame:
An Architecture for Storing and Querying RDF and RDF
Schema. In: ISWC 2002, Springer-Verlag, LNCS 2342.
[11] McBride, B.: Jena: A Semantic Web Toolkit, November
2002, IEEE Internet Computing, Vol. 6 Issue 6
Download