Semantic Web Infrastructure using DataPile Jakub Yaghob, Filip Zavoral Department of Software Engineering, Charles University in Prague {Jakub.Yaghob, Filip.Zavoral} @ mff.cuni.cz1 Abstract Years of research and development of technologies and tools do not lead to expected semantic web widespread. We consider practical nonexistence of an infrastructure for the semantic web operation as one of the main reasons of that status. In our paper we describe a proposal of such infrastructure based on the DataPile technology and relevant developed tools and their integration with web search engines and other tools. 1. Introduction - Current State of Semantic Web Implementation If the semantic web were half as good as proclaimed by its seers then it would be widely adopted by commercial subjects and everyone would use it daily, analogous to nowadays web. The reality is quite different – the semantic web is still not in common everyday use. One of the main reasons is a nonexistence of a unified infrastructure, where the semantic web could be effectively tested and operated. This problem can be easily demonstrated by the comparison with the ‘plain’ web. There is the infrastructure stabilized and steadily used. Web servers have stored somewhere on their disks web pages, the server typically provides data by http or https protocols. Data in the HTML format eventually complemented by some active elements is displayed to the client. The semantic web does not have such a ‘standard’ infrastructure. There are developed and relatively stabilized standards for ontology description (OWL), there are proposed and implemented specialized query languages (SPARQL [5], RQL [1], SeRQL [3], RDQL [9]) and data/metadata storage systems (Sesame [10], Jena [11]). However the overall infrastructure is unclear what is one of the reasons of unsatisfactory widespread of the semantic web. 1 Next chapters are organized as follows: chapter two demonstrates the DataPile systems as a data storage for the semantic web, chapter 3 shows individual infrastructure modules, and chapter 4 summarizes current status and foreshows subsequent progress. 2. DataPile Systems and their relevance to the semantic web 2.1. The DataPile The principal idea behind the DataPile [2] is a data verticalization, i.e. it does not use a traditional horizontal concept of a relational database table. In the DataPile each column of a ‘traditional’ row is represented as one row of a database table, and corresponding attributes (columns) of the entity (row) are connected together by the entity identification. Moreover, because all attributes of all entity types have the same physical representation, only one relational database table is used for storing all data. The DataPile as a data structure needs a second table, which stores logical structure of the entities. This table holds a metadata which can be used as a source for inference; we call this table metatable. We have presented [4] that the DataPile is very suitable as storage for the semantic web data. The DataPile was originally developed as a common data structure for the integration of heterogeneous information systems [2]. Original requirements laid on the DataPile as central data storage for data integration were: 1. Keeping all legacy applications, 2. Data unification from different sites, 3. Active distribution of data changes, 4. Possibility to add other collected and stored data, 5. Information about data changes along the time axis. All these requirements correspond very well with requirements laid on data storage for the semantic web: 1. Data sources are distributed over the whole web and we cannot enforce or influence their data scheme. The work was partly supported by the project 1ET100300419 of the Program Information Society of the Thematic Program II of the National Research Program of the Czech Republic. 2. 3. 4. 5. Data stored in one instance of the DataPile corresponds to one ontology. During data import from other sources data is transformed (unificated) into sustained ontology. The DataPile is able to ensure a data export including an active export (push), what is especially useful for executors (see below). It is possible to change data structure in the DataPile relatively easily. The data structure is not directly dictated by a relational database scheme, but it is set out by a content of metatables, where the structure is kept. Changes in the data structure fits well with changes in our sustained ontology. Moreover, the DataPile keeps information about time history of data in such a way that we are able to discover information about state of the world regarding any time point. 2.2. Data storage in the DataPile and relation between RDF and the DataPile Throwing performance optimizations away, data in the DataPile can be stored in one relational table, which contains one row for each attribute value of any entity in a given time segment. Each row in this table contains following columns: entity ID, attribute type, attribute value, data source, validity from, validity to, data relevance. The first three columns can be easily mapped to the RDF data model, whereas the entity represents a subject, the attribute type represents a predicate and the attribute value is an object. All remaining columns will be more described in the next subchapter. Because the DataPile was applied in the particular large information system, we were able to collect large number of real data (tens of millions entities). Because the data in the DataPile is in the matter of fact RDF data, we have got very large number of real RDF data. The only small disadvantage is the fact that some data is private and cannot be published, which can be rectified by proper data obfuscation. 2.3. Reifications The “data source” column from the previous subchapter denotes a source of data, the “data relevance” represents our estimation of source and data credibility and two columns “validity from” and “validity to” enclose the time segment, where the data is valid. They represent reifications about the RDF statement represented by one row. Modeling such reifications using the pure RDF format we could get five times bigger data, where searching using reifications would be very slow. 2.4. Context ontology and ontology mapping The sustained ontology stored in the DataPile metatables is the context ontology, i.e. it describes only data stored in the DataPile. Data grows during time and the ontology changes as well, typically by an augmentation (adding another areas of interest), seldom by a modification. The DataPile metatables cope with those ontology changes without changes in the data part of the DataPile. One of the biggest current problems, which impede worldwide expansion of the semantic web, is the problem of ontology mapping. Using context ontology means mapping it to other existing ontologies during data and metadata import. The metadata stored in the DataPile provide a framework for ontology mapping research. 3. Semantic Web Operating Infrastructure Proposed semantic web infrastructure is based on the DataPile where all the data and metadata are stored. The DataPile supports four types of interfaces - data import, import and merging of metadata, querying and executors. Several modules use these interfaces and provide complex semantic web functionality. 3.1. Importers The data import interface enables to import and update the DataPile data. Typical example of importers are filters converting data from any source (database, XML, web, ...) into a physical representation acceptable by the DataPile. The logical format of the data is converted by the filter to be conformed to the context ontology. Among the important parts of importers belongs the ability to detect and update existing data. The data unification interface enables to use various unification algorithm implementations that match existing and importing data using significant and relevant attributes. The web search engines are significantly important among importers - they interconnect the semantic web to existing web. We use the Egothor system [6] whose modular architecture is very well suitable for cooperation with the DataPile. In the original implementation the Egothor builds a record of a part of web in the inverse vector representation. Newly added extraction modules store filtered incoming data into the DataPile instead of building vectors. data source web Search Engine Data Import Filter Semantics Deductor contemporary web which contains a plethora of unstructured data. We are working on a framework for automatic metadata importers. Automatic importers based on different heuristic, statistical, probabilistic or artificial intelligence methods may automatically generate metadata from processed data sources. The Egothor module of automatic semantics derivation from processing data is the example of such importers. 3.2. Query engines Metadata Data Lion High Seat App Server DataPile Query Processor ... multicriterial querying SPARQL SDE - Tykadlo Conductor Executors Execution Environment SOAP, CORBA, ... Figure 1. Modules involved in the infrastructure The metadata is naturally getting obsolete during the time. Therefore the metadata importers play an important role keeping semantic web up-to-date. Metadata importers are classified as manual or automatic. Import filter from a legacy system is a typical example of such manual importers. The development of manual importers is relatively expensive, moreover this approach does not make possible to import data from In contrast to traditional relational databases with fixed data schema and standardized query language (SQL), querying semantic web data is still in the stage of proposals and pilot implementations. Our proposed framework is built upon a general query interface providing means for submitting queries and fetching and processing results. This interface is intentionally wide and general so that both parts (querying and processing results) may be highly heterogeneous. The querying itself is complicated by the fact that the user typically doesn't know the data structure, which is extremely large and dynamic in time. Therefore one of the query modules is Semantic Driven Explorer (SDE) named Tykadlo which is capable to search, display and explore metadata, filter data related to selected metadata and explore more semantically related data. The SPARQL [4] module converts the SPARQL query into respective SQL form and forwards it to the SQL engine to evaluate. Although this query method is adequate for simple queries, more complex queries with a lot of joins are inefficient. The Multicriterial Query Module [8] is suitable for specifying several criteria and user preferences while the resulting data set is qualified by a monotone function of particular criteria. The task of this module consists of searching the best one or best-k answers. This module can be extended to Dynamical Multicriterial Query Module using previous results (of the same user or other users) to improve the preciseness of the answer. The query engine is prepared to adopt more querying modules, based e.g. on artificial intelligence, advanced linguistic or data mining methods. 3.3. Executors Traditional result representation is tightly coupled to a query method; for example SDE displays interconnected pages containing result data together with their structure and relationships, search engine displays web links with appropriate piece of text, and SPARQL returns rows of attribute tuples. The technique of executors brings process models into this infrastructure. A task of an executor is to realize an atomic semantic action, i.e. interaction of result data with an outstanding (not only semantic) world. These atomic executors can be assembled together to make complex composed executors. Orchestration, i.e. mutual executor interconnection to achieve more complex functionality, is executed by the Conductor module. It may be illustrated by following example. One's mother has gone ill, she needs a medicine. A query module searches nearby pharmacies with the medicine available, the result is passed to the Conductor. One executor is responsible for buying the medicine, while the other arranges delivery to mothers home. The Conductor orchestrates these two executors to synchronize, to mutually cooperate, and to pass relevant data between them. Proceedings of the 9th East-European Conference on Advances in Databases and Information Systems, Tallinn, 2005 4. Conclusions [7] Kulhanek J., Obdrzalek D.: ”Generating and handling of differential data in DataPile–oriented systems”, IASTED 2006, Database and Applications DBA 2006, 503-045 Proposed solution for semantic web infrastructure is an open framework, not a closed solution. This is an essential characteristic due to contemporary state of the art of the semantic web, its immaturity and a need of flexible extensibility. Particular components are in various stage of finalization. Central DataPile based database as well as directly attached tools and unification algorithms are completely implemented, there are about tens of millions contained entities in a real world pilot deployment. Nevertheless, the metadata structure should be extended to be able to contain more complex ontologies. The technique of filters as base tools for data and metadata import is also reused from a pilot project [7], particular manual importers are to be implemented. Search engine Egothor is fully operating, its integration into the infrastructure is under development. Proposed modules for data and metadata import and semantics derivation are specified. The main part of our future work will consist in the implementation of remaining components, their integration and experimental research including comparative analysis. 5. References [1] Alexaki S., Christophides V., Karvounarakis G., Plexousakis D., Schol D.: RQL: A Declarative Query Language for RDF, in Proceedings of the Eleventh International World Wide Web Conference, USA, 2002 [2] Bednarek D., Obdrzalek D., Yaghob J., Zavoral F.: ”Data Integration Using DataPile Structure”, ADBIS 2005, [3] Broekstra J., Kampman A.: SeRQL: A Second Generation RDF Query Language, SWAD-Europe, Workshop on Semantic Web Storage and Retrieval, Netherlands, 2003 [4] Dokulil J.: Transforming Data from DataPile Structure into RDF, in Proceedings of the Dateso 2006 Workshop, Desna, Czech Republic, 2006, 54-62, Vol-176 [5] Dokulil J.: Pouziti relacnich databazi pro vyhodnoceni SPARQL dotazu. ITAT 2006, Information Technologies Applications and Theory, University of P.J.Safarik Kosice, 2006 [6] Galambos L., Sherlock W.: Getting Started with ::egothor, Practical Usage and Theory Behind the Java Search Engine, http://www.egothor.org/, 2004 [8] Pokorny J., Vojtas P.: A data model for flexible querying. In Proc. ADBIS’01, A. Caplinskas and J. Eder eds. Lecture Notes in Computer Science 2151, Springer Verlag, Berlin 2001, 280-293 [9] Seaborne A.: RDQL - A Query Language for RDF, W3C Member Submission, 2004 [10] Broekstra, J., Kampman, A., van Harmelen, F.: Sesame: An Architecture for Storing and Querying RDF and RDF Schema. In: ISWC 2002, Springer-Verlag, LNCS 2342. [11] McBride, B.: Jena: A Semantic Web Toolkit, November 2002, IEEE Internet Computing, Vol. 6 Issue 6