The Semantic Web: Has the DB Community missed the bus (again)? Vipul Kashyap, National Library of Medicine, NIH kashyap@nlm.nih.gov 3 April, 2002 Abstract There is a widespread interest in various research communities in the issues related to the Semantic Web. In this white paper, we try to understand the reasons behind the popularity of the current “Syntactic Web” and try to identify factors that might lead to the success of the “Semantic Web”. Among others the DB and IS communities were the ones that completely missed the bus, when it came to the Web, and may be might miss the bus again when it comes to the “Semantic Web”. The research issues in enabling the Semantic Web are organized using a “layered” semantic networking metaphor, and the various problem components are identified. Interestingly, we observe that DB research has spanned all the problem components identified above, and that the theme of the semantic web can provide a unifying framework for all these components. We conclude by proposing a set of critical problems that DB researchers are well positioned to solve and identify crucial assumptions underlying that the DB community has to adopt in order to address (successfully) the research issues to the Semantic Web. The Success of the “Syntactic” Web It is an obvious truth to everyone that the current success of the Web is way beyond what was ever imagined. It was primarily conceived as a means for physicists in CERN to share scientific data with each other. However, it has since then blossomed into a worldwide infrastructure for data and knowledge exchange and e-commerce transactions involving technologies as wide and diverse as: databases, user interfaces, hypermedia, internetworking protocols, distributed object computing, machine learning/data mining, etc. Why was the web so hugely successful, and more importantly why had the major technology areas on which the web critically depends for sustenance, failed to anticipate the sudden emergence of the web? We feel that the answer to these questions is critical to understand the feasibility and predict the success of the Semantic Web vision. Let’s try to analyze the success of the web along the following major dimensions: Technology: An argument can be made that the success of the web was due to the sophistication of the underlying technology that enabled it. However, internet protocols such at telnet, ftp, gopher; DBMS servers, distributed object computing frameworks such as CORBA/RMI; and hypermedia based systems existed much before the web came into being. Why is that none of these component areas were able to anticipate the web? Multimedia: The ability to put up multimedia information probably contributed to the success of the web as it is “cognitively easier” for people to browse multimedia information (a picture is worth a thousand words) as opposed to text documents and database tables. This, we believe may be a contributory factor in the success of the web. Ease of use: The most important reason for the success of the web is that the technology is relatively simple enough for generating a critical mass of adherents to it. For example, it is very easy to publish a (multimedia) web page and also to “jump” from one document to another in the web hyperspace. This was due to effective utilization of hypermedia technology to insulate users from protocol specific commands. A Sociological Experiment: A more accurate description of the web is that it is a sociological experiment rather than a technological invention. The ease of use enables the use of the web infrastructure for people to exchange information and data with each other. The introduction of instant messaging services enhances the feel of a “virtual society” on the web. It is clear from the above discussion that technology played a relatively minor role in the success of the web, even though the web critically depends on the various technological components for it. We need to think afresh how the semantic web can help make things easier for the user and enhance his/her web experience. On the other hand the DB research community needs to re-examine its assumptions, especially in the context of deployment and usability of the various technologies being developed. We now discuss the various research issues for enabling the Semantic Web using the multi-layer networking metaphor. Research Issues for the Semantic Web: The multi-layered network stack metaphor In this section, we borrow a metaphor from arguably the most successful research community, network and internet devices and protocols. The metaphor is that of the network protocol stack, where one layer of the stack depends on one or more layers below it. One may view information at all layers as increasingly “semantic”, as one goes from physical signals, to bits and bytes, to data frames and packets, to objects and methods, to entities and relationships, to processes. We now try to organize the various research issues for the Semantic Web according to this metaphor as illustrated in Figure 1 below. Building on the success of the data networking and middleware communities, the above picture tries to relate organize and relate the semantic web efforts along multiple layers, some of whom are described below: Object Interoperability: This is the layer at which the current middleware products are aimed in the industry. However these objects are primarily defined as containers for software and for streamlining the software development process. The CORBA, EJB object models are examples of standards at this layer. Meta-Model Interoperability: This is the layer at which the cross-over from the “data” space to the “knowledge” space takes place. The objects here are viewed as containers of knowledge to be fleshed out by upper layers. The OKBC and RDF(S) core models are examples of standards at this layer. Ontology Interoperability: This is the layer where ontologies, schemas and classifications are built upon common underlying standardized meta-models. The ability to use different ontologies to specify and query information constitutes interoperability at this layer. Meta-Data (View/Query) Interoperability: Semantic metadata descriptions can be constructed from one or more underlying ontologies. Issues at this layer would be to decompose information requests into those supported by the individual semantic metadata descriptions corresponding to the information sources. Figure 1: The multi-layered stack metaphor for the Semantic Web The ability to organize semantic web research along these layers helps us organize the work require to build out the underlying infrastructure of the semantic web. The issues that arise are: development of standards and industry wide APIs at each of the layers. Building up semantic-web specific functions such as semantic routings, “semantic” content delivery networks. Specification of further application layers may also be required. An interesting research topic arises at the “crossover” from the Distributed Object Computing world to the Semantic World, i.e. interoperation across meta (or data) models such as frame based and object oriented models. However the most interesting “semantic” issues arise from the meta-model layers upwards, as we expect the semantic web community to either standardize on a rich meta-model or a limited set of metamodels with mappings across them. We visualize the semantic web fabric as a collection of ontologies and metadata descriptions and inter-relationships and correspondence across them. The Semantic Web Fabric: A Collection of Metadata Descriptions and Ontologies One way of visualizing the Semantic Web is illustrated in Figure 2, as a collection of ontologies corresponding to different domains and user communities, and metadata descriptions constructed from those ontologies. Ontologies have been identified as the crucial component for capturing and representing semantics. Same information from differing perspectives may be captured using different ontologies and inter-ontology interoperation is the key problem that needs to be addressed in order to make the semantic web a reality. Information requests specified using a particular ontology have to be transformed into similar requests expressed in terms used in other ontologies. User Query/ Information Request User Query/ Information Request User Query/ Information Request Languages Inter-Ontology Relationships Manager Ontolog yServe r Metadata Repository Metadata Server Ontolog yServe r Metadata Serve r Metadata Repository Distributed Computing Infrastructure (J2EE, .NET, CORBA, Agents) for ... DATA REPOSITORIES ... DATA REPOSITORIES Figure 2: The Semantic Web Fabric: A Collection of Metadata and Ontologies This necessitates an infrastructure that has components that manage ontologies, interontological relationships, metadata descriptions constructed from the various ontologies and mappings of these metadata descriptions to the various data and information sources on the web. A set of component functions that are crucial for enabling the semantic web are: Bootstrapping, Creation and Maintenance of Semantic Knowledge o Collaborative and Sociological Processes, Statistical Techniques o Ontology Building, Maintenance and Versioning Tools Re-use of Existing Semantic Knowledge (Ontologies) Annotation/Association/Extraction of Knowledge with/from Underlying Data Information Retrieval and Analysis (Distributed Querying/Search/Inference Middleware) Semantic Discovery and Composition of Services Distributed Computing/Communication Infrastructures o Component based technologies, Agent based systems, Web Services Repositories for managing data and semantic knowledge o Relational Databases, Content Management Systems, Knowledge Base Systems As enumerated above, the scope and range of issues involved in Semantic Web research is wide and varied and spans multiple disciplines and research areas. Surprisingly, even though, the database community has not taken up issues related to the Semantic Web, work being done within the community has spanned almost all the categories mentioned above, the topic of discussion for the next section. DB Research and the Semantic Web We now discuss the areas of DB research that overlap with the Semantic Web effort, which are as follows: Semantic Data Models: Database researchers have been working on various types of semantic data models with constructs at higher level abstractions such as generalizations and aggregations. The main focus of the work here however was support of queries at a higher level of abstraction and efficient indexing structures for the same. Inference based on semantics was not the main focus of this area of work. Multi-database Schema Heterogeneity and Schema Integration: There is a wide body of work in multidatabase literature that attempts to identify and enumerate various schema heterogeneities, and techniques for resolving those heterogeneities in the context of schema integration. Attempts have also been made to use domain ontologies for integration of data across multiple databases. Schema Evolution: Even though for the most part, the database schema has been assumed to be relatively static, there has been work on schema evolution and versioning in the context of object oriented databases. Object Oriented/XML Databases: Specialized databases such as object oriented databases in the 90s and XML databases today, have and are being developed to address specialized needs of web content and complex data. Deductive Databases/Rule Based Systems: Rule based approaches for handling and manipulating data have been implemented in various deductive database prototypes. Rule based approaches are also visible in implementations of triggers in commercial relational database systems, Mediators and Wrappers: The availability of non-traditional (non-relational) data sources on the web created a need for exporting a “relational” view of the underlying data. Wrappers focused on encapsulating a data source into a relational or an objectrelational model, whereas mediators focused on partitioning queries and combining results from multiple data sources. Multidatabase/Federated Database Query Processing: Query processing across multiple autonomous databases has been a significant endeavor in the federated database field and frameworks to support mappings and query decomposition algorithms have been proposed. Data Mining: The presence of huge amounts of data in corporate databases (compounded by the data explosion on the internet), has given rise to the need for automatically “mining” patterns from the databases to come up with insights that can be applied to derive further business efficiencies. The data mining field of work has focused on coming up with scalable and efficient algorithms for the same. Probabilistic Databases: Though not a part of mainstream database literature, there has been a significant amount of work on storage, manipulation and querying of probabilistic data. Workflow-based Coordination Systems: Work has been done in definition of task based workflow processes and control and coordination of the same. There has been work in dynamic instantiation and combination of workflows and there have been approaches to re-use this work in the context of web services. Security in Database Systems: Security in database systems research has focused on specification of access and authorization policies based on group membership and techniques to enforce and prove the correctness of the policies specified. Multimedia Databases: The explosion of data on the web has given rise to specialized databases dealing with multimedia information such as text, images and video data. Based on the above discussion, it may be observed that even though the database community has not bought into the Semantic Web vision, work has been done across all the problem components crucial for the Semantic Web. We believe that the Semantic Web provides the underlying theme that can tie in all the disparate pieces of work. We now discuss the various missing gaps that need to be addressed to make the Semantic Web a reality and what DB research can contribute in that context. Missing Gaps We now discuss some of the critical missing gaps in Semantic Web research and how the DB community can respond to these challenges: Ontology Impedance/Integration/Interoperation: Ontology impedance may be defined as the semantic mismatch between two or more ontologies that are being merged. However, the ontology integration problem is slightly different from the schema integration problem, as it focuses on the semantics of the relationships and domain specific constraints on the information. Work needs to be done to estimate the consequent loss of information that results from this impedance. Schema integration work may prove to be a good starting point for ontology integration and interoperation. Scalability/Performance: Issues related to scalability of web servers serving semantic web content is a critical issue on which the future semantic web depends. Work is needed to come up with techniques that exploit “semantics” to design better caching techniques, e.g., semantic content distribution networks. There is also a need for metrics and measurements to evaluate how well algorithms for the semantic web will perform and scale. In general, this has been the strong point of DB research and this is an area where a significant contribution is most likely from the database community. Dynamic Ontologies: A fundamental but flawed assumption being made by the database research community is that database schemas, and by that extension, ontoloiges are static in nature. Real world ontologies are likely to be dynamic and evolve over time and algorithms and techniques should be based on this important assumption. Semantic Metadata Extraction: Two crucial factors that will determine the success of the semantic web are: the ease and cost of developing and maintaining ontologies, mappings and articulation rules; and the ease of constructing semantic annotations. Tools that drive the extraction process based on text processing and NLP techniques (most of the data on the web is textual) are important. There is significant work on mappings, etc., in Federated Database work and this needs to be augmented by looking at techniques from Information Retrieval (clustering) and NLP. Inferences based on the Semantics of the Data: The DB community has focused more on issues such as indexing, caching, etc., where the schemas and range of queries are known ahead of time. However, on the semantic web, where ontologies are dynamic and user requests that might change over time, inferences based on the semantics of data might be an important tool to address some of the issues. Semantics of Multimedia Data: There is a need to focus more on non-traditional data such as text, images and video. The challenge before the DB community is to be able to evolve a data model that is as simple and effective as the relational model and a query language similar to SQL, which treats structured and unstructured data in a uniform manner. Semantics of Processes/Plans/Workflows: Whereas there has been a lot of work in task and process-based workflow by the DB community, there is a need for the ability to map high level semantic descriptions to workflow instances and compose existing workflow instances on the fly. Once again, the DB community is well positioned to respond. Digital Rights Management: The appearance of new types of multimedia data creates a need for digital rights management, which the DB community should respond to. This is a new area for DB research, as the need for “watermarking” relational data was never felt. We believe that the next wave of research will focus on re-using data models/ontologies/schemas in an open and dynamic environment. This requires the DB community to change its assumptions and think “out of the box” in order to make an impact. Conclusion In general, we believe that the database research community is well positioned to address the challenges involved in enabling the Semantic Web, and furthermore, the Semantic Web theme serves as a good unifying framework for pulling together disparate pieces of work being performed by the various researchers in the database community. However, there is a need to think “outside the box” and change some of the underlying assumptions in order to make an impact in this area: Data Models/Schemas/Ontologies will form the critical infrastructure for the Semantic Web. More attention should be paid to issues such as model manipulation, management and querying. Re-use of pre-existing data models/schemas/ontologies is crucial in describing the semantics of various information sources, i.e., issues regarding this layer must be paid the same level of attention, as issues related to data management. There is a need to relax consistency and completeness requirements and estimate the “error” in the results returned. Semantics of information should be used to minimize “error” in the information obtained The new environment is likely to be more “dynamic” in nature – schemas, workflows, queries, etc. can no longer be assumed to be static. We believe, that if the DB community adapts to these requirements, it stands a good chance of making an impact, otherwise, it might miss the bus, again!