Event Report Report authors: Jessie Kennedy, Iain Coleman Event organiser(s): Jessie Kennedy, Peter Robson Title of event: The Closed World of Databases meets the Open World of the Semantic Web Date of event: 12th October 2006 to 13th October 2006 (2 days) Target Audience: The workshop will be of interest to researchers in both communities and to scientists and data managers responsible for managing data where semantic web technologies are being used along with database technologies. Objectives: (Please restate the original objectives of the event in bullet point form.) Explore the issues around the closed world assumption (CWA) used in the database community and the open world assumption (OWA) used in the semantic web community Review the CWA and OWA from both communities Discuss the implications for data processing when the two worlds meet, such as how to interpret the results of queries returned from a database as opposed to those returned from a reasoning engine Present the problem of nullology, i.e. the study of the empty set, and discuss how this should be interpreted Explore the issue of how to deal with missing information, and the question of what is missing information Discuss the identification of objects or concepts, including the mechanisms for identifying objects or concepts, how they may be referenced, how the approaches differ between databases and the semantic web, and whether they can work together effectively Chronology of Event: (Details of talks, discussions and breakout sessions, with links to any supporting materials on the web. Please include the following information: For talks, the talk title, speaker’s name, and a brief summary of the talk; For discussions and breakout sessions, the subject of the session, name of the session chair (if any), and a brief summary of the discussion.) Thursday, October 12th: Data Webs: New Visions for Research Data on the Web David Shotton Shotton provided an overview of the main issues involved in exposing databases to the Web. He gave an outline of the semantic web vision, and emphasised that database integration will be the semantic web’s “killer app”. He went on to discuss the data web philosophy, in which a registry of metadata refers users to the original sources of data that match their query. Finally, he gave an example of such a data web: the ImageWeb project that aims to integrate and make cross-searchable research images held by publishers, in institutional repositories, and in specialist research collections. Closed World Assumption Chris Date Page 1 of 5 Date outlined the principles behind the relational database model, emphasising that it is a logical system: a relational database query essentially proves a true logical theorem. He showed that the CWA is critically important in maintaining this formal integrity. Regarding the issue of unknown information, he pointed out that a database query is not really “is X true”, but rather “do we know that X is true”. Bearing this subtlety in mind is crucial to using a relational database under the CWA to deal with unknown information. Open World Assumption Nick Drummond and Rob Shearer Drummond and Shearer gave a joint presentation on semantics, data interpretation and the OWA. Data needs to be interpreted to derive knowledge, and is encoded with respect to a particular interpretation. Formal knowledge representations have well-defined semantics and provide unambiguous interpretations. The OWA considers a set of possible models: the ontology is constrained iteratively, becoming more restrictive as knowledge is added. The OWA is underspecified, readily reusable and extendable, and deals naturally with incomplete information. This is particularly appropriate in the domain of scientific research. Nullogy Chris Date Nullology is the study of the empty set. Date argued that this is often the acid test of whether a database model is correct. He presented a number of different ways the empty set can be used to test a database model, including the properties of relations with no rows, relations with no columns, empty lists of operands, empty partitions of headings, empty left hand sides in functional dependences, and empty keys. This analysis illustrated a number of shortcomings with SQL. The Semantic Gap between Databases and Ontologies Catherine Dolbear Dolbear discussed the issues that arise in authoring topographic ontologies, and in linking ontologies to a database, illustrating these with reference to her work with Ordnance Survey data. In creating ontologies, domain experts and ontology engineers have to collaborate in capturing expert knowledge in a formal way. Difficulties arise because domain experts tend to think in a closed world fashion. Software tools for linking ontologies to databases still have some way to go, with efficiency being a very important issue. Missing Information Chris Date Date summarised his presentation as “Don’t use nulls!” He considered the proposal to use threevalued logic to deal with missing information in a relational database, and argued that this leads to pathological cases and a breakdown of logical consistency. A systematic and disciplined system of representing missing information with default values can avoid these problems. Other solutions include having separate tables for known and unknown data, or recording only known data and interpreting a false predicate as “we do not know that the predicate is true”. The interpretation of information in a database is contained in the predicates of database queries: in a sense, database design is predicate design. Incomplete and Missing Data in Geoscience Databases Steve Henley Geoscience data can often be imprecise, incomplete or missing. Henley gave some simple examples of such data, and showed how a CWA approach does not deal with such data in a scientifically reasonable way. He rejected the default value solution of Chris Date, and also argued that decomposition into separate tables is inappropriate for scientific data. Henley concluded that, for imprecise data, a probabilistic extension of two-valued logic is required, while three-valued logic may be necessary for dealing with missing information. Integrating Genomic Mapping Data using Ontologies: A View from the Closed World Trevor Paterson Paterson showed how scientific data can be analysed and interpreted in a relational database under the CWA, using comparisons of genetic mapping data across different species as an example. Almost all of the data is positive, with negative data rarely recorded, and some of the data is contradictory. Page 2 of 5 Interpreting the results of database queries and reasoning across the knowledge domain is currently performed in cerebro by expert scientists. The challenge in moving to an open world representation lies in capturing the assumptions and assertions of expert reasoning in a domain ontology formalism. Representing data is relatively straightforward: representing relationships between objects is much more challenging. ComparaGRID Semantic Integration Technology Matt Pocock This presentation followed directly on from the previous talk by Paterson. Pocock demonstrated an OWL browser, called “Pussy Cat”, developed for the ComparaGRID genomic data project. Queries that require information from multiple sources are split up into separate queries to each source, the results of which are then reassembled and delivered to the user. The source databases for this project are in fundamentally different and contradictory schemas, even disagreeing on basic concepts like the definition of “gene”. Deriving a good domain ontology for this work is a difficult and poorly-scoped task, with no widely-validated methodology and a language gap between biologists and modellers. Friday, October 13th: Open Ontologies Dave Robertson Robertson discussed the issues that arise in achieving consistent processes across different ontologies. He argued that the problem of open knowledge sharing cannot be solved in the general case, but that researchers should not be daunted by this: computer engineering frequently has to cope with problems of this nature. In practical applications, the problem can be simplified or the properties weakened, and engineers can work successfully within these tolerance spaces. Identity, URIs and the Semantic Web Henry Thompson Studying the properties of URIs goes right to the foundations of the Web. Thompson discussed the properties of URI schemes, such as HTTP, and pointed out that URIs typically contain useful metadata, depending on the naming strategy. He advised that it is usually better to stick to the HTTP scheme rather than invent new URIs. The network effect is very powerful: the cost of giving away your data is miniscule compared to the benefits of gaining access to everybody else’s data. When using URIs in the database world, two questions arise: when do database owners need to mint URIs for objects, and what strategies should they use to do so? Don’t Mix Pointers and Relations Chris Date In this presentation, Date argued that pointers to data should not be permitted as entries in a database. He stated that mixing pointers and relations is a major departure from the relational model, adding complexity without increasing power. While there can be good reasons for using pointers in the implementation of a database, that is no reason to expose pointers in the database model itself: it is important to keep the logical and physical levels distinct. Panel Discussion All Speakers In the final session of the meeting, all the speakers assembled as a panel to discuss questions submitted by the meeting participants. A summary of the questions and answers can be found on the meeting wiki at http://wiki.esi.ac.uk/Closed_Verus_Open_World_Questions Event Achievements: (Please describe how, and to what extent, the event achieved the objectives stated above. Give a narrative description of any new ideas and opportunities which have emerged as a result of the event. Do you see any potential for follow on activity arising from this event, and if so of what type, e.g. new collaborations, research proposals, future events?) Page 3 of 5 Both the closed world assumption (where everything stated or implied by the database is true and everything else is false) and the open world assumption (where everything stated or implied by the database is true and everything else may or may not be true) were reviewed. Chris Date explained how the CWA is crucial in maintaining the mathematical integrity of the relational database model, and Nick Drummond and Rob Shearer gave an account of how the OWA is a natural way to deal with incomplete information in domains such as science where not all answers are yet known. At the ontological level, the scientific method can be seen an open world, however when proposing scientific theories the universe of discourse must be carefully defined. Of course scientists often assume the universe of discourse rather than making it explicit which potentially cause problems. The universe may be iteratively redefined, but permitting enlargement (or contraction) without re-evaluation of the impact on other hypotheses (axioms and therefore database design in the database world) is likely to draw erroneous conclusions. While reasoning within any theory, the closed world assumption (CWA) holds. Its universe of discourse is determined by the entities modelled, relationships, constraints, dependencies, and by the domains of permissible values. A scientific database should then be considered as testing a theory (however "imprecise"). If the real world data either doesn't fit or creates a contradiction, the theory is invalidated and the database should then be redesigned to meet the next hypothesis. In a scientific database, a reasonable interpretation of True and False under CWA is "valid by experiment and consistent with hypotheses" and "not validated by experiment or inconsistent with hypotheses". The CWA shouldn’t be confused with an assumption that we must know everything and embed that knowledge in a permanently fixed database schema. All we need is understand the current boundaries on our application or investigation. The CWA and OWA are really the extreme ends of a continuum: the OWA starts with no assumptions and as restrictions are added to the ontology, an open world is progressively closed down. The real differences are between the explicit assumptions made in open world ontologies, and the implicit assumptions that underlie closed world databases. When querying a relational database, all the understanding and interpretation are in the mind of the person formulating the query, while in the open world of the semantic web at least some of this interpretation is formalised in the ontology. The question of missing information was discussed at length. In the CWA, there can be no such thing as a missing value and all queries are answered with “yes” or “no”. This makes for a mathematically sound database that provides a true response to all queries. In scientific practice, however, missing data crops up routinely and in the OWA “don’t know” can be a meaningful answer. The proposal that the two worlds might be reconciled by inserting three-valued logic into the relational database model was considered, but was rejected on the grounds that it can lead to pathological consequences which can undermine the relational database model. Even if some three-valued logic could be made effective, there is more than one type of missing value – unknown values, undefined values, invalid values, and so on – and to treat them all as different truth-values would lead to a multi-valued logic that was prohibitively unwieldy. It was generally agreed that there is as yet no ideal solution to the missing data problem, though some partial solutions were offered to what is recognised as a serious issue. Even with all data present and correct, researchers still need to be able to capture and communicate metadata that describes it, so that others can access the data for purposes which may be unimagined by its creators. Creating this semantic data – data that tells you what other data means – can be as much a social and political exercise as a scientific one. Trevor Paterson described the challenges faced by the Comparagrid project in integrating genomic mapping data using ontologies, sets of semantic restrictions that describe the data and the possible relations between data objects. This is tantamount to automating the kind of interpretation of data that is performed by an expert scientist drawing on intuition, experience and training. Formalising this can open up fundamental scientific disagreements, which must be resolved or at least precisely defined before the ontology can be created. Such disagreements or contradictions not subject to resolution would require secondorder logic which permits paradoxes (however this might then be considered not science!).Even uncontroversial data can pose challenges of multiple interpretation, as Cathy Dolbear illustrated when she described her work in creating a widely useable ontology for Ordnance Survey data which was originally gathered for specific military and mapping purposes. Page 4 of 5 Social factors are also of great importance in realising the vision of the semantic web. Henry Thompson’s presentation emphasised the social properties of systems like HTTP and DNS, and the invaluable network effects that can come from understanding and using these social factors when opening up databases to the Web. Currently different technologies are being used by database owners use to create the URIs they will use to expose their databases but the best approach is as yet unclear. The event has resulted in expression of interest in a follow-up event on Semantic data Integration in collaboration with the Centre for Ecology and Hydrology early in Spring. Any Other Observations: This was a particularly lively meeting, with substantial debate during the presentations and a wideranging panel discussion that extended well past the scheduled finishing time. Page 5 of 5