Event Report

advertisement
Event Report
Report authors:
Jessie Kennedy, Iain Coleman
Event organiser(s):
Jessie Kennedy, Peter Robson
Title of event:
The Closed World of Databases meets the Open World of the Semantic
Web
Date of event:
12th October 2006 to 13th October 2006 (2 days)
Target Audience:
The workshop will be of interest to researchers in both communities and to
scientists and data managers responsible for managing data where
semantic web technologies are being used along with database
technologies.
Objectives:
(Please restate the original objectives of the event in bullet point form.)






Explore the issues around the closed world assumption (CWA) used in the database
community and the open world assumption (OWA) used in the semantic web community
Review the CWA and OWA from both communities
Discuss the implications for data processing when the two worlds meet, such as how to
interpret the results of queries returned from a database as opposed to those returned from a
reasoning engine
Present the problem of nullology, i.e. the study of the empty set, and discuss how this should
be interpreted
Explore the issue of how to deal with missing information, and the question of what is missing
information
Discuss the identification of objects or concepts, including the mechanisms for identifying
objects or concepts, how they may be referenced, how the approaches differ between
databases and the semantic web, and whether they can work together effectively
Chronology of Event:
(Details of talks, discussions and breakout sessions, with links to any supporting materials on the web.
Please include the following information:
 For talks, the talk title, speaker’s name, and a brief summary of the talk;
 For discussions and breakout sessions, the subject of the session, name of the session chair
(if any), and a brief summary of the discussion.)
Thursday, October 12th:
Data Webs: New Visions for Research Data on the Web David Shotton
Shotton provided an overview of the main issues involved in exposing databases to the Web. He gave
an outline of the semantic web vision, and emphasised that database integration will be the semantic
web’s “killer app”. He went on to discuss the data web philosophy, in which a registry of metadata
refers users to the original sources of data that match their query. Finally, he gave an example of such
a data web: the ImageWeb project that aims to integrate and make cross-searchable research images
held by publishers, in institutional repositories, and in specialist research collections.
Closed World Assumption Chris Date
Page 1 of 5
Date outlined the principles behind the relational database model, emphasising that it is a logical
system: a relational database query essentially proves a true logical theorem. He showed that the
CWA is critically important in maintaining this formal integrity. Regarding the issue of unknown
information, he pointed out that a database query is not really “is X true”, but rather “do we know that
X is true”. Bearing this subtlety in mind is crucial to using a relational database under the CWA to deal
with unknown information.
Open World Assumption Nick Drummond and Rob Shearer
Drummond and Shearer gave a joint presentation on semantics, data interpretation and the OWA.
Data needs to be interpreted to derive knowledge, and is encoded with respect to a particular
interpretation. Formal knowledge representations have well-defined semantics and provide
unambiguous interpretations. The OWA considers a set of possible models: the ontology is
constrained iteratively, becoming more restrictive as knowledge is added. The OWA is underspecified,
readily reusable and extendable, and deals naturally with incomplete information. This is particularly
appropriate in the domain of scientific research.
Nullogy Chris Date
Nullology is the study of the empty set. Date argued that this is often the acid test of whether a
database model is correct. He presented a number of different ways the empty set can be used to test
a database model, including the properties of relations with no rows, relations with no columns, empty
lists of operands, empty partitions of headings, empty left hand sides in functional dependences, and
empty keys. This analysis illustrated a number of shortcomings with SQL.
The Semantic Gap between Databases and Ontologies Catherine Dolbear
Dolbear discussed the issues that arise in authoring topographic ontologies, and in linking ontologies
to a database, illustrating these with reference to her work with Ordnance Survey data. In creating
ontologies, domain experts and ontology engineers have to collaborate in capturing expert knowledge
in a formal way. Difficulties arise because domain experts tend to think in a closed world fashion.
Software tools for linking ontologies to databases still have some way to go, with efficiency being a
very important issue.
Missing Information Chris Date
Date summarised his presentation as “Don’t use nulls!” He considered the proposal to use threevalued logic to deal with missing information in a relational database, and argued that this leads to
pathological cases and a breakdown of logical consistency. A systematic and disciplined system of
representing missing information with default values can avoid these problems. Other solutions
include having separate tables for known and unknown data, or recording only known data and
interpreting a false predicate as “we do not know that the predicate is true”. The interpretation of
information in a database is contained in the predicates of database queries: in a sense, database
design is predicate design.
Incomplete and Missing Data in Geoscience Databases Steve Henley
Geoscience data can often be imprecise, incomplete or missing. Henley gave some simple examples
of such data, and showed how a CWA approach does not deal with such data in a scientifically
reasonable way. He rejected the default value solution of Chris Date, and also argued that
decomposition into separate tables is inappropriate for scientific data. Henley concluded that, for
imprecise data, a probabilistic extension of two-valued logic is required, while three-valued logic may
be necessary for dealing with missing information.
Integrating Genomic Mapping Data using Ontologies: A View from the Closed World Trevor
Paterson
Paterson showed how scientific data can be analysed and interpreted in a relational database under
the CWA, using comparisons of genetic mapping data across different species as an example. Almost
all of the data is positive, with negative data rarely recorded, and some of the data is contradictory.
Page 2 of 5
Interpreting the results of database queries and reasoning across the knowledge domain is currently
performed in cerebro by expert scientists. The challenge in moving to an open world representation
lies in capturing the assumptions and assertions of expert reasoning in a domain ontology formalism.
Representing data is relatively straightforward: representing relationships between objects is much
more challenging.
ComparaGRID Semantic Integration Technology Matt Pocock
This presentation followed directly on from the previous talk by Paterson. Pocock demonstrated an
OWL browser, called “Pussy Cat”, developed for the ComparaGRID genomic data project. Queries
that require information from multiple sources are split up into separate queries to each source, the
results of which are then reassembled and delivered to the user. The source databases for this project
are in fundamentally different and contradictory schemas, even disagreeing on basic concepts like the
definition of “gene”. Deriving a good domain ontology for this work is a difficult and poorly-scoped task,
with no widely-validated methodology and a language gap between biologists and modellers.
Friday, October 13th:
Open Ontologies Dave Robertson
Robertson discussed the issues that arise in achieving consistent processes across different
ontologies. He argued that the problem of open knowledge sharing cannot be solved in the general
case, but that researchers should not be daunted by this: computer engineering frequently has to cope
with problems of this nature. In practical applications, the problem can be simplified or the properties
weakened, and engineers can work successfully within these tolerance spaces.
Identity, URIs and the Semantic Web Henry Thompson
Studying the properties of URIs goes right to the foundations of the Web. Thompson discussed the
properties of URI schemes, such as HTTP, and pointed out that URIs typically contain useful
metadata, depending on the naming strategy. He advised that it is usually better to stick to the HTTP
scheme rather than invent new URIs. The network effect is very powerful: the cost of giving away your
data is miniscule compared to the benefits of gaining access to everybody else’s data. When using
URIs in the database world, two questions arise: when do database owners need to mint URIs for
objects, and what strategies should they use to do so?
Don’t Mix Pointers and Relations Chris Date
In this presentation, Date argued that pointers to data should not be permitted as entries in a
database. He stated that mixing pointers and relations is a major departure from the relational model,
adding complexity without increasing power. While there can be good reasons for using pointers in the
implementation of a database, that is no reason to expose pointers in the database model itself: it is
important to keep the logical and physical levels distinct.
Panel Discussion All Speakers
In the final session of the meeting, all the speakers assembled as a panel to discuss questions
submitted by the meeting participants. A summary of the questions and answers can be found on the
meeting wiki at http://wiki.esi.ac.uk/Closed_Verus_Open_World_Questions
Event Achievements:
(Please describe how, and to what extent, the event achieved the objectives stated above. Give a
narrative description of any new ideas and opportunities which have emerged as a result of the event.
Do you see any potential for follow on activity arising from this event, and if so of what type, e.g. new
collaborations, research proposals, future events?)
Page 3 of 5
Both the closed world assumption (where everything stated or implied by the database is true
and everything else is false) and the open world assumption (where everything stated or
implied by the database is true and everything else may or may not be true) were reviewed.
Chris Date explained how the CWA is crucial in maintaining the mathematical integrity of the
relational database model, and Nick Drummond and Rob Shearer gave an account of how the
OWA is a natural way to deal with incomplete information in domains such as science where
not all answers are yet known. At the ontological level, the scientific method can be seen an open
world, however when proposing scientific theories the universe of discourse must be carefully defined.
Of course scientists often assume the universe of discourse rather than making it explicit which
potentially cause problems. The universe may be iteratively redefined, but permitting enlargement (or
contraction) without re-evaluation of the impact on other hypotheses (axioms and therefore database
design in the database world) is likely to draw erroneous conclusions.
While reasoning within any theory, the closed world assumption (CWA) holds. Its universe of
discourse is determined by the entities modelled, relationships, constraints, dependencies, and by the
domains of permissible values. A scientific database should then be considered as testing a theory
(however "imprecise"). If the real world data either doesn't fit or creates a contradiction, the theory is
invalidated and the database should then be redesigned to meet the next hypothesis. In a scientific
database, a reasonable interpretation of True and False under CWA is "valid by experiment and
consistent with hypotheses" and "not validated by experiment or inconsistent with hypotheses". The
CWA shouldn’t be confused with an assumption that we must know everything and embed that
knowledge in a permanently fixed database schema. All we need is understand the current boundaries
on our application or investigation.
The CWA and OWA are really the extreme ends of a continuum: the OWA starts with no assumptions
and as restrictions are added to the ontology, an open world is progressively closed down. The real
differences are between the explicit assumptions made in open world ontologies, and the implicit
assumptions that underlie closed world databases. When querying a relational database, all the
understanding and interpretation are in the mind of the person formulating the query, while in the open
world of the semantic web at least some of this interpretation is formalised in the ontology.
The question of missing information was discussed at length. In the CWA, there can be no such thing
as a missing value and all queries are answered with “yes” or “no”. This makes for a mathematically
sound database that provides a true response to all queries. In scientific practice, however, missing
data crops up routinely and in the OWA “don’t know” can be a meaningful answer. The proposal that
the two worlds might be reconciled by inserting three-valued logic into the relational database model
was considered, but was rejected on the grounds that it can lead to pathological consequences which
can undermine the relational database model. Even if some three-valued logic could be made
effective, there is more than one type of missing value – unknown values, undefined values, invalid
values, and so on – and to treat them all as different truth-values would lead to a multi-valued logic
that was prohibitively unwieldy. It was generally agreed that there is as yet no ideal solution to the
missing data problem, though some partial solutions were offered to what is recognised as a serious
issue.
Even with all data present and correct, researchers still need to be able to capture and communicate
metadata that describes it, so that others can access the data for purposes which may be unimagined
by its creators. Creating this semantic data – data that tells you what other data means – can be as
much a social and political exercise as a scientific one. Trevor Paterson described the challenges
faced by the Comparagrid project in integrating genomic mapping data using ontologies, sets of
semantic restrictions that describe the data and the possible relations between data objects. This is
tantamount to automating the kind of interpretation of data that is performed by an expert scientist
drawing on intuition, experience and training. Formalising this can open up fundamental scientific
disagreements, which must be resolved or at least precisely defined before the ontology can be
created. Such disagreements or contradictions not subject to resolution would require secondorder
logic which permits paradoxes (however this might then be considered not science!).Even
uncontroversial data can pose challenges of multiple interpretation, as Cathy Dolbear illustrated when
she described her work in creating a widely useable ontology for Ordnance Survey data which was
originally gathered for specific military and mapping purposes.
Page 4 of 5
Social factors are also of great importance in realising the vision of the semantic web. Henry
Thompson’s presentation emphasised the social properties of systems like HTTP and DNS, and the
invaluable network effects that can come from understanding and using these social factors when
opening up databases to the Web. Currently different technologies are being used by database
owners use to create the URIs they will use to expose their databases but the best approach is as yet
unclear.
The event has resulted in expression of interest in a follow-up event on Semantic data Integration in
collaboration with the Centre for Ecology and Hydrology early in Spring.
Any Other Observations:
This was a particularly lively meeting, with substantial debate during the presentations and a wideranging panel discussion that extended well past the scheduled finishing time.
Page 5 of 5
Download