Introduction - GeoVISTA Center

advertisement
A platform for experimenting with measures of semantic similarity and
supporting individual perspectives onto shared ontologies
Mark Gahegan*, Ritesh Agrawal, Anuj Jaiswal, Junyan Luo and Kean-Huat Soon
GeoVISTA Center, Department of Geography, The Pennsylvania State University, USA.
*Now at: School of Geography, Geology and Environmental Science, University of Auckland, New Zealand. Email:
m.gahegan@auckland.ac.nz
Abstract
This paper describes two developments in the ongoing search for better semantic similarity tools: such
methods are important when attempting to reconcile or to integrate knowledge, or knowledge-related
resources such as ontologies and database schemas.
The first is an open, extensible platform for
experimenting with different measures of similarity for ontologies and concept maps. The platform is
based around three different types of similarity, that we ground in cognitive principles and provide a
taxonomy and structure by which new methods can be integrated. The platform supports a variety of
specific similarity methods, to which researchers can add others. It also provides flexible ways to
combine the results from multiple methods, and some graphic tools for visualizing and communicating
multi-part similarity scores. Details of the system, which forms part of the ConceptVista open codebase,
are described, along with associated details of the interfaces by which users can add new methods, choose
which methods are used and select how multiple similarity scores are aggregated. We offer this as a
community resource, since many similarity methods have been proposed but there is still much confusion
about which one(s) might work well for different geographical problems; hence a test environment that all
can access and extend would seem to be of practical use. We also provide some examples of the platform
in use.
The second part of the paper describes in detail the idea of ‘perspectives’—a means of defining specific
views onto semantic knowledge that can overcome some of the smaller differences in ontology that
sometimes are a stumbling block for compatibility or acceptance. Perspectives are designed to help
reconcile a user’s specific (but individual) understanding, or their current needs, to the contents of an
established domain ontology, but without forcing the user to adopt all the constructs of the ontology
directly. Minor differences can be overcome without the need for the user’s conceptual model (or related
application programs) to be changed. Perspectives offer a convenient way to customize how ontologies
appear to their users, rather like ‘views’ in a relational database (but significantly more powerful). We
argue that perspectives are a kind of mediating transformation by which knowledge resources can be
integrated. In fact, they operationalize the notion of foreground and background in cognition, allowing
currently irrelevant details to be moved to the periphery. So far in our implementation, perspectives
1
allow: (i) properties to be recast as concepts in their own right (and conversely, concepts and sub-graphs
reduced to properties), (ii) differences in specialization/generalization to be bypassed and more generally
(iii) implied relations used to connect concepts directly. We describe perspectives from a cognitive
standpoint, and give examples of how they can be used.
Keywords: semantic similarity, ontology mediation, open platform, concept mapping, GIScience
2
1 Introduction
Knowledge computing (ontologies, provenance, workflows, etc.) has opened the possibility of
representing and reasoning with knowledge about the geographical world, which—just like with data
before it—has led in turn to many further questions regarding interoperability, integration and update of
such knowledge resources. Not surprising, then, that we often find ourselves faced with the problem of
reconciling knowledge captured from different experts, or scraped from published documents and
databases. But faced with the plethora of choices as to how knowledge should be recognized as similar
and integrated, the way forward is sometimes uncertain. Which methods should we use? Such demands
have led us to construct an evaluation platform for various semantic similarity methods that we describe
below, in the first part of the paper. The platform is implemented in ConceptVista, a concept mapping,
semantic
search
and
knowledge
integration
environment.
(http://www.geovista.psu.edu/ConceptVISTA/index.jsp). ConceptVista differs from ontology tools such
as Protégé in: its support for less formal kinds of knowledge such as concept maps, its links to many other
Web 2.0 technologies, and support of highly visual interaction and display. A fuller description is
available from Gahegan et al, (2007).
On the positive side, the semantic similarity research literature has grown very rapidly in the past five or
so years, so there are by now many useful similarity methods and associated metrics to draw from.
Harvey at al (1999), Winter (2001), Kuhn (2005) and Agrawal (2007) provide good introductions to the
setting of geospatial semantics and some of the specific problems that must be overcome in a geographical
setting. More general accounts of data semantics and ontologies are given by Gruber, (1993), Sheth
(1999), Guarino (1998) and Davis et al (2006); while Wach et al (2001) provide a useful summary of
ontology integration methods. Measuring semantic difference in geo-ontologies is specifically addressed
in the work of Rodriguez and Egenhofer (2003), Kavouras et al (2005) and Fonseca et al (2006), in which
practical metrics are proposed. Haase et al (2005), Klein (2004) and Bloehdorn et al (2006) specifically
discuss the problem of checking for, and maintaining, consistency in distributed (co-evolving) ontologies.
Sowa provides a very thought-provoking account of dynamic ontology (2006) and with colleague
Majumdar (Sowa & Majumdar, 2003) describes a system that can find matches between knowledge
fragments (conceptual graphs) using a sophisticated analogy reasoning engine. Nevertheless, live systems
which can be used by an entire research community to experiment with ontology-based knowledge
integration are rare. One very basic but useful example from the Geosciences Network (GEON) is
described by Lin and Ludäscher (2003); it can be accessed from the GEON cyberinfrastructure portal at
https://portal.geongrid.org/gridsphere/gridsphere.
However, a number of deeper questions about geographical knowledge construction and integration that
relate to the philosophy of geo-ontology (Frodeman, 2003), via the fields of hermeneutics, pragmatics and
3
situated cognition, require very careful attention since they impinge on the practicality and validity of
some matching methods. An excellent account of the background to these questions and the related
research literature over the past 40 years is provided by Schwering (2008). Many of these questions arise,
we believe, because of the situated and often contested sense of meaning that is common within
geography and the natural sciences (Clancy, 1994; Brodaric and Gahegan, 2007; Brodaric, 2007; Pike and
Gahegan, 2007). By and large, the concepts and relations we use to describe the world do not exist in
nature, they are constructed by humans. Hence it is not surprising that meaning differs between
individuals, and through time. The assumption that conveying and reconciling understanding is a merely
linguistic problem that would not occur in a purified language like description logic (ontology) is, in fact,
almost entirely wrong (Braspenning, 2002; Sowa, 2005). Thus there are often no perfect theoretical
solutions for geographical knowledge integration, but rather subjective measures and practices that, on
balance, provide useful results.
Take the case of two experts who study different, but overlapping domains, such as (i) vulnerability of
local places to climate change and (ii) crisis management and disaster relief. It is likely that they will
share some knowledge that is obviously the same, but there may also be knowledge that is not identical
but is commensurate to some degree, and finally of course there is knowledge that is not common to both
of them. The same can be said of computer programs, databases and ontologies; when compared with
each other there are grades of similarity and overlap. Putting it another way, there is intersecting
knowledge, where the problem is to insure that ontological clauses are not repeated and that small
inconsistencies are recognized and resolved. Then there is augmenting knowledge, that extends what is
known by each party separately, but remains compatible with the intersecting core, then finally there is the
possibility of knowledge that (so far) is disjoint and does not (yet) fit in. Readers are directed to a
thought-provoking view of this problem, in a presentation by Von Schweber (2006) where Venn diagrams
are used to differentiate what is currently known, what could be known (building on current knowledge)
and what cannot be known, as a means of understanding the limits of knowledge exchange between two
parties.
For the first case of intersecting knowledge, we need to recognize equivalence—though there may be
naming conventions and differences in level of detail that must be resolved in order to do so. For the
second case of augmenting knowledge, we might expect fewer overlapping details and the problem
becomes one of knowledge integration. More importantly, there are likely to be concepts that play quite
different roles, i.e. have different properties and engage in different relations.
For example the
vulnerability expert might be able to identify and understand at-risk populations, but have no idea how to
evacuate them from the path of a hurricane. Yet much of the underlying semantic structure may well be
compatible, when examined. In the third case where knowledge does not yet overlap, the best strategy
might be to avoid placing too much weight on similarities because they may be coincidental. Of course,
4
the complicating factor is that in most cases, we typically do not know beforehand which of these three
situations applies for a given knowledge fragment: we need to infer it.
These apparent conflicts in purpose point to the need for very flexible methods for recognizing and
resolving semantic similarity. Many of the tools developed to date for similarity resolution assume that
the problem is actually of the first type. And it is true that many integration problems so far studied use
ontologies to resolve apparent differences between knowledge communities that are very ‘close’ to each
other—as in the many examples of ontologies used to build schema mappings between databases with
similar content. But these are not the only kinds of knowledge integration problems we are faced with.
From our point of view then, we see an urgent need, not for another kind of metric to calculate semantic
similarity, but for an environment that allows:
1. Methods and their metrics to be readily evaluated and compared,
2. Easy extension with new methods for specific kinds of similarity and matching problems
3. Better support for augmenting knowledge (second case described above)
4. Flexibility in the way that methods and results are combined and communicated, and
5. Simple, but effective ways to investigate and visualize the results.
In short, we do not know enough about the problem to be able to move straight to a solution, we need first
to conduct experiments and evaluate strategies on their merits for resolving particular classes of problem.
Without such experimentation, how will we ever know which strategies are in fact the most successful?
A description of the environment we have constructed for evaluating similarity forms the first part of this
paper. The second part introduces a new idea to help knowledge communities look past unimportant
differences in ontology, so that they do not become a dogmatic stumbling block for potential users.
Specifically, we describe the idea of ‘perspectives’ onto knowledge, which provide different views onto
an underlying knowledge schema, adapted specifically for some purpose. To give just a simple example
for now, two experts (or two applications) might differ in the status afforded a notion such as ‘Mesozoic’.
To one scientist, this might be simply a property that describes the age of a fossil; but to another,
Mesozoic might be a complex concept in its own right, with an intricate web of relations. The first
scientist, not interested in these details, may wish to continue to treat Mesozoic as a property, and in our
opinion should be allowed to do so because it is not inconsistent with roles it can play. But the second
scientist should be able to treat it as a highly connected concept. The trick here is to support both views
from the same ontology.
5
1 A semantic similarity platform for testing and evaluation
As noted above, much has been written already on the construction of different metrics for computing
similarity across ontologies, and the related task of matching, or building an equivalence mapping
between two schemas. However, it seems all too often that new methods are not compared to existing
ones, nor is the software developed available to allow all tests to be replicated and methods compared. To
remedy this shortcoming, and to allow us to evaluate similarity methods in different contexts within the
geographical and geological realms, we have constructed an open, extensible similarity platform (test
bed), with graphical interfaces and pictorial display of the similarity measures used.
However, before describing our similarity platform, some clarification of purpose and intent is needed.
Firstly, we see our contribution here as providing a solid, conceptual structure for organizing the various
similarity methods available and thereby creating a clean and extensible programming interface for
ourselves and others to use. Good structure is needed if we are to manage a growing collection of more
complex methods. Secondly, by no means do we claim to have a complete set of methods to hand, nor
that the methods we describe are necessarily the best available. We do provide a structure, and have
populated it with methods that seem interesting and useful to us. But it is far from complete. One of our
purposes in writing this paper is to generate interest in sharing methods between research groups. If your
favorite method is missing, please consider supplying it to us! Finally, it is not our purpose here to
provide an in-depth review of similarity measures, we restrict ourselves to some brief notes on different
types of similarity methods related to supporting them computationally.
We also note that calculating similarity is just one part of a complex process of knowledge integration and
management, especially when the knowledge resources are community-based, that is, shared by a number
of users. We envisage the research reported here to fit within a five stage model of ontology management
as follows:
1. Choose a strategy for assessing similarity: based on an understanding of the task and the parties
involved.
2. Recognize similarities: use the chosen methods to compute similarity scores, and to summarize
and report the findings.
3. Update local knowledge resources according to the similarities encountered. As part of this same
project (but not reported here) we have developed different strategies for merging and integrating
6
knowledge, from importing absent concepts to anchoring new concepts into formal domain
ontologies1.
4. Manage the revision and maintenance of ontologies within a community: this under-researched
area must address questions of how knowledge communities are formed, rules for participation
and a policy for approving revisions. Some ideas as to how this might work are presented in
Klein (2004) and Pike et al (2005). The tools for software version control can address the
technological aspects of the problem.
5. Broadcast changes to community knowledge resources using a regular update cycle and
distribution mechanism.2
1.1 Specific similarity metrics
Following an idea suggested by Sowa and Majumdar (2003), we have developed a categorization scheme
for similarity metrics according to the Peirce’s notions of firstness, secondness, and thirdness (Peirce,
1931). In Peirce’s writings, he coins these terms to describe the different kinds of knowledge that are
needed to form and relate categories. To whit:
“First is the conception of being or existing independent of anything else. Second is the conception of
being relative to, the conception of reaction with, something else. Third is the conception of mediation,
whereby a first and a second are brought into relation.” (Peirce, 1891)
So, firstness describes the internal or intrinsic character of some entity, secondness relates an entity to
other entities, and thirdness describes mediating transformations by which entities can become connected.
If we extend these ideas into the realm of semantic similarity, we obtain a concise and clear taxonomy for
similarity methods that makes it easier to understand what the different methods do, and what their
operational parameters will be. This scheme provides some much-needed structure for the software
development, as summarized in Table 1.
Table 1. Modes of similarity, descriptions of what they do and the programming interfaces they require.
Mode
Firstness methods
Description
Parameters: Type
Match by internal property values, by property Properties(A): Vector,
type, and also property name or identity.
Properties(B): Vector
1
This leads naturally to the question of exactly how such merging should take place. Should properties from
matching concepts be merged into one new concept? Should they be subsumed into one or more existing concepts?
When combining knowledge across two communities, concepts—though their names and other properties may be
similar—may actually mean different things to their knowledge community, or play significantly different roles.
Thus assuming they are equivalent (and can therefore be merged) may be unhelpful.
2
Imagine, if you will, Critical Knowledge Updates to your PC!
7
Secondness methods
Match concepts according to their relationships Neighborhood(A): Graph
with
other
concepts,
i.e.
their
graph Neighborhood(B): Graph
neighborhood—a sub-graph centered on each
concept.
Thirdnesss methods
Introducing a mediating graph that brings A and B Properties(A): Vector,
Properties(B): Vector,
into relation.
Mediation(A, B): Graph
1.1.1 Firstness Methods
Value and Structure Similarity
Value similarity methods calculate similarity scores based on the commonality of property values. The
metrics are designed to be directly proportional to the number of values that two concepts share, and are
sometimes called set theoretic measures, employing notions of contrast (Tversky, 1977). Structural
similarity matches not on values of properties, but on their types. Thus it is basically a count of the
number of properties whose types are the same, again, normalized by the number that are different. The
form of the method is typically a measure of the information they share—a similarity score calculated as
the sum over all properties or values (pi) that are common between two concepts (A and B)—divided by
the total information contained in A and B separately. Commonality calculations vary with the type of
properties, a different method is needed for each new property type.
See the equations below for
examples of how these notions can be computed. Note that commonality can be computed in a number of
ways: the formulae below show integers compared according to their closeness to each other (giving a
numeric score in the range 0-1), and strings compared for exact matches (giving a score of either 0 or 1).
Partial string matching and more complex lexicographical analysis is often more reliable when comparing
concept maps and ontologies not created using a controlled vocabulary. The general form for all firstness
methods is (from Table 1) Properties(A): Vector, Properties(B): Vector; in the case of the simply
comparing the names of concepts, the vectors reduce to simple strings or URIs.

Similarity score  A, B  
n
i 1
commonality  pi , A, B 
total inf ormation A, B 

 Ap  Bp 
 if p is of type int
1  

range p  


Commonality  p, A, B   
 Ap  Bp   0, Ap  Bp if p is of type string


1, Ap  Bp
8
Total information A, B  n A  nB  n A  B .
It is in the nature of these kinds of measures that they are narrowly focused on the internal aspects of
concepts. As a consequence, it is usually not safe to rely on them alone. Types and values can be chosen
quite arbitrarily, and it is difficult to be sure that the comparisons are meaningful. For example, unless
constrained to only compare properties with the same names, these methods would compare the age of a
dog and the length of a train. But they do add evidence, and can be useful when used to support more
advanced methods.
1.1.2 Secondness Methods
Secondness methods concern the relations between concepts, or more broadly the similarity of their graph
neighborhoods. They are sometimes referred to as network methods. The simplest kinds are very similar
to those described above for properties and values, except they are used on relations (and possibly their
properties also). So, following from the equations given above, secondness measures can compute
similarity based on commonality among the relations, again normalized by the total information. One
complicating question is the depth of the subgraphs to be compared, and whether all types of relations
contribute equally to the overall score. For example, one could define a semantic distance metric, so that
more distant relations (not directly connected, but connected by intermediary concepts) count for less.
More useful relational measures can be calculated using the explicit semantics of the relations represented
in the ontology. For example, one can count the network distance that must be travelled across a
generalization hierarchy to connect two concepts A and B (the number of generalization steps from A to
their closest, shared generalization (G), then the number of further specialization steps from G to B.
Various other families of proximal relations can be used as well, including the spatial (Kuipers, 2000;
Schwering and Raubal, 2005). This kind of method works well with ontologies that share type hierarchies
explicitly, but more sophisticated measures are needed otherwise, such as described by Rodriguez and
Egenhofer (2003).
The general form remains as Neighborhood(A): Graph, Neighborhood(B): Graph
(Table 1), though some of the more advanced methods may require additional parameters.
1.1.3 Thirdness Methods
Implied semantic relationships connecting concepts (‘nym matching service)
It is often the case when combining knowledge gathered from different human experts or different
communities that we need to recognize connections by certain implicit (or missing) semantic
relationships, such as: toponyms, hypernyms, hyponyms, meronyms, holonyms, synonyms, and antonyms.
Although conceptually simple—use WordNet, Cyc or some other web-based thesaurus and gazetteer to
check if two concepts have some kind of ‘nym relationship—the problems here are ones of computational
performance and the reliability and completeness of these external resources. Performance is a problem
because searches through an external corpus must be made between all concepts that might be similar
9
(thus order (n-1)! Complexity assuming symmetry). Completeness is a problem because specialized
scientific vocabulary is often missing from these general-purpose thesauri.
We have developed a separate matching service that uses the semantic relationships from WordNet. The
service searches for concepts that are similar based on these semantic relationships. Effectively this
offloads the computational burden to a remote computer as an RMI service using the powerful Lucene
indexing tool for efficiency. Users of the service do not need to be aware of these complexities, the
interface is as described in Table 1, i.e. A: Concept, B: Concept, Mediation (A, B): Graph, but the Graph in
this case reduces to a single ‘nym relation.
Mediating subgraphs (analogies and perspectives)
The real power of mediation (thirdness) occurs when graphs more complex than a single relation can be
used to form connections between concepts. Sowa and Majumdar (2003) describe an analogy engine to
search for these mediating subgraphs in existing knowledge resources (such as ontologies). In our own
work, we have been frustrated by the lack of detailed domain knowledge by which these subgraphs might
be constructed, so have developed a human-led method by which users can construct their own mediating
subgraphs.
We term these subgraphs ‘perspectives’ and dedicate the next section of the paper to
explaining how they work and how they can be used. As far as the API is concerned they require no
special treatment.
Of course, the most successful strategies are likely to be those that can combine multiple methods and
resolve how to combine their results (see Schwering, 2005 as an example). We briefly touch on this
problem below.
1.2 Details of the Application Programming Interface
The Application Programming Interface is constructed around the following methods:
Analyzers
Methods for performing comparisons in values and types (described above).
Three
different
families
are
implemented
so
far:
StringAnalyzer,
NumericAnalyzer and DateAnalyzer for the firstness methods.3
Thirdness
analyszers include a ‘nym matching service and the perspectives mechanism
described below in section 3.
Filters
Exclude certain properties of concepts and relations from any comparison.
GUI Interfaces
Allow users to configure the various options and parameters that different
3
We have not included analyzers for secondness in our work so far, as there are so many to borrow from
colleagues.
10
Analyzers can use.
Extractors
Extract needed information from more complex fields, such as the local concept
name from a concept URI string (for when concepts are identified with URIs).
Summarizers
These define the expressions used to combine multiple scores together.
Visaualizers
Ways for reporting the similarity scores in the ConceptVista application.
Supporting the Analyzers is a SimilarityRegistry class that maintains a HashMap index between similarity
methods and their graphical user interface (GUI) component (instance of SimilarityEditorInterface). This
class allows users to add a new HashMap entry at runtime, if there is a need to modify or add an analyzer.
It also decouples similarity measures from their GUI component, hence different graphical interfaces to
methods can also be substituted as needed. Specific RegistryAnalyzers maintain a registry of which GUI
components to use to set the different parameters of a specific implementation of an Analyzer interface.
Filters are provided to allow the user to exclude different properties of a concept from being used by the
various similarity metrics. This is necessary because some properties may be known to be misleading,
and also because various details unrelated to semantics sometimes find their way into in some concept
maps.
GUI Interfaces are the graphical components by which the user interacts with the various similarity
methods. Because of the resemblance in parameters shown in table 1, new similarity methods can
sometimes make use of existing GUI Interfaces, so no additional code is needed.
Extractors are used to mine out information for matching from more complex properties, such as creating
concept names from their URI. They provide a way to restrict what is compared for the properties
selected for use.
Summarizers are the means to combine similarity scores from multiple methods. We envisage here a
process by which the scores from different methods can be weighted and combined. Over time, we can
perhaps learn which weighted combinations of methods works best for different knowledge integration
problems and domains. So, far we have restricted our Summarizers to work only with the firstness
measures described above. It remains an open question how to build a more general summarizer that will
work across all kinds of similarity methods.
Visualizers are a class of methods for visually depicting the similarity scores. On the example shown
below in Figure 1, simple bars are used to show these scores. We have experimented with other visual
devices and selected this one for now, though we intend to add more methods soon.
11
A complete UML diagram for the less experimental aspects of the whole platform is given for reference
purposes in Appendix 1. The entire application, which includes ConceptVista and the ‘nym matching
service,
is
available
for
download
as
a
package
from:
http://www.personal.psu.edu/arj135/Projects/CV4/CV4-Setup.exe. The authors will also be happy to make
source code available to any interested parties. The codebase uses an LGPL license and is written in Java
using the JENA ontology tools.
1.3 Example of use
When the user begins an evaluation of semantic similarity, two (or more) ontologies (or more informal
concept maps) are first loaded and displayed concurrently. The user then clicks on the similarity tools
panel to choose which methods to use, and via their GUI Interfaces, sets any necessary operational
parameters. Having configured the methods, and selected a Summarizer, the user then clicks on a concept
in one ontology from which to begin the search for similarity. The system responds by calculating
similarity scores between this concept and all the other concepts in the second ontology. The results are
projected into the display. The user can then choose to act on the computed scores, perhaps by creating
new relations to represent the uncovered connections, or by proceeding to compare further concepts.
The example below in Figure 1 shows the concept of SurfaceWater (highlighted in red) from the SWEET
EarthRealm ontology (source: www.nasa.gov/earthrealm) compared to several concepts from the
AktiveSA
ontology
for
crisis
management
https://www.edefence.org/~ps/aktivesa/OntoWeb/index.htm).
/
disaster
relief
(source:
The concept SurfaceWaterObject has the
best overall similarity scores. As configured in this example, the pink bar represents a structural similarity
score, which is calculated based on how many common links are associated with both concepts. The red
bar denotes a similarity score based on lexical similarities of the concept names. Finally, the brown bar
shows a score that combines both the structural and string value measures together. For this possible
match, structural similarity has the highest score, as the two concepts share an almost identical set of
properties. The string value similarity receives a moderate score because of the lexical difference between
“SurefaceWaterObject” and “SurfaceWater” This also compromises the combined score that considers
both structural and value similarity measures.
Note of course that different methods may well change the outcomes. For example, a more refined lexical
string matching method might improve the results, as might the use of a Filter to remove the sub-string
“Object” from all concepts in the AktiveSA ontology. The graph neighborhoods are very similar, so
secondness methods may not be so effective in this simple example.
12
Figure 1: The concept of SurfaceWater from the SWEET EarthRealm ontology (highlighted in red) is compared
to several concepts from the AktiveSA ontology for crisis management. The resulting similarity scores are
shown by the multi-bar glyph symbols in the display. See text for further details.
2 Perspectives
Perspectives are sequences of ontological transformations, specifically designed to overcome some of the
possible conflicts that can occur in the process of human-oriented ontology creation and use. For
example:
13
(i) seemingly arbitrary decisions where concepts could be linked by many relations, concerning
which ones should be made explicit and which ones are simply implied;
(ii) a different degree of interest or sophistication that may arise between ontology producers and
ontology consumers; or
(iii) a dissimilar propensity for levels of generalization among practitioners, where the ontology is
strongly hierarchical but the user conceives of a much flatter structure (or visa versa).
The aim of perspectives in all these cases is to finesse over such differences.
An early implementation of perspectives as ontological filters was reported by Gahegan et al, (2008), as a
predominantly visual filter used to draw attention to certain themes in the display, built automatically
around pre-defined semantic types. Filters were used to show the visual intersection of different views
onto a concept map. Here perspectives are extended to a deeper cognitive level, designed to mediate
conceptual knowledge so it better matches an expert’s personal understanding or current need.
Importantly, in this sense perspectives broaden the notion of what might be considered commensurate
knowledge beyond the kinds of similarity measures described above. Specifically they allow us to
operationalize an idea from the writings of Whitehead (1929; 1933), where he describes the ability of
humans to move ideas (concepts and relationships) between the light of enquiry and the ‘penumbral
background’, where details are not known precisely, or are not currently needed. Using this idea, we can
define a filter that highlights a specific sub-graph, but which ‘rolls-up’ the concepts on the periphery of
the filter, temporarily removing or recasting them as simple properties of the concepts of interest.
Figures 2 describes schematically how perspective work in the case of providing such conceptual focus.
The top diagram shows a concept map or ontology upon which two different perspectives (A and B) will
be defined. The left diagram represents the effect of applying perspective A: concepts inside the filter are
unchanged (numbers 6, 7, 8 and 9), concepts connected directly to those inside are reduced to being
properties of the included concepts and are shown now as circles (2, 5, 10, 11, 12, 15). Concepts not
directly connected to those within the perspective are temporarily removed (shown grayed out). The right
diagram shows perspective A retracted and perspective B asserted (concepts 6, 7, 12 and 14 now in focus),
with corresponding changes to the surrounding nodes. The movement from A to B represents conceptual
refocusing, so some previously relevant details are no longer required, some previously truncated concepts
are re-inflated and some additional concepts become visible.
14
13
B
5
1
A
12
6
14
2
8
7
3
9
15
10
16
11
4
13
1
1
12
6
6
12
2
14
14
2
8
8
7
7
16
15
3
10
3
9
13
5
5
11
15
9
11
10
16
4
4
Figure 2. An overview of how perspective filters work. The upper diagram shows a simplified ontology as a
series of connected nodes. Two perspectives (A and B) onto this ontology are shown on the lower diagrams.
Concepts on the periphery of a perspective are recast as properties and shown as circles. Concepts falling outside
a perspective are temporarily removed (grayed out). See text for details.
As an example of how perspectives are created, Figure 3 shows a snapshot from a user session where a
perspective is being constructed. Its purpose is to map a local expert’s understanding of vulnerability onto
the AktiveSA ontology.
The bottom left panel in the display visually depicts the perspective (here
comprised of three expressions shown as nodes that have been added separately by the user), and provides
a visual editor by which to create and interact with these expressions. The form of these expressions is
described in Gahegan et al, (2008).4
4
Note to reviewers. I have a sequence of images showing how a perspective is created to bring the AktiveSA and
local ontology into alignment…but they would take up a lot of space . Hence I have included only one snapshot
for now. Perhaps these additional images could be added to an accompanying website?
15
Figure 3. A snapshot of the process of creating a perspective, to map local user knowledge onto an established
ontology. The perspective editor, shown at bottom left, contains a visual portrayal of a perspective, at this stage
comprising three expressions (the green circles).
As mentioned above, many semantic differences arise because of quite arbitrary choices during ontology
construction, or because of a predilection for ‘lumping’ or ‘splitting’, i.e. the degree to which specificity is
added to a generalization hierarchy. In Figure 4, a simple hierarchy is shown that contains the concepts of
Tree, and Eucalypt Tree, and a single instance: e.Maculata. Eucalypts are Evergreens, so this category
could be added into the hierarchy, but in doing so, Tree would no longer be directly related to Eucalypt;
the dashed relation would be removed, and the two dotted relations would be added. Many local measures
of similarity can be misled by such simple differences, even though both ontologies are in most senses
entirely commensurate with each other. More to the point, if the concept Evergreen is superfluous,
confusing, contentious or absent from the conceptual model of the ontology user, then it need not be
shown.
16
e.Maculata
Eucalypt
Evergreen
Tree
Tree
Figure 4. A simple generalization hierarchy, showing the optional inclusion of an additional concept (Evergreen
Tree).
Following Sowa and Majumdar (2003), we recognize that the relationship between Tree and Eucalypt
remains equally true whether implied via the Evergreen concept, or explicitly linked.
For some
applications, the concepts of Evergreen and Deciduous may be useful, but for others they are unnecessary.
Should a researcher who does not need this distinction be forced to work with it? We believe not, and that
it is perfectly acceptable for them to ‘see’ and use this ontology as if Eucalypt and Tree were directly
connected.
The same logic can be applied to non-generalization relations too, and is very useful in emphasizing
specific (but implied and therefore non-obvious) patterns in knowledge. The following example should
make this point clear.
2.1 Examples of Perspectives in use
As an example, consider an ontology of authors, articles and thematic areas. Typically, we know which
articles the various authors have written, and (via keywords) which themes the articles relate to. But we
may wish to know, or see directly, which authors share interests in the same themes, or how themes
cluster together because they are studied by the same authors. Trying to glean this information from the
concept map can be difficult; the indirect connections via articles to researchers adds a great deal of (what
for this question is) noise.
Figure 5 shows the GIST Body of Knowledge, recently developed to provide a consolidated account of the
various themes that comprise GISystems and Science from an educational point of view
(http://www.ucgis.org/priorities/education/modelcurriculaproject.asp). An ontology was constructed from
the GIST major themes along with their hierarchical relationships. Each major theme (such as analytical
methods, cartography and visualization) is colored differently, with the window on the left of the screen
17
acting as a legend for the various themes. Note that the figure is designed to show the breadth of the
ontology, the details are not important here.
Figure 5. The GIST ontology, created from the GIST Body of Knowledge document that describes the major
teaching themes in the field of GIS. The ontology, like the document, is a hierarchy, comprised of major themes
that are further subdivided into specific topics; color is used to differentiate the various themes. The left panel in
the display is a navigable legend, displayed as a tree.
The next image, Figure 6, shows a close-up of part of the GIST ontology after authors have been added in
(scraped from Google Scholar using the GIST themes as keywords), and connected to the various themes
that they have published on. There is a proliferation of new relationships, but they are useful to see which
authors publish broadly, and which narrowly (the broader authors have multiple connections, and include
different colored themes). However, if the user wishes to see which topics seem to be closely related (i.e.
often studied by the same researchers) then this display does not help. The GIST Body of Knowledge is
structured as a hierarchy, but in general, many GIScience researchers work on several subtopics across the
field. The final image in this sequence, Figure 7, shows relationships between GIST topics based on
authors who work across different areas. The idea is to find topics that are closely related to each other (in
terms of what authors study) but classified in GIST into different themes. A perspective was created to
derive relationships between topics based on common authors, but then removing the authors (the
concepts that actually link themes together). The figure is shown in close-up, so the detail is readable. To
achieve this transformation, the perspective effectively internalizes (rolls up) the relations from topics to
18
authors inside of the topics concepts, so that authors are now effectively attributes of topics. Topics are
then linked together if they share the same value for any of their author properties.
Figure 6. A close-up of part of the GIST ontology with researchers added in and connected to themes, based on
the articles they have published. GIST themes are shown via the different colored ellipses, authors are displayed
as light blue rectangles (the currently selected author, Alan MacEachren, is highlighted in red). The inset
schematic at bottom left shows how the topics and authors are connected.
19
Figure 7. A perspective filter is applied to the ontology shown in Figure 6 above, to make explicit the implied
relationships between topics (based on co-occurrence of links to researchers). As the schematic at bottom left
shows, topics are now directly connected together when they share a researcher, and researchers are now absent.
See text for more details.
A perspective, then, is a special kind of mediating expression that extends the Jena query capability, so
that concepts, relations and properties can be hidden, internalized, and externalized but cannot be created
or destroyed. Perspectives do not change the underlying ontology whatsoever, they rather support specific
views onto it, provided they do not conflict with the underlying semantics (interpreted as above). Using
perspectives, a user or a knowledge engineer can shape the ontology to reflect their own understanding,
compressing it where it is too specific, and truncating it where it is too broad. So they can still use an
agreed, community ontology without the need to proliferate new versions to suit their immediate needs,
with all the savings this entails in terms of additional maintenance and reconciliation. Small differences
can be overlooked where the overall logic is not compromised.
3 Conclusions and Future Work
There seems to be no end to the set of possible semantic similarity methods. Of the many kinds reported
we have experienced mixed, patchy results with all of them. Moving forward from this point demands
more rigorous evaluation and comparison. The work we describe in the first part of this paper addresses
the problem via an open, extensible platform for computing semantic similarity. We have drawn on the
Peircian notions of firstness, secondness and thirdness to provide a strong conceptual underpinning that
we believe neatly describes the parameters and goals of different methods and so provides a firm
foundation for application development. Our API supports several methods for computing similarity, and
eases the problem of adding more. Results are provided visually, and the tools provide a high degree of
user interaction.
In parallel with this, with the aim of also producing more cognitively plausible tools with which to filter
and mediate ontologies, we have developed the notion of perspectives, described in the second half of the
paper. Perspectives are a special case of mediating transformations, or thirdness. They provide some
further sophistication to similarity methods, because they offer ways to see past what we argue are small
differences in the way knowledge can be encoded. We hope to eventually use these ideas as part of a
more advanced similarity reasoner, that can itself take different perspectives during the process of
computing similarity.
20
Firstness, secondness and thirdness do seem to be robust notions on which to design a semantic similarity
platform, but it appears that a fourthness category may be needed in addition: to deal with the growing
possibilities of computing similarity using measures of association from massive collections of use-cases.
Similarity by emergence perhaps describes it best. Mitra (2004) describes how this might work, by
scraping information from web pages returned from Google searches on the concepts to be compared. We
have experimented with these methods at length, and hope to include them in a later release of our
software.
Having constructed a platform for evaluating semantic similarity, it is perhaps time to move to a stage of
careful experimentation and comparison of our collective set of methods. Such a program will take time
and effort, but is needed. Answers to the following deeper questions must now be sought:
1. Which semantic methods are the most reliable and useful in uncovering similarity, or in merging
ontologies? This remains an open question that requires some sophisticated experimentation with
users, using tools such as the one shown above—and more besides!
We are keen to hear of
experiences from other researchers on this topic.
2. How should different measures of similarity be combined when they are used? Are there contexts
tasks for which some measures work better than others, and if so, what are they? Given that we might
discover these contexts, is it possible to achieve useful matching results without supervision by human
experts?
3. Is there a useful role for hermeneutics to play in constructing knowledge horizons or some other form
of perspective around concept maps, and showing where two horizons might intersect? So far in our
experiments with local perspectives we have not tried to explicitly highlight where perspectives
diverge, but this might hold some promise for communicating differences in understanding.
Acknowledgements:
This research was funded by the US National Science Foundation (NSF) via grants BCS–9978052
(HERO), ITR (BCS)–0219025, and ITR (EAR)–0225673 (GEON). The authors would like to thank
Stephen Weaver for his work in turning GIST into an RDF document (no mean feat) and the organizers of
the COSIT 2007 Semantic Similarity Workshop, where some of these ideas were first aired.
References
Agarwal, P. 2005. Ontological Considerations in GIScience. International Journal of Geographical
Information Science 19 (5):501-536.
21
Bloehdorn, S., Haase, P., Sure, Y., and Voelker, J. 2006. Ontology Evolution. In Semantic Web
Technologies: Trends and Research in Ontology-based Systems, eds. Davies, Studer and Warren, 5170. London: John Wiley & Sons Ltd.
Braspenning, P.J. 2000. Symposium on Intelligent Agents in Software Engineering for Planning, KaHo
St.-Lieven, Gent, 23rd February, 2000.
Brodaric, B. 2007. Geo-Pragmatics for the Geospatial Semantic Web. Transactions In GIS 11 (3):453-477.
Brodaric, B. and Gahegan, M. 2007. Experiments to examine situated geoscientific concepts. Spatial
Cognition and Computation Journal (Special Issue on Cognitive Semantics and Ontologies) 7(1): 6195.
Clancey, W. 1994. Situated cognition: How representations are created and given meaning. Lessons from
Learning. R Lewis and P Mendelsohn (Eds.) Amsterdam, North-Holland: pp. 231-242.
Davies, J., Studer, R., and Warren, P. 2006. Semantic Web Technologies: Trends and Research in
Ontology-based Systems. London: John Wiley & Sons Ltd.
Fonseca, F., Camara, G., and Monteiro, A.M. 2006. A Framework for Measuring the Interoperability of
Geo-Ontologies. Spatial Cognition and Computation 6 (4): 309-331.
Frodeman, R. 2003. Geo-Logic: Breaking ground between philosophy and the earth sciences. Albany,
SUNY Press.
Gahegan, M., Agrawal, R. and DiBiase, D. 2007. Building rich, semantic descriptions of learning
activities to facilitate reuse in digital libraries. International Journal on Digital Libraries, 7, (1-2):
81-97. URL:
http://www.springerlink.com/content/q102m641460h77v6/?p=cae7b09531014c3d9605c2f74b10dbfa
&pi=4
Gahegan, M, and Pike, W. 2006. A Situated Knowledge Representation of Geographical
Information. Transactions In GIS 10 (5):727-749.
Gahegan, M. Luo, J., Weaver, S., Pike, W. and Banchuen, T (in review). Connecting GEON: making
sense of the myriad resources, researchers and concepts that comprise a geoscience
cyberinfrastructure. Computers & geosciences, special issue of cyberinfrastructure for the
geosciences.
Gruber, T. R. 1993. A Translations Approach to Portable Ontology Specifications. Knowledge
Acquisition 5: 199-220.
Guarino, N. 1998. Formal ontology in information systems. In: Guarino, N. (ed.) Formal Ontology in
Information Systems. Proc. FOIS'98, Trento, Italy, June 6-8 1998. IOS Press, Amsterdam, pp. 3-15.
Haase, P., van Harmelen, F., Huang, Z., Stuckenschmidt, H., and Sure, Y. 2005. A Framework of
Handling Inconsistency in Changing Ontologies. International Conference on Semantic Web (ISWC
2005) Lecture Notes in Computer Science 3729, pp. 353-367.
Harvey, F., Kuhn, W., Pundt, H., Bishr, Y., and Riedemann, C. 1999. Semantic Interoperability: A Central
Issue for Sharing Geographic Information. The Annals of Regional Science 33: 213-232.
Kavouras, M., Kokla, M., and Tomai, E. 2005. Comparing categories among geographic ontologies.
Computer and Geosciences 31 (2): 145 - 154.
Klein, M. 2004. Change Management for Distributed Ontologies. PhD Dissertation, Dutch Graduate
School for Information and Knowledge Systems, Vrije Universiteit, Amsterdam.
Kuhn, W. 2005. Geospatial Semantics: What, of What, and How? Journal on Data Semantics III LNCS
3534: 1-24.
22
Kuipers, B. 2000. The Spatial Semantic Hierarchy. Artificial Intelligence 111: 191-233.
Lin, K. and Ludäscher, B. 2003. A system for semantic integration of geological maps via ontologies.
Proc. of the Workshop on Semantic Web Technologies for Searching and Retrieving Scientific Data
(SCISW).
Mitra, P. 2004. An Algebraic Framework for the Interoperation of Ontologies. PhD Dissertation,
Department of Electrical Engineering, Stanford University.
Peirce, C. S. 1891. Review of: Principles of Psychology by William James, Nation, 53: 32.
Peirce, C. S. 1931. The Collected Papers of Charles Sanders Peirce. Harvard University Press:
Cambridge, MA.
Pike W, Yarnal B, MacEachren A, Gahegan M, Yu C, 2005. Infrastructure for collaboration:
Building the future for local environmental change, Environment, 47(2): 8-21.
Pike, W. and Gahegan, M. 2007. Beyond ontologies: towards situated representations of scientific
knowledge. International Journal of Human-Computer Studies,65 (7): 674-688.
Rodriguez, M. A., and Egenhofer, M.J. 2003. Determining Semantic Similarity among Entity Classes
from Different Ontologies. IEEE Transactions on Knowledge and Data Engineering 15 (2):442-456.
Schwering, A. 2008. Approaches to Semantic Similarity Measurement for Geo-Spatial Data: A Survey.
Transactions in GIS 12(1): 5-29.
Schwering, A. 2005. Hybrid Model for Semantic Similarity Measurement. Ordnance Survey Research
Report Series, Southampton, UK.
Schwering, A. and M. Raubal 2005. Spatial Relations for Semantic Similarity Measurement. 2nd
International Workshop on Conceptual Modeling for Geographic Information Systems
(CoMoGIS2005), Klagenfurt, Austria, Springer: Berlin.
Sheth, A. 1999. Changing Focus on Interoperability in Information Systems: From System, Syntax,
Structure to Semantics. In Interoperating Geographic Information Systems, eds. Goodchild,
Egenhofer, Fegeas and Kottman, Kluwer: New York, pp. 5–29:
Smart, P.D., Abdelmoty, A.I., El-Geresy, B.A., and Jones, C.B. 2007. A Framework for Combining Rules
and Geo-ontologies. RR 2007 Lecture Notes in Computer Science 4524. pp. 133-147.
Sowa, J.F. 2005. The Challenge of Knowledge Soup. Research Trends in Science, Technology and
Mathematics Education, pp. 55-90.
Sowa, J.F. 2006. A Dynamic Theory of Ontology. In Formal Ontology in Information Systems, eds.
Bennett and Fellbaum, Amsterdam: IOS Presss, pp. 204-213.
Sowa, J., and Majumdar, A. 2003. Analogical reasoning, in de Moor A., Lex, W. and Ganter B. (eds.),
Conceptual Structures for Knowledge Creation and Communication, LNAI 2746, Springer-Verlag,
Berlin, pp. 16-36. http://www.jfsowa.com/pubs/analog.htm
Tversky, B. 1977. Features of Similarity. Psychological Review 84(4): 327-352.
Von Schweber, E. (2006) URL:
http://colab.cim3.net/file/work/Expedition_Workshop/2006_01_24_BootstrappingSOAthroughCOIs/
VonSchweber_LivingSystems_2006_01_24.ppt (accessed April 8, 2008).
Wache, H., Vögele, T., Visser, U., Stuckenschmidt, H., Schuster, G., Neumann, H., and Hübner, S., 2001.
Ontology-based integration of information: a survey of existing approaches. In: Stuckenschmidt, H.
(ed.), IJCAI-01 Workshop: Ontologies and Information Sharing, pp. 108-117.
23
Whitehead, A. N. 1929. Process and Reality: An Essay in Cosmology. Social Science Book Store: New
York.
Whitehead, A. N. 1933. Adventures of Ideas, Macmillan: New York.
Winter, S. 2001. Ontology: buzzword or paradigm shift in GIScience? International Journal of
Geographical Information Science 15 (7): 587-590.
24
Appendix A: UML specification of the similarity Application Programming Interface (1)
25
Appendix A: UML specification of the similarity Application Programming Interface (2)
26
Download