A platform for experimenting with measures of semantic similarity and supporting individual perspectives onto shared ontologies Mark Gahegan*, Ritesh Agrawal, Anuj Jaiswal, Junyan Luo and Kean-Huat Soon GeoVISTA Center, Department of Geography, The Pennsylvania State University, USA. *Now at: School of Geography, Geology and Environmental Science, University of Auckland, New Zealand. Email: m.gahegan@auckland.ac.nz Abstract This paper describes two developments in the ongoing search for better semantic similarity tools: such methods are important when attempting to reconcile or to integrate knowledge, or knowledge-related resources such as ontologies and database schemas. The first is an open, extensible platform for experimenting with different measures of similarity for ontologies and concept maps. The platform is based around three different types of similarity, that we ground in cognitive principles and provide a taxonomy and structure by which new methods can be integrated. The platform supports a variety of specific similarity methods, to which researchers can add others. It also provides flexible ways to combine the results from multiple methods, and some graphic tools for visualizing and communicating multi-part similarity scores. Details of the system, which forms part of the ConceptVista open codebase, are described, along with associated details of the interfaces by which users can add new methods, choose which methods are used and select how multiple similarity scores are aggregated. We offer this as a community resource, since many similarity methods have been proposed but there is still much confusion about which one(s) might work well for different geographical problems; hence a test environment that all can access and extend would seem to be of practical use. We also provide some examples of the platform in use. The second part of the paper describes in detail the idea of ‘perspectives’—a means of defining specific views onto semantic knowledge that can overcome some of the smaller differences in ontology that sometimes are a stumbling block for compatibility or acceptance. Perspectives are designed to help reconcile a user’s specific (but individual) understanding, or their current needs, to the contents of an established domain ontology, but without forcing the user to adopt all the constructs of the ontology directly. Minor differences can be overcome without the need for the user’s conceptual model (or related application programs) to be changed. Perspectives offer a convenient way to customize how ontologies appear to their users, rather like ‘views’ in a relational database (but significantly more powerful). We argue that perspectives are a kind of mediating transformation by which knowledge resources can be integrated. In fact, they operationalize the notion of foreground and background in cognition, allowing currently irrelevant details to be moved to the periphery. So far in our implementation, perspectives 1 allow: (i) properties to be recast as concepts in their own right (and conversely, concepts and sub-graphs reduced to properties), (ii) differences in specialization/generalization to be bypassed and more generally (iii) implied relations used to connect concepts directly. We describe perspectives from a cognitive standpoint, and give examples of how they can be used. Keywords: semantic similarity, ontology mediation, open platform, concept mapping, GIScience 2 1 Introduction Knowledge computing (ontologies, provenance, workflows, etc.) has opened the possibility of representing and reasoning with knowledge about the geographical world, which—just like with data before it—has led in turn to many further questions regarding interoperability, integration and update of such knowledge resources. Not surprising, then, that we often find ourselves faced with the problem of reconciling knowledge captured from different experts, or scraped from published documents and databases. But faced with the plethora of choices as to how knowledge should be recognized as similar and integrated, the way forward is sometimes uncertain. Which methods should we use? Such demands have led us to construct an evaluation platform for various semantic similarity methods that we describe below, in the first part of the paper. The platform is implemented in ConceptVista, a concept mapping, semantic search and knowledge integration environment. (http://www.geovista.psu.edu/ConceptVISTA/index.jsp). ConceptVista differs from ontology tools such as Protégé in: its support for less formal kinds of knowledge such as concept maps, its links to many other Web 2.0 technologies, and support of highly visual interaction and display. A fuller description is available from Gahegan et al, (2007). On the positive side, the semantic similarity research literature has grown very rapidly in the past five or so years, so there are by now many useful similarity methods and associated metrics to draw from. Harvey at al (1999), Winter (2001), Kuhn (2005) and Agrawal (2007) provide good introductions to the setting of geospatial semantics and some of the specific problems that must be overcome in a geographical setting. More general accounts of data semantics and ontologies are given by Gruber, (1993), Sheth (1999), Guarino (1998) and Davis et al (2006); while Wach et al (2001) provide a useful summary of ontology integration methods. Measuring semantic difference in geo-ontologies is specifically addressed in the work of Rodriguez and Egenhofer (2003), Kavouras et al (2005) and Fonseca et al (2006), in which practical metrics are proposed. Haase et al (2005), Klein (2004) and Bloehdorn et al (2006) specifically discuss the problem of checking for, and maintaining, consistency in distributed (co-evolving) ontologies. Sowa provides a very thought-provoking account of dynamic ontology (2006) and with colleague Majumdar (Sowa & Majumdar, 2003) describes a system that can find matches between knowledge fragments (conceptual graphs) using a sophisticated analogy reasoning engine. Nevertheless, live systems which can be used by an entire research community to experiment with ontology-based knowledge integration are rare. One very basic but useful example from the Geosciences Network (GEON) is described by Lin and Ludäscher (2003); it can be accessed from the GEON cyberinfrastructure portal at https://portal.geongrid.org/gridsphere/gridsphere. However, a number of deeper questions about geographical knowledge construction and integration that relate to the philosophy of geo-ontology (Frodeman, 2003), via the fields of hermeneutics, pragmatics and 3 situated cognition, require very careful attention since they impinge on the practicality and validity of some matching methods. An excellent account of the background to these questions and the related research literature over the past 40 years is provided by Schwering (2008). Many of these questions arise, we believe, because of the situated and often contested sense of meaning that is common within geography and the natural sciences (Clancy, 1994; Brodaric and Gahegan, 2007; Brodaric, 2007; Pike and Gahegan, 2007). By and large, the concepts and relations we use to describe the world do not exist in nature, they are constructed by humans. Hence it is not surprising that meaning differs between individuals, and through time. The assumption that conveying and reconciling understanding is a merely linguistic problem that would not occur in a purified language like description logic (ontology) is, in fact, almost entirely wrong (Braspenning, 2002; Sowa, 2005). Thus there are often no perfect theoretical solutions for geographical knowledge integration, but rather subjective measures and practices that, on balance, provide useful results. Take the case of two experts who study different, but overlapping domains, such as (i) vulnerability of local places to climate change and (ii) crisis management and disaster relief. It is likely that they will share some knowledge that is obviously the same, but there may also be knowledge that is not identical but is commensurate to some degree, and finally of course there is knowledge that is not common to both of them. The same can be said of computer programs, databases and ontologies; when compared with each other there are grades of similarity and overlap. Putting it another way, there is intersecting knowledge, where the problem is to insure that ontological clauses are not repeated and that small inconsistencies are recognized and resolved. Then there is augmenting knowledge, that extends what is known by each party separately, but remains compatible with the intersecting core, then finally there is the possibility of knowledge that (so far) is disjoint and does not (yet) fit in. Readers are directed to a thought-provoking view of this problem, in a presentation by Von Schweber (2006) where Venn diagrams are used to differentiate what is currently known, what could be known (building on current knowledge) and what cannot be known, as a means of understanding the limits of knowledge exchange between two parties. For the first case of intersecting knowledge, we need to recognize equivalence—though there may be naming conventions and differences in level of detail that must be resolved in order to do so. For the second case of augmenting knowledge, we might expect fewer overlapping details and the problem becomes one of knowledge integration. More importantly, there are likely to be concepts that play quite different roles, i.e. have different properties and engage in different relations. For example the vulnerability expert might be able to identify and understand at-risk populations, but have no idea how to evacuate them from the path of a hurricane. Yet much of the underlying semantic structure may well be compatible, when examined. In the third case where knowledge does not yet overlap, the best strategy might be to avoid placing too much weight on similarities because they may be coincidental. Of course, 4 the complicating factor is that in most cases, we typically do not know beforehand which of these three situations applies for a given knowledge fragment: we need to infer it. These apparent conflicts in purpose point to the need for very flexible methods for recognizing and resolving semantic similarity. Many of the tools developed to date for similarity resolution assume that the problem is actually of the first type. And it is true that many integration problems so far studied use ontologies to resolve apparent differences between knowledge communities that are very ‘close’ to each other—as in the many examples of ontologies used to build schema mappings between databases with similar content. But these are not the only kinds of knowledge integration problems we are faced with. From our point of view then, we see an urgent need, not for another kind of metric to calculate semantic similarity, but for an environment that allows: 1. Methods and their metrics to be readily evaluated and compared, 2. Easy extension with new methods for specific kinds of similarity and matching problems 3. Better support for augmenting knowledge (second case described above) 4. Flexibility in the way that methods and results are combined and communicated, and 5. Simple, but effective ways to investigate and visualize the results. In short, we do not know enough about the problem to be able to move straight to a solution, we need first to conduct experiments and evaluate strategies on their merits for resolving particular classes of problem. Without such experimentation, how will we ever know which strategies are in fact the most successful? A description of the environment we have constructed for evaluating similarity forms the first part of this paper. The second part introduces a new idea to help knowledge communities look past unimportant differences in ontology, so that they do not become a dogmatic stumbling block for potential users. Specifically, we describe the idea of ‘perspectives’ onto knowledge, which provide different views onto an underlying knowledge schema, adapted specifically for some purpose. To give just a simple example for now, two experts (or two applications) might differ in the status afforded a notion such as ‘Mesozoic’. To one scientist, this might be simply a property that describes the age of a fossil; but to another, Mesozoic might be a complex concept in its own right, with an intricate web of relations. The first scientist, not interested in these details, may wish to continue to treat Mesozoic as a property, and in our opinion should be allowed to do so because it is not inconsistent with roles it can play. But the second scientist should be able to treat it as a highly connected concept. The trick here is to support both views from the same ontology. 5 1 A semantic similarity platform for testing and evaluation As noted above, much has been written already on the construction of different metrics for computing similarity across ontologies, and the related task of matching, or building an equivalence mapping between two schemas. However, it seems all too often that new methods are not compared to existing ones, nor is the software developed available to allow all tests to be replicated and methods compared. To remedy this shortcoming, and to allow us to evaluate similarity methods in different contexts within the geographical and geological realms, we have constructed an open, extensible similarity platform (test bed), with graphical interfaces and pictorial display of the similarity measures used. However, before describing our similarity platform, some clarification of purpose and intent is needed. Firstly, we see our contribution here as providing a solid, conceptual structure for organizing the various similarity methods available and thereby creating a clean and extensible programming interface for ourselves and others to use. Good structure is needed if we are to manage a growing collection of more complex methods. Secondly, by no means do we claim to have a complete set of methods to hand, nor that the methods we describe are necessarily the best available. We do provide a structure, and have populated it with methods that seem interesting and useful to us. But it is far from complete. One of our purposes in writing this paper is to generate interest in sharing methods between research groups. If your favorite method is missing, please consider supplying it to us! Finally, it is not our purpose here to provide an in-depth review of similarity measures, we restrict ourselves to some brief notes on different types of similarity methods related to supporting them computationally. We also note that calculating similarity is just one part of a complex process of knowledge integration and management, especially when the knowledge resources are community-based, that is, shared by a number of users. We envisage the research reported here to fit within a five stage model of ontology management as follows: 1. Choose a strategy for assessing similarity: based on an understanding of the task and the parties involved. 2. Recognize similarities: use the chosen methods to compute similarity scores, and to summarize and report the findings. 3. Update local knowledge resources according to the similarities encountered. As part of this same project (but not reported here) we have developed different strategies for merging and integrating 6 knowledge, from importing absent concepts to anchoring new concepts into formal domain ontologies1. 4. Manage the revision and maintenance of ontologies within a community: this under-researched area must address questions of how knowledge communities are formed, rules for participation and a policy for approving revisions. Some ideas as to how this might work are presented in Klein (2004) and Pike et al (2005). The tools for software version control can address the technological aspects of the problem. 5. Broadcast changes to community knowledge resources using a regular update cycle and distribution mechanism.2 1.1 Specific similarity metrics Following an idea suggested by Sowa and Majumdar (2003), we have developed a categorization scheme for similarity metrics according to the Peirce’s notions of firstness, secondness, and thirdness (Peirce, 1931). In Peirce’s writings, he coins these terms to describe the different kinds of knowledge that are needed to form and relate categories. To whit: “First is the conception of being or existing independent of anything else. Second is the conception of being relative to, the conception of reaction with, something else. Third is the conception of mediation, whereby a first and a second are brought into relation.” (Peirce, 1891) So, firstness describes the internal or intrinsic character of some entity, secondness relates an entity to other entities, and thirdness describes mediating transformations by which entities can become connected. If we extend these ideas into the realm of semantic similarity, we obtain a concise and clear taxonomy for similarity methods that makes it easier to understand what the different methods do, and what their operational parameters will be. This scheme provides some much-needed structure for the software development, as summarized in Table 1. Table 1. Modes of similarity, descriptions of what they do and the programming interfaces they require. Mode Firstness methods Description Parameters: Type Match by internal property values, by property Properties(A): Vector, type, and also property name or identity. Properties(B): Vector 1 This leads naturally to the question of exactly how such merging should take place. Should properties from matching concepts be merged into one new concept? Should they be subsumed into one or more existing concepts? When combining knowledge across two communities, concepts—though their names and other properties may be similar—may actually mean different things to their knowledge community, or play significantly different roles. Thus assuming they are equivalent (and can therefore be merged) may be unhelpful. 2 Imagine, if you will, Critical Knowledge Updates to your PC! 7 Secondness methods Match concepts according to their relationships Neighborhood(A): Graph with other concepts, i.e. their graph Neighborhood(B): Graph neighborhood—a sub-graph centered on each concept. Thirdnesss methods Introducing a mediating graph that brings A and B Properties(A): Vector, Properties(B): Vector, into relation. Mediation(A, B): Graph 1.1.1 Firstness Methods Value and Structure Similarity Value similarity methods calculate similarity scores based on the commonality of property values. The metrics are designed to be directly proportional to the number of values that two concepts share, and are sometimes called set theoretic measures, employing notions of contrast (Tversky, 1977). Structural similarity matches not on values of properties, but on their types. Thus it is basically a count of the number of properties whose types are the same, again, normalized by the number that are different. The form of the method is typically a measure of the information they share—a similarity score calculated as the sum over all properties or values (pi) that are common between two concepts (A and B)—divided by the total information contained in A and B separately. Commonality calculations vary with the type of properties, a different method is needed for each new property type. See the equations below for examples of how these notions can be computed. Note that commonality can be computed in a number of ways: the formulae below show integers compared according to their closeness to each other (giving a numeric score in the range 0-1), and strings compared for exact matches (giving a score of either 0 or 1). Partial string matching and more complex lexicographical analysis is often more reliable when comparing concept maps and ontologies not created using a controlled vocabulary. The general form for all firstness methods is (from Table 1) Properties(A): Vector, Properties(B): Vector; in the case of the simply comparing the names of concepts, the vectors reduce to simple strings or URIs. Similarity score A, B n i 1 commonality pi , A, B total inf ormation A, B Ap Bp if p is of type int 1 range p Commonality p, A, B Ap Bp 0, Ap Bp if p is of type string 1, Ap Bp 8 Total information A, B n A nB n A B . It is in the nature of these kinds of measures that they are narrowly focused on the internal aspects of concepts. As a consequence, it is usually not safe to rely on them alone. Types and values can be chosen quite arbitrarily, and it is difficult to be sure that the comparisons are meaningful. For example, unless constrained to only compare properties with the same names, these methods would compare the age of a dog and the length of a train. But they do add evidence, and can be useful when used to support more advanced methods. 1.1.2 Secondness Methods Secondness methods concern the relations between concepts, or more broadly the similarity of their graph neighborhoods. They are sometimes referred to as network methods. The simplest kinds are very similar to those described above for properties and values, except they are used on relations (and possibly their properties also). So, following from the equations given above, secondness measures can compute similarity based on commonality among the relations, again normalized by the total information. One complicating question is the depth of the subgraphs to be compared, and whether all types of relations contribute equally to the overall score. For example, one could define a semantic distance metric, so that more distant relations (not directly connected, but connected by intermediary concepts) count for less. More useful relational measures can be calculated using the explicit semantics of the relations represented in the ontology. For example, one can count the network distance that must be travelled across a generalization hierarchy to connect two concepts A and B (the number of generalization steps from A to their closest, shared generalization (G), then the number of further specialization steps from G to B. Various other families of proximal relations can be used as well, including the spatial (Kuipers, 2000; Schwering and Raubal, 2005). This kind of method works well with ontologies that share type hierarchies explicitly, but more sophisticated measures are needed otherwise, such as described by Rodriguez and Egenhofer (2003). The general form remains as Neighborhood(A): Graph, Neighborhood(B): Graph (Table 1), though some of the more advanced methods may require additional parameters. 1.1.3 Thirdness Methods Implied semantic relationships connecting concepts (‘nym matching service) It is often the case when combining knowledge gathered from different human experts or different communities that we need to recognize connections by certain implicit (or missing) semantic relationships, such as: toponyms, hypernyms, hyponyms, meronyms, holonyms, synonyms, and antonyms. Although conceptually simple—use WordNet, Cyc or some other web-based thesaurus and gazetteer to check if two concepts have some kind of ‘nym relationship—the problems here are ones of computational performance and the reliability and completeness of these external resources. Performance is a problem because searches through an external corpus must be made between all concepts that might be similar 9 (thus order (n-1)! Complexity assuming symmetry). Completeness is a problem because specialized scientific vocabulary is often missing from these general-purpose thesauri. We have developed a separate matching service that uses the semantic relationships from WordNet. The service searches for concepts that are similar based on these semantic relationships. Effectively this offloads the computational burden to a remote computer as an RMI service using the powerful Lucene indexing tool for efficiency. Users of the service do not need to be aware of these complexities, the interface is as described in Table 1, i.e. A: Concept, B: Concept, Mediation (A, B): Graph, but the Graph in this case reduces to a single ‘nym relation. Mediating subgraphs (analogies and perspectives) The real power of mediation (thirdness) occurs when graphs more complex than a single relation can be used to form connections between concepts. Sowa and Majumdar (2003) describe an analogy engine to search for these mediating subgraphs in existing knowledge resources (such as ontologies). In our own work, we have been frustrated by the lack of detailed domain knowledge by which these subgraphs might be constructed, so have developed a human-led method by which users can construct their own mediating subgraphs. We term these subgraphs ‘perspectives’ and dedicate the next section of the paper to explaining how they work and how they can be used. As far as the API is concerned they require no special treatment. Of course, the most successful strategies are likely to be those that can combine multiple methods and resolve how to combine their results (see Schwering, 2005 as an example). We briefly touch on this problem below. 1.2 Details of the Application Programming Interface The Application Programming Interface is constructed around the following methods: Analyzers Methods for performing comparisons in values and types (described above). Three different families are implemented so far: StringAnalyzer, NumericAnalyzer and DateAnalyzer for the firstness methods.3 Thirdness analyszers include a ‘nym matching service and the perspectives mechanism described below in section 3. Filters Exclude certain properties of concepts and relations from any comparison. GUI Interfaces Allow users to configure the various options and parameters that different 3 We have not included analyzers for secondness in our work so far, as there are so many to borrow from colleagues. 10 Analyzers can use. Extractors Extract needed information from more complex fields, such as the local concept name from a concept URI string (for when concepts are identified with URIs). Summarizers These define the expressions used to combine multiple scores together. Visaualizers Ways for reporting the similarity scores in the ConceptVista application. Supporting the Analyzers is a SimilarityRegistry class that maintains a HashMap index between similarity methods and their graphical user interface (GUI) component (instance of SimilarityEditorInterface). This class allows users to add a new HashMap entry at runtime, if there is a need to modify or add an analyzer. It also decouples similarity measures from their GUI component, hence different graphical interfaces to methods can also be substituted as needed. Specific RegistryAnalyzers maintain a registry of which GUI components to use to set the different parameters of a specific implementation of an Analyzer interface. Filters are provided to allow the user to exclude different properties of a concept from being used by the various similarity metrics. This is necessary because some properties may be known to be misleading, and also because various details unrelated to semantics sometimes find their way into in some concept maps. GUI Interfaces are the graphical components by which the user interacts with the various similarity methods. Because of the resemblance in parameters shown in table 1, new similarity methods can sometimes make use of existing GUI Interfaces, so no additional code is needed. Extractors are used to mine out information for matching from more complex properties, such as creating concept names from their URI. They provide a way to restrict what is compared for the properties selected for use. Summarizers are the means to combine similarity scores from multiple methods. We envisage here a process by which the scores from different methods can be weighted and combined. Over time, we can perhaps learn which weighted combinations of methods works best for different knowledge integration problems and domains. So, far we have restricted our Summarizers to work only with the firstness measures described above. It remains an open question how to build a more general summarizer that will work across all kinds of similarity methods. Visualizers are a class of methods for visually depicting the similarity scores. On the example shown below in Figure 1, simple bars are used to show these scores. We have experimented with other visual devices and selected this one for now, though we intend to add more methods soon. 11 A complete UML diagram for the less experimental aspects of the whole platform is given for reference purposes in Appendix 1. The entire application, which includes ConceptVista and the ‘nym matching service, is available for download as a package from: http://www.personal.psu.edu/arj135/Projects/CV4/CV4-Setup.exe. The authors will also be happy to make source code available to any interested parties. The codebase uses an LGPL license and is written in Java using the JENA ontology tools. 1.3 Example of use When the user begins an evaluation of semantic similarity, two (or more) ontologies (or more informal concept maps) are first loaded and displayed concurrently. The user then clicks on the similarity tools panel to choose which methods to use, and via their GUI Interfaces, sets any necessary operational parameters. Having configured the methods, and selected a Summarizer, the user then clicks on a concept in one ontology from which to begin the search for similarity. The system responds by calculating similarity scores between this concept and all the other concepts in the second ontology. The results are projected into the display. The user can then choose to act on the computed scores, perhaps by creating new relations to represent the uncovered connections, or by proceeding to compare further concepts. The example below in Figure 1 shows the concept of SurfaceWater (highlighted in red) from the SWEET EarthRealm ontology (source: www.nasa.gov/earthrealm) compared to several concepts from the AktiveSA ontology for crisis management https://www.edefence.org/~ps/aktivesa/OntoWeb/index.htm). / disaster relief (source: The concept SurfaceWaterObject has the best overall similarity scores. As configured in this example, the pink bar represents a structural similarity score, which is calculated based on how many common links are associated with both concepts. The red bar denotes a similarity score based on lexical similarities of the concept names. Finally, the brown bar shows a score that combines both the structural and string value measures together. For this possible match, structural similarity has the highest score, as the two concepts share an almost identical set of properties. The string value similarity receives a moderate score because of the lexical difference between “SurefaceWaterObject” and “SurfaceWater” This also compromises the combined score that considers both structural and value similarity measures. Note of course that different methods may well change the outcomes. For example, a more refined lexical string matching method might improve the results, as might the use of a Filter to remove the sub-string “Object” from all concepts in the AktiveSA ontology. The graph neighborhoods are very similar, so secondness methods may not be so effective in this simple example. 12 Figure 1: The concept of SurfaceWater from the SWEET EarthRealm ontology (highlighted in red) is compared to several concepts from the AktiveSA ontology for crisis management. The resulting similarity scores are shown by the multi-bar glyph symbols in the display. See text for further details. 2 Perspectives Perspectives are sequences of ontological transformations, specifically designed to overcome some of the possible conflicts that can occur in the process of human-oriented ontology creation and use. For example: 13 (i) seemingly arbitrary decisions where concepts could be linked by many relations, concerning which ones should be made explicit and which ones are simply implied; (ii) a different degree of interest or sophistication that may arise between ontology producers and ontology consumers; or (iii) a dissimilar propensity for levels of generalization among practitioners, where the ontology is strongly hierarchical but the user conceives of a much flatter structure (or visa versa). The aim of perspectives in all these cases is to finesse over such differences. An early implementation of perspectives as ontological filters was reported by Gahegan et al, (2008), as a predominantly visual filter used to draw attention to certain themes in the display, built automatically around pre-defined semantic types. Filters were used to show the visual intersection of different views onto a concept map. Here perspectives are extended to a deeper cognitive level, designed to mediate conceptual knowledge so it better matches an expert’s personal understanding or current need. Importantly, in this sense perspectives broaden the notion of what might be considered commensurate knowledge beyond the kinds of similarity measures described above. Specifically they allow us to operationalize an idea from the writings of Whitehead (1929; 1933), where he describes the ability of humans to move ideas (concepts and relationships) between the light of enquiry and the ‘penumbral background’, where details are not known precisely, or are not currently needed. Using this idea, we can define a filter that highlights a specific sub-graph, but which ‘rolls-up’ the concepts on the periphery of the filter, temporarily removing or recasting them as simple properties of the concepts of interest. Figures 2 describes schematically how perspective work in the case of providing such conceptual focus. The top diagram shows a concept map or ontology upon which two different perspectives (A and B) will be defined. The left diagram represents the effect of applying perspective A: concepts inside the filter are unchanged (numbers 6, 7, 8 and 9), concepts connected directly to those inside are reduced to being properties of the included concepts and are shown now as circles (2, 5, 10, 11, 12, 15). Concepts not directly connected to those within the perspective are temporarily removed (shown grayed out). The right diagram shows perspective A retracted and perspective B asserted (concepts 6, 7, 12 and 14 now in focus), with corresponding changes to the surrounding nodes. The movement from A to B represents conceptual refocusing, so some previously relevant details are no longer required, some previously truncated concepts are re-inflated and some additional concepts become visible. 14 13 B 5 1 A 12 6 14 2 8 7 3 9 15 10 16 11 4 13 1 1 12 6 6 12 2 14 14 2 8 8 7 7 16 15 3 10 3 9 13 5 5 11 15 9 11 10 16 4 4 Figure 2. An overview of how perspective filters work. The upper diagram shows a simplified ontology as a series of connected nodes. Two perspectives (A and B) onto this ontology are shown on the lower diagrams. Concepts on the periphery of a perspective are recast as properties and shown as circles. Concepts falling outside a perspective are temporarily removed (grayed out). See text for details. As an example of how perspectives are created, Figure 3 shows a snapshot from a user session where a perspective is being constructed. Its purpose is to map a local expert’s understanding of vulnerability onto the AktiveSA ontology. The bottom left panel in the display visually depicts the perspective (here comprised of three expressions shown as nodes that have been added separately by the user), and provides a visual editor by which to create and interact with these expressions. The form of these expressions is described in Gahegan et al, (2008).4 4 Note to reviewers. I have a sequence of images showing how a perspective is created to bring the AktiveSA and local ontology into alignment…but they would take up a lot of space . Hence I have included only one snapshot for now. Perhaps these additional images could be added to an accompanying website? 15 Figure 3. A snapshot of the process of creating a perspective, to map local user knowledge onto an established ontology. The perspective editor, shown at bottom left, contains a visual portrayal of a perspective, at this stage comprising three expressions (the green circles). As mentioned above, many semantic differences arise because of quite arbitrary choices during ontology construction, or because of a predilection for ‘lumping’ or ‘splitting’, i.e. the degree to which specificity is added to a generalization hierarchy. In Figure 4, a simple hierarchy is shown that contains the concepts of Tree, and Eucalypt Tree, and a single instance: e.Maculata. Eucalypts are Evergreens, so this category could be added into the hierarchy, but in doing so, Tree would no longer be directly related to Eucalypt; the dashed relation would be removed, and the two dotted relations would be added. Many local measures of similarity can be misled by such simple differences, even though both ontologies are in most senses entirely commensurate with each other. More to the point, if the concept Evergreen is superfluous, confusing, contentious or absent from the conceptual model of the ontology user, then it need not be shown. 16 e.Maculata Eucalypt Evergreen Tree Tree Figure 4. A simple generalization hierarchy, showing the optional inclusion of an additional concept (Evergreen Tree). Following Sowa and Majumdar (2003), we recognize that the relationship between Tree and Eucalypt remains equally true whether implied via the Evergreen concept, or explicitly linked. For some applications, the concepts of Evergreen and Deciduous may be useful, but for others they are unnecessary. Should a researcher who does not need this distinction be forced to work with it? We believe not, and that it is perfectly acceptable for them to ‘see’ and use this ontology as if Eucalypt and Tree were directly connected. The same logic can be applied to non-generalization relations too, and is very useful in emphasizing specific (but implied and therefore non-obvious) patterns in knowledge. The following example should make this point clear. 2.1 Examples of Perspectives in use As an example, consider an ontology of authors, articles and thematic areas. Typically, we know which articles the various authors have written, and (via keywords) which themes the articles relate to. But we may wish to know, or see directly, which authors share interests in the same themes, or how themes cluster together because they are studied by the same authors. Trying to glean this information from the concept map can be difficult; the indirect connections via articles to researchers adds a great deal of (what for this question is) noise. Figure 5 shows the GIST Body of Knowledge, recently developed to provide a consolidated account of the various themes that comprise GISystems and Science from an educational point of view (http://www.ucgis.org/priorities/education/modelcurriculaproject.asp). An ontology was constructed from the GIST major themes along with their hierarchical relationships. Each major theme (such as analytical methods, cartography and visualization) is colored differently, with the window on the left of the screen 17 acting as a legend for the various themes. Note that the figure is designed to show the breadth of the ontology, the details are not important here. Figure 5. The GIST ontology, created from the GIST Body of Knowledge document that describes the major teaching themes in the field of GIS. The ontology, like the document, is a hierarchy, comprised of major themes that are further subdivided into specific topics; color is used to differentiate the various themes. The left panel in the display is a navigable legend, displayed as a tree. The next image, Figure 6, shows a close-up of part of the GIST ontology after authors have been added in (scraped from Google Scholar using the GIST themes as keywords), and connected to the various themes that they have published on. There is a proliferation of new relationships, but they are useful to see which authors publish broadly, and which narrowly (the broader authors have multiple connections, and include different colored themes). However, if the user wishes to see which topics seem to be closely related (i.e. often studied by the same researchers) then this display does not help. The GIST Body of Knowledge is structured as a hierarchy, but in general, many GIScience researchers work on several subtopics across the field. The final image in this sequence, Figure 7, shows relationships between GIST topics based on authors who work across different areas. The idea is to find topics that are closely related to each other (in terms of what authors study) but classified in GIST into different themes. A perspective was created to derive relationships between topics based on common authors, but then removing the authors (the concepts that actually link themes together). The figure is shown in close-up, so the detail is readable. To achieve this transformation, the perspective effectively internalizes (rolls up) the relations from topics to 18 authors inside of the topics concepts, so that authors are now effectively attributes of topics. Topics are then linked together if they share the same value for any of their author properties. Figure 6. A close-up of part of the GIST ontology with researchers added in and connected to themes, based on the articles they have published. GIST themes are shown via the different colored ellipses, authors are displayed as light blue rectangles (the currently selected author, Alan MacEachren, is highlighted in red). The inset schematic at bottom left shows how the topics and authors are connected. 19 Figure 7. A perspective filter is applied to the ontology shown in Figure 6 above, to make explicit the implied relationships between topics (based on co-occurrence of links to researchers). As the schematic at bottom left shows, topics are now directly connected together when they share a researcher, and researchers are now absent. See text for more details. A perspective, then, is a special kind of mediating expression that extends the Jena query capability, so that concepts, relations and properties can be hidden, internalized, and externalized but cannot be created or destroyed. Perspectives do not change the underlying ontology whatsoever, they rather support specific views onto it, provided they do not conflict with the underlying semantics (interpreted as above). Using perspectives, a user or a knowledge engineer can shape the ontology to reflect their own understanding, compressing it where it is too specific, and truncating it where it is too broad. So they can still use an agreed, community ontology without the need to proliferate new versions to suit their immediate needs, with all the savings this entails in terms of additional maintenance and reconciliation. Small differences can be overlooked where the overall logic is not compromised. 3 Conclusions and Future Work There seems to be no end to the set of possible semantic similarity methods. Of the many kinds reported we have experienced mixed, patchy results with all of them. Moving forward from this point demands more rigorous evaluation and comparison. The work we describe in the first part of this paper addresses the problem via an open, extensible platform for computing semantic similarity. We have drawn on the Peircian notions of firstness, secondness and thirdness to provide a strong conceptual underpinning that we believe neatly describes the parameters and goals of different methods and so provides a firm foundation for application development. Our API supports several methods for computing similarity, and eases the problem of adding more. Results are provided visually, and the tools provide a high degree of user interaction. In parallel with this, with the aim of also producing more cognitively plausible tools with which to filter and mediate ontologies, we have developed the notion of perspectives, described in the second half of the paper. Perspectives are a special case of mediating transformations, or thirdness. They provide some further sophistication to similarity methods, because they offer ways to see past what we argue are small differences in the way knowledge can be encoded. We hope to eventually use these ideas as part of a more advanced similarity reasoner, that can itself take different perspectives during the process of computing similarity. 20 Firstness, secondness and thirdness do seem to be robust notions on which to design a semantic similarity platform, but it appears that a fourthness category may be needed in addition: to deal with the growing possibilities of computing similarity using measures of association from massive collections of use-cases. Similarity by emergence perhaps describes it best. Mitra (2004) describes how this might work, by scraping information from web pages returned from Google searches on the concepts to be compared. We have experimented with these methods at length, and hope to include them in a later release of our software. Having constructed a platform for evaluating semantic similarity, it is perhaps time to move to a stage of careful experimentation and comparison of our collective set of methods. Such a program will take time and effort, but is needed. Answers to the following deeper questions must now be sought: 1. Which semantic methods are the most reliable and useful in uncovering similarity, or in merging ontologies? This remains an open question that requires some sophisticated experimentation with users, using tools such as the one shown above—and more besides! We are keen to hear of experiences from other researchers on this topic. 2. How should different measures of similarity be combined when they are used? Are there contexts tasks for which some measures work better than others, and if so, what are they? Given that we might discover these contexts, is it possible to achieve useful matching results without supervision by human experts? 3. Is there a useful role for hermeneutics to play in constructing knowledge horizons or some other form of perspective around concept maps, and showing where two horizons might intersect? So far in our experiments with local perspectives we have not tried to explicitly highlight where perspectives diverge, but this might hold some promise for communicating differences in understanding. Acknowledgements: This research was funded by the US National Science Foundation (NSF) via grants BCS–9978052 (HERO), ITR (BCS)–0219025, and ITR (EAR)–0225673 (GEON). The authors would like to thank Stephen Weaver for his work in turning GIST into an RDF document (no mean feat) and the organizers of the COSIT 2007 Semantic Similarity Workshop, where some of these ideas were first aired. References Agarwal, P. 2005. Ontological Considerations in GIScience. International Journal of Geographical Information Science 19 (5):501-536. 21 Bloehdorn, S., Haase, P., Sure, Y., and Voelker, J. 2006. Ontology Evolution. In Semantic Web Technologies: Trends and Research in Ontology-based Systems, eds. Davies, Studer and Warren, 5170. London: John Wiley & Sons Ltd. Braspenning, P.J. 2000. Symposium on Intelligent Agents in Software Engineering for Planning, KaHo St.-Lieven, Gent, 23rd February, 2000. Brodaric, B. 2007. Geo-Pragmatics for the Geospatial Semantic Web. Transactions In GIS 11 (3):453-477. Brodaric, B. and Gahegan, M. 2007. Experiments to examine situated geoscientific concepts. Spatial Cognition and Computation Journal (Special Issue on Cognitive Semantics and Ontologies) 7(1): 6195. Clancey, W. 1994. Situated cognition: How representations are created and given meaning. Lessons from Learning. R Lewis and P Mendelsohn (Eds.) Amsterdam, North-Holland: pp. 231-242. Davies, J., Studer, R., and Warren, P. 2006. Semantic Web Technologies: Trends and Research in Ontology-based Systems. London: John Wiley & Sons Ltd. Fonseca, F., Camara, G., and Monteiro, A.M. 2006. A Framework for Measuring the Interoperability of Geo-Ontologies. Spatial Cognition and Computation 6 (4): 309-331. Frodeman, R. 2003. Geo-Logic: Breaking ground between philosophy and the earth sciences. Albany, SUNY Press. Gahegan, M., Agrawal, R. and DiBiase, D. 2007. Building rich, semantic descriptions of learning activities to facilitate reuse in digital libraries. International Journal on Digital Libraries, 7, (1-2): 81-97. URL: http://www.springerlink.com/content/q102m641460h77v6/?p=cae7b09531014c3d9605c2f74b10dbfa &pi=4 Gahegan, M, and Pike, W. 2006. A Situated Knowledge Representation of Geographical Information. Transactions In GIS 10 (5):727-749. Gahegan, M. Luo, J., Weaver, S., Pike, W. and Banchuen, T (in review). Connecting GEON: making sense of the myriad resources, researchers and concepts that comprise a geoscience cyberinfrastructure. Computers & geosciences, special issue of cyberinfrastructure for the geosciences. Gruber, T. R. 1993. A Translations Approach to Portable Ontology Specifications. Knowledge Acquisition 5: 199-220. Guarino, N. 1998. Formal ontology in information systems. In: Guarino, N. (ed.) Formal Ontology in Information Systems. Proc. FOIS'98, Trento, Italy, June 6-8 1998. IOS Press, Amsterdam, pp. 3-15. Haase, P., van Harmelen, F., Huang, Z., Stuckenschmidt, H., and Sure, Y. 2005. A Framework of Handling Inconsistency in Changing Ontologies. International Conference on Semantic Web (ISWC 2005) Lecture Notes in Computer Science 3729, pp. 353-367. Harvey, F., Kuhn, W., Pundt, H., Bishr, Y., and Riedemann, C. 1999. Semantic Interoperability: A Central Issue for Sharing Geographic Information. The Annals of Regional Science 33: 213-232. Kavouras, M., Kokla, M., and Tomai, E. 2005. Comparing categories among geographic ontologies. Computer and Geosciences 31 (2): 145 - 154. Klein, M. 2004. Change Management for Distributed Ontologies. PhD Dissertation, Dutch Graduate School for Information and Knowledge Systems, Vrije Universiteit, Amsterdam. Kuhn, W. 2005. Geospatial Semantics: What, of What, and How? Journal on Data Semantics III LNCS 3534: 1-24. 22 Kuipers, B. 2000. The Spatial Semantic Hierarchy. Artificial Intelligence 111: 191-233. Lin, K. and Ludäscher, B. 2003. A system for semantic integration of geological maps via ontologies. Proc. of the Workshop on Semantic Web Technologies for Searching and Retrieving Scientific Data (SCISW). Mitra, P. 2004. An Algebraic Framework for the Interoperation of Ontologies. PhD Dissertation, Department of Electrical Engineering, Stanford University. Peirce, C. S. 1891. Review of: Principles of Psychology by William James, Nation, 53: 32. Peirce, C. S. 1931. The Collected Papers of Charles Sanders Peirce. Harvard University Press: Cambridge, MA. Pike W, Yarnal B, MacEachren A, Gahegan M, Yu C, 2005. Infrastructure for collaboration: Building the future for local environmental change, Environment, 47(2): 8-21. Pike, W. and Gahegan, M. 2007. Beyond ontologies: towards situated representations of scientific knowledge. International Journal of Human-Computer Studies,65 (7): 674-688. Rodriguez, M. A., and Egenhofer, M.J. 2003. Determining Semantic Similarity among Entity Classes from Different Ontologies. IEEE Transactions on Knowledge and Data Engineering 15 (2):442-456. Schwering, A. 2008. Approaches to Semantic Similarity Measurement for Geo-Spatial Data: A Survey. Transactions in GIS 12(1): 5-29. Schwering, A. 2005. Hybrid Model for Semantic Similarity Measurement. Ordnance Survey Research Report Series, Southampton, UK. Schwering, A. and M. Raubal 2005. Spatial Relations for Semantic Similarity Measurement. 2nd International Workshop on Conceptual Modeling for Geographic Information Systems (CoMoGIS2005), Klagenfurt, Austria, Springer: Berlin. Sheth, A. 1999. Changing Focus on Interoperability in Information Systems: From System, Syntax, Structure to Semantics. In Interoperating Geographic Information Systems, eds. Goodchild, Egenhofer, Fegeas and Kottman, Kluwer: New York, pp. 5–29: Smart, P.D., Abdelmoty, A.I., El-Geresy, B.A., and Jones, C.B. 2007. A Framework for Combining Rules and Geo-ontologies. RR 2007 Lecture Notes in Computer Science 4524. pp. 133-147. Sowa, J.F. 2005. The Challenge of Knowledge Soup. Research Trends in Science, Technology and Mathematics Education, pp. 55-90. Sowa, J.F. 2006. A Dynamic Theory of Ontology. In Formal Ontology in Information Systems, eds. Bennett and Fellbaum, Amsterdam: IOS Presss, pp. 204-213. Sowa, J., and Majumdar, A. 2003. Analogical reasoning, in de Moor A., Lex, W. and Ganter B. (eds.), Conceptual Structures for Knowledge Creation and Communication, LNAI 2746, Springer-Verlag, Berlin, pp. 16-36. http://www.jfsowa.com/pubs/analog.htm Tversky, B. 1977. Features of Similarity. Psychological Review 84(4): 327-352. Von Schweber, E. (2006) URL: http://colab.cim3.net/file/work/Expedition_Workshop/2006_01_24_BootstrappingSOAthroughCOIs/ VonSchweber_LivingSystems_2006_01_24.ppt (accessed April 8, 2008). Wache, H., Vögele, T., Visser, U., Stuckenschmidt, H., Schuster, G., Neumann, H., and Hübner, S., 2001. Ontology-based integration of information: a survey of existing approaches. In: Stuckenschmidt, H. (ed.), IJCAI-01 Workshop: Ontologies and Information Sharing, pp. 108-117. 23 Whitehead, A. N. 1929. Process and Reality: An Essay in Cosmology. Social Science Book Store: New York. Whitehead, A. N. 1933. Adventures of Ideas, Macmillan: New York. Winter, S. 2001. Ontology: buzzword or paradigm shift in GIScience? International Journal of Geographical Information Science 15 (7): 587-590. 24 Appendix A: UML specification of the similarity Application Programming Interface (1) 25 Appendix A: UML specification of the similarity Application Programming Interface (2) 26