Ontology Mapping - Università degli Studi di Roma Tor Vergata

O Ontology Mapping: a survey and a proposal. Roberto Basili, Maria Teresa Pazienza and Fabio Massimo Zanzotto Dispense del corso di Basi di Dati Distribuite Anno Accademico 2003-2004 O List of contents 1. INTRODUCTION ................................................................................................................................... 3 2. A DEFINITION FOR THE NOTION OF ONTOLOGY .................................................................... 7 2.1 INTERFACES TOWARDS WORLDS WHERE CONCEPTS ARE EXPRESSED: THE TASK LEVEL OF THE ONTOLOGY ..................................................................................................................................................... 9 3. EXISTING APPROACHES TO ONTOLOGY MAPPING .............................................................. 12 3.1 ONTOLOGY MAPPING ARCHITECTURES ......................................................................................... 13 3.1.1 Topology .................................................................................................................................... 13 3.1.2 Dynamics ................................................................................................................................... 15 3.2 AUTOMATIC METHODS FOR MAPPING ONTOLOGIES ..................................................................... 17 3.2.1 Mapping at the Instance Level ................................................................................................... 17 3.2.2 Denotations of the Concepts ...................................................................................................... 18 3.2.3 The use of the structure.............................................................................................................. 19 4. PROPOSED APPROACH TO MEANING NEGOTIATION IN MOSES ...................................... 20 4.1.1 4.1.2 4.1.3 5. Basic Assumptions ..................................................................................................................... 20 A Mental Model for Meaning Negotiation ................................................................................. 21 The Dimensions of the Meaning Negotiation Algorithm .......................................................... 22 ADDING NODES IN THE KNOWLEDGE GRID ............................................................................ 23 5.1 ONTOLOGY EXTRACTION FROM PLAIN TEXTS............................................................................... 23 REFERENCES .............................................................................................................................................. 26 O 1. Introduction The strong interoperability between software agents envisioned in the Semantic Web (SW) (Lee, Hendler, and Lassila, 2001) is a very fashionable idea. Software agents “going” in the World Wide Web and “working” cooperatively in order to serve needs of final users would be very relevant assistants in efficiently finding the desired information. To achieve these results, documents and services exposed in the web should clearly state their intended meaning. That is, human readable information should be completed with structured machine readable data describing the intended semantics in an explicit and unambiguous way with respect to shared conceptualisations often called ontologies. Two major obstacles should be overcome to realise this vision in practice:  firstly, after the definition of a formal language to express ontological information (e.g. SHOE (Heflin, Hendler, and Luke, 1999), DAML+OIL (DAML+OIL), or the more recent OWL (OWL)), real conceptualisations have to be written and agreed;  secondly, the effective application of these methodologies can be guaranteed when costs in marking up content in documents can be neglected1. Explicit and unambiguous conceptualisations have been often seen as the basis for an effective communication. The necessity of building such a conceptualisations for given knowledge domains dates back. Not only the communication among software agents but also among humans is affected by a large number of misunderstandings. In the 19th century the uncontrolled proliferation of names for new concepts made urgent a reduction of divergent denotations. Zoologists, botanists, and chemists had to handle a large number of new concept names to refer to their new intuitions as well as the progresses of the technologies required clear names for new artefacts (Cabré, 1998). Terminological studies originated by the work of (Wrust, 1931) addressed this problem. The attitude was prescriptive: a standardisation of the terms (i.e. concept denotations) used in a particular domain has to be settled and used for further references. The final resource should be a standard governing the communications among experts. As the “prescriptive attitude” of the terminology studies is very similar to the one underlying the Semantic Web vision, some insights of the problem of governing large-scale conceptualisations can be learnt from the long history of terminology even if the addressed knowledge domains strongly differ. The software agents envisioned in the SW scenario should in fact help to perform everyday task such as the reservation of a room in an hotel regardless to the underlying database or to the web interface of the specific (and selected) hotel. On the contrary, terminological studies aim to find explicit models for technical and scientific knowledge domains. These seem to be very far from more common problems addressed in the SW. When trying to define a conceptualisation for a knowledge domain, occurring problems are similar regardless the given knowledge domain. These problems are generated by the typical nature of the knowledge. So they appear in the scientific or technical terminological standardisation efforts as well as in the conceptualisation of “simpler” (i.e. closer to everyday life) domains. As knowledge is the psychological result of perception, learning, and reasoning:  it belongs to an individual or a group of individuals (locality)  it changes during the time (variability) An explicit and unambiguous version of the domain knowledge cannot be but the expression of a particular group of individuals in a particular time. Its strength depends on its stability and on its generality. Other individuals will use the proposed conceptualisation if it well represent their own perception of the reality. The locality and the variability of knowledge can be immediately clarified with an example. Let us examine an explicit conceptualisation of the knowledge related to coffee and let us imagine this as being compiled by 1 As (Hendler, 2001) says "Lowering the cost of markup isn’t enough – for many users it needs to be free. That is, semantic markup should be a by-product of normal computer use. Much like current web content, a small number of tool creators and web ontology designers will need to know the details, but most users will not even know ontologies exist." 3 O an Italian. A very rough approximation of the possible resulting ontology is given in Fig. 1 (translations for denotations are added for readability purposes). BEVERAGE CAPPUCCINO (Cappuccino) CAFFE (Coffee) MINI-CAPPUCCINO (Small Cappuccino) CAFFE MACCHIATO (Dashed Coffee) CAFFE AMERICANO (American Coffee) CAFFE RISTRETTO (Short Coffee) CAFFE LUNGO (Long Coffee) CAFFE CORRETTO (Corrected Coffee) CAFFE MACCHIATO CALDO (Hot Dashed Coffee) Fig. 1 An approximation of the coffee ontology for Italians This can seem a very complex representation for a simple problem. However, it is hard to convince an Italian to renounce to some of these concepts. Moreover, it is hard to believe but Italians can exactly associate a very precise meaning to each of these nodes. For instance a CAFFE (coffee) is a black beverage served in a very small cup called Tazzina (where Tazza is cup). It is prepared using a machine that injects steam through a pressed quantity of ground toasted coffee beans. A CAFFE MACCHIATO (dashed coffee) is the same black beverage served with a drop of milk whereas a CAFFE CORRETTO (corrected coffee) is served with a drop of spirit (grappa is often used). K2 Kn A K1 Fig. 2 The knowledge grid: the linguistic and the content agents The conceptualisation in Fig. 1 may well represent the knowledge that a group of people (i.e. the Italians) have on the coffee domain, but this cannot be considered a general ontology for the coffee domain as it represents a particular culture. However, an automatic reasoning system serving coffees in an Italian Bar (i.e. the place where coffees are served) must have such a kind of conceptualisation to achieve the purpose of giving the final costumer what he wants. Moreover, the presented coffee ontology is still an approximation of the actual state of affairs. The Italian coffee world does not rapidly change but new ways of preparing coffees may be invented and this should be reflected in names in the ontology. This is the case of the so-called MAROCCHINO that, according to urban legends, is sort of CAFFE MACCHIATO CALDO (i.e. a CAFFE with a drop of milk steamed in the Cappuccino way) with some cocoa powder on the top and in the actual liquid. 4 O If this is the nature of the “knowledge”, a scenario (as the one proposed in (DAML+OIL)) that admits a unique, stable, and agreed repository where explicit and unambiguous conceptualisations are collected does not seem to be very realistic. A more probable situation is that many conceptualisations of the same phenomena will coexist. Each knowledge node where the information is exposed will follow one of these particular conceptualisations. This perspective is represented in Fig. 2 where K1,...,Kn are the information nodes and A is the agent aiming to satisfy its needs. With respect to the centralised and prescriptive attitude, this is a more complex case. When designing software agents it cannot be assumed that they share the conceptualisation with the information providers they are interrogating. Then, relying on their inner conceptualisation, they should be equipped with reasoning devices allowing them to interpret ant transmit their own queries with respect to conceptualisation used by the node publishing the information they are interested in. Before the message can be passed with success, a positive meaning negotiation has to be achieved. If this is the interaction model, the overall architecture proposed for the SW is very similar to the organisation of the human society where interacting people can not be sure on how much their conceptualisation is shared. The problem analysed in these terms is very fascinating. Basically, it may be seen as the problem of aligning different conceptualisations that can be resolved in very different ways according to the capabilities of the automatic mapping algorithms and to the specific nature of the problem. Static or dynamic solutions can be foreseen. Uniroma3 UKBH Fig. 3 The test-bed knowledge grid in MOSES In this perspective, the vision of the Semantic Web slightly changes and the initial obstacles for the success of the Semantic Web may be rewritten as the following:  after the definition of a formal language to express ontological information, effective conceptualisations have to be written and meaning negotiation algorithms have be settled in order to mediate between the different conceptualisations;  the real application of these methodologies can be guaranteed when the costs in marking up content in the documents can be neglected. We plan to address these two issues in MOSES and the purpose of this document is to describe how we plan to do. As described in the previous deliverables of the project [2.1,2.2,2.3] we will have two information nodes with two different ontologies within an homogeneous knowledge domain, i.e. the university system of two different countries. The knowledge nodes will represent two universities, Uniroma3 and UKBH (see Fig. 3). 5 O As described in the deliverable 3.1 two demonstrators are foreseen in MOSES: the static and the dynamic demonstrator. The first will implement an ontological approach to question-answering based on interactions via natural language. The second will show how we plan to make the approach scalable. From the point of view of the work planned in this deliverable, the construction of the static demonstrator will have as drawback the definition of a test-bed as:  a static mapping between the ontologies of the two sites will be produced;  the two university sites will be mapped to the ontologies. These two resources will be used to test the algorithms for mapping the ontologies of existing knowledge nodes and for adding a new node. In this document we will therefore firstly describe the notion of ontological conceptualisation that we will adopt for addressing the problem of ontology mapping and ontology learning (Sec. 2). According to this notion of conceptualisation we will classify and describe the existing methods for ontology mapping (Sec. 3). We will then describe the architecture we propose to adopt in MOSES for the problem of meaning negotiation (Sec. 4). Finally, we will briefly describe the methods we plan to use to add knowledge nodes in the grid (Sec. 5). 6 O 2. A Definition for the Notion of Ontology A very important concept in the Semantic Web and in this document is the notion of ontology. As (Guarino, 1998) points out, the term ontology is overloaded as it is used for referring to very different concepts in many different disciplines (e.g. philosophy, artificial intelligence, knowledge management, and database theory). In order to better describe both the ontology mapping and the ontology building it is better to define what we intend for the term ontology within this document. The common perception is that an ontology is useful as it is a somehow systematic “description” of the reality or part of it. An the term ontology does it job in evoking this. However, there is not an agreement of what should be modelled and how. In the following we will try to give a definition of the notion of the ontology considering it as a engineering artefact as opposed to the philosophical discipline. Our definition elaborates the relevant perspective given in (Meadche and Staab, 2001) introducing some more distinctions among the levels in which the ontology can been seen. The new levels serve the purpose to better explain how the mapping between different ontologies can be achieved and how conceptualisations can be derived from diverse sources of “ontological” information such as domain texts. An ontology should serve the purpose of giving a conceptualisation of a knowledge domain, i.e. it should systematically show the necessary concepts and the necessary relationships among them. This should help in organising the particular state of the world that wants to be communicated. The communication implies and requires a semiotic level in which symbols related to concepts, relationships, and instances are represented. We will try to keep separated the inner level of the conceptualisation with the outer level of the denotation of the conceptualised elements. Therefore, for our purposes, the definition of an ontology is the following (see also Fig. 4 where the different levels are represented): Definition Ontology An ontology is a quintuple O=(C,I,S,F,G) - C: Conceptual Level where concepts and relationships among concepts are postulated; - I: Instance Level that represents the individuals of the investigated world and the their actual relationships; - S: Denotation Level in which signs for concepts, relationships, and instances are settled; - F: Instance Relation that links instances to the related class; - G: Denotation Relation indicates the signs associated to each element in C and I. For the conceptual level we can rely on the definition of conceptualisation given in (Genesereth and Nilsson, 1987) where a conceptualisation C=(D,R) is a couple consisting of a domain D, i.e. a set of elements, and of a set R of relations in D. Each element r in R is a n-ry (mathematical) relation in Dn for a given n, i.e. r  Dn. Note that n is not fixed for all the relations in R. As in OKBC (Chaudhri et al., 1998) we distinguish a special relationship among all, the binary isa relation, that, since is considered a transitive and acyclic relation, builds a hierarchy among the elements in D. There is a wide consensus on the existence of an isa relation among concepts because it is strongly related to the notion of concept. A concept in fact is defined when it exists a set of individuals (instances) sharing common properties, e.g. the concept of human is the intensional representation of an extension of individuals that share many properties such as being biped, having two eyes, being able to laugh, etc. The inclusion of an extension of a concept A in an extension of the concept B is equivalent to thinking that the instances of the class A are also instances of the class B, therefore that A is a B, as, for instance, a human is an animal. All the other relations in R represent the specific conceptualisation of the domain. In OKBC, a part of these relationships, i.e. the binary relationships, are taken into account and they are called slots. Given a relation r in R, an element (c1,…,cn) in r governs the relationships among elements at the instance level. 7 O The instance level on the other hand represents the space where the actual individuals of the represented world live and where their actual relationships are modelled. The instance level can therefore be seen as a couple I=(DI,RI) where DI is the set of instances and RI is the set of the relationships among the instances. The denotation level contains and specifies the signs used to communicate about objects foreseen in the two inner levels, i.e. the conceptual and the instance level. The two relations, i.e. the Instance Relation F and the Denotation Relation G, connect respectively the conceptual level with the instance level and the inner level (conceptual plus instance level) with the denotation level. They are defined as the followings: F CI G (CI)D Note that these are thought as mathematical relations and not as functions to allow the possibility of modelling both the ambiguity of the denotations on the one side and the multiple inheritance at for the instance level on the other side. However, building on these relations relevant subsets can be determined. The following “functions” can be defined: F(c)={i I | (i,c)  F } F-1(i)={c  C | (i,c) F } G(c)={d  S | (i,c)  G } G -1(d)={c  CI | (i,c) G } Denotation Level Conceptual Level Instance Level Fig. 4 An ontology: the conceptual, the instance, and the denotation levels Generally, the denotation level is neglected as considered a constitutive part of any ontology. In fact, concepts, relations, and instances can not be communicated if the an effective sign system is not used for it. In these cases, the relation G is a one-to-one mapping between the symbols and the elements in the conceptual and the instance space. However, this is not always the case: as pointed out in (Meadche and Staab, 2001), linguistic conceptualisations as WordNet (Miller, 1995) or EuroWordNet (Vossen, 1998) needs a clear separation between these two levels (i.e. the conceptual and the instance level on the one side, and the denotation level on the other side) since a one-to-one mapping between the inner and the communication level is not guaranteed due to the ambiguity of the natural language. 8 O A separation between the inner and the communication level is beneficial especially in a context of multiple competing conceptualisations as the one envisaged above. This separation makes clearer the relationship between the modelled world and the signs used to convey this conceptualisation. These latter are a possible link between different conceptualisations. Signs in the space S have in general a very nice property: they are formed using expressions somehow following the symbols and the formation rules of the natural language in which the conceptualisation is expressed. Let’s take for instance one of the conceptualisation gathered in the DAML+OIL library: the university ontology (see deliverable 2.3). Concepts and relations (there called properties) are conveyed by natural language expressions following slightly different concatenation rules, e.g. the sign AssiatantProfessor is a well formed natural language expression except for the rule used to concatenate together the 2 words. It is worth noticing that natural language is one of the more important conceptualisation device even if not the only one. Moreover, it is the predominant one when dealing with not formalised (and maybe familiar) knowledge domains such as the university domain, a bookstore, etc. When conceptualising, the background conceptualisation of the linguistic knowledge is a very strong bias. Moreover, since concepts and relationships have to be somehow efficiently communicated, the used denotations are chosen to recall somehow the intended concepts or relations. This property and its implications are more or less consciously used in ontology mapping algorithms. The background conceptualisation is in fact very relevant. This will be clearer in the following sections. 2.1 Interfaces towards worlds where concepts are expressed: the task level of the ontology An ontology is a specific conceptualisation of a domain. Usually, it aims to represent important concepts and relations in order to support reasoning needed to accomplish a particular task. For instance, an ontology devoted to support question answering on the university system may suggest associations between professors and available courses. Necessary concepts are in this case faculty and course: their instances X and Y enter in relations like X teacherOf Y. This is all what is expressed in the denotation level. There are three denotation objects, faculty, course, and teacherOf, referred to the object in the conceptual level. How they can be used to:  Interpret natural language utterances such as the questions proposed to the system?  Interpret the pages in which this information has to be extracted? For instance, all the following questions can be uttered to ask some information related to the previous relation among concepts: Who teaches the Database course? Who is the teacher of the Database course? Who is the professor of the Database course? Who holds the Database course? Who gives the Database course? 9 O Fig. 5 Pages containing instances of the teaherOf relation Furthermore, the required information, i.e. the tuples instances of the relation teacherOf, can be found in html pages like the one depicted in Fig. 5. There the information is mainly carried by the styles and the position of very simple textual elements in the pages. Interface Level Linguistic Interface Typografic Interface ... Fig. 6 A complete ontology definition: the interface level These examples show how a piece o ontological information such as a concept or a relation have very different ways to appear according to the space in which they appear. These are the properties that they activate if they are uttered in one of these spaces. This kind of information is generally not included in an ontology but it is important from many points of view:  Interpreting questions: questions in natural language can be mapped to ontological relations and concept only if a linguistic interface is foreseen 10 O  Automatically adding instances: describing the interfaces with the text and the html text can help in extracting the required knowledge, i.e. the instances of the foreseen concepts and relations, using places where this information is naturally located, i.e. the texts and the www sites. The denotation level may help in defining this interfaces but it may not contain all the relevant information. Additional levels have to be foreseen. We collect them in the what we call the interface level (see Fig. 6). Each element of this level represents an interface towards a particular form that the information may assume. It is still not clear how these interfaces should be separated from one another. 11 O 3. Existing approaches to Ontology Mapping The interest in find the mapping diverse conceptualisations has not born with the Semantic Web. This is a very relevant issue for many disciplines ranging from database theory (Batini, Lenzerini, Navathe, 1986) to the psychological studies. The formalisation of a conceptual level (as the Entity-Relationship diagram in the database theory) rapidly induces the necessity of stating somehow the mapping of possibly divergent ontology. To better understand the scope of mapping different conceptualisations, let us go through an example. Imagine an Italian willing to have his habitual morning coffee in England and suppose he managed to find a coffee shop. What kind of interaction with the barman should he have to achieve the goal having his own coffee? He suspects that his own conceptualisation (or the way to represent it) was not the same as the one of the barman when he tried to ask a hot dashed coffee. The situation may be depicted as the one in Fig. 7 where the two not-exhaustive ontologies are represented. It is worth noticing that the problem is not only a misalignment in the lexicalisation of the concepts (i.e. at the denotation or at the linguistic interface level). The real problem is that the conceptualisations are different also in the structure as the real world they want to model radically differ. The two ontologies should in fact finally model the Italian and the English ways to prepare coffee. Italian Client English Clerk BEVERAGE BEVERAGE CAPPUCCINO (Cappuccino) CAFFE (Coffee) COFFEE ESPRESSO CAPPUCCINO LONG CAPPUCCINO MINI-CAPPUCCINO (Small Cappuccino) CAFFE MACCHIATO (Dashed Coffee) CAFFE RISTRETTO (Short Coffee) CAFFE LUNGO (Long Coffee) CAFFE CORRETTO (Corrected Coffee) CAFFE MACCHIATO CALDO (Hot Dashed Coffee) Fig. 7 An attempt to map the Italian and the English coffee ontology In the case of the Italian at the coffee shop, we can say that a successful communication of the concept happens whenever the poor Italian gets a cup of coffee that vaguely seems what he has the habit to drink. He should therefore manage to negotiate the intended meaning for his hot dashed coffee. He would be successful if manage to negotiate that his CAFFE is similar to the ESPRESSO as intended by the barman. Something similar to the desired coffee can be obtained adding a drop of milk the served ESPRESSO and this is the top that can be obtained. The main deception for the Italian Costumer will be the understanding that the desired ESPRESSO is something somehow different from what he was expecting. The Italian Costumer and the English Clerk can finally conclude that there is a weak equivalence between the notion of CAFFE and ESPRESSO. This concludes the meaning negotiation activity and the breakfast of the Italian Costumer. Mapping different conceptualisations is then the activity of finding equivalences between concepts and relationships. As the above example shows, this activity can be tricky. Especially when the modelled domains substantially differ even if homogeneous as they represents different views on the reality, strong (and maybe stable) equivalences between concepts may not be what we want to find. In these cases, what we may want is to discover a weak equivalence lasting for the purpose of the communication. 12 O The ontology mapping or the meaning negotiation can be seen as a very relevant transversal filed that arise the interest of many researches. The works in this interdisciplinary area are very difficult to follow because many of the contributions use the language proper to their original field. In this paragraph we will report some of the results obtained in this transversal field in light of the definition of an ontology given in the previous section (Sec. 2). We will therefore firstly focussed on the possible architectures for organising the activity of the finding and describing the mappings among the differentiated ontologies (Sec. 3.1). Secondly, we will describe the algorithms and the methods that have been proposed in literature with respect to the level of the ontology definition they are relying on (Sec. 3.2). The dimensions of the classification of the algorithms can be easily derived from the levels of the ontology definition. This does not mean that the cited algorithm only rely on that particular level, yet separating the contributions of the methods on the various level they use can better clarify the role of the particular method. It is worth noticing that some of the methods simply neglect some of the levels. The dimensions of the classification are then the followings:  Denotation-based Ontology Mapping: the basic assumption is that a single denotation space has been used in the formalisation is similar and it induced denotations that may differ only for “dialectical” inflections  Instance-based Ontology Mapping: the extensions of the concepts are strongly used to determine the similarity  Conceptual-based Ontology Mapping: the structural relations among the concepts in the conceptual level are taken to determine the similarity. The basic assumption is that the structural similarity indicates the similarity among concepts. It is worth noting that in the instance-based ontology mapping and conceptual-based ontology mapping the problem of the different denotation is simply neglected. 3.1 Ontology Mapping Architectures The first problem that should be addressed in mapping a large number of ontologies is to decide how to organise the work and how to organise the results of the mappings. Two main choices have to be made. It has to be defined, on the one side, the topology of the final knowledge network and, on the other side, the dynamics of the final interactions. The architectural choices made in the literature strongly depend on the aims of the ontology mappings and on the availability and the quality of the automatic mapping algorithms used to reach these aims. The possible choices are classified in the followings and some related project is cited if possible. 3.1.1 Topology With respect to the topology of the final network, two different possible organisations of the mapping can be defined:  A centralised approach: a central ontology is foreseen that is guaranteeing the consistency of the overall knowledge system  A distributed approach: the consistency is guaranteed by one-to-one mappings between all the possible ontologies. 3.1.1.1 Centralised Approach The simplest idea to keep the consistency between different conceptualisations is to elect one conceptualisation elected as the “inter-lingua” (see Fig. 8). The use of this latter can be twofold. In the definition of a new conceptualisation it can be used as a shared vocabulary where concepts can be extracted from if necessary and possible or where new concepts can be linked. On the contrary, if the different conceptualisations already exist the “inter-lingua” can be used as the reference for all the other ontologies. 13 O Fig. 8 Keeping consistency with a centralised approach This topology has been often used as it reduces the number of necessary mappings. If the N is the number of ontologies for which the consistence has to be guaranteed, only N mappings are necessary. When the mappings are static this is the most suitable solution. This approach has been often used. In (Arens et al., 1993), a global ontology is postulated to keep the consistence of different database schemas. In (Beneventano et al., 2001) this model is applied by means of a general vocabulary, i.e. WordNet (Miller, 1995). The intended meaning of each node has to be linked to the particular synset in WordNet. The activity of choosing the appropriate synset is completely delegated to the knowledge engineer. It is worth noting that the policy adopted in the Semantic Web (Lee, Hendler, and Lassila, 2001) follows this approach. One on the main effort in fact is to define a space in which concepts can be univocally referred. The definitions of the relations among concepts is the defined in a central repository (i.e. the DAML+OIL ontology library (DAML+OIL)) to which the developers of new ontological material have to refer. 3.1.1.2 Distributed Approach The consistency between different ontologies may be also assured in pairs: given a set of ontologies, the links between each of the pair of ontologies should be given. The overall consistency is made using these one two one mappings. Given a set of N ontology, to obtain the overall coverage N2 mappings have to be produced. A sample mapping scheme is depicted in Fig. 9. Fig. 9 Keeping consistency with a distributed approach This approach may seem unreasonable due to the gigantic quantity of links that should be traced. However this is often used. Consider for instance the construction of bilingual dictionaries. Each pair of language 14 O should be covered to ease the comprehension of the language A to the speaker of the language B. The use of an inter-lingua may be misleading both for the final user and for the builders of the resource. This approach has been also used for instance in (Mena and Illarramendi, 2001). 3.1.2 Dynamics The availability of automatic approaches to mapping ontologies can enlarge the possible architectures that can be designed for addressing the problem. A distributed architecture is unfeasible if the connections should be statically traced. In fact, when the number of the involved ontology is large, the number of connections inherently grows squarely. Yet, the availability of automatic approaches to mapping concepts and relationships belonging to different ontologies makes this solution more appreciable and, in some cases, even more interesting than the centralised mapping. However, the final decision on the dynamics of the architecture strongly depends on the nature of the information modelled in the target ontologies. 3.1.2.1 Static Mappings The final objective of a static mapping is to have a resource that connects different conceptualisations in a stable way. However, a static mapping is possible only when some given conditions are met by the nature of the target ontologies. For instance, in the case of EuroWordNet, the need was to create a static cross-lingual resource with a stable final mapping among the concepts expressed in the different languages. As the core of languages don’t change frequently, a static mapping is the best solution. Here, the eventual mapping algorithms may be used for ease the task of finding correspondences. 3.1.2.2 Dynamic Mappings In a world where different conceptualisations may arise because of a slightly different perspective on the analysed domain or because of a lack of knowledge on the existence of an already defined conceptualisation, the availability of ontology mapping algorithms may open very relevant dynamic scenarios. Software agents equipped with these ontology mapping algorithms may assess the meaning of an utterance (i.e. the denotation of a concept or a relationship) through meaning negotiation. Such an approach is more similar to a model for the human society where “autonomous agents” negotiate the meaning of their utterances when they are not able to reach an immediate agreement on the underlying concept. It is worth noticing that a dynamic mapping can be even more interesting when clear relation between two concepts can be obtained. In this case, this relation may survive for the period of the “conversation” between the involved software agents serving the purpose of having a successful information transmission. 3.1.2.2.1 The Mediator Approach In (Pazienza and Vindigni, 2003), the mapping activity is dynamically carried on by a mediator agent intervening between dialoguing actors. The Mediator agent aims to understand ontological information a speaker agent wants to express as well as the way to present it to a computational hearer: it embody inferential capabilities related to agent communication. It is assumed to have no prior knowledge about the dialog topics, while focusing on understanding which relationships exist between concepts involved in the communication as well as the terms used to express them, without any assumption on the adopted internal knowledge representation formalisms. Due to the fact the mediation activity is developed out of the speaking agents, the mediator needs to work on a restricted view of the ontologies. It builds a partial image of the involved parts, by alternating queries to the agents about specific knowledge on terms and concepts, and dynamically comparing the results in order to find a mapping for the target concept. Similarly as in the previous section, the mediation function is inspired by human behaviour: if an hearer does not understand the meaning of a word, he asks for more details, firing a sub-dialogue, based on a sequence of query-answer, aimed to identify the target concept, or its position in the ontology: in this activity he will base also on the clues provided by synonyms, relations with other concepts, and so on. One positive aspect of this kind of mediation is that the analysis is framed into a local search space, as in real world ontologies, big differences in the taxonomic structure already appear in the upper levels, making impossible a large-scale mapping between misaligned structures. Seeing the mediation activity as a terminological problem helps in understanding the analogies existing on the underlying models. 15 O This view of the ontological mediation is based on minimal assumptions on the knowledge model the involved agent has: by adding more restrictive hypotheses on their introspection capabilities, it is possible to include in the same framework other more sophisticated techniques. Reasoning about concept properties, attribute similarity and relations requires a deeper semantic agreement among participants that is out of the scope of this research. In absence of such agreements, what agents expose about their ontologies is the denotational component, on which the mediation activity is focused. 3.1.2.2.2 Language Game Approach Negotiating meaning is one process that may happen in the interaction among human. Let us take the following example where three participants tend to form their own idea about a new concept. This dialog has been used in (Adriaans, 1992) to define the problem of language learning and it is extracted from (Milne, 1968). The dialog follows. One day when Cristopher Robin and Winnie-thePooh and Piglet were talking all together, Cristopher Robin finished the mouthful he was eating and said carelessly: “I saw a Heffalump today, Piglet.” “What was it doing?” asked Piglet. “Just lumping along,” said Critopher Robin. “I don’t think it saw me.” “I sae one once”, said Piglet. “At least, I think I did,” he said. “Only perhaps it wasn’t.” “So did I”, said Pooh, wondering what a Heffalump was like. “You don’t often see them”, said Christopher Robin carelessly. “Not now”, said Piglet. “Not at this time of year”, said Pooh. Then they all talked about something else... This dialog shows how during a conversation different participants can form their own opinion on the new notion. It is a very natural way that suggests an architecture in which autonomous agents equipped with meaning negotiation capability are free to circulate. Whenever a concept denotation is not known the meaning negotiation activity can start in order to achieve an useful understanding. Such an architecture is studied in (Steel, 1996) where the problem investigated is the formation of a language. Given a very precise conceptualisation shared by the agents, the problem is to see how they can converge to a unique denotations for the concepts and the relationships. These studies investigate the construction of new lexicons in communities of intelligent agents. Agents envisaged in these studies share a very precise protocol to play what is there called the “language game”. The agent society envisioned in such approaches is very similar to a human society and therefore offers a very interesting benchmark where the human behaviour can be modelled. Furthermore, with respect to the mediator approach it offers the possibility of reducing the number of interactions since the number of actors of the overall communication is obviously smaller. The agent-mediator-agent communication is in fact reduced to an agent-agent communication. 16 O 3.2 Automatic Methods for Mapping Ontologies In this section we will classify and discuss the approaches to automatically mapping elements belonging to different ontologies. The analysis is made according to the defined levels of an ontology as discussed in Sec. 2. 3.2.1 Mapping at the Instance Level The basic assumption is that equivalence at the conceptual level may be computed analysing the instances of the involved concepts. The strength of similarity among two concepts is reckoned on the shared instances. It is worth noticing that, as shown in Fig. 4, the inclusion of set of instances in another define a isa relation among the involved concepts. We will hereafter show some metrics to calculate this distance. We firstly analyse the case in which each concept cC as a set of instances F (c). The analysed methods assume that two instances sharing the same denotation are in fact referring to the same object in the word, i.e. the denotation level is not considered. In (Doan et al, 2002), a notion of probability is used. Given this model, the notion of the joint probability distribution between any two concepts c1 and c2 is well defined. This distribution consists of the four probabilities: P(c1,c2), P(c1,c2), P(c1,c2), and P(c1,c2). A term such as P(c1,c2)is the probability that a randomly chosen instance from the universe belongs to c1 but not to c2, and is computed as the fraction of the universe that belongs to c1 but not to c2. The probabilities are the estimated as follows: | F (c1 )  F (c 2 ) | |I | | F (c 1 )  (I  F (c 2 )) | P(c 1, c 2 )  |I | | (I F (c1 ))  F (c 2 ) | P(c1,c 2 )  |I | | (I F (c1 ))  (I F (c 2 )) | P(c1, c 2 )  |I | P(c1,c 2 )  Many practical similarity measures can be defined based on the joint distribution of the concepts involved. For instance, a used definition for the “exact” similarity measure, i.e a measure that estimates if the concept c1 and c2 are the same concept, is Jaccard_sim(c 1, c 2 )  P(c1  c 2 ) P(c1, c 2 )  P(c1  c 2 ) P(c1, c 2 )  P(c1, c 2 )  P(c1, c 2 ) This similarity measure is known as the Jaccard coeficient (van Rijsbergen, 1979). It takes the lowest value 0 when A and B are disjoint, and the highest value 1 when A and B are the same concept. It is worth noticing that: Jaccard_sim(c 1, c 2 )  | F (c 1 )  F (c 2 ) | | F (c 1 )  F (c 2 ) | Also a definition for the “most-specific-parent" similarity measure is given, i.e.: P( c1| c 2 ) MSP ( c1 , c 2 )   0  if P( c 2| c1 )  1 otherwise where the probabilities P(c1|c2) and P(c1|c2) can be trivially expressed in terms of the four joint probabilities. This definition states that if c2 subsumes c1, then the more specific c2 is, the higher P(c1|c2), and thus the higher the similarity value MSP(c1,c2) is. Thus it suits the intuition that the most specific parent of c1 in the 17 O taxonomy is the smallest set that subsumes c1. An analogous definition can be formulated for the “mostgeneral-child" similarity measure. Similar measures are given in (Weinstein and Birmingham, 1999). The main limitation of such a kind of approaches is that if the worlds that is modelled do not overlap, there is no possibility of estimating the distance between the involved concepts. If I1 and I2 are the two instance spaces of the two ontologies, the method can be used if |I1I2||I2||I1|. The case explored in (Weinstein and Birmingham, 1999) is in fact of this type .The modelled world is in fact the general space of the recording (i.e. CD, tapes, etc.). In this case, it may be assumed that the level of overlapping of two database modelling this world is very high because of the aim of the two models. This can not be always the case. 3.2.2 Denotations of the Concepts 3.2.2.1 Edit Distance In (Meadche and Staab, 2001) the denotation of the concepts and of the relation is used to detect the similarity. The basic assumption is that two objects sharing the same denotation relate to the same concept. Therefore, if a distance among the concepts can be established using the distance among their string denotations. The adopted measure is the edit distance (Levenshtein, 1966 between strings that measures the minimum number of token insertions, deletions, and substitutions required to transform one string into another using a dynamic programming algorithm. For example, the edit distance, ed, between the two surface denotations “TopHotel” and “Top_Hotel” equals 1, i.e.: ed(“TopHotel”,“Top_Hotel”) = 1 because one insertion operation changes the string “TopHotel” into “Top_Hotel”. Based on Levenshtein’s edit distance they propose what they call the syntactic similarity measure (SM) for strings, the String Matching (SM), which compares two denotations Li and Lj: SM(L i , L j )  max{0,1  ed(L i , L j ) } min(| L i |,| L j |) SM returns a degree of similarity between 0 and 1, where 1 stands for perfect match and zero for bad match. It considers the number of changes that must be made to change one string into the other and weighs the number of these changes against the length of the shortest string of these two. In the above example, the String Matching is computed as follows: SM(“TopHotel”, “Top_Hotel”) = 7/8. Of course, SM may sometimes be deceptive, when two strings resemble each other though they there is no meaningful relationship between them, e.g. “power” and “tower”. In our case study, however, we have found that in spite of this added “noise” SM may be very helpful for proposing good matches of strings. 3.2.2.2 Terminological Approach The overall expressiveness of the concept denotations cannot completely captured by the String Matching based on the edit distance. Many denotations can be seen as complex nominals with particular concatenation rules, e.g. AssistantProfessor and AssociateProfessor can be obviously seen as the complex nominals respectively “assistant professor” and “associate professor”. The particular concatenation rules can be reversed and used to generate the tokenisation rules. From there, methods adopted to find the similarities among complex noun phrases studied in the terminology variation problem (Jacquemin, 2001)can be used. Terms are linguistic denotations key domain concept that in general are represented by complex noun phrases. In (Jacquemin, 2001), the term variation problem is investigated. The point of view of these studies can be reversed, the methods settled to evaluate the variation of a term can be used to calculate the distance of the concepts denoted by different surface forms. 18 O In (Jacquemin, 2001), the term variation is studied along different dimensions:  Morphological variation that postulates the equivalence between concepts expressed by singular and plural noun phrases;  Syntactic variation that postulates the equivalence between concepts expressed by different syntactic structures but with the same words, e.g. Information Extraction vs. Extraction of Information;  Semantic variation that postulate the equivalence between forms obtained substituting words with equivalent words, elaboration of the natural language vs. processing of the natural language. The variation and, consequently, the equivalence between different denotations can be computed along this different dimensions. In (Jacquemin, 2001), only the syntactic variation is extensively treated and the idea is to provide rules as the following: N1 of N2  N2 N1 that states the equivalence between forms like Information Extraction vs. Extraction of Information. A number of these rules are provided. A mapping method based on the terminology alignment makes strong assumptions on the linguistic wellformedness of the concept denotations. This is generally the case except for the different concatenation rules but cases may exists in which the distance between the pure linguistic expression and the actual denotation cannot be filled. 3.2.3 The use of the structure How can we be sure that a concept denoted with the same label is in fact the same concept? Or conversely, how can we determine the similarity among two nodes of the conceptual structure if we don’t know that they have a similar denotation? The only way seems to be analysing the structure of the conceptual level. In this section we will analyse the mapping method assuming that elements having the same denotation refer to the same element. Differences are inspected in the structure in the conceptual level, i.e. C=(D,R). Since one of the most relevant and agreed relation in the structure is the isa, we will extrapolate this from the set R of all the relations. The isa relation will be represented as H in R. H(c1,c2) with c1,c2 in D represents tha c1 is a c2. 3.2.3.1 Immediate Neighbourhood In (Meadche and Staab, 2001), the similarity among two concepts is calculated building on the notion of semantic cotopy (SC). Given a concept c, the semantic cotocopy SC(c,H) of c in H is the set collecting all its super- and subconcepts: SC(c,H) = {ciC | H(c,ci)H(ci,c) ci=c} The similarity between the concept expressed in two different taxonmies can be therefore calculated using the following function, i.e. the taxonomy overlap (TO): TO(c, O1,O2 )  | SC(c,H1 )  SC (c,H2 ) | | SC(c,H1 )  SC (c,H2 ) | It is worth noticing the similarity of the taxonomy overlap and Jaccard-sim expressed in Sec. 3.2.1 where the estension of a concept is explored as a mapping indicator. This is a very similar principle where the estension is seen at the conceptual level. In (Meadche and Staab, 2001), the taxonomy overlap TO is mainly used to determine the overlap between different taxonomies. Here we are interested in the measure because it can give a the similarity between two concepts expressed in different taxonomies. It is generally not sufficient the equivalence of the surface forms. 19 O 4. Proposed Approach to Meaning Negotiation in MOSES MOSES is a very interesting test-bed for ontology mapping algorithms because of the nature of the problem here addressed. To better specify the possible architecture (i.e. what has been called topology and dynamics) for the MOSES mediation capabilities and the dimensions over with the mapping can take place, it is better to firstly review the constraints the overall MOSES operational scenario poses. The basic aim of the MOSES system is answering questions using a structured representation of the information. It can be called an ontological question answering engine (Zajac, 2001). The MOSES prototype will operate on two virtually homogeneous knowledge repositories, i.e. two university sites University of Roma III and UKBH of Copenhagen. However, two different ontologies have been produced for the two university systems. The denotation level of two ontologies, as well as the possible instances, is basically in the national language, i.e. Italian and Danish. Furthermore, the analysed domain, i.e. the university role and organisation, is strongly dependent on the culture of the particular country and on its particular laws. Some roles may exist in one country but not in the other or, worse, some concepts maybe referred with the same denotation but maybe related to very different concepts. Even if the knowledge in the two universities is homogeneous as both are in the university knowledge domain, there is a very small (if not a zero) overlapping at the level of instances as the world modelled in the two sites is substantially different. In order to demonstrate its viability the overall system should work in a situation in which different knowledge nodes can be dynamically added in the knowledge network with the smallest effort. In general, there is no guaranty that the knowledge node inserted in the network exposes the information in already known ontology. Furthermore, we don’t want to assume that the ontology underlying this new node is somehow following an “inter-lingua” conceptualisation. The situation is then very intriguing:  software agents representing the content of the WWW sites and software agents looking for information that don’t share necessary the same conceptualisation of the domain  the overall agent society has to be “robust” at the insertion or the deletion of a content node  the denotation level of the conceptualisations of the software agents is may be based on different natural languages The overall scenario will then configure a place where the interacting agents can not be sure whether or not they are sharing the conceptualised knowledge. The software agents of the system cannot rely on static mappings between the ontologies. 4.1.1 Basic Assumptions We have very little assumptions to rely on and the scenario we have in mind opens very relevant research problems. In any case, the nature of the information we want to tackle can give very interesting starting points and theoretical foundations to the mapping architecture and algorithms we want to explore. 4.1.1.1 The Nature of the Domain Ontologies The first strong assumption that we make is that we are working in a close domain whose limits have been settled in the deliverables 2.2 and 2.3, i.e. the university domain. Universities are observed from different angles, yet it seems agreed that these perspectives are internationally common. The basic roles of the universities in the society seems to go beyond the national frontiers: universities are institutions involved in the education and the research. This fact guarantees that the conceptualisations arisen in the analysis model a very similar domain even if the modelled realities differ for some particular concepts and relationships. As conceptualised domains are similar it is possible to make a positive interaction among the software agents since they are “referring” to the same knowledge. It is as if, in a conversation, participants were restricting the topic to a very specific argument that is known to everybody even if some different perception on some 20 O part still remain. Moreover, there is any guaranties on the fact that the same signs are used to declare the same concepts. 4.1.1.2 The Predominance of Natural Language Natural language is one of the most important means of conceptualisation. Not only the denotation level of the ontological conceptualisation may be fashioned according to its laws, but also the internal level, i.e. the concept level, may be strongly inspired by the conceptualisations hidden in the language used in the everyday life. The major effort in a conceptualisation is in fact the definition of a single meaning for the concept denotations and the relationship denotations by means of structural connections among the defined elements. The use of the natural language in the denotation level is so “natural” that generally the denotation level and the conceptual level are not divided. The denotations covey the concepts and the relationships. This is typical in the use of the Entity-Relationship diagram when defining a schema for a database as well as defining an ontological characterisation for a domain in DAML+OIL. The naturalness in using the natural language is due to the fact that denotations are often used to convey the meaning to other users of the resource. This fact has some important drawbacks:  typographic limits imposed by the used tool (e.g. no black spaces or short names have to be used in the denotations) are overcome consistently in the overall resource. The concatenation/truncation rule is generally consistently applied on all the denotations.  the chosen denotations tend to be meaningful in a community of people In this perspective, natural language is completely abandoned for more formal systems. Its power is only reduced to make it unambiguous in the context where it is used. Inherent rules and inherent conceptualisations that comes with the use of the natural language and that are diffusely represented in lexical resources and in interpretative grammars can be used. As the language plays an important role, the availability of language resources will help to govern the overall difficulty of the problem: lexical databases such as WordNet (Miller, 1995), EuroWordnet (Vossen, 1998), MultiWordNet (Bentivogli, Pianta, and Girardi, 2002) or ontological lexical resources such as Simple (SIMPLE) based on the notion of the Qualia-structure (Pustejovsky, 1995) will be very valuable resources to study the feasibility of the mapping. Works such as (Basili, Vindigni, and Zanzotto, 2003) on word sense disambiguation based on such a kind of resources will assume a very relevant role. 4.1.2 A Mental Model for Meaning Negotiation The environment where the MOSES agents will operate in is highly challenging:  no assumptions on the sharing of a common conceptualisation can be made  agents can enter or exit the MOSES society These two hypotheses imposes very clear constraints on the possible architecture. As a common unambiguous conceptualisation can not be seen as the “glueware” between the conceptualisations of the single agents, the applicability of a centralised approach is very difficult yet not impossible. However, the dynamics of the agent society (i.e. the insertion and the deletion of an agent from the society) imposes a distributed approach. Due to the inherent cost in such an approach, an overall static mapping is not feasible. We will then adopt a dynamic approach. Mimicking the way in which meaning consensus is established in human conversation we will a meaning negotiation module that can be adopted by the agents willing to participate to the MOSES agent society. This module will be give the possibility to the agent to play a very simple language game whose final aim is to reach an agreed meaning in order to allow a positive interaction on the final MOSES task, i.e. answering proposed questions. The software agents are then depicted in Fig. 10. This approach is interesting because it gives the possibility of studying the ontology mapping problem or the meaning negotiation as a model of the human behaviour. Furthermore, the implied software architecture is flexible and scalable. 21 O Meaning Negotiation Module Meaning Negotiation Module Bla Fig. 10 MOSES Agents playing the “language game” A language game aiming to achieve the target to define the sense of unknown denotation will be defined. 4.1.3 The Dimensions of the Meaning Negotiation Algorithm Due to the hypothesis on the nature of multilingual scenario and the strong assumptions on the role of the natural language, the meaning negotiation algorithm will be strongly based on the use of linguistic resources, i.e. the denotation level will be thought as staying in the natural language space. For this reason, each agent participating the language game will be also equipped with a lexical knowledge base that will help the agent to reason about the language elements (i.e. the words) used for the communication. Since the denotation levels of the ontologies will possibly not be in the same language, the interaction of the software agents with a bi-lingual dictionary will be foreseen. The algorithm will exploit all the levels of the two involved ontologies that are possible under the hypothesis made. The two levels exploited will be:  the denotation level  the conceptual level since, due to the nature of the domains and of the analysed sites, the instance level will not be so relevant. It is in fact reasonable to suppose that no overlapping on this level will be possible on the analysed situations. 22 O 5. Adding Nodes in the Knowledge Grid As discussed above, one of the important aspect to push the semantic web vision forward is to demonstrate an ability in helping the writers of the www sites in expressing the information they want to expose in the formalised ontological language if necessary. Moreover, as the information in a www site may not be modelled by the already expressed conceptualisation, a very relevant activity is the automatic production of a possible conceptualisation starting from the www site or some related texts. Using the definition of the notion of the ontology given in Sec. 2 we may determine the different directions in which we can work. Given the www site (i.e a very specific collection of documents), the “learnable” objects are:  The interface level. This may be learnt for two reasons: the first, to learn information extraction rules able to subsequently automatically fill the instance level; the second, to learn different ways in for expressing the same question  The conceptual level. This may be learnt for adding concepts and relationships among concepts that are not foreseen in the previous conceptualisation Learning the interface level may help in maintaining the consistency of a knowledge node and in including in the grid homogeneous sites. Learning the conceptual level will give the possibility of including heterogeneous knowledge nodes in the grid. As already discussed, the information gathered in the annotation phase will be a very important test-bed over which the learning techniques can be explored. It is worth noticing that this study will be the object of WP7. In the following we will describe some ideas on which we will work for addressing these issues. 5.1 Ontology Extraction from Plain Texts We want here to propose an acquisition method for the derivation of the linguistic interface of an ontology able to suggest linguistic patterns for known concepts and relations as well as to propose new concepts and new relation types. Our perspective is thus slightly different from other works (e.g. (Yangarber et al., 2000), (Riloff, 1996)). Plain texts are the starting point of our analysis. Texts here are assumed to drive the discovery of new domain knowledge. Fig. 11 presents an overview of the overall process where a cascade of activities (in the horizontal arrow) is defined to produce the final ontology. Lexical Knowledge Base Corpus Processing Conceptual Relation Patterns Domain Oriented Clustering Relation Patterns Concept Patterns Relation Type Definition Semantic Dictionary Building Domain Corpus Relation Pattern Classification Linguistic Relation Interfaces Semantic Relations Concept Hierarchy Domain Relation Type Hierarchy Domain Concept Hierarchy Fig. 11 Learning ontologies from free text: the architecture The acquisition process makes use of previously existing domain knowledge. This foresees at least a domain concept hierarchy (DCH) and a relation type system (RTS). General-purpose linguistic knowledge is also required. It includes at least morphological and grammatical models as well as lexical semantic knowledge 23 O (such as the Wordnet lexical hierarchy). The overall process aims to harmonise the two above sources by exploiting a large-scale corpus. The result is an augmented version of the source concept hierarchy that include a variety of linguistic knowledge. The process should in fact:  determine the relevant concepts that should be used in the target concept hierarchy DCH, possibly extending it;  propose relevant relationship prototypes that are linguistic explanations for the relation prototypes postulated in the domain hierarchy RTS or that constitute newer prototypes;  determine the linguistic forms in which the domain concepts are realized in texts;  propose textual denotations of the relation prototypes, that is, the linguistic interface of domain relations/associations. For the enterprise above, a terminological perspective is helpful. The main objective of research in terminology has been the extraction of synthetic representations for domain knowledge out from available material (Pearson, 1998). Thesauri and technical vocabularies, i.e. the explicit domain models, are in fact built using domain text collections considered as implicit domain models. These simple assumptions characterize several approaches in the computational terminology area (Computerm, 1998). At the beginning of the process only the text collection and a possibly preexisting term list is made available. A very general definition of the notion of term is then exploited, i.e. a surface form of a relevant domain concept (Heflin, Hendler, and Luke, 1999) Jeff Heflin, James Hendler, Sean Luke, SHOE: A Knowledge Representation Language for Internet Applications, Technical Report CS-TR-4078 (UMIACS TR-99-71), 1999. (Jacquemin, 1997). The different approaches to Terminology Extraction (TE) tend to give an "operational" definition of term describing (see (Zanzotto, 2002)) the following aspects:  the admissible surface forms. Admissible surface forms are usually described as prototypes over a valid natural language interpretation level (i.e. morphological, syntactical or semantic level).  the domain relevance. Domain relevance is used as a decision function and it has been generally implemented via statistical measures over text collections: the frequency of the term surface forms or the mutual information are examples of such functions. The simple frequency in the corpus has been suggested as the more effective decision function among surface representations including the same number of content words (Daille, 1994). The above model has been generally exploited for the description of the concepts in a domain while an attempt to use it for the detection of domain relationships has been done in (Basili, Pazienza, and Zanzotto, 2002). It is worth noticing that in the model underlying TE systems, it is the domain corpus and not the information needs, which drives the extraction of domain knowledge. In the learning of an ontology for Information Extraction, the principles of TE practice can be usefully adopted and in particular its notions of admissible surface forms and domain relevance. The first helps to characterize promising candidates for concept and relationship denotations. The notion of domain relevance optimises the work of ontology engineers: in fact, only the most relevant linguistic material (domain concepts as well as domain relations and their linguistic interface) will be shown. These are patterns sorted according to the domain relevance. Due to their inductive nature, unexplained patterns are potential candidates for new relations and are helpful even in the design of concepts and relations. Moreover, since linguistic constraints are used to express concepts and relationships they can be also used to cluster data. Sparse phenomena with a common semantic interpretation can thus be grouped: the higher is semantic agreement the higher will be the ranking of the underlying phenomena according to the domain relevance function. The result is that the target content of the linguistically principled ontology will grow faster as it integrates meaningful information emerging from the domain text collections. The overall learning process (Fig. 11) is organized as follows. Firstly admissible surface forms are extracted from the corpus and promising concept and relation candidate are stored as patterns. This activity is referred to as Corpus processing. Then, an analysis devoted to determine a concept hierarchy is applied to the more relevant concepts extracted in the previous phase. It makes use also of the pre-existing domain concept hierarchy (DCH). This activity generalizes the available evidence across the general-purpose lexical knowledge base and is called hereafter as Semantic Dictionary Building. Its aim is mainly to map domain 24 O concepts into the general lexical database. The resulting concept hierarchy can in fact be used in the analysis and interpretation of relational patterns in the domain texts. This generalization allows to conceptually cluster the surface forms observed throughout the corpus. The derived generalizations can undergo the statistical processing during the Domain Oriented Clustering phase. Their distributional figures are derived and the resulting generalized patterns can be organised according to their domain relevance score. The Relation Type Definition and the Relation Pattern Classification are then manual phases. The first activity is targeted to the production of a set of Semantic Relations (SR) determining a system of domain specific relationship names, i.e. labels helpful in the interpretation/explanation of prototypical concept associations in the domain. The semantic relations type system determined in the previous phase is then used to classify the specific linguistic patterns clustered and ranked in the previous phases. The result of this last activity is the set of linguistic rules for the matching and prediction of relations in SR. We will call hereafter such rules as Linguistic Relation Interfaces. 25 O References (Adriaans, 1992) Pieter Willem Adriaans, Language Learning from a Categorial Perspective, Academisch Proefschrift, Universiteit van Amsterdam, 1992 (Arens et al., 1993) Yigal Arens, Chin Y. Chee, Chun-Nan Hsu, Craig A. Knoblock, Retrieving and Integrating Data from Multiple Information Sources, International Journal of Cooperative Information Systems, 2(2), 1993 (Basili, Pazienza, and Zanzotto, 2002) Roberto Basili, Maria Teresa Pazienza, Fabio Massimo Zanzotto, Learning IE patterns: a terminology extraction perspective, Proc.\ of the Workshop of Event Modelling for Multilingual Document Linking at LREC 2002, Las Palmas, Spain, 2002 (Basili, Vindigni, and Zanzotto, 2003) Roberto Basili, Michele Vindigni, Fabio Massimo Zanzotto, Integrating ontological and linguistic knowledge for Conceptual Information Extraction, Submitted for publication, 2003 (Batini, Lenzerini, Navathe, 1986) C. Batini, M.Lenzerini, S.B. Navathe, A comparative analysis of methodologies for schema integration, Computing Surveys, 18(4), 1986 (Beneventano et al., 2001) D. Beneventano, S. Bergamaschi, F. Guerra, M. Vincini, The MOMIS approach to Information Integration, IEEE and AAAI International Conference on Enterprise Information Systems (ICEIS01), Setúbal, Portugal, 2001. (Bentivogli, Pianta, and Girardi, 2002) Luisa Bentivogli, Emanuele Pianta, Christian Girardi, MultiWordNet: developing an aligned multilingual database, in Proceedings of the First International Conference on Global WordNet, Mysore, India, January 21-25, 2002 (Cabré, 1998) Maria Theresa Cabré, Amsterdam/Philadelphia, 1998 Terminology, John Benjamins Publishing Company, (Chaudhri et al., 1998) V. Chaudhri, R. Fikes, P. Karp, J. Rice, OKBC: A Programmatic Foundation for Knowledge Base Interoperability. Proceedings of AAAI-98, Madison, Wisconsin, February, 1998 (Computerm, 1998) Didier Bourigault, Christian Jacquemin, Marie-Claude L'Homme Editors, Proceedings of the First Workshop on Computational Terminology COMPUTERM'98, held jointly with COLINGACL'98, Montreal, Quebec, Canada, 1998 (Daille, 1994) Beatrice Daille, Approche mixte pour l'extraction de terminologie: statistque lexicale et filtres linguistiques, PhD Thesis, C2V, TALANA, Université Paris VII, France 1994 (DAML+OIL) www.daml.org (Doan et al, 2002) AnHai Doan, Jayant Madhavan, Pedro Domingos, and Alon Halevy, Learning to Map between Ontologies on the Semantic Web, WWW2002, Honolulu, Hawaii, USA, 2002 (Genesereth and Nilsson, 1987) M.R. Genesereth, N.J. Nilsson, Logical Foundation of Artificial Intelligence, Morgan Kaufmann, Los Altos, Clalifornia, 1987 (Guarino, 1998) Nicola Guarino, Formal Ontology and Information Systems, Proceedings of the 1st International Conference on Formal Ontologies in Information Systems, FOIS'98, Trento, Italy, IOS Press, 1998. (Hendler, 2001) James Hendler, Agents and the Semantic Web, IEEE Intelligent Systems Journal (March/April 2001) (Heflin, Hendler, and Luke, 1999) Jeff Heflin, James Hendler, Sean Luke, SHOE: A Knowledge Representation Language for Internet Applications, Technical Report CS-TR-4078 (UMIACS TR-9971), 1999. (Jacquemin, 1997) Christian Jacquemin, Variation terminologique: Reconnaissance et acquisition automatiques de termes et de leurs variantes en corpus. Mémoire d'Habilitation à Diriger des Recherches en informatique fondamentale, Universitè de Nantes, France, 1997 26 O (Jacquemin, 2001) Christian Jacquemin, Spotting and Discovering Terms through Natural Language Processing, MIT Press Cambrige, Massachussetts, 2001 (Lee, Hendler, and Lassila, 2001) Tim Berners-Lee, James Hendler, Ora Lassila, The Semantic Web, Scientific American, May 2001 (Levenshtein, 1966) I.V. Levenshtein, Binary Codes capable of correcting deletions, insertions, and reversals, Cybernetics and Control Theory, 10(8), 1966 (Meadche and Staab, 2001) Alexander Meadche and Steffen Staab, Comparing Ontologies-Similarity Measures and Comparison Study, Internal Report No. 408, Institute AIFB, University of Karlsruhe, Germany, 2001 (Mena and Illarramendi, 2001) E. Mena and A. Illarramendi, Ontology-Based Query Processing for Global Information Systems, Kluwer Academic Publishers, ISBN 0-7923-7375-8, pp. 215, 2001. (Miller, 1995) George A. Miller, WordNet: A Lexical Database for English, Communications of the ACM, 38-11, 1995 (Milne, 1968) A.A. Milne, Winnie-the-Pooh, Noordhoof, Groningen, 1968 (OWL) http://www.w3.org/TR/owl-ref/ (Pazienza and Vindigni, 2003) Maria Teresa Pazienza, Michele Vindigni, Agents based ontological mediation in IE systems, forthcoming (Pearson, 1998) Jennifer Pearson, Amsterdam/Philadelphia, 1998 Terms in Context, John Benjamins Publishing Company, (Pustejovsky, 1995) James Pustejovsky, The Generative Lexicon, MIT Press, Cambridge, 1995 (Riloff, 1996) Ellen Riloff, Automatically Generating Extraction Patterns from Untagged Text, Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI-96), Portland, Oregon , 1996 (SIMPLE) http://www.ub.es/gilcub/SIMPLE/simple2.html (Steel, 1996) Luc Steel, Emergent Adaptive Lexicons, Proceedings of the Simulation of Adaptive Behavior Conference, MIT Press, 1996 (van Rijsbergen, 1979) van Rijsbergen, Information Retrieval, London Butterworths, Second Edition. 1979 (Vossen, 1998) Piek Vossen, EuroWordNet: A Multilingual Database with Lexical Semantic Networks, Kluwer Academic Publishers, Dordrecht, 1998 (Weinstein and Birmingham, 1999) Peter C. Weinstein, William P. Birmingham, Comparing Concepts in Differentiated Ontologies, Twelfth Workshop on Knowledge Acquisition, Modeling and Management (KAW 1999), Alberta, Canada, 1999 (Wrust, 1931) Eugene Wrust, Die Internationale Sprahnormung in der Technik, besonders in der Elektrotechnik, VDI-Verlag, Berlin, Germany, 1935 (Yangarber et al., 2000) Roman Yangarber, Ralph Grishman, Pasi Tapanainen, Silja Huttunen, Unsupervised Discovery of Scenario-Level Patterns for Information Extraction, Proceedings of Conference on Applied Natural Language Processing ANLP-NAACL 2000¸ Seattle, WA, 2000 (Zajac, 2001) Remi Zajac, Towards Ontological Question Answering, ACL-2001 Workshop on OpenDomain Question Answering, Toulose, France, 2001 (Zanzotto, 2002) Fabio Massimo Zanzotto, L'estrazione di terminologia come strumento per la modellazione di domini conoscitivi, Univerisità di Roma "Tor Vergata", Italy, 2002 27

Ontology Mapping - Università degli Studi di Roma Tor Vergata

Related documents

Products

Support

Ontology Mapping - Università degli Studi di Roma Tor Vergata

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib