Ontology Mapping - Università degli Studi di Roma Tor Vergata

advertisement
O
Ontology Mapping: a survey and a proposal.
Roberto Basili, Maria Teresa Pazienza and Fabio Massimo Zanzotto
Dispense del corso di Basi di Dati Distribuite
Anno Accademico 2003-2004
O
List of contents
1.
INTRODUCTION ................................................................................................................................... 3
2.
A DEFINITION FOR THE NOTION OF ONTOLOGY .................................................................... 7
2.1
INTERFACES TOWARDS WORLDS WHERE CONCEPTS ARE EXPRESSED: THE TASK LEVEL OF THE
ONTOLOGY ..................................................................................................................................................... 9
3.
EXISTING APPROACHES TO ONTOLOGY MAPPING .............................................................. 12
3.1
ONTOLOGY MAPPING ARCHITECTURES ......................................................................................... 13
3.1.1 Topology .................................................................................................................................... 13
3.1.2 Dynamics ................................................................................................................................... 15
3.2
AUTOMATIC METHODS FOR MAPPING ONTOLOGIES ..................................................................... 17
3.2.1 Mapping at the Instance Level ................................................................................................... 17
3.2.2 Denotations of the Concepts ...................................................................................................... 18
3.2.3 The use of the structure.............................................................................................................. 19
4.
PROPOSED APPROACH TO MEANING NEGOTIATION IN MOSES ...................................... 20
4.1.1
4.1.2
4.1.3
5.
Basic Assumptions ..................................................................................................................... 20
A Mental Model for Meaning Negotiation ................................................................................. 21
The Dimensions of the Meaning Negotiation Algorithm .......................................................... 22
ADDING NODES IN THE KNOWLEDGE GRID ............................................................................ 23
5.1
ONTOLOGY EXTRACTION FROM PLAIN TEXTS............................................................................... 23
REFERENCES .............................................................................................................................................. 26
O
1.
Introduction
The strong interoperability between software agents envisioned in the Semantic Web (SW) (Lee, Hendler,
and Lassila, 2001) is a very fashionable idea. Software agents “going” in the World Wide Web and
“working” cooperatively in order to serve needs of final users would be very relevant assistants in efficiently
finding the desired information. To achieve these results, documents and services exposed in the web should
clearly state their intended meaning. That is, human readable information should be completed with
structured machine readable data describing the intended semantics in an explicit and unambiguous way with
respect to shared conceptualisations often called ontologies.
Two major obstacles should be overcome to realise this vision in practice:

firstly, after the definition of a formal language to express ontological information (e.g. SHOE
(Heflin, Hendler, and Luke, 1999), DAML+OIL (DAML+OIL), or the more recent OWL (OWL)),
real conceptualisations have to be written and agreed;

secondly, the effective application of these methodologies can be guaranteed when costs in marking
up content in documents can be neglected1.
Explicit and unambiguous conceptualisations have been often seen as the basis for an effective
communication. The necessity of building such a conceptualisations for given knowledge domains dates
back. Not only the communication among software agents but also among humans is affected by a large
number of misunderstandings. In the 19th century the uncontrolled proliferation of names for new concepts
made urgent a reduction of divergent denotations. Zoologists, botanists, and chemists had to handle a large
number of new concept names to refer to their new intuitions as well as the progresses of the technologies
required clear names for new artefacts (Cabré, 1998). Terminological studies originated by the work of
(Wrust, 1931) addressed this problem. The attitude was prescriptive: a standardisation of the terms (i.e.
concept denotations) used in a particular domain has to be settled and used for further references. The final
resource should be a standard governing the communications among experts.
As the “prescriptive attitude” of the terminology studies is very similar to the one underlying the Semantic
Web vision, some insights of the problem of governing large-scale conceptualisations can be learnt from the
long history of terminology even if the addressed knowledge domains strongly differ. The software agents
envisioned in the SW scenario should in fact help to perform everyday task such as the reservation of a room
in an hotel regardless to the underlying database or to the web interface of the specific (and selected) hotel.
On the contrary, terminological studies aim to find explicit models for technical and scientific knowledge
domains. These seem to be very far from more common problems addressed in the SW.
When trying to define a conceptualisation for a knowledge domain, occurring problems are similar
regardless the given knowledge domain. These problems are generated by the typical nature of the
knowledge. So they appear in the scientific or technical terminological standardisation efforts as well as in
the conceptualisation of “simpler” (i.e. closer to everyday life) domains.
As knowledge is the psychological result of perception, learning, and reasoning:

it belongs to an individual or a group of individuals (locality)

it changes during the time (variability)
An explicit and unambiguous version of the domain knowledge cannot be but the expression of a particular
group of individuals in a particular time. Its strength depends on its stability and on its generality. Other
individuals will use the proposed conceptualisation if it well represent their own perception of the reality.
The locality and the variability of knowledge can be immediately clarified with an example. Let us examine
an explicit conceptualisation of the knowledge related to coffee and let us imagine this as being compiled by
1 As (Hendler, 2001) says "Lowering the cost of markup isn’t enough – for many users it needs to be free. That is,
semantic markup should be a by-product of normal computer use. Much like current web content, a small number
of tool creators and web ontology designers will need to know the details, but most users will not even know
ontologies exist."
3
O
an Italian. A very rough approximation of the possible resulting ontology is given in Fig. 1 (translations for
denotations are added for readability purposes).
BEVERAGE
CAPPUCCINO
(Cappuccino)
CAFFE
(Coffee)
MINI-CAPPUCCINO
(Small Cappuccino)
CAFFE
MACCHIATO
(Dashed Coffee)
CAFFE AMERICANO
(American Coffee)
CAFFE
RISTRETTO
(Short Coffee)
CAFFE LUNGO
(Long Coffee)
CAFFE
CORRETTO
(Corrected Coffee)
CAFFE
MACCHIATO
CALDO
(Hot Dashed Coffee)
Fig. 1 An approximation of the coffee ontology for Italians
This can seem a very complex representation for a simple problem. However, it is hard to convince an Italian
to renounce to some of these concepts. Moreover, it is hard to believe but Italians can exactly associate a
very precise meaning to each of these nodes. For instance a CAFFE (coffee) is a black beverage served in a
very small cup called Tazzina (where Tazza is cup). It is prepared using a machine that injects steam through
a pressed quantity of ground toasted coffee beans. A CAFFE MACCHIATO (dashed coffee) is the same
black beverage served with a drop of milk whereas a CAFFE CORRETTO (corrected coffee) is served with
a drop of spirit (grappa is often used).
K2
Kn
A
K1
Fig. 2 The knowledge grid: the linguistic and the content agents
The conceptualisation in Fig. 1 may well represent the knowledge that a group of people (i.e. the Italians)
have on the coffee domain, but this cannot be considered a general ontology for the coffee domain as it
represents a particular culture. However, an automatic reasoning system serving coffees in an Italian Bar
(i.e. the place where coffees are served) must have such a kind of conceptualisation to achieve the purpose of
giving the final costumer what he wants.
Moreover, the presented coffee ontology is still an approximation of the actual state of affairs. The Italian
coffee world does not rapidly change but new ways of preparing coffees may be invented and this should be
reflected in names in the ontology. This is the case of the so-called MAROCCHINO that, according to urban
legends, is sort of CAFFE MACCHIATO CALDO (i.e. a CAFFE with a drop of milk steamed in the
Cappuccino way) with some cocoa powder on the top and in the actual liquid.
4
O
If this is the nature of the “knowledge”, a scenario (as the one proposed in (DAML+OIL)) that admits a
unique, stable, and agreed repository where explicit and unambiguous conceptualisations are collected does
not seem to be very realistic. A more probable situation is that many conceptualisations of the same
phenomena will coexist. Each knowledge node where the information is exposed will follow one of these
particular conceptualisations. This perspective is represented in Fig. 2 where K1,...,Kn are the information
nodes and A is the agent aiming to satisfy its needs. With respect to the centralised and prescriptive attitude,
this is a more complex case. When designing software agents it cannot be assumed that they share the
conceptualisation with the information providers they are interrogating. Then, relying on their inner
conceptualisation, they should be equipped with reasoning devices allowing them to interpret ant transmit
their own queries with respect to conceptualisation used by the node publishing the information they are
interested in. Before the message can be passed with success, a positive meaning negotiation has to be
achieved. If this is the interaction model, the overall architecture proposed for the SW is very similar to the
organisation of the human society where interacting people can not be sure on how much their
conceptualisation is shared.
The problem analysed in these terms is very fascinating. Basically, it may be seen as the problem of aligning
different conceptualisations that can be resolved in very different ways according to the capabilities of the
automatic mapping algorithms and to the specific nature of the problem. Static or dynamic solutions can be
foreseen.
Uniroma3
UKBH
Fig. 3 The test-bed knowledge grid in MOSES
In this perspective, the vision of the Semantic Web slightly changes and the initial obstacles for the success
of the Semantic Web may be rewritten as the following:

after the definition of a formal language to express ontological information, effective
conceptualisations have to be written and meaning negotiation algorithms have be settled in order to
mediate between the different conceptualisations;

the real application of these methodologies can be guaranteed when the costs in marking up content
in the documents can be neglected.
We plan to address these two issues in MOSES and the purpose of this document is to describe how we plan
to do. As described in the previous deliverables of the project [2.1,2.2,2.3] we will have two information
nodes with two different ontologies within an homogeneous knowledge domain, i.e. the university system of
two different countries. The knowledge nodes will represent two universities, Uniroma3 and UKBH (see
Fig. 3).
5
O
As described in the deliverable 3.1 two demonstrators are foreseen in MOSES: the static and the dynamic
demonstrator. The first will implement an ontological approach to question-answering based on interactions
via natural language. The second will show how we plan to make the approach scalable. From the point of
view of the work planned in this deliverable, the construction of the static demonstrator will have as
drawback the definition of a test-bed as:

a static mapping between the ontologies of the two sites will be produced;

the two university sites will be mapped to the ontologies.
These two resources will be used to test the algorithms for mapping the ontologies of existing knowledge
nodes and for adding a new node.
In this document we will therefore firstly describe the notion of ontological conceptualisation that we will
adopt for addressing the problem of ontology mapping and ontology learning (Sec. 2). According to this
notion of conceptualisation we will classify and describe the existing methods for ontology mapping (Sec. 3).
We will then describe the architecture we propose to adopt in MOSES for the problem of meaning
negotiation (Sec. 4). Finally, we will briefly describe the methods we plan to use to add knowledge nodes in
the grid (Sec. 5).
6
O
2.
A Definition for the Notion of Ontology
A very important concept in the Semantic Web and in this document is the notion of ontology. As (Guarino,
1998) points out, the term ontology is overloaded as it is used for referring to very different concepts in many
different disciplines (e.g. philosophy, artificial intelligence, knowledge management, and database theory).
In order to better describe both the ontology mapping and the ontology building it is better to define what we
intend for the term ontology within this document.
The common perception is that an ontology is useful as it is a somehow systematic “description” of the
reality or part of it. An the term ontology does it job in evoking this. However, there is not an agreement of
what should be modelled and how. In the following we will try to give a definition of the notion of the
ontology considering it as a engineering artefact as opposed to the philosophical discipline. Our definition
elaborates the relevant perspective given in (Meadche and Staab, 2001) introducing some more distinctions
among the levels in which the ontology can been seen. The new levels serve the purpose to better explain
how the mapping between different ontologies can be achieved and how conceptualisations can be derived
from diverse sources of “ontological” information such as domain texts.
An ontology should serve the purpose of giving a conceptualisation of a knowledge domain, i.e. it should
systematically show the necessary concepts and the necessary relationships among them. This should help in
organising the particular state of the world that wants to be communicated. The communication implies and
requires a semiotic level in which symbols related to concepts, relationships, and instances are represented.
We will try to keep separated the inner level of the conceptualisation with the outer level of the denotation of
the conceptualised elements. Therefore, for our purposes, the definition of an ontology is the following (see
also Fig. 4 where the different levels are represented):
Definition Ontology
An ontology is a quintuple O=(C,I,S,F,G)
-
C: Conceptual Level where concepts and relationships among concepts are postulated;
-
I: Instance Level that represents the individuals of the investigated world and the their actual
relationships;
-
S: Denotation Level in which signs for concepts, relationships, and instances are settled;
-
F: Instance Relation that links instances to the related class;
-
G: Denotation Relation indicates the signs associated to each element in C and I.
For the conceptual level we can rely on the definition of conceptualisation given in (Genesereth and
Nilsson, 1987) where a conceptualisation C=(D,R) is a couple consisting of a domain D, i.e. a set of elements,
and of a set R of relations in D. Each element r in R is a n-ry (mathematical) relation in Dn for a given n, i.e. r
 Dn. Note that n is not fixed for all the relations in R. As in OKBC (Chaudhri et al., 1998) we distinguish a
special relationship among all, the binary isa relation, that, since is considered a transitive and acyclic
relation, builds a hierarchy among the elements in D. There is a wide consensus on the existence of an isa
relation among concepts because it is strongly related to the notion of concept. A concept in fact is defined
when it exists a set of individuals (instances) sharing common properties, e.g. the concept of human is the
intensional representation of an extension of individuals that share many properties such as being biped,
having two eyes, being able to laugh, etc. The inclusion of an extension of a concept A in an extension of the
concept B is equivalent to thinking that the instances of the class A are also instances of the class B,
therefore that A is a B, as, for instance, a human is an animal.
All the other relations in R represent the specific conceptualisation of the domain. In OKBC, a part of these
relationships, i.e. the binary relationships, are taken into account and they are called slots. Given a relation r
in R, an element (c1,…,cn) in r governs the relationships among elements at the instance level.
7
O
The instance level on the other hand represents the space where the actual individuals of the represented
world live and where their actual relationships are modelled. The instance level can therefore be seen as a
couple I=(DI,RI) where DI is the set of instances and RI is the set of the relationships among the instances.
The denotation level contains and specifies the signs used to communicate about objects foreseen in the two
inner levels, i.e. the conceptual and the instance level.
The two relations, i.e. the Instance Relation F and the Denotation Relation G, connect respectively the
conceptual level with the instance level and the inner level (conceptual plus instance level) with the
denotation level. They are defined as the followings:
F CI
G (CI)D
Note that these are thought as mathematical relations and not as functions to allow the possibility of
modelling both the ambiguity of the denotations on the one side and the multiple inheritance at for the
instance level on the other side. However, building on these relations relevant subsets can be determined.
The following “functions” can be defined:
F(c)={i I | (i,c)  F }
F-1(i)={c  C | (i,c) F }
G(c)={d  S | (i,c)  G }
G -1(d)={c  CI | (i,c) G }
Denotation Level
Conceptual Level
Instance Level
Fig. 4 An ontology: the conceptual, the instance, and the denotation levels
Generally, the denotation level is neglected as considered a constitutive part of any ontology. In fact,
concepts, relations, and instances can not be communicated if the an effective sign system is not used for it.
In these cases, the relation G is a one-to-one mapping between the symbols and the elements in the
conceptual and the instance space. However, this is not always the case: as pointed out in (Meadche and
Staab, 2001), linguistic conceptualisations as WordNet (Miller, 1995) or EuroWordNet (Vossen, 1998) needs
a clear separation between these two levels (i.e. the conceptual and the instance level on the one side, and
the denotation level on the other side) since a one-to-one mapping between the inner and the communication
level is not guaranteed due to the ambiguity of the natural language.
8
O
A separation between the inner and the communication level is beneficial especially in a context of multiple
competing conceptualisations as the one envisaged above. This separation makes clearer the relationship
between the modelled world and the signs used to convey this conceptualisation. These latter are a possible
link between different conceptualisations.
Signs in the space S have in general a very nice property: they are formed using expressions somehow
following the symbols and the formation rules of the natural language in which the conceptualisation is
expressed. Let’s take for instance one of the conceptualisation gathered in the DAML+OIL library: the
university ontology (see deliverable 2.3). Concepts and relations (there called properties) are conveyed by
natural language expressions following slightly different concatenation rules, e.g. the sign
AssiatantProfessor is a well formed natural language expression except for the rule used to concatenate
together the 2 words.
It is worth noticing that natural language is one of the more important conceptualisation device even if not
the only one. Moreover, it is the predominant one when dealing with not formalised (and maybe familiar)
knowledge domains such as the university domain, a bookstore, etc. When conceptualising, the background
conceptualisation of the linguistic knowledge is a very strong bias. Moreover, since concepts and
relationships have to be somehow efficiently communicated, the used denotations are chosen to recall
somehow the intended concepts or relations. This property and its implications are more or less consciously
used in ontology mapping algorithms. The background conceptualisation is in fact very relevant. This will be
clearer in the following sections.
2.1
Interfaces towards worlds where concepts are expressed: the task level of the
ontology
An ontology is a specific conceptualisation of a domain. Usually, it aims to represent important concepts and
relations in order to support reasoning needed to accomplish a particular task. For instance, an ontology
devoted to support question answering on the university system may suggest associations between professors
and available courses. Necessary concepts are in this case faculty and course: their instances X and Y enter
in relations like X teacherOf Y. This is all what is expressed in the denotation level. There are three
denotation objects, faculty, course, and teacherOf, referred to the object in the conceptual level. How they
can be used to:

Interpret natural language utterances such as the questions proposed to the system?

Interpret the pages in which this information has to be extracted?
For instance, all the following questions can be uttered to ask some information related to the previous
relation among concepts:
Who teaches the Database course?
Who is the teacher of the Database course?
Who is the professor of the Database course?
Who holds the Database course?
Who gives the Database course?
9
O
Fig. 5 Pages containing instances of the teaherOf relation
Furthermore, the required information, i.e. the tuples instances of the relation teacherOf, can be found in
html pages like the one depicted in Fig. 5. There the information is mainly carried by the styles and the
position of very simple textual elements in the pages.
Interface Level
Linguistic
Interface
Typografic
Interface
...
Fig. 6 A complete ontology definition: the interface level
These examples show how a piece o ontological information such as a concept or a relation have very
different ways to appear according to the space in which they appear. These are the properties that they
activate if they are uttered in one of these spaces. This kind of information is generally not included in an
ontology but it is important from many points of view:

Interpreting questions: questions in natural language can be mapped to ontological relations and
concept only if a linguistic interface is foreseen
10
O

Automatically adding instances: describing the interfaces with the text and the html text can help
in extracting the required knowledge, i.e. the instances of the foreseen concepts and relations, using
places where this information is naturally located, i.e. the texts and the www sites.
The denotation level may help in defining this interfaces but it may not contain all the relevant information.
Additional levels have to be foreseen. We collect them in the what we call the interface level (see Fig. 6).
Each element of this level represents an interface towards a particular form that the information may assume.
It is still not clear how these interfaces should be separated from one another.
11
O
3.
Existing approaches to Ontology Mapping
The interest in find the mapping diverse conceptualisations has not born with the Semantic Web. This is a
very relevant issue for many disciplines ranging from database theory (Batini, Lenzerini, Navathe, 1986) to
the psychological studies. The formalisation of a conceptual level (as the Entity-Relationship diagram in the
database theory) rapidly induces the necessity of stating somehow the mapping of possibly divergent
ontology.
To better understand the scope of mapping different conceptualisations, let us go through an example.
Imagine an Italian willing to have his habitual morning coffee in England and suppose he managed to find a
coffee shop. What kind of interaction with the barman should he have to achieve the goal having his own
coffee? He suspects that his own conceptualisation (or the way to represent it) was not the same as the one of
the barman when he tried to ask a hot dashed coffee. The situation may be depicted as the one in Fig. 7
where the two not-exhaustive ontologies are represented. It is worth noticing that the problem is not only a
misalignment in the lexicalisation of the concepts (i.e. at the denotation or at the linguistic interface level).
The real problem is that the conceptualisations are different also in the structure as the real world they want
to model radically differ. The two ontologies should in fact finally model the Italian and the English ways to
prepare coffee.
Italian Client
English Clerk
BEVERAGE
BEVERAGE
CAPPUCCINO
(Cappuccino)
CAFFE
(Coffee)
COFFEE
ESPRESSO
CAPPUCCINO
LONG
CAPPUCCINO
MINI-CAPPUCCINO
(Small Cappuccino)
CAFFE
MACCHIATO
(Dashed Coffee)
CAFFE
RISTRETTO
(Short Coffee)
CAFFE LUNGO
(Long Coffee)
CAFFE
CORRETTO
(Corrected Coffee)
CAFFE
MACCHIATO
CALDO
(Hot Dashed Coffee)
Fig. 7 An attempt to map the Italian and the English coffee ontology
In the case of the Italian at the coffee shop, we can say that a successful communication of the concept
happens whenever the poor Italian gets a cup of coffee that vaguely seems what he has the habit to drink. He
should therefore manage to negotiate the intended meaning for his hot dashed coffee. He would be successful
if manage to negotiate that his CAFFE is similar to the ESPRESSO as intended by the barman. Something
similar to the desired coffee can be obtained adding a drop of milk the served ESPRESSO and this is the top
that can be obtained. The main deception for the Italian Costumer will be the understanding that the desired
ESPRESSO is something somehow different from what he was expecting. The Italian Costumer and the
English Clerk can finally conclude that there is a weak equivalence between the notion of CAFFE and
ESPRESSO. This concludes the meaning negotiation activity and the breakfast of the Italian Costumer.
Mapping different conceptualisations is then the activity of finding equivalences between concepts and
relationships. As the above example shows, this activity can be tricky. Especially when the modelled
domains substantially differ even if homogeneous as they represents different views on the reality, strong
(and maybe stable) equivalences between concepts may not be what we want to find. In these cases, what we
may want is to discover a weak equivalence lasting for the purpose of the communication.
12
O
The ontology mapping or the meaning negotiation can be seen as a very relevant transversal filed that arise
the interest of many researches. The works in this interdisciplinary area are very difficult to follow because
many of the contributions use the language proper to their original field. In this paragraph we will report
some of the results obtained in this transversal field in light of the definition of an ontology given in the
previous section (Sec. 2). We will therefore firstly focussed on the possible architectures for organising the
activity of the finding and describing the mappings among the differentiated ontologies (Sec. 3.1). Secondly,
we will describe the algorithms and the methods that have been proposed in literature with respect to the
level of the ontology definition they are relying on (Sec. 3.2). The dimensions of the classification of the
algorithms can be easily derived from the levels of the ontology definition. This does not mean that the cited
algorithm only rely on that particular level, yet separating the contributions of the methods on the various
level they use can better clarify the role of the particular method. It is worth noticing that some of the
methods simply neglect some of the levels. The dimensions of the classification are then the followings:

Denotation-based Ontology Mapping: the basic assumption is that a single denotation space has been
used in the formalisation is similar and it induced denotations that may differ only for “dialectical”
inflections

Instance-based Ontology Mapping: the extensions of the concepts are strongly used to determine the
similarity

Conceptual-based Ontology Mapping: the structural relations among the concepts in the conceptual
level are taken to determine the similarity. The basic assumption is that the structural similarity
indicates the similarity among concepts.
It is worth noting that in the instance-based ontology mapping and conceptual-based ontology mapping the
problem of the different denotation is simply neglected.
3.1
Ontology Mapping Architectures
The first problem that should be addressed in mapping a large number of ontologies is to decide how to
organise the work and how to organise the results of the mappings. Two main choices have to be made. It
has to be defined, on the one side, the topology of the final knowledge network and, on the other side, the
dynamics of the final interactions. The architectural choices made in the literature strongly depend on the
aims of the ontology mappings and on the availability and the quality of the automatic mapping algorithms
used to reach these aims. The possible choices are classified in the followings and some related project is
cited if possible.
3.1.1
Topology
With respect to the topology of the final network, two different possible organisations of the mapping can be
defined:

A centralised approach: a central ontology is foreseen that is guaranteeing the consistency of the
overall knowledge system

A distributed approach: the consistency is guaranteed by one-to-one mappings between all the
possible ontologies.
3.1.1.1
Centralised Approach
The simplest idea to keep the consistency between different conceptualisations is to elect one
conceptualisation elected as the “inter-lingua” (see Fig. 8). The use of this latter can be twofold. In the
definition of a new conceptualisation it can be used as a shared vocabulary where concepts can be extracted
from if necessary and possible or where new concepts can be linked. On the contrary, if the different
conceptualisations already exist the “inter-lingua” can be used as the reference for all the other ontologies.
13
O
Fig. 8 Keeping consistency with a centralised approach
This topology has been often used as it reduces the number of necessary mappings. If the N is the number of
ontologies for which the consistence has to be guaranteed, only N mappings are necessary. When the
mappings are static this is the most suitable solution.
This approach has been often used. In (Arens et al., 1993), a global ontology is postulated to keep the
consistence of different database schemas. In (Beneventano et al., 2001) this model is applied by means of a
general vocabulary, i.e. WordNet (Miller, 1995). The intended meaning of each node has to be linked to the
particular synset in WordNet. The activity of choosing the appropriate synset is completely delegated to the
knowledge engineer.
It is worth noting that the policy adopted in the Semantic Web (Lee, Hendler, and Lassila, 2001) follows this
approach. One on the main effort in fact is to define a space in which concepts can be univocally referred.
The definitions of the relations among concepts is the defined in a central repository (i.e. the DAML+OIL
ontology library (DAML+OIL)) to which the developers of new ontological material have to refer.
3.1.1.2
Distributed Approach
The consistency between different ontologies may be also assured in pairs: given a set of ontologies, the
links between each of the pair of ontologies should be given. The overall consistency is made using these one
two one mappings. Given a set of N ontology, to obtain the overall coverage N2 mappings have to be
produced. A sample mapping scheme is depicted in Fig. 9.
Fig. 9 Keeping consistency with a distributed approach
This approach may seem unreasonable due to the gigantic quantity of links that should be traced. However
this is often used. Consider for instance the construction of bilingual dictionaries. Each pair of language
14
O
should be covered to ease the comprehension of the language A to the speaker of the language B. The use of
an inter-lingua may be misleading both for the final user and for the builders of the resource. This approach
has been also used for instance in (Mena and Illarramendi, 2001).
3.1.2
Dynamics
The availability of automatic approaches to mapping ontologies can enlarge the possible architectures that
can be designed for addressing the problem. A distributed architecture is unfeasible if the connections should
be statically traced. In fact, when the number of the involved ontology is large, the number of connections
inherently grows squarely. Yet, the availability of automatic approaches to mapping concepts and
relationships belonging to different ontologies makes this solution more appreciable and, in some cases, even
more interesting than the centralised mapping. However, the final decision on the dynamics of the
architecture strongly depends on the nature of the information modelled in the target ontologies.
3.1.2.1
Static Mappings
The final objective of a static mapping is to have a resource that connects different conceptualisations in a
stable way. However, a static mapping is possible only when some given conditions are met by the nature of
the target ontologies. For instance, in the case of EuroWordNet, the need was to create a static cross-lingual
resource with a stable final mapping among the concepts expressed in the different languages. As the core of
languages don’t change frequently, a static mapping is the best solution. Here, the eventual mapping
algorithms may be used for ease the task of finding correspondences.
3.1.2.2
Dynamic Mappings
In a world where different conceptualisations may arise because of a slightly different perspective on the
analysed domain or because of a lack of knowledge on the existence of an already defined conceptualisation,
the availability of ontology mapping algorithms may open very relevant dynamic scenarios. Software agents
equipped with these ontology mapping algorithms may assess the meaning of an utterance (i.e. the
denotation of a concept or a relationship) through meaning negotiation. Such an approach is more similar to
a model for the human society where “autonomous agents” negotiate the meaning of their utterances when
they are not able to reach an immediate agreement on the underlying concept. It is worth noticing that a
dynamic mapping can be even more interesting when clear relation between two concepts can be obtained. In
this case, this relation may survive for the period of the “conversation” between the involved software agents
serving the purpose of having a successful information transmission.
3.1.2.2.1 The Mediator Approach
In (Pazienza and Vindigni, 2003), the mapping activity is dynamically carried on by a mediator agent
intervening between dialoguing actors. The Mediator agent aims to understand ontological information a
speaker agent wants to express as well as the way to present it to a computational hearer: it embody
inferential capabilities related to agent communication. It is assumed to have no prior knowledge about the
dialog topics, while focusing on understanding which relationships exist between concepts involved in the
communication as well as the terms used to express them, without any assumption on the adopted internal
knowledge representation formalisms. Due to the fact the mediation activity is developed out of the speaking
agents, the mediator needs to work on a restricted view of the ontologies. It builds a partial image of the
involved parts, by alternating queries to the agents about specific knowledge on terms and concepts, and
dynamically comparing the results in order to find a mapping for the target concept.
Similarly as in the previous section, the mediation function is inspired by human behaviour: if an hearer does
not understand the meaning of a word, he asks for more details, firing a sub-dialogue, based on a sequence of
query-answer, aimed to identify the target concept, or its position in the ontology: in this activity he will base
also on the clues provided by synonyms, relations with other concepts, and so on. One positive aspect of this
kind of mediation is that the analysis is framed into a local search space, as in real world ontologies, big
differences in the taxonomic structure already appear in the upper levels, making impossible a large-scale
mapping between misaligned structures. Seeing the mediation activity as a terminological problem helps in
understanding the analogies existing on the underlying models.
15
O
This view of the ontological mediation is based on minimal assumptions on the knowledge model the
involved agent has: by adding more restrictive hypotheses on their introspection capabilities, it is possible to
include in the same framework other more sophisticated techniques. Reasoning about concept properties,
attribute similarity and relations requires a deeper semantic agreement among participants that is out of the
scope of this research. In absence of such agreements, what agents expose about their ontologies is the
denotational component, on which the mediation activity is focused.
3.1.2.2.2 Language Game Approach
Negotiating meaning is one process that may happen in the interaction among human. Let us take the
following example where three participants tend to form their own idea about a new concept. This dialog has
been used in (Adriaans, 1992) to define the problem of language learning and it is extracted from (Milne,
1968). The dialog follows.
One day when Cristopher Robin and Winnie-thePooh and Piglet were talking all together,
Cristopher Robin finished the mouthful he was
eating and said carelessly: “I saw a Heffalump
today, Piglet.”
“What was it doing?” asked Piglet.
“Just lumping along,” said Critopher Robin. “I
don’t think it saw me.”
“I sae one once”, said Piglet. “At least, I think I
did,” he said. “Only perhaps it wasn’t.”
“So did I”, said Pooh, wondering what a
Heffalump was like.
“You don’t often see them”, said Christopher
Robin carelessly.
“Not now”, said Piglet.
“Not at this time of year”, said Pooh.
Then they all talked about something else...
This dialog shows how during a conversation different participants can form their own opinion on the new
notion. It is a very natural way that suggests an architecture in which autonomous agents equipped with
meaning negotiation capability are free to circulate. Whenever a concept denotation is not known the
meaning negotiation activity can start in order to achieve an useful understanding.
Such an architecture is studied in (Steel, 1996) where the problem investigated is the formation of a
language. Given a very precise conceptualisation shared by the agents, the problem is to see how they can
converge to a unique denotations for the concepts and the relationships. These studies investigate the
construction of new lexicons in communities of intelligent agents. Agents envisaged in these studies share a
very precise protocol to play what is there called the “language game”.
The agent society envisioned in such approaches is very similar to a human society and therefore offers a
very interesting benchmark where the human behaviour can be modelled. Furthermore, with respect to the
mediator approach it offers the possibility of reducing the number of interactions since the number of actors
of the overall communication is obviously smaller. The agent-mediator-agent communication is in fact
reduced to an agent-agent communication.
16
O
3.2
Automatic Methods for Mapping Ontologies
In this section we will classify and discuss the approaches to automatically mapping elements belonging to
different ontologies. The analysis is made according to the defined levels of an ontology as discussed in Sec.
2.
3.2.1
Mapping at the Instance Level
The basic assumption is that equivalence at the conceptual level may be computed analysing the instances of
the involved concepts. The strength of similarity among two concepts is reckoned on the shared instances. It
is worth noticing that, as shown in Fig. 4, the inclusion of set of instances in another define a isa relation
among the involved concepts. We will hereafter show some metrics to calculate this distance.
We firstly analyse the case in which each concept cC as a set of instances F (c). The analysed methods
assume that two instances sharing the same denotation are in fact referring to the same object in the word,
i.e. the denotation level is not considered.
In (Doan et al, 2002), a notion of probability is used. Given this model, the notion of the joint probability
distribution between any two concepts c1 and c2 is well defined. This distribution consists of the four
probabilities: P(c1,c2), P(c1,c2), P(c1,c2), and P(c1,c2). A term such as P(c1,c2)is the probability that a
randomly chosen instance from the universe belongs to c1 but not to c2, and is computed as the fraction of the
universe that belongs to c1 but not to c2. The probabilities are the estimated as follows:
| F (c1 )  F (c 2 ) |
|I |
| F (c 1 )  (I  F (c 2 )) |
P(c 1, c 2 ) 
|I |
| (I F (c1 ))  F (c 2 ) |
P(c1,c 2 ) 
|I |
| (I F (c1 ))  (I F (c 2 )) |
P(c1, c 2 ) 
|I |
P(c1,c 2 ) 
Many practical similarity measures can be defined based on the joint distribution of the concepts involved.
For instance, a used definition for the “exact” similarity measure, i.e a measure that estimates if the concept
c1 and c2 are the same concept, is
Jaccard_sim(c 1, c 2 ) 
P(c1  c 2 )
P(c1, c 2 )

P(c1  c 2 ) P(c1, c 2 )  P(c1, c 2 )  P(c1, c 2 )
This similarity measure is known as the Jaccard coeficient (van Rijsbergen, 1979). It takes the lowest value 0
when A and B are disjoint, and the highest value 1 when A and B are the same concept. It is worth noticing
that:
Jaccard_sim(c 1, c 2 ) 
| F (c 1 )  F (c 2 ) |
| F (c 1 )  F (c 2 ) |
Also a definition for the “most-specific-parent" similarity measure is given, i.e.:
P( c1| c 2 )
MSP ( c1 , c 2 )  
0

if P( c 2| c1 )  1
otherwise
where the probabilities P(c1|c2) and P(c1|c2) can be trivially expressed in terms of the four joint probabilities.
This definition states that if c2 subsumes c1, then the more specific c2 is, the higher P(c1|c2), and thus the
higher the similarity value MSP(c1,c2) is. Thus it suits the intuition that the most specific parent of c1 in the
17
O
taxonomy is the smallest set that subsumes c1. An analogous definition can be formulated for the “mostgeneral-child" similarity measure. Similar measures are given in (Weinstein and Birmingham, 1999).
The main limitation of such a kind of approaches is that if the worlds that is modelled do not overlap, there is
no possibility of estimating the distance between the involved concepts. If I1 and I2 are the two instance
spaces of the two ontologies, the method can be used if |I1I2||I2||I1|. The case explored in (Weinstein and
Birmingham, 1999) is in fact of this type .The modelled world is in fact the general space of the recording
(i.e. CD, tapes, etc.). In this case, it may be assumed that the level of overlapping of two database modelling
this world is very high because of the aim of the two models. This can not be always the case.
3.2.2
Denotations of the Concepts
3.2.2.1
Edit Distance
In (Meadche and Staab, 2001) the denotation of the concepts and of the relation is used to detect the
similarity. The basic assumption is that two objects sharing the same denotation relate to the same concept.
Therefore, if a distance among the concepts can be established using the distance among their string
denotations. The adopted measure is the edit distance (Levenshtein, 1966 between strings that measures the
minimum number of token insertions, deletions, and substitutions required to transform one string into
another using a dynamic programming algorithm. For example, the edit distance, ed, between the two surface
denotations “TopHotel” and “Top_Hotel” equals 1, i.e.:
ed(“TopHotel”,“Top_Hotel”) = 1
because one insertion operation changes the string “TopHotel” into “Top_Hotel”. Based on Levenshtein’s
edit distance they propose what they call the syntactic similarity measure (SM) for strings, the String
Matching (SM), which compares two denotations Li and Lj:
SM(L i , L j )  max{0,1 
ed(L i , L j )
}
min(| L i |,| L j |)
SM returns a degree of similarity between 0 and 1, where 1 stands for perfect match and zero for bad match.
It considers the number of changes that must be made to change one string into the other and weighs the
number of these changes against the length of the shortest string of these two. In the above example, the
String Matching is computed as follows:
SM(“TopHotel”, “Top_Hotel”) = 7/8.
Of course, SM may sometimes be deceptive, when two strings resemble each other though they there is no
meaningful relationship between them, e.g. “power” and “tower”. In our case study, however, we have found
that in spite of this added “noise” SM may be very helpful for proposing good matches of strings.
3.2.2.2
Terminological Approach
The overall expressiveness of the concept denotations cannot completely captured by the String Matching
based on the edit distance. Many denotations can be seen as complex nominals with particular concatenation
rules, e.g. AssistantProfessor and AssociateProfessor can be obviously seen as the complex nominals
respectively “assistant professor” and “associate professor”. The particular concatenation rules can be
reversed and used to generate the tokenisation rules. From there, methods adopted to find the similarities
among complex noun phrases studied in the terminology variation problem (Jacquemin, 2001)can be used.
Terms are linguistic denotations key domain concept that in general are represented by complex noun
phrases. In (Jacquemin, 2001), the term variation problem is investigated. The point of view of these studies
can be reversed, the methods settled to evaluate the variation of a term can be used to calculate the distance
of the concepts denoted by different surface forms.
18
O
In (Jacquemin, 2001), the term variation is studied along different dimensions:

Morphological variation that postulates the equivalence between concepts expressed by singular and
plural noun phrases;

Syntactic variation that postulates the equivalence between concepts expressed by different syntactic
structures but with the same words, e.g. Information Extraction vs. Extraction of Information;

Semantic variation that postulate the equivalence between forms obtained substituting words with
equivalent words, elaboration of the natural language vs. processing of the natural language.
The variation and, consequently, the equivalence between different denotations can be computed along this
different dimensions. In (Jacquemin, 2001), only the syntactic variation is extensively treated and the idea is
to provide rules as the following:
N1 of N2  N2 N1
that states the equivalence between forms like Information Extraction vs. Extraction of Information. A
number of these rules are provided.
A mapping method based on the terminology alignment makes strong assumptions on the linguistic wellformedness of the concept denotations. This is generally the case except for the different concatenation rules
but cases may exists in which the distance between the pure linguistic expression and the actual denotation
cannot be filled.
3.2.3
The use of the structure
How can we be sure that a concept denoted with the same label is in fact the same concept? Or conversely,
how can we determine the similarity among two nodes of the conceptual structure if we don’t know that they
have a similar denotation? The only way seems to be analysing the structure of the conceptual level. In this
section we will analyse the mapping method assuming that elements having the same denotation refer to the
same element. Differences are inspected in the structure in the conceptual level, i.e. C=(D,R).
Since one of the most relevant and agreed relation in the structure is the isa, we will extrapolate this from the
set R of all the relations. The isa relation will be represented as H in R. H(c1,c2) with c1,c2 in D represents tha
c1 is a c2.
3.2.3.1
Immediate Neighbourhood
In (Meadche and Staab, 2001), the similarity among two concepts is calculated building on the notion of
semantic cotopy (SC). Given a concept c, the semantic cotocopy SC(c,H) of c in H is the set collecting all
its super- and subconcepts:
SC(c,H) = {ciC | H(c,ci)H(ci,c) ci=c}
The similarity between the concept expressed in two different taxonmies can be therefore calculated using
the following function, i.e. the taxonomy overlap (TO):
TO(c, O1,O2 ) 
| SC(c,H1 )  SC (c,H2 ) |
| SC(c,H1 )  SC (c,H2 ) |
It is worth noticing the similarity of the taxonomy overlap and Jaccard-sim expressed in Sec. 3.2.1 where the
estension of a concept is explored as a mapping indicator. This is a very similar principle where the estension
is seen at the conceptual level.
In (Meadche and Staab, 2001), the taxonomy overlap TO is mainly used to determine the overlap between
different taxonomies. Here we are interested in the measure because it can give a the similarity between two
concepts expressed in different taxonomies. It is generally not sufficient the equivalence of the surface
forms.
19
O
4.
Proposed Approach to Meaning Negotiation in MOSES
MOSES is a very interesting test-bed for ontology mapping algorithms because of the nature of the problem
here addressed. To better specify the possible architecture (i.e. what has been called topology and dynamics)
for the MOSES mediation capabilities and the dimensions over with the mapping can take place, it is better
to firstly review the constraints the overall MOSES operational scenario poses.
The basic aim of the MOSES system is answering questions using a structured representation of the
information. It can be called an ontological question answering engine (Zajac, 2001). The MOSES prototype
will operate on two virtually homogeneous knowledge repositories, i.e. two university sites University of
Roma III and UKBH of Copenhagen. However, two different ontologies have been produced for the two
university systems. The denotation level of two ontologies, as well as the possible instances, is basically in
the national language, i.e. Italian and Danish. Furthermore, the analysed domain, i.e. the university role and
organisation, is strongly dependent on the culture of the particular country and on its particular laws. Some
roles may exist in one country but not in the other or, worse, some concepts maybe referred with the same
denotation but maybe related to very different concepts.
Even if the knowledge in the two universities is homogeneous as both are in the university knowledge
domain, there is a very small (if not a zero) overlapping at the level of instances as the world modelled in the
two sites is substantially different.
In order to demonstrate its viability the overall system should work in a situation in which different
knowledge nodes can be dynamically added in the knowledge network with the smallest effort. In general,
there is no guaranty that the knowledge node inserted in the network exposes the information in already
known ontology. Furthermore, we don’t want to assume that the ontology underlying this new node is
somehow following an “inter-lingua” conceptualisation.
The situation is then very intriguing:

software agents representing the content of the WWW sites and software agents looking for
information that don’t share necessary the same conceptualisation of the domain

the overall agent society has to be “robust” at the insertion or the deletion of a content node

the denotation level of the conceptualisations of the software agents is may be based on different
natural languages
The overall scenario will then configure a place where the interacting agents can not be sure whether or not
they are sharing the conceptualised knowledge. The software agents of the system cannot rely on static
mappings between the ontologies.
4.1.1
Basic Assumptions
We have very little assumptions to rely on and the scenario we have in mind opens very relevant research
problems. In any case, the nature of the information we want to tackle can give very interesting starting
points and theoretical foundations to the mapping architecture and algorithms we want to explore.
4.1.1.1
The Nature of the Domain Ontologies
The first strong assumption that we make is that we are working in a close domain whose limits have been
settled in the deliverables 2.2 and 2.3, i.e. the university domain. Universities are observed from different
angles, yet it seems agreed that these perspectives are internationally common. The basic roles of the
universities in the society seems to go beyond the national frontiers: universities are institutions involved in
the education and the research. This fact guarantees that the conceptualisations arisen in the analysis model a
very similar domain even if the modelled realities differ for some particular concepts and relationships.
As conceptualised domains are similar it is possible to make a positive interaction among the software agents
since they are “referring” to the same knowledge. It is as if, in a conversation, participants were restricting
the topic to a very specific argument that is known to everybody even if some different perception on some
20
O
part still remain. Moreover, there is any guaranties on the fact that the same signs are used to declare the
same concepts.
4.1.1.2
The Predominance of Natural Language
Natural language is one of the most important means of conceptualisation. Not only the denotation level of
the ontological conceptualisation may be fashioned according to its laws, but also the internal level, i.e. the
concept level, may be strongly inspired by the conceptualisations hidden in the language used in the
everyday life. The major effort in a conceptualisation is in fact the definition of a single meaning for the
concept denotations and the relationship denotations by means of structural connections among the defined
elements.
The use of the natural language in the denotation level is so “natural” that generally the denotation level and
the conceptual level are not divided. The denotations covey the concepts and the relationships. This is typical
in the use of the Entity-Relationship diagram when defining a schema for a database as well as defining an
ontological characterisation for a domain in DAML+OIL.
The naturalness in using the natural language is due to the fact that denotations are often used to convey the
meaning to other users of the resource. This fact has some important drawbacks:

typographic limits imposed by the used tool (e.g. no black spaces or short names have to be used in
the denotations) are overcome consistently in the overall resource. The concatenation/truncation rule
is generally consistently applied on all the denotations.

the chosen denotations tend to be meaningful in a community of people
In this perspective, natural language is completely abandoned for more formal systems. Its power is only
reduced to make it unambiguous in the context where it is used.
Inherent rules and inherent conceptualisations that comes with the use of the natural language and that are
diffusely represented in lexical resources and in interpretative grammars can be used. As the language plays
an important role, the availability of language resources will help to govern the overall difficulty of the
problem: lexical databases such as WordNet (Miller, 1995), EuroWordnet (Vossen, 1998), MultiWordNet
(Bentivogli, Pianta, and Girardi, 2002) or ontological lexical resources such as Simple (SIMPLE) based on
the notion of the Qualia-structure (Pustejovsky, 1995) will be very valuable resources to study the feasibility
of the mapping. Works such as (Basili, Vindigni, and Zanzotto, 2003) on word sense disambiguation based
on such a kind of resources will assume a very relevant role.
4.1.2
A Mental Model for Meaning Negotiation
The environment where the MOSES agents will operate in is highly challenging:

no assumptions on the sharing of a common conceptualisation can be made

agents can enter or exit the MOSES society
These two hypotheses imposes very clear constraints on the possible architecture. As a common
unambiguous conceptualisation can not be seen as the “glueware” between the conceptualisations of the
single agents, the applicability of a centralised approach is very difficult yet not impossible. However, the
dynamics of the agent society (i.e. the insertion and the deletion of an agent from the society) imposes a
distributed approach. Due to the inherent cost in such an approach, an overall static mapping is not feasible.
We will then adopt a dynamic approach. Mimicking the way in which meaning consensus is established in
human conversation we will a meaning negotiation module that can be adopted by the agents willing to
participate to the MOSES agent society. This module will be give the possibility to the agent to play a very
simple language game whose final aim is to reach an agreed meaning in order to allow a positive interaction
on the final MOSES task, i.e. answering proposed questions.
The software agents are then depicted in Fig. 10. This approach is interesting because it gives the possibility
of studying the ontology mapping problem or the meaning negotiation as a model of the human behaviour.
Furthermore, the implied software architecture is flexible and scalable.
21
O
Meaning
Negotiation
Module
Meaning
Negotiation
Module
Bla
Fig. 10 MOSES Agents playing the “language game”
A language game aiming to achieve the target to define the sense of unknown denotation will be defined.
4.1.3
The Dimensions of the Meaning Negotiation Algorithm
Due to the hypothesis on the nature of multilingual scenario and the strong assumptions on the role of the
natural language, the meaning negotiation algorithm will be strongly based on the use of linguistic resources,
i.e. the denotation level will be thought as staying in the natural language space. For this reason, each agent
participating the language game will be also equipped with a lexical knowledge base that will help the agent
to reason about the language elements (i.e. the words) used for the communication. Since the denotation
levels of the ontologies will possibly not be in the same language, the interaction of the software agents with
a bi-lingual dictionary will be foreseen.
The algorithm will exploit all the levels of the two involved ontologies that are possible under the hypothesis
made. The two levels exploited will be:

the denotation level

the conceptual level
since, due to the nature of the domains and of the analysed sites, the instance level will not be so relevant. It
is in fact reasonable to suppose that no overlapping on this level will be possible on the analysed situations.
22
O
5.
Adding Nodes in the Knowledge Grid
As discussed above, one of the important aspect to push the semantic web vision forward is to demonstrate
an ability in helping the writers of the www sites in expressing the information they want to expose in the
formalised ontological language if necessary. Moreover, as the information in a www site may not be
modelled by the already expressed conceptualisation, a very relevant activity is the automatic production of a
possible conceptualisation starting from the www site or some related texts.
Using the definition of the notion of the ontology given in Sec. 2 we may determine the different directions
in which we can work. Given the www site (i.e a very specific collection of documents), the “learnable”
objects are:

The interface level. This may be learnt for two reasons: the first, to learn information extraction
rules able to subsequently automatically fill the instance level; the second, to learn different ways in
for expressing the same question

The conceptual level. This may be learnt for adding concepts and relationships among concepts that
are not foreseen in the previous conceptualisation
Learning the interface level may help in maintaining the consistency of a knowledge node and in including in
the grid homogeneous sites. Learning the conceptual level will give the possibility of including
heterogeneous knowledge nodes in the grid.
As already discussed, the information gathered in the annotation phase will be a very important test-bed over
which the learning techniques can be explored. It is worth noticing that this study will be the object of WP7.
In the following we will describe some ideas on which we will work for addressing these issues.
5.1
Ontology Extraction from Plain Texts
We want here to propose an acquisition method for the derivation of the linguistic interface of an ontology
able to suggest linguistic patterns for known concepts and relations as well as to propose new concepts and
new relation types. Our perspective is thus slightly different from other works (e.g. (Yangarber et al., 2000),
(Riloff, 1996)). Plain texts are the starting point of our analysis. Texts here are assumed to drive the
discovery of new domain knowledge. Fig. 11 presents an overview of the overall process where a cascade of
activities (in the horizontal arrow) is defined to produce the final ontology.
Lexical
Knowledge Base
Corpus
Processing
Conceptual
Relation
Patterns
Domain
Oriented
Clustering
Relation
Patterns
Concept
Patterns
Relation
Type
Definition
Semantic
Dictionary
Building
Domain Corpus
Relation
Pattern
Classification
Linguistic
Relation Interfaces
Semantic Relations
Concept Hierarchy
Domain Relation
Type Hierarchy
Domain
Concept Hierarchy
Fig. 11 Learning ontologies from free text: the architecture
The acquisition process makes use of previously existing domain knowledge. This foresees at least a domain
concept hierarchy (DCH) and a relation type system (RTS). General-purpose linguistic knowledge is also
required. It includes at least morphological and grammatical models as well as lexical semantic knowledge
23
O
(such as the Wordnet lexical hierarchy). The overall process aims to harmonise the two above sources by
exploiting a large-scale corpus. The result is an augmented version of the source concept hierarchy that
include a variety of linguistic knowledge. The process should in fact:

determine the relevant concepts that should be used in the target concept hierarchy DCH, possibly
extending it;

propose relevant relationship prototypes that are linguistic explanations for the relation prototypes
postulated in the domain hierarchy RTS or that constitute newer prototypes;

determine the linguistic forms in which the domain concepts are realized in texts;

propose textual denotations of the relation prototypes, that is, the linguistic interface of domain
relations/associations.
For the enterprise above, a terminological perspective is helpful. The main objective of research in
terminology has been the extraction of synthetic representations for domain knowledge out from
available material (Pearson, 1998). Thesauri and technical vocabularies, i.e. the explicit domain
models, are in fact built using domain text collections considered as implicit domain models. These
simple assumptions characterize several approaches in the computational terminology area
(Computerm, 1998). At the beginning of the process only the text collection and a possibly preexisting term list is made available. A very general definition of the notion of term is then exploited,
i.e. a surface form of a relevant domain concept (Heflin, Hendler, and Luke, 1999) Jeff Heflin, James
Hendler, Sean Luke, SHOE: A Knowledge Representation Language for Internet Applications,
Technical Report CS-TR-4078 (UMIACS TR-99-71), 1999.
(Jacquemin, 1997). The different approaches to Terminology Extraction (TE) tend to give an "operational"
definition of term describing (see (Zanzotto, 2002)) the following aspects:

the admissible surface forms. Admissible surface forms are usually described as prototypes over a
valid natural language interpretation level (i.e. morphological, syntactical or semantic level).

the domain relevance. Domain relevance is used as a decision function and it has been generally
implemented via statistical measures over text collections: the frequency of the term surface forms or
the mutual information are examples of such functions. The simple frequency in the corpus has been
suggested as the more effective decision function among surface representations including the same
number of content words (Daille, 1994).
The above model has been generally exploited for the description of the concepts in a domain while an
attempt to use it for the detection of domain relationships has been done in (Basili, Pazienza, and Zanzotto,
2002). It is worth noticing that in the model underlying TE systems, it is the domain corpus and not the
information needs, which drives the extraction of domain knowledge.
In the learning of an ontology for Information Extraction, the principles of TE practice can be usefully
adopted and in particular its notions of admissible surface forms and domain relevance. The first helps to
characterize promising candidates for concept and relationship denotations. The notion of domain relevance
optimises the work of ontology engineers: in fact, only the most relevant linguistic material (domain
concepts as well as domain relations and their linguistic interface) will be shown. These are patterns sorted
according to the domain relevance. Due to their inductive nature, unexplained patterns are potential
candidates for new relations and are helpful even in the design of concepts and relations. Moreover, since
linguistic constraints are used to express concepts and relationships they can be also used to cluster data.
Sparse phenomena with a common semantic interpretation can thus be grouped: the higher is semantic
agreement the higher will be the ranking of the underlying phenomena according to the domain relevance
function. The result is that the target content of the linguistically principled ontology will grow faster as it
integrates meaningful information emerging from the domain text collections.
The overall learning process (Fig. 11) is organized as follows. Firstly admissible surface forms are extracted
from the corpus and promising concept and relation candidate are stored as patterns. This activity is referred
to as Corpus processing. Then, an analysis devoted to determine a concept hierarchy is applied to the more
relevant concepts extracted in the previous phase. It makes use also of the pre-existing domain concept
hierarchy (DCH). This activity generalizes the available evidence across the general-purpose lexical
knowledge base and is called hereafter as Semantic Dictionary Building. Its aim is mainly to map domain
24
O
concepts into the general lexical database. The resulting concept hierarchy can in fact be used in the analysis
and interpretation of relational patterns in the domain texts. This generalization allows to conceptually
cluster the surface forms observed throughout the corpus. The derived generalizations can undergo the
statistical processing during the Domain Oriented Clustering phase. Their distributional figures are derived
and the resulting generalized patterns can be organised according to their domain relevance score. The
Relation Type Definition and the Relation Pattern Classification are then manual phases. The first activity is
targeted to the production of a set of Semantic Relations (SR) determining a system of domain specific
relationship names, i.e. labels helpful in the interpretation/explanation of prototypical concept associations in
the domain. The semantic relations type system determined in the previous phase is then used to classify the
specific linguistic patterns clustered and ranked in the previous phases. The result of this last activity is the
set of linguistic rules for the matching and prediction of relations in SR. We will call hereafter such rules as
Linguistic Relation Interfaces.
25
O
References
(Adriaans, 1992) Pieter Willem Adriaans, Language Learning from a Categorial Perspective, Academisch
Proefschrift, Universiteit van Amsterdam, 1992
(Arens et al., 1993) Yigal Arens, Chin Y. Chee, Chun-Nan Hsu, Craig A. Knoblock, Retrieving and
Integrating Data from Multiple Information Sources, International Journal of Cooperative Information
Systems, 2(2), 1993
(Basili, Pazienza, and Zanzotto, 2002) Roberto Basili, Maria Teresa Pazienza, Fabio Massimo Zanzotto,
Learning IE patterns: a terminology extraction perspective, Proc.\ of the Workshop of Event
Modelling for Multilingual Document Linking at LREC 2002, Las Palmas, Spain, 2002
(Basili, Vindigni, and Zanzotto, 2003) Roberto Basili, Michele Vindigni, Fabio Massimo Zanzotto,
Integrating ontological and linguistic knowledge for Conceptual Information Extraction, Submitted
for publication, 2003
(Batini, Lenzerini, Navathe, 1986) C. Batini, M.Lenzerini, S.B. Navathe, A comparative analysis of
methodologies for schema integration, Computing Surveys, 18(4), 1986
(Beneventano et al., 2001) D. Beneventano, S. Bergamaschi, F. Guerra, M. Vincini, The MOMIS approach
to Information Integration, IEEE and AAAI International Conference on Enterprise Information
Systems (ICEIS01), Setúbal, Portugal, 2001.
(Bentivogli, Pianta, and Girardi, 2002) Luisa Bentivogli, Emanuele Pianta, Christian Girardi, MultiWordNet:
developing an aligned multilingual database, in Proceedings of the First International Conference on
Global WordNet, Mysore, India, January 21-25, 2002
(Cabré, 1998) Maria Theresa Cabré,
Amsterdam/Philadelphia, 1998
Terminology,
John
Benjamins
Publishing
Company,
(Chaudhri et al., 1998) V. Chaudhri, R. Fikes, P. Karp, J. Rice, OKBC: A Programmatic Foundation for
Knowledge Base Interoperability. Proceedings of AAAI-98, Madison, Wisconsin, February, 1998
(Computerm, 1998) Didier Bourigault, Christian Jacquemin, Marie-Claude L'Homme Editors, Proceedings
of the First Workshop on Computational Terminology COMPUTERM'98, held jointly with COLINGACL'98, Montreal, Quebec, Canada, 1998
(Daille, 1994) Beatrice Daille, Approche mixte pour l'extraction de terminologie: statistque lexicale et filtres
linguistiques, PhD Thesis, C2V, TALANA, Université Paris VII, France 1994
(DAML+OIL) www.daml.org
(Doan et al, 2002) AnHai Doan, Jayant Madhavan, Pedro Domingos, and Alon Halevy, Learning to Map
between Ontologies on the Semantic Web, WWW2002, Honolulu, Hawaii, USA, 2002
(Genesereth and Nilsson, 1987) M.R. Genesereth, N.J. Nilsson, Logical Foundation of Artificial Intelligence,
Morgan Kaufmann, Los Altos, Clalifornia, 1987
(Guarino, 1998) Nicola Guarino, Formal Ontology and Information Systems, Proceedings of the 1st
International Conference on Formal Ontologies in Information Systems, FOIS'98, Trento, Italy, IOS
Press, 1998.
(Hendler, 2001) James Hendler, Agents and the Semantic Web, IEEE Intelligent Systems Journal
(March/April 2001)
(Heflin, Hendler, and Luke, 1999) Jeff Heflin, James Hendler, Sean Luke, SHOE: A Knowledge
Representation Language for Internet Applications, Technical Report CS-TR-4078 (UMIACS TR-9971), 1999.
(Jacquemin, 1997) Christian Jacquemin, Variation terminologique: Reconnaissance et acquisition
automatiques de termes et de leurs variantes en corpus. Mémoire d'Habilitation à Diriger des
Recherches en informatique fondamentale, Universitè de Nantes, France, 1997
26
O
(Jacquemin, 2001) Christian Jacquemin, Spotting and Discovering Terms through Natural Language
Processing, MIT Press Cambrige, Massachussetts, 2001
(Lee, Hendler, and Lassila, 2001) Tim Berners-Lee, James Hendler, Ora Lassila, The Semantic Web,
Scientific American, May 2001
(Levenshtein, 1966) I.V. Levenshtein, Binary Codes capable of correcting deletions, insertions, and
reversals, Cybernetics and Control Theory, 10(8), 1966
(Meadche and Staab, 2001) Alexander Meadche and Steffen Staab, Comparing Ontologies-Similarity
Measures and Comparison Study, Internal Report No. 408, Institute AIFB, University of Karlsruhe,
Germany, 2001
(Mena and Illarramendi, 2001) E. Mena and A. Illarramendi, Ontology-Based Query Processing for Global
Information Systems, Kluwer Academic Publishers, ISBN 0-7923-7375-8, pp. 215, 2001.
(Miller, 1995) George A. Miller, WordNet: A Lexical Database for English, Communications of the ACM,
38-11, 1995
(Milne, 1968) A.A. Milne, Winnie-the-Pooh, Noordhoof, Groningen, 1968
(OWL) http://www.w3.org/TR/owl-ref/
(Pazienza and Vindigni, 2003) Maria Teresa Pazienza, Michele Vindigni, Agents based ontological
mediation in IE systems, forthcoming
(Pearson, 1998) Jennifer Pearson,
Amsterdam/Philadelphia, 1998
Terms in Context, John Benjamins Publishing Company,
(Pustejovsky, 1995) James Pustejovsky, The Generative Lexicon, MIT Press, Cambridge, 1995
(Riloff, 1996) Ellen Riloff, Automatically Generating Extraction Patterns from Untagged Text, Proceedings
of the Thirteenth National Conference on Artificial Intelligence (AAAI-96), Portland, Oregon , 1996
(SIMPLE) http://www.ub.es/gilcub/SIMPLE/simple2.html
(Steel, 1996) Luc Steel, Emergent Adaptive Lexicons, Proceedings of the Simulation of Adaptive Behavior
Conference, MIT Press, 1996
(van Rijsbergen, 1979) van Rijsbergen, Information Retrieval, London Butterworths, Second Edition. 1979
(Vossen, 1998) Piek Vossen, EuroWordNet: A Multilingual Database with Lexical Semantic Networks,
Kluwer Academic Publishers, Dordrecht, 1998
(Weinstein and Birmingham, 1999) Peter C. Weinstein, William P. Birmingham, Comparing Concepts in
Differentiated Ontologies, Twelfth Workshop on Knowledge Acquisition, Modeling and Management
(KAW 1999), Alberta, Canada, 1999
(Wrust, 1931) Eugene Wrust, Die Internationale Sprahnormung in der Technik, besonders in der
Elektrotechnik, VDI-Verlag, Berlin, Germany, 1935
(Yangarber et al., 2000) Roman Yangarber, Ralph Grishman, Pasi Tapanainen, Silja Huttunen,
Unsupervised Discovery of Scenario-Level Patterns for Information Extraction, Proceedings of
Conference on Applied Natural Language Processing ANLP-NAACL 2000¸ Seattle, WA, 2000
(Zajac, 2001) Remi Zajac, Towards Ontological Question Answering, ACL-2001 Workshop on OpenDomain Question Answering, Toulose, France, 2001
(Zanzotto, 2002) Fabio Massimo Zanzotto, L'estrazione di terminologia come strumento per la modellazione
di domini conoscitivi, Univerisità di Roma "Tor Vergata", Italy, 2002
27
Download