HY-566 Semantic Web Ontology Learning Μπαλάφα Κασσιανή Πλασταρά Κατερίνα Table of contents 1. Introduction 2. Data sources for ontology learning 3. Ontology Learning Process 4. Architecture 5. Methods for learning ontologies 6. Ontology learning tools 7. Uses/applications of ontology learning 8. Conclusion 9. References 1. Introduction 1.1 Ontologies Ontologies serve as a means for establishing a conceptually concise basis for communicating knowledge for many purposes. In recent years, we have seen a surge of interest that deals with the discovery and automatic creation of complex, multirelational knowledge structures. Unlike knowledge bases ontologies have “all in one”: formal or machine readable representation full and explicitly described vocabulary full model of some domain consensus knowledge: common understanding of a domain easy to share and reuse 1.2 Ontology learning General The main task of ontology learning is to automatically learn complicated domain ontologies; this task is usually solved by human only. It explores techniques for applying knowledge discovery techniques to different data sources (html, documents, dictionaries, free text, legacy ontologies etc.) in order to support the task of engineering and maintaining ontologies. In other words is the Machine learning of ontologies. Technical description The manual building of ontologies is a tedious task, which can easily result in a knowledge acquisition bottleneck. In addition, human expert modeling by hand is biased, error prone and expensive. Fully automatic machine knowledge acquisition remains in the distant future. Most systems are semi-automatic and require human (expert) intervention and balanced cooperative modeling for constructing ontologies. Semantic Integration The conceptual structures that define an underlying ontology provide the key to machine-processable data on the Semantic Web. Ontologies serve as metadata schemas, providing a controlled vocabulary of concepts, each with explicitly defined and machine-processable semantics. Hence, the Semantic Web’s success and proliferation depends on quickly and cheaply constructing domain-specific ontologies. Although ontology-engineering tools have matured over the last decade, manual ontology acquisition remains a tedious, cumbersome task that can easily result in a knowledge acquisition bottleneck. Intelligent support tools for an ontology engineer take on a different meaning than the integration architectures for more conventional knowledge acquisition. In the figures below we can see how ontology learning is concerned in semantic integration Semantic Information Integration Ontology Alignment and Transformations ?????? NO RELATION BETWEEN THESE FIGURES!!! Ontology Engineering 2. Data sources for ontology learning 2.1 Natural languages Natural language texts exhibit morphological, syntactic, semantic, pragmatic and conceptual constraints that interact in order to convey a particular meaning to the reader. Thus, the text transports information to the reader and the reader embeds this information into his background knowledge. Through the understanding of the text, data is associated with conceptual structures and new conceptual structures are learned from the interacting constraints given through language. Tools that learn ontologies from natural language exploit the interacting constraints on the various language levels (from morphology to pragmatics and background knowledge) in order to discover new concepts and stipulate relationships between concepts. 2.1.2 Example An example of extracting semantic information of natural text in the form of ontology is a methodology developed in Leipzig University of Germany. This approach is focused on the application of statistical analysis of large corpora to the problem of extracting semantic relations from unstructured text. It is a viable method for generating input for the construction of ontologies, as ontologies use well-defined semantic relations as building blocks. The method’s purpose is to create classes of terms (collocation sets) and how to postprocess these statistically generated collocation sets in order to extract named relations. In addition, for different types of relations like cohyponyms or instance-of-relations, different extraction methods as well as additional sources of information can be applied to the basic collocation sets in order to verify the existence of a specific type of semantic relation for a given set of terms. The first step of this approach is to collect large amounts of unstructured text, which will be processed in the following steps. The next step is to create the collocation sets, i.e. the classes of similar terms. The occurrence of two or more words within a well defined unit of information (sentence, document) is called a collocation. For the selection of meaningful and significant collocations, an adequate collocation measure is defined based on probabilistic similarity metrics. For calculating the collocation measure for any reasonable pairs of terms the joint occurrences of each pair is counted. This problem is complex both in time and storage. Nevertheless, the collocation measure is calculated for any pair with total frequency of at least 3 for each component. This approach is based on extensible ternary search trees, where a count can be associated to a pair of word numbers. The memory overhead from the original implementation could be reduced by allocating the space for chunks of 100,000 nodes at once. Even when using this technique on a large memory computer more than one run through the corpus may be necessary, taking care that every pair is only counted once. The resulting word pairs above a threshold of significance are put into a database where they can be accessed and grouped in many different ways. Further on, except for the textual output of collocation sets, visualizing them as graphs is an additional type of representation. The procedure followed is: A word is chosen and its collocates are arranged in the plane so that collocations between collocates are taken into account. This results in graphs that show homogeneity where words are interconnected and they show separation where collocates have little in common. Polysemy is made visible (see figure below). Line thickness represents the significance of the collocation. All words in the graph are linked to the central word; the rest of the picture is automatically computed, but represents semantic connectedness as well. The relations between the words are just presented, but not yet named. The figure shows the collocation graph for space. Three different meaning contexts can be recognized in the graph: o real estate, o computer hardware, and o astronautics. The connection between address and memory results from the fact that address is another polysemous concept. Collocation graph for space The final step is to identify the relations between terms or collocation sets. The collocation sets are searched and some semantic relations appear more often than others. The following basic types of relations can be identified: o Cohyponymy o top-level syntactic relations, which translate to semantic ‘actor-verb’ and often used properties of a noun o instance-of o special relations given by multiwords (A prep/det/conj B), and o unstructured set of words describing some subject area. These types of relations may be classified according to the properties symmetry, anti-symmetry, and transitivity. Additional relations between collocation sets can be identified with the user’s contribution, such as: o Pattern-based extraction (user defined) e.g. (profession) ? (last name) implies that ? Is in fact a first name. o Compound nouns. Semantic relation between the parts of a compound word can be found in most cases. Term properties are be derived with similar ways. A combination of the results of each of the steps described above forms the ontology of terms included in the original text. The output of this approach may be used for the automatic generation of semantic relations between terms in order to fill and expand ontology hierarchies. 2.2 Ontology Learning from Semi-structured Data. With the success of new standards for document publishing on the web there will be a proliferation of semi-structured data and formal descriptions of semistructured data freely and widely available. HTML data, XML data, XML Document Type Definitions (DTDs), XML-Schemata , and their likes add -- more or less expressive -- semantic information to documents. A number of approaches understand ontologies as a common generalizing level that may communicate between the various data types and data descriptions. Ontologies play a major role for allowing semantic access to these vast resources of semistructured data. Though only few approaches do yet exist we belief that learning of ontologies from these data and data descriptions may considerably leverage the application of ontologies and, thus, facilitate the access to these data. 2.2.1 Example An example of learning ontologies from both unstructured text and semistructured text is the DODDLE system. This approach, which was implemented in Shizuoka University of Japan, describes how to construct domain ontologies with taxonomic and non-taxonomic conceptual relationships exploiting a machine readable dictionary and domain-specific texts. The taxonomic relationships come from WordNet (an online lexical database for the English language) in interaction with a domain expert, using the following two strategies: match result analysis and trimmed result analysis. The non-taxonomic relationships come from domain specific texts with the analysis of lexical co-occurrence statistics. The DODDLE (Domain Ontology Rapid Development Environment) system consists of two components: the taxonomic relationship acquisition module using WordNet and non-taxonomic relationship learning module using domain-specific texts. An overview of the system and its components is depicted in figure 1. Figure1: DODDLE overview Taxonomic relationship acquisition module: The taxonomic relationship acquisition module does spell match between the input domain terms and WordNet. The spell match links these terms to WordNet. Thus the initial model from the spell match results is a hierarchically structured set of all the nodes on the path from these terms to the root of WordNet. However the initial model has unnecessary internal terms (nodes) not to contribute to keeping topological relationships among matched nodes, such as parent-child relationship and sibling relationship. So the unnecessary internal nodes can be trimmed from the initial model into a trimmed model, as shown in Figure 2 process. Figure 2: Trimming process In order to refine the trimmed model, two strategies are applied in interaction with a user: match result analysis and trimmed result analysis. o Match result analysis: Looking at the trimmed model, it turns out that it is divided into a PAB (a PAth including only Best spell-matched nodes) and a STM (a Sub-Tree that includes best spell-matched nodes and other nodes and so should be Moved) based on the distribution of bestmatched nodes. On one hand, a PAB is a path that includes only best-matched nodes that have sense for a given domain specificity. Because all nodes have already been adjusted to the domain in PABs, PABs can stay there in the trimmed model. On the other hand, a STM is such a sub-tree that an internal node is a root and the subordinates are only best-matched nodes. Because internal nodes have not been confirmed to have sense for a given domain, an STM can be moved in the trimmed model. Thus DODDLE identifies PABs and STMs in the trimmed model automatically and then supports a user in constructing a conceptual hierarchy by moving STMs. Figure 3 illustrates the above-mentioned match result analysis. Figure 3: Match Result Analysis o Trimmed result analysis: In order to refine the trimmed model, DODDLE uses trim result analysis as well as match result analysis. Taking some sibling nodes with the same parent node, there may be many differences about the number of trimmed nodes between them and the parent node. When such a big difference comes up on a sub-tree in the trimmed model, it may be better to change the structure of the subtree. The system asks the user if the sub-tree should be reconstructed or not. Figure 4 illustrates the abovementioned trimmed result analysis. Figure 4: Trimmed Result Analysis Finally DODDLE II completes taxonomic relationships of the input domain terms with hand-made additional modification from the user. Non-taxonomic relationship learning module Non-taxonomic Relationship Learning almost comes from WordSpace, which derives lexical co-occurrence information from a large text corpus and is a multi-dimension vector space (a set of vectors). The inner product between two word vectors works as the measure of their semantic relatedness. When two words’ inner product is beyond some upper bound, they are candidates to have some non-taxonomic relationship between them. WordSpace is constructed as shown in Figure 5. Figure 5: Construction Flow of WordSpace The main steps of the WordSpace construction process are: extraction of high-frequency 4-grams, construction of collocation matrix, construction of context vectors, construction of word vectors and construction of vector representations of all concepts. After these two main and parallel modules are concluded, all the resulting concepts are compared for similarity. The user defines a certain threshold for this similarity and a concept pair with the similarity beyond it is extracted as a similar concept pair. A set of the similar concept pairs becomes a concept specification template. Both kinds of concept pairs, those whose meaning is similar (with taxonomic relation) and those who have something relevant with each other (with non-taxonomic relation), are extracted as concept pairs with context similarity in a mass. However, by using taxonomic information from TRA module with cooccurrence information, DODDLE distinguishes the concept pairs which are hierarchically closer to each other than the other pairs as TAXONOMY. A user constructs a domain ontology by considering the relation with each concept pair in the concept specification templates and by deleting an unnecessary concept pair. Figure 6 shows the ontology editor (left window) and the concept graph editor (right window). Figure 6: The ontology editor In order to evaluate how DODDLE is doing in practical fields, case studies have been done in a particular law called Contracts for the International Sale of Goods (CISG). Although this case study was small scale the results were encouraging. 2.3 Ontology Learning from Structured Data Ontologies have been firmly established as a means for mediating between different databases. Nevertheless, the manual creation of a mediating ontology is again a tedious, often extremely difficult, task that may be facilitated through learning methods. The negotiation of a common ontology from a set of data and the evolution of ontologies through the observation of data is a hot topic these days. The same applies to the learning of ontologies from metadata, such as database schemata, in order to derive a common high-level abstraction of underlying data descriptions - an important precondition for data warehousing or intelligent information agents. 3. Ontology Learning Process A general framework of the ontology learning process is shown in the figure below. The ontology learning process The basic steps in the engineering cycle are: o Merging existing structures or defining mapping rules between these structures allows importing and reusing existing ontologies. (For instance, Cyc’s ontological structures have been used to construct a domainspecific ontology o Ontology extraction models major parts of the target ontology, with learning support fed from Web documents. o The target ontology’s rough outline, which results from import, reuse, and extraction, is pruned to better fit the ontology to its primary purpose. o Ontology refinement profits from the pruned ontology but completes the ontology at a fine granularity (in contrast to extraction). o The target application serves as a measure for validating the resulting ontology. Finally, the ontology engineer can begin this cycle again—for example, to include new domains in the constructed ontology or to maintain and update its scope. 3.1 Ontology learning process example A variation of the ontology learning process described in the previous session was implemented in a user-centered system for ontology construction, called Adaptiva, implemented in the University of Sheffield (UK). In this approach, the user selects a corpus of texts and sketches a preliminary ontology (or selects an existing one) for a domain with a preliminary vocabulary associated to the elements in the ontology (lexicalisations). Examples of sentences involving such lexicalisation (e.g. ISA relation) in the corpus are automatically retrieved by the system. Retrieved examples are then validated by the user and used by an adaptive Information Extraction system to generate patterns that discover other lexicalisations of the same objects in the ontology, possibly identifying new concepts or relations. New instances are added to the existing ontology or used to tune it. This process is repeated until a satisfactory ontology is obtained. Each of the above mentioned stages consists of three steps: bootstrapping, pattern learning and user validation, and cleanup. o Bootstrapping. The bootstrapping process involves the user specifying a corpus of texts, and a seed ontology. The draft ontology must be associated with a small thesaurus of words, i.e. the user must indicate at least one term that lexicalises each concept in the hierarchy. o Pattern Learning & User Validation. Words in the thesaurus are used by the system to retrieve a first set of examples of the lexicalisation of the relations among concepts in the corpus. These are then presented to the user for validation. The learner then uses the positive examples to induce generic patterns able to discriminate between them and the negative ones. Pattern are generalised in order to find new (positive) examples of the same relation in the corpus. These are presented to the user for validation, and user feedback is used to refine the patterns or to derive additional ones. The process terminates when the user feels that the system has learned to spot the target relations correctly. The final patterns are then applied on the whole corpus and the ontology is presented to the user for cleanup. o Cleanup. This step helps the user make the ontology developed by the system coherent. First, users can visualize the results and edit the ontologies directly. They may want to collapse nodes, establish that two nodes are not separate concepts but synonyms, split nodes or move the hierarchical positioning of nodes with respect to each other. Also, the user may wish to 1) add further relations to a specific node; 2) ask the learner to find all relations between two given nodes; 3) refine/label relations discovered in the between given nodes. Corrections are returned back to the IE system for retraining. This methodology focuses the expensive user activity on sketching the initial ontology, validating textual examples and the final ontology, while the system performs the tedious activity of searching a large corpus for knowledge discovery. Moreover, the output of the process is not only an ontology, but also a system trained to rebuild and eventually retune the ontology, as the learner adapts by means of the user feedback. This simplifies ontology maintenance, a major problem in ontology-based methodologies. 4. Architecture The general architecture of the ontology learning process is shown in the following figure. Ontology learning architecture for the Semantic Web The ontology engineer only interacts via the graphical interfaces, which comprise two of the four components: the Ontology Engineering Workbench and the Management Component. Resource Processing and the Algorithm Library are the architecture’s remaining components. These components are described below. Ontology Engineering Workbench This component is sophisticated means for manual modeling and refining of the final ontology. The ontology engineer can browse the resulting ontology from the ontology learning process and decide to follow, delete or modify the proposals as the task requires. Management component graphical user interface The ontology engineer uses the management component to select input data—that is, relevant resources such as HTML and XML documents, DTDs, databases, or existing ontologies that the discovery process can further exploit. Then, using the management component, the engineer chooses from a set of resource-processing methods available in the resource-processing component and from a set of algorithms available in the algorithm library. The management component also supports the engineer in discovering task-relevant legacy data—for example, an ontology-based crawler gathers HTML documents that are relevant to a given core ontology. Resource processing Depending on the available input data, the engineer can choose various strategies for resource processing: o Index and reduce HTML documents to free text. o Transform semistructured documents, such as dictionaries, into a predefined relational structure. o Handle semistructured and structured schema data (such as DTDs, structured database schemata, and existing ontologies) by following different strategies for import, as described later in this article. o Process free natural text. After first preprocessing data according to one of these or similar strategies, the resource-processing module transforms the data into an algorithm-specific relational representation. Algorithm library An ontology can be described by a number of sets of concepts, relations, lexical entries, and links between these entities. An existing ontology definition can be acquired using various algorithms that work on this definition and the preprocessed input data. Although specific algorithms can vary greatly from one type of input to the next, a considerable overlap exists for underlying learning approaches such as association rules, formal concept analysis, or clustering. Hence, algorithms can be reused from the library for acquiring different parts of the ontology definition. 5. Methods for learning ontologies Some methodologies used in the ontology learning process are described in the following sections. 5.1 Association Rules A basic method that is used in many ontology learning systems is the use of association rules for ontology extraction. Association-rule-learning algorithms are used for prototypical applications of data mining and for finding associations that occur between items in order to construct ontologies (extraction stage). ‘Classes’ are expressed by the expert as a free text conclusion to a rule. Relations between these ‘classes’ may be discovered from existing knowledge bases and a model of the classes is constructed (ontology) based on user-selected patterns in the class relations. This approach is useful for solving classification problems by creating classification taxonomies (ontologies) from rules. A classification knowledge based system using this method with experimental results based on medical data was implemented in the University of New South Wales, in Australia. In this approach, Ripple Down Rules (RDR) were used to describe classes and their attributes. The form of RD Rules is shown in the following figure, which represents some rules for the class Satisfactory lipid profile previous raised LDL noted. In the first rule there is a condition Max(LDL) > 3.4 and in the second rule there is a condition Max(LDL) is HIGH), where HIGH is a range between 2 real number. An example of a class which is a disjunction of two rules The conclusions of the rules form the classes of the classification ontology. The expert using this methodology is allowed to specify the correct conclusion and identify the attributes and values that justify this conclusion in case the system makes an error. The method applied in this approach includes three basic steps: o The first step is to discover class relation between rules. In this stage, three basic relations are taken into account: 1. Subsumption/intersection: a class A subsumes/intersects with a class B if class A always occurs when class B occurs, but not the other way around. 2. Mutual exclusivity: two classes are mutual exclusive if they never occur together. 3. Similarity: two classes are similar if they have similar conditions in the rules they come from. Based on these relations the first classes of rule conclusions are formed. o The second step is to specify some compound relations which appear interesting using the three basic relations. This step is performed in interaction with the expert. o The final step is to extract instances of these compound relations or patterns and assemble them into a class model (ontology). The key idea in this technique is that it seems reasonable to use heuristic quantitative measures to group classes and class relations. This then enables possible ontologies to be explored on a reasonable scale. 5.2 Clustering Learning semantic classes In the context of learning semantic classes, learning from syntactic contexts exploits syntactic relations among words to derive semantic relations, following Harris’ hypothesis. According to this hypothesis, the study of syntactic regularities within a specialized corpus permits to identify syntactic schemata made out of combinations of word classes reflecting specific domain knowledge. The fact of using specialized corpora eases the learning task, given that we have to deal with a limited vocabulary with reduced polysemy, and limited syntactic variability. In syntactic approaches, learning results can be of different types, depending on the method employed. They can be distances that reflect the degree of similarity among terms, distance-based term classes elaborated with the help of nearestneighbor methods degrees of membership in term classes, class hierarchies formed by conceptual clustering or predicative schemata that use concepts to constraint selection. The notion of distance is fundamental in all cases, as it allows calculating the degree of proximity between two objects—terms in this case—as a function of the degree of similarity between the syntactic contexts in which they appear. Classes built by aggregation of near terms can afterwards be used for different applications, such as syntactic disambiguation or document retrieval. Distances are however calculated using the same similarity notion in all cases, and our model relies on these studies regardless of the application task. Conceptual clustering Ontologies are organized as multiple hierarchies that form an acyclic graph where nodes are term categories described by intention, and links represent inclusion. Learning through hierarchical classification of a set of objects can be performed in two main ways: top-down, by incremental specialization of classes, and bottom-up, by incremental generalization. The bottom-up approach due to its smaller algorithmic complexity and its understandability to the user in view of an interactive validation task is better. The Mo’K workbench A workbench that supports the development of conceptual clustering methods for the (semi-) automatic construction of ontologies of a conceptual hierarchy type from parsed corpora is the Mo’K workbench. The learning model proposed in that takes parsed corpora as input. No additional (terminological or semantic) knowledge is used for labeling the input, guiding learning or validating the learning results. Preliminary experiments showed that the quality of learning decreases with the generality of the corpus. This makes somewhat unrealistic the use of general ontologies for guiding such learning as they seem too incomplete and polysemic to allow for efficient learning in specific domains. 5.3 Ontology Learning with Information Extraction Rules The Figure below illustrates the overall idea of building ontologies with learned information extraction rules. We start with: 1. An initial, hand-crafted seed ontology of reasonable quality which contains already the relevant types of relationships between ontology concepts in the given domain. 2. An initial set of documents which exemplarily represent (informally) substantial parts of the knowledge represented formally in the seed ontology. To take the pairs of (ontological statement, one or more textual representations) as positive examples for the way how specific ontological statements can be reflected in texts. There are two possibilities to extract such examples: Based on the seed ontology, the system looks up the signature of a certain relation searches all occurrences of instances of the concept classes Disease and Cure, respectively, within a certain maximum distance, and regards these co-occurrences as positive examples for relationship R. This approach presupposes that the seed documents have some “definitional” character, like domain specific lexica or textbooks. The user goes through the seed documents with a marker and manually highlights all interesting passages as instances of some relationship. This approach is more work-intensive, but promises faster learning and more precise results. We employed this approach already successfully in an industrial information extraction project Employ a pattern learning algorithm to automatically construct information extraction rules which abstract from the specific examples, thus creating general statements which text patterns are an evidence for a certain ontological relationship. In order to learn such information extraction rules, we need some prerequisites: (a) A sufficiently detailed representation of documents (in particular, including word positions, which is not usual in conventional, vectorbased learning algorithms, WordNet-synsets, and part-of-speech tagging). (b) A sufficiently powerful representation formalism for extraction patterns. (c) A learning algorithm which has direct access to background knowledge sources, like the already available seed ontology containing statements about known concept instance, or like the WordNet database of lexical knowledge linking words to their synonyms sets, giving access to suband superclasses of synonym sets, etc. Apply these learned information extraction rules to other, new text documents to discover new or not yet formalized instances of relationship R in the given application domain. Compared to other ontology learning approaches this technique is not restricted to learning taxonomy relationships, but arbitrary relationships in an application domain. A project that uses this technique is the FRODO ("A Framework for Distributed Organizational Memories") project which is about methods and tools for building and maintaining distributed Organizational Memories in a real-world enterprise environment. It is funded by the German National Ministry for Research and Education has started with five scientific researchers in January 2000. 6. Ontology learning tools 6.1 TEXT-TO-ONTO It develops a semi-automatic ontology learning from text. It tries to overcome the knowledge acquisition bottleneck. It is based on a general architecture for discovering conceptual structures and engineering ontologies from text. Architecture The process of semi-automatic ontology learning from text is embedded in an architecture that comprises several core features described as a kind of pipeline. The main components of the architecture are the: Text & Processing Management Component The ontology engineer uses that component to select domain texts exploited in the further discovery process. The engineer can choose among a set of text (pre) processing methods available on the Text Processing Server and among a set of algorithms available at the Learning & Discovering component. The former module returns text that is annotated by XML and XML-tagged is fed to the Learning & Discovering component. Text Processing Server It contains a shallow text processor based on the core system SMES (Saarbr¨ucken Message Extraction System). SMES is a system that performs syntactic analysis on natural language documents. It organized in modules, such as tokenizer, morphological and lexical processing and chunk parsing that use lexical resources to produce mixed syntactic/semantic information. The results of text processing are stored in annotations using XML-tagged text. Lexical DB & Domain Lexicon SMES accesses a lexical database with more than 120.000 stem entries and more than 12.000 subcategorization frames that are used for lexical analysis and chunk parsing. The domain-specific part of the lexicon associates word stems with concepts available in the concept taxonomy and links syntactic information with semantic knowledge that may be further refined in the ontology. Learning & Discovering component Uses various discovering methods on the annotated texts e.g. term extraction methods for concept acquisition. Ontology Engineering Enviroment-ONTOEDIT It supports the ontology engineer in semi-automatically adding newly discovered conceptual structures to the ontology. Internally stores modeled ontologies using an XML serialization. 6.2 ASIUM ASIUM overview Asium is an acronym for “Acquisition of Semantic knowledge Using Machine learning method". The main aim of Asium is to help the expert in the acquisition of semantic knowledge from texts and to generalize the knowledge of the corpus. It also provides the expert with a user interface which includes tools and functionality for exploring the texts and then learning knowledge which is not in the texts. During the learning step, Asium helps the expert to acquire semantic knowledge from the texts, like subcategorization frames and an ontology. The ontology represents an acyclic graph of the concepts of the studied domain. The subcategorization frames represent the use of the verbs in these texts. For example, starting from cooking recipe texts, Asium should learn an ontology with concepts of "Recipients", "Vegetables" and "Meat". It can also learn, in parallel, the subcategorization frame of the verb "to cook" which can be: to cook: Object: Vegetable or Meat in: Recipients Methodology The overall methodology that is implemented by ASIUM is depicted in the following figure.The input for Asium are syntactically parsed texts from a specific domain. It then extracts these triplets: verb, preposition/function (if there is no preposition), lemmatized head noun of the complement. Next, using factorization, Asium will group together all the head nouns occurring with the same couple verb, preposition/function. These lists of nouns are called basic clusters. They are linked with the couples verb, preposition/function they are coming from. Asium then computes the similarity among all the basic clusters together. The nearest ones will be aggregated and this aggregation is suggested to the expert for creating a new concept. The expert defines a minimum threshold for gathering classes into concepts. Only the distance computation is not enough to learn concepts of one domain. The help of the expert is necessary because any learned concepts can contain noise (mistakes in the parsing for example), some sub-concepts are not identified or over-generalization occurs due to aggregations. Similarity computation is computed between all basic clusters to each other and next the expert validates the list of classes learned by Asium. After this, Asium will have learned the first level of the ontology. Similarity is computed again but among all the clusters, both the old and the new ones in order to learn the next level of the ontology. The advantages of this method are twofold: First, the similarity measure identifies all concepts of the domain and the expert can validate or split them. Next the learning process is, for one part, based on these new concepts and suggests more relevant and more general concepts. Second, the similarity measure will offer the expert aggregations between already validated concepts and new basic clusters in order to get more knowledge from the corpus. The cooperative process runs until there are no more possible aggregations. The output of the process are the subcategorization frames and the ontology schema. The ASIUM methodology SYLEX The preprocessing of the free text is performed by Sylex. Sylex, the syntactic parser of the French society Ingénia, is used in order to parse source texts in French or English. This parser is a tool-box of about 700 functions which have to be used in order to produce some results. In ASIUM, the attachments between head nouns of complements and verbs and the bounds are retrieved from the full syntactic parsing performed by Sylex. The file format that Asium uses to understand the parsing is the following: ---(Sentence of the original text) Verbe: (the verb) kind of complement (Sujet(Subject), COD(Object), COI(Indirect object), CC(position, manière, but, provenance, direction) (adjunct of position, manner, aim, provenance, direction): (head noun of the complement) Bornes_mot: (bounds of the noun in the sentence) Bornes_GN: (bound of the noun phrase in the sentence) Prep:(optional) (the preposition) The resulting parsed text is then provided to ASIUM for further elaboration. The user interface The user interface of ASIUM allows the user to manipulate and view the ontology in every stage of the learning process. The following figures show some of the basic windows of the interface. This window allows the expert to validate the concepts learned by Asium. This window displays the list of all the examples covered for the learned concept. This window displays the ontology like it actually is in memory: i.e. learned concepts and concepts to be proposed for this level. Each blue circle represents a class. It can be labeled or not. This window allows the expert to split a class into several sub-concepts. The left list represents the list of nouns the expert wants to split into subconcepts. The right list contains all the nouns for one sub-concept. Uses of ASIUM The kind of semantic knowledge that is prodused by ASIUM can be very useful in a lot of applications. Some of them are mention bellow: Information Retrieval: Verb subcategorization frames can be used in order to tag texts. The major part of the nouns occurring in the texts will then be tagged by their concept. The search of the right text will be based on a query using domain concepts instead of words. For example, if the user is interested in movie stars, he would not search for the noun "star" but for the concept "Movie_stars" which is really distinct from "Space_Stars". Information Extraction: Such subcategorization frames together with an ontology allow the expert to write "semantic" extractions rules. Text indexing: After the learning of the ontology for one domain, the texts should be enriched by the concepts. The ontology can then be use for indexing the texts. Texts Filtering: As with information extraction, filtering should use rules based on concepts and on the verbs used in the texts. The filtering quality should be improved by this semantic knowledge. Abstracts of texts: The use of subcategorization frames and ontology concepts will allow the texts to be tagged and then it will certainly be a precious help for extracting abstracts from texts. Automatic translation: Creation both in the language of the ontologies and the subcategorization frames and next the use of a method in order to match the concepts of the verbs frames in both languages should improve translators. Syntactic parsing improvement:: Subcategorization frames and concepts of a domain should improve a syntactic parser by letting it choose the right verb attachment regarding the ontology and then by letting it avoids a lot of ambiguities. 7. Uses/applications of ontology learning The ontology learning process and methods described in the previous section can be used and applied in many domains concerning knowledge and information extraction. Some uses and applications are described in this section. 7.1 Knowledge sharing in multi agent systems Discovering related concepts in a multi-agent system among agents with diverse ontologies is difficult using existing knowledge representation languages and approaches. In this section an approach for identifying candidate relations between expressive, diverse ontologies using concept cluster integration is described. In order to facilitate knowledge sharing between a group of interacting information agents (i.e. a multi-agent system), a common ontology should be shared. However, agents do not always commit a priori to a common, predefined global ontology. This research investigates approaches for agents with diverse ontologies to share knowledge by automated learning methods and agent communication strategies. The goal is that agents who do not know the relationships of their concepts to each other need to be able to teach each other these relationships. If the agents are able to discover these concept relations, this will aid them as a group in sharing knowledge even though they have diverse ontologies. Information agents acting on behalf of a diverse group of users need a way of discovering relationships between the individualized ontologies of users. These agents can use these discovered relationships to help their users find information related to their topic, or concept, of interest. In this approach, semantic concepts are represented in each agent as concept vectors of terms. Supervised inductive learning is used by agents to learn their individual ontologies. The output of this ontology learning is semantic concept descriptions (SCD) in the form of interpretation rules. This concept representation and learning is shown in the following figure. Supervised inductive learning produces ontology rules The process of knowledge sharing between two agents, the Q (querying) and the R (responding) agent, begins when the Q agent sends a concept based query. The R agent interpreters this query and if related concepts are found a response is sent to the Q agent. After that, the Q agent takes the following steps to perform the concept cluster integration: 1. From the R agent response, determine the names of the concepts to cluster. 2. Create a new compound concept using the above names. 3. Create a new ontology category by combining instances associated with the compound concept. 4. Re-learn the ontology rules. 5. Re-interpret the concept based query using the new ontology rules including the new concept cluster description rules. 6. If the concept is verified, store the new concept relation rule. In this way, an agent learns from the knowledge provided by another agent. This methodology was implemented in the DOGGIE (Distributed Ontology Gathering Group Integration Environment) system, which was developed in the University of Iowa. 7.2 Ontology based Interest Matching Designing a general algorithm for interest matching is a major challenge in building online community and agent-based communication networks. This section presents an information theoretic concept-matching approach to measure degrees of similarity among users. A distance metric is used as a measure of similarity on users represented by concept hierarchy. Preliminary sensitivity analysis shows that this distance metric has more interesting properties and is more noise tolerant than keyword-overlap approaches. With the emergence of online communities on the Internet, software-mediated social interactions are becoming an important field of research. Within an online community, history of a user’s online behavior can be analyzed and matched against other users to provide collaborative sanctioning and recommendation services to tailor and enhance the online experience. In this approach the process of finding similar users based on data from logged behavior is called interest matching. Ontologies may take many forms. In the described method, an ontology is expressed in a tree-hierarchy of concepts. In general, tree-representations of ontologies are usually polytrees. However, for the purpose of simplicity, here the tree representation is assumed to be singly connected and that that all child nodes of a node are mutually exclusive. Concepts in the hierarchy represent the subject areas that the user is interested in. To facilitate ontology exchange between agents, an ontology can be encoded in the DARPA Agent Markup Language (DAML). The figure below illustrates a visualization of this sample ontology. An example of an ontology used The root of the tree represents the interests of the user. Subsequent sub-trees represent classifications of interests of the user. Each parent node is related to a set of children nodes. A directed edge from the parent node to a child node represents a (possibly exclusive) sub-concept. For example, in the figure, Seafood and Poultry are both subcategories of the more general concept of Food. However, in general, every user is to adopt the standard ontology, there must be a way to personalize the ontology to describe each user. For each user, each node has a weight attribute to represent the importance of the concept. In this ontology, given the context of Food, the user tends to be more interested in Seafood rather than Poultry. The weights in the ontology are determined by observing the behavior of the user. History of the user’s online readings and explicit relevance feedback are excellent sources for determining the values of the weights. In this approach, a standard ontology is used to categorize the interests of users. Using the standard ontology, the websites the user visits can be classified and entered into the standard ontology to personalize it. A form of weight for each category can then be derived: if a user frequents websites in that category or an instance of that class, it can be viewed that the user will also be interested in other instances of the class. With the weights, the distance metric can be used to perform comparisons between interests of different users and finally categorize them. The effectiveness of the ontology matching algorithm is to be determined by deploying it in various instances of on-line communities. 7.3 Ontology learning for Web Directory Classification Ontologies and ontology learning can also be used to create information extraction tools for collecting general information from the free text of web pages and classifying them in categories. The goal is to collect indicator terms from the web pages that may assist the classification process. These terms can be derived from directory headings of a web page as well as its content. The indicator terms along with a collection of interpretation rules can result in a hierarchy (ontology) of web pages. In this way, the Information Extraction and Ontology Learning process can be applied to large web directories both for information storage and knowledge mining. 7.4 E-mail classification KMi Planet “KMi Planet” is a web-based news server for communication of stories between members in Knowledge Media Institute. Its main goals are to classify an incoming story, obtain the relevant objects within the story and deduce the relationships between them and to populate the ontology with minimal help from the user. Integrate a template-driven information extraction engine with an ontology engine to supply the necessary semantic content. Two primary components are the story library and the ontology library. The Story library contains the text of the stories that have been provided to Planet by the journalists. In the case of KMi Planet it contains stories which are relevant to our institute. The Ontology Library contains several existing ontologies, in particular the KMi ontology. PlanetOnto augmented the basic publish/find scenario supported by KMi planet, and supports the following activities: 1. Story submission. A journalist submits a story to KMi planet using e-mail text. Then the story is formatted and stored. 2. Story reading. A Planet reader browses through the latest stories using a standard Web browser, 3. Story annotation. Either a journalist or a knowledge engineer manually annotates the story using Knote (the Planet knowledge editor), 4. Provision of customized alerts. An agent called Newsboy builds user profiles from patterns of access to PlanetOnto and then uses these profiles to alert readers about relevant new stories. 5. Ontology editing. A tool called WebOnto providesWeb-based visualisation, browsing and editing support for the ontology. The “Operational Conceptual Modelling Language," OCML is a language designed for knowledge modeling. WebOnto uses OCML and allows the creation of classes and instances in the ontology, along with easier development and maintenance of the knowledge models. In that point ontology learning is concerned. 6. Story soliciting. An agent called Newshound, periodically solicits stories from the journalists. 7. Story retrieval and query answering. The Lois interface supports integrated access to the story archive Two other tools have been integrated in the architecture: MyPlanet: Is an extension to Newsboy and helps story readers to read only the stories that are of interest instead of reading all stories in the archive. It uses a manually predefined set of cue-phrases for each of “research areas defined in the ontology. For example, for genetic algorithms one cue-phrase is “evolutionary algorithms". Consider the example of someone interested in research area Genetic Algorithms. A search engine will return all the stories that talk about that research area. In contrast, my-Planet (by using the ontological relations) will also find all Projects that have research area Genetic Algorithms and then search for stories that talk about these projects, thus returning them to the reader even if the story text itself does not contain the phrase “genetic algorithms". an IE tool : Is a tool which extracts information from e-mail text and it connects with WebOnto to prove theorems using the KMi-planet ontology. 8. Conclusion Ontology learning could add significant leverage to the Semantic Web because it propels the construction of domain ontologies, which the Semantic Web needs to succeed. We have presented a collection of approaches and methodologies for ontology learning that crosses the boundaries of single disciplines, touching on a number of challenges. All these methods are still experimental and awaiting further improvement progress and analysis. So far, the results are rather discouraging compared to the final goal that has to be achieved, fully automated, intelligent and knowledge learning systems. The good news is, however, that perfect or optimal support for cooperative ontology modeling is not yet needed. Cheap methods in an integrated environment can tremendously help the ontology engineer. While a number of problems remain within individual disciplines, additional challenges arise that specifically pertain to applying ontology learning to the Semantic Web. With the use of XML-based namespace mechanisms, the notion of an ontology with well-defined boundaries—for example, only definitions that are in one file—will disappear. Rather, the Semantic Web might yield a primitive structure regarding ontology boundaries because ontologies refer to and import each other. However, what the semantics of these structures will look like is not yet known. In light of these facts, the importance of methods such as ontology pruning and crawling ??? will drastically increase and further approaches are yet to come. 9. References [1] M.Sintek, M. Junker, Ludger van Est, A. Abecker, Using Information Extraction Rules for Extending Domain Ontologies, German Research Center for Artificial Intelligence (DFKI) [2] M.Vargas-Vera, J.Domingue, Y.Kalfoglou, E.Motta, S.Buckingham Shum, Template-Driven Information Extraction for Populating Ontologies, Knowledge Media Institute (UK) [3] G.Bisson, C.Nedellec, Designing clustering methods for ontology building, University of Paris [4] A.Maedche, S.Staab, The TEXT-TO-ONTO Ontology Learning Environment, University of Karlsruhe [5] A.Maedche, S.Staab, Ontology Learning for the Semantic Web, University of Karlsruhe [6] H.Suryanto,P.Compton, Learning classification taxonomies from a classification knowledge based system, University of New South Wales (Australia) [7] Proceedings of the First Workshop on Ontology Learning OL'2000 Berlin, Germany, August 25, 2000 [8] Proceedings of the Second Workshop on Ontology Learning OL'2001 Seattle, USA, August 4, 2001 [9] ASIUM web page: http://www.lri.fr/~faure/Demonstration.UK/Presentation_Demo.html [10] T. Yamaguchi, Acquiring Conceptual Relationships from domain specific texts, Shizuoka University, Japan [11] G. Heyer, M. Lauter, Learning Relations using Collocations, Leipzig University, Germany [12] C. Brewster, F. Ciravegna, Y. Wilks, User-centered ontology learning for knowledge management, University of Sheffield, UK [13] A. Williams, C. Tsatsoulis, An instance based approach for identifying candidate ontology relations within a multi agent system, University of Iowa [14] W. Koh, L. Mui, An information theoretic approach to ontology based interest matching, MIT [15] M. Kavalec, V. Svatek, Information extraction and ontology learning guided byv web directory, University of Prague [16] C. Brewster, F. Ciravegna, Y. Wilks, Knowledge acquisition for knowledge management, University of Sheffield, UK