> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 1 MASTER-Web: An Ontology-based Internet Data Mining Multi-Agent System Frederico Luiz G. Freitas, Guilherme Bittencourt, Jacques Calmet Abstract- The Web displays classes of pages with similar structuring and contents (e.g., call for papers, publications etc), which are interrelated and define clusters (e.g., Science). We report on the design and implementation of a multi-agent architecture for information retrieval and extraction from these clusters. The entities of the cluster are defined along reusable ontologies. Each agent processes one class, employing the ontologies to recognize pages, extract information, communicate and cooperate with the other agents. Whenever it identifies information of interest to another agent it forwards this information to that agent. These “hot hints” usually contain much less garbage than the results returned by traditional search engines (e.g., AltaVista or Excite). Cooperation among agents facilitates searching for useful pages and outperforms existing search engines. The agent architecture enables many sorts of reuse from code, database definitions and knowledge bases to services provided by the search engines. The architecture was implemented using Java and the Jess inference engine and produced promising preliminary results. Index Terms--Internet, cooperative systems, information retrieval, knowledge representation, knowledge-based systems. I. INTRODUCTION F INDING only relevant information on the Web is one of the hardest challenge faced by researchers. Two reasons are the huge size of the Web and the diversity of available heterogeneous information. Current search engines suffer from low precision rates because pages are not semantically defined and users are allowed only to perform statistically lexiconbased searches, which cannot access the context that makes information relevant and meaningful. Search engines were designed along keyword-based indexing and retrieval methods. This approach, although robust, is inherently imprecise and the output usually delivers a great deal of irrelevant documents. A central problem here is word sense ambiguity (one word corresponding to several different meanings) inherent to natural languages. Shallow Natural Language Processing techniques have been applied in order to investigate such This work was supported by the Brazilian-German PROBRAL project ``A semantic approach to data retrieval'' under Grant No. 060/98. The authors thank the Brazilian “Fundação Coordenação de Aperfeiçoamento de Pessoal de Nível Superior” (CAPES) and the German “Deustscher Akademischer Austauschdienst” (DAAD) agencies for their support. Frederico Luiz G. Freitas and Guilherme Bittencourt, are with Laboratório de Controle e Microinformática (LCMI), Departamento de Automação e Sistemas (DAS), Universidade Federal de Santa Catarina, Florianópolis SC, 88040-900 Brazil (email: {fred-pe,gb}@lcmi.ufsc.br). Prof. Jacques Calmet is with the Institute of Algorithms and Cognitive Systems (IAKS), Informatics Department, University of Karlsruhe, D-76128 Germany (e-mail: calmet@ira.uka.de). problems. Together with the application of linguistic ontologies like WordNet [26], this branch of research lead to a retrieval improvement, but it does not provide semantics for the whole Web. A main reason is the lack of context in which words are being used. When semantics is not available to perform information retrieval, context, defined as the set of entities and restrictions present in a page, must be used. A fact that should be recalled when trying to define context in Internet searches is that Internet has become the only media capable of gathering most of the human knowledge, not only common facts about places and people, but also almost all facets of expert knowledge in a wide range of areas. So, it is clear that context cannot be formulated for the whole network either. However, a lesson learned in Artificial Intelligence in the 70’s [35], stating that knowledge works only over restricted domains, still holds for this task. Information retrieval researchers share this intuition of domain restriction; this is the reason why they evaluate techniques over homogeneous corpora. An option to provide context to the Internet consists in relying on knowledge-based systems tailored to restricted domains, taking advantage of the fact that knowledge engineering for years has developed methods and techniques for combining information in problem-, situation- and userspecific ways as a complement to index-based retrieval methods [5], therefore offering adequate knowledge representations for the problem, like ontologies, semantic networks and others. A. Information extraction systems and the lack of integration among them Although the Web is highly unstructured, we can identify classes of pages with similar structuring and contents (e.g., call for papers, references and lists of publications, etc). Information Extraction systems are being designed benefiting from the domain restriction and from the existence of these classes, together with another assumption: a great deal of users is objective-motivated in their searches. They are mainly interested in actual, relevant, combinable and useful data rather than in the pages where it is located. Current IE systems aim at storing data taken from narrow domain pages into databases that can be easily queried with semantically welldefined entities and relations. They also endow Internet with a notion of memory, preventing users from manually combining results from search engines queries unnecessarily to get the data, thus saving bandwidth, processing and user’s patience. Even the current search engines compute their list of best > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < ranked matches time after time, and there is no way of benefiting from past efforts. The purpose of designing such systems was also to allow users to take advantage of the diversity of relevant information spread over Web pages that have some structure – the concept of structure being here very loosely defined - and enabling them to combine sets of data that are often physically located in a great number of pages and servers. However, there are some interrelated facts that, so far, have been largely neglected by traditional Web Information Retrieval systems. Many of these classes are interrelated forming clusters (e.g., Science). Some important issues arise from this fact: Is it better, fruitful or more efficient to treat the whole cluster instead of a single page class? How to integrate the interrelated databases generated by distinct Information Extraction systems, or deeper, how should the Web be viewed for extraction purposes with the idea of integrating the extracted data? Only a few extraction systems tried to address these questions, but still ignoring the relations among classes of pages. We report on the design and implementation of MASTERWeb (Multi-Agent System for Text Extraction and Retrieval in the Web), which is a cognitive multi-agent architecture for integrated information retrieval and extraction, employing ontologies to define the cluster (or domain) processed. A proposed vision of the Web, combining page contents and functionality in page linkage is also presented to support these tasks. The article is organized as follows: Section 2 presents the proposed Web vision. Section 3 justifies the application of cognitive multi-agents and ontologies to the problem. Section 4 introduces the architecture of the system, its components and a useful design decision to easily construct new agents: the reuse of code, database definitions, page collectors and also knowledge (the latter being the most important feature). Section 5 describes the case study of an agent able to process “Call for Papers” pages. It outlines some promising results on recognition, a task that strongly affects performance. Section 6 mentions some related work, while Section 7 addresses future work and conclusions. II. A VISION OF THE WEB FOR INTEGRATED EXTRACTION A significant amount of Web pages presents data items, hereafter called entities (e.g., call for papers, conference announcements, etc) and information about them, which lead us to think of them as classes of specialized pages. To consult scientific pages, for example, there is some standard terminology, concepts, expected information and other patterns to take advantage from. Even page styles can be measured and compared, providing structural similarities [10] that can help determine whether a page belongs to a class or not. Once identified, we can view the common characteristics of these classes as a priori knowledge, which can help improve precision when searching information about a restricted topic. 2 They are semi-structured or structured and share several common features such as patterns of page linkage, terminology and page style. For our architecture, a set of these pages is a class (e.g., call for papers, researchers, etc), and the existence of these classes outlines a Web division by contents. The data typically found in a class is considered as discriminators, once it helps distinguish class members. This fact supports the use of extraction in class identification. Researchers pages, for instance, are supposed to contain information such as projects, interest areas and other items, and the presence of these items in a page consists in a strong indication that the page is a researcher’s home page. Most links in pages of the class point to pages containing entities of a few other classes, or attributes or links to these entities. A set of classes and their relations gather a body of knowledge involving entities about a specific domain (e.g., science, tourism, etc). This set is a cluster of classes. In researcher pages we often find links to papers, call for papers, and other classes of the scientific cluster. Another view of the Web, based on a preexisting taxonomy [32], focus on functionality, dividing pages along the role played in linkage and information storage. For integrated extraction purposes, we split them into functional groups: 1. Content pages, which contain the actual class members (entities), 2. Auxiliary pages, which contain attributes of these entities, 3. Lists of contents, which include the well-known resource directories, lists of links to pages of a class available on the Web, usually maintained by an organization, a person and even search engines, 4. Messages or lists of messages, which keeps e-mail correspondence about contents (the contents of these messages are discussions about the contents, therefore they do not constitute a safe source of information), 5. Recommendations, standing for members of other classes, which will play the role of suggestions in a cooperation process among the software components which will deal with the classes, 6. Simple garbage, pages whose only concern to a class being processed is the presence of similar keywords. When in search for contents pages, search engines often return them. We combine these two visions to accurately identify not only the information to be extracted from the page classes, but also identify instances of the relations among these classes in a cluster, improving significantly the search of useful pages wherefrom data should be extracted. Fig 1 illustrates these two visions. The ellipse corresponds to the vision by contents, showing the entities of the Science cluster, like papers, researchers, organizations and other classes. Each slice of the ellipse is processed according to the vision for functionality illustrated in the rectangle. The hatched part of the rectangle exhibits the functional groups not used in the processing of a class. The relations among classes and among functional > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < groups are represented by arrows. The direction of an arrow stands for links pointing to a class or functional group. Fig. 1. Combination of visions to treat a class of pages. The ellipse corresponds to the vision by contents. Each slice of the ellipse is processed according to the vision for functionality illustrated in the rectangle. The hatched part of the rectangle evidences the functional groups not used in the processing of one class. Relations among classes and among functional groups are represented by arrows. III. ONTOLOGY-BASED MULTI-AGENT APPROACH In order to increase performance, Internet-taming solutions use to follow some general principles, namely, distribution, to prevent bottlenecks and benefit from parallelism, cooperation among processes to make them help one another and prevent overlapping and rework, and a sort of coordination among them, performed by a central control or by the processes themselves through communication. For integrated extraction as described above, two other constraints hold: First, each component is responsible to process (i.e. to find and filter content pages and extract data from them) just one class of the cluster, avoiding the complexity of mixing knowledge about distinct entities. Second, components must cooperate by taking advantage of the class relations in the cluster. A multi-agent approach with explicit knowledge is suitable, not only because of the first constraint, but also because of the advantages of declarative solutions over procedural ones. Declarative solutions provide much more ontological engagement [20], i.e. a much more direct translation of the domain knowledge. Tasks like Web extraction and classification involve semi-structured or unstructured data, requiring frequent changes in the solution behavior. Employing declarativity, such changes can be easily updated, without code recompilation or execution halts. This feature constitutes a relevant extensibility advantage. Expressiveness is also a key issue here. Besides the inference capabilities, we remark the fact that, when the 3 concepts involved in these tasks (e.g. cluster’s entities, functional groups, Web page representations, etc) are defined declaratively, these concepts can be organized in structures known as ontologies [20]. The use of ontologies can bring out many benefits. Ontologies are usually frame-based [27], therefore allowing multiple inheritance, an advantage in expressiveness over object-oriented implementations. The advent of ontologies also supported the creation of a high-level communication model known as “peer-to-peer”, in which the concepts defined as domain knowledge are common to communicating agents, playing the role of shared vocabulary for communication among them. Within this model agents can express their intentions to the others by speech acts [4] such as to inform, to ask, to recruit or to exchange messages, using this vocabulary. When we enhance our solution with ontologies, we can witness how flexibility is increased. The cluster entities (domain knowledge) can be defined with the proper granularity, representing the subtle grading differences among the entities. For example, in earlier versions, we considered scientific events as one class with no sub-classes. Now, in the ontology of Science, we have scientific events with sub-classes conference, meeting, workshop, school, and others, and educational events with sub-classes lecture and school. In particular the sub-class school presents features from both classes (scientific event and educational event) and inherits these features from them, as shown on Fig.2. Fig. 2. Part of the ontology of Science, displaying an example of multiple inheritance, the subclass School, which inherits from classes Scientific-Event and Educational-Event. The ability to represent multiple inheritance consists of a clear advantage of ontologies over current object oriented implementations. The graphic was generated by Ontoviz, a plug-in component of the Protégé ontology editor [28]. Moreover, the knowledge about pages and the conditions under which they are considered to represent an instance of an entity, when represented declaratively, is not limited to terms, keywords and statistics, but to any fact that can distinguish a class of pages from other classes, such as facts involving page structure, probable regions where to find relevant information to be extracted, concepts contained in it and phrase meaning through the use of Natural Language Processing. To sum up, ontologies promise to extend the concept of code reuse to knowledge reuse: there are repositories like Ontolingua [20] storing the concepts about many subject areas where this knowledge can be reused. Section 6 will make all of > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < the advantages mentioned here more evident. IV. PROPOSED ARCHITECTURE A cognitive multi-agent system (MASTER-Web) is proposed to retrieve and extract data from Web pages belonging to classes of a cluster. The core idea of employing a multi-agent system is to take advantage of the relations among classes in a cluster. The architecture overview is shown in Fig. 3. Each agent, represented as a circle in the figure, is expert in recognizing and extracting data from pages supposed to belong to the class of pages that it processes (for instance “Call for papers” pages, organizations pages, papers and others, for the scientific cluster). The multi-agent system is based on a Distributed Problem Solving approach, where each agent is assigned distinct functions and cooperates with the other agents without overlapping functions. 4 Since the database generated is normalized, to state correct queries for accessing the information in the database should be a rather complicated task for the average user. A mediator [17] facilitates this task, providing reduced non-normalized database views. Any user or agent, belonging to the system or not, is allowed to query the mediator. A. Cooperation model When an agent is added to the system, it registers and introduces itself by sending to all of the other agents a set of rules to be used by them on the recognition of pages likely to belong to its associated page class. The other agents update their recognition rules and send, in turn, their own recognition rules to the new agent. When a link or page fires any other agent’s recognition rule, the agent sends the link or page to that agent. This model meets a sociable agent test [21], which states that an agent must change its behavior when a new agent is added to the society. Our agents will try to recognize pages for a new agent as soon as it joins the system. B. Agent’s tasks An agent performs four successive steps during the processing of a URL. They are depicted in Fig. 4.. Fig. 3. General architecture of the system. Each MASTER-Web agent is represented as a circle in the figure. It has the expertise to recognize and extract data from pages supposed to belong to the class of pages processed by the agent. In the model, each agent utilizes a meta-robot that can be connected to multiple search engines like Altavista or Excite for instance. The meta-robot queries the search engines with terms that assure recall for that agent’s page class (e.g., ‘Call for papers’ and ‘Call for participation’ for the CFP agent). Due to the lack of precision, the URL set resulting from the queries present a wide variation of functional groups, containing many lists, messages, pages from its class and from other agents’ classes and garbage. The retrieved URLs are all put into a queue. An agent continuously accesses two queues of URLs. The first one is filled up by the meta-robot and is assigned low priority. The other one, to which a higher priority is given, stores URLs sent by other agents of the system or taken from pages considered as lists. These links are considered as “hot hints”, because they were found under a safer context and therefore are expected to present higher precision. Cooperation among agents pays off if these agents’ suggestions contain less garbage than search engine results do. Fig. 4. A MASTER-Web agent in detail. A meta-robot queues pages received from search engines into a low priority queue. A high priority queue is filled with suggestions sent by other MASTER-Web agents. 1) Validation First, a validation takes place, ruling out non-html or http pages, the inaccessible and the ones present in the database, which have already been processed (valid or not). Even invalid pages are kept in the database since the meta-robot often finds repeated links, and a page is only retrieved again if its date has changed. This accelerates the processing, avoiding redundant work, and prevents the Web from unnecessary strain. 2) Preprocessing The preprocessing step, from each valid page, extracts representation elements, such as contents with and without html tags, title, links and e-mails, among other elements, applying information retrieval techniques, like stop-lists, > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < centroids, stemming and tagging [5] if necessary. This data is passed to the agent’s inference engine. 3) Recognition This step impacts directly into the model’s performance and cooperation effectiveness. During this step, an agent deduces to which functional group the page fits, whether it is a list, a message, a piece of garbage or a member of the class dealt by that agent or by another agent. Bad recognition causes loss of data or waste of time processing garbage. Due to this fact, we take the atomic approach for recognition [2]. It states that recall is better than forcing precision when false but apparently positive answers can be ruled out during extraction. The knowledge required for this task associates terms in the dictionaries with slots in the page representation, such as contents, title, summary, etc. For example, the “Call for Papers” agent considers a page to be a recommendation to the “Research Organizations” agent when its title contains one or more of the terms "home", "site", "society", "ltd", "organization", "association" and does not contain terms associated with events like "call", "conference", "forum", "seminar", "meeting", "workshop", etc. The recognition of a class member is a bit more complex than the other functional groups. The terms relate not only to the classes, but also to their attributes. As attribute terms are met, entries to a list containing a class and its attributes are generated for use by the extraction step. Another issue here is granularity: terms associated with the sub-classes have to be tested to check if the page is not a member of any of them. 4) Extraction The aims of this step are manifold: to extract data from the pages, to fill in the table(s) related to the entity being extracted (organizations, events, etc), to identify interesting links to other agents and, if needed, to correct the recognition result. For the first aim, a piece of data is extracted or a category is inferred. To extract data, terms from the dictionaries associated with an attribute trigger the process. A region for the data is heuristically determined by a function associated with the attribute and the data is extracted. Next, this data can be formatted (e.g. dates) and new attributes can be inferred. When there is not any other attribute to be extracted the entity is stored in the database. Categorization is accomplished in a similar way: keywords from the page or from a tagged region probably containing terms are matched against terms, which are kept in the dictionaries, associated with the categories. If a keyword is part of a category term – which can have many words – and the whole term exists on the page or region, the data is categorized accordingly. In case of multi-valued categorization the process continues until the last word of the region or of the page is reached. For the second aim, links in the page are sent to other agents when any of their identification rules about the anchor and/or about the URL fires. For example, from a page representing an event, an anchor or URL that contains the word “paper” or “article” and does not contain expressions linked to events, like “call for” or “cfp”, is considered useful for the “papers” agent. 5 If contradictions or strange facts are found during extraction, recognition results can be changed, e.g. in “Call for papers” pages for the CFP agent, dates older than one year cause the pages to be retracted to lists. C. Concurrency Issues Each agent runs the following processes concurrently, in increasing order of priority: 1. The meta-robot collector that populates the low priority URLs queue, 2. A page processor to treat pages from this low priority queue, 3. Another page processor for the high priority URLs queue which is filled by other MASTER-Web agents’ recommendations or by links found in lists, 4. An agent communication component that exchanges messages with the other agents. Since this process has a higher priority, when the agent receives a recommendation of a page, it stores it in the high priority queue and this page will be processed prior to the ones found by the meta-robot. To ensure that a page is completely treated before another one starts to be processed, we assign the page treatment process with the highest priority until it terminates. D. Knowledge Representation The most important design decisions for an integrated extraction system to achieve the maximum expressiveness, flexibility and reuse are related to knowledge representation. Based on [37], we consider that four types of knowledge are required: 1. Representations of the pages, either information retrieval representations of the pages (such as words and their frequencies, contents, links, e-mails, etc), which can be chosen according to their adequacy in the recognition and extraction tasks, or Shallow Natural Language Processing representations [13], necessary for unstructured classes where the shallow interpretation of the meaning of a phrase is a requirement. 2. The domain knowledge, represented not only by the entities of the cluster to be extracted, but also by the relations among them and by applicable restrictions. 3. Knowledge about how to recognize to which functional group a page fits. It is represented by relations associating entries in dictionaries to concepts, whose presence, absence or high frequency in the page determines that the page must be classified into a certain functional group. Shallow Natural Language Processing associations in the form of templates are useful as well. 4. Structures that perform extraction, viz.: a) Relations associating dictionary entries to attributes of page representations (like title, contents, links, etc). The relations indicate the existence of attributes of an entity of the cluster, b) Functions that determine regions, extract, convert, > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < format, check consistency and dismiss extracted data, c) Templates aggregating these relations and functions to each attribute, d) Rules that apply over these templates (each at a turn), e) Rules that infer data from extracted data or correct the result of a misclassification. An example should make the use of these structures clearer. To recognize pages that represent conferences, we first have an instance Conference of the class Concept (all of the classes mentioned here belong to the “Web” ontology), with slots name and synonyms (other slots can be included): ([Conference] of Concept (name “conference”) (synonyms “symposium” ... “conference”)) Then, we have the instance “conference” of the ClassRecognizer class: ([Conference] of Class-Recognizer (Class-name “conference”) (Concept [conference]) (Slots-in-the-Beginning “Initial-date” “Final-Date” “takes-Place-at”)....) Finally, a rule that recognizes pages representing a conference. The rule fires if any of the keywords associated with the concept are present in the page title: (defrule r_67_title (Web-page (Title? t) (URL? y)) (Class-Recognizer(Class-name ?x) (Concept? z)) (Concept (name? z) (synonyms $?w)) (test (> (count-occurrences $?w ?t) 0)) => (assert (recognized ?y 67))) Note that the rule holds not only for the class “conference”, but also to any class that has an associated instance of ClassRecognizer. It is important to remark that inheritance applies to subclasses too: for instance, the Class-Recognizer of the class Conference also considers concepts associated with its superclass Scientific-Event, like the concept Event, which have keywords like “call for papers”, “call for participation”, etc. There are also fields for concepts and keywords not inherited (Specific-Concepts and Specific-Keywords). Similar (and a bit more complex) structures are applied for extraction. Instances of the class Slot-Recognizer, similar to the Class-Recognizer mentioned above, specifies how to extract each attribute of an entity. Here again, ontologies show as expected their usefulness. First, attributes relate to concepts, not only to keywords. A knowledge structure can provide many ways to represent each concept, including keywords. Second, when domain entities are represented as frames, a formalism that provides a rich and detailed framework to represent attributes, facilitating the 6 specification of the knowledge outlined above. For example, using the frame-based inference engine CLIPS (“C” Language Integrated Production System) [33] it is possible to define what follows for each attribute: type (among the well-known types such as integer, floating, string and Boolean but also symbol, class, instance of classes or any), allowed classes when the type is instance, default value, cardinality (specifying if it is a single piece of data or a multiple one, including the maximum number if necessary), range (minimum and maximum values), if it is required or not, and inverse attributes (for example, in a person class, a “father” attribute is an instance of a person whose “son” attribute is the instance of the son). To sum up, as we could see, the construction of a new agent turns out to be easier, since in this case only new instantiations are needed. There is a clear gain in extensibility: if more items are desirable in the representation, no recompilation is required. Another advantage of this knowledge representation approach resides on the possibility of gradually using the available representations. For instance, natural language are dear but suitable to extraction, so only after the page is recognized as a class member, the text could be parsed and transformed into natural language representations. E. Types of reuse The following forms of reuse facilitate the construction of new agents, stressing the benefits of a knowledge-based approach to the problem: 1) Reuse of code All agents share the same structure and components. They differ only in the knowledge of their specific class of pages. Independent of the cluster, agents inherit all the code or most of it, implementing particular functions for this case. 2) Reuse of database definitions All agents access many tables with the same structure: pages not recognized, dictionaries, search engines (data for connecting them), queries and statistics. The only particular tables for each agent are those that store the extracted data (e.g. the tables of Conferences, Workshops, Meetings, Magazines, Journals and Schools for the “Call for Papers” agent). However, the agents can abstract their structures by the use of metadata [11], inserting data properly into them. 3) Reuse of search engines Instead of building a new robot collector it is a better practice to rely on existing search engines. The reasons for this statement are various. Firstly, for extraction purposes, it is not necessary to index the entire Web. Queries to search engines ensure recall and prevent strains on the network [23]. Moreover, as a project decision and approach to the problem, we claim that traditional keyword-based search engines can be a basis for more refined and accurate knowledge-based domain-restricted search engines or extraction agents. The meta-robot of the case study was written in a parameterized form that allows inclusion of new search engines as records in a table of the database, without code alterations or recompilations. > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < Rejected Invalid Correct False negatives False positives Messages TABLE I RESULTS OF THE “CFP” AGENT RECOGNITION Recommended A. Development Tools For the construction of an agent, we need Internet networking facilities to deal with pages, connection to a Database Management System with metadata capabilities, an Agent Communication Language like KQML (Knowledge Query Manipulation Language) [15] or FIPA (Foundation for Intelligent Physical Agents) [30], and an inference engine to which ontologies could be integrated and reused. Java fulfills all of the requirements. Besides portability across platforms, its distribution includes packages for networking and database connectivity, and we could reuse an implementation of the agent communication language KQML called JATLite [22] (Java Agent Template) from Stanford University. The selection and use of an inference engine was the trickiest component. Jess (the Java Expert System Shell) [19] is probably the most popular production system ever, with B. “CFP” agent: preliminary results We built an agent that queries pages reporting on any event or publication (“Call for Papers” and “Call for Participation”, CFPs) in the past or forthcoming. We tested it only for recognition against the class conference, obtaining the promising results displayed in Table 1. Lists V. CASE STUDY: THE “CALL FOR PAPERS” AGENT thousands of users, many of them thinking of it as a reimplementation of CLIPS. The subtle difference between them resides on the way classes are represented. CLIPS encompasses the internal language COOL (C Object Oriented Language) [33], which represents classes as frames. Jess uses Java beans -components providing reflection - but no frame representations, since it is more oriented to the objects’ community. Since Jess is not capable of representing frames and there wasn’t any other frame-based inference engine available, at first we could not reuse ontologies. Fortunately, Jess and Java’s popularity paid off. A Jess plug-in for the Protégé ontology editor [28] called JessTab [14] has been developed and solved that problem. This plug-in overrides the definition of Jess classes so as to implement the same expressiveness as frames in CLIPS. With this tool, we could reuse, define and refine ontologies via Protégé. A learned lesson of that case is that the expressiveness capability to represent frames constitutes a minimum requirement for inference engines when the reuse of ontologies is needed or intended. Recognized 4) Reuse of knowledge Taking advantage of the ontological approach, the architecture was planned to permit various types of reuse of knowledge, in case a new cluster is to be processed: 1. The representations of pages and auxiliary ontologies like Time and Locations, can be reused without alteration, 2. The structures that represent recognition and extraction knowledge, including most of the rules, can also be reused, but the instances for the new domain have to be created, 3. Ontologies about or concerned with the domain (cluster) being implemented can be reused. For instance, Ontolingua’s repository makes available many ontologies of interest for our approach, like Simple-time, Documents and a detailed ontology about science research developed by the project KA2 [13], which we extended and modified for our prototype. The Ontolingua framework offers at least three ways to reuse these ontologies, two of them directly: translators, which convert ontologies into several formalisms, like CLIPS, Prolog, LOOM, Epikit and others [20], the Open Knowledge Base Connectivity (OKBC) [8] that enables a knowledge representation systems such as CLIPS and Prolog to comply to a application program interface (API) that allows ontologies’ access and download. If the two previous alternatives are neither possible nor available, the only resource needed is a frame-based knowledge representation system to where the wished ontologies could be copied manually. Therefore, a new agent is quickly built, except for the knowledge acquisition task, which can be accomplished in two ways: browsing a lot of pages to understand their patterns or annotating them to apply machine learning techniques that come up with the rules. 7 81 1 7 25 9 3 16 1 2 41 3 0 21 1 3 96 0 0 The first interesting result was the correction of some pages manually misclassified. The actual good result was the precision, which reaches more than 97%, if it is taken into account the atomic approach, which will reposition the recognized false positives (6 lists and one with wrong html definitions) into unrecognized during extraction. The number of invalid pages gives a picture of how much garbage and repeated information search engines return. During recognition, the CFP agent sends only recommendations of organization pages, and just one rule related to the title was enough to attain 87.5% of precision in recommendation. But recommendations of pages being processed constitute just a small part of the process. During extraction, when the links are investigated and many of them recommended, the number of recommendations will rise. Nevertheless, the precision achieved outperforms search > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < engine results (33.9%) and suggests that recommendation and cooperation pays off. However, the number of false positives under lists must shrink, or many bad links will be queued as “hot hints”. VI. RELATED WORK A. Cooperative information gathering The article with the above title [29] can possibly set the rails of knowledge-based manipulation of information on the Internet. It defines properly Information Gathering as the join of information acquisition and retrieval and proposes to use cooperative multi-agent systems that manage their “independencies (...) so as to integrate and evolve consistent clusters of high quality information (...)”. Distributed Problem Solving is supported as a means for agents “to negotiate with other agents to discover consistent clusters of information”. Further research on agents for information gathering [1] fits exactly the retrieval part of the proposed model. Databases of a large digital library were grouped into hierarchical classes, each class possessing its own agent, with explicit knowledge about it. These agents construct retrieval plans that improve efficiency in retrieval processing. If one goes about transporting this tool to the Web, it will be necessary for other multi-agent systems to correctly match pages to these classes and extract information to populate these databases. Therefore, extraction can be seen as a support tool for retrieval. On the other hand, extraction from the Web requires retrieval, so we can consider these two tasks as complementary. Also in [1], information acquisition seems to encompass learning about domains and information extraction, although it is not explicitly stated. Adopting this view, the work proposed here is innovative since it clearly tackles the extraction part of the problem. Another advantage of our approach is that extraction facilitates search. For example, during extraction, when a link to another agent with similar interests is met, a message with the retrieved link is sent to it causing the search of this other agent to become more accurate. B. Classification and extraction Recognition or classification is usually tackled by statistics and learning [9], but, for cooperation purposes, the additional requirement of declarativity is imposed: the rules generated have to be represented explicitly (like in RIPPER [9]). Many systems perform extraction using wrappers. Their construction currently constitutes an active field of research. Wrappers can be built either by hand using grammars [3] and finite automata [2] or automatically through machine learning methods [24]. There are also systems employing learning and Natural Language Processing techniques. It provides more context and inference at the expense of many supplementary processing phases. AutoSlog [34], for instance, builds extraction dictionaries analyzing annotated corpora and defines declarative concept nodes that match semantic roles in domain-specific texts, but without concepts and relations as in 8 ontologies. C. Ontologies on Extraction and Integrated Extraction At least three projects employing ontologies to perform extraction can be identified. A first one uses a database approach [12] providing ontology definition tools and automatically generating keywords, constants, relationships, constraints for extraction rules and the normalized database schema. However, its ontologies are specific to extraction. Furthermore, they are not defined in a knowledge representation formalism, so they can be neither reasoned upon nor communicated at the knowledge level, thus blocking cooperation among extractors and integrated extraction. A second one [11] uses machine learning and a domain ontology with entities and relations. It represents title, keywords and hyperlinks, performs integrated extraction and delivers good results on recognition but only regular ones on extraction, once it is directed to the harder treatment of raw text pages, as our system also does. With the exception of this approach and the natural language ones, all of the extractors above require a great deal of page structure, working over data-rich content-poor collections of pages. This kind of pages should rather be considered as structured than as semistructured. The decision of relying on machine learning depends upon comparing the costs of annotating corpora against inspecting them for knowledge engineering purposes [2]. It leads to advantages such as speed and adaptability but also to drawbacks such as readability, ontological engagement of the learned rules (which tends to be specific) and to difficulties to apply a priori knowledge and to capture some rules without introducing a lot of features. Normally, learning techniques are used to accelerate knowledge acquisition. The last is a quite interesting project [13] linking ontologies and extraction, belonging to the Semantic Web [31] approach. It aims at the use of semantic markup languages in order to transform pages in agents-readable elements, enabling their use in new applications, environments and e-commerce. It involves the ontology editor Ontoedit, a page annotation tool that tries to facilitate annotation by suggesting the user to fill attributes of entities defined via ontologies by using natural language extraction techniques. This system also provides an agent that gathers domain-specific information from the Web and processes queries in natural language, and an environment to learn ontologies from text (Text-to-Onto) [25]. The designers take as an assumption that both linguistic and domain ontologies evolve over time, so their maintenance is a cyclic process thus needing a learning component to acquire ontologies. Our work is of a similar flavor as this last related work. However, we do not investigate the problem of ontology acquisition. Also, if they share a common flavor, they do not rely on the same design decisions or implementation methods. VII. FUTURE WORK AND CONCLUSIONS We have outlined an attempt, among many others in different fields, to make up for the lack of semantic soundness > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < of the Web. We also try to answer the question of deciding what is an adequate representation formalism for the searching, classifying and extracting information available on the Web, given the interrelations of page classes. Indeed, we raised the issue of integrated extraction and proposed for it both a multi-agent architecture and a vision of the Web combining contents and functionality (domains x functional groups). Knowledge engineering in the form of domain and templates ontologies is a central tool to enable flexibility, reusability and expressiveness of communication capability, according to our approach. These are useful requirements for distribution and context specification if taking into account the size and diversity of the Web. We propose an architecture designed to extract data not only from specific pages but from whole regions of the Web. Although we applied it only to the scientific domain, the methodology presented can deal with any cluster formed by interrelated page classes. There are clusters in the commercial domain that may fit well to this architecture, e.g. a shopping cluster including shopping centers, stores and suppliers, a tourism cluster linking events, hotels, and transport pages, among many others. In this perspective, the architecture can also be seen as a support tool to facilitate Personal Digital Assistants tasks. In fact, we stress that search, retrieval, extraction and categorization are closely related and integrated solutions represent a feasible option for system developers. We intend to enhance our architecture with the following improvements: 1. To include machine learning techniques in order to accelerate knowledge acquisition for classification and extraction, creating an instinctive layer [7] in the agent architecture, 2. To apply machine learning and/or natural language processing techniques to extraction, taking advantage of the ontologies already built, 3. To evaluate thoroughly the agent on the Web, 4. To implement duplicity checking when finding entities, 5. To implement other agents and make them cooperate, as a proof-of-concept that recommendation and cooperation actually pays off. [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] REFERENCES [1] [2] [3] [4] [5] [6] [7] Ambite, J., Knoblock, C.: Agents for information gathering. In Software Agents. Bradshaw, J. (ed.), MIT Press, Pittsburgh, PA, USA (1997). Appelt, D. E., Israel, D. J.: Introduction to information extraction Technology. International Joint Conference of Artificial Intelligence. Stokholm, Sweden (1999). Ashish, N., Knoblock, C.: Wrapper generation for semi-structured Internet sources. SIGMOD Record, 26(4):8-15 (1997). Austin, J. L.; How to do things with words. Clarendon Press, Oxford. (1962). Baeza-Yates, R., Ribeiro-Neto, B.: Modern information retrieval. 167-9, Addison Wesley (1999). Benjamins, R., Fensel, D., Pérez, A. Knowledge management through ontologies. Proceedings of the 2nd International Conference on Practical Aspects of Knowledge Management, Basel, Switzerland (1998). Bittencourt, G.: In the quest of the missing link. Proceedings of the International Joint Conference of Artificial Intelligence. Nagoya, Japan (1997). [28] [29] [30] [31] [32] [33] [34] [35] 9 Chaudri, V. K., Farquhar, A., Fikes, R., Karp, P., Rice, J. OKBC: a programmatic foundation for knowledge base interoperability. Proceedings of AAAI-98, Madison WI (1998). Cohen, W. W.: Learning rules that classify e-mail. http://www.parc. xerox.com /istl /projects/mlia /papers/cohen.ps (1996). Cruz, I.; Borisov, S.; Marks, M. A.; and Webb, T. R.; 1997. Measuring structural similarity among Web documents: preliminary results. Proceedings of the 7th International Conference on Electronic Publishing, EP'98. LNCS 1375, Springer Verlag, Heidelberg, Germany (1998). Craven, M., McCallum, A. M., DiPasquo, D., Mitchell, T., Freitag, D., Nigam, K., Slattery, S.: Learning to extract symbolic knowledge from the World Wide Web. Technical report CMU-CS-98-122. School of Computer Science. Carnegie Mellon University (1998). Embley, D., Campbell, D., Liddle, S., Smith, R.: Ontology-based extraction of information from data-rich unstructured documents. http://www.deg.byu.edu/papers/cikm98.ps (1998). Erdmann, M., Maedche, A., Schnurr, H.-P., Staab, S. From manual to semi-automatic semantic annotation: Bout ontology-based text annotation. http://www.aifb.uni-karlsruhe.de Eriksson, H. Jess plug-in for Protégé. http://www.ida.liu.se/~her/JessTab Finin, T.; Fritzson, R.; McKay, D.; and McEntire, Robin; KQML as an agent communication language. Proceedings of the International Conference on Information and Knowledge Management. ACM Press, NY (1994). Flanaghan, D.: Java examples in a nutshell. O’Reilly. Sebastopol CA,USA.330-333 (1997). Freitas, F. L. G., Bittencourt, G. Cognitive multi-agent systems for integrated information retrieval and extraction over the Web. In: LNCSLNAI 1952 - Advances in artificial intelligence. Proceedings of the International Joint Conference SBIA-IBERAMIA. M. Monard and J. Sichman (eds). 310-319. Springer-Verlag. Heidelberg (2000). Freitas, F., Siebra, C., Ferraz, C., Ramalho, G.: Mediation services for agents integration. Proceedings of SEMISH’99. Sociedade Brasileira de Computação (SBC). Rio, Brazil (1999). Friedmann-Hill, E. 1997. Jess, the Java expert system shell. http://herzberg.ca.sandia.gov/Jess. Gruber, T. .R.: Ontolingua: A mechanism to support portable ontologies. Technical report KSL-91-66. Stanford University, Knowledge Systems Laboratory. USA. (1996). Huhns, M., Singh, M.: The agent test. IEEE Internet Computing. Sep/Oct 97 (1997). JATLite, Java Agent Template, http://java.stanford.edu Koster, M.: Guidelines for robot writers. www.eskimo.com/~falken/ guidelin.html (1993). Kushmerick, N.: Wrapper induction. http://www.compapp.dcu.ie/~nick/ research/ wrappers (1999). Maedche,A., Staab, S. Discovering conceptual relations from text. Proceedings of ECAI-2000. IOS Press. Amsterdam (2000). Miller, G.. WordNet: a lexical database for English. Communications of the ACM. 38(11):39-41 (1995). Minsky, M.; A Framework for Representing Knowledge. In: The Psychology of Computer Vision, 211-281, McGraw-Hill, New York (1975). Noy, N., Fergerson, R., Musen, M. The model of Protégé: combining interoperability and flexibility. http://protégé.stanforrd.edu Oates, T., Prasad, M., Lesser, V.: Cooperative information gathering: a distributed problem solving approach. Technical report 94-66. University of Massachusetts, USA (1994). O’Brien, P.; Nicol, R. 1998. FIPA – towards a standard for software agents. http://www.fipa.org PC Week magazine, February 7(2000). Pirolli, P., Pitkow, J., Rao, R.: Silk from a sow’s ear: extracting usable structures from the Web. http://www.acm.org/sigchi/chi96/proceedings/ papers/Pirolli_2/pp2.html (1995). Riley, G. 1999. CLIPS: A tool for building expert systems. http://www. ghg.net/clips/CLIPS.html Riloff, E.: Information extraction as a basis for portable text classification systems. PhD. thesis. Department of Computer Science. University of Massachusetts, Amherst. USA (1994). Russel, S, Norvig, P.: Artificial intelligence: a modern approach. Prentice-Hall (1995). > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < [36] van de Velde, W.;. Reuse in Cyberspace. Abstract at the Dagstuhl seminar ‘Reusable Problem-Solving Methods’. Musen, M.; Studer, R. Orgs. Dagstuhl, Germany (1995). [37] Wee L. K., A., Tong L. C., Tan C. L. Knowledge representation issues in information extraction. PRICAI'98, 5th Pacific Rim International Conference on Artificial Intelligence, Proceedings. LNCS 1531, Springer-Verlag. Heidelberg (1998). Frederico L. G. Freitas is a PhD student at the University of Santa Catarina, Brazil, since 1998. He graduated in Informatics at the Aeronautics Technological Institute (ITA), Brazil. His interest areas comprise multi-agent systems, knowledge representation and communication through ontologies and cognitive Internet agents. Dr. Guilherme Bittencourt is Adjunct Professor at the University of Santa Catarina, Brazil since 1995. He received his PhD in Informatics in 1990 from the University of Karslruhe, Germany, MScs in AI from Grenoble, France, and Space Research National Institute (INPE), Brazil, and BA in Physics and Electronics Engineering from Federal University of Rio Grande do Sul, Brazil. His interest areas include knowledge representation, logics, fuzzy systems and multi-agent systems. Dr. Jacques Calmet is Professor at the University of Karlsruhe, Germany, since 1987. He is editor-inchief of the Journal "Applicable Algebra in Engineering, Communication and Computing", Springer Verlag. He received his PhD from AixMarseille University, France in 1970. His main interest areas are computer algebra, knowledge representation, multi-agent systems and mediators. He has edited several books in the Lectures Notes in Computer Science series published by Springer Verlag. 10